Powering Recommendations @ embed.ly

Developed by Ramesh Sampath

I built a data pipeline for embed.ly that converts log files coming from user activity at popular websites into a queryable data store in Redshift.

I worked with Zach Gazak, a data science fellow @ Insight, who built a recommendation model with this dataset.

The webapp is hosted at insight.sampathweb.com

About Embed.ly

Embed.ly provides a service that enables popular websites and to understand what content is more engaging in for users. Embed.ly helps its clients know part of an video users like the most.

By sitting between users and popular websites, Embedly collects a lot of data that can be analyzed to provide greater value to its clients. Embedly wants to build a data pipeline that a data scientist can build recommendation models from. This project is an attempt to make this possible. I am reviewing this with Embedly's engineering team to integrate it with their system.

Data Pipeline

ETL Process

Embed.ly creates log files at a rate of 2GB / 30 minutes
A cron job would upload these files into S3 bucket
AWS Elastic MapReduce process takes these json events and extracts the fields we need for building the recommendation model. The results of the EMR job is put in another S3 bucket
AWS data pipeline process loads these processed files from S3 and loads them into RedShift data warehouse.

User interface

Credits

John Emhoff, Engineering Team @ Embed.ly(https://embed.ly) for helping us understand the data challenges faced by embedly and how we can help
Zach Gazak, Insight Data Science Fellow, for the various whiteboard sessions to understand what features we need for the recommendation model
Insight Data Engineering Program for making this possible in four weeks

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.ipynb_checkpoints		.ipynb_checkpoints
app		app
etl_emr		etl_emr
images		images
ipy_notebooks		ipy_notebooks
.gitignore		.gitignore
README.md		README.md
application.py		application.py
gunicorn_start.sh		gunicorn_start.sh
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

app

app

etl_emr

etl_emr

images

images

ipy_notebooks

ipy_notebooks

.gitignore

.gitignore

README.md

README.md

application.py

application.py

gunicorn_start.sh

gunicorn_start.sh

requirements.txt

requirements.txt

requirements_dev.txt

requirements_dev.txt

Repository files navigation

Powering Recommendations @ embed.ly

About Embed.ly

Data Pipeline

ETL Process

User interface

Credits

About

Releases

Packages

Languages

sampathweb/insight-embedly

Folders and files

Latest commit

History

Repository files navigation

Powering Recommendations @ embed.ly

About Embed.ly

Data Pipeline

ETL Process

User interface

Credits

About

Resources

Stars

Watchers

Forks

Languages