redditract-pipeline

1.Introduction & Motivation

Introduction

A simple Airflow ETL pipeline written in Python to gather various data on Reddit's top 100 subreddits. Reddit API & its Python client "PRAW" are used for data retrieval.

Motivation:

Build simple ETL system as my personal fun project
Demonstrate basic skills and workflow in Airflow which involves data retrieval from an API service

Task Flow

Get the names of top 100 subreddits
Get details and data on each subreddit via Reddit API
Get hot/new/top submissions for each subreddit via Reddit API
Load retrieved data to MongoDB
Send Email report informing the successful DAG run

2.Airflow Variables

In order for the pipeline to work properly, the following Airflow variables are required.

Mongo DB: `mongo_uri`

URI for Mongo DB. Retrieved data will be stored in the specified Mongo DB.

Reddit Credentials: `reddit_credentials`

Credentials for Reddit API. The pipeline assumes the following structure for reddit_credentials

{
    "reddit_credentials": {
        "reddit_key": YOUR_REDDIT_KEY,
        "reddit_secret": YOUR_REDDIT_SECRET
    }
}

Email Configuration: `EMAIL_SENDER & EMAIL_RECEIVER`

The pipeline is configured to send an email with result summary after each successful DAG run. The following variables are required for it to work: email_sender: Email to send notifiactions from email_receiver: Email to send notifications to

Here is the sample of email report.

3.Deployment

Docker containers are used for deployment.(puckel/docker-airflow:1.10.9) Deployment is as easy as performing docker-compose up in the terminal.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
dags		dags
demo_output		demo_output
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

dags

dags

demo_output

demo_output

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

docker-compose.yaml

docker-compose.yaml

requirements.txt

requirements.txt

Repository files navigation

redditract-pipeline

1.Introduction & Motivation

Introduction

Motivation:

Task Flow

2.Airflow Variables

Mongo DB: `mongo_uri`

Reddit Credentials: `reddit_credentials`

Email Configuration: `EMAIL_SENDER & EMAIL_RECEIVER`

3.Deployment

About

Releases

Packages

Languages

License

masamerc/redditract-pipeline

Folders and files

Latest commit

History

Repository files navigation

redditract-pipeline

1.Introduction & Motivation

Introduction

Motivation:

Task Flow

2.Airflow Variables

Mongo DB: mongo_uri

Reddit Credentials: reddit_credentials

Email Configuration: EMAIL_SENDER & EMAIL_RECEIVER

3.Deployment

About

Resources

License

Stars

Watchers

Forks

Languages

Mongo DB: `mongo_uri`

Reddit Credentials: `reddit_credentials`

Email Configuration: `EMAIL_SENDER & EMAIL_RECEIVER`