Skip to content

AndrewSLowe/airflow_batch_project

Repository files navigation

Project Overview

This data pipeline brings in data from two seperate sources: Customer Purchase data from an OLTP database, and Movie Review data from a 3rd party data vendor. Using Apache Airflow, we can extract the data from both sources on a daily basis, perform processing using big data tools (AWS EMR, AWS Spark), and combine the two sources into a Redshift database (data warehouse) for analytics.

Tech Stack Languages: Python, PostgreSQL

Frameworks: Apache Airflow

Services: AWS EMR (Spark), AWS Redshift, Docker, PostgreSQL,

Finished Project

What you see below is the full data pipeline. On the bottom section, airflow is reading in the entire postgreSQL database daily, moving the data to an S3 bucket and immediately deleting it. Once the data is in the S3 bucket, it is staged in the redshift database (the final product of the data pipelien).

On the top side of the data pipeline, movie review data is brought in from a csv. You'll notice that airflow simultaneously moves the data to an S3 bucket while also moving the EMR steps to the S3 bucket. Next, the EMR steps are added to EMR where the data is cleaned using a naive classification model in Apache Spark. alt text

This project was inspired by a blog post at Start Data Engineering. (https://www.startdataengineering.com/post/data-engineering-project-for-beginners-batch-edition)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published