GitHub - harisyammnv/DEND_Capstone_Airflow_EMR: Capstone Project

Capstone Project

The purpose of this cpastone project is to demonstrate various data engineering skills acquired with the nanodegree. This capstone project mainly focuses on the following key areas:

Developing ETL/ELT pipelines using Apache Airflow
Constructing Data Lakes in S3 for raw data or for staging purposes
Using AWS EMR and pyspark to transform raw data
Constructing Redshift DWH from the staging/schematized data in AWS S3
Analyzing the Data using pyspark in a Jupyter Notebook

This project is for demonstration purposes and will make use of the following data sources

Data Sources from Udacity

I94 Immigration Data: This data comes from the US National Tourism and Trade Office found here. Each report contains international visitor arrival statistics by world regions and select countries (including top 20), type of visa, mode of transportation, age groups, states visited (first intended address only), and the top ports of entry (for select countries)
U.S. City Demographics Data: This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. Dataset comes from OpenSoft found here.
Airport Codes: This is a simple table of airport codes and corresponding cities. The airport codes may refer to either IATA airport code, a three-letter code which is used in passenger reservation, ticketing and baggage-handling systems, or the ICAO airport code which is a four letter code used by ATC systems and for airports that do not have an IATA airport code (from wikipedia). It comes from here

External Data Sources

I have enriched the dataset using the following data sources to make the analysis more useful and more informative

Port of Entry/Nationality Codes: This dataset contains information about various of port of entry codes which would be used to join the data in the I94 immigration data. The nationality codes contain the country abbreviations Port Codes; Nationality Codes And Port of Entry codes
US Visa Types: This data is extracted from the US DHS and Wikipedia which would give information on various visa types offered by US US Non-immigrant Visa Types and US Immigrant Visa Types
Airline Codes: This data source consists of airlines IATA abbreviations and the country of origin Airline Codes

Project Architecture

The following project architecture is used for developing the Data Lake in AWS

Using the data from the Data Lake in S3 which is considered as Staging Bucket here the following DWH Star schema is finalized:

Deciding and Creating AWS CloudFormation Stack

In this project I have decided to use AWS for building up the Data Lake and DWH. I have used AWS because of the availabiliy of large number of options and also having elastic resources at scale for appropriate pricing, this made me to choose AWS. I have used AWS EMR for processing massive data and AWS S3 for storing large amounts of data

In order to run this project the following resources are needed:

AWS EC2 - a m4.xlarge instance is needed for executing Apache Airflow
Apache Airflow - for orchestrating the ETL pipeline
AWS RDS - for Airflow to store its metadata
AWS S3 - for creating the Data Lake
AWS EMR - cluster with Apache Spark and Apache Livy are needed to perform Transformation tasks on the raw data

To make the provisioning of the above resources easier, I have used the AWS Cloudformation to create the stack needed for running this project.

The resources needed are already configured in the airflow_server.yaml yaml file

To start creating the necessary resources the following steps are to be performed:

Create an AWS account and provide necessary billing information (if needed)
Create a user from AWS IAM
For the user make sure the following permissions are provided in IAM
Finish the user creation step and download the AWS KEY ID and AWS SECRET into a csv file
Create an EC2 Key Pair with name airflow_key_pair for accessing the EC2 instance for using Airflow

Setting up Airflow and AWS Connections

Create an S3 bucket for raw data and upload airflow_server.yaml file into the bucket
Fill the dwh.cfg file with your AWS details and also fill the dwh_airflow.cfg
Run the python create_resources.py to create the Airflow EC2 instance
- This script uploads raw data bucket into the bucket with name for eg. dend-capstone-data (created in step 1.)
- Python scripts are also uploaded into the raw data bucket. (Look at the S3 folder structure below)
- Creates a Staging Bucket with name for eg. dend-capstone-staging in S3
- Creates a AWS EC2 instance with Airflow installed in it
Once the EC2 instance is created, log in to the instance using PuTTY and then run bash ~/start_airflow.sh to start airflow at <ec2_instance_ip>:8080
Use your browser to open airflow with address <ec2_instance_ip>:8080 and Airflow Home would list the following dags
- metadata_transform_dag - This DAG will extract all the information for creating the dimension tables in the DWH
- immigration_transform_dag - This DAG will extract the I94 immigration SAS data and stores the processed files in parquet format in S3
- create_DWH_dag- This DAG will use the staging bucket in S3 to create a DWH in AWS Redshift
Fill in the new aws_credentials in Connections page like the picture below:
Fill in the new redshift connection details as well:

Data Paths

The following files are used in the data pipeline:

Filename	Location	Purpose	Execution Platform
create_resources.py	Udacity	uploads data to S3	Udacity
extras/data_scraper.py	Udacity	scarpes data from internet to add additional data	Udacity
extras/nearest_city.py	Udacity	adds nearest city to the airports csv file	Udacity
dags/lib/emr_cluster_provider.py	AWS EC2	Provisions EMR cluster	AWS EC2
dags/lib/emr_session_provider.py	AWS EC2	Provides Livy REST Session	AWS EMR
dags/transforms/transforms_codes.py	AWS EC2	Transforms all the csv files with `-codes` suffix	AWS EMR + Livy Session
dags/transforms/tranforms_demographics.py	AWS EC2	Transforms demographics file	AWS EMR + Livy Session
dags/transforms/transforms_i94_meta_data.py	AWS EC2	Transforms all the meta data extracted from SAS label file	AWS EMR + Livy Session
dags/transforms/tranforms_immigration.py	AWS EC2	Transforms immigration SAS data	AWS EMR
dags/transforms/tranforms_visa.py	AWS EC2	Transforms csv files related to visa	AWS EMR + Livy Session
dags/transforms/data_lake_quality_check.py	AWS EC2	Performs Data Lake Quality check	AWS EMR + Livy Session
dags/capstone_meta_data_dag.py	AWS EC2	ELT pipeline for transforming CSV meta data to S3 parquet format	Airflow
dags/capstone_immigration_dag.py	AWS EC2	ELT pipeline for transforming SAS immigration to S3 parquet format with partition over month+year	Airflow
dags/capstone_DWH_dag.py	AWS EC2	ETL pipeline for building DWH in Redshift with Staging Data in S3	Airflow
emr_bootstrap/boostrap_action.sh	AWS S3	copies `transform_immigration.py` and `data_lake_quality_check.py` to EMR when cluster bootstraps	AWS EMR
plugins/helpers/dwh_airflow.cfg	AWS EC2	Used by airflow to get all the resource paths	Airflow
plugins/helpers/sql_queries.py	AWS EC2	SQL queries for DWH	Airflow
plugins/operators/copy_redshift.py	AWS EC2	Copies Parquet data from S3 staging to AWS Redshift	Airflow
plugins/operators/create_table.py	AWS EC2	Creates empty tables in AWS Redshift	Airflow
plugins/operators/DWH_data_quality.py	AWS EC2	Checks data quality in all the tables in AWS Redshift	Airflow
plugins/operators/emr_add_steps.py	AWS EC2	Adds emr steps to execute transformations	Airflow
plugins/operators/S3_Data_check.py	AWS EC2	Checks if Data in present in S3	Airflow

S3 folder Structure

This is S3 folder structure for both RAW and STAGING buckets

<RAW DATA BUCKET> 
│   airflow_server_v3.yaml 
│
└───awsemr/
│   │   bootstrap_action.sh
│   
└───python_apps/
│    │   data_lake_quality_check.py
│    │   transform_immigration.py
│
└───raw/
│    └───codes/
│        │   airlines-codes.csv
│        │   airport-codes.csv
│        │   nationality-codes.csv
│        │   port-of-entry-codes.csv
│    └───i94_immigration_data/
│        │   *.sas7bdat
│    └───sas_meta_data/
│        │   I94_SAS_Labels_Descriptions.SAS
│    └───us-demogrpahics/
│        │   us-cities-demographics.csv
│    └───us-visa/
│        │   visa-type.csv
│        │   visa-issuing-ports.csv

<STAGING DATA BUCKET>  
└───lake/
│   └───codes/
│        │   airlines-codes/*.parquet
│        │   airport-codes/*.parquet
│        │   nationality-codes/*.parquet
│        │   port-of-entry-codes/*.parquet
│   └───demogrpahics/
│        │   *.parquet
│   └───i94_meta_data/
│        │   ... extracted sas labels *.parquet
│   └───immigration/
│        │   month_year=apr_16/*.parquet
│        │   .....
│   └───visa-type/
│        │   *.parquet
│   └───visa-issue-post/
│        │   *.parquet

Executing the DAGS

Use the Airflow UI to trigger dags
First trigger the DAG - metadata_transform_dag. This DAG will spin up EMR cluster and use Apache Livy REST interface to perform the necessary transformations and the data quality checks

Then execute the immigration_transform_dag. This DAG will also spin up EMR cluster but will use the EMR add steps method to basically add the transformation steps defined for obtaining transformed SAS immigration data. This DAG has a monthly schedule and can start at a particular point of time and finish at another

Once both the DAGS are completed then S3 Staging area (Data Lake) is ready. Then the DWHcan be created in the Redshift
Execute the capstone_DWH_dag to fill the DWH with the tables and data
The Data Dictionary for the DWH tables can be found here

Results from DWH

Using the Redshift DWH we can check the Stats from the SAS data as follows:

The following plot shows the number of visitors to California in 2016 by various transport modes

SELECT a.entry_month,b.transportation_mode,COUNT(cicid) AS count
FROM immigration a INNER JOIN i94mode b ON a.mode_id=b.mode_id
WHERE a.state_code = 'CA'
GROUP BY a.entry_month, b.transportation_mode, a.entry_month
ORDER BY COUNT(cicid) DESC;

All the airports around the world can be viewed using geopandas as follows

Top 10 port of entries in the US used by immigrants entering into the US

SELECT TOP 10 b.port_city, b.state_code_or_country, COUNT(cicid) AS count
FROM immigration a INNER JOIN i94ports b ON a.port_id=b.port_code
GROUP BY b.port_city, b.state_code_or_country
ORDER BY COUNT(cicid) DESC;

Visa type distribution based on gender in 2016

SELECT im.gender, im.visa_type, count(im.cicid)
FROM immigration as im
JOIN visa_type ON im.visa_type = visa_type.visa_type
GROUP BY im.gender, im.visa_type;

Alternative Data Scenarios

The project rubric asked how alternative scenarios may be tackled. A discussion of these is included below.

Data increased by 100x:

I have used AWS EMR (elastic) with Spark this helps when processing of huge data throughput is needed.
The SAS data is already partitioned by month and year, this would help with querying in Spark.
The remaining meta data is stored in parquet format which can be processed with spark in a distributed way
I am using parquet format for all the tables in S3 Data Lake, therefore from staging to DWH the transfer would be faster.
I have used S3 for storing Staging and Raw data, the main advantage is S3's scalability and accessibility with other AWS infrastructure.

The pipelines would be run on a daily basis by 7 am every day:

In this scenario an additional Airflow DAG should be implemented to transfer the data from Udacity workspace to AWS S3 to meet the SLA. The existing Airflow DAGs would have to change the schedule interval from monthly to daily.

Database needed to be accessed by 100+ people:

I have already chosen Redshift as a DWH which have auto-scaling capabilities and good read performance. To manage efficient user access different user groups should be defined accordingly. This would restrict write access for the users who would need only read access and for admin users full access is provided

Conclusion

This capstone project was used to demonstrate the steps involved in building a Data Lake and DWH in AWS

Creating AWS resources using AWS Cloudformation - IaC
Creating S3 buckets, Redshift Cluster, IAM Roles, Security groups
Developing an ELT Pipeline that transforms raw data from S3 buckets into staging buckets in S3 using AWS EMR
Developing an ETL pipeline to copy data from S3 staging area to AWS Redshift
Using Airflow and Python to automate ETL pipelines
Writing custom operators to perform tasks such as S3 Data checks, copying data from S3 to Redshift and EMR add steps
Transforming data from various sources into a star schema optimized for the immigration analysts use cases.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
AWS_Help		AWS_Help
dags		dags
data		data
emr_bootstrap		emr_bootstrap
extras		extras
plugins		plugins
AWS_cf_stack_provider.py		AWS_cf_stack_provider.py
Capstone_Project_EDA.ipynb		Capstone_Project_EDA.ipynb
Data_Dictionary.md		Data_Dictionary.md
README.md		README.md
airflow_server_v3.yaml		airflow_server_v3.yaml
create_resources.py		create_resources.py
dwh.cfg		dwh.cfg
start_airflow.sh		start_airflow.sh
temp.py		temp.py

harisyammnv/DEND_Capstone_Airflow_EMR

Folders and files

Latest commit

History

Repository files navigation

Capstone Project

Data Sources from Udacity

External Data Sources

Project Architecture

Deciding and Creating AWS CloudFormation Stack

Setting up Airflow and AWS Connections

Data Paths

S3 folder Structure

Executing the DAGS

Results from DWH

Alternative Data Scenarios

Conclusion

Code References

About

Topics

Resources

Stars

Watchers

Forks

Languages