This repository contains the capstone project submission for the Udacity Data Engineering Nanodegree.
The project uses the following concepts:
- Building a data pipeline using Apache Airflow
- Loading data for processing from Amazon S3 and saving processed data to Amazon Redshift
- Loading Airflow and running the DAG locally through Docker
ArXiv is a free distribution service and an open-access archive for more than 1.5MM scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. arXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University. (Description from kaggle)
In this project, we will need to load metadata from ArXiv from S3, process the data into analytics tables on Redshift, and verify data integrety using tests. We implement this process using Apache Airflow.
The metadata on Kaggle is being updated monthly (as noted on the Kaggle page), hence we should be updating run this process at a monthly interval to keep up to date.
Filename | Description |
---|---|
assets | Assets folder containing images displayed in the README. |
airflow | Airflow folder containing all needed files (docker-compose, DAGs, plugins, etc.) used by Airflow. |
data | Data folder, initially empty with exception of a README. Data will be loaded into this folder before being moved to S3. |
exploration | Explorative Jupyter Notebooks for exploring data and setup |
setup | Folder containing all scripts for setting up the project, downloading data and loading this to S3, setting up AWS infrastructure (Redshift) and adding connections to Airflow programmatically. |
.gitignore | All the things that should NOT be committed to Git. |
LICENSE | In case anyone wants to re-use stuff from this repository. |
README.md | This file, descriping the repository and the content. |
- Follow the setup guide.
- Verify you Airflow running in docker, data loaded to S3, and a Redshift cluster running.
- Go to
http://localhost:8080/admin/
in your browser - Enable
arXiv_Redshift_dag
and it will auto-start
Step | Type | Description |
---|---|---|
begin_execution | DummyOperator | Dummy task with no action |
create_staging_tables, create_main_tables |
RedshiftExecuteSQLOperator | Creates staging and main tables if not already exists, otherwise drops and re-create the table |
re_parse_authors, re_parse_citations |
PythonOperator | Downloads and parses JSON files, then saves them in another JSON format to be able to leverage Redshifts COPY functionality |
stage_authors, stage_metadata, stage_citations, stage_classifications |
StageFromS3ToRedshiftOperator | Loads data from S3 into staging tables using COPY functions |
quality_check_staged_metadata, quality_check_staged_authors, quality_check_staged_citations, quality_check_staged_classifications |
DataQualityOperator | Runs quality checks and validation scripts on staged data |
stage_to_main_tables | DummyOperator | Dummy task with no action |
load_articles_table, load_article_version_dimension_table, load_article_authors_dimension_table, load_article_classifications_dimension_table, load_article_citations_dimension_table |
LoadRedshiftTableOperator | Transforms and loads data from staging tables into main tables |
run_data_quality_checks | DummyOperator | Dummy task with no action |
quality_check_articles_fact, quality_check_versions_dim, quality_check_authors_dim, quality_check_classifications_dim, quality_check_citations_dim |
DataQualityOperator | Runs quality check on data loaded into main dimensional tables and fact table |
End_execution | DummyOperator | Dummy task with no action |
The process reads data from S3, using the specifiers:
- Metadata data: s3://arxiv-etl/metadata
- Authors data: s3://arxiv-etl/authors/authors-parsed.json
- Citations data: s3://arxiv-etl/citations/citations_parsed.csv
- Classifications data: s3://arxiv-etl/classifications
Data loaded from S3 is processed and transformed into five main Fact and Dimensional tables in your Redshift cluster:
Table name | Row count |
---|---|
articles_fact | 1.697.508 |
authors_dim | 6.643.208 |
citations_dim | 9.346.457 |
classifications_dim | 2.735.492 |
versions_dim | 2.618.307 |
- Increase the size of the current Redshift Cluster. For this project I have 2 nodes in a cluster with combined 320 GB of storage. This can easily be scaled both vertically and horizontally, both to increase the individual node, as well as adding more nodes to the cluster.
- This should be created as a seperate Airflow DAG from the one in this project, that loads from S3. This new DAG should be created to update the dashboard, with a SLA for completing the task no later than 7:00 AM.
- I feel that with the current setup of leveraging a Redshift cluter, this is already possible. Again, as in the first scenario, it might require that the cluster is scaled to a larger size than currently.