Skip to content

jbmadsen/ArXiv-Metadata-ETL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArXiv Metadata ETL Pipeline with Airflow

ArXiv banner

Purpose

This repository contains the capstone project submission for the Udacity Data Engineering Nanodegree.

The project uses the following concepts:

Context

ArXiv is a free distribution service and an open-access archive for more than 1.5MM scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. arXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University. (Description from kaggle)

In this project, we will need to load metadata from ArXiv from S3, process the data into analytics tables on Redshift, and verify data integrety using tests. We implement this process using Apache Airflow.

How often the data should be updated and why?

The metadata on Kaggle is being updated monthly (as noted on the Kaggle page), hence we should be updating run this process at a monthly interval to keep up to date.

Project Structure

Filename Description
assets Assets folder containing images displayed in the README.
airflow Airflow folder containing all needed files (docker-compose, DAGs, plugins, etc.) used by Airflow.
data Data folder, initially empty with exception of a README. Data will be loaded into this folder before being moved to S3.
exploration Explorative Jupyter Notebooks for exploring data and setup
setup Folder containing all scripts for setting up the project, downloading data and loading this to S3, setting up AWS infrastructure (Redshift) and adding connections to Airflow programmatically.
.gitignore All the things that should NOT be committed to Git.
LICENSE In case anyone wants to re-use stuff from this repository.
README.md This file, descriping the repository and the content.

How to run

Execute locally

Setup Configurations

  1. Follow the setup guide.
  2. Verify you Airflow running in docker, data loaded to S3, and a Redshift cluster running.

Running

  1. Go to http://localhost:8080/admin/ in your browser
  2. Enable arXiv_Redshift_dag and it will auto-start

ArXiv DAG Pipeline

Overview of ArXiv Airflow pipeline:

DAG Flow

Step Type Description
begin_execution DummyOperator Dummy task with no action
create_staging_tables,
create_main_tables
RedshiftExecuteSQLOperator Creates staging and main tables if not already exists, otherwise drops and re-create the table
re_parse_authors,
re_parse_citations
PythonOperator Downloads and parses JSON files, then saves them in another JSON format to be able to leverage Redshifts COPY functionality
stage_authors,
stage_metadata,
stage_citations,
stage_classifications
StageFromS3ToRedshiftOperator Loads data from S3 into staging tables using COPY functions
quality_check_staged_metadata,
quality_check_staged_authors,
quality_check_staged_citations,
quality_check_staged_classifications
DataQualityOperator Runs quality checks and validation scripts on staged data
stage_to_main_tables DummyOperator Dummy task with no action
load_articles_table,
load_article_version_dimension_table,
load_article_authors_dimension_table,
load_article_classifications_dimension_table,
load_article_citations_dimension_table
LoadRedshiftTableOperator Transforms and loads data from staging tables into main tables
run_data_quality_checks DummyOperator Dummy task with no action
quality_check_articles_fact,
quality_check_versions_dim,
quality_check_authors_dim, quality_check_classifications_dim,
quality_check_citations_dim
DataQualityOperator Runs quality check on data loaded into main dimensional tables and fact table
End_execution DummyOperator Dummy task with no action

Loading data from S3

The process reads data from S3, using the specifiers:

  • Metadata data: s3://arxiv-etl/metadata
  • Authors data: s3://arxiv-etl/authors/authors-parsed.json
  • Citations data: s3://arxiv-etl/citations/citations_parsed.csv
  • Classifications data: s3://arxiv-etl/classifications

Data Processing

Data loaded from S3 is processed and transformed into five main Fact and Dimensional tables in your Redshift cluster:

DAG Flow

Final row counts for denormalized data in main tables:

Table name Row count
articles_fact 1.697.508
authors_dim 6.643.208
citations_dim 9.346.457
classifications_dim 2.735.492
versions_dim 2.618.307

Scenarios

How you would approach the problem differently under the following scenarios

The data was increased by 100x.

  • Increase the size of the current Redshift Cluster. For this project I have 2 nodes in a cluster with combined 320 GB of storage. This can easily be scaled both vertically and horizontally, both to increase the individual node, as well as adding more nodes to the cluster.

The data populates a dashboard that must be updated on a daily basis by 7am every day.

  • This should be created as a seperate Airflow DAG from the one in this project, that loads from S3. This new DAG should be created to update the dashboard, with a SLA for completing the task no later than 7:00 AM.

The database needed to be accessed by 100+ people.

  • I feel that with the current setup of leveraging a Redshift cluter, this is already possible. Again, as in the first scenario, it might require that the cluster is scaled to a larger size than currently.

About

ArXiv Metadata from Kaggle ETL pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published