GitHub - aletje/Data-Lake-with-Spark: Data Lake with Spark

Summary

This project demonstrates an ETL pipeline with Spark, populating a database designed for analysing song plays from a fictional music streaming app called Sparkify. The project was made during my Udacity Nano Degree Course in Data Engineering.

Building a Data Lake with Spark

Launched an AWS EMR cluster with Spark preinstalled
Created an ETL pipeline with PySpark
Created 1 fact and 4 dimension tables loaded back into S3 as parquet files.

Notes

The Song metadata is a subset originally from http://millionsongdataset.com/.
The Log data set is simulating user activity on a fictional music streaming app called Sparkify.
Both datasets resides in Udacity's S3 buckets and can be found here:
- s3://udacity-dend/log_data
- s3://udacity-dend/song_data

ETL pipeline

The ETL pipeline etl.py uses Python (PySpark) and Spark SQL:

Extracts json files from s3
Transforms them into PySpark DataFrames
Loads them back into s3 as parquet files for analytical purposes

Running the ETL-job

Open a new Ipython notebook and run the command
- %run etl.py

Data for analytical puropses are loaded back into a separate S3-bucket

The Parquet files that are loaded back into an S3 bucket are designed for analysing song plays from Sparkify app

Schema design

Fact table
- songplays are records in log data associated with song plays i.e. records with page NextSong
Dimension tables
- users are users in the app
- songs are songs from song metadata database
- artists are artists from song metadata database
- time are timestamps of records in songplays broken down into specific units

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
README.md		README.md
dl.cfg		dl.cfg
etl.py		etl.py
starschema.png		starschema.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

dl.cfg

dl.cfg

etl.py

etl.py

starschema.png

starschema.png

Repository files navigation

Summary

This project demonstrates an ETL pipeline with Spark, populating a database designed for analysing song plays from a fictional music streaming app called Sparkify. The project was made during my Udacity Nano Degree Course in Data Engineering.

Building a Data Lake with Spark

Notes

ETL pipeline

Running the ETL-job

Data for analytical puropses are loaded back into a separate S3-bucket

Schema design

About

Releases

Packages

Languages

aletje/Data-Lake-with-Spark

Folders and files

Latest commit

History

Repository files navigation

Summary

This project demonstrates an ETL pipeline with Spark, populating a database designed for analysing song plays from a fictional music streaming app called Sparkify. The project was made during my Udacity Nano Degree Course in Data Engineering.

Building a Data Lake with Spark

Notes

ETL pipeline

Running the ETL-job

Data for analytical puropses are loaded back into a separate S3-bucket

Schema design

About

Resources

Stars

Watchers

Forks

Languages