Sparkify DataLake

Unlike DWHs, Spark-based Datalakes are not limited by relations and allow a wide range of analytics on the row data. Therefore, optimizing queries and analytics is not confined to traditinoal techniques in DWHs such as using starschemas. However, we adopt in this project a starschema-like database since it powers the analytics related to the songs played by the users in a good way. Since we are using a DataLake, the user is always able to query the data in its original form.

Considering the song dataset we can divide the fields into two groups. One group of fields concern the artist. Namely, “artist_id”, “artist_location”, etc… The other group of fields concern the song. Namely, “song_id”, “duration”, etc…

Similarly, looking at the log data, we recognize other cateogries of info. Most importantly, in the timestamped logdata we can discern each song play (defined as a record with page = next). We also find data related to the session (session_id, itemsinSession, etc... ) and the users(e.g. userAgent, userId, etc...).

Each record for a songplay needs to be indexed with a new field called songplay_id which will serve as a primary key for our central fact table.

Based on our user requirements, session data will not be required. As for all other categories, this project includes the code to create the tables and the ETL pipelines powering storing them in a starschema.

Project structure

etl.py - A script defining the ETL process
dl.cfg - A configuration file containing AWS credentials

How to run

SSH to the master node of AWS EMR spark cluster
use spark-submit to run the the script. e.g. /usr/bin/spark-sbumit --master yarn ./etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

dl.cfg

dl.cfg

etl.py

etl.py

Repository files navigation

Sparkify DataLake

Contents

Summary

Data Modelling

Project structure

How to run

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
dl.cfg		dl.cfg
etl.py		etl.py

HasanAhmadQ7/datalakeSparkify

Folders and files

Latest commit

History

Repository files navigation

Sparkify DataLake

Contents

Summary

Data Modelling

Project structure

How to run

About

Resources

Stars

Watchers

Forks

Languages