DENDDataLakeProject

Project Overview

This project is part of Data Engineering Nano degree - DataLake, which mainly focuses on a music streaming startup, Sparkify, that want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.

File Structure

data: data files are added for practice reasons all sparkify data reside in s3 paths: s3://udacity-dend/song_data/ & s3://udacity-dend/log_data/
dl.cfg: holds our AWS user info
etl.py: all of our job is in this file where we open a spark session, extract data from s3, transform data and load it to our own s3 path

Data Schema

following is a diagram that dipicts our dimentional model that represents a star schema:

Instructions to run the files

Enter your aws account info in dwh.cfg
Enter your directory path, where you need your dimentional model to be written to in etl.py main() section
open terminal and navigate to project folder and run etl.py file or run it in your emr terminal

Acknowledgements

Credits go to udacity for providing the opportunity to practice our skills!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
.gitattributes		.gitattributes
README.md		README.md
SCHEMA.png		SCHEMA.png
dl.cfg		dl.cfg
etl.py		etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

.gitattributes

.gitattributes

README.md

README.md

SCHEMA.png

SCHEMA.png

dl.cfg

dl.cfg

etl.py

etl.py

Repository files navigation

DENDDataLakeProject

Table of Contents

Project Overview

File Structure

Data Schema

Instructions to run the files

Acknowledgements

About

Releases

Packages

Languages

Lamasheg/DENDDataLakeProject

Folders and files

Latest commit

History

Repository files navigation

DENDDataLakeProject

Table of Contents

Project Overview

File Structure

Data Schema

Instructions to run the files

Acknowledgements

About

Resources

Stars

Watchers

Forks

Languages