Million-Song-Dataset-Analysis-using-ML-models-on-Big-Data

Description: Music is often considered a reflection of the society and is a particularly interesting topic for researchers in order to examine the societal culture and value of each generation. For a human being it is relatively easy to determine whether a song belongs in a certain era or not, but for machines such problems are not trivial. Using the Million Song Dataset, a collection of audio features and metadata, I evaluated different classification algorithms and their ability to predict whether a song dates before or after the year 2000 and achieved a best score of 0.775 using the ROC-AUC metric

While the challenge in this project is to accurately determine whether a song date before/after 2000, the format of the data poses a significant challenge since the raw Dataset consists files in binary .h5 format. Processing those files, especially in a distributed environment, is non-trivial. So the implementation focuses on parsing those binary files in a distributed manner with Spark, and the Machine Learning implementation is used more like a proof-of-concept.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.idea		.idea
DeploymentFiles		DeploymentFiles
Final		Final
MyDocs		MyDocs
__pycache__		__pycache__
parquetAfterProcess		parquetAfterProcess
parquetFile		parquetFile
parquetFileTuple		parquetFileTuple
parquetTester		parquetTester
parquetTesterNew		parquetTesterNew
venv		venv
.gitignore		.gitignore
Final.zip		Final.zip
ML_prediction.py		ML_prediction.py
README.md		README.md
Tester.csv		Tester.csv
Useful_commands.txt		Useful_commands.txt
WordCount.py		WordCount.py
dependencies.sh		dependencies.sh
h5_scrapper.py		h5_scrapper.py
h5_to_csv_converter.py		h5_to_csv_converter.py
hdf5_getters.py		hdf5_getters.py
hdf5_getters.pyc		hdf5_getters.pyc
preprocess.py		preprocess.py
requirements.txt		requirements.txt
schema.avsc		schema.avsc
songs.avro		songs.avro

skalogerakis/Million-Song-Dataset-Analysis-using-ML-models-on-Big-Data

Folders and files

Latest commit

History

Repository files navigation

Million-Song-Dataset-Analysis-using-ML-models-on-Big-Data

About

Resources

Stars

Watchers

Forks

Languages