GitHub - cgonzo/million_song: Big-data project using Hadoop on Amazon EC2 to process million-song database

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 303 Commits
.gitignore		.gitignore
README		README
add_database.py		add_database.py
add_database.sh		add_database.sh
all.sh		all.sh
artist_by_most_popular_term.py		artist_by_most_popular_term.py
artist_by_most_popular_term.sh		artist_by_most_popular_term.sh
average_tempo.py		average_tempo.py
average_tempo.sh		average_tempo.sh
bayes.sh		bayes.sh
bayes_classify.py		bayes_classify.py
bayes_classify.sh		bayes_classify.sh
build_bayes.py		build_bayes.py
build_fisher.py		build_fisher.py
classifiers_base.py		classifiers_base.py
common.py		common.py
cover_song.sh		cover_song.sh
cover_song1.py		cover_song1.py
cover_song2.py		cover_song2.py
cover_song3.py		cover_song3.py
cover_song_prepare.sh		cover_song_prepare.sh
cover_song_preprocess.py		cover_song_preprocess.py
cover_song_results.sh		cover_song_results.sh
danceability.py		danceability.py
danceability.sh		danceability.sh
display_song.py		display_song.py
fastest.py		fastest.py
fastest.sh		fastest.sh
filelist.sh		filelist.sh
fisher.sh		fisher.sh
fisher_classify.py		fisher_classify.py
fisher_classify.sh		fisher_classify.sh
hdf5_getters.py		hdf5_getters.py
hottness.py		hottness.py
hottness.sh		hottness.sh
initialize64.sh		initialize64.sh
loudness.py		loudness.py
loudness.sh		loudness.sh
loudness_by_year.py		loudness_by_year.py
loudness_by_year.sh		loudness_by_year.sh
most_generic_pop.py		most_generic_pop.py
most_generic_pop.sh		most_generic_pop.sh
most_popular_terms.py		most_popular_terms.py
most_popular_terms.sh		most_popular_terms.sh
pagerank.py		pagerank.py
pr_out.py		pr_out.py
songs_per_artist.py		songs_per_artist.py
songs_per_artist.sh		songs_per_artist.sh
template.py		template.py
timing.sh		timing.sh

Repository files navigation

QUICK GETTING STARTED (this will only work until May 15, when I delete the snapshot):
Clone S3 snapshot snap-b1367ade onto a 350GB EBS volume
Use Starcluster to start cluster with cloned volume at /mnt/data
SSH to master and each node and Òsource /mnt/data/initialize64.shÓ on master and each node
Ssh to master; Òcd million_songÓ
Add dataset to Hadoop. Either Ò./add_database.shÓ (this will take several hours) or Òhadoop fs -copyFromLocal /mnt/data/hadoop_backup/data dataÓ (this will take a few minutes)
You should now be ready to go. Try running some of the .sh scripts.

GETTING STARTED WITHOUT THE AWS SNAPSHOT
Getting PyTables:
HDF5_getters.py requires PyTables. If youÕre on a standard Ubuntu image, you can just apt-get python-tables
If youÕre on StarclusterÕs Hadoop AMI, you need to download the following:
Òapt-get remove cythonÓ
Download and install the latest version of Cython
Download and install the latest version of numexpr
Download and install the latest version of hdf5
Export HDF5_DIR=directory where you installed HDF5
Export LD_LIBRARY_PATH= path to HDF5 shared libraries (HDF5_DIR/lib)
Download and install the latest version of PyTables

Manually generating /mnt/data:
Download the Million Song Dataset from http://labrosa.ee.columbia.edu/millionsong/pages/getting-dataset and extract it into /mnt/data/AdditionalFiles and /mnt/data/data
Download test artist split from https://github.com/tb2332/MSongsDB/raw/master/Tasks_Demos/Tagging/artists_train.txt and add it to AdditionalFiles
Download cover song database from https://github.com/tb2332/MSongsDB/raw/master/Tasks_Demos/CoverSongs/shs_dataset_train.txt and https://github.com/tb2332/MSongsDB/raw/master/Tasks_Demos/CoverSongs/shs_dataset_test.txt and add them to AdditionalFiles
Run filelist.sh to generate filelist.txt. Copy this to /mnt/data
Run add_database.sh to add data to the Hadoop file system. After running this, it is suggested to reboot the cluster before continuing.

DESCRIPTION OF EACH SCRIPT:
Add_database.sh:
Requires filelist.sh to be run first and filelist.txt to be copied to /mnt/data.
Goes through /mnt/data/data, processes and copies data to Hadoop file system. MUST BE RUN BEFORE ANY HADOOP SCRIPTS

All.sh:
Runs all Hadoop scripts

artist_by_most_popular_term.sh
Finds the most commonly tagged genre for each artist, then counts these up to create a top-ten list. This list is then used for the Bayes and Fisher classifiers.

Average_tempo.sh
Finds the average tempo across all songs

Bayes.sh
Builds and runs the Bayes classifier

Bayes_classify.sh
Runs previously-built Bayes classifier

Cover_song.sh
Requires cover_song_prepare to be run first
Page ranks most influential artists using cover song database
This uses the same pagerank code from homework 4, but the process of turning the list of cover songs into a link map was my own idea. I take the cover songs, find the oldest song, and say the newer songs are a cover of that song. I then turn this into links from the newer songsÕ artists to the cover songÕs artist. The resulting link map can then be pageranked. Having completely made this up, I was pleasantly surprised when the results actually made sense; the top 10 list is artists that are thought to be very influential in music. The fact that this worked was one of the pleasant surprises of this project.

Cover_song_prepare.sh
Converts cover song database into a format that is easier to run map-reduce on.

Cover_song_results.sh
Returns results of cover_song.sh in a tab-delimited format for easy import into Excel

Danceability.sh
Finds the average danceability score. Note that this is 0, as no song has a danceability score.

Fastest.sh
Sorts songs by tempo

Filelist.sh
Generates filelist.txt. After running, copy this file to /mnt/data. This is needed to run add_database.sh.

Fisher.sh
Builds and runs the Fisher classifier

Fisher_classify.sh
Runs previously-built Fisher classifier

Hottness.sh
Finds top 10 songs in ÒhotttnesssÓ in each genre

Initialize64.sh
This script resides on /mnt/data and is run on master and each node after cluster creation to install necessary packages

Loudness.sh
Find song with highest variance in loudness

Loudness_by_year.sh
Report the average variance in loudness per year

Most_generic_pop.sh
Finds most generic rock song by running Fisher classifier on all rock songs and sorting by the rock classifierÕs score. Must have generated Fisher classifier before running this.

Most_popular_terms.sh
Finds most popular terms that songs have been tagged with

Songs_per_artist.sh
Reports how many songs each artist has in the dataset.

Timing.sh
Runs a representative sample of scripts in order to compare speeds between different-sized clusters

PROJECT WRITEUP:
Project background:
Pandora and similar projects have popularized the analysis and machine learning of large audio sets. The Million Song Dataset is a recently released collection of the features and metadata for a million contemporary songs. This database spans over two hundred and seventy gigabytes, which makes it a great test of HadoopÕs abilities with large files.

The data from the dataset is divided into a million separate files, one file per song. Within each file, the data is stored in the HDF5 format, a binary data format that is optimized for large tables. This data can be read by PyTables, but PyTables relies on the tables being resident on disk, which complicates Hadoop streaming programs. Because of this fact, and because uploading the entire dataset into Hadoop took 60 hours, it was decided to instead access the files over NFS and upload only the data we needed into Hadoop in a text format.

The dataset contains both user- and automatically-generated metadata (genre, hotness, dance-ability) as well as data about each section, segment, tatum, bar, and beat of the song. An additional project added a manually-generated database about cover songs, consisting of a unique cover song identifier and the title and artist identifiers of each version of that song.

This program is divided into separate shell scripts to run each individual Hadoop job. For the majority of the shell scripts, there is no post-processing and the data is just left in the Hadoop file system. For scripts that output data that a future job will rely on (e.g. the classifiers use lists of the top ten most popular terms that were previously generated), they do the necessary post-processing and keep the generated text files in the current directory of the local file system. There are also a few scripts to batch up a large number of jobs; all.sh runs all of the tasks while timing.sh runs a subset used for timing comparisons between clusters of different sizes.

Parallelization:
Three tasks were run on both 4- and 8-node M1-large clusters in an attempt to characterize the speed of the three different types of tasks. The script to find the average tempo is a good representative of the majority of the scripts in the project: a mapper that mostly just reads data from the dataset followed by a simple reducer. This ran on the 8-node cluster in 47.8% of the time of the 4-node cluster, nearing the perfect 2x speedup that was expected.

The cover song pagerank algorithm didnÕt do as well. This had over twenty separate map-reduce tasks run in serial. The majority of the time was spent breaking down a previous map-reduce or starting up a new one rather than in the actual map-reduce. Because of this, adding more nodes to the cluster only reduced runtime by 3%. I would expect for the improvement with more nodes to increase if a larger cover song database was used, but the linkmap created for this database was too simple to fully utilize the added nodes.

Building the Fisher classifier is inherently limited in its parallelism because it can only run ten reducers at the same time, one for each category it is classifying. We see the 8-node cluster taking 43% less time than the 4-node cluster because the 4-node cluster can only run 8 jobs at one time, while the 8-node cluster can run all 10. After all reducers are running at once, however, adding nodes will not help. A future optimization would be to use a parallel algorithm in the place of this reducer to build the classification matrix

Categorization with the Fisher classifier, on the other hand, does take advantage of as many nodes as possible. We see it running in 48% less time with an 8-node cluster versus 4-node.

Speedup lessons:
The most interesting lesson from this project was that using "hadoop fs -copyFromLocal" was not necessarily the fastest way to get data into the array when not all data is needed. In our case, the million song database contains a lot of information that could be further compressed; it contained several time sequences of data when our data processing tasks were only concerned with the mean and variances of these sequences. By computing these means and variances from the raw files and then only uploading them rather than the original files, a significant speedup was attained.

For the million song database's data, directly uploading to the hadoop filesystem would have taken over 60 hours, while running a map-reduce job where the filenames were passed to the mapper, which then accessed the files over NFS, processed them, and uploaded them took under 7 hours for a 4-node cluster. The uncompressed data is 271GB, while the compressed data that was uploaded is only 3GB

Numpy is significantly faster than trying to do math in Python. When trying to do any sort of math over a list, converting that list to a Numpy array, running a Numpy operation on the array, and converting the array back to a list is orders of magnitude faster than iterating over that list in Python. This was discovered early in the project, so there is no hard data for the amount of speedup gained by this, but it would be significant.

Use global variables for constants. When building the Bayesian classifier, the original design was for the list of training artists to be re-loaded from disk at every iteration of the mapper. By moving this from the mapper to program start, the execution time of building the classifier dropped from 54 minutes to seven and a half.

To determine whether an artist is in the test or training set, the classifiers look in a dictionary which has the training set artists as a key. Using the statement "dictionary.has_key(key)", which presumably hashes the key and sees if there is any data under that hash, is much faster than the equivalent statement "key in dictionary.keys()" which presumably grabs all of the keys from dictionary and then compares them individually to key. In retrospect this makes sense, but at the time that one line was responsible for quadrupling the running time of the classifier.

Data that is more complicated than a few values is passed between mappers and reducers using JSON objects. This makes programming much simpler, but is a very slow method; json.loads(object) has a long startup time for decoding each object. Because of this, it's much faster to decode a single large object than several smaller objects. This fact was used to speed up the fisher classification. The first part of the classifier builder takes arrays from the mappers and combines it into single two-dimensional array. The cost of converting each json object to an array and combining these coupled with the large amount of time taken up by the actual classifier building work caused the reducer not to finish in ten minutes, causing a timeout. By using string manipulations to combine these json objects into a single object and import that instead, eight of the ten reducers were able to run in the allotted ten minutes. The ultimate solution for the other two to be able to run was to increase the timeout, but the speed gain still holds.

Annoyances:
Having not worked on projects with a large memory requirement in a garbage collected language before, I now see the main difference between these and C is that when the program has a memory leak, it is a lot harder to fix in Python. Objects created in the mapper and reducer functions tended to not be destroyed, leading to memory leaks. This would not be a big deal with a smaller data set, but when dealing with hundreds of gigabytes of data, this can quickly crash a cluster. The solution was to manually delete objects when done with them and run gc.collect() to manually run the garbage collector.

Rebooting the clusters when garbage collection failed or the cluster otherwise crashed often crashed Starcluster. Running the Starcluster reboot command a second time seemed to work.

The danceability for all songs was 0 in the data set. This made comparing songs by danceability rather difficult.

Individual songs were not tagged by genre; instead, artists were tagged. I got around this by tagging each song by an artist by that artistÕs genre, but this is far from an ideal situation for artists whose songs span multiple genres.

Future work:
This work barely scratches the surface of what is possible with the Million Song Dataset. The most obvious next step is to try additional classifiers for genre classification. In addition, there are a lot of other metadata to classify over, such as artist name.

During the middle of the project, a lyrics database was introduced. This will be able to yield all sorts of interesting data, as well as potentially improving the accuracy of our classifiers by a large margin.

This project intentionally avoided trying to figure out machine learning methods for time-differing data; we took the mean and variance of pitch and loudness rather than trying to see how it changed over time. Figuring out how to incorporate this time data into classifiers would be an interesting challenge.

SCRIPT FOR VIDEO PRESENTATION:
The million song database is a recently released collection of the features and metadata for a million contemporary songs. This database spans over two hundred and seventy gigabytes, which makes it a great test of hadoop

As a warm-up exercise, we can very easily find the AVERAGE TEMPO, MOST POPULAR GENRE, FASTEST SONG, and several other statistics.

The dataset includes measures of a song's loudness for each of its sections. By taking the variance of these loudnesses, we can determine the difference in volume between the loud and quiet sections of a song. If we graph the average of these variances per year, we see a weak upward trend. However if we focus on just the last twenty years, we see a much more clear trend toward higher variances in loudness.

The dataset also includes a list of cover songs. Using this, we can find the most influential artist. we take each group of cover songs and attach release dates to the songs, then call the oldest song the original and the others covers. We can then turn this into a link map, with the cover songs linking to the original. We then pagerank this link map to find the most influential artists.

Finally, we can try some machine learning algorithms on this data. We tried two classification algorithms: a naive Bayes classifier and the Fisher Linear Discriminant. The Bayes classifier does a great job finding the members of the most popular category, in this case rock, but a terrible job of classifying the rest, correctly categorizing only 0.6% of them. The Fisher classifier does worse in finding rock songs, but a better job overall, correctly classifying 54% of songs versus 38% for Bayes

Finally, we can turn this classification algorithm inward, using it to find the song that best fits into a category. Doing this for rock, the largest Genre, we get the most generic rock song as being Oasis's Better Man.

This work just scratches the surface of what you can do with the million song dataset. I encourage each of you to play around with it on your own. Feel free to play around with my code on Github and have fun.