GenrePredictor

This project aims to utilize various machine learning techniques in order to predict the genre of songs as accurately as possible

To run the data collector and organizer:

The data collector relies on parsing an aggregated h5 song file. I already ran the code that aggregates this file. It is too big to put in Github and email, so I am putting it on Google Drive and distributing it to you guys that way for now. You need to get this file in order to parse the data. It is called agg_all_songs.h5

There are some Python libraries that you will need to run the code. The easiest way to get them is probably pip. There are usually instructions you can look up online if you have trouble downloading them. They are: -tables +This is a library for reading hdf5 files -sqlite3 +python might actually come with this -numpy -Not sure what else you would need to install, but if the code fails due to a library, then it should say which one, and just install it

The file that has everything you need to get a list of Datapoints is datacollector.py This file relies on Artistdata.py to use the custom dictionary named Datapoint as well as hdf5_getters.py which is used to retreive data from the h5 song file.

In the main function of datacollector.py I have code written that shows how to use this class in order to parse the data and get a list of Datapoints. Line 556-559 call get_mbtag_freq() to get the top 100 most frequent tags used in the DB and then makes an inverted index out if it. Line 583 calls parse_aggregate_songs with the aggregate file I will share on Google Drive that creates an intermediate map of just straight reading the song data and maps all of the song data to particular artists. Line 585 calls parse_artis_map() using the map just created and the tag frequency inverted index in order to flatten out all list values in this map, just using simple averaging at the moment. This creates a list of flattened maps. Finally, on line 587 scale_and_convert_maps() takes the list of flattened artist maps you just created and scales all of the non-binary features to values between 0.0-1.0 (besides the years) and then converts these maps to the Datapoint class (literarlly just a dictionary with instance variables) and adds them to a new list which gets returned by this function. This is the final list you should be working with. I have a lot of notes in the file too that are generally for myself to make some updates in the future. This should still be enough to get going. I have this code in the main function now, but you can just import this class to run the functions similarly to how I do in the main function to get a list of Datapoint maps.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.gitignore		.gitignore
Artistdata.py		Artistdata.py
DT CV Accuracy.png		DT CV Accuracy.png
DT CV Precision.png		DT CV Precision.png
DT Test Accuracy.png		DT Test Accuracy.png
DT Test Precision.png		DT Test Precision.png
DT Train & Test Accuracy.png		DT Train & Test Accuracy.png
DT Train & Test Precision.png		DT Train & Test Precision.png
DT Training Accuracy.png		DT Training Accuracy.png
DT Training Precision.png		DT Training Precision.png
DTCrossValidationResults.txt		DTCrossValidationResults.txt
Depth2Tree.png		Depth2Tree.png
Depth3Tree.png		Depth3Tree.png
Depth4Tree.png		Depth4Tree.png
Depth5Tree.png		Depth5Tree.png
Depth6Tree.png		Depth6Tree.png
DtResultsValues.txt		DtResultsValues.txt
LICENSE		LICENSE
README.md		README.md
SVM accuracy C.png		SVM accuracy C.png
SVM accuracy j.png		SVM accuracy j.png
SVM precision C.png		SVM precision C.png
SVM precision j.png		SVM precision j.png
TODO.txt		TODO.txt
__init__.py		__init__.py
create_summary_file.py		create_summary_file.py
data.pkl		data.pkl
data_nobucket.pkl		data_nobucket.pkl
datacollector.py		datacollector.py
display_song.py		display_song.py
dt.py		dt.py
dtCV.py		dtCV.py
famhot.csv		famhot.csv
famhotgraph.png		famhotgraph.png
famhotgrpah.py		famhotgrpah.py
hdf5_descriptors.py		hdf5_descriptors.py
hdf5_getters.py		hdf5_getters.py
hdf5_utils.py		hdf5_utils.py
kNN Majority Cosine accuracy.png		kNN Majority Cosine accuracy.png
kNN Majority Cosine precision.png		kNN Majority Cosine precision.png
kNN Majority Euclidean accuracy.png		kNN Majority Euclidean accuracy.png
kNN Majority Euclidean precision.png		kNN Majority Euclidean precision.png
kNN Majority n Cosine accuracy.png		kNN Majority n Cosine accuracy.png
kNN Majority n Cosine precision.png		kNN Majority n Cosine precision.png
kNN Majority n Euclidean accuracy.png		kNN Majority n Euclidean accuracy.png
kNN Majority n Euclidean precision.png		kNN Majority n Euclidean precision.png
kNN Regression Cosine accuracy.png		kNN Regression Cosine accuracy.png
kNN Regression Cosine precision.png		kNN Regression Cosine precision.png
kNN Regression Euclidean accuracy.png		kNN Regression Euclidean accuracy.png
kNN Regression Euclidean precision.png		kNN Regression Euclidean precision.png
kNN Regression n Cosine accuracy.png		kNN Regression n Cosine accuracy.png
kNN Regression n Cosine precision.png		kNN Regression n Cosine precision.png
kNN Regression n Euclidean accuracy.png		kNN Regression n Euclidean accuracy.png
kNN Regression n Euclidean precision.png		kNN Regression n Euclidean precision.png
knn.py		knn.py
nb.py		nb.py
subset_artist_similarity.db		subset_artist_similarity.db
subset_artist_term.db		subset_artist_term.db
subset_track_metadata.db		subset_track_metadata.db
svm.py		svm.py
svm_classify		svm_classify
svm_learn		svm_learn
svm_model		svm_model
svm_out.txt		svm_out.txt
test.pkl		test.pkl
test.txt		test.txt
test_raw.pkl		test_raw.pkl
testdata.txt		testdata.txt
train.pkl		train.pkl
train.txt		train.txt
train_raw.pkl		train_raw.pkl
traindata.txt		traindata.txt
util.py		util.py

License

judaba13/GenrePredictor

Folders and files

Latest commit

History