我們想要觀察音樂的流行趨勢,並視覺化資料的結果和預測。
因此我們找了有百萬首歌以上的資料庫進行分析。
- 來自Amazon Public Dataset snapshot
- 放在Amazon AWS上的500GB資料集,需要透過Amazon EBS存放後再用Amazon EC2下載該資料集。
- 資料庫裡面含有歌曲名稱,歌詞,歌手資料,風格的標籤,歌手所在經緯度等。
- Apache Spark
- Anaconda
- Python
- MLlib
- D3.js
- Google API
- Amazon AWS存放資料,下載資料
- Apache Spark整理資料,分析資料
- html, css, javascript來視覺化分析結果
- Google API來plot經緯度
- 目前的資料集著重在歐美,而且歌曲數量仍不足以表現出流行趨勢的變動(無法做出有力的結論)。
- Google API還有從Amazon AWS處理這些資料時,免費的選擇可能下載超久或是額度不夠用。
- 很多歌曲的資料欄位仍然有缺失,在處理遺失資料的問題時,應該有更好的處理方法。
We wanted to observe the music trends around the world and try to visualize the data to make some conclusion and prediction. Therefore, we found a million song dataset and several relative datasets to do the work.
- A collection of audio features and metadata for a million contemporary popular music tracks, as our datasets to analyze.
- The Million Song Dataset is also a cluster of complementary datasets containing cover songs, lyrics, user data, genre labels and song tags similarity contributed by the community.
- Apache Spark
- Anaconda
- Python
- MLlib
- D3.js
- Google API
We used html, css, javascript to visualize the data analysis result. From this webpage, we can clearly see the popularity of the song tags, for instance, language, style, singers etc. from 1924 to 2010. Using Google api to acquire the exact coordinates of each location, and plot them on the map.
- Some data were missing, made it difficult to analyze. If the data is more complete, we can do more works and predictions with it.
- Google API only provides 2500 free accesses quota per day, we used many google accounts to accomplish this task with 24000 locations.
- The information provided by last.fm which is very trivial, is in the json form with a million directories. We have to organize these data into a csv table with one million times IO. Therefore, we chose spark which can do multiple read-in tasks simultaneously. It took us 4.5 hours to finish the tasks with approximately 6x efficiency compared to running program on localhost.
- The size of the data is still too small to make a strong conclusion.
Use Echo API to search for more complete global music dataset instead of current ones which mainly focused on Europe and America.
How to get the dataset:
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.