Skip to content


Repository files navigation - connected by songs (song + friendly) is a song recommendation application built during my time at Insight Data Engineering program.


  • many talented local artists get lesser visibility and reach in music streaming applications - increase their reach
  • build a community of people with similar musical tastes and let them explore music together - connect
  • personalized recommendations often tie users down to their history and fail to provide reasons for why something is recommened to the user - provide transparency presents an approach to address the above concerns.

Introduction is a song recommendation application with the following features:

  • Suggest songs to a user based on the songs listened to by the most relevant friends of the user
  • Suggest artists to listen to based on the current location of the user
  • Suggest songs frequently played together with the current song (users who listened to this also listened to)
  • Suggest friends based on a relevance score which mimics a naive, logical implementation of collaborative filtering defined as:

Image and video hosting by TinyPic


I used the "Million Song Dataset" [1] which is "a freely-available collection of audio features and metadata for a million contemporary popular music tracks" according to Labrosa website. Along with the metadata for songs a list of more than 150 M user-song request pairs was obtained from Echonest [2] and Also a list of unique artists with their location information was obtained from Echonest. More details can be found here.

Data Pipeline

Image and video hosting by TinyPic

####Ingestion Layer Kafka: The user taste profile is used to synthesize more user-song requests as a stream of request data. A synthesized stream of user's current location and the user-song requests are ingested into Kafka.

####Streaming Layer Spark Streaming: The ingested data gets processed by Spark streaming to extract data in the required formats. The information of user-song request with timestamp is stored into Cassandra - a NoSQL data store. The counts for requested songs and the users' current locations are stored in Redis - a caching datastore - for faster access. The data is periodically flushed into HDFS.

####Batch Layer Spark: Apache Spark reads data from HDFS to find friend suggestions, update relevance scores and mine frequent pattern among songs. The recommendations are explained here.

Cassandra Tables

user_song_log: (streaming) stores user-song requests partitioned by time
user_to_song: (streaming) stores user-song requests partitioned by user
song_to_user: (streaming) stores user-song requests partitioned by song
user_connections: stores user's connections (follows) partitioned by user
user_relevance: (batch) stores suggested users with relevance score
frequent_song_pairs: (batch) stores song-song frequencies


The application can be accessed at To login username: adam, password: 123

The app may not work as intended after Feb 28. The AWS machines will be terminated after that. Please look at the video for a demo.
Video demo:


[1] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.

[2] The Echo Nest Taste profile subset, the official user data collection for the Million Song Dataset, available at:


Song recommendation app






No releases published


No packages published