moviEharmony.com is a data platform which can finds movies which 2 people may like to watch together. It is completely open-source and uses the following technologies:
- Apache Kafka
- Python
- Amazon S3
- Spark / Spark MLlib
- Apache Cassandra
- Flask
moviEharmony.com is currently batch processing (as of Oct 7, 2015) Amazon review dataset. These reviews provide the data which drive the following components of moviEharmony.com:
- MovieSearch: Allows 2 users to find movies that they may like to watch together.
- QueryUser: Allows users to find what they have reviewed in the past
- EnterReview: Allows users to add movie reviews
This is my pipeline, the first step of this pipeline is to ingest user’s input movie review. A webpage is created so user can submit their movie reviews from their web browser. These reviews will be transformed to a json message and be sent to Kafka. A batch consumer job to save these messages from Kafka to S3. And combining these new reviews with all the historical reviews from amazon dataset, I can train a collaborative filtering model with my spark cluster. Spark machine learning library currently use a model based alternating least squares algorithm to learn latent factors and then use these latent factors to predict missing movie ratings for the users. The model will be saved to S3 and estimated ratings will be saved to cassandra. At the end, flask will be querying cassandra to get movie recommendations return to the users.