Project built during the Insight Data Engineering program
Work in progress, feedback is really appreciated
- [Introduction] (README.md#1-introduction)
- [AWS Clusters] (README.md#2-aws-clusters)
- [Data Pipeline] (README.md#3-data-pipeline)
- [Front End] (README.md#4-front-end)
Rendr is an application that builds a bipartite graph of users and restaurants to make recommendations using the structure of this network. A user is shown a restaurant based on how popular it is with people who are similar to the user
Foursquare Data collected by University of Minnesota researchers obtained from Internet Archives containing 2,153,471 users, 1,143,092 venues, 121,970 check-ins and 2,809,581 ratings that users assigned to venues; all extracted from the Foursquare application through the public API
Yelp Data obtained from the Yelp Academic Dataset Challenge consisting of 1.6M reviews and 500K tips by 366K users for 61K businesses
Rendr is powered by three clusters on AWS-
- 4 m4.larges for [Spark] (https://spark.apache.org/) and [HDFS] (http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html)
- 3 t2.mediums for [Kafka] (http://kafka.apache.org/) and [Zookeeper] (https://zookeeper.apache.org/)
- 3 t2.mediums for Cassandra and [Flask] (http://flask.pocoo.org/)
-
The data collected from the sources is stored on HDFS with 3 data nodes and 1 name node.
-
3 consumers are collecting data from the Rendr frontend and send messages to Kafka when the user performs an action such as clicking no for the restaurant that was recommended. These messages are sent in the same format as the foursquare data and can be processed in the same manner. They are consumed using [camus] (https://github.com/linkedin/camus). [Camus] (https://github.com/linkedin/camus) is a tool built by [Linkedin] (https://www.linkedin.com/) which is a distributed consumer running a map reduce job underneath to consume messages from [Kafka] (http://kafka.apache.org/) and save them to [HDFS] (http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html).
Spark and Graphx is used for all batch processing. The data from Yelp and Foursquare has very diffferent schema. Foursquare data only contains latitude and longitude of the venue and no other metadata such as whether the venue is a restaurant or not, the name, city, state etc. This needs to be filtered against yelp data which is much richer. Geohashing is used for entity resolution to determine whether a rating in foursquare refers to a restaurant in yelp.
Cassandra is used to to save the batch results and serve the front end. Three main tables serve the application-
- Seeds- key is the username and value is the restaurant id of the most recent restaurant liked/reviewed by the user
- Ranks - key is the restaurant id and values are the ranks and ids of other restaurants in the network
- IdMapper - key is the restaurant id and value is the metadata of the restaurant such as name, city, state which is needed to construct the query to the yelp API
Used flask for the front end along with javascript, html and css for views