#COTL
(The servers might be shut down in the future, if so, check the demo video)
A location-based photo sharing system.
See the following slides for data pipeline details. Data Pipeline
Click into each folder to see more details.
COTL is the backend for a photo sharing application. Each user get a newsfeed consisting of photos posted by people nearby the user. The system is also able to recommend the most popular photos to users.
- Photos are pulled out from flickr. A script runs on top of the flickr API to support live streaming data.
- Users are synthesized. For now 1 million active users are generated to post the photos.
- Like/view events are synthesized from each user's newsfeed. A script scans all the newsfeed for all the users and makes decisions whether to like a given photo or not.
- Only URL(from flickr data store) is referenced in the entire system. But adding photo store can be easily integrated to the data pipeline. It is assumed that photos are stored properly before any message comes to Kafka.
ElasticSearch
, as a spatial database.HBase
, to store nearly all other information to be queried by our API.HDFS
, source of truth. Batch jobs are running on top of HDFS.
- Simulated 1 million active users.
- Streamming incoming photos from
flickr
in an average of 10 photos/second. - Accumulated over 1 million photos.
- Simulated over 1 billion user behaviors.
I've stopped the simulation part so it should be all static data by the time you look at the system. Also I didn't spend too much time developing a web user-interface so we can actually post a photo or like a photo. It's more worth it to develop a mobile client and make a real impact.
It is really cool to scale something-- even my data hasn't hit the bottleneck of the original.
- 5 nodes are run in
AWS
as a cluster. HBase
: 1 master, 4 datanodes.ElasticSearch
: 5 shards, 1 replication each.HDFS
: 1 namenode, 4 datanodes.Kafka
: 1 producer, 1 broker, 2 partitions for each topic, and 2 consumers on different machines for each consumer group.