GitHub - patrickzheng/cotl: A location-based photo sharing system

#COTL

(The servers might be shut down in the future, if so, check the demo video)

A location-based photo sharing system.

See the following slides for data pipeline details. Data Pipeline

Click into each folder to see more details.

Overview

COTL is the backend for a photo sharing application. Each user get a newsfeed consisting of photos posted by people nearby the user. The system is also able to recommend the most popular photos to users.

Photos are pulled out from flickr. A script runs on top of the flickr API to support live streaming data.
Users are synthesized. For now 1 million active users are generated to post the photos.
Like/view events are synthesized from each user's newsfeed. A script scans all the newsfeed for all the users and makes decisions whether to like a given photo or not.
Only URL(from flickr data store) is referenced in the entire system. But adding photo store can be easily integrated to the data pipeline. It is assumed that photos are stored properly before any message comes to Kafka.

Data Stores

ElasticSearch, as a spatial database.
HBase, to store nearly all other information to be queried by our API.
HDFS, source of truth. Batch jobs are running on top of HDFS.

Data size & Throughput

Simulated 1 million active users.
Streamming incoming photos from flickr in an average of 10 photos/second.
Accumulated over 1 million photos.
Simulated over 1 billion user behaviors.

I've stopped the simulation part so it should be all static data by the time you look at the system. Also I didn't spend too much time developing a web user-interface so we can actually post a photo or like a photo. It's more worth it to develop a mobile client and make a real impact.

Scalability

It is really cool to scale something-- even my data hasn't hit the bottleneck of the original.

5 nodes are run in AWS as a cluster.
HBase: 1 master, 4 datanodes.
ElasticSearch: 5 shards, 1 replication each.
HDFS: 1 namenode, 4 datanodes.
Kafka: 1 producer, 1 broker, 2 partitions for each topic, and 2 consumers on different machines for each consumer group.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
initializations		initializations
kafka_consumers		kafka_consumers
public_html		public_html
simulated_behaviors		simulated_behaviors
simulated_post_photos		simulated_post_photos
spark_hbase		spark_hbase
spark_hdfs		spark_hdfs
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initializations

initializations

kafka_consumers

kafka_consumers

public_html

public_html

simulated_behaviors

simulated_behaviors

simulated_post_photos

simulated_post_photos

spark_hbase

spark_hbase

spark_hdfs

spark_hdfs

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Overview

Data Stores

Data size & Throughput

Scalability

About

Releases

Packages

Languages

patrickzheng/cotl

Folders and files

Latest commit

History

Repository files navigation

Overview

Data Stores

Data size & Throughput

Scalability

About

Resources

Stars

Watchers

Forks

Languages