InLivingColor

http://The.InLivingColor.website

InLivingColor is a data platform for exploring and analyzing the colors of the world around by the use of unsupervised machine learning. It is completely open-source and uses the following technologies:

Apache Kafka
Python / SciPy
Amazon S3
Spark
Apache Cassandra
Elastic Search
Flask
HighMaps

The InLivingColor Website

InLivingColor.website is currently downloading (as of July 1, 2015) and analyzing photos from Flickr. These photos provide the data which drive the following components of InLivingColor:

ColorSearch: Allows users to search by color.

ColorMap: Currently shows the number of photos that have been processed in different counties of the United States. (As you can see in the images above, the website is not just indexing the U.S., however.)

InLivingColor InANutshell

InLivingColor ingests photos along with their metadata by using Python and Kafka (more later). Upon ingestion, photos are anaylzed for their color content using a k-means algorithm. Each record (containing a photo, its metadata, and the color content data) is stored on S3. Spark then takes these records and creates a number of batch-views, using both Elastic Search (for the color search) as well as Cassandra (for the geographical/temporal aggregates). These databases are then made accessable via an API using Flask, which the webpage uses and accesses using JavaScript.

How to Download Millions of Photos from Flickr

InLivingColor uses a distributed system to download photographs from Flickr. For instance, the script ingest_photos_continuously_2014.py attempts to download as many photoid as possible from 2014. There are hundreds of millions of them, however, so it really attempts to download a uniformly-distributed portion of those photos. ingest_photos_continuously_2014.py then sends these photoids in a Kafka message to any number of consumers, started by the d_p_s-multiservice.sh script which allows you to run numerous processes on a single machine using tmux windows (i.e., ./d_p_s-multiservice.sh 5 will open up 5 of them). These consumers then download the JPEGs of the photos, as well as any meta data that Flickr has, preprocesses the photos for their color information, and then uploads these enriched JSON records to S3 (the images are also included in Base64 as PySpark currently does not enjoy accessing binary files on S3).

Batch Processing

batch_hourly.sh runs all the batch processes, and although it is named "hourly" it can be used to rebuild all the intermediate batch steps and populate the Cassandra and Elasticsearch databases. Subsequently, it can be run hourly for incremental updates. Now, batch_hourly.sh calls the following:

-python batch_hourly_1_aggregate.py aggregates the small files created by the processes that download, preprocess, and store the images/metadata. This is the step of the batch process that is the most difficult, for if the hourly download bins have many files this step may take several minutes. In the future it may be better to have the individual processes route these messages back to Kafka and use Secor to bundle them into nicely-sized files. The benefit of pushing the small files to S3 is that they are never queued anywhere, and we we take advantage of S3's excellent availablility. On the other hand, sending them back to Kafka would allow us to use Kafka's "Pub-Sub" capabilities to also send a real-time stream to SparkStreaming.

-python batch_hourly_prepare_colorsearch.py takes the aggregated files on S3, extracts the fields needed for the color search, saves those intermediate files back to S3, and then uploads the data to Elasticsearch.

-python batch_hourly_prepare_counts.py takes the aggregated files on S3, runs a map-reduce job to perform counts at 5 levels of geographic granularity and 3 levels of temporal granularity (actually, the cartesian product, so 15). This is calculated for each hourly download bin and stored on S3. The complete reduction is then handled by incorporating all of these prereduced results.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
images		images
old		old
server		server
website		website
.gitignore		.gitignore
BrianCruz_InLivingColor.key		BrianCruz_InLivingColor.key
README.md		README.md
start-services-master.sh		start-services-master.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

old

old

server

server

website

website

.gitignore

.gitignore

BrianCruz_InLivingColor.key

BrianCruz_InLivingColor.key

README.md

README.md

start-services-master.sh

start-services-master.sh

Repository files navigation

InLivingColor

The InLivingColor Website

InLivingColor InANutshell

How to Download Millions of Photos from Flickr

Batch Processing

About

Releases

Packages

Languages

patrickzheng/InLivingColor

Folders and files

Latest commit

History

Repository files navigation

InLivingColor

The InLivingColor Website

InLivingColor InANutshell

How to Download Millions of Photos from Flickr

Batch Processing

About

Resources

Stars

Watchers

Forks

Languages