GitHub - sqlqry/artmosphere: Data Engineering Project at Insight

#Artmosphere

Note: The original website is down at the termination of the Insight program. However, the video demo of the website is available here. Slides are available here.

Code for the web framework Flask can be found here. Code for front end web application can be found in this folder.

##Table of Contents

##Introduction This is a data engineering project at Insight Data Engineering Fellow Program. The project provides a platform for users to search for different artworks, see similar art pieces and real-time popularity of a given art piece. Users can also see where all the artworks have been uploaded across the world. The main goal of the program to learn different tools used in a data pipeline for processing large datasets in a distributed manner.

Tools used:

Data ingestion: Kafka
Data storage: Hadoop Distributed File System
Batch processing: Spark
Real-time processing: Spark Streaming
Database: Elasticsearch, Cassandra
Web API: Flask
Website: Bootstrap, Highcharts

##Settings Dataset: The dataset is a collection of 26,000 artworks and 45,000 artists collected from Artsy.net in JSON format. In order to simulate real-time user activities, the project also used self-engineered data in two formats:

Collecting log: timestamp, user_id, collected, artwork_id
Uploading log: timestamp, user_id, uploaded, artwork_id, location_code

AWS Clusters: A distributed AWS cluster of 4 EC2 machines is being used for this project. All the components (ingestion, batch and real-time processing) are configured and run in distributed mode, with 1 master node and 3 slave nodes. The master node has 8GB of memory and 50GB of storage. The work nodes each has 8GB of memory and 1TB of storage.

##Data Processing

Data Ingestion (Kafka): The datasets for batch and real-time processing are ingested using Kafka. For batch processing, the datasets are stored into HDFS. For real-time processing, the data is streamed into Spark Streaming.
- Streaming producer: my_streaming_producer.py
- Batch producer: hdfs_producer.py
- Batch consumer: hdfs_consumer.py
Batch Processing (HDFS, Spark): To perform batch processing job, Spark loads the data from HDFS and processed them in a distributed way. The two major batch processing steps for the project is to aggregate the artists upload locations and compute artwork-artwrok similarties.
- Aggreate Locations: batch_geo
  - To excute: run bash batch_geo_run.sh
- Compute Similarity: compute_similarity.py
  - To excute: run bash batch_sim_run.sh
The following graph shows the performance analysis of Spark for one the batch processing jobs - aggregating artists upload locations - up to 500GB:
Serving Layer (Elasticsearch, Cassandra): The platform provides a search function that searches a given keyword within the artworks' title. In order to achieve this goal, the metadata of all artworks are stored into Elasticsearch. All artworks and artists are stored in Cassandra tables and can be retrieved by ids. The aggregated artists locations are also stored in Cassandra table, which can be queried by location_code and timestamp.
Stream Processing (Spark Streaming): Spark Streaming processes the data in micro batches. The job aggregates how many people collected a certain piece of art every 5 seconds and saves the result into a table in Cassandra. The information can be queried by artwork_id and timestamp.
- Streaming Processing: spark_streaming
  - To excute: run bash log_streaming_run.sh
Front-end (Flask, Bootstrap, Highcharts): The frond-end uses Flask as the framework and the website uses JavaScript and Twitter Bootstrap libriries. All the plots are achieved via Highcharts.

##Website Note: Website is down at the termination of the Insight program. However, the video demo of the website is available here.

The artwork information:

Display similar artworks:

Plots show in real-time how many people have collected this piece of art within a 5-sec frame:

A map shows where all the artworks have been uploaded across the world:

##Presentation Deck The presentation slides are available here.

The video demo of the website is available here.

##Packages Used for the Pipeline pyspark, pyspark-cassandra, elasticsearch-hadoop-2.1.0.Beta2.jar

Name		Name	Last commit message	Last commit date
Latest commit History 270 Commits
batch_geo		batch_geo
batch_similarity		batch_similarity
dataset_sample		dataset_sample
flask		flask
img		img
kafka		kafka
populate_database		populate_database
spark_streaming		spark_streaming
README.md		README.md
batch_geo_run.sh		batch_geo_run.sh
batch_sim_run.sh		batch_sim_run.sh
log_streaming_run.sh		log_streaming_run.sh
populate_db.sh		populate_db.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batch_geo

batch_geo

batch_similarity

batch_similarity

dataset_sample

dataset_sample

flask

flask

img

img

kafka

kafka

populate_database

populate_database

spark_streaming

spark_streaming

README.md

README.md

batch_geo_run.sh

batch_geo_run.sh

batch_sim_run.sh

batch_sim_run.sh

log_streaming_run.sh

log_streaming_run.sh

populate_db.sh

populate_db.sh

Repository files navigation

About

Releases

Packages

Languages

sqlqry/artmosphere

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages