Skip to content

ohiosonia/project311

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The 411 on the 311

This project was a proof of concept to create a data pipeline for my Insight Data Engineering project. Specifically, I elected to base this project off of NYC 311 data. For the purposes of this project, the data has been either modified or self-engineered, so the results are fictious.

This slides that accompany this project are available here.

The video that demonstrates the functionality of the site is available here.

alt text

Table of Contents

Project Overview

I have two streams of data: historical and (near) real-time. After ingesting this data and performing some processing in Spark and Spark Streaming (for historical and real-time, respectively), I use Cassandra as my key-value store. A full diagram of my pipeline is below.

Data Architecture

The following tools were used for this project:

  • Zookeeper
  • Kafka
  • HDFS
  • Spark
  • Spark Streaming
  • Cassandra

The data and processing were done on four AWS EC2 m4 xlarge machines. The ingestion, storage, and processing were setup to run in a distributed manner, with 1 master node and 3 worker nodes. The master node had 8GB of memory and 50GB of storage. The worker nodes each had 8GB of memory and 1TB of storage.

Historical Data:

alt text

Near Real Time Data:

alt text

How to Use this Repo

Front End Results

I have a created a simple Flask app that displays the results from my data pipeline. The app is available at www.sonia.nyc and a video demonstration of the site is available [here] (https://youtu.be/pQgADLRgwkE).