New York Violation location predictor

This project is about training a model on big data and predicting Violation location using New York parking violations data for the year 2013-2014 using pyspark

Introduction

Given New York parking Violations data, the main objective is to build a machine learning model and be able to predict real time as the data is streamed.

For building ML model on the big data, Pyspark was used. Random forest and XGBoost were used to predict Violation locations. Lot of preprocessing was done, more information can be obtained from Quarantined Cops_Final Project.pdf. Overall XGboost achieved 99.4% accuracy and Random forest achieved 95% accuracy on the test dataset
Using Kafka and google pubsub, data from google cloud storage was streamed to kafka server using google pubsub. With the help of sparkstreaming in dataproc, streamed data was predicted with a latency of 4.8secs for batch of 3 data sent every 30secs

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Libraries needed

xgboost
pandas
pickle
sklearn

Information about files

1-create_cluster.sh- Contains bash code to spin up a cluster with required packages. This was needed to try xgboost model for training/realtime in dataproc
2-publish_data.py - Publishes data found in the bucket to topic which is then read by kafka server. It uses a googlecredentials.json file which has not been included for privacy purposes
3-small_temp.csv - Contains 10000 first rows of the big dataset. Will be usefull for testing purposes
4-Real_time.py - Code which helps in running the prediction in real time
5-Train_dataproc.py - Main training code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quarantined Cops_Final Project Report_BigdataLab.pdf

Quarantined Cops_Final Project Report_BigdataLab.pdf

README.md

README.md

Real_time.py

Real_time.py

Train_dataproc.py

Train_dataproc.py

create_cluster.sh

create_cluster.sh

publish_data.py

publish_data.py

small_temp.csv

small_temp.csv

Repository files navigation

New York Violation location predictor

Introduction

Getting Started

Prerequisites

Information about files

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Quarantined Cops_Final Project Report_BigdataLab.pdf		Quarantined Cops_Final Project Report_BigdataLab.pdf
README.md		README.md
Real_time.py		Real_time.py
Train_dataproc.py		Train_dataproc.py
create_cluster.sh		create_cluster.sh
publish_data.py		publish_data.py
small_temp.csv		small_temp.csv

Vishwesh4/Real-time-prediction

Folders and files

Latest commit

History

Repository files navigation

New York Violation location predictor

Introduction

Getting Started

Prerequisites

Information about files

About

Topics

Resources

Stars

Watchers

Forks

Languages