This project is about training a model on big data and predicting Violation location using New York parking violations data for the year 2013-2014 using pyspark
Given New York parking Violations data, the main objective is to build a machine learning model and be able to predict real time as the data is streamed.
- For building ML model on the big data, Pyspark was used. Random forest and XGBoost were used to predict Violation locations. Lot of preprocessing was done, more information can be obtained from
Quarantined Cops_Final Project.pdf
. Overall XGboost achieved 99.4% accuracy and Random forest achieved 95% accuracy on the test dataset - Using Kafka and google pubsub, data from google cloud storage was streamed to kafka server using google pubsub. With the help of sparkstreaming in dataproc, streamed data was predicted with a latency of 4.8secs for batch of 3 data sent every 30secs
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Libraries needed
xgboost
pandas
pickle
sklearn
1-create_cluster.sh
- Contains bash code to spin up a cluster with required packages. This was needed to try xgboost model for training/realtime in dataproc2-publish_data.py
- Publishes data found in the bucket to topic which is then read by kafka server. It uses a googlecredentials.json file which has not been included for privacy purposes3-small_temp.csv
- Contains 10000 first rows of the big dataset. Will be usefull for testing purposes4-Real_time.py
- Code which helps in running the prediction in real time5-Train_dataproc.py
- Main training code