SF Crime Statistics Spark Streaming

Real-world dataset exercise using data extracted from Kaggle on San Francisco crime incidents. Spark Structured Streaming is used for statistical analyses. This projects main objective was to create a Kafka server to produce data, and ingest data through Spark Structured Streaming.

Introduction

In order to run the project, you should set up following tools:

Spark 2.4.3
Scala 2.11.x
Java 1.8.x
Kafka build with Scala 2.11.x
Python 3.6.x or 3.7.x

Working with Kafka

Before you start Kafka, you need to start zookeeper. You can do it by executing command:

zookeeper-server-start config/zookeeper.properties

Modify config/zookeeper.properties to reflect your environment settings (i.e. free listening port number).

To start kafka execute command:

kafka-server-start config/server.properties

Modifying config/server.properties allows you to change kafka server settings like zoookeeper address, default retention period, port, replication factor, and others.

Considering that provided config files will be used, use following commands to:

Create topic (it's not mandatory, topics are not protected, first attempt to write to topic will create it)

kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic com.udacity.police.calls.new

List existing topics

kafka-topics --zookeeper localhost:2181 --list

Read data from topic (check if the data was produced successfuly by kafka_server.py)

kafka-console-consumer --bootstrap-server localhost:9092 --topic com.udacity.police.calls.new --from-beginning

Write data to topic

kafka-console-producer --broker-list localhost:9092 --topic com.udacity.police.calls.new

Producer

kafka_server.py

Before you run the program, locate path (direct or indirect) to json data file (police-department-calls-for-service.json). Fill correct path in input_file variable. To start producing data on topic, execute command:

python kafka_server.py

In separate terminal session execute command:

kafka-console-consumer --bootstrap-server localhost:9092 --topic com.udacity.police.calls.new --from-beginning

You should see something like below:

Produced data has following schema:

root
 |-- crime_id: string (nullable = true)
 |-- original_crime_type_name: string (nullable = true)
 |-- report_date: string (nullable = true)
 |-- call_date: string (nullable = true)
 |-- offense_date: string (nullable = true)
 |-- call_time: string (nullable = true)
 |-- call_date_time: string (nullable = true)
 |-- disposition: string (nullable = true)
 |-- address: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- agency_id: string (nullable = true)
 |-- address_type: string (nullable = true)
 |-- common_location: string (nullable = true)

Consumer

data_stream.py

To start structured streaming, execute command:

python data_stream.py

Implementation

To make starting program easier I've added following lines. It allows to run code with simple command python data_stream.py.

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.4 pyspark-shell'

After reading stream from kafka topic, I'm pulling value (which is json object produced by kafka_server.py) and timestamp (I find it most convenient real time date).

kafka_df = df.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS timestamp)")

For aggregation I'm using watermark and time window as it can be seen below.

agg_df = distinct_table.select('call_date_time', 'original_crime_type_name', 'disposition') \
    .withWatermark('call_date_time', '1 day') \
    .groupBy(
             psf.window('call_date_time', "10 minutes", "5 minutes"),
             'original_crime_type_name', 
             'disposition') \
    .count()

The results can be seen on the screen below:

In the join, there are few possibilities, because it's a join of stream with static data frame. There are two possibilities to join these data sets and it's inner join and left outer join. I was playing with those two types of joines and it can be easily changed by typing inner instead of left_outer.

join_query = agg_df.join(radio_code_df, radio_code_df.disposition==agg_df.disposition, "left_outer")

The result of the join can be seen on the screen below.

consumer_server.py

One of the project goals was to write plain vanilla kafka consumer. The result can be seen below.

Optimisation Optimising throughput means optimising processedRowsPerSecond KPI (the higher the better). It can be achived by setting kafka cluster with more brokers (more even load distribution) and during the topic definition (number of partitions) and by tinkering with few params:

spark.default.parallelism is a parameter defining level of parallelism. It is adviced to have no more than 3 tasks per core.

spark.streaming.kafka.maxRatePerPartition is the maximum number messages processed per micro batch per partition.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
config		config
img		img
README.md		README.md
consumer_server.py		consumer_server.py
data_stream.py		data_stream.py
kafka_server.py		kafka_server.py
police-department-calls-sample.json		police-department-calls-sample.json
producer_server.py		producer_server.py
radio_code.json		radio_code.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

img

img

README.md

README.md

consumer_server.py

consumer_server.py

data_stream.py

data_stream.py

kafka_server.py

kafka_server.py

police-department-calls-sample.json

police-department-calls-sample.json

producer_server.py

producer_server.py

radio_code.json

radio_code.json

Repository files navigation

SF Crime Statistics Spark Streaming

Introduction

Working with Kafka

Producer

kafka_server.py

Consumer

data_stream.py

consumer_server.py

About

Releases

Packages

Languages

mathew-i/SFCrimeStatisticsSparkStreaming

Folders and files

Latest commit

History

Repository files navigation

SF Crime Statistics Spark Streaming

Introduction

Working with Kafka

Producer

kafka_server.py

Consumer

data_stream.py

consumer_server.py

About

Resources

Stars

Watchers

Forks

Languages