Spark Streaming with Kafka

Sample code showing how to use Spark Streaming with Kafka.

Dependencies

Apache Spark (tested with 2.3.3 pre-built for Hadoop)
Python (tested with 3.7.3)
Apache Avro for Python 3 (tested with 1.8.2)
Docker

Notes on Running Apache Spark

If you're using Apache Spark with pre-built Hadoop, it's highly recommended to create a SPARK_HOME environment variable in your ~/.bash_profile or ~/.zshrc file like so:

# Spark
export SPARK_HOME="/Users/robert.dempsey/Applications/spark-2.3.3-bin-hadoop2.7"
export PATH=$PATH:$SPARK_HOME/bin

This makes it easy to call the Spark executables, and the shell scripts in this repo rely on that variable existing.

Installation

Clone this repo and cd into the directory.

Create a new Python 3.7.3 environment, activate it, and install the additional packages:

$ conda create -n pyspark_env python=3.7.3
$ source activate pyspark_env
$ pip install -r requirements.txt

Download and install Apache Avro:

$ tar xvf avro-1.8.2.tar.gz
$ cd avro-1.8.2
$ sudo python setup.py install
$ python
>>> import avro # should not raise ImportError

Run

There are examples showing parsing JSON and avro. They run the same way, so let's see how to run the avro example. From the root directory:

Launch the Docker containers:

$ docker-compose up -d
$ docker ps

In one terminal window, run the avro producer:

$ python avro/avro_producer.py

In another terminal window, run the avro consumer as a check:

$ python avro/avro_consumer.py

If everything is working you should see the messages being produced to Kafka by the producer script and consumed by the consumer script.

Now, stop the consumer script and run the shell script to use Spark Streaming:

$ chmod +x spark_submit_avro.sh
$ ./spark_submit_avro.sh

It takes a minute for Spark to get going, however once it does you should see output like this:

-------------------------------------------
Time: 2019-05-20 05:20:15
-------------------------------------------
{'name': 'Robert', 'favorite_number': 0, 'favorite_color': 'blue'}
{'name': 'Robert', 'favorite_number': 1, 'favorite_color': 'blue'}
{'name': 'Robert', 'favorite_number': 2, 'favorite_color': 'blue'}
{'name': 'Robert', 'favorite_number': 3, 'favorite_color': 'blue'}
{'name': 'Robert', 'favorite_number': 4, 'favorite_color': 'blue'}
{'name': 'Robert', 'favorite_number': 5, 'favorite_color': 'blue'}
{'name': 'Robert', 'favorite_number': 6, 'favorite_color': 'blue'}
{'name': 'Robert', 'favorite_number': 7, 'favorite_color': 'blue'}
{'name': 'Robert', 'favorite_number': 8, 'favorite_color': 'blue'}
{'name': 'Robert', 'favorite_number': 9, 'favorite_color': 'blue'}

As Spark is processing the Kafka stream in mini-batches, you'll continue to see messages like this while the producer continues to produce events.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
avro		avro
json		json
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
spark_submit_avro.sh		spark_submit_avro.sh
spark_submit_json.sh		spark_submit_json.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avro

avro

json

json

.gitignore

.gitignore

README.md

README.md

docker-compose.yml

docker-compose.yml

requirements.txt

requirements.txt

spark_submit_avro.sh

spark_submit_avro.sh

spark_submit_json.sh

spark_submit_json.sh

Repository files navigation

Spark Streaming with Kafka

Dependencies

Notes on Running Apache Spark

Installation

Run

About

Releases

Packages

Languages

ReallyDiffcult/spark-streaming-kafka

Folders and files

Latest commit

History

Repository files navigation

Spark Streaming with Kafka

Dependencies

Notes on Running Apache Spark

Installation

Run

About

Resources

Stars

Watchers

Forks

Languages