GitHub

Twitter Distributed Pipeline

Code repo for building streaming pipeline using twitter, kafka, spark and cassandra.

This application has been tested on Python 3.6.3, Docker 17.12.0-ce and docker-compose 1.18.0 on Windows 10 machine.

Browse in the python folder and follow the instructions to run the application:

Install all python packages listed in requirements.txt.

pip install -r requirements.txt

In docker-compose KAFKA_ADVERTISED_HOST_NAME should contain ip of your machine.
In the properties file, provide your twitter credentials.
docker compose up will lauch all the containers.
In another window, start producer.py. This starts loading twitter stream data to kafka.
Open a new windw and run the following command to create keyspace and table in cassandra

docker exec -it cassandra-seed-node /usr/bin/env cqlsh -f /src/cassandra.cql

Next, execute the command below to run the spark job and load results into the cassandra db.

docker-compose exec master spark-submit --jars /src/spark-streaming-kafka-0-8-assembly.jar --packages anguenot:pyspark-cassandra:0.7.0 --conf spark.cassandra.connection.host=cassandra-seed-node /src/consumer.py kafka:9092 twitter-data 2000

Next steps:

migrate to scala
start using Akka actor model for producer

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
python		python
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python

python

README.md

README.md

Repository files navigation

About

Releases

Packages

Languages

tahirh/twitter_distributed_pipeline

Folders and files

Latest commit

History

python

python

README.md

README.md

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages