This project demonstrates
- streaming of tweets (twitter_app.py)
- processing of streamed tweets using Spark (spark_app.py)
- storing processed tweets into SQLite (database_app.py)
Spin up GCE instance (Container Optimized OS)
SSH into instance and clone project
git clone https://github.com/randy-chng/demo_pyspark_streaming.git
Build image and provide Twitter details [W], [X], [Y] and [Z]
cd demo_pyspark_streaming
docker image build --tag streaming_pipeline_i --build-arg access_token=[W] --build-arg access_secret=[X] --build-arg consumer_key=[Y] --build-arg consumer_secret=[Z] --file Dockerfile .
Run created image
docker run --name streaming_pipeline_c --publish 5555:5555 -di streaming_pipeline_i
Access created container
docker exec -it streaming_pipeline_c /bin/bash
Run following commands to start pipeline
nohup python -u twitter_app.py > twitter_app_output.log 2>&1 &
nohup python -u spark_app.py > spark_app_output.log 2>&1 &
For sample output, refer to notebook.ipynb
Access created container
docker exec -it streaming_pipeline_c /bin/bash
Run following commands to start jupyterlab
jupyter lab --ip=0.0.0.0 --port=5555 --allow-root
Visit provided link using local web browser
- http://127.0.0.1:5555/?token=[SOME RANDOM GENERATED TOKEN VALUE]