Build a streaming application that reads tweets (via Twitter API) and calculates the top hashtags used, distributed by the following aspects:
- Language
- Date
- Source (e.g. Twitter from Iphone)
-
Spin off an EC2 instance and deploy Minikube.
-
Deploy Apache Nifi Helm Chart.
helm repo add cetic https://cetic.github.io/helm-charts
helm install tweets cetic/nifi
- Deploy Apache Kafka Helm Chart.
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install tweets-kafka bitnami/kafka --set zookeeper.enabled=false,externalZookeeper.servers=tweets-zookeeper:2181
- Deploy Cassandra Helm Chart.
helm install tweets-db --set dbUser.user=admin,dbUser.password=<password> bitnami/cassandra
-
Upload Nifi template via UI.
-
Insert Tweeter API tokens in processor GetTwitter.
- Start all processors in Apache Nifi.
Browse to:http://<ec2-public-ip>:8081/nifi
- Deploy the count_hashtags container in minikube:
sudo kubectl run count-hashtags --image gcr.io/pmoraesm/count_hashtags:0.4
The docker images used in this project are available at https://gcr.io/pmoraesm. Two images are available:
- pyspark: A pyspark interactive environment, used for testing purposes
- count_hashtags : The spark application that processes counts the hashtags and saves it to the database, starts processing and saving results to Cassandra automatically.