Skip to content

Data pipelines to aggregate of Twitter metrics. Both in Batch as well as Streaming mode. Batch mode runs once for historical tweet aggregations and streaming mode runs aggregations for unbounded streaming data.

License

dsoumyadip/tb-tweet-aggregator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tb-tweet-aggregator

There are two type of job. Batch and stream.

  • BATCH: This Spark job reads historical tweets in JSON format and do multiple types of aggregation and saves the aggregated data to Firestore.
    HOW TO RUN:

    • For Dataproc: gcloud dataproc jobs submit pyspark --cluster tb-cluster --region=us-central1 batch/aggregation-historical-tweets.py -- "gs://twitter-battle-2/historical-tweets.json"
    • For in house Spark cluster:
      spark-submit --master SPARK_MASTER_IP:PORT batch/aggregation-historical-tweets.py -- "gs://twitter-battle-2/historical-tweets.json"
      Type of aggregations:
      • Historical total tweet count by each handle.
      • Historical tweet count by 1 hour window. (Total + tweet sentiment level)
  • STREAMING: This Dataflow(Apache Beam) job reads incoming streamed tweets from PubSub and do sentiment aggregation over fixed time window.
    HOW TO RUN:
    python streaming/streaming-aggretaion.py

About

Data pipelines to aggregate of Twitter metrics. Both in Batch as well as Streaming mode. Batch mode runs once for historical tweet aggregations and streaming mode runs aggregations for unbounded streaming data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages