Skip to content

Data pipeline construction with Spark, Kafka, Python and Airflow

Notifications You must be signed in to change notification settings

XiaoyuLiu198/Streaming-Data-Server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Streaming Data Analysis

This is a project using live data from Twitter API, based on Kafka, Spark, Airflow and AWS. Frequency of triggering is 5mins/day.

Screen Shot 2021-08-26 at 12 13 28 AM

Collecting and Preprocessing

Kafka

Screen Shot 2021-08-26 at 12 35 50 AM

Spark

Screen Shot 2021-08-26 at 12 42 15 AM

See ETL.py for code.

Storation

Because of the high frequency of access and uploading in streaming, use Delta Table storation in AWS S3.

Analysis

Screen Shot 2021-08-26 at 11 38 58 PM

See lda-pyspark.py for code.

1. Visulize barplot of top hashtags

Screen Shot 2021-09-02 at 9 57 58 PM

2. Group the texts with LDA topic analysis.

Firstly find out the hyperparameter with cross validation, then pass it to full dataset with Xcom. Use sparknlp session and mlib to do LDA analysis. Result is as followed:

Screen Shot 2021-08-29 at 9 25 14 PM

Visulization examples:

With pyLDAvis:

You can see the rankings of topics in all documents, and click the topic number in the left to see words in topic. Screen Shot 2021-08-29 at 8 47 18 PM

Pipeline construction

Deploy with Airflow(see dags/main.py for code)

About

Data pipeline construction with Spark, Kafka, Python and Airflow

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages