Streaming Data Analysis

This is a project using live data from Twitter API, based on Kafka, Spark, Airflow and AWS. Frequency of triggering is 5mins/day.

Collecting and Preprocessing

Kafka

Spark

See ETL.py for code.

Storation

Because of the high frequency of access and uploading in streaming, use Delta Table storation in AWS S3.

Analysis

See lda-pyspark.py for code.

1. Visulize barplot of top hashtags

2. Group the texts with LDA topic analysis.

Firstly find out the hyperparameter with cross validation, then pass it to full dataset with Xcom. Use sparknlp session and mlib to do LDA analysis. Result is as followed:

Visulization examples:

With pyLDAvis:

You can see the rankings of topics in all documents, and click the topic number in the left to see words in topic.

Pipeline construction

Deploy with Airflow(see dags/main.py for code)

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
dags		dags
logs		logs
pyForDag		pyForDag
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dags

dags

logs

logs

pyForDag

pyForDag

README.md

README.md

docker-compose.yaml

docker-compose.yaml

Repository files navigation

Streaming Data Analysis

Collecting and Preprocessing

Kafka

Spark

Storation

Analysis

1. Visulize barplot of top hashtags

2. Group the texts with LDA topic analysis.

Pipeline construction

About

Releases

Packages

Languages

XiaoyuLiu198/Streaming-Data-Server

Folders and files

Latest commit

History

Repository files navigation

Streaming Data Analysis

Collecting and Preprocessing

Kafka

Spark

Storation

Analysis

1. Visulize barplot of top hashtags

2. Group the texts with LDA topic analysis.

Pipeline construction

About

Topics

Resources

Stars

Watchers

Forks

Languages