ML-Classification-Kafka-Streaming

This aims to classify the news articles ranging from a range of categories like politics, sports.. by streaming articles from The Guardian using Apache Kafka.

The data preprocessing is done using the Spark MLLib libraries by creating a Pipeline model with Tokenizer, Stopword remover, Labelizer, TF-IDF vectorizer, and a Classifier.

For reference you can refer to Pipeline documentation.

To run the code use the following steps:

Start zookeeper:

bin/zookeeper-server-start.sh config/zookeeper.properties
Start Kafka Server:

bin/kafka-server-start.sh config/server.properties
To create ML models and pipeline files you'll need to run pipeline.py file :

python3 pipeline.py
Run Producer File:

python3 stream_producer.py API-key fromDate toDate
Run Consumer file from inside the apache-spark directory (path of models and pipeline to be loaded needs to be provided in the consumer file)

bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.1 /consumer.py localhost:9092 guardian2

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
consumer.py		consumer.py
pipeline.py		pipeline.py
stream_producer.py		stream_producer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

consumer.py

consumer.py

pipeline.py

pipeline.py

stream_producer.py

stream_producer.py

Repository files navigation

ML-Classification-Kafka-Streaming

To run the code use the following steps:

About

Releases

Packages

Languages

aditya8138/ML-Classification-Kafka-Streaming

Folders and files

Latest commit

History

Repository files navigation

ML-Classification-Kafka-Streaming

To run the code use the following steps:

About

Resources

Stars

Watchers

Forks

Languages