This aims to classify the news articles ranging from a range of categories like politics, sports.. by streaming articles from The Guardian using Apache Kafka.
The data preprocessing is done using the Spark MLLib libraries by creating a Pipeline model with Tokenizer, Stopword remover, Labelizer, TF-IDF vectorizer, and a Classifier.
For reference you can refer to Pipeline documentation.
- Start zookeeper:
bin/zookeeper-server-start.sh config/zookeeper.properties
- Start Kafka Server:
bin/kafka-server-start.sh config/server.properties
- To create ML models and pipeline files you'll need to run pipeline.py file :
python3 pipeline.py
- Run Producer File:
python3 stream_producer.py API-key fromDate toDate
- Run Consumer file from inside the apache-spark directory (path of models and pipeline to be loaded needs to be provided in the consumer file)
bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.1 /consumer.py localhost:9092 guardian2