FlairifyMe

FlairifyMe is a Reddit Flair Detector for r/india subreddit, that takes a post's URL as user input and predicts the flair for the post using a model generated by Logistic Regression. The web-application is hosted on Heroku at FlairifyMe(https://flairify-me.herokuapp.com/).

The web-application also offers visual content and temporal analysis of the collected data.

Directory Structure

The project has been developed using Python and several of its libraries and frameworks:

Scikit-learn
PRAW
NLTK
Flask
numpy
pandas
PyMongo

The scraped data is saved and loaded as a MongoDB instance.The web-application is based on Flask, and deployed using Heroku.

Following is the description of the files and folders in the repository:

Data: Contains CSV files with preprocessed scraped data, the MongoDB Collections and scripts for scraping, and preprocessing and analysing data.
Models: Contains the machine learning model used for predicting flairs.
Training: Contains the script for text-classification.
templates: Contains HTML scripts for the web-application
app.py: Used to start up the Flask server.
flair_predictor.py: Module to accept a valid URL and predict the post's flair by loading the model.
nltk.txt: Contains NLTK library dependencies for deployment on Heroku.
requirements.txt: Contains all dependencies for the project

Usage

The web-application allows the user to enter a r/india URL and displays the predicted flair for the submitted post. The user can view content and temporal analysis of the scraped data by clicking on the 'Post Analysis' button on the top right corner of the page.

To run on a local server:

Clone the repository

git clone https://github.com/BhavyaC16/FlairifyMe.git

Create a virtual environment

python3 -m venv FlairifyMe
source FlairifyMe/bin/activate
cd FlairifyMe/

Finally, install the project dependencies

pip3 install -r requirements.txt

To run the server, execute the following command

python3 app.py

Approach

Data Scraping

The python library PRAW has been used to scrape data from the subreddit r/india, with a total of 3,156 posts for 13 different flairs. The number of posts scraped per flair are as follows:

Data preprocessing

The data has been preprocessed using the NLTK library. The following procedures have been executed on the title, body and comments to clean the data:

Tokenizing and removing symbols
Removing stopwords
Stemming

Two separate databases have been prepared and saved as a MongoDB instance for training: one with stemming, and the other without stemming, as it is said to reduce prediction accuracy in certain cases by sources.

Training

The data has been loaded from MongoDB to a pandas DataFrame and split into 80-20 Training-Testing sets using scikit-learn. Each of the post features: Title, Body, Comments, Title+Comments and Title+Body+Comments were trained on three algorithms: Naive Bayes, Linear SVM and Logistic Regression, for both datasets(with and without stemming).

Following are the results, summarized as a table:

DATA WITHOUT STEMMING:

Feature\Algorithm	Naive Bayes	Linear SVM	Logistic Regression
Title	0.59177	0.58386	0.54430
Body	0.20569	0.24367	0.24051
Comments	0.31171	0.59494	0.58069
Title+Comments	0.37500	0.64082	0.63449
Title+Body+Comments	0.37816	0.64399	0.65189

DATA WITH STEMMING:

Feature\Algorithm	Naive Bayes	Linear SVM	Logistic Regression
Title	0.57753	0.57120	0.54430
Body	0.18354	0.23101	0.24051
Comments	0.30063	0.55538	0.56013
Title+Comments	0.36076	0.58703	0.60126
Title+Body+Comments	0.36551	0.59335	0.61392

After going through the flair-wise and overall prediction accuracies, the model trained using Title+Body+Comments on non-Stemmed data, using Logistic Regresssion was chosen.

Flair Prediction

The saved model is loaded for predicting the flair once the post features (title, body and comments) have been cleaned using NLTK. The returned result is displayed on the web-application.

API for querying FlairifyMe

A developer API using flask has been implemented, which returns a JSON containing the predicted flair of the Reddit Post queried by the user.

Can be accessed by querying:

flairify-me.herokuapp.com/api/resource?redditURL=<enter_url_here>

Returns JSON of the following format when successful:

{'status': 'successful', 'status_code': 200, 'result': {'flair': '<predicted_flair>'}}

Else, returns JSON of the format:

{'status': 'failed', 'status_code': <error_code>, 'result': {'error': '<error_message>'}}

Future Extension

I plan on adding the following features to the project:

Improving the prediction by training the model on user inputs.
Automating the script to allow users to develop prediction model for any subreddit entered by them.

Learnings

This task has been a great learning experience for me as it was my first time working with Machine Learning and Natural Language Processing, and with most of the tools like Heroku and MongoDB, as well as several libraries like scikit-learn, nltk, praw and Flask.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
Data		Data
Models		Models
Training		Training
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
app.py		app.py
flair_predictor.py		flair_predictor.py
nltk.txt		nltk.txt
requirements.txt		requirements.txt
runtime.txt		runtime.txt

License

gauravchopracg/FlairifyMe

Folders and files

Latest commit

History

Repository files navigation

FlairifyMe

Directory Structure

Usage

Approach

Data Scraping

Data preprocessing

Training

Flair Prediction

API for querying FlairifyMe

Future Extension

Learnings

References

About

Resources

License

Stars

Watchers

Forks

Languages