Skip to content

AnjayGoel/political-content-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Political Content Classification

Classifying us centric political posts on reddit.

Motivation

Annoyed by US centric news/political posts on reddit.

How was dataset generated:

  • Mine data using PushShift, Reddit API and BigQuery and merge them.
  • Label posts based on the subreddit.
  • Extract keywords using TextRank and generate a frequency table.
  • Train models using relative frequencies of extracted keywords.
  • A simple logistic regression on relative word frequencies is giving ~94% accuracy in classification.

How to use

Included a logistic regression model.
Example:

from classifier import Classifier
Classifier.predict(text)

Final dataset can be found here.

TODO

Make a browser plugin.