Text_Classification_20NewsGroupsData

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. Download link: http://qwone.com/~jason/20Newsgroups/

Selected 4 classes for this project.

rec.sport.hockey
sci.med
soc.religion.christian
talk.religion.misc

Classifiers used - Naive Bayes, Logistic Regression, Support Vector Machines, and Random Forests.

Configurations used:

Unigram Baseline (UB) -- Basic sentence segmentation and tokenization. Use all words.
Bigram Baseline (BB) -- Use all bigrams. (e.g. I ran a race => {I ran, ran a, a race}. ) Applied all the classifiers for these configurations and selected best model from that and applied some more techniques namely:

Feature representations
Feature selection
Hyperparameters

My Best configuration is obtained by removing stop words with L2 penalization on SVM. I have used SGDclassifier with hinge loss ( a linear SVM, as the number of features is more than 10000 most likely points are linearly separable ) as the classifier with above mentioned. I have created two python files one to build a model and another to test the model:

Execute the following command to get a learned model from a training dataset:

python best_config_train.py Testing_Samples_Location.

Execute the following command to pre-learned model run on a test dataset:

python best_config_test.py Testing_Samples_Location

Exploration for best params in Unigram Configuration:

python Analysis_config.py Training_samples_location Testing_Samples_Location

Code for Learning curves plot: python Learning_curves.py

Baseline models: python Unigram_bigram_models.py Training_samples_location Testing_Samples_Location

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Analysis_config.py		Analysis_config.py
Learning_curves.py		Learning_curves.py
README.md		README.md
Unigram_bigram_models.py		Unigram_bigram_models.py
best_config_test.py		best_config_test.py
best_configuration_train.py		best_configuration_train.py
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis_config.py

Analysis_config.py

Learning_curves.py

Learning_curves.py

README.md

README.md

Unigram_bigram_models.py

Unigram_bigram_models.py

best_config_test.py

best_config_test.py

best_configuration_train.py

best_configuration_train.py

report.pdf

report.pdf

Repository files navigation

Text_Classification_20NewsGroupsData

About

Releases

Packages

Languages

xkuang/Text_Classification_20NewsGroupsData

Folders and files

Latest commit

History

Repository files navigation

Text_Classification_20NewsGroupsData

About

Resources

Stars

Watchers

Forks

Languages