GitHub - kareemf/news-article-classification: CS 363 - Artifical Intelligence Final: News Article Classification + Clustering

Kareem Francis

Naive Bayes Article Classifier

5/27/12

CS363 Artificial Intelligence

CUNY Queens College

###Requirements:

Python 2.7
NLTK 2.0.1, which requires PyYAML 3.09 and Numpy 1.6.1
Beautiful Soup is requires for the web scraper

###Execution: From the project source directory, run python nbac.py [-h] [-training_dir TRAINING_DIR] [-testing_dir TESTING_DIR] [-o OUTPUT] [-c] [-v]

###Naive Bayes Article Classifier

####optional arguments:

-h, --help
show this help message and exit
-training_dir TRAINING_DIR A directory containing training articles and a class.txt file.
-testing_dir TESTING_DIR A directory containing test articles and a class.txt file
-o OUTPUT, --output OUTPUT The file to which output is to be written. Defaults to predictions.txt in current directory
-c, --clean Execute without using previous data. Necessary if training data has changed
-v, --verbose
Execute in verbose mode. Greatly increases printed output.

If no training or testing directories are specified, the default locations are used, which are located in the root of the project folder. The default location of the output file is the same directory as the python modules. Important: even though both the training and testing folders contain the same articles, only those files listed in class.txt will be opened, and there is NO overlap between the two.

###Overview: Naive Bayes classification is a probabilistic approach to classifying/categorizing information based on provided information. It is considered naive because the algorithm ignores dependences between features, assuming them to be independent. A document classifier is one that labels documents - in this case, news articles - in to predefined categories based on their content. A naive Bayes document classifier applies naive Bayes to the task of document classification. Clustering, unlike classification is an unsupervised machine learning technique, which means that the input labels (classifications) are unknown. Data is grouped by similarities, but those similarities are unknown prior execution, and must be discerned. K-Means is a clustering algorithm that groups data into K clusters, with each data entry being a member of the cluster to which it is the closest. There are many ways in which the distance between data points can be measured, such as Euclidian distance and cosine similarity, which measures the angle between vectors.

###Files:

README.rtf - this file
Project Overview.rtf - a description of the approach I used
Newscraper.py - the web scraper I wrote to gather news articles from Google News
NBAC.py - the main program, which performs both classification and clustering
NBAC_Ensemble.py – my attempt at an ensemble classification approach
Article.py - a class used as the data representation of an article
Model.p and Data.p - serialized execution data, used to avoid having to recompute data between executions, which may be time and processor intensive.
Predictions.txt: contains output of classification of test documents
Clusters.txt: contains output of clustering of test documents
Performance Report.rtf: contains analysis of program results

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
testing_articles		testing_articles
training_articles		training_articles
.gitignore		.gitignore
FinalProject-2012Spring.pdf		FinalProject-2012Spring.pdf
Performance Results.rtf		Performance Results.rtf
Project Outline.rtf		Project Outline.rtf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

testing_articles

testing_articles

training_articles

training_articles

.gitignore

.gitignore

FinalProject-2012Spring.pdf

FinalProject-2012Spring.pdf

Performance Results.rtf

Performance Results.rtf

Project Outline.rtf

Project Outline.rtf

README.md

README.md

Repository files navigation

About

Releases

Packages

Languages

kareemf/news-article-classification

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages