GitHub - LunasAbacus/DataMiner: Mine all the data, for homework and science!

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
CleanXML0.py		CleanXML0.py
FeatureVector1.py		FeatureVector1.py
FeatureVector2.py		FeatureVector2.py
FeatureVector3.py		FeatureVector3.py
FeatureVector4.py		FeatureVector4.py
FeatureVector5.py		FeatureVector5.py
NaiveBayesClassifier.py		NaiveBayesClassifier.py
README.txt		README.txt
TODO.txt		TODO.txt
TagExtractor.py		TagExtractor.py
TagExtractor.pyc		TagExtractor.pyc
output-FeatureVector2.txt		output-FeatureVector2.txt
output-FeatureVector4.txt		output-FeatureVector4.txt
output-NaiveBayesClassifier.txt		output-NaiveBayesClassifier.txt
output-NaiveBayesClassifier2.txt		output-NaiveBayesClassifier2.txt
output.txt		output.txt
output2.txt		output2.txt
output4.txt		output4.txt
reut2-000.sgm		reut2-000.sgm
reut2-001.sgm		reut2-001.sgm
stopwords.txt		stopwords.txt
test.py		test.py

Repository files navigation

DataMiner
=========

Authors
=======
Nathan Jacobs
Joshua Adams

Mine all the data, for homework and science!

Files
=====

TagExtractor.py - data structure used to extract tags from the .sgm files that are used to create the feature vectors

stopwords.txt - is a list of stop words that are elimated in a preliminary step when creating a feature vector

FeatureVector2.py - constructs a feature vector for each reuter in the file by pulling out all of the nouns in the body
 
FeatureVector3.py - constructors a feature vector by keeps a count of words that are not listed in stopwords.txt

FeatureVector4.py - constructs a feature vector for each reuter by finding the frequency distribution for each word in the body

output-FeatureVector2.txt - sample output for FeatureVector2

output-FeatureVector3.txt - sample output for FeatureVector3

output-FeatureVector4.txt - sample output for FeatureVector4

Installation
============

1. First download python 2.7.5 from http://www.python.org/download/
2. Next instal nltk, following instructions from http://nltk.org/install.html
3. In python idle, type the following
	import nltk
	nltk.download()
4. Click the download button from the window that pops up

Runing The Program
==================

To run the program type 'python FeatureVector[number].py' in terminal
where number is the feature vector that is being run

The output for each program will be according to the number of the feature
vector that is executed. For example FeatureVector2.py will create the file
'output-FeatureVector2.txt' which is the output file for FeatureVector2.