Skip to content

LunasAbacus/DataMiner

Repository files navigation

DataMiner
=========

Authors
=======
Nathan Jacobs
Joshua Adams

Mine all the data, for homework and science!

Files
=====

TagExtractor.py - data structure used to extract tags from the .sgm files that are used to create the feature vectors

stopwords.txt - is a list of stop words that are elimated in a preliminary step when creating a feature vector

FeatureVector2.py - constructs a feature vector for each reuter in the file by pulling out all of the nouns in the body
 
FeatureVector3.py - constructors a feature vector by keeps a count of words that are not listed in stopwords.txt

FeatureVector4.py - constructs a feature vector for each reuter by finding the frequency distribution for each word in the body

output-FeatureVector2.txt - sample output for FeatureVector2

output-FeatureVector3.txt - sample output for FeatureVector3

output-FeatureVector4.txt - sample output for FeatureVector4

Installation
============

1. First download python 2.7.5 from http://www.python.org/download/
2. Next instal nltk, following instructions from http://nltk.org/install.html
3. In python idle, type the following
	import nltk
	nltk.download()
4. Click the download button from the window that pops up

Runing The Program
==================

To run the program type 'python FeatureVector[number].py' in terminal
where number is the feature vector that is being run

The output for each program will be according to the number of the feature
vector that is executed. For example FeatureVector2.py will create the file
'output-FeatureVector2.txt' which is the output file for FeatureVector2.

About

Mine all the data, for homework and science!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages