Skip to content

Python module for preprocessing the Reuters article database

Notifications You must be signed in to change notification settings

cs109clone/reuters-preprocessing

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lab 1 - Preprocessing Data

Table of Contents

  1. Overview
  2. Module Description
  3. Usage
  4. Development
  5. [Change Log](#change log)

Overview

The purpose of this module is to preprocess a set of SGML documents representing a Reuters article database into a dataset of feature vectors and class labels. The datasets will be employed in future assignments for automated categorization, similarity search, and building document graphs.

Description

This python module contains the following files and directories:

  • preprocess.py - main module for preprocessing the Reuters article database
  • feature1.py - sub-module that generates feature vector dataset #1
  • feature2.py - sub-module that generates feature vector dataset #2
  • feature3.py - sub-module that generates feature vector dataset #3
  • tfidf.py - module for term frequency-inverse document frequency
  • data/
    • reut2-xxx.sgm - formatted articles (replace xxx from {000,...,021})

Running preprocess.py will generate the following files

  • dataset1.csv
  • dataset2.csv
  • dataset3.csv

The feature vectors in the datasets were generated using the following methodologies

  • TF-IDF of title & body words to select the top 1000 words as features
  • Filtering nouns & verbs from the term lists, and repeating the previous process

For a more detailed report of the methodology used to sanitize and construct these refined datasets and feature vectors, read the file in this project titled Report1.md using the following command

> less Report1.md

Potential additional to future iterations of feature vector generation:

  • different normalization
  • bigram/trigram/n-gram aggregation
  • stratified sampling: starting letter, stem, etc.
  • binning: equal-width & equal-depth (grouping by topics/places, part-of-speech, etc)
  • entropy-based discretization (partitioning based on entropy calculations)

Usage

This module relies on several libraries to perform preprocessing, before anything:

Ensure NLTK is installed and the corpus and tokenizers are installed:

> pip install NLTK

Next, enter a Python shell and download the necessary NLTK data:

> python
$ import nltk
$ nltk.download()

From the download window, ensure punkt, wordnet and stopwords are downloaded onto your machine.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Download which package (l=list; x=cancel)?
  Identifier> punkt
    Downloading package punkt to /home/3/loua/nltk_data...
      Unzipping tokenizers/punkt.zip.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> stopwords
    Downloading package stopwords to /home/3/loua/nltk_data...
      Unzipping corpora/stopwords.zip.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> wordnet
    Downloading package wordnet to /home/3/loua/nltk_data...
      Unzipping corpora/wordnet.zip.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q

Next, ensure BeautifulSoup4 is installed:

> pip install beautifulsoup4

To run the code, first ensure the preprocess.py file has execute privileges:

> chmod +x preprocess.py

Next, ensure the tfidf.py, feature1.py, feature2.py, and feature3.py files are in the same directory as preprocess.py. Also, ensure there is a data/ directory in the same folder as preprocess.py and the data/ directory containing the reut2-xxx.sgm files is present. To begin preprocessing the data, run:

> python preprocess.py

or

> ./preprocess.py

The preprocessing might take some time to complete.

Once preprocess.py finishes execution, three datasets are generated by the code labeled dataset1.csv, dataset2.csv, and dataset3.csv in the project directory (same folder as preprocess.py). To view these datasets, run:

> less datasetX.csv

where X is replaced with 1, 2, or 3 depending on the dataset.

Development

  • This module was developed using python 2.7.10 using the NLTK and BeautifulSoup4 modules.

Contributors

Change Log

2015-09-11 - version 1.0.3

  • Finalize the construction of output of dataset3.csv
  • Update Report1.md to reflect approach/rationale of dataset3.csv
  • Finalize documentation
  • Include usage of scikit-learn

2015-09-11 - Version 1.0.2

  • Update tf-idf module to use log normalization & probabilistic inverse frequency
  • Finalize the construction of output of dataset2.csv
  • Update Report1.md to reflect approach/rationale of dataset2.csv
  • Begin construction for dataset3.csv
  • TODO: finish Report1.md and dataset3.csv

2015-09-11 - Version 1.0.1

  • Fixed td-idf module to provide normalized scores in the range [0,1]
  • Updated tokenization in preprocess.py to filter non-english words and shorter stems
  • Updated the feature selection process for feature vector 1 to run in minimal time
  • Finalize the construction and output of dataset1.csv
  • Began construction for dataset2.csv
  • TODO: finish Report1.md and dataset2.csv; start dataset3.csv

2015-09-10 - Version 1.0.0:

  • Initial code import
  • Added functionality to generate parse tree
  • Added functionality to generate document objects
  • Added functionality to tokenize, stem, and filter words
  • Added functionality to generate lexicons for title & body words
  • Prepare documents for feature selection & dataset generation

About

Python module for preprocessing the Reuters article database

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%