Skip to content

nmonath/NLPProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLPProject

Dependencies

Numpy and Scipy are required for using our code. We recommend downloading this distribution of Python, which includes all of Numpy and Scipy and many other useful packages: https://store.continuum.io/cshop/anaconda/

You will also need to install the Enum package for Python, which can be done with the command:

$ easy_install enum

As well as the Gensim package, which can be done with the command:

$ easy_install -U gensim

Gensim provides an interface with Word2Vec in Python.

Finally, to perform the Fuzzy Clustering experiments. You'll need to download and install peach. You can download it from here: https://code.google.com/p/peach/ Just unzip the directory and place in the folder containing your Python libraries.

Preliminaries

Data Set Organization

These modules allow you to perform classification and clustering experiments on data sets formatted in the following way:

The data_set folders are of the following format:

data_set
       <data_set_name>
              train_classes.txt
              test_classes.txt
              class_label_index.txt
              train/
                     train_00001
                     train_00001.srl
              test/
                     test_00001
                     test_00001.srl

Each file train/test_XXXXX is a raw text file containing a training or testing document, train/test_XXXX.srl, is the dependency parsed and semantic role labeled version of document. train_classes.txt and test_classes.txt store the class labels of the training and testing documents. They are organized such that the class label of the ith file (determined by number after train/test in filename) in the train/test folders is on the ith line of the file.

Utility Functions

The Util module provides several key functions that will be used throughout this explanation. These functions are used to perform the parsing of documents in a corpus, reading class label files, etc. We highlight a few of these key functions here. See the full documentation for more details.

Dependency Parsing and Semantic Role Labeling

The SRL function of the Util module runs the ClearNLP parser to extract both dependency pairs and semantic role labels in all of the files in a specified directory. For example, if I had a data set called my_data_set in the data_sets folder. I would run the parser on both my training and testing data in the following way.

import Util
Util.SRL('<Path-To>/data_sets/my_data_set/train'):
Util.SRL('<Path-To>/data_sets/my_data_set/test'):

The parser will produce a .srl file for every file in train and test, as shown in the above description of datasets. This file contains both the depedency pairs and semantic role information.

NOTE. Please first set up the ClearNLP parser as speficied by: http://clearnlp.wikispaces.com/

NOTE. Please refer to the ClearNLP documentation for the speficiations of the .srl file format: http://clearnlp.wikispaces.com/dataFormat

NOTE. The parser WILL parse hidden files in the given directories, but the SRL function will delete the parsed version of these files.

Loading Class Label Files

The class labels associated with each document in data set are stored in the train_classes.txt and test_classes.txt. The ith line of these documents stores, in plain text, the class labels of the document with filename is lexicographically ith in the respective folders. If a document has multiple labels, the labels are store on the same line seperated by white space. The function LoadClassFile is used to load the class labels. These class labels are stored as numbers (typically 0 based).

Y = Util.LoadClassFile('<Path-To>/data_sets/<Data-Set-Name>/train_classes.txt')
Y = Util.LoadClassFile('<Path-To>/data_sets/<Data-Set-Name>/test_classes.txt')

In the case where every document only has one label, Y in the above example is a 1-by-N numpy array of class label numbers, where N is the number of documents. In the case where one or more documents have at least one label, Y is an N-by-C matrix where C is the total number of class labels and N is the number of documents. Each class label corresponds to a column of Y. Each document has a corresponding row in Y with 0s and 1s in each column representing whether or not the document has that class label. This follows the specification in the sklearn package for multiclass-multioutput labels.

The Features Module

Configuration

The Features module is used to extract Vector Space Model feature vectors from parsed text documents. The parsing comes as a preprocessing step using ClearNLP's parser (this will be explained below). The Features module allows for several different flavors of features in addition to the traditional bag-of-words features. It also allows for several different options such as the use of lemmatization, the inclusion of part-of-speech tags, etc.

Let's begin by defining some terminology that will make this explanation more clear. By units or base units we are referring to the terms whose presence/absense in a document determine the values of the feature vector of the document. This means each unit corresponds to an entry in the feature vector of a document. The different possible units will be described below. We'll call the feature definition the list of units corresponding to the entries in a feature vector of a document. The typical definition of feature vector is intented here. A feature vector is a D-element vector such that element i corresponds to the ith element of the feature definition with a value representing the presence/absense of the unit. We say that units and their corresponding feature definition can have one of two representations. The representation determines if the hashed version of the unit or the string version of the unit. Finally, a feature vector can be of one of three types. It can be binary, which means the values of the feature vector are 0 or 1 depending on whether or not a unit appears in the document. It can be tf-idf, in which the values of the feature vector are the term-frequency-inverse-document-frequency of a unit. Lastly, it can be count, in which the values of feature vector are the term frequencies of the units.

Now let's examine the different units that the Features module provides. There are three base forms of units, words, dependency pairs, and predicate argument components. These forms can also be combined together, e.g. words and depedency pairs, dependency pairs and predicate argument components, all three, etc. By words, we mean a traditional bag of words representation. Dependency pairs refers to the bag of all depedency pairs of words (as determined by the dependency parser). Predicate argument components refers to the bag of all the predicates and arguments that appear in a document. Note that this does not mean that a predicate argument structure is the unit. It means that the predicate itself is one unit and each of the arguments are additional units. For example, if we had the predicate argument structure: {Predicate: ate, Arg0: the hungry child, Arg1: the cake}. The feature definition would have three units: "ate", "the hungry child" and "the cake".

To select which unit, representation, and type are used to extract features from documents. The following global variables of the module are used:

FUNIT
FREP
FTYPE

The values that these variables take on are enum values from the classes FeatureUnits, FeatureRepresentation , FeatureType, which are nested in the features module. Below are the possible values for each variable:

import Features

Features.FUNIT = Features.FeatureUnits.WORD
Features.FUNIT = Features.FeatureUnits.DEPENDENCY_PAIR
Features.FUNIT = Features.FeatureUnits.WORDS_AND_DEPENDENCY_PAIRS
Features.FUNIT = Features.FeatureUnits.PREDICATE_ARGUMENT 
Features.FUNIT = Features.FeatureUnits.WORDS_AND_PREDICATE_ARGUMENT
Features.FUNIT = Features.FeatureUnits.DEPENDENCY_PAIRS_AND_PREDICATE_ARGUMENT 
Features.FUNIT = Features.FeatureUnits.ALL (default)

Features.FREP = Features.FeatureRepresentation.HASH (default)
Features.FREP = Features.FeatureRepresentation.STRING

Features.FTYPE = Features.FeatureType.BINARY 
Features.FTYPE = Features.FeatureType.TFIDF (default)
Features.FTYPE = Features.FeatureType.COUNT 

Now let's go over some of the other options we can set in the Features module. We can determine if the units are lemmatized with the following global variable:

Features.USE_LEMMA = True (default) /False

We can determine if units are case sensitive. Of course if lemmatization is used the value of this variable will have no effect.

Features.CASE_SENSITIVE = True/False (default)

We can append part of speech tags (provided by the parser) to every word in every unit with the following global variable:

Features.USE_POS_TAGS = True/False (default)

The dependency parser also provides dependency relation tags. These can be appended to words in FeatureUnits.DEPENDENCY_PAIR units with the following global variable:

Features.USE_DEP_TAGS = True/False (default)

The semantic role labeler also provides argument labels. These can be appended to the argument structures. This can be set with the following global variable:

Features.USE_ARG_LABELS  = True/False (default)

NOTE Appending any of these labels will result in a larger feature space. It will create a unique unit for each word and argument labels.

Rather than removing stop words based on a fixed list, we only keep those words with certain part of speech tags. The part of speech tags which define the retained word are controlled by the global KEEPER_POS. It's default value is shown below:

KEEPER_POS = ["JJ", "JJR", "JJS", "NN", "NNS", "NNP", "NNPS", "RR", "RBR", "RBS", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]

NOTE. A dependency pair unit is retained only if both words in the pair have part of speech tags in KEEPER_POS. Predicates are retained if its POS tag appears in KEEPER_POS. Only the words in an argument that have POS tags in KEEPER_POS are retained. This means that we had the following predicate argument structure: {Predicate: drove, Arg0: the old man, Arg1: there}. The only units that would be retained are "drove" and "old man". "the" is dropped from "the old man" and "there" is left off entirely.

We also provide four other options to remove units which might introduce noise. These options are to remove units consisting of a single character, to remove units which appear only one time, to remove units that appear in only one document, and to remove units containing non-alpha-numeric symbols. The first three are set with the following global variables.

Features.REMOVE_SINGLE_CHARACTERS = True (default) / False
Features.REMOVE_FEATURES_APPEARING_IN_ONLY_ONE_DOCUMENT = True (default) / False
Features.REMOVE_FEATURES_ONLY_APPEARING_ONE_TIME = True (default) / False

The final option, the removal of units with non-alpha numeric symbols, is generalized to only keeping units consisting of a specified set of symbols. The global variable SYMBOLS_TO_KEEP is a regular expression (an re in Python) which specifies the set of symbols from which a unit must be drawn for the unit to be a part of the feature definition. It's default value is shown below.

Features.SYMBOLS_TO_KEEP = '[a-zA-Z0-9]*'

There is also one optimization configuration option, the use of memory maps for the feature vector matrices. This can be set with the following global:

Features.USE_MEMORY_MAP = True / False (default)

The current status of each variable can be quickly checked using the DisplayConfigurations() method of the Features module.

Features.DisplayConfiguration()

which gives the following output:

Feature Configuration Settings
------------------------------
USE_LEMMA: True
CASE_SENSITIVE: False
USE_POS_TAGS: False
USE_DEP_TAGS: False
USE_ARG_LABELS: False
SYMBOLS_TO_KEEP: [a-zA-Z0-9]*
REMOVE_SINGLE_CHARACTERS: True
REMOVE_FEATURES_APPEARING_IN_ONLY_ONE_DOCUMENT: True
REMOVE_FEATURES_ONLY_APPEARING_ONE_TIME: True
USE_MEMORY_MAP: False
KEEPER_POS: ['JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS', 'RR', 'RBR', 'RBS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
FUNIT: Words, Dependency Pairs, and Predicate Argument Components
FREP: Hash
FTYPE: TF-IDF

Defining and Extracting Features

The Features module provides the method Features() to define a feature definition and extract feature vectors from a set of documents. First configure your Features module as described above. Then, given a training set of documents <Path-To>/data_sets/<Data-Set-Name>/train we can extract features from all of the .srl files in train in the following way:

(feature_def, X_train) = Features.Features('<Path-To>/data_sets/<Data-Set-Name>/train')

The first output argument feature_def is a 1-by-D array of the units which correspond to each each of the D columns of X_train. The second output argument is X_train an N by D matrix where N is the number of documents and D is the dimensionality of the feature definition. The ith row of X_train corresponds to the lexicographically ith document in '<Path-To>/data_sets/<Data-Set-Name>/train'.

To extract features from another set of documents (e.g. a testing set of documents) using the same feature definition we pass feature_def as an argument to the method. This optional argument has the name feature. For example, extracting feature vectors from '<Path-To>/data_sets/<Data-Set-Name>/test'

X_test = Features.Features('<Path-To>/data_sets/<Data-Set-Name>/train', feature=feature_def)

The next section explains how these can be used in supervised and unsupervised learning.

Supervised Learning

To do supervised learning with the features extracted from documents we provide the SupervisedLearning module, which acts as a wrapper to some of the classifiers provided by sklearn.

The easiest way to use ```SupervisedLearningis to use itsRun`` function. The signature for the function is:

def Run(FeaturesModule, clf, dirname):

The inputs are a configured Features module, the string name of a classifier to use, and the directory of the data set to be used (formatted in the way that is described above). clf can alternatively be one of the classifier objects of sklearn (additional documentation for this to come). The possible values for clf using the string input are: 'ridge', 'percepton', 'Passive Aggressive', 'LinearSVM', 'SVM', 'SGD'.

The output of the function is a tuple containing the following information in this order:

Accuracy
Overall Preicision
Overall Recall
Overall F1 score
Avg. Precision per class
Avg. Recall per class
F1 Score
Precision per class
Recall per class
F1 Score per class

Unsupervised Learning

The document clustering or UnsupervisedLearning module is used in much of the same way as the Supervised module. The signature of the method Run in this case is:

def Run(FeaturesModule, clstr, dirname, train_test='train'):

The inputs are a configured Features module, the string name of a clustering algorithm (KMeans or GMM), the directory of the data set and whether the training document or testing documents of the training set are to be used.

The method returns a tuple with the following information in the order presented below:

Purity Score
Normalized Mutual Information Score
Rand Index Score

About

Repository for our Final Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published