Dependencies

We need the following tools to run or build our code:

Python (>= 2.6 and < 3.0) : for most scripts
NLTK (any recent version) : for preprocessing
GNU toolchain (any recent version; including gcc, autoconf, make, etc) : for building liblbfgs
OCaml (any recent version) : for building megam
g++ (>= 4.4) : for building our own maxent and latent model implementations
Standard UNIX utils such as grep, sort, uniq, cat, sh
A program called seq, which is part of GNU coreutils (note for some reason there is not a corresponding counter part in BSD coreutils and it is thus not available by default on Mac OS X)
Matlab (any recent version) : feature extraction

Directory structure

common.py : common utilities (possibly) useful to everyone
feateng/ : feature engineering stuff
preprocess/ : preprocessing stuff (cleaning the data; etc.)
coarsefine/ : scripts for coarser grouping classification
maxent/ : our implementaion of MaxEnt and the latent model using liblbfgs
tools/ : source code and executables of off-the-shelf tools
scripts/ : scripts for running the whole training and testing procedure

Building necessary tools

See instructions in tools/BUILDING and maxent/BUILDING.

Setting up PYTHONPATH

Add the project directory to your PYTHONPATH so that every script knows which common.py to import. If you are using bash, do this (suppose $PROJ_PATH is the project directory)

export PYTHONPATH=$PROJ_PATH:$PYTHONPATH

Getting cleaned up data

Assuming $ORIG is the original csv data, run this at the top level of the project directory to get clean and split data:

sh clean.sh $ORIG split

Now you should have a split directory with a couple of number-named files. Run this (still at top level) to get randomly merged training/dev/test data:

preprocess/tr_de_te.py 3 1 1 . split/*

You will see three files called train, dev, test in your current working directory.

Generating features

Use scripts under feateng to generate features.

Baseline

First notice baseline features are not used with any other features later on. Use feateng/baseline.py to generate baseline features. If you run this without any arguments, you will see the usage:

$ feateng/baseline.py
Usage: feateng/baseline.py which(=megam|crfsuite) out_dir train dev test [window_back=1 window_forward=1]

Since we will only use MegaM in the following, run the following to obtain baseline features:

feateng/baseline.py megam $OUTDIR $TRAIN_PATH $DEV_PATH $TEST_PATH

Where $OUTDIR is the output directory (make sure it exists), $TRAIN_PATH, $DEV_PATH and $TEST_PATH are paths of corresponding data files obtained by running preprocess/tr_de_te.py.

Under $OUTDIR, you should see four files, namely train.megam dev.megam test.megam map.megam.

n-gram

Similar to baseline features, run without arguments for usage:

$ feateng/ngrams.py
Usage: feateng/ngrams.py which(=megam|crfsuite) out_dir train dev test window_back window_forward ngram_window stem

Usually, we use the following arguments

feateng/ngrams.py megam $OUTDIR $TRAIN_PATH $DEV_PATH $TEST_PATH 0 0 1 1

Under $OUTDIR, you should see four files, namely train.megam dev.megam test.megam map.megam.

Collocation

First of all, we need to extract collocations. To do this, first run the following to get the input for MDL collocation extraction

cut -f5 $TRAIN_PATH | preprocess/remove_triple.py > $TRAIN_FOR_MDL

Then run mdlinduct.py to induct collocations

mdl/mdlinduct.py $TRAIN_FOR_MDL $TRAIN_FOR_MDL

Now run the following to get a list of collocations longer than 2 words

grep -E '_.*_' $TRAIN_FOR_MDL-words | sort | uniq > $COLLOC_LIST

Using $COLLOC_LIST, we can now do the actual feature generation with feateng/colloc.py, here’s the usage:

$ feateng/colloc.py
Usage: feateng/colloc.py which(=megam|crfsuite) out_dir train dev test colloc

Run the following to get collocation features

feateng/colloc.py megam $OUTDIR $TRAIN_PATH $DEV_PATH $TEST_PATH $COLLOC_LIST

Under $OUTDIR, you should see four files, namely train.megam dev.megam test.megam map.megam. It’s fine that some lines contains only a integer label since in this part we are only extracting a small number of features.

Task-specific features

We extract all kinds of task-specific features using feateng/FeatureEng.py. Here the usage:

feateng/FeatureEng.py which(=megam|matlab) $INDIR $OUTDIR $TRAIN_NAME $DEV_NAME $TEST_NAME

Usually we use megam for which. Under $OUTDIR, you should see four files, namely train.megam dev.megam test.megam map.megam.

Merging features

Once you have separte feature files from different feature generation routines, the next thing is to merge them into a single data set. Use feateng/merge_feat_megam.py for this. For example, the following command line merges n-gram, collocation and some other mysterious features of the training data to another output file called merged/train.megam

feateng/merge_feat_megam.py merged/train.megam $TRAIN_COLLOC $TRAIN_NGRAMS $TRAIN_MYSTERY

Do the same thing for test and dev data.

Tuning the classifier and testing

Once you have the features ready, use one of the scripts under scripts/ to try it out! For MaxEnt, you could either use megam or our implementation. For the latent model, our implementation is your only choice.

All scripts need valid path variables for scripts, therefore, before running any of them, make sure to modify $PROJ_PATH to be your actual path of the project directory.

There are four scripts under scripts/, namely, run_megam.sh, run_maxent.sh, run_megam_merge.sh, run_latent.sh. All of them require input training/dev/test data be renamed as train.megam, dev.megam, and test.megam respectively. Also, it needs a file called map.megam for converting back and forth between textual and integral label names (some scripts under feateng/ might not generate the last file; you can just use the one generated by another script on the same data set, for example, feateng/ngrams.py). Here’s brief description of what the scripts do:

run_{megam,maxent}.sh : These two do the same thing — training and testing using MaxEnt, except that training is done with different implementations of MaxEnt. To use one of these, go to the directory containing train/dev/test/map data named as train.megam, dev.megam, test.megam and map.megam respectively.
run_megam_merge.sh : This runs the coarser grouping model. To use one of these, go to the directory containing train/dev/test/map data named as train.megam, dev.megam, test.megam and map.megam respectively. You also need a file named megam.groups containing the grouping information. There is one that corresponds to the grouping scheme in the coding manual in scripts/megam.groups.
run_latent.sh : This runs the latent model. To use one of these, go to the directory containing train/dev/test/map data named as train.megam, dev.megam, test.megam and map.megam respectively.

After running one of them, it produces several output files:

megam.run.out : the raw prediction file, where each line corresponds to a line in the test data and numbers are labels sorted in k-best order
megam.run.kbest : k-best accuracy on the test data

Sometimes, we also produce other output for diagnostic purposes, for example:

test.labels : textual labels (codes in our case) of test data; we do this because megam requires labels be encoded in integers and sometimes it is desirable to see its textual counterpart
megam.run.eval : label-wise 1-best accuracy
megam.run.csv : confusion matrix over labels in CSV format

Feature extraction using LSA and PCA

We use Matlab for this part. Before running, make sure the data format is correct (i.e. generated using which=matlab; currently only work with task-specific features).

libsvm_read.m : reads libsvm formatted files and returns a sparse data matrix and class label vector.
lsa.m : performs LSA and writes train/dev/test features in megam format
pca.m : performs PCA and writes train/dev/test features in megam format

Training and testing are the same as ordinary features.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
coarsefine		coarsefine
eval		eval
extraction		extraction
feateng		feateng
maxent		maxent
mdl		mdl
preprocess		preprocess
scripts		scripts
tools		tools
.gitignore		.gitignore
README.html		README.html
README.org		README.org
clean.sh		clean.sh
common.py		common.py
writeup.pdf		writeup.pdf

aboudiop/cmsc773

Folders and files

Latest commit

History

Repository files navigation

Dependencies

Directory structure

Building necessary tools

Setting up PYTHONPATH

Getting cleaned up data

Generating features

Baseline

n-gram

Collocation

Task-specific features

Merging features

Tuning the classifier and testing

Feature extraction using LSA and PCA

About

Resources

Stars

Watchers

Forks

Languages