We need the following tools to run or build our code:
- Python (>= 2.6 and < 3.0) : for most scripts
- NLTK (any recent version) : for preprocessing
- GNU toolchain (any recent version; including gcc, autoconf, make, etc) : for building liblbfgs
- OCaml (any recent version) : for building megam
- g++ (>= 4.4) : for building our own maxent and latent model implementations
- Standard UNIX utils such as
grep, sort, uniq, cat, sh
- A program called
seq
, which is part of GNU coreutils (note for some reason there is not a corresponding counter part in BSD coreutils and it is thus not available by default on Mac OS X) - Matlab (any recent version) : feature extraction
common.py
: common utilities (possibly) useful to everyonefeateng/
: feature engineering stuffpreprocess/
: preprocessing stuff (cleaning the data; etc.)coarsefine/
: scripts for coarser grouping classificationmaxent/
: our implementaion of MaxEnt and the latent model using liblbfgstools/
: source code and executables of off-the-shelf toolsscripts/
: scripts for running the whole training and testing procedure
See instructions in tools/BUILDING
and maxent/BUILDING
.
Add the project directory to your PYTHONPATH
so that every script
knows which common.py
to import. If you are using bash, do this
(suppose $PROJ_PATH
is the project directory)
export PYTHONPATH=$PROJ_PATH:$PYTHONPATH
- Assuming
$ORIG
is the original csv data, run this at the top level of the project directory to get clean and split data:
sh clean.sh $ORIG split
- Now you should have a
split
directory with a couple of number-named files. Run this (still at top level) to get randomly merged training/dev/test data:
preprocess/tr_de_te.py 3 1 1 . split/*
You will see three files called train, dev, test
in your current
working directory.
Use scripts under feateng
to generate features.
First notice baseline features are not used with any other features
later on. Use feateng/baseline.py
to generate baseline features. If
you run this without any arguments, you will see the usage:
$ feateng/baseline.py
Usage: feateng/baseline.py which(=megam|crfsuite) out_dir train dev test [window_back=1 window_forward=1]
Since we will only use MegaM in the following, run the following to obtain baseline features:
feateng/baseline.py megam $OUTDIR $TRAIN_PATH $DEV_PATH $TEST_PATH
Where $OUTDIR
is the output directory (make sure it exists),
$TRAIN_PATH
, $DEV_PATH
and $TEST_PATH
are paths of corresponding
data files obtained by running preprocess/tr_de_te.py
.
Under $OUTDIR
, you should see four files, namely train.megam
dev.megam test.megam map.megam
.
Similar to baseline features, run without arguments for usage:
$ feateng/ngrams.py
Usage: feateng/ngrams.py which(=megam|crfsuite) out_dir train dev test window_back window_forward ngram_window stem
Usually, we use the following arguments
feateng/ngrams.py megam $OUTDIR $TRAIN_PATH $DEV_PATH $TEST_PATH 0 0 1 1
Under $OUTDIR
, you should see four files, namely train.megam
dev.megam test.megam map.megam
.
First of all, we need to extract collocations. To do this, first run the following to get the input for MDL collocation extraction
cut -f5 $TRAIN_PATH | preprocess/remove_triple.py > $TRAIN_FOR_MDL
Then run mdlinduct.py
to induct collocations
mdl/mdlinduct.py $TRAIN_FOR_MDL $TRAIN_FOR_MDL
Now run the following to get a list of collocations longer than 2 words
grep -E '_.*_' $TRAIN_FOR_MDL-words | sort | uniq > $COLLOC_LIST
Using $COLLOC_LIST
, we can now do the actual feature generation with
feateng/colloc.py
, here’s the usage:
$ feateng/colloc.py
Usage: feateng/colloc.py which(=megam|crfsuite) out_dir train dev test colloc
Run the following to get collocation features
feateng/colloc.py megam $OUTDIR $TRAIN_PATH $DEV_PATH $TEST_PATH $COLLOC_LIST
Under $OUTDIR
, you should see four files, namely train.megam
dev.megam test.megam map.megam
. It’s fine that some lines contains
only a integer label since in this part we are only extracting a small
number of features.
We extract all kinds of task-specific features using
feateng/FeatureEng.py
. Here the usage:
feateng/FeatureEng.py which(=megam|matlab) $INDIR $OUTDIR $TRAIN_NAME $DEV_NAME $TEST_NAME
Usually we use megam
for which
. Under $OUTDIR
, you should see
four files, namely train.megam dev.megam test.megam map.megam
.
Once you have separte feature files from different feature generation
routines, the next thing is to merge them into a single data set. Use
feateng/merge_feat_megam.py
for this. For example, the following
command line merges n-gram, collocation and some other mysterious
features of the training data to another output file called
merged/train.megam
feateng/merge_feat_megam.py merged/train.megam $TRAIN_COLLOC $TRAIN_NGRAMS $TRAIN_MYSTERY
Do the same thing for test and dev data.
Once you have the features ready, use one of the scripts under
scripts/
to try it out! For MaxEnt, you could either use megam or
our implementation. For the latent model, our implementation is your
only choice.
All scripts need valid path variables for scripts, therefore, before
running any of them, make sure to modify $PROJ_PATH
to be your
actual path of the project directory.
There are four scripts under scripts/
, namely, run_megam.sh
,
run_maxent.sh
, run_megam_merge.sh
, run_latent.sh
. All of them
require input training/dev/test data be renamed as train.megam
,
dev.megam
, and test.megam
respectively. Also, it needs a file
called map.megam
for converting back and forth between textual and
integral label names (some scripts under feateng/
might not generate
the last file; you can just use the one generated by another script on
the same data set, for example, feateng/ngrams.py
). Here’s brief
description of what the scripts do:
run_{megam,maxent}.sh
: These two do the same thing — training and testing using MaxEnt, except that training is done with different implementations of MaxEnt. To use one of these, go to the directory containing train/dev/test/map data named astrain.megam
,dev.megam
,test.megam
andmap.megam
respectively.run_megam_merge.sh
: This runs the coarser grouping model. To use one of these, go to the directory containing train/dev/test/map data named astrain.megam
,dev.megam
,test.megam
andmap.megam
respectively. You also need a file namedmegam.groups
containing the grouping information. There is one that corresponds to the grouping scheme in the coding manual inscripts/megam.groups
.run_latent.sh
: This runs the latent model. To use one of these, go to the directory containing train/dev/test/map data named astrain.megam
,dev.megam
,test.megam
andmap.megam
respectively.
After running one of them, it produces several output files:
megam.run.out
: the raw prediction file, where each line corresponds to a line in the test data and numbers are labels sorted in k-best ordermegam.run.kbest
: k-best accuracy on the test data
Sometimes, we also produce other output for diagnostic purposes, for example:
test.labels
: textual labels (codes in our case) of test data; we do this because megam requires labels be encoded in integers and sometimes it is desirable to see its textual counterpartmegam.run.eval
: label-wise 1-best accuracymegam.run.csv
: confusion matrix over labels in CSV format
We use Matlab for this part. Before running, make sure the data format is correct (i.e. generated using which=matlab; currently only work with task-specific features).
libsvm_read.m
: reads libsvm formatted files and returns a sparse data matrix and class label vector.lsa.m
: performs LSA and writes train/dev/test features in megam formatpca.m
: performs PCA and writes train/dev/test features in megam format
Training and testing are the same as ordinary features.