GitHub - turian/query-classification-with-word-representations: KDDCup 2005 query classification with word representations

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
scripts		scripts
.gitignore		.gitignore
.hgignore		.hgignore
README		README
README.batch		README.batch
TODO		TODO

Repository files navigation

Query classification with word representations
--------------------------------------

    scripts and steps by Joseph Turian

Based upon the KDDcup challenge.

    http://www.sigkdd.org/kdd2005/kddcup.html

Using l2-regularized logistic regression, balancing positive examples
vs negative examples, as described in Dekang Lin's "Phrase Clustering
for Discriminative Learning".

We use Megam for l2-regularized logistic regression, because it makes
it easy to weight the examples. It might also be interesting to try
svmsgd, Vowpal Wabbit, or a polynomial SVM.

We have instructions and scripts for how we add word representations
(Brown clusters and/or word embeddings) to the training.

INSTALLATION:
-------------

Download and install megam:
    http://www.cs.utah.edu/~hal/megam/
We use the pre-compiled optimized binary version.

You will need my common Python library:
    http://github.com/turian/common

Download data:
    mkdir data/
    cd data/
    mkdir train/
    cd train/
    wget http://www.sigkdd.org/kdd2005/kddcup/KDDCUPData.zip
    unzip KDDCUPData.zip
    cd ..
    mkdir test/
    cd test/
    wget http://www.sigkdd.org/kdd2005/Labeled800Queries.zip
    unzip Labeled800Queries.zip
    cp labeler2.txt.new labeler2.txt
# Last line is necessary because of inconsistent formatting in one line.

    
Download word representations:
    # Go back to base directory
    mkdir representations/
    cd representations/

    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c100-freq1.txt
    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c320-freq1.txt
    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c1000-freq1.txt
    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c3200-freq1.txt

    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-1750000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d200.txt.gz
    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2030000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d100.txt.gz
    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2270000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.txt.gz
    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2280000000.LEARNING_RATE%3d1e-08.EMBEDDING_LEARNING_RATE%3d1e-07.EMBEDDING_SIZE%3d25.txt.gz
    ln -s model-1750000000.LEARNING_RATE=1e-09.EMBEDDING_LEARNING_RATE=1e-06.EMBEDDING_SIZE=200.txt.gz cw-embeddings-200dim.txt.gz
    ln -s model-2030000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.EMBEDDING_SIZE\=100.txt.gz cw-embeddings-100dim.txt.gz 
    ln -s model-2270000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.txt.gz cw-embeddings-50dim.txt.gz
    ln -s model-2280000000.LEARNING_RATE\=1e-08.EMBEDDING_LEARNING_RATE\=1e-07.EMBEDDING_SIZE\=25.txt.gz cw-embeddings-25dim.txt.gz

    wget http://pylearn.org/turian/hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz
    wget http://pylearn.org/turian/hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz
    ln -s hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-100dim.txt.gz
    ln -s hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-50dim.txt.gz 

RUNNING
----

See README.batch

NOTE
----

I generate one feature file per l2 parameter. Otherwise, different runs might clobber each other.