Skip to content

turian/query-classification-with-word-representations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Query classification with word representations
--------------------------------------

    scripts and steps by Joseph Turian

Based upon the KDDcup challenge.

    http://www.sigkdd.org/kdd2005/kddcup.html

Using l2-regularized logistic regression, balancing positive examples
vs negative examples, as described in Dekang Lin's "Phrase Clustering
for Discriminative Learning".

We use Megam for l2-regularized logistic regression, because it makes
it easy to weight the examples. It might also be interesting to try
svmsgd, Vowpal Wabbit, or a polynomial SVM.

We have instructions and scripts for how we add word representations
(Brown clusters and/or word embeddings) to the training.

INSTALLATION:
-------------

Download and install megam:
    http://www.cs.utah.edu/~hal/megam/
We use the pre-compiled optimized binary version.

You will need my common Python library:
    http://github.com/turian/common

Download data:
    mkdir data/
    cd data/
    mkdir train/
    cd train/
    wget http://www.sigkdd.org/kdd2005/kddcup/KDDCUPData.zip
    unzip KDDCUPData.zip
    cd ..
    mkdir test/
    cd test/
    wget http://www.sigkdd.org/kdd2005/Labeled800Queries.zip
    unzip Labeled800Queries.zip
    cp labeler2.txt.new labeler2.txt
# Last line is necessary because of inconsistent formatting in one line.

    
Download word representations:
    # Go back to base directory
    mkdir representations/
    cd representations/

    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c100-freq1.txt
    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c320-freq1.txt
    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c1000-freq1.txt
    wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c3200-freq1.txt

    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-1750000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d200.txt.gz
    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2030000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d100.txt.gz
    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2270000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.txt.gz
    wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2280000000.LEARNING_RATE%3d1e-08.EMBEDDING_LEARNING_RATE%3d1e-07.EMBEDDING_SIZE%3d25.txt.gz
    ln -s model-1750000000.LEARNING_RATE=1e-09.EMBEDDING_LEARNING_RATE=1e-06.EMBEDDING_SIZE=200.txt.gz cw-embeddings-200dim.txt.gz
    ln -s model-2030000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.EMBEDDING_SIZE\=100.txt.gz cw-embeddings-100dim.txt.gz 
    ln -s model-2270000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.txt.gz cw-embeddings-50dim.txt.gz
    ln -s model-2280000000.LEARNING_RATE\=1e-08.EMBEDDING_LEARNING_RATE\=1e-07.EMBEDDING_SIZE\=25.txt.gz cw-embeddings-25dim.txt.gz

    wget http://pylearn.org/turian/hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz
    wget http://pylearn.org/turian/hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz
    ln -s hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-100dim.txt.gz
    ln -s hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-50dim.txt.gz 

RUNNING
----

See README.batch

NOTE
----

I generate one feature file per l2 parameter. Otherwise, different runs might clobber each other.

About

KDDCup 2005 query classification with word representations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published