-
Notifications
You must be signed in to change notification settings - Fork 2
turian/query-classification-with-word-representations
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Query classification with word representations -------------------------------------- scripts and steps by Joseph Turian Based upon the KDDcup challenge. http://www.sigkdd.org/kdd2005/kddcup.html Using l2-regularized logistic regression, balancing positive examples vs negative examples, as described in Dekang Lin's "Phrase Clustering for Discriminative Learning". We use Megam for l2-regularized logistic regression, because it makes it easy to weight the examples. It might also be interesting to try svmsgd, Vowpal Wabbit, or a polynomial SVM. We have instructions and scripts for how we add word representations (Brown clusters and/or word embeddings) to the training. INSTALLATION: ------------- Download and install megam: http://www.cs.utah.edu/~hal/megam/ We use the pre-compiled optimized binary version. You will need my common Python library: http://github.com/turian/common Download data: mkdir data/ cd data/ mkdir train/ cd train/ wget http://www.sigkdd.org/kdd2005/kddcup/KDDCUPData.zip unzip KDDCUPData.zip cd .. mkdir test/ cd test/ wget http://www.sigkdd.org/kdd2005/Labeled800Queries.zip unzip Labeled800Queries.zip cp labeler2.txt.new labeler2.txt # Last line is necessary because of inconsistent formatting in one line. Download word representations: # Go back to base directory mkdir representations/ cd representations/ wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c100-freq1.txt wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c320-freq1.txt wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c1000-freq1.txt wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c3200-freq1.txt wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-1750000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d200.txt.gz wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2030000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d100.txt.gz wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2270000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.txt.gz wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2280000000.LEARNING_RATE%3d1e-08.EMBEDDING_LEARNING_RATE%3d1e-07.EMBEDDING_SIZE%3d25.txt.gz ln -s model-1750000000.LEARNING_RATE=1e-09.EMBEDDING_LEARNING_RATE=1e-06.EMBEDDING_SIZE=200.txt.gz cw-embeddings-200dim.txt.gz ln -s model-2030000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.EMBEDDING_SIZE\=100.txt.gz cw-embeddings-100dim.txt.gz ln -s model-2270000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.txt.gz cw-embeddings-50dim.txt.gz ln -s model-2280000000.LEARNING_RATE\=1e-08.EMBEDDING_LEARNING_RATE\=1e-07.EMBEDDING_SIZE\=25.txt.gz cw-embeddings-25dim.txt.gz wget http://pylearn.org/turian/hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz wget http://pylearn.org/turian/hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz ln -s hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-100dim.txt.gz ln -s hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-50dim.txt.gz RUNNING ---- See README.batch NOTE ---- I generate one feature file per l2 parameter. Otherwise, different runs might clobber each other.
About
KDDCup 2005 query classification with word representations
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published