dsl

Discriminating between Similar Languages

python dsl/tfidf.py --train dat/confused/all_slavic/train/ --test dat/confused/all_slavic/dev/ --topn 100000 --tokenize word --tf lognorm --idf smooth --rare 0 --N 3 --qf smooth > out

Or an instant candy

python dsl/tfidf.py --train dat/train/ --test dat/dev/ --topn 10000 --tokenize word --tf lognorm --idf smooth --rare 0 --N 3 --qf smooth > out

Download this year's training and development datasets and split them to one file-per-language:

bash scripts/download_and_split.sh dat/

Running an experiment

Prequisites

The newest version of the repo obviously
Train and dev features files in the format of space separated feature matrix, where the first column contains the labels and the rest of them the feature values.
A directory where the language names can be looked up. The default is dat/t1/train, this presumes that you are running an experiment on all 14 languages. If you're running on fewer languages, then please specify a directory where for each language, a file named as the language exists. These names will be used in the classifier's output instead of integer labels.

Parameters

--train: train file, space-separated matrix, where the first column contains the labels
--test: same as train. Unlabeled test files are not supported right now
--encoder: name of the encoder
--classifier: name of the classifier
params: positional arguments, i.e. anything without a --, these arguments split by '=' and passed to the Representation constructor as keyword arguments. For example pca_latentDimension=50 is converted to a parameter, where the key ia pca_latentDimension and its value is 50.

python dsl/representation/run_experiment.py --train train_matrix --test dev_matrix --lang-map dat/t1/train --encoder pca --class svm svm_ktype=linearsvc pca_latentDimension=50 > out

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
dsl		dsl
scripts		scripts
storage		storage
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dsl

dsl

scripts

scripts

storage

storage

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

setup.py

setup.py

Repository files navigation

dsl

Running an experiment

Prequisites

Parameters

About

Releases

Packages

Contributors 3

Languages

License

juditacs/dsl

Folders and files

Latest commit

History

Repository files navigation

dsl

Running an experiment

Prequisites

Parameters

About

Resources

License

Stars

Watchers

Forks

Languages