-
Notifications
You must be signed in to change notification settings - Fork 0
swabhs/transliteration
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
=================================================================================================== To train the classifier, run: pypy perceptron.py train.dat featlist.dat gaz.dat brown.dat x test.dat > weights.dat train.dat - the training data file is a tab-separated file in this format: de_word de_pos de_tag featlist.dat - the feature list file is a list of preextracted features from the training data. Find it at /usr0/home/sswayamd/transliteration/featlist.dat gaz.dat - file containing german/english words from the gazetteer. Find it at /usr0/home/sswayamd/transliteration/gaz.de-en brown.dat - file containing the brown clusters. Find it at /usr0/home/sswayamd/transliteration/brown.dat x - number of iterations you want to run test.dat - the test data file in a tab-separated file in the same format as the training data final.model - output file containing the feature names and feature weights, space-separated, from every iteration. Feature weights from different iterations are separated by a blank line. =================================================================================================== To run the decoder: pypy decode.py test.dat final.model gaz.dat brown.dat > output.dat A fully trained model can be found at /usr0/home/sswayamd/transliteration/final.model output.dat - contains the output tags in the same format as the test file =================================================================================================== Preprocessing Chris's script to convert parallel data into training data with BIO tags: ------------------------------------------------------------------------ python mnt.py inp > output1 My script to get tab-separated training data, for POS tagging: ------------------------------------------------------------- python data_extract.py output1 > output2 Turboparser to tag the German data: ---------------------------------- ./TurboTagger --test --evaluate \ --file_model=/usr2/home/sswayamd/wmt/TurboParser/models/german_tagger.model \ --file_test=output2 \ --file_prediction=output3 \ --logtostderr More preprocessing: ------------------ paste output3 output2 > output4 awk '{print $1,$2,$4}' output4 > output5 sed 's/ /\t/g' output5 > output.dat output.dat - contains the data in the format that is accepted by classifier ===================================================================================================
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published