GitHub - swabhs/transliteration

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
README		README
cost_viterbi.py		cost_viterbi.py
data_extract.py		data_extract.py
decoder.py		decoder.py
featlist.dat		featlist.dat
features.py		features.py
framework.py		framework.py
mnt.py		mnt.py
perceptron.py		perceptron.py
viterbi.py		viterbi.py

Repository files navigation

===================================================================================================
To train the classifier, run:

pypy perceptron.py train.dat featlist.dat gaz.dat brown.dat x test.dat > weights.dat

train.dat    - the training data file is a tab-separated file in this format:

de_word	de_pos	de_tag

featlist.dat - the feature list file is a list of preextracted features from
               the training data. 
               Find it at /usr0/home/sswayamd/transliteration/featlist.dat

gaz.dat      - file containing german/english words from the gazetteer. 
               Find it at /usr0/home/sswayamd/transliteration/gaz.de-en

brown.dat    - file containing the brown clusters. 
               Find it at /usr0/home/sswayamd/transliteration/brown.dat

x            - number of iterations you want to run

test.dat     - the test data file in a tab-separated file in the same format as the
               training data

final.model  - output file containing the feature names and feature weights, space-separated,
               from every iteration. Feature weights from different iterations are separated
               by a blank line.

===================================================================================================

To run the decoder:

pypy decode.py test.dat final.model gaz.dat brown.dat > output.dat

A fully trained model can be found at /usr0/home/sswayamd/transliteration/final.model

output.dat   - contains the output tags in the same format as the test file

===================================================================================================
Preprocessing

Chris's script to convert parallel data into training data with BIO tags:
------------------------------------------------------------------------
python mnt.py inp > output1

My script to get tab-separated training data, for POS tagging:
-------------------------------------------------------------
python data_extract.py output1 > output2

Turboparser to tag the German data:
----------------------------------
./TurboTagger --test --evaluate \
--file_model=/usr2/home/sswayamd/wmt/TurboParser/models/german_tagger.model \
--file_test=output2 \
--file_prediction=output3 \
--logtostderr

More preprocessing:
------------------
paste output3 output2 > output4
awk '{print $1,$2,$4}' output4 > output5
sed 's/ /\t/g' output5 > output.dat

output.dat   - contains the data in the format that is accepted by classifier

===================================================================================================