A* Decoding

You'll need to install fairseq (latest version should be fine) in order to work with the models already trained. For scoring, install sacrebleu. Both can be done using the default packages in pip. Unzip the model checkpoints and place them in data/ckpts

To run Dijkstra's with a normal conditional LM, use the command:

python decode.py  --fairseq_path data/ckpts/cond_model.pt --fairseq_lang_pair de-en --src_wmap data/wmaps/wmap.bpe.de --trg_wmap data/wmaps/wmap.bpe.en --input_method file --src_test data/valid.de --preprocessing word  --postprocessing bpe@@ --decoder dijkstra_ts

note that this probably won't finish since it takes up a huge amount of memory. You can also limit the breadth width by adding in the flag --beam <k> for a chosen beam width k. You can specify CPU usage with --n_cpu_threads 30

To run beam search with k=5, use the command:

python decode.py  --fairseq_path data/ckpts/cond_model.pt --fairseq_lang_pair de-en --src_wmap data/wmaps/wmap.bpe.de --trg_wmap data/wmaps/wmap.bpe.en --input_method file --src_test data/valid.de --preprocessing word --n_cpu_threads 30 --postprocessing bpe@@ --decoder beam --beam 5

To run dijkstra_ts with PMI and a unigram model as the marginal LM, use the command:

python decode.py  --fairseq_path data/ckpts/cond_model.pt --fairseq_lang_pair de-en --src_wmap data/wmaps/wmap.bpe.de --trg_wmap data/wmaps/wmap.bpe.en --input_method file --src_test data/valid.de --preprocessing word --n_cpu_threads 30 --postprocessing bpe@@ --decoder dijkstra_ts --subtract-uni --lmbd 0.2

Note that lmbda is the interpolation parameter (i.e. lmbda 0.2 -> log P(y|x) - 0.2log P(y)). To run dijkstra_ts with PMI and a NN as the marginal LM, use the command:

python decode.py  --fairseq_path data/ckpts/cond_model.pt --fairseq_lang_pair de-en --src_wmap data/wmaps/wmap.bpe.de --trg_wmap data/wmaps/wmap.bpe.en --input_method file --src_test data/valid.de --preprocessing word --n_cpu_threads 30 --postprocessing bpe@@ --decoder dijkstra_ts --subtract-marg --marg_path data/ckpts/lm.pt --lmbd 0.2

Scoring

For scoring, append the arguments --outputs text --output_path <file_name>.txt and then detokenize the text using the moses detokenizer script (copied to scripts/detokenizer.perl for ease)

cat <file_name>.txt | perl scripts/detokenizer.perl -threads 8 -l en > out

The detokenized valid/test files for IWSLT14 de-en are located in the data folder already. You can run sacrebleu to score with:

cat out | sacrebleu data/valid.detok.en

If you want to decode on different data sets, lmk and I can send the tokenization scripts and bpe codes I used for training the models.

SGNMT

SGNMT is an open-source framework for neural machine translation (NMT) and other sequence prediction tasks. The tool provides a flexible platform which allows pairing NMT with various other models such as language models, length models, or bag2seq models. It supports rescoring both n-best lists and lattices. A wide variety of search strategies is available for complex decoding problems.

SGNMT is compatible with the following NMT toolkits:

Tensor2Tensor (TensorFlow)
fairseq (PyTorch)

Old SGNMT versions (0.x) are compatible with:

Features:

Syntactically guided neural machine translation (NMT lattice rescoring)
n-best list rescoring with NMT
Integrating external n-gram posterior probabilities used in MBR
Ensemble NMT decoding
Forced NMT decoding
Integrating language models
Different search algorithms (beam, A*, depth first search, greedy...)
Target sentence length modelling
Bag2Sequence models and decoding algorithms
Joint decoding with word- and subword/character-level models
Hypothesis recombination
Heuristic search
...

Documentation

Please see the full SGNMT documentation for more information.

Contributors

Felix Stahlberg, University of Cambridge
Eva Hasler, SDL Research
Danielle Saunders, University of Cambridge

Citing

If you use SGNMT in your work, please cite the following paper:

Felix Stahlberg, Eva Hasler, Danielle Saunders, and Bill Byrne. SGNMT - A Flexible NMT Decoding Platform for Quick Prototyping of New Models and Search Strategies. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 17 Demo Session), September 2017. Copenhagen, Denmark. arXiv

Name		Name	Last commit message	Last commit date
Latest commit History 221 Commits
cam		cam
data		data
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
decode.py		decode.py
extract_scores_along_reference.py		extract_scores_along_reference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cam

cam

data

data

scripts

scripts

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

decode.py

decode.py

extract_scores_along_reference.py

extract_scores_along_reference.py

Repository files navigation

A* Decoding

Scoring

SGNMT

Documentation

Contributors

Citing

About

Releases

Packages

Languages

License

cimeister/sgnmt

Folders and files

Latest commit

History

Repository files navigation

A* Decoding

Scoring

SGNMT

Documentation

Contributors

Citing

About

Resources

License

Stars

Watchers

Forks

Languages