Modular set of models for multilingual translationapplied to low-resource languages
decoded_results
: our decoding results on the validation and test setsscripts
: training and decoding scriptscode
: All pytohn codepaths.py
: paths to all data filesconfig.y
: where all hyperparameters and other options are stored.utils.py
: auxiliary function for data reading and iterationssubwords.py
: word segmentation training and decodingvocab.py
: vocabulary extraction fileextract_ted_talks.py
: data extraction file from the provided starter coderun.py
: main file called by the scriptsnmt
: the baseline model and main network components, used in all other modelslayers.py
: enhanced version of pytorch's LSTM, LSTMCell and Dropout modules.routine.py
: main training, pretraining, searching and scoring functions for any model and parametersencoder.py
: the encoder moduledecoder.py
: the decoder modulenmtmodel.py
: a simple sequence-to-sequence model, designed for main_trainingnmt.py
: data loading, model loading or initialization, and decoding functions
multi
,shared
andtransfer
: other models with their own versions of thenmt.py
,nmtmodel.py
andconfig.py
files. They correspond respectively to multiple-encoder multitask, shared-encoder multitask, and transfer learning approaches.
- numpy>=1.15.1
- pytorch 0.4.1
- tqlm
- docopt
- nltk
- sentencepiece
To use that code, you should have an additional data
folder in the root, with :
* the WMT 2015 data files for Azeri, Belarusian, Galician, Turkish, Russian and Portuguese, in data/bilingual
* (unused) wikipedia data for Azeri, Belarusian, Galician in data/monolingual
* initially empty data/vocab
and data/subwords
folders.
All scripts are run from the main directory.
Before the first launch, run python code/subwords.py
, then python code/vocab.py
.
Models can be trained with ./scripts/train.sh
amd run with ./scripts/decode.sh
.
Every option is decided by editing the config.py
file, or if specific to a mode the other config files.
Here are the main options to be changed
- Change the low-resource language used with
config.language
(az, be or gl) - Decode on test file with
config.test = True
(otherwise dev file) - Change mode with
config.mode
betweennormal
,shared
,transfer
, ormulti
.shared
mode correspond to alternate minibatch sampling. - Activate or deactivate subwords with
config.subwords
- Deactivate the use of helper language with
config.use_helper = False
. Only for normal mode.True
actually correspond to shared encoder multitask with concatenated corpuses.