Skip to content

Helsinki-NLP/FoTraNMT

 
 

Repository files navigation

FoTraNMT - Scaling Up Language Coverage in multilingual Neural Machine Translation

FotraNMT is a multilingual NMT toolkit developed as part of the FoTran project at University of Helsinki. We developed FoTraNMT specifically to train and extract massively multilingual meaning representations in a cost-effective way. It includes features for multilingual & multimodal machine translation enabled by distributed training of independent encoders and decoders connected via an inner-attention layer.

FoTraNMT is optimized for training large models (on a sufficiently large high-performance cluster). For this, we pay special attention at

  1. distributing the modules across several processing units, and
  2. efficiently traininig the network over many translation direction.

After training, the system can also be used in non- resource-intensive settings, because its modular design allows each individual module to be loaded and used independently. We plan to provide trained modules, so the community can benefit from this feature.

Requirements & Instalation

You need to clone this repo

git clone --branch att-brg https://github.com/Helsinki-NLP/OpenNMT-py.git
cd OpenNMT-py

We strongly recommend to make the setup in a virtual environment. This is done by:

python3 -m pip install --user --upgrade pip
python3 -m pip install --user virtualenv
python3 -m venv env
source env/bin/activate

You can confirm you’re in the virtual environment by checking the location of your Python interpreter, it should point to the env directory:

which python
pip install --upgrade pip

Now that you’re in your virtual environment you can install packages without worrying too much about version control.

First you need to have installed torch according to your system requirements (which can be checked using nvidia-smi)

python3 -m pip install torch torchvision

After installing pytorch, you can run:

pip install six tqdm h5py future configargparse pyYAML

Hands-on example

The following scripts require subword-nmt and sacrebleu.

pip install subword-nmt
pip install sacrebleu


git clone --branch att-brg https://github.com/Helsinki-NLP/OpenNMT-py.git

Example on how to train a simple model with 2 encoder and 1 decoder.
First, prepare the parallel data for training, validation, and testing.

cd OpenNMT-py/data 
source ./prep-data.sh

Second, let's train a model using French and German as input, and Czech as target language.

bash train_example.sh

It runs on cpu, and it will train a 1-layer model for 10000 training steps

After the training is completed, we can evaluate the model on a reference test.

bash test_example.sh

Citing this work

If you use this work, please consider citing the work it builds up on: Introduction to the architecture

@article{vazquez-etal-2020-systematic,
    title = "A Systematic Study of Inner-Attention-Based Sentence Representations in Multilingual Neural Machine Translation",
    author = {V{\'a}zquez, Ra{\'u}l  and
      Raganato, Alessandro  and
      Creutz, Mathias  and
      Tiedemann, J{\"o}rg},
    journal = "Computational Linguistics",
    volume = "46",
    number = "2",
    month = jun,
    year = "2020",
    url = "https://aclanthology.org/2020.cl-2.5",
    doi = "10.1162/coli_a_00377",
    pages = "387--424",
}

FoTraNMT is built on top of OpenNMT, so please also acknowledge them if you use the system. See the OpenNMT-py offical repo and website for documentation of the system's basic functionalities.

@inproceedings{klein-etal-2017-opennmt,
    title = "{O}pen{NMT}: Open-Source Toolkit for Neural Machine Translation",
    author = "Klein, Guillaume  and
      Kim, Yoon  and
      Deng, Yuntian  and
      Senellart, Jean  and
      Rush, Alexander",
    booktitle = "Proceedings of {ACL} 2017, System Demonstrations",
    month = jul,
    year = "2017",
    address = "Vancouver, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P17-4012",
    pages = "67--72",
}

About

Open Source Neural Machine Translation in PyTorch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 89.5%
  • Perl 4.3%
  • Shell 3.5%
  • Emacs Lisp 2.0%
  • Smalltalk 0.2%
  • Ruby 0.2%
  • Other 0.3%