Fast Training of Machine Translation Models

This project was developed during an Insight AI Fellowsip in the summer of 2019. It is in two parts. The first part has code for training of a model that translates French sentences into English. The second part is a small webapp that translates French sentences to English using a trained model.

The main point of this project was to test out ideas for speeding up training. Unfortunately none of these ideas worked in practice.

The model definition used the Pervasive Attention repo (https://github.com/elbayadm/gttn2d) as a reference implementation. Unless marked with an attribution (see esp. docstrings at the beginning of classes and functions) all code is my own. If you are looking for code to reuse in your own work, here is a list of classes that I would have liked to have found implemented elsewhere instead of writing my own:

LanguageCorpus and its descendents make a flexible framework for creating training sets from parallel sentences.
SubSampler and DistributedSampler implement samplers that allow epochs to be smaller than the entire training set.
beam_search() performs a vectorized beam search for the best outputs for a batch of input sentences.

Requisites

Linux or MacOS. Windows Subsystem for Linux (WSL) might work too, but I have not tested it.
bash
git
perl (For third-party tokenization scripts)
pip
- For example, on Ubuntu 18.04 run sudo apt-get install python3-pip.
python (Python version 3.6)
virtualenv
- If everything else is installed, you should be able to install virtualenv by running pip3 install virtualenv.

If you prefer to use conda for package management, then the list of required packages is in requirements.in and dev-requirements.in (not requirements.txt which lists dependencies too).

Setup

Clone repository.

git clone https://github.com/expz/fast-training

Change directory into the repo.

cd fast-training

Create Python environment named fast-training. NOTE: Moving this directory after creating the Python environment will break the environment and require deleting and redoing the setup.

deactivate 2>/dev/null
virtualenv --python=$(which python3) fast-training

Install Python packages.

source .env
pip install -r requirements.txt

Translate a sentence

Activate Python environment. From the root directory of this repository, run

source .env

Download the tokenizer (19KB), model (32MB) and vocabulary (90KB).

./download

Translate a sentence.

python translate.py 'Comment allez-vous?'

Run webapp

Activate Python environment. From the root directory of this repository, run

source .env

Download the tokenizer (19KB), model (32MB) and vocabulary (90KB).

./download

Run the app.

gunicorn -b 0.0.0.0:8887 app:api

Navigate to http://localhost:8887 in a browser to see the website. The health check is available at http://localhost:8887/health. It should print 'OK'.
To test the API directly, open a new terminal and run (requires curl to be installed):

curl -X POST -H 'Content-type: application/json' -d '{"fr": "Comment allez-vous?"}' http://localhost:8887/api/translate/fren

Hosting on Google Cloud Platform

If you would like to host the app on Google Cloud, then set GCP_PROJECT, e.g. run

export GCP_PROJECT=my-google-cloud-project

Then build the docker container and push it to Google Cloud Run:

./build/build

Training the model

WARNING: Training takes substantial computing resources. Some datasets are large and require significant computing power to preprocess and significant RAM as working space (although preprocessing does not load all data in memory for embedding vector datasets). The code will run on a CPU, but to run it in a reasonable amount of time at least one NVIDIA GPU is required.

NOTE: The possible commands and their syntax can be found by running python dev.py --help, or for a subcommand like prepare-data it would be python dev.py prepare-data -- --help.

Run the following commands from the root directory of the repository.

Load the environment and install development requirements.

source .env
pip install -r dev-requirements.txt

Download and prepare a small dataset of a few hundred MB. For more data, a larger data source or multiple data sources can be used. Possible data sources are listed using python dev.py pepare-data list-datasets.

python dev.py prepare-data standard fr news_fr_en "['news2014']" 50 --shuffle True --joint-vocab-size 24000 --valid-size 4096 --use-cache True

(Optional) View the model architecture.

python dev.py summary config/news_fr_en.yaml

Train a model. With the bert_fr_en dataset prepared, a 6+4 layer Densenet can be trained (config/news_fr_en.yaml). This can take a few hours to reach a point of reasonable translations for some sentences. To train on NVIDIA GPU 0 (e.g., if you have just one GPU):

python dev.py train config/news_fr_en.yaml --lr 0.01 --device_ids '[0]'

To train on multiple GPUs as once (e.g., GPUs 0 and 1), run:

python dev.py train config/news_fr_en.yaml --lr 0.01 --device_ids '[0,1]'

Press CTRL-C to stop training in the middle. The model is saved at the end of every epoch in the model/news_fr_en folder.

(Optional) To continue training a saved model, e.g., model/news_fr_en/model_0.pth, run

python dev.py train config/news_fr_en.yaml --lr 0.01 --device_ids '[0]' --restore model/news_fr_en/model_0.pth

Try out the model. To translate 16 examples, choose a saved model, e.g., model_0.pth, and run

python dev.py example config/news_fr_en.yaml --gpu_id 0 --model model/news_fr_en/model_0.pth --batch 16

Translate a sentence of your own with beam search of beam size 5:

python translate.py --config config/news_fr_en.yaml --model model/news_fr_en/model_0.pth --beam 5 'Comment allez-vous?'

Development

First install development requirements. From the root directory of the repository, run

source .env
pip install -r dev-requirements.txt

Test

From the root directory of the repository, run

source .env
pytest

Manage requirements

Upgrade packages:

pip-compile --upgrade requirements.in
pip-compile --upgrade dev-requirements.in

Format code

This repo uses Google's Yet Another Python Formatter (yapf) to format code.

yapf -i --style=google src/file_to_format.py

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
build		build
config		config
data		data
model		model
src		src
tests		tests
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
dev-requirements.in		dev-requirements.in
dev-requirements.txt		dev-requirements.txt
dev.py		dev.py
download		download
entrypoint		entrypoint
requirements.in		requirements.in
requirements.txt		requirements.txt
translate.py		translate.py

License

expz/fast-training

Folders and files

Latest commit

History

Repository files navigation

Fast Training of Machine Translation Models

Requisites

Setup

Translate a sentence

Run webapp

Hosting on Google Cloud Platform

Training the model

Development

Test

Manage requirements

Format code

About

Topics

Resources

License

Stars

Watchers

Forks

Languages