This project was developed during an Insight AI Fellowsip in the summer of 2019. It is in two parts. The first part has code for training of a model that translates French sentences into English. The second part is a small webapp that translates French sentences to English using a trained model.
The main point of this project was to test out ideas for speeding up training. Unfortunately none of these ideas worked in practice.
The model definition used the Pervasive Attention repo (https://github.com/elbayadm/gttn2d) as a reference implementation. Unless marked with an attribution (see esp. docstrings at the beginning of classes and functions) all code is my own. If you are looking for code to reuse in your own work, here is a list of classes that I would have liked to have found implemented elsewhere instead of writing my own:
-
LanguageCorpus and its descendents make a flexible framework for creating training sets from parallel sentences.
-
SubSampler and DistributedSampler implement samplers that allow epochs to be smaller than the entire training set.
-
beam_search() performs a vectorized beam search for the best outputs for a batch of input sentences.
- Linux or MacOS. Windows Subsystem for Linux (WSL) might work too, but I have not tested it.
bash
git
perl
(For third-party tokenization scripts)pip
- For example, on Ubuntu 18.04 run
sudo apt-get install python3-pip
.
- For example, on Ubuntu 18.04 run
python
(Python version 3.6)virtualenv
- If everything else is installed, you should be able to install
virtualenv
by runningpip3 install virtualenv
.
- If everything else is installed, you should be able to install
If you prefer to use conda
for package management, then the list of required packages is in requirements.in
and dev-requirements.in
(not requirements.txt
which lists dependencies too).
- Clone repository.
git clone https://github.com/expz/fast-training
- Change directory into the repo.
cd fast-training
- Create Python environment named
fast-training
. NOTE: Moving this directory after creating the Python environment will break the environment and require deleting and redoing the setup.
deactivate 2>/dev/null
virtualenv --python=$(which python3) fast-training
- Install Python packages.
source .env
pip install -r requirements.txt
- Activate Python environment. From the root directory of this repository, run
source .env
- Download the tokenizer (19KB), model (32MB) and vocabulary (90KB).
./download
- Translate a sentence.
python translate.py 'Comment allez-vous?'
- Activate Python environment. From the root directory of this repository, run
source .env
- Download the tokenizer (19KB), model (32MB) and vocabulary (90KB).
./download
- Run the app.
gunicorn -b 0.0.0.0:8887 app:api
-
Navigate to http://localhost:8887 in a browser to see the website. The health check is available at http://localhost:8887/health. It should print 'OK'.
-
To test the API directly, open a new terminal and run (requires
curl
to be installed):
curl -X POST -H 'Content-type: application/json' -d '{"fr": "Comment allez-vous?"}' http://localhost:8887/api/translate/fren
If you would like to host the app on Google Cloud, then set GCP_PROJECT
, e.g. run
export GCP_PROJECT=my-google-cloud-project
Then build the docker container and push it to Google Cloud Run:
./build/build
WARNING: Training takes substantial computing resources. Some datasets are large and require significant computing power to preprocess and significant RAM as working space (although preprocessing does not load all data in memory for embedding vector datasets). The code will run on a CPU, but to run it in a reasonable amount of time at least one NVIDIA GPU is required.
NOTE: The possible commands and their syntax can be found by running python dev.py --help
, or for a subcommand like prepare-data
it would be python dev.py prepare-data -- --help
.
Run the following commands from the root directory of the repository.
- Load the environment and install development requirements.
source .env
pip install -r dev-requirements.txt
- Download and prepare a small dataset of a few hundred MB. For more data, a larger data source or multiple data sources can be used. Possible data sources are listed using
python dev.py pepare-data list-datasets
.
python dev.py prepare-data standard fr news_fr_en "['news2014']" 50 --shuffle True --joint-vocab-size 24000 --valid-size 4096 --use-cache True
- (Optional) View the model architecture.
python dev.py summary config/news_fr_en.yaml
- Train a model. With the
bert_fr_en
dataset prepared, a 6+4 layer Densenet can be trained (config/news_fr_en.yaml
). This can take a few hours to reach a point of reasonable translations for some sentences. To train on NVIDIA GPU 0 (e.g., if you have just one GPU):
python dev.py train config/news_fr_en.yaml --lr 0.01 --device_ids '[0]'
To train on multiple GPUs as once (e.g., GPUs 0 and 1), run:
python dev.py train config/news_fr_en.yaml --lr 0.01 --device_ids '[0,1]'
Press CTRL-C to stop training in the middle. The model is saved at the end of every epoch in the model/news_fr_en
folder.
- (Optional) To continue training a saved model, e.g.,
model/news_fr_en/model_0.pth
, run
python dev.py train config/news_fr_en.yaml --lr 0.01 --device_ids '[0]' --restore model/news_fr_en/model_0.pth
- Try out the model. To translate 16 examples, choose a saved model, e.g.,
model_0.pth
, and run
python dev.py example config/news_fr_en.yaml --gpu_id 0 --model model/news_fr_en/model_0.pth --batch 16
Translate a sentence of your own with beam search of beam size 5:
python translate.py --config config/news_fr_en.yaml --model model/news_fr_en/model_0.pth --beam 5 'Comment allez-vous?'
First install development requirements. From the root directory of the repository, run
source .env
pip install -r dev-requirements.txt
From the root directory of the repository, run
source .env
pytest
Upgrade packages:
pip-compile --upgrade requirements.in
pip-compile --upgrade dev-requirements.in
This repo uses Google's Yet Another Python Formatter (yapf) to format code.
yapf -i --style=google src/file_to_format.py