Le Traducteur

A framework for neural machine translation. The name is in reference to the framework's first task of English-to-French translation.

Supported Models:

Sutskever et al.'s "Sequence to Sequence with Neural Networks"

Recommended Corpora

Europarl Parallel Corpora: Proceedings of the European Parliament from 1996 to 2011

Getting Started

The system is built with PyTorch and AllenNLP, which are the main dependencies.

Prerequisites

Python 3.6 (3.6.5+ recommended)

Installing

It is recommended to first create a virtual environment before installing dependencies.

Using Conda

conda create --name le-traducteur python=3.6

Using VirtualEnv

python3 -m venv /path/to/new/virtual/environment

Download PyTorch and AllenNLP via

`pip install -r requirements.txt`

Caveat with installing AllenNLP using pip / conda

The current version of AllenNLP doesn't support restricting vocabulary by namespace. To enable this and run the provided experiments, you'll have to download AllenNLP from source.

Once version 0.5.2 is released, this should no longer be a problem.

Pre-trained NLP models

Several tokenizers used rely on NLTK and spaCy's pre-trained models for tokenizing English as well as French and Spanish. Feel free to not explicit download these models yourself. They will be downloaded automatically if a tokenizer in the config is specified to use a spaCy model that does not yet exist on your machine.

Dataset Reading

Running initial tests

Go to the root directory of this repository and run pytest to verify the provided dataset readers are working.

Creating a corpus

generate_parallel_corpus.py is a provided tool to create a combined parallel corpus for any language pair. It is recommend to refer to languages via their ISO codes when using this script and the framework in general.

This script can be used for any pair of monolingual transcriptions. It only makes the assumption that the files it is given have the same number of lines, where each line in one file is a translation of the other file at the same place.

Arguments to this script are:

src language: The ISO code of the source language in which to translate from
dst language: The ISO code of the destination language in which to translate to
src path: The path to the source language utterances
dst path: The path to the destination language utterances
save dir: The directory in which to save the new corpus

Output: A jsonl file containing a single JSON object per line of the form

{
    'id': <Line number of the >
    <src language>: <The src language utterance>
    <dst language>: <The dst language utterance>
}

An example dataset reader meant for reading the Europarl French-to-English dataset is provided in europarl_french_english.py. smoke_europarl_en_fr.jsonl is a subset of the full English-French parallel corpus produced by passing the Europarl transcriptions to generate_parallel_corpus.py.

Experiments

Example parallel corpora and configurations are provided in experiments/ and tests/fixtures.

Experiments are run by doing

allennlp train <path to the current experiment's JSON configuration> \
-s <directory for serialization>  \
--include-package library

A recommended workflow for extending beyond the provided models and supported language pairs is provided in conventions.md.

Built With

AllenNLP - The NLP framework used, built by AI2
PyTorch - The deep learning library used

Authors

Tam Dang

License

This project is licensed under the Apache License - see the LICENSE.md file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
experiments		experiments
library		library
scripts		scripts
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
CONVENTIONS.md		CONVENTIONS.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

dangitstam/le-traducteur

Folders and files

Latest commit

History

Repository files navigation

Le Traducteur

Supported Models:

Recommended Corpora

Getting Started

Prerequisites

Installing

Using Conda

Using VirtualEnv

Caveat with installing AllenNLP using pip / conda

Pre-trained NLP models

Dataset Reading

Running initial tests

Creating a corpus

Experiments

Built With

Authors

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages