Machine-Translation-with-CRFs

Project 2 of NLP2, in which we implement a latent-variable conditional random field (LV-CRF) for the task of translation a source sentence x into a target sentence y. Latent inversion transduction grammar (ITG) trees mapping between x and y are constructed as the latent variables. The trees are stored compactly as hypergraph forests, where the hyper-edges are featurized into a vector phi, and have local potential functions. Stochastic optimization on a weights vector w is performed to fit the model to observed translation pairs (x,y). For more details read the project description or the paper that partly inspired it.

See the final report for our findings.

Report

The following papers are useful reference material for the CRF-model:

For example, we can take some of their plots and pictures as inspiration.

How-to

Use save-parses.py to save the parse-forest of a number of sentence pairs of a corpus. In translations you can set k and null to control how many translations (k) and insertions (null) to make. Set the size of the corpus in read_data and the maximal sentence length just below.
Use train.py to load these parses and train on them. For SGD we scale the learning rate each time we make a weight-vector update (i.e. each minibatch). See section 5.2 of this paper on SGD-tricks. This introduces a new hyperparameter lmbda which controls the rate of scaling. We now start with a high learning rate of around 1 to 10, and let the formula scale this down during training.
Use predict.py to load in a trained weights vector w and some parses in the right format, and predict the best translations (viterbi and sampled). Write these to a prediction .txt file in the folder predict. These can be used to compute BLEU scores with respect to a reference with this command.

The parses

Let's train with two types of parses: small sentences of length 10, with only 2 translations (plus -EPS-, so 3) and small sentences of length 10, with only 4 translations (plus -EPS-, so 5). With the new parallel parser we can now easily do max_sents=40000. See the settings below and the link to the dropbox where they are located.

ch_en, en_ch, _, _ = translations(path='data/lexicon', k=3, null=3, remove_punct=True) corpus = read_data(max_sents=40000) corpus = [(ch, en) for ch, en in corpus if len(en.split()) < 10]. Link to training parses. Link to dev parses.
ch_en, en_ch, _, _ = translations(path='data/lexicon', k=5, null=5, remove_punct=True) corpus = read_data(max_sents=40000) corpus = [(ch, en) for ch, en in corpus if len(en.split()) < 10]. Link to training parses. Link to dev parses.

Note: When you select the sentences of a certain length you get a smaller number than 40k! The first example with <10 gives 28372 parses. To make sure that in the final training we can compare the runs for the two different parse-types fairly let's only use the parses 0-28k.

Note: Tim has made a parallel version of save-parses! You can now use the branch parallel to check it out for yourself. If you have 4 cores you can simply run python save-parse.py --num-cores 8 and see the magic of parallel computing unfold in front of your eyes. Warning: expect massive speedup (4x or more) and some beautiful wind-tunnel effects from your desktop/laptop.

Training schedule

DONE Train on eps-40k-ml10-3trans for one iteration with these settings. (Took 11 hours.)
DONE Train on eps-40k-ml10-5trans for one iteration with these settings. (Took 13 hours.)

Trained weights

One iteration over the whole eps-40k-ml10-3trans: weights. See training settings here.
One iteration over the whole eps-40k-ml10-5trans: weights. See training settings here.

Training-set translations

We have some wonderful training-set translations! The reference translations of the training-set.

Viterbi translations

Translations for eps-40k-ml10-3trans and their probabilities. Results: BLEU = 4.04, 45.7/7.2/2.0/0.5 (200 sentences).
Translations for eps-40k-ml10-5trans. Results: BLEU = 0.00, 32.3/2.6/0.1/0.0 (200 sentences).

Sampled translations

Translations for eps-40k-ml10-3trans and their sample-frequency. Results: BLEU = ... (200 sentences).

Dev-set translations

We have obtained the following translations with the above trained weights. See also the reference translations of the dev-set.

Viterbi translations

Translations for eps-40k-ml10-3trans. Results: BLEU = 0.00, 75.6/12.5/1.5/0.2 (200 sentences). BLEU = 0.00, 74.7/11.5/1.6/0.2 (500 sentences).
Translations for eps-40k-ml10-5trans. Results: BLEU = 0.00, 65.4/6.4/0.4/0.0 (200 sentences) BLEU = 0.00, 65.4/6.5/0.3/0.0 (500 sentences).

Sampled translations

Translations for eps-40k-ml10-3trans and their sample-frequency. Results: BLEU = ... (200 sentences).

IBM1 translations

As an interesting baseline we use the IBM1 word-translations to generate sentence-translations by monotonically translating the Chinese sentences word-by-word using this code.

This achieves the following results:

Translations of the training-set. Results: BLEU = 7.22, 60.6/12.8/3.4/1.3 (200 sentences).
Translations of the dev-set. Results: BLEU = 0.00, 83.8/18.4/3.4/0.4 (200 sentences).

Individual comparisons

Here is a small selection of individual comparisons of translations.

Some (old) results

See these translations for our best result so far! This has been achieved by training 1 iteration over 1300 sentences of maximal length 9 parsed with eps=True and maximally 3 epsilon insertions, with minibatch size 1, delta_0=10, lmbda=0.01, scale_weight=2 and regularizer=False. See the correct translations for reference. (Also note that later iterations get worse which you can see here.) Lastly: we achieve a BLEU score of 3.44 on these translations (hurray!): BLEU = 3.44, 49.8/6.2/1.1/0.5 (BP=0.967, ratio=0.968, hyp_len=1222, ref_len=1263).

Some issues/questions

The problem with derivations for which the p(y,d|x) = nan is this: the weights vector w. This still occurs, even with the above described hack. It only occurs with long sentences though. I think because for a long sentence, the derivation has many edges. And then sum([estimated_weights[edge] for edge in derrivation]) gets upset, which we use in join_prob to compute the p(y,d|x). NOTE: This is not really an issue: we still get Viterbi estimates! We just cannot compute the correct probability.

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.idea		.idea
__pycache__		__pycache__
data		data
dev123_lengths		dev123_lengths
lib		lib
prediction		prediction
readings		readings
report		report
trained-weights		trained-weights
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
all-derivations.py		all-derivations.py
check-weight.py		check-weight.py
experiment-bleu.py		experiment-bleu.py
experiment.py		experiment.py
features.py		features.py
graph.py		graph.py
ibm-translate.py		ibm-translate.py
learning-rates.py		learning-rates.py
libitg.py		libitg.py
multi-bleu-command.txt		multi-bleu-command.txt
multi-bleu.pl		multi-bleu.pl
predict.py		predict.py
processing.py		processing.py
save-parses-dev.py		save-parses-dev.py
save-parses.py		save-parses.py
sgd.py		sgd.py
train.py		train.py
util.py		util.py

daandouwe/Machine-Translation-with-CRFs

Folders and files

Latest commit

History

Repository files navigation

Machine-Translation-with-CRFs

Report

How-to

The parses

Training schedule

Trained weights

Training-set translations

Dev-set translations

IBM1 translations

Individual comparisons

Some (old) results

Some issues/questions

About

Resources

Stars

Watchers

Forks

Languages