DLM

Dictionary Language Modeling for cross-lingual pretraining on Neural Machine Translation

environment

Python 3.7
tensorflow==2.0.0+
tensorflow_datasets=2.1.0
nltk==3.4.5
mecab-python3==0.996.2
jieba==0.42.1
pkuseg==0.0.22
gensim==3.8.1
chardet==3.0.4
six==1.12.0
matplotlib==3.1.1
torch==1.4.0
torchvision==0.5.0
tinysegmenter==0.4

dataset

zh-en:

jr-en

ASPEC

de-en:

fr-en:

note

When using the project,
please remember to modify the "__data_dir" variable
in the top of "preprocess/corpus/wmt-news.py", "preprocess/corpus/europarl.py", "preprocess/corpus/KFTT.py", "preprocess/corpus/um_corpus.py"

for training

Tutorial for training the models

load data

at the top of train.py

There is a line of code like "from load.xxx import Loader"

change the xxx to zh_en; or any other languages

if use transformer in pytorch instead of tensorflow (optional)

at the top of train.py

There is a line of code like "from models.transformer_for_nmt import Model"

change it to "from models.transformer_for_nmt_torch import Model"

train.py

at the bottom of train.py

make sure it is 
    o_train = Train(use_cache=True)
    o_train.train()
    o_train.test()

The "use_cache" param indicate whether to load the preprocessed data from cache if there is cache

for adjusting the parameters, loss, optimizer and so on

it is in models/transformer_for_nmt.py

You can change
    
+ data_params

+ model_params

+ train_params

+ compile_params

+ monitor_params

if you want to choose dataset

it is in top of the "__init__" function of "Loader" in load/zh_en.py

if you want to change the preprocess pipeline

it is in models/transformer_for_nmt.py

you could change the pipelines at the top of the "Model"

if you want to load the trained model

it is in the "checkpoint_params" of models/transformer_for_nmt.py

you can specify the "name, time" of the model_dir, then it would load the best model automatically.

tensorboard

the tensorboard files will be save in the "runtime/tensorboard" directory.

This directory will be generated automatically after running the train.py

model files

all model files will be saved in "runtime/models"

This directory will be generated automatically after running the train.py

log

All the results of running train.py will be logged into the "runtime/log".

Including the data params, model params, train params and the results.

But if the train.py exits before it finishes, then there would be no logs.

for testing

Tutorial for testing the models

train.py

at the bottom of train.py

make sure it is 
    o_train = Train(use_cache=True)
    # o_train.train()
    o_train.test(True)

The "use_cache" param indicate whether to load the preprocessed data from cache if there is cache

choose which model to load

it is in the "checkpoint_params" of models/transformer_for_nmt.py

you can specify the "name, time" of the model_dir, then it would load the best model automatically.

Name		Name	Last commit message	Last commit date
Latest commit History 230 Commits
lib		lib
nmt		nmt
pretrain		pretrain
tools		tools
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib

lib

nmt

nmt

pretrain

pretrain

tools

tools

.gitignore

.gitignore

README.md

README.md

init.py

init.py

Repository files navigation

DLM

environment

dataset

note

for training

for testing

About

Releases

Packages

Contributors 5

Languages

SamuelLAN/DLM

Folders and files

Latest commit

History

Repository files navigation

DLM

environment

dataset

note

for training

for testing

About

Resources

Stars

Watchers

Forks

Languages