mling_sdgms

This is the code repo for our EACL 2021 paper Combining Deep Generative Models and Cross-lingual Petraining for Semi-supervised Document Classification.

The code structure is shown as follows

.
├── data_model
├── nn_model
├── utils
├── train_bert
├── args.py
├── params.py
├── main.py
├── train_xling.py
├── train_semicldc.py
├── train_xlsemicldc.py
├── ...
├── requirements.txt
├── README.md

args.py and params.py contains all the arguments (data path, hyperparameters, etc.) for all experiments.

main.py is the main entry of all programs.

train_xling.py is used to train our non-parallel cross-lingual VAE (NXVAE).

train_semicldc.py is used to perform both supervised and semi-supervised mono-lingual document classification.

train_xlsemicldc.py is used to perform both supervised and semi-supervised zero-shot cross-lingual document classification.

We use pytorch 1.1.0, and all library dependencies generated from Conda (there could be MANY useless/redundant dependencies) can be found in requirements.txt.

The purpose of each folder is:

data_model contains data reading module such as building vocabulary, converting text to its vocabulary id. NOTE that MLDoc (the dataset we use in our experiments) is not publicly available.

nn_model contains all the neural model implementations, among which:

xlingva.py is the NXVAE model;
semicldc_model.py is the mono-lingual M1+M2 model;
xlsemicldc_model is the zero-shot M1+M2 model;
aux_semicldc_model is the mono-lingual AUX model;
aux_xlsemicldc_model.py is the zero-shot AUX model;

utils contains scripts about IO, preprocessing and the calculation of some common distributions.

train_bert contains all the code on pretraining BERT and performing document classification with BERT.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data_model		data_model
nn_model		nn_model
train_bert		train_bert
utils		utils
LICENSE		LICENSE
README.md		README.md
args.py		args.py
main.py		main.py
params.py		params.py
requirements.txt		requirements.txt
train_cldc.py		train_cldc.py
train_parallel.py		train_parallel.py
train_semicldc.py		train_semicldc.py
train_xling.py		train_xling.py
train_xlsemicldc.py		train_xlsemicldc.py
visualize.py		visualize.py

License

cambridgeltl/mling_sdgms

Folders and files

Latest commit

History

Repository files navigation

mling_sdgms

About

Resources

License

Stars

Watchers

Forks

Languages