Skip to content

In this project we develop new deep learning models for bootstrapping language understanding models for languages with no labeled data using labeled data from other languages.

License

microsoft/Multilingual-Model-Transfer

Repository files navigation

Zero-Resource Multilingual Model Transfer

This repo contains the source code for our ACL 2019 paper:

Multi-Source Cross-Lingual Model Transfer: Learning What to Share
Xilun Chen, Ahmed Hassan Awadallah, Hany Hassan, Wei Wang, Claire Cardie
The 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019)
paper, arXiv, bibtex

Introduction

Modern NLP applications have enjoyed a great boost utilizing neural networks models. Such deep neural models, however, are not applicable to most human languages due to the lack of annotated training data for various NLP tasks. Cross-lingual transfer learning (CLTL) is a viable method for building NLP models for a low-resource target language by leveraging labeled data from other (source) languages. In this work, we focus on the multilingual transfer setting where training data in multiple source languages is leveraged to further boost target language performance.

Unlike most existing methods that rely only on language-invariant features for CLTL, our approach coherently utilizes both language-invariant and language-specific features at instance level. Our model leverages adversarial networks to learn language-invariant features, and mixture-of-experts models to dynamically exploit the similarity between the target language and each individual source language. This enables our model to learn effectively what to share between various languages in the multilingual setup. Moreover, when coupled with unsupervised multilingual embeddings, our model can operate in a zero-resource setting where neither target language training data nor cross-lingual resources (e.g. parallel corpora or Machine Translation systems) are available. Our model achieves significant performance gains over prior art, as shown in an extensive set of experiments over multiple text classification and sequence tagging tasks including a large-scale industry dataset.

Requirements

  • Python 3.6
  • PyTorch 0.4
  • PyTorchNet (for confusion matrix)
  • tqdm (for progress bar)

File Structure

.
├── LICENSE
├── README.md
├── conlleval.pl                            (official CoNLL evaluation script)
├── data_prep                               (data processing scripts)
│   ├── bio_dataset.py                      (processing the CoNLL dataset)
│   └── multi_lingual_amazon.py             (processing the Amazon Review dataset)
├── data_processing_scripts                 (auxiliary scripts for dataset pre-processing)
│   └── amazon
│       ├── pickle_dataset.py
│       └── process_dataset.py
├── layers.py                               (lower-level helper modules)
├── models.py                               (higher-level modules)
├── options.py                              (hyper-parameters aka. all the knobs you may want to turn)
├── scripts                                 (scripts for training and evaluating the models)
│   ├── get_overall_perf_amazon.py          (evaluation script for Amazon Reviews)
│   ├── get_overall_perf_ner.py             (evaluation script for CoNLL NER)
│   ├── train_amazon_3to1.sh                (training script for Amazon Reviews)
│   └── train_conll_ner_3to1.sh             (training script for CoNLL NER)
├── train_cls_man_moe.py                    (training code for text classification)
├── train_tagging_man_moe.py                (training code for sequence tagging)
├── utils.py                                (helper functions)
└── vocab.py                                (building the vocabulary)

Dataset

The CoNLL 2002, 2003 and Amazon datasets, as well as the multilingual word embeddings (MUSE, VecMap, UMWE) are all publicly available online.

Run Experiments

CoNLL Named Entity Recogintion

./scripts/train_conll_ner_3to1.sh {exp_name}

The following script can print out a compiled dev/test F1 scores for all languages:

python scripts/get_overall_perf_ner.py save {exp_name}

Multilingual Amazon Reviews

./scripts/train_amazon_3to1.sh {exp_name}

The following script can print out a compiled dev/test F1 scores for all languages:

python scripts/get_overall_perf_amazon.py save {exp_name}

Citation

If you find this project useful for your research, please kindly cite our ACL 2019 paper:

@InProceedings{chen-etal-acl2019-multi-source,
    author = {Chen, Xilun and Hassan Awadallah, Ahmed and Hassan, Hany and Wang, Wei and Cardie, Claire},
    title = {Multi-Source Cross-Lingual Model Transfer: Learning What to Share},
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

About

In this project we develop new deep learning models for bootstrapping language understanding models for languages with no labeled data using labeled data from other languages.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published