Parsimonious Parser Transfer

This repository contains the code for the Parsimonious Parser Transfer (PPT; EACL 2021) by Kemal Kurniawan, Lea Frermann, Philip Schulz, and Trevor Cohn.

Citation

@inproceedings{kurniawan2021,
  title = {PPT: Parsimonious Parser Transfer for Unsupervised Cross-Lingual Adaptation},
  shorttitle = {PPT},
  booktitle = {Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume},
  author = {Kurniawan, Kemal and Frermann, Lea and Schulz, Philip and Cohn, Trevor},
  year = {2021},
  month = apr,
  pages = {2907--2918},
  url = {https://www.aclweb.org/anthology/2021.eacl-main.254}
}

Fetching submodules

After cloning this repository, you need to also fetch the submodules with

git submodule init
git submodule update

Installing requirements

We recommend you to use conda package manager. Then, create a virtual environment with all the required dependencies with:

conda env create -n [env] -f environment.yml

Replace [env] with your desired environment name. Once created, activate the environment. The command above also installs the CPU version of PyTorch. If you need the GPU version, follow the corresponding PyTorch installation docs afterwards. If you're using other package manager (e.g., pip), you can look at the environment.yml file to see what the requirements are.

Preparing dataset

Download UD treebanks v2.2 from https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2837

Preparing word embeddings

Next, download FastText's Wiki word embeddings from this page. You need to download the text format (.vec). Suppose you put the word embedding files in fasttext directory. Then, perform the word embedding alignment to get the multilingual embeddings:

./align_embedding.py with heetal

Lastly, minimise the word embedding files so they contain only words that actually occur in the UD data. Assuming the UD data is stored in ud-treebanks-v2.2, then run

./minimize_vectors_file.py with vectors_path=aligned_fasttext/wiki.multi.id.vec output_path=aligned_fasttext/wiki.multi.min.id.vec corpus.lang=id

The command above minimises the word vector file for Indonesian (id). You can set corpus.lang to other language codes mentioned in the paper, e.g., ar for Arabic, es for Spanish, etc.

Training the source (English) parser

Assuming you have minimised the English word vectors file to wiki.en.min.vec, run

./run_parser.py with ahmadetal word_emb_path=wiki.en.min.vec

The trained parser will be stored in artifacts directory.

Performing direct transfer

Assuming the source parser parameters are saved in artifacts/100_model.pth, then run

./run_parser.py evaluate with ahmadetal heetal_eval_setup load_params=100_model.pth word_emb_path=aligned_fasttext/wiki.multi.min.id.vec corpus.lang=id

Running the self-training baseline

./run_st.py with ahmadetal heetal_eval_setup distant word_emb_path=aligned_fasttext/wiki.multi.min.id.vec load_params=100_model.pth corpus.lang=id

Change distant to nearby to change the hyperparameters to ones optimised for nearby languages.

Running PPT

./run_ppt.py with ahmadetal heetal_eval_setup distant word_emb_path=aligned_fasttext/wiki.multi.min.id.vec load_params=100_model.pth corpus.lang=id

Same as before, you can change distant to nearby for nearby languages.

Running PPTX

Suppose you've trained the source parsers using run_parser.py, and the trained models are saved in artifacts and de_artifacts for English and German respectively (we use 2 languages in this example but you can have as many as you want). Also, the model parameters are artifacts/100_model.pth and de_artifacts/150_model.pth respectively. Then, you can run PPTX with:

./run_pptx.py with ahmadetal heetal_eval_setup distant load_src="{'en':('artifacts','100_model.pth'),'de':('de_artifacts','150_model.pth')}" \
     main_src=en word_emb_path=aligned_fasttext/wiki.multi.min.id.vec corpus.lang=id

Change distant to nearby for nearby languages.

(Optional) Sacred: an experiment manager

Almost all scripts in this repository use Sacred. The scripts are written so that you can store all about an experiment run in a MongoDB database. Simply set environment variables SACRED_MONGO_URL to point to a MongoDB instance and SACRED_DB_NAME to a database name to activate it. Also, invoke the help command of any such script to print its usage, e.g., ./run_parser.py help.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
fastText_multilingual @ f81f3e4		fastText_multilingual @ f81f3e4
ingredients		ingredients
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.txt		LICENSE.txt
README.md		README.md
aatrn.py		aatrn.py
align_embedding.py		align_embedding.py
callbacks.py		callbacks.py
compute_leakage.py		compute_leakage.py
crf.py		crf.py
environment.yml		environment.yml
evaluation.py		evaluation.py
matrix_tree.py		matrix_tree.py
minimize_vectors_file.py		minimize_vectors_file.py
models.py		models.py
modules.py		modules.py
mst.py		mst.py
readers.py		readers.py
run_parser.py		run_parser.py
run_ppt.py		run_ppt.py
run_pptx.py		run_pptx.py
run_st.py		run_st.py
sample_conllu.py		sample_conllu.py
serialization.py		serialization.py
split_conllu.py		split_conllu.py
utils.py		utils.py

License

kmkurn/ppt-eacl2021

Folders and files

Latest commit

History

Repository files navigation

Parsimonious Parser Transfer

Citation

Fetching submodules

Installing requirements

Preparing dataset

Preparing word embeddings

Training the source (English) parser

Performing direct transfer

Running the self-training baseline

Running PPT

Running PPTX

(Optional) Sacred: an experiment manager

About

Resources

License

Stars

Watchers

Forks

Languages