Skip to content
/ udify Public
forked from Hyperparticle/udify

A single model that parses Universal Dependencies across all (75) languages.

License

Notifications You must be signed in to change notification settings

yyht/udify

 
 

Repository files navigation

UDify

MIT License

UDify is a single model that parses Universal Dependencies (UPOS, UFeats, Lemmas, Deps) jointly, accepting any of 75 supported languages as input (UD v2.3). This repository accompanies the paper, "75 Languages, 1 Model: Parsing Universal Dependencies Universally," providing tools to train a multilingual model capable of parsing any Universal Dependencies treebank with high accuracy. This project also supports training and evaluating for the SIGMORPHON 2019 Shared Task #2.

The project is built using AllenNLP and PyTorch.

Note that this repository is still a work in progress (sorry, I'm traveling a lot ✈️).

Getting Started

Install the Python packages in requirements.txt. UDify depends on AllenNLP and PyTorch. For Windows OS, use WSL. Optionally, install TensorFlow to get access to TensorBoard to get a rich visualization of model performance on each UD task.

pip install -r ./requirements.txt

Download the UD corpus by running the script

bash ./scripts/download_ud_data.sh

or alternatively download the data from universaldependencies.org and extract into data/ud-treebanks-v2.3/, then run scripts/concat_ud_data.sh to generate the multilingual UD dataset.

Training the Model

To train the multilingual model (fine-tune UD on BERT), run the command

python train.py --config config/ud/multilingual/udify_bert_finetune_multilingual.json --name multilingual

which will begin loading the dataset and model before training the network. The model metrics, vocab, and weights will be saved under logs/multilingual. Note that this process is highly memory intensive and requires 16+ GB of RAM and 12+ GB of GPU memory. The training may take 20 or more days to complete all 80 epochs depending on the type of your GPU.

Viewing Model Performance

One can view how well the models are performing by running TensorBoard

tensorboard --logdir logs

This should show the currently trained model as well as any other previously trained models. The model will be stored in a folder specified by the --name parameter as well as a date stamp, e.g., logs/multilingual/2019.07.03_11.08.51.

Predicting Universal Dependencies from a Trained Model

To predict UD annotations, one can supply the path to the trained model and an input conllu-formatted file:

python predict.py <archive_dir> <input.conllu> [--pred_file <output.conllu>]

For instance, predicting the dev set of English EWT with the trained model saved under logs/multilingual/2019.07.03_11.08.51 can be done with

python predict.py logs/multilingual/2019.07.03_11.08.51 data/ud-treebanks-v2.3/UD_English-EWT/en_ewt-ud-dev.conllu --pred_file en-ewt_dev.conllu

and will save the output predictions to logs/multilingual/2019.07.03_11.08.51/en-ewt_dev.conllu.

Configuration Options

One can specify the type of device to run on. For a single GPU, use the flag --device 0, or --device -1 for CPU.

Pretrained Models [Coming Soon]

The pretrained multilingual UDify model can drastically speed up fine-tuning on any UD treebank. A link to download the pretrained model is coming soon (need to find a proper hosting site).

SIGMORPHON 2019 Shared Task [Coming Soon]

A modification to the basic UDify model is available for parsing morphology in the SIGMORPHON 2019 Shared Task #2. The configuration will be made available soon.

About

A single model that parses Universal Dependencies across all (75) languages.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.5%
  • Shell 0.5%