GitHub - bheinzerling/subword-sequence-tagging: Code for the paper: Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation (ACL 2019)

This repository contains code and data for the ACL 2019 paper Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation by Benjamin Heinzerling and Michael Strube.

https://www.aclweb.org/anthology/P19-1027/

POS Tagging Experiments

To run the POS tagging experiments described in the paper use a command like the following, replacing the language code with the one for your language of interest:

python main.py train \
    --dataset ud_1_2 \
    --lang el \
    --tag upostag \
    --use-char \
    --use-bpe \
    --use-meta-rnn \
    --best-vocab-size \
    --char-emb-dim 50 \
    --char-nhidden 256 \
    --bpe-nhidden 256 \
    --meta-nhidden 256 \
    --dropout 0.2

The arguments --use-char, --use-bpe, --use-bert specify whether to use (randomly initialized) character embeddings, (pretrained) BPE embeddings, and/or (multilingual) BERT. As described in the paper, we use a "meta" LSTM to combine all input representations, which is specified with --use-meta-rnn. To automatically select the "best" BPE vocabulary size for the given language, set --best-vocab-size, or select a vocabulary size manually with --vocab-size (supported vocabulary size for each language can be found here).

Training will take between 10 minutes and several hours, depending on the size of the dataset for the given language. For the above command you should see training finish due to early stopping and then see the results on the test set:

2020-11-12 12:38:08| score acc_0.9316/0.9328
2020-11-12 12:38:08| out/81/e365_acc_0.9328_model.pt
2020-11-12 12:38:08| Early stop after 10 steps
...
2020-11-12 12:38:11| loaded model out/81/e365_acc_0.9328_model.pt
2020-11-12 12:38:12| score acc_0.8996/0.8996
2020-11-12 12:38:12| out/81/acc_0.8996_model.pt
2020-11-12 12:38:12| final score: 0.8996

This particular accuracy is higher than the one reported in the paper (82%) for this language (el: Modern Greek), which might be due to newer versions of pytorch and transformers, but I haven't looked into this.

To train on several languages simultaneously (MultiBPEmb experiments in the paper), run a command like this:

python main.py train \
    --dataset ud_1_2_lowres_multi \
    --lang multi \
    --tag upostag \
    --use-char \
    --use-bpe \
    --use-meta-rnn \
    --char-emb-dim 100 \
    --char-nhidden 512 \
    --bpe-nhidden 1024 \
    --meta-nhidden 1024 \
    --dropout 0.2 \
    --vocab-size 1000000

Comments on the code

This code was written in 2018 and would look much different if it were written today. If you want to develop your own sequence tagging models, I recommend using https://github.com/flairNLP/flair instead.

Having said that, the project structure is:

argparser.py: contains all command line arguments
data.py: code for loading plain text data and converting it into pytorch tensors
main.py: entry point, run either python main.py train or python main.py test
model.py: a sequence tagging model
trainer.py: contains all the boiler plate for loading data and model training
data/: contains datasets in plain text format
out/: results and model checkpoints will be written in a subdirectory for each run
out/cache: tensorized datasets will be cached here

Feel free to send me an email or open an issue if anything is unclear and/or doesn't work.

Requirements

pytorch transformers numpy bpemb joblib conllu boltons

The first three libraries are well-known, the other ones are used for:

bpemb: pretrained subword embeddings
joblib: for pickling
conllu: for reading files in .conllu format
boltons: useful utility functions

After cloning this repository, extract the UD 1.2 data:

$ cd data
$ tar xzf ud_1_2.tar.gz

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
README.md		README.md
argparser.py		argparser.py
bert_wrapper.py		bert_wrapper.py
data.py		data.py
eval.py		eval.py
main.py		main.py
model.py		model.py
requirements.txt		requirements.txt
trainer.py		trainer.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

README.md

argparser.py

argparser.py

bert_wrapper.py

bert_wrapper.py

data.py

data.py

eval.py

eval.py

main.py

main.py

model.py

model.py

requirements.txt

requirements.txt

trainer.py

trainer.py

util.py

util.py

Repository files navigation

POS Tagging Experiments

Comments on the code

Requirements

About

Releases

Packages

Languages

bheinzerling/subword-sequence-tagging

Folders and files

Latest commit

History

Repository files navigation

POS Tagging Experiments

Comments on the code

Requirements

About

Resources

Stars

Watchers

Forks

Languages