NNVLP - A Neural Network-Based Vietnamese Language Processing Toolkit

Code by Thai-Hoang Pham at Alt Inc. (Utilize some code at a repository)

A demo website is available at nnvlp.org

1. Introduction

NNVLP is a Python implementation of the system described in a paper NNVLP: A Neural Network-Based Vietnamese Language Processing Toolkit. This system is used for some common sequence labeling tasks for Vietnamese including part-of-speech (POS) tagging, chunking, named entity recognition (NER). The architecture of this system is the combination of bi-directional Long Short-Term Memory (Bi-LSTM), Conditional Random Field (CRF), and word embeddings that is the concatenation of pre-trained word embeddings learnt from skip-gram model and character-level word features learnt from Convolutional Neural Network (CNN).

Figure 1: The CNN layer for extracting character-level word features of word Học_sinh (Student).

Figure 2: The Bi-LSTM-CRF layers for input sentence Anh rời EU hôm qua (UK left EU yesterday).

Our system achieves an accuracy of 91.92%, F1 scores of 84.11% and 92.91% for POS tagging, chunking, and NER tasks respectively.

The following tables compare the performance of NNVLP and other previous toolkit on POS tagging, chunking, and NER task respectively.

POS tagging

System	Accuracy
Vitk	88.41
vTools	90.73
RDRPOSTagger	91.96
NNVLP	91.92

Chunking

System	P	R	F1
vTools	82.79	83.55	83.17
NNVLP	83.93	84.28	84.11

NER

System	P	R	F1
Vitk	88.36	89.20	88.78
vie-ner-lstm	91.09	93.03	92.05
NNVLP	92.76	93.07	92.91

2. Installation

This software depends on Numpy, Theano, and Lasagne. You must have them installed before using NNVLP.

The simple way to install them is using pip:

	$ pip install -U numpy theano lasagne

3. Usage

3.1. Data

The input data's format of NNVLP follows CoNLL format. The corpus of POS tagging task consists of two columns namely word, and POS tag. For chunking task, there are three columns namely word, POS tag, and chunk in the corpus. The corpus of NER task consists of four columns. The order of these columns are word, POS tag, chunk, and named entity. The table below describes an example Vietnamese sentence in NER corpus.

Word	POS	Chunk	NER
Từ	E	B-PP	O
Singapore	NNP	B-NP	B-LOC
,	CH	O	O
chỉ	R	O	O
khoảng	N	B-NP	O
vài	L	B-NP	O
chục	M	B-NP	O
phút	Nu	B-NP	O
ngồi	V	B-VP	O
phà	N	B-NP	O
là	V	B-VP	O
dến	V	B-VP	O
được	R	O	O
Batam	NNP	B-NP	B-LOC
.	CH	O	O

To access the full dataset of VLSP, you need to sign the user agreement of the VLSP consortium.

3.2. Command-line Usage

You can use NNVLP software by shell commands:

For POS tagging:

	$ bash pos.sh

For chunking:

	$ bash chunk.sh

For NER:

	$ bash ner.sh

Arguments in these scripts:

train_dir: path for training data
dev_dir: path for development data
test_dir: path for testing data
word_dir: path for word dictionary
vector_dir: path for vector dictionary
char_embedd_dim: character embedding dimension
num_units: number of hidden units for LSTM
num_filters: number of filters for CNN
grad_clipping: grad clipping
peepholes: peepholes (True or False)
learning_rate: learning rate
decay_rate: decay rate
dropout: dropout for input data (True or False)
batch_size: size of input batch for training this system.
patience: number used for early stopping in training stage

Note: In the first time of running NNVLP, this system will automatically download word embeddings for Vietnamese from the internet. (It may take a long time because a size of this embedding set is about 1 GB). If the system cannot automatically download this embedding set, you can manually download it from here (vector, unknown vector, word) and put it into embedding directory.

4. References

Thai-Hoang Pham, Xuan-Khoai Pham, Tuan-Anh Nguyen, Phuong Le-Hong, "NNVLP: A Neural Network-Based Vietnamese Language Processing Toolkit" Proceedings of The 8th International Joint Conference on Natural Language Processing (IJCNLP 2017)

@inproceedings{Pham:2017b,
  title={NNVLP: A Neural Network-Based Vietnamese Language Processing Toolkit},
  author={Thai-Hoang Pham and Xuan-Khoai Pham and Tuan-Anh Nguyen and Phuong Le-Hong},
  booktitle={Proceedings of The 8th International Joint Conference on Natural Language Processing},
  year={2017},
}

Thai-Hoang Pham, Phuong Le-Hong, "End-to-end Recurrent Neural Network Models for Vietnamese Named Entity Recognition: Word-level vs. Character-level" Proceedings of The 15th International Conference of the Pacific Association for Computational Linguistics (PACLING 2017)

@inproceedings{Pham:2017a,
  title={End-to-end Recurrent Neural Network Models for Vietnamese Named Entity Recognition: Word-level vs. Character-level},
  author={Thai-Hoang Pham and Phuong Le-Hong},
  booktitle={Proceedings of The 15th International Conference of the Pacific Association for Computational Linguistics},
  year={2017},
}

5. Contact

Thai-Hoang Pham < phamthaihoang.hn@gmail.com >

Alt Inc, Hanoi, Vietnam

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
data		data
docs		docs
embedding		embedding
.gitignore		.gitignore
README.md		README.md
chunk.py		chunk.py
chunk.sh		chunk.sh
conlleval.pl		conlleval.pl
labelencoder.py		labelencoder.py
lifelong.py		lifelong.py
ner.bsub		ner.bsub
ner.py		ner.py
ner.sh		ner.sh
network.py		network.py
pos.py		pos.py
pos.sh		pos.sh
utils.py		utils.py
utils2.py		utils2.py

Lanuet/NNVLP

Folders and files

Latest commit

History

Repository files navigation

NNVLP - A Neural Network-Based Vietnamese Language Processing Toolkit

1. Introduction

POS tagging

Chunking

NER

2. Installation

3. Usage

3.1. Data

3.2. Command-line Usage

4. References

5. Contact

About

Resources

Stars

Watchers

Forks

Languages