THUOCL dataset
icwb dataset
news and novel dataset
我爱自然语言处理 reference
HMM Framework reference
Before test the project, use the following command to install the required python packages. Note that we are using python 3.7.0 and above.
pip3 install requirement.txt
-
Part1: segmentation, e.g.
'woaibeijingtiananmen' --> ['wo', 'ai', 'bei', 'jing', 'tian', 'an','men']
-
Part2: tokenization, e.g.
['wo', 'ai', 'bei', 'jing', 'tian', 'an','men'] --> ['wo', 'ai', 'beijing', 'tiananmen']
-
Part3: translation, e.g.
['wo', 'ai', 'beijing', 'tiananmen'] --> ['我', '爱', '北京', '天安门']
-
test: in folder part2
bash train.sh % train the 2nd part
in folder part3
bash train.sh % train the 3rd part
in folder combination
bash singlesentence.sh % test a single sentence; alternatively, use python ./test/interpret.py
in folder combination
bash accuracytest.sh % test a batch of sentence, to get the accuracy on words / sentences; note that this may cost long time, and some relative path problem may occur on different PCs, so if you need to run the accuracy test, please contact yuzy@shanghaitech.edu.cn. Alternatively, use the following code python ./test/test.py
for accuracy test of part1 and part3, please refer to branch 'zheng_dev' and 'main' respectively
-
test_dataset
data used to test
-
train_dataset
data used to train
Through out the experiments, we have the following notations:
Seg: Segmentation, i.e. part 1
Tok: Tokenaization, i.e. part 2
Trs: Translation, i.e. part 3
W: accuracy per word, i.e. # of correct words / # of total words
S: accuracy per sentence, i.e. # of correct sentences / # of total sentences
3-5: a sentence with 3-5 tokens
6-8: a sentence with 3-5 tokens
>=9: a sentence with >=9 tokens
top1: accuracy of the highest scored sentence
top3: accuracy of top 3 highest scored sentences
top5: accuracy of top 5 highest scored sentences
Stage | Acc Type | Accuracy (%) | ||
---|---|---|---|---|
3-5 | 6-8 | >= 9 | ||
Seg | W | 0.999 | 0.998 | 0.992 |
S | 0.995 | 0.980 | 0.935 | |
Tok | W | 0.623 | 0.653 | 0.655 |
S | 0.265 | 0.140 | 0.060 | |
Trs | W | 0.924 | 0.938 | 0.957 |
S | 0.770 | 0.700 | 0.655 |
Stages | Acc Type | Accuracy (%) | ||
---|---|---|---|---|
3-5 | 6-8 | >=9 | ||
Seg+Tok+Trs | W | 0.782 | 0.827 | 0.824 |
S | 0.310 | 0.230 | 0.085 | |
Seg+Trs | W | 0.786 | 0.793 | 0.801 |
S | 0.365 | 0.220 | 0.095 |
Stages | Acc Type | Accuracy (%) | ||
---|---|---|---|---|
3-5 | 6-8 | >= 9 | ||
Seg+Tok+Trs | W | 0.782 | 0.827 | 0.824 |
S | 0.310 | 0.230 | 0.085 | |
Seg+Trs | W | 0.786 | 0.793 | 0.801 |
S | 0.365 | 0.220 | 0.095 |
Number of tokens | Accuracy Type | Accuracy (%) | ||
---|---|---|---|---|
Top1 | Top3 | Top5 | ||
3-5 | W | 0.782 | 0.826 | 0.837 |
S | 0.310 | 0.375 | 0.395 | |
6-8 | W | 0.827 | 0.855 | 0.863 |
S | 0.230 | 0.285 | 0.295 | |
>=9 | W | 0.824 | 0.854 | 0.864 |
S | 0.085 | 0.115 | 0.135 |