Skip to content

w-dq/PY2HZ

 
 

Repository files navigation

AI Final - PY2HZ

Datasets, References and Results record

Record Document

Tencent Document

Links

THUOCL dataset

icwb dataset

news and novel dataset

我爱自然语言处理 reference

HMM Framework reference


Pre-requirement

Before test the project, use the following command to install the required python packages. Note that we are using python 3.7.0 and above.

pip3 install requirement.txt

Code

  • Part1: segmentation, e.g.

    'woaibeijingtiananmen' --> ['wo', 'ai', 'bei', 'jing', 'tian', 'an','men']
    
  • Part2: tokenization, e.g.

    ['wo', 'ai', 'bei', 'jing', 'tian', 'an','men'] --> ['wo', 'ai', 'beijing', 'tiananmen']
    
  • Part3: translation, e.g.

    ['wo', 'ai', 'beijing', 'tiananmen'] --> ['我', '爱', '北京', '天安门']
    
    
  • test: in folder part2

    bash train.sh              % train the 2nd part
    
    

    in folder part3

    bash train.sh              % train the 3rd part
    
    

    in folder combination

    bash singlesentence.sh             
    
    % test a single sentence; alternatively, use
    
    python ./test/interpret.py
    
    

    in folder combination

    bash accuracytest.sh                
    
    % test a batch of sentence, to get the accuracy on words / sentences; note that this may cost long time, and some relative path problem may occur on different PCs, so if you need to run the accuracy test, please contact yuzy@shanghaitech.edu.cn. Alternatively, use the following code
    
    python ./test/test.py 
    

    for accuracy test of part1 and part3, please refer to branch 'zheng_dev' and 'main' respectively

  • test_dataset

    data used to test

  • train_dataset

    data used to train


Numerical Results

Through out the experiments, we have the following notations:

Stage

Seg: Segmentation, i.e. part 1

Tok: Tokenaization, i.e. part 2

Trs: Translation, i.e. part 3

Acc Type

W: accuracy per word, i.e. # of correct words / # of total words

S: accuracy per sentence, i.e. # of correct sentences / # of total sentences

Test data set:

3-5: a sentence with 3-5 tokens

6-8: a sentence with 3-5 tokens

>=9: a sentence with >=9 tokens

Result

top1: accuracy of the highest scored sentence

top3: accuracy of top 3 highest scored sentences

top5: accuracy of top 5 highest scored sentences

Stage Acc Type Accuracy (%)
3-5 6-8 >= 9
Seg W 0.999 0.998 0.992
S 0.995 0.980 0.935
Tok W 0.623 0.653 0.655
S 0.265 0.140 0.060
Trs W 0.924 0.938 0.957
S 0.770 0.700 0.655
Stages Acc Type Accuracy (%)
3-5 6-8 >=9
Seg+Tok+Trs W 0.782 0.827 0.824
S 0.310 0.230 0.085
Seg+Trs W 0.786 0.793 0.801
S 0.365 0.220 0.095
Stages Acc Type Accuracy (%)
3-5 6-8 >= 9
Seg+Tok+Trs W 0.782 0.827 0.824
S 0.310 0.230 0.085
Seg+Trs W 0.786 0.793 0.801
S 0.365 0.220 0.095
Number of tokens Accuracy Type Accuracy (%)
Top1 Top3 Top5
3-5 W 0.782 0.826 0.837
S 0.310 0.375 0.395
6-8 W 0.827 0.855 0.863
S 0.230 0.285 0.295
>=9 W 0.824 0.854 0.864
S 0.085 0.115 0.135

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.6%
  • Shell 0.4%