GitHub - ultraicy/NLPCC_2018_TASK2_GEC: Code for the paper: " A Sequence to Sequence Learning for Chinese Grammatical Error Correction" (NLPCC-18).

This is the code of our team (Zlbnlp) for the NLPCC 2018 Shared Task 2 Grammatical Error Correction.The paper is A Sequence to Sequence Learning for Chinese Grammatical Error Correction.

Usage

Prerequisites

python3.6
pytorch0.2.0 (use following commands to install from source)

export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # [anaconda root directory]
conda install numpy pyyaml mkl mkl-include setuptools cmake cffi typing
conda install -c mingfeima mkldnn
conda install -c pytorch magma-cuda80

git clone https://github.com/pytorch/pytorch.git
cd pytorch
git reset --hard a03e5cb40938b6b3f3e6dbddf9cff8afdff72d1b
git submodule update --init
pip install -r requirements.txt
python setup.py install

m2score scripts(to compute the metrics)
libgrass-ui toolkit(word segmentation toolkit)
fairseq-py (dependent on torch, use following commands to install)

cd CS2S+BPE+Emb/software/fairseq-py
pip install -r requirements.txt
python setup.py build 
python setup.py develop

Data

The data and embeddings can be found in the Zlbnlp_data. You need manually split the whole dataset into two parts.

training dataset:contain 1,215,876 sentence pairs.Filepaths is CS2S+BPE+Emb/data/train.tok.src, CS2S+BPE+Emb/data/train.tok.trg
development dataset:contain 5k sentence pairs.Filepaths is CS2S+BPE+Emb/data/dev.tok.src, CS2S+BPE+Emb/data/dev.tok.trg
test data is source.txt.jieba.seg,using jieba toolkit.

Data processing

cd CS2S+BPE+Emb/training/
chmod +x preprocess.sh
./preprocess.sh

Training

Training command

The command below is what we used to train an model on the NLPCC-2018 Task 2 dataset.

./train_embed.sh

Decoding

The following is the command used to generate outputs and F0.5 score:

cd CS2S+BPE+Emb/
./run.sh ./data/source.txt.jieba.seg ./output/CS2S+BPE+Emb/ 0 ./training/models/mlconv_embed/model1
cd libgrass-ui/
./remove_spac_pkunlp_segment.sh

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
CS2S+BPE+Emb		CS2S+BPE+Emb
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CS2S+BPE+Emb

CS2S+BPE+Emb

scripts

scripts

README.md

README.md

Repository files navigation

Usage

Prerequisites

Data

Data processing

Training

Decoding

About

Releases

Packages

Languages

ultraicy/NLPCC_2018_TASK2_GEC

Folders and files

Latest commit

History

Repository files navigation

Usage

Prerequisites

Data

Data processing

Training

Decoding

About

Resources

Stars

Watchers

Forks

Languages