Bi-LSTM+CNN for Chinese word segmentation
The implementation of this repository partly refers to Koth's kcws.
Have tensorflow 1.2 installed.
-
Preprocessing
python preprocess.py --rootDir <ROOTDIR> --corpusAll Corpora/people2014All.txt --resultFile pre_chars_for_w2v.txt
ROOTDIR is the absolute path of your corpus. Run python preprocess.py -h to see more details.
-
Word2vec Training
./third_party/word2vec -train pre_chars_for_w2v.txt -save-vocab pre_vocab.txt -min-count 3
python SentHandler/replace_unk.py pre_vocab.txt pre_chars_for_w2v.txt chars_for_w2v.txt
./third_party/word2vec -train chars_for_w2v.txt -output char_vec.txt \
-size 50 -sample 1e-4 -negative 0 -hs 1 -binary 0 -iter 5First off, the file word2vec.c in third_party directory should be compiled (see third_party/compile_w2v.sh). Then word2vec counts the characters which have a frequency more than 3 and saves them into file pre_vocab.txt. After replacing with "UNK" the words that are not in pre_vocab.txt, finally, word2vec training begins.
-
Generate Training Files
python pre_train.py --corpusAll Corpora/people2014All.txt --vecpath char_vec.txt \
--train_file Corpora/train.txt --test_file Corpora/test.txtRun python pre_train.py -h to see more details.
-
Training
python ./CWSTrain/lstm_cnn_train.py --train_data_path Corpora/train.txt \
--test_data_path Corpora/test.txt --word2vec_path char_vec.txtArguments of lstm_cnn_train.py are set by tf.app.flags. See the file for more args' configurations.
./cws_train.sh <ROOTDIR>
Take reference to Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural Architectures for Named Entity Recognition. In Proc. ACL. 2016.
-
Freeze graph
python tools/freeze_graph.py --input_graph Logs/seg_logs/graph.pbtxt --input_checkpoint Logs/seg_logs/model.ckpt --output_node_names "input_placeholder, transitions, Reshape_7" --output_graph Models/lstm_crf_model.pbtxt
Build model for segmentation.
See Here.
-
Freeze graph
python tools/freeze_graph.py --input_graph Logs/seg_cnn_logs/graph.pbtxt --input_checkpoint Logs/seg_cnn_logs/model.ckpt --output_node_names "input_placeholder,Reshape_5" --output_graph Models/lstm_cnn_model.pbtxt
Experiments on corpus People 2014.
Models | Bi-LSTM-CRF | Bi-LSTM-CNN |
---|---|---|
Precision | 96.11% | 96.27% |
Recall | 95.73% | 96.34% |
F-value | 95.92% | 96.30% |
-
Dump Vocabulary
python tools/vob_dump.py --vecpath char_vec.txt --dump_path Models/vob_dump.pk
This step is neccessary for the seg model.
-
Seg Script
Use file tools/crf_seg.py and file tools/cnn_seg.py. You may refer to the files about detailed parameters config.
For default, at the root path of this repository, python tools/crf_seg.py or python tools/cnn_seg.py will work. -
PRF Scoring
python PRF_Score.py Results/cnn_result.txt Corpora/test_gold.txt
Result files are put in directory Results/.