LSTM-CNN-CWS

Bi-LSTM+CNN for Chinese word segmentation

Acknowledgement

The implementation of this repository partly refers to Koth's kcws.

Have tensorflow 1.2 installed.

Preprocessing

python preprocess.py --rootDir <ROOTDIR> --corpusAll Corpora/people2014All.txt --resultFile pre_chars_for_w2v.txt

ROOTDIR is the absolute path of your corpus. Run python preprocess.py -h to see more details.
Word2vec Training

./third_party/word2vec -train pre_chars_for_w2v.txt -save-vocab pre_vocab.txt -min-count 3

python SentHandler/replace_unk.py pre_vocab.txt pre_chars_for_w2v.txt chars_for_w2v.txt

./third_party/word2vec -train chars_for_w2v.txt -output char_vec.txt \
-size 50 -sample 1e-4 -negative 0 -hs 1 -binary 0 -iter 5

First off, the file word2vec.c in third_party directory should be compiled (see third_party/compile_w2v.sh). Then word2vec counts the characters which have a frequency more than 3 and saves them into file pre_vocab.txt. After replacing with "UNK" the words that are not in pre_vocab.txt, finally, word2vec training begins.
Generate Training Files

python pre_train.py --corpusAll Corpora/people2014All.txt --vecpath char_vec.txt \
--train_file Corpora/train.txt --test_file Corpora/test.txt

Run python pre_train.py -h to see more details.
Training

python ./CWSTrain/lstm_cnn_train.py --train_data_path Corpora/train.txt \
--test_data_path Corpora/test.txt --word2vec_path char_vec.txt

Arguments of lstm_cnn_train.py are set by tf.app.flags. See the file for more args' configurations.

./cws_train.sh <ROOTDIR>

Freeze graph

python tools/freeze_graph.py --input_graph Logs/seg_logs/graph.pbtxt --input_checkpoint Logs/seg_logs/model.ckpt --output_node_names "input_placeholder, transitions, Reshape_7" --output_graph Models/lstm_crf_model.pbtxt

Build model for segmentation.

See Here.

Freeze graph

python tools/freeze_graph.py --input_graph Logs/seg_cnn_logs/graph.pbtxt --input_checkpoint Logs/seg_cnn_logs/model.ckpt --output_node_names "input_placeholder,Reshape_5" --output_graph Models/lstm_cnn_model.pbtxt

Experiments on corpus People 2014.

Dump Vocabulary

python tools/vob_dump.py --vecpath char_vec.txt --dump_path Models/vob_dump.pk

This step is neccessary for the seg model.
Seg Script

Use file tools/crf_seg.py and file tools/cnn_seg.py. You may refer to the files about detailed parameters config.
For default, at the root path of this repository, python tools/crf_seg.py or python tools/cnn_seg.py will work.
PRF Scoring

python PRF_Score.py Results/cnn_result.txt Corpora/test_gold.txt

Result files are put in directory Results/.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
CWSTrain		CWSTrain
Corpora		Corpora
Extra		Extra
Logs		Logs
Models		Models
Results		Results
SentHandler		SentHandler
third_party		third_party
tools		tools
PRF_Score.py		PRF_Score.py
README.md		README.md
char_vec.txt		char_vec.txt
cws_train.sh		cws_train.sh
pre_train.py		pre_train.py
pre_vocab.txt		pre_vocab.txt
preprocess.py		preprocess.py