classycn: Classical Chinese sentence segmenter.

Data - Warning: the data folder is over 1G in size! data/sjw - cleaned data from Seungjeongwon Ilgi - memos from ancient Korean Royal Secretariat. Over 200 million characters and 16k+ uniques. data/24s - semi - cleaned data from the "Twenty-Four Histories" of China, except Han Shu and San Guo Zhi. Data is from Wikisource, may contain noisy tokens. 20m tokens, 12k uniques.
data/vectors - word vectors produced using GloVe & Word2Vec.
Scripts runhmm - trains and tests an HMM tagger from NLTK runcrf - trains and tests a CRF tagger from CRF Suite runlstm - trains and tests a bi-directional LSTM tagger. Implemented with Theano.

Contact: Yizhou Hu @ huyz725 at gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
data		data
lstmdebug		lstmdebug
qualitative		qualitative
README.md		README.md
cpr.py		cpr.py
crf-hu.sh		crf-hu.sh
crf.py		crf.py
crf.sh		crf.sh
crfvec.sh		crfvec.sh
hmm-hu.sh		hmm-hu.sh
lstm.py		lstm.py
runcrf-hu.py		runcrf-hu.py
runcrf.py		runcrf.py
runhmm-hu.py		runhmm-hu.py
runhmm.py		runhmm.py
runlstmdense50-1000000.py		runlstmdense50-1000000.py
runlstmsparse3-1000.py		runlstmsparse3-1000.py
theano-multicore.txt		theano-multicore.txt
toglove.py		toglove.py
util.py		util.py
util.pyc		util.pyc

mpezeshki/classycn