classycn: Classical Chinese sentence segmenter.

Data - Warning: the data folder is over 1G in size!

data/sjw - cleaned data from Seungjeongwon Ilgi - memos from ancient Korean Royal Secretariat. Over 200 million characters and 16k+ uniques.

data/24s - semi - cleaned data from the "Twenty-Four Histories" of China, except Han Shu and San Guo Zhi. Data is from Wikisource, may contain noisy tokens. 20m tokens, 12k uniques.

data/vectors - word vectors produced using GloVe & Word2Vec.

Scripts

runhmm - trains and tests an HMM tagger from NLTK

runcrf - trains and tests a CRF tagger from CRF Suite

runlstm - trains and tests a bi-directional LSTM tagger. Implemented with Theano.

Contact: Yizhou Hu @ huyz725+github at gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
data		data
datasjw1000000cbow500x		datasjw1000000cbow500x
qualitative		qualitative
.gitignore		.gitignore
README.md		README.md
cpr.py		cpr.py
crf-hu.sh		crf-hu.sh
crf.py		crf.py
crf.sh		crf.sh
crfvec.sh		crfvec.sh
hmm-hu.sh		hmm-hu.sh
lstm.py		lstm.py
runcrf-hu.py		runcrf-hu.py
runcrf.py		runcrf.py
runhmm-hu.py		runhmm-hu.py
runhmm.py		runhmm.py
runlstmdense50-10000-cbow.py		runlstmdense50-10000-cbow.py
runlstmdense50-10000-glove.py		runlstmdense50-10000-glove.py
runlstmdense50-10000-sg.py		runlstmdense50-10000-sg.py
runlstmdense500-1000000-cbow-x.py		runlstmdense500-1000000-cbow-x.py
runlstmdense500-1000000-cbow.py		runlstmdense500-1000000-cbow.py
runlstmsparse-1000000-x.py		runlstmsparse-1000000-x.py
theano-multicore.txt		theano-multicore.txt
thesis-draft.pdf		thesis-draft.pdf
toglove.py		toglove.py
util.py		util.py

billxu0521/classycn

Folders and files

Latest commit

History

Repository files navigation

classycn: Classical Chinese sentence segmenter.

About

Resources

Stars

Watchers

Forks

Languages