Skip to content

d2207197/smttoktag

Repository files navigation

zh_tok_tagger

chinese tokenizer and POS tagger based on statistical machine translation

$ ./toktagger.py -t tgt_data/sbc4/toktag.phrasetable.h5 -l tgt_data/sbc4/tag.blm -f verbose <<< 我下週要去旅行
我下週要去旅行	我 下 週 要 去 旅行	Nh Nes Nf D D VA	Nh + Nes Nf + D D VA	-36.222996470898366
$ cat infiles | ./toktagger.py -t tgt_data/sbc4/toktag.phrasetable.h5 -l tgt_data/sbc4/tag.blm > outfile

usage

./toktagger.py --help
usage: toktagger.py [-h] --translation-model H5_FILE_PATH --language-model
                    KENLM_BLM_PATH [--format {verbose,tab,/}]
                    [FILE [FILE ...]]

Chinese tokenzier and Part-Of-Speech tagger.

positional arguments:
  FILE                  input file(text in chinese)

optional arguments:
  -h, --help            show this help message and exit
  --translation-model H5_FILE_PATH, -t H5_FILE_PATH
                        Pytables Phrase Table
  --language-model KENLM_BLM_PATH, -l KENLM_BLM_PATH
                        KenLM BLM
  --format {verbose,tab,/}, -f {verbose,tab,/}
                        output format

About

chinese tokenizer and POS tagger based on statistical machine translation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages