a simple pinyin input method implemented in python
https://github.com/MolunYu/Pinyin-Input-Method
- python3
- sina_news_gbk: download
Pinyin-Input-Method
│ README.md
│
├─data
│ │ acc_result.yaml
│ │ char2freq.json //一元词频
│ │ four2freq.json //四元词篇
│ │ pinyin2word.json //拼音转汉字
│ │ single_pinyin2word.json //拼音转汉字(多音字处理版)
│ │ three2freq.json //三元词频
│ │ word2freq.json //二元词篇
│ │
│ ├─pinyin2word
│ │ .....
│ │
│ ├─pinyin_test //测试文件目录(in-ans为一对测试数据)
│ │ .....
│ │
│ └─sina_news_gbk
│ 下载地址.txt
│
└─src
| acc.py //准确率测试
| bar.py
| pinyin.py //转换拼音文件为句子
| pinyin_interact.py //交互式转换拼音文件
| pre_gram3.py //3-gram 预处理
| pre_gram4.py //4-gram 预处理
| pre_process.py //2-gram 预处理(3-gram,4-gram前执行)
| viterbi_2gram.py //2-gram 译码
| viterbi_3gram.py //3-gram 译码
| viterbi_4gram.py //4-gram 译码
- Download sina_news_gbk
- Move sina_news_gbk to data/
|-- Pinyin-Input-Method
|-- data
|-- sina_news_gbk
|-- 2016-01.txt
...
|-- 2016-09.txt
- Data preprocessing
cd ./src
python pre_process.py
- Option: 3-gram and 4-gram required
cd ./src
python pre_gram3.py
python pre_gram4.py
- Single_pinyin2word.json has a manual way to handle multi-tone words, if you don't need it, change with pinyin2word.json in viterbi_2gram.py, viterbi_gram3.py, viterbi_gram4.py
# multi-tone word
with open("../data/single_pinyin2word.json", mode="r") as src:
pinyin2word = json.load(src)
# normal
with open("../data/pinyin2word.json", mode="r") as src:
pinyin2word = json.load(src)
- Transform pinyin to Chinese character
cd ./src
python pinyin.py --input=INPUT_FILE_PATH --output=OUTPUT_FILE_PATH
- Transform pinyin in console interactively
cd ./src
python pinyin_interact.py
- Test accuracy from given files. Available test files in pinyin_test can be used as below. You can also use your files, make sure you have right format.
cd ./src
python acc.py --input=INPUT_FILE_PATH --ans=ANSWER_FILE_PATH
- Change n-gram model in pinin.py, pinyin_interact.py, acc.py.
# 2-gram
from viterbi_2gram import viterbi
# 3-gram
from viterbi_3gram import viterbi
# 4-gram
from viterbi_4gram import viterbi
- 2-gram: ~86%
- 3-gram: ~90%
- 4-gram: ~91%