Skip to content

Ivan-Nebogatikov/ChineseCorrector

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#README

Current instanse:

HttpPost

http://ec2-18-223-97-193.us-east-2.compute.amazonaws.com:5000/correct/best

Body

{
	"text" : "我已经等猴多时了。"
}

Response:

{
"best": "我已经等候多时了。"
}

HttpPost

http://ec2-18-223-97-193.us-east-2.compute.amazonaws.com:5000/correct/all

Body

{
	"text" : "我已经等猴多时了。"
}

Response:

{
"all": [
"我",
"已经",
[
"华新水泥",
"眼见为实"
],
"了",
"。"
]
}

You could use the client application in you android device. You need to build the apk, install it and change system spell checker to new one.

You could rate the correction with android application UI.

Details(It is serverless aws lambda funtion to update the DB):

HttpPost

https://1shknletu9.execute-api.us-east-2.amazonaws.com/default/serverlessrepo-RateCorrection-helloworldpython3-WBC9LG4EQ4KT

{
	"Id" : "123",
	"Corrected" : "我不知道",
	"Input" : "我不知道",
	"IsLike" : false
}

The worst week corrections are sended to my email to update the correction model. Thank you for using rating system!


Original info

Author: Hanwen, LIU - HKUST

A Chinese words correction system with detection and correction functions based on n-gram language model and Chinese text segmentation. The detection core focused on the continuous singletons while the correction core focuses on the the shape and pronunciation similarity of characters.

Usage

In [1]: import Checker

PREPROCESSING DONE!

In [2]: fix = Checker.correct_core('我已经等猴多时了。')

In [3]: for word in fix:
   ...:     print(word)
   ...:

已经
['等候多时', '得道多助', '勇而多计']
In [4]: fix = Checker.correct_core('我已经等猴多时了。')

In [5]: answer = ''

In [6]: for word in fix:
   ...:     if type(word)==list:
   ...:         answer+=word[0]
   ...:     else:
   ...:         answer+=word
   ...:

In [7]: print(answer)
我已经等候多时了

Please make sure that the data files mentioned in the Checker.py be downloaded! Some files are too large, please download them from Google Drive: https://drive.google.com/open?id=1A_rifWNTVLkPeTfKTPN-KaeaqTj09-IG

Files

  • Checker.py: Detection and correction system module.
  • CharSimilarity.py: Characters similarity measurement module.
  • Experimental Results.ipynb: Experimental Results.
  • sijiao_dict.py: Sijiao codes of characters from https://github.com/contr4l/SimilarCharactor.
  • similar_char_preprocessing.py: The cache of all similarity values between common characters.
  • testing data(folder): Testing data files
  • chinese_word_correction_data.json: The original training data.(Provided by Porf.Lei CHEN)
  • weibo_contents_words.set: The vocabulary file of training data.
  • weibo_contents_words.bin: The trained binary KenLM language model.
  • pd_simi_dic.pkl: The similarity cache file, can be generated by similar_char_preprocessing.py.

Future Work

Although the performance of our correction system is acceptable, there are some problems which should be solved in the future.

First of all, the correction speed of out design is extremely low. We have designed some elaborate algorithm to speed up the system, for example, we calculated common characters' similarity values with each other in advance. However, the combinations number is large and the comparison times are many, so the total speed of correction is very low. Because of the low speed, we can only test our corrector with small test data which may not be compellent.

The next problem is the detection accuracy. We have noticed that if there are several singleton words appear which are not error words, the detector will treat them as error words since this is how our detection algorithm works.

The last problem is the candidates selecting algorithm in the correction part. We select the candidates by combining the candidates with the prefix words and query the language model for score. However, the score of candidates with longer length may probably get higher score. For example, when correcting the sentences '平民刘备鉴持卟鞋', the prefix words are '平民' and '刘备' and the candidates are '坚持/补鞋' and '坚持不懈'. Although the first candidate is more similar, but the score of the first candidate may probably be lower than the second one, because the 4-gram score usually lower than 3-gram score.

Reference

[1] J. L. Peterson, Computer programs for detecting and correcting spelling errors, Communica- tions of the ACM 23 (1980) 676–687.

[2] S. Cucerzan, E. Brill, Spelling correction as an iterative process that exploits the collective knowledge of web users, in: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.

[3] F. Ahmad, G. Kondrak, Learning a spelling error model from search query logs, in: Proceed- ings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 955–962.

[4] C.-H. Chang, A new approach for automatic chinese spelling correction, in: Proceedings of Natural Language Processing Pacific Rim Symposium, volume 95, Citeseer, pp. 278–283.

[5] Y. Zheng, C. Li, M. Sun, Chime: An efficient error-tolerant chinese pinyin input method, in: IJCAI, volume 11, pp. 2551–2556.

[6] fxsjy, Jieba - github.com, https://github.com/fxsjy/jieba.

[7] K. Heafield, I. Pouzyrevsky, J. H. Clark, P. Koehn, Scalable modified Kneser-Ney language model estimation, in: Proceedings of the 51st Annual Meeting of the Association for Compu- tational Linguistics, Sofia, Bulgaria, pp. 690–696.

[8] K. Heafield, KenLM: faster and smaller language model queries, in: Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, Edinburgh, Scotland, United Kingdom, pp. 187–197.

[9] K. Heafield, Kenlm homepage, https://kheafield.com/code/kenlm/.

[10] D. China, A chinese text similarity measurement algorithm based on ssc, https://blog.csdn.net/chndata/article/details/41114771.

[11] contrl4, Similarcharactor, https://github.com/contr4l/SimilarCharactor.

About

A implementation of Chinese Words Correction.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.9%
  • Java 3.1%