soft_mask_bert 文本纠错与BERT的最新结合

soft_mask_bert model for Chinese Spelling Correction in keras

本代码是对论文Spelling Error Correction with Soft-Masked BERT结果的复现。

测试环境

python 3.6 + tensorflow 2.3.1 + Keras-2.4.3

依赖 keras_bert

https://github.com/CyberZHG/keras-bert
注意事项：修改keras_bert layers目录的__init__.py, 导出PositionEmbedding类
- 假设python3的安装位置为：/usr/local/lib/python3.6/，执行如下命令
- sudo vim /usr/local/lib/python3.6/site-packages/keras_bert/layers/__init__.py
- 把第二行修改为from .embedding import get_embedding, TokenEmbedding, EmbeddingSimilarity, PositionEmbedding

训练&测试数据集

训练&测试数据集从下面的github获取，sgml格式

https://github.com/wdimmy/Automatic-Corpus-Generation

Bert中文版权重

官方版： https://github.com/google-research/bert
哈工大版： https://github.com/ymcui/Chinese-BERT-wwm

中文拼写纠错 (Spelling Correction)

背景

拼写纠错 (Spelling Correction)任务的目标是在文本中查找并更正拼写错误 (typographical errors),这类错误通常发生在外观、发音相似的字符之间.

示例

输入:

1986年毕业于国防科技大学计算机应用专业，获学时学位。

输出:

1986年毕业于国防科技大学计算机应用专业，获学士学位。
(时 -> 士)

标准评价指标

拼写纠错 (Spelling Correction)任务通常使用accuracy, precision, recall, 以及F1 score指标进行评测.这些指标可以是字符层面(character level)的或者是句子层面(sentence level)的.通常会分别对识别(Detection)和纠正(Correction)进行评估.

识别(Detection): 识别一段文字中所有拼写错误字符的位置，应该做到与正确参考 (gold standard) 相同.
纠正(Correction): 识别的错误字符以及纠正错误字符，应与正确参考 (gold standard)相同.

SIGHAN Bake-off: Chinese Spelling Check Task.

版本: SIGHAN 数据集有3个版本 (2015, 2014, 2013)
综述论文: Tseng et. al. (2015)
Licence: http://nlp.ee.ncu.edu.tw/resource/csc_download.html (academic use only)

test set	# sentence pairs	# characters	#拼写错误 (chars)	字符集	主题 (Genre)
SIGHAN 2015 (Tseng et. al. 2015)	1,100	33,711	715	繁体	第二语言学习 (second-language learning)
SIGHAN 2014 (Yu et. al. 2014)	1,062	53,114	792	繁体	第二语言学习 (second-language learning)
SIGHAN 2013 (Wu et. al. 2013)	2,000	143,039	1,641	繁体	第二语言学习 (second-language learning)

评价指标

(1) 误报率 (False Positive Rate), (2) 识别正确率 (Accuracy), (3) 识别精确率 (precision), (4) 识别召回率 (recall), (5) 识别F1, (6) 纠正正确率 (Accuracy), (7) 纠正精确率 (precision), (8) 纠正召回率 (recall), (9) 纠正 F1
具体实现: http://nlp.ee.ncu.edu.tw/resource/csc_download.html

结果

System	False Positive Rate	Detection Accuracy	Detection Precision	Detection Recall	Detection F1	Correction Accuracy	Correction Precision	Correction Recall	Correction F1
Soft-Masked BERT (Zhang et. al. 2020)	-	80.9	73.7	73.2	73.5	77.4	66.7	66.2	66.4
Confusion-set (Wang et. al. 2019)	-	-	66.8	73.1	69.8	-	71.5	59.5	64.9
FASPell (Hong et. al. 2019)	-	74.2	67.6	60.0	63.5	73.7	66.6	59.1	62.6
CAS (Zhang et. al. 2015)	11.6	70.1	80.3	53.3	64.0	69.2	79.7	51.5	62.5
上表是不同模型在结果SIGHAN 2015测试集上的结果.

其他资源

Source	# sentence pairs	# chars	#拼写错误	字符集	主题 (Genre)
Synthetic training dataset (Wang et. al. 2018)	271,329	12M	382,702	简体	新闻

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
doc		doc
README.md		README.md
data_generator.py		data_generator.py
soft_mask_bert_keras.py		soft_mask_bert_keras.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

doc

doc

README.md

README.md

data_generator.py

data_generator.py

soft_mask_bert_keras.py

soft_mask_bert_keras.py

Repository files navigation

soft_mask_bert 文本纠错与BERT的最新结合

测试环境

依赖 keras_bert

训练&测试数据集

Bert中文版权重

推荐阅读

中文拼写纠错 (Spelling Correction)

背景

示例

标准评价指标

SIGHAN Bake-off: Chinese Spelling Check Task.

评价指标

结果

相关资源

其他资源

About

Releases

Packages

Languages

Source	# sentence pairs	# chars	#拼写错误	字符集	主题 (Genre)
SIGHAN 2015 Training data (Tseng et. al. 2015)	2,334	146,076	2,594	繁体	第二语言学习 (second-language learning)
SIGHAN 2014 Training data (Yu et. al. 2014)	6,526	324,342	10,087	繁体	第二语言学习 (second-language learning)
SIGHAN 2013 Training data (Wu et. al. 2013)	350	17,220	350	繁体	第二语言学习 (second-language learning)

yangxudong/soft_mask_bert

Folders and files

Latest commit

History

Repository files navigation

soft_mask_bert 文本纠错与BERT的最新结合

测试环境

依赖 keras_bert

训练&测试数据集

Bert中文版权重

推荐阅读

中文拼写纠错 (Spelling Correction)

背景

示例

标准评价指标

SIGHAN Bake-off: Chinese Spelling Check Task.

评价指标

结果

相关资源

其他资源

About

Resources

Stars

Watchers

Forks

Languages