拒识文本分类

这里部署工程级别的项目

关键词:

须知

每次测试的结果将写入result.csv

0. 数据准备

至少包含 [sent, target] data.csv

1. 预处理

python preprocess.py

生成[sent,sent_chars,sent_words,target]

data_train.csv data_test.csv

2. 规则过滤

python FilterRules.py

（这将会增加一列['isFilter']，默认为None若被过滤则显示违反的规则，如_islen）

使用探索模式（评估过滤器效果），则在__main__中将filtering函数注释，并去掉exploring的注释
如需更改过滤器的规则，则更改toFilter函数

python FilterRules.py -task exploring

探索模式将会评估当前规则的准确率

3.1 来自语言模型的特征

(1) 训练语言模型

python LangModelMgr.py

python LangModelMgr.py -n 2 -dtype words -dsource std -dname weibo

(2) 特征工程

python FeatureEngr.py

data_train_feat.csv data_test_feat.csv

(3) 特征筛选

python Visualization.py

生成关于特征和标签之间的皮尔森相关系数热力图

python Visualization.py -plot len l3_neg_ppl

python FeatureEngr.py -del len

(4) 判别式模型

python DiscriminantModel.py

/Model *.model文件

基于词向量

(1) 获得词向量

python ToVectorMgr.py

data_train_chars_d2v.vec data_test__chars_d2v.vec

这里默认使用文档级的 Doc2Vec
文档级别的Word2Vec （尚未实现）
词表级别的WordList2Vec （尚未实现）

(2) 生成式模型

python GenerativeModel.py

默认使用SVM模型，可选LR或MLP

神经网络

python DeepNet.py

默认使用fasttext

python DeepNet.py -net textcnn

集成学习

python Ensenmble.py

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Baselines		Baselines
Data		Data
Discriminant		Discriminant
Generative		Generative
Model		Model
NeuralNet		NeuralNet
.gitignore		.gitignore
DeepNet.py		DeepNet.py
DiscriminantModel.py		DiscriminantModel.py
Ensemble.py		Ensemble.py
FeatureEngr.py		FeatureEngr.py
FilterRules.py		FilterRules.py
GenerativeModel.py		GenerativeModel.py
LangModelMgr.py		LangModelMgr.py
Preprocessing.py		Preprocessing.py
README.md		README.md
SentClassifier.py		SentClassifier.py
ToVectorMgr.py		ToVectorMgr.py
Visualization.py		Visualization.py
_ParagClassifier.py		_ParagClassifier.py
const.py		const.py
main.py		main.py

SixingYan/ErrorTextDetection

Folders and files

Latest commit

History

Repository files navigation