Dementia

file list: wikiXMLtoTXT.py, train_word2vec_model.py, data_preprocess.py, tokenize_data_helper.py, syntactic_extract.py, semantic_extract.py, semantic_pointer_sentence_embedding.py, cluster.py, test_cluster.py, CNN_text_classification.py

Section 1: Word Embedding

wikiXMLtoTXT.py: 將wiki corpus的XML轉換成TXT檔，每一個換行符代表一篇文章

train_word2vec_model.py: 包含了兩個functions，一個是將corpus進行分詞並儲存，會用到dict.txt.big來擴充jieba分詞的辭典；一個是word embedding投影向量空間訓練

Section 2: Data Preprocess

data_preprocess.py: 將收集到的control資料由CSV轉成TXT檔;每一筆包含一行的ID碼與一行的敘述;每一比包含一句話與是否來自失智患者；將label從二為轉為一維;分詞;分詞加詞性標注;將段落依標點符號切分為句子；load word2vec model by gensim；產生sentence vector by average of word vectors

tokenize_data_helper.py: tokenize data by keras Tokenizer; pad tokenized list; inverse tokens to string; get embedding matrix using pretrained w2v

Section 3: Syntactic and Semantic Clustering

Syntacitc feature extract by using postag, semantic feature extract by sentence vector Using K-mean cluster on control and dementia data Plot scatter of the result Predicting a unknown sentence is dementia or not by k-mean precit function

Section 4: CNN text mining, Using CNN on text raw data with three filter_size (3,4,5) and each has 128 filters to extract feature automaticly. Classification layer is also connect on the last layer using softmax cross entropy. Two types of embedding layer is used: one hot encoding and pretrained word2vec model. Training step is using one-of-ten cross validation Test step is using 5 of each class, i.e. 5 control& 5 dementia

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.ipynb_checkpoints		.ipynb_checkpoints
classify_result_pkl		classify_result_pkl
cluster_result_pickle		cluster_result_pickle
jupyter_notebook		jupyter_notebook
.gitignore		.gitignore
CNN_text_classifictaion.py		CNN_text_classifictaion.py
README.md		README.md
ckip_synt.csv		ckip_synt.csv
cluster.py		cluster.py
data_preprocess.py		data_preprocess.py
jieba_synt.csv		jieba_synt.csv
result_to_csv.py		result_to_csv.py
ros_synt_seman.py		ros_synt_seman.py
semantic_extract.py		semantic_extract.py
semantic_pointer_sentence_embedding.py		semantic_pointer_sentence_embedding.py
stanford_example.py		stanford_example.py
syntactic_extract.py		syntactic_extract.py
test_cluster.py		test_cluster.py
tokenize_data_helper.py		tokenize_data_helper.py
train_word2vec_model.py		train_word2vec_model.py
wikiXMLtoTXT.py		wikiXMLtoTXT.py

YYLMike/dementia_text

Folders and files

Latest commit

History

Repository files navigation

Dementia

About

Resources

Stars

Watchers

Forks

Languages