Personalized Medicine

Project Overview

Personalized Medicine (PM) [1] is promising in improving health care efficiently and safely because it makes it possible for patients to receive earlier diagnoses, risk assessments, and optimal treatments. However, PM in cancer treatment is still slowly developing due to a large amount of textual content based-medical literature and clinical observations needed to be manually analyzed. To address this issue and speed up the progress of PM, we propose an efficient text classifier to automatically classify the effect of genetic variants. We show that contributions from NLP community and specialists can definitely bring PM into a bright future.

Methods

Data

The data comes from Kaggle competition and Memorial Sloan Kettering Cancer Center.

Two data files were used in this project. There are 3321 data points in total. Gene and Variation are categorical. Class contains 9 unique values.TEXT contains biomedical literature related to each class/ effect of genetic variants.

training_text.zip: ID TEXT
training_variants.zip: ID Gene Variation Class

Data Preprocessing

Tools: NLTK, SpaCy
Steps:
- Tokenization
- Removal of punctuations
- Lemmatization
- Removal of stop words
- Removal of integers, and single integer-single character combinations (Special consideration for biomedical literature)
- Lower casting
(Stop words were removed for building machine learning classifiers, but not for building neural networks, as neural networks are trying to learn the semantic meaning and the meaning of a word depends on the context.)
Special consideration for biomedical literature:
- In addition to removing suggested stop words by tools, also remove those common and meaningless in biomedical literature, such as 'figure', 'fig', 'table','tab', 'supplement', 'supplementary', 'download','author', 'et', 'al', etc.
- Remove integers that are not aside any character (intext citation numbers).
- Remove single integer - single character combinations (figure index in literature), such as '1a', '7c', etc.
Any null content of the text was replaced by the merged string of gene mutation type and variation type of the same row. Any space that appeared in the Gene and Variation columns was replaced by an underscore.

Exploratory Data Analysis

Univariant and bivariant analyses to show the distribution of categorical variables and the interaction between them.
The distribution of words and characters in all text data. The distribution of unigram, bigram, and trigram for each class. And word cloud plots for each class.

Here is the EDA demo and some interesting findings.

Feature Extraction

Methods:

One-hot encoding gene and variation features
Bag-of-Words (Tools: Count vectorizer and TFIDF vectorizer) and Word Embeddings (Source: stanford GloVe, BioWordVec, BioSentVec, BioConceptVec, pubmed2018_w2v_200D, pubmed2018_w2v_400D) for text feature.
A pre-trained model trained on PubMed biomedicine literature by Microsoft for BERT model.

Features for training machine learning models: one-hot gene and variation + SVD truncated count vector and TFIDF vector of text.

Features for training neural networks: feature vectors transformed by pubmed2018_w2v_400D pre-trained model (Except for BERT).

Here is an evaluation of some text feature extraction methods.

Machine Learning Methods（Baseline）

Eight supervised machine learning methods were applied:

Support Vector Machine (SVM)
Logistic Regression (LR)
k-Nearest Neighbors (kNN)
Random Forest (RF)
Adaptive Boosting (AdaBoost)
eXtreme Gradient Boosting (XGBoost)
Multi-Laryer Perceptron (MLP)
An ensemble (voting) model of LR, RF, KNN, and SVM

Here is an evaluation of model performance.

Neural Networks

CNN [2] and BiLSTM [3]. RCNN, RNN+Attention, and BERT are worth trying.

Pre-trained word-embedding is chosen from: pubmed_w2v_400D

Evaluation Metrics

Log loss / Cross entropy loss
Balanced accuracy
F1-score (micro-average)

Results and Discussions

An EDA demo
Comparison of over sampling techniques
It seems traditional over sampling cannot solve the imbalanced data problem. Even worse, over sampling could introduce serious overfitting.
Comparison of pre-trained word embedding models
- The representative power of pre-trained word embedding model highly depends on the dataset itself.
Evaluation of eight machine learning methods
Evaluation of several neural network -based models

Future Work

Include different types of dataset from other sources, e.g, personal information (family disease history, age, race, etc).
Upcome other three or more NN based models.
Combine / stack NN models.
Build Non-static NN models. It is reported that non-static NN models are always better than static NN models.
Concatenate multiple word vector representations (e.g. pubmed_w2c_400D and BioConceptVec). In addition to word vectors trained from PubMed, biological concepts are important features.
Deal with imbalanced text data by sentence / word augmentation using nlpaug.
Build a hybrid model / multi-model: one part trained on text data, the other trained on sequence data to capture genetic variants ( like what DeepSEA does)

Environment

python 3.8

pytorch 1.7.0

Directory

```
Personalized-Medicine
├── eight-ml-classifiers
│   ├── images
│   │   ├── confusion-matrix
│   │   ├── learning-curve
│   │   ├── Accuracy_allmodel.png
│   │   ├── F1score_allmodels.png
│	│   └── logloss_allmodels.png
│   ├── README.md
│	├── data-preprocessing_v1.py
│	├── data-preprocessing_v2.py
│	├── feature-extraction.py
│	├── model-evaluation.py
│	├── performance-of-ml-classifiers.ipynb
│	├── train-models.py
│	├── workflow-part1.ipynb
│	└── workflow-part2.ipynb
├── exploratory-data analysis
│   ├── images
│   │   ├── other_distribution
│   │   │   ├── dist_char.png
│   │   │   ├── dist_class.png
│   │   │   ├── dist_gene.png
│   │   │   ├── dist_variation.png
│   │   │   ├── dist_word.png
│   │   │   ├── gene_class.png
│   │   │   └── word_class.png
│   │   ├── uni_bi_trigram_distribution
│   │   │   ├── bi_c1.png
│   │   │   ├── ...
│   │   │   ├── bi_c9.png
│   │   │   ├── tri_c1.png
│   │   │   ├── ...
│   │   │   ├── tri_c9.png
│   │   │   ├── uni_c1.png
│   │   │   ├── ...
│   │   │   └── uni_c9.png
│   │   └── wordcloud_image
│   │   │   ├── wordCloud_class_1.png
│   │   │   ├── ...
│   │   │   ├── wordCloud_class_9.png
│   │   │   ├── wordCloud_not_strict.png
│   │   │   └── wordCloud_strict.png
│   ├── eda-demo.ipynb
│   ├── eda-gene-variation.py
│   ├── eda-text.py
│   └── resampling.ipynb
├── neural-nets
│   ├── image
│   │   ├── CE
│   │   ├── acc
│   │   ├── cm
│   │   ├── f1score
│   │   └── logloss
│   ├── models
│   │   ├── __pycache__
│   │   ├── CNN.py
│   │   └── BiLSTM.py
│   ├── LICENSE
│   ├── run.py
│   ├── train_eval.py
│   ├── utils.py
│   └── visualize.py
├── word-embedding-and-bow
│   ├── README.md
│   ├── bioconceptvec-rf.py
│   ├── biosentvec-rf.py
│   ├── biowordvec-rf.py
│   ├── glove-rf.py
│   ├── tfidf-count-rf.py
│   └── word2vec-rf.py
├── LICENSE.txt
└── README.md

```

References

[1] Personalized Medicine: Part 1: Evolution and Development into Theranostics [2] Convolutional Neural Networks for Sentence Classification
[3] Recurrent Neural Network for Text Classification with Multi-Task Learning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eight-ml-classifiers

eight-ml-classifiers

exploratory-data-analysis

exploratory-data-analysis

neural-nets

neural-nets

word-embedding-and-bow

word-embedding-and-bow

LICENSE.txt

LICENSE.txt

README.md

README.md

Repository files navigation

Personalized Medicine

Project Overview

Methods

Data

Data Preprocessing

Exploratory Data Analysis

Feature Extraction

Machine Learning Methods（Baseline）

Neural Networks

Evaluation Metrics

Results and Discussions

Future Work

Environment

Directory

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
eight-ml-classifiers		eight-ml-classifiers
exploratory-data-analysis		exploratory-data-analysis
neural-nets		neural-nets
word-embedding-and-bow		word-embedding-and-bow
LICENSE.txt		LICENSE.txt
README.md		README.md

License

JoKerDii/Personalized-Medicine

Folders and files

Latest commit

History

Repository files navigation

Personalized Medicine

Project Overview

Methods

Data

Data Preprocessing

Exploratory Data Analysis

Feature Extraction

Machine Learning Methods（Baseline）

Neural Networks

Evaluation Metrics

Results and Discussions

Future Work

Environment

Directory

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages