An Empirical Evaluation of CNNs and RNNs for ICD-9 Code Assignment using MIMIC-III Clinical Notes

Members: Jinmiao Huang, Cesar Osorio, and Luke Wicent Sy (all three provided equal contribution)
Report: link
If you used this code in your work, please cite the following publication: Huang, J., Osorio, C., & Sy, L. W. (2018). An Empirical Evaluation of Deep Learning for ICD-9 Code Assignment using MIMIC-III Clinical Notes. Retrieved from https://arxiv.org/pdf/1802.02311.pdf

If you have questions on how to run this code, feel free to message us via github.

General Pipeline

(optional) cleaned NOTEEVENTS.csv using postgresql. imported NOTEEVENTS.csv by modifying mimic iii github and using the commands "select regexp_replace(field, E'[\n\r]+', ' ', 'g' )". the cleaned version (NOTEEVENTS-2.csv) can be downloaded in the google drive mentioned in "Environment Setup (local)"
run preprocess.ipynb to produce DATA_HADM and DATA_HADM_CLEANED.
run describe_icd9code.ipynb and describe_icd9category.ipynb to produce the descriptive statistics.
(optional) run word2vec-generator.ipynb to produce the word2vec models
run feature_extraction_seq.ipynb and feature_extraction_nonseq.ipynb to produce the input features for the machine learning and deep learning classifiers.
run ml_baseline.py to get the results for Logistic Regression and Random Forest.
run nn_baseline_train.py and nn_baseline_test.py to get the results for Feed-Forward Neural Network.
run wordseq_train.py and wordseq_test.py to get the results for Conv1D, RNN, LSTM and GRU (refer to 'help' or the guide below on training and testing for Keras Deep Learning Models)

Training and Testing for Feed Forward Neural Network

Prerequirest: Keras + Tensorflow, or Keras + Theano
models are specified in nn_baseline_models.py
run nn_baseline_preprocessing to prapare the data for training and testing use.
Training:
- You can also run training with default arguments: pythno nn_baseline_train.py,
- Or run training script with customized input arguments: python nn_baseline_train.py --epoch 10 --batch_size 128 --model_name nn_model_1 --pre_train False
- Please refer to parse_args() function in nn_baseline_train.py for the full list of the input arguments
Testing:
- Test model with default model and data file: python tfidf_test.py
- Please refer to parse_args() function in nn_baseline_train.py for the full list of the input arguments

Training and Testing for Recurrent and Convolution Neural Network

Similar to Feed Forward Neural Network, users can run the training and tesing with the default settings in wordseq_train.py and wordseq_test.py. All the model architectures are specified in wordseq_models.py

Environment Setup (local)

conda env create -f environment.yml
Install spark or download spark binary from here
pip install https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0/snapshots/dev1/toree-pip/toree-0.2.0.dev1.tar.gz
- the above command should install toree. if it fails, refer to github link.
- note that toree was not included in environment.yml because including it there didn't work for me before.
jupyter toree install --user --spark_home=/spark-2.1.0-bin-hadoop2.7 --interpreters=PySpark
Extract the ff. files to the directory "code/data":
- DIAGNOSES_ICD.csv (from MIMIC-III database)
- NOTEEVENTS-2.csv (cleaned version of MIMIC-III NOTEEVENTS.csv, replaced '\n' with ' ') link
- D_ICD_DIAGNOSES.csv (from MIMIC-III database)
- model_word2vec_v2_*dim.txt (generated word2vec)
- bio_nlp_vec/PubMed-shuffle-win-*.txt Download here (you will need to convert the .bin files to .txt. I used gensim to do this)
- model_doc2vec_v2_*dim_final.csv (generated word2vec)
To run data preprocessing, data statistics, and ipynb related stuff, start the jupyter notebook. Don't forget to set the kernel to "Toree Pyspark".
- jupyter notebook
To run the deep learning experiments, follow the corresponding guide below.

Environment Setup (azure)

Setup Docker w/ GPU following this guide
Using Azure's portal, select the vm's firewall (in my case, it showed "azure01-firewall" in "all resources"), then "allow" port 22 (ssh) and 8888 (jupyter) for both inbound and outbound.
You can ssh the VM through one of the ff:
- docker-machine ssh azure01
- ssh docker-user@public_ip_addr
Spark can be installed by following the instructions in "Environment Setup (local), but note that this will not be as powerful as HDInsights. I recommend taking advantage of the VM's large memory by setting the spark memory to a higher value (/conf/spark-defaults.conf)
If you have a jupyter notebook running in this VM, you can access via http://public_ip_addr:8888/
To enable the GPUs for deep learning, follow the instructions in the tensorflow website link
- you can check the GPUs' status by "nvidia-smi"

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
code		code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

README.md

README.md

Repository files navigation

An Empirical Evaluation of CNNs and RNNs for ICD-9 Code Assignment using MIMIC-III Clinical Notes

General Pipeline

Training and Testing for Feed Forward Neural Network

Training and Testing for Recurrent and Convolution Neural Network

Environment Setup (local)

Environment Setup (azure)

About

Releases

Packages

Languages

zeroesones/clinical-notes-diagnosis-dl-nlp

Folders and files

Latest commit

History

code

code

README.md

README.md

Repository files navigation

An Empirical Evaluation of CNNs and RNNs for ICD-9 Code Assignment using MIMIC-III Clinical Notes

General Pipeline

Training and Testing for Feed Forward Neural Network

Training and Testing for Recurrent and Convolution Neural Network

Environment Setup (local)

Environment Setup (azure)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages