Skip to content

TheeChris/hospital_readmission

Repository files navigation

Using natural language processing with word embeddings and topic modeling on clinical notes to predict hospital readmission

30-day hospital readmissions have been targeted as a key metric of patient care. In 2012, the Affordable Care Act initiated the Hospital Readmission Reduction Program (HRRP) to incentivize improved patient outcomes by financially penalizing hospitals with excessive readmission rates. According to the American Hospital Association, in the first five years of the HRRP, hospitals experienced $1.9 billion in penalties.

This project uses 283,208 clinical notes (nursing and discharge summary) on 35,779 patient admissions from the MIMIC-III database. Various natural language processing models were assessed in their ability to predict all-cause 30-day readmission for all patients (excluding neonates). While a Random Forest model with a bag-of-words approach proved to be the best fit with a ROC-AUC of 0.7076, a preliminary skip-gram word embedding model using Word2Vec with Latent Dirichlet Allocation (LDA) and logistic regression showed promising results with a ROC-AUC of 0.7078. Future models could focus on more effective classification algorithms and hyperparameter tuning.

Data Source

MIMIC-III v1.4

Table of Contents

  1. Predictive Modeling Notebook: this notebook runs through the various models that were attempted with outputs of their respective ROC curve and confusion matrix.
  2. Slide Deck: used to present the findings from exploration and modeling
  3. notebooks: these are exploratory notebooks used to show the model building process
    1. DataPrep: collecting and cleaning data
    2. DataExploration: examining trends in the data
    3. Logistic Regression: bag-of-words models using logistic regressions
    4. Word2Vec: word embeddings models
    5. ULMFiT: pre-trained language model
    6. Random Forest: bag-of-words models using random forest
    7. SVM: bag-of-words models using support vector machine
  4. reports
    1. Figures: all saved plot outputs
    2. Final Report: a summary of the project process and results
    3. Milestone Report 1: a summary of data preparation and exploration
    4. Milestone Report 2: a summary of the initial logistic regression model with over- and under-sampling
  5. src: source code for modules used in the project

Latent Dirichlet Allocation Topic Modelling

LDA Topics

Random Forest Bag-of-Words ROC Curve and Confusion Matrix

ROC Curve

Confusion Matrix

About

Predicting 30-day hospital readmission using word embedding and topic modeling on clinical notes.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published