Skip to content

bfetler/coronary_disease

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Coronary Heart Disease

How well can we predict heart disease from patient data? Ideally, we'd like to measure a set of parameters on a patient, and predict the likelihood of whether or not they will develop heart disease, when, and the severity of the disease.

A dataset from a 1988 coronory disease study is given in the UCI Machine Learning Heart Disease Dataset. Data was collected at the Cleveland Clinic from 303 patients with and without heart disease. As there are few data points, patients with varying severity of heart disease were grouped together in a target variable. Some of the seventy-five columns of original data were corrupted, and replaced with fourteen columns by the data author. After data cleaning, 297 patients were left. Despite the small size, it is a reasonable dataset to start exploring coronary disease prediction.

Exploration

Data exploration and prediction is given in coronary_predict.py. The data was randomly split into 70% training data and 30% test data. A scatter matrix of training data shows some correlation between variables, for example between maximum heart rate and fluoroscopy vessel count, but no strong trends are apparent. Due to large variability (standard error divided by mean greater than 2), the b_sugar_up column was dropped.

scatter matrix

Histograms of train and test data typically show similar patterns, so that variable column values are typically uniformly distributed between train and test.

coronary training data histograms

coronary test data histograms

Evaluating Incoming Test Data

If the test data for groups of patients comes in batches periodically, we could compare the variable distributions between train and test data to see if any anomalies stand out, to check if incoming data is statistically different from training data and needs attention. This would also tell us something about the validity of the procedure. We may use test data to model this process, using an Independent T-Test comparing each column of variables in train and test data. Typical p-values are given in the table below, all > 0.05 (no significant difference).

variable p-value of train, test
age 0.24
sex 0.98
chest_pain 0.42
b_pressure 0.65
cholesterol 0.56
ecg_type 0.93
heart_rate 0.71
exer_angina 0.33
exer_depress 0.051
exer_slope 0.72
fluor_count 0.68
thal_defect 0.29
A Note on Unit Testing

Unit testing of data science methods may be useful when writing a new algorithm or testing routines. One may test if an algorithm returns a result, or good results, or implements a particular API. Most of the Scikit-learn methods are already tested for this, provided one follows suggestions for sensible data as described in the docs, such as scaling the data beforehand if needed.

However! One very useful task in data science is to test not just the routines, but the data. These types of tests are represented above. For example, is the distribution of parameters in the training and test sets the same, statistically speaking? Are they the same for incoming data in a new data stream? If not, one may flag the data in production. For example, for each data column one may write:

assertGreater(ttest.pvalue, 0.05)

This of course depends on what your Null Hypothesis is, your assumptions about the data and models, and what issues you are attempting to solve with data science. In this case, we are trying to predict heart disease in patients.

Modeling and Fitting

If there are no significant anomalies in the data, we proceed to fit the training set using:

We scaled the continuous variable data columns, and added one-hot encoding for chest_pain. We found the training data fits the presence or absence of coronary disease with an accuracy of 82% +- 7% using either method. A standard error was estimated from 5-fold cross-validation scores, which varies by 1-2% due to the small amount of data and random variation in the train test split.

Logistic Regression of normalized data gives an idea of variable importance, provided the coefficients are not collinear. The order of the coefficients varies somewhat, depending on randomness in the train and test datasets. In general, fluoroscopy vessel count (fluor_count) is always at top, chest_pain type is in the top five, and sex has more influence than age. Typical values are given below.

variable logistic regression coefficient
fluor_count 1.08
sex 0.95
chest_pain 0.78
b_pressure 0.51
intercept -0.44
thal_defect 0.44
heart_rate -0.43
exer_slope 0.40
exer_angina 0.34
cholesterol 0.26
exer_depress 0.17
ecg_type 0.15
age -0.06

Prediction

Assuming we are satisfied there are no significant anomalies in the incoming test data, and that the training data is not overfit and is reasonable, we proceed with test data prediction using:

  • Logistic Regression
  • LinearSVC

Using the test data, we find a prediction score accuracy of about 80% +- 7% using either classifier. The fit and prediction scores depend on the random split between train and test data, and are as reliable as random variation in the data allows. Logistic Regression gives an idea of variable importance, while LinearSVC is less sensitive to variable dependence. Both methods are about as fast. We suggest using Logistic Regression for parameter importance modeling, and LinearSVC for more accurate training and production tests, though more data is needed.

Conclusion

These methods indicate a general feasibility of predicting heart disease in patients. However, better quality data is needed before quantifying the results.

About

predict coronary heart disease from patient data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages