GitHub - bfetler/lending_club: predict high or low interest rate

Loan Scrutiny and Ad Targeting Using Interest Rate Predictors

Can we use consumer loan data to assess the level of scrutiny needed for future loan applicants? For example, we could target different types of questions to loan applicants using variables that predict a high or low interest rate, as a potential predictor of high- or low-risk behavior.

Could we also use loan data to predict which types of ads consumers should receive? We could potentially use the same predictors to target advertising to different customer segments.

A consumer loan dataset is available from Lending Club, an online lending service, which we may use to explore these questions.

Exploration

The dataset contains 14 variables for 2500 loan applicants from FY 2013, including Interest.Rate (the interest rate approved), which may be used as a target variable for supervised learning. We divided consumers based on a target interest rate into two categories, high interest if above 12% and low interest if below 12%, using a variable IR_TF (interest rate true-false). After data cleaning, 2498 columns remained. Histograms gave some indication of data variability.

Some histograms of financial variables were not normally distributed, and were replaced by log variables for statistical analysis.

Modeling and Prediction

The data was randomly split into 75% training data and 25% test data. We used the training data to model loan behavior, and the test data as an analog for batches of incoming new loan applicants. Fit and prediction was done comparing several machine learning methods. The following was done for each method:

Initial fit and cross validation of training data.
Optimization of meta-parameters by grid search with cross validation if desired.
Cross validation with statistics tells us whether or not further parameter optimization is needed.
Variable optimization on training data as follows:
- start with two variables FICO.Score, Amount.Requested
- successively add random columns
- keep columns with improved score (cross validation mean score)
- repeat N times
Finally, prediction on test data using optimized columns and parameters.

Summary

A summary of the results follows.

Support Vector Machines
- 90% +- 4% accuracy
- insensitive to parameter choice
- slightly slower than other methods
Naive Bayes
- 89% +- 5% accuracy
- sensitive to parameter choice (seven columns optimal)
- reasonably fast
Logistic Regression
- 89% +- 4% accuracy
- somewhat sensitive to parameter choice (five to eight columns optimal)
- fast

Any of the above classification methods will predict high or low interest rate with about 89% +- 5% accuracy. The training error estimate is low and consistent across methods, indicating the error comes from variability within the data. Logistic Regression is the best choice, since it is fast, accurate, and easy to implement with a little optimization.

Detailed Analysis by Logistic Regression

A detailed analysis by Logistic Regression follows.

Fit of training data of high or low interest rate from eleven numeric variables was performed using Logistic Regression, scored using fit accuracy. A score of 90% was found after scaling the data.

Cross-validation can tell us whether or not further parameter optimization is needed. Essentially, by splitting the training data into subtrain and validation data, and fitting the model with a CV factor of 10 (90% subtrain and 10% test data), one may repeat the process 10 times with a slightly different random data set. Each fold gives a new prediction score, and one may do statistics on the scores to tell how well we fit the model.

We used cross-validation prediction scores of the data from Logistic Regression to calculate a mean and standard error for different model parameters. Their range is shown below in boxplots. We tested the statistical significance of the scores between different model parameters using a t-test, which shows insensitivity to C at higher values. The error bars are bigger than the variation in accuracy for most values.

In other words, the choice of C doesn't matter much as long as the value is high enough, which we can measure by statistics. We chose the standard value of C=1 for our model.

Optimization using randomly chosen column variables with CV gave a best score of 90% +- 4% using eight variables, with optimum number varying between five and eight columns.

Prediction score of test data was estimated at 89%. A plot is shown below.

A processing script is given in logistic_regression.py. Plots of logistic functions are in logistic_regression_plots/ and script output in logistic_regression_output.txt.

Interest Rate by Linear Regression

To predict the actual interest rate from other variables, rather than just whether the interest rate was high or low, we applied Linear Regression. Fitting the training data using all columns gave an accuracy of 76% +- 5% by cross-validation, while prediction of test data gave 76% accuracy. By examining variable importance, we found we could model the same fitting accuracy using only five of the columns plus the Intercept:

FICO.Score
Loan.Length
Amount.Funded.By.Investors
Inquiries.in.the.Last.6.Months
Log.CREDIT.Lines

A processing script is given in linear_regression.py. Plots of logistic functions are in linear_regression_plots/ and script output in linear_regression_output.txt. Interest rate prediction is shown below.

Conclusion

We can predict high or low interest rate with about 89% +- 5% accuracy. Therefore, we may:

accurately target new customer segments with extra scrutiny on their loan questionnaires
correctly target ads to existing customer segments

Name		Name	Last commit message	Last commit date
Latest commit History 216 Commits
dist_plots		dist_plots
knn_plots		knn_plots
linear_regression_plots		linear_regression_plots
logistic_regression_plots		logistic_regression_plots
logistic_sm_plots		logistic_sm_plots
naive_bayes_kfold_plots		naive_bayes_kfold_plots
naive_bayes_plots		naive_bayes_plots
naive_bayes_try_plots		naive_bayes_try_plots
pca_explore_plots		pca_explore_plots
svm_predict_plots		svm_predict_plots
.gitignore		.gitignore
README.md		README.md
distributions.py		distributions.py
knn_output.txt		knn_output.txt
knn_predict.py		knn_predict.py
linear_regression.py		linear_regression.py
linear_regression_output.txt		linear_regression_output.txt
logistic_regression.py		logistic_regression.py
logistic_regression_output.txt		logistic_regression_output.txt
logistic_regression_sm.py		logistic_regression_sm.py
logistic_sm_output.txt		logistic_sm_output.txt
naive_bayes.py		naive_bayes.py
naive_bayes_kfold.py		naive_bayes_kfold.py
naive_bayes_kfold.txt		naive_bayes_kfold.txt
naive_bayes_output.txt		naive_bayes_output.txt
naive_bayes_try.py		naive_bayes_try.py
naive_bayes_try.txt		naive_bayes_try.txt
pca_decomp.py		pca_decomp.py
svm_predict.py		svm_predict.py
svm_predict_output.txt		svm_predict_output.txt
utils.py		utils.py

bfetler/lending_club

Folders and files

Latest commit

History

Repository files navigation

Loan Scrutiny and Ad Targeting Using Interest Rate Predictors

Exploration

Modeling and Prediction

Summary

Detailed Analysis by Logistic Regression

Interest Rate by Linear Regression

Conclusion

About

Resources

Stars

Watchers

Forks

Languages