Skip to content

rateflask predicts the rate of return at inception of a Lending Club loan

License

Notifications You must be signed in to change notification settings

nhu2000/rateflask

 
 

Repository files navigation

This Capstone Project was completed as a part of the Zipfian Academy Data Science Bootcamp, Winter 2015 cohort. For developments since then, please visit github.com/rateflask.

rateflask

rateflask predicts the rate of return at inception of a Lending Club loan, web version at rateflask.com.

Description

Analysis of Lending Club data tend to focus on loans that have already matured. Matured loans, however, comprise less than 10% of loans issued. This conundrum inspired a new methodology to enable loan comparisons be made, regardless as to whether the loan has matured, is ongoing or yet to be issued.

The methodology involves using Random Forests to predict the expected loan payment for a particular month, then aggregated across the whole period to give a single rate of return metric for each loan. This allows 90% of loans issued be used as training data, and the remaining set of matured loans as validation.

Details

The model consists of 4 x 36 individual Random Forests sub-models, one for each grade-month pair (grades A - D, in the period Jan 2012 - Dec 2014). The training data is the set of 3-year loans issued between 2012 and 2014, i.e. ongoing loans. Loan details (FICO, monthly income, etc.) are used as features, and the loan status (current, in default, etc) as targets.

The loan status is used to calculate the probability of each payment made, and aggregated to give the rate of return of that loan. Viewed as a black box, the model takes in loan features as input, and outputs the expected rate of return.

The trained model is validated against 3-year loans issued between 2009 and 2011, i.e. loans that have matured. The validation process involves comparing the actual rate of return, calculated purely with actual payment data, against the expected rate of return, calculated purely on loan features. The actual and predicted rate of return are illustrated in the graph below by blue and green respectively. The headline rate in red is the Lending Club rate.

Loans of the same sub-grade band pay the same headline rate. For example, all B-2 loans issued in December 2014 had an interest rate of 9.49%. Suppose you were given a number of B-2 loans to choose from, it's worth asking if it's possible to beat the average return in a statistically reliable way.

The graph below shows the improvement in rate of return with an active selection strategy based on the model, compared to choosing a loan of at random. The active selection strategy involves using the model to generate the predicted rate of return, ranking the loans and identifying the top quartile. The bottom quartile is also included in the graph for illustrative purposes.

For further details:

Presentation slides can be found here. The charts for the presentation and shown here were generated with R's ggplot via rpy2, and detailed here.

Requirements

  • numpy 1.9.0
  • scipy 0.14.0
  • pandas 0.14.1
  • scikit-learn 0.14.1
  • matplotlib 1.3.1
  • flask 0.10.1
  • lendingclub 0.1.8
  • pymongo 2.7.2
  • psycopg2 2.5.3
  • dill 0.2.2

Installation

  1. Clone this repo.
  2. Download the full version data (~270 MB) from the Lending Club website or from the following Dropbox address, and place in a directory labeled data.
  3. Install the listed requirements.
  4. (Optional) Start up a MongoDB instance, and a PostgreSQL database named 'rateflask'.

To run the production version locally, run python app.py (or sudo python app.py should there be permission errors) in terminal from the repo directory. Direct your browser to 0.0.0.0:8000 to view the app, and to 0.0.0.0:8000/refresh to update the data (requires Lending Club login). For debugging, run python app.py debug.

To test if the installation has been successful, run python test.py from the same location. To run the model against the validation set, run python test.py compare. Please note that the validation process might take some time.

Modules

model - rate of return prediction and validation

  • model.py - core prediction model, trained on 2012-14 loan data
  • validation.py - validates prediction model with 2009-11 loan data
  • start - trains new model on first start

helpers - data processing and cashflow generation

  • preprocessing.py - cleans up data and fills missing values
  • postprocessing.py - creates files for charts and data table
  • cashflow.py - generates cashflows and compounding curves

transfers - file input/output, API requests and database insertions

  • fileio.py - dumping and loading data with pickle/dill
  • retrieve.py - requests data from Lending Club API
  • database.py - inserts data to MongoDB and PostgreSQL

The CodeFlower visualization can be found here.

Next steps

Portfolio selection model that selects the highest-returning diversified portfolio based on a user's desired risk profile.

License

Copyright (c) 2015 Rateflask

Licensed under the MIT licence.

About

rateflask predicts the rate of return at inception of a Lending Club loan

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 49.3%
  • JavaScript 40.9%
  • Python 4.7%
  • CSS 4.4%
  • HTML 0.7%