rateflask

This Capstone Project was completed as a part of the Zipfian Academy Data Science Bootcamp, Winter 2015 cohort. For developments since then, please visit github.com/rateflask.

rateflask

rateflask predicts the rate of return at inception of a Lending Club loan, web version at rateflask.com.

Description

Analysis of Lending Club data tend to focus on loans that have already matured. Matured loans, however, comprise less than 10% of loans issued. This conundrum inspired a new methodology to enable loan comparisons be made, regardless as to whether the loan has matured, is ongoing or yet to be issued.

The methodology involves using Random Forests to predict the expected loan payment for a particular month, then aggregated across the whole period to give a single rate of return metric for each loan. This allows 90% of loans issued be used as training data, and the remaining set of matured loans as validation.

Details

The model consists of 4 x 36 individual Random Forests sub-models, one for each grade-month pair (grades A - D, in the period Jan 2012 - Dec 2014). The training data is the set of 3-year loans issued between 2012 and 2014, i.e. ongoing loans. Loan details (FICO, monthly income, etc.) are used as features, and the loan status (current, in default, etc) as targets.

The loan status is used to calculate the probability of each payment made, and aggregated to give the rate of return of that loan. Viewed as a black box, the model takes in loan features as input, and outputs the expected rate of return.

The trained model is validated against 3-year loans issued between 2009 and 2011, i.e. loans that have matured. The validation process involves comparing the actual rate of return, calculated purely with actual payment data, against the expected rate of return, calculated purely on loan features. The actual and predicted rate of return are illustrated in the graph below by blue and green respectively. The headline rate in red is the Lending Club rate.

Loans of the same sub-grade band pay the same headline rate. For example, all B-2 loans issued in December 2014 had an interest rate of 9.49%. Suppose you were given a number of B-2 loans to choose from, it's worth asking if it's possible to beat the average return in a statistically reliable way.

The graph below shows the improvement in rate of return with an active selection strategy based on the model, compared to choosing a loan of at random. The active selection strategy involves using the model to generate the predicted rate of return, ranking the loans and identifying the top quartile. The bottom quartile is also included in the graph for illustrative purposes.

For further details:

Presentation slides can be found here. The charts for the presentation and shown here were generated with R's ggplot via rpy2, and detailed here.

Requirements

numpy 1.9.0
scipy 0.14.0
pandas 0.14.1
scikit-learn 0.14.1
matplotlib 1.3.1
flask 0.10.1
lendingclub 0.1.8
pymongo 2.7.2
psycopg2 2.5.3
dill 0.2.2

Installation

Clone this repo.
Download the full version data (~270 MB) from the Lending Club website or from the following Dropbox address, and place in a directory labeled data.
Install the listed requirements.
(Optional) Start up a MongoDB instance, and a PostgreSQL database named 'rateflask'.

To run the production version locally, run python app.py (or sudo python app.py should there be permission errors) in terminal from the repo directory. Direct your browser to 0.0.0.0:8000 to view the app, and to 0.0.0.0:8000/refresh to update the data (requires Lending Club login). For debugging, run python app.py debug.

To test if the installation has been successful, run python test.py from the same location. To run the model against the validation set, run python test.py compare. Please note that the validation process might take some time.

Modules

model - rate of return prediction and validation

model.py - core prediction model, trained on 2012-14 loan data
validation.py - validates prediction model with 2009-11 loan data
start - trains new model on first start

helpers - data processing and cashflow generation

preprocessing.py - cleans up data and fills missing values
postprocessing.py - creates files for charts and data table
cashflow.py - generates cashflows and compounding curves

transfers - file input/output, API requests and database insertions

fileio.py - dumping and loading data with pickle/dill
retrieve.py - requests data from Lending Club API
database.py - inserts data to MongoDB and PostgreSQL

The CodeFlower visualization can be found here.

Next steps

Portfolio selection model that selects the highest-returning diversified portfolio based on a user's desired risk profile.

License

Licensed under the MIT licence.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
helpers		helpers
model		model
notebooks		notebooks
static		static
templates		templates
transfers		transfers
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
REQUIREMENTS.txt		REQUIREMENTS.txt
app.py		app.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

helpers

helpers

model

model

notebooks

notebooks

static

static

templates

templates

transfers

transfers

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

README.md

README.md

REQUIREMENTS.txt

REQUIREMENTS.txt

app.py

app.py

test.py

test.py

Repository files navigation

rateflask

Description

Details

Requirements

Installation

Modules

Next steps

License

About

Releases

Packages

Languages

License

nhu2000/rateflask

Folders and files

Latest commit

History

Repository files navigation

rateflask

Description

Details

Requirements

Installation

Modules

Next steps

License

About

Resources

License

Stars

Watchers

Forks

Languages