GitHub

Introduction

Objective of the challenge is to analyze BankSim Kaggle dataset.

For this challenge, I covered the following steps:

EDA
Training a XGB model to predict Gender
Hyperparameter tuning of XGB model
Training a Random Forest model to predict Gender
Hyperparameter tuning of Random Forest model
Model assessment
CLV model implementation for customer long-term values estimation

1) Repository structure

Current repository in inspired from https://drivendata.github.io/cookiecutter-data-science/

/data:
  /output -> Predictions
  /raw -> Raw data
  /processed -> Processed data (dataset with generated features)
/models:
  *.pkl -> Pickled trained models / Hyperopt trials
/params:
  *.json -> Model parameters
/notebooks
  EDA.ipynb -> Use for EDA
  CLV.ipynb -> Use to validate CLV predictions
  assess.ipynb -> Use for assessing model performance
/src
  features.py --> Generate features for classification
  train.py --> Train ML model
  optimize.py --> Run Hyperopt with TPE
  model.py --> Class to make it easier to fit multiple models
  spaces.py --> Search space used for optimization
  clv.py --> Run CLV to predict long-term values
  /resources
    /sql
      *.sql --> SQL scripts
Makefile -> Use to simplify pipeline execution
requirements.txt -> Packages to install
setup.py -> Use to install current package

2) Quickstart

Before anything else, you must download the dataset bs140513_032310.csv from Kaggle and paste it into data/raw folder.

If you would like to get started ASAP, run these make commands in the following order:
make venv --> Set-up python virtual environment
make features --> Generate features
make train_xgb --> Train initial XGB model for gender classification
make optimize_xgb --> Run Hyperopt hyperparameter optimizer for XGB model
make train_opt_xgb --> Train optimal XGB model for gender classification
make train_rf --> Train initial Random Forest model for gender classification
make optimize_rf --> Run Hyperopt hyperparameter optimizer for Random Forest model
make train_opt_rf --> Train optimal Random Forest model for gender classification
make clv --> Run CLV model to predict long-term values

3) Set-up Environment

Run the following command:
make venv
It will install all necessary packages used in this challenge.

4) Exploratory Data Analysis

EDA is available in notebook notebooks/EDA.ipynb.

In this notebook, I am exploring each variable in the dataset and how they correlate with Gender which is our ML problem outcome variable. I also try building more insightful features which will help us achieve better results on our ML task.

5) Classification using XGBoost

Second model is a XGBoost model.
This model was chosen as it yields good results without much data transformation (such as normalization, clipping etc...) required.
Its parameters can be found in params/def_xgb_model.json.

{
  "name": "xgb",
  "clf": "XGBClassifier",
  "params": {
    "max_depth": 4,
    "n_estimators": 100,
    "learning_rate": 0.05,
    "n_jobs": -1,
    "objective": "binary:logistic",
    "colsample_bytree": 0.5,
    "gamma": 1
  }
}

a) Training initial XGB model

I use 90% of the data for training and the remaining 10% for validation to assess that model doesn't overfit and generalizes Ill to new data.

Run the following command:
make train_xgb

You should reach an AUC of {} for the validation set.
This is a very low score which shows our model didn't learn successfully on our classification task.

b) Optimize model

To try reaching better results, I optimize model parameters.
I use hyperopt package for that purpose and our objective is to maximize validation AUC using K-Folds with K=10. TPE algorithm is picked to search the space for the best parameters.

Search history is dumped at every round in models/opt_xgb_trials.pkl so that optimizer can be stopped and resumed anytime.

To run the optimizer, run the following command:
make optimize_xgb

c) Training optimized XGB model

Its parameters can be found in params/opt_xgb_model.json.

{
	"name": "opt_xgb",
	"clf": "XGBClassifier",
	"params": {
		"colsample_bytree": 0.5,
		"eta": 0.157,
		"gamma": 0.5700000000000001,
		"max_depth": 3,
		"min_child_Iight": 5.0,
		"n_estimators": 162.0,
		"subsample": 0.1
	}
}

Run the following command:
make train_opt

Validation AUC is improved to .
It is still not great but it shows the model learnt a bit

6) Classification using RandomForest

I try to use another model.
I picked RandomForest as it is less prone to overfitting than Gradient Boosting Algorithm.

Its parameters can be found in params/def_rf_model.json.

a) Training initial RandomForest model

Run the following command:
make train_rf

b) Optimize model

To run the optimizer, run the following command:
make optimize_rf

c) Training optimized RandomForest model

Its parameters can be found in params/opt_rf_model.json.

{
	"name": "opt_rf",
	"clf": "RandomForestClassifier",
	"params": {
		"ccp_alpha": 0.007,
		"class_weight": "balanced",
		"max_depth": 1,
		"max_features": 0.92,
		"max_leaf_nodes": 3,
		"max_samples": 0.64,
		"n_estimators": 209.0,
		"n_jobs": -1,
		"verbose": 0,
		"warm_start": false
	}
}

6) Model assessment

Model is assessed in notebook notebooks/assess.ipynb.

In this notebook, I am investigating different plots and metrics to assess model performance.
I am also looking at feature importance for model interpretability.

7) Predict customer long-term values

To find out customer long-term values, I use a package called lifetimes.
It combines 2 bayesian models (BG/NBD and Gamma-Gamma models).
Both models require at most RFM features (Recency, Frequency, Monetary).
BG/NBD predicts expected number of future purchases per customer.
Gamma-Gamma predicts expected amount for each purchase per customer.
Multiplying both provides customer long-term values.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
models		models
notebooks		notebooks
params		params
src		src
.gitignore		.gitignore
README.md		README.md
makefile		makefile
requirements.txt		requirements.txt
setup.py		setup.py

damienmarlier51/bankSim

Folders and files

Latest commit

History

Repository files navigation

Introduction

1) Repository structure

2) Quickstart

3) Set-up Environment

4) Exploratory Data Analysis

5) Classification using XGBoost

a) Training initial XGB model

b) Optimize model

c) Training optimized XGB model

6) Classification using RandomForest

a) Training initial RandomForest model

b) Optimize model

c) Training optimized RandomForest model

6) Model assessment

7) Predict customer long-term values

About

Resources

Stars

Watchers

Forks

Languages