Scripts and libraries used to participate to the Kaggle competition "Homesite Quote Conversion", where the objective is to predict which customers will purchase a quoted insurance plan. https://www.kaggle.com/c/homesite-quote-conversion
Dataset consists in a training set of 261 features for around 250.000 observations.
The predictive model developped here consists in averaging two simple predictive models (Gradient Boosted classification and K Nearest Neighbours classification). Parameters tuning has been performed thanks to the benchmark scripts.
This model gets a score of 0.96144, where the leader reaches a score of 0.97006 (score computed on a test set with the area under the ROC curve metric).
- file_handler.py: library of functions providing an abstraction level on top of the manipulated files (csv, cache, json,...)
- summary.py: library of functions for plotting and describing the dataset's features
- utils.py: library of functions to manipulate data (dates, categorical features,...)
- benchmark_xgb.py: benchmark of the Gradient Boosted classification (xgboost library) with parameter tuning
- benchmark_knn.py: benchmark of the K-Nearest-Neighbours classification (sklearn library) with parameter tuning
- train_models.py : script that performs the classifiers training and serializes them into models folder
- predict.py : loads the classifiers from models folder and performs the prediction (output is in results folder)
- data: contains the dataset in 2 subfolders (originals in data/csv, cache in data/cache)
- models : contains the classifiers trained and serialized
- plots : directory reserved for plots
- results : contains csv files for Kaggle submission