Kaggle Homesite contest 3rd Place solution (New Model Army part)

This is the description of the approach taken by team New Model Army (Michael Pearmain and Konrad Banachewicz) in the Homesite Quote Conversion contest hosted by Kaggle:

https://www.kaggle.com/c/homesite-quote-conversion

The general idea is a two-level architecture:

Create multiple transformations of the train/test data
Generate stacked predictions using multiple models trained on versions of the data (metafeatures)
Create a stacked ensemble of the metafeatures

System and Dependenices

To replicate the full solution will require 16GB RAM min, and depending on the number of cores and machines available will take in excess of 10days to run - (This is due to teh heavy workload of producing meta-level features for 5-fold splits on multiple datasets for multiple models)

The code is written as a combination of python (2.7.6) with the following libraries:

numpy
pandas
keras
sklearn
datetime
itertools
xgboost

and R (3.2.1) with the following libraries:

data.table
caret
stringr
readr
lubridate
Rtnse
lme4
chron
h2o
earth
ranger
glmnet
Metrics
e1071
nnet
xgboost

Solution Replication

To replicate the top NMA (New Model Army) submission all should be run at teh top level dir, it is expected that in ./input are the train test and sample submission files for the contest. It also expects a metafeature folder, and a submissions folder.

Datasets

Dataset creation was achieved in R by running the ~/R/data_preparation.R script Those datasets are used as input to the metafeature creation.

WARNING:- The creation of the datasets may take in excess of 2 days.

Folds

The file ~/R/folds_prep.R generates a split of the training set into 5 folds, 10 folds and a train/validation split. The output is a dataframe xfolds (and a file xfolds.csv), which is used in subsequent analysis. This way we can ensure consistency across different models - each time the same folds are used, so there is no leakage, and allows us to compare performace of all new models.

Metafeatures

This is the main workhorse of the solution where we create fold second level models using a variety of different classification models using R and python. Within the ~/R/ and ~/python directories are numerous build_meta_XXX files. Each of these files must be run and the output is saved into the ~/metafeatures subdirectory

WARNING:- The creation of the metafeatures is an extremely long process, running on a single threaded machine was never undertaken and may result in excess of 20days to run (pure guestimate)

Second level Meta Ensembling.

Once we have a variety of models prediction as meta features, we then build a second stage set of meta features using build_ensemble.R. This script firstly removes any linear combinations of metafeatures that may have been produced, and then runs five more classification stack prediction, (nnet, xgboost, ranger, HillClimbing (which is our implementation of the "library of models" approach of Caruana et al), and glmnet

Third Level Meta Ensemble.

This was a final blend of the models using our hand crafted hillclimbing model, which is contained in the build_ensemble.R file

Name		Name	Last commit message	Last commit date
Latest commit History 319 Commits
R		R
python		python
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R

R

python

python

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Kaggle Homesite contest 3rd Place solution (New Model Army part)

System and Dependenices

Solution Replication

Datasets

Folds

Metafeatures

Second level Meta Ensembling.

Third Level Meta Ensemble.

About

Releases

Packages

Languages

andreymalakhov/homesite

Folders and files

Latest commit

History

Repository files navigation

Kaggle Homesite contest 3rd Place solution (New Model Army part)

System and Dependenices

Solution Replication

Datasets

Folds

Metafeatures

Second level Meta Ensembling.

Third Level Meta Ensemble.

About

Resources

Stars

Watchers

Forks

Languages