Property Inspection Prediction

1 Introduction

My first kaggle competion

https://www.kaggle.com/c/liberty-mutual-group-property-inspection-prediction

“In this challenge, your task is to predict a transformed count of hazards or pre-existing damages using a dataset of property information. This will enable Liberty Mutual to more accurately identify high risk homes that require additional examination to confirm their insurability.”

Main points:

Predictors’ names are NOT informative. So you cannot use insurance specific domain knowledge to improve the models. You can use only machine learning techniques.
There are categorical variables that you have to factorize. It seems to do it in python is a little bit more complicated than in R.

PS: (DO NOT export this code. When exporting all codes are executed and you will create ami instances)

PS: You can set org-mode to do not execute the code during exporting set

TODO Try to make code more safe by setting org export parameters in the header to prevent execute the code while exporting.

(setq org-export-babel-evaluate nil)

1.1 Learning and developing Goals

Run models on AWS cloud service (EC2)
1. Run the models exploring parallelism
2. Develop literate devops (deployment) using Emacs
Learning popular python analytics libraries
Start learning new machine learning technique
1. Gradient Boost Tree (Implementation XGBoost)

https://github.com/dmlc/xgboost

2 Exploratory phase

We started with exploratory data analysis (EDA) to get familiar with the data and start understanding the relations between predictors and the response variable. There are 33 columns and almost 51k rows in the training data. The Hazard column is the response (dependent) variable and it is an integer starting with 1. Many predictors (features) are categorical data but others are numerical data. In this competition there is no additional information about the data.

To get a balcony view of the data, the histogram of the predictors was built and also the correlation matrix.

PS: Shameless stolen from http://blog.kaggle.com/2015/09/28/liberty-mutual-property-inspection-winners-interview-qingchen-wang/

We investigated the Hazard score and we discovered that Hazard score is extremely concentrated in the first’s levels. Almost 40% of the data has Hazards score equal to 1 and 80% of the data has Hazard score less than 7.

In order to start reducing dimensionality using feature selection we used the relative importance plot of the first xgboost models and we eliminated few features based on that.

import pandas as pd
import numpy as np
import xgboost as xgb
import libs.utils as utl
import libs.exploratory as epl

train_pre = pd.read_pickle("data/pre/train_pre.pkl")
train_pre.Hazard.describe()

epl.build_histogram_dashboard(train_pre)
epl.build_corrmatrix_dashboard(train_pre)

xgb_model_file = "submissions/20151021/xgb_model.bin"
xgb_model = xgb.Booster({'nthread':3}) #init model
xgb_model.load_model(xgb_model_file) # load data

epl.build_xgb_features_importance_dashboard(xgb_model,train_pre)

3 Variable selection or Features engineering

The only thing we tried was factorize the columns (categorical columns):

columns_to_factorize = [
                      'T1_V4', 'T1_V5', 'T1_V6', 'T1_V7', 'T1_V8',
                      'T1_V9', 'T1_V11', 'T1_V12', 'T1_V15', 'T1_V16',
                      'T1_V17', 'T2_V3', 'T2_V5', 'T2_V11', 'T2_V12',
                      'T2_V13'
                      ]

The other columns are numerical values. Since it was impossible to interpret the features based on their names and there are no explanations about every feature. But I should have tried few data transformations (That are going to be next steps in the next competition)

4 The Model

We worked only with Gradient Boosting https://en.wikipedia.org/wiki/Gradient_boosting, because is technique that I was not familiar with. The GBM combines weaker classifiers in order to get a single strong classifier. In each interaction the

\begin{equation} F_k+1(\bold{x}) = F_k(\bold{x}) + h_k(\bold{x}) \nonumber \end{equation}

is improved by the function $h_k(\bold{x})$. Like any other boosting methods, GBM learn by correcting its predecessor $F_k(\bold{x})$.

We chose XGBoost which is a parallel implementation of GBM, because it is very popular in Kaggle competitions and allow us run GBM algorithms in parallel. The main points are summarized:

Usability
- Easy to install (local and remote machine)
- Easy to use in R and Python
Efficiency
- Can explore parallelism
- Can run in clusters and multithreads systems
- Implemented in C/C++ (Double check this later)
Feasibility
- Customized objective and evaluation function
- Tunable parameters

The parameters that we investigated during the competition were:

Controls complexity
- gamma
- max_depth
Robust to noise
- subsample
- colsample_bytree
- num_round
Optimization related
- eta: controls the learning rate (It can help to prevent overfitting)

We randomly split the data in train (70%) and validation data (30%) and we tried different parameters for eta, max_depth and num_rounds. We use rmse as a metric to train the model but also we monitor the gini metric in the validation data set. An important property of the gini metric is that only the order of the prediction matters.

We started by modifying the start kit. We observed a significant increase in our score when we chose count:poisson as objective because it is natural choice. The response variable was integer (counting data)

5 Set AMI instance on AWS EC2

The AWS services provide us (data scientist) access to clusters, computers with big memory, powerful GPUs and distributed systems with low price thanks to the hardware as commodity business model. Of course there are more reasons and inclusive more important than those cited (Reliability and Scalability) that is not scope of this document.

I started exploring the service known as Elastic Compute Cloud EC2. EC2 allows us to run a virtual machine or cluster of virtual machines on the cloud and you can scale up or down according with your necessity.

I installed the command line AWS cli tools on my local machine (https://aws.amazon.com/cli/). I found a interesting blog (http://howardism.org/Technical/Emacs/literate-devops.html) about how to deploy my code using Emacs + org-mode (org-babel:http://orgmode.org/worg/org-contrib/babel/). The process is known as literate ~~programming~~ deployment. This make the life really easy, because automatizes the entire process to deploy the code on aws cloud service and also provide a better documentation of the entire deployment process. Also Emacs has a nice mode called TRAMP that can be used to edit remote files like a local file (http://www.emacswiki.org/emacs/TrampMode).

To avoid reinvent the wheel and simplify the entire process, we start by choosing the ami with the criteria ranked by priority below:

Total Cost: < USD 10
#cpu : [8 ,16]
Memory RAM: 2GB
Sytems similar to development environment (my local machine)
- with pre-installed tools:
  1. python and pip (same version or similar of my local machine)
  2. scikit-learn, pandas and numpy
  3. json and zipfile
  4. command make
- Easy to install xgboost
Storage: 8GB (The minimum will be enough)
Networking requirements: low

Instances candidates:

m3.2xlarge
1. #cpu: 8
2. RAM: 30 GB
3. pricing: 0.616/hour => 16h
m4.2xlarge
1. #cpu: 8
2. RAM: 32 GB
3. pricing: 0.588/hour => 17h
c1.xlarge old generation instance
1. #cpu: 8
2. RAM: 7 GB
3. pricing: 0.478/hour => 21h

Based on the criteria, we chose the community ami instance (Compute-Optimized) anaconda-2.3.0-on-ubuntu-14.04-lts -ami-31b27375 (Thanks to anaconda project: http://docs.continuum.io/anaconda/images). Compute-Optimized instances have a higher ratio of vCPUs to memory than other families and the lowest cost per vCPU of all the Amazon EC2 instance types. Our budget with this instance allows us to play 21h in aws cloud. :)

To access the web interface use the link below and if you need to create a new key pair use the instructions below.

Login aws console: https://xxxxxxxxxxxx.signin.aws.amazon.com/console/
To use aws console web interface to create the key pair (Case you don’t have it).
- create and download key pair: key.perm
- move key.perm to .ssh/
- change the permission: chmod 400 key.perm

Run it only if you don’t have the key pair yet.

mv -v ~/Downloads/key.perm ~/.ssh/
chmod 400 ~/.ssh/key.perm

The deployment process will be explained in the next sections. You can run the code inside the emacs with C-c C-c or you can use emacs to build and save

5.1 Load anaconda ami in N. California

launch: anaconda-2.3.0-on-ubuntu-14.04-lts - ami-31b27375
- ami has the almost the same python version of the development environment
  - ami : python-2.7.10
  - dev: python-2.7.6

set tag: kaggle-competition-ncalifornia

## Launch instance and get instance id
INSTANCE_TYPE=c1.xlarge
INSTANCE_ID=`aws ec2 run-instances --image-id ami-31b27375 --security-group-ids sg-d681d4b3 --count 1 --instance-type $INSTANCE_TYPE --key-name key --query 'Instances[0].InstanceId' --output text`

echo "Instance ID: "
echo $INSTANCE_ID

# Get instance public ip
INSTANCE_PUBLIC_IP=`aws ec2 describe-instances --instance-ids $INSTANCE_ID --query 'Reservations[0].Instances[0].PublicIpAddress' --output text`

echo "Instance PublicIP: " 
echo $INSTANCE_PUBLIC_IP

PS: You need to wait the instance booting. This take 1 minute

# Get instance public ip
aws ec2 describe-instances --instance-ids $INSTANCE_ID --query 'Reservations[0].Instances[0].State.Name'

5.2 Install necessary packages

Export setup.sh script (tangle code in Property_Inspection_Prediction.org)
Only if you are using emacs, org-mode and org-babel.
- Go to setup.sh first block

Setup script

C-u C-u C-c C-v t (run org-tangle with 2 Universal arguments)

Copy the project and data to ami

Compact the projetct and remove unecessary folders and files

	cd ~/Documents/kaggle/competition/
	tar -cjf ~/tmp/lmgpip.pack.tar.bz2 Liberty_Mutual_Group_Property_Inspection_Prediction \
           --exclude-backups --exclude-vcs \
           --exclude=Liberty_Mutual_Group_Property_Inspection_Prediction/data/pre/* \
           --exclude=Liberty_Mutual_Group_Property_Inspection_Prediction/dev \
           --exclude=Liberty_Mutual_Group_Property_Inspection_Prediction/snippet \
           --exclude=Liberty_Mutual_Group_Property_Inspection_Prediction/study \
           --exclude=Liberty_Mutual_Group_Property_Inspection_Prediction/scratch \
           --exclude=Liberty_Mutual_Group_Property_Inspection_Prediction/.idea \
           --exclude='*.pyc' \
           --exclude=Liberty_Mutual_Group_Property_Inspection_Prediction/submissions/2015* \
           --exclude=Liberty_Mutual_Group_Property_Inspection_Prediction/figures/*
	cd -

Copy the package to the running instance

scp -o "StrictHostKeyChecking no" -i ~/.ssh/key.pem ~/tmp/lmgpip.pack.tar.bz2 ubuntu@$INSTANCE_PUBLIC_IP:/home/ubuntu/

Descompact the project on the running instance

	ssh -t -o "StrictHostKeyChecking no"\
           -i ~/.ssh/key.pem\
           ubuntu@$INSTANCE_PUBLIC_IP 'tar -xjvf lmgpip.pack.tar.bz2'

Update ami
- Set setup.sh permission

ssh -t -o "StrictHostKeyChecking no"\
 -i ~/.ssh/key.pem \
 ubuntu@$INSTANCE_PUBLIC_IP 'chmod -v 700 Liberty_Mutual_Group_Property_Inspection_Prediction/config/setup.sh'

Run setup.sh

	 echo "ssh -t -o \"StrictHostKeyChecking no\"\\
            -i ~/.ssh/key.pem \\
            ubuntu@$INSTANCE_PUBLIC_IP 'bash -x ./Liberty_Mutual_Group_Property_Inspection_Prediction/config/setup.sh'"

PS: This going take a while 7 minutes PS: For debug ssh into instance and run the script

Check deployment by running unit tests

ssh -t -o "StrictHostKeyChecking no"\
    -i ~/.ssh/key.pem \
    ubuntu@$INSTANCE_PUBLIC_IP \
    'cd ./Liberty_Mutual_Group_Property_Inspection_Prediction/ ; pwd; /home/ubuntu/anaconda/bin/nosetests tests/'

5.3 Access running instance (ami)

ssh to the running instance (ami)

access

echo "ssh -i ~/.ssh/key.pem ubuntu@$INSTANCE_PUBLIC_IP"

configure emacs tramp (edit remote file)

edit ~/.ssh/config

echo "Host $INSTANCE_PUBLIC_IP" > ~/.ssh/config
echo "     IdentityFile ~/.ssh/key.pem"  >> ~/.ssh/config
echo "     HostName $INSTANCE_PUBLIC_IP"  >> ~/.ssh/config
echo "     User ubuntu"  >> ~/.ssh/config
cat ~/.ssh/config

On Emacs C-x c-f (goto root and type ssh:)

5.4 Managing instance

Listing instance

aws ec2 describe-images --owners --filters "Name=name,Values=*anaconda*" --output text

Stop

aws ec2 stop-instances --instance-ids $INSTANCE_ID

Start

aws ec2 start-instances --instance-ids $INSTANCE_ID

Terminate

aws ec2 terminate-instances --instance-ids $INSTANCE_ID

5.5 Setup script

This script prepares the ami instance to the project. It updates ami and installs the necessary packages such as xgboost and nose.

echo "preparing variable senviroments"
export PATH=/home/ubuntu/anaconda/bin:${PATH}

echo "updatting the system "
sudo apt-get update ## && sudo apt-get upgrade -y
echo "installing packages "
echo "\tinstalling git"
sudo apt-get -y install git
echo "\tinstalling make"
sudo apt-get -y install make
echo "\tinstalling htop"
sudo apt-get -y install htop
echo "\tinstalling g++"
sudo apt-get -y install g++

Update pip and install nose to run unit test

echo "updatting pip"
pip install --upgrade pip
echo "installing nose"
pip install nose

Install XGBoost: https://github.com/dmlc/xgboost/tree/master/python-package

echo "clone xgboost"
git clone https://github.com/dmlc/xgboost.git
echo "building xgboost"
cd xgboost
./build.sh
echo "python setting up"
cd python-package
python setup.py install

6 Comparing run time local machine vs aws instance

My local machine configuration

Operating System: Ubuntu 14.04.3 LTS
Processor: 4x Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz
RAM Memory : 6012MB
#cpus: 4

To make sure that run the code in the ec2 instance is worthwhile, we change the number of trees (num_round) and we executed the code with different numbers of thread in the remote and local machine. We conclude that we have significant gain in time performance when we execute xgboost in the cloud with 6 numbers threads and the number num_round is greater than 500 in our configuration. See the graph below with the comparative.

Legend:

Rem thr N: executed in ec2 instance with N threads
Loc thr N: executed in local machine with N threads

htop

The image above is the output of the command htop and It shows our algorithms running in parallel using 7 cpu units.

7 Analysing leaderboard scores

In order to assess my relative performance and plan my next steps and strategy, we conducted a brief analysis of the scores of the leaderboards competition and also the scores which I found on the internet.

The table below summarizes scores that I found on internet. The difference of my best score and the winner score is only 2.2%, but I did only 18 submissions (The winner did 232 submissions) because of the amount of time that I had to spend on the competition. This suggest that I have to spend much much more time to have any chance to win a competition or at least end in the 25% tail.

Model	public	private	Desc	link
Winner	0.394970	0.397064	Ensemble: 232 Entries. Takes 2h to run
25% Pos: 559		0.391804	Yi Li
alex	0.390355	0.392787	Ensemble	alex
Me	0.385060	0.387957	Single model XGBoost: 18 Entries
Sean XGBoost	0.392		XGBoost (No many details)	sean
Sean AWML	0.343		Amazon Machine Learning (AML) service	sean
Xavier Xgboost	0.391169		Xgboost essemble	xavier
Xavier Random Forest	0.373147			xavier
Xavier SVM	0.3188			xavier

getwd()
source("libs/kaggle_leaderboard_parser.R")
source("libs/kaggle_leaderboard_dashboard.R")

# Downloading leaderboard
# Shameless stolen (adapt) from Jeff Hebert: https://rstudio-pubs-static.s3.amazonaws.com/29531_4b5b689e7adf4448a8d420e6b356397c.html
contest.url <- "https://www.kaggle.com/c/liberty-mutual-group-property-inspection-prediction"
prop.inspection.lb <- leaderboard(contest.url)

build.leaderboard.dashboard(prop.inspection.lb)

The histogram below shows a comparative between the private scores distributions of all kaggle competitors and my public and private score.

In this competition, the private Gini metric of my model was bigger than in the public leaderboard. My score is located in the left side of the mode of the histogram. So, we calculated private score improvement metric by subtracting public score from the private one and then we investigated how much the scores changed between the public and private leaderboards.

We noted that almost half of the top 25 in the public score were able to improve their rank in the private leaderboard, but in general the rank in public leaderboard can be very different from private leaderboard. See the boxplot.

The scatterplot below shows the relations of gini scores improvement and rank improvement. We selected the top 100 submissions in the private leaderboard for this analysis.

Few kaggler actually reduced their gini score in the top 100 private leaderboard. We were located in the upper right quadrant, where kagglers increased their private score but lost position in the leaderboard. Their neighbors in the public leaderboard were able to increase more their scores. The winner increased a little his score to gain one position and end the competition in first place. In the data it seems to have few clusters that might be related with similar type of models or approaches and you can see the pattern that large improvements in score can lead to better rank

8 Time Table

Thanks to Emacs and orgmode (http://orgmode.org/) We were able to track the time I spent in every task. The tasks on this project were classified:

DOC (28%): Time spent writing documentation and taking notes
MODELLING (20%): Time spent analyzing, modeling and planning the next steps
DATA (3%): Time spent in preparing the data for analysis
PROG (26%): Time spent implementing and refactoring the code
STUDY (23%): Time spent studying libraries and machine learning’s algorithms

It is interesting to note, thanks to Kaggle’s good job, I only spent 3% of the time preparing the data. Normally, I spend 60% up to 80% of the time with data processing: acquiring, decide which data to collect or use, preparing, cleaning and dealing with missing values*.

It is clear the necessity to save code for the next competitions and I expect as the amount of time I spend studying will be worthwhile. The majority of the time writing was spent after the end of the competition and I believe it is very important.

PS: This is a roughly estimation but useful for planning

9 Conclusion

In general, kaggle competitions is a good way to learn, try and test new machine learning algorithms.

What I haven’t used
- I should have used Cross Validation: Grid Search or Randomized Search to tune up parameters and save time
- I should have spent more time designing training data and validation data. It is good to have data validation similar as test data (submission)
- I should have tried ensemble model
  - Bagging or
  - Boosting or
  - Stacking (Blending)
- In the real data analysis where interpretability is extremely important, I would have spent more time in exploratory phase and variable selection. I still believe that might have contributed to reach better results
Goals and What I learned
- Run the algorithms on AWS cloud is cheaper and can save a lot of time
- Set AWS instances can be facilitated a lot by using literate deployment with Emacs and org-babel.
- Development the algorithm in python was not so difficult than I was expecting. (I normally use R for these tasks). The first thing that I noticed is that work with categorical data is easier in R.
- Keep organized and track all your tries are extremely important
- Gradient Boosting is a powerful technique and also can be used as feature selection (relative importance)
- Kaggle competition, blogs and forums is a good way to train and apply machine learning algorithms
- It is important to understand the evaluation metric.

10 Note from references

I used a lot of information from others blogs. I tried to cite everything, but I confessed that during my annotation I lost a lot of my sources. So, if you see something that came from other site and it was not cited and you felt wronged, please let me know I will do my best to include all references.

The author believes that share code and knowledge is awesome. Feel free to share and modify this piece of code. But don’t be impolite and remember to cite the author and give him his credits.

11 Appendix

11.1 Emacs help function

(defun send-region-to-terminal (start end)
 "execute region in an inferior terminal

  To help org-babel depo=loy projects on aws
  Basicaly it send the current region to terminal process
  buffer named *terminal*"
 (interactive "r")
 (process-send-string "*terminal*" (concat (buffer-substring-no-properties start end) "\n")))

12 Automate system

12.1 Kaggle

12.1.1 winner documentaion template

https://www.kaggle.com/wiki/WinningModelDocumentationTemplate

12.2 Pragmatic programming principles

DRY: Do not repeat yourself
Write shy code n Design by Contratc n Test Unit in mind
Decoupling n Law of Demeter
1. The Law of Demeter for functions states that any method of an object can call only methods belongs to:

itself
parameter that was passed in to the method
any object it created
any direct held component objects
1. Write code that writes code (Yasnippet)

12.3 Export

12.3.1 docx

Change headers structure and create Dev Code n Analysis headers
Set tags :noexport: to exclude subtree Dev Code n Analysis in the output
org-html-export-as-html
Save as html (Stop here to publish as html)

Edit (delete) xml lines (first 3 lines)

       	<?xml version="1.0" encoding="utf-8"?>
       	<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
       	"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

Open it in MS word
Remember to turn on Navigation Panel in word:
1. View -> Tick Navigation Panel

12.3.2 html

Change headers structure and create Dev Code n Analysis headers
Set tags :noexport: to exclude subtree Dev Code n Analysis in the output
org-html-export-as-html
Save as html (Stop here to publish as html)
Zip (folder do projeto)
1. model_2014.org e/ou model_2014.docx
2. model_2014.html
3. figures

12.3.3 markdown

org-md-export-to-markdown: C-c C-e m m

12.4 Start ipython notebook

ipython notebook &

O navegador irah abrir uma nova http://127.0.0.1:8888/ com seu notebook.

12.5 Generates scripts

C-c C-v t (org-tangle)

12.6 Generates TAGS

Steps:

M-x ess-build-tags-for-directory
1. Select folde (Rcode)
2. Select fiel TAG
visit-tags-table (update hash)
M-. visit tag (while point in function call)

Unfortunately, these programs do not recognize R code syntax. They do allow tagging of arbitrary language files through regular expressions, but this is not sufficient for R.

R 2.9.0 onwards provides the rtags function as a tagging utility for R code. It parses R code files (using R’s parser) and produces tags in Emacs’ etags format.

To update you can use: M-x visit-tags-table (select tag table)

M-. = visit tag (Go to function definition)

M-x ess-build-tags-for-directory run the shel script below for you Ask the directory to run rtags n then ask for file to save (TAGS)

## Generate TAGS file
rtags(path="Rcode/",recursive = TRUE,verbose=TRUE,ofile = "TAGS")

Nao parece funcionar com ggtags (mode para tarbalhar com tags no emacs)

O comando gtags gnu tags. Suporta varias liunguagens e projectile trabalha com gtags).

Entao desta forma nao terei as TAGS atualizadas toda vez que salvo arquivos.

12.7 Build proj tree

C-c C-c inside FSTREE

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
config		config
dev		dev
figures		figures
libs		libs
scratch		scratch
snippet/org-mode		snippet/org-mode
study		study
tests		tests
COPYING		COPYING
README.org		README.org
main.py		main.py
main_train.py		main_train.py

License

raghparihar/Public_Liberty_Mutual_Group_Property_Inspection_Prediction

Folders and files

Latest commit

History

Repository files navigation