My first kaggle competion
https://www.kaggle.com/c/liberty-mutual-group-property-inspection-prediction
“In this challenge, your task is to predict a transformed count of hazards or pre-existing damages using a dataset of property information. This will enable Liberty Mutual to more accurately identify high risk homes that require additional examination to confirm their insurability.”
Main points:
- Predictors’ names are NOT informative. So you cannot use insurance specific domain knowledge to improve the models. You can use only machine learning techniques.
- There are categorical variables that you have to factorize. It seems to do it in python is a little bit more complicated than in R.
PS: (DO NOT export this code. When exporting all codes are executed and you will create ami instances)
PS: You can set org-mode to do not execute the code during exporting set
TODO Try to make code more safe by setting org export parameters in the header to prevent execute the code while exporting.
(setq org-export-babel-evaluate nil)
- Run models on AWS cloud service (EC2)
- Run the models exploring parallelism
- Develop literate devops (deployment) using Emacs
- Learning popular python analytics libraries
- Start learning new machine learning technique
- Gradient Boost Tree (Implementation XGBoost)
https://github.com/dmlc/xgboost
We started with exploratory data analysis (EDA) to get familiar with the data and start understanding the relations between predictors and the response variable. There are 33 columns and almost 51k rows in the training data. The Hazard column is the response (dependent) variable and it is an integer starting with 1. Many predictors (features) are categorical data but others are numerical data. In this competition there is no additional information about the data.
To get a balcony view of the data, the histogram of the predictors was built and also the correlation matrix.
PS: Shameless stolen from http://blog.kaggle.com/2015/09/28/liberty-mutual-property-inspection-winners-interview-qingchen-wang/
We investigated the Hazard score and we discovered that Hazard score is extremely concentrated in the first’s levels. Almost 40% of the data has Hazards score equal to 1 and 80% of the data has Hazard score less than 7.
In order to start reducing dimensionality using feature selection we used the relative importance plot of the first xgboost models and we eliminated few features based on that.
import pandas as pd
import numpy as np
import xgboost as xgb
import libs.utils as utl
import libs.exploratory as epl
train_pre = pd.read_pickle("data/pre/train_pre.pkl")
train_pre.Hazard.describe()
epl.build_histogram_dashboard(train_pre)
epl.build_corrmatrix_dashboard(train_pre)
xgb_model_file = "submissions/20151021/xgb_model.bin"
xgb_model = xgb.Booster({'nthread':3}) #init model
xgb_model.load_model(xgb_model_file) # load data
epl.build_xgb_features_importance_dashboard(xgb_model,train_pre)
The only thing we tried was factorize the columns (categorical columns):
columns_to_factorize = [
'T1_V4', 'T1_V5', 'T1_V6', 'T1_V7', 'T1_V8',
'T1_V9', 'T1_V11', 'T1_V12', 'T1_V15', 'T1_V16',
'T1_V17', 'T2_V3', 'T2_V5', 'T2_V11', 'T2_V12',
'T2_V13'
]
The other columns are numerical values. Since it was impossible to interpret the features based on their names and there are no explanations about every feature. But I should have tried few data transformations (That are going to be next steps in the next competition)
We worked only with Gradient Boosting https://en.wikipedia.org/wiki/Gradient_boosting, because is technique that I was not familiar with. The GBM combines weaker classifiers in order to get a single strong classifier. In each interaction the
\begin{equation} Fk+1(\bold{x}) = Fk(\bold{x}) + hk(\bold{x}) \nonumber \end{equation}
is improved by the function
We chose XGBoost which is a parallel implementation of GBM, because it is very popular in Kaggle competitions and allow us run GBM algorithms in parallel. The main points are summarized:
- Usability
- Easy to install (local and remote machine)
- Easy to use in R and Python
- Efficiency
- Can explore parallelism
- Can run in clusters and multithreads systems
- Implemented in C/C++ (Double check this later)
- Feasibility
- Customized objective and evaluation function
- Tunable parameters
The parameters that we investigated during the competition were:
- Controls complexity
- gamma
- max_depth
- Robust to noise
- subsample
- colsample_bytree
- num_round
- Optimization related
- eta: controls the learning rate (It can help to prevent overfitting)
We randomly split the data in train (70%) and validation data (30%) and we tried different parameters for eta, max_depth and num_rounds. We use rmse as a metric to train the model but also we monitor the gini metric in the validation data set. An important property of the gini metric is that only the order of the prediction matters.
We started by modifying the start kit. We observed a significant increase in our score when we chose count:poisson as objective because it is natural choice. The response variable was integer (counting data)
The AWS services provide us (data scientist) access to clusters, computers with big memory, powerful GPUs and distributed systems with low price thanks to the hardware as commodity business model. Of course there are more reasons and inclusive more important than those cited (Reliability and Scalability) that is not scope of this document.
I started exploring the service known as Elastic Compute Cloud EC2. EC2 allows us to run a virtual machine or cluster of virtual machines on the cloud and you can scale up or down according with your necessity.
I installed the command line AWS cli tools on my local machine
(https://aws.amazon.com/cli/). I found a interesting blog
(http://howardism.org/Technical/Emacs/literate-devops.html) about how to
deploy my code using Emacs + org-mode
(org-babel:http://orgmode.org/worg/org-contrib/babel/). The process is
known as literate programming deployment. This make the life really
easy, because automatizes the entire process to deploy the code on aws
cloud service and also provide a better documentation of the entire
deployment process. Also Emacs has a nice mode called TRAMP that can be
used to edit remote files like a local file
(http://www.emacswiki.org/emacs/TrampMode).
To avoid reinvent the wheel and simplify the entire process, we start by choosing the ami with the criteria ranked by priority below:
- Total Cost: < USD 10
- #cpu : [8 ,16]
- Memory RAM: 2GB
- Sytems similar to development environment (my local machine)
- with pre-installed tools:
- python and pip (same version or similar of my local machine)
- scikit-learn, pandas and numpy
- json and zipfile
- command make
- Easy to install xgboost
- with pre-installed tools:
- Storage: 8GB (The minimum will be enough)
- Networking requirements: low
Instances candidates:
- m3.2xlarge
- #cpu: 8
- RAM: 30 GB
- pricing: 0.616/hour => 16h
- m4.2xlarge
- #cpu: 8
- RAM: 32 GB
- pricing: 0.588/hour => 17h
- c1.xlarge old generation instance
- #cpu: 8
- RAM: 7 GB
- pricing: 0.478/hour => 21h
Based on the criteria, we chose the community ami instance (Compute-Optimized) anaconda-2.3.0-on-ubuntu-14.04-lts -ami-31b27375 (Thanks to anaconda project: http://docs.continuum.io/anaconda/images). Compute-Optimized instances have a higher ratio of vCPUs to memory than other families and the lowest cost per vCPU of all the Amazon EC2 instance types. Our budget with this instance allows us to play 21h in aws cloud. :)
To access the web interface use the link below and if you need to create a new key pair use the instructions below.
- Login aws console: https://xxxxxxxxxxxx.signin.aws.amazon.com/console/
- To use aws console web interface to create the key pair (Case you
don’t have it).
- create and download key pair: key.perm
- move key.perm to .ssh/
- change the permission: chmod 400 key.perm
Run it only if you don’t have the key pair yet.
mv -v ~/Downloads/key.perm ~/.ssh/
chmod 400 ~/.ssh/key.perm
The deployment process will be explained in the next sections. You can run the code inside the emacs with C-c C-c or you can use emacs to build and save
- launch: anaconda-2.3.0-on-ubuntu-14.04-lts - ami-31b27375
- ami has the almost the same python version of the development
environment
- ami : python-2.7.10
- dev: python-2.7.6
- ami has the almost the same python version of the development
environment
- set tag: kaggle-competition-ncalifornia
## Launch instance and get instance id INSTANCE_TYPE=c1.xlarge INSTANCE_ID=`aws ec2 run-instances --image-id ami-31b27375 --security-group-ids sg-d681d4b3 --count 1 --instance-type $INSTANCE_TYPE --key-name key --query 'Instances[0].InstanceId' --output text` echo "Instance ID: " echo $INSTANCE_ID
# Get instance public ip INSTANCE_PUBLIC_IP=`aws ec2 describe-instances --instance-ids $INSTANCE_ID --query 'Reservations[0].Instances[0].PublicIpAddress' --output text` echo "Instance PublicIP: " echo $INSTANCE_PUBLIC_IP
PS: You need to wait the instance booting. This take 1 minute
# Get instance public ip
aws ec2 describe-instances --instance-ids $INSTANCE_ID --query 'Reservations[0].Instances[0].State.Name'
- Export setup.sh script (tangle code in Property_Inspection_Prediction.org)
Only if you are using emacs, org-mode and org-babel.
- Go to setup.sh first block
- C-u C-u C-c C-v t (run org-tangle with 2 Universal arguments)
- Copy the project and data to ami
- Compact the projetct and remove unecessary folders and files
cd ~/Documents/kaggle/competition/ tar -cjf ~/tmp/lmgpip.pack.tar.bz2 Liberty_Mutual_Group_Property_Inspection_Prediction \ --exclude-backups --exclude-vcs \ --exclude=Liberty_Mutual_Group_Property_Inspection_Prediction/data/pre/* \ --exclude=Liberty_Mutual_Group_Property_Inspection_Prediction/dev \ --exclude=Liberty_Mutual_Group_Property_Inspection_Prediction/snippet \ --exclude=Liberty_Mutual_Group_Property_Inspection_Prediction/study \ --exclude=Liberty_Mutual_Group_Property_Inspection_Prediction/scratch \ --exclude=Liberty_Mutual_Group_Property_Inspection_Prediction/.idea \ --exclude='*.pyc' \ --exclude=Liberty_Mutual_Group_Property_Inspection_Prediction/submissions/2015* \ --exclude=Liberty_Mutual_Group_Property_Inspection_Prediction/figures/* cd -
- Copy the package to the running instance
scp -o "StrictHostKeyChecking no" -i ~/.ssh/key.pem ~/tmp/lmgpip.pack.tar.bz2 ubuntu@$INSTANCE_PUBLIC_IP:/home/ubuntu/
- Descompact the project on the running instance
ssh -t -o "StrictHostKeyChecking no"\ -i ~/.ssh/key.pem\ ubuntu@$INSTANCE_PUBLIC_IP 'tar -xjvf lmgpip.pack.tar.bz2'
- Compact the projetct and remove unecessary folders and files
- Update ami
- Set setup.sh permission
ssh -t -o "StrictHostKeyChecking no"\
-i ~/.ssh/key.pem \
ubuntu@$INSTANCE_PUBLIC_IP 'chmod -v 700 Liberty_Mutual_Group_Property_Inspection_Prediction/config/setup.sh'
- Run setup.sh
echo "ssh -t -o \"StrictHostKeyChecking no\"\\
-i ~/.ssh/key.pem \\
ubuntu@$INSTANCE_PUBLIC_IP 'bash -x ./Liberty_Mutual_Group_Property_Inspection_Prediction/config/setup.sh'"
PS: This going take a while 7 minutes PS: For debug ssh into instance and run the script
- Check deployment by running unit tests
ssh -t -o "StrictHostKeyChecking no"\ -i ~/.ssh/key.pem \ ubuntu@$INSTANCE_PUBLIC_IP \ 'cd ./Liberty_Mutual_Group_Property_Inspection_Prediction/ ; pwd; /home/ubuntu/anaconda/bin/nosetests tests/'
ssh to the running instance (ami)
- access
echo "ssh -i ~/.ssh/key.pem ubuntu@$INSTANCE_PUBLIC_IP"
- configure emacs tramp (edit remote file)
- edit ~/.ssh/config
echo "Host $INSTANCE_PUBLIC_IP" > ~/.ssh/config echo " IdentityFile ~/.ssh/key.pem" >> ~/.ssh/config echo " HostName $INSTANCE_PUBLIC_IP" >> ~/.ssh/config echo " User ubuntu" >> ~/.ssh/config cat ~/.ssh/config
- On Emacs C-x c-f (goto root and type ssh:)
- edit ~/.ssh/config
- Listing instance
aws ec2 describe-images --owners --filters "Name=name,Values=*anaconda*" --output text
- Stop
aws ec2 stop-instances --instance-ids $INSTANCE_ID
- Start
aws ec2 start-instances --instance-ids $INSTANCE_ID
- Terminate
aws ec2 terminate-instances --instance-ids $INSTANCE_ID
This script prepares the ami instance to the project. It updates ami and installs the necessary packages such as xgboost and nose.
echo "preparing variable senviroments"
export PATH=/home/ubuntu/anaconda/bin:${PATH}
echo "updatting the system "
sudo apt-get update ## && sudo apt-get upgrade -y
echo "installing packages "
echo "\tinstalling git"
sudo apt-get -y install git
echo "\tinstalling make"
sudo apt-get -y install make
echo "\tinstalling htop"
sudo apt-get -y install htop
echo "\tinstalling g++"
sudo apt-get -y install g++
Update pip and install nose to run unit test
echo "updatting pip"
pip install --upgrade pip
echo "installing nose"
pip install nose
Install XGBoost: https://github.com/dmlc/xgboost/tree/master/python-package
echo "clone xgboost"
git clone https://github.com/dmlc/xgboost.git
echo "building xgboost"
cd xgboost
./build.sh
echo "python setting up"
cd python-package
python setup.py install
My local machine configuration
- Operating System: Ubuntu 14.04.3 LTS
- Processor: 4x Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz
- RAM Memory : 6012MB
- #cpus: 4
To make sure that run the code in the ec2 instance is worthwhile, we change the number of trees (num_round) and we executed the code with different numbers of thread in the remote and local machine. We conclude that we have significant gain in time performance when we execute xgboost in the cloud with 6 numbers threads and the number num_round is greater than 500 in our configuration. See the graph below with the comparative.
Legend:
- Rem thr N: executed in ec2 instance with N threads
- Loc thr N: executed in local machine with N threads
htop
The image above is the output of the command htop and It shows our algorithms running in parallel using 7 cpu units.
In order to assess my relative performance and plan my next steps and strategy, we conducted a brief analysis of the scores of the leaderboards competition and also the scores which I found on the internet.
The table below summarizes scores that I found on internet. The difference of my best score and the winner score is only 2.2%, but I did only 18 submissions (The winner did 232 submissions) because of the amount of time that I had to spend on the competition. This suggest that I have to spend much much more time to have any chance to win a competition or at least end in the 25% tail.
Model | public | private | Desc | link |
---|---|---|---|---|
Winner | 0.394970 | 0.397064 | Ensemble: 232 Entries. Takes 2h to run | |
25% Pos: 559 | 0.391804 | Yi Li | ||
alex | 0.390355 | 0.392787 | Ensemble | alex |
Me | 0.385060 | 0.387957 | Single model XGBoost: 18 Entries | |
Sean XGBoost | 0.392 | XGBoost (No many details) | sean | |
Sean AWML | 0.343 | Amazon Machine Learning (AML) service | sean | |
Xavier Xgboost | 0.391169 | Xgboost essemble | xavier | |
Xavier Random Forest | 0.373147 | xavier | ||
Xavier SVM | 0.3188 | xavier |
getwd()
source("libs/kaggle_leaderboard_parser.R")
source("libs/kaggle_leaderboard_dashboard.R")
# Downloading leaderboard
# Shameless stolen (adapt) from Jeff Hebert: https://rstudio-pubs-static.s3.amazonaws.com/29531_4b5b689e7adf4448a8d420e6b356397c.html
contest.url <- "https://www.kaggle.com/c/liberty-mutual-group-property-inspection-prediction"
prop.inspection.lb <- leaderboard(contest.url)
build.leaderboard.dashboard(prop.inspection.lb)
The histogram below shows a comparative between the private scores distributions of all kaggle competitors and my public and private score.
In this competition, the private Gini metric of my model was bigger than in the public leaderboard. My score is located in the left side of the mode of the histogram. So, we calculated private score improvement metric by subtracting public score from the private one and then we investigated how much the scores changed between the public and private leaderboards.
We noted that almost half of the top 25 in the public score were able to improve their rank in the private leaderboard, but in general the rank in public leaderboard can be very different from private leaderboard. See the boxplot.
The scatterplot below shows the relations of gini scores improvement and rank improvement. We selected the top 100 submissions in the private leaderboard for this analysis.
Few kaggler actually reduced their gini score in the top 100 private leaderboard. We were located in the upper right quadrant, where kagglers increased their private score but lost position in the leaderboard. Their neighbors in the public leaderboard were able to increase more their scores. The winner increased a little his score to gain one position and end the competition in first place. In the data it seems to have few clusters that might be related with similar type of models or approaches and you can see the pattern that large improvements in score can lead to better rank
Thanks to Emacs and orgmode (http://orgmode.org/) We were able to track the time I spent in every task. The tasks on this project were classified:
- DOC (28%): Time spent writing documentation and taking notes
- MODELLING (20%): Time spent analyzing, modeling and planning the next steps
- DATA (3%): Time spent in preparing the data for analysis
- PROG (26%): Time spent implementing and refactoring the code
- STUDY (23%): Time spent studying libraries and machine learning’s algorithms
It is interesting to note, thanks to Kaggle’s good job, I only spent 3% of the time preparing the data. Normally, I spend 60% up to 80% of the time with data processing: acquiring, decide which data to collect or use, preparing, cleaning and dealing with missing values*.
It is clear the necessity to save code for the next competitions and I expect as the amount of time I spend studying will be worthwhile. The majority of the time writing was spent after the end of the competition and I believe it is very important.
PS: This is a roughly estimation but useful for planning
In general, kaggle competitions is a good way to learn, try and test new machine learning algorithms.
- What I haven’t used
- I should have used Cross Validation: Grid Search or Randomized Search to tune up parameters and save time
- I should have spent more time designing training data and validation data. It is good to have data validation similar as test data (submission)
- I should have tried ensemble model
- Bagging or
- Boosting or
- Stacking (Blending)
- In the real data analysis where interpretability is extremely important, I would have spent more time in exploratory phase and variable selection. I still believe that might have contributed to reach better results
- Goals and What I learned
- Run the algorithms on AWS cloud is cheaper and can save a lot of time
- Set AWS instances can be facilitated a lot by using literate deployment with Emacs and org-babel.
- Development the algorithm in python was not so difficult than I was expecting. (I normally use R for these tasks). The first thing that I noticed is that work with categorical data is easier in R.
- Keep organized and track all your tries are extremely important
- Gradient Boosting is a powerful technique and also can be used as feature selection (relative importance)
- Kaggle competition, blogs and forums is a good way to train and apply machine learning algorithms
- It is important to understand the evaluation metric.
I used a lot of information from others blogs. I tried to cite everything, but I confessed that during my annotation I lost a lot of my sources. So, if you see something that came from other site and it was not cited and you felt wronged, please let me know I will do my best to include all references.
The author believes that share code and knowledge is awesome. Feel free to share and modify this piece of code. But don’t be impolite and remember to cite the author and give him his credits.
(defun send-region-to-terminal (start end)
"execute region in an inferior terminal
To help org-babel depo=loy projects on aws
Basicaly it send the current region to terminal process
buffer named *terminal*"
(interactive "r")
(process-send-string "*terminal*" (concat (buffer-substring-no-properties start end) "\n")))
https://www.kaggle.com/wiki/WinningModelDocumentationTemplate
- DRY: Do not repeat yourself
- Write shy code n Design by Contratc n Test Unit in mind
- Decoupling n Law of Demeter
- The Law of Demeter for functions states that any method of an object can call only methods belongs to:
- itself
- parameter that was passed in to the method
- any object it created
- any direct held component objects
- Write code that writes code (Yasnippet)
- Change headers structure and create Dev Code n Analysis headers
- Set tags :noexport: to exclude subtree Dev Code n Analysis in the output
- org-html-export-as-html
- Save as html (Stop here to publish as html)
- Edit (delete) xml lines (first 3 lines)
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
- Open it in MS word
- Remember to turn on Navigation Panel in word:
- View -> Tick Navigation Panel
- Change headers structure and create Dev Code n Analysis headers
- Set tags :noexport: to exclude subtree Dev Code n Analysis in the output
- org-html-export-as-html
- Save as html (Stop here to publish as html)
- Zip (folder do projeto)
- model_2014.org e/ou model_2014.docx
- model_2014.html
- figures
- org-md-export-to-markdown: C-c C-e m m
ipython notebook &
O navegador irah abrir uma nova http://127.0.0.1:8888/ com seu notebook.
- C-c C-v t (org-tangle)
Steps:
- M-x ess-build-tags-for-directory
- Select folde (Rcode)
- Select fiel TAG
- visit-tags-table (update hash)
- M-. visit tag (while point in function call)
Unfortunately, these programs do not recognize R code syntax. They do allow tagging of arbitrary language files through regular expressions, but this is not sufficient for R.
R 2.9.0 onwards provides the rtags function as a tagging utility for R code. It parses R code files (using R’s parser) and produces tags in Emacs’ etags format.
To update you can use: M-x visit-tags-table (select tag table)
M-. = visit tag (Go to function definition)
M-x ess-build-tags-for-directory run the shel script below for you Ask the directory to run rtags n then ask for file to save (TAGS)
## Generate TAGS file
rtags(path="Rcode/",recursive = TRUE,verbose=TRUE,ofile = "TAGS")
Nao parece funcionar com ggtags (mode para tarbalhar com tags no emacs)
O comando gtags gnu tags. Suporta varias liunguagens e projectile trabalha com gtags).
Entao desta forma nao terei as TAGS atualizadas toda vez que salvo arquivos.
- C-c C-c inside FSTREE