Kaggle Invasive Species Monitoring

This is my solution to the Invasive Species Monitoring Kaggle prediction competition, which ended mid August 2017.

Techniques used:

Convolutional deep neural networks
Gradient boosting trees
Stacking
Bagging
Cross-validation

Tools used:

Keras (on top of TensorFlow)
XGBoost
scikit-learn
Python

To run

To run the whole process (assuming files have been placed as specified in config.yaml):

# Create folds.
python create_folds.py config.yaml

# Generate features.
python preprocess_nn.py config.yaml
python preprocess_gbt.py config.yaml

# Find hyperparameters.
python fit_hyperparams.py config.yaml 1 # GBT

# Fit models.
python train_loop.py config.yaml 0 # NN
python train_loop.py config.yaml 1 # GBT
python train_loop.py config.yaml 3 # NN (extra layers)
python train_loop.py config.yaml 4 # GBT (log loss)

# Combine models.
python stack.py config.yaml

Classifier overview

My ensemble consisted of four base models:

A convolutional neural networks (CNN), based on code from Finlay Liu;
Another CNN, with an extra layer;
A gradient boosting trees (GBT) (trained using XGBoost) with AUC score, based on code from Paulo Pinto;
Another GBT, with log likelihood loss.

From each of these I created a bagging ensemble, and then I stacked those together using regularised logistic regression.

Hyperparameters for the GBTs were chosen using cross-validation.

I originally included a support vector machine (SVM), but it performed so badly that I dropped it.

Discussion

I was fairly sure this would be a competition dominated by deep neural networks (DNNs), as these have dominated recent image-related competitions. I did not expect a high position in this competition, as the computers I had access to lacked the GPU RAM to run code made publicly available by Finlay Liu, which reported achieved 98.5% accuracy on the public test set. Rather, I saw this competition as a good chance to try linking a variety of machine learning tools together using the Python library scikit-learn, which I had not used before, and in this sense I felt it was a success.

There were several more approaches I would have liked to have tried in working on this data set:

Optimising hyperparameters using Gaussian processes, rather than cross-validation;
Trying different stacking functions;
Optimising the hyperparameters of the stacking function;
Trying to see what performance I could draw out of a SVM;
Trying more modifications of the DNNs. The largest DNN I was able to run seemed to perform worse than a smaller one, which I would have liked to have looked into further.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
create_folds.py		create_folds.py
dummy.yaml		dummy.yaml
dummy_hyperparams.yaml		dummy_hyperparams.yaml
estimators.py		estimators.py
fit_hyperparams.py		fit_hyperparams.py
hyperparams.yaml		hyperparams.yaml
model_nn.py		model_nn.py
preprocess_gbt.py		preprocess_gbt.py
preprocess_nn.py		preprocess_nn.py
stack.py		stack.py
train.py		train.py
train_loop.py		train_loop.py
utils.py		utils.py

andycraig/kaggle-hydrangea

Folders and files

Latest commit

History

Repository files navigation

Kaggle Invasive Species Monitoring

To run

Classifier overview

Discussion

About

Resources

Stars

Watchers

Forks

Languages