The benchmark tests:
scikit-learn
implementationxgboost
implementationLightGBM
implementation
We used conda environment. The toolbox were installed as followed:
git clone https://github.com/scikit-learn/scikit-learn.git
cd scikit-learn
python setup.py install
git clone https://github.com/dmlc/xgboost.git
cd xgboost
git submodule init
git submodule update
cp make/config.mk .
Edit config.mk
and turn edit TEST_COVER=1
to activate the debug mode.
make -j
cd python-package
python setup.py install
git clone https://github.com/Microsoft/LightGBM.git
cd LightGBM
mkdir build
cd build
ccmake ../
Activate the Debug
mode instead of Release.
make -j
cd ../python-package
python setup.py install
We used the following list of parameters:
Parameters | Value |
---|---|
'learning_rate' |
0.1 |
'loss' |
'deviance' |
'min_weight_fraction_leaf' |
0. |
'subsample' |
1. |
'max_features' |
None |
'min_samples_split' |
2 |
'min_samples_leaf' |
1 |
'min_impurity_split ' |
1 |
'max_leaf_nodes' |
None |
'presort' |
'auto' |
'init' |
None |
'warm_start' |
False |
'verbose' |
0 |
'random_state' |
42 |
'criterion' |
'friedman_mse' |
We fixed the following parameters to be similar of scikit-learn
.
Parameters | Value |
---|---|
'booster' |
'gbtree' |
'eta' |
0.1 |
'objective' |
'binary:logistic' |
'subsample' |
1. |
'colsample_bytree' |
1. |
'colsample_bylevel' |
1. |
'min_child_weight' |
1 |
'gamma ' |
1 |
'max_delta_step' |
0 |
'alpha' |
0. |
'delta' |
0. |
'tree_method' |
'exact' |
'scale_pos_weight' |
1. |
'presort' |
'auto' |
'init' |
None |
'verbose_eval' |
False |
'random_state' |
42 |
We fixed the following parameters to be similar of scikit-learn
.
Parameters | Value |
---|---|
'boosting' |
'gbdt' |
'learning_rate' |
0.1 |
'application' |
'binary' |
'metric' |
'binary_logloss' |
'tree_learner' |
'serial ' |
'feature_fraction' |
1. |
'bagging_fraction' |
1. |
'bagging_freq' |
0 |
'max_bin' |
255 |
'is_sparse' |
False |
'min_gain_to_split ' |
1 |
'verbose' |
1 |
'feature_fraction_seed' |
42 |
'bagging_seed' |
42 |
'data_random_seed' |
42 |
A useful list of alias between parameters is available in config.h
.
We used the following datasets to benchmark the libraries:
- Randomly generated dataset
- Real dataset
The following parameters were used to build the dataset using a grid:
n_samples
: 1k, 10k, 100kn_features
: 1, 5, 10
The following parameters were used to build create the classifier using a grid:
max_depth
: 1, 3, 5, 8n_estimators
: 1, 10
In the results
folder, the results of the benchmark have been dumped using joblib
.
The file check_trees.py
is intended to check the structure of a tree created within the gradient
boosting algorithm.
The parameters are fixed in the python file. The resulting structures are:
sklearn
tree structurexgboost
tree structureLighGBM
tree structure