This code is an implementation of the inference framework for Gaussian process (GP) models proposed in [1]. The framework is able to perform inference for Gaussian process models with arbitrary likelihood functions and it is scalable to large datasets.
SAVIGP stands for Scalable Automated Variational Inference for Gaussian Process models.
Experiments
Several examples for different forms of likelihood functions are available in the GP/experiment_setup.py
file.
In order to replicate the experiments reported in the paper, the corresponding line in the GP/experiment_run.py
file should be uncommented. For example,
the following line in the file will run the experiment using the Boston dataset:
ExperimentRunner.boston_experiment()
The experiments can also be run concurrently, as it is shown in the file.
Output
After an experiment is finished, the results will be saved in a directory called results
, which is a folder one level higher than the directory of the code.
The result of each experiment will be saved in a separate directory which contains several files, as follows:
File | content |
---|---|
test_.csv | Result of the prediction on test data |
train_.csv | Training data used. Model can be configured to not save these data, in the case the training dataset is large |
model.dump | An image of the model. After each iteration of the optimisation this image will be updated, and it can be used later on to initialize the model |
opt.dump | Last state of the optimiser. It can be used to continue the optimisation from the last iteration |
config_.csv | Configuration of the model |
*.log | Log file |
Visualization
In order to plot the results generated, the last line in the GP/experiment_run.py
file should be uncommented:
ExperimentRunner.plot()
In the case that the results of more than one experiment are in the result folder, then this method will plot the average of the results. The types of plots
depend on the likelihood function. For example bar-charts are generated in the case of classification, and box-plots in the case of regression models. The
type of likelihood is extracted from the config_.csv
file. The plot function also exports the data used for graphs (e.g., the height of the bar charts and
size of error-bars) into a separate folder called graph_data
, which is one lever higher than the code folder. These data can be used to regenerate the plots
using other tools. I used an R code for the plots in the paper. The code is in the file graphs_R/graphs.R
.
Likelihood functions
The code comes with a set of pre-defined likelihood functions as follows:
Likelihood class | problem type |
---|---|
Likelihood.UnivariateGaussian | Normal Gaussian process regression with single output |
Likelihood.MultivariateGaussian | Normal Gaussian process regression with multi-dimensional output |
Likelihood.LogGaussianCox | Log Gaussian Cox process. Can be used for example for the prediction of the rate of incidents |
Likelihood.LogisticLL | Logistic likelihood function. Can be used for binary classification |
Likelihood.SoftmaxLL | SoftmaxLL likelihood function. Can be used for multi-class classification |
Likelihood.WarpLL | Likelihood corresponding to Warp Gaussian process |
Likelihood.CogLLL | Likelihood corresponding to Gaussian process regression networks |
For defining a new likelihood function the class Likelihood.likelihood
should be extended and the required functions should be implemented.
See the class documentation for more details.
Example
Below is a short example using the Boston dataset:
import logging
from ExtRBF import ExtRBF
from model_learn import ModelLearn
from data_transformation import MeanTransformation
from likelihood import UnivariateGaussian
from data_source import DataSource
import numpy as np
# defining model type. It can be "mix1", "mix2", or "full"
method = "full"
# number of inducing points
num_inducing = 30
# loading data
data = DataSource.boston_data()
d = data[0]
Xtrain = d['train_X']
Ytrain = d['train_Y']
Xtest = d['test_X']
Ytest = d['test_Y']
# it is just a name that will be used for naming folders and files when exporting results
name = 'boston'
# defining the likelihood function
cond_ll = UnivariateGaussian(np.array(1.0))
# number of samples used for approximating the likelihood and its gradients
num_samples = 2000
# defining the kernel
kernels = [ExtRBF(Xtrain.shape[1], variance=1, lengthscale=np.array((1.,)), ARD = False)]
ModelLearn.run_model(Xtest,
Xtrain,
Ytest,
Ytrain,
cond_ll,
kernels,
method,
name,
d['id'],
num_inducing,
num_samples,
num_inducing / Xtrain.shape[0],
# optimise hyper-parameters (hyp), posterior parameters (mog), and likelihood parameters (ll)
['hyp', 'mog', 'll'],
# Transform data before training
MeanTransformation,
# place inducting points on training data. If False, they will be placed using clustering
True,
# level of logging
logging.DEBUG,
# do not export training data into csv files
False,
# add a small latent noise to the kernel for the stability of numerical computations
latent_noise=0.001,
# for how many iterations each set of parameters will be optimised
opt_per_iter={'mog': 25, 'hyp': 25, 'll': 25},
# total number of global optimisations
max_iter=200,
# number of threads
n_threads=1,
# size of each partition of data
partition_size=3000)
The code shows how to configure the model. There are two options that can significantly affect the speed of the code and
the amount of memory usage: n_threads
and partition_size
. The whole dataset is divided into partitions of size
partition_size
and calculations on each partition is performed on a separate thread, where the maximum number of threads is n_threads
.
Dependences
Following packages are required:
- Python 2.7 (2.7.6)
- Scipy (0.15.1)
- Numpy (1.9.1)
- GPy (0.6.0)
- pandas (0.16.0)
- scikit-learn (0.14.1)
- matplotlib (1.3.1)
Following are required for tests:
- DerApproximator (0.52)
- texttable (0.8.2)
Numbers in the parenthesis indicate the tested version.
References
[1] A. Dezfouli, E. V. Bonilla. Scalable Inference for Gaussian Process Models with Black-Box Likelihoods, Advances in Neural and Information Processing Systems (NIPS), Montreal, December 2015