GitHub - teamdandelion/Bhattacharrya: Implementing the recursive Bhattacharrya kernel

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
MNIST		MNIST
Matlab		Matlab
.gitignore		.gitignore
README		README
basis.py		basis.py
bhatta.py		bhatta.py
bhatta_plots.py		bhatta_plots.py
bhatta_poly.py		bhatta_poly.py
bhatta_utilities.py		bhatta_utilities.py

Repository files navigation

Implements the Bhattacharrya Kernel in Python and Matlab.

The Bhattacharrya Kernel:
The Bhattacharrya Kernel is described by Kondor in "A Kernel Between Sets of Vectors" (Kondor & Jebara, http://machinelearning.wustl.edu/mlpapers/paper_files/icml2003_KondorJ03.pdf). The kernel takes a base kernel k, two datasets D1 and D2, where each dataset is a set of vectors in R^r (r-dimensional space). The Bhattacharrya kernel then maps the vectors in these two sets into a Hilbert space using the base kernel and calculates the Bhattacharrya overlap between the two datasets in the Hilbert space.

For the mathematics, see "A Kernel Between Sets of Vectors".

The Algorithm:
The program in "bhatta.py" is built so it can take a number of datasets and compute the Bhattacharrya Kernel matrix between them, or can compute Bhattacharrya evaluations between individual datasets. Because a lot of the computation is independent to each individual dataset, the Bhatta_Manager class instantiates a Dataset object for each dataset, which precomputes the eigenvector decomposition, local kernel matrix, etc. The MemorizedBhatta class operates with a similar principle but takes new Datasets on the fly, so that not all the Datasets need to be known at object instantiation.

Proving Correctness:
Since the mathematics are fairly opaque it can be difficult to prove correctness of the Bhattacharrya algorithm. However, we can test the correctness of the algorithm using the polynomial kernel. Since the polynomial kernel maps to a known and easily computed Hilbert space, we can directly map datasets into the polynomial kernel's Hilbert space and manually test the Bhattacharrya overlap, and compare this result to the Bhattacharrya algorithm evaluation with a polynomial base kernel. If the results are identical for several random datasets, it is extremely likely that the algorithm is producing the correct result. Furthermore, calculating the direct Bhattacharrya overlap in a known space and mapping data into the polynomial-kernel Hilbert space are both very simple operations.

"bhatta.py" contains the kernel. This is the main file
"basis.py" contains some helper functions
"bhatta_plots.py" is a utility for vizualising kernel evaluations between datasets
"mnist.py" handles loading the MNIST digit database and vectorizing it into a form the Bhattacharrya kernel can handle
"bhatta_poly" contains utilities to explicitly discover the polynomial feature space, for bugtesting the kernel and confirming correctness
"Matlab/" contains Matlab versions of the kernel; may not be correct atm. The multi_poly_bhatta.matlab file is useful for comparing the closed-form polynomial RKHS version of the algorithm to empirical and PCA evaluations.

Here's an email I sent to Risi which is relevant:

Hey Risi,
I've done some more work organizing and commenting the Bhattacharrya code, as well as adding a readme. The key insight I had while working on this project was figuring out how to test the correctness of a Bhattacharrya kernel implementation. It's very straightforward to calculate the Bhattacharrya overlap between two distributes in a known, regular space, and it's straightforward to map any dataset into the reproducing Hilbert space for the polynomial kernel. So, to test a kernelized Bhattacharrya implementation, take some data, map it into the polynomial kernel rkhs, directly compute the Bhattacharrya overlap in that space, and compare to what the Bhattacharrya kernel algorithm gives for the same data with a polynomial base kernel.

This testing method has another significant benefit, in that you can test individual components of Bhattacharrya algorithm. Since there is a 1-to-1 mapping between components (covariance matrices and determinants) between the Bhattacharrya evaluation in the polynomial RKHS and the Bhattacharrya algorithm with a base kernel, if there is an error in the algorithm you can isolate which component is failing. Through this method, I discovered that the math as described in the paper doesn't quite work out: the mixed determinant and covariance terms break. I think this had to do with the proposed method of constructing a shared basis using Gram-Schmidt orthogonalization failing. I don't quite remember why the issue surfaced, but I do know that I fixed it and the fix is reflected in the Python algorithm.

Matlab was particularly helpful in debugging the algorithm. Take a look for example at multi_poly_bhatta.m. It generates a bunch of random datasets, and computes Phi (the polynomial RKHS) for them. It then generates the closed-form Bhattacharrya evaluation RKHS data, as well as the empirical and PCA Bhattacharrya algorithm evaluations using just the kernel matrix. So you can use it to compare the different Bhattacharrya components in the kernelized and closed form versions.