The package provides a python implementation of the High Dimensional Discriminant analysis/clustering models, see publications:
The original R
package available on the CRAN:
http://cran.r-project.org/web/packages/HDclassif/index.html
Some of the models are actually implemented, those who usually provide the best results in terms of classification accuracy.
Just download the package. It has been tested on linux, Debian Stretch. Scipy and Scikit should be installed. Also, for a faster processing, a good linear algebra library is preferable. Openblas is a good option.
import platform
platform.python_version()
‘2.7.13’
import scipy as sp
sp.__version__
‘1.0.0’
import sklearn
sklearn.__version__
‘0.19.1’
We provide an introductory example taken from Charles Bouveyron thesis, on the crabs data set. We also compare with the standard GMM with EM provided by scikit.
First, load packages and define some variables:
import hdda
import matplotlib.pyplot as plt
import scipy as sp
from sklearn.decomposition import PCA
from sklearn import mixture
# Parameters for HDDA
MODEL = 'M2'
C = 4 # For the example with do not fit the number of classes
th = 0.05 # The threshold for the Cattel test
Load the data
data = sp.load('crabs.npz')
X = data['x']
Y = data['y']
For illustration, we can plot the projection on the two first PC axis.
plt.figure()
pca = PCA(n_components=2)
Xp = pca.fit_transform(X)
plt.scatter(Xp[:,0],Xp[:,1],c=Y,s=40)
plt.savefig('2D_true_labels.png')
We could select the best model using BIC or ICL:
bic, icl = [], []
for model_ in ['M1', 'M2', 'M3', 'M4', 'M5']:
model = hdda.HDDC(C=C,th=th,model=model_)
model.fit(X)
bic.append(model.bic)
icl.append(model.icl)
plt.figure()
plt.plot(bic)
plt.plot(icl)
plt.legend(("BIC", "ICL"))
plt.xticks(sp.arange(5), ('M1', 'M2', 'M3', 'M4', 'M5'))
plt.grid()
plt.savefig('bic_icl_crabs.png')
model = hdda.HDDC(C=C,th=th,model=MODEL)
model.fit(X)
model.bic
yp=model.predict(X)
plt.figure()
plt.scatter(Xp[:,0],Xp[:,1],c=yp,s=40)
plt.savefig("2D_hdda.png")
The same learning is done with the GMM from scikit, and we plot the results
clf = mixture.GaussianMixture(n_components=4, covariance_type='full')
clf.fit(X)
yp=clf.predict(X)
plt.figure()
plt.scatter(Xp[:,0],Xp[:,1],c=yp,s=40)
plt.savefig('2D_gmm.png')
The complete file is available in script_crabs.py.
In this example, we show how HDDC clusterizes pixels from a satellite image time series. Again, we need to load data and set up some parameters. Then we use thefit_all
function to learn the best
model. This is an example of the work of Maïlys Lopes on grassland
monitoring from satellite image time series.
import hdda
import scipy as sp
import matplotlib.pyplot as plt
from sklearn import mixture
import time as time
# Load data
data = sp.load('prairie5.npy')
x = data
n,d=x.shape
print "Number of samples: {}\n Number of variables: {}".format(n,d)
# Parameters
MODEL = ['M1','M2','M3','M4','M5','M6','M7','M8']
th = [0.05,0.1,0.2]
C = sp.arange(1,5)
# Model Selection
model = hdda.HDGMM()
tic = time.clock()
model.fit_all(x,MODEL=MODEL,C=C,th=th,VERBOSE=True)
toc = time.clock()
print "Processing time: {}".format(toc-tic)
# Plot data
bands= ['B','G','R','NIR']
for i,b in enumerate(bands):
plt.figure()
# Plot the samples
for j in xrange(n):
plt.plot(data[j,(i*17):((i+1)*17)],'k',lw=0.5)
# Plot the means
for j in xrange(len(model.mean)):
plt.plot(model.mean[j][(i*17):((i+1)*17)],lw=3)
plt.savefig('grassland_{}.png'.format(b))
for b in {B,G,R,NIR}
do
echo [[file:grassland_$b.png]]
done
Let’s do the clustering with the conventional GMM from Scikit
. The
best model is selected according to the BIC (taken from scikit).
bicGmm = []
lowest_bic = sp.infty
for c in C:
gmm = mixture.GMM(n_components=c, covariance_type='full')
gmm.fit(x)
bicGmm.append(gmm.bic(x))
if bicGmm[-1] < lowest_bic:
lowest_bic = bicGmm[-1]
best_gmm = gmm
print bicGmm
# Plot data
bands= ['B','G','R','NIR']
for i,b in enumerate(bands):
plt.figure()
# Plot the samples
for j in xrange(n):
plt.plot(data[j,(i*17):((i+1)*17)],'k',lw=0.5)
# Plot the means
for j in xrange(best_gmm.means_.shape[0]):
plt.plot(best_gmm.means_[j][(i*17):((i+1)*17)],lw=3)
plt.savefig('grassland_gmm_{}.png'.format(b))
for b in {B,G,R,NIR}
do
echo [[file:grassland_gmm_$b.png]]
done