README for the HDDA python toolbox

Objectives

The package provides a python implementation of the High Dimensional Discriminant analysis/clustering models, see publications:

The original R package available on the CRAN: http://cran.r-project.org/web/packages/HDclassif/index.html

Some of the models are actually implemented, those who usually provide the best results in terms of classification accuracy.

Install and requirements

Just download the package. It has been tested on linux, Debian Stretch. Scipy and Scikit should be installed. Also, for a faster processing, a good linear algebra library is preferable. Openblas is a good option.

import platform
platform.python_version()

‘2.7.13’

import scipy as sp
sp.__version__

‘1.0.0’

import sklearn
sklearn.__version__

‘0.19.1’

Usage

Crabs data set

We provide an introductory example taken from Charles Bouveyron thesis, on the crabs data set. We also compare with the standard GMM with EM provided by scikit.

First, load packages and define some variables:

import hdda
import matplotlib.pyplot as plt
import scipy as sp
from sklearn.decomposition import PCA
from sklearn import mixture

# Parameters for HDDA
MODEL = 'M2'
C = 4 # For the example with do not fit the number of classes
th = 0.05 # The threshold for the Cattel test

Load the data

data = sp.load('crabs.npz')
X = data['x']
Y = data['y']

For illustration, we can plot the projection on the two first PC axis.

plt.figure()
pca = PCA(n_components=2)
Xp = pca.fit_transform(X)
plt.scatter(Xp[:,0],Xp[:,1],c=Y,s=40)
plt.savefig('2D_true_labels.png')

We could select the best model using BIC or ICL:

bic, icl = [], []
for model_ in ['M1', 'M2', 'M3', 'M4', 'M5']:
    model = hdda.HDDC(C=C,th=th,model=model_)
    model.fit(X)
    bic.append(model.bic)
    icl.append(model.icl)

plt.figure()
plt.plot(bic)
plt.plot(icl)
plt.legend(("BIC", "ICL"))
plt.xticks(sp.arange(5), ('M1', 'M2', 'M3', 'M4', 'M5'))
plt.grid()
plt.savefig('bic_icl_crabs.png')

model = hdda.HDDC(C=C,th=th,model=MODEL)
model.fit(X)
model.bic
yp=model.predict(X)

plt.figure()
plt.scatter(Xp[:,0],Xp[:,1],c=yp,s=40)
plt.savefig("2D_hdda.png")

The same learning is done with the GMM from scikit, and we plot the results

clf = mixture.GaussianMixture(n_components=4, covariance_type='full')
clf.fit(X)
yp=clf.predict(X)

plt.figure()
plt.scatter(Xp[:,0],Xp[:,1],c=yp,s=40)
plt.savefig('2D_gmm.png')

The complete file is available in script_crabs.py.

Grassland

In this example, we show how HDDC clusterizes pixels from a satellite image time series. Again, we need to load data and set up some parameters. Then we use the fit_all function to learn the best model. This is an example of the work of Maïlys Lopes on grassland monitoring from satellite image time series.

import hdda
import scipy as sp
import matplotlib.pyplot as plt
from sklearn import mixture
import time as time

# Load data
data = sp.load('prairie5.npy')
x = data
n,d=x.shape
print "Number of samples: {}\n Number of variables: {}".format(n,d)
# Parameters
MODEL = ['M1','M2','M3','M4','M5','M6','M7','M8']
th = [0.05,0.1,0.2]
C = sp.arange(1,5)

# Model Selection
model = hdda.HDGMM()
tic = time.clock()
model.fit_all(x,MODEL=MODEL,C=C,th=th,VERBOSE=True)
toc = time.clock()
print "Processing time: {}".format(toc-tic)

# Plot data
bands= ['B','G','R','NIR']

for i,b in enumerate(bands):
    plt.figure()
    # Plot the samples
    for j in xrange(n):
        plt.plot(data[j,(i*17):((i+1)*17)],'k',lw=0.5)
    # Plot the means
    for j in xrange(len(model.mean)):
        plt.plot(model.mean[j][(i*17):((i+1)*17)],lw=3)
    plt.savefig('grassland_{}.png'.format(b))

for b in {B,G,R,NIR}
do
    echo [[file:grassland_$b.png]]
done

Let’s do the clustering with the conventional GMM from Scikit. The best model is selected according to the BIC (taken from scikit).

bicGmm = []
lowest_bic  = sp.infty
for c in C:
    gmm = mixture.GMM(n_components=c, covariance_type='full')
    gmm.fit(x)
    bicGmm.append(gmm.bic(x))
    if bicGmm[-1] < lowest_bic:
        lowest_bic = bicGmm[-1]
        best_gmm = gmm

print bicGmm

# Plot data
bands= ['B','G','R','NIR']

for i,b in enumerate(bands):
    plt.figure()
    # Plot the samples
    for j in xrange(n):
        plt.plot(data[j,(i*17):((i+1)*17)],'k',lw=0.5)
    # Plot the means
    for j in xrange(best_gmm.means_.shape[0]):
        plt.plot(best_gmm.means_[j][(i*17):((i+1)*17)],lw=3)
    plt.savefig('grassland_gmm_{}.png'.format(b))

for b in {B,G,R,NIR}
do
    echo [[file:grassland_gmm_$b.png]]
done

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
2D_gmm.png		2D_gmm.png
2D_hdda.png		2D_hdda.png
2D_true_labels.png		2D_true_labels.png
LICENSE		LICENSE
README.org		README.org
bic_icl_crabs.png		bic_icl_crabs.png
crabs.npz		crabs.npz
grassland_B.png		grassland_B.png
grassland_G.png		grassland_G.png
grassland_NIR.png		grassland_NIR.png
grassland_R.png		grassland_R.png
grassland_gmm_B.png		grassland_gmm_B.png
grassland_gmm_G.png		grassland_gmm_G.png
grassland_gmm_NIR.png		grassland_gmm_NIR.png
grassland_gmm_R.png		grassland_gmm_R.png
hdda.py		hdda.py
mda.py		mda.py
script_crabs.py		script_crabs.py
script_grasslands.py		script_grasslands.py

License

artificyan/HDDA

Folders and files

Latest commit

History

Repository files navigation

README for the HDDA python toolbox

Objectives

Install and requirements

Usage

Crabs data set

Grassland

About

Resources

License

Stars

Watchers

Forks

Languages