예제 #1
0
def linkage(X, method='single', metric='euclidean', preserve_input=True):
    '''Hierarchical, agglomerative clustering on a dissimilarity matrix or on
Euclidean data.

Apart from the argument 'preserve_input', the method has the same input
parameters and output format as the functions of the same name in the
module scipy.cluster.hierarchy.

The argument X is preferably a NumPy array with floating point entries
(X.dtype==numpy.double). Any other data format will be converted before
it is processed.

If X is a one-dimensional array, it is considered a condensed matrix of
pairwise dissimilarities in the format which is returned by
scipy.spatial.distance.pdist. It contains the flattened, upper-
triangular part of a pairwise dissimilarity matrix. That is, if there
are N data points and the matrix d contains the dissimilarity between
the i-th and j-th observation at position d(i,j), the vector X has
length N(N-1)/2 and is ordered as follows:

  [ d(0,1), d(0,2), ..., d(0,n-1), d(1,2), ..., d(1,n-1), ...,
    d(n-2,n-1) ]

The 'metric' argument is ignored in case of dissimilarity input.

The optional argument 'preserve_input' specifies whether the method
makes a working copy of the dissimilarity vector or writes temporary
data into the existing array. If the dissimilarities are generated for
the clustering step only and are not needed afterward, approximately
half the memory can be saved by specifying 'preserve_input=False'. Note
that the input array X contains unspecified values after this procedure.
It is therefore safer to write

  linkage(X, method="...", preserve_input=False)
  del X

to make sure that the matrix X is not accessed accidentally after it has
been used as scratch memory. (The single linkage algorithm does not
write to the distance matrix or its copy anyway, so the 'preserve_input'
flag has no effect in this case.)

If X contains vector data, it must be a two-dimensional array with N
observations in D dimensions as an (N×D) array. The preserve_input
argument is ignored in this case. The specified metric is used to
generate pairwise distances from the input. The following two function
calls yield the same output:

  linkage(pdist(X, metric), method="...", preserve_input=False)
  linkage(X, metric=metric, method="...")

The general scheme of the agglomerative clustering procedure is as
follows:

  1. Start with N singleton clusters (nodes) labeled 0,...,N−1, which
     represent the input points.
  2. Find a pair of nodes with minimal distance among all pairwise
     distances.
  3. Join the two nodes into a new node and remove the two old nodes.
     The new nodes are labeled consecutively N, N+1, ...
  4. The distances from the new node to all other nodes is determined by
     the method parameter (see below).
  5. Repeat N−1 times from step 2, until there is one big node, which
     contains all original input points.

The output of linkage is stepwise dendrogram, which is represented as an
(N−1)×4 NumPy array with floating point entries (dtype=numpy.double).
The first two columns contain the node indices which are joined in each
step. The input nodes are labeled 0,...,N−1, and the newly generated
nodes have the labels N,...,2N−2. The third column contains the distance
between the two nodes at each step, ie. the current minimal distance at
the time of the merge. The fourth column counts the number of points
which comprise each new node.

The parameter method specifies which clustering scheme to use. The
clustering scheme determines the distance from a new node to the other
nodes. Denote the dissimilarities by d, the nodes to be joined by I, J,
the new node by K and any other node by L. The symbol |I| denotes the
size of the cluster I.

method='single': d(K,L) = min(d(I,L), d(J,L))

  The distance between two clusters A, B is the closest distance between
  any two points in each cluster:

    d(A,B) = min{ d(a,b) | a∈A, b∈B }

method='complete': d(K,L) = max(d(I,L), d(J,L))

  The distance between two clusters A, B is the maximal distance between
  any two points in each cluster:

    d(A,B) = max{ d(a,b) | a∈A, b∈B }

method='average': d(K,L) = ( |I|·d(I,L) + |J|·d(J,L) ) / (|I|+|J|)

  The distance between two clusters A, B is the average distance between
  the points in the two clusters:

    d(A,B) = (|A|·|B|)^(-1) · \sum { d(a,b) | a∈A, b∈B }

method='weighted': d(K,L) = (d(I,L)+d(J,L))/2

  There is no global description for the distance between clusters since
  the distance depends on the order of the merging steps.

The following three methods are intended for Euclidean data only, ie.
when X contains the pairwise (non-squared!) distances between vectors in
Euclidean space. The algorithm will work on any input, however, and it
is up to the user to make sure that applying the methods makes sense.

method='centroid': d(K,L) = ( (|I|·d(I,L) + |J|·d(J,L)) / (|I|+|J|)
                              − |I|·|J|·d(I,J)/(|I|+|J|)^2 )^(1/2)

  There is a geometric interpretation: d(A,B) is the distance between
  the centroids (ie. barycenters) of the clusters in Euclidean space:

    d(A,B) = ‖c_A−c_B∥,

  where c_A denotes the centroid of the points in cluster A.

method='median': d(K,L) = ( d(I,L)/2 + d(J,L)/2 − d(I,J)/4 )^(1/2)

  Define the midpoint w_K of a cluster K iteratively as w_K=k if K={k}
  is a singleton and as the midpoint (w_I+w_J)/2 if K is formed by
  joining I and J. Then we have

    d(A,B) = ∥w_A−w_B∥

  in Euclidean space for all nodes A,B. Notice however that this
  distance depends on the order of the merging steps.

method='ward': d(K,L) = ( ((|I|+|L)d(I,L) + (|J|+|L|)d(J,L) − |L|d(I,J))
                          / (|I|+|J|+|L|) )^(1/2)

  The global cluster dissimilarity can be expressed as

    d(A,B) = ( 2|A|·|B|/(|A|+|B|) )^(1/2) · ‖c_A−c_B∥,

  where c_A again denotes the centroid of the points in cluster A.

The clustering algorithm handles infinite values correctly, as long as the
chosen distance update formula makes sense. If a NaN value occurs, either
in the original dissimilarities or as an updated dissimilarity, an error is
raised.

The linkage method does not treat NumPy's masked arrays as special
and simply ignores the mask.'''
    X = array(X, copy=False, subok=True)
    if X.ndim==1:
        if method=='single':
            preserve_input = False
        X = array(X, dtype=double, copy=preserve_input, order='C', subok=True)
        NN = len(X)
        N = int(ceil(sqrt(NN*2)))
        if (N*(N-1)//2) != NN:
            raise ValueError('The length of the condensed distance matrix '
                             'must be (k \choose 2) for k data points!')
    else:
        assert X.ndim==2
        N = len(X)
        X = pdist(X, metric)
        X = array(X, dtype=double, copy=False, order='C', subok=True)
    Z = empty((N-1,4))
    if N > 1:
        linkage_wrap(N, X, Z, mthidx[method])
    return Z
예제 #2
0
def linkage(X, method='single', metric='euclidean', preserve_input=True):
    '''Hierarchical, agglomerative clustering on a dissimilarity matrix or on
Euclidean data.

Apart from the argument 'preserve_input', the method has the same input
parameters and output format as the functions of the same name in the
module scipy.cluster.hierarchy.

The argument X is preferably a NumPy array with floating point entries
(X.dtype==numpy.double). Any other data format will be converted before
it is processed.

If X is a one-dimensional array, it is considered a condensed matrix of
pairwise dissimilarities in the format which is returned by
scipy.spatial.distance.pdist. It contains the flattened, upper-
triangular part of a pairwise dissimilarity matrix. That is, if there
are N data points and the matrix d contains the dissimilarity between
the i-th and j-th observation at position d(i,j), the vector X has
length N(N-1)/2 and is ordered as follows:

  [ d(0,1), d(0,2), ..., d(0,n-1), d(1,2), ..., d(1,n-1), ...,
    d(n-2,n-1) ]

The 'metric' argument is ignored in case of dissimilarity input.

The optional argument 'preserve_input' specifies whether the method
makes a working copy of the dissimilarity vector or writes temporary
data into the existing array. If the dissimilarities are generated for
the clustering step only and are not needed afterward, approximately
half the memory can be saved by specifying 'preserve_input=False'. Note
that the input array X contains unspecified values after this procedure.
It is therefore safer to write

  linkage(X, method="...", preserve_input=False)
  del X

to make sure that the matrix X is not accessed accidentally after it has
been used as scratch memory. (The single linkage algorithm does not
write to the distance matrix or its copy anyway, so the 'preserve_input'
flag has no effect in this case.)

If X contains vector data, it must be a two-dimensional array with N
observations in D dimensions as an (N×D) array. The preserve_input
argument is ignored in this case. The specified metric is used to
generate pairwise distances from the input. The following two function
calls yield the same output:

  linkage(pdist(X, metric), method="...", preserve_input=False)
  linkage(X, metric=metric, method="...")

The general scheme of the agglomerative clustering procedure is as
follows:

  1. Start with N singleton clusters (nodes) labeled 0,...,N−1, which
     represent the input points.
  2. Find a pair of nodes with minimal distance among all pairwise
     distances.
  3. Join the two nodes into a new node and remove the two old nodes.
     The new nodes are labeled consecutively N, N+1, ...
  4. The distances from the new node to all other nodes is determined by
     the method parameter (see below).
  5. Repeat N−1 times from step 2, until there is one big node, which
     contains all original input points.

The output of linkage is stepwise dendrogram, which is represented as an
(N−1)×4 NumPy array with floating point entries (dtype=numpy.double).
The first two columns contain the node indices which are joined in each
step. The input nodes are labeled 0,...,N−1, and the newly generated
nodes have the labels N,...,2N−2. The third column contains the distance
between the two nodes at each step, ie. the current minimal distance at
the time of the merge. The fourth column counts the number of points
which comprise each new node.

The parameter method specifies which clustering scheme to use. The
clustering scheme determines the distance from a new node to the other
nodes. Denote the dissimilarities by d, the nodes to be joined by I, J,
the new node by K and any other node by L. The symbol |I| denotes the
size of the cluster I.

method='single': d(K,L) = min(d(I,L), d(J,L))

  The distance between two clusters A, B is the closest distance between
  any two points in each cluster:

    d(A,B) = min{ d(a,b) | a∈A, b∈B }

method='complete': d(K,L) = max(d(I,L), d(J,L))

  The distance between two clusters A, B is the maximal distance between
  any two points in each cluster:

    d(A,B) = max{ d(a,b) | a∈A, b∈B }

method='average': d(K,L) = ( |I|·d(I,L) + |J|·d(J,L) ) / (|I|+|J|)

  The distance between two clusters A, B is the average distance between
  the points in the two clusters:

    d(A,B) = (|A|·|B|)^(-1) · \sum { d(a,b) | a∈A, b∈B }

method='weighted': d(K,L) = (d(I,L)+d(J,L))/2

  There is no global description for the distance between clusters since
  the distance depends on the order of the merging steps.

The following three methods are intended for Euclidean data only, ie.
when X contains the pairwise (non-squared!) distances between vectors in
Euclidean space. The algorithm will work on any input, however, and it
is up to the user to make sure that applying the methods makes sense.

method='centroid': d(K,L) = ( (|I|·d(I,L) + |J|·d(J,L)) / (|I|+|J|)
                              − |I|·|J|·d(I,J)/(|I|+|J|)^2 )^(1/2)

  There is a geometric interpretation: d(A,B) is the distance between
  the centroids (ie. barycenters) of the clusters in Euclidean space:

    d(A,B) = ‖c_A−c_B∥,

  where c_A denotes the centroid of the points in cluster A.

method='median': d(K,L) = ( d(I,L)/2 + d(J,L)/2 − d(I,J)/4 )^(1/2)

  Define the midpoint w_K of a cluster K iteratively as w_K=k if K={k}
  is a singleton and as the midpoint (w_I+w_J)/2 if K is formed by
  joining I and J. Then we have

    d(A,B) = ∥w_A−w_B∥

  in Euclidean space for all nodes A,B. Notice however that this
  distance depends on the order of the merging steps.

method='ward': d(K,L) = ( ((|I|+|L)d(I,L) + (|J|+|L|)d(J,L) − |L|d(I,J))
                          / (|I|+|J|+|L|) )^(1/2)

  The global cluster dissimilarity can be expressed as

    d(A,B) = ( 2|A|·|B|/(|A|+|B|) )^(1/2) · ‖c_A−c_B∥,

  where c_A again denotes the centroid of the points in cluster A.

The clustering algorithm handles infinite values correctly, as long as the
chosen distance update formula makes sense. If a NaN value occurs, either
in the original dissimilarities or as an updated dissimilarity, an error is
raised.

The linkage method does not treat NumPy's masked arrays as special
and simply ignores the mask.'''
    X = array(X, copy=False, subok=True)
    if X.ndim == 1:
        if method == 'single':
            preserve_input = False
        X = array(X, dtype=double, copy=preserve_input, order='C', subok=True)
        NN = len(X)
        N = int(ceil(sqrt(NN * 2)))
        if (N * (N - 1) // 2) != NN:
            raise ValueError('The length of the condensed distance matrix '
                             'must be (k \choose 2) for k data points!')
    else:
        assert X.ndim == 2
        N = len(X)
        X = pdist(X, metric)
        X = array(X, dtype=double, copy=False, order='C', subok=True)
    Z = empty((N - 1, 4))
    if N > 1:
        linkage_wrap(N, X, Z, mthidx[method])
    return Z
예제 #3
0
def linkage(D, method='single', metric='euclidean', preserve_input=True):
    '''Hierarchical (agglomerative) clustering on a dissimilarity matrix or on
Euclidean data.

The argument D is either a compressed distance matrix or a collection
of m observation vectors in n dimensions as an (m×n) NumPy array. Apart
from the argument preserve_input, the methods have the same input
parameters and output format as the functions of the same name in the
package scipy.cluster.hierarchy. Therefore, the documentation is not
duplicated here. Please refer to the SciPy documentation for further
details.

The additional, optional argument preserve_input specifies whether the
fastcluster package first copies the distance matrix or writes into
the existing array. If the distance matrix is only generated for the
clustering step and is not needed afterwards, half the memory can be
saved by specifying preserve_input=False.

Note that the input array D contains unspecified values after this
procedure. It is therefore safer to write

    linkage(D, method="…", preserve_distance=False)
    del D

to make sure the matrix D is not accidentally used after it has been
used as scratch memory.

The single linkage algorithm does not write to the distance matrix or
its copy anyway, so the preserve_distance flag has no effect in this
case.'''
    if not isinstance(D, ndarray):
        raise ValueError('The first argument must be of type numpy.ndarray.')
    if len(D.shape)==1:
        if method=='single':
            assert D.dtype==double
            D_ = require(D, dtype=double, requirements=['C'])
            if D_ is not D:
                stderr.write('The condensed distance matrix had to be copied since it has the following flags:\n')
                stderr.write(str(D.flags) + '\n')
        elif preserve_input:
            D_ = D.copy()
            assert D_.dtype == double
            assert D_.flags.c_contiguous
            assert D_.flags.owndata
            assert D_.flags.writeable
            assert D_.flags.aligned
        else:
            assert D.dtype==double
            D_ = require(D, dtype=double, requirements=['C', 'A', 'W', 'O'])
            if D_ is not D:
                stderr.write('The condensed distance matrix had to be copied since it has the following flags:\n')
                stderr.write(str(D.flags) + '\n')

        is_valid_y(D_, throw=True)

        N = num_obs_y(D_)
        Z = empty((N-1,4))
        if N > 1:
            linkage_wrap(N, D_, Z, mthidx[method])
        return Z
    else:
        assert len(D.shape)==2
        N = D.shape[0]
        Z = empty((N-1,4))
        D_ = pdist(D, metric)
        assert D_.dtype == double
        assert D_.flags.c_contiguous
        assert D_.flags.owndata
        assert D_.flags.writeable
        assert D_.flags.aligned
        if N > 1:
            linkage_wrap(N, D_, Z, mthidx[method])
        return Z