def linkage(X, method='single', metric='euclidean', preserve_input=True): '''Hierarchical, agglomerative clustering on a dissimilarity matrix or on Euclidean data. Apart from the argument 'preserve_input', the method has the same input parameters and output format as the functions of the same name in the module scipy.cluster.hierarchy. The argument X is preferably a NumPy array with floating point entries (X.dtype==numpy.double). Any other data format will be converted before it is processed. If X is a one-dimensional array, it is considered a condensed matrix of pairwise dissimilarities in the format which is returned by scipy.spatial.distance.pdist. It contains the flattened, upper- triangular part of a pairwise dissimilarity matrix. That is, if there are N data points and the matrix d contains the dissimilarity between the i-th and j-th observation at position d(i,j), the vector X has length N(N-1)/2 and is ordered as follows: [ d(0,1), d(0,2), ..., d(0,n-1), d(1,2), ..., d(1,n-1), ..., d(n-2,n-1) ] The 'metric' argument is ignored in case of dissimilarity input. The optional argument 'preserve_input' specifies whether the method makes a working copy of the dissimilarity vector or writes temporary data into the existing array. If the dissimilarities are generated for the clustering step only and are not needed afterward, approximately half the memory can be saved by specifying 'preserve_input=False'. Note that the input array X contains unspecified values after this procedure. It is therefore safer to write linkage(X, method="...", preserve_input=False) del X to make sure that the matrix X is not accessed accidentally after it has been used as scratch memory. (The single linkage algorithm does not write to the distance matrix or its copy anyway, so the 'preserve_input' flag has no effect in this case.) If X contains vector data, it must be a two-dimensional array with N observations in D dimensions as an (N×D) array. The preserve_input argument is ignored in this case. The specified metric is used to generate pairwise distances from the input. The following two function calls yield the same output: linkage(pdist(X, metric), method="...", preserve_input=False) linkage(X, metric=metric, method="...") The general scheme of the agglomerative clustering procedure is as follows: 1. Start with N singleton clusters (nodes) labeled 0,...,N−1, which represent the input points. 2. Find a pair of nodes with minimal distance among all pairwise distances. 3. Join the two nodes into a new node and remove the two old nodes. The new nodes are labeled consecutively N, N+1, ... 4. The distances from the new node to all other nodes is determined by the method parameter (see below). 5. Repeat N−1 times from step 2, until there is one big node, which contains all original input points. The output of linkage is stepwise dendrogram, which is represented as an (N−1)×4 NumPy array with floating point entries (dtype=numpy.double). The first two columns contain the node indices which are joined in each step. The input nodes are labeled 0,...,N−1, and the newly generated nodes have the labels N,...,2N−2. The third column contains the distance between the two nodes at each step, ie. the current minimal distance at the time of the merge. The fourth column counts the number of points which comprise each new node. The parameter method specifies which clustering scheme to use. The clustering scheme determines the distance from a new node to the other nodes. Denote the dissimilarities by d, the nodes to be joined by I, J, the new node by K and any other node by L. The symbol |I| denotes the size of the cluster I. method='single': d(K,L) = min(d(I,L), d(J,L)) The distance between two clusters A, B is the closest distance between any two points in each cluster: d(A,B) = min{ d(a,b) | a∈A, b∈B } method='complete': d(K,L) = max(d(I,L), d(J,L)) The distance between two clusters A, B is the maximal distance between any two points in each cluster: d(A,B) = max{ d(a,b) | a∈A, b∈B } method='average': d(K,L) = ( |I|·d(I,L) + |J|·d(J,L) ) / (|I|+|J|) The distance between two clusters A, B is the average distance between the points in the two clusters: d(A,B) = (|A|·|B|)^(-1) · \sum { d(a,b) | a∈A, b∈B } method='weighted': d(K,L) = (d(I,L)+d(J,L))/2 There is no global description for the distance between clusters since the distance depends on the order of the merging steps. The following three methods are intended for Euclidean data only, ie. when X contains the pairwise (non-squared!) distances between vectors in Euclidean space. The algorithm will work on any input, however, and it is up to the user to make sure that applying the methods makes sense. method='centroid': d(K,L) = ( (|I|·d(I,L) + |J|·d(J,L)) / (|I|+|J|) − |I|·|J|·d(I,J)/(|I|+|J|)^2 )^(1/2) There is a geometric interpretation: d(A,B) is the distance between the centroids (ie. barycenters) of the clusters in Euclidean space: d(A,B) = ‖c_A−c_B∥, where c_A denotes the centroid of the points in cluster A. method='median': d(K,L) = ( d(I,L)/2 + d(J,L)/2 − d(I,J)/4 )^(1/2) Define the midpoint w_K of a cluster K iteratively as w_K=k if K={k} is a singleton and as the midpoint (w_I+w_J)/2 if K is formed by joining I and J. Then we have d(A,B) = ∥w_A−w_B∥ in Euclidean space for all nodes A,B. Notice however that this distance depends on the order of the merging steps. method='ward': d(K,L) = ( ((|I|+|L)d(I,L) + (|J|+|L|)d(J,L) − |L|d(I,J)) / (|I|+|J|+|L|) )^(1/2) The global cluster dissimilarity can be expressed as d(A,B) = ( 2|A|·|B|/(|A|+|B|) )^(1/2) · ‖c_A−c_B∥, where c_A again denotes the centroid of the points in cluster A. The clustering algorithm handles infinite values correctly, as long as the chosen distance update formula makes sense. If a NaN value occurs, either in the original dissimilarities or as an updated dissimilarity, an error is raised. The linkage method does not treat NumPy's masked arrays as special and simply ignores the mask.''' X = array(X, copy=False, subok=True) if X.ndim==1: if method=='single': preserve_input = False X = array(X, dtype=double, copy=preserve_input, order='C', subok=True) NN = len(X) N = int(ceil(sqrt(NN*2))) if (N*(N-1)//2) != NN: raise ValueError('The length of the condensed distance matrix ' 'must be (k \choose 2) for k data points!') else: assert X.ndim==2 N = len(X) X = pdist(X, metric) X = array(X, dtype=double, copy=False, order='C', subok=True) Z = empty((N-1,4)) if N > 1: linkage_wrap(N, X, Z, mthidx[method]) return Z
def linkage(X, method='single', metric='euclidean', preserve_input=True): '''Hierarchical, agglomerative clustering on a dissimilarity matrix or on Euclidean data. Apart from the argument 'preserve_input', the method has the same input parameters and output format as the functions of the same name in the module scipy.cluster.hierarchy. The argument X is preferably a NumPy array with floating point entries (X.dtype==numpy.double). Any other data format will be converted before it is processed. If X is a one-dimensional array, it is considered a condensed matrix of pairwise dissimilarities in the format which is returned by scipy.spatial.distance.pdist. It contains the flattened, upper- triangular part of a pairwise dissimilarity matrix. That is, if there are N data points and the matrix d contains the dissimilarity between the i-th and j-th observation at position d(i,j), the vector X has length N(N-1)/2 and is ordered as follows: [ d(0,1), d(0,2), ..., d(0,n-1), d(1,2), ..., d(1,n-1), ..., d(n-2,n-1) ] The 'metric' argument is ignored in case of dissimilarity input. The optional argument 'preserve_input' specifies whether the method makes a working copy of the dissimilarity vector or writes temporary data into the existing array. If the dissimilarities are generated for the clustering step only and are not needed afterward, approximately half the memory can be saved by specifying 'preserve_input=False'. Note that the input array X contains unspecified values after this procedure. It is therefore safer to write linkage(X, method="...", preserve_input=False) del X to make sure that the matrix X is not accessed accidentally after it has been used as scratch memory. (The single linkage algorithm does not write to the distance matrix or its copy anyway, so the 'preserve_input' flag has no effect in this case.) If X contains vector data, it must be a two-dimensional array with N observations in D dimensions as an (N×D) array. The preserve_input argument is ignored in this case. The specified metric is used to generate pairwise distances from the input. The following two function calls yield the same output: linkage(pdist(X, metric), method="...", preserve_input=False) linkage(X, metric=metric, method="...") The general scheme of the agglomerative clustering procedure is as follows: 1. Start with N singleton clusters (nodes) labeled 0,...,N−1, which represent the input points. 2. Find a pair of nodes with minimal distance among all pairwise distances. 3. Join the two nodes into a new node and remove the two old nodes. The new nodes are labeled consecutively N, N+1, ... 4. The distances from the new node to all other nodes is determined by the method parameter (see below). 5. Repeat N−1 times from step 2, until there is one big node, which contains all original input points. The output of linkage is stepwise dendrogram, which is represented as an (N−1)×4 NumPy array with floating point entries (dtype=numpy.double). The first two columns contain the node indices which are joined in each step. The input nodes are labeled 0,...,N−1, and the newly generated nodes have the labels N,...,2N−2. The third column contains the distance between the two nodes at each step, ie. the current minimal distance at the time of the merge. The fourth column counts the number of points which comprise each new node. The parameter method specifies which clustering scheme to use. The clustering scheme determines the distance from a new node to the other nodes. Denote the dissimilarities by d, the nodes to be joined by I, J, the new node by K and any other node by L. The symbol |I| denotes the size of the cluster I. method='single': d(K,L) = min(d(I,L), d(J,L)) The distance between two clusters A, B is the closest distance between any two points in each cluster: d(A,B) = min{ d(a,b) | a∈A, b∈B } method='complete': d(K,L) = max(d(I,L), d(J,L)) The distance between two clusters A, B is the maximal distance between any two points in each cluster: d(A,B) = max{ d(a,b) | a∈A, b∈B } method='average': d(K,L) = ( |I|·d(I,L) + |J|·d(J,L) ) / (|I|+|J|) The distance between two clusters A, B is the average distance between the points in the two clusters: d(A,B) = (|A|·|B|)^(-1) · \sum { d(a,b) | a∈A, b∈B } method='weighted': d(K,L) = (d(I,L)+d(J,L))/2 There is no global description for the distance between clusters since the distance depends on the order of the merging steps. The following three methods are intended for Euclidean data only, ie. when X contains the pairwise (non-squared!) distances between vectors in Euclidean space. The algorithm will work on any input, however, and it is up to the user to make sure that applying the methods makes sense. method='centroid': d(K,L) = ( (|I|·d(I,L) + |J|·d(J,L)) / (|I|+|J|) − |I|·|J|·d(I,J)/(|I|+|J|)^2 )^(1/2) There is a geometric interpretation: d(A,B) is the distance between the centroids (ie. barycenters) of the clusters in Euclidean space: d(A,B) = ‖c_A−c_B∥, where c_A denotes the centroid of the points in cluster A. method='median': d(K,L) = ( d(I,L)/2 + d(J,L)/2 − d(I,J)/4 )^(1/2) Define the midpoint w_K of a cluster K iteratively as w_K=k if K={k} is a singleton and as the midpoint (w_I+w_J)/2 if K is formed by joining I and J. Then we have d(A,B) = ∥w_A−w_B∥ in Euclidean space for all nodes A,B. Notice however that this distance depends on the order of the merging steps. method='ward': d(K,L) = ( ((|I|+|L)d(I,L) + (|J|+|L|)d(J,L) − |L|d(I,J)) / (|I|+|J|+|L|) )^(1/2) The global cluster dissimilarity can be expressed as d(A,B) = ( 2|A|·|B|/(|A|+|B|) )^(1/2) · ‖c_A−c_B∥, where c_A again denotes the centroid of the points in cluster A. The clustering algorithm handles infinite values correctly, as long as the chosen distance update formula makes sense. If a NaN value occurs, either in the original dissimilarities or as an updated dissimilarity, an error is raised. The linkage method does not treat NumPy's masked arrays as special and simply ignores the mask.''' X = array(X, copy=False, subok=True) if X.ndim == 1: if method == 'single': preserve_input = False X = array(X, dtype=double, copy=preserve_input, order='C', subok=True) NN = len(X) N = int(ceil(sqrt(NN * 2))) if (N * (N - 1) // 2) != NN: raise ValueError('The length of the condensed distance matrix ' 'must be (k \choose 2) for k data points!') else: assert X.ndim == 2 N = len(X) X = pdist(X, metric) X = array(X, dtype=double, copy=False, order='C', subok=True) Z = empty((N - 1, 4)) if N > 1: linkage_wrap(N, X, Z, mthidx[method]) return Z
def linkage(D, method='single', metric='euclidean', preserve_input=True): '''Hierarchical (agglomerative) clustering on a dissimilarity matrix or on Euclidean data. The argument D is either a compressed distance matrix or a collection of m observation vectors in n dimensions as an (m×n) NumPy array. Apart from the argument preserve_input, the methods have the same input parameters and output format as the functions of the same name in the package scipy.cluster.hierarchy. Therefore, the documentation is not duplicated here. Please refer to the SciPy documentation for further details. The additional, optional argument preserve_input specifies whether the fastcluster package first copies the distance matrix or writes into the existing array. If the distance matrix is only generated for the clustering step and is not needed afterwards, half the memory can be saved by specifying preserve_input=False. Note that the input array D contains unspecified values after this procedure. It is therefore safer to write linkage(D, method="…", preserve_distance=False) del D to make sure the matrix D is not accidentally used after it has been used as scratch memory. The single linkage algorithm does not write to the distance matrix or its copy anyway, so the preserve_distance flag has no effect in this case.''' if not isinstance(D, ndarray): raise ValueError('The first argument must be of type numpy.ndarray.') if len(D.shape)==1: if method=='single': assert D.dtype==double D_ = require(D, dtype=double, requirements=['C']) if D_ is not D: stderr.write('The condensed distance matrix had to be copied since it has the following flags:\n') stderr.write(str(D.flags) + '\n') elif preserve_input: D_ = D.copy() assert D_.dtype == double assert D_.flags.c_contiguous assert D_.flags.owndata assert D_.flags.writeable assert D_.flags.aligned else: assert D.dtype==double D_ = require(D, dtype=double, requirements=['C', 'A', 'W', 'O']) if D_ is not D: stderr.write('The condensed distance matrix had to be copied since it has the following flags:\n') stderr.write(str(D.flags) + '\n') is_valid_y(D_, throw=True) N = num_obs_y(D_) Z = empty((N-1,4)) if N > 1: linkage_wrap(N, D_, Z, mthidx[method]) return Z else: assert len(D.shape)==2 N = D.shape[0] Z = empty((N-1,4)) D_ = pdist(D, metric) assert D_.dtype == double assert D_.flags.c_contiguous assert D_.flags.owndata assert D_.flags.writeable assert D_.flags.aligned if N > 1: linkage_wrap(N, D_, Z, mthidx[method]) return Z