def linkage_vector(X, method='single', metric='euclidean', extraarg=None): '''Hierarchical (agglomerative) clustering on Euclidean data. Compared to the 'linkage' method, 'linkage_vector' uses a memory-saving algorithm. While the linkage method requires Θ(N^2) memory for clustering of N points, this method needs Θ(ND) for N points in R^D, which is usually much smaller. The argument X has the same format as before, when X describes vector data, ie. it is an (N×D) array. Also the output array has the same format. The parameter method must be one of 'single', 'centroid', 'median', 'ward', ie. only for these methods there exist memory-saving algorithms currently. If 'method', is one of 'centroid', 'median', 'ward', the 'metric' must be 'euclidean'. For single linkage clustering, any dissimilarity function may be chosen. Basically, every metric which is implemented in the method scipy.spatial.distance.pdist is reimplemented here. However, the metrics differ in some instances since a number of mistakes and typos (both in the code and in the documentation) were corrected in the fastcluster package. Therefore, the available metrics with their definitions are listed below as a reference. The symbols u and v mostly denote vectors in R^D with coordinates u_j and v_j respectively. See below for additional metrics for Boolean vectors. Unless otherwise stated, the input array X is converted to a floating point array (X.dtype==numpy.double) if it does not have already the required data type. Some metrics accept Boolean input; in this case this is stated explicitly below. If a NaN value occurs, either in the original dissimilarities or as an updated dissimilarity, an error is raised. In principle, the clustering algorithm handles infinite values correctly, but the user is advised to carefully check the behavior of the metric and distance update formulas under these circumstances. The distance formulas combined with the clustering in the 'linkage_vector' method do not have specified behavior if the data X contains infinite or NaN values. Also, the masks in NumPy’s masked arrays are simply ignored. metric='euclidean': Euclidean metric, L_2 norm d(u,v) = ∥u−v∥ = ( \sum_j { (u_j−v_j)^2 } )^(1/2) metric='sqeuclidean': squared Euclidean metric d(u,v) = ∥u−v∥^2 = \sum_j { (u_j−v_j)^2 } metric='seuclidean': standardized Euclidean metric d(u,v) = ( \sum_j { (u_j−v_j)^2 / V_j } )^(1/2) The vector V=(V_0,...,V_{D−1}) is given as the 'extraarg' argument. If no 'extraarg' is given, V_j is by default the unbiased sample variance of all observations in the j-th coordinate: V_j = Var_i (X(i,j) ) = 1/(N−1) · \sum_i ( X(i,j)^2 − μ(X_j)^2 ) (Here, μ(X_j) denotes as usual the mean of X(i,j) over all rows i.) metric='mahalanobis': Mahalanobis distance d(u,v) = ( transpose(u−v) V (u−v) )^(1/2) Here, V=extraarg, a (D×D)-matrix. If V is not specified, the inverse of the covariance matrix numpy.linalg.inv(numpy.cov(X, rowvar=False)) is used. metric='cityblock': the Manhattan distance, L_1 norm d(u,v) = \sum_j |u_j−v_j| metric='chebychev': the supremum norm, L_∞ norm d(u,v) = max_j { |u_j−v_j| } metric='minkowski': the L_p norm d(u,v) = ( \sum_j |u_j−v_j|^p ) ^(1/p) This metric coincides with the cityblock, euclidean and chebychev metrics for p=1, p=2 and p=∞ (numpy.inf), respectively. The parameter p is given as the 'extraarg' argument. metric='cosine' d(u,v) = 1 − ⟨u,v⟩ / (∥u∥·∥v∥) = 1 − (\sum_j u_j·v_j) / ( (\sum u_j^2)(\sum v_j^2) )^(1/2) metric='correlation': This method first mean-centers the rows of X and then applies the 'cosine' distance. Equivalently, the correlation distance measures 1 − (Pearson’s correlation coefficient). d(u,v) = 1 − ⟨u−μ(u),v−μ(v)⟩ / (∥u−μ(u)∥·∥v−μ(v)∥) metric='canberra' d(u,v) = \sum_j ( |u_j−v_j| / (|u_j|+|v_j|) ) Summands with u_j=v_j=0 contribute 0 to the sum. metric='braycurtis' d(u,v) = (\sum_j |u_j-v_j|) / (\sum_j |u_j+v_j|) metric=(user function): The parameter metric may also be a function which accepts two NumPy floating point vectors and returns a number. Eg. the Euclidean distance could be emulated with fn = lambda u, v: numpy.sqrt(((u-v)*(u-v)).sum()) linkage_vector(X, method='single', metric=fn) This method, however, is much slower than the build-in function. metric='hamming': The Hamming distance accepts a Boolean array (X.dtype==bool) for efficient storage. Any other data type is converted to numpy.double. d(u,v) = |{j | u_j≠v_j }| metric='jaccard': The Jaccard distance accepts a Boolean array (X.dtype==bool) for efficient storage. Any other data type is converted to numpy.double. d(u,v) = |{j | u_j≠v_j }| / |{j | u_j≠0 or v_j≠0 }| d(0,0) = 0 Python represents True by 1 and False by 0. In the Boolean case, the Jaccard distance is therefore: d(u,v) = |{j | u_j≠v_j }| / |{j | u_j ∨ v_j }| The following metrics are designed for Boolean vectors. The input array is converted to the 'bool' data type if it is not Boolean already. Use the following abbreviations to count the number of True/False combinations: a = |{j | u_j ∧ v_j }| b = |{j | u_j ∧ (¬v_j) }| c = |{j | (¬u_j) ∧ v_j }| d = |{j | (¬u_j) ∧ (¬v_j) }| Recall that D denotes the number of dimensions, hence D=a+b+c+d. metric='yule' d(u,v) = 2bc / (ad+bc) metric='dice': d(u,v) = (b+c) / (2a+b+c) d(0,0) = 0 metric='rogerstanimoto': d(u,v) = 2(b+c) / (b+c+D) metric='russellrao': d(u,v) = (b+c+d) / D metric='sokalsneath': d(u,v) = 2(b+c)/ ( a+2(b+c)) d(0,0) = 0 metric='kulsinski' d(u,v) = (b/(a+b) + c/(a+c)) / 2 metric='matching': d(u,v) = (b+c)/D Notice that when given a Boolean array, the 'matching' and 'hamming' distance are the same. The 'matching' distance formula, however, converts every input to Boolean first. Hence, the vectors (0,1) and (0,2) have zero 'matching' distance since they are both converted to (False, True) but the Hamming distance is 0.5. metric='sokalmichener' is an alias for 'matching'.''' if method=='single': assert metric!='USER' if metric in ('hamming', 'jaccard'): X = array(X, copy=False, subok=True) dtype = bool if X.dtype==bool else double else: dtype = bool if metric in booleanmetrics else double X = array(X, dtype=dtype, copy=False, order='C', subok=True) else: assert metric=='euclidean' X = array(X, dtype=double, copy=(method=='ward'), order='C', subok=True) assert X.ndim==2 N = len(X) Z = empty((N-1,4)) if metric=='seuclidean': if extraarg is None: extraarg = var(X, axis=0, ddof=1) elif metric=='mahalanobis': if extraarg is None: extraarg = inv(cov(X, rowvar=False)) # instead of the inverse covariance matrix, pass the matrix product # with the data matrix! extraarg = array(dot(X,extraarg),dtype=double, copy=False, order='C', subok=True) elif metric=='correlation': X = X-expand_dims(X.mean(axis=1),1) metric='cosine' elif not isinstance(metric, str): assert extraarg is None metric, extraarg = 'USER', metric elif metric!='minkowski': assert extraarg is None if N > 1: linkage_vector_wrap(X, Z, mthidx[method], mtridx[metric], extraarg) return Z
def linkage_vector(X, method='single', metric='euclidean', extraarg=None): r'''Hierarchical (agglomerative) clustering on Euclidean data. Compared to the 'linkage' method, 'linkage_vector' uses a memory-saving algorithm. While the linkage method requires Θ(N^2) memory for clustering of N points, this method needs Θ(ND) for N points in R^D, which is usually much smaller. The argument X has the same format as before, when X describes vector data, ie. it is an (N×D) array. Also the output array has the same format. The parameter method must be one of 'single', 'centroid', 'median', 'ward', ie. only for these methods there exist memory-saving algorithms currently. If 'method', is one of 'centroid', 'median', 'ward', the 'metric' must be 'euclidean'. For single linkage clustering, any dissimilarity function may be chosen. Basically, every metric which is implemented in the method scipy.spatial.distance.pdist is reimplemented here. However, the metrics differ in some instances since a number of mistakes and typos (both in the code and in the documentation) were corrected in the fastcluster package. Therefore, the available metrics with their definitions are listed below as a reference. The symbols u and v mostly denote vectors in R^D with coordinates u_j and v_j respectively. See below for additional metrics for Boolean vectors. Unless otherwise stated, the input array X is converted to a floating point array (X.dtype==numpy.double) if it does not have already the required data type. Some metrics accept Boolean input; in this case this is stated explicitly below. If a NaN value occurs, either in the original dissimilarities or as an updated dissimilarity, an error is raised. In principle, the clustering algorithm handles infinite values correctly, but the user is advised to carefully check the behavior of the metric and distance update formulas under these circumstances. The distance formulas combined with the clustering in the 'linkage_vector' method do not have specified behavior if the data X contains infinite or NaN values. Also, the masks in NumPy’s masked arrays are simply ignored. metric='euclidean': Euclidean metric, L_2 norm d(u,v) = ∥u−v∥ = ( \sum_j { (u_j−v_j)^2 } )^(1/2) metric='sqeuclidean': squared Euclidean metric d(u,v) = ∥u−v∥^2 = \sum_j { (u_j−v_j)^2 } metric='seuclidean': standardized Euclidean metric d(u,v) = ( \sum_j { (u_j−v_j)^2 / V_j } )^(1/2) The vector V=(V_0,...,V_{D−1}) is given as the 'extraarg' argument. If no 'extraarg' is given, V_j is by default the unbiased sample variance of all observations in the j-th coordinate: V_j = Var_i (X(i,j) ) = 1/(N−1) · \sum_i ( X(i,j)^2 − μ(X_j)^2 ) (Here, μ(X_j) denotes as usual the mean of X(i,j) over all rows i.) metric='mahalanobis': Mahalanobis distance d(u,v) = ( transpose(u−v) V (u−v) )^(1/2) Here, V=extraarg, a (D×D)-matrix. If V is not specified, the inverse of the covariance matrix numpy.linalg.inv(numpy.cov(X, rowvar=False)) is used. metric='cityblock': the Manhattan distance, L_1 norm d(u,v) = \sum_j |u_j−v_j| metric='chebychev': the supremum norm, L_∞ norm d(u,v) = max_j { |u_j−v_j| } metric='minkowski': the L_p norm d(u,v) = ( \sum_j |u_j−v_j|^p ) ^(1/p) This metric coincides with the cityblock, euclidean and chebychev metrics for p=1, p=2 and p=∞ (numpy.inf), respectively. The parameter p is given as the 'extraarg' argument. metric='cosine' d(u,v) = 1 − ⟨u,v⟩ / (∥u∥·∥v∥) = 1 − (\sum_j u_j·v_j) / ( (\sum u_j^2)(\sum v_j^2) )^(1/2) metric='correlation': This method first mean-centers the rows of X and then applies the 'cosine' distance. Equivalently, the correlation distance measures 1 − (Pearson’s correlation coefficient). d(u,v) = 1 − ⟨u−μ(u),v−μ(v)⟩ / (∥u−μ(u)∥·∥v−μ(v)∥) metric='canberra' d(u,v) = \sum_j ( |u_j−v_j| / (|u_j|+|v_j|) ) Summands with u_j=v_j=0 contribute 0 to the sum. metric='braycurtis' d(u,v) = (\sum_j |u_j-v_j|) / (\sum_j |u_j+v_j|) metric=(user function): The parameter metric may also be a function which accepts two NumPy floating point vectors and returns a number. Eg. the Euclidean distance could be emulated with fn = lambda u, v: numpy.sqrt(((u-v)*(u-v)).sum()) linkage_vector(X, method='single', metric=fn) This method, however, is much slower than the build-in function. metric='hamming': The Hamming distance accepts a Boolean array (X.dtype==bool) for efficient storage. Any other data type is converted to numpy.double. d(u,v) = |{j | u_j≠v_j }| metric='jaccard': The Jaccard distance accepts a Boolean array (X.dtype==bool) for efficient storage. Any other data type is converted to numpy.double. d(u,v) = |{j | u_j≠v_j }| / |{j | u_j≠0 or v_j≠0 }| d(0,0) = 0 Python represents True by 1 and False by 0. In the Boolean case, the Jaccard distance is therefore: d(u,v) = |{j | u_j≠v_j }| / |{j | u_j ∨ v_j }| The following metrics are designed for Boolean vectors. The input array is converted to the 'bool' data type if it is not Boolean already. Use the following abbreviations to count the number of True/False combinations: a = |{j | u_j ∧ v_j }| b = |{j | u_j ∧ (¬v_j) }| c = |{j | (¬u_j) ∧ v_j }| d = |{j | (¬u_j) ∧ (¬v_j) }| Recall that D denotes the number of dimensions, hence D=a+b+c+d. metric='yule' d(u,v) = 2bc / (ad+bc) metric='dice': d(u,v) = (b+c) / (2a+b+c) d(0,0) = 0 metric='rogerstanimoto': d(u,v) = 2(b+c) / (b+c+D) metric='russellrao': d(u,v) = (b+c+d) / D metric='sokalsneath': d(u,v) = 2(b+c)/ ( a+2(b+c)) d(0,0) = 0 metric='kulsinski' d(u,v) = (b/(a+b) + c/(a+c)) / 2 metric='matching': d(u,v) = (b+c)/D Notice that when given a Boolean array, the 'matching' and 'hamming' distance are the same. The 'matching' distance formula, however, converts every input to Boolean first. Hence, the vectors (0,1) and (0,2) have zero 'matching' distance since they are both converted to (False, True) but the Hamming distance is 0.5. metric='sokalmichener' is an alias for 'matching'.''' if method == 'single': assert metric != 'USER' if metric in ('hamming', 'jaccard'): X = array(X, copy=False, subok=True) dtype = bool if X.dtype == bool else double else: dtype = bool if metric in booleanmetrics else double X = array(X, dtype=dtype, copy=False, order='C', subok=True) else: assert metric == 'euclidean' X = array(X, dtype=double, copy=(method == 'ward'), order='C', subok=True) assert X.ndim == 2 N = len(X) Z = empty((N - 1, 4)) if metric == 'seuclidean': if extraarg is None: extraarg = var(X, axis=0, ddof=1) elif metric == 'mahalanobis': if extraarg is None: extraarg = inv(cov(X, rowvar=False)) # instead of the inverse covariance matrix, pass the matrix product # with the data matrix! extraarg = array(dot(X, extraarg), dtype=double, copy=False, order='C', subok=True) elif metric == 'correlation': X = X - expand_dims(X.mean(axis=1), 1) metric = 'cosine' elif not isinstance(metric, str): assert extraarg is None metric, extraarg = 'USER', metric elif metric != 'minkowski': assert extraarg is None if N > 1: linkage_vector_wrap(X, Z, mthidx[method], mtridx[metric], extraarg) return Z