Beispiel #1
0
def linkage_vector(X, method='single', metric='euclidean', extraarg=None):
    '''Hierarchical (agglomerative) clustering on Euclidean data.

Compared to the 'linkage' method, 'linkage_vector' uses a memory-saving
algorithm. While the linkage method requires Θ(N^2) memory for
clustering of N points, this method needs Θ(ND) for N points in R^D,
which is usually much smaller.

The argument X has the same format as before, when X describes vector
data, ie. it is an (N×D) array. Also the output array has the same
format. The parameter method must be one of 'single', 'centroid',
'median', 'ward', ie. only for these methods there exist memory-saving
algorithms currently. If 'method', is one of 'centroid', 'median',
'ward', the 'metric' must be 'euclidean'.

For single linkage clustering, any dissimilarity function may be chosen.
Basically, every metric which is implemented in the method
scipy.spatial.distance.pdist is reimplemented here. However, the metrics
differ in some instances since a number of mistakes and typos (both in
the code and in the documentation) were corrected in the fastcluster
package.

Therefore, the available metrics with their definitions are listed below
as a reference. The symbols u and v mostly denote vectors in R^D with
coordinates u_j and v_j respectively. See below for additional metrics
for Boolean vectors. Unless otherwise stated, the input array X is
converted to a floating point array (X.dtype==numpy.double) if it does
not have already the required data type. Some metrics accept Boolean
input; in this case this is stated explicitly below.

If a NaN value occurs, either in the original dissimilarities or as an
updated dissimilarity, an error is raised. In principle, the clustering
algorithm handles infinite values correctly, but the user is advised to
carefully check the behavior of the metric and distance update formulas
under these circumstances.

The distance formulas combined with the clustering in the
'linkage_vector' method do not have specified behavior if the data X
contains infinite or NaN values. Also, the masks in NumPy’s masked
arrays are simply ignored.

metric='euclidean': Euclidean metric, L_2 norm

    d(u,v) = ∥u−v∥ = ( \sum_j { (u_j−v_j)^2 } )^(1/2)

metric='sqeuclidean': squared Euclidean metric

    d(u,v) = ∥u−v∥^2 = \sum_j { (u_j−v_j)^2 }

metric='seuclidean': standardized Euclidean metric

    d(u,v) = ( \sum_j { (u_j−v_j)^2 / V_j } )^(1/2)

  The vector V=(V_0,...,V_{D−1}) is given as the 'extraarg' argument. If
  no 'extraarg' is given, V_j is by default the unbiased sample variance
  of all observations in the j-th coordinate:

    V_j = Var_i (X(i,j) ) = 1/(N−1) · \sum_i ( X(i,j)^2 − μ(X_j)^2 )

  (Here, μ(X_j) denotes as usual the mean of X(i,j) over all rows i.)

metric='mahalanobis': Mahalanobis distance

    d(u,v) = ( transpose(u−v) V (u−v) )^(1/2)

  Here, V=extraarg, a (D×D)-matrix. If V is not specified, the inverse
  of the covariance matrix numpy.linalg.inv(numpy.cov(X, rowvar=False))
  is used.

metric='cityblock': the Manhattan distance, L_1 norm

    d(u,v) = \sum_j |u_j−v_j|

metric='chebychev': the supremum norm, L_∞ norm

    d(u,v) = max_j { |u_j−v_j| }

metric='minkowski': the L_p norm

    d(u,v) = ( \sum_j |u_j−v_j|^p ) ^(1/p)

  This metric coincides with the cityblock, euclidean and chebychev
  metrics for p=1, p=2 and p=∞ (numpy.inf), respectively. The parameter
  p is given as the 'extraarg' argument.

metric='cosine'

    d(u,v) = 1 − ⟨u,v⟩ / (∥u∥·∥v∥)
           = 1 − (\sum_j u_j·v_j) / ( (\sum u_j^2)(\sum v_j^2) )^(1/2)

metric='correlation': This method first mean-centers the rows of X and
  then applies the 'cosine' distance. Equivalently, the correlation
  distance measures 1 − (Pearson’s correlation coefficient).

    d(u,v) = 1 − ⟨u−μ(u),v−μ(v)⟩ / (∥u−μ(u)∥·∥v−μ(v)∥)

metric='canberra'

    d(u,v) = \sum_j ( |u_j−v_j| / (|u_j|+|v_j|) )

  Summands with u_j=v_j=0 contribute 0 to the sum.

metric='braycurtis'

    d(u,v) = (\sum_j |u_j-v_j|) / (\sum_j |u_j+v_j|)

metric=(user function): The parameter metric may also be a function
  which accepts two NumPy floating point vectors and returns a number.
  Eg. the Euclidean distance could be emulated with

    fn = lambda u, v: numpy.sqrt(((u-v)*(u-v)).sum())
    linkage_vector(X, method='single', metric=fn)

  This method, however, is much slower than the build-in function.

metric='hamming': The Hamming distance accepts a Boolean array
  (X.dtype==bool) for efficient storage. Any other data type is
  converted to numpy.double.

    d(u,v) = |{j | u_j≠v_j }|

metric='jaccard': The Jaccard distance accepts a Boolean array
  (X.dtype==bool) for efficient storage. Any other data type is
  converted to numpy.double.

    d(u,v) = |{j | u_j≠v_j }| / |{j | u_j≠0 or v_j≠0 }|
    d(0,0) = 0

  Python represents True by 1 and False by 0. In the Boolean case, the
  Jaccard distance is therefore:

    d(u,v) = |{j | u_j≠v_j }| / |{j | u_j  ∨ v_j }|

The following metrics are designed for Boolean vectors. The input array
is converted to the 'bool' data type if it is not Boolean already. Use
the following abbreviations to count the number of True/False
combinations:

  a = |{j | u_j ∧ v_j }|
  b = |{j | u_j ∧ (¬v_j) }|
  c = |{j | (¬u_j) ∧ v_j }|
  d = |{j | (¬u_j) ∧ (¬v_j) }|

Recall that D denotes the number of dimensions, hence D=a+b+c+d.

metric='yule'

    d(u,v) = 2bc / (ad+bc)

metric='dice':

    d(u,v) = (b+c) / (2a+b+c)
    d(0,0) = 0

metric='rogerstanimoto':

    d(u,v) = 2(b+c) / (b+c+D)

metric='russellrao':

    d(u,v) = (b+c+d) / D

metric='sokalsneath':

    d(u,v) = 2(b+c)/ ( a+2(b+c))
    d(0,0) = 0

metric='kulsinski'

    d(u,v) = (b/(a+b) + c/(a+c)) / 2

metric='matching':

    d(u,v) = (b+c)/D

  Notice that when given a Boolean array, the 'matching' and 'hamming'
  distance are the same. The 'matching' distance formula, however,
  converts every input to Boolean first. Hence, the vectors (0,1) and
  (0,2) have zero 'matching' distance since they are both converted to
  (False, True) but the Hamming distance is 0.5.

metric='sokalmichener' is an alias for 'matching'.'''
    if method=='single':
        assert metric!='USER'
        if metric in ('hamming', 'jaccard'):
            X = array(X, copy=False, subok=True)
            dtype = bool if X.dtype==bool else double
        else:
            dtype = bool if metric in booleanmetrics else double
        X = array(X, dtype=dtype, copy=False, order='C', subok=True)
    else:
        assert metric=='euclidean'
        X = array(X, dtype=double, copy=(method=='ward'), order='C', subok=True)
    assert X.ndim==2
    N = len(X)
    Z = empty((N-1,4))

    if metric=='seuclidean':
        if extraarg is None:
            extraarg = var(X, axis=0, ddof=1)
    elif metric=='mahalanobis':
        if extraarg is None:
            extraarg = inv(cov(X, rowvar=False))
        # instead of the inverse covariance matrix, pass the matrix product
        # with the data matrix!
        extraarg = array(dot(X,extraarg),dtype=double, copy=False, order='C', subok=True)
    elif metric=='correlation':
        X = X-expand_dims(X.mean(axis=1),1)
        metric='cosine'
    elif not isinstance(metric, str):
        assert extraarg is None
        metric, extraarg = 'USER', metric
    elif metric!='minkowski':
        assert extraarg is None
    if N > 1:
        linkage_vector_wrap(X, Z, mthidx[method], mtridx[metric], extraarg)
    return Z
Beispiel #2
0
def linkage_vector(X, method='single', metric='euclidean', extraarg=None):
    r'''Hierarchical (agglomerative) clustering on Euclidean data.

Compared to the 'linkage' method, 'linkage_vector' uses a memory-saving
algorithm. While the linkage method requires Θ(N^2) memory for
clustering of N points, this method needs Θ(ND) for N points in R^D,
which is usually much smaller.

The argument X has the same format as before, when X describes vector
data, ie. it is an (N×D) array. Also the output array has the same
format. The parameter method must be one of 'single', 'centroid',
'median', 'ward', ie. only for these methods there exist memory-saving
algorithms currently. If 'method', is one of 'centroid', 'median',
'ward', the 'metric' must be 'euclidean'.

For single linkage clustering, any dissimilarity function may be chosen.
Basically, every metric which is implemented in the method
scipy.spatial.distance.pdist is reimplemented here. However, the metrics
differ in some instances since a number of mistakes and typos (both in
the code and in the documentation) were corrected in the fastcluster
package.

Therefore, the available metrics with their definitions are listed below
as a reference. The symbols u and v mostly denote vectors in R^D with
coordinates u_j and v_j respectively. See below for additional metrics
for Boolean vectors. Unless otherwise stated, the input array X is
converted to a floating point array (X.dtype==numpy.double) if it does
not have already the required data type. Some metrics accept Boolean
input; in this case this is stated explicitly below.

If a NaN value occurs, either in the original dissimilarities or as an
updated dissimilarity, an error is raised. In principle, the clustering
algorithm handles infinite values correctly, but the user is advised to
carefully check the behavior of the metric and distance update formulas
under these circumstances.

The distance formulas combined with the clustering in the
'linkage_vector' method do not have specified behavior if the data X
contains infinite or NaN values. Also, the masks in NumPy’s masked
arrays are simply ignored.

metric='euclidean': Euclidean metric, L_2 norm

    d(u,v) = ∥u−v∥ = ( \sum_j { (u_j−v_j)^2 } )^(1/2)

metric='sqeuclidean': squared Euclidean metric

    d(u,v) = ∥u−v∥^2 = \sum_j { (u_j−v_j)^2 }

metric='seuclidean': standardized Euclidean metric

    d(u,v) = ( \sum_j { (u_j−v_j)^2 / V_j } )^(1/2)

  The vector V=(V_0,...,V_{D−1}) is given as the 'extraarg' argument. If
  no 'extraarg' is given, V_j is by default the unbiased sample variance
  of all observations in the j-th coordinate:

    V_j = Var_i (X(i,j) ) = 1/(N−1) · \sum_i ( X(i,j)^2 − μ(X_j)^2 )

  (Here, μ(X_j) denotes as usual the mean of X(i,j) over all rows i.)

metric='mahalanobis': Mahalanobis distance

    d(u,v) = ( transpose(u−v) V (u−v) )^(1/2)

  Here, V=extraarg, a (D×D)-matrix. If V is not specified, the inverse
  of the covariance matrix numpy.linalg.inv(numpy.cov(X, rowvar=False))
  is used.

metric='cityblock': the Manhattan distance, L_1 norm

    d(u,v) = \sum_j |u_j−v_j|

metric='chebychev': the supremum norm, L_∞ norm

    d(u,v) = max_j { |u_j−v_j| }

metric='minkowski': the L_p norm

    d(u,v) = ( \sum_j |u_j−v_j|^p ) ^(1/p)

  This metric coincides with the cityblock, euclidean and chebychev
  metrics for p=1, p=2 and p=∞ (numpy.inf), respectively. The parameter
  p is given as the 'extraarg' argument.

metric='cosine'

    d(u,v) = 1 − ⟨u,v⟩ / (∥u∥·∥v∥)
           = 1 − (\sum_j u_j·v_j) / ( (\sum u_j^2)(\sum v_j^2) )^(1/2)

metric='correlation': This method first mean-centers the rows of X and
  then applies the 'cosine' distance. Equivalently, the correlation
  distance measures 1 − (Pearson’s correlation coefficient).

    d(u,v) = 1 − ⟨u−μ(u),v−μ(v)⟩ / (∥u−μ(u)∥·∥v−μ(v)∥)

metric='canberra'

    d(u,v) = \sum_j ( |u_j−v_j| / (|u_j|+|v_j|) )

  Summands with u_j=v_j=0 contribute 0 to the sum.

metric='braycurtis'

    d(u,v) = (\sum_j |u_j-v_j|) / (\sum_j |u_j+v_j|)

metric=(user function): The parameter metric may also be a function
  which accepts two NumPy floating point vectors and returns a number.
  Eg. the Euclidean distance could be emulated with

    fn = lambda u, v: numpy.sqrt(((u-v)*(u-v)).sum())
    linkage_vector(X, method='single', metric=fn)

  This method, however, is much slower than the build-in function.

metric='hamming': The Hamming distance accepts a Boolean array
  (X.dtype==bool) for efficient storage. Any other data type is
  converted to numpy.double.

    d(u,v) = |{j | u_j≠v_j }|

metric='jaccard': The Jaccard distance accepts a Boolean array
  (X.dtype==bool) for efficient storage. Any other data type is
  converted to numpy.double.

    d(u,v) = |{j | u_j≠v_j }| / |{j | u_j≠0 or v_j≠0 }|
    d(0,0) = 0

  Python represents True by 1 and False by 0. In the Boolean case, the
  Jaccard distance is therefore:

    d(u,v) = |{j | u_j≠v_j }| / |{j | u_j  ∨ v_j }|

The following metrics are designed for Boolean vectors. The input array
is converted to the 'bool' data type if it is not Boolean already. Use
the following abbreviations to count the number of True/False
combinations:

  a = |{j | u_j ∧ v_j }|
  b = |{j | u_j ∧ (¬v_j) }|
  c = |{j | (¬u_j) ∧ v_j }|
  d = |{j | (¬u_j) ∧ (¬v_j) }|

Recall that D denotes the number of dimensions, hence D=a+b+c+d.

metric='yule'

    d(u,v) = 2bc / (ad+bc)

metric='dice':

    d(u,v) = (b+c) / (2a+b+c)
    d(0,0) = 0

metric='rogerstanimoto':

    d(u,v) = 2(b+c) / (b+c+D)

metric='russellrao':

    d(u,v) = (b+c+d) / D

metric='sokalsneath':

    d(u,v) = 2(b+c)/ ( a+2(b+c))
    d(0,0) = 0

metric='kulsinski'

    d(u,v) = (b/(a+b) + c/(a+c)) / 2

metric='matching':

    d(u,v) = (b+c)/D

  Notice that when given a Boolean array, the 'matching' and 'hamming'
  distance are the same. The 'matching' distance formula, however,
  converts every input to Boolean first. Hence, the vectors (0,1) and
  (0,2) have zero 'matching' distance since they are both converted to
  (False, True) but the Hamming distance is 0.5.

metric='sokalmichener' is an alias for 'matching'.'''
    if method == 'single':
        assert metric != 'USER'
        if metric in ('hamming', 'jaccard'):
            X = array(X, copy=False, subok=True)
            dtype = bool if X.dtype == bool else double
        else:
            dtype = bool if metric in booleanmetrics else double
        X = array(X, dtype=dtype, copy=False, order='C', subok=True)
    else:
        assert metric == 'euclidean'
        X = array(X,
                  dtype=double,
                  copy=(method == 'ward'),
                  order='C',
                  subok=True)
    assert X.ndim == 2
    N = len(X)
    Z = empty((N - 1, 4))

    if metric == 'seuclidean':
        if extraarg is None:
            extraarg = var(X, axis=0, ddof=1)
    elif metric == 'mahalanobis':
        if extraarg is None:
            extraarg = inv(cov(X, rowvar=False))
        # instead of the inverse covariance matrix, pass the matrix product
        # with the data matrix!
        extraarg = array(dot(X, extraarg),
                         dtype=double,
                         copy=False,
                         order='C',
                         subok=True)
    elif metric == 'correlation':
        X = X - expand_dims(X.mean(axis=1), 1)
        metric = 'cosine'
    elif not isinstance(metric, str):
        assert extraarg is None
        metric, extraarg = 'USER', metric
    elif metric != 'minkowski':
        assert extraarg is None
    if N > 1:
        linkage_vector_wrap(X, Z, mthidx[method], mtridx[metric], extraarg)
    return Z