Skip to content

MartijnGosgens/validation_indices

Repository files navigation

README

All indices that are discussed in the paper (except SMI, which requires a lot of computation and is implemented by its original authors in Matlab) can be found in the list Indices. It can be obtained via

from validation_indices import Indices

Other indices (e.g. other pair-counting indices and other normalizations of NMI) are also implemented. These can be found in the files PairCountingIndices,InformationTheoreticIndices respectively. Each index is a class that has (among others) a static function score that takes as input two clusterings. Clusterings can be provided in two forms

  • either as a list where the i-th entry corresponds to the cluster-label (integer) of the i-th item,
  • or as a list of lists where each inner-list contains the indices of all items in that cluster.

Hence, the clustering where the first two items are assigned to the first cluster and the third is assigned to another cluster can be represented as either [0,0,1] or [[0,1],[2]].

In short, the following sample code computes the values of all indices for the two clusterings on three items, each with 2 clusters where the clusters of size 2 are not identical:

from validation_indices import Indices
A = [0,0,1]
B = [[0,2],[1]]
{
    i.__name__: i.score(A,B)
    for i in Indices
}

The statistical test as described in the Appendix of our paper can be applied to an index (e.g. NMI in the example) in the following way:

from validation_indices import NamedIndices
from validation_indices.ConstantBaselineTests import check_constant_baseline

# We apply the test to the Normalized Mutual Information index.
I = NamedIndices["NMI"]

# Choose n=50,100,150,...,1000
ns = range(50, 1001, 50)
# For each n, we consider balanced cluster sizes with k=sqrt(n) clusters.
n2gtk = {
    n: int(n**0.5)
    for n in ns
}
# For each n, we consider candidates with balanced cluster sizes with
# k1=n^0.25, k2=n^0.5, k3=n^0.75.
n2ks = {
    n: [int(n**0.25),int(n**0.5),int(n**0.75)]
    for n in ns
}

check_constant_baseline(
    I        = I,
    n2ks     = n2ks,
    n2gtk    = n2gtk,
    repeats  = 500,
    aggregate= True)

To generate the figures showing the inconsistencies between the indices, simply run the file InconsistencyVisualizations.py. This can be done in the following way:

from validation_indices.InconsistencyVisualizations import *

To generate the figures for the constant baseline experiments in the appendix, simply run the file ConstantBaselineExperiments.py in the following way:

from validation_indices.ConstantBaselineExperiments import *

Experiments with synthetic datasets

The module rules_bruteforce computes the minimal set of inconsistency triplets ('rules') that is shown in Figure 1 of the Supplementary material.

The inconsistency triplets can be found by running the file bruteforce_minimal_basis.py.

By running check_minimum_coverage.py we obtain a table that shows for each pair of indices, for which triplet this pair is inconsistent.

Experiments with datasets

The module datasets_experiments performs the clustering experiments on datasets to show the inconsistencies among the validation indices. The full experiment can be performed by running the file perform_experiment.py. This will generate 3 .txt files containing the latex code from the tables in the paper.

The perform_experiment.py file makes use of a few other scripts in the datasets_experiments folder:

  • The file parse_realworld_datasets.py contains the function parse_datasets() that retrieves a number of datasets from this repository and parses them.
  • The file apply_clustering_algorithms.py contains the function apply_clustering_algorithms()that takes these datasets and applies a number of clustering algorithms to them to obtain a large number of candidate clusterings.
  • The file compute_indices.py contains the function compute_indices() that takes all these obtained clusterings and computes their similarity to the ground truth with respect to each validation index. The results will be stored in the file all_datasets_methods_metrics.tsv.
  • The file count_agreements.py contains various functions to process the values in all_datasets_methods_metrics.tsv to obtain the tables listed in the paper.

About

Implementations of cluster similarity indices.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages