PRC Clustering and Classification Using sklearn

This is a python package implementing several clustering and classification algorithms that use Pinch Ratio Clustering. To use it, you'll need scikit-learn (tested with versions 0.12 and 0.13) as well as the python bindings to the C++ library.

Install with

python setup.py install

Run the tests with

python tests.py

to make sure everything works.

Clustering

The file cluster.py contains sklearn compatible clustering algorithms using PRC. An example:

>>> from sk_prc.cluster import PinchRatioClustering
>>> from sk_prc import similarity
>>> from sklearn.datasets import make_blobs
>>> from sklearn.metrics import adjusted_rand_score
>>> data, labels = make_blobs(100, 2, 2, random_state=106)
>>> knn_strategy = similarity.KNN(10)
>>> c = PinchRatioClustering(n_clusters=2, 
...                          adj_matrix_strategy=knn_strategy,
...                          n_trials=1)
>>> c.fit(data)
>>> adjusted_rand_score(c.labels, labels)
1.0

Note that we can set the number of clusters we want, the adjacency matrix type to use, how many TILO runs to do, and the initial ordering to use. Gaussian and k nearest neighbors adjacency matrices are supported, and if you want to use your own adjacency matrix you can do that, too. Right now the TILO run with minimal width (widths sorted in nondecreasing order!) over n_trials runs is chosen.

Furthermore, you can get a bunch of information about the clustering, like the ordering, boundary, pinch ratios, and width:

>>> c.ordering                  # doctest: +ELLIPSIS
array([15, 35, ..., 59, 86])
>>> c.boundary[49] == 0.0       # good separation between two blobs
True
>>> c.pinch_ratios              # note good separation
[0.0]
>>> list(c.width)               # doctest: +ELLIPSIS
[0.0, 6.0, ..., 50.0, 50.5]

Classification

Note that these classifiers aren't as well tested as the clustering stuff. Use at your own risk.

The file classify.py contains BinaryTiloClassifier, which can work with sklearn to implement classifiers based on TILO/PRC. It is parametrized by a cut strategy and an adjacency matrix strategy.

>>> from sk_prc.classify import BinaryTiloClassifier, NearestCutStrategy
>>> from sk_prc import similarity
>>> import numpy as np
>>> c = BinaryTiloClassifier(NearestCutStrategy(),
...                          similarity.Gaussian())
>>> data = np.array([[0, 0], [0, 1], [1, 0], [1, 1],
...                  [15, 15], [14, 14], [14, 15], [15, 14]], dtype=float)
>>> labels = np.array(['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'])
>>> fitted_model = c.fit(data, labels)
>>> guesses = fitted_model.predict(np.array([[1.5, 1.5],
...                                         [11.0, 11.0]]))
>>> guesses[0], guesses[1]
('a', 'b')

Here is an example of multiclass classification using bits of sklearn. (We only use some of the iris data for speed)

>>> from sklearn.datasets import load_iris
>>> from sklearn.multiclass import OneVsOneClassifier
>>> iris = load_iris()
>>> indices = np.arange(0, 150, 10) ## use a subset of the data for speed
>>> iris_data, iris_labels = iris.data[indices], iris.target[indices]
>>> c = BinaryTiloClassifier(NearestCutStrategy(),
...                          similarity.KNN(6))
>>> mcc = OneVsOneClassifier(c)
>>> guessed_labels = mcc.fit(iris_data, iris_labels).predict(iris_data)
>>> (guessed_labels != iris_labels).sum()
1

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
sk_prc		sk_prc
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sk_prc

sk_prc

README.md

README.md

setup.py

setup.py

Repository files navigation

PRC Clustering and Classification Using sklearn

Clustering

Classification

About

Releases

Packages

Languages

rsbowman/sklearn-prc

Folders and files

Latest commit

History

Repository files navigation

PRC Clustering and Classification Using sklearn

Clustering

Classification

About

Resources

Stars

Watchers

Forks

Languages