Skip to content

minha12/k-mean-constrained

 
 

Repository files navigation

k-means-constrained

K-means clustering implementation whereby a minimum and/or maximum size for each cluster can be specified.

This K-means implementation modifies the cluster assignment step (E in EM) by formulating it as a Minimum Cost Flow (MCF) linear network optimisation problem. This is then solved using a cost-scaling push-relabel algorithm and uses Google's Operations Research tools's SimpleMinCostFlow which is a fast C++ implementation.

This package is inspired by Bradley et al.. The original Minimum Cost Flow (MCF) network proposed by Bradley et al. has been modified so maximum cluster sizes can also be specified along with minimum cluster size.

The code is based on scikit-lean's KMeans and implements the same API with modifications.

Ref:

  1. Bradley, P. S., K. P. Bennett, and Ayhan Demiriz. "Constrained k-means clustering." Microsoft Research, Redmond (2000): 1-8.
  2. Google's SimpleMinCostFlow C++ implementation

Example

>>> from k_means_constrained import KMeansConstrained
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
>>>                [4, 2], [4, 4], [4, 0]])
>>> clf = KMeansConstrained(
>>>     n_clusters=2,
>>>     size_min=2,
>>>     size_max=5,
>>>     random_state=0
>>> )
>>> clf.fit(X)
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> clf.cluster_centers_
array([[ 1.,  2.],
       [ 4.,  2.]])
>>> clf.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)

Installation

Requires Google's OR-tools to be installed.

Currently tested with:

  • scikit-learn == 0.19.1
  • ortools == 6.7.4973

As this package uses internal scikit learn methods associated with k-means it may break with other versions of scikit-learn. This can be addressed in the future by importing the internal methods into this project.

TODO:

  • Documentation
  • Test with sparse X
  • Remove dependencies on internal scikit learn methods by importing them into this project

About

K-Means clustering - constrained with minimum and maximum cluster size

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 80.2%
  • Jupyter Notebook 19.8%