Skip to content

Implementing KMedoids in scikit-learn-extra #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jul 29, 2019
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
cd19b57
Added kmedoids code
znd4 Apr 7, 2019
3e18444
changed k_medoids_ imports to absolute
znd4 Apr 23, 2019
936919d
Merge branch 'master' of https://github.com/scikit-learn-contrib/scik…
znd4 Apr 29, 2019
d4c086c
Added .vscode to .gitignore
znd4 Apr 29, 2019
bacc931
Add venv to .gitignore
znd4 Apr 29, 2019
0cb8e43
Added cluster tests
znd4 Apr 29, 2019
96f3a2e
Fix KMedoids docstring
znd4 Apr 29, 2019
8d9d9d6
Reconfigure _kpp_init tests
znd4 Apr 30, 2019
8e534e8
added documentation
znd4 May 11, 2019
4d61529
Rename k_medoids_.py -> _k_medoids.py
znd4 Jul 26, 2019
03f9e54
Update conf.py to include mathjax
znd4 Jul 26, 2019
2e95287
Add KMedoids to test_common.py
znd4 Jul 26, 2019
0e1ee5b
add plot_kmedoids_digits.py
znd4 Jul 26, 2019
ee1688b
Add Examples line to KMedoids docstring
znd4 Jul 26, 2019
e96e2b0
Remove duplicate examples section in _k_medoids.py docstring
znd4 Jul 26, 2019
07f6e3c
ACTUALLY remove duplicate examples section
znd4 Jul 26, 2019
9910804
Add sphinx gallery of plot_kmedoids_digits.py
znd4 Jul 26, 2019
0c8d032
Added k-medoids++ to help message
znd4 Jul 26, 2019
0368daa
Merge branch 'master' into kmedoids
znd4 Jul 26, 2019
3d71001
Run `black` on code
znd4 Jul 27, 2019
182d505
Remove commented out math code
znd4 Jul 27, 2019
88d9630
Remove unnecessary plot_kmedoids_digits.py
znd4 Jul 27, 2019
9405d98
Remove `x_squared_norms` from _kpp_init (copied over from kmeans)
znd4 Jul 27, 2019
0989f88
Add comment for _kpp_init
znd4 Jul 27, 2019
d76d6b8
update n_samples -> n_query, where appropriate
znd4 Jul 28, 2019
c060b0e
Add sklearn_extra/cluster/tests/__init__.py
rth Jul 29, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ __pycache__/
# C extensions
*.so

# Text Editors
.vscode/

# scikit-learn specific
doc/_build/
doc/auto_examples/
Expand All @@ -17,6 +20,7 @@ doc/datasets/generated/
# Distribution / packaging

.Python
venv/
env/
build/
develop-eggs/
Expand Down
10 changes: 10 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,13 @@ Kernel approximation
:template: class.rst

kernel_approximation.Fastfood

Clustering
====================

.. autosummary::
:toctree: generated/
:template: class.rst

cluster.KMedoids

18 changes: 14 additions & 4 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,13 +48,23 @@
# pngmath / imgmath compatibility layer for different sphinx versions
import sphinx
from distutils.version import LooseVersion
if LooseVersion(sphinx.__version__) < LooseVersion('1.4'):
extensions.append('sphinx.ext.pngmath')
else:
extensions.append('sphinx.ext.imgmath')
# if LooseVersion(sphinx.__version__) < LooseVersion('1.4'):
# extensions.append('sphinx.ext.pngmath')
# else:
# extensions.append('sphinx.ext.imgmath')

autodoc_default_flags = ['members', 'inherited-members']

# For maths, use mathjax by default and svg if NO_MATHJAX env variable is set
# (useful for viewing the doc offline)
if os.environ.get('NO_MATHJAX'):
extensions.append('sphinx.ext.imgmath')
imgmath_image_format = 'svg'
else:
extensions.append('sphinx.ext.mathjax')
mathjax_path = ('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/'
'MathJax.js?config=TeX-AMS_SVG')

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

Expand Down
58 changes: 57 additions & 1 deletion doc/user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,60 @@
User guide
==========

To add.
.. _k_medoids:

K-Medoids
=========

:class:`KMedoids` is related to the :class:`KMeans` algorithm. While
:class:`KMeans` tries to minimize the within cluster sum-of-squares,
:class:`KMedoids` tries to minimize the sum of distances between each point and
the medoid of its cluster. The medoid is a data point (unlike the centroid)
which has least total distance to the other members of its cluster. The use of
a data point to represent each cluster's center allows the use of any distance
metric for clustering.

:class:`KMedoids` can be more robust to noise and outliers than :class:`KMeans`
as it will choose one of the cluster members as the medoid while
:class:`KMeans` will move the center of the cluster towards the outlier which
might in turn move other points away from the cluster centre.

:class:`KMedoids` is also different from K-Medians, which is analogous to :class:`KMeans`
except that the Manhattan Median is used for each cluster center instead of
the centroid. K-Medians is robust to outliers, but it is limited to the
Manhattan Distance metric and, similar to :class:`KMeans`, it does not guarantee
that the center of each cluster will be a member of the original dataset.

The complexity of K-Medoids is :math:`O(N^2 K T)` where :math:`N` is the number
of samples, :math:`T` is the number of iterations and :math:`K` is the number of
clusters. This makes it more suitable for smaller datasets in comparison to
:class:`KMeans` which is :math:`O(N K T)`.

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_plot_kmedoids_digits.py`: Applying K-Medoids on digits
with various distance metrics.


**Algorithm description:**
There are several algorithms to compute K-Medoids, though :class:`KMedoids`
currently only supports Partitioning Around Medoids (PAM). The PAM algorithm
uses a greedy search, which may fail to find the global optimum. It consists of
two alternating steps commonly called the
Assignment and Update steps (BUILD and SWAP in Kaufmann and Rousseeuw, 1987).

PAM works as follows:

* Initialize: Select ``n_clusters`` from the dataset as the medoids using
a heuristic, random, or k-medoids++ approach (configurable using the ``init`` parameter).
* Assignment step: assign each element from the dataset to the closest medoid.
* Update step: Identify the new medoid of each cluster.
* Repeat the assignment and update step while the medoids keep changing or
maximum number of iterations ``max_iter`` is reached.

.. topic:: References:

* "Clustering by Means of Medoids'"
Kaufman, L. and Rousseeuw, P.J.,
Statistical Data Analysis Based on the L1Norm and Related Methods, edited
by Y. Dodge, North-Holland, 405416. 1987
97 changes: 97 additions & 0 deletions examples/plot_kmedoids_digits.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# -*- coding: utf-8 -*-
"""
=============================================================
A demo of K-Medoids clustering on the handwritten digits data
=============================================================
In this example we compare different pairwise distance
metrics for K-Medoids.
"""
import numpy as np
import matplotlib.pyplot as plt

from collections import namedtuple
from sklearn.cluster import KMeans
from sklearn_extra.cluster import KMedoids
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

print(__doc__)

# Authors: Timo Erkkilä <[email protected]>
# Antti Lehmussola <[email protected]>
# Kornel Kiełczewski <[email protected]>
# License: BSD 3 clause

np.random.seed(42)

digits = load_digits()
data = scale(digits.data)
n_digits = len(np.unique(digits.target))

reduced_data = PCA(n_components=2).fit_transform(data)

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02 # point in the mesh [x_min, m_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

plt.figure()
plt.clf()

plt.suptitle("Comparing multiple K-Medoids metrics to K-Means and each other",
fontsize=14)

Algorithm = namedtuple('ClusterAlgorithm', ['model', 'description'])

selected_models = [
Algorithm(KMedoids(metric='manhattan',
n_clusters=n_digits),
'KMedoids (manhattan)'),
Algorithm(KMedoids(metric='euclidean',
n_clusters=n_digits),
'KMedoids (euclidean)'),
Algorithm(KMedoids(metric='cosine',
n_clusters=n_digits),
'KMedoids (cosine)'),
Algorithm(KMeans(n_clusters=n_digits),
'KMeans')
]

plot_rows = int(np.ceil(len(selected_models) / 2.0))
plot_cols = 2

for i, (model, description) in enumerate(selected_models):

# Obtain labels for each point in mesh. Use last trained model.
model.fit(reduced_data)
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.subplot(plot_cols, plot_rows, i + 1)
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')

plt.plot(reduced_data[:, 0],
reduced_data[:, 1],
'k.', markersize=2,
alpha=0.3,
)
# Plot the centroids as a white X
centroids = model.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=10)
plt.title(description)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())

plt.show()
5 changes: 5 additions & 0 deletions sklearn_extra/cluster/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from ._k_medoids import KMedoids

__all__ = [
'KMedoids',
]
Loading