Skip to content

Update documentation #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Dec 18, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
91891ae
Add pam algorithm
Nov 6, 2020
40e42e7
Merge remote-tracking branch 'upstream/master' into kmedoid_pam
Nov 7, 2020
4362f85
pam algorithm, not naive.
Nov 8, 2020
5a86b95
black reformat
Nov 8, 2020
6e6c90d
Fix mistake in code
Nov 8, 2020
2ed1803
optimization of the algorithm for speed, review from @kno10
Nov 10, 2020
8422c17
remove generator for couples
Nov 10, 2020
6eae84b
fix mistake
Nov 10, 2020
2a92978
Update pam review 2
Nov 11, 2020
ecce8c8
fix mistake
Nov 12, 2020
1cba61c
cython implementation
Nov 13, 2020
258c262
add test
Nov 13, 2020
9078628
disable openmp for windows and mac
Nov 13, 2020
482bc37
fix black
Nov 13, 2020
50d0eb3
fix setup.py for windows
Nov 13, 2020
7637891
remove test
Nov 13, 2020
eeaa2a3
change review
Nov 15, 2020
093f8b0
Merge branch 'master' into kmedoid_pam
TimotheeMathieu Nov 18, 2020
bd04827
fix black
Nov 18, 2020
e979579
Add build, remove parallel computing
Nov 21, 2020
b51d23b
Apply suggestions from code review
TimotheeMathieu Nov 21, 2020
e675bdb
apply suggested change & rename alternating to alternate.
Nov 21, 2020
4250681
fix test
Nov 21, 2020
8f2ada3
Merge remote-tracking branch 'upstream/master' into kmedoid_pam
Nov 21, 2020
552294b
make build default. Allow max_iter = 0 for build-only algo
Nov 21, 2020
b024b8e
Test for method and init
Nov 21, 2020
018c9c7
test on blobs example
Nov 21, 2020
2f6368f
fix typo
Nov 21, 2020
f1a33ad
fix difference long/long long windows vs linux
Nov 21, 2020
498d9b6
try another fix for windows/linux long difference
Nov 21, 2020
daa9879
test another fix cython long/int on different platforms
Nov 21, 2020
213bb2e
test all in int, cython kmedoid
Nov 22, 2020
3c5adf1
Add doc cluster module, update doc kmedoids, change examples.
Nov 26, 2020
acf20c7
fix black
Nov 26, 2020
7592374
Merge branch 'master' into doc
rth Nov 26, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,5 +40,6 @@ Robust
:toctree: generated/
:template: class.rst

robust.RobustWeightedEstimator

robust.RobustWeightedClassifier
robust.RobustWeightedRegressor
robust.RobustWeightedKMeans
199 changes: 199 additions & 0 deletions doc/modules/cluster.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
.. _cluster:

=====================================================
Clustering with KMedoids and Common-nearest-neighbors
=====================================================
.. _k_medoids:

K-Medoids
=========

:class:`KMedoids` is related to the :class:`KMeans` algorithm. While
:class:`KMeans` tries to minimize the within cluster sum-of-squares,
:class:`KMedoids` tries to minimize the sum of distances between each point and
the medoid of its cluster. The medoid is a data point (unlike the centroid)
which has least total distance to the other members of its cluster. The use of
a data point to represent each cluster's center allows the use of any distance
metric for clustering. It may also be a practical advantage, for instance K-Medoids
algorithms have been used for facial recognition for which the medoid is a
typical photo of the person to recognize while K-Means would have obtained a blurry
image that mixed several pictures of the person to recognize.

:class:`KMedoids` can be more robust to noise and outliers than :class:`KMeans`
as it will choose one of the cluster members as the medoid while
:class:`KMeans` will move the center of the cluster towards the outlier which
might in turn move other points away from the cluster centre.

:class:`KMedoids` is also different from K-Medians, which is analogous to :class:`KMeans`
except that the Manhattan Median is used for each cluster center instead of
the centroid. K-Medians is robust to outliers, but it is limited to the
Manhattan Distance metric and, similar to :class:`KMeans`, it does not guarantee
that the center of each cluster will be a member of the original dataset.

The complexity of K-Medoids is :math:`O(N^2 K T)` where :math:`N` is the number
of samples, :math:`T` is the number of iterations and :math:`K` is the number of
clusters. This makes it more suitable for smaller datasets in comparison to
:class:`KMeans` which is :math:`O(N K T)`.

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_plot_kmedoids_digits.py`: Applying K-Medoids on digits
with various distance metrics.


**Algorithm description:**
There are several algorithms to compute K-Medoids, though :class:`KMedoids`
currently only supports K-Medoids solver analogous to K-Means called alternate
and the algorithm PAM (partitioning around medoids). Alternate algorithm is used
when speed is an issue.


* Alternate method works as follows:

* Initialize: Select ``n_clusters`` from the dataset as the medoids using
a heuristic, random, or k-medoids++ approach (configurable using the ``init`` parameter).
* Assignment step: assign each element from the dataset to the closest medoid.
* Update step: Identify the new medoid of each cluster.
* Repeat the assignment and update step while the medoids keep changing or
maximum number of iterations ``max_iter`` is reached.

* PAM method works as follows:

* Initialize: Greedy initialization of ``n_clusters``. First select the point
in the dataset that minimize the sum of distances to a point. Then, add one
point that minimize the cost and loop until ``n_clusters`` point are selected.
This is the ``init`` parameter called ``build``.
* Swap Step: for all medoids already selected, compute the cost of swaping this
medoid with any non-medoid point. Then, make the swap that decrease the cost
the moste. Loop and stop when there is no change anymore.

.. topic:: References:

* Maranzana, F.E., 1963. On the location of supply points to minimize
transportation costs. IBM Systems Journal, 2(2), pp.129-135.
* Park, H.S. and Jun, C.H., 2009. A simple and fast algorithm for K-medoids
clustering. Expert systems with applications, 36(2), pp.3336-3341.
* Kaufman, L. and Rousseeuw, P.J. (2008). Partitioning Around Medoids (Program PAM).
In Finding Groups in Data (eds L. Kaufman and P.J. Rousseeuw).
doi:10.1002/9780470316801.ch2
* Bhat, Aruna (2014).K-medoids clustering using partitioning around medoids
for performing face recognition. International Journal of Soft Computing,
Mathematics and Control, 3(3), pp 1-12.

.. _commonnn:

Common-nearest-neighbors clustering
===================================

:class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>`
provides an interface to density-based
common-nearest-neighbors clustering. Density-based clustering identifies
clusters as dense regions of high point density, separated by sparse
regions of lower density. Common-nearest-neighbors clustering
approximates local density as the number of shared (common) neighbors
between two points with respect to a neighbor search radius. A density
threshold (density criterion) is used – defined by the cluster
parameters ``min_samples`` (number of common neighbors) and ``eps`` (search
radius) – to distinguish high from low density. A high value of
``min_samples`` and a low value of ``eps`` corresponds to high density.

As such the method is related to other density-based cluster algorithms
like :class:`DBSCAN <sklearn.cluster.DBSCAN>` or Jarvis-Patrick. DBSCAN
approximates local density as the number of points in the neighborhood
of a single point. The Jarvis-Patrick algorithm uses the number of
common neighbors shared by two points among the :math:`k` nearest neighbors.
As these approaches each provide a different notion of how density is
estimated from point samples, they can be used complementarily. Their
relative suitability for a classification problem depends on the nature
of the clustered data. Common-nearest-neighbors clustering (as
density-based clustering in general) has the following advantages over
other clustering techniques:

* The cluster result is deterministic. The same set of cluster
parameters always leads to the same classification for a data set.
A different ordering of the data set leads to a different ordering
of the cluster assignment, but does not change the assignment
qualitatively.
* Little prior knowledge about the data is required, e.g. the number
of resulting clusters does not need to be known beforehand (although
cluster parameters need to be tuned to obtain a desired result).
* Identified clusters are not restricted in their shape or size.
* Points can be considered noise (outliers) if they do not fullfil
the density criterion.

The common-nearest-neighbors algorithm tests the density criterion for
pairs of neighbors (do they have at least ``min_samples`` points in the
intersection of their neighborhoods at a radius ``eps``). Two points that
fullfil this criterion are directly part of the same dense data region,
i.e. they are *density reachable*. A *density connected* network of
density reachable points (a connected component if density reachability
is viewed as a graph structure) constitutes a separated dense region and
therefore a cluster. Note, that for example in contrast to
:class:`DBSCAN <sklearn.cluster.DBSCAN>` there is no differentiation in
*core* (dense points) and *edge* points (points that are not dense
themselves but neighbors of dense points). The assignment of points on
the cluster rims to a cluster is possible, but can be ambiguous. The
cluster result is returned as a 1D container of labels, i.e. a sequence
of integers (zero-based) of length :math:`n` for a data set of :math:`n`
points,
denoting the assignment of points to a specific cluster. Noise is
labeled with ``-1``. Valid clusters have at least two members. The
clusters are not sorted by cluster member count. In same cases the
algorithm tends to identify small clusters that can be filtered out
manually.

.. topic:: Examples:

* :ref:`examples/cluster/plot_commonnn.py <sphx_glr_auto_examples_plot_commonnn.py>`
Basic usage of the
:class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>`
* :ref:`examples/cluster/plot_commonnn_data_sets.py <sphx_glr_auto_examples_plot_commonnn_data_sets.py>`
Common-nearest-neighbors clustering of toy data sets

.. topic:: Implementation:

The present implementation of the common-nearest-neighbors algorithm in
:class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>`
shares some
commonalities with the current
scikit-learn implementation of :class:`DBSCAN <sklearn.cluster.DBSCAN>`.
It computes neighborhoods from points in bulk with
:class:`NearestNeighbors <sklearn.neighbors.NearestNeighbors>` before
the actual clustering. Consequently, to store the neighborhoods
it requires memory on the order of
:math:`O(n ⋅ n_n)` for :math:`n` points in the data set where :math:`n_n`
is the
average number of neighbors (which is proportional to ``eps``), that is at
worst :math:`O(n^2)`. Depending on the input structure (dense or sparse
points or similarity matrix) the additional memory demand varies.
The clustering itself follows a
breadth-first-search scheme, checking the density criterion at every
node expansion. The linear time complexity is roughly proportional to
the number of data points :math:`n`, the total number of neighbors :math:`N`
and the value of ``min_samples``. For density-based clustering
schemes with lower memory demand, also consider:

* :class:`OPTICS <sklearn.cluster.OPTICS>` – Density-based clustering
related to DBSCAN using a ``eps`` value range.
* `cnnclustering <https://pypi.org/project/cnnclustering/>`_ – A
different implementation of common-nearest-neighbors clustering.

.. topic:: Notes:

* :class:`DBSCAN <sklearn.cluster.DBSCAN>` provides an option to
specify data point weights with ``sample_weights``. This feature is
experimentally at the moment for :class:`CommonNNClustering` as
weights are not well defined for checking the common-nearest-neighbor
density criterion. It should not be used in production, yet.

.. topic:: References:

* B. Keller, X. Daura, W. F. van Gunsteren "Comparing Geometric and
Kinetic Cluster Algorithms for Molecular Simulation Data" J. Chem.
Phys., 2010, 132, 074110.

* O. Lemke, B.G. Keller "Density-based Cluster Algorithms for the
Identification of Core Sets" J. Chem. Phys., 2016, 145, 164104.

* O. Lemke, B.G. Keller "Common nearest neighbor clustering - a
benchmark" Algorithms, 2018, 11, 19.
176 changes: 1 addition & 175 deletions doc/user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,179 +11,5 @@ User guide
:numbered:

modules/eigenpro.rst
modules/cluster.rst
modules/robust.rst

.. _k_medoids:

K-Medoids
=========

:class:`KMedoids` is related to the :class:`KMeans` algorithm. While
:class:`KMeans` tries to minimize the within cluster sum-of-squares,
:class:`KMedoids` tries to minimize the sum of distances between each point and
the medoid of its cluster. The medoid is a data point (unlike the centroid)
which has least total distance to the other members of its cluster. The use of
a data point to represent each cluster's center allows the use of any distance
metric for clustering.

:class:`KMedoids` can be more robust to noise and outliers than :class:`KMeans`
as it will choose one of the cluster members as the medoid while
:class:`KMeans` will move the center of the cluster towards the outlier which
might in turn move other points away from the cluster centre.

:class:`KMedoids` is also different from K-Medians, which is analogous to :class:`KMeans`
except that the Manhattan Median is used for each cluster center instead of
the centroid. K-Medians is robust to outliers, but it is limited to the
Manhattan Distance metric and, similar to :class:`KMeans`, it does not guarantee
that the center of each cluster will be a member of the original dataset.

The complexity of K-Medoids is :math:`O(N^2 K T)` where :math:`N` is the number
of samples, :math:`T` is the number of iterations and :math:`K` is the number of
clusters. This makes it more suitable for smaller datasets in comparison to
:class:`KMeans` which is :math:`O(N K T)`.

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_plot_kmedoids_digits.py`: Applying K-Medoids on digits
with various distance metrics.


**Algorithm description:**
There are several algorithms to compute K-Medoids, though :class:`KMedoids`
currently only supports K-Medoids solver analogous to K-Means. Other frequently
used approach is partitioning around medoids (PAM) which is currently not
implemented.

This version works as follows:

* Initialize: Select ``n_clusters`` from the dataset as the medoids using
a heuristic, random, or k-medoids++ approach (configurable using the ``init`` parameter).
* Assignment step: assign each element from the dataset to the closest medoid.
* Update step: Identify the new medoid of each cluster.
* Repeat the assignment and update step while the medoids keep changing or
maximum number of iterations ``max_iter`` is reached.

.. topic:: References:

* Maranzana, F.E., 1963. On the location of supply points to minimize
transportation costs. IBM Systems Journal, 2(2), pp.129-135.
* Park, H.S. and Jun, C.H., 2009. A simple and fast algorithm for K-medoids
clustering. Expert systems with applications, 36(2), pp.3336-3341.

.. _commonnn:

Common-nearest-neighbors clustering
===================================

:class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>`
provides an interface to density-based
common-nearest-neighbors clustering. Density-based clustering identifies
clusters as dense regions of high point density, separated by sparse
regions of lower density. Common-nearest-neighbors clustering
approximates local density as the number of shared (common) neighbors
between two points with respect to a neighbor search radius. A density
threshold (density criterion) is used – defined by the cluster
parameters ``min_samples`` (number of common neighbors) and ``eps`` (search
radius) – to distinguish high from low density. A high value of
``min_samples`` and a low value of ``eps`` corresponds to high density.

As such the method is related to other density-based cluster algorithms
like :class:`DBSCAN <sklearn.cluster.DBSCAN>` or Jarvis-Patrick. DBSCAN
approximates local density as the number of points in the neighborhood
of a single point. The Jarvis-Patrick algorithm uses the number of
common neighbors shared by two points among the :math:`k` nearest neighbors.
As these approaches each provide a different notion of how density is
estimated from point samples, they can be used complementarily. Their
relative suitability for a classification problem depends on the nature
of the clustered data. Common-nearest-neighbors clustering (as
density-based clustering in general) has the following advantages over
other clustering techniques:

* The cluster result is deterministic. The same set of cluster
parameters always leads to the same classification for a data set.
A different ordering of the data set leads to a different ordering
of the cluster assignment, but does not change the assignment
qualitatively.
* Little prior knowledge about the data is required, e.g. the number
of resulting clusters does not need to be known beforehand (although
cluster parameters need to be tuned to obtain a desired result).
* Identified clusters are not restricted in their shape or size.
* Points can be considered noise (outliers) if they do not fullfil
the density criterion.

The common-nearest-neighbors algorithm tests the density criterion for
pairs of neighbors (do they have at least ``min_samples`` points in the
intersection of their neighborhoods at a radius ``eps``). Two points that
fullfil this criterion are directly part of the same dense data region,
i.e. they are *density reachable*. A *density connected* network of
density reachable points (a connected component if density reachability
is viewed as a graph structure) constitutes a separated dense region and
therefore a cluster. Note, that for example in contrast to
:class:`DBSCAN <sklearn.cluster.DBSCAN>` there is no differentiation in
*core* (dense points) and *edge* points (points that are not dense
themselves but neighbors of dense points). The assignment of points on
the cluster rims to a cluster is possible, but can be ambiguous. The
cluster result is returned as a 1D container of labels, i.e. a sequence
of integers (zero-based) of length :math:`n` for a data set of :math:`n`
points,
denoting the assignment of points to a specific cluster. Noise is
labeled with ``-1``. Valid clusters have at least two members. The
clusters are not sorted by cluster member count. In same cases the
algorithm tends to identify small clusters that can be filtered out
manually.

.. topic:: Examples:

* :ref:`examples/cluster/plot_commonnn.py <sphx_glr_auto_examples_plot_commonnn.py>`
Basic usage of the
:class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>`
* :ref:`examples/cluster/plot_commonnn_data_sets.py <sphx_glr_auto_examples_plot_commonnn_data_sets.py>`
Common-nearest-neighbors clustering of toy data sets

.. topic:: Implementation:

The present implementation of the common-nearest-neighbors algorithm in
:class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>`
shares some
commonalities with the current
scikit-learn implementation of :class:`DBSCAN <sklearn.cluster.DBSCAN>`.
It computes neighborhoods from points in bulk with
:class:`NearestNeighbors <sklearn.neighbors.NearestNeighbors>` before
the actual clustering. Consequently, to store the neighborhoods
it requires memory on the order of
:math:`O(n ⋅ n_n)` for :math:`n` points in the data set where :math:`n_n`
is the
average number of neighbors (which is proportional to ``eps``), that is at
worst :math:`O(n^2)`. Depending on the input structure (dense or sparse
points or similarity matrix) the additional memory demand varies.
The clustering itself follows a
breadth-first-search scheme, checking the density criterion at every
node expansion. The linear time complexity is roughly proportional to
the number of data points :math:`n`, the total number of neighbors :math:`N`
and the value of ``min_samples``. For density-based clustering
schemes with lower memory demand, also consider:

* :class:`OPTICS <sklearn.cluster.OPTICS>` – Density-based clustering
related to DBSCAN using a ``eps`` value range.
* `cnnclustering <https://pypi.org/project/cnnclustering/>`_ – A
different implementation of common-nearest-neighbors clustering.

.. topic:: Notes:

* :class:`DBSCAN <sklearn.cluster.DBSCAN>` provides an option to
specify data point weights with ``sample_weights``. This feature is
experimentally at the moment for :class:`CommonNNClustering` as
weights are not well defined for checking the common-nearest-neighbor
density criterion. It should not be used in production, yet.

.. topic:: References:

* B. Keller, X. Daura, W. F. van Gunsteren "Comparing Geometric and
Kinetic Cluster Algorithms for Molecular Simulation Data" J. Chem.
Phys., 2010, 132, 074110.

* O. Lemke, B.G. Keller "Density-based Cluster Algorithms for the
Identification of Core Sets" J. Chem. Phys., 2016, 145, 164104.

* O. Lemke, B.G. Keller "Common nearest neighbor clustering - a
benchmark" Algorithms, 2018, 11, 19.
6 changes: 6 additions & 0 deletions examples/cluster/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.. _clusgter_examples:

Cluster
=======

Examples concerning the :mod:`sklearn_extra.kernel_methods.cluster` module.
Loading