diff --git a/doc/api.rst b/doc/api.rst index b99a3b5b..57b36246 100644 --- a/doc/api.rst +++ b/doc/api.rst @@ -40,5 +40,6 @@ Robust :toctree: generated/ :template: class.rst - robust.RobustWeightedEstimator - + robust.RobustWeightedClassifier + robust.RobustWeightedRegressor + robust.RobustWeightedKMeans diff --git a/doc/modules/cluster.rst b/doc/modules/cluster.rst new file mode 100644 index 00000000..bb351308 --- /dev/null +++ b/doc/modules/cluster.rst @@ -0,0 +1,199 @@ +.. _cluster: + +===================================================== +Clustering with KMedoids and Common-nearest-neighbors +===================================================== +.. _k_medoids: + +K-Medoids +========= + +:class:`KMedoids` is related to the :class:`KMeans` algorithm. While +:class:`KMeans` tries to minimize the within cluster sum-of-squares, +:class:`KMedoids` tries to minimize the sum of distances between each point and +the medoid of its cluster. The medoid is a data point (unlike the centroid) +which has least total distance to the other members of its cluster. The use of +a data point to represent each cluster's center allows the use of any distance +metric for clustering. It may also be a practical advantage, for instance K-Medoids +algorithms have been used for facial recognition for which the medoid is a +typical photo of the person to recognize while K-Means would have obtained a blurry +image that mixed several pictures of the person to recognize. + +:class:`KMedoids` can be more robust to noise and outliers than :class:`KMeans` +as it will choose one of the cluster members as the medoid while +:class:`KMeans` will move the center of the cluster towards the outlier which +might in turn move other points away from the cluster centre. + +:class:`KMedoids` is also different from K-Medians, which is analogous to :class:`KMeans` +except that the Manhattan Median is used for each cluster center instead of +the centroid. K-Medians is robust to outliers, but it is limited to the +Manhattan Distance metric and, similar to :class:`KMeans`, it does not guarantee +that the center of each cluster will be a member of the original dataset. + +The complexity of K-Medoids is :math:`O(N^2 K T)` where :math:`N` is the number +of samples, :math:`T` is the number of iterations and :math:`K` is the number of +clusters. This makes it more suitable for smaller datasets in comparison to +:class:`KMeans` which is :math:`O(N K T)`. + +.. topic:: Examples: + + * :ref:`sphx_glr_auto_examples_plot_kmedoids_digits.py`: Applying K-Medoids on digits + with various distance metrics. + + +**Algorithm description:** +There are several algorithms to compute K-Medoids, though :class:`KMedoids` +currently only supports K-Medoids solver analogous to K-Means called alternate +and the algorithm PAM (partitioning around medoids). Alternate algorithm is used +when speed is an issue. + + +* Alternate method works as follows: + + * Initialize: Select ``n_clusters`` from the dataset as the medoids using + a heuristic, random, or k-medoids++ approach (configurable using the ``init`` parameter). + * Assignment step: assign each element from the dataset to the closest medoid. + * Update step: Identify the new medoid of each cluster. + * Repeat the assignment and update step while the medoids keep changing or + maximum number of iterations ``max_iter`` is reached. + +* PAM method works as follows: + + * Initialize: Greedy initialization of ``n_clusters``. First select the point + in the dataset that minimize the sum of distances to a point. Then, add one + point that minimize the cost and loop until ``n_clusters`` point are selected. + This is the ``init`` parameter called ``build``. + * Swap Step: for all medoids already selected, compute the cost of swaping this + medoid with any non-medoid point. Then, make the swap that decrease the cost + the moste. Loop and stop when there is no change anymore. + +.. topic:: References: + + * Maranzana, F.E., 1963. On the location of supply points to minimize + transportation costs. IBM Systems Journal, 2(2), pp.129-135. + * Park, H.S. and Jun, C.H., 2009. A simple and fast algorithm for K-medoids + clustering. Expert systems with applications, 36(2), pp.3336-3341. + * Kaufman, L. and Rousseeuw, P.J. (2008). Partitioning Around Medoids (Program PAM). + In Finding Groups in Data (eds L. Kaufman and P.J. Rousseeuw). + doi:10.1002/9780470316801.ch2 + * Bhat, Aruna (2014).K-medoids clustering using partitioning around medoids + for performing face recognition. International Journal of Soft Computing, + Mathematics and Control, 3(3), pp 1-12. + +.. _commonnn: + +Common-nearest-neighbors clustering +=================================== + +:class:`CommonNNClustering ` +provides an interface to density-based +common-nearest-neighbors clustering. Density-based clustering identifies +clusters as dense regions of high point density, separated by sparse +regions of lower density. Common-nearest-neighbors clustering +approximates local density as the number of shared (common) neighbors +between two points with respect to a neighbor search radius. A density +threshold (density criterion) is used – defined by the cluster +parameters ``min_samples`` (number of common neighbors) and ``eps`` (search +radius) – to distinguish high from low density. A high value of +``min_samples`` and a low value of ``eps`` corresponds to high density. + +As such the method is related to other density-based cluster algorithms +like :class:`DBSCAN ` or Jarvis-Patrick. DBSCAN +approximates local density as the number of points in the neighborhood +of a single point. The Jarvis-Patrick algorithm uses the number of +common neighbors shared by two points among the :math:`k` nearest neighbors. +As these approaches each provide a different notion of how density is +estimated from point samples, they can be used complementarily. Their +relative suitability for a classification problem depends on the nature +of the clustered data. Common-nearest-neighbors clustering (as +density-based clustering in general) has the following advantages over +other clustering techniques: + + * The cluster result is deterministic. The same set of cluster + parameters always leads to the same classification for a data set. + A different ordering of the data set leads to a different ordering + of the cluster assignment, but does not change the assignment + qualitatively. + * Little prior knowledge about the data is required, e.g. the number + of resulting clusters does not need to be known beforehand (although + cluster parameters need to be tuned to obtain a desired result). + * Identified clusters are not restricted in their shape or size. + * Points can be considered noise (outliers) if they do not fullfil + the density criterion. + +The common-nearest-neighbors algorithm tests the density criterion for +pairs of neighbors (do they have at least ``min_samples`` points in the +intersection of their neighborhoods at a radius ``eps``). Two points that +fullfil this criterion are directly part of the same dense data region, +i.e. they are *density reachable*. A *density connected* network of +density reachable points (a connected component if density reachability +is viewed as a graph structure) constitutes a separated dense region and +therefore a cluster. Note, that for example in contrast to +:class:`DBSCAN ` there is no differentiation in +*core* (dense points) and *edge* points (points that are not dense +themselves but neighbors of dense points). The assignment of points on +the cluster rims to a cluster is possible, but can be ambiguous. The +cluster result is returned as a 1D container of labels, i.e. a sequence +of integers (zero-based) of length :math:`n` for a data set of :math:`n` +points, +denoting the assignment of points to a specific cluster. Noise is +labeled with ``-1``. Valid clusters have at least two members. The +clusters are not sorted by cluster member count. In same cases the +algorithm tends to identify small clusters that can be filtered out +manually. + +.. topic:: Examples: + + * :ref:`examples/cluster/plot_commonnn.py ` + Basic usage of the + :class:`CommonNNClustering ` + * :ref:`examples/cluster/plot_commonnn_data_sets.py ` + Common-nearest-neighbors clustering of toy data sets + +.. topic:: Implementation: + + The present implementation of the common-nearest-neighbors algorithm in + :class:`CommonNNClustering ` + shares some + commonalities with the current + scikit-learn implementation of :class:`DBSCAN `. + It computes neighborhoods from points in bulk with + :class:`NearestNeighbors ` before + the actual clustering. Consequently, to store the neighborhoods + it requires memory on the order of + :math:`O(n ⋅ n_n)` for :math:`n` points in the data set where :math:`n_n` + is the + average number of neighbors (which is proportional to ``eps``), that is at + worst :math:`O(n^2)`. Depending on the input structure (dense or sparse + points or similarity matrix) the additional memory demand varies. + The clustering itself follows a + breadth-first-search scheme, checking the density criterion at every + node expansion. The linear time complexity is roughly proportional to + the number of data points :math:`n`, the total number of neighbors :math:`N` + and the value of ``min_samples``. For density-based clustering + schemes with lower memory demand, also consider: + + * :class:`OPTICS ` – Density-based clustering + related to DBSCAN using a ``eps`` value range. + * `cnnclustering `_ – A + different implementation of common-nearest-neighbors clustering. + +.. topic:: Notes: + + * :class:`DBSCAN ` provides an option to + specify data point weights with ``sample_weights``. This feature is + experimentally at the moment for :class:`CommonNNClustering` as + weights are not well defined for checking the common-nearest-neighbor + density criterion. It should not be used in production, yet. + +.. topic:: References: + + * B. Keller, X. Daura, W. F. van Gunsteren "Comparing Geometric and + Kinetic Cluster Algorithms for Molecular Simulation Data" J. Chem. + Phys., 2010, 132, 074110. + + * O. Lemke, B.G. Keller "Density-based Cluster Algorithms for the + Identification of Core Sets" J. Chem. Phys., 2016, 145, 164104. + + * O. Lemke, B.G. Keller "Common nearest neighbor clustering - a + benchmark" Algorithms, 2018, 11, 19. diff --git a/doc/user_guide.rst b/doc/user_guide.rst index 7468a46d..6ee38b1f 100644 --- a/doc/user_guide.rst +++ b/doc/user_guide.rst @@ -11,179 +11,5 @@ User guide :numbered: modules/eigenpro.rst + modules/cluster.rst modules/robust.rst - -.. _k_medoids: - -K-Medoids -========= - -:class:`KMedoids` is related to the :class:`KMeans` algorithm. While -:class:`KMeans` tries to minimize the within cluster sum-of-squares, -:class:`KMedoids` tries to minimize the sum of distances between each point and -the medoid of its cluster. The medoid is a data point (unlike the centroid) -which has least total distance to the other members of its cluster. The use of -a data point to represent each cluster's center allows the use of any distance -metric for clustering. - -:class:`KMedoids` can be more robust to noise and outliers than :class:`KMeans` -as it will choose one of the cluster members as the medoid while -:class:`KMeans` will move the center of the cluster towards the outlier which -might in turn move other points away from the cluster centre. - -:class:`KMedoids` is also different from K-Medians, which is analogous to :class:`KMeans` -except that the Manhattan Median is used for each cluster center instead of -the centroid. K-Medians is robust to outliers, but it is limited to the -Manhattan Distance metric and, similar to :class:`KMeans`, it does not guarantee -that the center of each cluster will be a member of the original dataset. - -The complexity of K-Medoids is :math:`O(N^2 K T)` where :math:`N` is the number -of samples, :math:`T` is the number of iterations and :math:`K` is the number of -clusters. This makes it more suitable for smaller datasets in comparison to -:class:`KMeans` which is :math:`O(N K T)`. - -.. topic:: Examples: - - * :ref:`sphx_glr_auto_examples_plot_kmedoids_digits.py`: Applying K-Medoids on digits - with various distance metrics. - - -**Algorithm description:** -There are several algorithms to compute K-Medoids, though :class:`KMedoids` -currently only supports K-Medoids solver analogous to K-Means. Other frequently -used approach is partitioning around medoids (PAM) which is currently not -implemented. - -This version works as follows: - -* Initialize: Select ``n_clusters`` from the dataset as the medoids using - a heuristic, random, or k-medoids++ approach (configurable using the ``init`` parameter). -* Assignment step: assign each element from the dataset to the closest medoid. -* Update step: Identify the new medoid of each cluster. -* Repeat the assignment and update step while the medoids keep changing or - maximum number of iterations ``max_iter`` is reached. - -.. topic:: References: - - * Maranzana, F.E., 1963. On the location of supply points to minimize - transportation costs. IBM Systems Journal, 2(2), pp.129-135. - * Park, H.S. and Jun, C.H., 2009. A simple and fast algorithm for K-medoids - clustering. Expert systems with applications, 36(2), pp.3336-3341. - -.. _commonnn: - -Common-nearest-neighbors clustering -=================================== - -:class:`CommonNNClustering ` -provides an interface to density-based -common-nearest-neighbors clustering. Density-based clustering identifies -clusters as dense regions of high point density, separated by sparse -regions of lower density. Common-nearest-neighbors clustering -approximates local density as the number of shared (common) neighbors -between two points with respect to a neighbor search radius. A density -threshold (density criterion) is used – defined by the cluster -parameters ``min_samples`` (number of common neighbors) and ``eps`` (search -radius) – to distinguish high from low density. A high value of -``min_samples`` and a low value of ``eps`` corresponds to high density. - -As such the method is related to other density-based cluster algorithms -like :class:`DBSCAN ` or Jarvis-Patrick. DBSCAN -approximates local density as the number of points in the neighborhood -of a single point. The Jarvis-Patrick algorithm uses the number of -common neighbors shared by two points among the :math:`k` nearest neighbors. -As these approaches each provide a different notion of how density is -estimated from point samples, they can be used complementarily. Their -relative suitability for a classification problem depends on the nature -of the clustered data. Common-nearest-neighbors clustering (as -density-based clustering in general) has the following advantages over -other clustering techniques: - - * The cluster result is deterministic. The same set of cluster - parameters always leads to the same classification for a data set. - A different ordering of the data set leads to a different ordering - of the cluster assignment, but does not change the assignment - qualitatively. - * Little prior knowledge about the data is required, e.g. the number - of resulting clusters does not need to be known beforehand (although - cluster parameters need to be tuned to obtain a desired result). - * Identified clusters are not restricted in their shape or size. - * Points can be considered noise (outliers) if they do not fullfil - the density criterion. - -The common-nearest-neighbors algorithm tests the density criterion for -pairs of neighbors (do they have at least ``min_samples`` points in the -intersection of their neighborhoods at a radius ``eps``). Two points that -fullfil this criterion are directly part of the same dense data region, -i.e. they are *density reachable*. A *density connected* network of -density reachable points (a connected component if density reachability -is viewed as a graph structure) constitutes a separated dense region and -therefore a cluster. Note, that for example in contrast to -:class:`DBSCAN ` there is no differentiation in -*core* (dense points) and *edge* points (points that are not dense -themselves but neighbors of dense points). The assignment of points on -the cluster rims to a cluster is possible, but can be ambiguous. The -cluster result is returned as a 1D container of labels, i.e. a sequence -of integers (zero-based) of length :math:`n` for a data set of :math:`n` -points, -denoting the assignment of points to a specific cluster. Noise is -labeled with ``-1``. Valid clusters have at least two members. The -clusters are not sorted by cluster member count. In same cases the -algorithm tends to identify small clusters that can be filtered out -manually. - -.. topic:: Examples: - - * :ref:`examples/cluster/plot_commonnn.py ` - Basic usage of the - :class:`CommonNNClustering ` - * :ref:`examples/cluster/plot_commonnn_data_sets.py ` - Common-nearest-neighbors clustering of toy data sets - -.. topic:: Implementation: - - The present implementation of the common-nearest-neighbors algorithm in - :class:`CommonNNClustering ` - shares some - commonalities with the current - scikit-learn implementation of :class:`DBSCAN `. - It computes neighborhoods from points in bulk with - :class:`NearestNeighbors ` before - the actual clustering. Consequently, to store the neighborhoods - it requires memory on the order of - :math:`O(n ⋅ n_n)` for :math:`n` points in the data set where :math:`n_n` - is the - average number of neighbors (which is proportional to ``eps``), that is at - worst :math:`O(n^2)`. Depending on the input structure (dense or sparse - points or similarity matrix) the additional memory demand varies. - The clustering itself follows a - breadth-first-search scheme, checking the density criterion at every - node expansion. The linear time complexity is roughly proportional to - the number of data points :math:`n`, the total number of neighbors :math:`N` - and the value of ``min_samples``. For density-based clustering - schemes with lower memory demand, also consider: - - * :class:`OPTICS ` – Density-based clustering - related to DBSCAN using a ``eps`` value range. - * `cnnclustering `_ – A - different implementation of common-nearest-neighbors clustering. - -.. topic:: Notes: - - * :class:`DBSCAN ` provides an option to - specify data point weights with ``sample_weights``. This feature is - experimentally at the moment for :class:`CommonNNClustering` as - weights are not well defined for checking the common-nearest-neighbor - density criterion. It should not be used in production, yet. - -.. topic:: References: - - * B. Keller, X. Daura, W. F. van Gunsteren "Comparing Geometric and - Kinetic Cluster Algorithms for Molecular Simulation Data" J. Chem. - Phys., 2010, 132, 074110. - - * O. Lemke, B.G. Keller "Density-based Cluster Algorithms for the - Identification of Core Sets" J. Chem. Phys., 2016, 145, 164104. - - * O. Lemke, B.G. Keller "Common nearest neighbor clustering - a - benchmark" Algorithms, 2018, 11, 19. diff --git a/examples/cluster/README.txt b/examples/cluster/README.txt new file mode 100644 index 00000000..ad0ebf6a --- /dev/null +++ b/examples/cluster/README.txt @@ -0,0 +1,6 @@ +.. _clusgter_examples: + +Cluster +======= + +Examples concerning the :mod:`sklearn_extra.kernel_methods.cluster` module. diff --git a/examples/plot_clustering.py b/examples/cluster/plot_clustering.py similarity index 99% rename from examples/plot_clustering.py rename to examples/cluster/plot_clustering.py index a5621fe9..dd479b9e 100644 --- a/examples/plot_clustering.py +++ b/examples/cluster/plot_clustering.py @@ -129,5 +129,3 @@ size=20, ) plot_num += 1 - - plt.show() diff --git a/examples/plot_commonnn_data_sets.py b/examples/cluster/plot_commonnn_data_sets.py similarity index 100% rename from examples/plot_commonnn_data_sets.py rename to examples/cluster/plot_commonnn_data_sets.py diff --git a/examples/plot_kmedoids_digits.py b/examples/cluster/plot_kmedoids_digits.py similarity index 100% rename from examples/plot_kmedoids_digits.py rename to examples/cluster/plot_kmedoids_digits.py diff --git a/examples/kernel_approximation/README.txt b/examples/kernel_approximation/README.txt new file mode 100644 index 00000000..5ea04362 --- /dev/null +++ b/examples/kernel_approximation/README.txt @@ -0,0 +1,7 @@ +.. _kernel_approximation_examples: + +Kernel approximation +==================== + +Examples concerning the :mod:`sklearn_extra.kernel_methods.kernel_approximation` +module. diff --git a/examples/plot_digits_classification_fastfood.py b/examples/kernel_approximation/plot_digits_classification_fastfood.py similarity index 100% rename from examples/plot_digits_classification_fastfood.py rename to examples/kernel_approximation/plot_digits_classification_fastfood.py diff --git a/examples/plot_kernel_approximation.py b/examples/kernel_approximation/plot_kernel_approximation.py similarity index 100% rename from examples/plot_kernel_approximation.py rename to examples/kernel_approximation/plot_kernel_approximation.py diff --git a/examples/plot_kmedoids.py b/examples/plot_kmedoids.py new file mode 100644 index 00000000..fa7cadd3 --- /dev/null +++ b/examples/plot_kmedoids.py @@ -0,0 +1,63 @@ +# -*- coding: utf-8 -*- +""" +============= +KMedoids Demo +============= + +KMedoids clustering of data points. The goal is to find medoids than minimize the +sum of absolute distance to the closest medoid. A medoid is a point of the dataset. +Read more in the :ref:`User Guide +<_k_medoids>`. + +""" +import matplotlib.pyplot as plt +import numpy as np + +from sklearn_extra.cluster import KMedoids +from sklearn.datasets import make_blobs + + +print(__doc__) + +# ############################################################################# +# Generate sample data +centers = [[1, 1], [-1, -1], [1, -1]] +X, labels_true = make_blobs( + n_samples=750, centers=centers, cluster_std=0.4, random_state=0 +) + +# ############################################################################# +# Compute Kmedoids clustering +cobj = KMedoids(n_clusters=3).fit(X) +labels = cobj.labels_ + +############################################################""" +# Plot results +unique_labels = set(labels) +colors = [ + plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels)) +] +for k, col in zip(unique_labels, colors): + + class_member_mask = labels == k + + xy = X[class_member_mask] + plt.plot( + xy[:, 0], + xy[:, 1], + "o", + markerfacecolor=tuple(col), + markeredgecolor="k", + markersize=6, + ) + +plt.plot( + cobj.cluster_centers_[:, 0], + cobj.cluster_centers_[:, 1], + "o", + markerfacecolor="cyan", + markeredgecolor="k", + markersize=6, +) + +plt.title("KMedoids clustering. Medoids are represented in cyan.") diff --git a/examples/plot_robust_classification_toy.py b/examples/plot_robust_classification_toy.py index 33226306..17d54454 100644 --- a/examples/plot_robust_classification_toy.py +++ b/examples/plot_robust_classification_toy.py @@ -44,7 +44,7 @@ RobustWeightedClassifier( max_iter=100, weighting="mom", - k=6, + k=8, random_state=rng, ), # The parameter k is set larger the number of outliers diff --git a/examples/robust/README.txt b/examples/robust/README.txt new file mode 100644 index 00000000..526c9400 --- /dev/null +++ b/examples/robust/README.txt @@ -0,0 +1,6 @@ +.. _robust_examples: + +Robust +====== + +Examples concerning the :mod:`sklearn_extra.kernel_methods.robust` module. diff --git a/examples/plot_robust_classification_diabete.py b/examples/robust/plot_robust_classification_diabete.py similarity index 99% rename from examples/plot_robust_classification_diabete.py rename to examples/robust/plot_robust_classification_diabete.py index 6f87e5b9..da649df6 100644 --- a/examples/plot_robust_classification_diabete.py +++ b/examples/robust/plot_robust_classification_diabete.py @@ -81,7 +81,5 @@ ) plt.ylabel("AUC") -plt.show() - # Remark : when using accuracy score, the optimal hyperparameters change and # for example the parameter c changes from 1.35 to 10. diff --git a/examples/plot_robust_regression_california_houses.py b/examples/robust/plot_robust_regression_california_houses.py similarity index 99% rename from examples/plot_robust_regression_california_houses.py rename to examples/robust/plot_robust_regression_california_houses.py index 8e7ba1e0..fe2f9260 100644 --- a/examples/plot_robust_regression_california_houses.py +++ b/examples/robust/plot_robust_regression_california_houses.py @@ -101,5 +101,3 @@ def quadratic_loss(est, X, y, X_test, y_test): axe2.set_title("median of errors") fig.suptitle("Boxplots of the test squared error") - -plt.show() diff --git a/sklearn_extra/cluster/_k_medoids.py b/sklearn_extra/cluster/_k_medoids.py index 18fb987d..25795872 100644 --- a/sklearn_extra/cluster/_k_medoids.py +++ b/sklearn_extra/cluster/_k_medoids.py @@ -39,7 +39,7 @@ class KMedoids(BaseEstimator, ClusterMixin, TransformerMixin): What distance metric to use. See :func:metrics.pairwise_distances method : {'alternate', 'pam'}, default: 'alternate' - Which algorithm to use. + Which algorithm to use. 'alternate' is faster while 'pam' is more accurate. init : {'random', 'heuristic', 'k-medoids++', 'build'}, optional, default: 'build' Specify medoid initialization method. 'random' selects n_clusters