Skip to content

DOC fix: The algorithm explained - and implemented - in K-Medoids is not PAM #44

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 29, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 8 additions & 9 deletions doc/user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,12 +49,11 @@ clusters. This makes it more suitable for smaller datasets in comparison to

**Algorithm description:**
There are several algorithms to compute K-Medoids, though :class:`KMedoids`
currently only supports Partitioning Around Medoids (PAM). The PAM algorithm
uses a greedy search, which may fail to find the global optimum. It consists of
two alternating steps commonly called the
Assignment and Update steps (BUILD and SWAP in Kaufmann and Rousseeuw, 1987).
currently only supports K-Medoids solver analogous to K-Means. Other frequently
used approach is partitioning around medoids (PAM) which is currently not
implemented.

PAM works as follows:
This version works as follows:

* Initialize: Select ``n_clusters`` from the dataset as the medoids using
a heuristic, random, or k-medoids++ approach (configurable using the ``init`` parameter).
Expand All @@ -65,7 +64,7 @@ PAM works as follows:

.. topic:: References:

* "Clustering by Means of Medoids'"
Kaufman, L. and Rousseeuw, P.J.,
Statistical Data Analysis Based on the L1Norm and Related Methods, edited
by Y. Dodge, North-Holland, 405416. 1987
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let's then add Maranzana (1963) and Park (2009) references here and in the docstring below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the references should reflect what was used, and implemented. For example Park specifies a different initialization strategy. I don't think retrofitting references is the proper way to go. Maybe the ESL book then should be cited instead,.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not about retrofitting, we can cite ESL, but is not the primary source, and there is barely 2 pages on K-medoids there. Maranzana (1963) does seem to describe this algorithm with random initialization. The initialization is indeed different in Park (2009), but I would still mention it ( I understand that you don't like it :) ) , as otherwise the iterative step is the same, and it has a more recent bibliography review on the topic. We can add their initialization as an option as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example Park specifies a different initialization strategy.

Actually, init="heuristic"

elif self.init == "heuristic": # Initialization by heuristic
# Pick K first data points that have the smallest sum distance
# to every other point. These are the initial medoids.
medoids = np.argpartition(np.sum(D, axis=1), n_clusters - 1)[

is not that different from what they do up to a normalization factor I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is quite similar, except for the missing normalization term. From my intuition this will work very poorly, because most likely these medoids will be close to each other at the center of the data set; so none of them will be a good medoid. If you benchmark this, it will likely work worse than uniform random.

* Maranzana, F.E., 1963. On the location of supply points to minimize
transportation costs. IBM Systems Journal, 2(2), pp.129-135.
* Park, H.S. and Jun, C.H., 2009. A simple and fast algorithm for K-medoids
clustering. Expert systems with applications, 36(2), pp.3336-3341.
7 changes: 4 additions & 3 deletions sklearn_extra/cluster/_k_medoids.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,9 +90,10 @@ class KMedoids(BaseEstimator, ClusterMixin, TransformerMixin):

References
----------
Kaufman, L. and Rousseeuw, P.J., Statistical Data Analysis Based on
the L1–Norm and Related Methods, edited by Y. Dodge, North-Holland,
405–416. 1987
Maranzana, F.E., 1963. On the location of supply points to minimize
transportation costs. IBM Systems Journal, 2(2), pp.129-135.
Park, H.S.and Jun, C.H., 2009. A simple and fast algorithm for K-medoids
clustering. Expert systems with applications, 36(2), pp.3336-3341.

See also
--------
Expand Down