-
Notifications
You must be signed in to change notification settings - Fork 45
Implementing KMedoids in scikit-learn-extra #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hmm. I could've sworn that I'd written tests to cover |
This is my first time working with a code coverage tool. I thought that my most recent commit should've fixed the coverage issues involving the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zdog234 ! I'll try to review this in detail soon.
Could you also please add the section from the user guide and the example from the original PR?
sklearn_extra/cluster/k_medoids_.py
Outdated
|
||
return medoids | ||
|
||
def _kpp_init(self, D, n_clusters, random_state_, n_local_trials=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's strange indeed that this function is reported as not being run in coverage while it is explicitly run in test_kmedoids_pp
will try to have a closer look soon.
I tried to add some of the documentation, but I don't understand how to add mathjax/mathjs, so all of the math is broken when I try to build. Has a decision been made w/ this project as to which one to use? And how do I go about adding that? Is there something I can just EDIT: I added the documentation; I just don't know if the math will build properly. This is my first time working with |
See scikit-learn's doc/conf.py where the sphinx.ext.mathjax is included,
and mathjax_path is specified.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. A few minor comments,
- could you please add the
KMedoids
estimator to https://github.com/scikit-learn-contrib/scikit-learn-extra/blob/master/sklearn_extra/tests/test_common.py to the list of estimators on whichcheck_estimator
is run - please also add the example from the original PR under
examples/
- if you could rename the file
k_medoid_.py
to_k_medoids.py
so that it becomes a private export when not used asfrom sklearn_extra.cluster import KMedoids
that would be great.
The docs is rendereed here https://25-173284824-gh.circle-artifacts.com/0/doc/user_guide.html no need to worry about formatting to much, we can fix that later.
Test this code fail when more then two dimensionality:
result:
|
This behaviour is expected. KMedoids represents each cluster by the
training sample that is closest to its middle, not by an arbitrary
coordinate in its middle.
|
OK~ Got it.~ |
Just got back from a vacation where I intentionally didn't bring my computer :) I'll try to get to @rth 's comments this week/weekend. |
@zdog234 Great~ |
@zdog234
Because i want to known: mapping relations of centers and label. |
large data not work, label all print '0':
|
Thanks @rth. When you get a chance, could you let me know if the changes I've made address your concerns? Best 😊 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments @zdog234 , otherwise (after a light review) LGTM.
We adopted black for code style recently. Please run black sklearn_extra/ examples/
for fixing the linter CI.
I would rather we merged this and opened follow up issues than keep this PR open until everything is perfect.
Maybe @jeremiedbb who worked on KMeans lately would also have some comments.
Later it would be nice to add an example on some dataset where KMedoids is a better than existing scikit-learn clustering algorithms as discussed in scikit-learn/scikit-learn#11099 (comment)
"than the number of samples %d." | ||
% (self.n_clusters, X.shape[0])) | ||
|
||
D = pairwise_distances(X, metric=self.metric) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the scaling is O(N**2)
because of this distance calculation, right? For the medioid assignment, wouldn't it be more efficient re-compute the nearest-neighbours between the medoids and the samples at each iteration? That would only be O(n_clusters*N)
at each iteration, and assuming n_clusters*max_iter < n_samples
it should still be faster?
I suppose because there are typically few clusters, constructing a BallTree
instead of brute force nearest neighbors is not worth it?
Not asking to make this change now, just wondering if we should open an issue about this once it's merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the above comment I forgot that the distance matrix is also used in _update_medoid_idxs_in_place
: there recomputing distances would be O(n_clusters*(n_samples/n_clusters)**2) so indeed probably slower, but it might still be interesting as it wouldn't require storing the full distance matrix..
Anyway, we might want to replace pairwise_distances
with pairwise_distances_chunked
to reduce memory usage in the current implementation..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really interesting idea. I can try to take a look at pairwise_distances_chunked
for a future PR 🙂
|
||
Parameters | ||
---------- | ||
distances : {array-like, sparse matrix}, shape=(n_samples, n_clusters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is always a dense matrix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there are tests for this functionality, but there is a line that makes me think that sparse arrays are accepted (X = check_array(X, accept_sparse=["csr", "csc"]
)
mentnioning k_means_._k_init copypasta
@zdog234 yes, run success when set 'k-medoids++' and 'manhattan' distance. but centers order not same like input data order. i wish the output order would like my input data order. input origin centers:
output centers by 'manhattan' :
output centers by 'euclidean' :
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zdog234 ! LGTM.
This PR has been open since April and I don't think there is a point in waiting for further feedback. Merging. We should rather open follow-up issues for possible issues or improvements.
@zdog234 Would you mind opening a separate issues with a reproducible example and the expected/obtained result? Thanks! |
BTW, I added an empty |
Seeing this finally merged is very comforting!! Thanks everyone for your
work.
|
Based on the recommendation of a few people, I'm porting the KMedoids implementaion from scikit-learn #1109.
I think I'm missing the documentation atm, but I'll have to take a look at that later (I don't really have any experience with restructured text.