Open
Description
KMedoids currently pre-computes a full distance matrix with pairwise_distances
resulting in large memory usage making it unsuitable for datasets with more than 20-50k samples.
To improve the situation somewhat, following approaches could be possible,
- use
pairwise_distances_chunked
- makes sure that for
float32
input the distance matrix is also 32 bit. - investigate re-computing distance in each iterations (Implementing KMedoids in scikit-learn-extra #12 (comment)). This will reduce the memory requirements at the cost of additional compute time. I'm not sure it could be worth it.