-
Notifications
You must be signed in to change notification settings - Fork 123
Description
Hello!
I've been working on a model that clusters MoGs into clusters and noticed a weird drop in performance when all MoGs in given input belong to only one cluster. After doing a little bit of sleuthing, I found out that there is an inconsistency between the implementation of ARI in Clustering.jl and sklearn. Specifically, the special case is handled here in Clustering.jl and here in sklearn, where Clustering.jl assigns ARI = 0 while sklearn assigns ARI = 1.
Intuitively, when all points belong to one cluster and the model correctly predicts that, one would expect ARI to be 1. Here follows the code to reproduce the scenario:
import Clustering: randindex
randindex([1, 1, 1], [1, 1, 1]) == 1.0 # falsefrom sklearn.metrics import adjusted_rand_score as ari
ari([1, 1, 1], [1, 1, 1]) == 1.0 # TrueMoreover, while debugging, I came across another case where the result differs, and although it's definitely not as crucial as it rarely happens in real life scenarios, I'll drop it here as well. Whenever each cluster has exactly one point belonging to it, the special case condition will also set the ARI to 0, even though one would expect 1 since the clustering is perfect.
import Clustering: randindex
randindex([1, 2, 3], [1, 2, 3]) == 1.0 # falsefrom sklearn.metrics import adjusted_rand_score as ari
ari([1, 2, 3], [1, 2, 3]) == 1.0 # TrueI'm assuming this in not the expected behaviour, hence I'm submitting an issue, however if it is, I'll be happy to hear an explanation with hopes that it'll change the intuition I see behind ARI.