Skip to content

Adjusted Rand Index inconsistency with Python's sklearn implementation #226

@jindravo

Description

@jindravo

Hello!

I've been working on a model that clusters MoGs into clusters and noticed a weird drop in performance when all MoGs in given input belong to only one cluster. After doing a little bit of sleuthing, I found out that there is an inconsistency between the implementation of ARI in Clustering.jl and sklearn. Specifically, the special case is handled here in Clustering.jl and here in sklearn, where Clustering.jl assigns ARI = 0 while sklearn assigns ARI = 1.

Intuitively, when all points belong to one cluster and the model correctly predicts that, one would expect ARI to be 1. Here follows the code to reproduce the scenario:

import Clustering: randindex

randindex([1, 1, 1], [1, 1, 1]) == 1.0 # false
from sklearn.metrics import adjusted_rand_score as ari

ari([1, 1, 1], [1, 1, 1]) == 1.0 # True

Moreover, while debugging, I came across another case where the result differs, and although it's definitely not as crucial as it rarely happens in real life scenarios, I'll drop it here as well. Whenever each cluster has exactly one point belonging to it, the special case condition will also set the ARI to 0, even though one would expect 1 since the clustering is perfect.

import Clustering: randindex

randindex([1, 2, 3], [1, 2, 3]) == 1.0 # false
from sklearn.metrics import adjusted_rand_score as ari

ari([1, 2, 3], [1, 2, 3]) == 1.0 # True

I'm assuming this in not the expected behaviour, hence I'm submitting an issue, however if it is, I'll be happy to hear an explanation with hopes that it'll change the intuition I see behind ARI.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions