Adjusted Rand Index inconsistency with Python's sklearn implementation

Hello!

I've been working on a model that clusters MoGs into clusters and noticed a weird drop in performance when all MoGs in given input belong to only one cluster. After doing a little bit of sleuthing, I found out that there is an inconsistency between the implementation of ARI in `Clustering.jl` and `sklearn`. Specifically, the special case is handled [here in `Clustering.jl`](https://github.com/JuliaStats/Clustering.jl/blob/master/src/randindex.jl#L39) and [here in `sklearn`](https://github.com/scikit-learn/scikit-learn/blob/0d378913b/sklearn/metrics/cluster/_supervised.py#L396), where `Clustering.jl` assigns `ARI = 0` while `sklearn` assigns `ARI = 1`.

Intuitively, when all points belong to one cluster and the model correctly predicts that, one would expect ARI to be 1. Here follows the code to reproduce the scenario:

```jl
import Clustering: randindex

randindex([1, 1, 1], [1, 1, 1]) == 1.0 # false
```

```py
from sklearn.metrics import adjusted_rand_score as ari

ari([1, 1, 1], [1, 1, 1]) == 1.0 # True
```

Moreover, while debugging, I came across another case where the result differs, and although it's definitely not as crucial as it rarely happens in real life scenarios, I'll drop it here as well. Whenever each cluster has exactly one point belonging to it, the special case condition will also set the ARI to 0, even though one would expect 1 since the clustering is perfect.

```jl
import Clustering: randindex

randindex([1, 2, 3], [1, 2, 3]) == 1.0 # false
```

```py
from sklearn.metrics import adjusted_rand_score as ari

ari([1, 2, 3], [1, 2, 3]) == 1.0 # True
```

I'm assuming this in not the expected behaviour, hence I'm submitting an issue, however if it is, I'll be happy to hear an explanation with hopes that it'll change the intuition I see behind ARI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adjusted Rand Index inconsistency with Python's sklearn implementation #226

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adjusted Rand Index inconsistency with Python's sklearn implementation #226

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions