Skip to content

Adjusted Rand index inconsistency for large n #225

@ynschy

Description

@ynschy

The adjusted Rand index fails unexpectedly when n is large (n > 100,000). Here is an example with a comparison to an R implementation.

using Random
using Clustering
using RCall
Random.seed!(123);

n = 100_000;
a = rand(1:3,n);
b = rand(1:3,n);

randindex(a,b)[1]

only(R"library(mclust); adjustedRandIndex($a,$b)")

which gives

0.2933142400616828

-1.5731751561282826e-6

In theory the true adjusted Rand index should be close to 0. This starts to happen around n=83,000 for me.

As a Julia comparison, my own implementation of the adjusted Rand index gives the same result as in R:

function ari(a,b)
    table = counts(a,b)
    acounts = sum(table,dims=1)
    bcounts = sum(table,dims=2)
    
    score = sum([x*(x-1)/2 for x in table])
    asum = sum([x*(x-1)/2 for x in acounts])
    bsum = sum([x*(x-1)/2 for x in bcounts])
    expected = asum*bsum/binomial(sum(table),2)
    total = (asum + bsum)/2
    
    if total == expected
        return 0
    else
        return (score-expected)/(total-expected)
    end
end;
ari(a,b)
-1.5731751561282826e-6

I use Clustering.jl 0.14.2, Julia 1.6.2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions