-
Notifications
You must be signed in to change notification settings - Fork 123
Closed
Labels
Description
The adjusted Rand index fails unexpectedly when n is large (n > 100,000). Here is an example with a comparison to an R implementation.
using Random
using Clustering
using RCall
Random.seed!(123);
n = 100_000;
a = rand(1:3,n);
b = rand(1:3,n);
randindex(a,b)[1]
only(R"library(mclust); adjustedRandIndex($a,$b)")
which gives
0.2933142400616828
-1.5731751561282826e-6
In theory the true adjusted Rand index should be close to 0. This starts to happen around n=83,000 for me.
As a Julia comparison, my own implementation of the adjusted Rand index gives the same result as in R:
function ari(a,b)
table = counts(a,b)
acounts = sum(table,dims=1)
bcounts = sum(table,dims=2)
score = sum([x*(x-1)/2 for x in table])
asum = sum([x*(x-1)/2 for x in acounts])
bsum = sum([x*(x-1)/2 for x in bcounts])
expected = asum*bsum/binomial(sum(table),2)
total = (asum + bsum)/2
if total == expected
return 0
else
return (score-expected)/(total-expected)
end
end;
ari(a,b)
-1.5731751561282826e-6
I use Clustering.jl 0.14.2, Julia 1.6.2.