Skip to content

hash(float) is not a very good hash function, causes specific performance problem #704

@jepler

Description

@jepler

The following program constructs three sets. The first contains integers (3,4,5,...). The second contains floating point numbers (3.0, 4.0, 5.0, ...). The third, floating point numbers (3.0, 3.0001, 3.0002, ...).

def make_set(i0, di, n):
    s = set()
    for i in range(n): s.add(i0 + di * i)

def timeit(f, n=10, r=3):
    import time
    best = 1e100
    for i in range(r):
        t0 = time.monotonic()
        for j in range(n): f()
        t1 = time.monotonic()
        dt = t1 - t0
        best = min(dt, best)
    return best

print(timeit(lambda: make_set(3, 1, 100)))
print(timeit(lambda: make_set(3., 1., 100)))
print(timeit(lambda: make_set(3., .001, 100)))

Typical performance in circuitpython on trinket m0:

0.155273
0.236816
2.21582

Because hash(float) is essentially floor(float), all the items in the third set go to the same hash bucket, giving pathological and slow behavior. In this case, 10x slower than a set containing different floats.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions