Skip to content

Conversation

@sjakobi
Copy link
Member

@sjakobi sjakobi commented Nov 2, 2025

...and define differenceWith via differenceWithKey.

Closes #389, closes #364.

...and define `differenceWith` via `differenceWithKey`

Closes #389.
This makes the overlapping case significantly faster.
@sjakobi sjakobi force-pushed the sjakobi/issue389-dWK branch from 5539624 to 261ba6d Compare November 5, 2025 21:17
@sjakobi sjakobi marked this pull request as ready for review November 5, 2025 21:25
differenceWith f = differenceWithKey (const f)
{-# INLINE differenceWith #-}

-- | \(O(n \log m)\) Difference with a combining function. When two equal keys are
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are m and n here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n is the size of the first map. m is the size of the second map. This is a convention this package uses for many functions. I suspect it was adopted from containers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but I don't think that's necessarily what this implementation does; it was left unchanged from the old one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obviously wrong to me at least. If the first map is small, the second is a relatively large superset, and lookup[Cont] takes log(m), we still do n lookups in the larger map. To be fair, we don't start these lookups at the root, so maybe O(n log (m/n)) would be more accurate?!

IMHO these log(size)s are not very useful anyways, since on 64-bit systems we have a maximum tree height of 13, and on 32-bit systems the maximum tree height is 8; and you can still have a map with two entries and full tree height…

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bounds are given assuming sufficiently uniform hashing, but that's not at all the case for important instances like Int. It's ... a problem. I can't say if n log (m/n) is accurate or not, but it should be something symmetrical!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take that back. Maybe not symmetrical. But ... I dunno...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not symmetrical actually. I have no idea.

go_differenceWithKey !_s Empty _tB = Empty
go_differenceWithKey _s a Empty = a
go_differenceWithKey s a@(Leaf hA (L kA vA)) b
= lookupCont
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need lookupCont for compatibility, or can we commit to unboxed sums and make life easier?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that lookupCont is currently the only lookup version that takes a Shift argument and can therefore be used at different levels of the tree.

#410 proposes getting rid of lookupCont, although I personally don't find it so bad.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, a version of lookupRecordCollision# that takes a shift should be able to reduce code size, compared to lookupCont, because we only need to compile it once for each key type. This package has a tradition of aggressively inlining everything to ensure specialization really happens, but I would love to imagine that we might not need to be quite so heavy-handed with modern GHC.

Just v | v `ptrEq` vA -> a
| otherwise -> Leaf hA (L kA v))
hA kA s b
go_differenceWithKey _s a@(Collision hA aryA) (Leaf hB (L kB vB))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally assume that collisions are rare, And that multiple collisions are even more so. It might make sense to defer to basic operations when one of the maps is a collision bucket, rather than trying to do something fancy. That might require you to lift some go functions to the top level. That same sort of lifting might also give a better way to talk about leaves here.

Copy link
Member Author

@sjakobi sjakobi Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I agree, but could you make a concrete suggestion on how to handle this case?

IMHO the use of lookupInArrayCont is already quite clear.

I guess we could use a new function like

updateCollisionWithKey :: (k -> v -> Maybe v) -> Hash -> k -> Array (Leaf k v) -> HashMap k v -> HashMap k v

But I'm not yet convinced that it would make the code much clearer.

In general, I feel that our collection of functions for operating on Collision arrays is pretty awkward, especially the names (e.g. updateOrSnocWithKey). (#437 is related)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provide a differenceWithKey function difference and differenceWith could be much faster

3 participants