Skip to content
This repository was archived by the owner on Feb 4, 2025. It is now read-only.

Fast hash/equality for Model objects #129

Closed
jakemac53 opened this issue Oct 31, 2024 · 19 comments
Closed

Fast hash/equality for Model objects #129

jakemac53 opened this issue Oct 31, 2024 · 19 comments

Comments

@jakemac53
Copy link
Contributor

As a part of performance work, I have been looking into generating hash functions for Model objects (see my WIP branch). It isn't too bad to make a basic implementation, but it is very slow, taking almost an entire second cumalitively computing hashes for each edit in the large JSON benchmark.

My first approach here is to generate "identityHash" functions which do lookups on the node object for each known property, recursively calling "identityHash" on all the nested objects, for example Interface.identityHash looks like this:

  /// Hash code for comparing instances of this extension type.
  int get identityHash => Object.hash(
        Object.hashAll((node['members'] as Map?)
                ?.cast<String, Member>()
                ?.entries
                .map((entry) =>
                    Object.hash(entry.key, entry.value.identityHash)) ??
            const []),
        (node['thisType'] as NamedTypeDesc?)?.identityHash ?? 0,
        Object.hashAll((node['metadataAnnotations'] as List?)
                ?.cast<MetadataAnnotation>()
                ?.map((entry) => entry.identityHash) ??
            const []),
        (node['properties'] as Properties?)?.identityHash ?? 0,
      );

Ultimately the result of this is that even cached macro phases take an unacceptable amount of time (multiple milliseconds), so we will need to come up with something faster and evaluate exactly what is making this so slow.

@jakemac53
Copy link
Contributor Author

jakemac53 commented Oct 31, 2024

Some ideas:

  • It is probably faster to iterate the entries - although we would have to do a switch for the keys to cast things to the correct extension types.
  • We could also potentially try just using the deep hashing functionality from package:collection and work at the raw map level.
  • We could potentially build up a hash in the JsonBufferBuilder as we build objects, but unless we also serialize that hash on the wire it would require reading the entire payload each time when we first receive a buffer. We also only want the hash of the Model field, not the entire buffer, which complicates things.

@davidmorgan
Copy link
Contributor

Hmmmm when the query finishes on the host, the Model should be the only thing in the buffer, e.g. after the initial query here:

https://github.com/dart-lang/macros/blob/178c59faae7b6d8a4c3542ff36987bb761a54ec9/pkgs/_analyzer_macros/lib/macro_implementation.dart#L110

the host could then hash the whole buffer and send the result. It'd be nice to have it as a field on Model, we can only do that today by having a Map field since those can be added to after creation.

That's assuming that hashing the buffer as raw bytes turns out to be fast, I hope so :)

Hashing raw bytes will only do what we want if the bytes are sufficiently deterministic, I don't know if that will be the case, but it's a lot easier than hashing objects so I guess it's worth a try :)

@jakemac53
Copy link
Contributor Author

  • We could also potentially try just using the deep hashing functionality from package:collection and work at the raw map level.

I did give this a shot and it was much worse, fwiw (about 3x worse).

@jakemac53
Copy link
Contributor Author

  • It is probably faster to iterate the entries - although we would have to do a switch for the keys to cast things to the correct extension types.

I tried this as well and it seems a bit faster but not enough to meaningfully change the result.

@jakemac53
Copy link
Contributor Author

jakemac53 commented Oct 31, 2024

It looks like just hacking something in to grab the raw bytes and hash those is definitely faster, 150ms or so per edit for the large json example, cumulatively spent doing hash computations. I also edited my branch to save the hash instead of the Model objects for the cached responses, and compare against that, so we do about half as many total hashes. I haven't played around with the different hash algorithms much to see which is fastest. I tried sha1, md5, and sha256, sha1/md5 were close but sha256 was about twice as slow.

Still a lot of time to spend hashing, and I don't love that it is dependent on specific ordering in which the buffer is built up (technically, my other approach also was though, but it would have been relatively trivial to make it not so).

Given that this now only ever actually hashes a given Model object once ever, I don't believe it is worth trying to compute the hash on the fly and store it in the buffer.

@davidmorgan
Copy link
Contributor

I read a bit about what the google3 build does, there is a public doc about a specialized hash, PSHA2, it's based on SHA256 with tweaks to add parallelization so it can run using SIMD instructions, i.e. it uses parallelization within a single CPU. The parallel part of psha256 kicks in for message sizes of 1024 bytes so it looks like it won't be too hard to hit it.

Comparing the command line versions of md5sum, sha256sum, sha1sum and psha2sum on my machine using 1GB of input: md5 hits 0.54GB/s, sha256 is 1.08GB/s, sha1 is 1.23GB/s and psha256 is 1.56GB/s.

Using the package:crypto example script AOT-compiled the numbers are: md5 0.17GB/s, sha1 0.10GB/s, sha256 0.08GB/s

Based on these numbers, if you were using the package:crypto implementations, it looks like there would be about 9x speedup with native-performance PSHA2, 7x with native-performance SHA1, 3x with native-performance md5sum.

I wonder how much of that native-performance boost we could get with ffi? But actually these are so important, I could see an argument for directly supporting them in the platform, e.g. put them in dart:io.

@davidmorgan
Copy link
Contributor

Some bazel discussion bazelbuild/bazel#22011 mentions blake3, which looks even faster, b3sum hits 7.87GB/s, mostly because it also parallelizes across multiple cores: with 1 thread (b3sum --num-threads 1) it's 1.83GB/s which is just a bit faster than PSHA2, then it gets considerably faster as I add 1-3 more threads.

@mraleph
Copy link

mraleph commented Nov 1, 2024

Do you really need to compute hash to compare things? Can't you just (for example) do a simple byte-by-byte comparison of the incoming data? It seems you are just using it a proxy for equality anyway.

That being said: I puzzled as to why we are trying to solve all these problems externally to the tools which are supposed to have all information necessary to short cut the what has changed computation. CFE is supposed to have fine grained incremental recompilation of the dependencies, analyzer does not - but should eventually implement it anyway. It seems like we are reinventing the same calculation with macros specific twists.

@davidmorgan
Copy link
Contributor

Do you really need to compute hash to compare things? Can't you just (for example) do a simple byte-by-byte comparison of the incoming data? It seems you are just using it a proxy for equality anyway.

If a match means there is no more work to do then it's nice to get a match based on hashes, because then you can get a match without having to store all possible matches as full data.

That being said: I puzzled as to why we are trying to solve all these problems externally to the tools which are supposed to have all information necessary to short cut the what has changed computation. CFE is supposed to have fine grained incremental recompilation of the dependencies, analyzer does not - but should eventually implement it anyway. It seems like we are reinventing the same calculation with macros specific twists.

Neither analyzer nor CFE has a data model that's immediately useful to macros, because they are private to the analyzer and CFE, which means you can't code against them--they change. So, macros have their own data model that is public and stable. (The JSON representation and corresponding binary format and extension types).

Macros describe what data they need as a query, the host (analyzer or CFE) converts its own data model to the macro model and sends it in response.

Macros usually only care about a part of the code, for example fields in classes with a particular annotation and their types, so what each macro receives is significantly cut down from the full host model. This also means that it should be very common that when a file changes the macro does not have to rerun: something changed but it wasn't what the macro cares about.

This investigation is about noticing that the data being sent to a macro is the same as last time, and so the output from last time can be reused. "The same as last time" is easy to check by keeping a hash from last time and comparing.

It's true that we could perhaps optimize further by pushing some part of the "same as last time" check before the conversion to the macro data model, so for example if the CFE could compare what changed against the macro query before it even starts to do the conversion. But this would be a lot more work to do, and it's possible than convert-then-compare gets us most of the performance, so we obviously check that first.

@jakemac53
Copy link
Contributor Author

That being said: I puzzled as to why we are trying to solve all these problems externally to the tools which are supposed to have all information necessary to short cut the what has changed computation.

  • We have to implement it twice that way, which is extra work as well as a larger potential for bugs. The question of "what does this macro depend on" is fairly challenging to answer, and the query results hash model is quite trivial and reliable.
  • It is probably a lot more expensive, especially in terms of memory, to maintain back-links across the program to the macro applications which depend on those nodes. The current solution we are exploring just needs to store a single hash per query that the macro runs. We do trade off CPU cost for that, to recompute the queries and hash them, absolutely. We are just exploring now the actual cost of re-running those queries and hashing the results.
  • The hashing approach also could be more compatible with re-using macro results in modular compiler. Although we probably won't actually do this.

@jakemac53
Copy link
Contributor Author

I hooked up the CPU profiler, from head but with #134 applied.

Here are some noteworthy things, from a profile spanning a single incremental edit:

image

  • about 27% of the total time is still spent computing hashes. We definitely need to improve that.

image

  • I don't know exactly all this is doing, but it seems pretty expensive and if we were more deeply integrated we could possibly avoid re-doing this work when it is cached cc @scheglov .

image
image

  • We are spending >10% of the time just re-running the queries to see what has been changed. It looks like almost half that time is spent just creating typed maps though (almost all the created objects here are a result of queries), which doesn't even account for all the work spent building up these objects.
  • This indicates we need to optimize the speed for creating these objects.

image

  • This isn't too unexpected, we do expect a lot of the time to go to parsing, since every library file gets re-parsed, and all augmentation code is parsed twice I believe.

@jakemac53
Copy link
Contributor Author

Fwiw, this is my launch_config.json, which assumes you have already generated a benchmark to run (dart tool/benchmark_generator/bin/main.dart large macro 64 for example):

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "benchmark_debug",
            "request": "launch",
            "type": "dart",
            "program": "pkgs/_macro_tool/bin/main.dart",
            "args": [
                "--workspace=goldens/foo",
                "--packageConfig=.dart_tool/package_config.json",
                "--script=goldens/foo/lib/generated/large/a0.dart",
                "--host=analyzer",
                "--watch"
            ],
            "cwd": "${workspaceFolder}"
        },
    ]
}

@scheglov
Copy link

scheglov commented Nov 6, 2024

One of the previously recorded tree of performance operations in dart-lang/sdk#55784 (comment) provides details what we do in Linker._mergeMacroAugmentations.

  1. Merge separate macro results into single code.
  2. Parse.
  3. Build unlinked data. Mostly hashing.

Similar data internally.

@scheglov
Copy link

scheglov commented Nov 6, 2024

See also my previous benchmarks for hashing, Dart vs. Rust.
Or, maybe more fairly, different algorithms, some of which can be optimized for modern CPUs.

@jakemac53
Copy link
Contributor Author

See also my previous benchmarks for hashing, Dart vs. Rust.

Fwiw, the actual hashing is not the problem in this particular case, it is the work to pull out the interesting bits of the objects that we want to hash that is expensive.

In my PR I am just using Object.hash and Object.hashAll, at least initially, and they consume combined only about 300ms, which isn't nothing to be sure but its not the primary issue by any means.

image

@davidmorgan
Copy link
Contributor

Re #147 which uses md5.

I have never tried FFI before, I figured it's about time and a good use of a Friday afternoon :)

I got a random md5 C implementation working easily enough

dart-lang/core@89faa07

it's 2x faster than the Dart one

> dart example/example.dart
 FFI: 71ms
9be5897a856106bf8fc9ef292366a3ac
 FFI: 71ms
9be5897a856106bf8fc9ef292366a3ac
...
...
Dart: 130ms
9be5897a856106bf8fc9ef292366a3ac
Dart: 145ms
9be5897a856106bf8fc9ef292366a3ac

it would be interesting to try one of the newer+faster hashes with a C implementations like blake3. (I notice though that the blake3 C implementation says that only the rust implementation provides multithreading).

Total FFI newbie, as I mentioned, but my newly-gained understanding is that if we allocate bytes natively (FFI "malloc") then you can treat the bytes on the Dart side as a Uint8List.

So we could write JSON directly into one, and there would be no copying to hash it. This assumes we can hash the whole buffer, of course, which is not something we can do yet, we'd have to write a new buffer just for hashing. But it should be fast.

The reason you have to allocate the bytes natively then use as a Uint8List, rather than the other way around, is that bytes allocated on the Dart side can be moved by the garbage collector, while bytes allocated natively can't.

@mraleph
Copy link

mraleph commented Nov 22, 2024

@davidmorgan you can mark your hashing function as a leaf, that would allow you to do md5(buffer.address) even if buffer is allocated in the Dart heap because leaf functions don't allow GC to happen while they execute.

@davidmorgan
Copy link
Contributor

davidmorgan commented Nov 22, 2024

Thanks Slava! I saw a reference to "leaf" functions, but didn't know it was something I can just opt into. That makes sense.

I managed to hack a working blake3 example, too. (Hurrah for ffigen!)

https://github.com/davidmorgan/core/tree/crypto-example-blake3/pkgs/crypto (build with: cmake -DBUILD_SHARED_LIBS=ON . && make)

For the example above it hits ~32ms, so another 2x speedup.

The example does lots of small (2000 byte) hashes which does not hit maximum throughput, the difference gets much bigger if we do a small number of large hashes, e.g. for one hash of a billion bytes

  FFI blake3: 291ms
570a1dd602f4c46553f0c4c211b4d08e32d8013425dda2cf18fcd18262cda24c
  Dart md5: 6182ms
66b64ed98371d30f6a48c0a9b0b8b6cd

For the very large hashes there is probably another 2-3x throughput available with a multithreaded blake3 implementation, based on what I saw with the blake3 CLI tool.

@jakemac53
Copy link
Contributor Author

jakemac53 commented Nov 22, 2024

Note that the MD5 approach in #147 uses chunked encoding, which might not translate as well to the FFI approach, since there are many very small chunks. Maybe we can do some sort of a streaming API though?

But it might translate well to the approach where we build up a buffer just for hashing. Then we are also more order dependent though, and I am not sure we want to rely on that, although it might turn out that things are pretty deterministic already if the analyzer/cfe have a deterministic ordering of members.

@davidmorgan davidmorgan closed this as not planned Won't fix, can't repro, duplicate, stale Feb 4, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

4 participants