-
Notifications
You must be signed in to change notification settings - Fork 7
Fast hash/equality for Model objects #129
Comments
Some ideas:
|
Hmmmm when the query finishes on the host, the the host could then hash the whole buffer and send the result. It'd be nice to have it as a field on That's assuming that hashing the buffer as raw bytes turns out to be fast, I hope so :) Hashing raw bytes will only do what we want if the bytes are sufficiently deterministic, I don't know if that will be the case, but it's a lot easier than hashing objects so I guess it's worth a try :) |
I did give this a shot and it was much worse, fwiw (about 3x worse). |
I tried this as well and it seems a bit faster but not enough to meaningfully change the result. |
It looks like just hacking something in to grab the raw bytes and hash those is definitely faster, 150ms or so per edit for the large json example, cumulatively spent doing hash computations. I also edited my branch to save the hash instead of the Model objects for the cached responses, and compare against that, so we do about half as many total hashes. I haven't played around with the different hash algorithms much to see which is fastest. I tried sha1, md5, and sha256, sha1/md5 were close but sha256 was about twice as slow. Still a lot of time to spend hashing, and I don't love that it is dependent on specific ordering in which the buffer is built up (technically, my other approach also was though, but it would have been relatively trivial to make it not so). Given that this now only ever actually hashes a given Model object once ever, I don't believe it is worth trying to compute the hash on the fly and store it in the buffer. |
I read a bit about what the google3 build does, there is a public doc about a specialized hash, PSHA2, it's based on SHA256 with tweaks to add parallelization so it can run using SIMD instructions, i.e. it uses parallelization within a single CPU. The parallel part of psha256 kicks in for message sizes of 1024 bytes so it looks like it won't be too hard to hit it. Comparing the command line versions of md5sum, sha256sum, sha1sum and psha2sum on my machine using 1GB of input: md5 hits 0.54GB/s, sha256 is 1.08GB/s, sha1 is 1.23GB/s and psha256 is 1.56GB/s. Using the Based on these numbers, if you were using the I wonder how much of that native-performance boost we could get with ffi? But actually these are so important, I could see an argument for directly supporting them in the platform, e.g. put them in |
Some bazel discussion bazelbuild/bazel#22011 mentions blake3, which looks even faster, |
Do you really need to compute hash to compare things? Can't you just (for example) do a simple byte-by-byte comparison of the incoming data? It seems you are just using it a proxy for equality anyway. That being said: I puzzled as to why we are trying to solve all these problems externally to the tools which are supposed to have all information necessary to short cut the what has changed computation. CFE is supposed to have fine grained incremental recompilation of the dependencies, analyzer does not - but should eventually implement it anyway. It seems like we are reinventing the same calculation with macros specific twists. |
If a match means there is no more work to do then it's nice to get a match based on hashes, because then you can get a match without having to store all possible matches as full data.
Neither analyzer nor CFE has a data model that's immediately useful to macros, because they are private to the analyzer and CFE, which means you can't code against them--they change. So, macros have their own data model that is public and stable. (The JSON representation and corresponding binary format and extension types). Macros describe what data they need as a query, the host (analyzer or CFE) converts its own data model to the macro model and sends it in response. Macros usually only care about a part of the code, for example fields in classes with a particular annotation and their types, so what each macro receives is significantly cut down from the full host model. This also means that it should be very common that when a file changes the macro does not have to rerun: something changed but it wasn't what the macro cares about. This investigation is about noticing that the data being sent to a macro is the same as last time, and so the output from last time can be reused. "The same as last time" is easy to check by keeping a hash from last time and comparing. It's true that we could perhaps optimize further by pushing some part of the "same as last time" check before the conversion to the macro data model, so for example if the CFE could compare what changed against the macro query before it even starts to do the conversion. But this would be a lot more work to do, and it's possible than convert-then-compare gets us most of the performance, so we obviously check that first. |
|
I hooked up the CPU profiler, from head but with #134 applied. Here are some noteworthy things, from a profile spanning a single incremental edit:
|
Fwiw, this is my launch_config.json, which assumes you have already generated a benchmark to run ( {
"version": "0.2.0",
"configurations": [
{
"name": "benchmark_debug",
"request": "launch",
"type": "dart",
"program": "pkgs/_macro_tool/bin/main.dart",
"args": [
"--workspace=goldens/foo",
"--packageConfig=.dart_tool/package_config.json",
"--script=goldens/foo/lib/generated/large/a0.dart",
"--host=analyzer",
"--watch"
],
"cwd": "${workspaceFolder}"
},
]
} |
One of the previously recorded tree of performance operations in dart-lang/sdk#55784 (comment) provides details what we do in
Similar data internally. |
See also my previous benchmarks for hashing, Dart vs. Rust. |
Fwiw, the actual hashing is not the problem in this particular case, it is the work to pull out the interesting bits of the objects that we want to hash that is expensive. In my PR I am just using |
Re #147 which uses md5. I have never tried FFI before, I figured it's about time and a good use of a Friday afternoon :) I got a random md5 C implementation working easily enough it's 2x faster than the Dart one
it would be interesting to try one of the newer+faster hashes with a C implementations like blake3. (I notice though that the blake3 C implementation says that only the rust implementation provides multithreading). Total FFI newbie, as I mentioned, but my newly-gained understanding is that if we allocate bytes natively (FFI "malloc") then you can treat the bytes on the Dart side as a So we could write JSON directly into one, and there would be no copying to hash it. This assumes we can hash the whole buffer, of course, which is not something we can do yet, we'd have to write a new buffer just for hashing. But it should be fast. The reason you have to allocate the bytes natively then use as a |
@davidmorgan you can mark your hashing function as a leaf, that would allow you to do |
Thanks Slava! I saw a reference to "leaf" functions, but didn't know it was something I can just opt into. That makes sense. I managed to hack a working blake3 example, too. (Hurrah for ffigen!) https://github.com/davidmorgan/core/tree/crypto-example-blake3/pkgs/crypto (build with: For the example above it hits ~32ms, so another 2x speedup. The example does lots of small (2000 byte) hashes which does not hit maximum throughput, the difference gets much bigger if we do a small number of large hashes, e.g. for one hash of a billion bytes
For the very large hashes there is probably another 2-3x throughput available with a multithreaded blake3 implementation, based on what I saw with the blake3 CLI tool. |
Note that the MD5 approach in #147 uses chunked encoding, which might not translate as well to the FFI approach, since there are many very small chunks. Maybe we can do some sort of a streaming API though? But it might translate well to the approach where we build up a buffer just for hashing. Then we are also more order dependent though, and I am not sure we want to rely on that, although it might turn out that things are pretty deterministic already if the analyzer/cfe have a deterministic ordering of members. |
As a part of performance work, I have been looking into generating hash functions for Model objects (see my WIP branch). It isn't too bad to make a basic implementation, but it is very slow, taking almost an entire second cumalitively computing hashes for each edit in the large JSON benchmark.
My first approach here is to generate "identityHash" functions which do lookups on the
node
object for each known property, recursively calling "identityHash" on all the nested objects, for exampleInterface.identityHash
looks like this:Ultimately the result of this is that even cached macro phases take an unacceptable amount of time (multiple milliseconds), so we will need to come up with something faster and evaluate exactly what is making this so slow.
The text was updated successfully, but these errors were encountered: