-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Use length prefix in default Hasher::write_str
#134134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Use length prefix in default Hasher::write_str
#134134
Conversation
Using a 0xFF trailer is only correct for bytewise hashes. A generic `Hasher` is not necessarily bytewise, so use a length prefix in the default implementation instead.
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @saethlin (or someone else) some time within the next two weeks. Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (
|
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
…refix, r=<try> Use length prefix in default `Hasher::write_str` Using a 0xFF trailer is only correct for bytewise hashes. A generic `Hasher` is not necessarily bytewise, so use a length prefix in the default implementation instead. Context: https://rust-lang.zulipchat.com/#narrow/channel/219381-t-libs/topic/collision-free.20non-bytewise.20hashers.20on.20stable r? saethlin
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Note that rustc-hash optionally overrides both |
I'm having trouble estimating the quality of |
I don't know of any hasher where one The other experiment that would be useful is checking how easy or hard it is to trigger this theoretical problem. That is, pick some reasonably popular hasher that doesn't mix in the length and try to find a set of keys that'll lead to full hash collisions with the current |
Finished benchmarking commit (3eb8d2d): comparison URL. Overall result: ❌✅ regressions and improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)Results (primary -2.0%, secondary -2.4%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (primary -2.3%, secondary 2.3%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResults (primary 0.0%, secondary 0.0%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 768.372s -> 769.733s (0.18%) |
IMO, the main problem here is that this issue introduces seed-independent collisions. This affects, say, DoS resistance. To give another example, suppose I'd like to find a collision-free hash function for a given set of 1M elements. Typically, I can keep trying random seeds until I get one without collisions. But at the moment, this is not the case for the affected hashers. |
In theory, this is a strong argument. In practice it's not clear to me if there's any existing hasher that only has seed independent collisions today because of There are a bunch of hashers that have seed-independent collisions among strings of the same length, and many hashers already mix the length into |
Generally speaking, all modern quality hash functions try to resist DoS as well as they can, as SMHasher tests this behavior, so this is hardly surprising. Of course, many cryptographic or almost-cryptographic hash functions have Rust implementations that are only safe because they're bytewise. The moment anyone decides to optimize them in a seemingly innocuous way, the jig is up, and DoS resistance is broken.
For the core loops (pre-finalization):
These collisions don't typically matter, as the finalization mixes in the total length. But this doesn't protect against multiple writes whose lengths sum up to the same value. That is only broken because of
I'd like to make a somewhat tangential counterargument. No existing hashes are tuned to Rust's needs in particular, leading to suboptimal performance when using anything other than strings as keys. This is very much a barren field: I can think of multiple optimizations tuned for the streaming design of Any attempts to solve this (like what I'm currently working on) will inevitably raise the question: what does the |
An alternative sound formalization exists. We can specify that Of course, this will break FNV and all other popular existing |
As the author of foldhash, my 2 cents is that I think it's always been a mistake to automatically mix in anything at all. This applies to both I would also like to call attention to the fact that |
Thinking aloud: I agree that I believe that the only possibility in which this extra avalanche step can be avoided is when the length and the array contents are available at once. As such, it seems reasonable to remove fn write_bytes(&mut self, bytes: &[u8]) {
self.write_length_prefix(bytes.len());
self.write(bytes);
} The The This change would be backwards-compatible (nightly excluded, but deprecating Does this sound reasonable or am I missing something? |
Ignoring
If IMO If |
I never meant to suggest that |
As the author of rapidhash, I agree with a lot of what's been discussed here. I tangentially wanted to note that the changes suggested here potentially break the portability of the To allow us to make future changes based on the suggestions above, I've raised the concern in a separate issue which suggests documenting that |
Using a 0xFF trailer is only correct for bytewise hashes. A generic
Hasher
is not necessarily bytewise, so use a length prefix in the default implementation instead.Context: https://rust-lang.zulipchat.com/#narrow/channel/219381-t-libs/topic/collision-free.20non-bytewise.20hashers.20on.20stable
r? saethlin