Use hashes in Key instead of allocated strings #2616

udoprog · 2022-04-15T13:35:26Z

Description

Changes Key to wrap an Rc<[u8]> instead of an Rc<str>. Which allows Key to among other things implement From<&[u8]>. This in turn allows it to more efficiently wrap types which directly have a binary representation.

This does add some unsafe which we wouldn't have to if From<Rc<str>> existed in std, see rust-lang/rust#96078

This also introduces a handful of breaking changes which needs to be considered, especially since the binary representation of seemingly unrelated things might overlap or not:

Integer types have lengths, so 0u32 would never match 0u64. Previously this was slated over since they both compared to the string "0".
Different types can conflict if they have the same representation. Such as a Vec<u8> which has the same byte sequence as a String. Like "foo" and b"foo".

In my mind the shortcomings are mitigated since keys tend to be uniform in the context which they are used. Consider when you're iterating over something and generating keys, they tend to be generated in the same manner:

html! {
    {for entries.iter().map(|e| {
        html! { <Entry key={e.id} entry={e} /> }
    }}
}

The upside of this is that using keys is made more efficient and sometimes avoids the extra allocation necessary to stringify things. My motivating use-case was that I wanted to use a Uuid as key. With this change I could directly use Uuid::as_bytes.

Solving the shortcomings

If we want to, I do believe we could solve the above shortcomings with a bit of extra work.

We can store the TypeId in the Key or define our own. E.g. I personally think it would be reasonable to compare Rust strings with their binary equivalents since UTF-8 is intentionally ascii-compatible.
We can use a variable-length encoding for integers, which would have the same binary representation regardless of size. If all numbers utilized zig-zag encoding, signed and unsigned integers could be correctly compared.

Checklist

I have run cargo make pr-flow
I have reviewed my own code
I have added tests

github-actions · 2022-04-15T14:07:43Z

Visit the preview URL for this PR (updated for commit 923efa0):

https://yew-rs-api--pr2616-key-as-bytes-mkqpfsb3.web.app

_{(expires Thu, 28 Apr 2022 08:11:17 GMT)}

_{🔥 via Firebase Hosting GitHub Action 🌎}

github-actions · 2022-04-15T14:17:23Z

Size Comparison

examples	master (KB)	pull request (KB)	diff
boids	173.392	174.075	+0.684
contexts	110.692	110.865	+0.173
counter	87.387	87.640	+0.253
counter_functional	87.939	87.938	-0.002
dyn_create_destroy_apps	90.438	90.583	+0.146
file_upload	103.313	103.671	+0.357
function_memory_game	167.781	169.016	+1.234
function_router	353.482	361.712	+8.229
function_todomvc	162.715	164.108	+1.394
futures	227.142	227.249	+0.107
game_of_life	108.423	109.269	+0.846
inner_html	83.878	83.716	-0.162
js_callback	113.460	113.881	+0.421
keyed_list	196.287	199.624	+3.337
mount_point	86.903	86.854	-0.049
nested_list	116.408	117.112	+0.704
node_refs	90.273	90.812	+0.539
password_strength	1539.888	1540.306	+0.418
portals	97.476	97.915	+0.439
router	319.312	327.790	+8.478
simple_ssr	500.776	500.855	+0.079
ssr_router	428.949	437.451	+8.502
suspense	111.171	111.651	+0.480
timer	90.069	90.405	+0.336
todomvc	144.063	145.352	+1.288
two_apps	87.971	88.243	+0.272
webgl	87.560	87.418	-0.142

futursolo · 2022-04-15T14:17:26Z

Instead of converting internal representation of keys from Rc<str> to Rc<[u8]>, how about implementing it with std::hash::Hash?

udoprog · 2022-04-15T15:06:19Z

Instead of converting internal representation of keys from Rc<str> to Rc<[u8]>, how about implementing it with std::hash::Hash?

That would make sense to me! If you use something like twox-hash you could store keys as 64 or 128-bit integers.

It still might be desirable that things are salted appropriately (i.e. hash some constant to distinguish between byte arrays and integers) and that you encode integers somehow so that the same number in different widths or signedness result in the same hash.

ranile · 2022-04-15T20:49:30Z

If it's done with Hash, it's worth benchmarking the performance (not sure how good our CI is for that, @voidpumpkin can comment on that). While I doubt there will be performance impact, we may see a noticeable increase in binary size by adding another dependency

futursolo · 2022-04-16T12:36:15Z

we may see a noticeable increase in binary size by adding another dependency

We can use std::collections::hash_map::DefaultHasher?

udoprog · 2022-04-16T14:11:21Z

@hamza1311 Tried a branch which uses hashing (but does not use zig-zag encoding): https://github.com/udoprog/yew/commit/key-hashing

Size differences when cherry picked on top of 0.19.3 and used in one of my applications (twox-hash added as a dependency):

before:           10934052
after:            11155539
diff:               221487

before optimized:  1674539
after optimized:   1715264
diff:                40725

@futursolo It's an option. You'd have to be comfortable with 64-bit wide keys and it does do a bit of extra work by virtue of being cryptographic and randomly seeded. But I don't think that matters much.

ranile · 2022-04-16T17:02:10Z

I assume that is in bytes? An increase of ~40KB is a significant increase in binary size

udoprog · 2022-04-17T02:39:01Z

I replaced the current propsal with one that does hashing, and twox-hash with 128-bit keys is hidden behind a feature flag w/ the same name. This is the result from the benches I added for anyone curious:

Note that benches do not take allocating into account very well. I.e. the situation in a benchmark is an ideal allocation circumstance to simply copy the specified segment of memory. But we can note that it's roughly as fast as calculating the hash over it which is all we should really care about since the new proposed key isn't allocating at all.

SipHasher is apparently really good and consistent, and it has a constructor to avoid random seeding. The std implementation has been excellently implemented and tuned.

Current Key:

test bench_string_key ... bench:       3,385 ns/iter (+/- 91)
test bench_u128_key   ... bench:       7,589 ns/iter (+/- 218)
test bench_u64_key    ... bench:       7,589 ns/iter (+/- 469)

DefaultHasher (which is SipHasher using 64-bit keys and without random seeding):

test bench_string_key ... bench:       3,020 ns/iter (+/- 99)
test bench_u128_key   ... bench:       1,788 ns/iter (+/- 38)
test bench_u64_key    ... bench:       1,781 ns/iter (+/- 20)

twox-hash (using 128-bit keys):

test bench_string_key ... bench:       6,555 ns/iter (+/- 625)
test bench_u128_key   ... bench:       6,416 ns/iter (+/- 283)
test bench_u64_key    ... bench:       6,493 ns/iter (+/- 283)

udoprog · 2022-04-17T02:47:24Z

I assume that is in bytes? An increase of ~40KB is a significant increase in binary size

Up to you to decide. I get something like 10kb variability between builds so it isn't that big of a deal for me. But I put twox-hash behind a non-default feature flag in this proposal as a result of your comment. If you don't want it at all, please tell me and I'll remove it.

WorldSEnder · 2022-04-17T02:59:52Z

I get something like 10kb variability between builds so it isn't that big of a deal for me.

It sounds we might use different measurement processes. There is a CI test for size comparison, and any change above ~30 bytes is out of the ordinary if functionality is unchanged. Be sure to run the comparison with build flags optimized for size (see the size-cmp workflow) which should run wasm-opt and use nightly rust for tree-shaking and best size performance.

lukechu10 · 2022-04-19T04:45:55Z

What about hash collisions? The std implementation of HashMap has special code for dealing with hash collisions which can be seen here: https://docs.rs/hashbrown/latest/src/hashbrown/raw/mod.rs.html#822

Although hash collisions should be relatively rare, especially with the default hasher, they still can occur. And the biggest problem that I see with this is that hashes are not deterministic which means that an error caused by a hash collision would not be reproducible.

udoprog · 2022-04-19T11:28:28Z

What about hash collisions? The std implementation of HashMap has special code for dealing with hash collisions which can be seen here: https://docs.rs/hashbrown/latest/src/hashbrown/raw/mod.rs.html#822

Hash collisions in a HashMap are common because the hash is used to pick between a much lower number of buckets, which are sized according to the capacity of the hash map.

The way we'd use keys and given that the a hash has a decent distribution (which I believe DefaultHash is) if we want to keep the probability of a single collision below 0.000001% we can have:

6.1M elements with a 64-bit hash.
6e16 elements with a 128-bit hash.

If I'm reading the table here correctly: https://en.wikipedia.org/wiki/Birthday_attack

I prefer the latter (which is why I at least want it as a feature to test it). But I don't see much of an issue with the former which is why I also think a 64-bit hash is fine. Random UUIDs for example are 128-bit which is considered sufficient to be globally unique.

Although hash collisions should be relatively rare, especially with the default hasher, they still can occur. And the biggest problem that I see with this is that hashes are not deterministic which means that an error caused by a hash collision would not be reproducible.

The hashes proposed here are deterministic. Both DefaultHasher and twox-hash here are initialized with fixed seeds.

ranile · 2022-04-19T16:59:48Z

I get something like 10kb variability between builds so it isn't that big of a deal for me

In release builds with wasm-opt and the build size optimizations? If so, that seems a little off to me.

Side note: @udoprog, can you run our benchmark "suite"? You can refer to the benchmark workflow to see how it's run.

udoprog · 2022-04-20T13:30:35Z

It's unclear to me how you run the full suite of benchmarks using that action since it looks rather complicated. Is there a simplified one we can run or can we simply just invoke the action itself?

Furthermore, is that or anything else blocking this PR? Is this something you want to adopt?

WorldSEnder · 2022-04-20T13:48:25Z

Furthermore, is that or anything else blocking this PR? Is this something you want to adopt?

I see two blockers, one you can't do anything about. I'd want to run the performance benchmarks locally, to see how exactly it affects them, since the github runner is a bit unstable and flactuates too much for accurate results.

The other things is clippy complaining about the use of #![features(test)] in the benchmarks. Can you change it, so that the benchmarks only get picked up on nightly?

Note: I'm also convinced that hash collisions occur basically never. I have yet to see a website with more than a few thousand items in a single element. Usually, browsers begin to struggle with other things way before we hit a point where I'd start worrying about hash collisions and the birthday problem, even with 64 bit hashes, but going for 128 bits for good measure seems reasonable. Perhaps we should add a lints when the number of elements in a list is greater than some X, e.g. 10000 when we opt to go with 64 bits.

udoprog · 2022-04-21T07:00:37Z

Switched to criterion, since it's way harder to just selectively disable the default benches.

@WorldSEnder If I read you correct you personally wanted to run the webdriver benchmarks?

WorldSEnder · 2022-04-21T08:06:12Z

@WorldSEnder If I read you correct you personally wanted to run the webdriver benchmarks?

Yep. Running the benchmark-struct and benchmark-hooks tests locally for accurate results on multiple envs (chrome/firefox)*2 different CPUs.

WorldSEnder · 2022-04-21T18:56:47Z

Update: looks like wins across the board. I think it's definitely something for the release profile and it has some nice performance wins on update. Measurements. There seems to be a persistent small overhead when creating "trivial" keys. Which brings me to three more questions I want to ask:

Can this benefit from niche optimization? Specifically, we use Option<Key> a few times in the code. Can we reserve one hash value as a niche such that size_of::<Option<Key>> == size_of::<Key>?
Do we have to use such a good hashing algorithm, or could we stick in something simple like FxHasher, which should be even faster, especially when the input is a single u64 getting hashed.
Should we include the original key during debugging to provide a bit of a safety-net against hash collisions for good measure? In any case, we might want to make it clear in the documentation that the key should not be initialized with user/content data, to be a bit more resistant against crafted/unwanted collisions.

udoprog · 2022-04-21T20:05:47Z

Can this benefit from niche optimization? Specifically, we use Option a few times in the code. Can we reserve one hash value as a niche such that size_of::<Option> == size_of::?

I suppose we can add a loop which tries to construct a NonZeroU64/128. Alternatively provide a default constructor which returns an empty all zero key that can be used instead. Does seem like a good follow-up PR to me.

Do we have to use such a good hashing algorithm, or could we stick in something simple like FxHasher, which should be even faster, especially when the input is a single u64 getting hashed.

FxHash is in my mind is very similar to a single round of fnv, which does not provide very randomly distributed outputs making it more prone to conflict. How much? Not sure. Increasing the number of rounds helps, but then it's supposedly much slower than something like xxhash (or siphash).

the key should not be initialized with user/content data, to be a bit more resistant against crafted collisions.

I'm not following. Could you elaborate? The proposed algorithms all have good avalanche effects, which should mean that crafting colissions shouldn't be easy.

WorldSEnder · 2022-04-21T20:32:10Z

I'm not following. Could you elaborate?

Sorry, I bunched a few things together. The crafted part was conditional on using a faster but worse hasher such as Fx. The overall recommendation still stands though, especially when using u64 hashes. A chance of 1e-12 for ~6.1k element "will" happen if you start testing with actually random data (it's like 1/1000th of winning the powerball or there about, right?), and it will be very hard to debug. If you use controlled keys, such as a running index in a collection, the debuggability should be a lot higher, and tying in with testing against the full, unhashed, key in debug mode, should give at least reproducible issues if they do happen.

EDIT: at least a flag to turn it off? I do like the speedup, I just like to be able to debug errors, too :)
EDIT2: ran the benchmarks again with FxHasher64 patched in, see the updated results link. Gives a bit of speedup in the create_10k and swap and comparable performance for the rest.

futursolo

Looks good for the most part, just some minor issues.

In addition, we should also look into why the code size of function_router has increased for 8KB (the example itself does not use key).

futursolo · 2022-04-28T12:28:18Z

packages/yew/src/virtual_dom/key.rs

-    fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
-        self.key.fmt(f)
+// Default implementation for byte sequences.
+impl From<&[u8]> for Key {


Why not impl<T> From<T> for Key where T: Hash and only specialise on types that needs special handling?

Providing a blanket impl<T> From<T> for Key where T: Hash would prevent any specialization, since Hash might be implemented upstream for any T in the future (trait coherence rules) and we don't have access to stable specialization.

I guess instead of using From<T>, having pub fn new<H: Hash>(h: H) -> Key as an associated function would work.

Usage of From(Into) for conversion into Keys is designated by the html! macro and can be changed to something else.

I don't see how it semantically makes sense to construct a Key from an abstract hashable type. I see the hashing as a performance improvement that acts as-if it would compare the actual values the Key was constructed from. I wouldn't expand this. Users can just hash those types themselves before constructing the keys and wrestle their own collision problems.

futursolo · 2022-04-28T12:30:14Z

packages/yew/src/virtual_dom/key.rs

-key_impl_from_to_string!(i64);
-key_impl_from_to_string!(i128);
-key_impl_from_to_string!(isize);
+macro_rules! signed {


Does this mean that Key::from(0_u8) == Key::from(0_u16) but Key::from(0_i8) != Key::from(0_u8)?

Yeah, this can be remedied with zig zag encoding as mentioned under Solving the shortcomings here.

Personally, I feel that passing a key of 0_u16 into an element / component where its last key is 0_u8 to be treated as the same element / component may not be an intentional behaviour for most cases.

I would prefer to treat them as different values if a correct implementation could potentially negatively impact code size.

We can emit a warning if the type of key changes between renders in debug mode.

udoprog · 2023-05-04T21:12:59Z

Uh, as a heads up I didn't intend to close it. I did remove my fork while not remembering this was still open. But it didn't seem like anyone moved on this anyway.

ranile added breaking change A-yew Area: The main yew crate labels Apr 15, 2022

udoprog added 3 commits April 21, 2022 08:51

Add basic bench

41974ea

Use hashing in Key

572fab0

Make clippy happy since Key is Copy

923efa0

udoprog changed the title ~~Wrap bytes in a Keys instead of strings~~ Use hashes in Key instead of allocated strings Apr 21, 2022

futursolo reviewed Apr 28, 2022

View reviewed changes

udoprog closed this by deleting the head repository Oct 16, 2022

futursolo mentioned this pull request Nov 9, 2022

Implement IntoPropValue for Key #2804

Closed

futursolo mentioned this pull request Apr 22, 2023

Make Key avoid an allocation, make it faster to compare than Strings #3205

Open

Uh oh!

Use hashes in Key instead of allocated strings #2616

Use hashes in Key instead of allocated strings #2616

Uh oh!

Conversation

udoprog commented Apr 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Solving the shortcomings

Checklist

Uh oh!

github-actions bot commented Apr 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Size Comparison

Uh oh!

futursolo commented Apr 15, 2022

Uh oh!

udoprog commented Apr 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ranile commented Apr 15, 2022

Uh oh!

futursolo commented Apr 16, 2022

Uh oh!

udoprog commented Apr 16, 2022

Uh oh!

ranile commented Apr 16, 2022

Uh oh!

udoprog commented Apr 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current Key:

DefaultHasher (which is SipHasher using 64-bit keys and without random seeding):

twox-hash (using 128-bit keys):

Uh oh!

udoprog commented Apr 17, 2022

Uh oh!

WorldSEnder commented Apr 17, 2022

Uh oh!

lukechu10 commented Apr 19, 2022

Uh oh!

udoprog commented Apr 19, 2022

Uh oh!

ranile commented Apr 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

udoprog commented Apr 20, 2022

Uh oh!

WorldSEnder commented Apr 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

udoprog commented Apr 21, 2022

Uh oh!

WorldSEnder commented Apr 21, 2022

Uh oh!

WorldSEnder commented Apr 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

udoprog commented Apr 21, 2022

Uh oh!

WorldSEnder commented Apr 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

futursolo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

futursolo Apr 28, 2022

Choose a reason for hiding this comment

Uh oh!

udoprog Apr 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

futursolo Apr 30, 2022

Choose a reason for hiding this comment

Uh oh!

WorldSEnder May 1, 2022

udoprog commented Apr 15, 2022 •

edited

Loading

github-actions bot commented Apr 15, 2022 •

edited

Loading

github-actions bot commented Apr 15, 2022 •

edited

Loading

udoprog commented Apr 15, 2022 •

edited

Loading

udoprog commented Apr 17, 2022 •

edited

Loading

ranile commented Apr 19, 2022 •

edited

Loading

WorldSEnder commented Apr 20, 2022 •

edited

Loading

WorldSEnder commented Apr 21, 2022 •

edited

Loading

WorldSEnder commented Apr 21, 2022 •

edited

Loading

futursolo left a comment •

edited

Loading

udoprog Apr 28, 2022 •

edited

Loading