Skip to content

UB with impl FromWasmAbi for char when unpaired surrogate is given #2269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alvinhochun opened this issue Aug 6, 2020 · 2 comments · Fixed by #3866
Closed

UB with impl FromWasmAbi for char when unpaired surrogate is given #2269

alvinhochun opened this issue Aug 6, 2020 · 2 comments · Fixed by #3866
Labels
breaking-change Tracking breaking changes for the next major version bump (if ever)

Comments

@alvinhochun
Copy link

Describe the Bug

Wasm-bindgen assumes the "char" (which is a JavaScript String) to be converted is always valid and uses char::from_u32_unchecked. However the JS String may contain a single "char" of an unpaired surrogate (for example when the JS code uses charAt), in which using char::from_u32_unchecked with it creates an invalid char and causes UB.

Steps to Reproduce

// Rust:
#[wasm_bindgen]
pub fn take_char_by_value(x: char) {}
// JS:
take_char_by_value('😃'.charAt(0));

Expected Behavior

I don't have an idea what behaviour would be the best. Probably use U+FFFD like how strings are handled? Perhaps if the type is Option<char> it should use char::from_u32 instead, but then this might cause ambiguity between null-ish and invalid char.

Actual Behavior

An invalid char is produced.

Additional Context

@alvinhochun alvinhochun added the bug Something isn't working label Aug 6, 2020
@alexcrichton
Copy link
Contributor

Oh dear, thanks for catching this! At the very least we should probably switch to using the replacement character. This probably means that char shouldn't have been exposed a wasm-bindgen type since it's lossy. An alternative would be to panic, but in general there's not really a great option available to us.

@alexcrichton alexcrichton added breaking-change Tracking breaking changes for the next major version bump (if ever) and removed bug Something isn't working labels Aug 6, 2020
@alvinhochun
Copy link
Author

In my case I ended up using JsString instead. I am interacting with a JS library that (mis)uses charAt in a stream interface that counts char. Since I can also request for the next UTF-16 code unit, I resorted to using char::decode_utf16 to manually decode any non-BMP scalar values.

I think panicking isn't a very good option but it might be necessary to help developers notice the issue. In any case, I can see it being a trap because it requires the JS side to properly split by code point (which the ES6 String Iterator does); but even then, neither ES6 String Iterator nor codePointAt would handle invalid surrogates properly so it would be quite easy to unintentionally trigger the panic. Beginners might misuse charAt or the bracket notation, or like in my case an external library might be using them on different assumptions.

At the very least, there should be a huge warning in the char page of the guide, perhaps also prefer JsString over char.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Tracking breaking changes for the next major version bump (if ever)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants