Don't write 0 to the end of strings in embind #10844

jgravelle-google · 2020-04-03T22:54:22Z

This is unsafe because it's one past the end of the string, and in threaded contexts it can cause data races.

kripken

Please update the docs in site/source/docs/api_reference/preamble.js.rst.

How well is this tested?

kripken · 2020-04-06T16:22:52Z

src/runtime_strings_extra.js

 #endif // TEXTDECODER == 2

-function UTF16ToString(ptr) {
+function UTF16ToString(ptr, maxCharsToRead) {


The utf8 version calls the param maxBytesToRead, not chars, is the difference intentional?

Yes, though I'm not sure what's best here. I have no idea how people expect to use UTF16 or 32 strings in practice. The C++ wstrings have a length that seems accurate? My understanding of unicode encodings is honestly pretty shaky.

This is also why I didn't change documentation; I really don't want people to use these. Which maybe isn't the best approach, granted. Does it make sense to leave a comment here saying "hey don't use this" but still keep the parameter to fix the issue? Is there a better way to avoid having to write a 0 byte?

The other question is a consistency one: does it make sense to change utf8 to maxChars, or is it actually bytes? I matched the names to my understanding, so thank you for noticing that.

Good points - I'm not sure about those either.

We could take the time to investigate this in depth, but I wonder if that's worth it? The by far important use case is UTF8. I'm not sure if 16 and 32 are even that important. If we do a little searching and don't find a reason not to, just saying embind only does UTF8 for now might be an option.

It looks like similar Dart and .NET APIs use bytes, so maybe we could do that too.

kripken · 2020-04-06T16:25:17Z

src/runtime_strings_extra.js


-function UTF16ToString(ptr) {
+function UTF16ToString(ptr, maxCharsToRead) {
+  if (maxCharsToRead === undefined) maxCharsToRead = Infinity;


I think we should be consistent with how UTF8ArrayToString does things, it has that trick,

// (As a tiny code save trick, compare endPtr against endIdx using a negation, so that undefined means Infinity)

We should either do them all that way, or none, I think, so it's consistent one way or the other. I lean towards doing the trick, with that documentation everywhere.

Makes sense (and I'd missed that comment so I was more befuddled than I needed to be when I first saw it), consistency makes sense.

jgravelle-google · 2020-04-09T17:27:00Z

How well is this tested?

There's some embind tests in test_core, and other.test_embind is pretty thorough. other.test_embind fails for all of the string types if I just remove the 0-byte-write, and passes with the rest of the change. That seemed reasonable to me, but there are probably more corner-cases in utf encodings that we don't test.

kripken · 2020-05-12T00:43:09Z

I think we should add proper embind + pthreads testing eventually, and for all the string types, but that doesn't need to block this PR. But the other comments above should be addressed I think.

dschuff · 2020-05-12T01:15:51Z

Yeah, test_embind looks pretty reasonable overall. Maybe we could address the 2 comments in code but not update the documents (i.e. leave the extra 'bytes' option undocumented for now) until we add a test that exercises it in a future PR?

juj · 2020-05-12T05:41:13Z

src/runtime_strings_extra.js

  while (1) {
    var utf32 = {{{ makeGetValue('ptr', 'i*4', 'i32') }}};
-    if (utf32 == 0) return str;
+    if (utf32 == 0 || i >= maxCharsToRead) return str;


Instead of having a while(1) with i >= maxCharsToRead, it would be better to move the index check to the while.

These changes to UTF16ToString and UTF32ToString regress code size. I am still keeping my hopes up that Closure compiler would one day be able to optimize it (google/closure-compiler#3193, google/closure-compiler#3201 and google/closure-compiler#3223). Could you change this to use the same compare against NaN/undefined trick that UTF8ToString uses to save code size and make the code (perhaps) optimizable by Closure in the future?

This is unsafe because it's one past the end of the string, and in threaded contexts it can cause data races

kripken

Looks good, thanks @jgravelle-google !

Note that I tested this locally with the asanjs PR, and the test (asan.test_embind_val) still fails. So while this fixes one set of issues, it looks like there might be more. It would be good to figure those out so we can be more sure of things here. Meanwhile in the asanjs PR I'll disable it on that test.

juj · 2020-05-14T06:32:05Z

src/runtime_strings_extra.js

    while (1) {
      var codeUnit = {{{ makeGetValue('ptr', 'i*2', 'i16') }}};
-      if (codeUnit == 0) return str;
+      if (codeUnit == 0 || i == maxBytesToRead / 2) return str;


Should this not be i == maxIdx? Having a division in the hot per-characterr loop would be bad for performance.

Also note the inconsistency in implementation between UTF16ToString() and UTF32ToString: this retains a while(1) whereas UTF32ToString switched to a while (!(i >= maxBytesToRead / 4)) {. This variant is better for code size and less branches (termination test is preserved in one if, whereas it is in two parts in UTF32ToString), let's use UTF16ToString() format in both?

+1 for a followup to make the code more consistent.

juj · 2020-05-14T06:32:32Z

src/runtime_strings_extra.js

-  while (1) {
+  // If maxBytesToRead is not passed explicitly, it will be undefined, and this
+  // will always evaluate to true. This saves on code size.
+  while (!(i >= maxBytesToRead / 4)) {


Likewise here, having a per-loop iteration division is not good for perf.

I think it is safe to assume the JS VM will do loop invariant code motion on such things, it's been standard for many years since the JS speed wars. And it can be more compact this way instead of defining a new variable (unless we also need that variable elsewhere).

juj · 2020-05-14T06:37:50Z

src/runtime_strings_extra.js

+  // If maxBytesToRead is not passed explicitly, it will be undefined, and this
+  // will always evaluate to true. This saves on code size.
+  while (!(i >= maxBytesToRead / 4)) {
    var utf32 = {{{ makeGetValue('ptr', 'i*4', 'i32') }}};


These makeGetValues could be demoted to direct HEAPU32/HEAPU16 reads for better performance and smaller code size, although could leave that for a later day as well.

The compiler turns those {{{ makeGetValue }}} things into HEAP32 at compile time, so there should be no speed or size difference.

I don't think it is able to optimize away the *4, so it will end up with a (x*4)>>2 in each iteration. Manually iterating over indices instead of iterating over bytes will be shorter code size wise and also faster too.

That's a good question, yeah, maybe not. Sounds good to iterate over HEAP32 indexes in a followup.

jgravelle-google requested a review from kripken April 3, 2020 22:54

kripken reviewed Apr 6, 2020

View reviewed changes

juj reviewed May 12, 2020

View reviewed changes

jgravelle-google added 2 commits May 13, 2020 10:51

Don't write 0 to the end of strings in embind

0ec8839

This is unsafe because it's one past the end of the string, and in threaded contexts it can cause data races

Code review feedback

422301c

jgravelle-google force-pushed the embind_wstring_threadsafe branch from 6f02a46 to 422301c Compare May 13, 2020 20:53

kripken approved these changes May 14, 2020

View reviewed changes

kripken merged commit cfd03f3 into emscripten-core:master May 14, 2020

kripken mentioned this pull request May 14, 2020

Fix asan.test_embind_val #11158

Closed

juj reviewed May 14, 2020

View reviewed changes

jiulongw mentioned this pull request Jul 9, 2020

ASAN violation in embind fromWireType for std::string #11598

Closed

Don't write 0 to the end of strings in embind #10844

Don't write 0 to the end of strings in embind #10844

Uh oh!

Conversation

jgravelle-google commented Apr 3, 2020

Uh oh!

kripken left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgravelle-google commented Apr 9, 2020

Uh oh!

kripken commented May 12, 2020

Uh oh!

dschuff commented May 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kripken left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juj May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kripken left a comment •

edited

Loading

juj May 14, 2020 •

edited

Loading