- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.4k
Don't write 0 to the end of strings in embind #10844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't write 0 to the end of strings in embind #10844
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the docs in site/source/docs/api_reference/preamble.js.rst.
How well is this tested?
        
          
                src/runtime_strings_extra.js
              
                Outdated
          
        
      | #endif // TEXTDECODER == 2 | ||
|  | ||
| function UTF16ToString(ptr) { | ||
| function UTF16ToString(ptr, maxCharsToRead) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The utf8 version calls the param maxBytesToRead, not chars, is the difference intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, though I'm not sure what's best here. I have no idea how people expect to use UTF16 or 32 strings in practice. The C++ wstrings have a length that seems accurate? My understanding of unicode encodings is honestly pretty shaky.
This is also why I didn't change documentation; I really don't want people to use these. Which maybe isn't the best approach, granted. Does it make sense to leave a comment here saying "hey don't use this" but still keep the parameter to fix the issue? Is there a better way to avoid having to write a 0 byte?
The other question is a consistency one: does it make sense to change utf8 to maxChars, or is it actually bytes? I matched the names to my understanding, so thank you for noticing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points - I'm not sure about those either.
We could take the time to investigate this in depth, but I wonder if that's worth it? The by far important use case is UTF8. I'm not sure if 16 and 32 are even that important. If we do a little searching and don't find a reason not to, just saying embind only does UTF8 for now might be an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like similar Dart and .NET APIs use bytes, so maybe we could do that too.
        
          
                src/runtime_strings_extra.js
              
                Outdated
          
        
      |  | ||
| function UTF16ToString(ptr) { | ||
| function UTF16ToString(ptr, maxCharsToRead) { | ||
| if (maxCharsToRead === undefined) maxCharsToRead = Infinity; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be consistent with how UTF8ArrayToString does things, it has that trick,
  // (As a tiny code save trick, compare endPtr against endIdx using a negation, so that undefined means Infinity)We should either do them all that way, or none, I think, so it's consistent one way or the other. I lean towards doing the trick, with that documentation everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense (and I'd missed that comment so I was more befuddled than I needed to be when I first saw it), consistency makes sense.
| 
 There's some embind tests in test_core, and other.test_embind is pretty thorough. other.test_embind fails for all of the string types if I just remove the 0-byte-write, and passes with the rest of the change. That seemed reasonable to me, but there are probably more corner-cases in utf encodings that we don't test. | 
| I think we should add proper embind + pthreads testing eventually, and for all the string types, but that doesn't need to block this PR. But the other comments above should be addressed I think. | 
| Yeah, test_embind looks pretty reasonable overall. Maybe we could address the 2 comments in code but not update the documents (i.e. leave the extra 'bytes' option undocumented for now) until we add a test that exercises it in a future PR? | 
        
          
                src/runtime_strings_extra.js
              
                Outdated
          
        
      | while (1) { | ||
| var utf32 = {{{ makeGetValue('ptr', 'i*4', 'i32') }}}; | ||
| if (utf32 == 0) return str; | ||
| if (utf32 == 0 || i >= maxCharsToRead) return str; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having a while(1) with i >= maxCharsToRead, it would be better to move the index check to the while.
These changes to UTF16ToString and UTF32ToString regress code size. I am still keeping my hopes up that Closure compiler would one day be able to optimize it (google/closure-compiler#3193, google/closure-compiler#3201 and google/closure-compiler#3223). Could you change this to use the same compare against NaN/undefined trick that UTF8ToString uses to save code size and make the code (perhaps) optimizable by Closure in the future?
This is unsafe because it's one past the end of the string, and in threaded contexts it can cause data races
6f02a46    to
    422301c      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks @jgravelle-google !
Note that I tested this locally with the asanjs PR, and the test (asan.test_embind_val) still fails. So while this fixes one set of issues, it looks like there might be more. It would be good to figure those out so we can be more sure of things here. Meanwhile in the asanjs PR I'll disable it on that test.
| while (1) { | ||
| var codeUnit = {{{ makeGetValue('ptr', 'i*2', 'i16') }}}; | ||
| if (codeUnit == 0) return str; | ||
| if (codeUnit == 0 || i == maxBytesToRead / 2) return str; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not be i == maxIdx? Having a division in the hot per-characterr loop would be bad for performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note the inconsistency in implementation between UTF16ToString() and UTF32ToString: this retains a while(1) whereas UTF32ToString switched to a while (!(i >= maxBytesToRead / 4)) {. This variant is better for code size and less branches (termination test is preserved in one if, whereas it is in two parts in UTF32ToString), let's use UTF16ToString() format in both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for a followup to make the code more consistent.
| while (1) { | ||
| // If maxBytesToRead is not passed explicitly, it will be undefined, and this | ||
| // will always evaluate to true. This saves on code size. | ||
| while (!(i >= maxBytesToRead / 4)) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise here, having a per-loop iteration division is not good for perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is safe to assume the JS VM will do loop invariant code motion on such things, it's been standard for many years since the JS speed wars. And it can be more compact this way instead of defining a new variable (unless we also need that variable elsewhere).
| // If maxBytesToRead is not passed explicitly, it will be undefined, and this | ||
| // will always evaluate to true. This saves on code size. | ||
| while (!(i >= maxBytesToRead / 4)) { | ||
| var utf32 = {{{ makeGetValue('ptr', 'i*4', 'i32') }}}; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These makeGetValues could be demoted to direct HEAPU32/HEAPU16 reads for better performance and smaller code size, although could leave that for a later day as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler turns those {{{ makeGetValue }}} things into HEAP32 at compile time, so there should be no speed or size difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is able to optimize away the *4, so it will end up with a (x*4)>>2 in each iteration. Manually iterating over indices instead of iterating over bytes will be shorter code size wise and also faster too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question, yeah, maybe not. Sounds good to iterate over HEAP32 indexes in a followup.
This is unsafe because it's one past the end of the string, and in threaded contexts it can cause data races.