Skip to content

Conversation

@jgravelle-google
Copy link
Contributor

This is unsafe because it's one past the end of the string, and in threaded contexts it can cause data races.

@jgravelle-google jgravelle-google requested a review from kripken April 3, 2020 22:54
Copy link
Member

@kripken kripken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the docs in site/source/docs/api_reference/preamble.js.rst.

How well is this tested?

#endif // TEXTDECODER == 2

function UTF16ToString(ptr) {
function UTF16ToString(ptr, maxCharsToRead) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The utf8 version calls the param maxBytesToRead, not chars, is the difference intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though I'm not sure what's best here. I have no idea how people expect to use UTF16 or 32 strings in practice. The C++ wstrings have a length that seems accurate? My understanding of unicode encodings is honestly pretty shaky.

This is also why I didn't change documentation; I really don't want people to use these. Which maybe isn't the best approach, granted. Does it make sense to leave a comment here saying "hey don't use this" but still keep the parameter to fix the issue? Is there a better way to avoid having to write a 0 byte?

The other question is a consistency one: does it make sense to change utf8 to maxChars, or is it actually bytes? I matched the names to my understanding, so thank you for noticing that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points - I'm not sure about those either.

We could take the time to investigate this in depth, but I wonder if that's worth it? The by far important use case is UTF8. I'm not sure if 16 and 32 are even that important. If we do a little searching and don't find a reason not to, just saying embind only does UTF8 for now might be an option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like similar Dart and .NET APIs use bytes, so maybe we could do that too.


function UTF16ToString(ptr) {
function UTF16ToString(ptr, maxCharsToRead) {
if (maxCharsToRead === undefined) maxCharsToRead = Infinity;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be consistent with how UTF8ArrayToString does things, it has that trick,

  // (As a tiny code save trick, compare endPtr against endIdx using a negation, so that undefined means Infinity)

We should either do them all that way, or none, I think, so it's consistent one way or the other. I lean towards doing the trick, with that documentation everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense (and I'd missed that comment so I was more befuddled than I needed to be when I first saw it), consistency makes sense.

@jgravelle-google
Copy link
Contributor Author

How well is this tested?

There's some embind tests in test_core, and other.test_embind is pretty thorough. other.test_embind fails for all of the string types if I just remove the 0-byte-write, and passes with the rest of the change. That seemed reasonable to me, but there are probably more corner-cases in utf encodings that we don't test.

@kripken
Copy link
Member

kripken commented May 12, 2020

I think we should add proper embind + pthreads testing eventually, and for all the string types, but that doesn't need to block this PR. But the other comments above should be addressed I think.

@dschuff
Copy link
Member

dschuff commented May 12, 2020

Yeah, test_embind looks pretty reasonable overall. Maybe we could address the 2 comments in code but not update the documents (i.e. leave the extra 'bytes' option undocumented for now) until we add a test that exercises it in a future PR?

while (1) {
var utf32 = {{{ makeGetValue('ptr', 'i*4', 'i32') }}};
if (utf32 == 0) return str;
if (utf32 == 0 || i >= maxCharsToRead) return str;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a while(1) with i >= maxCharsToRead, it would be better to move the index check to the while.

These changes to UTF16ToString and UTF32ToString regress code size. I am still keeping my hopes up that Closure compiler would one day be able to optimize it (google/closure-compiler#3193, google/closure-compiler#3201 and google/closure-compiler#3223). Could you change this to use the same compare against NaN/undefined trick that UTF8ToString uses to save code size and make the code (perhaps) optimizable by Closure in the future?

This is unsafe because it's one past the end of the string, and in
threaded contexts it can cause data races
@jgravelle-google jgravelle-google force-pushed the embind_wstring_threadsafe branch from 6f02a46 to 422301c Compare May 13, 2020 20:53
Copy link
Member

@kripken kripken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @jgravelle-google !

Note that I tested this locally with the asanjs PR, and the test (asan.test_embind_val) still fails. So while this fixes one set of issues, it looks like there might be more. It would be good to figure those out so we can be more sure of things here. Meanwhile in the asanjs PR I'll disable it on that test.

@kripken kripken merged commit cfd03f3 into emscripten-core:master May 14, 2020
while (1) {
var codeUnit = {{{ makeGetValue('ptr', 'i*2', 'i16') }}};
if (codeUnit == 0) return str;
if (codeUnit == 0 || i == maxBytesToRead / 2) return str;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be i == maxIdx? Having a division in the hot per-characterr loop would be bad for performance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note the inconsistency in implementation between UTF16ToString() and UTF32ToString: this retains a while(1) whereas UTF32ToString switched to a while (!(i >= maxBytesToRead / 4)) {. This variant is better for code size and less branches (termination test is preserved in one if, whereas it is in two parts in UTF32ToString), let's use UTF16ToString() format in both?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for a followup to make the code more consistent.

while (1) {
// If maxBytesToRead is not passed explicitly, it will be undefined, and this
// will always evaluate to true. This saves on code size.
while (!(i >= maxBytesToRead / 4)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here, having a per-loop iteration division is not good for perf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is safe to assume the JS VM will do loop invariant code motion on such things, it's been standard for many years since the JS speed wars. And it can be more compact this way instead of defining a new variable (unless we also need that variable elsewhere).

// If maxBytesToRead is not passed explicitly, it will be undefined, and this
// will always evaluate to true. This saves on code size.
while (!(i >= maxBytesToRead / 4)) {
var utf32 = {{{ makeGetValue('ptr', 'i*4', 'i32') }}};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These makeGetValues could be demoted to direct HEAPU32/HEAPU16 reads for better performance and smaller code size, although could leave that for a later day as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler turns those {{{ makeGetValue }}} things into HEAP32 at compile time, so there should be no speed or size difference.

Copy link
Collaborator

@juj juj May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is able to optimize away the *4, so it will end up with a (x*4)>>2 in each iteration. Manually iterating over indices instead of iterating over bytes will be shorter code size wise and also faster too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, yeah, maybe not. Sounds good to iterate over HEAP32 indexes in a followup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants