Skip to content

Conversation

bnoordhuis
Copy link
Member

@bnoordhuis bnoordhuis commented Feb 2, 2017

Detect it when source files in lib/ are not ASCII. Decode them as UTF-8
and store them as UTF-16 in the binary so they can be used as external
string resources without non-ASCII characters getting mangled.

CI: https://ci.nodejs.org/job/node-test-pull-request/6176/

@nodejs-github-bot nodejs-github-bot added build Issues and PRs related to build files or the CI. c++ Issues and PRs that require attention from people who are familiar with C++. labels Feb 2, 2017
@bnoordhuis
Copy link
Member Author

I rather feared this approach may be a bit too forward-looking for some compilers... I'll see if I can dumb it down.

tools/js2c.py Outdated
# Treat non-ASCII as UTF-8 and convert it to UTF-16.
if any(ord(c) > 127 for c in lines):
ctype = 'uint16_t'
data = map(ord, lines.decode('utf-8').encode('utf-16be'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe 'utf-16be' if sys.byteorder == 'little' else 'utf-16le'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big-endian encoding is intentional, the line below stitches the individual octets together again into uint16s. (.encode() returns a byte string in case you're wondering.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Right.

@jasnell
Copy link
Member

jasnell commented Feb 2, 2017

Just curious, have there been specific issues or is this purely preventative?

@aqrln
Copy link
Contributor

aqrln commented Feb 2, 2017

@jasnell yes, reported in #10673

@gibfahn
Copy link
Member

gibfahn commented Feb 2, 2017

What's the reason for using UTF-16 rather than UTF-8 (out of interest) to store the files in the binary?

@jasnell
Copy link
Member

jasnell commented Feb 2, 2017

I would have to assume it is because the external string API does not directly support UTF8. That is, if you look at ExternalStringResource, it assumes a utf16_t* buffer while ExternalOneByteStringResource assumes const char*. If the strings were stored as UTF8 we would incur an additional cost at startup that is not necessary.

@aqrln
Copy link
Contributor

aqrln commented Feb 2, 2017

@gibfahn so that they can be used as external strings. UTF-8 strings must be copied to V8's managed memory. See #5458 by @bnoordhuis.

using v8::Object;
using v8::String;

template <typename T, size_t N, T P>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny suggestion: Could we be a bit more explicit and rename P to Ptr or Pointer? 😄

@aqrln
Copy link
Contributor

aqrln commented Feb 3, 2017

@bnoordhuis ping. Did you have time to dumb it down? 😉 This patch seems to work on GCC 4.8+.

Diff

diff --git a/src/node_javascript.cc b/src/node_javascript.cc
index 8c665381db..07a0daaa83 100644
--- a/src/node_javascript.cc
+++ b/src/node_javascript.cc
@@ -16,16 +16,16 @@ using v8::String;
 template <typename T, size_t N, T P>
 struct ExternalStringResource;
 
-template <size_t N, const char (&P)[N]>
-struct ExternalStringResource<const char[N], N, P>
+template <size_t N, const char* P>
+struct ExternalStringResource<const char*, N, P>
     : public String::ExternalOneByteStringResource {
   const char* data() const override { return P; }
   size_t length() const override { return N; }
   void Dispose() override { /* Default calls `delete this`. */ }
 };
 
-template <size_t N, const uint16_t (&P)[N]>
-struct ExternalStringResource<const uint16_t[N], N, P>
+template <size_t N, const uint16_t* P>
+struct ExternalStringResource<const uint16_t*, N, P>
     : public String::ExternalStringResource {
   const uint16_t* data() const override { return P; }
   size_t length() const override { return N; }
@@ -34,7 +34,7 @@ struct ExternalStringResource<const uint16_t[N], N, P>
 
 // id##_data is defined in node_natives.h.
 #define V(id) \
-  static ExternalStringResource<decltype(id##_data),                          \
+  static ExternalStringResource<decltype(&id##_data[0]),                      \
                                 arraysize(id##_data),                         \
                                 id##_data> id##_external_data;
 NODE_NATIVES_MAP(V)
diff --git a/tools/js2c.py b/tools/js2c.py
index e6c56cf075..8104e50e5a 100755
--- a/tools/js2c.py
+++ b/tools/js2c.py
@@ -189,7 +189,7 @@ NODE_NATIVES_MAP = """\
 SOURCES = """\
 static const uint8_t {escaped_id}_name[] = {{
 {name}}};
-static const {ctype} {escaped_id}_data[] = {{
+static constexpr const {ctype} {escaped_id}_data[] = {{
 {data}}};
 """
 

@bnoordhuis
Copy link
Member Author

bnoordhuis commented Feb 6, 2017

@aqrln Thanks. I tried a couple of different approaches but I ended up settling for something that's pretty similar to your patch. :-)

I added a second commit that replaces a few U+2019 quotes with ASCII quotes so that their files can be stored as one-byte strings. Can I get a quick LGTM or two?

EDIT: The CI scoreboard is currently linking to a previous CI build. The correct one is https://ci.nodejs.org/job/node-test-commit/7713/.

@richardlau
Copy link
Member

I added a second commit that replaces a few U+2019 quotes with ASCII quotes so that their files can be stored as one-byte strings.

Probably one for another PR, but this sounds like a good candidate for a lint rule.

@bnoordhuis
Copy link
Member Author

bnoordhuis commented Feb 6, 2017

Interesting VS build failure...

an expression involving objects with internal linkage cannot be used as a non-type argument

I suppose the compiler is technically correct, and wasn't that the best kind of correct?

EDIT: New attempt: https://ci.nodejs.org/job/node-test-pull-request/6243/

@bnoordhuis
Copy link
Member Author

bnoordhuis commented Feb 11, 2017

Different take: moved all the logic to js2c.py.

CI: https://ci.nodejs.org/job/node-test-pull-request/6357/

EDIT: Sigh...

node_javascript.cc(55276): error C2466: cannot allocate an array of constant size 0

@bnoordhuis
Copy link
Member Author

CI is finally green except for that silly issue where the ARM buildbots don't report their status properly.

@addaleax The PR changed quite a bit, perhaps you want to take another look?

node.gyp Outdated
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd expect INTERMEDIATE_DIR to work here but it doesn't: js2c generates the file but the build doesn't pick it up (and I updated the 'sources' list in case you're wondering.) I decided to leave well enough alone for now.

Detect it when source files in lib/ are not ASCII.  Decode them as UTF-8
and store them as UTF-16 in the binary so they can be used as external
string resources without non-ASCII characters getting mangled.

Fixes: nodejs#10673
PR-URL: nodejs#11129
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: James M Snell <[email protected]>
The previous commit stores baked-in files with non-ASCII characters
as UTF-16.  Replace the \u2019 with a regular quote character so that
the files they're in can be stored as one-byte strings.  The UTF-16
functionality is still tested by the Unicode diagram in lib/timers.js.

PR-URL: nodejs#11129
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: James M Snell <[email protected]>
@bnoordhuis bnoordhuis closed this Feb 13, 2017
@bnoordhuis bnoordhuis deleted the fix10673 branch February 13, 2017 18:24
@bnoordhuis bnoordhuis merged commit fd8587e into nodejs:master Feb 13, 2017
italoacasas pushed a commit to italoacasas/node that referenced this pull request Feb 14, 2017
Detect it when source files in lib/ are not ASCII.  Decode them as UTF-8
and store them as UTF-16 in the binary so they can be used as external
string resources without non-ASCII characters getting mangled.

Fixes: nodejs#10673
PR-URL: nodejs#11129
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: James M Snell <[email protected]>
italoacasas pushed a commit to italoacasas/node that referenced this pull request Feb 14, 2017
The previous commit stores baked-in files with non-ASCII characters
as UTF-16.  Replace the \u2019 with a regular quote character so that
the files they're in can be stored as one-byte strings.  The UTF-16
functionality is still tested by the Unicode diagram in lib/timers.js.

PR-URL: nodejs#11129
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: James M Snell <[email protected]>
aqrln added a commit to aqrln/node that referenced this pull request Feb 16, 2017
In order to allow using Unicode characters inside comments of built-in
JavaScript libraries without forcing them to be stored as UTF-16 data in
Node's binary, update the tooling to strip comments during build
process. All line breaks are preserved so that line numbers in stack
traces aren't broken.

Refs: nodejs#11129
Refs: nodejs#11371 (comment)
aqrln added a commit to aqrln/node that referenced this pull request Feb 16, 2017
This test ensures that UTF-8 characters can be used in core JavaScript
modules built into Node's binary.

Refs: nodejs#11129
italoacasas pushed a commit that referenced this pull request Feb 16, 2017
Notable changes:

* deps:
    * update V8 to 5.5 (Michaël Zasso) [#11029](#11029)
    * upgrade libuv to 1.11.0 (cjihrig) [#11094](#11094)
    * add node-inspect 1.10.2 (Jan Krems) [#10187](#10187)
* lib: build `node inspect` into `node` (Anna Henningsen) [#10187](#10187)
* crypto: Remove expired certs from CNNIC whitelist (Shigeki Ohtsu) [#9469](#9469)
* inspector: add --inspect-brk (Josh Gavant) [#11149](#11149)
* fs: allow WHATWG URL and file: URLs as paths (James M Snell) [#10739](#10739)
* src: support UTF-8 in compiled-in JS source files (Ben Noordhuis) [#11129](#11129)
* url: extend url.format to support WHATWG URL (James M Snell) [#10857](#10857)

PR-URL: #11185
italoacasas pushed a commit to italoacasas/node that referenced this pull request Feb 21, 2017
Notable changes:

* deps:
    * update V8 to 5.5 (Michaël Zasso) [nodejs#11029](nodejs#11029)
    * upgrade libuv to 1.11.0 (cjihrig) [nodejs#11094](nodejs#11094)
    * add node-inspect 1.10.4 (Jan Krems) [nodejs#10187](nodejs#10187)
    * upgrade zlib to 1.2.11 (Sam Roberts) [nodejs#10980](nodejs#10980)
* lib: build `node inspect` into `node` (Anna Henningsen) [nodejs#10187](nodejs#10187)
* crypto: Remove expired certs from CNNIC whitelist (Shigeki Ohtsu) [nodejs#9469](nodejs#9469)
* inspector: add --inspect-brk (Josh Gavant) [nodejs#11149](nodejs#11149)
* fs: allow WHATWG URL objects as paths (James M Snell) [nodejs#10739](nodejs#10739)
* src: support UTF-8 in compiled-in JS source files (Ben Noordhuis) [nodejs#11129](nodejs#11129)
* url: extend url.format to support WHATWG URL (James M Snell) [nodejs#10857](nodejs#10857)

PR-URL: nodejs#11185
italoacasas pushed a commit to italoacasas/node that referenced this pull request Feb 22, 2017
Notable changes:

* deps:
    * update V8 to 5.5 (Michaël Zasso) [nodejs#11029](nodejs#11029)
    * upgrade libuv to 1.11.0 (cjihrig) [nodejs#11094](nodejs#11094)
    * add node-inspect 1.10.4 (Jan Krems) [nodejs#10187](nodejs#10187)
    * upgrade zlib to 1.2.11 (Sam Roberts) [nodejs#10980](nodejs#10980)
* lib: build `node inspect` into `node` (Anna Henningsen) [nodejs#10187](nodejs#10187)
* crypto: Remove expired certs from CNNIC whitelist (Shigeki Ohtsu) [nodejs#9469](nodejs#9469)
* inspector: add --inspect-brk (Josh Gavant) [nodejs#11149](nodejs#11149)
* fs: allow WHATWG URL objects as paths (James M Snell) [nodejs#10739](nodejs#10739)
* src: support UTF-8 in compiled-in JS source files (Ben Noordhuis) [nodejs#11129](nodejs#11129)
* url: extend url.format to support WHATWG URL (James M Snell) [nodejs#10857](nodejs#10857)

PR-URL: nodejs#11185
imyller added a commit to imyller/meta-nodejs that referenced this pull request Mar 2, 2017
    Notable changes:

    * deps:
        * update V8 to 5.5 (Michaël Zasso) [#11029](nodejs/node#11029)
        * upgrade libuv to 1.11.0 (cjihrig) [#11094](nodejs/node#11094)
        * add node-inspect 1.10.4 (Jan Krems) [#10187](nodejs/node#10187)
        * upgrade zlib to 1.2.11 (Sam Roberts) [#10980](nodejs/node#10980)
    * lib: build `node inspect` into `node` (Anna Henningsen) [#10187](nodejs/node#10187)
    * crypto: Remove expired certs from CNNIC whitelist (Shigeki Ohtsu) [#9469](nodejs/node#9469)
    * inspector: add --inspect-brk (Josh Gavant) [#11149](nodejs/node#11149)
    * fs: allow WHATWG URL objects as paths (James M Snell) [#10739](nodejs/node#10739)
    * src: support UTF-8 in compiled-in JS source files (Ben Noordhuis) [#11129](nodejs/node#11129)
    * url: extend url.format to support WHATWG URL (James M Snell) [#10857](nodejs/node#10857)

    PR-URL: nodejs/node#11185

Signed-off-by: Ilkka Myller <[email protected]>
@jasnell
Copy link
Member

jasnell commented Mar 7, 2017

Needs a backport PR to land in v4.
Only one of the two commits was cherry-picked into v6 (the second one wasn't needed)

jasnell pushed a commit that referenced this pull request Mar 7, 2017
Detect it when source files in lib/ are not ASCII.  Decode them as UTF-8
and store them as UTF-16 in the binary so they can be used as external
string resources without non-ASCII characters getting mangled.

Fixes: #10673
PR-URL: #11129
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: James M Snell <[email protected]>
MylesBorins pushed a commit that referenced this pull request Mar 9, 2017
Detect it when source files in lib/ are not ASCII.  Decode them as UTF-8
and store them as UTF-16 in the binary so they can be used as external
string resources without non-ASCII characters getting mangled.

Fixes: #10673
PR-URL: #11129
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: James M Snell <[email protected]>
@MylesBorins MylesBorins mentioned this pull request Mar 9, 2017
jasnell pushed a commit that referenced this pull request Apr 4, 2017
This test ensures that UTF-8 characters can be used in core JavaScript
modules built into Node's binary.

PR-URL: #11423
Ref: #11129
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: James M Snell <[email protected]>
italoacasas pushed a commit to italoacasas/node that referenced this pull request Apr 10, 2017
This test ensures that UTF-8 characters can be used in core JavaScript
modules built into Node's binary.

PR-URL: nodejs#11423
Ref: nodejs#11129
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Sakthipriyan Vairamani <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Issues and PRs related to build files or the CI. c++ Issues and PRs that require attention from people who are familiar with C++.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants