Add support for all Unicode codepoints of the BMP in the JSON parser [TG-7634] #4594

antlechner · 2019-04-30T17:19:51Z

This is a follow-up PR to #4590, which implemented support in the JSON parser for hexadecimal Unicode representations (\uABCD) but was limited to codepoints <= 0x7f (with values from 0x80 to 0xff also sort of working, representing them as integers rather than characters).
This improved implementation supports all values from \u0000 to \uFFFF, i.e. all codepoints in the BMP.

Each commit message has a non-empty body, explaining why the change was made.
Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/
Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
n/a My commit message includes data points confirming performance improvements (if claimed).
My PR is restricted to a single feature or bugfix.
n/a White-space or formatting changes outside the feature-related changed lines are in commits of their own.

tautschnig · 2019-04-30T22:05:04Z

src/util/unicode.h

@@ -31,6 +31,11 @@ std::string utf16_native_endian_to_java(const std::wstring &in);

 std::vector<std::string> narrow_argv(int argc, const wchar_t **argv_wide);

+/// \param hex: a representation of a UTF-16 character as a four-digit string
+///   (e.g.\ "0041" for \u0041)


It seems you will need to escape this for Doxygen to be happy.

tautschnig · 2019-04-30T22:13:49Z

src/util/unicode.cpp

+  PRECONDITION(hex.length() == 4);
+  const char16_t utf16_char = std::strtol(hex.c_str(), nullptr, 16);
+  const std::u16string u16(1, utf16_char);
+  return std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t>{}


This is failing with a linker error with Visual Studio. It's a known problem and https://stackoverflow.com/questions/32055357/visual-studio-c-2015-stdcodecvt-with-char16-t-or-char32-t discusses some fixes. It seems that using wchar_t instead of char16_t might be the easiest of those, if that does actually work. But it would need to be wrapped in an #ifdef _MSC_VER.

I tried the solution that uses wchar_t, the AWS Windows build is still failing and I can't tell from the logs if it's because of this or for some other reason... Can you tell from the logs what the problem is?

Turns out the problem was a compilation error that, for some reason, did not show up anywhere in the AWS build logs.
Now everything builds fine but test are failing - again only on Windows...

The problem was the encoding of string literals, which can vary between compilers. Should all be fixed now!

romainbrenguier · 2019-05-01T08:31:04Z

unit/json/json_parser.cpp

@@ -84,17 +84,21 @@ SCENARIO("Loading JSON files")
      }
    }
  }
-  GIVEN("A JSON file containing a hexadecimal Unicode character")
+  GIVEN("A JSON file containing hexadecimal Unicode characters")


⛏️ The use of "characters" here is a bit confusing. I would talk about symbols (or code points) for unicode and only use characters for the c++ char type 🍔

romainbrenguier · 2019-05-01T08:34:07Z

src/util/unicode.h

@@ -31,6 +31,11 @@ std::string utf16_native_endian_to_java(const std::wstring &in);

 std::vector<std::string> narrow_argv(int argc, const wchar_t **argv_wide);

+/// \param hex: a representation of a UTF-16 character as a four-digit string


romainbrenguier · 2019-05-01T08:34:16Z

src/util/unicode.h

@@ -31,6 +31,11 @@ std::string utf16_native_endian_to_java(const std::wstring &in);

 std::vector<std::string> narrow_argv(int argc, const wchar_t **argv_wide);

+/// \param hex: a representation of a UTF-16 character as a four-digit string
+///   (e.g.\ "0041" for \u0041)
+/// \return the UTF-8 encoding of the character.


smowton · 2019-05-01T10:34:39Z

Nitpick: chars between 00 and ff are not UTF-8 (which can represent all unicode characters, perhaps with many bytes) -- better to just say codepoints <= 255 or 0xff. Similarly codepoints <= 0xffff are the basic multilingual plane (BMP), but not UTF-16, which again can represent all codepoints, perhaps using multiple wchars. Therefore I suggest twiddling PR description / commit messages to clarify that this is adding support for all BMP unicode codepoints

smowton · 2019-05-01T10:39:44Z

src/json/parser.y

@@ -51,7 +53,7 @@ static std::string convert_TOK_STRING()
        // \uABCD, i.e. the following four digits are part of this character.
        assert(p + 4 < yyjsontext + len - 1);
        std::string hex(++p, 4);
-        result += std::stoi(hex, nullptr, 16);
+        result += utf16_hex_to_utf8(hex);


Rename codepoint_hex_to_utf8 -- the difference is that utf-16 implies the possibility of two 16-bit numbers representing a single codepoint, such as U+10437 == 0xD801 0xDC37

smowton · 2019-05-01T10:40:28Z

src/util/unicode.cpp

@@ -315,3 +318,25 @@ std::string utf16_native_endian_to_java(const std::wstring &in)
    utf16_native_endian_to_java(ch, result, loc);
  return result.str();
 }
+
+char16_t utf16_hex_to_utf16_native_endian(const std::string &hex)


Again, codepoint rather than utf16

smowton · 2019-05-01T10:42:03Z

src/util/unicode.h

+
+std::string utf16_native_endian_to_utf8(char16_t utf16_char);
+
+/// \param hex: a representation of a UTF-16 symbol as a four-digit string


of a BMP codepoint

The previous implementation only supported codepoints up to 0x7f as characters, and all remaining codepoints up to 0xff as integers. The new implementation supports all codepoints in the BMP, i.e. up to 0xffff.

A named variable is clearer that referring to the value twice in relation to p, where p changes in between.

antlechner · 2019-05-02T09:44:39Z

All review comments have been addressed. I split the original "hex to UTF8" function into four tiny functions, as this makes it easier to reuse them (and I'm planning to use one of them in some new code soon). All four functions are documented (@smowton could you check that the terminology in the Doxygen makes sense?). Some of the string literals in the unit test have been specified to use UTF-8 encoding as the Windows compiler used a different encoding by default.
I also addressed @smowton 's review comment from #4590, I used a variable name instead of comments to (hopefully) clarify how the parser processes Unicode representations.

smowton

I'd probably clarify that while anything taking a char16_t is converting UTF-16 to UTF-8, it's not tackling the whole of UTF-16, which can use low / high surrogate pairs (two successive char16_ts) to represent characters outside the BMP (i.e. those from U+0x10000 upwards). So really they're converting a UTF-16 BMP char... which is always just equal to the codepoint number. Otherwise lgtm.

smowton · 2019-05-02T13:44:01Z

(In case that was a bit confused... what I'm saying is we're handling that subset of Unicode where the UTF-16 <-> codepoint number transformation is trivial (int codepoint = 0x1234; char16_t utf16_codepoint = (char16_t)codepoint;). By contrast those chars that don't fit into a char16_t need a more involved transformation.

antlechner requested review from romainbrenguier and allredj April 30, 2019 17:19

antlechner requested review from kroening, peterschrammel, pkesseli, smowton and tautschnig as code owners April 30, 2019 17:19

antlechner force-pushed the antonia/json-parsing-utf16 branch from 645a7c3 to 040d31b Compare April 30, 2019 17:37

tautschnig reviewed Apr 30, 2019

View reviewed changes

tautschnig assigned antlechner Apr 30, 2019

romainbrenguier approved these changes May 1, 2019

View reviewed changes

antlechner mentioned this pull request May 1, 2019

Implement parsing of hexadecimal Unicode characters from JSON #4590

Merged

4 tasks

antlechner force-pushed the antonia/json-parsing-utf16 branch from 040d31b to e5a260e Compare May 1, 2019 10:23

antlechner added the In progress label May 1, 2019

smowton reviewed May 1, 2019

View reviewed changes

antlechner force-pushed the antonia/json-parsing-utf16 branch from e5a260e to afe2843 Compare May 1, 2019 11:15

antlechner changed the title ~~Add support for UTF-16 in the JSON parser~~ Add support for all Unicode codepoints of the BMP in the JSON parser May 1, 2019

antlechner force-pushed the antonia/json-parsing-utf16 branch 4 times, most recently from 87b8203 to 21c4ee6 Compare May 1, 2019 15:03

antlechner added 2 commits May 2, 2019 10:37

Add support for whole BMP in the JSON parser

0a3ddf8

The previous implementation only supported codepoints up to 0x7f as characters, and all remaining codepoints up to 0xff as integers. The new implementation supports all codepoints in the BMP, i.e. up to 0xffff.

Name last_hex_digit variable in JSON parser

4d502e7

A named variable is clearer that referring to the value twice in relation to p, where p changes in between.

antlechner removed the In progress label May 2, 2019

antlechner changed the title ~~Add support for all Unicode codepoints of the BMP in the JSON parser~~ Add support for all Unicode codepoints of the BMP in the JSON parser [TG-7634] May 2, 2019

antlechner force-pushed the antonia/json-parsing-utf16 branch from 21c4ee6 to 4d502e7 Compare May 2, 2019 09:49

antlechner removed their assignment May 2, 2019

smowton approved these changes May 2, 2019

View reviewed changes

tautschnig approved these changes May 2, 2019

View reviewed changes

tautschnig merged commit 9d2ca43 into diffblue:develop May 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for all Unicode codepoints of the BMP in the JSON parser [TG-7634] #4594

Add support for all Unicode codepoints of the BMP in the JSON parser [TG-7634] #4594

Uh oh!

antlechner commented Apr 30, 2019 •

edited

Loading

Uh oh!

tautschnig Apr 30, 2019

Uh oh!

tautschnig Apr 30, 2019

Uh oh!

antlechner May 1, 2019

Uh oh!

antlechner May 1, 2019

Uh oh!

antlechner May 2, 2019

Uh oh!

romainbrenguier May 1, 2019

Uh oh!

romainbrenguier May 1, 2019

Uh oh!

romainbrenguier May 1, 2019

Uh oh!

smowton commented May 1, 2019

Uh oh!

smowton May 1, 2019

Uh oh!

smowton May 1, 2019

Uh oh!

smowton May 1, 2019

Uh oh!

antlechner commented May 2, 2019

Uh oh!

smowton left a comment

Uh oh!

smowton commented May 2, 2019

Uh oh!

Uh oh!

		@@ -31,6 +31,11 @@ std::string utf16_native_endian_to_java(const std::wstring &in);

		std::vector<std::string> narrow_argv(int argc, const wchar_t **argv_wide);

		/// \param hex: a representation of a UTF-16 character as a four-digit string


		std::string utf16_native_endian_to_utf8(char16_t utf16_char);

		/// \param hex: a representation of a UTF-16 symbol as a four-digit string

Add support for all Unicode codepoints of the BMP in the JSON parser [TG-7634] #4594

Add support for all Unicode codepoints of the BMP in the JSON parser [TG-7634] #4594

Uh oh!

Conversation

antlechner commented Apr 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smowton commented May 1, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antlechner commented May 2, 2019

Uh oh!

smowton left a comment

Choose a reason for hiding this comment

Uh oh!

smowton commented May 2, 2019

Uh oh!

Uh oh!

antlechner commented Apr 30, 2019 •

edited

Loading