Skip to content

Add support for all Unicode codepoints of the BMP in the JSON parser [TG-7634] #4594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 2, 2019

Conversation

antlechner
Copy link
Contributor

@antlechner antlechner commented Apr 30, 2019

This is a follow-up PR to #4590, which implemented support in the JSON parser for hexadecimal Unicode representations (\uABCD) but was limited to codepoints <= 0x7f (with values from 0x80 to 0xff also sort of working, representing them as integers rather than characters).
This improved implementation supports all values from \u0000 to \uFFFF, i.e. all codepoints in the BMP.

  • Each commit message has a non-empty body, explaining why the change was made.
  • Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
  • The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/
  • Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
  • n/a My commit message includes data points confirming performance improvements (if claimed).
  • My PR is restricted to a single feature or bugfix.
  • n/a White-space or formatting changes outside the feature-related changed lines are in commits of their own.

@@ -31,6 +31,11 @@ std::string utf16_native_endian_to_java(const std::wstring &in);

std::vector<std::string> narrow_argv(int argc, const wchar_t **argv_wide);

/// \param hex: a representation of a UTF-16 character as a four-digit string
/// (e.g.\ "0041" for \u0041)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you will need to escape this for Doxygen to be happy.

PRECONDITION(hex.length() == 4);
const char16_t utf16_char = std::strtol(hex.c_str(), nullptr, 16);
const std::u16string u16(1, utf16_char);
return std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t>{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is failing with a linker error with Visual Studio. It's a known problem and https://stackoverflow.com/questions/32055357/visual-studio-c-2015-stdcodecvt-with-char16-t-or-char32-t discusses some fixes. It seems that using wchar_t instead of char16_t might be the easiest of those, if that does actually work. But it would need to be wrapped in an #ifdef _MSC_VER.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the solution that uses wchar_t, the AWS Windows build is still failing and I can't tell from the logs if it's because of this or for some other reason... Can you tell from the logs what the problem is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out the problem was a compilation error that, for some reason, did not show up anywhere in the AWS build logs.
Now everything builds fine but test are failing - again only on Windows...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was the encoding of string literals, which can vary between compilers. Should all be fixed now!

@@ -84,17 +84,21 @@ SCENARIO("Loading JSON files")
}
}
}
GIVEN("A JSON file containing a hexadecimal Unicode character")
GIVEN("A JSON file containing hexadecimal Unicode characters")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⛏️ The use of "characters" here is a bit confusing. I would talk about symbols (or code points) for unicode and only use characters for the c++ char type 🍔

@@ -31,6 +31,11 @@ std::string utf16_native_endian_to_java(const std::wstring &in);

std::vector<std::string> narrow_argv(int argc, const wchar_t **argv_wide);

/// \param hex: a representation of a UTF-16 character as a four-digit string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍔

@@ -31,6 +31,11 @@ std::string utf16_native_endian_to_java(const std::wstring &in);

std::vector<std::string> narrow_argv(int argc, const wchar_t **argv_wide);

/// \param hex: a representation of a UTF-16 character as a four-digit string
/// (e.g.\ "0041" for \u0041)
/// \return the UTF-8 encoding of the character.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍔

@smowton
Copy link
Contributor

smowton commented May 1, 2019

Nitpick: chars between 00 and ff are not UTF-8 (which can represent all unicode characters, perhaps with many bytes) -- better to just say codepoints <= 255 or 0xff. Similarly codepoints <= 0xffff are the basic multilingual plane (BMP), but not UTF-16, which again can represent all codepoints, perhaps using multiple wchars. Therefore I suggest twiddling PR description / commit messages to clarify that this is adding support for all BMP unicode codepoints

@@ -51,7 +53,7 @@ static std::string convert_TOK_STRING()
// \uABCD, i.e. the following four digits are part of this character.
assert(p + 4 < yyjsontext + len - 1);
std::string hex(++p, 4);
result += std::stoi(hex, nullptr, 16);
result += utf16_hex_to_utf8(hex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename codepoint_hex_to_utf8 -- the difference is that utf-16 implies the possibility of two 16-bit numbers representing a single codepoint, such as U+10437 == 0xD801 0xDC37

@@ -315,3 +318,25 @@ std::string utf16_native_endian_to_java(const std::wstring &in)
utf16_native_endian_to_java(ch, result, loc);
return result.str();
}

char16_t utf16_hex_to_utf16_native_endian(const std::string &hex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, codepoint rather than utf16


std::string utf16_native_endian_to_utf8(char16_t utf16_char);

/// \param hex: a representation of a UTF-16 symbol as a four-digit string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of a BMP codepoint

@antlechner antlechner force-pushed the antonia/json-parsing-utf16 branch from e5a260e to afe2843 Compare May 1, 2019 11:15
@antlechner antlechner changed the title Add support for UTF-16 in the JSON parser Add support for all Unicode codepoints of the BMP in the JSON parser May 1, 2019
@antlechner antlechner force-pushed the antonia/json-parsing-utf16 branch 4 times, most recently from 87b8203 to 21c4ee6 Compare May 1, 2019 15:03
antlechner added 2 commits May 2, 2019 10:37
The previous implementation only supported codepoints up to 0x7f as
characters, and all remaining codepoints up to 0xff as integers.
The new implementation supports all codepoints in the BMP, i.e. up to
0xffff.
A named variable is clearer that referring to the value twice in
relation to p, where p changes in between.
@antlechner
Copy link
Contributor Author

All review comments have been addressed. I split the original "hex to UTF8" function into four tiny functions, as this makes it easier to reuse them (and I'm planning to use one of them in some new code soon). All four functions are documented (@smowton could you check that the terminology in the Doxygen makes sense?). Some of the string literals in the unit test have been specified to use UTF-8 encoding as the Windows compiler used a different encoding by default.
I also addressed @smowton 's review comment from #4590, I used a variable name instead of comments to (hopefully) clarify how the parser processes Unicode representations.

@antlechner antlechner changed the title Add support for all Unicode codepoints of the BMP in the JSON parser Add support for all Unicode codepoints of the BMP in the JSON parser [TG-7634] May 2, 2019
@antlechner antlechner force-pushed the antonia/json-parsing-utf16 branch from 21c4ee6 to 4d502e7 Compare May 2, 2019 09:49
@antlechner antlechner removed their assignment May 2, 2019
Copy link
Contributor

@smowton smowton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably clarify that while anything taking a char16_t is converting UTF-16 to UTF-8, it's not tackling the whole of UTF-16, which can use low / high surrogate pairs (two successive char16_ts) to represent characters outside the BMP (i.e. those from U+0x10000 upwards). So really they're converting a UTF-16 BMP char... which is always just equal to the codepoint number. Otherwise lgtm.

@smowton
Copy link
Contributor

smowton commented May 2, 2019

(In case that was a bit confused... what I'm saying is we're handling that subset of Unicode where the UTF-16 <-> codepoint number transformation is trivial (int codepoint = 0x1234; char16_t utf16_codepoint = (char16_t)codepoint;). By contrast those chars that don't fit into a char16_t need a more involved transformation.

@tautschnig tautschnig merged commit 9d2ca43 into diffblue:develop May 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants