-
Notifications
You must be signed in to change notification settings - Fork 277
Add support for all Unicode codepoints of the BMP in the JSON parser [TG-7634] #4594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for all Unicode codepoints of the BMP in the JSON parser [TG-7634] #4594
Conversation
645a7c3
to
040d31b
Compare
src/util/unicode.h
Outdated
@@ -31,6 +31,11 @@ std::string utf16_native_endian_to_java(const std::wstring &in); | |||
|
|||
std::vector<std::string> narrow_argv(int argc, const wchar_t **argv_wide); | |||
|
|||
/// \param hex: a representation of a UTF-16 character as a four-digit string | |||
/// (e.g.\ "0041" for \u0041) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you will need to escape this for Doxygen to be happy.
PRECONDITION(hex.length() == 4); | ||
const char16_t utf16_char = std::strtol(hex.c_str(), nullptr, 16); | ||
const std::u16string u16(1, utf16_char); | ||
return std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t>{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is failing with a linker error with Visual Studio. It's a known problem and https://stackoverflow.com/questions/32055357/visual-studio-c-2015-stdcodecvt-with-char16-t-or-char32-t discusses some fixes. It seems that using wchar_t
instead of char16_t
might be the easiest of those, if that does actually work. But it would need to be wrapped in an #ifdef _MSC_VER
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the solution that uses wchar_t
, the AWS Windows build is still failing and I can't tell from the logs if it's because of this or for some other reason... Can you tell from the logs what the problem is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out the problem was a compilation error that, for some reason, did not show up anywhere in the AWS build logs.
Now everything builds fine but test are failing - again only on Windows...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem was the encoding of string literals, which can vary between compilers. Should all be fixed now!
unit/json/json_parser.cpp
Outdated
@@ -84,17 +84,21 @@ SCENARIO("Loading JSON files") | |||
} | |||
} | |||
} | |||
GIVEN("A JSON file containing a hexadecimal Unicode character") | |||
GIVEN("A JSON file containing hexadecimal Unicode characters") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⛏️ The use of "characters" here is a bit confusing. I would talk about symbols (or code points) for unicode and only use characters for the c++ char
type 🍔
src/util/unicode.h
Outdated
@@ -31,6 +31,11 @@ std::string utf16_native_endian_to_java(const std::wstring &in); | |||
|
|||
std::vector<std::string> narrow_argv(int argc, const wchar_t **argv_wide); | |||
|
|||
/// \param hex: a representation of a UTF-16 character as a four-digit string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🍔
src/util/unicode.h
Outdated
@@ -31,6 +31,11 @@ std::string utf16_native_endian_to_java(const std::wstring &in); | |||
|
|||
std::vector<std::string> narrow_argv(int argc, const wchar_t **argv_wide); | |||
|
|||
/// \param hex: a representation of a UTF-16 character as a four-digit string | |||
/// (e.g.\ "0041" for \u0041) | |||
/// \return the UTF-8 encoding of the character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🍔
040d31b
to
e5a260e
Compare
Nitpick: chars between 00 and ff are not UTF-8 (which can represent all unicode characters, perhaps with many bytes) -- better to just say codepoints <= 255 or 0xff. Similarly codepoints <= 0xffff are the basic multilingual plane (BMP), but not UTF-16, which again can represent all codepoints, perhaps using multiple wchars. Therefore I suggest twiddling PR description / commit messages to clarify that this is adding support for all BMP unicode codepoints |
src/json/parser.y
Outdated
@@ -51,7 +53,7 @@ static std::string convert_TOK_STRING() | |||
// \uABCD, i.e. the following four digits are part of this character. | |||
assert(p + 4 < yyjsontext + len - 1); | |||
std::string hex(++p, 4); | |||
result += std::stoi(hex, nullptr, 16); | |||
result += utf16_hex_to_utf8(hex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename codepoint_hex_to_utf8 -- the difference is that utf-16 implies the possibility of two 16-bit numbers representing a single codepoint, such as U+10437 == 0xD801 0xDC37
src/util/unicode.cpp
Outdated
@@ -315,3 +318,25 @@ std::string utf16_native_endian_to_java(const std::wstring &in) | |||
utf16_native_endian_to_java(ch, result, loc); | |||
return result.str(); | |||
} | |||
|
|||
char16_t utf16_hex_to_utf16_native_endian(const std::string &hex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, codepoint rather than utf16
src/util/unicode.h
Outdated
|
||
std::string utf16_native_endian_to_utf8(char16_t utf16_char); | ||
|
||
/// \param hex: a representation of a UTF-16 symbol as a four-digit string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of a BMP codepoint
e5a260e
to
afe2843
Compare
87b8203
to
21c4ee6
Compare
The previous implementation only supported codepoints up to 0x7f as characters, and all remaining codepoints up to 0xff as integers. The new implementation supports all codepoints in the BMP, i.e. up to 0xffff.
A named variable is clearer that referring to the value twice in relation to p, where p changes in between.
All review comments have been addressed. I split the original "hex to UTF8" function into four tiny functions, as this makes it easier to reuse them (and I'm planning to use one of them in some new code soon). All four functions are documented (@smowton could you check that the terminology in the Doxygen makes sense?). Some of the string literals in the unit test have been specified to use UTF-8 encoding as the Windows compiler used a different encoding by default. |
21c4ee6
to
4d502e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably clarify that while anything taking a char16_t
is converting UTF-16 to UTF-8, it's not tackling the whole of UTF-16, which can use low / high surrogate pairs (two successive char16_ts) to represent characters outside the BMP (i.e. those from U+0x10000 upwards). So really they're converting a UTF-16 BMP char... which is always just equal to the codepoint number. Otherwise lgtm.
(In case that was a bit confused... what I'm saying is we're handling that subset of Unicode where the UTF-16 <-> codepoint number transformation is trivial ( |
This is a follow-up PR to #4590, which implemented support in the JSON parser for hexadecimal Unicode representations (
\uABCD
) but was limited to codepoints <=0x7f
(with values from0x80
to0xff
also sort of working, representing them as integers rather than characters).This improved implementation supports all values from
\u0000
to\uFFFF
, i.e. all codepoints in the BMP.