-
Notifications
You must be signed in to change notification settings - Fork 70
Improve test coverage #290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added tests and moved code used only in tests to test source sets
| import kotlinx.io.internal.REPLACEMENT_CODE_POINT | ||
| import kotlinx.io.internal.commonAsUtf8ToByteArray | ||
| import kotlinx.io.internal.processUtf8CodePoints | ||
| import kotlin.test.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While reviewing and testing these, I found a subtle difference: in the Java decoding process, when the decoder stumbles across an invalid multi-byte sequence, it replaces every byte of the sequence with \ufffd, while our decoder replaces the whole group with a single code point.
E.g. consider the 4-byte sequences:
- 0xf0, 0x89, 0x89
- 0xf0, 0x89, 0x89, 0x89
Their replacement for our decoding is a single codepoint, for Java, it is 3 and 4 replacement cp accordingly.
This behaviour leaks surprisingly for the following sequence: 0xf0 0xf0 0xf0 -- then three characters are produced. The same applies for 3- (0xE0) and probably 2- (haven't checked) byte sequences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take on that -- replacement behaviour is undocumented (readString family), and thus it's hard to both justify and figure out this behaviour, as well as take it into account when implementing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replacement behaviour is undocumented
Although a UTF-8 conversion process is required to never consume well-formed subse- quences as part of its error handling for ill-formed subsequences, such a process is not oth- erwise constrained in how it deals with any ill-formed subsequence itself. An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors.
😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean undocumented in kotlinx-io.
I think we can pick a single strategy, document it and stick with it.
Right now, depending on the "ill-formness", we can replace N bytes with [1..4] replacement characters and it's a bit surprising. Always having a single replacement or as many characters as in an ill-formed sequence seems a bit more reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's undocumented in the standard, either. :)
And, IIRC, we also have the same discrepancy in stdlib, where native and JVM implementations of byte-array-to-string conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of examples you initially gave, it all seems more or less consistent:
0xf0 0x89 0x89 <EOF>- the sequence has a valid prefix, but terminates abruplty -> we replaced the whole sequence with\ufffd;0xf0 0x89 0x89 0x89 <EOF>- the sequence is valid, but encoded value lies outside the valid range -> we replaced the whole sequence with\ufffd;0xf0 0xf0 0xf0 <EOF>- we don't consider0xf0 0xf0as a sequence as the second0xf0is not a continuation CP, thus the first invalid sequence consists of a single byte, we replaced, and did the same for other two single-byte sequences.
However, looking at UTF-8's definitions of well- and ill-formed cp-sequences (D84-D86) gives an impression that 0xf0 0xf0 0xf0 should be treated as a single ill-formed sequence as none of its bytes overlaps with a minimal well-formed subsequence. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But current behavior is a conforming one:
An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll cut this stream of consciousness off: we should probably follow what others (Java, Python) do and consider only ill-formed subsequences consisting of a single code unit:
>>> b'\xf0\x89\x89\x89'.decode("utf-8", errors='replace')
'����
jshell> new String(new byte[]{(byte)0xf0,(byte)0x89,(byte)0x89,(byte)0x89})
$5 ==> "����"
https://go.dev/play/p/Yc9wxok2aV2
fmt.Println(string([]byte{0xf0, 0x89, 0x89, 0x89}))
...
����
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| import kotlinx.io.internal.REPLACEMENT_CHARACTER | ||
| import kotlinx.io.internal.REPLACEMENT_CODE_POINT | ||
| import kotlinx.io.internal.commonAsUtf8ToByteArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I wonder if we should delegate to JVM string constructor immediately.
Because right now, we process bytes one by one (expecting non-ASCII), then invoke concatToString() which does the same, doing an extra job, and maybe even repackaging these bytes again into ASCII-compressed strings. Bonus points for JVM -- everything is intensified, i.e. StringCoding.hasNegatives and StringUTF16.putChar which is really nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling String's ctor seems to be slower (by about 10-15% when it comes to utf8 strings) compared to what we do now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really unexpected though 🥲
Added tests to cover some UTF-8-related edge cases and moved functions used primarily by tests to corresponding source sets.