Improve test coverage #290

fzhinkin · 2024-04-03T10:55:22Z

Added tests to cover some UTF-8-related edge cases and moved functions used primarily by tests to corresponding source sets.

Added tests and moved code used only in tests to test source sets

qwwdfsad · 2024-04-16T12:24:57Z

core/common/test/Utf8Test.kt

 import kotlinx.io.internal.REPLACEMENT_CODE_POINT
-import kotlinx.io.internal.commonAsUtf8ToByteArray
 import kotlinx.io.internal.processUtf8CodePoints
 import kotlin.test.*


While reviewing and testing these, I found a subtle difference: in the Java decoding process, when the decoder stumbles across an invalid multi-byte sequence, it replaces every byte of the sequence with \ufffd, while our decoder replaces the whole group with a single code point.

E.g. consider the 4-byte sequences:

0xf0, 0x89, 0x89

0xf0, 0x89, 0x89, 0x89

Their replacement for our decoding is a single codepoint, for Java, it is 3 and 4 replacement cp accordingly.

This behaviour leaks surprisingly for the following sequence: 0xf0 0xf0 0xf0 -- then three characters are produced. The same applies for 3- (0xE0) and probably 2- (haven't checked) byte sequences

Take on that -- replacement behaviour is undocumented (readString family), and thus it's hard to both justify and figure out this behaviour, as well as take it into account when implementing.

replacement behaviour is undocumented

Although a UTF-8 conversion process is required to never consume well-formed subse- quences as part of its error handling for ill-formed subsequences, such a process is not oth- erwise constrained in how it deals with any ill-formed subsequence itself. An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors.

😢

I mean undocumented in kotlinx-io.
I think we can pick a single strategy, document it and stick with it.
Right now, depending on the "ill-formness", we can replace N bytes with [1..4] replacement characters and it's a bit surprising. Always having a single replacement or as many characters as in an ill-formed sequence seems a bit more reasonable

It's undocumented in the standard, either. :)

And, IIRC, we also have the same discrepancy in stdlib, where native and JVM implementations of byte-array-to-string conversion.

Speaking of examples you initially gave, it all seems more or less consistent:

0xf0 0x89 0x89 <EOF> - the sequence has a valid prefix, but terminates abruplty -> we replaced the whole sequence with \ufffd;

0xf0 0x89 0x89 0x89 <EOF> - the sequence is valid, but encoded value lies outside the valid range -> we replaced the whole sequence with \ufffd;

0xf0 0xf0 0xf0 <EOF> - we don't consider 0xf0 0xf0 as a sequence as the second 0xf0 is not a continuation CP, thus the first invalid sequence consists of a single byte, we replaced, and did the same for other two single-byte sequences.

However, looking at UTF-8's definitions of well- and ill-formed cp-sequences (D84-D86) gives an impression that 0xf0 0xf0 0xf0 should be treated as a single ill-formed sequence as none of its bytes overlaps with a minimal well-formed subsequence. 🤔

But current behavior is a conforming one:

An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors.

Ok, I'll cut this stream of consciousness off: we should probably follow what others (Java, Python) do and consider only ill-formed subsequences consisting of a single code unit:

>>> b'\xf0\x89\x89\x89'.decode("utf-8", errors='replace') '��

jshell> new String(new byte[]{(byte)0xf0,(byte)0x89,(byte)0x89,(byte)0x89}) $5 ==> "��"

https://go.dev/play/p/Yc9wxok2aV2

fmt.Println(string([]byte{0xf0, 0x89, 0x89, 0x89})) ... ��

@qwwdfsad, I filed #301 to track it

qwwdfsad · 2024-04-16T12:30:01Z

core/common/test/Utf8Test.kt


+import kotlinx.io.internal.REPLACEMENT_CHARACTER
 import kotlinx.io.internal.REPLACEMENT_CODE_POINT
-import kotlinx.io.internal.commonAsUtf8ToByteArray


Also, I wonder if we should delegate to JVM string constructor immediately.

Because right now, we process bytes one by one (expecting non-ASCII), then invoke concatToString() which does the same, doing an extra job, and maybe even repackaging these bytes again into ASCII-compressed strings. Bonus points for JVM -- everything is intensified, i.e. StringCoding.hasNegatives and StringUTF16.putChar which is really nice.

Calling String's ctor seems to be slower (by about 10-15% when it comes to utf8 strings) compared to what we do now.

This is really unexpected though 🥲

Improve test coverage

fc759b6

Added tests and moved code used only in tests to test source sets

fzhinkin marked this pull request as ready for review April 3, 2024 11:27

qwwdfsad self-requested a review April 13, 2024 11:57

qwwdfsad requested changes Apr 16, 2024

View reviewed changes

fzhinkin mentioned this pull request Apr 25, 2024

Map each code point of an ill-formed UTF-8 subsequence to a replacement character individually #301

Open

qwwdfsad self-requested a review May 6, 2024 18:12

qwwdfsad approved these changes May 6, 2024

View reviewed changes

fzhinkin merged commit 0431af5 into develop May 7, 2024

fzhinkin deleted the improve-test-coverage branch May 7, 2024 07:32

Improve test coverage #290

Improve test coverage #290

Uh oh!

Conversation

fzhinkin commented Apr 3, 2024

Uh oh!

qwwdfsad Apr 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qwwdfsad Apr 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qwwdfsad Apr 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qwwdfsad Apr 16, 2024 •

edited

Loading

qwwdfsad Apr 16, 2024 •

edited

Loading

qwwdfsad Apr 22, 2024 •

edited

Loading