Skip to content

Conversation

@fzhinkin
Copy link
Collaborator

@fzhinkin fzhinkin commented Apr 3, 2024

Added tests to cover some UTF-8-related edge cases and moved functions used primarily by tests to corresponding source sets.

Added tests and moved code used only in tests to test source sets
@fzhinkin fzhinkin marked this pull request as ready for review April 3, 2024 11:27
@qwwdfsad qwwdfsad self-requested a review April 13, 2024 11:57
import kotlinx.io.internal.REPLACEMENT_CODE_POINT
import kotlinx.io.internal.commonAsUtf8ToByteArray
import kotlinx.io.internal.processUtf8CodePoints
import kotlin.test.*
Copy link
Member

@qwwdfsad qwwdfsad Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reviewing and testing these, I found a subtle difference: in the Java decoding process, when the decoder stumbles across an invalid multi-byte sequence, it replaces every byte of the sequence with \ufffd, while our decoder replaces the whole group with a single code point.

E.g. consider the 4-byte sequences:

  • 0xf0, 0x89, 0x89
  • 0xf0, 0x89, 0x89, 0x89

Their replacement for our decoding is a single codepoint, for Java, it is 3 and 4 replacement cp accordingly.

This behaviour leaks surprisingly for the following sequence: 0xf0 0xf0 0xf0 -- then three characters are produced. The same applies for 3- (0xE0) and probably 2- (haven't checked) byte sequences

Copy link
Member

@qwwdfsad qwwdfsad Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take on that -- replacement behaviour is undocumented (readString family), and thus it's hard to both justify and figure out this behaviour, as well as take it into account when implementing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replacement behaviour is undocumented

Although a UTF-8 conversion process is required to never consume well-formed subse- quences as part of its error handling for ill-formed subsequences, such a process is not oth- erwise constrained in how it deals with any ill-formed subsequence itself. An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors.

😢

Copy link
Member

@qwwdfsad qwwdfsad Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean undocumented in kotlinx-io.
I think we can pick a single strategy, document it and stick with it.
Right now, depending on the "ill-formness", we can replace N bytes with [1..4] replacement characters and it's a bit surprising. Always having a single replacement or as many characters as in an ill-formed sequence seems a bit more reasonable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's undocumented in the standard, either. :)

And, IIRC, we also have the same discrepancy in stdlib, where native and JVM implementations of byte-array-to-string conversion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of examples you initially gave, it all seems more or less consistent:

  • 0xf0 0x89 0x89 <EOF> - the sequence has a valid prefix, but terminates abruplty -> we replaced the whole sequence with \ufffd;
  • 0xf0 0x89 0x89 0x89 <EOF> - the sequence is valid, but encoded value lies outside the valid range -> we replaced the whole sequence with \ufffd;
  • 0xf0 0xf0 0xf0 <EOF> - we don't consider 0xf0 0xf0 as a sequence as the second 0xf0 is not a continuation CP, thus the first invalid sequence consists of a single byte, we replaced, and did the same for other two single-byte sequences.

However, looking at UTF-8's definitions of well- and ill-formed cp-sequences (D84-D86) gives an impression that 0xf0 0xf0 0xf0 should be treated as a single ill-formed sequence as none of its bytes overlaps with a minimal well-formed subsequence. 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But current behavior is a conforming one:

An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll cut this stream of consciousness off: we should probably follow what others (Java, Python) do and consider only ill-formed subsequences consisting of a single code unit:

>>> b'\xf0\x89\x89\x89'.decode("utf-8", errors='replace')
'����
jshell> new String(new byte[]{(byte)0xf0,(byte)0x89,(byte)0x89,(byte)0x89})
$5 ==> "����"

https://go.dev/play/p/Yc9wxok2aV2

fmt.Println(string([]byte{0xf0, 0x89, 0x89, 0x89}))
...

����

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qwwdfsad, I filed #301 to track it


import kotlinx.io.internal.REPLACEMENT_CHARACTER
import kotlinx.io.internal.REPLACEMENT_CODE_POINT
import kotlinx.io.internal.commonAsUtf8ToByteArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I wonder if we should delegate to JVM string constructor immediately.

Because right now, we process bytes one by one (expecting non-ASCII), then invoke concatToString() which does the same, doing an extra job, and maybe even repackaging these bytes again into ASCII-compressed strings. Bonus points for JVM -- everything is intensified, i.e. StringCoding.hasNegatives and StringUTF16.putChar which is really nice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling String's ctor seems to be slower (by about 10-15% when it comes to utf8 strings) compared to what we do now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really unexpected though 🥲

@fzhinkin fzhinkin merged commit 0431af5 into develop May 7, 2024
@fzhinkin fzhinkin deleted the improve-test-coverage branch May 7, 2024 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants