Skip to content

unicode/utf8: RuneError should be a 4-byte value -- something uncreatable via DecodeRune's valid returns #47826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
BrannonKing opened this issue Aug 19, 2021 · 5 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@BrannonKing
Copy link

What version of Go are you using (go version)?

1.16.6

Does this issue reproduce with the latest release?

I have not tried 1.17.x

What operating system and processor architecture are you using (go env)?

linux on amd64

What did you do?

I've been doing my own rune code folding (due to bugs in x/text/cases). I have this function:

func MyCaseFold(name []byte) []byte {
	var b bytes.Buffer
	b.Grow(len(name))
	for i := 0; i < len(name); {
		r, w := utf8.DecodeRune(name[i:])
		if r == utf8.RuneError && w < 2 {
			return name
		}
		replacements := foldMap[r]
		if len(replacements) > 0 {
			for j := range replacements {
				b.WriteRune(replacements[j])
			}
		} else {
			b.WriteRune(r)
		}
		i += w
	}
	return b.Bytes()
}

What did you expect to see?

I had expected that I could do the same with r := range string(name). However, I happened to have this (in hex) unicode string: 43efbfbd. The latter three bytes of that are valid utf-8 (and utf8.Valid agrees). It so happens that said string decodes to the same value as utf8.RuneError. I had to add that w < 2 check, which I determined after reading the DecodeRune source, where I saw that RuneError was only returned with 0 or 1 for the byte width.

utf8.RuneError should be a value that cannot be created via one of the non-error code-paths in DecodeRune.

@neild
Copy link
Contributor

neild commented Aug 19, 2021

RuneError is U+FFFD REPLACEMENT CHARACTER, the standard character "used to replace an incoming character whose value is unknown or unrepresentable in Unicode".

Decoding a string via repeated application of utf8.DecodeRune will consume the entire string produce a series of valid UTF-8 characters, replacing invalid bytes with REPLACEMENT CHARACTER. Programs which want to distinguish between invalid bytes in the string and validly encoded U+FFFD values can check for a length < 2, as stated in the documentation:

If p is empty it returns (RuneError, 0). Otherwise, if the encoding is invalid, it returns (RuneError, 1). Both are impossible results for correct, non-empty UTF-8.

@ianlancetaylor ianlancetaylor changed the title utf8.RuneError should be a 4-byte value -- something uncreatable via DecodeRune's valid returns unicode/utf8: RuneError should be a 4-byte value -- something uncreatable via DecodeRune's valid returns Aug 19, 2021
@ianlancetaylor
Copy link
Contributor

We can't change the value now, as it would break the Go 1 compatibility guarantee (https://golang.org/doc/go1compat).

@mknyszek mknyszek added this to the Backlog milestone Aug 20, 2021
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 20, 2021
@BrannonKing
Copy link
Author

Is there any technical reason that utf8.RuneError should equal unicode.ReplacementCharacter ?

@ianlancetaylor
Copy link
Contributor

Is there any technical reason that utf8.RuneError should equal unicode.ReplacementCharacter ?

I don't know whether this counts as a technical reason, but it's the standard way to represent an encoding error. See https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character .

@ianlancetaylor
Copy link
Contributor

In any case, regardless of the merits, we can't make this change today. It would not be backward compatible. I'm going to close this issue. Please comment if you disagree.

@golang golang locked and limited conversation to collaborators Aug 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants