Skip to content

Restrict indexing into strings to a special ByteIndex or StringIndex type #9297

@nalimilan

Description

@nalimilan

This is a proposal that's been done in Rust and apparently is still under discussion (see rust-lang/rust#10044 (comment)), but I thought it could be interesting for Julia.

The idea is that since indexing strings with a number like s[3] only makes sense when 3 actually corresponds to the boundary of a unicode code point, it represents a trap for developers who only test it on ASCII, making bugs appear only in production when used with non-ASCII text. Typical cases are the naive:

julia> s = "noël";
julia> s[4] # Thinking that this provides the fourth "character"
ERROR: invalid UTF-8 character index
 in next at ./utf8.jl:68
 in getindex at string.jl:57

the tempting:

julia> s[end - 1] # Thinking you skip the last character, works well... until it breaks
ERROR: invalid UTF-8 character index
 in next at ./utf8.jl:68
 in getindex at string.jl:57

(Julia equivalent of http://www.reddit.com/r/rust/comments/1zlq21/should_rust_be_more_careful_with_unicode/cfush88)

or the slightly more involved:

julia> s[match(r"l", s).offset - 1]
ERROR: invalid UTF-8 character index
 in next at ./utf8.jl:68
 in getindex at string.jl:57

Instead of letting people do incorrect-but-easy things like this, it could be useful to restrict string indexing to a special type, say ByteIndex, instead of a plain integer. match would provide offset as that type too. It would prevent both naive indexing using integers as well as doing incorrect arithmetic on indexes you get from functions, encouraging people to always use dedicated functions.

It might also make sense to allow arithmetic operations on this type, so that idx + 1 means "the code point after the one at position idx", which would be O(n) but starting from the index -- and usually you don't take very large offsets. I'm not saying this is necessarily a good idea, though, because ByteIndex implies a reasoning in bytes, and then arithmetic operations would switch to a reasoning in code points. It could be named StringIndex instead, and made opaque so that people never see the integer index which is in bytes.

Finally, it might be possible to perform some optimizations by removing checks that the index corresponds to the start of a code point, if the index held a reference to the string it was build from, so that it can be checked that it matches the indexed string. Not sure it would be significant, though.

Metadata

Metadata

Labels

breakingThis change will break codedesignDesign of APIs or of the language itselfunicodeRelated to unicode characters and encodings

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions