Restrict indexing into strings to a special `ByteIndex` or `StringIndex` type

This is a proposal that's been done in Rust and apparently is still under discussion (see https://github.com/rust-lang/rust/issues/10044#issuecomment-26982523), but I thought it could be interesting for Julia.

The idea is that since indexing strings with a number like `s[3]` only makes sense when `3` actually corresponds to the boundary of a unicode code point, it represents a trap for developers who only test it on ASCII, making bugs appear only in production when used with non-ASCII text. Typical cases are the naive:

``` julia
julia> s = "noël";
julia> s[4] # Thinking that this provides the fourth "character"
ERROR: invalid UTF-8 character index
 in next at ./utf8.jl:68
 in getindex at string.jl:57
```

the tempting:

``` julia
julia> s[end - 1] # Thinking you skip the last character, works well... until it breaks
ERROR: invalid UTF-8 character index
 in next at ./utf8.jl:68
 in getindex at string.jl:57
```

(Julia equivalent of http://www.reddit.com/r/rust/comments/1zlq21/should_rust_be_more_careful_with_unicode/cfush88)

or the slightly more involved:

``` julia
julia> s[match(r"l", s).offset - 1]
ERROR: invalid UTF-8 character index
 in next at ./utf8.jl:68
 in getindex at string.jl:57
```

Instead of letting people do incorrect-but-easy things like this, it could be useful to restrict string indexing to a special type, say `ByteIndex`, instead of a plain integer. `match` would provide `offset` as that type too. It would prevent both naive indexing using integers as well as doing incorrect arithmetic on indexes you get from functions, encouraging people to always use dedicated functions.

It _might_ also make sense to allow arithmetic operations on this type, so that `idx + 1` means "the code point after the one at position `idx`", which would be O(n) but starting from the index -- and usually you don't take very large offsets. I'm not saying this is necessarily a good idea, though, because `ByteIndex` implies a reasoning in bytes, and then arithmetic operations would switch to a reasoning in code points. It could be named `StringIndex` instead, and made opaque so that people never see the integer index which is in bytes.

Finally, it might be possible to perform some optimizations by removing checks that the index corresponds to the start of a code point, if the index held a reference to the string it was build from, so that it can be checked that it matches the indexed string. Not sure it would be significant, though.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Restrict indexing into strings to a special `ByteIndex` or `StringIndex` type #9297

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Restrict indexing into strings to a special ByteIndex or StringIndex type #9297

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Restrict indexing into strings to a special `ByteIndex` or `StringIndex` type #9297