Skip to content

String guide nits #17340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
huonw opened this issue Sep 17, 2014 · 4 comments
Closed

String guide nits #17340

huonw opened this issue Sep 17, 2014 · 4 comments

Comments

@huonw
Copy link
Member

huonw commented Sep 17, 2014

  • The codepoint iterator chars is not mentioned in http://doc.rust-lang.org/master/guide-strings.html#indexing-strings

  • The use of Str in http://doc.rust-lang.org/master/guide-strings.html#generic-functions is not particularly idiomatic; since T is at the 'top level' taking a &str directly is fine; being generic over Str is better when it may be expensive for the user to get to a &str, e.g. if the string data is inside something

    fn foo(x: &[&str])
    
    fn bar<T: Str>(x: &[T])
    

    If the user has &[String] and wants to call foo, they're forced to allocate storage to store the .as_slice's of each element, which can be arbitrarily expensive (they may have a lot of Strings); bar gets around this since it can be called with &[String] or &[&str].

    (FWIW, this is actually still true for functions taking iterators, even though one can sometimes .map(|s| s.as_slice()) to get a Iterator<&str>. If you have a Iterator<String>, there's no way to convert that into a Iterator<&str> without collecting into a temporary data structure.)

@huonw huonw added the A-docs label Sep 17, 2014
@steveklabnik
Copy link
Member

What's the difference between chars and graphemes? Such a unicode newbie 😦

@huonw
Copy link
Member Author

huonw commented Sep 17, 2014

There's 3 basic levels of unicode (and its encodings):

  • code units, the underlying data type used to store everything
  • code points/unicode scalar values (char)
  • graphemes (visible characters)

For UTF-8, like Rust's strings, the code units are bytes (u8), the other two are independent of the encoding format. Code points are the letters and combining characters, and graphemes are sequences of codepoints that make up a single visible entity. E.g. consider:

u͔n͈̰̎i̙̮͚̦c͚̉o̼̩̰͗d͔̆̓ͥé

It is 45 bytes in UTF8, 26 codepoints but only 7 visible characters. The full breakdown is:

bytes             45: [117, 205, 148, 110, 204, 142, 205, 136, 204, 176, 105, 204, 153, 204, 174, 205, 154, 204, 166, 99, 204, 137, 205, 154, 111, 205, 151, 204, 188, 204, 169, 204, 176, 100, 204, 134, 205, 131, 205, 165, 205, 148, 101, 204, 129]
code points/chars 26: [u, ͔, n, ̎, ͈, ̰, i, ̙, ̮, ͚, ̦, c, ̉, ͚, o, ͗, ̼, ̩, ̰, d, ̆, ̓, ͥ, ͔, e, ́]
graphemes         7: [u͔, n͈̰̎, i̙̮͚̦, c͚̉, o̼̩̰͗, d͔̆̓ͥ, é]

playpen

That is, each little diacritic/combining character is its own codepoint, but sequences of them are rendered together as a single visible unit. (NB. they aren't always separated, there are some precombined characters like ä which can be either one (U+00E4) or two codepoints (U+0061 U+0308).)

@steveklabnik
Copy link
Member

Ahhh gotcha. When would you prefer chars over graphemes? Or is it just that I mentioned two out of the three?

@huonw
Copy link
Member Author

huonw commented Sep 17, 2014

If possible, text should be treated as a black box chunk of memory that can only be read/written (not iteration) and reencoded (almost always chars, but preferably a library would do this for you). It's hard to handle arbitrary text semantically accurately, even just splitting words is hard/impossible.

Obviously it isn't always always possible to avoid, e.g. an application may wish to be converting textual source code into machine code (but who would do that in Rust anyway? ;P ), in which case it's whatever makes sense for what it's doing.

That compiler example probably wants chars (since the input is very structured); text editors and text rendering libraries would want to be handling graphemes on some levels and code points on others.

But yes, it's mainly that you mentioned only 2 of 3.

lnicola pushed a commit to lnicola/rust that referenced this issue Jun 23, 2024
internal: Improve `find_path` performance

cc rust-lang/rust-analyzer#17339, db80216dac3d972612d8e2d12ade3c28a1826ac2 should fix a case where we don't reduce our search space appropriately. This also adds a fuel system which really shouldn't ever be hit, hence why it warns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants