-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Add support for NFC and NFKC #15986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for NFC and NFKC #15986
Conversation
This follows the current layout of normalization iterators. |
cc @kwantam |
Quick glance looks quite nice, but I'd love to take a careful look at this. Sadly, I'm in a paper crunch for the next several days... I'll try to get to it asap. And thanks, @Florob! |
pub struct Recompositions<'a> { | ||
iter: Decompositions<'a>, | ||
state: RecompositionState, | ||
buffer: Vec<char>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how large this is expected to get, it may be better to use a RingBuf
to avoid O(n)
shifts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfackler That is a good catch. Can you elaborate a bit on the "depending on how large this is expected to get" part? My instinct is that for a push/shift access pattern a RingBuf
would always be better, or equivalent to a Vec
, but maybe I'm missing an edge case.
This is expected to be small, often only a single char
, for "normal" strings I'd expect at most 2-4 char
s. If one were to craft a string this could grow up to the input string's char_len()
though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's always going to be super small, then the extra bookkeeping overhead from a RingBuf
may make it slower than a Vec
. Since this one can grow to the size of the input, it's probably best to use a RingBuf
if for no other reason than to defend against pathological inputs.
Finally, some spare time! This looks good 👍 @Florob, thanks again! |
Also: it seems OK to me that this lives in libcollections, since (1) that's where The only obvious other option is to have a third library that inherits from both collections and unicode and implements this functionality, but in that case we'd have to import |
@kwantam, I'm not sure you're aware of this, but NFC support has previously been rejected (because of the amount of data required to implement it). |
@Florob: it's true what you say. If we imported As it is now, core and unicode both contribute functionality to collections, which then re-exports to std; that is, I think, what makes this whole thing seem like a bit of a mess. I also agree regarding your other point: we can do this change in either order, so from my point of view integrating compositions first and then doing rearrangement in a separate PR should be fine. But of course my opinion isn't the one that counts! @alexcrichton, thoughts? |
Due to For now though, thanks @kwantam for taking a look, and thanks @Florob for the implementation! |
This adds a new `Recompositions` iterator, which performs canonical composition on the result of the `Decompositions` iterator (which is canonical or compatibility decomposition). In effect this implements Unicode normalization forms C and KC.
…=lnicola internal: move to `Arc::from_iter` Builds atop of rust-lang/rust-analyzer#15985, will rebase.
This adds a new
Recompositions
iterator, which performs canonical composition on the result of theDecompositions
iterator (which is canonical or compatibility decomposition). In effect this implements Unicode normalization forms C and KC.