Skip to content

Add support for NFC and NFKC #15986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

Florob
Copy link
Contributor

@Florob Florob commented Jul 25, 2014

This adds a new Recompositions iterator, which performs canonical composition on the result of the Decompositions iterator (which is canonical or compatibility decomposition). In effect this implements Unicode normalization forms C and KC.

@Florob
Copy link
Contributor Author

Florob commented Jul 25, 2014

This follows the current layout of normalization iterators.
Ideally I'd want to move all the Normalization iterators completely to libunicode, but as is this would create a cyclic dependency, which I'm not sure how to resolve.

@alexcrichton
Copy link
Member

cc @kwantam

@kwantam
Copy link
Contributor

kwantam commented Jul 28, 2014

Quick glance looks quite nice, but I'd love to take a careful look at this. Sadly, I'm in a paper crunch for the next several days... I'll try to get to it asap.

And thanks, @Florob!

pub struct Recompositions<'a> {
iter: Decompositions<'a>,
state: RecompositionState,
buffer: Vec<char>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how large this is expected to get, it may be better to use a RingBuf to avoid O(n) shifts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfackler That is a good catch. Can you elaborate a bit on the "depending on how large this is expected to get" part? My instinct is that for a push/shift access pattern a RingBuf would always be better, or equivalent to a Vec, but maybe I'm missing an edge case.
This is expected to be small, often only a single char, for "normal" strings I'd expect at most 2-4 chars. If one were to craft a string this could grow up to the input string's char_len() though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's always going to be super small, then the extra bookkeeping overhead from a RingBuf may make it slower than a Vec. Since this one can grow to the size of the input, it's probably best to use a RingBuf if for no other reason than to defend against pathological inputs.

@kwantam
Copy link
Contributor

kwantam commented Aug 3, 2014

Finally, some spare time! This looks good 👍

@Florob, thanks again!

@kwantam
Copy link
Contributor

kwantam commented Aug 3, 2014

Also: it seems OK to me that this lives in libcollections, since (1) that's where str lives anyway (from the point of view of the stdlib, I mean), and (2) these iterators are themselves built on top of collections.

The only obvious other option is to have a third library that inherits from both collections and unicode and implements this functionality, but in that case we'd have to import str there and then re-export to stdlib, and that sounds like the beginning of a mess.

@Florob
Copy link
Contributor Author

Florob commented Aug 3, 2014

@kwantam, I'm not sure you're aware of this, but NFC support has previously been rejected (because of the amount of data required to implement it).
I was told it should be implemented in a separate libunicode, which now exists. The premise as I understand it was that str should be usable without linking to the unicode crate at all. It does indeed seem sensible to me that, since we have a libunicode, all Unicode algorithms would live there, and not just some of them (or just the data for some of them).
I have not thought this through, but I think it should be possible to have libcollections not depend on libunicode at all (breaking-change though). However, I'd rather do that as a separate PR. It seems to me that adding nfc/nfkc next to the pre-existing nfd/nfkd should be fine for now.

@kwantam
Copy link
Contributor

kwantam commented Aug 3, 2014

@Florob: it's true what you say. If we imported str from libcollections into libunicode and from there into stdlib (like char, which goes core->unicode->std), then libunicode inherits from libcollections, but not vice-versa.

As it is now, core and unicode both contribute functionality to collections, which then re-exports to std; that is, I think, what makes this whole thing seem like a bit of a mess.

I also agree regarding your other point: we can do this change in either order, so from my point of view integrating compositions first and then doing rearrangement in a separate PR should be fine.

But of course my opinion isn't the one that counts! @alexcrichton, thoughts?

@alexcrichton
Copy link
Member

Due to nfd being in StrAllocating inside of libcollections already, I think it's ok to put nfk into libcollections for now. These methods are after all #[experimental] and may be moved in the future. I would perhaps prefer then in a separate extension trait that's not in the prelude by default, but that's a matter that can be resolved at a later date.

For now though, thanks @kwantam for taking a look, and thanks @Florob for the implementation!

bors added a commit that referenced this pull request Aug 4, 2014
This adds a new `Recompositions` iterator, which performs canonical composition on the result of the `Decompositions` iterator (which is canonical or compatibility decomposition). In effect this implements Unicode normalization forms C and KC.
@bors bors closed this Aug 4, 2014
bors added a commit to rust-lang-ci/rust that referenced this pull request Dec 11, 2023
…=lnicola

internal: move to `Arc::from_iter`

Builds atop of rust-lang/rust-analyzer#15985, will rebase.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants