Skip to content

Unicode for String Processing proposal #257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 22, 2022

Conversation

natecook1000
Copy link
Member

@natecook1000 natecook1000 commented Apr 8, 2022

Draft of the Unicode for String Processing proposal

@natecook1000 natecook1000 marked this pull request as ready for review April 14, 2022 17:56
@natecook1000 natecook1000 changed the title Draft of Unicode for String Processing proposal Unicode for String Processing proposal Apr 14, 2022

Custom classes function as the set union of their individual components, whether those parts are individual characters, individual Unicode scalar values, ranges, Unicode property classes or POSIX classes, or other custom classes.

- Individual characters and scalars will be tested using the same behavior as if they were listed in an alternation. That is, a custom character class like `[abc]` is equivalent to `(a|b|c)` under the same options and modes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we said this was explicitly not the case, so that you could support e[\u{301}-\u{305}] or some such, under the impression that alternation required a grapheme break. Has that changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, the (a|b|c) grouping would require a grapheme break, which shouldn't be the case. I'll update this language.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, what's the reason alternation requires a grapheme break again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternation start and end points are set by grouping (or the start/end of the whole pattern), which are the places we're requiring a grapheme break.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We surely need them for capturing groups and for the overall match, but even for non-capturing groups? The latter is just syntactic scopes. In the builder that's be an embedded ChoiceOf { ... }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, does the alternation rule turn out to just be the capture rule, and if so we don't need it?

Custom classes function as the set union of their individual components, whether those parts are individual characters, individual Unicode scalar values, ranges, Unicode property classes or POSIX classes, or other custom classes.

- Individual characters and scalars will be tested using the same behavior as if they were listed in an alternation. That is, a custom character class like `[abc]` is equivalent to `(a|b|c)` under the same options and modes.
- When in grapheme cluster semantic mode, ranges of characters will test for membership using NFD form (or NFKD when performing caseless matching). This differs from how a `ClosedRange<Character>` would operate its `contains` method, since that depends on `String`'s `Comparable` conformance, but the decomposed comparison better aligns with the canonical equivalence matching used elsewhere in `Regex`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does that compose with concatenation and alternation?


- Individual characters and scalars will be tested using the same behavior as if they were listed in an alternation. That is, a custom character class like `[abc]` is equivalent to `(a|b|c)` under the same options and modes.
- When in grapheme cluster semantic mode, ranges of characters will test for membership using NFD form (or NFKD when performing caseless matching). This differs from how a `ClosedRange<Character>` would operate its `contains` method, since that depends on `String`'s `Comparable` conformance, but the decomposed comparison better aligns with the canonical equivalence matching used elsewhere in `Regex`.
- A custom character class will match a maximum of one `Character` or `UnicodeScalar`, depending on the matching semantic level. This means that a custom character class with extended grapheme cluster members may not match anything while using scalar semantics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain and provide rationale? This is a surprise and I thought we were working on clarifying the rules here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clarify this language; this doesn't totally capture the intent.

@natecook1000
Copy link
Member Author

@swift-ci Please test

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very quick pass, LGTM and we should try to move to pitch phase ASAP

str.contains(/Cafe\u{301}/.matchingSemantics(.unicodeScalar))
// true - "e\u{301}" matches with /e\u{301}/
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere (not sure here or in detailed design) I feel we should have an example using the DSL where we switch modes of a sub-builder. That way we can explain where the implicit grapheme break requirements are inserted (e.g. on entry or on exit?)

public func subtracting(_ other: CharacterClass) -> CharacterClass

/// Returns a character class matching elements in one or the other, but not both,
/// of this class and the given class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure we get the doc comments into the code as well

### More general `CharacterSet` replacement

Foundation's `CharacterSet` type is in some ways similar to the `CharacterClass` type defined in this proposal. `CharacterSet` is primarily a set type that is defined over Unicode scalars, and can therefore sometimes be awkward to use in conjunction with Swift `String`s. The proposed `CharacterClass` type is a `RegexBuilder`-specific type, and as such isn't intended to be a full general purpose replacement. Future work could involve expanding upon the `CharacterClass` API or introducing a different type to fill that role.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it in the regex builder module or the stdlib? If the stdlib I feel we'll want something a little more than saying it's regex-specific. It is a great model for character classes (or we should make it be that way), but we can say extra API or extra conversions are future work as we're focusing on regex's needs (which are broad and deep, so it's a good area to focus on)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in the RegexBuilder – if/when we decide it's fully generally useful, we could lower to the stdlib.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need to keep the ABI, so it's still pretty relevant (or just as relevant). I don't know if there's a good way to deprecate it in the future.

@natecook1000
Copy link
Member Author

@swift-ci Please test

@natecook1000 natecook1000 merged commit 8dd8470 into swiftlang:main Apr 22, 2022
@natecook1000 natecook1000 deleted the unicode_proposal branch April 22, 2022 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants