-
Notifications
You must be signed in to change notification settings - Fork 50
Unicode for String Processing proposal #257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
Custom classes function as the set union of their individual components, whether those parts are individual characters, individual Unicode scalar values, ranges, Unicode property classes or POSIX classes, or other custom classes. | ||
|
||
- Individual characters and scalars will be tested using the same behavior as if they were listed in an alternation. That is, a custom character class like `[abc]` is equivalent to `(a|b|c)` under the same options and modes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we said this was explicitly not the case, so that you could support e[\u{301}-\u{305}]
or some such, under the impression that alternation required a grapheme break. Has that changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, the (a|b|c)
grouping would require a grapheme break, which shouldn't be the case. I'll update this language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, what's the reason alternation requires a grapheme break again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternation start and end points are set by grouping (or the start/end of the whole pattern), which are the places we're requiring a grapheme break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We surely need them for capturing groups and for the overall match, but even for non-capturing groups? The latter is just syntactic scopes. In the builder that's be an embedded ChoiceOf { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, does the alternation rule turn out to just be the capture rule, and if so we don't need it?
Custom classes function as the set union of their individual components, whether those parts are individual characters, individual Unicode scalar values, ranges, Unicode property classes or POSIX classes, or other custom classes. | ||
|
||
- Individual characters and scalars will be tested using the same behavior as if they were listed in an alternation. That is, a custom character class like `[abc]` is equivalent to `(a|b|c)` under the same options and modes. | ||
- When in grapheme cluster semantic mode, ranges of characters will test for membership using NFD form (or NFKD when performing caseless matching). This differs from how a `ClosedRange<Character>` would operate its `contains` method, since that depends on `String`'s `Comparable` conformance, but the decomposed comparison better aligns with the canonical equivalence matching used elsewhere in `Regex`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does that compose with concatenation and alternation?
|
||
- Individual characters and scalars will be tested using the same behavior as if they were listed in an alternation. That is, a custom character class like `[abc]` is equivalent to `(a|b|c)` under the same options and modes. | ||
- When in grapheme cluster semantic mode, ranges of characters will test for membership using NFD form (or NFKD when performing caseless matching). This differs from how a `ClosedRange<Character>` would operate its `contains` method, since that depends on `String`'s `Comparable` conformance, but the decomposed comparison better aligns with the canonical equivalence matching used elsewhere in `Regex`. | ||
- A custom character class will match a maximum of one `Character` or `UnicodeScalar`, depending on the matching semantic level. This means that a custom character class with extended grapheme cluster members may not match anything while using scalar semantics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain and provide rationale? This is a surprise and I thought we were working on clarifying the rules here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll clarify this language; this doesn't totally capture the intent.
@swift-ci Please test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very quick pass, LGTM and we should try to move to pitch phase ASAP
str.contains(/Cafe\u{301}/.matchingSemantics(.unicodeScalar)) | ||
// true - "e\u{301}" matches with /e\u{301}/ | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhere (not sure here or in detailed design) I feel we should have an example using the DSL where we switch modes of a sub-builder. That way we can explain where the implicit grapheme break requirements are inserted (e.g. on entry or on exit?)
public func subtracting(_ other: CharacterClass) -> CharacterClass | ||
|
||
/// Returns a character class matching elements in one or the other, but not both, | ||
/// of this class and the given class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure we get the doc comments into the code as well
### More general `CharacterSet` replacement | ||
|
||
Foundation's `CharacterSet` type is in some ways similar to the `CharacterClass` type defined in this proposal. `CharacterSet` is primarily a set type that is defined over Unicode scalars, and can therefore sometimes be awkward to use in conjunction with Swift `String`s. The proposed `CharacterClass` type is a `RegexBuilder`-specific type, and as such isn't intended to be a full general purpose replacement. Future work could involve expanding upon the `CharacterClass` API or introducing a different type to fill that role. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it in the regex builder module or the stdlib? If the stdlib I feel we'll want something a little more than saying it's regex-specific. It is a great model for character classes (or we should make it be that way), but we can say extra API or extra conversions are future work as we're focusing on regex's needs (which are broad and deep, so it's a good area to focus on)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in the RegexBuilder
– if/when we decide it's fully generally useful, we could lower to the stdlib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd need to keep the ABI, so it's still pretty relevant (or just as relevant). I don't know if there's a good way to deprecate it in the future.
@swift-ci Please test |
Draft of the Unicode for String Processing proposal