Unicode for String Processing proposal #257

natecook1000 · 2022-04-08T11:08:39Z

Draft of the Unicode for String Processing proposal

Documentation/Evolution/UnicodeForStringProcessing.md

milseman · 2022-04-18T18:17:15Z

Documentation/Evolution/UnicodeForStringProcessing.md

+
+Custom classes function as the set union of their individual components, whether those parts are individual characters, individual Unicode scalar values, ranges, Unicode property classes or POSIX classes, or other custom classes.
+
+- Individual characters and scalars will be tested using the same behavior as if they were listed in an alternation. That is, a custom character class like `[abc]` is equivalent to `(a|b|c)` under the same options and modes.


I thought we said this was explicitly not the case, so that you could support e[\u{301}-\u{305}] or some such, under the impression that alternation required a grapheme break. Has that changed?

That's a good point, the (a|b|c) grouping would require a grapheme break, which shouldn't be the case. I'll update this language.

BTW, what's the reason alternation requires a grapheme break again?

Alternation start and end points are set by grouping (or the start/end of the whole pattern), which are the places we're requiring a grapheme break.

We surely need them for capturing groups and for the overall match, but even for non-capturing groups? The latter is just syntactic scopes. In the builder that's be an embedded ChoiceOf { ... }

Basically, does the alternation rule turn out to just be the capture rule, and if so we don't need it?

milseman · 2022-04-18T18:17:42Z

Documentation/Evolution/UnicodeForStringProcessing.md

+Custom classes function as the set union of their individual components, whether those parts are individual characters, individual Unicode scalar values, ranges, Unicode property classes or POSIX classes, or other custom classes.
+
+- Individual characters and scalars will be tested using the same behavior as if they were listed in an alternation. That is, a custom character class like `[abc]` is equivalent to `(a|b|c)` under the same options and modes.
+- When in grapheme cluster semantic mode, ranges of characters will test for membership using NFD form (or NFKD when performing caseless matching). This differs from how a `ClosedRange<Character>` would operate its `contains` method, since that depends on `String`'s `Comparable` conformance, but the decomposed comparison better aligns with the canonical equivalence matching used elsewhere in `Regex`.


How does that compose with concatenation and alternation?

milseman · 2022-04-18T18:18:18Z

Documentation/Evolution/UnicodeForStringProcessing.md

+
+- Individual characters and scalars will be tested using the same behavior as if they were listed in an alternation. That is, a custom character class like `[abc]` is equivalent to `(a|b|c)` under the same options and modes.
+- When in grapheme cluster semantic mode, ranges of characters will test for membership using NFD form (or NFKD when performing caseless matching). This differs from how a `ClosedRange<Character>` would operate its `contains` method, since that depends on `String`'s `Comparable` conformance, but the decomposed comparison better aligns with the canonical equivalence matching used elsewhere in `Regex`.
+- A custom character class will match a maximum of one `Character` or `UnicodeScalar`, depending on the matching semantic level. This means that a custom character class with extended grapheme cluster members may not match anything while using scalar semantics.


Can you explain and provide rationale? This is a surprise and I thought we were working on clarifying the rules here.

I'll clarify this language; this doesn't totally capture the intent.

...

natecook1000 · 2022-04-22T15:27:36Z

@swift-ci Please test

milseman

Very quick pass, LGTM and we should try to move to pitch phase ASAP

milseman · 2022-04-22T15:42:18Z

Documentation/Evolution/UnicodeForStringProcessing.md

+str.contains(/Cafe\u{301}/.matchingSemantics(.unicodeScalar))
+// true - "e\u{301}" matches with /e\u{301}/
+```
+


Somewhere (not sure here or in detailed design) I feel we should have an example using the DSL where we switch modes of a sub-builder. That way we can explain where the implicit grapheme break requirements are inserted (e.g. on entry or on exit?)

milseman · 2022-04-22T15:43:43Z

Documentation/Evolution/UnicodeForStringProcessing.md

  public func subtracting(_ other: CharacterClass) -> CharacterClass

+  /// Returns a character class matching elements in one or the other, but not both,
+  /// of this class and the given class.


Make sure we get the doc comments into the code as well

milseman · 2022-04-22T15:45:08Z

Documentation/Evolution/UnicodeForStringProcessing.md

+### More general `CharacterSet` replacement
+
+Foundation's `CharacterSet` type is in some ways similar to the `CharacterClass` type defined in this proposal. `CharacterSet` is primarily a set type that is defined over Unicode scalars, and can therefore sometimes be awkward to use in conjunction with Swift `String`s. The proposed `CharacterClass` type is a `RegexBuilder`-specific type, and as such isn't intended to be a full general purpose replacement. Future work could involve expanding upon the `CharacterClass` API or introducing a different type to fill that role.
+


Is it in the regex builder module or the stdlib? If the stdlib I feel we'll want something a little more than saying it's regex-specific. It is a great model for character classes (or we should make it be that way), but we can say extra API or extra conversions are future work as we're focusing on regex's needs (which are broad and deep, so it's a good area to focus on)

It's in the RegexBuilder – if/when we decide it's fully generally useful, we could lower to the stdlib.

We'd need to keep the ABI, so it's still pretty relevant (or just as relevant). I don't know if there's a good way to deprecate it in the future.

natecook1000 · 2022-04-22T15:52:35Z

@swift-ci Please test

Draft of Unicode for String Processing proposal

2ab649a

stephentyrone reviewed Apr 11, 2022

View reviewed changes

Documentation/Evolution/UnicodeForStringProcessing.md Outdated Show resolved Hide resolved

stephentyrone reviewed Apr 11, 2022

View reviewed changes

Documentation/Evolution/UnicodeForStringProcessing.md Outdated Show resolved Hide resolved

stephentyrone reviewed Apr 11, 2022

View reviewed changes

Documentation/Evolution/UnicodeForStringProcessing.md Outdated Show resolved Hide resolved

stephentyrone reviewed Apr 11, 2022

View reviewed changes

Documentation/Evolution/UnicodeForStringProcessing.md Outdated Show resolved Hide resolved

Revisions and additions

cf5cbe0

milseman reviewed Apr 12, 2022

View reviewed changes

natecook1000 added 4 commits April 14, 2022 06:31

Merge branch 'main' into unicode_proposal

1cf20c1

Finish draft

7f7fab8

Additional revisions, API fixes

23494c6

Re-order option sections

daff3e4

natecook1000 marked this pull request as ready for review April 14, 2022 17:56

natecook1000 requested review from milseman, stephentyrone and Azoy April 14, 2022 17:56

natecook1000 changed the title ~~Draft of Unicode for String Processing proposal~~ Unicode for String Processing proposal Apr 14, 2022

stephentyrone reviewed Apr 15, 2022

View reviewed changes

Documentation/Evolution/UnicodeForStringProcessing.md Outdated Show resolved Hide resolved

Update word boundary selection method

c25c146

milseman reviewed Apr 18, 2022

View reviewed changes

natecook1000 added 6 commits April 18, 2022 22:20

Change option API to nominal naming

c839b62

Rename quantification behaivor

03f0b1f

Align proposal with existing API

aa04a56

Updated API and matching semantic descriptions

b176fa2

...

Merge branch 'main' into unicode_proposal

7f3637a

Update proposal overview doc

25f5a2d

Update proposal authors

1a5e4d0

milseman approved these changes Apr 22, 2022

View reviewed changes

natecook1000 merged commit 8dd8470 into swiftlang:main Apr 22, 2022

natecook1000 deleted the unicode_proposal branch April 22, 2022 16:27


		Custom classes function as the set union of their individual components, whether those parts are individual characters, individual Unicode scalar values, ranges, Unicode property classes or POSIX classes, or other custom classes.

		- Individual characters and scalars will be tested using the same behavior as if they were listed in an alternation. That is, a custom character class like `[abc]` is equivalent to `(a\|b\|c)` under the same options and modes.

		### More general `CharacterSet` replacement

		Foundation's `CharacterSet` type is in some ways similar to the `CharacterClass` type defined in this proposal. `CharacterSet` is primarily a set type that is defined over Unicode scalars, and can therefore sometimes be awkward to use in conjunction with Swift `String`s. The proposed `CharacterClass` type is a `RegexBuilder`-specific type, and as such isn't intended to be a full general purpose replacement. Future work could involve expanding upon the `CharacterClass` API or introducing a different type to fill that role.

Unicode for String Processing proposal #257

Unicode for String Processing proposal #257

Uh oh!

Conversation

natecook1000 commented Apr 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natecook1000 commented Apr 22, 2022

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natecook1000 commented Apr 22, 2022

Uh oh!

Uh oh!

natecook1000 commented Apr 8, 2022 •

edited

Loading