From 2ab649a4ed946870da9afbe13b73c7652daab5b2 Mon Sep 17 00:00:00 2001 From: Nate Cook Date: Fri, 8 Apr 2022 06:07:42 -0500 Subject: [PATCH 01/12] Draft of Unicode for String Processing proposal --- Documentation/Evolution/CharacterClasses.md | 503 ------------- .../Evolution/UnicodeForStringProcessing.md | 677 ++++++++++++++++++ 2 files changed, 677 insertions(+), 503 deletions(-) delete mode 100644 Documentation/Evolution/CharacterClasses.md create mode 100644 Documentation/Evolution/UnicodeForStringProcessing.md diff --git a/Documentation/Evolution/CharacterClasses.md b/Documentation/Evolution/CharacterClasses.md deleted file mode 100644 index c9ffcbc95..000000000 --- a/Documentation/Evolution/CharacterClasses.md +++ /dev/null @@ -1,503 +0,0 @@ -# Character Classes for String Processing - -- **Authors:** [Nate Cook](https://github.com/natecook1000), [Michael Ilseman](https://github.com/milseman) -- **Status:** Draft pitch - -## Introduction - -[Declarative String Processing Overview][overview] presents regex-powered matching broadly, without details concerning syntax and semantics, leaving clarification to subsequent pitches. [Regular Expression Literals][literals] presents more details on regex _syntax_ such as delimiters and PCRE-syntax innards, but explicitly excludes discussion of regex _semantics_. This pitch and discussion aims to address a targeted subset of regex semantics: definitions of character classes. We propose a comprehensive treatment of regex character class semantics in the context of existing and newly proposed API directly on `Character` and `Unicode.Scalar`. - -Character classes in regular expressions include metacharacters like `\d` to match a digit, `\s` to match whitespace, and `.` to match any character. Individual literal characters can also be thought of as character classes, as they at least match themselves, and, in case-insensitive matching, their case-toggled counterpart. For the purpose of this work, then, we consider a *character class* to be any part of a regular expression literal that can match an actual component of a string. - -## Motivation - -Operating over classes of characters is a vital component of string processing. Swift's `String` provides, by default, a view of `Character`s or [extended grapheme clusters][graphemes] whose comparison honors [Unicode canonical equivalence][canoneq]. - -```swift -let str = "Cafe\u{301}" // "Café" -str == "Café" // true -str.dropLast() // "Caf" -str.last == "é" // true (precomposed e with acute accent) -str.last == "e\u{301}" // true (e followed by composing acute accent) -``` - -Unicode leaves all interpretation of grapheme clusters up to implementations, which means that Swift needs to define any semantics for its own usage. Since other regex engines operate, at most, at the semantics level of Unicode scalar values, there is little to no prior art to consult. - -
Other engines - -Character classes in other languages match at either the Unicode scalar value level, or even the code unit level, instead of recognizing grapheme clusters as characters. When matching the `.` character class, other languages will only match the first part of an `"e\u{301}"` grapheme cluster. Some languages, like Perl, Ruby, and Java, support an additional `\X` metacharacter, which explicitly represents a single grapheme cluster. - -| Matching `"Cafe\u{301}"` | Pattern: `^Caf.` | Remaining | Pattern: `^Caf\X` | Remaining | -|---|---|---|---|---| -| C#, Rust, Go | `"Cafe"` | `"´"` | n/a | n/a | -| NSString, Java, Ruby, Perl | `"Cafe"` | `"´"` | `"Café"` | `""` | - -Other than Java's `CANON_EQ` option, the vast majority of other languages and engines are not capable of comparing with canonical equivalence. - -
- -[SE-0211 Unicode Scalar Properties][scalarprops] added basic building blocks for classification of scalars by surfacing Unicode data from the [UCD][ucd]. [SE-0221: Character Properties][charprops] defined grapheme-cluster semantics for Swift for a subset of these. But, many classifications used in string processing are combinations of scalar properties or ad-hoc listings, and as such are not present today in Swift. - -Regardless of any syntax or underlying formalism, classifying characters is a worthy and much needed addition to the Swift standard library. We believe our thorough treatment of every character class found across many popular regex engines gives Swift a solid semantic basis. - -## Proposed Solution - -This pitch is narrowly scoped to Swift definitions of character classes found in regexes. For each character class, we propose: - -- A name for use in API -- A `Character` API, by extending Unicode scalar definitions to grapheme clusters -- A `Unicode.Scalar` API with modern Unicode definitions -- If applicable, a `Unicode.Scalar` API for notable standards like POSIX - -We're proposing what we believe to be the Swiftiest definitions using [Unicode's guidance][uts18] for `Unicode.Scalar` and extending this to grapheme clusters using `Character`'s existing [rationale][charpropsrationale]. - -
Broad language/engine survey - -For these definitions, we cross-referenced Unicode's [UTS\#18][uts18] with a broad survey of existing languages and engines. We found that while these all support a subset of UTS\#18, each language or framework implements a slightly different subset. The following table shows some of the variations: - -| Language/Framework | Dot (`.`) matches | Supports `\X` | Canonical Equivalence | `\d` matches FULL WIDTH digit | -|------------------------------|----------------------------------------------------|---------------|---------------------------|-------------------------------| -| [ECMAScript][ecmascript] | UTF16 code unit (Unicode scalar in Unicode mode) | no | no | no | -| [Perl][perl] / [PCRE][pcre] | UTF16 code unit, (Unicode scalar in Unicode mode) | yes | no | no | -| [Python3][python] | Unicode scalar | no | no | yes | -| [Raku][raku] | Grapheme cluster | n/a | strings always normalized | yes | -| [Ruby][ruby] | Unicode scalar | yes | no | no | -| [Rust][rust] | Unicode scalar | no | no | no | -| [C#][csharp] | UTF16 code unit | no | no | yes | -| [Java][java] | Unicode scalar | yes | Only in CANON_EQ mode | no | -| [Go][go] | Unicode scalar | no | no | no | -| [`NSRegularExpression`][icu] | Unicode scalar | yes | no | yes | - -We are still in the process of evaluating [C++][cplusplus], [RE2][re2], and [Oniguruma][oniguruma]. - -
- -## Detailed Design - -### Literal characters - -A literal character (such as `a`, `é`, or `한`) in a regex literal matches that particular character or code sequence. When matching at the semantic level of `Unicode.Scalar`, it should match the literal sequence of scalars. When matching at the semantic level of `Character`, it should match `Character`-by-`Character`, honoring Unicode canonical equivalence. - -We are not proposing new API here as this is already handled by `String` and `String.UnicodeScalarView`'s conformance to `Collection`. - -### Unicode values: `\u`, `\U`, `\x` - -Metacharacters that begin with `\u`, `\U`, or `\x` match a character with the specified Unicode scalar values. We propose these be treated exactly the same as literals. - -### Match any: `.`, `\X` - -The dot metacharacter matches any single character or element. Depending on options and modes, it may exclude newlines. - -`\X` matches any grapheme cluster (`Character`), even when the regular expression is otherwise matching at semantic level of `Unicode.Scalar`. - -We are not proposing new API here as this is already handled by collection conformances. - -While we would like for the stdlib to have grapheme-breaking API over collections of `Unicode.Scalar`, that is a separate discussion and out-of-scope for this pitch. - -### Decimal digits: `\d`,`\D` - -We propose `\d` be named "decimalDigit" with the following definitions: - -```swift -extension Character { - /// A Boolean value indicating whether this character represents - /// a decimal digit. - /// - /// Decimal digits are comprised of a single Unicode scalar that has a - /// `numericType` property equal to `.decimal`. This includes the digits - /// from the ASCII range, from the _Halfwidth and Fullwidth Forms_ Unicode - /// block, as well as digits in some scripts, like `DEVANAGARI DIGIT NINE` - /// (U+096F). - /// - /// Decimal digits are a subset of whole numbers, see `isWholeNumber`. - /// - /// To get the character's value, use the `decimalDigitValue` property. - public var isDecimalDigit: Bool { get } - - /// The numeric value this character represents, if it is a decimal digit. - /// - /// Decimal digits are comprised of a single Unicode scalar that has a - /// `numericType` property equal to `.decimal`. This includes the digits - /// from the ASCII range, from the _Halfwidth and Fullwidth Forms_ Unicode - /// block, as well as digits in some scripts, like `DEVANAGARI DIGIT NINE` - /// (U+096F). - /// - /// Decimal digits are a subset of whole numbers, see `wholeNumberValue`. - /// - /// let chars: [Character] = ["1", "९", "A"] - /// for ch in chars { - /// print(ch, "-->", ch.decimalDigitValue) - /// } - /// // Prints: - /// // 1 --> Optional(1) - /// // ९ --> Optional(9) - /// // A --> nil - public var decimalDigitValue: Int? { get } - -} - -extension Unicode.Scalar { - /// A Boolean value indicating whether this scalar is considered - /// a decimal digit. - /// - /// Any Unicode scalar that has a `numericType` property equal to `.decimal` - /// is considered a decimal digit. This includes the digits from the ASCII - /// range, from the _Halfwidth and Fullwidth Forms_ Unicode block, as well - /// as digits in some scripts, like `DEVANAGARI DIGIT NINE` (U+096F). - public var isDecimalDigit: Bool { get } -} -``` - -`\D` matches the inverse of `\d`. - -*TBD*: [SE-0221: Character Properties][charprops] did not define equivalent API on `Unicode.Scalar`, as it was itself an extension of single `Unicode.Scalar.Properties`. Since we're defining additional classifications formed from algebraic formulations of properties, it may make sense to put API such as `decimalDigitValue` on `Unicode.Scalar` as well as back-porting other API from `Character` (e.g. `hexDigitValue`). We'd like to discuss this with the community. - -*TBD*: `Character.isHexDigit` is currently constrained to the subset of decimal digits that are followed by encodings of Latin letters `A-F` in various forms (all 6 of them... thanks Unicode). We could consider extending this to be a superset of `isDecimalDigit` by allowing and producing values for all decimal digits, one would just have to use the Latin letters to refer to values greater than `9`. We'd like to discuss this with the community. - -_
Rationale_ - -Unicode's recommended definition for `\d` is its [numeric type][numerictype] of "Decimal" in contrast to "Digit". It is specifically restricted to sets of ascending contiguously-encoded scalars in a decimal radix positional numeral system. Thus, it excludes "digits" such as superscript numerals from its [definition][derivednumeric] and is a proper subset of `Character.isWholeNumber`. - -We interpret Unicode's definition of the set of scalars, especially its requirement that scalars be encoded in ascending chains, to imply that this class is restricted to scalars which meaningfully encode base-10 digits. Thus, we choose to make this Character property _restrictive_, similar to `isHexDigit` and `isWholeNumber` and provide a way to access this value. - -It's possible we might add future properties to differentiate Unicode's non-decimal digits, but that is outside the scope of this pitch. - -
- -### Word characters: `\w`, `\W` - -We propose `\w` be named "word character" with the following definitions: - -```swift -extension Character { - /// A Boolean value indicating whether this character is considered - /// a "word" character. - /// - /// See `Unicode.Scalar.isWordCharacter`. - public var isWordCharacter: Bool { get } -} - -extension Unicode.Scalar { - /// A Boolean value indicating whether this scalar is considered - /// a "word" character. - /// - /// Any Unicode scalar that has one of the Unicode properties - /// `Alphabetic`, `Digit`, or `Join_Control`, or is in the - /// general category `Mark` or `Connector_Punctuation`. - public var isWordCharacter: Bool { get } -} -``` - -`\W` matches the inverse of `\w`. - -_
Rationale_ - -Word characters include more than letters, and we went with Unicode's recommended scalar semantics. We extend to grapheme clusters similarly to `Character.isLetter`, that is, subsequent (combining) scalars do not change the word-character-ness of the grapheme cluster. - -
- -### Whitespace and newlines: `\s`, `\S` (plus `\h`, `\H`, `\v`, `\V`, and `\R`) - -We propose `\s` be named "whitespace" with the following definitions: - -```swift -extension Unicode.Scalar { - /// A Boolean value indicating whether this scalar is considered - /// whitespace. - /// - /// All Unicode scalars with the derived `White_Space` property are - /// considered whitespace, including: - /// - /// - `CHARACTER TABULATION` (U+0009) - /// - `LINE FEED (LF)` (U+000A) - /// - `LINE TABULATION` (U+000B) - /// - `FORM FEED (FF)` (U+000C) - /// - `CARRIAGE RETURN (CR)` (U+000D) - /// - `NEWLINE (NEL)` (U+0085) - public var isWhitespace: Bool { get } -} -``` - -This definition matches the value of the existing `Unicode.Scalar.Properties.isWhitespace` property. Note that `Character.isWhitespace` already exists with the desired semantics, which is a grapheme cluster that begins with a whitespace Unicode scalar. - -We propose `\h` be named "horizontalWhitespace" with the following definitions: - -```swift -extension Character { - /// A Boolean value indicating whether this character is considered - /// horizontal whitespace. - /// - /// All characters with an initial Unicode scalar in the general - /// category `Zs`/`Space_Separator`, or the control character - /// `CHARACTER TABULATION` (U+0009), are considered horizontal - /// whitespace. - public var isHorizontalWhitespace: Bool { get } -} - -extension Unicode.Scalar { - /// A Boolean value indicating whether this scalar is considered - /// horizontal whitespace. - /// - /// All Unicode scalars with the general category - /// `Zs`/`Space_Separator`, along with the control character - /// `CHARACTER TABULATION` (U+0009), are considered horizontal - /// whitespace. - public var isHorizontalWhitespace: Bool { get } -} -``` - -We propose `\v` be named "verticalWhitespace" with the following definitions: - - -```swift -extension Character { - /// A Boolean value indicating whether this scalar is considered - /// vertical whitespace. - /// - /// All characters with an initial Unicode scalar in the general - /// category `Zl`/`Line_Separator`, or the following control - /// characters, are considered vertical whitespace (see below) - public var isVerticalWhitespace: Bool { get } -} - -extension Unicode.Scalar { - /// A Boolean value indicating whether this scalar is considered - /// vertical whitespace. - /// - /// All Unicode scalars with the general category - /// `Zl`/`Line_Separator`, along with the following control - /// characters, are considered vertical whitespace: - /// - /// - `LINE FEED (LF)` (U+000A) - /// - `LINE TABULATION` (U+000B) - /// - `FORM FEED (FF)` (U+000C) - /// - `CARRIAGE RETURN (CR)` (U+000D) - /// - `NEWLINE (NEL)` (U+0085) - public var isVerticalWhitespace: Bool { get } -} -``` - -Note that `Character.isNewline` already exists with the definition [required][lineboundary] by UTS\#18. *TBD:* Should we backport to `Unicode.Scalar`? - -`\S`, `\H`, and `\V` match the inverse of `\s`, `\h`, and `\v`, respectively. - -We propose `\R` include "verticalWhitespace" above with detection (and consumption) of the CR-LF sequence when applied to `Unicode.Scalar`. It is equivalent to `Character.isVerticalWhitespace` when applied to `Character`s. - -We are similarly not proposing any new API for `\R` until the stdlib has grapheme-breaking API over `Unicode.Scalar`. - -_
Rationale_ - -Note that "whitespace" is a term-of-art and is not correlated with visibility, which is a completely separate concept. - -We use Unicode's recommended scalar semantics for horizontal whitespace and extend that to grapheme semantics similarly to `Character.isWhitespace`. - -We use ICU's definition for vertical whitespace, similarly extended to grapheme clusters. - -
- -### Control characters: `\t`, `\r`, `\n`, `\f`, `\0`, `\e`, `\a`, `\b`, `\cX` - -We propose the following names and meanings for these escaped literals representing specific control characters: - -```swift -extension Character { - /// A horizontal tab character, `CHARACTER TABULATION` (U+0009). - public static var tab: Character { get } - - /// A carriage return character, `CARRIAGE RETURN (CR)` (U+000D). - public static var carriageReturn: Character { get } - - /// A line feed character, `LINE FEED (LF)` (U+000A). - public static var lineFeed: Character { get } - - /// A form feed character, `FORM FEED (FF)` (U+000C). - public static var formFeed: Character { get } - - /// A NULL character, `NUL` (U+0000). - public static var nul: Character { get } - - /// An escape control character, `ESC` (U+001B). - public static var escape: Character { get } - - /// A bell character, `BEL` (U+0007). - public static var bell: Character { get } - - /// A backspace character, `BS` (U+0008). - public static var backspace: Character { get } - - /// A combined carriage return and line feed as a single character denoting - // end-of-line. - public static var carriageReturnLineFeed: Character { get } - - /// Returns a control character with the given value, Control-`x`. - /// - /// This method returns a value only when you pass a letter in - /// the ASCII range as `x`: - /// - /// if let ch = Character.control("G") { - /// print("'ch' is a bell character", ch == Character.bell) - /// } else { - /// print("'ch' is not a control character") - /// } - /// // Prints "'ch' is a bell character: true" - /// - /// - Parameter x: An upper- or lowercase letter to derive - /// the control character from. - /// - Returns: Control-`x` if `x` is in the pattern `[a-zA-Z]`; - /// otherwise, `nil`. - public static func control(_ x: Unicode.Scalar) -> Character? -} - -extension Unicode.Scalar { - /// Same as above, producing Unicode.Scalar, except for CR-LF... -} -``` - -We also propose `isControl` properties with the following definitions: - -```swift -extension Character { - /// A Boolean value indicating whether this character represents - /// a control character. - /// - /// Control characters are a single Unicode scalar with the - /// general category `Cc`/`Control` or the CR-LF pair (`\r\n`). - public var isControl: Bool { get } -} - -extension Unicode.Scalar { - /// A Boolean value indicating whether this scalar represents - /// a control character. - /// - /// Control characters have the general category `Cc`/`Control`. - public var isControl: Bool { get } -} -``` - -*TBD*: Should we have a CR-LF static var on `Unicode.Scalar` that produces a value of type `Character`? - - -_
Rationale_ - -This approach simplifies the use of some common control characters, while making the rest available through a method call. - -
- - - -### Unicode named values and properties: `\N`, `\p`, `\P` - -`\N{NAME}` matches a Unicode scalar value with the specified name. `\p{PROPERTY}` and `\p{PROPERTY=VALUE}` match a Unicode scalar value with the given Unicode property (and value, if given). - -While most Unicode-defined properties can only match at the Unicode scalar level, some are defined to match an extended grapheme cluster. For example, `/\p{RGI_Emoji_Flag_Sequence}/` will match any flag emoji character, which are composed of two Unicode scalar values. - -`\P{...}` matches the inverse of `\p{...}`. - -Most of this is already present inside `Unicode.Scalar.Properties`, and we propose to round it out with anything missing, e.g. script and script extensions. (API is _TBD_, still working on it.) - -Even though we are not proposing any `Character`-based API, we'd like to discuss with the community whether or how to extend them to grapheme clusters. Some options: - -- Forbid in any grapheme-cluster semantic mode -- Match only single-scalar grapheme clusters with the given property -- Match any grapheme cluster that starts with the given property -- Something more-involved such as per-property reasoning - - -### POSIX character classes: `[:NAME:]` - -We propose that POSIX character classes be prefixed with "posix" in their name with APIs for testing membership of `Character`s and `Unicode.Scalar`s. `Unicode.Scalar.isASCII` and `Character.isASCII` already exist and can satisfy `[:ascii:]`, and can be used in combination with new members like `isDigit` to represent individual POSIX character classes. Alternatively, we could introduce an option-set-like `POSIXCharacterClass` and `func isPOSIX(_:POSIXCharacterClass)` since POSIX is a fully defined standard. This would cut down on the amount of API noise directly visible on `Character` and `Unicode.Scalar` significantly. We'd like some discussion the the community here, noting that this will become clearer as more of the string processing overview takes shape. - -POSIX's character classes represent concepts that we'd like to define at all semantic levels. We propose the following definitions, some of which are covered elsewhere in this pitch and some of which already exist today. Some Character definitions are *TBD* and we'd like more discussion with the community. - - -| POSIX class | API name | `Character` | `Unicode.Scalar` | POSIX mode value | -|-------------|----------------------|-----------------------|-------------------------------|-------------------------------| -| `[:lower:]` | lowercase | (exists) | `\p{Lowercase}` | `[a-z]` | -| `[:upper:]` | uppercase | (exists) | `\p{Uppercase}` | `[A-Z]` | -| `[:alpha:]` | alphabetic | (exists: `.isLetter`) | `\p{Alphabetic}` | `[A-Za-z]` | -| `[:alnum:]` | alphaNumeric | TBD | `[\p{Alphabetic}\p{Decimal}]` | `[A-Za-z0-9]` | -| `[:word:]` | wordCharacter | (pitched) | (pitched) | `[[:alnum:]_]` | -| `[:digit:]` | decimalDigit | (pitched) | (pitched) | `[0-9]` | -| `[:xdigit:]`| hexDigit | (exists) | `\p{Hex_Digit}` | `[0-9A-Fa-f]` | -| `[:punct:]` | punctuation | (exists) | (port from `Character`) | `[-!"#%&'()*,./:;?@[\\\]_{}]` | -| `[:blank:]` | horizontalWhitespace | (pitched) | (pitched) | `[ \t]` | -| `[:space:]` | whitespace | (exists) | `\p{Whitespace}` | `[ \t\n\r\f\v]` | -| `[:cntrl:]` | control | (pitched) | (pitched) | `[\x00-\x1f\x7f]` | -| `[:graph:]` | TBD | TBD | TBD | `[^ [:cntrl:]]` | -| `[:print:]` | TBD | TBD | TBD | `[[:graph:] ]` | - - -### Custom classes: `[...]` - -We propose that custom classes function just like set union. We propose that ranged-based custom character classes function just like `ClosedRange`. Thus, we are not proposing any additional API. - -That being said, providing grapheme cluster semantics is simultaneously obvious and tricky. A direct extension treats `[a-f]` as equivalent to `("a"..."f").contains()`. Strings (and thus Characters) are ordered for the purposes of efficiently maintaining programming invariants while honoring Unicode canonical equivalence. This ordering is _consistent_ but [linguistically meaningless][meaningless] and subject to implementation details such as whether we choose to normalize under NFC or NFD. - -```swift -let c: ClosedRange = "a"..."f" -c.contains("e") // true -c.contains("g") // false -c.contains("e\u{301}") // false, NFC uses precomposed é -c.contains("e\u{305}") // true, there is no precomposed e̅ -``` - -We will likely want corresponding `RangeExpression`-based API in the future and keeping consistency with ranges is important. - -We would like to discuss this problem with the community here. Even though we are not addressing regex literals specifically in this thread, it makes sense to produce suggestions for compilation errors or warnings. - -Some options: - -- Do nothing, embrace emergent behavior -- Warn/error for _any_ character class ranges -- Warn/error for character class ranges outside of a quasi-meaningful subset (e.g. ACII, albeit still has issues above) -- Warn/error for multiple-scalar grapheme clusters (albeit still has issues above) - - - -## Future Directions - -### Future API - -Library-extensible pattern matching will necessitate more types, protocols, and API in the future, many of which may involve character classes. This pitch aims to define names and semantics for exactly these kinds of API now, so that they can slot in naturally. - -### More classes or custom classes - -Future API might express custom classes or need more built-in classes. This pitch aims to establish rationale and precedent for a large number of character classes in Swift, serving as a basis that can be extended. - -### More lenient conversion APIs - -The proposed semantics for matching "digits" are broader than what the existing `Int(_:radix:)?` initializer accepts. It may be useful to provide additional initializers that can understand the whole breadth of characters matched by `\d`, or other related conversions. - - - - -[literals]: https://forums.swift.org/t/pitch-regular-expression-literals/52820 -[overview]: https://forums.swift.org/t/declarative-string-processing-overview/52459 -[charprops]: https://github.com/apple/swift-evolution/blob/master/proposals/0221-character-properties.md -[charpropsrationale]: https://github.com/apple/swift-evolution/blob/master/proposals/0221-character-properties.md#detailed-semantics-and-rationale -[canoneq]: https://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence -[graphemes]: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries -[meaningless]: https://forums.swift.org/t/declarative-string-processing-overview/52459/121 -[scalarprops]: https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md -[ucd]: https://www.unicode.org/reports/tr44/tr44-28.html -[numerictype]: https://www.unicode.org/reports/tr44/#Numeric_Type -[derivednumeric]: https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedNumericType.txt - - -[uts18]: https://unicode.org/reports/tr18/ -[proplist]: https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt -[pcre]: https://www.pcre.org/current/doc/html/pcre2pattern.html -[perl]: https://perldoc.perl.org/perlre -[raku]: https://docs.raku.org/language/regexes -[rust]: https://docs.rs/regex/1.5.4/regex/ -[python]: https://docs.python.org/3/library/re.html -[ruby]: https://ruby-doc.org/core-2.4.0/Regexp.html -[csharp]: https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference -[icu]: https://unicode-org.github.io/icu/userguide/strings/regexp.html -[posix]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html -[oniguruma]: https://www.cuminas.jp/sdk/regularExpression.html -[go]: https://pkg.go.dev/regexp/syntax@go1.17.2 -[cplusplus]: https://www.cplusplus.com/reference/regex/ECMAScript/ -[ecmascript]: https://262.ecma-international.org/12.0/#sec-pattern-semantics -[re2]: https://github.com/google/re2/wiki/Syntax -[java]: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html diff --git a/Documentation/Evolution/UnicodeForStringProcessing.md b/Documentation/Evolution/UnicodeForStringProcessing.md new file mode 100644 index 000000000..5500c06d3 --- /dev/null +++ b/Documentation/Evolution/UnicodeForStringProcessing.md @@ -0,0 +1,677 @@ +# Unicode for String Processing + +Proposal: [SE-NNNN](NNNN-filename.md) +Authors: [Nate Cook](https://github.com/natecook1000), [Michael Ilseman](https://github.com/milseman) +Review Manager: TBD +Implementation: [apple/swift-experimental-string-processing][repo] +Status: **Draft** + + +## Introduction + +This proposal describes `Regex`'s rich Unicode support during regular expression matching, along with the character classes and options and that define that behavior. + +## Motivation + +Character classes in regular expressions include metacharacters like `\d` to match a digit, `\s` to match whitespace, and `.` to match any character. Individual literal characters can also be thought of as character classes, as they at least match themselves, and, in case-insensitive matching, their case-toggled counterpart. For the purpose of this work, then, we consider a *character class* to be any part of a regular expression literal that can match an actual component of a string. + +Operating over classes of characters is a vital component of string processing. Swift's `String` provides, by default, a view of `Character`s or [extended grapheme clusters][graphemes] whose comparison honors [Unicode canonical equivalence][canoneq]. + +```swift +let str = "Cafe\u{301}" // "Café" +str == "Café" // true +str.dropLast() // "Caf" +str.last == "é" // true (precomposed e with acute accent) +str.last == "e\u{301}" // true (e followed by composing acute accent) +``` + +At a regular expression's simplest, without metacharacters or special features, matching should behave like a test for equality. A string should always match a regular expression that simply contains the same characters. + +```swift +str.contains(/Café/) // true +``` + +And from there, small changes should continue to comport with the element counting and comparison expectations set by `String`: + +```swift +str.contains(/Caf./) // true +str.contains(/.+é/) // true +str.contains(/.+e\u{301}/) // true +str.contains(/\w+é/) // true +``` + +With these initial principles in hand, we can look at how character classes should behave with Swift strings. Unicode leaves all interpretation of grapheme clusters up to implementations, which means that Swift needs to define any semantics for its own usage. Since other regex engines operate, at most, at the semantics level of Unicode scalar values, there is little to no prior art to consult. + +
Other engines + +Character classes in other languages match at either the Unicode scalar value level, or even the code unit level, instead of recognizing grapheme clusters as characters. When matching the `.` character class, other languages will only match the first part of an `"e\u{301}"` grapheme cluster. Some languages, like Perl, Ruby, and Java, support an additional `\X` metacharacter, which explicitly represents a single grapheme cluster. + +| Matching `"Cafe\u{301}"` | Pattern: `^Caf.` | Remaining | Pattern: `^Caf\X` | Remaining | +|---|---|---|---|---| +| C#, Rust, Go | `"Cafe"` | `"´"` | n/a | n/a | +| NSString, Java, Ruby, Perl | `"Cafe"` | `"´"` | `"Café"` | `""` | + +Other than Java's `CANON_EQ` option, the vast majority of other languages and engines are not capable of comparing with canonical equivalence. + +
+ +## Proposed solution + +TK: semantic levels, options for controlling, canonical equivalence, Unicode properties + +## Detailed design + +First, we'll discuss the options that let you control a regex's behavior, and then explore the character classes that define the your pattern. + +### Options + +Options can be declared in two different ways: as part of [regular expression literal syntax][literals], or applied as methods when declaring a `Regex`. For example, both of these `Regex`es are declared with case insensitivity: + +```swift +let regex1 = /(?i)label/ +let regex2 = Regex { + "label" +}.ignoringCase()` +``` + +Note that the `ignoringCase()` is available on any type conforming to `RegexComponent`, which means that you can use the more readable option-setting interface in conjunction with literals: + +```swift +let regex3 = /label/.ignoringCase() +``` + +Calling option-setting methods like `ignoringCase(_:)` acts like wrapping the regex in a option-setting group. That is, while it sets the *default* behavior for the callee, it doesn’t override options that are applied to more specific regions. In this example, the `"b"` in `"label"` matches case-sensitively, despite the outer call to `ignoringCase()`: + +```swift +let regex4 = Regex { + "la" + "b".ignoringCase(false) + "el" +} +.ignoringCase() + +"label".contains(regex4) // true +"LAbEL".contains(regex4) // true +"LABEL".contains(regex4) // false +``` + +Option scoping in literals is discussed in the [Run-time Regex Construction proposal][option-scoping]. + +#### Case insensitivity + +Regular expressions perform case sensitive comparisons by default. The `i` option or the `ignoringCase(_:)` method enables case insensitive comparison. + +```swift +let str = "Café" + +str.firstMatch(of: /CAFÉ/) // nil +str.firstMatch(of: /(?i)CAFÉ/) // "Café" +str.firstMatch(of: /(?i)cAfÉ/) // "Café" +``` + +Case insensitive matching uses case folding to ensure that canonical equivalence continues to operate as expected. + +*Regex literal syntax:* `/(?i).../` or `/(?i...)/` + +*Regex builder syntax:* + +```swift +extension RegexComponent { + /// Returns a regular expression that ignores casing when matching. + public func ignoringCase(_ ignoreCase: Bool = true) -> Regex +} +``` + +#### Single line mode (`.` matches newlines) + +The "any" metacharacter (`.`) matches any character in a string *except* newlines by default. With the `s` option enabled, `.` matches any character including newlines. + +```swift +let str = """ + <> + """ + +str.firstMatch(of: /<<.+>>/) // nil +str.firstMatch(of: /(?s)<<.+>>/) // "This string\nuses double-angle-brackets\nto group text." +``` + +This option also affects the behavior of `CharacterClass.any`, which is designed to match the behavior of the `.` regex literal component. + +*Regex literal syntax:* `/(?s).../` or `/(?s...)/` + +*Regex builder syntax:* + +```swift +extension RegexComponent { + /// Returns a regular expression where the start and end of input + /// anchors (`^` and `$`) also match against the start and end of a line. + public func dotMatchesNewlines(_ dotMatchesNewlines: Bool = true) -> Regex +} +``` + +#### Reluctant quantification by default + +Regular expression quantifiers (`+`, `*`, and `?`) match eagerly by default, such that they match the longest possible substring. Appending `?` to a quantifier makes it reluctant, instead, so that it matches the shortest possible substring. + +```swift +let str = "A value." + +// By default, the '+' quantifier is eager, and consumes as much as possible. +str.firstMatch(of: /<.+>/) // "A value." + +// Adding '?' makes the '+' quantifier reluctant, so that it consumes as little as possible. +str.firstMatch(of: /<.+?>/) // "" +``` + +The `U` option toggles the "eagerness" of quanitifiers, so that quantifiers are reluctant by default, and only become eager when `?` is added to the quantifier. + +```swift +// '(?U)' toggles the eagerness of quantifiers: +str.firstMatch(of: /(?U)<.+>/) // "" +str.firstMatch(of: /(?U)<.+?>/) // "A value." +``` + +*Regex literal syntax:* `/(?U).../` or `/(?U...)/` + +*Regex builder syntax:* + +```swift +extension RegexComponent { + /// Returns a regular expression where quantifiers are reluctant by default + /// instead of eager. + public func reluctantCaptures(_ useReluctantCaptures: Bool = true) -> Regex +} +``` + +#### Use ASCII-only character classes + +With one or more of these options enabled, the default character classes match only ASCII values instead of the full Unicode range of characters. Four options are included in this group: + +* `D`: Match only ASCII members for `\d`, `\p{Digit}`, `[:digit:]`, and the `CharacterClass.digit`. +* `S`: Match only ASCII members for `\s`, `\p{Space}`, `[:space:]`. +* `W`: Match only ASCII members for `\w`, `\p{Word}`, `[:word:]`, `\b`, `CharacterClass.word`, and `Anchor.wordBoundary`. +* `P`: Match only ASCII members for all POSIX properties (including `digit`, `space`, and `word`). + +*Regex literal syntax:* `/(?DSWP).../` or `/(?DSWP...)/` + +*Regex builder syntax:* + +```swift +extension RegexComponent { + /// Returns a regular expression that only matches ASCII characters as "word + /// characters". + public func usingASCIIWordCharacters(_ useASCII: Bool = true) -> Regex + + /// Returns a regular expression that only matches ASCII characters as digits. + public func usingASCIIDigits(_ useASCII: Bool = true) -> Regex + + /// Returns a regular expression that only matches ASCII characters as space + /// characters. + public func usingASCIISpaces(_ useASCII: Bool = true) -> Regex + + /// Returns a regular expression that only matches ASCII characters when + /// matching character classes. + public func usingASCIICharacterClasses(_ useASCII: Bool = true) -> Regex +} +``` + +#### Use Unicode word boundaries + +By default, matching uses the Unicode specification for finding word boundaries for the `\b` and `Anchor.wordBoundary` anchors. Disabling the `w` option switches to finding word boundaries at points in the input where `\b\B` or `\B\b` match, given the other matching options that are enabled, which may be more compatible with other regular expression engines. + +In this example, the default matching behavior find the whole first word of the string, while the match with Unicode word boundaries disabled stops at the apostrophe: + +```swift +let str = "Don't look down!" + +str.firstMatch(of: /D\S+\b/) // "Don't" +str.firstMatch(of: /(?-w)D\S+\b/) // "Don" +``` + +*Regex literal syntax:* `/(?-w).../` or `/(?-w...)/` + +*Regex builder syntax:* + +```swift +extension RegexComponent { + /// Returns a regular expression that uses the Unicode word boundary + /// algorithm. + /// + /// This option is enabled by default; pass `false` to disable use of + /// Unicode's word boundary algorithm. + public func usingUnicodeWordBoundaries(_ useUnicodeWordBoundaries: Bool = true) -> Regex +} +``` + +### Matching semantic level + +When matching with grapheme cluster semantics (the default), metacharacters like `.` and `\w`, custom character classes, and character class instances like `.any` match a grapheme cluster when possible, corresponding with the default string representation. In addition, matching with grapheme cluster semantics compares characters using their canonical representation, corresponding with the way comparing strings for equality works. + +When matching with Unicode scalar semantics, metacharacters and character classes always match a single Unicode scalar value, even if that scalar comprises part of a grapheme cluster. + +These semantic levels can lead to different results, especially when working with strings that have decomposed characters. In the following example, `queRegex` matches any 3-character string that begins with `"q"`. + +```swift +let composed = "qué" +let decomposed = "que\u{301}" + +let queRegex = /^q..$/ + +print(composed.contains(queRegex)) +// Prints "true" +print(decomposed.contains(queRegex)) +// Prints "true" +``` + +When using Unicode scalar semantics, however, the regular expression only matches the composed version of the string, because each `.` matches a single Unicode scalar value. + +```swift +let queRegexScalar = queRegex.matchingSemantics(.unicodeScalar) +print(composed.contains(queRegexScalar)) +// Prints "true" +print(decomposed.contains(queRegexScalar)) +// Prints "false" +``` + +*Regex literal syntax:* `(?X)...` or `(?X...)` for grapheme cluster semantics, `(?u)...` or `(?u...)` for Unicode scalar semantics. + +*Regex builder syntax:* + +```swift +extension RegexComponent { + /// Returns a regular expression that matches with the specified semantic + /// level. + public func matchingSemantics(_ semanticLevel: RegexSemanticLevel) -> Regex +} + +public struct RegexSemanticLevel: Hashable { + /// Match at the default semantic level of a string, where each matched + /// element is a `Character`. + public static var graphemeCluster: RegexSemanticLevel + + /// Match at the semantic level of a string's `UnicodeScalarView`, where each + /// matched element is a `UnicodeScalar` value. + public static var unicodeScalar: RegexSemanticLevel +} +``` + +#### Multiline mode + +By default, the start and end anchors (`^` and `$`) match only the beginning and end of a string. With the `m` or the option, they also match the beginning and end of each line. + +```swift +let str = """ + abc + def + ghi + """ + +str.firstMatch(of: /^abc/) // "abc" +str.firstMatch(of: /^abc$/) // nil +str.firstMatch(of: /(?m)^abc$/) // "abc" + +str.firstMatch(of: /^def/) // nil +str.firstMatch(of: /(?m)^def$/) // "def" +``` + +This option applies only to anchors used in a regex literal. The anchors defined in `RegexBuilder` are specific about matching at the start/end of the input or the line, and therefore do not correspond directly with the `^` and `$` literal anchors. + +```swift +str.firstMatch(of: Regex { Anchor.startOfInput ; "def" }) // nil +str.firstMatch(of: Regex { Anchor.startOfLine ; "def" }) // "def" +``` + +*Regex literal syntax:* `/(?m).../` or `/(?m...)/` + +*Regex builder syntax:* + +```swift +extension RegexComponent { + /// Returns a regular expression where the start and end of input + /// anchors (`^` and `$`) also match against the start and end of a line. + public func anchorsMatchLineEndings(_ matchLineEndings: Bool = true) -> Regex +} +``` + +--- + +### Character Classes + +We propose the following definitions for regex character classes, along with a `CharacterClass` type as part of the `RegexBuilder` module, to encapsulate and simplify character class usage within builder-style regexes. + +The two regular expressions defined in this example will match the same inputs, looking for one or more word characters followed by up to three digits, optionally separated by a space: + +```swift +let regex1 = /\w+\s?\d{,3}/ +let regex2 = Regex { + OneOrMore(.word) + Optionally(.whitespace) + Repeat(.decimalDigit, ...3) +} +``` + +You can build custom character classes by combining regex-defined classes with individual characters or ranges, or by performing common set operations such as subtracting or negating a character class. + + +#### “Any” + +The simplest character class, representing **any character**, is written as `.` or `CharacterClass.any` and is also referred to as the "dot" metacharacter. This class always matches a single `Character` or Unicode scalar value, depending on the matching semantic level. This class excludes newlines, unless "single line mode" is enabled (see section above). + +In the following example, using grapheme cluster semantics, a dot matches a grapheme cluster, so the decomposed é is treated as a single value: + +```swift +"Cafe\u{301}".contains(/C.../) +// true +``` + +For this example, using Unicode scalar semantics, a dot matches only a single Unicode scalar value, so the combining marks don't get grouped with the commas before them: + +```swift +let data = "\u{300},\u{301},\u{302},\u{303},..." +for match in data.matches(of: /(.),/.matchingSemantics(.unicodeScalar)) { + print(match.1) +} +// Prints: +// ̀ +// ́ +// ̂ +// ... +``` + +`Regex` also provides ways to select a specific level of "any" matching, without needing to change semantic levels. + +- The **any grapheme cluster** character class is written as `\X` or `CharacterClass.anyGraphemeCluster`, and matches from the current location up to the next grapheme cluster boundary. +- The **any Unicode scalar** character class is written as `\O` or `CharacterClass.anyUnicodeScalar`, and matches exactly one Unicode scalar value at the current location. + +#### Decimal and hexadecimal digits + +The **decimal digit** character class is matched by `\d` or `CharacterClass.decimalDigit`. Both regexes in this example match one or more decimal digits followed by a colon: + +```swift +let regex1 = /\d+:/ +let regex2 = Regex { + OneOrMore(.decimalDigit) + ":" +} +``` + +_Unicode scalar semantics:_ Matches a Unicode scalar that has a `numericType` property equal to `.decimal`. This includes the digits from the ASCII range, from the _Halfwidth and Fullwidth Forms_ Unicode block, as well as digits in some scripts, like `DEVANAGARI DIGIT NINE` (U+096F). This corresponds to the general category `Decimal_Number`. + +_Grapheme cluster semantics:_ Matches a character made up of a single Unicode scalar that fits the decimal digit criteria above. + +_ASCII mode_: Matches a Unicode scalar in the range `0` to `9`. + + +To invert the decimal digit character class, use `\D` or `CharacterClass.decimalDigit.inverted`. + + +The **hexadecimal digit** character class is matched by `CharacterClass.hexDigit`. + +_Unicode scalar semantics:_ Matches a decimal digit, as described above, or an uppercase or small `A` through `F` from the _Halfwidth and Fullwidth Forms_ Unicode block. Note that this is a broader class than described by the `UnicodeScalar.properties.isHexDigit` property, as that property only include ASCII and fullwidth decimal digits. + +_Grapheme cluster semantics:_ Matches a character made up of a single Unicode scalar that fits the hex digit criteria above. + +_ASCII mode_: Matches a Unicode scalar in the range `0` to `9`, `a` to `f`, or `A` to `F`. + +To invert the hexadecimal digit character class, use `CharacterClass.hexDigit.inverted`. + +*
Rationale* + +Unicode's recommended definition for `\d` is its [numeric type][numerictype] of "Decimal" in contrast to "Digit". It is specifically restricted to sets of ascending contiguously-encoded scalars in a decimal radix positional numeral system. Thus, it excludes "digits" such as superscript numerals from its [definition][derivednumeric] and is a proper subset of `Character.isWholeNumber`. + +We interpret Unicode's definition of the set of scalars, especially its requirement that scalars be encoded in ascending chains, to imply that this class is restricted to scalars which meaningfully encode base-10 digits. Thus, we choose to make the grapheme cluster interpretation *restrictive*. + +
+ + +#### "Word" characters + +The **word** character class is matched by `\w` or `CharacterClass.word`. This character class and its name are essentially terms of art within regular expressions, and represents part of a notional "word". Note that, by default, this is distinct from the algorithm for identifying word boundaries. + +_Unicode scalar semantics:_ Matches a Unicode scalar that has one of the Unicode properties `Alphabetic`, `Digit`, or `Join_Control`, or is in the general category `Mark` or `Connector_Punctuation`. + +_Grapheme cluster semantics:_ Matches a character that begins with a Unicode scalar value that fits the criteria above. + +_ASCII mode_: Matches the numbers `0` through `9`, lowercase and uppercase `A` through `Z`, and the underscore (`_`). + +To invert the word character class, use `\W` or `CharacterClass.word.inverted`. + +*
Rationale* + +Word characters include more than letters, and we went with Unicode's recommended scalar semantics. Following the Unicode recommendation that nonspacing marks remain with their base characters, we extend to grapheme clusters similarly to `Character.isLetter`. That is, combining scalars do not change the word-character-ness of the grapheme cluster. + +
+ + +#### Whitespace and newlines + +The **whitespace** character class is matched by `\s` and `CharacterClass.whitespace`. + +_Unicode scalar semantics:_ Matches a Unicode scalar that has the Unicode properties `Whitespace`, including a space, a horizontal tab (U+0009), `LINE FEED (LF)` (U+000A), `LINE TABULATION` (U+000B), `FORM FEED (FF)` (U+000C), `CARRIAGE RETURN (CR)` (U+000D), and `NEWLINE (NEL)` (U+0085). Note that under Unicode scalar semantics, `\s` only matches the first scalar in a `CR`+`LF` pair. + +_Grapheme cluster semantics:_ Matches a character that begins with a `Whitespace` Unicode scalar value. This includes matching a `CR`+`LF` pair. + +_ASCII mode_: Matches characters that both ASCII and fit the criteria given above. The current matching semantics dictate whether a `CR`+`LF` pair is matched in ASCII mode. + +The **horizontal whitespace** character class is matched by `\h` and `CharacterClass.horizontalWhitespace`. + +_Unicode scalar semantics:_ Matches a Unicode scalar that has the Unicode general category `Zs`/`Space_Separator` as well as a horizontal tab (U+0009). + +_Grapheme cluster semantics:_ Matches a character that begins with a Unicode scalar value that fits the criteria above. + +_ASCII mode_: Matches either a space (`" "`) or a horizontal tab. + +The **vertical whitespace** character class is matched by `\v` and `CharacterClass.verticalWhitespace`. Additionally, `\R` and `CharacterClass.newline` provide a way to include the `CR`+`LF` pair, even when matching with Unicode scalar semantics. + +_Unicode scalar semantics:_ Matches a Unicode scalar that has the Unicode general category `Zl`/`Line_Separator` as well as any of the following control characters: `LINE FEED (LF)` (U+000A), `LINE TABULATION` (U+000B), `FORM FEED (FF)` (U+000C), `CARRIAGE RETURN (CR)` (U+000D), and `NEWLINE (NEL)` (U+0085). Only when specified as `\R` or `CharacterClass.newline` does this match the whole `CR`+`LF` pair. + +_Grapheme cluster semantics:_ Matches a character that begins with a Unicode scalar value that fits the criteria above. + +_ASCII mode_: Matches any of the four ASCII control characters listed above. The current matching semantics dictate whether a `CR`+`LF` pair is matched in ASCII mode. + +To invert these character classes, use `\S`, `\H`, and `\V`, respectively, or the `inverted` property on a `CharacterClass` instance. + +
Rationale + +Note that "whitespace" is a term-of-art and is not correlated with visibility, which is a completely separate concept. + +We use Unicode's recommended scalar semantics for horizontal and vertical whitespace, extended to grapheme clusters as in the existing `Character.isWhitespace` property. + +
+ + +#### Unicode properties + +Character classes that match **Unicode properties** are written as `\p{PROPERTY}` or `\p{PROPERTY=VALUE}`, as described in the [Run-time Regex Construction proposal][literal-properties]. + +While most Unicode properties are only defined at the scalar level, some are defined to match an extended grapheme cluster. For example, `\p{RGI_Emoji_Flag_Sequence}` will match any flag emoji character, which are composed of two Unicode scalar values. Such property classes will match multiple scalars, even when matching with Unicode scalar semantics. + +Unicode property matching is extended to `Character`s with a goal of consistency with other regex character classes. For `\p{Decimal}` and `\p{Hex_Digit}`, only single-scalar `Character`s can match, for the reasons described in that section, above. For all other Unicode property classes, matching `Character`s can comprise multiple scalars, as long as the first scalar matches the property. + +To invert a Unicode property character class, use `\P{...}`. + + +#### POSIX character classes: `[:NAME:]` + +**POSIX character classes** represent concepts that we'd like to define at all semantic levels. We propose the following definitions, some of which have been described above. When matching with grapheme cluster semantics, Unicode properties are extended to `Character`s as descrived in the rationale above, and as shown in the table below. That is, for POSIX class `[:word:]`, any `Character` that starts with a matching scalar is a match, while for `[:digit:]`, a matching `Character` must only comprise a single Unicode scalar value. + +| POSIX class | Unicode property class | Character behavior | ASCII mode value | +|--------------|-----------------------------------|----------------------|-------------------------------| +| `[:lower:]` | `\p{Lowercase}` | starts-with | `[a-z]` | +| `[:upper:]` | `\p{Uppercase}` | starts-with | `[A-Z]` | +| `[:alpha:]` | `\p{Alphabetic}` | starts-with | `[A-Za-z]` | +| `[:alnum:]` | `[\p{Alphabetic}\p{Decimal}]` | starts-with | `[A-Za-z0-9]` | +| `[:word:]` | See \* below | starts-with | `[[:alnum:]_]` | +| `[:digit:]` | `\p{DecimalNumber}` | single-scalar | `[0-9]` | +| `[:xdigit:]` | `\p{Hex_Digit}` | single-scalar | `[0-9A-Fa-f]` | +| `[:punct:]` | `\p{Punctuation}` | starts-with | `[-!"#%&'()*,./:;?@[\\\]{}]` | +| `[:blank:]` | `[\p{Space_Separator}\u{09}]` | starts-with | `[ \t]` | +| `[:space:]` | `\p{Whitespace}` | starts-with | `[ \t\n\r\f\v]` | +| `[:cntrl:]` | `\p{Control}` | starts-with | `[\x00-\x1f\x7f]` | +| `[:graph:]` | See \*\* below | starts-with | `[^ [:cntrl:]]` | +| `[:print:]` | `[[:graph:][:blank:]--[:cntrl:]]` | starts-with | `[[:graph:] ]` | + +\* The Unicode scalar property definition for `[:word:]` is `[\p{Alphanumeric}\p{Mark}\p{Join_Control}\p{Connector_Punctuation}]`. +\*\* The Unicode scalar property definition for `[:cntrl:]` is `[^\p{Space}\p{Control}\p{Surrogate}\p{Unassigned}]`. + +#### Custom classes + +Custom classes function as the set union of their individual components, whether those parts are individual characters, individual Unicode scalar values, ranges, Unicode property classes or POSIX classes, or other custom classes. + +- Individual characters and scalars will be tested using the same behavior as if they were listed in an alternation. That is, a custom character class like `[abc]` is equivalent to `(a|b|c)` under the same options and modes. +- When in grapheme cluster semantic mode, ranges of characters will test for membership using NFD form (or NFKD when performing caseless matching). This differs from how a `ClosedRange` would operate its `contains` method, since that depends on `String`'s `Comparable` conformance, but the decomposed comparison better aligns with the canonical equivalence matching used elsewhere in `Regex`. +- A custom character class will match a maximum of one `Character` or `UnicodeScalar`, depending on the matching semantic level. This means that a custom character class with extended grapheme cluster members may not match anything while using scalar semantics. + +In regex literals, custom classes are enclosed in square brackets `[...]`, and can be nested or combined using set operators like `&&`. For more detail, see the [literal syntax proposal][literal-charclass]. + +With the `RegexBuilder`'s `CharacterClass` type, you can use built-in character classes with ranges and groups of characters. For example, to parse a valid octodecimal number, you could define a custom character class that combines `.decimalDigit` with a range of characters. + +```swift +let octoDecimalRegex: Regex = Regex { + let charClass = CharacterClass(.decimalDigit, "a"..."h").ignoringCase() + Capture(OneOrMore(charClass)) + transform: { Int($0, radix: 18) } +} +``` + +The full `CharacterClass` API is as follows: + +```swift +public struct CharacterClass: RegexComponent { + public var regex: Regex { get } + + public var inverted: CharacterClass { get } +} + +extension RegexComponent where Self == CharacterClass { + public static var any: CharacterClass { get } + + public static var anyGraphemeCluster: CharacterClass { get } + + public static var anyUnicodeScalar: CharacterClass { get } + + public static var decimalDigit: CharacterClass { get } + + public static var hexDigit: CharacterClass { get } + + public static var word: CharacterClass { get } + + public static var whitespace: CharacterClass { get } + + public static var horizontalWhitespace: CharacterClass { get } + + public static var newlineSequence: CharacterClass { get } + + public static var verticalWhitespace: CharacterClass { get } +} + +extension RegexComponent where Self == CharacterClass { + /// Returns a character class that matches any character in the given string + /// or sequence. + public static func anyOf(_ s: S) -> CharacterClass + + /// Returns a character class that matches any unicode scalar in the given + /// sequence. + public static func anyOf(_ s: S) -> CharacterClass +} + +// Unicode properties +extension CharacterClass { + public static func generalCategory(_ category: Unicode.GeneralCategory) -> CharacterClass +} + +// Set algebra methods +extension CharacterClass { + public init(_ first: CharacterClass, _ rest: CharacterClass...) + + public func union(_ other: CharacterClass) -> CharacterClass + + public func intersection(_ other: CharacterClass) -> CharacterClass + + public func subtracting(_ other: CharacterClass) -> CharacterClass + + public func symmetricDifference(_ other: CharacterClass) -> CharacterClass +} + +/// Range syntax for characters in `CharacterClass`es. +public func ...(lhs: Character, rhs: Character) -> CharacterClass + +/// Range syntax for unicode scalars in `CharacterClass`es. +@_disfavoredOverload +public func ...(lhs: UnicodeScalar, rhs: UnicodeScalar) -> CharacterClass +``` + +## Source compatibility + +Everything in this proposal is additive, and has no compatibility effect on existing source code. + + +## Effect on ABI stability + +Everything in this proposal is additive, and has no effect on existing stable ABI. + + +## Effect on API resilience + +TBD + + +## Future directions + +### Expanded options + +The initial version of `Regex` includes only the options described above. Filling out the remainder of options described in the [Run-time Regex Construction proposal][literals] could be completed as future work. + +### Extensions to Character and Unicode Scalar APIs + +An earlier version of this pitch described adding standard library APIs to `Character` and `UnicodeScalar` for each of the supported character classes, as well as convenient static members for control characters. In addition, regex literals support Unicode property features that don’t currently exist in the standard library, such as a scalar’s script or extended category, or creating a scalar by its Unicode name instead of its scalar value. These kinds of additions are + +## Alternatives considered + +### Operate on String.UnicodeScalarView instead of using semantic modes + +Instead of providing APIs to select whether `Regex` matching is `Character`-based vs. `UnicodeScalar`-based, we could instead provide methods to match against the different views of a string. This different approach has multiple drawbacks: + +* As the scalar level used when matching changes the behavior of individual components of a `Regex`, it’s more appropriate to specify the semantic level at the declaration site than the call site. +* With the proposed options model, you can define a Regex that includes different semantic levels for different portions of the match, which would be impossible with a call site-based approach. + + + + +[repo]: https://github.com/apple/swift-experimental-string-processing/ +[option-scoping]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md#matching-options +[literals]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md +[literal-properties]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md#character-properties +[literal-charclass]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md#custom-character-classes + +[overview]: https://forums.swift.org/t/declarative-string-processing-overview/52459 +[charprops]: https://github.com/apple/swift-evolution/blob/master/proposals/0221-character-properties.md +[charpropsrationale]: https://github.com/apple/swift-evolution/blob/master/proposals/0221-character-properties.md#detailed-semantics-and-rationale +[canoneq]: https://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence +[graphemes]: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries +[meaningless]: https://forums.swift.org/t/declarative-string-processing-overview/52459/121 +[scalarprops]: https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md +[ucd]: https://www.unicode.org/reports/tr44/tr44-28.html +[numerictype]: https://www.unicode.org/reports/tr44/#Numeric_Type +[derivednumeric]: https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedNumericType.txt + + +[uts18]: https://unicode.org/reports/tr18/ +[proplist]: https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt +[pcre]: https://www.pcre.org/current/doc/html/pcre2pattern.html +[perl]: https://perldoc.perl.org/perlre +[raku]: https://docs.raku.org/language/regexes +[rust]: https://docs.rs/regex/1.5.4/regex/ +[python]: https://docs.python.org/3/library/re.html +[ruby]: https://ruby-doc.org/core-2.4.0/Regexp.html +[csharp]: https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference +[icu]: https://unicode-org.github.io/icu/userguide/strings/regexp.html +[posix]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html +[oniguruma]: https://www.cuminas.jp/sdk/regularExpression.html +[go]: https://pkg.go.dev/regexp/syntax@go1.17.2 +[cplusplus]: https://www.cplusplus.com/reference/regex/ECMAScript/ +[ecmascript]: https://262.ecma-international.org/12.0/#sec-pattern-semantics +[re2]: https://github.com/google/re2/wiki/Syntax +[java]: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html From cf5cbe0e5549b95da09afb5cb9ef8afd7e31195b Mon Sep 17 00:00:00 2001 From: Nate Cook Date: Mon, 11 Apr 2022 10:10:09 -0500 Subject: [PATCH 02/12] Revisions and additions --- .../Evolution/UnicodeForStringProcessing.md | 108 ++++++++++-------- 1 file changed, 61 insertions(+), 47 deletions(-) diff --git a/Documentation/Evolution/UnicodeForStringProcessing.md b/Documentation/Evolution/UnicodeForStringProcessing.md index 5500c06d3..0e53ff524 100644 --- a/Documentation/Evolution/UnicodeForStringProcessing.md +++ b/Documentation/Evolution/UnicodeForStringProcessing.md @@ -1,7 +1,7 @@ # Unicode for String Processing Proposal: [SE-NNNN](NNNN-filename.md) -Authors: [Nate Cook](https://github.com/natecook1000), [Michael Ilseman](https://github.com/milseman) +Authors: [Nate Cook](https://github.com/natecook1000), [Michael Ilseman](https://github.com/milseman), [Alejandro Alonso](https://github.com/Azoy) Review Manager: TBD Implementation: [apple/swift-experimental-string-processing][repo] Status: **Draft** @@ -9,7 +9,7 @@ Status: **Draft** ## Introduction -This proposal describes `Regex`'s rich Unicode support during regular expression matching, along with the character classes and options and that define that behavior. +This proposal describes `Regex`'s rich Unicode support during regular expression matching, along with the character classes and options that define that behavior. ## Motivation @@ -25,13 +25,13 @@ str.last == "é" // true (precomposed e with acute accent) str.last == "e\u{301}" // true (e followed by composing acute accent) ``` -At a regular expression's simplest, without metacharacters or special features, matching should behave like a test for equality. A string should always match a regular expression that simply contains the same characters. +At a regular expression's simplest, without metacharacters or special features, matching behaves like a test for equality. A string always matches a regular expression that simply contains the same characters. ```swift str.contains(/Café/) // true ``` -And from there, small changes should continue to comport with the element counting and comparison expectations set by `String`: +From that point, small changes continue to comport with the element counting and comparison expectations set by `String`: ```swift str.contains(/Caf./) // true @@ -40,7 +40,7 @@ str.contains(/.+e\u{301}/) // true str.contains(/\w+é/) // true ``` -With these initial principles in hand, we can look at how character classes should behave with Swift strings. Unicode leaves all interpretation of grapheme clusters up to implementations, which means that Swift needs to define any semantics for its own usage. Since other regex engines operate, at most, at the semantics level of Unicode scalar values, there is little to no prior art to consult. +With these initial principles in hand, we can look at how character classes behave with Swift strings. Unicode leaves all interpretation of grapheme clusters up to implementations, which means that Swift needs to define any semantics for its own usage. Since other regex engines operate, at most, at the semantics level of Unicode scalar values, there is little to no prior art to consult.
Other engines @@ -65,37 +65,38 @@ First, we'll discuss the options that let you control a regex's behavior, and th ### Options -Options can be declared in two different ways: as part of [regular expression literal syntax][literals], or applied as methods when declaring a `Regex`. For example, both of these `Regex`es are declared with case insensitivity: +Options can be enabled and disabled in two different ways: as part of [regular expression internal syntax][internals], or applied as methods when declaring a `Regex`. For example, both of these `Regex`es are declared with case insensitivity: ```swift -let regex1 = /(?i)label/ +let regex1 = /(?i)banana/ let regex2 = Regex { - "label" + "banana" }.ignoringCase()` ``` -Note that the `ignoringCase()` is available on any type conforming to `RegexComponent`, which means that you can use the more readable option-setting interface in conjunction with literals: +Note that the `ignoringCase()` is available on any type conforming to `RegexComponent`, which means that you can use the more readable option-setting interface in conjunction with regex literals: ```swift -let regex3 = /label/.ignoringCase() +let regex3 = /banana/.ignoringCase() ``` -Calling option-setting methods like `ignoringCase(_:)` acts like wrapping the regex in a option-setting group. That is, while it sets the *default* behavior for the callee, it doesn’t override options that are applied to more specific regions. In this example, the `"b"` in `"label"` matches case-sensitively, despite the outer call to `ignoringCase()`: +Calling an option-setting method like `ignoringCase(_:)` acts like wrapping the callee in an option-setting group `(?:...)`. That is, while it sets the behavior for the callee, it doesn’t override options that are applied to more specific regions. In this example, the middle `"na"` in `"banana"` matches case-sensitively, despite the outer call to `ignoringCase()`: ```swift let regex4 = Regex { - "la" - "b".ignoringCase(false) - "el" + "ba" + "na".ignoringCase(false) + "na" } .ignoringCase() -"label".contains(regex4) // true -"LAbEL".contains(regex4) // true -"LABEL".contains(regex4) // false -``` +"banana".contains(regex4) // true +"BAnaNA".contains(regex4) // true +"BANANA".contains(regex4) // false -Option scoping in literals is discussed in the [Run-time Regex Construction proposal][option-scoping]. +// Equivalent to: +let regex5 = /(?i)ba(?-i:na)na/ +``` #### Case insensitivity @@ -111,9 +112,9 @@ str.firstMatch(of: /(?i)cAfÉ/) // "Café" Case insensitive matching uses case folding to ensure that canonical equivalence continues to operate as expected. -*Regex literal syntax:* `/(?i).../` or `/(?i...)/` +**Regular expression syntax:** `(?i)...` or `(?i:...)` -*Regex builder syntax:* +**`RegexBuilder` API:** ```swift extension RegexComponent { @@ -139,9 +140,9 @@ str.firstMatch(of: /(?s)<<.+>>/) // "This string\nuses double-angle-brackets\ This option also affects the behavior of `CharacterClass.any`, which is designed to match the behavior of the `.` regex literal component. -*Regex literal syntax:* `/(?s).../` or `/(?s...)/` +**Regular expression syntax:** `(?s)...` or `(?s...)` -*Regex builder syntax:* +**`RegexBuilder` API:** ```swift extension RegexComponent { @@ -173,9 +174,9 @@ str.firstMatch(of: /(?U)<.+>/) // "" str.firstMatch(of: /(?U)<.+?>/) // "A value." ``` -*Regex literal syntax:* `/(?U).../` or `/(?U...)/` +**Regular expression syntax:** `(?U)...` or `(?U...)` -*Regex builder syntax:* +**`RegexBuilder` API:** ```swift extension RegexComponent { @@ -194,9 +195,9 @@ With one or more of these options enabled, the default character classes match o * `W`: Match only ASCII members for `\w`, `\p{Word}`, `[:word:]`, `\b`, `CharacterClass.word`, and `Anchor.wordBoundary`. * `P`: Match only ASCII members for all POSIX properties (including `digit`, `space`, and `word`). -*Regex literal syntax:* `/(?DSWP).../` or `/(?DSWP...)/` +**Regular expression syntax:** `(?DSWP)...` or `(?DSWP...)` -*Regex builder syntax:* +**`RegexBuilder` API:** ```swift extension RegexComponent { @@ -219,9 +220,11 @@ extension RegexComponent { #### Use Unicode word boundaries -By default, matching uses the Unicode specification for finding word boundaries for the `\b` and `Anchor.wordBoundary` anchors. Disabling the `w` option switches to finding word boundaries at points in the input where `\b\B` or `\B\b` match, given the other matching options that are enabled, which may be more compatible with other regular expression engines. +By default, matching word boundaries with the `\b` and `Anchor.wordBoundary` anchors uses Unicode default word boundaries, specified as Unicode [level 2 regular expression support][level2-word-boundaries]. + +Disabling the `w` option switches to [simple word boundaries][level1-word-boundaries], finding word boundaries at points in the input where `\b\B` or `\B\b` match. Depending on the other matching options that are enabled, this may be more compatible with the behavior other regular expression engines. -In this example, the default matching behavior find the whole first word of the string, while the match with Unicode word boundaries disabled stops at the apostrophe: +As shown in this example, the default matching behavior finds the whole first word of the string, while the match with simple word boundaries stops at the apostrophe: ```swift let str = "Don't look down!" @@ -230,18 +233,23 @@ str.firstMatch(of: /D\S+\b/) // "Don't" str.firstMatch(of: /(?-w)D\S+\b/) // "Don" ``` -*Regex literal syntax:* `/(?-w).../` or `/(?-w...)/` +**Regular expression syntax:** `(?-w)...` or `(?-w...)` -*Regex builder syntax:* +**`RegexBuilder` API:** ```swift extension RegexComponent { - /// Returns a regular expression that uses the Unicode word boundary - /// algorithm. + /// Returns a regular expression that uses simple word boundaries. /// - /// This option is enabled by default; pass `false` to disable use of - /// Unicode's word boundary algorithm. - public func usingUnicodeWordBoundaries(_ useUnicodeWordBoundaries: Bool = true) -> Regex + /// A simple word boundary is a position in the input between two characters + // that match `/\w\W/` or `/\W\w/`, or between the start or end of the input + // and `\w` character. Word boundaries therefore depend on the option-defined + // behavior of `\w`. + // + // The default word boundaries use a Unicode algorithm that handles some cases + // better than simple word boundaries, such as words with internal + // punctuation, changes in script, and Emoji. + public func usingSimpleWordBoundaries(_ useSimpleWordBoundaries: Bool = true) -> Regex } ``` @@ -275,9 +283,9 @@ print(decomposed.contains(queRegexScalar)) // Prints "false" ``` -*Regex literal syntax:* `(?X)...` or `(?X...)` for grapheme cluster semantics, `(?u)...` or `(?u...)` for Unicode scalar semantics. +**Regular expression syntax:** `(?X)...` or `(?X...)` for grapheme cluster semantics, `(?u)...` or `(?u...)` for Unicode scalar semantics. -*Regex builder syntax:* +**`RegexBuilder` API:** ```swift extension RegexComponent { @@ -323,9 +331,9 @@ str.firstMatch(of: Regex { Anchor.startOfInput ; "def" }) // nil str.firstMatch(of: Regex { Anchor.startOfLine ; "def" }) // "def" ``` -*Regex literal syntax:* `/(?m).../` or `/(?m...)/` +**Regular expression syntax:** `(?m)...` or `(?m...)` -*Regex builder syntax:* +**`RegexBuilder` API:** ```swift extension RegexComponent { @@ -484,7 +492,7 @@ We use Unicode's recommended scalar semantics for horizontal and vertical whites #### Unicode properties -Character classes that match **Unicode properties** are written as `\p{PROPERTY}` or `\p{PROPERTY=VALUE}`, as described in the [Run-time Regex Construction proposal][literal-properties]. +Character classes that match **Unicode properties** are written as `\p{PROPERTY}` or `\p{PROPERTY=VALUE}`, as described in the [Run-time Regex Construction proposal][internals-properties]. While most Unicode properties are only defined at the scalar level, some are defined to match an extended grapheme cluster. For example, `\p{RGI_Emoji_Flag_Sequence}` will match any flag emoji character, which are composed of two Unicode scalar values. Such property classes will match multiple scalars, even when matching with Unicode scalar semantics. @@ -524,9 +532,9 @@ Custom classes function as the set union of their individual components, whether - When in grapheme cluster semantic mode, ranges of characters will test for membership using NFD form (or NFKD when performing caseless matching). This differs from how a `ClosedRange` would operate its `contains` method, since that depends on `String`'s `Comparable` conformance, but the decomposed comparison better aligns with the canonical equivalence matching used elsewhere in `Regex`. - A custom character class will match a maximum of one `Character` or `UnicodeScalar`, depending on the matching semantic level. This means that a custom character class with extended grapheme cluster members may not match anything while using scalar semantics. -In regex literals, custom classes are enclosed in square brackets `[...]`, and can be nested or combined using set operators like `&&`. For more detail, see the [literal syntax proposal][literal-charclass]. +Inside regular expressions, custom classes are enclosed in square brackets `[...]`, and can be nested or combined using set operators like `&&`. For more detail, see the [Run-time Regex Construction proposal][internals-charclass]. -With the `RegexBuilder`'s `CharacterClass` type, you can use built-in character classes with ranges and groups of characters. For example, to parse a valid octodecimal number, you could define a custom character class that combines `.decimalDigit` with a range of characters. +With `RegexBuilder`'s `CharacterClass` type, you can use built-in character classes with ranges and groups of characters. For example, to parse a valid octodecimal number, you could define a custom character class that combines `.decimalDigit` with a range of characters. ```swift let octoDecimalRegex: Regex = Regex { @@ -615,7 +623,7 @@ Everything in this proposal is additive, and has no effect on existing stable AB ## Effect on API resilience -TBD +TK ## Future directions @@ -628,6 +636,10 @@ The initial version of `Regex` includes only the options described above. Fillin An earlier version of this pitch described adding standard library APIs to `Character` and `UnicodeScalar` for each of the supported character classes, as well as convenient static members for control characters. In addition, regex literals support Unicode property features that don’t currently exist in the standard library, such as a scalar’s script or extended category, or creating a scalar by its Unicode name instead of its scalar value. These kinds of additions are +### Byte semantic mode + +A future `Regex` version could support a byte-level semantic mode in addition to grapheme cluster and Unicode scalar semantics. Byte-level semantics would allow matching individual bytes, potentially providing the capability of parsing string and non-string data together. + ## Alternatives considered ### Operate on String.UnicodeScalarView instead of using semantic modes @@ -642,9 +654,11 @@ Instead of providing APIs to select whether `Regex` matching is `Character`-base [repo]: https://github.com/apple/swift-experimental-string-processing/ [option-scoping]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md#matching-options -[literals]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md -[literal-properties]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md#character-properties -[literal-charclass]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md#custom-character-classes +[internals]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md +[internals-properties]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md#character-properties +[internals-charclass]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md#custom-character-classes +[level1-word-boundaries]:https://unicode.org/reports/tr18/#Simple_Word_Boundaries +[level2-word-boundaries]:https://unicode.org/reports/tr18/#RL2.3 [overview]: https://forums.swift.org/t/declarative-string-processing-overview/52459 [charprops]: https://github.com/apple/swift-evolution/blob/master/proposals/0221-character-properties.md From 7f7fab852bb6e624abe1547656103772d586a00c Mon Sep 17 00:00:00 2001 From: Nate Cook Date: Thu, 14 Apr 2022 11:48:03 -0500 Subject: [PATCH 03/12] Finish draft --- .../Evolution/UnicodeForStringProcessing.md | 111 ++++++++++++------ 1 file changed, 73 insertions(+), 38 deletions(-) diff --git a/Documentation/Evolution/UnicodeForStringProcessing.md b/Documentation/Evolution/UnicodeForStringProcessing.md index 0e53ff524..2ade4aa7f 100644 --- a/Documentation/Evolution/UnicodeForStringProcessing.md +++ b/Documentation/Evolution/UnicodeForStringProcessing.md @@ -9,13 +9,11 @@ Status: **Draft** ## Introduction -This proposal describes `Regex`'s rich Unicode support during regular expression matching, along with the character classes and options that define that behavior. +This proposal describes `Regex`'s rich Unicode support during regex matching, along with the character classes and options that define that behavior. ## Motivation -Character classes in regular expressions include metacharacters like `\d` to match a digit, `\s` to match whitespace, and `.` to match any character. Individual literal characters can also be thought of as character classes, as they at least match themselves, and, in case-insensitive matching, their case-toggled counterpart. For the purpose of this work, then, we consider a *character class* to be any part of a regular expression literal that can match an actual component of a string. - -Operating over classes of characters is a vital component of string processing. Swift's `String` provides, by default, a view of `Character`s or [extended grapheme clusters][graphemes] whose comparison honors [Unicode canonical equivalence][canoneq]. +Swift's `String` type provides, by default, a view of `Character`s or [extended grapheme clusters][graphemes] whose comparison honors [Unicode canonical equivalence][canoneq]. Each character in a string can be composed of one or more Unicode scalar values, while still being treated as a single unit, equivalent to other ways of formulating the equivalent character: ```swift let str = "Cafe\u{301}" // "Café" @@ -25,30 +23,36 @@ str.last == "é" // true (precomposed e with acute accent) str.last == "e\u{301}" // true (e followed by composing acute accent) ``` -At a regular expression's simplest, without metacharacters or special features, matching behaves like a test for equality. A string always matches a regular expression that simply contains the same characters. +This default view is fairly novel. Most languages that support Unicode strings generally operate at the Unicode scalar level, and don't provide the same affordance for operating on a string as a collection of grapheme clusters. In Python, for example, Unicode strings report their length as the number of scalar values, and don't use canonical equivalence in comparisons: -```swift -str.contains(/Café/) // true +```python +cafe = u"Cafe\u0301" +len(cafe) # 5 +cafe == u"Café" # False ``` -From that point, small changes continue to comport with the element counting and comparison expectations set by `String`: +Existing regex engines follow this same model of operating at the Unicode scalar level. To match canonically equivalent characters, or have equivalent behavior between equivalent strings, you must normalize your string and regex to the same canonical format. -```swift -str.contains(/Caf./) // true -str.contains(/.+é/) // true -str.contains(/.+e\u{301}/) // true -str.contains(/\w+é/) // true +```python +# Matches a four-element string +re.match(u"^.{4}$", cafe) # None +# Matches a string ending with 'é' +re.match(u".+é$", cafe) # None + +cafeComp = unicodedata.normalize("NFC", cafe) +re.match(u"^.{4}$", cafeComp) # +re.match(u".+é$", cafeComp) # ``` -With these initial principles in hand, we can look at how character classes behave with Swift strings. Unicode leaves all interpretation of grapheme clusters up to implementations, which means that Swift needs to define any semantics for its own usage. Since other regex engines operate, at most, at the semantics level of Unicode scalar values, there is little to no prior art to consult. +With Swift's string model, this behavior would surprising and undesirable — Swift's default regex semantics must match the semantics of a `String`.
Other engines -Character classes in other languages match at either the Unicode scalar value level, or even the code unit level, instead of recognizing grapheme clusters as characters. When matching the `.` character class, other languages will only match the first part of an `"e\u{301}"` grapheme cluster. Some languages, like Perl, Ruby, and Java, support an additional `\X` metacharacter, which explicitly represents a single grapheme cluster. +Other regex engines match character classes (such as `\w` or `.`) at the Unicode scalar value level, or even the code unit level, instead of recognizing grapheme clusters as characters. When matching the `.` character class, other languages will only match the first part of an `"e\u{301}"` grapheme cluster. Some languages, like Perl, Ruby, and Java, support an additional `\X` metacharacter, which explicitly represents a single grapheme cluster. | Matching `"Cafe\u{301}"` | Pattern: `^Caf.` | Remaining | Pattern: `^Caf\X` | Remaining | |---|---|---|---|---| -| C#, Rust, Go | `"Cafe"` | `"´"` | n/a | n/a | +| C#, Rust, Go, Python | `"Cafe"` | `"´"` | n/a | n/a | | NSString, Java, Ruby, Perl | `"Cafe"` | `"´"` | `"Café"` | `""` | Other than Java's `CANON_EQ` option, the vast majority of other languages and engines are not capable of comparing with canonical equivalence. @@ -57,7 +61,23 @@ Other than Java's `CANON_EQ` option, the vast majority of other languages and en ## Proposed solution -TK: semantic levels, options for controlling, canonical equivalence, Unicode properties +In a regex's simplest form, without metacharacters or special features, matching behaves like a test for equality. A string always matches a regex that simply contains the same characters. + +```swift +let str = "Cafe\u{301}" // "Café" +str.contains(/Café/) // true +``` + +From that point, small changes continue to comport with the element counting and comparison expectations set by `String`: + +```swift +str.contains(/Caf./) // true +str.contains(/.+é/) // true +str.contains(/.+e\u{301}/) // true +str.contains(/\w+é/) // true +``` + +Swift's `Regex` follows the level 2 guidelines for Unicode support in regular expressions described in [Unicode Technical Standard #18][uts18], with support for Unicode character classes, canonical equivalence, grapheme cluster matching semantics, and level 2 word boundaries enabled by default. `Regex` provides options for selecting different matching behaviors, such as ASCII character classes or Unicode scalar semantics, which corresponds more closely with other regex engines. ## Detailed design @@ -65,7 +85,7 @@ First, we'll discuss the options that let you control a regex's behavior, and th ### Options -Options can be enabled and disabled in two different ways: as part of [regular expression internal syntax][internals], or applied as methods when declaring a `Regex`. For example, both of these `Regex`es are declared with case insensitivity: +Options can be enabled and disabled in two different ways: as part of [regex internal syntax][internals], or applied as methods when declaring a `Regex`. For example, both of these `Regex`es are declared with case insensitivity: ```swift let regex1 = /(?i)banana/ @@ -100,7 +120,7 @@ let regex5 = /(?i)ba(?-i:na)na/ #### Case insensitivity -Regular expressions perform case sensitive comparisons by default. The `i` option or the `ignoringCase(_:)` method enables case insensitive comparison. +Regexes perform case sensitive comparisons by default. The `i` option or the `ignoringCase(_:)` method enables case insensitive comparison. ```swift let str = "Café" @@ -112,7 +132,7 @@ str.firstMatch(of: /(?i)cAfÉ/) // "Café" Case insensitive matching uses case folding to ensure that canonical equivalence continues to operate as expected. -**Regular expression syntax:** `(?i)...` or `(?i:...)` +**Regex syntax:** `(?i)...` or `(?i:...)` **`RegexBuilder` API:** @@ -140,7 +160,7 @@ str.firstMatch(of: /(?s)<<.+>>/) // "This string\nuses double-angle-brackets\ This option also affects the behavior of `CharacterClass.any`, which is designed to match the behavior of the `.` regex literal component. -**Regular expression syntax:** `(?s)...` or `(?s...)` +**Regex syntax:** `(?s)...` or `(?s...)` **`RegexBuilder` API:** @@ -154,7 +174,7 @@ extension RegexComponent { #### Reluctant quantification by default -Regular expression quantifiers (`+`, `*`, and `?`) match eagerly by default, such that they match the longest possible substring. Appending `?` to a quantifier makes it reluctant, instead, so that it matches the shortest possible substring. +Regex quantifiers (`+`, `*`, and `?`) match eagerly by default, such that they match the longest possible substring. Appending `?` to a quantifier makes it reluctant, instead, so that it matches the shortest possible substring. ```swift let str = "A value." @@ -174,7 +194,7 @@ str.firstMatch(of: /(?U)<.+>/) // "" str.firstMatch(of: /(?U)<.+?>/) // "A value." ``` -**Regular expression syntax:** `(?U)...` or `(?U...)` +**Regex syntax:** `(?U)...` or `(?U...)` **`RegexBuilder` API:** @@ -195,7 +215,7 @@ With one or more of these options enabled, the default character classes match o * `W`: Match only ASCII members for `\w`, `\p{Word}`, `[:word:]`, `\b`, `CharacterClass.word`, and `Anchor.wordBoundary`. * `P`: Match only ASCII members for all POSIX properties (including `digit`, `space`, and `word`). -**Regular expression syntax:** `(?DSWP)...` or `(?DSWP...)` +**Regex syntax:** `(?DSWP)...` or `(?DSWP...)` **`RegexBuilder` API:** @@ -220,9 +240,9 @@ extension RegexComponent { #### Use Unicode word boundaries -By default, matching word boundaries with the `\b` and `Anchor.wordBoundary` anchors uses Unicode default word boundaries, specified as Unicode [level 2 regular expression support][level2-word-boundaries]. +By default, matching word boundaries with the `\b` and `Anchor.wordBoundary` anchors uses Unicode default word boundaries, specified as [Unicode level 2 regular expression support][level2-word-boundaries]. -Disabling the `w` option switches to [simple word boundaries][level1-word-boundaries], finding word boundaries at points in the input where `\b\B` or `\B\b` match. Depending on the other matching options that are enabled, this may be more compatible with the behavior other regular expression engines. +Disabling the `w` option switches to [simple word boundaries][level1-word-boundaries], finding word boundaries at points in the input where `\b\B` or `\B\b` match. Depending on the other matching options that are enabled, this may be more compatible with the behavior other regex engines. As shown in this example, the default matching behavior finds the whole first word of the string, while the match with simple word boundaries stops at the apostrophe: @@ -233,7 +253,25 @@ str.firstMatch(of: /D\S+\b/) // "Don't" str.firstMatch(of: /(?-w)D\S+\b/) // "Don" ``` -**Regular expression syntax:** `(?-w)...` or `(?-w...)` +You can see more differences between level 1 and level 2 word boundaries in the following table: + +| Example | Level 1 | Level 2 | +|---------------------|---------------------------------|-------------------------------------------| +| I can't do that. | ["I", "can", "t", "do", "that"] | ["I", "can't", "do", "that", "."] | +| 🔥😊👍 | ["🔥😊👍"] | ["🔥", "😊", "👍"] | +| 👩🏻👶🏿👨🏽🧑🏾👩🏼 | ["👩🏻👶🏿👨🏽🧑🏾👩🏼"] | ["👩🏻", "👶🏿", "👨🏽", "🧑🏾", "👩🏼"] | +| 🇨🇦🇺🇸🇲🇽 | ["🇨🇦🇺🇸🇲🇽"] | ["🇨🇦", "🇺🇸", "🇲🇽"] | +| 〱㋞ツ | ["〱", "㋞", "ツ"] | ["〱㋞ツ"] | +| hello〱㋞ツ | ["hello〱", "㋞", "ツ"] | ["hello", "〱㋞ツ"] | +| 나는 Chicago에 산다 | ["나는", "Chicago에", "산다"] | ["나", "는", "Chicago", "에", "산", "다"] | +| 眼睛love食物 | ["眼睛love食物"] | ["眼", "睛", "love", "食", "物"] | +| 아니ㅋㅋㅋ네 | ["아니ㅋㅋㅋ네"] | ["아", "니", "ㅋㅋㅋ", "네"] | +| Re:Zero | ["Re", "Zero"] | ["Re:Zero"] | +| \u{d}\u{a} | ["\u{d}", "\u{a}"] | ["\u{d}\u{a}"] | +| €1 234,56 | ["1", "234", "56"] | ["€", "1", "234,56"] | + + +**Regex syntax:** `(?-w)...` or `(?-w...)` **`RegexBuilder` API:** @@ -253,7 +291,7 @@ extension RegexComponent { } ``` -### Matching semantic level +#### Matching semantic level When matching with grapheme cluster semantics (the default), metacharacters like `.` and `\w`, custom character classes, and character class instances like `.any` match a grapheme cluster when possible, corresponding with the default string representation. In addition, matching with grapheme cluster semantics compares characters using their canonical representation, corresponding with the way comparing strings for equality works. @@ -273,7 +311,7 @@ print(decomposed.contains(queRegex)) // Prints "true" ``` -When using Unicode scalar semantics, however, the regular expression only matches the composed version of the string, because each `.` matches a single Unicode scalar value. +When using Unicode scalar semantics, however, the regex only matches the composed version of the string, because each `.` matches a single Unicode scalar value. ```swift let queRegexScalar = queRegex.matchingSemantics(.unicodeScalar) @@ -283,7 +321,7 @@ print(decomposed.contains(queRegexScalar)) // Prints "false" ``` -**Regular expression syntax:** `(?X)...` or `(?X...)` for grapheme cluster semantics, `(?u)...` or `(?u...)` for Unicode scalar semantics. +**Regex syntax:** `(?X)...` or `(?X...)` for grapheme cluster semantics, `(?u)...` or `(?u...)` for Unicode scalar semantics. **`RegexBuilder` API:** @@ -331,7 +369,7 @@ str.firstMatch(of: Regex { Anchor.startOfInput ; "def" }) // nil str.firstMatch(of: Regex { Anchor.startOfLine ; "def" }) // "def" ``` -**Regular expression syntax:** `(?m)...` or `(?m...)` +**Regex syntax:** `(?m)...` or `(?m...)` **`RegexBuilder` API:** @@ -349,7 +387,7 @@ extension RegexComponent { We propose the following definitions for regex character classes, along with a `CharacterClass` type as part of the `RegexBuilder` module, to encapsulate and simplify character class usage within builder-style regexes. -The two regular expressions defined in this example will match the same inputs, looking for one or more word characters followed by up to three digits, optionally separated by a space: +The two regexes defined in this example will match the same inputs, looking for one or more word characters followed by up to three digits, optionally separated by a space: ```swift let regex1 = /\w+\s?\d{,3}/ @@ -436,7 +474,7 @@ We interpret Unicode's definition of the set of scalars, especially its requirem #### "Word" characters -The **word** character class is matched by `\w` or `CharacterClass.word`. This character class and its name are essentially terms of art within regular expressions, and represents part of a notional "word". Note that, by default, this is distinct from the algorithm for identifying word boundaries. +The **word** character class is matched by `\w` or `CharacterClass.word`. This character class and its name are essentially terms of art within regexes, and represents part of a notional "word". Note that, by default, this is distinct from the algorithm for identifying word boundaries. _Unicode scalar semantics:_ Matches a Unicode scalar that has one of the Unicode properties `Alphabetic`, `Digit`, or `Join_Control`, or is in the general category `Mark` or `Connector_Punctuation`. @@ -532,7 +570,7 @@ Custom classes function as the set union of their individual components, whether - When in grapheme cluster semantic mode, ranges of characters will test for membership using NFD form (or NFKD when performing caseless matching). This differs from how a `ClosedRange` would operate its `contains` method, since that depends on `String`'s `Comparable` conformance, but the decomposed comparison better aligns with the canonical equivalence matching used elsewhere in `Regex`. - A custom character class will match a maximum of one `Character` or `UnicodeScalar`, depending on the matching semantic level. This means that a custom character class with extended grapheme cluster members may not match anything while using scalar semantics. -Inside regular expressions, custom classes are enclosed in square brackets `[...]`, and can be nested or combined using set operators like `&&`. For more detail, see the [Run-time Regex Construction proposal][internals-charclass]. +Inside regexes, custom classes are enclosed in square brackets `[...]`, and can be nested or combined using set operators like `&&`. For more detail, see the [Run-time Regex Construction proposal][internals-charclass]. With `RegexBuilder`'s `CharacterClass` type, you can use built-in character classes with ranges and groups of characters. For example, to parse a valid octodecimal number, you could define a custom character class that combines `.decimalDigit` with a range of characters. @@ -615,16 +653,13 @@ public func ...(lhs: UnicodeScalar, rhs: UnicodeScalar) -> CharacterClass Everything in this proposal is additive, and has no compatibility effect on existing source code. - ## Effect on ABI stability Everything in this proposal is additive, and has no effect on existing stable ABI. - ## Effect on API resilience -TK - +N/A ## Future directions From 23494c6dc83f382664b6b2bfc9bb803b20746d57 Mon Sep 17 00:00:00 2001 From: Nate Cook Date: Thu, 14 Apr 2022 12:27:21 -0500 Subject: [PATCH 04/12] Additional revisions, API fixes --- .../Evolution/UnicodeForStringProcessing.md | 55 +++++++++++++++---- 1 file changed, 43 insertions(+), 12 deletions(-) diff --git a/Documentation/Evolution/UnicodeForStringProcessing.md b/Documentation/Evolution/UnicodeForStringProcessing.md index 2ade4aa7f..876e83777 100644 --- a/Documentation/Evolution/UnicodeForStringProcessing.md +++ b/Documentation/Evolution/UnicodeForStringProcessing.md @@ -118,6 +118,24 @@ let regex4 = Regex { let regex5 = /(?i)ba(?-i:na)na/ ``` +All option APIs are provided on `RegexComponent`, so they can be called on a `Regex` instance, or on any component that you would use inside a `RegexBuilder` block when the `RegexBuilder` module is imported. + +The options that `Regex` supports are shown in the table below. Options that affect _matching behavior_ are supported through both regex syntax and APIs, while options that have _structural_ or _syntactic_ effects are only supported through regex syntax. + +| **Matching Behavior** | | | +|--------------------------|----------------|-------------------------------| +| Case insensitivity | `(?i)` | `ignoringCase()` | +| Dot matches newlines | `(?s)` | `dotMatchesNewlines()` | +| Anchors match newlines | `(?m)` | `anchorsMatchNewlines()` | +| Unicode word boundaries | `(?w)` | `usingSimpleWordBoundaries()` | +| ASCII character classes | `(?DSWP)` | `usingASCIIDigits()`, etc | +| Semantic level | `(?Xu)` | `matchingSemantics(_:)` | +| Reluctant quantifiers | `(?U)` | `reluctantQuantifiers()` | +| **Structural/Syntactic** | | | +| Extended syntax | `(?x)`,`(?xx)` | n/a | +| Named captures only | `(?n)` | n/a | +| Shared capture names | `(?J)` | n/a | + #### Case insensitivity Regexes perform case sensitive comparisons by default. The `i` option or the `ignoringCase(_:)` method enables case insensitive comparison. @@ -202,10 +220,23 @@ str.firstMatch(of: /(?U)<.+?>/) // "A value." extension RegexComponent { /// Returns a regular expression where quantifiers are reluctant by default /// instead of eager. - public func reluctantCaptures(_ useReluctantCaptures: Bool = true) -> Regex + public func reluctantQuantifiers(_ useReluctantQuantifiers: Bool = true) -> Regex +} +``` + +In order for this option to have the same effect on regexes built with `RegexBuilder` as with regex syntax, the `RegexBuilder` quantifier APIs are amended to have an `nil`-defaulted optional `behavior` parameter. For example: + +```swift +extension OneOrMore { + public init( + _ behavior: QuantificationBehavior? = nil, + @RegexComponentBuilder _ component: () -> Component + ) where Output == (Substring, C0), Component.Output == (W, C0) } ``` +When you pass `nil`, the quantifier uses the default behavior as set by this option (either eager or reluctant). If an explicit behavior is passed, that behavior is used regardless of the default. + #### Use ASCII-only character classes With one or more of these options enabled, the default character classes match only ASCII values instead of the full Unicode range of characters. Four options are included in this group: @@ -221,10 +252,6 @@ With one or more of these options enabled, the default character classes match o ```swift extension RegexComponent { - /// Returns a regular expression that only matches ASCII characters as "word - /// characters". - public func usingASCIIWordCharacters(_ useASCII: Bool = true) -> Regex - /// Returns a regular expression that only matches ASCII characters as digits. public func usingASCIIDigits(_ useASCII: Bool = true) -> Regex @@ -232,6 +259,10 @@ extension RegexComponent { /// characters. public func usingASCIISpaces(_ useASCII: Bool = true) -> Regex + /// Returns a regular expression that only matches ASCII characters as "word + /// characters". + public func usingASCIIWordCharacters(_ useASCII: Bool = true) -> Regex + /// Returns a regular expression that only matches ASCII characters when /// matching character classes. public func usingASCIICharacterClasses(_ useASCII: Bool = true) -> Regex @@ -428,8 +459,8 @@ for match in data.matches(of: /(.),/.matchingSemantics(.unicodeScalar)) { `Regex` also provides ways to select a specific level of "any" matching, without needing to change semantic levels. -- The **any grapheme cluster** character class is written as `\X` or `CharacterClass.anyGraphemeCluster`, and matches from the current location up to the next grapheme cluster boundary. -- The **any Unicode scalar** character class is written as `\O` or `CharacterClass.anyUnicodeScalar`, and matches exactly one Unicode scalar value at the current location. +- The **any grapheme cluster** character class is written as `\X` or `CharacterClass.anyGraphemeCluster`, and matches from the current location up to the next grapheme cluster boundary. This includes matching newlines, regardless of any option settings. +- The **any Unicode scalar** character class is written as `\O` or `CharacterClass.anyUnicodeScalar`, and matches exactly one Unicode scalar value at the current location. This includes matching newlines, regardless of any option settings, but only the first scalar in an `\r\n` cluster. #### Decimal and hexadecimal digits @@ -665,7 +696,7 @@ N/A ### Expanded options -The initial version of `Regex` includes only the options described above. Filling out the remainder of options described in the [Run-time Regex Construction proposal][literals] could be completed as future work. +The initial version of `Regex` includes only the options described above. Filling out the remainder of options described in the [Run-time Regex Construction proposal][literals] could be completed as future work, as well as additional improvements, such as adding an option for making quantifiers possessive by default. ### Extensions to Character and Unicode Scalar APIs @@ -688,10 +719,10 @@ Instead of providing APIs to select whether `Regex` matching is `Character`-base [repo]: https://github.com/apple/swift-experimental-string-processing/ -[option-scoping]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md#matching-options -[internals]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md -[internals-properties]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md#character-properties -[internals-charclass]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md#custom-character-classes +[option-scoping]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md#matching-options +[internals]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md +[internals-properties]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md#character-properties +[internals-charclass]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md#custom-character-classes [level1-word-boundaries]:https://unicode.org/reports/tr18/#Simple_Word_Boundaries [level2-word-boundaries]:https://unicode.org/reports/tr18/#RL2.3 From daff3e45bfd0210a0dcdc6fccf2866e6e066d09c Mon Sep 17 00:00:00 2001 From: Nate Cook Date: Thu, 14 Apr 2022 12:55:43 -0500 Subject: [PATCH 05/12] Re-order option sections --- .../Evolution/UnicodeForStringProcessing.md | 125 +++++++++--------- 1 file changed, 63 insertions(+), 62 deletions(-) diff --git a/Documentation/Evolution/UnicodeForStringProcessing.md b/Documentation/Evolution/UnicodeForStringProcessing.md index 876e83777..4fa58f1b9 100644 --- a/Documentation/Evolution/UnicodeForStringProcessing.md +++ b/Documentation/Evolution/UnicodeForStringProcessing.md @@ -122,19 +122,19 @@ All option APIs are provided on `RegexComponent`, so they can be called on a `Re The options that `Regex` supports are shown in the table below. Options that affect _matching behavior_ are supported through both regex syntax and APIs, while options that have _structural_ or _syntactic_ effects are only supported through regex syntax. -| **Matching Behavior** | | | -|--------------------------|----------------|-------------------------------| -| Case insensitivity | `(?i)` | `ignoringCase()` | -| Dot matches newlines | `(?s)` | `dotMatchesNewlines()` | -| Anchors match newlines | `(?m)` | `anchorsMatchNewlines()` | -| Unicode word boundaries | `(?w)` | `usingSimpleWordBoundaries()` | -| ASCII character classes | `(?DSWP)` | `usingASCIIDigits()`, etc | -| Semantic level | `(?Xu)` | `matchingSemantics(_:)` | -| Reluctant quantifiers | `(?U)` | `reluctantQuantifiers()` | -| **Structural/Syntactic** | | | -| Extended syntax | `(?x)`,`(?xx)` | n/a | -| Named captures only | `(?n)` | n/a | -| Shared capture names | `(?J)` | n/a | +| **Matching Behavior** | | | +|------------------------------|----------------|-------------------------------| +| Case insensitivity | `(?i)` | `ignoringCase()` | +| Single-line mode | `(?s)` | `dotMatchesNewlines()` | +| Multi-line mode | `(?m)` | `anchorsMatchNewlines()` | +| ASCII-only character classes | `(?DSWP)` | `usingASCIIDigits()`, etc | +| Unicode word boundaries | `(?w)` | `usingSimpleWordBoundaries()` | +| Semantic level | `(?Xu)` | `matchingSemantics(_:)` | +| Reluctant quantifiers | `(?U)` | `reluctantQuantifiers()` | +| **Structural/Syntactic** | | | +| Extended syntax | `(?x)`,`(?xx)` | n/a | +| Named captures only | `(?n)` | n/a | +| Shared capture names | `(?J)` | n/a | #### Case insensitivity @@ -190,54 +190,45 @@ extension RegexComponent { } ``` -#### Reluctant quantification by default +#### Multiline mode -Regex quantifiers (`+`, `*`, and `?`) match eagerly by default, such that they match the longest possible substring. Appending `?` to a quantifier makes it reluctant, instead, so that it matches the shortest possible substring. +By default, the start and end anchors (`^` and `$`) match only the beginning and end of a string. With the `m` or the option, they also match the beginning and end of each line. ```swift -let str = "A value." +let str = """ + abc + def + ghi + """ -// By default, the '+' quantifier is eager, and consumes as much as possible. -str.firstMatch(of: /<.+>/) // "A value." +str.firstMatch(of: /^abc/) // "abc" +str.firstMatch(of: /^abc$/) // nil +str.firstMatch(of: /(?m)^abc$/) // "abc" -// Adding '?' makes the '+' quantifier reluctant, so that it consumes as little as possible. -str.firstMatch(of: /<.+?>/) // "" +str.firstMatch(of: /^def/) // nil +str.firstMatch(of: /(?m)^def$/) // "def" ``` -The `U` option toggles the "eagerness" of quanitifiers, so that quantifiers are reluctant by default, and only become eager when `?` is added to the quantifier. +This option applies only to anchors used in a regex literal. The anchors defined in `RegexBuilder` are specific about matching at the start/end of the input or the line, and therefore do not correspond directly with the `^` and `$` literal anchors. ```swift -// '(?U)' toggles the eagerness of quantifiers: -str.firstMatch(of: /(?U)<.+>/) // "" -str.firstMatch(of: /(?U)<.+?>/) // "A value." +str.firstMatch(of: Regex { Anchor.startOfInput ; "def" }) // nil +str.firstMatch(of: Regex { Anchor.startOfLine ; "def" }) // "def" ``` -**Regex syntax:** `(?U)...` or `(?U...)` +**Regex syntax:** `(?m)...` or `(?m...)` **`RegexBuilder` API:** ```swift extension RegexComponent { - /// Returns a regular expression where quantifiers are reluctant by default - /// instead of eager. - public func reluctantQuantifiers(_ useReluctantQuantifiers: Bool = true) -> Regex -} -``` - -In order for this option to have the same effect on regexes built with `RegexBuilder` as with regex syntax, the `RegexBuilder` quantifier APIs are amended to have an `nil`-defaulted optional `behavior` parameter. For example: - -```swift -extension OneOrMore { - public init( - _ behavior: QuantificationBehavior? = nil, - @RegexComponentBuilder _ component: () -> Component - ) where Output == (Substring, C0), Component.Output == (W, C0) + /// Returns a regular expression where the start and end of input + /// anchors (`^` and `$`) also match against the start and end of a line. + public func anchorsMatchLineEndings(_ matchLineEndings: Bool = true) -> Regex } ``` -When you pass `nil`, the quantifier uses the default behavior as set by this option (either eager or reluctant). If an explicit behavior is passed, that behavior is used regardless of the default. - -#### Use ASCII-only character classes +#### ASCII-only character classes With one or more of these options enabled, the default character classes match only ASCII values instead of the full Unicode range of characters. Four options are included in this group: @@ -269,7 +260,7 @@ extension RegexComponent { } ``` -#### Use Unicode word boundaries +#### Unicode word boundaries By default, matching word boundaries with the `\b` and `Anchor.wordBoundary` anchors uses Unicode default word boundaries, specified as [Unicode level 2 regular expression support][level2-word-boundaries]. @@ -374,44 +365,54 @@ public struct RegexSemanticLevel: Hashable { } ``` -#### Multiline mode +#### Reluctant quantification by default -By default, the start and end anchors (`^` and `$`) match only the beginning and end of a string. With the `m` or the option, they also match the beginning and end of each line. +Regex quantifiers (`+`, `*`, and `?`) match eagerly by default, such that they match the longest possible substring. Appending `?` to a quantifier makes it reluctant, instead, so that it matches the shortest possible substring. ```swift -let str = """ - abc - def - ghi - """ +let str = "A value." -str.firstMatch(of: /^abc/) // "abc" -str.firstMatch(of: /^abc$/) // nil -str.firstMatch(of: /(?m)^abc$/) // "abc" +// By default, the '+' quantifier is eager, and consumes as much as possible. +str.firstMatch(of: /<.+>/) // "A value." -str.firstMatch(of: /^def/) // nil -str.firstMatch(of: /(?m)^def$/) // "def" +// Adding '?' makes the '+' quantifier reluctant, so that it consumes as little as possible. +str.firstMatch(of: /<.+?>/) // "" ``` -This option applies only to anchors used in a regex literal. The anchors defined in `RegexBuilder` are specific about matching at the start/end of the input or the line, and therefore do not correspond directly with the `^` and `$` literal anchors. +The `U` option toggles the "eagerness" of quanitifiers, so that quantifiers are reluctant by default, and only become eager when `?` is added to the quantifier. ```swift -str.firstMatch(of: Regex { Anchor.startOfInput ; "def" }) // nil -str.firstMatch(of: Regex { Anchor.startOfLine ; "def" }) // "def" +// '(?U)' toggles the eagerness of quantifiers: +str.firstMatch(of: /(?U)<.+>/) // "" +str.firstMatch(of: /(?U)<.+?>/) // "A value." ``` -**Regex syntax:** `(?m)...` or `(?m...)` +**Regex syntax:** `(?U)...` or `(?U...)` **`RegexBuilder` API:** ```swift extension RegexComponent { - /// Returns a regular expression where the start and end of input - /// anchors (`^` and `$`) also match against the start and end of a line. - public func anchorsMatchLineEndings(_ matchLineEndings: Bool = true) -> Regex + /// Returns a regular expression where quantifiers are reluctant by default + /// instead of eager. + public func reluctantQuantifiers(_ useReluctantQuantifiers: Bool = true) -> Regex } ``` +In order for this option to have the same effect on regexes built with `RegexBuilder` as with regex syntax, the `RegexBuilder` quantifier APIs are amended to have an `nil`-defaulted optional `behavior` parameter. For example: + +```swift +extension OneOrMore { + public init( + _ behavior: QuantificationBehavior? = nil, + @RegexComponentBuilder _ component: () -> Component + ) where Output == (Substring, C0), Component.Output == (W, C0) +} +``` + +When you pass `nil`, the quantifier uses the default behavior as set by this option (either eager or reluctant). If an explicit behavior is passed, that behavior is used regardless of the default. + + --- ### Character Classes From c25c1462f2bcf077944be8947f09b1bcadeb9c82 Mon Sep 17 00:00:00 2001 From: Nate Cook Date: Mon, 18 Apr 2022 12:47:12 -0500 Subject: [PATCH 06/12] Update word boundary selection method --- .../Evolution/UnicodeForStringProcessing.md | 65 ++++++++++++------- 1 file changed, 43 insertions(+), 22 deletions(-) diff --git a/Documentation/Evolution/UnicodeForStringProcessing.md b/Documentation/Evolution/UnicodeForStringProcessing.md index 4fa58f1b9..b43bb021e 100644 --- a/Documentation/Evolution/UnicodeForStringProcessing.md +++ b/Documentation/Evolution/UnicodeForStringProcessing.md @@ -122,19 +122,19 @@ All option APIs are provided on `RegexComponent`, so they can be called on a `Re The options that `Regex` supports are shown in the table below. Options that affect _matching behavior_ are supported through both regex syntax and APIs, while options that have _structural_ or _syntactic_ effects are only supported through regex syntax. -| **Matching Behavior** | | | -|------------------------------|----------------|-------------------------------| -| Case insensitivity | `(?i)` | `ignoringCase()` | -| Single-line mode | `(?s)` | `dotMatchesNewlines()` | -| Multi-line mode | `(?m)` | `anchorsMatchNewlines()` | -| ASCII-only character classes | `(?DSWP)` | `usingASCIIDigits()`, etc | -| Unicode word boundaries | `(?w)` | `usingSimpleWordBoundaries()` | -| Semantic level | `(?Xu)` | `matchingSemantics(_:)` | -| Reluctant quantifiers | `(?U)` | `reluctantQuantifiers()` | -| **Structural/Syntactic** | | | -| Extended syntax | `(?x)`,`(?xx)` | n/a | -| Named captures only | `(?n)` | n/a | -| Shared capture names | `(?J)` | n/a | +| **Matching Behavior** | | | +|------------------------------|----------------|------------------------------------| +| Case insensitivity | `(?i)` | `ignoringCase()` | +| Single-line mode | `(?s)` | `dotMatchesNewlines()` | +| Multi-line mode | `(?m)` | `anchorsMatchNewlines()` | +| ASCII-only character classes | `(?DSWP)` | `usingASCIIDigits()`, etc | +| Unicode word boundaries | `(?w)` | `identifyingWordBoundaries(with:)` | +| Semantic level | `(?Xu)` | `matchingSemantics(_:)` | +| Reluctant quantifiers | `(?U)` | `reluctantQuantifiers()` | +| **Structural/Syntactic** | | | +| Extended syntax | `(?x)`,`(?xx)` | n/a | +| Named captures only | `(?n)` | n/a | +| Shared capture names | `(?J)` | n/a | #### Case insensitivity @@ -299,17 +299,36 @@ You can see more differences between level 1 and level 2 word boundaries in the ```swift extension RegexComponent { - /// Returns a regular expression that uses simple word boundaries. + /// Returns a regular expression that uses the specified word boundary algorithm. /// /// A simple word boundary is a position in the input between two characters - // that match `/\w\W/` or `/\W\w/`, or between the start or end of the input - // and `\w` character. Word boundaries therefore depend on the option-defined - // behavior of `\w`. - // - // The default word boundaries use a Unicode algorithm that handles some cases - // better than simple word boundaries, such as words with internal - // punctuation, changes in script, and Emoji. - public func usingSimpleWordBoundaries(_ useSimpleWordBoundaries: Bool = true) -> Regex + /// that match `/\w\W/` or `/\W\w/`, or between the start or end of the input + /// and `\w` character. Word boundaries therefore depend on the option-defined + /// behavior of `\w`. + /// + /// The default word boundaries use a Unicode algorithm that handles some cases + /// better than simple word boundaries, such as words with internal + /// punctuation, changes in script, and Emoji. + public func identifyingWordBoundaries(with wordBoundaryKind: RegexWordBoundaryKind) -> Regex +} + +public struct RegexWordBoundaryKind: Hashable { + /// A word boundary algorithm that implements the "simple word boundary" + /// Unicode recommendation. + /// + /// A simple word boundary is a position in the input between two characters + /// that match `/\w\W/` or `/\W\w/`, or between the start or end of the input + /// and a `\w` character. Word boundaries therefore depend on the option- + /// defined behavior of `\w`. + public static var unicodeLevel1: Self { get } + + /// A word boundary algorithm that implements the "default word boundary" + /// Unicode recommendation. + /// + /// Default word boundaries use a Unicode algorithm that handles some cases + /// better than simple word boundaries, such as words with internal + /// punctuation, changes in script, and Emoji. + public static var unicodeLevel2: Self { get } } ``` @@ -716,7 +735,9 @@ Instead of providing APIs to select whether `Regex` matching is `Character`-base * As the scalar level used when matching changes the behavior of individual components of a `Regex`, it’s more appropriate to specify the semantic level at the declaration site than the call site. * With the proposed options model, you can define a Regex that includes different semantic levels for different portions of the match, which would be impossible with a call site-based approach. +### Binary word boundary option method +A prior version of this proposal used a binary method for setting the word boundary algorithm, called `usingSimpleWordBoundaries()`. A method taking a `RegexWordBoundaryKind` instance is included in the proposal instead, to leave room for implementing other word boundary algorithms in the future. [repo]: https://github.com/apple/swift-experimental-string-processing/ From c839b62bb924a8e7f505a8cdf7af87a154f4980e Mon Sep 17 00:00:00 2001 From: Nate Cook Date: Mon, 18 Apr 2022 22:20:02 -0500 Subject: [PATCH 07/12] Change option API to nominal naming --- .../Evolution/UnicodeForStringProcessing.md | 52 +++++++++---------- 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/Documentation/Evolution/UnicodeForStringProcessing.md b/Documentation/Evolution/UnicodeForStringProcessing.md index b43bb021e..721d41e71 100644 --- a/Documentation/Evolution/UnicodeForStringProcessing.md +++ b/Documentation/Evolution/UnicodeForStringProcessing.md @@ -91,25 +91,25 @@ Options can be enabled and disabled in two different ways: as part of [regex int let regex1 = /(?i)banana/ let regex2 = Regex { "banana" -}.ignoringCase()` +}.ignoresCase()` ``` -Note that the `ignoringCase()` is available on any type conforming to `RegexComponent`, which means that you can use the more readable option-setting interface in conjunction with regex literals: +Note that the `ignoresCase()` is available on any type conforming to `RegexComponent`, which means that you can always use the more readable option-setting interface in conjunction with regex literals or run-time compiled `Regex`es: ```swift -let regex3 = /banana/.ignoringCase() +let regex3 = /banana/.ignoresCase() ``` -Calling an option-setting method like `ignoringCase(_:)` acts like wrapping the callee in an option-setting group `(?:...)`. That is, while it sets the behavior for the callee, it doesn’t override options that are applied to more specific regions. In this example, the middle `"na"` in `"banana"` matches case-sensitively, despite the outer call to `ignoringCase()`: +Calling an option-setting method like `ignoresCase(_:)` acts like wrapping the callee in an option-setting group `(?:...)`. That is, while it sets the behavior for the callee, it doesn’t override options that are applied to more specific regions. In this example, the middle `"na"` in `"banana"` matches case-sensitively, despite the outer call to `ignoresCase()`: ```swift let regex4 = Regex { "ba" - "na".ignoringCase(false) + "na".ignoresCase(false) "na" } -.ignoringCase() - +.ignoresCase() + "banana".contains(regex4) // true "BAnaNA".contains(regex4) // true "BANANA".contains(regex4) // false @@ -122,19 +122,19 @@ All option APIs are provided on `RegexComponent`, so they can be called on a `Re The options that `Regex` supports are shown in the table below. Options that affect _matching behavior_ are supported through both regex syntax and APIs, while options that have _structural_ or _syntactic_ effects are only supported through regex syntax. -| **Matching Behavior** | | | -|------------------------------|----------------|------------------------------------| -| Case insensitivity | `(?i)` | `ignoringCase()` | -| Single-line mode | `(?s)` | `dotMatchesNewlines()` | -| Multi-line mode | `(?m)` | `anchorsMatchNewlines()` | -| ASCII-only character classes | `(?DSWP)` | `usingASCIIDigits()`, etc | -| Unicode word boundaries | `(?w)` | `identifyingWordBoundaries(with:)` | -| Semantic level | `(?Xu)` | `matchingSemantics(_:)` | -| Reluctant quantifiers | `(?U)` | `reluctantQuantifiers()` | -| **Structural/Syntactic** | | | -| Extended syntax | `(?x)`,`(?xx)` | n/a | -| Named captures only | `(?n)` | n/a | -| Shared capture names | `(?J)` | n/a | +| **Matching Behavior** | | | +|------------------------------|----------------|---------------------------| +| Case insensitivity | `(?i)` | `ignoresCase()` | +| Single-line mode | `(?s)` | `dotMatchesNewlines()` | +| Multi-line mode | `(?m)` | `anchorsMatchNewlines()` | +| ASCII-only character classes | `(?DSWP)` | `asciiOnlyDigits()`, etc | +| Unicode word boundaries | `(?w)` | `wordBoundaryKind(_:)` | +| Semantic level | `(?Xu)` | `matchingSemantics(_:)` | +| Reluctant quantifiers | `(?U)` | `reluctantQuantifiers()` | +| **Structural/Syntactic** | | | +| Extended syntax | `(?x)`,`(?xx)` | n/a | +| Named captures only | `(?n)` | n/a | +| Shared capture names | `(?J)` | n/a | #### Case insensitivity @@ -157,7 +157,7 @@ Case insensitive matching uses case folding to ensure that canonical equivalence ```swift extension RegexComponent { /// Returns a regular expression that ignores casing when matching. - public func ignoringCase(_ ignoreCase: Bool = true) -> Regex + public func ignoresCase(_ ignoresCase: Bool = true) -> Regex } ``` @@ -244,19 +244,19 @@ With one or more of these options enabled, the default character classes match o ```swift extension RegexComponent { /// Returns a regular expression that only matches ASCII characters as digits. - public func usingASCIIDigits(_ useASCII: Bool = true) -> Regex + public func asciiOnlyDigits(_ asciiOnly: Bool = true) -> Regex /// Returns a regular expression that only matches ASCII characters as space /// characters. - public func usingASCIISpaces(_ useASCII: Bool = true) -> Regex + public func asciiOnlyWhitespace(_ asciiOnly: Bool = true) -> Regex /// Returns a regular expression that only matches ASCII characters as "word /// characters". - public func usingASCIIWordCharacters(_ useASCII: Bool = true) -> Regex + public func asciiOnlyWordCharacters(_ asciiOnly: Bool = true) -> Regex /// Returns a regular expression that only matches ASCII characters when /// matching character classes. - public func usingASCIICharacterClasses(_ useASCII: Bool = true) -> Regex + public func asciiOnlyCharacterClasses(_ asciiOnly: Bool = true) -> Regex } ``` @@ -309,7 +309,7 @@ extension RegexComponent { /// The default word boundaries use a Unicode algorithm that handles some cases /// better than simple word boundaries, such as words with internal /// punctuation, changes in script, and Emoji. - public func identifyingWordBoundaries(with wordBoundaryKind: RegexWordBoundaryKind) -> Regex + public func wordBoundaryKind(_ wordBoundaryKind: RegexWordBoundaryKind) -> Regex } public struct RegexWordBoundaryKind: Hashable { From 03f0b1fa2a35578fac302ccfd26a5ed8caafa21c Mon Sep 17 00:00:00 2001 From: Nate Cook Date: Thu, 21 Apr 2022 13:13:04 -0500 Subject: [PATCH 08/12] Rename quantification behaivor --- .../Evolution/UnicodeForStringProcessing.md | 31 ++++++++++++++----- 1 file changed, 23 insertions(+), 8 deletions(-) diff --git a/Documentation/Evolution/UnicodeForStringProcessing.md b/Documentation/Evolution/UnicodeForStringProcessing.md index 721d41e71..36e622fc1 100644 --- a/Documentation/Evolution/UnicodeForStringProcessing.md +++ b/Documentation/Evolution/UnicodeForStringProcessing.md @@ -130,7 +130,7 @@ The options that `Regex` supports are shown in the table below. Options that aff | ASCII-only character classes | `(?DSWP)` | `asciiOnlyDigits()`, etc | | Unicode word boundaries | `(?w)` | `wordBoundaryKind(_:)` | | Semantic level | `(?Xu)` | `matchingSemantics(_:)` | -| Reluctant quantifiers | `(?U)` | `reluctantQuantifiers()` | +| Repetition behavior | `(?U)` | `repetitionBehavior(_:)` | | **Structural/Syntactic** | | | | Extended syntax | `(?x)`,`(?xx)` | n/a | | Named captures only | `(?n)` | n/a | @@ -384,9 +384,9 @@ public struct RegexSemanticLevel: Hashable { } ``` -#### Reluctant quantification by default +#### Default repetition behavior -Regex quantifiers (`+`, `*`, and `?`) match eagerly by default, such that they match the longest possible substring. Appending `?` to a quantifier makes it reluctant, instead, so that it matches the shortest possible substring. +Regex quantifiers (`+`, `*`, and `?`) match eagerly by default when they repeat, such that they match the longest possible substring. Appending `?` to a quantifier makes it reluctant, instead, so that it matches the shortest possible substring. ```swift let str = "A value." @@ -398,7 +398,7 @@ str.firstMatch(of: /<.+>/) // "A value." str.firstMatch(of: /<.+?>/) // "" ``` -The `U` option toggles the "eagerness" of quanitifiers, so that quantifiers are reluctant by default, and only become eager when `?` is added to the quantifier. +The `U` option toggles the "eagerness" of quantifiers, so that quantifiers are reluctant by default, and only become eager when `?` is added to the quantifier. ```swift // '(?U)' toggles the eagerness of quantifiers: @@ -410,11 +410,26 @@ str.firstMatch(of: /(?U)<.+?>/) // "A value." **`RegexBuilder` API:** +The `repetitionBehavior(_:)` method lets you set the default behavior for all quantifiers that don't explicitly provide their own behavior. For example, you can make all quantifiers behave possessively, eliminating any quantification-caused backtracking. + ```swift extension RegexComponent { /// Returns a regular expression where quantifiers are reluctant by default /// instead of eager. - public func reluctantQuantifiers(_ useReluctantQuantifiers: Bool = true) -> Regex + public func repetitionBehavior(_ behavior: RegexRepetitionBehavior) -> Regex +} + +public struct RegexRepetitionBehavior { + /// Match as much of the input string as possible, backtracking when + /// necessary. + public static var eager: RegexRepetitionBehavior { get } + + /// Match as little of the input string as possible, expanding the matched + /// region as necessary to complete a match. + public static var reluctant: RegexRepetitionBehavior { get } + + /// Match as much of the input string as possible, performing no backtracking. + public static var possessive: RegexRepetitionBehavior { get } } ``` @@ -423,7 +438,7 @@ In order for this option to have the same effect on regexes built with `RegexBui ```swift extension OneOrMore { public init( - _ behavior: QuantificationBehavior? = nil, + _ behavior: RegexRepetitionBehavior? = nil, @RegexComponentBuilder _ component: () -> Component ) where Output == (Substring, C0), Component.Output == (W, C0) } @@ -714,9 +729,9 @@ N/A ## Future directions -### Expanded options +### Expanded options and modifiers -The initial version of `Regex` includes only the options described above. Filling out the remainder of options described in the [Run-time Regex Construction proposal][literals] could be completed as future work, as well as additional improvements, such as adding an option for making quantifiers possessive by default. +The initial version of `Regex` includes only the options described above. Filling out the remainder of options described in the [Run-time Regex Construction proposal][literals] could be completed as future work, as well as additional improvements, such as adding an option that makes a regex match only at the start of a string. ### Extensions to Character and Unicode Scalar APIs From aa04a563a8d451660c7d9544c015e9e3ec769191 Mon Sep 17 00:00:00 2001 From: Nate Cook Date: Thu, 21 Apr 2022 13:23:41 -0500 Subject: [PATCH 09/12] Align proposal with existing API --- .../Evolution/UnicodeForStringProcessing.md | 20 ++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/Documentation/Evolution/UnicodeForStringProcessing.md b/Documentation/Evolution/UnicodeForStringProcessing.md index 36e622fc1..0f0cc80e6 100644 --- a/Documentation/Evolution/UnicodeForStringProcessing.md +++ b/Documentation/Evolution/UnicodeForStringProcessing.md @@ -138,7 +138,7 @@ The options that `Regex` supports are shown in the table below. Options that aff #### Case insensitivity -Regexes perform case sensitive comparisons by default. The `i` option or the `ignoringCase(_:)` method enables case insensitive comparison. +Regexes perform case sensitive comparisons by default. The `i` option or the `ignoresCase(_:)` method enables case insensitive comparison. ```swift let str = "Café" @@ -460,7 +460,7 @@ let regex1 = /\w+\s?\d{,3}/ let regex2 = Regex { OneOrMore(.word) Optionally(.whitespace) - Repeat(.decimalDigit, ...3) + Repeat(.digit, ...3) } ``` @@ -497,14 +497,14 @@ for match in data.matches(of: /(.),/.matchingSemantics(.unicodeScalar)) { - The **any grapheme cluster** character class is written as `\X` or `CharacterClass.anyGraphemeCluster`, and matches from the current location up to the next grapheme cluster boundary. This includes matching newlines, regardless of any option settings. - The **any Unicode scalar** character class is written as `\O` or `CharacterClass.anyUnicodeScalar`, and matches exactly one Unicode scalar value at the current location. This includes matching newlines, regardless of any option settings, but only the first scalar in an `\r\n` cluster. -#### Decimal and hexadecimal digits +#### Digits -The **decimal digit** character class is matched by `\d` or `CharacterClass.decimalDigit`. Both regexes in this example match one or more decimal digits followed by a colon: +The **decimal digit** character class is matched by `\d` or `CharacterClass.digit`. Both regexes in this example match one or more decimal digits followed by a colon: ```swift let regex1 = /\d+:/ let regex2 = Regex { - OneOrMore(.decimalDigit) + OneOrMore(.digit) ":" } ``` @@ -516,7 +516,7 @@ _Grapheme cluster semantics:_ Matches a character made up of a single Unicode sc _ASCII mode_: Matches a Unicode scalar in the range `0` to `9`. -To invert the decimal digit character class, use `\D` or `CharacterClass.decimalDigit.inverted`. +To invert the decimal digit character class, use `\D` or `CharacterClass.digit.inverted`. The **hexadecimal digit** character class is matched by `CharacterClass.hexDigit`. @@ -638,11 +638,11 @@ Custom classes function as the set union of their individual components, whether Inside regexes, custom classes are enclosed in square brackets `[...]`, and can be nested or combined using set operators like `&&`. For more detail, see the [Run-time Regex Construction proposal][internals-charclass]. -With `RegexBuilder`'s `CharacterClass` type, you can use built-in character classes with ranges and groups of characters. For example, to parse a valid octodecimal number, you could define a custom character class that combines `.decimalDigit` with a range of characters. +With `RegexBuilder`'s `CharacterClass` type, you can use built-in character classes with ranges and groups of characters. For example, to parse a valid octodecimal number, you could define a custom character class that combines `.digit` with a range of characters. ```swift let octoDecimalRegex: Regex = Regex { - let charClass = CharacterClass(.decimalDigit, "a"..."h").ignoringCase() + let charClass = CharacterClass(.digit, "a"..."h").ignoresCase() Capture(OneOrMore(charClass)) transform: { Int($0, radix: 18) } } @@ -664,7 +664,7 @@ extension RegexComponent where Self == CharacterClass { public static var anyUnicodeScalar: CharacterClass { get } - public static var decimalDigit: CharacterClass { get } + public static var digit: CharacterClass { get } public static var hexDigit: CharacterClass { get } @@ -683,10 +683,12 @@ extension RegexComponent where Self == CharacterClass { /// Returns a character class that matches any character in the given string /// or sequence. public static func anyOf(_ s: S) -> CharacterClass + where S.Element == Character /// Returns a character class that matches any unicode scalar in the given /// sequence. public static func anyOf(_ s: S) -> CharacterClass + where S.Element == UnicodeScalar } // Unicode properties From b176fa23efc963d66e30578962613d0e4fa1e5c7 Mon Sep 17 00:00:00 2001 From: Nate Cook Date: Fri, 22 Apr 2022 08:14:57 -0500 Subject: [PATCH 10/12] Updated API and matching semantic descriptions ... --- .../Evolution/UnicodeForStringProcessing.md | 94 +++++++++++++++++-- Tests/RegexBuilderTests/RegexDSLTests.swift | 8 ++ 2 files changed, 93 insertions(+), 9 deletions(-) diff --git a/Documentation/Evolution/UnicodeForStringProcessing.md b/Documentation/Evolution/UnicodeForStringProcessing.md index 0f0cc80e6..a403bb14b 100644 --- a/Documentation/Evolution/UnicodeForStringProcessing.md +++ b/Documentation/Evolution/UnicodeForStringProcessing.md @@ -77,7 +77,38 @@ str.contains(/.+e\u{301}/) // true str.contains(/\w+é/) // true ``` -Swift's `Regex` follows the level 2 guidelines for Unicode support in regular expressions described in [Unicode Technical Standard #18][uts18], with support for Unicode character classes, canonical equivalence, grapheme cluster matching semantics, and level 2 word boundaries enabled by default. `Regex` provides options for selecting different matching behaviors, such as ASCII character classes or Unicode scalar semantics, which corresponds more closely with other regex engines. + +For compatibility with other regex engines and the flexibility to match at both `Character` and Unicode scalar level, you can switch between matching levels for an entire regex or within select portions. This powerful capability provides the expected default behavior when working with strings, while allowing you to drop down for Unicode scalar-specific matching. + +By default, literal characters and Unicode scalar values (e.g. `\u{301}`) are coalesced into characters in the same way as a normal string, as shown above. Metacharacters, like `.` and `\w`, and custom character classes each match a single element at the current matching level. + +For example, these matches fail, because by the time the parser encounters the "`\u{301}`" Unicode scalar literal, the full `"é"` character has been matched: + +```swift +str.contains(/Caf.\u{301}) // false - `.` matches "é" character +str.contains(/Caf\w\u{301}) // false - `\w` matches "é" character +str.contains(/.+\u{301}) // false - `.+` matches each character +``` + +Alternatively, we can drop down to use Unicode scalar semantics if we want to match specific Unicode sequences. For example, these regexes matches an `"e"` followed by any modifier with the specified parameters: + +```swift +str.contains(/e[\u{300}-\u{314}]/.matchingSemantics(.unicodeScalar)) +// true - matches an "e" followed by a Unicode scalar in the range U+0300 - U+0314 +str.contains(/e\p{Nonspacing Mark}/.matchingSemantics(.unicodeScalar)) +// true - matches an "e" followed by a Unicode scalar with general category "Nonspacing Mark" +``` + +Matching in Unicode scalar mode is analogous to comparing against a string's `UnicodeScalarView` — individual Unicode scalars are matched without combining them into characters or testing for canonical equivalence. + +```swift +str.contains(/Café/.matchingSemantics(.unicodeScalar)) +// false - "e\u{301}" doesn't match with /é/ +str.contains(/Cafe\u{301}/.matchingSemantics(.unicodeScalar)) +// true - "e\u{301}" matches with /e\u{301}/ +``` + +Swift's `Regex` follows the level 2 guidelines for Unicode support in regular expressions described in [Unicode Technical Standard #18][uts18], with support for Unicode character classes, canonical equivalence, grapheme cluster matching semantics, and level 2 word boundaries enabled by default. In addition to selecting the matching semantics, `Regex` provides options for selecting different matching behaviors, such as ASCII character classes or Unicode scalar semantics, which corresponds more closely with other regex engines. ## Detailed design @@ -262,9 +293,9 @@ extension RegexComponent { #### Unicode word boundaries -By default, matching word boundaries with the `\b` and `Anchor.wordBoundary` anchors uses Unicode default word boundaries, specified as [Unicode level 2 regular expression support][level2-word-boundaries]. +By default, matching word boundaries with the `\b` and `Anchor.wordBoundary` anchors uses Unicode _default word boundaries,_ specified as [Unicode level 2 regular expression support][level2-word-boundaries]. -Disabling the `w` option switches to [simple word boundaries][level1-word-boundaries], finding word boundaries at points in the input where `\b\B` or `\B\b` match. Depending on the other matching options that are enabled, this may be more compatible with the behavior other regex engines. +Disabling the `w` option switches to _[simple word boundaries][level1-word-boundaries],_ finding word boundaries at points in the input where `\b\B` or `\B\b` match. Depending on the other matching options that are enabled, this may be more compatible with the behavior other regex engines. As shown in this example, the default matching behavior finds the whole first word of the string, while the match with simple word boundaries stops at the apostrophe: @@ -338,7 +369,7 @@ When matching with grapheme cluster semantics (the default), metacharacters like When matching with Unicode scalar semantics, metacharacters and character classes always match a single Unicode scalar value, even if that scalar comprises part of a grapheme cluster. -These semantic levels can lead to different results, especially when working with strings that have decomposed characters. In the following example, `queRegex` matches any 3-character string that begins with `"q"`. +These semantic levels lead to different results, especially when working with strings that have decomposed characters. In the following example, `queRegex` matches any 3-character string that begins with `"q"`. ```swift let composed = "qué" @@ -362,6 +393,38 @@ print(decomposed.contains(queRegexScalar)) // Prints "false" ``` +With grapheme cluster semantics, a grapheme cluster boundary is naturally enforced at the start and end of the match and every capture group. Matching with Unicode scalar semantics, on the other hand, including using the `\O` metacharacter or `.anyUnicodeScalar` character class, can yield string indices that aren't aligned to character boundaries. Take care when using indices that aren't aligned with grapheme cluster boundaries, as they may have to be rounded to a boundary if used in a `String` instance. + +```swift +let family = "👨‍👨‍👧‍👦 is a family" + +// Grapheme-cluster mode: Yields a character +let firstCharacter = /^./ +let characterMatch = family.firstMatch(of: firstCharacter)!.output +print(characterMatch) +// Prints "👨‍👨‍👧‍👦" + +// Unicode-scalar mode: Yields only part of a character +let firstUnicodeScalar = /^./.matchingSemantics(.unicodeScalar) +let unicodeScalarMatch = family.firstMatch(of: firstUnicodeScalar)!.output +print(unicodeScalarMatch) +// Prints "👨" + +// The end of `unicodeScalarMatch` is not aligned on a character boundary +print(unicodeScalarMatch.endIndex == family.index(after: family.startIndex)) +// Prints "false" +``` + +When a regex proceeds with grapheme cluster semantics from a position that _isn't_ grapheme cluster aligned, it attempts to match the partial grapheme cluster that starts at that point. In the first call to `contains(_:)` below, `\O` matches a single Unicode scalar value, as shown above, and then the engine tries to match `\s` against the remainder of the family emoji character. Because that character is not whitespace, the match fails. The second call uses `\X`, which matches the entire emoji character, and then successfully matches the following space. + +```swift +// \O matches a single Unicode scalar, whatever the current semantics +family.contains(/^\O\s/)) // false + +// \X matches a single character, whatever the current semantics +family.contains(/^\X\s/) // true +``` + **Regex syntax:** `(?X)...` or `(?X...)` for grapheme cluster semantics, `(?u)...` or `(?u...)` for Unicode scalar semantics. **`RegexBuilder` API:** @@ -494,8 +557,8 @@ for match in data.matches(of: /(.),/.matchingSemantics(.unicodeScalar)) { `Regex` also provides ways to select a specific level of "any" matching, without needing to change semantic levels. -- The **any grapheme cluster** character class is written as `\X` or `CharacterClass.anyGraphemeCluster`, and matches from the current location up to the next grapheme cluster boundary. This includes matching newlines, regardless of any option settings. -- The **any Unicode scalar** character class is written as `\O` or `CharacterClass.anyUnicodeScalar`, and matches exactly one Unicode scalar value at the current location. This includes matching newlines, regardless of any option settings, but only the first scalar in an `\r\n` cluster. +- The **any grapheme cluster** character class is written as `\X` or `CharacterClass.anyGraphemeCluster`, and matches from the current location up to the next grapheme cluster boundary. This includes matching newlines, regardless of any option settings. This metacharacter is equivalent to the regex syntax `(?s-u:.)`. +- The **any Unicode scalar** character class is written as `\O` or `CharacterClass.anyUnicodeScalar`, and matches exactly one Unicode scalar value at the current location. This includes matching newlines, regardless of any option settings, but only the first scalar in an `\r\n` cluster. This metacharacter is equivalent to the regex syntax `(?su:.)`. #### Digits @@ -641,10 +704,11 @@ Inside regexes, custom classes are enclosed in square brackets `[...]`, and can With `RegexBuilder`'s `CharacterClass` type, you can use built-in character classes with ranges and groups of characters. For example, to parse a valid octodecimal number, you could define a custom character class that combines `.digit` with a range of characters. ```swift -let octoDecimalRegex: Regex = Regex { +let octoDecimalRegex: Regex<(Substring, Int?)> = Regex { let charClass = CharacterClass(.digit, "a"..."h").ignoresCase() - Capture(OneOrMore(charClass)) - transform: { Int($0, radix: 18) } + Capture { + OneOrMore(charClass) + } transform: { Int($0, radix: 18) } } ``` @@ -693,19 +757,27 @@ extension RegexComponent where Self == CharacterClass { // Unicode properties extension CharacterClass { + /// Returns a character class that matches elements in the given Unicode + /// general category. public static func generalCategory(_ category: Unicode.GeneralCategory) -> CharacterClass } // Set algebra methods extension CharacterClass { + /// Creates a character class that combines the given classes in a union. public init(_ first: CharacterClass, _ rest: CharacterClass...) + /// Returns a character class from the union of this class and the given class. public func union(_ other: CharacterClass) -> CharacterClass + /// Returns a character class from the intersection of this class and the given class. public func intersection(_ other: CharacterClass) -> CharacterClass + /// Returns a character class by subtracting the given class from this class. public func subtracting(_ other: CharacterClass) -> CharacterClass + /// Returns a character class matching elements in one or the other, but not both, + /// of this class and the given class. public func symmetricDifference(_ other: CharacterClass) -> CharacterClass } @@ -743,6 +815,10 @@ An earlier version of this pitch described adding standard library APIs to `Char A future `Regex` version could support a byte-level semantic mode in addition to grapheme cluster and Unicode scalar semantics. Byte-level semantics would allow matching individual bytes, potentially providing the capability of parsing string and non-string data together. +### More general `CharacterSet` replacement + +Foundation's `CharacterSet` type is in some ways similar to the `CharacterClass` type defined in this proposal. `CharacterSet` is primarily a set type that is defined over Unicode scalars, and can therefore sometimes be awkward to use in conjunction with Swift `String`s. The proposed `CharacterClass` type is a `RegexBuilder`-specific type, and as such isn't intended to be a full general purpose replacement. Future work could involve expanding upon the `CharacterClass` API or introducing a different type to fill that role. + ## Alternatives considered ### Operate on String.UnicodeScalarView instead of using semantic modes diff --git a/Tests/RegexBuilderTests/RegexDSLTests.swift b/Tests/RegexBuilderTests/RegexDSLTests.swift index 0c0bf7c8f..7a2fca1ea 100644 --- a/Tests/RegexBuilderTests/RegexDSLTests.swift +++ b/Tests/RegexBuilderTests/RegexDSLTests.swift @@ -338,6 +338,14 @@ class RegexDSLTests: XCTestCase { Repeat(2...) { "e" } Repeat(0...) { "f" } } + + let octoDecimalRegex: Regex<(Substring, Int?)> = Regex { + let charClass = CharacterClass(.digit, "a"..."h")//.ignoringCase() + Capture { + OneOrMore(charClass) + } transform: { Int($0, radix: 18) } + } + XCTAssertEqual("ab12".firstMatch(of: octoDecimalRegex)!.output.1, 61904) } func testAssertions() throws { From 25f5a2d194a703e229a13f52be73d69e2172d9c7 Mon Sep 17 00:00:00 2001 From: Nate Cook Date: Fri, 22 Apr 2022 10:25:28 -0500 Subject: [PATCH 11/12] Update proposal overview doc --- Documentation/Evolution/ProposalOverview.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Documentation/Evolution/ProposalOverview.md b/Documentation/Evolution/ProposalOverview.md index 4346932b5..0bdc7a7da 100644 --- a/Documentation/Evolution/ProposalOverview.md +++ b/Documentation/Evolution/ProposalOverview.md @@ -43,13 +43,13 @@ Introduces `CustomMatchingRegexComponent`, which is a monadic-parser style inter ## Unicode for String Processing -- Draft: TBD +- [Draft](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/UnicodeForStringProcessing.md) - (Old) [Character class definitions](https://forums.swift.org/t/pitch-character-classes-for-string-processing/52920) Covers three topics: -- Proposes literal and DSL API for library-defined character classes, Unicode scripts and properties, and custom character classes. -- Proposes literal and DSL API for options that affect matching behavior. +- Proposes regex syntax and `RegexBuilder` API for options that affect matching behavior. +- Proposes regex syntax and `RegexBuilder` API for library-defined character classes, Unicode properties, and custom character classes. - Defines how Unicode scalar-based classes are extended to grapheme clusters in the different semantic and other matching modes. From 1a5e4d06315188b6dc4b3195851a63c8cc7c9831 Mon Sep 17 00:00:00 2001 From: Nate Cook Date: Fri, 22 Apr 2022 10:43:35 -0500 Subject: [PATCH 12/12] Update proposal authors --- Documentation/Evolution/UnicodeForStringProcessing.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/Evolution/UnicodeForStringProcessing.md b/Documentation/Evolution/UnicodeForStringProcessing.md index a403bb14b..828d8f53c 100644 --- a/Documentation/Evolution/UnicodeForStringProcessing.md +++ b/Documentation/Evolution/UnicodeForStringProcessing.md @@ -1,7 +1,7 @@ # Unicode for String Processing Proposal: [SE-NNNN](NNNN-filename.md) -Authors: [Nate Cook](https://github.com/natecook1000), [Michael Ilseman](https://github.com/milseman), [Alejandro Alonso](https://github.com/Azoy) +Authors: [Nate Cook](https://github.com/natecook1000), [Alejandro Alonso](https://github.com/Azoy) Review Manager: TBD Implementation: [apple/swift-experimental-string-processing][repo] Status: **Draft**