diff --git a/Documentation/Evolution/CharacterClasses.md b/Documentation/Evolution/CharacterClasses.md
deleted file mode 100644
index c9ffcbc95..000000000
--- a/Documentation/Evolution/CharacterClasses.md
+++ /dev/null
@@ -1,503 +0,0 @@
-# Character Classes for String Processing
-
-- **Authors:** [Nate Cook](https://github.com/natecook1000), [Michael Ilseman](https://github.com/milseman)
-- **Status:** Draft pitch
-
-## Introduction
-
-[Declarative String Processing Overview][overview] presents regex-powered matching broadly, without details concerning syntax and semantics, leaving clarification to subsequent pitches. [Regular Expression Literals][literals] presents more details on regex _syntax_ such as delimiters and PCRE-syntax innards, but explicitly excludes discussion of regex _semantics_. This pitch and discussion aims to address a targeted subset of regex semantics: definitions of character classes. We propose a comprehensive treatment of regex character class semantics in the context of existing and newly proposed API directly on `Character` and `Unicode.Scalar`.
-
-Character classes in regular expressions include metacharacters like `\d` to match a digit, `\s` to match whitespace, and `.` to match any character. Individual literal characters can also be thought of as character classes, as they at least match themselves, and, in case-insensitive matching, their case-toggled counterpart. For the purpose of this work, then, we consider a *character class* to be any part of a regular expression literal that can match an actual component of a string.
-
-## Motivation
-
-Operating over classes of characters is a vital component of string processing. Swift's `String` provides, by default, a view of `Character`s or [extended grapheme clusters][graphemes] whose comparison honors [Unicode canonical equivalence][canoneq].
-
-```swift
-let str = "Cafe\u{301}" // "Café"
-str == "Café" // true
-str.dropLast() // "Caf"
-str.last == "é" // true (precomposed e with acute accent)
-str.last == "e\u{301}" // true (e followed by composing acute accent)
-```
-
-Unicode leaves all interpretation of grapheme clusters up to implementations, which means that Swift needs to define any semantics for its own usage. Since other regex engines operate, at most, at the semantics level of Unicode scalar values, there is little to no prior art to consult.
-
-Other engines
-
-Character classes in other languages match at either the Unicode scalar value level, or even the code unit level, instead of recognizing grapheme clusters as characters. When matching the `.` character class, other languages will only match the first part of an `"e\u{301}"` grapheme cluster. Some languages, like Perl, Ruby, and Java, support an additional `\X` metacharacter, which explicitly represents a single grapheme cluster.
-
-| Matching `"Cafe\u{301}"` | Pattern: `^Caf.` | Remaining | Pattern: `^Caf\X` | Remaining |
-|---|---|---|---|---|
-| C#, Rust, Go | `"Cafe"` | `"´"` | n/a | n/a |
-| NSString, Java, Ruby, Perl | `"Cafe"` | `"´"` | `"Café"` | `""` |
-
-Other than Java's `CANON_EQ` option, the vast majority of other languages and engines are not capable of comparing with canonical equivalence.
-
-
-
-[SE-0211 Unicode Scalar Properties][scalarprops] added basic building blocks for classification of scalars by surfacing Unicode data from the [UCD][ucd]. [SE-0221: Character Properties][charprops] defined grapheme-cluster semantics for Swift for a subset of these. But, many classifications used in string processing are combinations of scalar properties or ad-hoc listings, and as such are not present today in Swift.
-
-Regardless of any syntax or underlying formalism, classifying characters is a worthy and much needed addition to the Swift standard library. We believe our thorough treatment of every character class found across many popular regex engines gives Swift a solid semantic basis.
-
-## Proposed Solution
-
-This pitch is narrowly scoped to Swift definitions of character classes found in regexes. For each character class, we propose:
-
-- A name for use in API
-- A `Character` API, by extending Unicode scalar definitions to grapheme clusters
-- A `Unicode.Scalar` API with modern Unicode definitions
-- If applicable, a `Unicode.Scalar` API for notable standards like POSIX
-
-We're proposing what we believe to be the Swiftiest definitions using [Unicode's guidance][uts18] for `Unicode.Scalar` and extending this to grapheme clusters using `Character`'s existing [rationale][charpropsrationale].
-
-Broad language/engine survey
-
-For these definitions, we cross-referenced Unicode's [UTS\#18][uts18] with a broad survey of existing languages and engines. We found that while these all support a subset of UTS\#18, each language or framework implements a slightly different subset. The following table shows some of the variations:
-
-| Language/Framework | Dot (`.`) matches | Supports `\X` | Canonical Equivalence | `\d` matches FULL WIDTH digit |
-|------------------------------|----------------------------------------------------|---------------|---------------------------|-------------------------------|
-| [ECMAScript][ecmascript] | UTF16 code unit (Unicode scalar in Unicode mode) | no | no | no |
-| [Perl][perl] / [PCRE][pcre] | UTF16 code unit, (Unicode scalar in Unicode mode) | yes | no | no |
-| [Python3][python] | Unicode scalar | no | no | yes |
-| [Raku][raku] | Grapheme cluster | n/a | strings always normalized | yes |
-| [Ruby][ruby] | Unicode scalar | yes | no | no |
-| [Rust][rust] | Unicode scalar | no | no | no |
-| [C#][csharp] | UTF16 code unit | no | no | yes |
-| [Java][java] | Unicode scalar | yes | Only in CANON_EQ mode | no |
-| [Go][go] | Unicode scalar | no | no | no |
-| [`NSRegularExpression`][icu] | Unicode scalar | yes | no | yes |
-
-We are still in the process of evaluating [C++][cplusplus], [RE2][re2], and [Oniguruma][oniguruma].
-
-
-
-## Detailed Design
-
-### Literal characters
-
-A literal character (such as `a`, `é`, or `한`) in a regex literal matches that particular character or code sequence. When matching at the semantic level of `Unicode.Scalar`, it should match the literal sequence of scalars. When matching at the semantic level of `Character`, it should match `Character`-by-`Character`, honoring Unicode canonical equivalence.
-
-We are not proposing new API here as this is already handled by `String` and `String.UnicodeScalarView`'s conformance to `Collection`.
-
-### Unicode values: `\u`, `\U`, `\x`
-
-Metacharacters that begin with `\u`, `\U`, or `\x` match a character with the specified Unicode scalar values. We propose these be treated exactly the same as literals.
-
-### Match any: `.`, `\X`
-
-The dot metacharacter matches any single character or element. Depending on options and modes, it may exclude newlines.
-
-`\X` matches any grapheme cluster (`Character`), even when the regular expression is otherwise matching at semantic level of `Unicode.Scalar`.
-
-We are not proposing new API here as this is already handled by collection conformances.
-
-While we would like for the stdlib to have grapheme-breaking API over collections of `Unicode.Scalar`, that is a separate discussion and out-of-scope for this pitch.
-
-### Decimal digits: `\d`,`\D`
-
-We propose `\d` be named "decimalDigit" with the following definitions:
-
-```swift
-extension Character {
- /// A Boolean value indicating whether this character represents
- /// a decimal digit.
- ///
- /// Decimal digits are comprised of a single Unicode scalar that has a
- /// `numericType` property equal to `.decimal`. This includes the digits
- /// from the ASCII range, from the _Halfwidth and Fullwidth Forms_ Unicode
- /// block, as well as digits in some scripts, like `DEVANAGARI DIGIT NINE`
- /// (U+096F).
- ///
- /// Decimal digits are a subset of whole numbers, see `isWholeNumber`.
- ///
- /// To get the character's value, use the `decimalDigitValue` property.
- public var isDecimalDigit: Bool { get }
-
- /// The numeric value this character represents, if it is a decimal digit.
- ///
- /// Decimal digits are comprised of a single Unicode scalar that has a
- /// `numericType` property equal to `.decimal`. This includes the digits
- /// from the ASCII range, from the _Halfwidth and Fullwidth Forms_ Unicode
- /// block, as well as digits in some scripts, like `DEVANAGARI DIGIT NINE`
- /// (U+096F).
- ///
- /// Decimal digits are a subset of whole numbers, see `wholeNumberValue`.
- ///
- /// let chars: [Character] = ["1", "९", "A"]
- /// for ch in chars {
- /// print(ch, "-->", ch.decimalDigitValue)
- /// }
- /// // Prints:
- /// // 1 --> Optional(1)
- /// // ९ --> Optional(9)
- /// // A --> nil
- public var decimalDigitValue: Int? { get }
-
-}
-
-extension Unicode.Scalar {
- /// A Boolean value indicating whether this scalar is considered
- /// a decimal digit.
- ///
- /// Any Unicode scalar that has a `numericType` property equal to `.decimal`
- /// is considered a decimal digit. This includes the digits from the ASCII
- /// range, from the _Halfwidth and Fullwidth Forms_ Unicode block, as well
- /// as digits in some scripts, like `DEVANAGARI DIGIT NINE` (U+096F).
- public var isDecimalDigit: Bool { get }
-}
-```
-
-`\D` matches the inverse of `\d`.
-
-*TBD*: [SE-0221: Character Properties][charprops] did not define equivalent API on `Unicode.Scalar`, as it was itself an extension of single `Unicode.Scalar.Properties`. Since we're defining additional classifications formed from algebraic formulations of properties, it may make sense to put API such as `decimalDigitValue` on `Unicode.Scalar` as well as back-porting other API from `Character` (e.g. `hexDigitValue`). We'd like to discuss this with the community.
-
-*TBD*: `Character.isHexDigit` is currently constrained to the subset of decimal digits that are followed by encodings of Latin letters `A-F` in various forms (all 6 of them... thanks Unicode). We could consider extending this to be a superset of `isDecimalDigit` by allowing and producing values for all decimal digits, one would just have to use the Latin letters to refer to values greater than `9`. We'd like to discuss this with the community.
-
-_Rationale
_
-
-Unicode's recommended definition for `\d` is its [numeric type][numerictype] of "Decimal" in contrast to "Digit". It is specifically restricted to sets of ascending contiguously-encoded scalars in a decimal radix positional numeral system. Thus, it excludes "digits" such as superscript numerals from its [definition][derivednumeric] and is a proper subset of `Character.isWholeNumber`.
-
-We interpret Unicode's definition of the set of scalars, especially its requirement that scalars be encoded in ascending chains, to imply that this class is restricted to scalars which meaningfully encode base-10 digits. Thus, we choose to make this Character property _restrictive_, similar to `isHexDigit` and `isWholeNumber` and provide a way to access this value.
-
-It's possible we might add future properties to differentiate Unicode's non-decimal digits, but that is outside the scope of this pitch.
-
-
-
-### Word characters: `\w`, `\W`
-
-We propose `\w` be named "word character" with the following definitions:
-
-```swift
-extension Character {
- /// A Boolean value indicating whether this character is considered
- /// a "word" character.
- ///
- /// See `Unicode.Scalar.isWordCharacter`.
- public var isWordCharacter: Bool { get }
-}
-
-extension Unicode.Scalar {
- /// A Boolean value indicating whether this scalar is considered
- /// a "word" character.
- ///
- /// Any Unicode scalar that has one of the Unicode properties
- /// `Alphabetic`, `Digit`, or `Join_Control`, or is in the
- /// general category `Mark` or `Connector_Punctuation`.
- public var isWordCharacter: Bool { get }
-}
-```
-
-`\W` matches the inverse of `\w`.
-
-_Rationale
_
-
-Word characters include more than letters, and we went with Unicode's recommended scalar semantics. We extend to grapheme clusters similarly to `Character.isLetter`, that is, subsequent (combining) scalars do not change the word-character-ness of the grapheme cluster.
-
-
-
-### Whitespace and newlines: `\s`, `\S` (plus `\h`, `\H`, `\v`, `\V`, and `\R`)
-
-We propose `\s` be named "whitespace" with the following definitions:
-
-```swift
-extension Unicode.Scalar {
- /// A Boolean value indicating whether this scalar is considered
- /// whitespace.
- ///
- /// All Unicode scalars with the derived `White_Space` property are
- /// considered whitespace, including:
- ///
- /// - `CHARACTER TABULATION` (U+0009)
- /// - `LINE FEED (LF)` (U+000A)
- /// - `LINE TABULATION` (U+000B)
- /// - `FORM FEED (FF)` (U+000C)
- /// - `CARRIAGE RETURN (CR)` (U+000D)
- /// - `NEWLINE (NEL)` (U+0085)
- public var isWhitespace: Bool { get }
-}
-```
-
-This definition matches the value of the existing `Unicode.Scalar.Properties.isWhitespace` property. Note that `Character.isWhitespace` already exists with the desired semantics, which is a grapheme cluster that begins with a whitespace Unicode scalar.
-
-We propose `\h` be named "horizontalWhitespace" with the following definitions:
-
-```swift
-extension Character {
- /// A Boolean value indicating whether this character is considered
- /// horizontal whitespace.
- ///
- /// All characters with an initial Unicode scalar in the general
- /// category `Zs`/`Space_Separator`, or the control character
- /// `CHARACTER TABULATION` (U+0009), are considered horizontal
- /// whitespace.
- public var isHorizontalWhitespace: Bool { get }
-}
-
-extension Unicode.Scalar {
- /// A Boolean value indicating whether this scalar is considered
- /// horizontal whitespace.
- ///
- /// All Unicode scalars with the general category
- /// `Zs`/`Space_Separator`, along with the control character
- /// `CHARACTER TABULATION` (U+0009), are considered horizontal
- /// whitespace.
- public var isHorizontalWhitespace: Bool { get }
-}
-```
-
-We propose `\v` be named "verticalWhitespace" with the following definitions:
-
-
-```swift
-extension Character {
- /// A Boolean value indicating whether this scalar is considered
- /// vertical whitespace.
- ///
- /// All characters with an initial Unicode scalar in the general
- /// category `Zl`/`Line_Separator`, or the following control
- /// characters, are considered vertical whitespace (see below)
- public var isVerticalWhitespace: Bool { get }
-}
-
-extension Unicode.Scalar {
- /// A Boolean value indicating whether this scalar is considered
- /// vertical whitespace.
- ///
- /// All Unicode scalars with the general category
- /// `Zl`/`Line_Separator`, along with the following control
- /// characters, are considered vertical whitespace:
- ///
- /// - `LINE FEED (LF)` (U+000A)
- /// - `LINE TABULATION` (U+000B)
- /// - `FORM FEED (FF)` (U+000C)
- /// - `CARRIAGE RETURN (CR)` (U+000D)
- /// - `NEWLINE (NEL)` (U+0085)
- public var isVerticalWhitespace: Bool { get }
-}
-```
-
-Note that `Character.isNewline` already exists with the definition [required][lineboundary] by UTS\#18. *TBD:* Should we backport to `Unicode.Scalar`?
-
-`\S`, `\H`, and `\V` match the inverse of `\s`, `\h`, and `\v`, respectively.
-
-We propose `\R` include "verticalWhitespace" above with detection (and consumption) of the CR-LF sequence when applied to `Unicode.Scalar`. It is equivalent to `Character.isVerticalWhitespace` when applied to `Character`s.
-
-We are similarly not proposing any new API for `\R` until the stdlib has grapheme-breaking API over `Unicode.Scalar`.
-
-_Rationale
_
-
-Note that "whitespace" is a term-of-art and is not correlated with visibility, which is a completely separate concept.
-
-We use Unicode's recommended scalar semantics for horizontal whitespace and extend that to grapheme semantics similarly to `Character.isWhitespace`.
-
-We use ICU's definition for vertical whitespace, similarly extended to grapheme clusters.
-
-
-
-### Control characters: `\t`, `\r`, `\n`, `\f`, `\0`, `\e`, `\a`, `\b`, `\cX`
-
-We propose the following names and meanings for these escaped literals representing specific control characters:
-
-```swift
-extension Character {
- /// A horizontal tab character, `CHARACTER TABULATION` (U+0009).
- public static var tab: Character { get }
-
- /// A carriage return character, `CARRIAGE RETURN (CR)` (U+000D).
- public static var carriageReturn: Character { get }
-
- /// A line feed character, `LINE FEED (LF)` (U+000A).
- public static var lineFeed: Character { get }
-
- /// A form feed character, `FORM FEED (FF)` (U+000C).
- public static var formFeed: Character { get }
-
- /// A NULL character, `NUL` (U+0000).
- public static var nul: Character { get }
-
- /// An escape control character, `ESC` (U+001B).
- public static var escape: Character { get }
-
- /// A bell character, `BEL` (U+0007).
- public static var bell: Character { get }
-
- /// A backspace character, `BS` (U+0008).
- public static var backspace: Character { get }
-
- /// A combined carriage return and line feed as a single character denoting
- // end-of-line.
- public static var carriageReturnLineFeed: Character { get }
-
- /// Returns a control character with the given value, Control-`x`.
- ///
- /// This method returns a value only when you pass a letter in
- /// the ASCII range as `x`:
- ///
- /// if let ch = Character.control("G") {
- /// print("'ch' is a bell character", ch == Character.bell)
- /// } else {
- /// print("'ch' is not a control character")
- /// }
- /// // Prints "'ch' is a bell character: true"
- ///
- /// - Parameter x: An upper- or lowercase letter to derive
- /// the control character from.
- /// - Returns: Control-`x` if `x` is in the pattern `[a-zA-Z]`;
- /// otherwise, `nil`.
- public static func control(_ x: Unicode.Scalar) -> Character?
-}
-
-extension Unicode.Scalar {
- /// Same as above, producing Unicode.Scalar, except for CR-LF...
-}
-```
-
-We also propose `isControl` properties with the following definitions:
-
-```swift
-extension Character {
- /// A Boolean value indicating whether this character represents
- /// a control character.
- ///
- /// Control characters are a single Unicode scalar with the
- /// general category `Cc`/`Control` or the CR-LF pair (`\r\n`).
- public var isControl: Bool { get }
-}
-
-extension Unicode.Scalar {
- /// A Boolean value indicating whether this scalar represents
- /// a control character.
- ///
- /// Control characters have the general category `Cc`/`Control`.
- public var isControl: Bool { get }
-}
-```
-
-*TBD*: Should we have a CR-LF static var on `Unicode.Scalar` that produces a value of type `Character`?
-
-
-_Rationale
_
-
-This approach simplifies the use of some common control characters, while making the rest available through a method call.
-
-
-
-
-
-### Unicode named values and properties: `\N`, `\p`, `\P`
-
-`\N{NAME}` matches a Unicode scalar value with the specified name. `\p{PROPERTY}` and `\p{PROPERTY=VALUE}` match a Unicode scalar value with the given Unicode property (and value, if given).
-
-While most Unicode-defined properties can only match at the Unicode scalar level, some are defined to match an extended grapheme cluster. For example, `/\p{RGI_Emoji_Flag_Sequence}/` will match any flag emoji character, which are composed of two Unicode scalar values.
-
-`\P{...}` matches the inverse of `\p{...}`.
-
-Most of this is already present inside `Unicode.Scalar.Properties`, and we propose to round it out with anything missing, e.g. script and script extensions. (API is _TBD_, still working on it.)
-
-Even though we are not proposing any `Character`-based API, we'd like to discuss with the community whether or how to extend them to grapheme clusters. Some options:
-
-- Forbid in any grapheme-cluster semantic mode
-- Match only single-scalar grapheme clusters with the given property
-- Match any grapheme cluster that starts with the given property
-- Something more-involved such as per-property reasoning
-
-
-### POSIX character classes: `[:NAME:]`
-
-We propose that POSIX character classes be prefixed with "posix" in their name with APIs for testing membership of `Character`s and `Unicode.Scalar`s. `Unicode.Scalar.isASCII` and `Character.isASCII` already exist and can satisfy `[:ascii:]`, and can be used in combination with new members like `isDigit` to represent individual POSIX character classes. Alternatively, we could introduce an option-set-like `POSIXCharacterClass` and `func isPOSIX(_:POSIXCharacterClass)` since POSIX is a fully defined standard. This would cut down on the amount of API noise directly visible on `Character` and `Unicode.Scalar` significantly. We'd like some discussion the the community here, noting that this will become clearer as more of the string processing overview takes shape.
-
-POSIX's character classes represent concepts that we'd like to define at all semantic levels. We propose the following definitions, some of which are covered elsewhere in this pitch and some of which already exist today. Some Character definitions are *TBD* and we'd like more discussion with the community.
-
-
-| POSIX class | API name | `Character` | `Unicode.Scalar` | POSIX mode value |
-|-------------|----------------------|-----------------------|-------------------------------|-------------------------------|
-| `[:lower:]` | lowercase | (exists) | `\p{Lowercase}` | `[a-z]` |
-| `[:upper:]` | uppercase | (exists) | `\p{Uppercase}` | `[A-Z]` |
-| `[:alpha:]` | alphabetic | (exists: `.isLetter`) | `\p{Alphabetic}` | `[A-Za-z]` |
-| `[:alnum:]` | alphaNumeric | TBD | `[\p{Alphabetic}\p{Decimal}]` | `[A-Za-z0-9]` |
-| `[:word:]` | wordCharacter | (pitched) | (pitched) | `[[:alnum:]_]` |
-| `[:digit:]` | decimalDigit | (pitched) | (pitched) | `[0-9]` |
-| `[:xdigit:]`| hexDigit | (exists) | `\p{Hex_Digit}` | `[0-9A-Fa-f]` |
-| `[:punct:]` | punctuation | (exists) | (port from `Character`) | `[-!"#%&'()*,./:;?@[\\\]_{}]` |
-| `[:blank:]` | horizontalWhitespace | (pitched) | (pitched) | `[ \t]` |
-| `[:space:]` | whitespace | (exists) | `\p{Whitespace}` | `[ \t\n\r\f\v]` |
-| `[:cntrl:]` | control | (pitched) | (pitched) | `[\x00-\x1f\x7f]` |
-| `[:graph:]` | TBD | TBD | TBD | `[^ [:cntrl:]]` |
-| `[:print:]` | TBD | TBD | TBD | `[[:graph:] ]` |
-
-
-### Custom classes: `[...]`
-
-We propose that custom classes function just like set union. We propose that ranged-based custom character classes function just like `ClosedRange`. Thus, we are not proposing any additional API.
-
-That being said, providing grapheme cluster semantics is simultaneously obvious and tricky. A direct extension treats `[a-f]` as equivalent to `("a"..."f").contains()`. Strings (and thus Characters) are ordered for the purposes of efficiently maintaining programming invariants while honoring Unicode canonical equivalence. This ordering is _consistent_ but [linguistically meaningless][meaningless] and subject to implementation details such as whether we choose to normalize under NFC or NFD.
-
-```swift
-let c: ClosedRange = "a"..."f"
-c.contains("e") // true
-c.contains("g") // false
-c.contains("e\u{301}") // false, NFC uses precomposed é
-c.contains("e\u{305}") // true, there is no precomposed e̅
-```
-
-We will likely want corresponding `RangeExpression`-based API in the future and keeping consistency with ranges is important.
-
-We would like to discuss this problem with the community here. Even though we are not addressing regex literals specifically in this thread, it makes sense to produce suggestions for compilation errors or warnings.
-
-Some options:
-
-- Do nothing, embrace emergent behavior
-- Warn/error for _any_ character class ranges
-- Warn/error for character class ranges outside of a quasi-meaningful subset (e.g. ACII, albeit still has issues above)
-- Warn/error for multiple-scalar grapheme clusters (albeit still has issues above)
-
-
-
-## Future Directions
-
-### Future API
-
-Library-extensible pattern matching will necessitate more types, protocols, and API in the future, many of which may involve character classes. This pitch aims to define names and semantics for exactly these kinds of API now, so that they can slot in naturally.
-
-### More classes or custom classes
-
-Future API might express custom classes or need more built-in classes. This pitch aims to establish rationale and precedent for a large number of character classes in Swift, serving as a basis that can be extended.
-
-### More lenient conversion APIs
-
-The proposed semantics for matching "digits" are broader than what the existing `Int(_:radix:)?` initializer accepts. It may be useful to provide additional initializers that can understand the whole breadth of characters matched by `\d`, or other related conversions.
-
-
-
-
-[literals]: https://forums.swift.org/t/pitch-regular-expression-literals/52820
-[overview]: https://forums.swift.org/t/declarative-string-processing-overview/52459
-[charprops]: https://github.com/apple/swift-evolution/blob/master/proposals/0221-character-properties.md
-[charpropsrationale]: https://github.com/apple/swift-evolution/blob/master/proposals/0221-character-properties.md#detailed-semantics-and-rationale
-[canoneq]: https://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence
-[graphemes]: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
-[meaningless]: https://forums.swift.org/t/declarative-string-processing-overview/52459/121
-[scalarprops]: https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md
-[ucd]: https://www.unicode.org/reports/tr44/tr44-28.html
-[numerictype]: https://www.unicode.org/reports/tr44/#Numeric_Type
-[derivednumeric]: https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedNumericType.txt
-
-
-[uts18]: https://unicode.org/reports/tr18/
-[proplist]: https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
-[pcre]: https://www.pcre.org/current/doc/html/pcre2pattern.html
-[perl]: https://perldoc.perl.org/perlre
-[raku]: https://docs.raku.org/language/regexes
-[rust]: https://docs.rs/regex/1.5.4/regex/
-[python]: https://docs.python.org/3/library/re.html
-[ruby]: https://ruby-doc.org/core-2.4.0/Regexp.html
-[csharp]: https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference
-[icu]: https://unicode-org.github.io/icu/userguide/strings/regexp.html
-[posix]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
-[oniguruma]: https://www.cuminas.jp/sdk/regularExpression.html
-[go]: https://pkg.go.dev/regexp/syntax@go1.17.2
-[cplusplus]: https://www.cplusplus.com/reference/regex/ECMAScript/
-[ecmascript]: https://262.ecma-international.org/12.0/#sec-pattern-semantics
-[re2]: https://github.com/google/re2/wiki/Syntax
-[java]: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
diff --git a/Documentation/Evolution/ProposalOverview.md b/Documentation/Evolution/ProposalOverview.md
index 4346932b5..898e0db20 100644
--- a/Documentation/Evolution/ProposalOverview.md
+++ b/Documentation/Evolution/ProposalOverview.md
@@ -19,7 +19,7 @@ Covers the result builder approach and basic API.
## Run-time Regex Construction
-- [Pitch](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md)
+- [Pitch](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md), [Thread](https://forums.swift.org/t/pitch-2-regex-syntax-and-run-time-construction/56624)
- (old) Pitch thread: [Regex Syntax](https://forums.swift.org/t/pitch-regex-syntax/55711)
+ Brief: Syntactic superset of PCRE2, Oniguruma, ICU, UTS\#18, etc.
@@ -27,7 +27,7 @@ Covers the "interior" syntax, extended syntaxes, run-time construction of a rege
## Regex Literals
-- [Draft](https://github.com/apple/swift-experimental-string-processing/pull/187)
+- [Draft](https://github.com/apple/swift-experimental-string-processing/pull/187), [Thread](https://forums.swift.org/t/pitch-2-regex-literals/56736)
- (Old) original pitch:
+ [Thread](https://forums.swift.org/t/pitch-regular-expression-literals/52820)
+ [Update](https://forums.swift.org/t/pitch-regular-expression-literals/52820/90)
@@ -39,17 +39,17 @@ Covers the "interior" syntax, extended syntaxes, run-time construction of a rege
Proposes a slew of Regex-powered algorithms.
-Introduces `CustomMatchingRegexComponent`, which is a monadic-parser style interface for external parsers to be used as components of a regex.
+Introduces `CustomPrefixMatchRegexComponent`, which is a monadic-parser style interface for external parsers to be used as components of a regex.
## Unicode for String Processing
-- Draft: TBD
+- [Draft](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/UnicodeForStringProcessing.md)
- (Old) [Character class definitions](https://forums.swift.org/t/pitch-character-classes-for-string-processing/52920)
Covers three topics:
-- Proposes literal and DSL API for library-defined character classes, Unicode scripts and properties, and custom character classes.
-- Proposes literal and DSL API for options that affect matching behavior.
+- Proposes regex syntax and `RegexBuilder` API for options that affect matching behavior.
+- Proposes regex syntax and `RegexBuilder` API for library-defined character classes, Unicode properties, and custom character classes.
- Defines how Unicode scalar-based classes are extended to grapheme clusters in the different semantic and other matching modes.
diff --git a/Documentation/Evolution/RegexLiterals.md b/Documentation/Evolution/RegexLiterals.md
index 3c12c9c7a..3643590d4 100644
--- a/Documentation/Evolution/RegexLiterals.md
+++ b/Documentation/Evolution/RegexLiterals.md
@@ -12,7 +12,7 @@ In *[Regex Type and Overview][regex-type]* we introduced the `Regex` type, which
```swift
let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
-let regex = try! Regex(compiling: pattern)
+let regex = try! Regex(pattern)
// regex: Regex
```
@@ -366,7 +366,7 @@ However we decided against this because:
### No custom literal
-Instead of adding a custom regex literal, we could require users to explicitly write `try! Regex(compiling: "[abc]+")`. This would be similar to `NSRegularExpression`, and loses all the benefits of parsing the literal at compile time. This would mean:
+Instead of adding a custom regex literal, we could require users to explicitly write `try! Regex("[abc]+")`. This would be similar to `NSRegularExpression`, and loses all the benefits of parsing the literal at compile time. This would mean:
- No source tooling support (e.g syntax highlighting, refactoring actions) would be available.
- Parse errors would be diagnosed at run time rather than at compile time.
diff --git a/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md b/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md
index cab21288d..5c9fa6c59 100644
--- a/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md
+++ b/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md
@@ -1,7 +1,12 @@
# Regex Syntax and Run-time Construction
-- Authors: [Hamish Knight](https://github.com/hamishknight), [Michael Ilseman](https://github.com/milseman)
+* Proposal: [SE-NNNN](NNNN-filename.md)
+* Authors: [Hamish Knight](https://github.com/hamishknight), [Michael Ilseman](https://github.com/milseman)
+* Review Manager: [Ben Cohen](https://github.com/airspeedswift)
+* Status: **Awaiting review**
+* Implementation: https://github.com/apple/swift-experimental-string-processing
+ * Available in nightly toolchain snapshots with `import _StringProcessing`
## Introduction
@@ -50,11 +55,11 @@ We propose run-time construction of `Regex` from a best-in-class treatment of fa
```swift
let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
-let regex = try! Regex(compiling: pattern)
+let regex = try! Regex(pattern)
// regex: Regex
let regex: Regex<(Substring, Substring, Substring, Substring, Substring)> =
- try! Regex(compiling: pattern)
+ try! Regex(pattern)
```
### Syntax
@@ -81,11 +86,11 @@ We propose initializers to declare and compile a regex from syntax. Upon failure
```swift
extension Regex {
/// Parse and compile `pattern`, resulting in a strongly-typed capture list.
- public init(compiling pattern: String, as: Output.Type = Output.self) throws
+ public init(_ pattern: String, as: Output.Type = Output.self) throws
}
extension Regex where Output == AnyRegexOutput {
/// Parse and compile `pattern`, resulting in an existentially-typed capture list.
- public init(compiling pattern: String) throws
+ public init(_ pattern: String) throws
}
```
@@ -156,6 +161,20 @@ extension Regex.Match where Output == AnyRegexOutput {
}
```
+We propose adding API to query and access captures by name in an existentially typed regex match:
+
+```swift
+extension Regex.Match where Output == AnyRegexOutput {
+ /// If a named-capture with `name` is present, returns its value. Otherwise `nil`.
+ public subscript(_ name: String) -> AnyRegexOutput.Element? { get }
+}
+
+extension AnyRegexOutput {
+ /// If a named-capture with `name` is present, returns its value. Otherwise `nil`.
+ public subscript(_ name: String) -> AnyRegexOutput.Element? { get }
+}
+```
+
The rest of this proposal will be a detailed and exhaustive definition of our proposed regex syntax.
Grammar Notation
@@ -392,7 +411,7 @@ For non-Unicode properties, only a value is required. These include:
- The special PCRE2 properties `Xan`, `Xps`, `Xsp`, `Xuc`, `Xwd`.
- The special Java properties `javaLowerCase`, `javaUpperCase`, `javaWhitespace`, `javaMirrored`.
-Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`.
+Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`. Both spellings may be used inside and outside of a custom character class.
#### `\K`
@@ -534,6 +553,7 @@ These operators have a lower precedence than the implicit union of members, e.g
To avoid ambiguity between .NET's subtraction syntax and range syntax, .NET specifies that a subtraction will only be parsed if the right-hand-side is a nested custom character class. We propose following this behavior.
+Note that a custom character class may begin with the `:` character, and only becomes a POSIX character property if a closing `:]` is present. For example, `[:a]` is the character class of `:` and `a`.
### Matching options
@@ -863,7 +883,23 @@ PCRE supports `\N` meaning "not a newline", however there are engines that treat
### Extended character property syntax
-ICU unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`, such that they follow the same internal grammar, which allows referencing any Unicode character property in addition to the POSIX properties. We propose supporting this, though it is a purely additive feature, and therefore should not conflict with regex engines that implement a more limited POSIX syntax.
+ICU unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`. This has two effects:
+
+- They share the same internal grammar, which allows the use of any Unicode character properties in addition to the POSIX properties.
+- The POSIX syntax may be used outside of custom character classes, unlike in PCRE and Oniguruma.
+
+We propose following both of these rules. The former is purely additive, and therefore should not conflict with regex engines that implement a more limited POSIX syntax. The latter does conflict with other engines, but we feel it is much more likely that a user would expect e.g `[:space:]` to be a character property rather than the character class `[:aceps]`. We do however feel that a warning might be warranted in order to avoid confusion.
+
+### POSIX character property disambiguation
+
+PCRE, Oniguruma and ICU allow `[:` to be part of a custom character class if a closing `:]` is not present. For example, `[:a]` is the character class of `:` and `a`. However they each have different rules for detecting the closing `:]`:
+
+- PCRE will scan ahead until it hits either `:]`, `]`, or `[:`.
+- Oniguruma will scan ahead until it hits either `:]`, `]`, or the length exceeds 20 characters.
+- ICU will scan ahead until it hits a known escape sequence (e.g `\a`, `\e`, `\Q`, ...), or `:]`. Note this excludes character class escapes e.g `\d`. It also excludes `]`, meaning that even `[:a][:]` is parsed as a POSIX character property.
+
+We propose unifying these behaviors by scanning ahead until we hit either `[`, `]`, `:]`, or `\`. Additionally, we will stop on encountering `}` or a second occurrence of `=`. These fall out the fact that they would be invalid contents of the alternative `\p{...}` syntax.
+
### Script properties
diff --git a/Documentation/Evolution/RegexTypeOverview.md b/Documentation/Evolution/RegexTypeOverview.md
index bce336551..68dd6ccc7 100644
--- a/Documentation/Evolution/RegexTypeOverview.md
+++ b/Documentation/Evolution/RegexTypeOverview.md
@@ -1,6 +1,11 @@
# Regex Type and Overview
-- Authors: [Michael Ilseman](https://github.com/milseman)
+* Proposal: [SE-0350](0350-regex-type-overview.md)
+* Authors: [Michael Ilseman](https://github.com/milseman)
+* Review Manager: [Ben Cohen](https://github.com/airspeedswift)
+* Status: **Active Review (4 - 28 April 2022)**
+* Implementation: https://github.com/apple/swift-experimental-string-processing
+ * Available in nightly toolchain snapshots with `import _StringProcessing`
## Introduction
@@ -134,11 +139,11 @@ Regexes can be created at run time from a string containing familiar regex synta
```swift
let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
-let regex = try! Regex(compiling: pattern)
+let regex = try! Regex(pattern)
// regex: Regex
let regex: Regex<(Substring, Substring, Substring, Substring, Substring)> =
- try! Regex(compiling: pattern)
+ try! Regex(pattern)
```
*Note*: The syntax accepted and further details on run-time compilation, including `AnyRegexOutput` and extended syntaxes, are discussed in [Run-time Regex Construction][pitches].
@@ -207,7 +212,7 @@ func processEntry(_ line: String) -> Transaction? {
// amount: Substring
// )>
- guard let match = regex.matchWhole(line),
+ guard let match = regex.wholeMatch(line),
let kind = Transaction.Kind(match.kind),
let date = try? Date(String(match.date), strategy: dateParser),
let amount = try? Decimal(String(match.amount), format: decimalParser)
@@ -226,7 +231,7 @@ The result builder allows for inline failable value construction, which particip
Swift regexes describe an unambiguous algorithm, where choice is ordered and effects can be reliably observed. For example, a `print()` statement inside the `TryCapture`'s transform function will run whenever the overall algorithm naturally dictates an attempt should be made. Optimizations can only elide such calls if they can prove it is behavior-preserving (e.g. "pure").
-`CustomMatchingRegexComponent`, discussed in [String Processing Algorithms][pitches], allows industrial-strength parsers to be used a regex components. This allows us to drop the overly-permissive pre-parsing step:
+`CustomPrefixMatchRegexComponent`, discussed in [String Processing Algorithms][pitches], allows industrial-strength parsers to be used a regex components. This allows us to drop the overly-permissive pre-parsing step:
```swift
func processEntry(_ line: String) -> Transaction? {
@@ -300,7 +305,7 @@ Regex targets [UTS\#18 Level 2](https://www.unicode.org/reports/tr18/#Extended_U
```swift
/// A regex represents a string processing algorithm.
///
-/// let regex = try Regex(compiling: "a(.*)b")
+/// let regex = try Regex("a(.*)b")
/// let match = "cbaxb".firstMatch(of: regex)
/// print(match.0) // "axb"
/// print(match.1) // "x"
@@ -384,21 +389,25 @@ extension Regex.Match {
// Run-time compilation interfaces
extension Regex {
/// Parse and compile `pattern`, resulting in a strongly-typed capture list.
- public init(compiling pattern: String, as: Output.Type = Output.self) throws
+ public init(_ pattern: String, as: Output.Type = Output.self) throws
}
extension Regex where Output == AnyRegexOutput {
/// Parse and compile `pattern`, resulting in an existentially-typed capture list.
- public init(compiling pattern: String) throws
+ public init(_ pattern: String) throws
}
```
+### Cancellation
+
+Regex is somewhat different from existing standard library operations in that regex processing can be a long-running task.
+For this reason regex algorithms may check if the parent task has been cancelled and end execution.
+
### On severability and related proposals
The proposal split presented is meant to aid focused discussion, while acknowledging that each is interconnected. The boundaries between them are not completely cut-and-dry and could be refined as they enter proposal phase.
Accepting this proposal in no way implies that all related proposals must be accepted. They are severable and each should stand on their own merit.
-
## Source compatibility
Everything in this proposal is additive. Regex delimiters may have their own source compatibility impact, which is discussed in that proposal.
@@ -422,7 +431,7 @@ Regular expressions have a deservedly mixed reputation, owing to their historica
* "Regular expressions are bad because you should use a real parser"
- In other systems, you're either in or you're out, leading to a gravitational pull to stay in when... you should get out
- - Our remedy is interoperability with real parsers via `CustomMatchingRegexComponent`
+ - Our remedy is interoperability with real parsers via `CustomPrefixMatchRegexComponent`
- Literals with refactoring actions provide an incremental off-ramp from regex syntax to result builders and real parsers
* "Regular expressions are bad because ugly unmaintainable syntax"
- We propose literals with source tools support, allowing for better syntax highlighting and analysis
@@ -488,6 +497,16 @@ The generic parameter `Output` is proposed to contain both the whole match (the
The biggest issue with this alternative design is that the numbering of `Captures` elements misaligns with the numbering of captures in textual regexes, where backreference `\0` refers to the entire match and captures start at `\1`. This design would sacrifice familarity and have the pitfall of introducing off-by-one errors.
+### Encoding `Regex`es into the type system
+
+During the initial review period the following comment was made:
+
+> I think the goal should be that, at least for regex literals (and hopefully for the DSL to some extent), one day we might not even need a bytecode or interpreter. I think the ideal case is if each literal was its own function or type that gets generated and optimised as if you wrote it in Swift.
+
+This is an approach that has been tried a few times in a few different languages (including by a few members of the Swift Standard Library and Core teams), and while it can produce attractive microbenchmarks, it has almost always proved to be a bad idea at the macro scale. In particular, even if we set aside witness tables and other associated swift generics overhead, optimizing a fixed pipeline for each pattern you want to match causes significant codesize expansion when there are multiple patterns in use, as compared to a more flexible byte code interpreter. A bytecode interpreter makes better use of instruction caches and memory, and can also benefit from micro architectural resources that are shared across different patterns. There is a tradeoff w.r.t. branch prediction resources, where separately compiled patterns may have more decisive branch history data, but a shared bytecode engine has much more data to use; this tradeoff tends to fall on the side of a bytecode engine, but it does not always do so.
+
+It should also be noted that nothing prevents AOT or JIT compiling of the bytecode if we believe it will be advantageous, but compiling or interpreting arbitrary Swift code at runtime is rather more unattractive, since both the type system and language are undecidable. Even absent this rationale, we would probably not encode regex programs directly into the type system simply because it is unnecessarily complex.
+
### Future work: static optimization and compilation
Swift's support for static compilation is still developing, and future work here is leveraging that to compile regex when profitable. Many regex describe simple [DFAs](https://en.wikipedia.org/wiki/Deterministic_finite_automaton) and can be statically compiled into very efficient programs. Full static compilation needs to be balanced with code size concerns, as a matching-specific bytecode is typically far smaller than a corresponding program (especially since the bytecode interpreter is shared).
@@ -497,7 +516,7 @@ Regex are compiled into an intermediary representation and fairly simple analysi
### Future work: parser combinators
-What we propose here is an incremental step towards better parsing support in Swift using parser-combinator style libraries. The underlying execution engine supports recursive function calls and mechanisms for library extensibility. `CustomMatchingRegexComponent`'s protocol requirement is effectively a [monadic parser](https://homepages.inf.ed.ac.uk/wadler/papers/marktoberdorf/baastad.pdf), meaning `Regex` provides a regex-flavored combinator-like system.
+What we propose here is an incremental step towards better parsing support in Swift using parser-combinator style libraries. The underlying execution engine supports recursive function calls and mechanisms for library extensibility. `CustomPrefixMatchRegexComponent`'s protocol requirement is effectively a [monadic parser](https://homepages.inf.ed.ac.uk/wadler/papers/marktoberdorf/baastad.pdf), meaning `Regex` provides a regex-flavored combinator-like system.
An issues with traditional parser combinator libraries are the compilation barriers between call-site and definition, resulting in excessive and overly-cautious backtracking traffic. These can be eliminated through better [compilation techniques](https://core.ac.uk/download/pdf/148008325.pdf). As mentioned above, Swift's support for custom static compilation is still under development.
@@ -546,7 +565,7 @@ Regexes are often used for tokenization and tokens can be represented with Swift
### Future work: baked-in localized processing
-- `CustomMatchingRegexComponent` gives an entry point for localized processors
+- `CustomPrefixMatchRegexComponent` gives an entry point for localized processors
- Future work includes (sub?)protocols to communicate localization intent
-->
diff --git a/Documentation/Evolution/StringProcessingAlgorithms.md b/Documentation/Evolution/StringProcessingAlgorithms.md
index b976c562e..edefbd19b 100644
--- a/Documentation/Evolution/StringProcessingAlgorithms.md
+++ b/Documentation/Evolution/StringProcessingAlgorithms.md
@@ -8,9 +8,9 @@ We propose:
1. New regex-powered algorithms over strings, bringing the standard library up to parity with scripting languages
2. Generic `Collection` equivalents of these algorithms in terms of subsequences
-3. `protocol CustomMatchingRegexComponent`, which allows 3rd party libraries to provide their industrial-strength parsers as intermixable components of regexes
+3. `protocol CustomPrefixMatchRegexComponent`, which allows 3rd party libraries to provide their industrial-strength parsers as intermixable components of regexes
-This proposal is part of a larger [regex-powered string processing initiative](https://forums.swift.org/t/declarative-string-processing-overview/52459). Throughout the document, we will reference the still-in-progress [`RegexProtocol`, `Regex`](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/StronglyTypedCaptures.md), and result builder DSL, but these are in flux and not formally part of this proposal. Further discussion of regex specifics is out of scope of this proposal and better discussed in another thread (see [Pitch and Proposal Status](https://github.com/apple/swift-experimental-string-processing/issues/107) for links to relevant threads).
+This proposal is part of a larger [regex-powered string processing initiative](https://github.com/apple/swift-evolution/blob/main/proposals/0350-regex-type-overview.md), the status of each proposal is tracked [here](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/ProposalOverview.md). Further discussion of regex specifics is out of scope of this proposal and better discussed in their relevant reviews.
## Motivation
@@ -91,18 +91,18 @@ Note: Only a subset of Python's string processing API are included in this table
### Complex string processing
-Even with the API additions, more complex string processing quickly becomes unwieldy. Up-coming support for authoring regexes in Swift help alleviate this somewhat, but string processing in the modern world involves dealing with localization, standards-conforming validation, and other concerns for which a dedicated parser is required.
+Even with the API additions, more complex string processing quickly becomes unwieldy. String processing in the modern world involves dealing with localization, standards-conforming validation, and other concerns for which a dedicated parser is required.
Consider parsing the date field `"Date: Wed, 16 Feb 2022 23:53:19 GMT"` in an HTTP header as a `Date` type. The naive approach is to search for a substring that looks like a date string (`16 Feb 2022`), and attempt to post-process it as a `Date` with a date parser:
```swift
let regex = Regex {
- capture {
- oneOrMore(.digit)
+ Capture {
+ OneOrMore(.digit)
" "
- oneOrMore(.word)
+ OneOrMore(.word)
" "
- oneOrMore(.digit)
+ OneOrMore(.digit)
}
}
@@ -128,21 +128,21 @@ DEBIT 03/24/2020 IRX tax payment ($52,249.98)
Parsing a currency string such as `$3,020.85` with regex is also tricky, as it can contain localized and currency symbols in addition to accounting conventions. This is why Foundation provides industrial-strength parsers for localized strings.
-## Proposed solution
+## Proposed solution
### Complex string processing
-We propose a `CustomMatchingRegexComponent` protocol which allows types from outside the standard library participate in regex builders and `RegexComponent` algorithms. This allows types, such as `Date.ParseStrategy` and `FloatingPointFormatStyle.Currency`, to be used directly within a regex:
+We propose a `CustomPrefixMatchRegexComponent` protocol which allows types from outside the standard library participate in regex builders and `RegexComponent` algorithms. This allows types, such as `Date.ParseStrategy` and `FloatingPointFormatStyle.Currency`, to be used directly within a regex:
```swift
let dateRegex = Regex {
- capture(dateParser)
+ Capture(dateParser)
}
let date: Date = header.firstMatch(of: dateRegex).map(\.result.1)
let currencyRegex = Regex {
- capture(.localizedCurrency(code: "USD").sign(strategy: .accounting))
+ Capture(.localizedCurrency(code: "USD").sign(strategy: .accounting))
}
let amount: [Decimal] = statement.matches(of: currencyRegex).map(\.result.1)
@@ -162,28 +162,30 @@ We also propose the following regex-powered algorithms as well as their generic
|`replace(:with:subrange:maxReplacements)`| Replaces all occurrences of the sequence matching the given `RegexComponent` or sequence with a given collection |
|`split(by:)`| Returns the longest possible subsequences of the collection around elements equal to the given separator |
|`firstMatch(of:)`| Returns the first match of the specified `RegexComponent` within the collection |
+|`wholeMatch(of:)`| Matches the specified `RegexComponent` in the collection as a whole |
+|`prefixMatch(of:)`| Matches the specified `RegexComponent` against the collection at the beginning |
|`matches(of:)`| Returns a collection containing all matches of the specified `RegexComponent` |
+## Detailed design
-## Detailed design
+### `CustomPrefixMatchRegexComponent`
-### `CustomMatchingRegexComponent`
-
-`CustomMatchingRegexComponent` inherits from `RegexComponent` and satisfies its sole requirement; Conformers can be used with all of the string algorithms generic over `RegexComponent`.
+`CustomPrefixMatchRegexComponent` inherits from `RegexComponent` and satisfies its sole requirement. Conformers can be used with all of the string algorithms generic over `RegexComponent`.
```swift
-/// A protocol for custom match functionality.
-public protocol CustomMatchingRegexComponent : RegexComponent {
- /// Match the input string within the specified bounds, beginning at the given index, and return
- /// the end position (upper bound) of the match and the matched instance.
+/// A protocol allowing custom types to function as regex components by
+/// providing the raw functionality backing `prefixMatch`.
+public protocol CustomPrefixMatchRegexComponent: RegexComponent {
+ /// Process the input string within the specified bounds, beginning at the given index, and return
+ /// the end position (upper bound) of the match and the produced output.
/// - Parameters:
/// - input: The string in which the match is performed.
/// - index: An index of `input` at which to begin matching.
/// - bounds: The bounds in `input` in which the match is performed.
/// - Returns: The upper bound where the match terminates and a matched instance, or `nil` if
/// there isn't a match.
- func match(
+ func consuming(
_ input: String,
startingAt index: String.Index,
in bounds: Range
@@ -197,8 +199,8 @@ public protocol CustomMatchingRegexComponent : RegexComponent {
We use Foundation `FloatingPointFormatStyle.Currency` as an example for protocol conformance. It would implement the `match` function with `Match` being a `Decimal`. It could also add a static function `.localizedCurrency(code:)` as a member of `RegexComponent`, so it can be referred as `.localizedCurrency(code:)` in the `Regex` result builder:
```swift
-extension FloatingPointFormatStyle.Currency : CustomMatchingRegexComponent {
- public func match(
+extension FloatingPointFormatStyle.Currency : CustomPrefixMatchRegexComponent {
+ public func consuming(
_ input: String,
startingAt index: String.Index,
in bounds: Range
@@ -389,7 +391,7 @@ extension BidirectionalCollection where SubSequence == Substring {
}
```
-#### First match
+#### Match
```swift
extension BidirectionalCollection where SubSequence == Substring {
@@ -398,6 +400,16 @@ extension BidirectionalCollection where SubSequence == Substring {
/// - Returns: The first match of `regex` in the collection, or `nil` if
/// there isn't a match.
public func firstMatch(of regex: R) -> RegexMatch?
+
+ /// Match a regex in its entirety.
+ /// - Parameter r: The regex to match against.
+ /// - Returns: The match if there is one, or `nil` if none.
+ public func wholeMatch(of r: R) -> Regex.Match?
+
+ /// Match part of the regex, starting at the beginning.
+ /// - Parameter r: The regex to match against.
+ /// - Returns: The match if there is one, or `nil` if none.
+ public func prefixMatch(of r: R) -> Regex.Match?
}
```
@@ -473,7 +485,7 @@ extension RangeReplaceableCollection where SubSequence == Substring {
/// - Returns: A new collection in which all occurrences of subsequence
/// matching `regex` in `subrange` are replaced by `replacement`.
public func replacing(
- _ regex: R,
+ _ r: R,
with replacement: Replacement,
subrange: Range,
maxReplacements: Int = .max
@@ -489,7 +501,7 @@ extension RangeReplaceableCollection where SubSequence == Substring {
/// - Returns: A new collection in which all occurrences of subsequence
/// matching `regex` are replaced by `replacement`.
public func replacing(
- _ regex: R,
+ _ r: R,
with replacement: Replacement,
maxReplacements: Int = .max
) -> Self where Replacement.Element == Element
@@ -502,7 +514,7 @@ extension RangeReplaceableCollection where SubSequence == Substring {
/// - maxReplacements: A number specifying how many occurrences of the
/// sequence matching `regex` to replace. Default is `Int.max`.
public mutating func replace(
- _ regex: R,
+ _ r: R,
with replacement: Replacement,
maxReplacements: Int = .max
) where Replacement.Element == Element
@@ -511,48 +523,48 @@ extension RangeReplaceableCollection where SubSequence == Substring {
/// the given regex are replaced by another regex match.
/// - Parameters:
/// - regex: A regex describing the sequence to replace.
- /// - replacement: A closure that receives the full match information,
- /// including captures, and returns a replacement collection.
/// - subrange: The range in the collection in which to search for `regex`.
/// - maxReplacements: A number specifying how many occurrences of the
/// sequence matching `regex` to replace. Default is `Int.max`.
+ /// - replacement: A closure that receives the full match information,
+ /// including captures, and returns a replacement collection.
/// - Returns: A new collection in which all occurrences of subsequence
/// matching `regex` are replaced by `replacement`.
public func replacing(
_ regex: R,
- with replacement: (RegexMatch) throws -> Replacement,
subrange: Range,
- maxReplacements: Int = .max
+ maxReplacements: Int = .max,
+ with replacement: (RegexMatch) throws -> Replacement
) rethrows -> Self where Replacement.Element == Element
/// Returns a new collection in which all occurrences of a sequence matching
/// the given regex are replaced by another collection.
/// - Parameters:
/// - regex: A regex describing the sequence to replace.
- /// - replacement: A closure that receives the full match information,
- /// including captures, and returns a replacement collection.
/// - maxReplacements: A number specifying how many occurrences of the
/// sequence matching `regex` to replace. Default is `Int.max`.
+ /// - replacement: A closure that receives the full match information,
+ /// including captures, and returns a replacement collection.
/// - Returns: A new collection in which all occurrences of subsequence
/// matching `regex` are replaced by `replacement`.
public func replacing(
_ regex: R,
- with replacement: (RegexMatch) throws -> Replacement,
- maxReplacements: Int = .max
+ maxReplacements: Int = .max,
+ with replacement: (RegexMatch) throws -> Replacement
) rethrows -> Self where Replacement.Element == Element
/// Replaces all occurrences of the sequence matching the given regex with
/// a given collection.
/// - Parameters:
/// - regex: A regex describing the sequence to replace.
- /// - replacement: A closure that receives the full match information,
- /// including captures, and returns a replacement collection.
/// - maxReplacements: A number specifying how many occurrences of the
/// sequence matching `regex` to replace. Default is `Int.max`.
+ /// - replacement: A closure that receives the full match information,
+ /// including captures, and returns a replacement collection.
public mutating func replace(
_ regex: R,
- with replacement: (RegexMatch) throws -> Replacement,
- maxReplacements: Int = .max
+ maxReplacements: Int = .max,
+ with replacement: (RegexMatch) throws -> Replacement
) rethrows where Replacement.Element == Element
}
```
@@ -609,4 +621,4 @@ Trimming a string from both sides shares a similar story. For example, `"ababa".
### Future API
-Some Python functions are not currently included in this proposal, such as trimming the suffix from a string/collection. This pitch aims to establish a pattern for using `RegexComponent` with string processing algorithms, so that further enhancement can to be introduced to the standard library easily in the future, and eventually close the gap between Swift and other popular scripting languages.
+Some common string processing functions are not currently included in this proposal, such as trimming the suffix from a string/collection, and finding overlapping ranges of matched substrings. This pitch aims to establish a pattern for using `RegexComponent` with string processing algorithms, so that further enhancement can to be introduced to the standard library easily in the future, and eventually close the gap between Swift and other popular scripting languages.
diff --git a/Documentation/Evolution/UnicodeForStringProcessing.md b/Documentation/Evolution/UnicodeForStringProcessing.md
new file mode 100644
index 000000000..828d8f53c
--- /dev/null
+++ b/Documentation/Evolution/UnicodeForStringProcessing.md
@@ -0,0 +1,872 @@
+# Unicode for String Processing
+
+Proposal: [SE-NNNN](NNNN-filename.md)
+Authors: [Nate Cook](https://github.com/natecook1000), [Alejandro Alonso](https://github.com/Azoy)
+Review Manager: TBD
+Implementation: [apple/swift-experimental-string-processing][repo]
+Status: **Draft**
+
+
+## Introduction
+
+This proposal describes `Regex`'s rich Unicode support during regex matching, along with the character classes and options that define that behavior.
+
+## Motivation
+
+Swift's `String` type provides, by default, a view of `Character`s or [extended grapheme clusters][graphemes] whose comparison honors [Unicode canonical equivalence][canoneq]. Each character in a string can be composed of one or more Unicode scalar values, while still being treated as a single unit, equivalent to other ways of formulating the equivalent character:
+
+```swift
+let str = "Cafe\u{301}" // "Café"
+str == "Café" // true
+str.dropLast() // "Caf"
+str.last == "é" // true (precomposed e with acute accent)
+str.last == "e\u{301}" // true (e followed by composing acute accent)
+```
+
+This default view is fairly novel. Most languages that support Unicode strings generally operate at the Unicode scalar level, and don't provide the same affordance for operating on a string as a collection of grapheme clusters. In Python, for example, Unicode strings report their length as the number of scalar values, and don't use canonical equivalence in comparisons:
+
+```python
+cafe = u"Cafe\u0301"
+len(cafe) # 5
+cafe == u"Café" # False
+```
+
+Existing regex engines follow this same model of operating at the Unicode scalar level. To match canonically equivalent characters, or have equivalent behavior between equivalent strings, you must normalize your string and regex to the same canonical format.
+
+```python
+# Matches a four-element string
+re.match(u"^.{4}$", cafe) # None
+# Matches a string ending with 'é'
+re.match(u".+é$", cafe) # None
+
+cafeComp = unicodedata.normalize("NFC", cafe)
+re.match(u"^.{4}$", cafeComp) #
+re.match(u".+é$", cafeComp) #
+```
+
+With Swift's string model, this behavior would surprising and undesirable — Swift's default regex semantics must match the semantics of a `String`.
+
+Other engines
+
+Other regex engines match character classes (such as `\w` or `.`) at the Unicode scalar value level, or even the code unit level, instead of recognizing grapheme clusters as characters. When matching the `.` character class, other languages will only match the first part of an `"e\u{301}"` grapheme cluster. Some languages, like Perl, Ruby, and Java, support an additional `\X` metacharacter, which explicitly represents a single grapheme cluster.
+
+| Matching `"Cafe\u{301}"` | Pattern: `^Caf.` | Remaining | Pattern: `^Caf\X` | Remaining |
+|---|---|---|---|---|
+| C#, Rust, Go, Python | `"Cafe"` | `"´"` | n/a | n/a |
+| NSString, Java, Ruby, Perl | `"Cafe"` | `"´"` | `"Café"` | `""` |
+
+Other than Java's `CANON_EQ` option, the vast majority of other languages and engines are not capable of comparing with canonical equivalence.
+
+
+
+## Proposed solution
+
+In a regex's simplest form, without metacharacters or special features, matching behaves like a test for equality. A string always matches a regex that simply contains the same characters.
+
+```swift
+let str = "Cafe\u{301}" // "Café"
+str.contains(/Café/) // true
+```
+
+From that point, small changes continue to comport with the element counting and comparison expectations set by `String`:
+
+```swift
+str.contains(/Caf./) // true
+str.contains(/.+é/) // true
+str.contains(/.+e\u{301}/) // true
+str.contains(/\w+é/) // true
+```
+
+
+For compatibility with other regex engines and the flexibility to match at both `Character` and Unicode scalar level, you can switch between matching levels for an entire regex or within select portions. This powerful capability provides the expected default behavior when working with strings, while allowing you to drop down for Unicode scalar-specific matching.
+
+By default, literal characters and Unicode scalar values (e.g. `\u{301}`) are coalesced into characters in the same way as a normal string, as shown above. Metacharacters, like `.` and `\w`, and custom character classes each match a single element at the current matching level.
+
+For example, these matches fail, because by the time the parser encounters the "`\u{301}`" Unicode scalar literal, the full `"é"` character has been matched:
+
+```swift
+str.contains(/Caf.\u{301}) // false - `.` matches "é" character
+str.contains(/Caf\w\u{301}) // false - `\w` matches "é" character
+str.contains(/.+\u{301}) // false - `.+` matches each character
+```
+
+Alternatively, we can drop down to use Unicode scalar semantics if we want to match specific Unicode sequences. For example, these regexes matches an `"e"` followed by any modifier with the specified parameters:
+
+```swift
+str.contains(/e[\u{300}-\u{314}]/.matchingSemantics(.unicodeScalar))
+// true - matches an "e" followed by a Unicode scalar in the range U+0300 - U+0314
+str.contains(/e\p{Nonspacing Mark}/.matchingSemantics(.unicodeScalar))
+// true - matches an "e" followed by a Unicode scalar with general category "Nonspacing Mark"
+```
+
+Matching in Unicode scalar mode is analogous to comparing against a string's `UnicodeScalarView` — individual Unicode scalars are matched without combining them into characters or testing for canonical equivalence.
+
+```swift
+str.contains(/Café/.matchingSemantics(.unicodeScalar))
+// false - "e\u{301}" doesn't match with /é/
+str.contains(/Cafe\u{301}/.matchingSemantics(.unicodeScalar))
+// true - "e\u{301}" matches with /e\u{301}/
+```
+
+Swift's `Regex` follows the level 2 guidelines for Unicode support in regular expressions described in [Unicode Technical Standard #18][uts18], with support for Unicode character classes, canonical equivalence, grapheme cluster matching semantics, and level 2 word boundaries enabled by default. In addition to selecting the matching semantics, `Regex` provides options for selecting different matching behaviors, such as ASCII character classes or Unicode scalar semantics, which corresponds more closely with other regex engines.
+
+## Detailed design
+
+First, we'll discuss the options that let you control a regex's behavior, and then explore the character classes that define the your pattern.
+
+### Options
+
+Options can be enabled and disabled in two different ways: as part of [regex internal syntax][internals], or applied as methods when declaring a `Regex`. For example, both of these `Regex`es are declared with case insensitivity:
+
+```swift
+let regex1 = /(?i)banana/
+let regex2 = Regex {
+ "banana"
+}.ignoresCase()`
+```
+
+Note that the `ignoresCase()` is available on any type conforming to `RegexComponent`, which means that you can always use the more readable option-setting interface in conjunction with regex literals or run-time compiled `Regex`es:
+
+```swift
+let regex3 = /banana/.ignoresCase()
+```
+
+Calling an option-setting method like `ignoresCase(_:)` acts like wrapping the callee in an option-setting group `(?:...)`. That is, while it sets the behavior for the callee, it doesn’t override options that are applied to more specific regions. In this example, the middle `"na"` in `"banana"` matches case-sensitively, despite the outer call to `ignoresCase()`:
+
+```swift
+let regex4 = Regex {
+ "ba"
+ "na".ignoresCase(false)
+ "na"
+}
+.ignoresCase()
+
+"banana".contains(regex4) // true
+"BAnaNA".contains(regex4) // true
+"BANANA".contains(regex4) // false
+
+// Equivalent to:
+let regex5 = /(?i)ba(?-i:na)na/
+```
+
+All option APIs are provided on `RegexComponent`, so they can be called on a `Regex` instance, or on any component that you would use inside a `RegexBuilder` block when the `RegexBuilder` module is imported.
+
+The options that `Regex` supports are shown in the table below. Options that affect _matching behavior_ are supported through both regex syntax and APIs, while options that have _structural_ or _syntactic_ effects are only supported through regex syntax.
+
+| **Matching Behavior** | | |
+|------------------------------|----------------|---------------------------|
+| Case insensitivity | `(?i)` | `ignoresCase()` |
+| Single-line mode | `(?s)` | `dotMatchesNewlines()` |
+| Multi-line mode | `(?m)` | `anchorsMatchNewlines()` |
+| ASCII-only character classes | `(?DSWP)` | `asciiOnlyDigits()`, etc |
+| Unicode word boundaries | `(?w)` | `wordBoundaryKind(_:)` |
+| Semantic level | `(?Xu)` | `matchingSemantics(_:)` |
+| Repetition behavior | `(?U)` | `repetitionBehavior(_:)` |
+| **Structural/Syntactic** | | |
+| Extended syntax | `(?x)`,`(?xx)` | n/a |
+| Named captures only | `(?n)` | n/a |
+| Shared capture names | `(?J)` | n/a |
+
+#### Case insensitivity
+
+Regexes perform case sensitive comparisons by default. The `i` option or the `ignoresCase(_:)` method enables case insensitive comparison.
+
+```swift
+let str = "Café"
+
+str.firstMatch(of: /CAFÉ/) // nil
+str.firstMatch(of: /(?i)CAFÉ/) // "Café"
+str.firstMatch(of: /(?i)cAfÉ/) // "Café"
+```
+
+Case insensitive matching uses case folding to ensure that canonical equivalence continues to operate as expected.
+
+**Regex syntax:** `(?i)...` or `(?i:...)`
+
+**`RegexBuilder` API:**
+
+```swift
+extension RegexComponent {
+ /// Returns a regular expression that ignores casing when matching.
+ public func ignoresCase(_ ignoresCase: Bool = true) -> Regex