diff --git a/Documentation/Evolution/CharacterClasses.md b/Documentation/Evolution/CharacterClasses.md deleted file mode 100644 index c9ffcbc95..000000000 --- a/Documentation/Evolution/CharacterClasses.md +++ /dev/null @@ -1,503 +0,0 @@ -# Character Classes for String Processing - -- **Authors:** [Nate Cook](https://github.com/natecook1000), [Michael Ilseman](https://github.com/milseman) -- **Status:** Draft pitch - -## Introduction - -[Declarative String Processing Overview][overview] presents regex-powered matching broadly, without details concerning syntax and semantics, leaving clarification to subsequent pitches. [Regular Expression Literals][literals] presents more details on regex _syntax_ such as delimiters and PCRE-syntax innards, but explicitly excludes discussion of regex _semantics_. This pitch and discussion aims to address a targeted subset of regex semantics: definitions of character classes. We propose a comprehensive treatment of regex character class semantics in the context of existing and newly proposed API directly on `Character` and `Unicode.Scalar`. - -Character classes in regular expressions include metacharacters like `\d` to match a digit, `\s` to match whitespace, and `.` to match any character. Individual literal characters can also be thought of as character classes, as they at least match themselves, and, in case-insensitive matching, their case-toggled counterpart. For the purpose of this work, then, we consider a *character class* to be any part of a regular expression literal that can match an actual component of a string. - -## Motivation - -Operating over classes of characters is a vital component of string processing. Swift's `String` provides, by default, a view of `Character`s or [extended grapheme clusters][graphemes] whose comparison honors [Unicode canonical equivalence][canoneq]. - -```swift -let str = "Cafe\u{301}" // "Café" -str == "Café" // true -str.dropLast() // "Caf" -str.last == "é" // true (precomposed e with acute accent) -str.last == "e\u{301}" // true (e followed by composing acute accent) -``` - -Unicode leaves all interpretation of grapheme clusters up to implementations, which means that Swift needs to define any semantics for its own usage. Since other regex engines operate, at most, at the semantics level of Unicode scalar values, there is little to no prior art to consult. - -
Other engines - -Character classes in other languages match at either the Unicode scalar value level, or even the code unit level, instead of recognizing grapheme clusters as characters. When matching the `.` character class, other languages will only match the first part of an `"e\u{301}"` grapheme cluster. Some languages, like Perl, Ruby, and Java, support an additional `\X` metacharacter, which explicitly represents a single grapheme cluster. - -| Matching `"Cafe\u{301}"` | Pattern: `^Caf.` | Remaining | Pattern: `^Caf\X` | Remaining | -|---|---|---|---|---| -| C#, Rust, Go | `"Cafe"` | `"´"` | n/a | n/a | -| NSString, Java, Ruby, Perl | `"Cafe"` | `"´"` | `"Café"` | `""` | - -Other than Java's `CANON_EQ` option, the vast majority of other languages and engines are not capable of comparing with canonical equivalence. - -
- -[SE-0211 Unicode Scalar Properties][scalarprops] added basic building blocks for classification of scalars by surfacing Unicode data from the [UCD][ucd]. [SE-0221: Character Properties][charprops] defined grapheme-cluster semantics for Swift for a subset of these. But, many classifications used in string processing are combinations of scalar properties or ad-hoc listings, and as such are not present today in Swift. - -Regardless of any syntax or underlying formalism, classifying characters is a worthy and much needed addition to the Swift standard library. We believe our thorough treatment of every character class found across many popular regex engines gives Swift a solid semantic basis. - -## Proposed Solution - -This pitch is narrowly scoped to Swift definitions of character classes found in regexes. For each character class, we propose: - -- A name for use in API -- A `Character` API, by extending Unicode scalar definitions to grapheme clusters -- A `Unicode.Scalar` API with modern Unicode definitions -- If applicable, a `Unicode.Scalar` API for notable standards like POSIX - -We're proposing what we believe to be the Swiftiest definitions using [Unicode's guidance][uts18] for `Unicode.Scalar` and extending this to grapheme clusters using `Character`'s existing [rationale][charpropsrationale]. - -
Broad language/engine survey - -For these definitions, we cross-referenced Unicode's [UTS\#18][uts18] with a broad survey of existing languages and engines. We found that while these all support a subset of UTS\#18, each language or framework implements a slightly different subset. The following table shows some of the variations: - -| Language/Framework | Dot (`.`) matches | Supports `\X` | Canonical Equivalence | `\d` matches FULL WIDTH digit | -|------------------------------|----------------------------------------------------|---------------|---------------------------|-------------------------------| -| [ECMAScript][ecmascript] | UTF16 code unit (Unicode scalar in Unicode mode) | no | no | no | -| [Perl][perl] / [PCRE][pcre] | UTF16 code unit, (Unicode scalar in Unicode mode) | yes | no | no | -| [Python3][python] | Unicode scalar | no | no | yes | -| [Raku][raku] | Grapheme cluster | n/a | strings always normalized | yes | -| [Ruby][ruby] | Unicode scalar | yes | no | no | -| [Rust][rust] | Unicode scalar | no | no | no | -| [C#][csharp] | UTF16 code unit | no | no | yes | -| [Java][java] | Unicode scalar | yes | Only in CANON_EQ mode | no | -| [Go][go] | Unicode scalar | no | no | no | -| [`NSRegularExpression`][icu] | Unicode scalar | yes | no | yes | - -We are still in the process of evaluating [C++][cplusplus], [RE2][re2], and [Oniguruma][oniguruma]. - -
- -## Detailed Design - -### Literal characters - -A literal character (such as `a`, `é`, or `한`) in a regex literal matches that particular character or code sequence. When matching at the semantic level of `Unicode.Scalar`, it should match the literal sequence of scalars. When matching at the semantic level of `Character`, it should match `Character`-by-`Character`, honoring Unicode canonical equivalence. - -We are not proposing new API here as this is already handled by `String` and `String.UnicodeScalarView`'s conformance to `Collection`. - -### Unicode values: `\u`, `\U`, `\x` - -Metacharacters that begin with `\u`, `\U`, or `\x` match a character with the specified Unicode scalar values. We propose these be treated exactly the same as literals. - -### Match any: `.`, `\X` - -The dot metacharacter matches any single character or element. Depending on options and modes, it may exclude newlines. - -`\X` matches any grapheme cluster (`Character`), even when the regular expression is otherwise matching at semantic level of `Unicode.Scalar`. - -We are not proposing new API here as this is already handled by collection conformances. - -While we would like for the stdlib to have grapheme-breaking API over collections of `Unicode.Scalar`, that is a separate discussion and out-of-scope for this pitch. - -### Decimal digits: `\d`,`\D` - -We propose `\d` be named "decimalDigit" with the following definitions: - -```swift -extension Character { - /// A Boolean value indicating whether this character represents - /// a decimal digit. - /// - /// Decimal digits are comprised of a single Unicode scalar that has a - /// `numericType` property equal to `.decimal`. This includes the digits - /// from the ASCII range, from the _Halfwidth and Fullwidth Forms_ Unicode - /// block, as well as digits in some scripts, like `DEVANAGARI DIGIT NINE` - /// (U+096F). - /// - /// Decimal digits are a subset of whole numbers, see `isWholeNumber`. - /// - /// To get the character's value, use the `decimalDigitValue` property. - public var isDecimalDigit: Bool { get } - - /// The numeric value this character represents, if it is a decimal digit. - /// - /// Decimal digits are comprised of a single Unicode scalar that has a - /// `numericType` property equal to `.decimal`. This includes the digits - /// from the ASCII range, from the _Halfwidth and Fullwidth Forms_ Unicode - /// block, as well as digits in some scripts, like `DEVANAGARI DIGIT NINE` - /// (U+096F). - /// - /// Decimal digits are a subset of whole numbers, see `wholeNumberValue`. - /// - /// let chars: [Character] = ["1", "९", "A"] - /// for ch in chars { - /// print(ch, "-->", ch.decimalDigitValue) - /// } - /// // Prints: - /// // 1 --> Optional(1) - /// // ९ --> Optional(9) - /// // A --> nil - public var decimalDigitValue: Int? { get } - -} - -extension Unicode.Scalar { - /// A Boolean value indicating whether this scalar is considered - /// a decimal digit. - /// - /// Any Unicode scalar that has a `numericType` property equal to `.decimal` - /// is considered a decimal digit. This includes the digits from the ASCII - /// range, from the _Halfwidth and Fullwidth Forms_ Unicode block, as well - /// as digits in some scripts, like `DEVANAGARI DIGIT NINE` (U+096F). - public var isDecimalDigit: Bool { get } -} -``` - -`\D` matches the inverse of `\d`. - -*TBD*: [SE-0221: Character Properties][charprops] did not define equivalent API on `Unicode.Scalar`, as it was itself an extension of single `Unicode.Scalar.Properties`. Since we're defining additional classifications formed from algebraic formulations of properties, it may make sense to put API such as `decimalDigitValue` on `Unicode.Scalar` as well as back-porting other API from `Character` (e.g. `hexDigitValue`). We'd like to discuss this with the community. - -*TBD*: `Character.isHexDigit` is currently constrained to the subset of decimal digits that are followed by encodings of Latin letters `A-F` in various forms (all 6 of them... thanks Unicode). We could consider extending this to be a superset of `isDecimalDigit` by allowing and producing values for all decimal digits, one would just have to use the Latin letters to refer to values greater than `9`. We'd like to discuss this with the community. - -_
Rationale_ - -Unicode's recommended definition for `\d` is its [numeric type][numerictype] of "Decimal" in contrast to "Digit". It is specifically restricted to sets of ascending contiguously-encoded scalars in a decimal radix positional numeral system. Thus, it excludes "digits" such as superscript numerals from its [definition][derivednumeric] and is a proper subset of `Character.isWholeNumber`. - -We interpret Unicode's definition of the set of scalars, especially its requirement that scalars be encoded in ascending chains, to imply that this class is restricted to scalars which meaningfully encode base-10 digits. Thus, we choose to make this Character property _restrictive_, similar to `isHexDigit` and `isWholeNumber` and provide a way to access this value. - -It's possible we might add future properties to differentiate Unicode's non-decimal digits, but that is outside the scope of this pitch. - -
- -### Word characters: `\w`, `\W` - -We propose `\w` be named "word character" with the following definitions: - -```swift -extension Character { - /// A Boolean value indicating whether this character is considered - /// a "word" character. - /// - /// See `Unicode.Scalar.isWordCharacter`. - public var isWordCharacter: Bool { get } -} - -extension Unicode.Scalar { - /// A Boolean value indicating whether this scalar is considered - /// a "word" character. - /// - /// Any Unicode scalar that has one of the Unicode properties - /// `Alphabetic`, `Digit`, or `Join_Control`, or is in the - /// general category `Mark` or `Connector_Punctuation`. - public var isWordCharacter: Bool { get } -} -``` - -`\W` matches the inverse of `\w`. - -_
Rationale_ - -Word characters include more than letters, and we went with Unicode's recommended scalar semantics. We extend to grapheme clusters similarly to `Character.isLetter`, that is, subsequent (combining) scalars do not change the word-character-ness of the grapheme cluster. - -
- -### Whitespace and newlines: `\s`, `\S` (plus `\h`, `\H`, `\v`, `\V`, and `\R`) - -We propose `\s` be named "whitespace" with the following definitions: - -```swift -extension Unicode.Scalar { - /// A Boolean value indicating whether this scalar is considered - /// whitespace. - /// - /// All Unicode scalars with the derived `White_Space` property are - /// considered whitespace, including: - /// - /// - `CHARACTER TABULATION` (U+0009) - /// - `LINE FEED (LF)` (U+000A) - /// - `LINE TABULATION` (U+000B) - /// - `FORM FEED (FF)` (U+000C) - /// - `CARRIAGE RETURN (CR)` (U+000D) - /// - `NEWLINE (NEL)` (U+0085) - public var isWhitespace: Bool { get } -} -``` - -This definition matches the value of the existing `Unicode.Scalar.Properties.isWhitespace` property. Note that `Character.isWhitespace` already exists with the desired semantics, which is a grapheme cluster that begins with a whitespace Unicode scalar. - -We propose `\h` be named "horizontalWhitespace" with the following definitions: - -```swift -extension Character { - /// A Boolean value indicating whether this character is considered - /// horizontal whitespace. - /// - /// All characters with an initial Unicode scalar in the general - /// category `Zs`/`Space_Separator`, or the control character - /// `CHARACTER TABULATION` (U+0009), are considered horizontal - /// whitespace. - public var isHorizontalWhitespace: Bool { get } -} - -extension Unicode.Scalar { - /// A Boolean value indicating whether this scalar is considered - /// horizontal whitespace. - /// - /// All Unicode scalars with the general category - /// `Zs`/`Space_Separator`, along with the control character - /// `CHARACTER TABULATION` (U+0009), are considered horizontal - /// whitespace. - public var isHorizontalWhitespace: Bool { get } -} -``` - -We propose `\v` be named "verticalWhitespace" with the following definitions: - - -```swift -extension Character { - /// A Boolean value indicating whether this scalar is considered - /// vertical whitespace. - /// - /// All characters with an initial Unicode scalar in the general - /// category `Zl`/`Line_Separator`, or the following control - /// characters, are considered vertical whitespace (see below) - public var isVerticalWhitespace: Bool { get } -} - -extension Unicode.Scalar { - /// A Boolean value indicating whether this scalar is considered - /// vertical whitespace. - /// - /// All Unicode scalars with the general category - /// `Zl`/`Line_Separator`, along with the following control - /// characters, are considered vertical whitespace: - /// - /// - `LINE FEED (LF)` (U+000A) - /// - `LINE TABULATION` (U+000B) - /// - `FORM FEED (FF)` (U+000C) - /// - `CARRIAGE RETURN (CR)` (U+000D) - /// - `NEWLINE (NEL)` (U+0085) - public var isVerticalWhitespace: Bool { get } -} -``` - -Note that `Character.isNewline` already exists with the definition [required][lineboundary] by UTS\#18. *TBD:* Should we backport to `Unicode.Scalar`? - -`\S`, `\H`, and `\V` match the inverse of `\s`, `\h`, and `\v`, respectively. - -We propose `\R` include "verticalWhitespace" above with detection (and consumption) of the CR-LF sequence when applied to `Unicode.Scalar`. It is equivalent to `Character.isVerticalWhitespace` when applied to `Character`s. - -We are similarly not proposing any new API for `\R` until the stdlib has grapheme-breaking API over `Unicode.Scalar`. - -_
Rationale_ - -Note that "whitespace" is a term-of-art and is not correlated with visibility, which is a completely separate concept. - -We use Unicode's recommended scalar semantics for horizontal whitespace and extend that to grapheme semantics similarly to `Character.isWhitespace`. - -We use ICU's definition for vertical whitespace, similarly extended to grapheme clusters. - -
- -### Control characters: `\t`, `\r`, `\n`, `\f`, `\0`, `\e`, `\a`, `\b`, `\cX` - -We propose the following names and meanings for these escaped literals representing specific control characters: - -```swift -extension Character { - /// A horizontal tab character, `CHARACTER TABULATION` (U+0009). - public static var tab: Character { get } - - /// A carriage return character, `CARRIAGE RETURN (CR)` (U+000D). - public static var carriageReturn: Character { get } - - /// A line feed character, `LINE FEED (LF)` (U+000A). - public static var lineFeed: Character { get } - - /// A form feed character, `FORM FEED (FF)` (U+000C). - public static var formFeed: Character { get } - - /// A NULL character, `NUL` (U+0000). - public static var nul: Character { get } - - /// An escape control character, `ESC` (U+001B). - public static var escape: Character { get } - - /// A bell character, `BEL` (U+0007). - public static var bell: Character { get } - - /// A backspace character, `BS` (U+0008). - public static var backspace: Character { get } - - /// A combined carriage return and line feed as a single character denoting - // end-of-line. - public static var carriageReturnLineFeed: Character { get } - - /// Returns a control character with the given value, Control-`x`. - /// - /// This method returns a value only when you pass a letter in - /// the ASCII range as `x`: - /// - /// if let ch = Character.control("G") { - /// print("'ch' is a bell character", ch == Character.bell) - /// } else { - /// print("'ch' is not a control character") - /// } - /// // Prints "'ch' is a bell character: true" - /// - /// - Parameter x: An upper- or lowercase letter to derive - /// the control character from. - /// - Returns: Control-`x` if `x` is in the pattern `[a-zA-Z]`; - /// otherwise, `nil`. - public static func control(_ x: Unicode.Scalar) -> Character? -} - -extension Unicode.Scalar { - /// Same as above, producing Unicode.Scalar, except for CR-LF... -} -``` - -We also propose `isControl` properties with the following definitions: - -```swift -extension Character { - /// A Boolean value indicating whether this character represents - /// a control character. - /// - /// Control characters are a single Unicode scalar with the - /// general category `Cc`/`Control` or the CR-LF pair (`\r\n`). - public var isControl: Bool { get } -} - -extension Unicode.Scalar { - /// A Boolean value indicating whether this scalar represents - /// a control character. - /// - /// Control characters have the general category `Cc`/`Control`. - public var isControl: Bool { get } -} -``` - -*TBD*: Should we have a CR-LF static var on `Unicode.Scalar` that produces a value of type `Character`? - - -_
Rationale_ - -This approach simplifies the use of some common control characters, while making the rest available through a method call. - -
- - - -### Unicode named values and properties: `\N`, `\p`, `\P` - -`\N{NAME}` matches a Unicode scalar value with the specified name. `\p{PROPERTY}` and `\p{PROPERTY=VALUE}` match a Unicode scalar value with the given Unicode property (and value, if given). - -While most Unicode-defined properties can only match at the Unicode scalar level, some are defined to match an extended grapheme cluster. For example, `/\p{RGI_Emoji_Flag_Sequence}/` will match any flag emoji character, which are composed of two Unicode scalar values. - -`\P{...}` matches the inverse of `\p{...}`. - -Most of this is already present inside `Unicode.Scalar.Properties`, and we propose to round it out with anything missing, e.g. script and script extensions. (API is _TBD_, still working on it.) - -Even though we are not proposing any `Character`-based API, we'd like to discuss with the community whether or how to extend them to grapheme clusters. Some options: - -- Forbid in any grapheme-cluster semantic mode -- Match only single-scalar grapheme clusters with the given property -- Match any grapheme cluster that starts with the given property -- Something more-involved such as per-property reasoning - - -### POSIX character classes: `[:NAME:]` - -We propose that POSIX character classes be prefixed with "posix" in their name with APIs for testing membership of `Character`s and `Unicode.Scalar`s. `Unicode.Scalar.isASCII` and `Character.isASCII` already exist and can satisfy `[:ascii:]`, and can be used in combination with new members like `isDigit` to represent individual POSIX character classes. Alternatively, we could introduce an option-set-like `POSIXCharacterClass` and `func isPOSIX(_:POSIXCharacterClass)` since POSIX is a fully defined standard. This would cut down on the amount of API noise directly visible on `Character` and `Unicode.Scalar` significantly. We'd like some discussion the the community here, noting that this will become clearer as more of the string processing overview takes shape. - -POSIX's character classes represent concepts that we'd like to define at all semantic levels. We propose the following definitions, some of which are covered elsewhere in this pitch and some of which already exist today. Some Character definitions are *TBD* and we'd like more discussion with the community. - - -| POSIX class | API name | `Character` | `Unicode.Scalar` | POSIX mode value | -|-------------|----------------------|-----------------------|-------------------------------|-------------------------------| -| `[:lower:]` | lowercase | (exists) | `\p{Lowercase}` | `[a-z]` | -| `[:upper:]` | uppercase | (exists) | `\p{Uppercase}` | `[A-Z]` | -| `[:alpha:]` | alphabetic | (exists: `.isLetter`) | `\p{Alphabetic}` | `[A-Za-z]` | -| `[:alnum:]` | alphaNumeric | TBD | `[\p{Alphabetic}\p{Decimal}]` | `[A-Za-z0-9]` | -| `[:word:]` | wordCharacter | (pitched) | (pitched) | `[[:alnum:]_]` | -| `[:digit:]` | decimalDigit | (pitched) | (pitched) | `[0-9]` | -| `[:xdigit:]`| hexDigit | (exists) | `\p{Hex_Digit}` | `[0-9A-Fa-f]` | -| `[:punct:]` | punctuation | (exists) | (port from `Character`) | `[-!"#%&'()*,./:;?@[\\\]_{}]` | -| `[:blank:]` | horizontalWhitespace | (pitched) | (pitched) | `[ \t]` | -| `[:space:]` | whitespace | (exists) | `\p{Whitespace}` | `[ \t\n\r\f\v]` | -| `[:cntrl:]` | control | (pitched) | (pitched) | `[\x00-\x1f\x7f]` | -| `[:graph:]` | TBD | TBD | TBD | `[^ [:cntrl:]]` | -| `[:print:]` | TBD | TBD | TBD | `[[:graph:] ]` | - - -### Custom classes: `[...]` - -We propose that custom classes function just like set union. We propose that ranged-based custom character classes function just like `ClosedRange`. Thus, we are not proposing any additional API. - -That being said, providing grapheme cluster semantics is simultaneously obvious and tricky. A direct extension treats `[a-f]` as equivalent to `("a"..."f").contains()`. Strings (and thus Characters) are ordered for the purposes of efficiently maintaining programming invariants while honoring Unicode canonical equivalence. This ordering is _consistent_ but [linguistically meaningless][meaningless] and subject to implementation details such as whether we choose to normalize under NFC or NFD. - -```swift -let c: ClosedRange = "a"..."f" -c.contains("e") // true -c.contains("g") // false -c.contains("e\u{301}") // false, NFC uses precomposed é -c.contains("e\u{305}") // true, there is no precomposed e̅ -``` - -We will likely want corresponding `RangeExpression`-based API in the future and keeping consistency with ranges is important. - -We would like to discuss this problem with the community here. Even though we are not addressing regex literals specifically in this thread, it makes sense to produce suggestions for compilation errors or warnings. - -Some options: - -- Do nothing, embrace emergent behavior -- Warn/error for _any_ character class ranges -- Warn/error for character class ranges outside of a quasi-meaningful subset (e.g. ACII, albeit still has issues above) -- Warn/error for multiple-scalar grapheme clusters (albeit still has issues above) - - - -## Future Directions - -### Future API - -Library-extensible pattern matching will necessitate more types, protocols, and API in the future, many of which may involve character classes. This pitch aims to define names and semantics for exactly these kinds of API now, so that they can slot in naturally. - -### More classes or custom classes - -Future API might express custom classes or need more built-in classes. This pitch aims to establish rationale and precedent for a large number of character classes in Swift, serving as a basis that can be extended. - -### More lenient conversion APIs - -The proposed semantics for matching "digits" are broader than what the existing `Int(_:radix:)?` initializer accepts. It may be useful to provide additional initializers that can understand the whole breadth of characters matched by `\d`, or other related conversions. - - - - -[literals]: https://forums.swift.org/t/pitch-regular-expression-literals/52820 -[overview]: https://forums.swift.org/t/declarative-string-processing-overview/52459 -[charprops]: https://github.com/apple/swift-evolution/blob/master/proposals/0221-character-properties.md -[charpropsrationale]: https://github.com/apple/swift-evolution/blob/master/proposals/0221-character-properties.md#detailed-semantics-and-rationale -[canoneq]: https://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence -[graphemes]: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries -[meaningless]: https://forums.swift.org/t/declarative-string-processing-overview/52459/121 -[scalarprops]: https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md -[ucd]: https://www.unicode.org/reports/tr44/tr44-28.html -[numerictype]: https://www.unicode.org/reports/tr44/#Numeric_Type -[derivednumeric]: https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedNumericType.txt - - -[uts18]: https://unicode.org/reports/tr18/ -[proplist]: https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt -[pcre]: https://www.pcre.org/current/doc/html/pcre2pattern.html -[perl]: https://perldoc.perl.org/perlre -[raku]: https://docs.raku.org/language/regexes -[rust]: https://docs.rs/regex/1.5.4/regex/ -[python]: https://docs.python.org/3/library/re.html -[ruby]: https://ruby-doc.org/core-2.4.0/Regexp.html -[csharp]: https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference -[icu]: https://unicode-org.github.io/icu/userguide/strings/regexp.html -[posix]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html -[oniguruma]: https://www.cuminas.jp/sdk/regularExpression.html -[go]: https://pkg.go.dev/regexp/syntax@go1.17.2 -[cplusplus]: https://www.cplusplus.com/reference/regex/ECMAScript/ -[ecmascript]: https://262.ecma-international.org/12.0/#sec-pattern-semantics -[re2]: https://github.com/google/re2/wiki/Syntax -[java]: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html diff --git a/Documentation/Evolution/ProposalOverview.md b/Documentation/Evolution/ProposalOverview.md new file mode 100644 index 000000000..7656526a6 --- /dev/null +++ b/Documentation/Evolution/ProposalOverview.md @@ -0,0 +1,55 @@ + +# Regex Proposals + +## Regex Type and Overview + +- [Proposal](https://github.com/apple/swift-evolution/blob/main/proposals/0350-regex-type-overview.md), [Thread](https://forums.swift.org/t/se-0350-regex-type-and-overview/56530) +- [Pitch thread](https://forums.swift.org/t/pitch-regex-type-and-overview/56029) + +Presents basic Regex type and gives an overview of how everything fits into the overall story + + +## Regex Builder DSL + +- [Proposal](https://github.com/apple/swift-evolution/blob/main/proposals/0351-regex-builder.md), [Thread](https://forums.swift.org/t/se-0351-regex-builder-dsl/56531) +- [Pitch thread](https://forums.swift.org/t/pitch-regex-builder-dsl/56007) + +Covers the result builder approach and basic API. + + +## Run-time Regex Construction + +- [Pitch](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md), [Thread](https://forums.swift.org/t/pitch-2-regex-syntax-and-run-time-construction/56624) +- (old) Pitch thread: [Regex Syntax](https://forums.swift.org/t/pitch-regex-syntax/55711) + + Brief: Syntactic superset of PCRE2, Oniguruma, ICU, UTS\#18, etc. + +Covers the "interior" syntax, extended syntaxes, run-time construction of a regex from a string, and details of `AnyRegexOutput`. + +## Regex Literals + +- [Draft](https://github.com/apple/swift-experimental-string-processing/pull/187), [Thread](https://forums.swift.org/t/pitch-2-regex-literals/56736) +- (Old) original pitch: + + [Thread](https://forums.swift.org/t/pitch-regular-expression-literals/52820) + + [Update](https://forums.swift.org/t/pitch-regular-expression-literals/52820/90) + + +## String processing algorithms + +- [Pitch thread](https://forums.swift.org/t/pitch-regex-powered-string-processing-algorithms/55969) + +Proposes a slew of Regex-powered algorithms. + +Introduces `CustomConsumingRegexComponent`, which is a monadic-parser style interface for external parsers to be used as components of a regex. + +## Unicode for String Processing + +- [Draft](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/UnicodeForStringProcessing.md) +- (Old) [Character class definitions](https://forums.swift.org/t/pitch-character-classes-for-string-processing/52920) + +Covers three topics: + +- Proposes regex syntax and `RegexBuilder` API for options that affect matching behavior. +- Proposes regex syntax and `RegexBuilder` API for library-defined character classes, Unicode properties, and custom character classes. +- Defines how Unicode scalar-based classes are extended to grapheme clusters in the different semantic and other matching modes. + + diff --git a/Documentation/Evolution/RegexBuilderDSL.md b/Documentation/Evolution/RegexBuilderDSL.md index f0a477644..635112e93 100644 --- a/Documentation/Evolution/RegexBuilderDSL.md +++ b/Documentation/Evolution/RegexBuilderDSL.md @@ -1,7 +1,7 @@ # Regex builder DSL * Proposal: [SE-NNNN](NNNN-filename.md) -* Authors: [Richard Wei](https://github.com/rxwei) +* Authors: [Richard Wei](https://github.com/rxwei), [Michael Ilseman](https://github.com/milseman), [Nate Cook](https://github.com/natecook1000) * Review Manager: TBD * Implementation: [apple/swift-experimental-string-processing](https://github.com/apple/swift-experimental-string-processing/tree/main/Sources/_StringProcessing/RegexDSL) * Status: **Pitch** @@ -17,6 +17,7 @@ - [Quantification](#quantification) - [Capture and reference](#capture-and-reference) - [Subpattern](#subpattern) + - [Scoping](#scoping) - [Source compatibility](#source-compatibility) - [Effect on ABI stability](#effect-on-abi-stability) - [Effect on API resilience](#effect-on-api-resilience) @@ -400,95 +401,7 @@ extension RegexComponentBuilder { } ``` -To support `if` statements, `buildEither(first:)`, `buildEither(second:)` and `buildOptional(_:)` are defined with overloads to support up to 10 captures because each capture type needs to be transformed to an optional. The overload for non-capturing regexes, due to the lack of generic constraints, must be annotated with `@_disfavoredOverload` in order not shadow other overloads. We expect that a variadic-generic version of this method will eventually superseded all of these overloads. - -```swift -extension RegexComponentBuilder { - // The following builder methods implement what would be possible with - // variadic generics (using imaginary syntax) as a single method: - // - // public static func buildEither< - // Component, WholeMatch, Capture... - // >( - // first component: Component - // ) -> Regex<(Substring, Capture...)> - // where Component.Output == (WholeMatch, Capture...) - - public static func buildEither( - first component: Component - ) -> Regex { - component - } - - public static func buildEither( - first component: Component - ) -> Regex<(Substring, C0)> where R.Output == (W, C0) { - component - } - - public static func buildEither( - first component: Component - ) -> Regex<(Substring, C0, C1)> where R.Output == (W, C0, C1) { - component - } - - // The following builder methods implement what would be possible with - // variadic generics (using imaginary syntax) as a single method: - // - // public static func buildEither< - // Component, WholeMatch, Capture... - // >( - // second component: Component - // ) -> Regex<(Substring, Capture...)> - // where Component.Output == (WholeMatch, Capture...) - - public static func buildEither( - second component: Component - ) -> Regex { - component - } - - public static func buildEither( - second component: Component - ) -> Regex<(Substring, C0)> where R.Output == (W, C0) { - component - } - - public static func buildEither( - second component: Component - ) -> Regex<(Substring, C0, C1)> where R.Output == (W, C0, C1) { - component - } - - // ... `O(arity)` overloads of `buildEither(_:)` - - // The following builder methods implement what would be possible with - // variadic generics (using imaginary syntax) as a single method: - // - // public static func buildOptional< - // Component, WholeMatch, Capture... - // >( - // _ component: Component? - // ) where Component.Output == (WholeMatch, Capture...) - - @_disfavoredOverload - public static func buildOptional( - _ component: Component? - ) -> Regex - - public static func buildOptional( - _ component: Component? - ) -> Regex<(Substring, C0?)> - - public static func buildOptional( - _ component: Component? - ) -> Regex<(Substring, C0?, C1?)> - - // ... `O(arity)` overloads of `buildOptional(_:)` -} -``` - -To support `if #available(...)` statements, `buildLimitedAvailability(_:)` is defined with overloads to support up to 10 captures. Similar to `buildOptional`, the overload for non-capturing regexes must be annotated with `@_disfavoredOverload`. +To support `if #available(...)` statements, `buildLimitedAvailability(_:)` is defined with overloads to support up to 10 captures. The overload for non-capturing regexes, due to the lack of generic constraints, must be annotated with `@_disfavoredOverload` in order not shadow other overloads. We expect that a variadic-generic version of this method will eventually superseded all of these overloads. ```swift extension RegexComponentBuilder { @@ -518,6 +431,8 @@ extension RegexComponentBuilder { } ``` +`buildOptional` and `buildEither` are intentionally not supported due to ergonomic issues and fundamental semantic differences between regex conditionals and result builder conditionals. Please refer to the [alternatives considered](#support-buildoptional-and-buildeither) section for detailed rationale. + ### Alternation Alternations are used to match one of multiple patterns. An alternation wraps its underlying patterns' capture types in an `Optional` and concatenates them together, first to last. @@ -620,99 +535,6 @@ public enum AlternationBuilder { // ... `O(arity^2)` overloads of `buildPartialBlock(accumulated:next:)` } -extension AlternationBuilder { - // The following builder methods implement what would be possible with - // variadic generics (using imaginary syntax) as a single method: - // - // public static func buildEither< - // R, WholeMatch, Capture... - // >( - // first component: Component - // ) -> Regex<(Substring, Component?...)> - // where R.Output == (WholeMatch, Capture...) - - @_disfavoredOverload - public static func buildEither( - first component: Component - ) -> Regex - - public static func buildEither( - first component: Component - ) -> Regex<(Substring, C0?)> - - public static func buildEither( - first component: Component - ) -> Regex<(Substring, C0?, C1?)> - - // ... `O(arity)` overloads of `buildEither(_:)` - - public static func buildEither( - first component: Component - ) -> Regex<(Substring, C0?, C1?, C2?, C3?, C4?, C5?, C6?, C7?, C8, C9?)> where R.Output == (W, C0, C1, C2, C3, C4, C5, C6, C7, C8, C9) -} - -extension AlternationBuilder { - // The following builder methods implement what would be possible with - // variadic generics (using imaginary syntax) as a single method: - // - // public static func buildEither< - // R, WholeMatch, Capture... - // >( - // second component: Component - // ) -> Regex<(Substring, Capture?...)> - // where R.Output == (WholeMatch, Capture...) - - @_disfavoredOverload - public static func buildEither( - second component: Component - ) -> Regex - - public static func buildEither( - second component: Component - ) -> Regex<(Substring, C0?)> - - public static func buildEither( - second component: Component - ) -> Regex<(Substring, C0?, C1?)> - - // ... `O(arity)` overloads of `buildEither(_:)` - - public static func buildEither( - second component: Component - ) -> Regex<(Substring, C0?, C1?, C2?, C3?, C4?, C5?, C6?, C7?, C8, C9?)> where R.Output == (W, C0, C1, C2, C3, C4, C5, C6, C7, C8, C9) -} - -extension AlternationBuilder { - // The following builder methods implement what would be possible with - // variadic generics (using imaginary syntax) as a single method: - // - // public static func buildOptional< - // Component, WholeMatch, Capture... - // >( - // _ component: Component? - // ) -> Regex<(Substring, Capture?...)> - // where Component.Output == (WholeMatch, Capture...) - - @_disfavoredOverload - public static func buildOptional( - _ component: Component? - ) -> Regex - - public static func buildOptional( - _ component: Component? - ) -> Regex<(Substring, C0?)> - - public static func buildOptional( - _ component: Component? - ) -> Regex<(Substring, C0?, C1?)> - - // ... `O(arity)` overloads of `buildOptional(_:)` - - public static func buildOptional( - _ component: Component? - ) -> Regex<(Substring, C0?, C1?, C2?, C3?, C4?, C5?, C6?, C7?, C8, C9?)> where R.Output == (W, C0, C1, C2, C3, C4, C5, C6, C7, C8, C9) -} - extension AlternationBuilder { // The following builder methods implement what would be possible with // variadic generics (using imaginary syntax) as a single method: @@ -1290,6 +1112,53 @@ Regex { wholeSentence in } ``` +### Scoping + +In textual regexes, atomic groups (`(?>...)`) can be used to define a backtracking scope. That is, when the regex engine exits from the scope successfully, it throws away all backtracking positions from the scope. In regex builder, the `Local` type serves this purpose. + +```swift +public struct Local: RegexComponent { + public var regex: Regex + + // The following builder methods implement what would be possible with + // variadic generics (using imaginary syntax) as a single set of methods: + // + // public init( + // @RegexComponentBuilder _ component: () -> Component + // ) where Output == (Substring, Capture...), Component.Output == (WholeMatch, Capture...) + + @_disfavoredOverload + public init( + @RegexComponentBuilder _ component: () -> Component + ) where Output == Substring + + public init( + @RegexComponentBuilder _ component: () -> Component + ) where Output == (Substring, C0), Component.Output == (W, C0) + + public init( + @RegexComponentBuilder _ component: () -> Component + ) where Output == (Substring, C0, C1), Component.Output == (W, C0, C1) + + // ... `O(arity)` overloads +} +``` + +For example, the following regex matches string `abcc` but not `abc`. + +```swift +Regex { + "a" + Local { + ChoiceOf { + "bc" + "b" + } + } + "c" +} +``` + ## Source compatibility Regex builder will be shipped in a new module named `RegexBuilder`, and thus will not affect the source compatibility of the existing code. @@ -1306,7 +1175,7 @@ The proposed feature relies heavily upon overloads of `buildBlock` and `buildPar ### Operators for quantification and alternation -While `ChoiceOf` and quantifier functions provide a general way of creating alternations and quantifications, we recognize that some synctactic sugar can be useful for creating one-liners like in textual regexes, e.g. infix operator `|`, postfix operator `*`, etc. +While `ChoiceOf` and quantifier types provide a general way of creating alternations and quantifications, we recognize that some synctactic sugar can be useful for creating one-liners like in textual regexes, e.g. infix operator `|`, postfix operator `*`, etc. ```swift // The following functions implement what would be possible with variadic @@ -1441,6 +1310,83 @@ One could argue that type such as `OneOrMore` could be defined as a top- Another reason to use types instead of free functions is consistency with existing result-builder-based DSLs such as SwiftUI. +### Support `buildOptional` and `buildEither` + +To support `if` statements, an earlier iteration of this proposal defined `buildEither(first:)`, `buildEither(second:)` and `buildOptional(_:)` as the following: + +```swift +extension RegexComponentBuilder { + public static func buildEither< + Component, WholeMatch, Capture... + >( + first component: Component + ) -> Regex<(Substring, Capture...)> + where Component.Output == (WholeMatch, Capture...) + + public static func buildEither< + Component, WholeMatch, Capture... + >( + second component: Component + ) -> Regex<(Substring, Capture...)> + where Component.Output == (WholeMatch, Capture...) + + public static func buildOptional< + Component, WholeMatch, Capture... + >( + _ component: Component? + ) where Component.Output == (WholeMatch, Capture...) +} +``` + +However, multiple-branch control flow statements (e.g. `if`-`else` and `switch`) would need to be required to produce either the same regex type, which is limiting, or an "either-like" type, which can be difficult to work with when nested. Unlike `ChoiceOf`, producing a tuple of optionals is not an option, because the branch taken would be decided when the builder closure is executed, and it would cause capture numbering to be inconsistent with conventional regex. + +Moreover, result builder conditionals does not work the same way as regex conditionals. In regex conditionals, the conditions are themselves regexes and are evaluated by the regex engine during matching, whereas result builder conditionals are evaluated as part of the builder closure. We hope that a future result builder feature will support "lifting" control flow conditions into the DSL domain, e.g. supporting `Regex` as a condition. + +### Flatten optionals + +With the proposed design, `ChoiceOf` with `AlternationBuilder` wraps every component's capture type with an `Optional`. This means that any `ChoiceOf` with optional-capturing components would lead to a doubly-nested optional captures. This could make the result of matching harder to use. + +```swift +ChoiceOf { + OneOrMore(Capture(.digit)) // Output == (Substring, Substring) + Optionally { + ZeroOrMore(Capture(.word)) // Output == (Substring, Substring?) + "a" + } // Output == (Substring, Substring??) +} // Output == (Substring, Substring?, Substring???) +``` + +One way to improve this could be overloading quantifier initializers (e.g. `ZeroOrMore.init(_:)`) and `AlternationBuilder.buildPartialBlock` to flatten any optionals upon composition. However, this would be non-trivial. Quantifier initializers would need to be overloaded `O(2^arity)` times to account for all possible positions of `Optional` that may appear in the `Output` tuple. Even worse, `AlternationBuilder.buildPartialBlock` would need to be overloaded `O(arity!)` times to account for all possible combinations of two `Output` tuples with all possible positions of `Optional` that may appear in one of the `Output` tuples. + +### Structured rather than flat captures + +We propose inferring capture types in such a way as to align with the traditional numbering of backreferences. This is because much of the motivation behind providing regex in Swift is their familiarity. + +If we decided to deprioritize this motivation, there are opportunities to infer safer, more ergonomic, and arguably more intuitive types for captures. For example, to be consistent with traditional regex backreferences quantifications of multiple or nested captures had to produce parallel arrays rather than an array of tuples. + +```swift +OneOrMore { + Capture { + OneOrMore(.hexDigit) + } + ".." + Capture { + OneOrMore(.hexDigit) + } +} + +// Flat capture types: +// => `Output == (Substring, Substring, Substring)>` + +// Structured capture types: +// => `Output == (Substring, (Substring, Substring))` +``` + +Similarly, an alternation of multiple or nested captures could produce a structured alternation type (or an anonymous sum type) rather than flat optionals. + +This is cool, but it adds extra complexity to regex builder and it isn't as clear because the generic type no longer aligns with the traditional regex backreference numbering. We think the consistency of the flat capture types trumps the added safety and ergonomics of the structured capture types. + + [Declarative String Processing]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/DeclarativeStringProcessing.md [Strongly Typed Regex Captures]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/StronglyTypedCaptures.md [Regex Syntax]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntax.md diff --git a/Documentation/Evolution/RegexLiteralPitch.md b/Documentation/Evolution/RegexLiteralPitch.md deleted file mode 100644 index bf2a5dad3..000000000 --- a/Documentation/Evolution/RegexLiteralPitch.md +++ /dev/null @@ -1,292 +0,0 @@ -# Regular Expression Literals - -- Authors: Hamish Knight, Michael Ilseman - -## Introduction - -We propose to introduce a first-class regular expression literal into the language that can take advantage of library support to offer extensible, powerful, and familiar textual pattern matching. - -This is a component of a larger string processing picture. We would like to start a focused discussion surrounding our approach to the literal itself, while acknowledging that evaluating the utility of the literal will ultimately depend on the whole picture (e.g. supporting API). To aid this focused discussion, details such as the representation of captures in the type system, semantic details, extensions to lexing/parsing, additional API, etc., are out of scope of this pitch and thread. Feel free to continue discussion of anything related in the [overview thread][overview]. - -## Motivation - -Regular expressions are a ubiquitous, familiar, and concise syntax for matching and extracting text that satisfies a particular pattern. Syntactically, a regex literal in Swift should: - -- Support a syntax familiar to developers who have learned to use regular expressions in other tools and languages -- Allow reuse of many regular expressions not specifically designed for Swift (e.g. from Stack Overflow or popular programming books) -- Allow libraries to define custom types that can be constructed with regex literals, much like string literals -- Diagnose at compile time if a regex literal uses capabilities that aren't allowed by the type's regex dialect - -Further motivation, examples, and discussion can be found in the [overview thread][overview]. - -## Proposed Solution - -We propose the introduction of a regular expression literal that supports [the PCRE syntax][PCRE], in addition to new standard library protocols `ExpressibleByRegexLiteral` and `RegexLiteralProtocol` that allow for the customization of how the regex literal is interpreted (similar to [string interpolation][stringinterpolation]). The compiler will parse the PCRE syntax within a regex literal, and synthesize calls to corresponding builder methods. Types conforming to `ExpressibleByRegexLiteral` will be able to provide a builder type that opts into supporting various regex constructs through the use of normal function declarations and `@available`. - -_Note: This pitch concerns language syntax and compiler changes alone, it isn't stating what features the stdlib should support in the initial version or in future versions._ - -## Detailed Design - -A regular expression literal will be introduced using `/` delimiters, within which the compiler will parse [PCRE regex syntax][PCRE]: - -```swift -// Matches " = ", extracting the identifier and hex number -let regex = /([[:alpha:]]\w*) = ([0-9A-F]+)/ -``` - -The above regex literal will be inferred to be the default regex literal type `Regex`. Errors in the regex will be diagnosed by the compiler. - -_`Regex` here is a stand-in type, further details about the type such as if or how this will scale to strongly typed captures is still under investigation._ - -_How best to diagnose grapheme-semantic concerns is still under investigation and probably best discussed in their corresponding threads. For example, `Range` is not [countable][countable] and [ordering is not linguistically meaningful][ordering], so validating character class ranges may involve restricting to a semantically-meaningful range (e.g. ASCII). This is best discussed in the (upcoming) character class pitch/thread._ - -The compiler will then transform the literal into a set of builder calls that may be customized by adopting the `ExpressibleByRegexLiteral` protocol. Below is a straw-person transformation of this example: - -```swift -// let regex = /([[:alpha:]]\w*) = ([0-9A-F]+)/ -let regex = { - var builder = T.RegexLiteral() - - // __A4 = /([[:alpha:]]\w*)/ - let __A1 = builder.buildCharacterClass_POSIX_alpha() - let __A2 = builder.buildCharacterClass_w() - let __A3 = builder.buildConcatenate(__A1, __A2) - let __A4 = builder.buildCaptureGroup(__A3) - - // __B1 = / = / - let __B1 = builder.buildLiteral(" = ") - - // __C3 = /([0-9A-F]+)/ - let __C1 = builder.buildCustomCharacterClass(["0"..."9", "A"..."F"]) - let __C2 = builder.buildOneOrMore(__C1) - let __C3 = builder.buildCaptureGroup(__C2) - - let __D1 = builder.buildConcatenate(__A4, __B1, __C3) - return T(regexLiteral: builder.finalize(__D1)) -}() -``` - -In this formulation, the compiler fully parses the regex literal, calling mutating methods on a builder which constructs an AST. Here, the compiler recognizes syntax such as ranges and classifies metacharacters (`buildCharacterClass_w()`). Alternate formulations could involve less reasoning (`buildMetacharacter_w`), or more (`builderCharacterClass_word`). We'd like community feedback on this approach. - -Additionally, it may make sense for the stdlib to provide a `RegexLiteral` conformer that just constructs a string to pass off to a string-based library. Such a type might assume all features are supported unless communicated otherwise, and we'd like community feedback on mechanisms to communicate this (e.g. availability). - -### The `ExpressibleByRegexLiteral` and `RegexLiteralProtocol` protocols - -New `ExpressibleByRegexLiteral` and `RegexLiteralProtocol` protocols will be introduced to the standard library, and will serve a similar purpose to the existing literal protocols `ExpressibleByStringInterpolation` and `StringInterpolationProtocol`. - -```swift -public protocol ExpressibleByRegexLiteral { - associatedtype RegexLiteral : RegexLiteralProtocol = DefaultRegexLiteral - init(regexLiteral: RegexLiteral) -} - -public protocol RegexLiteralProtocol { - init() - - // Informal builder requirements for building a regex literal - // will be specified here. -} -``` - -Types conforming to `ExpressibleByRegexLiteral` will be able to provide a custom type that conforms to `RegexLiteralProtocol`, which will be used to build the resulting regex value. A default conforming type will be provided by the standard library (`DefaultRegexLiteral` here). - -Libraries can extend regex handling logic for their domains. For example, a higher-level library could provide linguistically richer regular expressions by incorporating locale, collation, language dictionaries, and fuzzier matching. Similarly, libraries wrapping different regex engines (e.g. `NSRegularExpression`) can support custom regex literals. - -### Opting into certain regex features - -We intend for the compiler to completely parse [the PCRE syntax][PCRE]. However, types conforming to `RegexLiteralProtocol` might not be able to handle the full feature set. The compiler will look for corresponding function declarations inside `RegexLiteralProtocol` and will emit a compilation error if missing. Conforming types can use `@available` on these function declarations to communicate versioning and add more support in the future. - -This approach of lookup combined with availability allows the stdlib to support more features over time. - -### Impact of using `/` as the delimiter - -#### On comment syntax - -Single line comments use the syntax `//`, which would conflict with the spelling for an empty regex literal. As such, an empty regex literal would be forbidden. - -While not conflicting with the syntax proposed in this pitch, it's also worth noting that the `//` comment syntax (in particular documentation comments that use `///`) would likely preclude the ability to use `///` as a delimiter if we ever wanted to support multi-line regex literals. It's possible though that future multi-line support could be provided through raw regex literals. Alternatively, it could be inferred from the regex options provided. For example, a regex that uses the multi-line option `/(?m)/` could be allowed to span multiple lines. - -Multi-line comments use the `/*` delimiter. As such, a regex literal starting with `*` wouldn't be parsed. This however isn't a major issue as an unqualified `*` is already invalid regex syntax. An escaped `/\*/` regex literal wouldn't be impacted. - -#### On custom infix operators using the `/` character - -Choosing `/` as the delimiter means there will be a conflict for infix operators containing `/` in cases where whitespace isn't used, for example: - -```swift -x+/y/+z -``` - -Should the operators be parsed as `+/` and `/+` respectively, or should this be parsed as `x + /y/ + z`? - -In this case, things can be disambiguated by the user inserting additional whitespace. We therefore could continue to parse `x+/y/+z` as a binary operator chain, and require additional whitespace to interpret `/y/` as a regex literal. - -#### On custom prefix and postfix operators using the `/` character - -There will also be parsing ambiguity with any user-defined prefix and postfix operators containing the `/` character. For example, code such as the following poses an issue: - -```swift -let x = /0; let y = 1/ -``` - -Should this be considered to be two `let` bindings, with each initialization expression using prefix and postfix `/` operators, or is it a single regex literal? - -This also extends more generally to prefix and postfix operators containing the `/` character, e.g: - -```swift -let x = Int { 0 } -} - -let x = 0 -/ 1 / .foo() -``` - -Today, this is parsed as a single binary operator chain `0 / 1 / .foo()`, with `.foo()` becoming an argument to the `/` operator. This is because while Swift does have some parser behavior that is affected by newlines, generally newlines are treated as whitespace, and expressions therefore may span multiple lines. However the user may well be expecting the second line to be parsed as a regex literal. - -This is also potentially an issue for result builders, for example: - -```swift -SomeBuilder { - x - / y / - z -} -``` - -Today this is parsed as `SomeBuilder { x / y / z }`, however it's likely the user was expecting this to become a result builder with 3 elements, the second of which being a regex literal. - -There is currently no source compatibility impact as both cases will continue to parse as binary operations. The user may insert a `;` on the prior line to get the desired regex literal parsing. However this may not be sufficient we may need to change parsing rules (under a version check) to favor parsing regex literals in these cases. We'd like to discuss this further with the community. - -It's worth noting that this is similar to an ambiguity that already exists today with trailing closures, for example: - -```swift -SomeBuilder { - SomeType() - { print("hello") } - AnotherType() -} -``` - -`{ print("hello") }` will be parsed as a trailing closure to `SomeType()` rather than as a separate element to the result builder. - -It can also currently arise with leading dot syntax in a result builder, e.g: - -```swift -SomeBuilder { - SomeType() - .member -} -``` - -`.member` will be parsed as a member access on `SomeType()` rather than as a separate element that may have its base type inferred by the parameter of a `buildExpression` method on the result builder. - - -## Future Directions - -### Typed captures - -Typed captures would statically represent how many captures and of what kind are present in a regex literals. They could produce a `Substring` for a regular capture, `Substring?` for a zero-or-one capture, and `Array` (or a lazy collection) for a zero(or one)-or-more capture. These are worth exploring, especially in the context of the [start of variadic generics][variadics] support, but we'd like to keep this pitch and discussion focused to the details presented. - -### Other regex literals - -Multi-line extensions to regex literals is considered future work. Generally, we'd like to encourage refactoring into `Pattern` when the regex gets to that degree of complexity. - -User-specified [choice of quote delimiters][perlquotes] is considered future work. A related approach to this could be a "raw" regex literal analogous to [raw strings][rawstrings]. For example (total strawperson), an approach where `n` `#`s before the opening delimiter would requires `n` `#` at the end of the trailing delimiter as well as requiring `n-1` `#`s to access metacharacters. - -```txt -// All of the below are trying to match a path like "/tmp/foo/bar/File.app/file.txt" - -/\/tmp\/.*\/File\.app\/file\.txt/ -#//tmp/.*/File\.app/file\.txt/# -##//tmp/#.#*/File.app/file.txt/## -``` - -"Swiftier" literals, such as with non-semantic whitespace (e.g. [Raku's][rakuregex]), is future work. We'd want to strongly consider using a different backing technology for Swifty matching literals, such as PEGs. - -Fully-custom literal support, that is literals whose bodies are not parsed and there is no default type available, is orthogonal to this work. It would require support for compilation-time Swift libraries in addition to Swift APIs for the compiler and type system. - - -### Further extension to Swift language constructs - -Other language constructs, such as raw-valued enums, might benefit from further regex enhancements. - -```swift -enum CalculatorToken: Regex { - case wholeNumber = /\d+/ - case identifier = /\w+/ - case symbol = /\p{Math}/ - ... -} -``` - -As mentioned in the overview, general purpose extensions to Swift (syntactic) pattern matching could benefit regex - -```swift -func parseField(_ field: String) -> ParsedField { - switch field { - case let text <- /#\s?(.*)/: - return .comment(text) - case let (l, u) <- /([0-9A-F]+)(?:\.\.([0-9A-F]+))?/: - return .scalars(Unicode.Scalar(hex: l) ... Unicode.Scalar(hex: u ?? l)) - case let prop <- GraphemeBreakProperty.init: - return .property(prop) - } -} -``` - -### Other semantic details - -Further details about the semantics of regex literals, such as what definition we give to character classes, the initial supported feature set, and how to switch between grapheme-semantic and scalar-semantic usage, is still under investigation and outside the scope of this discussion. - -## Alternatives considered - -### Using a different delimiter to `/` - -As explored above, using `/` as the delimiter has the potential to conflict with existing operators using that character, and may necessitate: - -- Changing of parsing rules around chained `/` over multiple lines -- Deprecating prefix and postfix operators containing the `/` character -- Requiring additional whitespace to disambiguate from infix operators containing `/` -- Requiring a new language version mode to parse the literal with `/` delimiters - -However one of the main goals of this pitch is to introduce a familiar syntax for regular expression literals, which has been the motivation behind choices such as using the PCRE regex syntax. Given the fact that `/` is an existing term of art for regular expressions, we feel that if the aforementioned parsing issues can be solved in a satisfactory manner, we should prefer it as the delimiter. - - -### Reusing string literal syntax - -Instead of supporting a first-class literal kind for regular expressions, we could instead allow users to write a regular expression in a string literal, and parse, diagnose, and generate the appropriate code when it's coerced to an `ExpressibleByRegexLiteral` conforming type. - -```swift -let regex: Regex = "([[:alpha:]]\w*) = ([0-9A-F]+)" -``` - -However we decided against this because: - -- We would not be able to easily apply custom syntax highlighting for the regex syntax -- It would require an `ExpressibleByRegexLiteral` contextual type to be treated as a regex, otherwise it would be defaulted to `String`, which may be undesired -- In an overloaded context it may be ambiguous whether a string literal is meant to be interpreted as a literal string or regex -- Regex escape sequences aren't currently compatible with string literal escape sequence rules, e.g `\w` is currently illegal in a string literal -- It wouldn't be compatible with other string literal features such as interpolations - -[PCRE]: http://pcre.org/current/doc/html/pcre2syntax.html -[overview]: https://forums.swift.org/t/declarative-string-processing-overview/52459 -[variadics]: https://forums.swift.org/t/pitching-the-start-of-variadic-generics/51467 -[stringinterpolation]: https://github.com/apple/swift-evolution/blob/master/proposals/0228-fix-expressiblebystringinterpolation.md -[countable]: https://en.wikipedia.org/wiki/Countable_set -[ordering]: https://forums.swift.org/t/min-function-doesnt-work-on-values-greater-than-9-999-any-idea-why/52004/16 -[perlquotes]: https://perldoc.perl.org/perlop#Quote-and-Quote-like-Operators -[rawstrings]: https://github.com/apple/swift-evolution/blob/main/proposals/0200-raw-string-escaping.md -[rakuregex]: https://docs.raku.org/language/regexes diff --git a/Documentation/Evolution/RegexLiterals.md b/Documentation/Evolution/RegexLiterals.md new file mode 100644 index 000000000..3643590d4 --- /dev/null +++ b/Documentation/Evolution/RegexLiterals.md @@ -0,0 +1,389 @@ +# Regex Literals + +- Authors: [Hamish Knight](https://github.com/hamishknight), [Michael Ilseman](https://github.com/milseman), [David Ewing](https://github.com/DaveEwing) + +## Introduction + +We propose the introduction of regex literals to Swift source code, providing compile-time checks and typed-capture inference. Regex literals help complete the story told in *[Regex Type and Overview][regex-type]*. + +## Motivation + +In *[Regex Type and Overview][regex-type]* we introduced the `Regex` type, which is able to dynamically compile a regex pattern: + +```swift +let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"# +let regex = try! Regex(pattern) +// regex: Regex +``` + +The ability to compile regex patterns at run time is useful for cases where it is e.g provided as user input, however it is suboptimal when the pattern is statically known for a number of reasons: + +- Regex syntax errors aren't detected until run time, and explicit error handling (e.g `try!`) is required to deal with these errors. +- No special source tooling support, such as syntactic highlighting, code completion, and refactoring support, is available. +- Capture types aren't known until run time, and as such a dynamic `AnyRegexOutput` capture type must be used. +- The syntax is overly verbose, especially for e.g an argument to a matching function. + +## Proposed solution + +A regex literal may be written using `/.../` delimiters: + +```swift +// Matches " = ", extracting the identifier and hex number +let regex = /(?[[:alpha:]]\w*) = (?[0-9A-F]+)/ +// regex: Regex<(Substring, identifier: Substring, hex: Substring)> +``` + +Forward slashes are a regex term of art. They are used as the delimiters for regex literals in, e.g., Perl, JavaScript and Ruby. Perl and Ruby additionally allow for [user-selected delimiters](https://perldoc.perl.org/perlop#Quote-and-Quote-like-Operators) to avoid having to escape any slashes inside a regex. For that purpose, we propose the extended literal `#/.../#`. + +An extended literal, `#/.../#`, avoids the need to escape forward slashes within the regex. It allows an arbitrary number of balanced `#` characters around the literal and escape. When the opening delimiter is followed by a new line, it supports a multi-line literal where whitespace is non-semantic and line-ending comments are ignored. + +The compiler will parse the contents of a regex literal using regex syntax outlined in *[Regex Construction][internal-syntax]*, diagnosing any errors at compile time. The capture types and labels are automatically inferred based on the capture groups present in the regex. Regex literals allows editors and source tools to support features such as syntax coloring inside the literal, highlighting sub-structure of the regex, and conversion of the literal to an equivalent result builder DSL (see *[Regex builder DSL][regex-dsl]*). + +A regex literal also allows for seamless composition with the Regex DSL, enabling lightweight intermixing of a regex syntax with other elements of the builder: + +```swift +// A regex for extracting a currency (dollars or pounds) and amount from input +// with precisely the form /[$£]\d+\.\d{2}/ +let regex = Regex { + Capture { /[$£]/ } + TryCapture { + /\d+/ + "." + /\d{2}/ + } transform: { + Amount(twoDecimalPlaces: $0) + } +} +``` + +This flexibility allows for terse matching syntax to be used when it's suitable, and more explicit syntax where clarity and strong types are required. + +Due to the existing use of `/` in comment syntax and operators, there are some syntactic ambiguities to consider. While there are quite a few cases to consider, we do not feel that the impact of any individual case is sufficient to disqualify the syntax. Some of these ambiguities require a couple of source breaking language changes, and as such the `/.../` syntax requires upgrading to a new language mode in order to use. + +## Detailed design + +### Named typed captures + +Regex literals have their capture types statically determined by the capture groups present. This follows the same inference behavior as [the DSL][regex-dsl], and is explored in more detail in *[Strongly Typed Captures][strongly-typed-captures]*. One aspect of this that is currently unique to the literal is the ability to infer labeled tuple elements for named capture groups. For example: + +```swift +func matchHexAssignment(_ input: String) -> (String, Int)? { + let regex = /(?[[:alpha:]]\w*) = (?[0-9A-F]+)/ + // regex: Regex<(Substring, identifier: Substring, hex: Substring)> + + guard let match = regex.matchWhole(input), + let hex = Int(match.hex, radix: 16) + else { return nil } + + return (String(match.identifier), hex) +} +``` + +This allows the captures to be referenced as `match.identifier` and `match.hex`, in addition to numerically (like unnamed capture groups) as `match.1` and `match.2`. This label inference behavior is not available in the DSL, however users are able to [bind captures to named variables instead][dsl-captures]. + +### Extended delimiters `#/.../#`, `##/.../##` + +Backslashes may be used to write forward slashes within the regex literal, e.g `/foo\/bar/`. However, this can be quite syntactically noisy and confusing. To avoid this, a regex literal may be surrounded by an arbitrary number of balanced number signs. This changes the delimiter of the literal, and therefore allows the use of forward slashes without escaping. For example: + +```swift +let regex = #/usr/lib/modules/([^/]+)/vmlinuz/# +// regex: Regex<(Substring, Substring)> +``` + +The number of `#` characters may be further increased to allow the use of e.g `/#` within the literal. This is similar in style to the raw string literal syntax introduced by [SE-0200], however it has a couple of key differences. Backslashes do not become literal characters. Additionally, a multi-line mode, where whitespace and line-ending comments are ignored, is entered when the opening delimiter is followed by a newline. + +```swift +let regex = #/ + usr/lib/modules/ # Prefix + (? [^/]+) + /vmlinuz # The kernel +#/ +// regex: Regex<(Substring, subpath: Substring)> +``` + +#### Escaping of backslashes + +This syntax differs from raw string literals `#"..."#` in that it does not treat backslashes as literal within the regex. A string literal `#"\n"#` represents the literal characters `\n`. However a regex literal `#/\n/#` remains a newline escape sequence. + +One of the primary motivations behind this escaping behavior in raw string literals is that it allows the contents to be easily transportable to/from e.g external files where escaping is unnecessary. For string literals, this suggests that backslashes be treated as literal by default. For regex literals however, it instead suggests that backslashes should retain their semantic meaning. This enables interoperability with regexes taken from outside your code without having to adjust escape sequences to match the delimiters used. + +With string literals, escaping can be tricky without the use of raw syntax, as backslashes may have semantic meaning to the consumer, rather than the compiler. For example: + +```swift +// Matches '\' * '=' * + +let regex = try NSRegularExpression(pattern: "\\\\w\\s*=\\s*\\d+", options: []) +``` + +In this case, the intent is not for the compiler to recognize any of these sequences as string literal escapes, it is instead for `NSRegularExpression` to interpret them as regex escape sequences. However this is not an issue for regex literals, as the regex parser is the only possible consumer of such escape sequences. Such a regex would instead be spelled as: + +```swift +let regex = /\\\w\s*=\s*\d+/ +// regex: Regex +``` + +Backslashes still require escaping to be treated as literal, however we don't expect this to be as common of an occurrence as needing to write a regex escape sequence such as `\s`, `\w`, or `\p{...}`, within a regex literal with extended delimiters `#/.../#`. + +#### Multi-line mode + +Extended regex delimiters additionally support a multi-line mode when the opening delimiter is followed by a new line. For example: + +```swift +let regex = #/ + # Match a line of the format e.g "DEBIT 03/03/2022 Totally Legit Shell Corp $2,000,000.00" + (? \w+) \s\s+ + (? \S+) \s\s+ + (? (?: (?!\s\s) . )+) \s\s+ # Note that account names may contain spaces. + (? .*) + /# +``` + +In this mode, [extended regex syntax][extended-regex-syntax] `(?x)` is enabled by default. This means that whitespace becomes non-semantic, and end-of-line comments are supported with `# comment` syntax. + +This mode is supported with any (non-zero) number of `#` characters in the delimiter. Similar to multi-line strings introduced by [SE-0168], the closing delimiter must appear on a new line. To avoid parsing confusion, such a literal will not be parsed if a closing delimiter is not present. This avoids inadvertently treating the rest of the file as regex if you only type the opening. + +### Ambiguities with comment syntax + +Line comment syntax `//` and block comment syntax `/*` will continue to be parsed as comments. An empty regex literal is not a particularly useful thing to express, but can be written as `#//#` if desired. `*` would be an invalid starting character of a regex, and therefore does not pose an issue. + +A parsing conflict does however arise when a block comment surrounds a regex literal ending with `*`, for example: + + ```swift + /* + let regex = /[0-9]*/ + */ + ``` + +In this case, the block comment prematurely ends on the second line, rather than extending all the way to the third line as the user would expect. This is already an issue today with `*/` in a string literal, though it is more likely to occur in a regex given the prevalence of the `*` quantifier. This issue can be avoided in many cases by using line comment syntax `//` instead, which it should be noted is the syntax that Xcode uses when commenting out multiple lines. + + +### Ambiguity with infix operators + +There is a minor ambiguity when infix operators are used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required for regex literal interpretation, e.g `x + /y/`. Alternatively, extended literals may be used, e.g `x+#/y/#`. + +### Regex syntax limitations + +In order to help avoid further parsing ambiguities, a `/.../` regex literal will not be parsed if it starts with a space, tab, or `)` character. Though the latter is already invalid regex syntax. This restriction may be avoided by using the extended `#/.../#` literal. + +#### Rationale + +This is due to 2 main parsing ambiguities. The first of which arises when a `/.../` regex literal starts a new line. This is particularly problematic for result builders, where we expect it to be frequently used, in particular within a `Regex` builder: + +```swift +let digit = Regex { + TryCapture(OneOrMore(.digit)) { Int($0) } +} +// Matches against + (' + ' | ' - ') + +let regex = Regex { + digit + / [+-] / + digit +} +``` + +Instead of being parsed as 3 result builder elements, the second of which being a regex literal, this is instead parsed as a single operator chain with the operands `digit`, `[+-]`, and `digit`. This will therefore be diagnosed as semantically invalid. + +To avoid this issue, a regex literal may not start with a space or tab character. This takes advantage of the fact that infix operators require consistent spacing on either side. + +If a space or tab is needed as the first character, it must be either escaped, e.g: + +```swift +let regex = Regex { + digit + /\ [+-] / + digit +} +``` + +or extended literal must be used, e.g: + +```swift +let regex = Regex { + digit + #/ [+-] /# + digit +} +``` + +The second ambiguity arises with Swift's ability to pass an unapplied operator reference as an argument to a function or subscript, for example: + +```swift +let arr: [Double] = [2, 3, 4] +let x = arr.reduce(1, /) / 5 +``` + +The `/` in the call to `reduce` is in a valid expression context, and as such could be parsed as a regex literal. This is also applicable to operators in tuples and parentheses. To help mitigate this ambiguity, a regex literal will not be parsed if the first character is `)`. This should have minimal impact, as this would not be valid regex syntax anyway. + +It should be noted that this only mitigates the issue, as it does not handle the case where the next character is a comma or right square bracket. These cases are explored further in the following section. + +### Language changes required + +In addition to ambiguities listed above, there are also some parsing ambiguities that require the following language changes in a new language mode: + +- Deprecation of prefix operators containing the `/` character. +- Parsing `/,` and `/]` as the start of a regex literal if a closing `/` is found, rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than 2 unapplied operator arguments. + +#### Prefix operators containing `/` + +We need to ban prefix operators starting with `/`, to avoid ambiguity with cases such as: + +```swift +let x = /0; let y = 1/ +let z = /^x^/ +``` + +Prefix operators containing `/` more generally also need banning, in order to allow prefix operators to be used with regex literals in an unambiguous way, e.g: + +```swift +let x = !/y / .foo() +``` + +Today, this is interpreted as the prefix operator `!/` on `y`. With the banning of prefix operators containing `/`, it becomes prefix `!` on a regex literal, with a member access `.foo`. + +Postfix `/` operators do not require banning, as they'd only be treated as regex literal delimiters if we are already trying to lex as a regex literal. + +#### `/,` and `/]` as regex literal openings + +As stated previously, there is a parsing ambiguity with unapplied operators in argument lists, tuples, and parentheses. Some of these cases can be mitigated by not parsing a regex literal if the starting character is `)`. However it does not solve the issue when the next character is `,` or `]`. Both of these are valid regex starting characters, and comma in particular may be a fairly common case for a regex. + +For example: + +```swift +// Ambiguity with comma: +func foo(_ x: (Int, Int) -> Int, _ y: (Int, Int) -> Int) {} +foo(/, /) + +// Also affects cases where the closing '/' is outside the argument list. +func bar(_ fn: (Int, Int) -> Int, _ x: Int) -> Int { 0 } +bar(/, 2) + bar(/, 3) + +// Ambiguity with right square bracket: +struct S { + subscript(_ fn: (Int, Int) -> Int) -> Int { 0 } +} +func baz(_ x: S) -> Int { + x[/] + x[/] +} +``` + +`foo(/, /)` is currently parsed as 2 unapplied operator arguments. `bar(/, 2) + bar(/, 3)` is currently parsed as two independent calls that each take an unapplied `/` operator reference. Both of these will become regex literals arguments, `/, /` and `/, 2) + bar(/` respectively (though the latter will produce a regex error). + +To disambiguate these cases, users will need to surround at least the opening `/` with parentheses, e.g: + +```swift +foo((/), /) +bar((/), 2) + bar(/, 3) + +func baz(_ x: S) -> Int { + x[(/)] + x[/] +} +``` + +This takes advantage of the fact that a regex literal will not be parsed if the first character is `)`. + + + +## Source Compatibility + +As explored above, two source breaking changes are needed for `/.../` syntax: + +- Deprecation of prefix operators containing the `/` character. +- Parsing `/,` and `/]` as the start of a regex literal if a closing `/` is found, rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than two unapplied operator arguments. + +As such, both these changes and the `/.../` syntax will be introduced in Swift 6 mode. However, projects will be able to adopt the syntax earlier by passing the compiler flag `-enable-bare-regex-syntax`. Note this does not affect the extended delimiter syntax `#/.../#`, which will be usable immediately. + +## Future Directions + +### Modern literal syntax + +We could support a more modern Swift-like syntax in regex literals. For example, comments could be done with `//` and `/* ... */`, and quoted sequences could be done with `"..."`. This would however be incompatible with the syntactic superset of regex syntax we intend to parse, and as such may need to be introduced using a new literal kind, with no obvious choice of delimiter. + +However, such a syntax would lose out on the familiarity benefits of standard regex, and as such may lead to an "uncanny valley" effect. It's also possible that the ability to use regex literals in the DSL lessens the benefit that this syntax would bring. + +## Alternatives Considered + +Given the fact that `/.../` is an existing term of art for regular expressions, we feel it should be the preferred delimiter syntax. It should be noted that the syntax has become less popular in some communities such as Perl, however we still feel that it is a compelling choice, especially with extended delimiters `#/.../#`. Additionally, while there has some syntactic ambiguities, we do not feel that they are sufficient to disqualify the syntax. To evaluate this trade-off, below is a list of alternative delimiters that would not have the same ambiguities, and would not therefore require source breaking changes. + +### Prefixed quote `re'...'` + +We could choose to use `re'...'` delimiters, for example: + +```swift +// Matches " = ", extracting the identifier and hex number +let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)' +``` + +The use of two letter prefix could potentially be used as a namespace for future literal types. It would also have obvious extensions to extended and multi-line literals using `re#'...'#` and `re'''...'''` respectively. However, it is unusual for a Swift literal to be prefixed in this way. We also feel that its similarity to a string literal might have users confuse it with a raw string literal. + +Also, there are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. However, alternative syntax exists for all of these constructs, e.g `(?)`, `\k`, and `(?C"arg")`. Those could be required instead. An extended regex literal syntax e.g `re#'...'#` would also avoid this issue. + +### Prefixed double quote `re"...."` + +This would be a double quoted version of `re'...'`, more similar to string literal syntax. This has the advantage that single quote regex syntax e.g `(?'name')` would continue to work without requiring the use of the alternative syntax or extended literal syntax. However it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote may be useful in expressing this difference. + +### Single letter prefixed quote `r'...'` + +This would be a slightly shorter version of `re'...'`. While it's more concise, it could potentially be confused to mean "raw", especially as Python uses this syntax for raw strings. + +### Single quotes `'...'` + +This would be an even more concise version of `re'...'` that drops the prefix entirely. However, given how close it is to string literal syntax, it may not be entirely clear to users that `'...'` denotes a regex as opposed to some different form of string literal (e.g some form of character literal, or a string literal with different escaping rules). + +We could help distinguish it from a string literal by requiring e.g `'/.../'`, though it may not be clear that the `/` characters are part of the delimiters rather than part of the literal. Additionally, this would potentially rule out the use of `'...'` as a future literal kind. + +### Magic literal `#regex(...)` + +We could opt for for a more explicitly spelled out literal syntax such as `#regex(...)`. This is a more heavyweight option, similar to `#selector(...)`. As such, it may be considered syntactically noisy as e.g a function argument `str.match(#regex([abc]+))` vs `str.match(/[abc]+/)`. + +Such a syntax would require the containing regex to correctly balance parentheses for groups, otherwise the rest of the line might be incorrectly considered a regex. This could place additional cognitive burden on the user, and may lead to an awkward typing experience. For example, if the user is editing a previously written regex, the syntax highlighting for the rest of the line may change, and unhelpful spurious errors may be reported. With a different delimiter, the compiler would be able to detect and better diagnose unbalanced parentheses in the regex. + +We could avoid the parenthesis balancing issue by requiring an additional internal delimiter such as `#regex(/.../)`. However this is even more heavyweight, and it may be unclear that `/` is part of the delimiter rather than part of an argument. Alternatively, we could replace the internal delimiter with another character such as ```#regex`...` ```, `#regex{...}`, or `#regex/.../`. However those would be inconsistent with the existing `#literal(...)` syntax and the first two would overload the existing meanings for the ``` `` ``` and `{}` delimiters. + +It should also be noted that `#regex(...)` would introduce a syntactic inconsistency where the argument of a `#literal(...)` is no longer necessarily valid Swift syntax, despite being written in the form of an argument. + +### Shortened magic literal `#(...)` + +We could reduce the visual weight of `#regex(...)` by only requiring `#(...)`. However it would still retain the same issues, such as still looking potentially visually noisy as an argument, and having suboptimal behavior for parenthesis balancing. It is also not clear why regex literals would deserve such privileged syntax. + +### Using a different delimiter for multi-line + +Instead of re-using the extended delimiter syntax `#/.../#` for multi-line regex literals, we could choose a different delimiter for it. Unfortunately, the obvious choice for a multi-line regex literal would be to use `///` delimiters, in accordance with the precedent set by multi-line string literals `"""`. This signifies a (documentation) comment, and as such would not be viable. + +### Reusing string literal syntax + +Instead of supporting a first-class literal kind for regex, we could instead allow users to write a regex in a string literal, and parse, diagnose, and generate the appropriate code when it's coerced to the `Regex` type. + +```swift +let regex: Regex = #"([[:alpha:]]\w*) = ([0-9A-F]+)"# +``` + +However we decided against this because: + +- We would not be able to easily apply custom syntax highlighting and other editor features for the regex syntax. +- It would require a `Regex` contextual type to be treated as a regex, otherwise it would be defaulted to `String`, which may be undesired. +- In an overloaded context it may be ambiguous or unclear whether a string literal is meant to be interpreted as a literal string or regex. +- Regex-specific escape sequences such as `\w` would likely require the use of raw string syntax `#"..."#`, as they are otherwise invalid in a string literal. +- It wouldn't be compatible with other string literal features such as interpolations. + +### No custom literal + +Instead of adding a custom regex literal, we could require users to explicitly write `try! Regex("[abc]+")`. This would be similar to `NSRegularExpression`, and loses all the benefits of parsing the literal at compile time. This would mean: + +- No source tooling support (e.g syntax highlighting, refactoring actions) would be available. +- Parse errors would be diagnosed at run time rather than at compile time. +- We would lose the type safety of typed captures. +- More verbose syntax is required. + +We therefore feel this would be a much less compelling feature without first class literal support. + +[SE-0168]: https://github.com/apple/swift-evolution/blob/main/proposals/0168-multi-line-string-literals.md +[SE-0200]: https://github.com/apple/swift-evolution/blob/main/proposals/0200-raw-string-escaping.md + +[pitch-status]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/ProposalOverview.md +[regex-type]: https://github.com/apple/swift-evolution/blob/main/proposals/0350-regex-type-overview.md +[strongly-typed-captures]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/StronglyTypedCaptures.md + +[internal-syntax]: https://github.com/apple/swift-experimental-string-processing/blob/39cb22d96d90ee7cb308b1153e106e50598afdd9/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md +[extended-regex-syntax]: https://github.com/apple/swift-experimental-string-processing/blob/39cb22d96d90ee7cb308b1153e106e50598afdd9/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md#extended-syntax-modes + +[regex-dsl]: https://github.com/apple/swift-evolution/blob/main/proposals/0351-regex-builder.md +[dsl-captures]: https://github.com/apple/swift-evolution/blob/main/proposals/0351-regex-builder.md#capture-and-reference diff --git a/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md b/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md index 7dd56b6a8..bee7bbf03 100644 --- a/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md +++ b/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md @@ -1,16 +1,18 @@ - # Regex Syntax and Run-time Construction -- Authors: [Hamish Knight](https://github.com/hamishknight), [Michael Ilseman](https://github.com/milseman) +* Proposal: [SE-NNNN](NNNN-filename.md) +* Authors: [Hamish Knight](https://github.com/hamishknight), [Michael Ilseman](https://github.com/milseman) +* Review Manager: [Ben Cohen](https://github.com/airspeedswift) +* Status: **Awaiting review** +* Implementation: https://github.com/apple/swift-experimental-string-processing + * Available in nightly toolchain snapshots with `import _StringProcessing` ## Introduction A regex declares a string processing algorithm using syntax familiar across a variety of languages and tools throughout programming history. We propose the ability to create a regex at run time from a string containing regex syntax (detailed here), API for accessing the match and captures, and a means to convert between an existential capture representation and concrete types. -The overall story is laid out in [Regex Type and Overview](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexTypeOverview.md) and each individual component is tracked in [Pitch and Proposal Status](https://github.com/apple/swift-experimental-string-processing/issues/107). +The overall story is laid out in [SE-0350 Regex Type and Overview][overview] and each individual component is tracked in [Pitch and Proposal Status][pitches]. ## Motivation @@ -44,9 +46,8 @@ func processEntry(_ line: String) -> Transaction? { Fixing these fundamental limitations requires migrating to a completely different engine and type system representation. This is the path we're proposing with `Regex`, outlined in [Regex Type and Overview][overview]. Details on the semantic differences between ICU's string model and Swift's `String` is discussed in [Unicode for String Processing][pitches]. -The full string processing effort includes a regex type with strongly typed captures, the ability to create a regex from a string at runtime, a compile-time literal, a result builder DSL, protocols for intermixing 3rd party industrial-strength parsers with regex declarations, and a slew of regex-powered algorithms over strings. +Run-time construction is important for tools and editors. For example, SwiftPM allows the user to provide a regular expression to filter tests via `swift test --filter`. -This proposal specifically hones in on the _familiarity_ aspect by providing a best-in-class treatment of familiar regex syntax. ## Proposed Solution @@ -85,11 +86,11 @@ We propose initializers to declare and compile a regex from syntax. Upon failure ```swift extension Regex { /// Parse and compile `pattern`, resulting in a strongly-typed capture list. - public init(compiling pattern: String, as: Output.Type = Output.self) throws + public init(_ pattern: String, as: Output.Type = Output.self) throws } extension Regex where Output == AnyRegexOutput { /// Parse and compile `pattern`, resulting in an existentially-typed capture list. - public init(compiling pattern: String) throws + public init(_ pattern: String) throws } ``` @@ -160,6 +161,20 @@ extension Regex.Match where Output == AnyRegexOutput { } ``` +We propose adding API to query and access captures by name in an existentially typed regex match: + +```swift +extension Regex.Match where Output == AnyRegexOutput { + /// If a named-capture with `name` is present, returns its value. Otherwise `nil`. + public subscript(_ name: String) -> AnyRegexOutput.Element? { get } +} + +extension AnyRegexOutput { + /// If a named-capture with `name` is present, returns its value. Otherwise `nil`. + public subscript(_ name: String) -> AnyRegexOutput.Element? { get } +} +``` + The rest of this proposal will be a detailed and exhaustive definition of our proposed regex syntax.
Grammar Notation @@ -326,7 +341,7 @@ BuiltinCharClass -> '.' | '\C' | '\d' | '\D' | '\h' | '\H' | '\N' | '\O' | '\R' - `\W`: Non-word character. - `\X`: Any extended grapheme cluster. -Precise definitions of character classes is discussed in [Character Classes for String Processing](https://forums.swift.org/t/pitch-character-classes-for-string-processing/52920). +Precise definitions of character classes is discussed in [Unicode for String Processing][pitches]. #### Unicode scalars @@ -396,7 +411,7 @@ For non-Unicode properties, only a value is required. These include: - The special PCRE2 properties `Xan`, `Xps`, `Xsp`, `Xuc`, `Xwd`. - The special Java properties `javaLowerCase`, `javaUpperCase`, `javaWhitespace`, `javaMirrored`. -Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`. +Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`. Both spellings may be used inside and outside of a custom character class. #### `\K` @@ -538,6 +553,7 @@ These operators have a lower precedence than the implicit union of members, e.g To avoid ambiguity between .NET's subtraction syntax and range syntax, .NET specifies that a subtraction will only be parsed if the right-hand-side is a nested custom character class. We propose following this behavior. +Note that a custom character class may begin with the `:` character, and only becomes a POSIX character property if a closing `:]` is present. For example, `[:a]` is the character class of `:` and `a`. ### Matching options @@ -867,7 +883,23 @@ PCRE supports `\N` meaning "not a newline", however there are engines that treat ### Extended character property syntax -ICU unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`, such that they follow the same internal grammar, which allows referencing any Unicode character property in addition to the POSIX properties. We propose supporting this, though it is a purely additive feature, and therefore should not conflict with regex engines that implement a more limited POSIX syntax. +ICU unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`. This has two effects: + +- They share the same internal grammar, which allows the use of any Unicode character properties in addition to the POSIX properties. +- The POSIX syntax may be used outside of custom character classes, unlike in PCRE and Oniguruma. + +We propose following both of these rules. The former is purely additive, and therefore should not conflict with regex engines that implement a more limited POSIX syntax. The latter does conflict with other engines, but we feel it is much more likely that a user would expect e.g `[:space:]` to be a character property rather than the character class `[:aceps]`. We do however feel that a warning might be warranted in order to avoid confusion. + +### POSIX character property disambiguation + +PCRE, Oniguruma and ICU allow `[:` to be part of a custom character class if a closing `:]` is not present. For example, `[:a]` is the character class of `:` and `a`. However they each have different rules for detecting the closing `:]`: + +- PCRE will scan ahead until it hits either `:]`, `]`, or `[:`. +- Oniguruma will scan ahead until it hits either `:]`, `]`, or the length exceeds 20 characters. +- ICU will scan ahead until it hits a known escape sequence (e.g `\a`, `\e`, `\Q`, ...), or `:]`. Note this excludes character class escapes e.g `\d`. It also excludes `]`, meaning that even `[:a][:]` is parsed as a POSIX character property. + +We propose unifying these behaviors by scanning ahead until we hit either `[`, `]`, `:]`, or `\`. Additionally, we will stop on encountering `}` or a second occurrence of `=`. These fall out the fact that they would be invalid contents of the alternative `\p{...}` syntax. + ### Script properties @@ -932,7 +964,7 @@ We are deferring runtime support for callouts from regex literals as future work ## Alternatives Considered -### Failalbe inits +### Failable inits There are many ways for compilation to fail, from syntactic errors to unsupported features to type mismatches. In the general case, run-time compilation errors are not recoverable by a tool without modifying the user's input. Even then, the thrown errors contain valuable information as to why compilation failed. For example, swiftpm presents any errors directly to the user. @@ -977,3 +1009,9 @@ This proposal regards _syntactic_ support, and does not necessarily mean that ev [unicode-scripts]: https://www.unicode.org/reports/tr24/#Script [unicode-script-extensions]: https://www.unicode.org/reports/tr24/#Script_Extensions [balancing-groups]: https://docs.microsoft.com/en-us/dotnet/standard/base-types/grouping-constructs-in-regular-expressions#balancing-group-definitions +[overview]: https://github.com/apple/swift-evolution/blob/main/proposals/0350-regex-type-overview.md +[pitches]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/ProposalOverview.md + + + + diff --git a/Documentation/Evolution/RegexTypeOverview.md b/Documentation/Evolution/RegexTypeOverview.md index cf2fb9265..94230d724 100644 --- a/Documentation/Evolution/RegexTypeOverview.md +++ b/Documentation/Evolution/RegexTypeOverview.md @@ -1,6 +1,11 @@ # Regex Type and Overview -- Authors: [Michael Ilseman](https://github.com/milseman) and the Standard Library Team +* Proposal: [SE-0350](0350-regex-type-overview.md) +* Authors: [Michael Ilseman](https://github.com/milseman) +* Review Manager: [Ben Cohen](https://github.com/airspeedswift) +* Status: **Active Review (4 - 28 April 2022)** +* Implementation: https://github.com/apple/swift-experimental-string-processing + * Available in nightly toolchain snapshots with `import _StringProcessing` ## Introduction @@ -13,7 +18,7 @@ We propose addressing this basic shortcoming through an effort we are calling re 3. A literal for compile-time construction of a regex with statically-typed captures, enabling powerful source tools. 4. An expressive and composable result-builder DSL, with support for capturing strongly-typed values. 5. A modern treatment of Unicode semantics and string processing. -6. A treasure trove of string processing algorithms, along with library-extensible protocols enabling industrial-strength parsers to be used seamlessly as regex components. +6. A slew of regex-powered string processing algorithms, along with library-extensible protocols enabling industrial-strength parsers to be used seamlessly as regex components. This proposal provides details on \#1, the `Regex` type and captures, and gives an overview of how each of the other proposals fit into regex in Swift. @@ -207,7 +212,7 @@ func processEntry(_ line: String) -> Transaction? { // amount: Substring // )> - guard let match = regex.matchWhole(line), + guard let match = regex.wholeMatch(line), let kind = Transaction.Kind(match.kind), let date = try? Date(String(match.date), strategy: dateParser), let amount = try? Decimal(String(match.amount), format: decimalParser) @@ -226,7 +231,7 @@ The result builder allows for inline failable value construction, which particip Swift regexes describe an unambiguous algorithm, where choice is ordered and effects can be reliably observed. For example, a `print()` statement inside the `TryCapture`'s transform function will run whenever the overall algorithm naturally dictates an attempt should be made. Optimizations can only elide such calls if they can prove it is behavior-preserving (e.g. "pure"). -`CustomMatchingRegexComponent`, discussed in [String Processing Algorithms][pitches], allows industrial-strength parsers to be used a regex components. This allows us to drop the overly-permissive pre-parsing step: +`CustomConsumingRegexComponent`, discussed in [String Processing Algorithms][pitches], allows industrial-strength parsers to be used a regex components. This allows us to drop the overly-permissive pre-parsing step: ```swift func processEntry(_ line: String) -> Transaction? { @@ -384,21 +389,25 @@ extension Regex.Match { // Run-time compilation interfaces extension Regex { /// Parse and compile `pattern`, resulting in a strongly-typed capture list. - public init(compiling pattern: String, as: Output.Type = Output.self) throws + public init(_ pattern: String, as: Output.Type = Output.self) throws } extension Regex where Output == AnyRegexOutput { /// Parse and compile `pattern`, resulting in an existentially-typed capture list. - public init(compiling pattern: String) throws + public init(_ pattern: String) throws } ``` +### Cancellation + +Regex is somewhat different from existing standard library operations in that regex processing can be a long-running task. +For this reason regex algorithms may check if the parent task has been cancelled and end execution. + ### On severability and related proposals The proposal split presented is meant to aid focused discussion, while acknowledging that each is interconnected. The boundaries between them are not completely cut-and-dry and could be refined as they enter proposal phase. Accepting this proposal in no way implies that all related proposals must be accepted. They are severable and each should stand on their own merit. - ## Source compatibility Everything in this proposal is additive. Regex delimiters may have their own source compatibility impact, which is discussed in that proposal. @@ -422,7 +431,7 @@ Regular expressions have a deservedly mixed reputation, owing to their historica * "Regular expressions are bad because you should use a real parser" - In other systems, you're either in or you're out, leading to a gravitational pull to stay in when... you should get out - - Our remedy is interoperability with real parsers via `CustomMatchingRegexComponent` + - Our remedy is interoperability with real parsers via `CustomConsumingRegexComponent` - Literals with refactoring actions provide an incremental off-ramp from regex syntax to result builders and real parsers * "Regular expressions are bad because ugly unmaintainable syntax" - We propose literals with source tools support, allowing for better syntax highlighting and analysis @@ -488,6 +497,16 @@ The generic parameter `Output` is proposed to contain both the whole match (the The biggest issue with this alternative design is that the numbering of `Captures` elements misaligns with the numbering of captures in textual regexes, where backreference `\0` refers to the entire match and captures start at `\1`. This design would sacrifice familarity and have the pitfall of introducing off-by-one errors. +### Encoding `Regex`es into the type system + +During the initial review period the following comment was made: + +> I think the goal should be that, at least for regex literals (and hopefully for the DSL to some extent), one day we might not even need a bytecode or interpreter. I think the ideal case is if each literal was its own function or type that gets generated and optimised as if you wrote it in Swift. + +This is an approach that has been tried a few times in a few different languages (including by a few members of the Swift Standard Library and Core teams), and while it can produce attractive microbenchmarks, it has almost always proved to be a bad idea at the macro scale. In particular, even if we set aside witness tables and other associated swift generics overhead, optimizing a fixed pipeline for each pattern you want to match causes significant codesize expansion when there are multiple patterns in use, as compared to a more flexible byte code interpreter. A bytecode interpreter makes better use of instruction caches and memory, and can also benefit from micro architectural resources that are shared across different patterns. There is a tradeoff w.r.t. branch prediction resources, where separately compiled patterns may have more decisive branch history data, but a shared bytecode engine has much more data to use; this tradeoff tends to fall on the side of a bytecode engine, but it does not always do so. + +It should also be noted that nothing prevents AOT or JIT compiling of the bytecode if we believe it will be advantageous, but compiling or interpreting arbitrary Swift code at runtime is rather more unattractive, since both the type system and language are undecidable. Even absent this rationale, we would probably not encode regex programs directly into the type system simply because it is unnecessarily complex. + ### Future work: static optimization and compilation Swift's support for static compilation is still developing, and future work here is leveraging that to compile regex when profitable. Many regex describe simple [DFAs](https://en.wikipedia.org/wiki/Deterministic_finite_automaton) and can be statically compiled into very efficient programs. Full static compilation needs to be balanced with code size concerns, as a matching-specific bytecode is typically far smaller than a corresponding program (especially since the bytecode interpreter is shared). @@ -497,7 +516,7 @@ Regex are compiled into an intermediary representation and fairly simple analysi ### Future work: parser combinators -What we propose here is an incremental step towards better parsing support in Swift using parser-combinator style libraries. The underlying execution engine supports recursive function calls and mechanisms for library extensibility. `CustomMatchingRegexComponent`'s protocol requirement is effectively a [monadic parser](https://homepages.inf.ed.ac.uk/wadler/papers/marktoberdorf/baastad.pdf), meaning `Regex` provides a regex-flavored combinator-like system. +What we propose here is an incremental step towards better parsing support in Swift using parser-combinator style libraries. The underlying execution engine supports recursive function calls and mechanisms for library extensibility. `CustomConsumingRegexComponent`'s protocol requirement is effectively a [monadic parser](https://homepages.inf.ed.ac.uk/wadler/papers/marktoberdorf/baastad.pdf), meaning `Regex` provides a regex-flavored combinator-like system. An issues with traditional parser combinator libraries are the compilation barriers between call-site and definition, resulting in excessive and overly-cautious backtracking traffic. These can be eliminated through better [compilation techniques](https://core.ac.uk/download/pdf/148008325.pdf). As mentioned above, Swift's support for custom static compilation is still under development. @@ -546,9 +565,9 @@ Regexes are often used for tokenization and tokens can be represented with Swift ### Future work: baked-in localized processing -- `CustomMatchingRegexComponent` gives an entry point for localized processors +- `CustomConsumingRegexComponent` gives an entry point for localized processors - Future work includes (sub?)protocols to communicate localization intent --> -[pitches]: https://github.com/apple/swift-experimental-string-processing/issues/107 +[pitches]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/ProposalOverview.md diff --git a/Documentation/Evolution/StringProcessingAlgorithms.md b/Documentation/Evolution/StringProcessingAlgorithms.md index b976c562e..58426c145 100644 --- a/Documentation/Evolution/StringProcessingAlgorithms.md +++ b/Documentation/Evolution/StringProcessingAlgorithms.md @@ -8,13 +8,13 @@ We propose: 1. New regex-powered algorithms over strings, bringing the standard library up to parity with scripting languages 2. Generic `Collection` equivalents of these algorithms in terms of subsequences -3. `protocol CustomMatchingRegexComponent`, which allows 3rd party libraries to provide their industrial-strength parsers as intermixable components of regexes +3. `protocol CustomConsumingRegexComponent`, which allows 3rd party libraries to provide their industrial-strength parsers as intermixable components of regexes -This proposal is part of a larger [regex-powered string processing initiative](https://forums.swift.org/t/declarative-string-processing-overview/52459). Throughout the document, we will reference the still-in-progress [`RegexProtocol`, `Regex`](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/StronglyTypedCaptures.md), and result builder DSL, but these are in flux and not formally part of this proposal. Further discussion of regex specifics is out of scope of this proposal and better discussed in another thread (see [Pitch and Proposal Status](https://github.com/apple/swift-experimental-string-processing/issues/107) for links to relevant threads). +This proposal is part of a larger [regex-powered string processing initiative](https://github.com/apple/swift-evolution/blob/main/proposals/0350-regex-type-overview.md), the status of each proposal is tracked [here](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/ProposalOverview.md). Further discussion of regex specifics is out of scope of this proposal and better discussed in their relevant reviews. ## Motivation -A number of common string processing APIs are missing from the Swift standard library. While most of the desired functionalities can be accomplished through a series of API calls, every gap adds a burden to developers doing frequent or complex string processing. For example, here's one approach to find the number of occurrences a substring ("banana") within a string: +A number of common string processing APIs are missing from the Swift standard library. While most of the desired functionalities can be accomplished through a series of API calls, every gap adds a burden to developers doing frequent or complex string processing. For example, here's one approach to find the number of occurrences of a substring ("banana") within a string: ```swift let str = "A banana a day keeps the doctor away. I love bananas; banana are my favorite fruit." @@ -31,7 +31,7 @@ while let r = str.range(of: "banana", options: [], range: idx.. @@ -91,18 +91,18 @@ Note: Only a subset of Python's string processing API are included in this table ### Complex string processing -Even with the API additions, more complex string processing quickly becomes unwieldy. Up-coming support for authoring regexes in Swift help alleviate this somewhat, but string processing in the modern world involves dealing with localization, standards-conforming validation, and other concerns for which a dedicated parser is required. +Even with the API additions, more complex string processing quickly becomes unwieldy. String processing in the modern world involves dealing with localization, standards-conforming validation, and other concerns for which a dedicated parser is required. Consider parsing the date field `"Date: Wed, 16 Feb 2022 23:53:19 GMT"` in an HTTP header as a `Date` type. The naive approach is to search for a substring that looks like a date string (`16 Feb 2022`), and attempt to post-process it as a `Date` with a date parser: ```swift let regex = Regex { - capture { - oneOrMore(.digit) + Capture { + OneOrMore(.digit) " " - oneOrMore(.word) + OneOrMore(.word) " " - oneOrMore(.digit) + OneOrMore(.digit) } } @@ -128,21 +128,21 @@ DEBIT 03/24/2020 IRX tax payment ($52,249.98) Parsing a currency string such as `$3,020.85` with regex is also tricky, as it can contain localized and currency symbols in addition to accounting conventions. This is why Foundation provides industrial-strength parsers for localized strings. -## Proposed solution +## Proposed solution ### Complex string processing -We propose a `CustomMatchingRegexComponent` protocol which allows types from outside the standard library participate in regex builders and `RegexComponent` algorithms. This allows types, such as `Date.ParseStrategy` and `FloatingPointFormatStyle.Currency`, to be used directly within a regex: - +We propose a `CustomConsumingRegexComponent` protocol which allows types from outside the standard library participate in regex builders and `RegexComponent` algorithms. This allows types, such as `Date.ParseStrategy` and `FloatingPointFormatStyle.Currency`, to be used directly within a regex: + ```swift let dateRegex = Regex { - capture(dateParser) + Capture(dateParser) } -let date: Date = header.firstMatch(of: dateRegex).map(\.result.1) +let date: Date = header.firstMatch(of: dateRegex).map(\.result.1) let currencyRegex = Regex { - capture(.localizedCurrency(code: "USD").sign(strategy: .accounting)) + Capture(.localizedCurrency(code: "USD").sign(strategy: .accounting)) } let amount: [Decimal] = statement.matches(of: currencyRegex).map(\.result.1) @@ -162,28 +162,55 @@ We also propose the following regex-powered algorithms as well as their generic |`replace(:with:subrange:maxReplacements)`| Replaces all occurrences of the sequence matching the given `RegexComponent` or sequence with a given collection | |`split(by:)`| Returns the longest possible subsequences of the collection around elements equal to the given separator | |`firstMatch(of:)`| Returns the first match of the specified `RegexComponent` within the collection | +|`wholeMatch(of:)`| Matches the specified `RegexComponent` in the collection as a whole | +|`prefixMatch(of:)`| Matches the specified `RegexComponent` against the collection at the beginning | |`matches(of:)`| Returns a collection containing all matches of the specified `RegexComponent` | +We also propose an overload of `~=` allowing regexes to be used in `case` expressions: + +```swift + switch "abcde" { + case /a.*f/: // never taken + case /abc/: // never taken + case /ab.*e/: return "success" + default: // never taken + } + + switch "2022-04-22" { + case decimalParser: // never taken + + case OneOrMore { + CharacterClass.whitespace + }: // never taken + + case #/\d{2}/\d{2}/\d{4}/# // never taken + + case dateParser: return "success" + + default: // never taken + } +``` -## Detailed design +## Detailed design -### `CustomMatchingRegexComponent` +### `CustomConsumingRegexComponent` -`CustomMatchingRegexComponent` inherits from `RegexComponent` and satisfies its sole requirement; Conformers can be used with all of the string algorithms generic over `RegexComponent`. +`CustomConsumingRegexComponent` inherits from `RegexComponent` and satisfies its sole requirement. Conformers can be used with all of the string algorithms generic over `RegexComponent`. ```swift -/// A protocol for custom match functionality. -public protocol CustomMatchingRegexComponent : RegexComponent { - /// Match the input string within the specified bounds, beginning at the given index, and return - /// the end position (upper bound) of the match and the matched instance. +/// A protocol allowing custom types to function as regex components by +/// providing the raw functionality backing `prefixMatch`. +public protocol CustomConsumingRegexComponent: RegexComponent { + /// Process the input string within the specified bounds, beginning at the given index, and return + /// the end position (upper bound) of the match and the produced output. /// - Parameters: /// - input: The string in which the match is performed. /// - index: An index of `input` at which to begin matching. /// - bounds: The bounds in `input` in which the match is performed. /// - Returns: The upper bound where the match terminates and a matched instance, or `nil` if /// there isn't a match. - func match( + func consuming( _ input: String, startingAt index: String.Index, in bounds: Range @@ -197,8 +224,8 @@ public protocol CustomMatchingRegexComponent : RegexComponent { We use Foundation `FloatingPointFormatStyle.Currency` as an example for protocol conformance. It would implement the `match` function with `Match` being a `Decimal`. It could also add a static function `.localizedCurrency(code:)` as a member of `RegexComponent`, so it can be referred as `.localizedCurrency(code:)` in the `Regex` result builder: ```swift -extension FloatingPointFormatStyle.Currency : CustomMatchingRegexComponent { - public func match( +extension FloatingPointFormatStyle.Currency : CustomConsumingRegexComponent { + public func consuming( _ input: String, startingAt index: String.Index, in bounds: Range @@ -214,17 +241,19 @@ Matching and extracting a localized currency amount, such as `"$3,020.85"`, can ```swift let regex = Regex { - capture(.localizedCurreny(code: "USD")) + Capture(.localizedCurrency(code: "USD")) } ``` - +
-### String algorithm additions +### String and Collection algorithm additions #### Contains +We propose a `contains` variant over collections that tests for subsequence membership. The second algorithm allows for specialization using e.g. the [two way search algorithm](https://en.wikipedia.org/wiki/Two-way_string-matching_algorithm). + ```swift extension Collection where Element: Equatable { /// Returns a Boolean value indicating whether the collection contains the @@ -232,35 +261,80 @@ extension Collection where Element: Equatable { /// - Parameter other: A sequence to search for within this collection. /// - Returns: `true` if the collection contains the specified sequence, /// otherwise `false`. - public func contains(_ other: S) -> Bool + public func contains(_ other: C) -> Bool + where S.Element == Element +} +extension BidirectionalCollection where Element: Comparable { + /// Returns a Boolean value indicating whether the collection contains the + /// given sequence. + /// - Parameter other: A sequence to search for within this collection. + /// - Returns: `true` if the collection contains the specified sequence, + /// otherwise `false`. + public func contains(_ other: C) -> Bool where S.Element == Element } +``` + +We propose a regex-taking variant over string types (those that produce a `Substring` upon slicing). -extension BidirectionalCollection where SubSequence == Substring { +```swift +extension Collection where SubSequence == Substring { /// Returns a Boolean value indicating whether the collection contains the /// given regex. /// - Parameter regex: A regex to search for within this collection. /// - Returns: `true` if the regex was found in the collection, otherwise /// `false`. - public func contains(_ regex: R) -> Bool + public func contains(_ regex: some RegexComponent) -> Bool +} + +// In RegexBuilder module +extension Collection where SubSequence == Substring { + /// Returns a Boolean value indicating whether this collection contains a + /// match for the regex, where the regex is created by the given closure. + /// + /// - Parameter content: A closure that returns a regex to search for within + /// this collection. + /// - Returns: `true` if the regex returned by `content` matched anywhere in + /// this collection, otherwise `false`. + public func contains( + @RegexComponentBuilder _ content: () -> some RegexComponent + ) -> Bool } ``` #### Starts with +We propose a regex-taking `starts(with:)` variant for string types: + ```swift -extension BidirectionalCollection where SubSequence == Substring { +extension Collection where SubSequence == Substring { /// Returns a Boolean value indicating whether the initial elements of the /// sequence are the same as the elements in the specified regex. /// - Parameter regex: A regex to compare to this sequence. /// - Returns: `true` if the initial elements of the sequence matches the /// beginning of `regex`; otherwise, `false`. - public func starts(with regex: R) -> Bool + public func starts(with regex: some RegexComponent) -> Bool +} + +// In RegexBuilder module +extension Collection where SubSequence == Substring { + /// Returns a Boolean value indicating whether the initial elements of this + /// collection are a match for the regex created by the given closure. + /// + /// - Parameter content: A closure that returns a regex to match at + /// the beginning of this collection. + /// - Returns: `true` if the initial elements of this collection match + /// regex returned by `content`; otherwise, `false`. + public func starts( + @RegexComponentBuilder with content: () -> some RegexComponent + ) -> Bool } ``` #### Trim prefix +We propose generic `trimmingPrefix` and `trimPrefix` methods for collections that trim elements matching a predicate or a possible prefix sequence. + ```swift extension Collection { /// Returns a new collection of the same type by removing initial elements @@ -279,7 +353,7 @@ extension Collection where SubSequence == Self { /// - Parameter predicate: A closure that takes an element of the sequence /// as its argument and returns a Boolean value indicating whether the /// element should be removed from the collection. - public mutating func trimPrefix(while predicate: (Element) throws -> Bool) + public mutating func trimPrefix(while predicate: (Element) throws -> Bool) rethrows } extension RangeReplaceableCollection { @@ -288,7 +362,7 @@ extension RangeReplaceableCollection { /// - Parameter predicate: A closure that takes an element of the sequence /// as its argument and returns a Boolean value indicating whether the /// element should be removed from the collection. - public mutating func trimPrefix(while predicate: (Element) throws -> Bool) + public mutating func trimPrefix(while predicate: (Element) throws -> Bool) rethrows } extension Collection where Element: Equatable { @@ -297,44 +371,78 @@ extension Collection where Element: Equatable { /// - Parameter prefix: The collection to remove from this collection. /// - Returns: A collection containing the elements that does not match /// `prefix` from the start. - public func trimmingPrefix(_ prefix: Prefix) -> SubSequence + public func trimmingPrefix(_ prefix: Prefix) -> SubSequence where Prefix.Element == Element } extension Collection where SubSequence == Self, Element: Equatable { /// Removes the initial elements that matches `prefix` from the start. /// - Parameter prefix: The collection to remove from this collection. - public mutating func trimPrefix(_ prefix: Prefix) + public mutating func trimPrefix(_ prefix: Prefix) where Prefix.Element == Element } extension RangeReplaceableCollection where Element: Equatable { /// Removes the initial elements that matches `prefix` from the start. /// - Parameter prefix: The collection to remove from this collection. - public mutating func trimPrefix(_ prefix: Prefix) + public mutating func trimPrefix(_ prefix: Prefix) where Prefix.Element == Element } +``` -extension BidirectionalCollection where SubSequence == Substring { +We propose regex-taking variants for string types: + +```swift +extension Collection where SubSequence == Substring { /// Returns a new subsequence by removing the initial elements that matches /// the given regex. /// - Parameter regex: The regex to remove from this collection. /// - Returns: A new subsequence containing the elements of the collection /// that does not match `prefix` from the start. - public func trimmingPrefix(_ regex: R) -> SubSequence + public func trimmingPrefix(_ regex: some RegexComponent) -> SubSequence } -extension RangeReplaceableCollection - where Self: BidirectionalCollection, SubSequence == Substring -{ +// In RegexBuilder module +extension Collection where SubSequence == Substring { + /// Returns a subsequence of this collection by removing the elements + /// matching the regex from the start, where the regex is created by + /// the given closure. + /// + /// - Parameter content: A closure that returns the regex to search for at + /// the start of this collection. + /// - Returns: A collection containing the elements after those that match + /// the regex returned by `content`. If the regex does not match at + /// the start of the collection, the entire contents of this collection + /// are returned. + public func trimmingPrefix( + @RegexComponentBuilder _ content: () -> some RegexComponent + ) -> SubSequence +} + +extension RangeReplaceableCollection where SubSequence == Substring { /// Removes the initial elements that matches the given regex. /// - Parameter regex: The regex to remove from this collection. - public mutating func trimPrefix(_ regex: R) + public mutating func trimPrefix(_ regex: some RegexComponent) +} + +// In RegexBuilder module +extension RangeReplaceableCollection where SubSequence == Substring { + /// Removes the initial elements matching the regex from the start of + /// this collection, if the initial elements match, using the given closure + /// to create the regex. + /// + /// - Parameter content: A closure that returns the regex to search for + /// at the start of this collection. + public mutating func trimPrefix( + @RegexComponentBuilder _ content: () -> some RegexComponent + ) } ``` #### First range +We propose a generic collection algorithm for finding the first range of a given subsequence: + ```swift extension Collection where Element: Equatable { /// Finds and returns the range of the first occurrence of a given sequence @@ -342,8 +450,8 @@ extension Collection where Element: Equatable { /// - Parameter sequence: The sequence to search for. /// - Returns: A range in the collection of the first occurrence of `sequence`. /// Returns nil if `sequence` is not found. - public func firstRange(of sequence: S) -> Range? - where S.Element == Element + public func firstRange(of other: C) -> Range? + where C.Element == Element } extension BidirectionalCollection where Element: Comparable { @@ -352,22 +460,42 @@ extension BidirectionalCollection where Element: Comparable { /// - Parameter other: The sequence to search for. /// - Returns: A range in the collection of the first occurrence of `sequence`. /// Returns `nil` if `sequence` is not found. - public func firstRange(of other: S) -> Range? - where S.Element == Element + public func firstRange(of other: C) -> Range? + where C.Element == Element } +``` -extension BidirectionalCollection where SubSequence == Substring { +We propose a regex-taking variant for string types. + +```swift +extension Collection where SubSequence == Substring { /// Finds and returns the range of the first occurrence of a given regex /// within the collection. /// - Parameter regex: The regex to search for. /// - Returns: A range in the collection of the first occurrence of `regex`. /// Returns `nil` if `regex` is not found. - public func firstRange(of regex: R) -> Range? + public func firstRange(of regex: some RegexComponent) -> Range? +} + +// In RegexBuilder module +extension Collection where SubSequence == Substring { + /// Returns the range of the first match for the regex within this collection, + /// where the regex is created by the given closure. + /// + /// - Parameter content: A closure that returns a regex to search for. + /// - Returns: A range in the collection of the first occurrence of the first + /// match of if the regex returned by `content`. Returns `nil` if no match + /// for the regex is found. + public func firstRange( + @RegexComponentBuilder of content: () -> some RegexComponent + ) -> Range? } ``` #### Ranges +We propose a generic collection algorithm for iterating over all (non-overlapping) ranges of a given subsequence. + ```swift extension Collection where Element: Equatable { /// Finds and returns the ranges of the all occurrences of a given sequence @@ -375,45 +503,133 @@ extension Collection where Element: Equatable { /// - Parameter other: The sequence to search for. /// - Returns: A collection of ranges of all occurrences of `other`. Returns /// an empty collection if `other` is not found. - public func ranges(of other: S) -> some Collection> - where S.Element == Element + public func ranges(of other: C) -> some Collection> + where C.Element == Element } -extension BidirectionalCollection where SubSequence == Substring { +extension BidirectionalCollection where Element: Comparable { + /// Finds and returns the ranges of the all occurrences of a given sequence + /// within the collection. + /// - Parameter other: The sequence to search for. + /// - Returns: A collection of ranges of all occurrences of `other`. Returns + /// an empty collection if `other` is not found. + public func ranges(of other: C) -> some Collection> + where C.Element == Element +} +``` + +And of course regex-taking versions for string types: + +```swift +extension Collection where SubSequence == Substring { /// Finds and returns the ranges of the all occurrences of a given sequence /// within the collection. /// - Parameter regex: The regex to search for. /// - Returns: A collection or ranges in the receiver of all occurrences of /// `regex`. Returns an empty collection if `regex` is not found. - public func ranges(of regex: R) -> some Collection> + public func ranges(of regex: some RegexComponent) -> some Collection> +} + +// In RegexBuilder module +extension Collection where SubSequence == Substring { + /// Returns the ranges of the all non-overlapping matches for the regex + /// within this collection, where the regex is created by the given closure. + /// + /// - Parameter content: A closure that returns a regex to search for. + /// - Returns: A collection of ranges of all matches for the regex returned by + /// `content`. Returns an empty collection if no match for the regex + /// is found. + public func ranges( + @RegexComponentBuilder of content: () -> some RegexComponent + ) -> some Collection> } ``` -#### First match +#### Match + +We propose algorithms for extracting a `Match` instance from a given regex from the start, anywhere in the middle, or over the entire `self`. ```swift -extension BidirectionalCollection where SubSequence == Substring { +extension Collection where SubSequence == Substring { /// Returns the first match of the specified regex within the collection. /// - Parameter regex: The regex to search for. /// - Returns: The first match of `regex` in the collection, or `nil` if /// there isn't a match. - public func firstMatch(of regex: R) -> RegexMatch? + public func firstMatch(of regex: R) -> Regex.Match? + + /// Match a regex in its entirety. + /// - Parameter regex: The regex to match against. + /// - Returns: The match if there is one, or `nil` if none. + public func wholeMatch(of regex: R) -> Regex.Match? + + /// Match part of the regex, starting at the beginning. + /// - Parameter regex: The regex to match against. + /// - Returns: The match if there is one, or `nil` if none. + public func prefixMatch(of regex: R) -> Regex.Match? +} + +// In RegexBuilder module +extension Collection where SubSequence == Substring { + /// Returns the first match for the regex within this collection, where + /// the regex is created by the given closure. + /// + /// - Parameter content: A closure that returns the regex to search for. + /// - Returns: The first match for the regex created by `content` in this + /// collection, or `nil` if no match is found. + public func firstMatch( + @RegexComponentBuilder of content: () -> R + ) -> Regex.Match? + + /// Matches a regex in its entirety, where the regex is created by + /// the given closure. + /// + /// - Parameter content: A closure that returns a regex to match against. + /// - Returns: The match if there is one, or `nil` if none. + public func wholeMatch( + @RegexComponentBuilder of content: () -> R + ) -> Regex.Match? + + /// Matches part of the regex, starting at the beginning, where the regex + /// is created by the given closure. + /// + /// - Parameter content: A closure that returns a regex to match against. + /// - Returns: The match if there is one, or `nil` if none. + public func prefixMatch( + @RegexComponentBuilder of content: () -> R + ) -> Regex.Match? } ``` #### Matches +We propose an algorithm for iterating over all (non-overlapping) matches of a given regex: + ```swift -extension BidirectionalCollection where SubSequence == Substring { +extension Collection where SubSequence == Substring { /// Returns a collection containing all matches of the specified regex. /// - Parameter regex: The regex to search for. /// - Returns: A collection of matches of `regex`. - public func matches(of regex: R) -> some Collection> + public func matches(of regex: R) -> some Collection.Match> +} + +// In RegexBuilder module +extension Collection where SubSequence == Substring { + /// Returns a collection containing all non-overlapping matches of + /// the regex, created by the given closure. + /// + /// - Parameter content: A closure that returns the regex to search for. + /// - Returns: A collection of matches for the regex returned by `content`. + /// If no matches are found, the returned collection is empty. + public func matches( + @RegexComponentBuilder of content: () -> R + ) -> some Collection.Match> } ``` #### Replace +We propose generic collection algorithms that will replace all occurences of a given subsequence: + ```swift extension RangeReplaceableCollection where Element: Equatable { /// Returns a new collection in which all occurrences of a target sequence @@ -425,14 +641,59 @@ extension RangeReplaceableCollection where Element: Equatable { /// - maxReplacements: A number specifying how many occurrences of `other` /// to replace. Default is `Int.max`. /// - Returns: A new collection in which all occurrences of `other` in - /// `subrange` of the collection are replaced by `replacement`. - public func replacing( - _ other: S, + /// `subrange` of the collection are replaced by `replacement`. + public func replacing( + _ other: C, + with replacement: Replacement, + subrange: Range, + maxReplacements: Int = .max + ) -> Self where C.Element == Element, Replacement.Element == Element + + /// Returns a new collection in which all occurrences of a target sequence + /// are replaced by another collection. + /// - Parameters: + /// - other: The sequence to replace. + /// - replacement: The new elements to add to the collection. + /// - maxReplacements: A number specifying how many occurrences of `other` + /// to replace. Default is `Int.max`. + /// - Returns: A new collection in which all occurrences of `other` in + /// `subrange` of the collection are replaced by `replacement`. + public func replacing( + _ other: C, + with replacement: Replacement, + maxReplacements: Int = .max + ) -> Self where C.Element == Element, Replacement.Element == Element + + /// Replaces all occurrences of a target sequence with a given collection + /// - Parameters: + /// - other: The sequence to replace. + /// - replacement: The new elements to add to the collection. + /// - maxReplacements: A number specifying how many occurrences of `other` + /// to replace. Default is `Int.max`. + public mutating func replace( + _ other: C, + with replacement: Replacement, + maxReplacements: Int = .max + ) where C.Element == Element, Replacement.Element == Element +} +extension RangeReplaceableCollection where Self: BidirectionalCollection, Element: Comparable { + /// Returns a new collection in which all occurrences of a target sequence + /// are replaced by another collection. + /// - Parameters: + /// - other: The sequence to replace. + /// - replacement: The new elements to add to the collection. + /// - subrange: The range in the collection in which to search for `other`. + /// - maxReplacements: A number specifying how many occurrences of `other` + /// to replace. Default is `Int.max`. + /// - Returns: A new collection in which all occurrences of `other` in + /// `subrange` of the collection are replaced by `replacement`. + public func replacing( + _ other: C, with replacement: Replacement, subrange: Range, maxReplacements: Int = .max - ) -> Self where S.Element == Element, Replacement.Element == Element - + ) -> Self where C.Element == Element, Replacement.Element == Element + /// Returns a new collection in which all occurrences of a target sequence /// are replaced by another collection. /// - Parameters: @@ -442,25 +703,29 @@ extension RangeReplaceableCollection where Element: Equatable { /// to replace. Default is `Int.max`. /// - Returns: A new collection in which all occurrences of `other` in /// `subrange` of the collection are replaced by `replacement`. - public func replacing( - _ other: S, + public func replacing( + _ other: C, with replacement: Replacement, maxReplacements: Int = .max - ) -> Self where S.Element == Element, Replacement.Element == Element - + ) -> Self where C.Element == Element, Replacement.Element == Element + /// Replaces all occurrences of a target sequence with a given collection /// - Parameters: /// - other: The sequence to replace. /// - replacement: The new elements to add to the collection. /// - maxReplacements: A number specifying how many occurrences of `other` /// to replace. Default is `Int.max`. - public mutating func replace( - _ other: S, + public mutating func replace( + _ other: C, with replacement: Replacement, maxReplacements: Int = .max - ) where S.Element == Element, Replacement.Element == Element + ) where C.Element == Element, Replacement.Element == Element } +``` + +We propose regex-taking variants for string types as well as variants that take a closure which will generate the replacement portion from a regex match (e.g. by reading captures). +```swift extension RangeReplaceableCollection where SubSequence == Substring { /// Returns a new collection in which all occurrences of a sequence matching /// the given regex are replaced by another collection. @@ -472,13 +737,13 @@ extension RangeReplaceableCollection where SubSequence == Substring { /// sequence matching `regex` to replace. Default is `Int.max`. /// - Returns: A new collection in which all occurrences of subsequence /// matching `regex` in `subrange` are replaced by `replacement`. - public func replacing( - _ regex: R, + public func replacing( + _ r: some RegexComponent, with replacement: Replacement, subrange: Range, maxReplacements: Int = .max ) -> Self where Replacement.Element == Element - + /// Returns a new collection in which all occurrences of a sequence matching /// the given regex are replaced by another collection. /// - Parameters: @@ -488,12 +753,12 @@ extension RangeReplaceableCollection where SubSequence == Substring { /// sequence matching `regex` to replace. Default is `Int.max`. /// - Returns: A new collection in which all occurrences of subsequence /// matching `regex` are replaced by `replacement`. - public func replacing( - _ regex: R, + public func replacing( + _ r: some RegexComponent, with replacement: Replacement, maxReplacements: Int = .max ) -> Self where Replacement.Element == Element - + /// Replaces all occurrences of the sequence matching the given regex with /// a given collection. /// - Parameters: @@ -501,112 +766,417 @@ extension RangeReplaceableCollection where SubSequence == Substring { /// - replacement: The new elements to add to the collection. /// - maxReplacements: A number specifying how many occurrences of the /// sequence matching `regex` to replace. Default is `Int.max`. - public mutating func replace( - _ regex: R, + public mutating func replace( + _ r: some RegexComponent, with replacement: Replacement, maxReplacements: Int = .max ) where Replacement.Element == Element - + /// Returns a new collection in which all occurrences of a sequence matching /// the given regex are replaced by another regex match. /// - Parameters: /// - regex: A regex describing the sequence to replace. - /// - replacement: A closure that receives the full match information, - /// including captures, and returns a replacement collection. /// - subrange: The range in the collection in which to search for `regex`. /// - maxReplacements: A number specifying how many occurrences of the /// sequence matching `regex` to replace. Default is `Int.max`. + /// - replacement: A closure that receives the full match information, + /// including captures, and returns a replacement collection. /// - Returns: A new collection in which all occurrences of subsequence /// matching `regex` are replaced by `replacement`. public func replacing( _ regex: R, - with replacement: (RegexMatch) throws -> Replacement, subrange: Range, - maxReplacements: Int = .max + maxReplacements: Int = .max, + with replacement: (Regex.Match) throws -> Replacement ) rethrows -> Self where Replacement.Element == Element - + /// Returns a new collection in which all occurrences of a sequence matching /// the given regex are replaced by another collection. /// - Parameters: /// - regex: A regex describing the sequence to replace. - /// - replacement: A closure that receives the full match information, - /// including captures, and returns a replacement collection. /// - maxReplacements: A number specifying how many occurrences of the /// sequence matching `regex` to replace. Default is `Int.max`. + /// - replacement: A closure that receives the full match information, + /// including captures, and returns a replacement collection. /// - Returns: A new collection in which all occurrences of subsequence /// matching `regex` are replaced by `replacement`. public func replacing( _ regex: R, - with replacement: (RegexMatch) throws -> Replacement, - maxReplacements: Int = .max + maxReplacements: Int = .max, + with replacement: (Regex.Match) throws -> Replacement ) rethrows -> Self where Replacement.Element == Element - + /// Replaces all occurrences of the sequence matching the given regex with /// a given collection. /// - Parameters: /// - regex: A regex describing the sequence to replace. - /// - replacement: A closure that receives the full match information, - /// including captures, and returns a replacement collection. /// - maxReplacements: A number specifying how many occurrences of the /// sequence matching `regex` to replace. Default is `Int.max`. + /// - replacement: A closure that receives the full match information, + /// including captures, and returns a replacement collection. public mutating func replace( _ regex: R, - with replacement: (RegexMatch) throws -> Replacement, - maxReplacements: Int = .max + maxReplacements: Int = .max, + with replacement: (Regex.Match) throws -> Replacement + ) rethrows where Replacement.Element == Element +} + +// In RegexBuilder module +extension Collection where SubSequence == Substring { + /// Returns a new collection in which all matches for the regex + /// are replaced, using the given closure to create the regex. + /// + /// - Parameters: + /// - replacement: The new elements to add to the collection in place of + /// each match for the regex, using `content` to create the regex. + /// - subrange: The range in the collection in which to search for + /// the regex. + /// - maxReplacements: A number specifying how many occurrences of + /// the regex to replace. + /// - content: A closure that returns the collection to search for + /// and replace. + /// - Returns: A new collection in which all matches for regex in `subrange` + /// are replaced by `replacement`, using `content` to create the regex. + public func replacing( + with replacement: Replacement, + subrange: Range, + maxReplacements: Int = .max, + @RegexComponentBuilder content: () -> some RegexComponent + ) -> Self where Replacement.Element == Element + + /// Returns a new collection in which all matches for the regex + /// are replaced, using the given closure to create the regex. + /// + /// - Parameters: + /// - replacement: The new elements to add to the collection in place of + /// each match for the regex, using `content` to create the regex. + /// - maxReplacements: A number specifying how many occurrences of regex + /// to replace. + /// - content: A closure that returns the collection to search for + /// and replace. + /// - Returns: A new collection in which all matches for regex in `subrange` + /// are replaced by `replacement`, using `content` to create the regex. + public func replacing( + with replacement: Replacement, + maxReplacements: Int = .max, + @RegexComponentBuilder content: () -> some RegexComponent + ) -> Self where Replacement.Element == Element + + /// Replaces all matches for the regex in this collection, using the given + /// closure to create the regex. + /// + /// - Parameters: + /// - replacement: The new elements to add to the collection in place of + /// each match for the regex, using `content` to create the regex. + /// - maxReplacements: A number specifying how many occurrences of + /// the regex to replace. + /// - content: A closure that returns the collection to search for + /// and replace. + public mutating func replace( + with replacement: Replacement, + maxReplacements: Int = .max, + @RegexComponentBuilder content: () -> some RegexComponent + ) where Replacement.Element == Element + + /// Returns a new collection in which all matches for the regex + /// are replaced, using the given closures to create the replacement + /// and the regex. + /// + /// - Parameters: + /// - subrange: The range in the collection in which to search for the + /// regex, using `content` to create the regex. + /// - maxReplacements: A number specifying how many occurrences of + /// the regex to replace. + /// - content: A closure that returns the collection to search for + /// and replace. + /// - replacement: A closure that receives the full match information, + /// including captures, and returns a replacement collection. + /// - Returns: A new collection in which all matches for regex in `subrange` + /// are replaced by the result of calling `replacement`, where regex + /// is the result of calling `content`. + public func replacing( + subrange: Range, + maxReplacements: Int = .max, + @RegexComponentBuilder content: () -> R, + with replacement: (Regex.Match) throws -> Replacement + ) rethrows -> Self where Replacement.Element == Element + + /// Returns a new collection in which all matches for the regex + /// are replaced, using the given closures to create the replacement + /// and the regex. + /// + /// - Parameters: + /// - maxReplacements: A number specifying how many occurrences of + /// the regex to replace, using `content` to create the regex. + /// - content: A closure that returns the collection to search for + /// and replace. + /// - replacement: A closure that receives the full match information, + /// including captures, and returns a replacement collection. + /// - Returns: A new collection in which all matches for regex in `subrange` + /// are replaced by the result of calling `replacement`, where regex is + /// the result of calling `content`. + public func replacing( + maxReplacements: Int = .max, + @RegexComponentBuilder content: () -> R, + with replacement: (Regex.Match) throws -> Replacement + ) rethrows -> Self where Replacement.Element == Element + + /// Replaces all matches for the regex in this collection, using the + /// given closures to create the replacement and the regex. + /// + /// - Parameters: + /// - maxReplacements: A number specifying how many occurrences of + /// the regex to replace, using `content` to create the regex. + /// - content: A closure that returns the collection to search for + /// and replace. + /// - replacement: A closure that receives the full match information, + /// including captures, and returns a replacement collection. + public mutating func replace( + maxReplacements: Int = .max, + @RegexComponentBuilder content: () -> R, + with replacement: (Regex.Match) throws -> Replacement ) rethrows where Replacement.Element == Element } ``` #### Split +We propose a generic collection `split` that can take a subsequence separator: + ```swift extension Collection where Element: Equatable { /// Returns the longest possible subsequences of the collection, in order, - /// around elements equal to the given separator. - /// - Parameter separator: The element to be split upon. + /// around elements equal to the given separator collection. + /// + /// - Parameters: + /// - separator: A collection of elements to be split upon. + /// - maxSplits: The maximum number of times to split the collection, + /// or one less than the number of subsequences to return. + /// - omittingEmptySubsequences: If `false`, an empty subsequence is + /// returned in the result for each consecutive pair of separator + /// sequences in the collection and for each instance of separator + /// sequences at the start or end of the collection. If `true`, only + /// nonempty subsequences are returned. /// - Returns: A collection of subsequences, split from this collection's - /// elements. - public func split(by separator: S) -> some Collection - where S.Element == Element + /// elements. + public func split( + separator: C, + maxSplits: Int = Int.max, + omittingEmptySubsequences: Bool = true + ) -> some Collection where C.Element == Element } +extension BidirectionalCollection where Element: Comparable { + /// Returns the longest possible subsequences of the collection, in order, + /// around elements equal to the given separator collection. + /// + /// - Parameters: + /// - separator: A collection of elements to be split upon. + /// - maxSplits: The maximum number of times to split the collection, + /// or one less than the number of subsequences to return. + /// - omittingEmptySubsequences: If `false`, an empty subsequence is + /// returned in the result for each consecutive pair of separator + /// sequences in the collection and for each instance of separator + /// sequences at the start or end of the collection. If `true`, only + /// nonempty subsequences are returned. + /// - Returns: A collection of subsequences, split from this collection's + /// elements. + public func split( + separator: C, + maxSplits: Int = Int.max, + omittingEmptySubsequences: Bool = true + ) -> some Collection where C.Element == Element +} +``` + +And a regex-taking variant for string types: -extension BidirectionalCollection where SubSequence == Substring { +```swift +extension Collection where SubSequence == Substring { /// Returns the longest possible subsequences of the collection, in order, - /// around elements equal to the given separator. - /// - Parameter separator: A regex describing elements to be split upon. + /// around subsequence that match the given separator regex. + /// + /// - Parameters: + /// - separator: A regex to be split upon. + /// - maxSplits: The maximum number of times to split the collection, + /// or one less than the number of subsequences to return. + /// - omittingEmptySubsequences: If `false`, an empty subsequence is + /// returned in the result for each consecutive pair of matches + /// and for each match at the start or end of the collection. If + /// `true`, only nonempty subsequences are returned. /// - Returns: A collection of substrings, split from this collection's - /// elements. - public func split(by separator: R) -> some Collection + /// elements. + public func split( + separator: some RegexComponent, + maxSplits: Int = Int.max, + omittingEmptySubsequences: Bool = true + ) -> some Collection +} + +// In RegexBuilder module +extension Collection where SubSequence == Substring { + /// Returns the longest possible subsequences of the collection, in order, + /// around subsequence that match the regex created by the given closure. + /// + /// - Parameters: + /// - maxSplits: The maximum number of times to split the collection, + /// or one less than the number of subsequences to return. + /// - omittingEmptySubsequences: If `false`, an empty subsequence is + /// returned in the result for each consecutive pair of matches + /// and for each match at the start or end of the collection. If + /// `true`, only nonempty subsequences are returned. + /// - separator: A closure that returns a regex to be split upon. + /// - Returns: A collection of substrings, split from this collection's + /// elements. + public func split( + maxSplits: Int = Int.max, + omittingEmptySubsequences: Bool = true, + @RegexComponentBuilder separator: () -> some RegexComponent + ) -> some Collection +} +``` + +**Note:** We plan to adopt the new generics features enabled by [SE-0346][] for these proposed methods when the standard library adopts primary associated types, [pending a forthcoming proposal][stdlib-pitch]. For example, the first method in the _Replacement_ section above would instead be: + +```swift +extension RangeReplaceableCollection where Element: Equatable { + /// Returns a new collection in which all occurrences of a target sequence + /// are replaced by another collection. + public func replacing( + _ other: some Collection, + with replacement: some Collection, + subrange: Range, + maxReplacements: Int = .max + ) -> Self } ``` +### Language-level pattern matching via `~=` + +We propose allowing any regex component be used in case statements by overloading the `~=` operator for matching against the entire input: + +```swift +extension RegexComponent { + public static func ~=(regex: Self, input: String) -> Bool + + public static func ~=(regex: Self, input: Substring) -> Bool +} +``` + + +[SE-0346]: https://github.com/apple/swift-evolution/blob/main/proposals/0346-light-weight-same-type-syntax.md +[stdlib-pitch]: https://forums.swift.org/t/pitch-primary-associated-types-in-the-standard-library/56426 +#### Searching for empty strings and matches + +Empty matches and inputs are an important edge case for several of the algorithms proposed above. For example, what is the result of `"123.firstRange(of: /[a-z]*/)`? How do you split a collection separated by an empty collection, as in `"1234".split(separator: "")`? For the Swift standard library, this is a new consideration, as current algorithms are `Element`-based and cannot be passed an empty input. + +Languages and libraries are nearly unanimous about finding the location of an empty string, with Ruby, Python, C#, Java, Javascript, etc, finding an empty string at each index in the target. Notably, Foundation's `NSString.range(of:)` does _not_ find an empty string at all. + +The methods proposed here follow the consensus behavior, which makes sense if you think of `a.firstRange(of: b)` as returning the first subrange `r` where `a[r] == b`. If a regex can match an empty substring, like `/[a-z]*/`, the behavior is the same. + +```swift +let hello = "Hello" +let emptyRange = hello.firstRange(of: "") +// emptyRange is equivalent to '0..<0' (integer ranges shown for readability) +``` + +Because searching again at the same index would yield that same empty string, we advance one position after finding an empty string or matching an empty pattern when finding all ranges. This yields the position of every valid index in the string. + +```swift +let allRanges = hello.ranges(of: "") +// allRanges is equivalent to '[0..<0, 1..<1, 2..<2, 3..<3, 4..<4, 5..<5]' +``` + +Splitting with an empty separator (or a pattern that matches empty string), uses this same behavior, resulting in a collection of single-element substrings. Interestingly, a couple languages make different choices here. C# returns the original string instead of its parts, and Python rejects an empty separator (though it permits regexes that match empty strings). + +```swift +let parts = hello.split(separator: "") +// parts == ["h", "e", "l", "l", "o"] + +let moreParts = hello.split(separator: "", omittingEmptySubsequences: false) +// parts == ["", "h", "e", "l", "l", "o", ""] +``` +Finally, searching for an empty string within an empty string yields, as you might imagine, the empty string: +```swift +let empty = "" +let range = empty.firstRange(of: empty) +// empty == empty[range] +``` ## Alternatives considered ### Extend `Sequence` instead of `Collection` -Most of the proposed algorithms are necessarily on `Collection` due to the use of indices or mutation. `Sequence` does not support multi-pass iteration, so even `trimPrefix` would problematic on `Sequence` because it needs to look 1 `Element` ahead to know when to stop trimming. +Most of the proposed algorithms are necessarily on `Collection` due to the use of indices or mutation. `Sequence` does not support multi-pass iteration, so even `trimmingPrefix` would problematic on `Sequence` because it needs to look one `Element` ahead to know when to stop trimming and would need to return a wrapper for the in-progress iterator instead of a subsequence. + +### Cross-proposal API naming consistency + +The regex work is broken down into 6 proposals based on technical domain, which is advantageous for deeper technical discussions and makes reviewing the large body of work manageable. The disadvantage of this approach is that relatively-shallow cross-cutting concerns, such as API naming consistency, are harder to evaluate until we've built up intuition from multiple proposals. + +We've seen the [Regex type and overview](https://github.com/apple/swift-evolution/blob/main/proposals/0350-regex-type-overview.md), the [Regex builder DSL](https://github.com/apple/swift-evolution/blob/main/proposals/0351-regex-builder.md), and here we present lots of ways to use regex. Now's a good time to go over API naming consistency. + +(The other proposal with a significant amount of API is [Unicode for String Processing](https://forums.swift.org/t/pitch-unicode-for-string-processing/56907), which is in the pitch phase. It is a technical niche and less impactful on these naming discussions. We'll still want to design those names for consistency, of course.) + + +```swift +protocol RegexComponent { + associatedtype RegexOutput +} +``` + +The associatedtype name is "RegexOutput" to help libraries conform their parsers to this protocol (e.g. via `CustomConsumingRegexComponent`). Regex's capture representation is regexy: it has the overall matched portion as the first capture and the regex builders know how to combine these kinds of capture lists together. This could be different than how e.g. a parser combinator library's output types might be represented. Thus, we chose a more specific name to avoid any potential conflicts. + +The name "RegexComponent" accentuates that any conformer can be used as part of a larger regex, while it de-emphasizes that `Regex` instances themselves can be used directly. We propose methods that are generic over `RegexComponent` and developers will be considering whether they should make their functions that otherwise take a `Regex` also be generic over `RegexComponent`. + +It's possible there might be some initial confusion around the word "component", i.e. a developer may have a regex and not be sure how to make it into a component or how to get the component out. The word "component" carries a lot of value in the context of the regex DSL. An alternative name might be `RegexProtocol`, which implies that a Regex can be used at the site and would be clearly the way to make a function taking a concrete `Regex` generic. But, it's otherwise a naming workaround that doesn't carry the additional regex builder connotations. + +The protocol requirement is `var regex: Regex`, i.e. any type that can produce a regex or hook into the engine's customization hooks (this is what `consuming` does) can be used as a component of the DSL and with these generic API. An alternative name could be "CustomRegexConvertible", but we don't feel that communicates component composability very well, nor is it particularly enlightening when encountering these generic API. + +Another alternative is to have a second protocol just for generic API. But without a compelling semantic distinction or practical utility, we'd prefer to avoid adding protocols just for names. If a clearly superior name exists, we should just choose that. + + +```swift +protocol CustomConsumingRegexComponent { + func consuming(...) +} +``` + +This is not a normal developer-facing protocol or concept; it's an advanced library-extensibility feature. Explicit, descriptive, and careful names are more important than concise names. The "custom" implies that we're not just vending a regex directly ourselves, we're instead customizing behavior by hooking into the run-time engine directly. + +Older versions of the pitch had `func match(...) -> (String.Index, T)?` as the protocol requirement. As [Regex type and overview](https://github.com/apple/swift-evolution/blob/main/proposals/0350-regex-type-overview.md) went through review, naming convention settled on using the word "match" as a noun and in context with operations that produce a `Match` instance. Since this is the engine's customization hook, it produces the value and position to resume execution from directly, and hence different terminology is apt and avoids confusion or future ambiguities. "Consuming" is the nomenclature we're going with for something that chews off the front of its input in order to produces a value. + +This protocol customizes the basic consume-from-the-front functionality. A protocol for customizing search is future work and involves accommodating different kinds of state and ways that a searcher may wish to speed up subsequent searches. Alternative names for the protocol include `CustomRegexComponent`, `CustomConsumingRegex`, etc., but we don't feel brevity is the key consideration here. + + +### Why `where SubSequence == Substring`? + +A `Substring` slice requirement allows the regex engine to produce indicies in the original collection by operating over a portion of the input. Unfortunately, this is not one of the requirements of `StringProtocol`. + +A new protocol for types that can produce a `Substring` on request (e.g. from UTF-8 contents) would have to eagerly produce a `String` copy first and would need requirements to translate indices. When higher-level algorithms are implemented via multiple calls to the lower-level algorithms, these copies could happen many times. Shared strings are future work but a much better solution to this. + ## Future directions ### Backward algorithms -It would be useful to have algorithms that operate from the back of a collection, including ability to find the last non-overlapping range of a pattern in a string, and/or that to find the first range of a pattern when searching from the back, and trimming a string from both sides. They are deferred from this proposal as the API that could clarify the nuances of backward algorithms are still being explored. +It would be useful to have algorithms that operate from the back of a collection, including ability to find the last non-overlapping range of a pattern in a string, and/or that to find the first range of a pattern when searching from the back, and trimming a string from both sides. They are deferred from this proposal as the API that could clarify the nuances of backward algorithms are still being explored.
Nuances of backward algorithms -There is a subtle difference between finding the last non-overlapping range of a pattern in a string, and finding the first range of this pattern when searching from the back. +There is a subtle difference between finding the last non-overlapping range of a pattern in a string, and finding the first range of this pattern when searching from the back. -The currently proposed algorithm that finds a pattern from the front, e.g. `"aaaaa".ranges(of: "aa")`, produces two non-overlapping ranges, splitting the string in the chunks `aa|aa|a`. It would not be completely unreasonable to expect to introduce a counterpart, such as `"aaaaa".lastRange(of: "aa")`, to return the range that contains the third and fourth characters of the string. This would be a shorthand for `"aaaaa".ranges(of: "aa").last`. Yet, it would also be reasonable to expect the function to return the first range of `"aa"` when searching from the back of the string, i.e. the range that contains the fourth and fifth characters. +The currently proposed algorithm that finds a pattern from the front, e.g. `"aaaaa".ranges(of: "aa")`, produces two non-overlapping ranges, splitting the string in the chunks `aa|aa|a`. It would not be completely unreasonable to expect to introduce a counterpart, such as `"aaaaa".lastRange(of: "aa")`, to return the range that contains the third and fourth characters of the string. This would be a shorthand for `"aaaaa".ranges(of: "aa").last`. Yet, it would also be reasonable to expect the function to return the first range of `"aa"` when searching from the back of the string, i.e. the range that contains the fourth and fifth characters. -Trimming a string from both sides shares a similar story. For example, `"ababa".trimming("aba")` can return either `"ba"` or `"ab"`, depending on whether the prefix or the suffix was trimmed first. +Trimming a string from both sides shares a similar story. For example, `"ababa".trimming("aba")` can return either `"ba"` or `"ab"`, depending on whether the prefix or the suffix was trimmed first.
- +### Split preserving the separator + +Future work is a split variant that interweaves the separator with the separated portions. For example, when splitting over `\p{punctuation}` it might be useful to be able to preserve the punctionation as a separate entry in the returned collection. + ### Future API -Some Python functions are not currently included in this proposal, such as trimming the suffix from a string/collection. This pitch aims to establish a pattern for using `RegexComponent` with string processing algorithms, so that further enhancement can to be introduced to the standard library easily in the future, and eventually close the gap between Swift and other popular scripting languages. +Some common string processing functions are not currently included in this proposal, such as trimming the suffix from a string/collection, and finding overlapping ranges of matched substrings. This pitch aims to establish a pattern for using `RegexComponent` with string processing algorithms, so that further enhancement can to be introduced to the standard library easily in the future, and eventually close the gap between Swift and other popular scripting languages. diff --git a/Documentation/Evolution/UnicodeForStringProcessing.md b/Documentation/Evolution/UnicodeForStringProcessing.md new file mode 100644 index 000000000..828d8f53c --- /dev/null +++ b/Documentation/Evolution/UnicodeForStringProcessing.md @@ -0,0 +1,872 @@ +# Unicode for String Processing + +Proposal: [SE-NNNN](NNNN-filename.md) +Authors: [Nate Cook](https://github.com/natecook1000), [Alejandro Alonso](https://github.com/Azoy) +Review Manager: TBD +Implementation: [apple/swift-experimental-string-processing][repo] +Status: **Draft** + + +## Introduction + +This proposal describes `Regex`'s rich Unicode support during regex matching, along with the character classes and options that define that behavior. + +## Motivation + +Swift's `String` type provides, by default, a view of `Character`s or [extended grapheme clusters][graphemes] whose comparison honors [Unicode canonical equivalence][canoneq]. Each character in a string can be composed of one or more Unicode scalar values, while still being treated as a single unit, equivalent to other ways of formulating the equivalent character: + +```swift +let str = "Cafe\u{301}" // "Café" +str == "Café" // true +str.dropLast() // "Caf" +str.last == "é" // true (precomposed e with acute accent) +str.last == "e\u{301}" // true (e followed by composing acute accent) +``` + +This default view is fairly novel. Most languages that support Unicode strings generally operate at the Unicode scalar level, and don't provide the same affordance for operating on a string as a collection of grapheme clusters. In Python, for example, Unicode strings report their length as the number of scalar values, and don't use canonical equivalence in comparisons: + +```python +cafe = u"Cafe\u0301" +len(cafe) # 5 +cafe == u"Café" # False +``` + +Existing regex engines follow this same model of operating at the Unicode scalar level. To match canonically equivalent characters, or have equivalent behavior between equivalent strings, you must normalize your string and regex to the same canonical format. + +```python +# Matches a four-element string +re.match(u"^.{4}$", cafe) # None +# Matches a string ending with 'é' +re.match(u".+é$", cafe) # None + +cafeComp = unicodedata.normalize("NFC", cafe) +re.match(u"^.{4}$", cafeComp) # +re.match(u".+é$", cafeComp) # +``` + +With Swift's string model, this behavior would surprising and undesirable — Swift's default regex semantics must match the semantics of a `String`. + +
Other engines + +Other regex engines match character classes (such as `\w` or `.`) at the Unicode scalar value level, or even the code unit level, instead of recognizing grapheme clusters as characters. When matching the `.` character class, other languages will only match the first part of an `"e\u{301}"` grapheme cluster. Some languages, like Perl, Ruby, and Java, support an additional `\X` metacharacter, which explicitly represents a single grapheme cluster. + +| Matching `"Cafe\u{301}"` | Pattern: `^Caf.` | Remaining | Pattern: `^Caf\X` | Remaining | +|---|---|---|---|---| +| C#, Rust, Go, Python | `"Cafe"` | `"´"` | n/a | n/a | +| NSString, Java, Ruby, Perl | `"Cafe"` | `"´"` | `"Café"` | `""` | + +Other than Java's `CANON_EQ` option, the vast majority of other languages and engines are not capable of comparing with canonical equivalence. + +
+ +## Proposed solution + +In a regex's simplest form, without metacharacters or special features, matching behaves like a test for equality. A string always matches a regex that simply contains the same characters. + +```swift +let str = "Cafe\u{301}" // "Café" +str.contains(/Café/) // true +``` + +From that point, small changes continue to comport with the element counting and comparison expectations set by `String`: + +```swift +str.contains(/Caf./) // true +str.contains(/.+é/) // true +str.contains(/.+e\u{301}/) // true +str.contains(/\w+é/) // true +``` + + +For compatibility with other regex engines and the flexibility to match at both `Character` and Unicode scalar level, you can switch between matching levels for an entire regex or within select portions. This powerful capability provides the expected default behavior when working with strings, while allowing you to drop down for Unicode scalar-specific matching. + +By default, literal characters and Unicode scalar values (e.g. `\u{301}`) are coalesced into characters in the same way as a normal string, as shown above. Metacharacters, like `.` and `\w`, and custom character classes each match a single element at the current matching level. + +For example, these matches fail, because by the time the parser encounters the "`\u{301}`" Unicode scalar literal, the full `"é"` character has been matched: + +```swift +str.contains(/Caf.\u{301}) // false - `.` matches "é" character +str.contains(/Caf\w\u{301}) // false - `\w` matches "é" character +str.contains(/.+\u{301}) // false - `.+` matches each character +``` + +Alternatively, we can drop down to use Unicode scalar semantics if we want to match specific Unicode sequences. For example, these regexes matches an `"e"` followed by any modifier with the specified parameters: + +```swift +str.contains(/e[\u{300}-\u{314}]/.matchingSemantics(.unicodeScalar)) +// true - matches an "e" followed by a Unicode scalar in the range U+0300 - U+0314 +str.contains(/e\p{Nonspacing Mark}/.matchingSemantics(.unicodeScalar)) +// true - matches an "e" followed by a Unicode scalar with general category "Nonspacing Mark" +``` + +Matching in Unicode scalar mode is analogous to comparing against a string's `UnicodeScalarView` — individual Unicode scalars are matched without combining them into characters or testing for canonical equivalence. + +```swift +str.contains(/Café/.matchingSemantics(.unicodeScalar)) +// false - "e\u{301}" doesn't match with /é/ +str.contains(/Cafe\u{301}/.matchingSemantics(.unicodeScalar)) +// true - "e\u{301}" matches with /e\u{301}/ +``` + +Swift's `Regex` follows the level 2 guidelines for Unicode support in regular expressions described in [Unicode Technical Standard #18][uts18], with support for Unicode character classes, canonical equivalence, grapheme cluster matching semantics, and level 2 word boundaries enabled by default. In addition to selecting the matching semantics, `Regex` provides options for selecting different matching behaviors, such as ASCII character classes or Unicode scalar semantics, which corresponds more closely with other regex engines. + +## Detailed design + +First, we'll discuss the options that let you control a regex's behavior, and then explore the character classes that define the your pattern. + +### Options + +Options can be enabled and disabled in two different ways: as part of [regex internal syntax][internals], or applied as methods when declaring a `Regex`. For example, both of these `Regex`es are declared with case insensitivity: + +```swift +let regex1 = /(?i)banana/ +let regex2 = Regex { + "banana" +}.ignoresCase()` +``` + +Note that the `ignoresCase()` is available on any type conforming to `RegexComponent`, which means that you can always use the more readable option-setting interface in conjunction with regex literals or run-time compiled `Regex`es: + +```swift +let regex3 = /banana/.ignoresCase() +``` + +Calling an option-setting method like `ignoresCase(_:)` acts like wrapping the callee in an option-setting group `(?:...)`. That is, while it sets the behavior for the callee, it doesn’t override options that are applied to more specific regions. In this example, the middle `"na"` in `"banana"` matches case-sensitively, despite the outer call to `ignoresCase()`: + +```swift +let regex4 = Regex { + "ba" + "na".ignoresCase(false) + "na" +} +.ignoresCase() + +"banana".contains(regex4) // true +"BAnaNA".contains(regex4) // true +"BANANA".contains(regex4) // false + +// Equivalent to: +let regex5 = /(?i)ba(?-i:na)na/ +``` + +All option APIs are provided on `RegexComponent`, so they can be called on a `Regex` instance, or on any component that you would use inside a `RegexBuilder` block when the `RegexBuilder` module is imported. + +The options that `Regex` supports are shown in the table below. Options that affect _matching behavior_ are supported through both regex syntax and APIs, while options that have _structural_ or _syntactic_ effects are only supported through regex syntax. + +| **Matching Behavior** | | | +|------------------------------|----------------|---------------------------| +| Case insensitivity | `(?i)` | `ignoresCase()` | +| Single-line mode | `(?s)` | `dotMatchesNewlines()` | +| Multi-line mode | `(?m)` | `anchorsMatchNewlines()` | +| ASCII-only character classes | `(?DSWP)` | `asciiOnlyDigits()`, etc | +| Unicode word boundaries | `(?w)` | `wordBoundaryKind(_:)` | +| Semantic level | `(?Xu)` | `matchingSemantics(_:)` | +| Repetition behavior | `(?U)` | `repetitionBehavior(_:)` | +| **Structural/Syntactic** | | | +| Extended syntax | `(?x)`,`(?xx)` | n/a | +| Named captures only | `(?n)` | n/a | +| Shared capture names | `(?J)` | n/a | + +#### Case insensitivity + +Regexes perform case sensitive comparisons by default. The `i` option or the `ignoresCase(_:)` method enables case insensitive comparison. + +```swift +let str = "Café" + +str.firstMatch(of: /CAFÉ/) // nil +str.firstMatch(of: /(?i)CAFÉ/) // "Café" +str.firstMatch(of: /(?i)cAfÉ/) // "Café" +``` + +Case insensitive matching uses case folding to ensure that canonical equivalence continues to operate as expected. + +**Regex syntax:** `(?i)...` or `(?i:...)` + +**`RegexBuilder` API:** + +```swift +extension RegexComponent { + /// Returns a regular expression that ignores casing when matching. + public func ignoresCase(_ ignoresCase: Bool = true) -> Regex +} +``` + +#### Single line mode (`.` matches newlines) + +The "any" metacharacter (`.`) matches any character in a string *except* newlines by default. With the `s` option enabled, `.` matches any character including newlines. + +```swift +let str = """ + <> + """ + +str.firstMatch(of: /<<.+>>/) // nil +str.firstMatch(of: /(?s)<<.+>>/) // "This string\nuses double-angle-brackets\nto group text." +``` + +This option also affects the behavior of `CharacterClass.any`, which is designed to match the behavior of the `.` regex literal component. + +**Regex syntax:** `(?s)...` or `(?s...)` + +**`RegexBuilder` API:** + +```swift +extension RegexComponent { + /// Returns a regular expression where the start and end of input + /// anchors (`^` and `$`) also match against the start and end of a line. + public func dotMatchesNewlines(_ dotMatchesNewlines: Bool = true) -> Regex +} +``` + +#### Multiline mode + +By default, the start and end anchors (`^` and `$`) match only the beginning and end of a string. With the `m` or the option, they also match the beginning and end of each line. + +```swift +let str = """ + abc + def + ghi + """ + +str.firstMatch(of: /^abc/) // "abc" +str.firstMatch(of: /^abc$/) // nil +str.firstMatch(of: /(?m)^abc$/) // "abc" + +str.firstMatch(of: /^def/) // nil +str.firstMatch(of: /(?m)^def$/) // "def" +``` + +This option applies only to anchors used in a regex literal. The anchors defined in `RegexBuilder` are specific about matching at the start/end of the input or the line, and therefore do not correspond directly with the `^` and `$` literal anchors. + +```swift +str.firstMatch(of: Regex { Anchor.startOfInput ; "def" }) // nil +str.firstMatch(of: Regex { Anchor.startOfLine ; "def" }) // "def" +``` + +**Regex syntax:** `(?m)...` or `(?m...)` + +**`RegexBuilder` API:** + +```swift +extension RegexComponent { + /// Returns a regular expression where the start and end of input + /// anchors (`^` and `$`) also match against the start and end of a line. + public func anchorsMatchLineEndings(_ matchLineEndings: Bool = true) -> Regex +} +``` + +#### ASCII-only character classes + +With one or more of these options enabled, the default character classes match only ASCII values instead of the full Unicode range of characters. Four options are included in this group: + +* `D`: Match only ASCII members for `\d`, `\p{Digit}`, `[:digit:]`, and the `CharacterClass.digit`. +* `S`: Match only ASCII members for `\s`, `\p{Space}`, `[:space:]`. +* `W`: Match only ASCII members for `\w`, `\p{Word}`, `[:word:]`, `\b`, `CharacterClass.word`, and `Anchor.wordBoundary`. +* `P`: Match only ASCII members for all POSIX properties (including `digit`, `space`, and `word`). + +**Regex syntax:** `(?DSWP)...` or `(?DSWP...)` + +**`RegexBuilder` API:** + +```swift +extension RegexComponent { + /// Returns a regular expression that only matches ASCII characters as digits. + public func asciiOnlyDigits(_ asciiOnly: Bool = true) -> Regex + + /// Returns a regular expression that only matches ASCII characters as space + /// characters. + public func asciiOnlyWhitespace(_ asciiOnly: Bool = true) -> Regex + + /// Returns a regular expression that only matches ASCII characters as "word + /// characters". + public func asciiOnlyWordCharacters(_ asciiOnly: Bool = true) -> Regex + + /// Returns a regular expression that only matches ASCII characters when + /// matching character classes. + public func asciiOnlyCharacterClasses(_ asciiOnly: Bool = true) -> Regex +} +``` + +#### Unicode word boundaries + +By default, matching word boundaries with the `\b` and `Anchor.wordBoundary` anchors uses Unicode _default word boundaries,_ specified as [Unicode level 2 regular expression support][level2-word-boundaries]. + +Disabling the `w` option switches to _[simple word boundaries][level1-word-boundaries],_ finding word boundaries at points in the input where `\b\B` or `\B\b` match. Depending on the other matching options that are enabled, this may be more compatible with the behavior other regex engines. + +As shown in this example, the default matching behavior finds the whole first word of the string, while the match with simple word boundaries stops at the apostrophe: + +```swift +let str = "Don't look down!" + +str.firstMatch(of: /D\S+\b/) // "Don't" +str.firstMatch(of: /(?-w)D\S+\b/) // "Don" +``` + +You can see more differences between level 1 and level 2 word boundaries in the following table: + +| Example | Level 1 | Level 2 | +|---------------------|---------------------------------|-------------------------------------------| +| I can't do that. | ["I", "can", "t", "do", "that"] | ["I", "can't", "do", "that", "."] | +| 🔥😊👍 | ["🔥😊👍"] | ["🔥", "😊", "👍"] | +| 👩🏻👶🏿👨🏽🧑🏾👩🏼 | ["👩🏻👶🏿👨🏽🧑🏾👩🏼"] | ["👩🏻", "👶🏿", "👨🏽", "🧑🏾", "👩🏼"] | +| 🇨🇦🇺🇸🇲🇽 | ["🇨🇦🇺🇸🇲🇽"] | ["🇨🇦", "🇺🇸", "🇲🇽"] | +| 〱㋞ツ | ["〱", "㋞", "ツ"] | ["〱㋞ツ"] | +| hello〱㋞ツ | ["hello〱", "㋞", "ツ"] | ["hello", "〱㋞ツ"] | +| 나는 Chicago에 산다 | ["나는", "Chicago에", "산다"] | ["나", "는", "Chicago", "에", "산", "다"] | +| 眼睛love食物 | ["眼睛love食物"] | ["眼", "睛", "love", "食", "物"] | +| 아니ㅋㅋㅋ네 | ["아니ㅋㅋㅋ네"] | ["아", "니", "ㅋㅋㅋ", "네"] | +| Re:Zero | ["Re", "Zero"] | ["Re:Zero"] | +| \u{d}\u{a} | ["\u{d}", "\u{a}"] | ["\u{d}\u{a}"] | +| €1 234,56 | ["1", "234", "56"] | ["€", "1", "234,56"] | + + +**Regex syntax:** `(?-w)...` or `(?-w...)` + +**`RegexBuilder` API:** + +```swift +extension RegexComponent { + /// Returns a regular expression that uses the specified word boundary algorithm. + /// + /// A simple word boundary is a position in the input between two characters + /// that match `/\w\W/` or `/\W\w/`, or between the start or end of the input + /// and `\w` character. Word boundaries therefore depend on the option-defined + /// behavior of `\w`. + /// + /// The default word boundaries use a Unicode algorithm that handles some cases + /// better than simple word boundaries, such as words with internal + /// punctuation, changes in script, and Emoji. + public func wordBoundaryKind(_ wordBoundaryKind: RegexWordBoundaryKind) -> Regex +} + +public struct RegexWordBoundaryKind: Hashable { + /// A word boundary algorithm that implements the "simple word boundary" + /// Unicode recommendation. + /// + /// A simple word boundary is a position in the input between two characters + /// that match `/\w\W/` or `/\W\w/`, or between the start or end of the input + /// and a `\w` character. Word boundaries therefore depend on the option- + /// defined behavior of `\w`. + public static var unicodeLevel1: Self { get } + + /// A word boundary algorithm that implements the "default word boundary" + /// Unicode recommendation. + /// + /// Default word boundaries use a Unicode algorithm that handles some cases + /// better than simple word boundaries, such as words with internal + /// punctuation, changes in script, and Emoji. + public static var unicodeLevel2: Self { get } +} +``` + +#### Matching semantic level + +When matching with grapheme cluster semantics (the default), metacharacters like `.` and `\w`, custom character classes, and character class instances like `.any` match a grapheme cluster when possible, corresponding with the default string representation. In addition, matching with grapheme cluster semantics compares characters using their canonical representation, corresponding with the way comparing strings for equality works. + +When matching with Unicode scalar semantics, metacharacters and character classes always match a single Unicode scalar value, even if that scalar comprises part of a grapheme cluster. + +These semantic levels lead to different results, especially when working with strings that have decomposed characters. In the following example, `queRegex` matches any 3-character string that begins with `"q"`. + +```swift +let composed = "qué" +let decomposed = "que\u{301}" + +let queRegex = /^q..$/ + +print(composed.contains(queRegex)) +// Prints "true" +print(decomposed.contains(queRegex)) +// Prints "true" +``` + +When using Unicode scalar semantics, however, the regex only matches the composed version of the string, because each `.` matches a single Unicode scalar value. + +```swift +let queRegexScalar = queRegex.matchingSemantics(.unicodeScalar) +print(composed.contains(queRegexScalar)) +// Prints "true" +print(decomposed.contains(queRegexScalar)) +// Prints "false" +``` + +With grapheme cluster semantics, a grapheme cluster boundary is naturally enforced at the start and end of the match and every capture group. Matching with Unicode scalar semantics, on the other hand, including using the `\O` metacharacter or `.anyUnicodeScalar` character class, can yield string indices that aren't aligned to character boundaries. Take care when using indices that aren't aligned with grapheme cluster boundaries, as they may have to be rounded to a boundary if used in a `String` instance. + +```swift +let family = "👨‍👨‍👧‍👦 is a family" + +// Grapheme-cluster mode: Yields a character +let firstCharacter = /^./ +let characterMatch = family.firstMatch(of: firstCharacter)!.output +print(characterMatch) +// Prints "👨‍👨‍👧‍👦" + +// Unicode-scalar mode: Yields only part of a character +let firstUnicodeScalar = /^./.matchingSemantics(.unicodeScalar) +let unicodeScalarMatch = family.firstMatch(of: firstUnicodeScalar)!.output +print(unicodeScalarMatch) +// Prints "👨" + +// The end of `unicodeScalarMatch` is not aligned on a character boundary +print(unicodeScalarMatch.endIndex == family.index(after: family.startIndex)) +// Prints "false" +``` + +When a regex proceeds with grapheme cluster semantics from a position that _isn't_ grapheme cluster aligned, it attempts to match the partial grapheme cluster that starts at that point. In the first call to `contains(_:)` below, `\O` matches a single Unicode scalar value, as shown above, and then the engine tries to match `\s` against the remainder of the family emoji character. Because that character is not whitespace, the match fails. The second call uses `\X`, which matches the entire emoji character, and then successfully matches the following space. + +```swift +// \O matches a single Unicode scalar, whatever the current semantics +family.contains(/^\O\s/)) // false + +// \X matches a single character, whatever the current semantics +family.contains(/^\X\s/) // true +``` + +**Regex syntax:** `(?X)...` or `(?X...)` for grapheme cluster semantics, `(?u)...` or `(?u...)` for Unicode scalar semantics. + +**`RegexBuilder` API:** + +```swift +extension RegexComponent { + /// Returns a regular expression that matches with the specified semantic + /// level. + public func matchingSemantics(_ semanticLevel: RegexSemanticLevel) -> Regex +} + +public struct RegexSemanticLevel: Hashable { + /// Match at the default semantic level of a string, where each matched + /// element is a `Character`. + public static var graphemeCluster: RegexSemanticLevel + + /// Match at the semantic level of a string's `UnicodeScalarView`, where each + /// matched element is a `UnicodeScalar` value. + public static var unicodeScalar: RegexSemanticLevel +} +``` + +#### Default repetition behavior + +Regex quantifiers (`+`, `*`, and `?`) match eagerly by default when they repeat, such that they match the longest possible substring. Appending `?` to a quantifier makes it reluctant, instead, so that it matches the shortest possible substring. + +```swift +let str = "A value." + +// By default, the '+' quantifier is eager, and consumes as much as possible. +str.firstMatch(of: /<.+>/) // "A value." + +// Adding '?' makes the '+' quantifier reluctant, so that it consumes as little as possible. +str.firstMatch(of: /<.+?>/) // "" +``` + +The `U` option toggles the "eagerness" of quantifiers, so that quantifiers are reluctant by default, and only become eager when `?` is added to the quantifier. + +```swift +// '(?U)' toggles the eagerness of quantifiers: +str.firstMatch(of: /(?U)<.+>/) // "" +str.firstMatch(of: /(?U)<.+?>/) // "A value." +``` + +**Regex syntax:** `(?U)...` or `(?U...)` + +**`RegexBuilder` API:** + +The `repetitionBehavior(_:)` method lets you set the default behavior for all quantifiers that don't explicitly provide their own behavior. For example, you can make all quantifiers behave possessively, eliminating any quantification-caused backtracking. + +```swift +extension RegexComponent { + /// Returns a regular expression where quantifiers are reluctant by default + /// instead of eager. + public func repetitionBehavior(_ behavior: RegexRepetitionBehavior) -> Regex +} + +public struct RegexRepetitionBehavior { + /// Match as much of the input string as possible, backtracking when + /// necessary. + public static var eager: RegexRepetitionBehavior { get } + + /// Match as little of the input string as possible, expanding the matched + /// region as necessary to complete a match. + public static var reluctant: RegexRepetitionBehavior { get } + + /// Match as much of the input string as possible, performing no backtracking. + public static var possessive: RegexRepetitionBehavior { get } +} +``` + +In order for this option to have the same effect on regexes built with `RegexBuilder` as with regex syntax, the `RegexBuilder` quantifier APIs are amended to have an `nil`-defaulted optional `behavior` parameter. For example: + +```swift +extension OneOrMore { + public init( + _ behavior: RegexRepetitionBehavior? = nil, + @RegexComponentBuilder _ component: () -> Component + ) where Output == (Substring, C0), Component.Output == (W, C0) +} +``` + +When you pass `nil`, the quantifier uses the default behavior as set by this option (either eager or reluctant). If an explicit behavior is passed, that behavior is used regardless of the default. + + +--- + +### Character Classes + +We propose the following definitions for regex character classes, along with a `CharacterClass` type as part of the `RegexBuilder` module, to encapsulate and simplify character class usage within builder-style regexes. + +The two regexes defined in this example will match the same inputs, looking for one or more word characters followed by up to three digits, optionally separated by a space: + +```swift +let regex1 = /\w+\s?\d{,3}/ +let regex2 = Regex { + OneOrMore(.word) + Optionally(.whitespace) + Repeat(.digit, ...3) +} +``` + +You can build custom character classes by combining regex-defined classes with individual characters or ranges, or by performing common set operations such as subtracting or negating a character class. + + +#### “Any” + +The simplest character class, representing **any character**, is written as `.` or `CharacterClass.any` and is also referred to as the "dot" metacharacter. This class always matches a single `Character` or Unicode scalar value, depending on the matching semantic level. This class excludes newlines, unless "single line mode" is enabled (see section above). + +In the following example, using grapheme cluster semantics, a dot matches a grapheme cluster, so the decomposed é is treated as a single value: + +```swift +"Cafe\u{301}".contains(/C.../) +// true +``` + +For this example, using Unicode scalar semantics, a dot matches only a single Unicode scalar value, so the combining marks don't get grouped with the commas before them: + +```swift +let data = "\u{300},\u{301},\u{302},\u{303},..." +for match in data.matches(of: /(.),/.matchingSemantics(.unicodeScalar)) { + print(match.1) +} +// Prints: +// ̀ +// ́ +// ̂ +// ... +``` + +`Regex` also provides ways to select a specific level of "any" matching, without needing to change semantic levels. + +- The **any grapheme cluster** character class is written as `\X` or `CharacterClass.anyGraphemeCluster`, and matches from the current location up to the next grapheme cluster boundary. This includes matching newlines, regardless of any option settings. This metacharacter is equivalent to the regex syntax `(?s-u:.)`. +- The **any Unicode scalar** character class is written as `\O` or `CharacterClass.anyUnicodeScalar`, and matches exactly one Unicode scalar value at the current location. This includes matching newlines, regardless of any option settings, but only the first scalar in an `\r\n` cluster. This metacharacter is equivalent to the regex syntax `(?su:.)`. + +#### Digits + +The **decimal digit** character class is matched by `\d` or `CharacterClass.digit`. Both regexes in this example match one or more decimal digits followed by a colon: + +```swift +let regex1 = /\d+:/ +let regex2 = Regex { + OneOrMore(.digit) + ":" +} +``` + +_Unicode scalar semantics:_ Matches a Unicode scalar that has a `numericType` property equal to `.decimal`. This includes the digits from the ASCII range, from the _Halfwidth and Fullwidth Forms_ Unicode block, as well as digits in some scripts, like `DEVANAGARI DIGIT NINE` (U+096F). This corresponds to the general category `Decimal_Number`. + +_Grapheme cluster semantics:_ Matches a character made up of a single Unicode scalar that fits the decimal digit criteria above. + +_ASCII mode_: Matches a Unicode scalar in the range `0` to `9`. + + +To invert the decimal digit character class, use `\D` or `CharacterClass.digit.inverted`. + + +The **hexadecimal digit** character class is matched by `CharacterClass.hexDigit`. + +_Unicode scalar semantics:_ Matches a decimal digit, as described above, or an uppercase or small `A` through `F` from the _Halfwidth and Fullwidth Forms_ Unicode block. Note that this is a broader class than described by the `UnicodeScalar.properties.isHexDigit` property, as that property only include ASCII and fullwidth decimal digits. + +_Grapheme cluster semantics:_ Matches a character made up of a single Unicode scalar that fits the hex digit criteria above. + +_ASCII mode_: Matches a Unicode scalar in the range `0` to `9`, `a` to `f`, or `A` to `F`. + +To invert the hexadecimal digit character class, use `CharacterClass.hexDigit.inverted`. + +*
Rationale* + +Unicode's recommended definition for `\d` is its [numeric type][numerictype] of "Decimal" in contrast to "Digit". It is specifically restricted to sets of ascending contiguously-encoded scalars in a decimal radix positional numeral system. Thus, it excludes "digits" such as superscript numerals from its [definition][derivednumeric] and is a proper subset of `Character.isWholeNumber`. + +We interpret Unicode's definition of the set of scalars, especially its requirement that scalars be encoded in ascending chains, to imply that this class is restricted to scalars which meaningfully encode base-10 digits. Thus, we choose to make the grapheme cluster interpretation *restrictive*. + +
+ + +#### "Word" characters + +The **word** character class is matched by `\w` or `CharacterClass.word`. This character class and its name are essentially terms of art within regexes, and represents part of a notional "word". Note that, by default, this is distinct from the algorithm for identifying word boundaries. + +_Unicode scalar semantics:_ Matches a Unicode scalar that has one of the Unicode properties `Alphabetic`, `Digit`, or `Join_Control`, or is in the general category `Mark` or `Connector_Punctuation`. + +_Grapheme cluster semantics:_ Matches a character that begins with a Unicode scalar value that fits the criteria above. + +_ASCII mode_: Matches the numbers `0` through `9`, lowercase and uppercase `A` through `Z`, and the underscore (`_`). + +To invert the word character class, use `\W` or `CharacterClass.word.inverted`. + +*
Rationale* + +Word characters include more than letters, and we went with Unicode's recommended scalar semantics. Following the Unicode recommendation that nonspacing marks remain with their base characters, we extend to grapheme clusters similarly to `Character.isLetter`. That is, combining scalars do not change the word-character-ness of the grapheme cluster. + +
+ + +#### Whitespace and newlines + +The **whitespace** character class is matched by `\s` and `CharacterClass.whitespace`. + +_Unicode scalar semantics:_ Matches a Unicode scalar that has the Unicode properties `Whitespace`, including a space, a horizontal tab (U+0009), `LINE FEED (LF)` (U+000A), `LINE TABULATION` (U+000B), `FORM FEED (FF)` (U+000C), `CARRIAGE RETURN (CR)` (U+000D), and `NEWLINE (NEL)` (U+0085). Note that under Unicode scalar semantics, `\s` only matches the first scalar in a `CR`+`LF` pair. + +_Grapheme cluster semantics:_ Matches a character that begins with a `Whitespace` Unicode scalar value. This includes matching a `CR`+`LF` pair. + +_ASCII mode_: Matches characters that both ASCII and fit the criteria given above. The current matching semantics dictate whether a `CR`+`LF` pair is matched in ASCII mode. + +The **horizontal whitespace** character class is matched by `\h` and `CharacterClass.horizontalWhitespace`. + +_Unicode scalar semantics:_ Matches a Unicode scalar that has the Unicode general category `Zs`/`Space_Separator` as well as a horizontal tab (U+0009). + +_Grapheme cluster semantics:_ Matches a character that begins with a Unicode scalar value that fits the criteria above. + +_ASCII mode_: Matches either a space (`" "`) or a horizontal tab. + +The **vertical whitespace** character class is matched by `\v` and `CharacterClass.verticalWhitespace`. Additionally, `\R` and `CharacterClass.newline` provide a way to include the `CR`+`LF` pair, even when matching with Unicode scalar semantics. + +_Unicode scalar semantics:_ Matches a Unicode scalar that has the Unicode general category `Zl`/`Line_Separator` as well as any of the following control characters: `LINE FEED (LF)` (U+000A), `LINE TABULATION` (U+000B), `FORM FEED (FF)` (U+000C), `CARRIAGE RETURN (CR)` (U+000D), and `NEWLINE (NEL)` (U+0085). Only when specified as `\R` or `CharacterClass.newline` does this match the whole `CR`+`LF` pair. + +_Grapheme cluster semantics:_ Matches a character that begins with a Unicode scalar value that fits the criteria above. + +_ASCII mode_: Matches any of the four ASCII control characters listed above. The current matching semantics dictate whether a `CR`+`LF` pair is matched in ASCII mode. + +To invert these character classes, use `\S`, `\H`, and `\V`, respectively, or the `inverted` property on a `CharacterClass` instance. + +
Rationale + +Note that "whitespace" is a term-of-art and is not correlated with visibility, which is a completely separate concept. + +We use Unicode's recommended scalar semantics for horizontal and vertical whitespace, extended to grapheme clusters as in the existing `Character.isWhitespace` property. + +
+ + +#### Unicode properties + +Character classes that match **Unicode properties** are written as `\p{PROPERTY}` or `\p{PROPERTY=VALUE}`, as described in the [Run-time Regex Construction proposal][internals-properties]. + +While most Unicode properties are only defined at the scalar level, some are defined to match an extended grapheme cluster. For example, `\p{RGI_Emoji_Flag_Sequence}` will match any flag emoji character, which are composed of two Unicode scalar values. Such property classes will match multiple scalars, even when matching with Unicode scalar semantics. + +Unicode property matching is extended to `Character`s with a goal of consistency with other regex character classes. For `\p{Decimal}` and `\p{Hex_Digit}`, only single-scalar `Character`s can match, for the reasons described in that section, above. For all other Unicode property classes, matching `Character`s can comprise multiple scalars, as long as the first scalar matches the property. + +To invert a Unicode property character class, use `\P{...}`. + + +#### POSIX character classes: `[:NAME:]` + +**POSIX character classes** represent concepts that we'd like to define at all semantic levels. We propose the following definitions, some of which have been described above. When matching with grapheme cluster semantics, Unicode properties are extended to `Character`s as descrived in the rationale above, and as shown in the table below. That is, for POSIX class `[:word:]`, any `Character` that starts with a matching scalar is a match, while for `[:digit:]`, a matching `Character` must only comprise a single Unicode scalar value. + +| POSIX class | Unicode property class | Character behavior | ASCII mode value | +|--------------|-----------------------------------|----------------------|-------------------------------| +| `[:lower:]` | `\p{Lowercase}` | starts-with | `[a-z]` | +| `[:upper:]` | `\p{Uppercase}` | starts-with | `[A-Z]` | +| `[:alpha:]` | `\p{Alphabetic}` | starts-with | `[A-Za-z]` | +| `[:alnum:]` | `[\p{Alphabetic}\p{Decimal}]` | starts-with | `[A-Za-z0-9]` | +| `[:word:]` | See \* below | starts-with | `[[:alnum:]_]` | +| `[:digit:]` | `\p{DecimalNumber}` | single-scalar | `[0-9]` | +| `[:xdigit:]` | `\p{Hex_Digit}` | single-scalar | `[0-9A-Fa-f]` | +| `[:punct:]` | `\p{Punctuation}` | starts-with | `[-!"#%&'()*,./:;?@[\\\]{}]` | +| `[:blank:]` | `[\p{Space_Separator}\u{09}]` | starts-with | `[ \t]` | +| `[:space:]` | `\p{Whitespace}` | starts-with | `[ \t\n\r\f\v]` | +| `[:cntrl:]` | `\p{Control}` | starts-with | `[\x00-\x1f\x7f]` | +| `[:graph:]` | See \*\* below | starts-with | `[^ [:cntrl:]]` | +| `[:print:]` | `[[:graph:][:blank:]--[:cntrl:]]` | starts-with | `[[:graph:] ]` | + +\* The Unicode scalar property definition for `[:word:]` is `[\p{Alphanumeric}\p{Mark}\p{Join_Control}\p{Connector_Punctuation}]`. +\*\* The Unicode scalar property definition for `[:cntrl:]` is `[^\p{Space}\p{Control}\p{Surrogate}\p{Unassigned}]`. + +#### Custom classes + +Custom classes function as the set union of their individual components, whether those parts are individual characters, individual Unicode scalar values, ranges, Unicode property classes or POSIX classes, or other custom classes. + +- Individual characters and scalars will be tested using the same behavior as if they were listed in an alternation. That is, a custom character class like `[abc]` is equivalent to `(a|b|c)` under the same options and modes. +- When in grapheme cluster semantic mode, ranges of characters will test for membership using NFD form (or NFKD when performing caseless matching). This differs from how a `ClosedRange` would operate its `contains` method, since that depends on `String`'s `Comparable` conformance, but the decomposed comparison better aligns with the canonical equivalence matching used elsewhere in `Regex`. +- A custom character class will match a maximum of one `Character` or `UnicodeScalar`, depending on the matching semantic level. This means that a custom character class with extended grapheme cluster members may not match anything while using scalar semantics. + +Inside regexes, custom classes are enclosed in square brackets `[...]`, and can be nested or combined using set operators like `&&`. For more detail, see the [Run-time Regex Construction proposal][internals-charclass]. + +With `RegexBuilder`'s `CharacterClass` type, you can use built-in character classes with ranges and groups of characters. For example, to parse a valid octodecimal number, you could define a custom character class that combines `.digit` with a range of characters. + +```swift +let octoDecimalRegex: Regex<(Substring, Int?)> = Regex { + let charClass = CharacterClass(.digit, "a"..."h").ignoresCase() + Capture { + OneOrMore(charClass) + } transform: { Int($0, radix: 18) } +} +``` + +The full `CharacterClass` API is as follows: + +```swift +public struct CharacterClass: RegexComponent { + public var regex: Regex { get } + + public var inverted: CharacterClass { get } +} + +extension RegexComponent where Self == CharacterClass { + public static var any: CharacterClass { get } + + public static var anyGraphemeCluster: CharacterClass { get } + + public static var anyUnicodeScalar: CharacterClass { get } + + public static var digit: CharacterClass { get } + + public static var hexDigit: CharacterClass { get } + + public static var word: CharacterClass { get } + + public static var whitespace: CharacterClass { get } + + public static var horizontalWhitespace: CharacterClass { get } + + public static var newlineSequence: CharacterClass { get } + + public static var verticalWhitespace: CharacterClass { get } +} + +extension RegexComponent where Self == CharacterClass { + /// Returns a character class that matches any character in the given string + /// or sequence. + public static func anyOf(_ s: S) -> CharacterClass + where S.Element == Character + + /// Returns a character class that matches any unicode scalar in the given + /// sequence. + public static func anyOf(_ s: S) -> CharacterClass + where S.Element == UnicodeScalar +} + +// Unicode properties +extension CharacterClass { + /// Returns a character class that matches elements in the given Unicode + /// general category. + public static func generalCategory(_ category: Unicode.GeneralCategory) -> CharacterClass +} + +// Set algebra methods +extension CharacterClass { + /// Creates a character class that combines the given classes in a union. + public init(_ first: CharacterClass, _ rest: CharacterClass...) + + /// Returns a character class from the union of this class and the given class. + public func union(_ other: CharacterClass) -> CharacterClass + + /// Returns a character class from the intersection of this class and the given class. + public func intersection(_ other: CharacterClass) -> CharacterClass + + /// Returns a character class by subtracting the given class from this class. + public func subtracting(_ other: CharacterClass) -> CharacterClass + + /// Returns a character class matching elements in one or the other, but not both, + /// of this class and the given class. + public func symmetricDifference(_ other: CharacterClass) -> CharacterClass +} + +/// Range syntax for characters in `CharacterClass`es. +public func ...(lhs: Character, rhs: Character) -> CharacterClass + +/// Range syntax for unicode scalars in `CharacterClass`es. +@_disfavoredOverload +public func ...(lhs: UnicodeScalar, rhs: UnicodeScalar) -> CharacterClass +``` + +## Source compatibility + +Everything in this proposal is additive, and has no compatibility effect on existing source code. + +## Effect on ABI stability + +Everything in this proposal is additive, and has no effect on existing stable ABI. + +## Effect on API resilience + +N/A + +## Future directions + +### Expanded options and modifiers + +The initial version of `Regex` includes only the options described above. Filling out the remainder of options described in the [Run-time Regex Construction proposal][literals] could be completed as future work, as well as additional improvements, such as adding an option that makes a regex match only at the start of a string. + +### Extensions to Character and Unicode Scalar APIs + +An earlier version of this pitch described adding standard library APIs to `Character` and `UnicodeScalar` for each of the supported character classes, as well as convenient static members for control characters. In addition, regex literals support Unicode property features that don’t currently exist in the standard library, such as a scalar’s script or extended category, or creating a scalar by its Unicode name instead of its scalar value. These kinds of additions are + +### Byte semantic mode + +A future `Regex` version could support a byte-level semantic mode in addition to grapheme cluster and Unicode scalar semantics. Byte-level semantics would allow matching individual bytes, potentially providing the capability of parsing string and non-string data together. + +### More general `CharacterSet` replacement + +Foundation's `CharacterSet` type is in some ways similar to the `CharacterClass` type defined in this proposal. `CharacterSet` is primarily a set type that is defined over Unicode scalars, and can therefore sometimes be awkward to use in conjunction with Swift `String`s. The proposed `CharacterClass` type is a `RegexBuilder`-specific type, and as such isn't intended to be a full general purpose replacement. Future work could involve expanding upon the `CharacterClass` API or introducing a different type to fill that role. + +## Alternatives considered + +### Operate on String.UnicodeScalarView instead of using semantic modes + +Instead of providing APIs to select whether `Regex` matching is `Character`-based vs. `UnicodeScalar`-based, we could instead provide methods to match against the different views of a string. This different approach has multiple drawbacks: + +* As the scalar level used when matching changes the behavior of individual components of a `Regex`, it’s more appropriate to specify the semantic level at the declaration site than the call site. +* With the proposed options model, you can define a Regex that includes different semantic levels for different portions of the match, which would be impossible with a call site-based approach. + +### Binary word boundary option method + +A prior version of this proposal used a binary method for setting the word boundary algorithm, called `usingSimpleWordBoundaries()`. A method taking a `RegexWordBoundaryKind` instance is included in the proposal instead, to leave room for implementing other word boundary algorithms in the future. + + +[repo]: https://github.com/apple/swift-experimental-string-processing/ +[option-scoping]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md#matching-options +[internals]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md +[internals-properties]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md#character-properties +[internals-charclass]: https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md#custom-character-classes +[level1-word-boundaries]:https://unicode.org/reports/tr18/#Simple_Word_Boundaries +[level2-word-boundaries]:https://unicode.org/reports/tr18/#RL2.3 + +[overview]: https://forums.swift.org/t/declarative-string-processing-overview/52459 +[charprops]: https://github.com/apple/swift-evolution/blob/master/proposals/0221-character-properties.md +[charpropsrationale]: https://github.com/apple/swift-evolution/blob/master/proposals/0221-character-properties.md#detailed-semantics-and-rationale +[canoneq]: https://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence +[graphemes]: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries +[meaningless]: https://forums.swift.org/t/declarative-string-processing-overview/52459/121 +[scalarprops]: https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md +[ucd]: https://www.unicode.org/reports/tr44/tr44-28.html +[numerictype]: https://www.unicode.org/reports/tr44/#Numeric_Type +[derivednumeric]: https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedNumericType.txt + + +[uts18]: https://unicode.org/reports/tr18/ +[proplist]: https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt +[pcre]: https://www.pcre.org/current/doc/html/pcre2pattern.html +[perl]: https://perldoc.perl.org/perlre +[raku]: https://docs.raku.org/language/regexes +[rust]: https://docs.rs/regex/1.5.4/regex/ +[python]: https://docs.python.org/3/library/re.html +[ruby]: https://ruby-doc.org/core-2.4.0/Regexp.html +[csharp]: https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference +[icu]: https://unicode-org.github.io/icu/userguide/strings/regexp.html +[posix]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html +[oniguruma]: https://www.cuminas.jp/sdk/regularExpression.html +[go]: https://pkg.go.dev/regexp/syntax@go1.17.2 +[cplusplus]: https://www.cplusplus.com/reference/regex/ECMAScript/ +[ecmascript]: https://262.ecma-international.org/12.0/#sec-pattern-semantics +[re2]: https://github.com/google/re2/wiki/Syntax +[java]: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html diff --git a/Package.swift b/Package.swift index 26b7f90af..f8162e762 100644 --- a/Package.swift +++ b/Package.swift @@ -76,7 +76,7 @@ let package = Package( .unsafeFlags(["-Xfrontend", "-enable-experimental-pairwise-build-block"]), .unsafeFlags(["-Xfrontend", "-disable-availability-checking"]) ]), - .target( + .testTarget( name: "Prototypes", dependencies: ["_RegexParser", "_StringProcessing"], swiftSettings: [ @@ -100,7 +100,7 @@ let package = Package( // MARK: Exercises .target( name: "Exercises", - dependencies: ["_RegexParser", "Prototypes", "_StringProcessing", "RegexBuilder"], + dependencies: ["_RegexParser", "_StringProcessing", "RegexBuilder"], swiftSettings: [ .unsafeFlags(["-Xfrontend", "-enable-experimental-pairwise-build-block"]), .unsafeFlags(["-Xfrontend", "-disable-availability-checking"]) diff --git a/README.md b/README.md index e8a6e387e..42586ad2b 100644 --- a/README.md +++ b/README.md @@ -10,6 +10,21 @@ See [Declarative String Processing Overview][decl-string] - [Swift Trunk Development Snapshot](https://www.swift.org/download/#snapshots) DEVELOPMENT-SNAPSHOT-2022-03-09 or later. +## Trying it out + +To try out the functionality provided here, download the latest open source development toolchain. Import `_StringProcessing` in your source file to get access to the API and specify `-Xfrontend -enable-experimental-string-processing` to get access to the literals. + +For example, in a `Package.swift` file's target declaration: + +```swift +.target( + name: "foo", + dependencies: ["depA"], + swiftSettings: [.unsafeFlags(["-Xfrontend", "-enable-experimental-string-processing"])] + ), +``` + + ## Integration with Swift `_RegexParser` and `_StringProcessing` are specially integrated modules that are built as part of apple/swift. diff --git a/Sources/Exercises/Exercises.swift b/Sources/Exercises/Exercises.swift index f9801ca90..17c1eeb56 100644 --- a/Sources/Exercises/Exercises.swift +++ b/Sources/Exercises/Exercises.swift @@ -16,7 +16,6 @@ public enum Exercises { HandWrittenParticipant.self, RegexDSLParticipant.self, RegexLiteralParticipant.self, - PEGParticipant.self, NSREParticipant.self, ] } diff --git a/Sources/Exercises/Participants/PEGParticipant.swift b/Sources/Exercises/Participants/PEGParticipant.swift index 21bec6a7c..8987b6b0c 100644 --- a/Sources/Exercises/Participants/PEGParticipant.swift +++ b/Sources/Exercises/Participants/PEGParticipant.swift @@ -9,6 +9,9 @@ // //===----------------------------------------------------------------------===// +// Disabled because Prototypes is a test target. +#if false + struct PEGParticipant: Participant { static var name: String { "PEG" } } @@ -51,3 +54,4 @@ private func graphemeBreakPropertyData(forLine line: String) -> GraphemeBreakEnt } +#endif diff --git a/Sources/RegexBuilder/Algorithms.swift b/Sources/RegexBuilder/Algorithms.swift new file mode 100644 index 000000000..f1f6d97a0 --- /dev/null +++ b/Sources/RegexBuilder/Algorithms.swift @@ -0,0 +1,315 @@ +//===----------------------------------------------------------------------===// +// +// This source file is part of the Swift.org open source project +// +// Copyright (c) 2021-2022 Apple Inc. and the Swift project authors +// Licensed under Apache License v2.0 with Runtime Library Exception +// +// See https://swift.org/LICENSE.txt for license information +// +//===----------------------------------------------------------------------===// + +import _StringProcessing + +// FIXME(rdar://92459215): We should be using 'some RegexComponent' instead of +// for the methods below that don't impose any additional +// requirements on the type. Currently the generic parameter is needed to work +// around a compiler issue. + +extension BidirectionalCollection where SubSequence == Substring { + /// Matches a regex in its entirety, where the regex is created by + /// the given closure. + /// + /// - Parameter content: A closure that returns a regex to match against. + /// - Returns: The match if there is one, or `nil` if none. + @available(SwiftStdlib 5.7, *) + public func wholeMatch( + @RegexComponentBuilder of content: () -> R + ) -> Regex.Match? { + wholeMatch(of: content()) + } + + /// Matches part of the regex, starting at the beginning, where the regex + /// is created by the given closure. + /// + /// - Parameter content: A closure that returns a regex to match against. + /// - Returns: The match if there is one, or `nil` if none. + @available(SwiftStdlib 5.7, *) + public func prefixMatch( + @RegexComponentBuilder of content: () -> R + ) -> Regex.Match? { + prefixMatch(of: content()) + } + + /// Returns a Boolean value indicating whether this collection contains a + /// match for the regex, where the regex is created by the given closure. + /// + /// - Parameter content: A closure that returns a regex to search for within + /// this collection. + /// - Returns: `true` if the regex returned by `content` matched anywhere in + /// this collection, otherwise `false`. + @available(SwiftStdlib 5.7, *) + public func contains( + @RegexComponentBuilder _ content: () -> R + ) -> Bool { + contains(content()) + } + + /// Returns the range of the first match for the regex within this collection, + /// where the regex is created by the given closure. + /// + /// - Parameter content: A closure that returns a regex to search for. + /// - Returns: A range in the collection of the first occurrence of the first + /// match of if the regex returned by `content`. Returns `nil` if no match + /// for the regex is found. + @available(SwiftStdlib 5.7, *) + public func firstRange( + @RegexComponentBuilder of content: () -> R + ) -> Range? { + firstRange(of: content()) + } + + // FIXME: Return `some Collection>` for SE-0346 + /// Returns the ranges of the all non-overlapping matches for the regex + /// within this collection, where the regex is created by the given closure. + /// + /// - Parameter content: A closure that returns a regex to search for. + /// - Returns: A collection of ranges of all matches for the regex returned by + /// `content`. Returns an empty collection if no match for the regex + /// is found. + @available(SwiftStdlib 5.7, *) + public func ranges( + @RegexComponentBuilder of content: () -> R + ) -> [Range] { + ranges(of: content()) + } + + // FIXME: Return `some Collection` for SE-0346 + /// Returns the longest possible subsequences of the collection, in order, + /// around subsequence that match the regex created by the given closure. + /// + /// - Parameters: + /// - maxSplits: The maximum number of times to split the collection, + /// or one less than the number of subsequences to return. + /// - omittingEmptySubsequences: If `false`, an empty subsequence is + /// returned in the result for each consecutive pair of matches + /// and for each match at the start or end of the collection. If + /// `true`, only nonempty subsequences are returned. + /// - separator: A closure that returns a regex to be split upon. + /// - Returns: A collection of substrings, split from this collection's + /// elements. + @available(SwiftStdlib 5.7, *) + public func split( + maxSplits: Int = Int.max, + omittingEmptySubsequences: Bool = true, + @RegexComponentBuilder separator: () -> R + ) -> [SubSequence] { + split(separator: separator(), maxSplits: maxSplits, omittingEmptySubsequences: omittingEmptySubsequences) + } + + /// Returns a Boolean value indicating whether the initial elements of this + /// collection are a match for the regex created by the given closure. + /// + /// - Parameter content: A closure that returns a regex to match at + /// the beginning of this collection. + /// - Returns: `true` if the initial elements of this collection match + /// regex returned by `content`; otherwise, `false`. + @available(SwiftStdlib 5.7, *) + public func starts( + @RegexComponentBuilder with content: () -> R + ) -> Bool { + starts(with: content()) + } + + /// Returns a subsequence of this collection by removing the elements + /// matching the regex from the start, where the regex is created by + /// the given closure. + /// + /// - Parameter content: A closure that returns the regex to search for at + /// the start of this collection. + /// - Returns: A collection containing the elements after those that match + /// the regex returned by `content`. If the regex does not match at + /// the start of the collection, the entire contents of this collection + /// are returned. + @available(SwiftStdlib 5.7, *) + public func trimmingPrefix( + @RegexComponentBuilder _ content: () -> R + ) -> SubSequence { + trimmingPrefix(content()) + } + + /// Returns the first match for the regex within this collection, where + /// the regex is created by the given closure. + /// + /// - Parameter content: A closure that returns the regex to search for. + /// - Returns: The first match for the regex created by `content` in this + /// collection, or `nil` if no match is found. + @available(SwiftStdlib 5.7, *) + public func firstMatch( + @RegexComponentBuilder of content: () -> R + ) -> Regex.Match? { + firstMatch(of: content()) + } + + // FIXME: Return `some Collection.Match> for SE-0346 + /// Returns a collection containing all non-overlapping matches of + /// the regex, created by the given closure. + /// + /// - Parameter content: A closure that returns the regex to search for. + /// - Returns: A collection of matches for the regex returned by `content`. + /// If no matches are found, the returned collection is empty. + @available(SwiftStdlib 5.7, *) + public func matches( + @RegexComponentBuilder of content: () -> R + ) -> [Regex.Match] { + matches(of: content()) + } +} + +extension RangeReplaceableCollection +where Self: BidirectionalCollection, SubSequence == Substring { + /// Removes the initial elements matching the regex from the start of + /// this collection, if the initial elements match, using the given closure + /// to create the regex. + /// + /// - Parameter content: A closure that returns the regex to search for + /// at the start of this collection. + @available(SwiftStdlib 5.7, *) + public mutating func trimPrefix( + @RegexComponentBuilder _ content: () -> R + ) { + trimPrefix(content()) + } + + /// Returns a new collection in which all matches for the regex + /// are replaced, using the given closure to create the regex. + /// + /// - Parameters: + /// - replacement: The new elements to add to the collection in place of + /// each match for the regex, using `content` to create the regex. + /// - subrange: The range in the collection in which to search for + /// the regex. + /// - maxReplacements: A number specifying how many occurrences of + /// the regex to replace. + /// - content: A closure that returns the collection to search for + /// and replace. + /// - Returns: A new collection in which all matches for regex in `subrange` + /// are replaced by `replacement`, using `content` to create the regex. + @available(SwiftStdlib 5.7, *) + public func replacing( + with replacement: Replacement, + subrange: Range, + maxReplacements: Int = .max, + @RegexComponentBuilder content: () -> R + ) -> Self where Replacement.Element == Element { + replacing(content(), with: replacement, subrange: subrange, maxReplacements: maxReplacements) + } + + /// Returns a new collection in which all matches for the regex + /// are replaced, using the given closure to create the regex. + /// + /// - Parameters: + /// - replacement: The new elements to add to the collection in place of + /// each match for the regex, using `content` to create the regex. + /// - maxReplacements: A number specifying how many occurrences of regex + /// to replace. + /// - content: A closure that returns the collection to search for + /// and replace. + /// - Returns: A new collection in which all matches for regex in `subrange` + /// are replaced by `replacement`, using `content` to create the regex. + @available(SwiftStdlib 5.7, *) + public func replacing( + with replacement: Replacement, + maxReplacements: Int = .max, + @RegexComponentBuilder content: () -> R + ) -> Self where Replacement.Element == Element { + replacing(content(), with: replacement, maxReplacements: maxReplacements) + } + + /// Replaces all matches for the regex in this collection, using the given + /// closure to create the regex. + /// + /// - Parameters: + /// - replacement: The new elements to add to the collection in place of + /// each match for the regex, using `content` to create the regex. + /// - maxReplacements: A number specifying how many occurrences of + /// the regex to replace. + /// - content: A closure that returns the collection to search for + /// and replace. + @available(SwiftStdlib 5.7, *) + public mutating func replace( + with replacement: Replacement, + maxReplacements: Int = .max, + @RegexComponentBuilder content: () -> R + ) where Replacement.Element == Element { + replace(content(), with: replacement, maxReplacements: maxReplacements) + } + + /// Returns a new collection in which all matches for the regex + /// are replaced, using the given closures to create the replacement + /// and the regex. + /// + /// - Parameters: + /// - subrange: The range in the collection in which to search for the + /// regex, using `content` to create the regex. + /// - maxReplacements: A number specifying how many occurrences of + /// the regex to replace. + /// - content: A closure that returns the collection to search for + /// and replace. + /// - replacement: A closure that receives the full match information, + /// including captures, and returns a replacement collection. + /// - Returns: A new collection in which all matches for regex in `subrange` + /// are replaced by the result of calling `replacement`, where regex + /// is the result of calling `content`. + @available(SwiftStdlib 5.7, *) + public func replacing( + subrange: Range, + maxReplacements: Int = .max, + @RegexComponentBuilder content: () -> R, + with replacement: (Regex.Match) throws -> Replacement + ) rethrows -> Self where Replacement.Element == Element { + try replacing(content(), subrange: subrange, maxReplacements: maxReplacements, with: replacement) + } + + /// Returns a new collection in which all matches for the regex + /// are replaced, using the given closures to create the replacement + /// and the regex. + /// + /// - Parameters: + /// - maxReplacements: A number specifying how many occurrences of + /// the regex to replace, using `content` to create the regex. + /// - content: A closure that returns the collection to search for + /// and replace. + /// - replacement: A closure that receives the full match information, + /// including captures, and returns a replacement collection. + /// - Returns: A new collection in which all matches for regex in `subrange` + /// are replaced by the result of calling `replacement`, where regex is + /// the result of calling `content`. + @available(SwiftStdlib 5.7, *) + public func replacing( + maxReplacements: Int = .max, + @RegexComponentBuilder content: () -> R, + with replacement: (Regex.Match) throws -> Replacement + ) rethrows -> Self where Replacement.Element == Element { + try replacing(content(), maxReplacements: maxReplacements, with: replacement) + } + + /// Replaces all matches for the regex in this collection, using the + /// given closures to create the replacement and the regex. + /// + /// - Parameters: + /// - maxReplacements: A number specifying how many occurrences of + /// the regex to replace, using `content` to create the regex. + /// - content: A closure that returns the collection to search for + /// and replace. + /// - replacement: A closure that receives the full match information, + /// including captures, and returns a replacement collection. + @available(SwiftStdlib 5.7, *) + public mutating func replace( + maxReplacements: Int = .max, + @RegexComponentBuilder content: () -> R, + with replacement: (Regex.Match) throws -> Replacement + ) rethrows where Replacement.Element == Element { + try replace(content(), maxReplacements: maxReplacements, with: replacement) + } +} diff --git a/Sources/RegexBuilder/Anchor.swift b/Sources/RegexBuilder/Anchor.swift index 55b554aea..e8cd4ac54 100644 --- a/Sources/RegexBuilder/Anchor.swift +++ b/Sources/RegexBuilder/Anchor.swift @@ -9,7 +9,7 @@ // //===----------------------------------------------------------------------===// -import _RegexParser +@_implementationOnly import _RegexParser @_spi(RegexBuilder) import _StringProcessing @available(SwiftStdlib 5.7, *) @@ -31,34 +31,21 @@ public struct Anchor { @available(SwiftStdlib 5.7, *) extension Anchor: RegexComponent { - var astAssertion: AST.Atom.AssertionKind { - if !isInverted { - switch kind { - case .startOfSubject: return .startOfSubject - case .endOfSubjectBeforeNewline: return .endOfSubjectBeforeNewline - case .endOfSubject: return .endOfSubject - case .firstMatchingPositionInSubject: return .firstMatchingPositionInSubject - case .textSegmentBoundary: return .textSegment - case .startOfLine: return .startOfLine - case .endOfLine: return .endOfLine - case .wordBoundary: return .wordBoundary - } - } else { - switch kind { - case .startOfSubject: fatalError("Not yet supported") - case .endOfSubjectBeforeNewline: fatalError("Not yet supported") - case .endOfSubject: fatalError("Not yet supported") - case .firstMatchingPositionInSubject: fatalError("Not yet supported") - case .textSegmentBoundary: return .notTextSegment - case .startOfLine: fatalError("Not yet supported") - case .endOfLine: fatalError("Not yet supported") - case .wordBoundary: return .notWordBoundary - } + var baseAssertion: DSLTree._AST.AssertionKind { + switch kind { + case .startOfSubject: return .startOfSubject(isInverted) + case .endOfSubjectBeforeNewline: return .endOfSubjectBeforeNewline(isInverted) + case .endOfSubject: return .endOfSubject(isInverted) + case .firstMatchingPositionInSubject: return .firstMatchingPositionInSubject(isInverted) + case .textSegmentBoundary: return .textSegmentBoundary(isInverted) + case .startOfLine: return .startOfLine(isInverted) + case .endOfLine: return .endOfLine(isInverted) + case .wordBoundary: return .wordBoundary(isInverted) } } public var regex: Regex { - Regex(node: .atom(.assertion(astAssertion))) + Regex(node: .atom(.assertion(baseAssertion))) } } diff --git a/Sources/RegexBuilder/CharacterClass.swift b/Sources/RegexBuilder/CharacterClass.swift index 8d0cde435..3a96ba363 100644 --- a/Sources/RegexBuilder/CharacterClass.swift +++ b/Sources/RegexBuilder/CharacterClass.swift @@ -9,7 +9,7 @@ // //===----------------------------------------------------------------------===// -import _RegexParser +@_implementationOnly import _RegexParser @_spi(RegexBuilder) import _StringProcessing @available(SwiftStdlib 5.7, *) @@ -21,19 +21,10 @@ public struct CharacterClass { } init(unconverted model: _CharacterClassModel) { - // FIXME: Implement in DSLTree instead of wrapping an AST atom - switch model.makeAST() { - case .atom(let atom): - self.ccc = .init(members: [.atom(.unconverted(atom))]) - default: - fatalError("Unsupported _CharacterClassModel") + guard let ccc = model.makeDSLTreeCharacterClass() else { + fatalError("Unsupported character class") } - } - - init(property: AST.Atom.CharacterProperty) { - // FIXME: Implement in DSLTree instead of wrapping an AST atom - let astAtom = AST.Atom(.property(property), .fake) - self.ccc = .init(members: [.atom(.unconverted(astAtom))]) + self.ccc = ccc } } @@ -109,7 +100,7 @@ extension RegexComponent where Self == CharacterClass { members: s.map { .atom(.char($0)) })) } - /// Returns a character class that matches any unicode scalar in the given + /// Returns a character class that matches any Unicode scalar in the given /// sequence. public static func anyOf(_ s: S) -> CharacterClass where S.Element == UnicodeScalar @@ -123,15 +114,11 @@ extension RegexComponent where Self == CharacterClass { @available(SwiftStdlib 5.7, *) extension CharacterClass { public static func generalCategory(_ category: Unicode.GeneralCategory) -> CharacterClass { - guard let extendedCategory = category.extendedGeneralCategory else { - fatalError("Unexpected general category") - } - return CharacterClass(property: - .init(.generalCategory(extendedCategory), isInverted: false, isPOSIX: false)) + return CharacterClass(.generalCategory(category)) } } -/// Range syntax for characters in `CharacterClass`es. +/// Returns a character class that includes the characters in the given range. @available(SwiftStdlib 5.7, *) public func ...(lhs: Character, rhs: Character) -> CharacterClass { let range: DSLTree.CustomCharacterClass.Member = .range(.char(lhs), .char(rhs)) @@ -139,7 +126,7 @@ public func ...(lhs: Character, rhs: Character) -> CharacterClass { return CharacterClass(ccc) } -/// Range syntax for unicode scalars in `CharacterClass`es. +/// Returns a character class that includes the Unicode scalars in the given range. @_disfavoredOverload @available(SwiftStdlib 5.7, *) public func ...(lhs: UnicodeScalar, rhs: UnicodeScalar) -> CharacterClass { @@ -148,44 +135,6 @@ public func ...(lhs: UnicodeScalar, rhs: UnicodeScalar) -> CharacterClass { return CharacterClass(ccc) } -extension Unicode.GeneralCategory { - var extendedGeneralCategory: Unicode.ExtendedGeneralCategory? { - switch self { - case .uppercaseLetter: return .uppercaseLetter - case .lowercaseLetter: return .lowercaseLetter - case .titlecaseLetter: return .titlecaseLetter - case .modifierLetter: return .modifierLetter - case .otherLetter: return .otherLetter - case .nonspacingMark: return .nonspacingMark - case .spacingMark: return .spacingMark - case .enclosingMark: return .enclosingMark - case .decimalNumber: return .decimalNumber - case .letterNumber: return .letterNumber - case .otherNumber: return .otherNumber - case .connectorPunctuation: return .connectorPunctuation - case .dashPunctuation: return .dashPunctuation - case .openPunctuation: return .openPunctuation - case .closePunctuation: return .closePunctuation - case .initialPunctuation: return .initialPunctuation - case .finalPunctuation: return .finalPunctuation - case .otherPunctuation: return .otherPunctuation - case .mathSymbol: return .mathSymbol - case .currencySymbol: return .currencySymbol - case .modifierSymbol: return .modifierSymbol - case .otherSymbol: return .otherSymbol - case .spaceSeparator: return .spaceSeparator - case .lineSeparator: return .lineSeparator - case .paragraphSeparator: return .paragraphSeparator - case .control: return .control - case .format: return .format - case .surrogate: return .surrogate - case .privateUse: return .privateUse - case .unassigned: return .unassigned - @unknown default: return nil - } - } -} - // MARK: - Set algebra methods @available(SwiftStdlib 5.7, *) diff --git a/Sources/RegexBuilder/DSL.swift b/Sources/RegexBuilder/DSL.swift index 4020e2035..10590fb74 100644 --- a/Sources/RegexBuilder/DSL.swift +++ b/Sources/RegexBuilder/DSL.swift @@ -9,7 +9,7 @@ // //===----------------------------------------------------------------------===// -import _RegexParser +@_implementationOnly import _RegexParser @_spi(RegexBuilder) import _StringProcessing @available(SwiftStdlib 5.7, *) @@ -95,8 +95,8 @@ extension UnicodeScalar: RegexComponent { // Note: Quantifiers are currently gyb'd. extension DSLTree.Node { - /// Generates a DSLTree node for a repeated range of the given DSLTree node. - /// Individual public API functions are in the generated Variadics.swift file. + // Individual public API functions are in the generated Variadics.swift file. + /// Generates a DSL tree node for a repeated range of the given node. @available(SwiftStdlib 5.7, *) static func repeating( _ range: Range, @@ -116,13 +116,13 @@ extension DSLTree.Node { return .quantification(.oneOrMore, kind, node) case _ where range.count == 1: // ..<1 or ...0 or any range with count == 1 // Note: `behavior` is ignored in this case - return .quantification(.exactly(.init(faking: range.lowerBound)), .default, node) + return .quantification(.exactly(range.lowerBound), .default, node) case (0, _): // 0..: _BuiltinRegexComponent { // MARK: - Groups -/// An atomic group, i.e. opens a local backtracking scope which, upon successful exit, -/// discards any remaining backtracking points from within the scope +/// An atomic group. +/// +/// This group opens a local backtracking scope which, upon successful exit, +/// discards any remaining backtracking points from within the scope. @available(SwiftStdlib 5.7, *) public struct Local: _BuiltinRegexComponent { public var regex: Regex @@ -265,6 +267,7 @@ public struct Local: _BuiltinRegexComponent { // MARK: - Backreference @available(SwiftStdlib 5.7, *) +/// A backreference. public struct Reference: RegexComponent { let id = ReferenceID() diff --git a/Sources/RegexBuilder/Match.swift b/Sources/RegexBuilder/Match.swift deleted file mode 100644 index 78a466a18..000000000 --- a/Sources/RegexBuilder/Match.swift +++ /dev/null @@ -1,45 +0,0 @@ -//===----------------------------------------------------------------------===// -// -// This source file is part of the Swift.org open source project -// -// Copyright (c) 2021-2022 Apple Inc. and the Swift project authors -// Licensed under Apache License v2.0 with Runtime Library Exception -// -// See https://swift.org/LICENSE.txt for license information -// -//===----------------------------------------------------------------------===// - -import _StringProcessing - -@available(SwiftStdlib 5.7, *) -extension String { - @available(SwiftStdlib 5.7, *) - public func wholeMatch( - @RegexComponentBuilder of content: () -> R - ) -> Regex.Match? { - wholeMatch(of: content()) - } - - @available(SwiftStdlib 5.7, *) - public func prefixMatch( - @RegexComponentBuilder of content: () -> R - ) -> Regex.Match? { - prefixMatch(of: content()) - } -} - -extension Substring { - @available(SwiftStdlib 5.7, *) - public func wholeMatch( - @RegexComponentBuilder of content: () -> R - ) -> Regex.Match? { - wholeMatch(of: content()) - } - - @available(SwiftStdlib 5.7, *) - public func prefixMatch( - @RegexComponentBuilder of content: () -> R - ) -> Regex.Match? { - prefixMatch(of: content()) - } -} diff --git a/Sources/RegexBuilder/Variadics.swift b/Sources/RegexBuilder/Variadics.swift index 196a67cb4..f06978c8b 100644 --- a/Sources/RegexBuilder/Variadics.swift +++ b/Sources/RegexBuilder/Variadics.swift @@ -11,7 +11,6 @@ // BEGIN AUTO-GENERATED CONTENT -import _RegexParser @_spi(RegexBuilder) import _StringProcessing @available(SwiftStdlib 5.7, *) @@ -646,7 +645,7 @@ extension Repeat { ) where RegexOutput == Substring { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component.regex.root)) + self.init(node: .quantification(.exactly(count), .default, component.regex.root)) } @_disfavoredOverload @@ -656,7 +655,7 @@ extension Repeat { ) where RegexOutput == Substring { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component().regex.root)) + self.init(node: .quantification(.exactly(count), .default, component().regex.root)) } @_disfavoredOverload @@ -761,7 +760,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?), Component.RegexOutput == (W, C1) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component.regex.root)) + self.init(node: .quantification(.exactly(count), .default, component.regex.root)) } public init( @@ -770,7 +769,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?), Component.RegexOutput == (W, C1) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component().regex.root)) + self.init(node: .quantification(.exactly(count), .default, component().regex.root)) } public init( @@ -873,7 +872,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?), Component.RegexOutput == (W, C1, C2) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component.regex.root)) + self.init(node: .quantification(.exactly(count), .default, component.regex.root)) } public init( @@ -882,7 +881,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?), Component.RegexOutput == (W, C1, C2) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component().regex.root)) + self.init(node: .quantification(.exactly(count), .default, component().regex.root)) } public init( @@ -985,7 +984,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?, C3?), Component.RegexOutput == (W, C1, C2, C3) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component.regex.root)) + self.init(node: .quantification(.exactly(count), .default, component.regex.root)) } public init( @@ -994,7 +993,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?, C3?), Component.RegexOutput == (W, C1, C2, C3) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component().regex.root)) + self.init(node: .quantification(.exactly(count), .default, component().regex.root)) } public init( @@ -1097,7 +1096,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?, C3?, C4?), Component.RegexOutput == (W, C1, C2, C3, C4) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component.regex.root)) + self.init(node: .quantification(.exactly(count), .default, component.regex.root)) } public init( @@ -1106,7 +1105,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?, C3?, C4?), Component.RegexOutput == (W, C1, C2, C3, C4) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component().regex.root)) + self.init(node: .quantification(.exactly(count), .default, component().regex.root)) } public init( @@ -1209,7 +1208,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?, C3?, C4?, C5?), Component.RegexOutput == (W, C1, C2, C3, C4, C5) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component.regex.root)) + self.init(node: .quantification(.exactly(count), .default, component.regex.root)) } public init( @@ -1218,7 +1217,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?, C3?, C4?, C5?), Component.RegexOutput == (W, C1, C2, C3, C4, C5) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component().regex.root)) + self.init(node: .quantification(.exactly(count), .default, component().regex.root)) } public init( @@ -1321,7 +1320,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?, C3?, C4?, C5?, C6?), Component.RegexOutput == (W, C1, C2, C3, C4, C5, C6) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component.regex.root)) + self.init(node: .quantification(.exactly(count), .default, component.regex.root)) } public init( @@ -1330,7 +1329,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?, C3?, C4?, C5?, C6?), Component.RegexOutput == (W, C1, C2, C3, C4, C5, C6) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component().regex.root)) + self.init(node: .quantification(.exactly(count), .default, component().regex.root)) } public init( @@ -1433,7 +1432,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?, C3?, C4?, C5?, C6?, C7?), Component.RegexOutput == (W, C1, C2, C3, C4, C5, C6, C7) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component.regex.root)) + self.init(node: .quantification(.exactly(count), .default, component.regex.root)) } public init( @@ -1442,7 +1441,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?, C3?, C4?, C5?, C6?, C7?), Component.RegexOutput == (W, C1, C2, C3, C4, C5, C6, C7) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component().regex.root)) + self.init(node: .quantification(.exactly(count), .default, component().regex.root)) } public init( @@ -1545,7 +1544,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?, C3?, C4?, C5?, C6?, C7?, C8?), Component.RegexOutput == (W, C1, C2, C3, C4, C5, C6, C7, C8) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component.regex.root)) + self.init(node: .quantification(.exactly(count), .default, component.regex.root)) } public init( @@ -1554,7 +1553,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?, C3?, C4?, C5?, C6?, C7?, C8?), Component.RegexOutput == (W, C1, C2, C3, C4, C5, C6, C7, C8) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component().regex.root)) + self.init(node: .quantification(.exactly(count), .default, component().regex.root)) } public init( @@ -1657,7 +1656,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?, C3?, C4?, C5?, C6?, C7?, C8?, C9?), Component.RegexOutput == (W, C1, C2, C3, C4, C5, C6, C7, C8, C9) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component.regex.root)) + self.init(node: .quantification(.exactly(count), .default, component.regex.root)) } public init( @@ -1666,7 +1665,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?, C3?, C4?, C5?, C6?, C7?, C8?, C9?), Component.RegexOutput == (W, C1, C2, C3, C4, C5, C6, C7, C8, C9) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component().regex.root)) + self.init(node: .quantification(.exactly(count), .default, component().regex.root)) } public init( @@ -1769,7 +1768,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?, C3?, C4?, C5?, C6?, C7?, C8?, C9?, C10?), Component.RegexOutput == (W, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component.regex.root)) + self.init(node: .quantification(.exactly(count), .default, component.regex.root)) } public init( @@ -1778,7 +1777,7 @@ extension Repeat { ) where RegexOutput == (Substring, C1?, C2?, C3?, C4?, C5?, C6?, C7?, C8?, C9?, C10?), Component.RegexOutput == (W, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component().regex.root)) + self.init(node: .quantification(.exactly(count), .default, component().regex.root)) } public init( diff --git a/Sources/VariadicsGenerator/VariadicsGenerator.swift b/Sources/VariadicsGenerator/VariadicsGenerator.swift index b0bd03b38..56e2f3790 100644 --- a/Sources/VariadicsGenerator/VariadicsGenerator.swift +++ b/Sources/VariadicsGenerator/VariadicsGenerator.swift @@ -112,8 +112,17 @@ let defaultAvailableAttr = "@available(SwiftStdlib 5.7, *)" @main struct VariadicsGenerator: ParsableCommand { @Option(help: "The maximum arity of declarations to generate.") - var maxArity: Int + var maxArity: Int = 10 + + @Flag(help: "Suppress status messages while generating.") + var silent: Bool = false + func log(_ message: String, terminator: String = "\n") { + if !silent { + print(message, terminator: terminator, to: &standardError) + } + } + func run() throws { precondition(maxArity > 1) precondition(maxArity < Counter.bitWidth) @@ -132,20 +141,17 @@ struct VariadicsGenerator: ParsableCommand { // BEGIN AUTO-GENERATED CONTENT - import _RegexParser @_spi(RegexBuilder) import _StringProcessing """) - print("Generating concatenation overloads...", to: &standardError) + log("Generating concatenation overloads...") for (leftArity, rightArity) in Permutations(totalArity: maxArity) { guard rightArity != 0 else { continue } - print( - " Left arity: \(leftArity) Right arity: \(rightArity)", - to: &standardError) + log(" Left arity: \(leftArity) Right arity: \(rightArity)") emitConcatenation(leftArity: leftArity, rightArity: rightArity) } @@ -155,42 +161,40 @@ struct VariadicsGenerator: ParsableCommand { output("\n\n") - print("Generating quantifiers...", to: &standardError) + log("Generating quantifiers...") for arity in 0...maxArity { - print(" Arity \(arity): ", terminator: "", to: &standardError) + log(" Arity \(arity): ", terminator: "") for kind in QuantifierKind.allCases { - print("\(kind.rawValue) ", terminator: "", to: &standardError) + log("\(kind.rawValue) ", terminator: "") emitQuantifier(kind: kind, arity: arity) } - print("repeating ", terminator: "", to: &standardError) + log("repeating ", terminator: "") emitRepeating(arity: arity) - print(to: &standardError) + log("") } - print("Generating atomic groups...", to: &standardError) + log("Generating atomic groups...") for arity in 0...maxArity { - print(" Arity \(arity): ", terminator: "", to: &standardError) + log(" Arity \(arity): ", terminator: "") emitAtomicGroup(arity: arity) - print(to: &standardError) + log("") } - print("Generating alternation overloads...", to: &standardError) + log("Generating alternation overloads...") for (leftArity, rightArity) in Permutations(totalArity: maxArity) { - print( - " Left arity: \(leftArity) Right arity: \(rightArity)", - to: &standardError) + log(" Left arity: \(leftArity) Right arity: \(rightArity)") emitAlternation(leftArity: leftArity, rightArity: rightArity) } - print("Generating 'AlternationBuilder.buildBlock(_:)' overloads...", to: &standardError) + log("Generating 'AlternationBuilder.buildBlock(_:)' overloads...") for arity in 1...maxArity { - print(" Capture arity: \(arity)", to: &standardError) + log(" Capture arity: \(arity)") emitUnaryAlternationBuildBlock(arity: arity) } - print("Generating 'capture' and 'tryCapture' overloads...", to: &standardError) + log("Generating 'capture' and 'tryCapture' overloads...") for arity in 0...maxArity { - print(" Capture arity: \(arity)", to: &standardError) + log(" Capture arity: \(arity)") emitCapture(arity: arity) } @@ -198,7 +202,7 @@ struct VariadicsGenerator: ParsableCommand { output("// END AUTO-GENERATED CONTENT\n") - print("Done!", to: &standardError) + log("Done!") } func tupleType(arity: Int, genericParameters: () -> String) -> String { @@ -492,7 +496,7 @@ struct VariadicsGenerator: ParsableCommand { ) \(params.whereClauseForInit) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component.regex.root)) + self.init(node: .quantification(.exactly(count), .default, component.regex.root)) } \(params.disfavored)\ @@ -502,7 +506,7 @@ struct VariadicsGenerator: ParsableCommand { ) \(params.whereClauseForInit) { assert(count > 0, "Must specify a positive count") // TODO: Emit a warning about `repeatMatch(count: 0)` or `repeatMatch(count: 1)` - self.init(node: .quantification(.exactly(.init(faking: count)), .default, component().regex.root)) + self.init(node: .quantification(.exactly(count), .default, component().regex.root)) } \(params.disfavored)\ @@ -517,7 +521,7 @@ struct VariadicsGenerator: ParsableCommand { \(params.disfavored)\ public init<\(params.genericParams), R: RangeExpression>( _ expression: R, - _ behavior: QuantificationBehavior? = nil, + _ behavior: RegexRepetitionBehavior? = nil, @\(concatBuilderName) _ component: () -> Component ) \(params.repeatingWhereClause) { self.init(node: .repeating(expression.relative(to: 0..= low) { - int idx = low + (high - low) / 2; - - const uint32_t entry = _swift_stdlib_graphemeBreakProperties[idx]; - - // Shift the enum and range count out of the value. - uint32_t lower = (entry << 11) >> 11; - - // Shift the enum out first, then shift out the scalar value. - uint32_t upper = lower + ((entry << 3) >> 24); - - // Shift everything out. - uint8_t enumValue = (uint8_t)(entry >> 29); - - // Special case: extendedPictographic who used an extra bit for the range. - if (enumValue == 5) { - upper = lower + ((entry << 2) >> 23); - } - - if (scalar >= lower && scalar <= upper) { - return enumValue; - } - - if (scalar > upper) { - low = idx + 1; - continue; - } - - if (scalar < lower) { - high = idx - 1; - continue; - } - } - - // If we made it out here, then our scalar was not found in the grapheme - // array (this occurs when a scalar doesn't map to any grapheme break - // property). Return the max value here to indicate .any. - return 0xFF; -} - -SWIFT_CC -_Bool _swift_stdlib_isLinkingConsonant(uint32_t scalar) { - intptr_t idx = _swift_stdlib_getScalarBitArrayIdx(scalar, - _swift_stdlib_linkingConsonant, - _swift_stdlib_linkingConsonant_ranks); - - if (idx == INTPTR_MAX) { - return false; - } - - return true; -} diff --git a/Sources/_CUnicode/UnicodeNormalization.c b/Sources/_CUnicode/UnicodeNormalization.c deleted file mode 100644 index 6c7abfc81..000000000 --- a/Sources/_CUnicode/UnicodeNormalization.c +++ /dev/null @@ -1,116 +0,0 @@ -//===----------------------------------------------------------------------===// -// -// This source file is part of the Swift.org open source project -// -// Copyright (c) 2021 Apple Inc. and the Swift project authors -// Licensed under Apache License v2.0 with Runtime Library Exception -// -// See https://swift.org/LICENSE.txt for license information -// See https://swift.org/CONTRIBUTORS.txt for the list of Swift project authors -// -//===----------------------------------------------------------------------===// - -#if defined(__APPLE__) -#include "Apple/NormalizationData.h" -#else -#include "Common/NormalizationData.h" -#endif - -#include "include/UnicodeData.h" - -SWIFT_CC -uint16_t _swift_stdlib_getNormData(uint32_t scalar) { - // Fast Path: ASCII and some latiny scalars are very basic and have no - // normalization properties. - if (scalar < 0xC0) { - return 0; - } - - intptr_t dataIdx = _swift_stdlib_getScalarBitArrayIdx(scalar, - _swift_stdlib_normData, - _swift_stdlib_normData_ranks); - - // If we don't have an index into the data indices, then this scalar has no - // normalization information. - if (dataIdx == INTPTR_MAX) { - return 0; - } - - const uint8_t scalarDataIdx = _swift_stdlib_normData_data_indices[dataIdx]; - return _swift_stdlib_normData_data[scalarDataIdx]; -} - -SWIFT_CC -const uint8_t *_swift_stdlib_nfd_decompositions(void) { - return _swift_stdlib_nfd_decomp; -} - -SWIFT_CC -uint32_t _swift_stdlib_getDecompositionEntry(uint32_t scalar) { - intptr_t levelCount = NFD_DECOMP_LEVEL_COUNT; - intptr_t decompIdx = _swift_stdlib_getMphIdx(scalar, levelCount, - _swift_stdlib_nfd_decomp_keys, - _swift_stdlib_nfd_decomp_ranks, - _swift_stdlib_nfd_decomp_sizes); - - return _swift_stdlib_nfd_decomp_indices[decompIdx]; -} - -SWIFT_CC -uint32_t _swift_stdlib_getComposition(uint32_t x, uint32_t y) { - intptr_t levelCount = NFC_COMP_LEVEL_COUNT; - intptr_t compIdx = _swift_stdlib_getMphIdx(y, levelCount, - _swift_stdlib_nfc_comp_keys, - _swift_stdlib_nfc_comp_ranks, - _swift_stdlib_nfc_comp_sizes); - const uint32_t *array = _swift_stdlib_nfc_comp_indices[compIdx]; - - // Ensure that the first element in this array is equal to our y scalar. - const uint32_t realY = (array[0] << 11) >> 11; - - if (y != realY) { - return UINT32_MAX; - } - - const uint32_t count = array[0] >> 21; - - uint32_t low = 1; - uint32_t high = count - 1; - - while (high >= low) { - uint32_t idx = low + (high - low) / 2; - - const uint32_t entry = array[idx]; - - // Shift the range count out of the scalar. - const uint32_t lower = (entry << 15) >> 15; - - _Bool isNegative = entry >> 31; - uint32_t rangeCount = (entry << 1) >> 18; - - if (isNegative) { - rangeCount = -rangeCount; - } - - const uint32_t composed = lower + rangeCount; - - if (x == lower) { - return composed; - } - - if (x > lower) { - low = idx + 1; - continue; - } - - if (x < lower) { - high = idx - 1; - continue; - } - } - - // If we made it out here, then our scalar was not found in the composition - // array. - // Return the max here to indicate that we couldn't find one. - return UINT32_MAX; -} diff --git a/Sources/_CUnicode/include/UnicodeData.h b/Sources/_CUnicode/include/UnicodeData.h index 3ce6e3591..f846ddf68 100644 --- a/Sources/_CUnicode/include/UnicodeData.h +++ b/Sources/_CUnicode/include/UnicodeData.h @@ -32,32 +32,6 @@ intptr_t _swift_stdlib_getScalarBitArrayIdx(uint32_t scalar, const uint64_t *bitArrays, const uint16_t *ranks); -//===----------------------------------------------------------------------===// -// Normalization -//===----------------------------------------------------------------------===// - -SWIFT_CC -uint16_t _swift_stdlib_getNormData(uint32_t scalar); - -SWIFT_CC -const uint8_t *_swift_stdlib_nfd_decompositions(void); - -SWIFT_CC -uint32_t _swift_stdlib_getDecompositionEntry(uint32_t scalar); - -SWIFT_CC -uint32_t _swift_stdlib_getComposition(uint32_t x, uint32_t y); - -//===----------------------------------------------------------------------===// -// Grapheme Breaking -//===----------------------------------------------------------------------===// - -SWIFT_CC -uint8_t _swift_stdlib_getGraphemeBreakProperty(uint32_t scalar); - -SWIFT_CC -_Bool _swift_stdlib_isLinkingConsonant(uint32_t scalar); - //===----------------------------------------------------------------------===// // Scalar Props //===----------------------------------------------------------------------===// diff --git a/Sources/_RegexParser/Regex/AST/AST.swift b/Sources/_RegexParser/Regex/AST/AST.swift index 409d5a7ee..eae393289 100644 --- a/Sources/_RegexParser/Regex/AST/AST.swift +++ b/Sources/_RegexParser/Regex/AST/AST.swift @@ -9,8 +9,9 @@ // //===----------------------------------------------------------------------===// -/// A regex abstract syntax tree. This is a top-level type that stores the root -/// node. +/// A regex abstract syntax tree. +/// +/// This is a top-level type that stores the root node. public struct AST: Hashable { public var root: AST.Node public var globalOptions: GlobalMatchingOptionSequence? @@ -22,7 +23,7 @@ public struct AST: Hashable { } extension AST { - /// Whether this AST tree has nested somewhere inside it a capture. + /// Whether this AST tree contains at least one capture nested inside of it. public var hasCapture: Bool { root.hasCapture } /// The capture structure of this AST tree. @@ -94,7 +95,9 @@ extension AST.Node { _associatedValue as? T } - /// If this node is a parent node, access its children + /// The child nodes of this node. + /// + /// If the node isn't a parent node, this value is `nil`. public var children: [AST.Node]? { return (_associatedValue as? _ASTParent)?.children } @@ -103,7 +106,7 @@ extension AST.Node { _associatedValue.location } - /// Whether this node is "trivia" or non-semantic, like comments + /// Whether this node is trivia or non-semantic, like comments. public var isTrivia: Bool { switch self { case .trivia: return true @@ -111,7 +114,7 @@ extension AST.Node { } } - /// Whether this node has nested somewhere inside it a capture + /// Whether this node contains at least one capture nested inside of it. public var hasCapture: Bool { switch self { case .group(let g) where g.kind.value.isCapturing: @@ -122,7 +125,7 @@ extension AST.Node { return self.children?.any(\.hasCapture) ?? false } - /// Whether this AST node may be used as the operand of a quantifier such as + /// Whether this node may be used as the operand of a quantifier such as /// `?`, `+` or `*`. public var isQuantifiable: Bool { switch self { @@ -203,7 +206,9 @@ extension AST { } } - /// An Oniguruma absent function. This is used to model a pattern which should + /// An Oniguruma absent function. + /// + /// This is used to model a pattern which should /// not be matched against across varying scopes. public struct AbsentFunction: Hashable, _ASTNode { public enum Start: Hashable { diff --git a/Sources/_RegexParser/Regex/AST/Atom.swift b/Sources/_RegexParser/Regex/AST/Atom.swift index 1f6043d72..e17ce68bb 100644 --- a/Sources/_RegexParser/Regex/AST/Atom.swift +++ b/Sources/_RegexParser/Regex/AST/Atom.swift @@ -415,7 +415,7 @@ extension AST.Atom.CharacterProperty { } extension AST.Atom { - /// Anchors and other built-in zero-width assertions + /// Anchors and other built-in zero-width assertions. @frozen public enum AssertionKind: String { /// \A @@ -574,7 +574,7 @@ extension AST.Atom { } extension AST.Atom.Callout { - /// A tag specifier `[...]` which may appear in an Oniguruma callout. + /// A tag specifier `[...]` that can appear in an Oniguruma callout. public struct OnigurumaTag: Hashable { public var leftBracket: SourceLocation public var name: AST.Located @@ -668,8 +668,10 @@ extension AST.Atom.EscapedBuiltin { } extension AST.Atom { - /// Retrieve the character value of the atom if it represents a literal - /// character or unicode scalar, nil otherwise. + /// Retrieves the character value of the atom. + /// + /// If the atom doesn't represent a literal character or a Unicode scalar, + /// this value is `nil`. public var literalCharacterValue: Character? { switch kind { case .char(let c): @@ -711,9 +713,9 @@ extension AST.Atom { } } - /// Produce a string literal representation of the atom, if possible + /// A string literal representation of the atom, if possible. /// - /// Individual characters will be returned, Unicode scalars will be + /// Individual characters are returned as-is, and Unicode scalars are /// presented using "\u{nnnn}" syntax. public var literalStringValue: String? { switch kind { diff --git a/Sources/_RegexParser/Regex/AST/CustomCharClass.swift b/Sources/_RegexParser/Regex/AST/CustomCharClass.swift index 19e72aef5..c1dd4c620 100644 --- a/Sources/_RegexParser/Regex/AST/CustomCharClass.swift +++ b/Sources/_RegexParser/Regex/AST/CustomCharClass.swift @@ -104,8 +104,9 @@ extension CustomCC.Member { } extension AST.CustomCharacterClass { - /// Strip trivia from the character class members. This does not recurse into - /// nested custom character classes. + /// Strips trivia from the character class members. + /// + /// This method doesn't recurse into nested custom character classes. public var strippingTriviaShallow: Self { var copy = self copy.members = copy.members.filter(\.isSemantic) diff --git a/Sources/_RegexParser/Regex/AST/Group.swift b/Sources/_RegexParser/Regex/AST/Group.swift index a8c4f8b0f..8ecaadeda 100644 --- a/Sources/_RegexParser/Regex/AST/Group.swift +++ b/Sources/_RegexParser/Regex/AST/Group.swift @@ -78,6 +78,7 @@ extension AST { } extension AST.Group.Kind { + /// Whether the group is a capturing group. public var isCapturing: Bool { switch self { case .capture, .namedCapture, .balancedCapture: return true @@ -85,7 +86,9 @@ extension AST.Group.Kind { } } - /// If this is a named group, its name, `nil` otherwise. + /// The name of the group. + /// + /// If the group doesn't have a name, this value is `nil`. public var name: String? { switch self { case .namedCapture(let name): return name.value @@ -96,9 +99,11 @@ extension AST.Group.Kind { } extension AST.Group.Kind { - /// If this group is a lookaround assertion, return its direction - /// and whether it is positive or negative. Otherwise returns - /// `nil`. + /// The direction of a lookaround assertion + /// and an indication of whether the assertion is positive or negative. + /// + /// If the group isn't a lookaheand or lookbehind assertion, + /// this value is `nil`. public var lookaroundKind: (forwards: Bool, positive: Bool)? { switch self { case .lookahead: return (true, true) diff --git a/Sources/_RegexParser/Regex/AST/MatchingOptions.swift b/Sources/_RegexParser/Regex/AST/MatchingOptions.swift index 8e4b31bc5..e779c39fb 100644 --- a/Sources/_RegexParser/Regex/AST/MatchingOptions.swift +++ b/Sources/_RegexParser/Regex/AST/MatchingOptions.swift @@ -10,7 +10,7 @@ //===----------------------------------------------------------------------===// extension AST { - /// An option written in source that changes matching semantics. + /// An option, written in source, that changes matching semantics. public struct MatchingOption: Hashable { public enum Kind { // PCRE options @@ -83,7 +83,7 @@ extension AST { } } - /// A sequence of matching options written in source. + /// A sequence of matching options, written in source. public struct MatchingOptionSequence: Hashable { /// If the sequence starts with a caret '^', its source location, or nil /// otherwise. If this is set, it indicates that all the matching options @@ -138,8 +138,11 @@ extension AST.MatchingOptionSequence: _ASTPrintable { } extension AST { - /// Global matching option specifiers. Unlike `MatchingOptionSequence`, - /// these must appear at the start of the pattern, and apply globally. + /// Global matching option specifiers. + /// + /// Unlike `MatchingOptionSequence`, + /// these options must appear at the start of the pattern, + /// and they apply to the entire pattern. public struct GlobalMatchingOption: _ASTNode, Hashable { /// Determines the definition of a newline for the '.' character class and /// when parsing end-of-line comments. diff --git a/Sources/_RegexParser/Regex/AST/Quantification.swift b/Sources/_RegexParser/Regex/AST/Quantification.swift index f2189cb38..fa7e4de82 100644 --- a/Sources/_RegexParser/Regex/AST/Quantification.swift +++ b/Sources/_RegexParser/Regex/AST/Quantification.swift @@ -59,7 +59,7 @@ extension AST { /// MARK: - Semantic API extension AST.Quantification.Amount { - /// Get the bounds + /// The bounds. public var bounds: (atLeast: Int, atMost: Int?) { switch self { case .zeroOrMore: return (0, nil) diff --git a/Sources/_RegexParser/Regex/Parse/CaptureStructure.swift b/Sources/_RegexParser/Regex/Parse/CaptureStructure.swift index 8298dc207..9cb31c7d9 100644 --- a/Sources/_RegexParser/Regex/Parse/CaptureStructure.swift +++ b/Sources/_RegexParser/Regex/Parse/CaptureStructure.swift @@ -286,10 +286,11 @@ extension CaptureStructure { MemoryLayout.stride + inputUTF8CodeUnitCount + 1 } - /// Encode the capture structure to the given buffer as a serialized + /// Encodes the capture structure to the given buffer as a serialized /// representation. /// /// The encoding rules are as follows: + /// /// ``` /// encode(〚`T`〛) ==> , 〚`T`〛, .end /// 〚`T` (atom)〛 ==> .atom diff --git a/Sources/_RegexParser/Regex/Parse/Parse.swift b/Sources/_RegexParser/Regex/Parse/Parse.swift index a2790924a..ec6e1c26c 100644 --- a/Sources/_RegexParser/Regex/Parse/Parse.swift +++ b/Sources/_RegexParser/Regex/Parse/Parse.swift @@ -577,8 +577,8 @@ fileprivate func defaultSyntaxOptions( } } -/// Parse a given regex string with delimiters, inferring the syntax options -/// from the delimiter used. +/// Parses a given regex string with delimiters, inferring the syntax options +/// from the delimiters used. public func parseWithDelimiters( _ regex: S ) throws -> AST where S.SubSequence == Substring { diff --git a/Sources/_RegexParser/Regex/Parse/Source.swift b/Sources/_RegexParser/Regex/Parse/Source.swift index 6eac16395..23cc0497d 100644 --- a/Sources/_RegexParser/Regex/Parse/Source.swift +++ b/Sources/_RegexParser/Regex/Parse/Source.swift @@ -9,10 +9,12 @@ // //===----------------------------------------------------------------------===// -/// The source given to a parser. This can be bytes in memory, a file on disk, -/// something streamed over a network connection, etc. +// For now, we use String as the source while prototyping... + +/// The source of text being given to a parser. /// -/// For now, we use String... +/// This can be bytes in memory, a file on disk, +/// something streamed over a network connection, and so on. /// public struct Source { var input: Input @@ -37,7 +39,7 @@ extension Source { public typealias Input = String // for wrapper... public typealias Char = Character // for wrapper... - /// A precise point in the input, commonly used for bounded ranges + /// A precise point in the input, commonly used for bounded ranges. public typealias Position = String.Index } diff --git a/Sources/_RegexParser/Regex/Parse/SourceLocation.swift b/Sources/_RegexParser/Regex/Parse/SourceLocation.swift index a58473c96..eb51643bd 100644 --- a/Sources/_RegexParser/Regex/Parse/SourceLocation.swift +++ b/Sources/_RegexParser/Regex/Parse/SourceLocation.swift @@ -62,7 +62,7 @@ public protocol LocatedErrorProtocol: Error { } extension Source { - /// An error with source location info + /// An error that includes information about the location in source code. public struct LocatedError: Error, LocatedErrorProtocol { public let error: E public let location: SourceLocation @@ -77,10 +77,10 @@ extension Source { } } - /// Located value: a value wrapped with a source range + /// A value wrapped with a source range. /// - /// Note: source location is part of value identity, so that the same - /// e.g. `Character` appearing twice can be stored in a data structure + /// Note: Source location is part of value identity so that, for example, the + /// same `Character` value appearing twice can be stored in a data structure /// distinctly. To ignore source locations, use `.value` directly. public struct Located { public var value: T diff --git a/Sources/_RegexParser/Regex/Parse/SyntaxOptions.swift b/Sources/_RegexParser/Regex/Parse/SyntaxOptions.swift index b7c09ea1c..0a6270f1b 100644 --- a/Sources/_RegexParser/Regex/Parse/SyntaxOptions.swift +++ b/Sources/_RegexParser/Regex/Parse/SyntaxOptions.swift @@ -31,31 +31,31 @@ public struct SyntaxOptions: OptionSet { [.endOfLineComments, .nonSemanticWhitespace] } + // NOTE: Currently, this means we have raw quotes. + // Better would be to have real Swift string delimiter parsing logic. + /// `'a "." b' == '/a\Q.\Eb/'` - /// - /// NOTE: Currently, this means we have raw quotes. - /// Better would be to have real Swift string delimiter parsing logic. public static var experimentalQuotes: Self { Self(1 << 2) } + // NOTE: traditional comments are not nested. Currently, we are neither. + // Traditional comments can't have `)`, not even escaped in them either, we + // can. Traditional comments can have `*/` in them, we can't without + // escaping. We don't currently do escaping. + /// `'a /* comment */ b' == '/a(?#. comment )b/'` - /// - /// NOTE: traditional comments are not nested. Currently, we are neither. - /// Traditional comments can't have `)`, not even escaped in them either, we - /// can. Traditional comments can have `*/` in them, we can't without - /// escaping. We don't currently do escaping. public static var experimentalComments: Self { Self(1 << 3) } /// ``` - /// 'a{n...m}' == '/a{n,m}/' - /// 'a{n...*)` - /// `(_: .*)` == `(?:.*)` + /// `(_: .*)` == `(?:.*)` public static var experimentalCaptures: Self { Self(1 << 5) } /// The default syntax for a multi-line regex literal. diff --git a/Sources/_RegexParser/Regex/Printing/DumpAST.swift b/Sources/_RegexParser/Regex/Printing/DumpAST.swift index 8565b14e9..a9cf6b424 100644 --- a/Sources/_RegexParser/Regex/Printing/DumpAST.swift +++ b/Sources/_RegexParser/Regex/Printing/DumpAST.swift @@ -9,10 +9,11 @@ // //===----------------------------------------------------------------------===// -/// AST entities can be pretty-printed or dumped +/// AST entities that can be pretty-printed or dumped. /// -/// Alternative: just use `description` for pretty-print -/// and `debugDescription` for dump +/// As an alternative to this protocol, +/// you can also use the `description` to pretty-print an AST, +/// and `debugDescription` for to dump a debugging representation. public protocol _ASTPrintable: CustomStringConvertible, CustomDebugStringConvertible diff --git a/Sources/_RegexParser/Regex/Printing/PrettyPrinter.swift b/Sources/_RegexParser/Regex/Printing/PrettyPrinter.swift index f1d8c83b0..bf379fc14 100644 --- a/Sources/_RegexParser/Regex/Printing/PrettyPrinter.swift +++ b/Sources/_RegexParser/Regex/Printing/PrettyPrinter.swift @@ -9,17 +9,25 @@ // //===----------------------------------------------------------------------===// -/// Track and handle state relevant to pretty-printing ASTs. +/// State used when to pretty-printing regex ASTs. public struct PrettyPrinter { // Configuration - /// Cut off pattern conversion after this many levels + /// The maximum number number of levels, from the root of the tree, + /// at which to perform pattern conversion. + /// + /// A `nil` value indicates that there is no maximum, + /// and pattern conversion always takes place. public var maxTopDownLevels: Int? - /// Cut off pattern conversion after this tree height + /// The maximum number number of levels, from the leaf nodes of the tree, + /// at which to perform pattern conversion. + /// + /// A `nil` value indicates that there is no maximum, + /// and pattern conversion always takes place. public var minBottomUpLevels: Int? - /// How many spaces to indent with ("tab-width") + /// The number of spaces used for indentation. public var indentWidth = 2 // Internal state @@ -46,25 +54,27 @@ extension PrettyPrinter { self.minBottomUpLevels = minBottomUpLevels } - /// Output a string directly, without termination, without - /// indentation, and without updating _any_ internal state. + /// Outputs a string directly, without termination or + /// indentation, and without updating any internal state. /// - /// This is the low-level interface to the pret + /// This is the low-level interface to the pretty printer. /// - /// NOTE: If `s` includes a newline, even at the end, - /// this function will not update any tracking state. + /// - Note: If `s` includes a newline, even at the end, + /// this method does not update any tracking state. public mutating func output(_ s: String) { result += s } - /// Terminate a line, updating any relevant state + /// Terminates a line, updating any relevant state. public mutating func terminateLine() { output("\n") startOfLine = true } - /// Indent a new line, if at the start of a line, otherwise - /// does nothing. Updates internal state. + /// Indents a new line, if at the start of a line, otherwise + /// does nothing. + /// + /// This function updates internal state. public mutating func indent() { guard startOfLine else { return } let numCols = indentLevel * indentWidth @@ -72,7 +82,9 @@ extension PrettyPrinter { startOfLine = false } - // Finish, flush, and clear. Returns the rendered output + // Finish, flush, and clear. + // + // - Returns: The rendered output. public mutating func finish() -> String { defer { result = "" } return result @@ -85,18 +97,18 @@ extension PrettyPrinter { extension PrettyPrinter { /// Print out a new entry. /// - /// This will property indent `s`, update any internal state, - /// and will also terminate the current line. + /// This method indents `s`, updates any internal state, + /// and terminates the current line. public mutating func print(_ s: String) { indent() output("\(s)") terminateLine() } - /// Print out a new entry by invoking `f` until it returns `nil`. + /// Prints out a new entry by invoking `f` until it returns `nil`. /// - /// This will property indent, update any internal state, - /// and will also terminate the current line. + /// This method indents `s`, updates any internal state, + /// and terminates the current line. public mutating func printLine(_ f: () -> String?) { // TODO: What should we do if `f` never returns non-nil? indent() @@ -106,7 +118,7 @@ extension PrettyPrinter { terminateLine() } - /// Execute `f` at one increased level of indentation + /// Executes `f` at one increased level of indentation. public mutating func printIndented( _ f: (inout Self) -> () ) { @@ -115,7 +127,7 @@ extension PrettyPrinter { self.indentLevel -= 1 } - /// Execute `f` inside an indented "block", which has a header + /// Executes `f` inside an indented block, which has a header /// and delimiters. public mutating func printBlock( _ header: String, diff --git a/Sources/_RegexParser/Regex/Printing/PrintAsCanonical.swift b/Sources/_RegexParser/Regex/Printing/PrintAsCanonical.swift index ab961ba51..59c0cc04a 100644 --- a/Sources/_RegexParser/Regex/Printing/PrintAsCanonical.swift +++ b/Sources/_RegexParser/Regex/Printing/PrintAsCanonical.swift @@ -12,7 +12,7 @@ // TODO: Round-tripping tests extension AST { - /// Render using Swift's preferred regex literal syntax + /// Renders using Swift's preferred regex literal syntax. public func renderAsCanonical( showDelimiters delimiters: Bool = false, terminateLine: Bool = false @@ -27,7 +27,7 @@ extension AST { } extension AST.Node { - /// Render using Swift's preferred regex literal syntax + /// Renders using Swift's preferred regex literal syntax. public func renderAsCanonical( showDelimiters delimiters: Bool = false, terminateLine: Bool = false @@ -38,8 +38,12 @@ extension AST.Node { } extension PrettyPrinter { - /// Will output `ast` in canonical form, taking care to - /// also indent and terminate the line (updating internal state) + /// Outputs a regular expression abstract syntax tree in canonical form, + /// indenting and terminating the line, and updating its internal state. + /// + /// - Parameter ast: The abstract syntax tree of the regular expression being output. + /// - Parameter delimiters: Whether to include commas between items. + /// - Parameter terminateLine: Whether to include terminate the line. public mutating func printAsCanonical( _ ast: AST, delimiters: Bool = false, @@ -57,8 +61,8 @@ extension PrettyPrinter { } } - /// Output the `ast` in canonical form, does not indent, terminate, - /// or affect internal state + /// Outputs a regular expression abstract syntax tree in canonical form, + /// without indentation, line termation, or affecting its internal state. mutating func outputAsCanonical(_ ast: AST.Node) { switch ast { case let .alternation(a): diff --git a/Sources/_RegexParser/Utility/Misc.swift b/Sources/_RegexParser/Utility/Misc.swift index 55d3d3adc..bd9bc665e 100644 --- a/Sources/_RegexParser/Utility/Misc.swift +++ b/Sources/_RegexParser/Utility/Misc.swift @@ -111,8 +111,11 @@ extension Collection { } extension Collection where Element: Equatable { - /// Attempt to drop a given prefix from the collection, returning the - /// resulting subsequence, or `nil` if the prefix does not match. + /// Attempts to drop a given prefix from the collection. + /// + /// - Parameter other: The collection that contains the prefix. + /// - Returns: The resulting subsequence, + /// or `nil` if the prefix doesn't match. public func tryDropPrefix( _ other: C ) -> SubSequence? where C.Element == Element { @@ -121,8 +124,11 @@ extension Collection where Element: Equatable { return dropFirst(prefixCount) } - /// Attempt to drop a given suffix from the collection, returning the - /// resulting subsequence, or `nil` if the suffix does not match. + /// Attempts to drop a given suffix from the collection. + /// + /// - Parameter other: The collection that contains the suffix. + /// - Returns: The resulting subsequence, + /// or `nil` if the prefix doesn't match. public func tryDropSuffix( _ other: C ) -> SubSequence? where C.Element == Element { diff --git a/Sources/_RegexParser/Utility/MissingUnicode.swift b/Sources/_RegexParser/Utility/MissingUnicode.swift index 4d819806b..b1a4a07ff 100644 --- a/Sources/_RegexParser/Utility/MissingUnicode.swift +++ b/Sources/_RegexParser/Utility/MissingUnicode.swift @@ -12,13 +12,13 @@ // MARK: - Missing stdlib API extension Unicode { + // Note: The `Script` enum includes the "meta" script type "Katakana_Or_Hiragana", which + // isn't defined by https://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt, + // but is defined by https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt. + // We may want to split it out, as it's the only case that is a union of + // other script types. + /// Character script types. - /// - /// Note this includes the "meta" script type "Katakana_Or_Hiragana", which - /// isn't defined by https://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt, - /// but is defined by https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt. - /// We may want to split it out, as it's the only case that is a union of - /// other script types. @frozen public enum Script: String, Hashable { case adlam = "Adlam" @@ -254,7 +254,8 @@ extension Unicode { case spaceSeparator = "Zs" } - /// A list of unicode properties that can either be true or false. + /// A list of Unicode properties that can either be true or false. + /// /// https://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt @frozen public enum BinaryProperty: String, Hashable { @@ -328,9 +329,10 @@ extension Unicode { } } +// TODO: These should become aliases for the Block (blk) Unicode character +// property. + /// Oniguruma properties that are not covered by Unicode spellings. -/// TODO: These should become aliases for the Block (blk) Unicode character -/// property. @frozen public enum OnigurumaSpecialProperty: String, Hashable { case inBasicLatin = "In_Basic_Latin" @@ -657,18 +659,24 @@ public enum OnigurumaSpecialProperty: String, Hashable { } extension Character { + /// Whether this character represents an octal (base 8) digit, + /// for the purposes of pattern parsing. public var isOctalDigit: Bool { ("0"..."7").contains(self) } + /// Whether this character represents a word character, + /// for the purposes of pattern parsing. public var isWordCharacter: Bool { isLetter || isNumber || self == "_" } - /// Whether this character represents whitespace for the purposes of pattern - /// parsing. + /// Whether this character represents whitespace, + /// for the purposes of pattern parsing. public var isPatternWhitespace: Bool { return unicodeScalars.first!.properties.isPatternWhitespace } } extension UnicodeScalar { + /// Whether this character represents a printable ASCII character, + /// for the purposes of pattern parsing. public var isPrintableASCII: Bool { // Exclude non-printables before the space character U+20, and anything // including and above the DEL character U+7F. diff --git a/Sources/_StringProcessing/Algorithms/Algorithms/Contains.swift b/Sources/_StringProcessing/Algorithms/Algorithms/Contains.swift index 1d4332ad0..2a1ef72a2 100644 --- a/Sources/_StringProcessing/Algorithms/Algorithms/Contains.swift +++ b/Sources/_StringProcessing/Algorithms/Algorithms/Contains.swift @@ -28,16 +28,16 @@ extension Collection where Element: Equatable { /// - Returns: `true` if the collection contains the specified sequence, /// otherwise `false`. @available(SwiftStdlib 5.7, *) - public func contains(_ other: S) -> Bool - where S.Element == Element + public func contains(_ other: C) -> Bool + where C.Element == Element { firstRange(of: other) != nil } } extension BidirectionalCollection where Element: Comparable { - func contains(_ other: S) -> Bool - where S.Element == Element + func contains(_ other: C) -> Bool + where C.Element == Element { if #available(SwiftStdlib 5.7, *) { return firstRange(of: other) != nil @@ -46,6 +46,20 @@ extension BidirectionalCollection where Element: Comparable { } } +// Overload breakers + +extension StringProtocol { + @available(SwiftStdlib 5.7, *) + public func contains(_ other: String) -> Bool { + firstRange(of: other) != nil + } + + @available(SwiftStdlib 5.7, *) + public func contains(_ other: Substring) -> Bool { + firstRange(of: other) != nil + } +} + // MARK: Regex algorithms extension BidirectionalCollection where SubSequence == Substring { diff --git a/Sources/_StringProcessing/Algorithms/Algorithms/FirstRange.swift b/Sources/_StringProcessing/Algorithms/Algorithms/FirstRange.swift index 508c04663..42703827e 100644 --- a/Sources/_StringProcessing/Algorithms/Algorithms/FirstRange.swift +++ b/Sources/_StringProcessing/Algorithms/Algorithms/FirstRange.swift @@ -32,31 +32,33 @@ extension BidirectionalCollection { // MARK: Fixed pattern algorithms extension Collection where Element: Equatable { - /// Finds and returns the range of the first occurrence of a given sequence - /// within the collection. - /// - Parameter sequence: The sequence to search for. + /// Finds and returns the range of the first occurrence of a given collection + /// within this collection. + /// + /// - Parameter other: The collection to search for. /// - Returns: A range in the collection of the first occurrence of `sequence`. /// Returns nil if `sequence` is not found. @available(SwiftStdlib 5.7, *) - public func firstRange( - of sequence: S - ) -> Range? where S.Element == Element { + public func firstRange( + of other: C + ) -> Range? where C.Element == Element { // TODO: Use a more efficient search algorithm - let searcher = ZSearcher(pattern: Array(sequence), by: ==) + let searcher = ZSearcher(pattern: Array(other), by: ==) return searcher.search(self[...], in: startIndex..( - of other: S - ) -> Range? where S.Element == Element { + public func firstRange( + of other: C + ) -> Range? where C.Element == Element { let searcher = PatternOrEmpty( searcher: TwoWaySearcher(pattern: Array(other))) let slice = self[...] diff --git a/Sources/_StringProcessing/Algorithms/Algorithms/Ranges.swift b/Sources/_StringProcessing/Algorithms/Algorithms/Ranges.swift index 853c73271..33a9748ac 100644 --- a/Sources/_StringProcessing/Algorithms/Algorithms/Ranges.swift +++ b/Sources/_StringProcessing/Algorithms/Algorithms/Ranges.swift @@ -175,9 +175,9 @@ extension BidirectionalCollection { // MARK: Fixed pattern algorithms extension Collection where Element: Equatable { - func ranges( - of other: S - ) -> RangesCollection> where S.Element == Element { + func ranges( + of other: C + ) -> RangesCollection> where C.Element == Element { ranges(of: ZSearcher(pattern: Array(other), by: ==)) } @@ -188,9 +188,9 @@ extension Collection where Element: Equatable { /// - Returns: A collection of ranges of all occurrences of `other`. Returns /// an empty collection if `other` is not found. @available(SwiftStdlib 5.7, *) - public func ranges( - of other: S - ) -> [Range] where S.Element == Element { + public func ranges( + of other: C + ) -> [Range] where C.Element == Element { ranges(of: ZSearcher(pattern: Array(other), by: ==)).map { $0 } } } @@ -207,10 +207,10 @@ extension BidirectionalCollection where Element: Equatable { } extension BidirectionalCollection where Element: Comparable { - func ranges( - of other: S + func ranges( + of other: C ) -> RangesCollection>> - where S.Element == Element + where C.Element == Element { ranges(of: PatternOrEmpty(searcher: TwoWaySearcher(pattern: Array(other)))) } @@ -247,6 +247,7 @@ extension BidirectionalCollection where SubSequence == Substring { // FIXME: Return `some Collection>` for SE-0346 /// Finds and returns the ranges of the all occurrences of a given sequence /// within the collection. + /// /// - Parameter regex: The regex to search for. /// - Returns: A collection or ranges in the receiver of all occurrences of /// `regex`. Returns an empty collection if `regex` is not found. diff --git a/Sources/_StringProcessing/Algorithms/Algorithms/Replace.swift b/Sources/_StringProcessing/Algorithms/Algorithms/Replace.swift index 4a6da6c10..217fb90d6 100644 --- a/Sources/_StringProcessing/Algorithms/Algorithms/Replace.swift +++ b/Sources/_StringProcessing/Algorithms/Algorithms/Replace.swift @@ -78,12 +78,12 @@ extension RangeReplaceableCollection where Element: Equatable { /// - Returns: A new collection in which all occurrences of `other` in /// `subrange` of the collection are replaced by `replacement`. @available(SwiftStdlib 5.7, *) - public func replacing( - _ other: S, + public func replacing( + _ other: C, with replacement: Replacement, subrange: Range, maxReplacements: Int = .max - ) -> Self where S.Element == Element, Replacement.Element == Element { + ) -> Self where C.Element == Element, Replacement.Element == Element { replacing( ZSearcher(pattern: Array(other), by: ==), with: replacement, @@ -101,11 +101,11 @@ extension RangeReplaceableCollection where Element: Equatable { /// - Returns: A new collection in which all occurrences of `other` in /// `subrange` of the collection are replaced by `replacement`. @available(SwiftStdlib 5.7, *) - public func replacing( - _ other: S, + public func replacing( + _ other: C, with replacement: Replacement, maxReplacements: Int = .max - ) -> Self where S.Element == Element, Replacement.Element == Element { + ) -> Self where C.Element == Element, Replacement.Element == Element { replacing( other, with: replacement, @@ -120,11 +120,11 @@ extension RangeReplaceableCollection where Element: Equatable { /// - maxReplacements: A number specifying how many occurrences of `other` /// to replace. Default is `Int.max`. @available(SwiftStdlib 5.7, *) - public mutating func replace( - _ other: S, + public mutating func replace( + _ other: C, with replacement: Replacement, maxReplacements: Int = .max - ) where S.Element == Element, Replacement.Element == Element { + ) where C.Element == Element, Replacement.Element == Element { self = replacing( other, with: replacement, @@ -136,12 +136,12 @@ extension RangeReplaceableCollection where Element: Equatable { extension RangeReplaceableCollection where Self: BidirectionalCollection, Element: Comparable { - func replacing( - _ other: S, + func replacing( + _ other: C, with replacement: Replacement, subrange: Range, maxReplacements: Int = .max - ) -> Self where S.Element == Element, Replacement.Element == Element { + ) -> Self where C.Element == Element, Replacement.Element == Element { replacing( PatternOrEmpty(searcher: TwoWaySearcher(pattern: Array(other))), with: replacement, @@ -149,11 +149,11 @@ extension RangeReplaceableCollection maxReplacements: maxReplacements) } - func replacing( - _ other: S, + func replacing( + _ other: C, with replacement: Replacement, maxReplacements: Int = .max - ) -> Self where S.Element == Element, Replacement.Element == Element { + ) -> Self where C.Element == Element, Replacement.Element == Element { replacing( other, with: replacement, @@ -161,11 +161,11 @@ extension RangeReplaceableCollection maxReplacements: maxReplacements) } - mutating func replace( - _ other: S, + mutating func replace( + _ other: C, with replacement: Replacement, maxReplacements: Int = .max - ) where S.Element == Element, Replacement.Element == Element { + ) where C.Element == Element, Replacement.Element == Element { self = replacing( other, with: replacement, diff --git a/Sources/_StringProcessing/Algorithms/Algorithms/Split.swift b/Sources/_StringProcessing/Algorithms/Algorithms/Split.swift index 8c7a9832d..ab465c382 100644 --- a/Sources/_StringProcessing/Algorithms/Algorithms/Split.swift +++ b/Sources/_StringProcessing/Algorithms/Algorithms/Split.swift @@ -15,13 +15,28 @@ struct SplitCollection { public typealias Base = Searcher.Searched let ranges: RangesCollection - - init(ranges: RangesCollection) { + var maxSplits: Int + var omittingEmptySubsequences: Bool + + init( + ranges: RangesCollection, + maxSplits: Int, + omittingEmptySubsequences: Bool) + { self.ranges = ranges + self.maxSplits = maxSplits + self.omittingEmptySubsequences = omittingEmptySubsequences } - init(base: Base, searcher: Searcher) { + init( + base: Base, + searcher: Searcher, + maxSplits: Int, + omittingEmptySubsequences: Bool) + { self.ranges = base.ranges(of: searcher) + self.maxSplits = maxSplits + self.omittingEmptySubsequences = omittingEmptySubsequences } } @@ -30,97 +45,131 @@ extension SplitCollection: Sequence { let base: Base var index: Base.Index var ranges: RangesCollection.Iterator - var isDone: Bool - - init(ranges: RangesCollection) { + var maxSplits: Int + var omittingEmptySubsequences: Bool + + var splitCounter = 0 + var isDone = false + + init( + ranges: RangesCollection, + maxSplits: Int, + omittingEmptySubsequences: Bool + ) { self.base = ranges.base self.index = base.startIndex self.ranges = ranges.makeIterator() - self.isDone = false + self.maxSplits = maxSplits + self.omittingEmptySubsequences = omittingEmptySubsequences } public mutating func next() -> Base.SubSequence? { guard !isDone else { return nil } - guard let range = ranges.next() else { + /// Return the rest of base if it's non-empty or we're including + /// empty subsequences. + func finish() -> Base.SubSequence? { isDone = true - return base[index...] + return index == base.endIndex && omittingEmptySubsequences + ? nil + : base[index...] + } + + if index == base.endIndex { + return finish() + } + + if splitCounter >= maxSplits { + return finish() } - defer { index = range.upperBound } - return base[index.. Iterator { - Iterator(ranges: ranges) - } -} - -extension SplitCollection: Collection { - public struct Index { - var start: Base.Index - var base: RangesCollection.Index - var isEndIndex: Bool - } - - public var startIndex: Index { - let base = ranges.startIndex - return Index(start: ranges.base.startIndex, base: base, isEndIndex: false) - } - - public var endIndex: Index { - Index(start: ranges.base.endIndex, base: ranges.endIndex, isEndIndex: true) - } - - public func formIndex(after index: inout Index) { - guard !index.isEndIndex else { fatalError("Cannot advance past endIndex") } - - if let range = index.base.range { - let newStart = range.upperBound - ranges.formIndex(after: &index.base) - index.start = newStart - } else { - index.isEndIndex = true - } - } - - public func index(after index: Index) -> Index { - var index = index - formIndex(after: &index) - return index - } - - public subscript(index: Index) -> Base.SubSequence { - guard !index.isEndIndex else { - fatalError("Cannot subscript using endIndex") - } - let end = index.base.range?.lowerBound ?? ranges.base.endIndex - return ranges.base[index.start.. Bool { - switch (lhs.isEndIndex, rhs.isEndIndex) { - case (false, false): - return lhs.start == rhs.start - case (let lhs, let rhs): - return lhs == rhs - } - } - - static func < (lhs: Self, rhs: Self) -> Bool { - switch (lhs.isEndIndex, rhs.isEndIndex) { - case (true, _): - return false - case (_, true): - return true - case (false, false): - return lhs.start < rhs.start - } - } -} +//extension SplitCollection: Collection { +// public struct Index { +// var start: Base.Index +// var base: RangesCollection.Index +// var isEndIndex: Bool +// } +// +// public var startIndex: Index { +// let base = ranges.startIndex +// return Index(start: ranges.base.startIndex, base: base, isEndIndex: false) +// } +// +// public var endIndex: Index { +// Index(start: ranges.base.endIndex, base: ranges.endIndex, isEndIndex: true) +// } +// +// public func formIndex(after index: inout Index) { +// guard !index.isEndIndex else { fatalError("Cannot advance past endIndex") } +// +// if let range = index.base.range { +// let newStart = range.upperBound +// ranges.formIndex(after: &index.base) +// index.start = newStart +// } else { +// index.isEndIndex = true +// } +// } +// +// public func index(after index: Index) -> Index { +// var index = index +// formIndex(after: &index) +// return index +// } +// +// public subscript(index: Index) -> Base.SubSequence { +// guard !index.isEndIndex else { +// fatalError("Cannot subscript using endIndex") +// } +// let end = index.base.range?.lowerBound ?? ranges.base.endIndex +// return ranges.base[index.start.. Bool { +// switch (lhs.isEndIndex, rhs.isEndIndex) { +// case (false, false): +// return lhs.start == rhs.start +// case (let lhs, let rhs): +// return lhs == rhs +// } +// } +// +// static func < (lhs: Self, rhs: Self) -> Bool { +// switch (lhs.isEndIndex, rhs.isEndIndex) { +// case (true, _): +// return false +// case (_, true): +// return true +// case (false, false): +// return lhs.start < rhs.start +// } +// } +//} // MARK: `ReversedSplitCollection` @@ -176,10 +225,15 @@ extension ReversedSplitCollection: Sequence { extension Collection { func split( - by separator: Searcher + by separator: Searcher, + maxSplits: Int, + omittingEmptySubsequences: Bool ) -> SplitCollection where Searcher.Searched == Self { - // TODO: `maxSplits`, `omittingEmptySubsequences`? - SplitCollection(base: self, searcher: separator) + SplitCollection( + base: self, + searcher: separator, + maxSplits: maxSplits, + omittingEmptySubsequences: omittingEmptySubsequences) } } @@ -198,9 +252,11 @@ extension BidirectionalCollection { extension Collection { // TODO: Non-escaping and throwing func split( - whereSeparator predicate: @escaping (Element) -> Bool + whereSeparator predicate: @escaping (Element) -> Bool, + maxSplits: Int, + omittingEmptySubsequences: Bool ) -> SplitCollection> { - split(by: PredicateConsumer(predicate: predicate)) + split(by: PredicateConsumer(predicate: predicate), maxSplits: maxSplits, omittingEmptySubsequences: omittingEmptySubsequences) } } @@ -216,9 +272,11 @@ extension BidirectionalCollection where Element: Equatable { extension Collection where Element: Equatable { func split( - by separator: Element + by separator: Element, + maxSplits: Int, + omittingEmptySubsequences: Bool ) -> SplitCollection> { - split(whereSeparator: { $0 == separator }) + split(whereSeparator: { $0 == separator }, maxSplits: maxSplits, omittingEmptySubsequences: omittingEmptySubsequences) } } @@ -234,23 +292,28 @@ extension BidirectionalCollection where Element: Equatable { extension Collection where Element: Equatable { @_disfavoredOverload - func split( - by separator: S - ) -> SplitCollection> where S.Element == Element { - split(by: ZSearcher(pattern: Array(separator), by: ==)) + func split( + by separator: C, + maxSplits: Int, + omittingEmptySubsequences: Bool + ) -> SplitCollection> where C.Element == Element { + split(by: ZSearcher(pattern: Array(separator), by: ==), maxSplits: maxSplits, omittingEmptySubsequences: omittingEmptySubsequences) } // FIXME: Return `some Collection` for SE-0346 /// Returns the longest possible subsequences of the collection, in order, /// around elements equal to the given separator. + /// /// - Parameter separator: The element to be split upon. /// - Returns: A collection of subsequences, split from this collection's - /// elements. + /// elements. @available(SwiftStdlib 5.7, *) - public func split( - by separator: S - ) -> [SubSequence] where S.Element == Element { - Array(split(by: ZSearcher(pattern: Array(separator), by: ==))) + public func split( + separator: C, + maxSplits: Int = .max, + omittingEmptySubsequences: Bool = true + ) -> [SubSequence] where C.Element == Element { + Array(split(by: ZSearcher(pattern: Array(separator), by: ==), maxSplits: maxSplits, omittingEmptySubsequences: omittingEmptySubsequences)) } } @@ -266,13 +329,16 @@ extension BidirectionalCollection where Element: Equatable { } extension BidirectionalCollection where Element: Comparable { - func split( - by separator: S + func split( + by separator: C, + maxSplits: Int, + omittingEmptySubsequences: Bool ) -> SplitCollection>> - where S.Element == Element + where C.Element == Element { split( - by: PatternOrEmpty(searcher: TwoWaySearcher(pattern: Array(separator)))) + by: PatternOrEmpty(searcher: TwoWaySearcher(pattern: Array(separator))), + maxSplits: maxSplits, omittingEmptySubsequences: omittingEmptySubsequences) } // FIXME @@ -292,9 +358,11 @@ extension BidirectionalCollection where Element: Comparable { extension BidirectionalCollection where SubSequence == Substring { @_disfavoredOverload func split( - by separator: R + by separator: R, + maxSplits: Int, + omittingEmptySubsequences: Bool ) -> SplitCollection> { - split(by: RegexConsumer(separator)) + split(by: RegexConsumer(separator), maxSplits: maxSplits, omittingEmptySubsequences: omittingEmptySubsequences) } func splitFromBack( @@ -303,15 +371,23 @@ extension BidirectionalCollection where SubSequence == Substring { splitFromBack(by: RegexConsumer(separator)) } - // FIXME: Return `some Collection` for SE-0346 + // TODO: Is this @_disfavoredOverload necessary? + // It prevents split(separator: String) from choosing this overload instead + // of the collection-based version when String has RegexComponent conformance + + // FIXME: Return `some Collection` for SE-0346 /// Returns the longest possible subsequences of the collection, in order, /// around elements equal to the given separator. + /// /// - Parameter separator: A regex describing elements to be split upon. /// - Returns: A collection of substrings, split from this collection's - /// elements. + /// elements. + @_disfavoredOverload public func split( - by separator: R + separator: R, + maxSplits: Int = .max, + omittingEmptySubsequences: Bool = true ) -> [SubSequence] { - Array(split(by: RegexConsumer(separator))) + Array(split(by: RegexConsumer(separator), maxSplits: maxSplits, omittingEmptySubsequences: omittingEmptySubsequences)) } } diff --git a/Sources/_StringProcessing/Algorithms/Algorithms/StartsWith.swift b/Sources/_StringProcessing/Algorithms/Algorithms/StartsWith.swift index 0dd91f360..2f45a734b 100644 --- a/Sources/_StringProcessing/Algorithms/Algorithms/StartsWith.swift +++ b/Sources/_StringProcessing/Algorithms/Algorithms/StartsWith.swift @@ -51,9 +51,10 @@ extension BidirectionalCollection where Element: Equatable { extension BidirectionalCollection where SubSequence == Substring { /// Returns a Boolean value indicating whether the initial elements of the /// sequence are the same as the elements in the specified regex. + /// /// - Parameter regex: A regex to compare to this sequence. /// - Returns: `true` if the initial elements of the sequence matches the - /// beginning of `regex`; otherwise, `false`. + /// beginning of `regex`; otherwise, `false`. public func starts(with regex: R) -> Bool { starts(with: RegexConsumer(regex)) } diff --git a/Sources/_StringProcessing/Algorithms/Algorithms/Trim.swift b/Sources/_StringProcessing/Algorithms/Algorithms/Trim.swift index 73a5cd554..7411236da 100644 --- a/Sources/_StringProcessing/Algorithms/Algorithms/Trim.swift +++ b/Sources/_StringProcessing/Algorithms/Algorithms/Trim.swift @@ -102,21 +102,26 @@ extension RangeReplaceableCollection where Self: BidirectionalCollection { // MARK: Predicate algorithms extension Collection { - // TODO: Non-escaping and throwing - func trimmingPrefix( - while predicate: @escaping (Element) -> Bool - ) -> SubSequence { - trimmingPrefix(ManyConsumer(base: PredicateConsumer(predicate: predicate))) + fileprivate func endOfPrefix(while predicate: (Element) throws -> Bool) rethrows -> Index { + try firstIndex(where: { try !predicate($0) }) ?? endIndex + } + + @available(SwiftStdlib 5.7, *) + public func trimmingPrefix( + while predicate: (Element) throws -> Bool + ) rethrows -> SubSequence { + let end = try endOfPrefix(while: predicate) + return self[end...] } } extension Collection where SubSequence == Self { @available(SwiftStdlib 5.7, *) public mutating func trimPrefix( - while predicate: @escaping (Element) -> Bool - ) { - trimPrefix(ManyConsumer( - base: PredicateConsumer(predicate: predicate))) + while predicate: (Element) throws -> Bool + ) throws { + let end = try endOfPrefix(while: predicate) + self = self[end...] } } @@ -124,9 +129,10 @@ extension RangeReplaceableCollection { @_disfavoredOverload @available(SwiftStdlib 5.7, *) public mutating func trimPrefix( - while predicate: @escaping (Element) -> Bool - ) { - trimPrefix(ManyConsumer(base: PredicateConsumer(predicate: predicate))) + while predicate: (Element) throws -> Bool + ) rethrows { + let end = try endOfPrefix(while: predicate) + removeSubrange(startIndex..( + public func trimmingPrefix( _ prefix: Prefix ) -> SubSequence where Prefix.Element == Element { trimmingPrefix(FixedPatternConsumer(pattern: prefix)) @@ -202,7 +208,7 @@ extension Collection where SubSequence == Self, Element: Equatable { /// as its argument and returns a Boolean value indicating whether the /// element should be removed from the collection. @available(SwiftStdlib 5.7, *) - public mutating func trimPrefix( + public mutating func trimPrefix( _ prefix: Prefix ) where Prefix.Element == Element { trimPrefix(FixedPatternConsumer(pattern: prefix)) @@ -217,7 +223,7 @@ extension RangeReplaceableCollection where Element: Equatable { /// as its argument and returns a Boolean value indicating whether the /// element should be removed from the collection. @available(SwiftStdlib 5.7, *) - public mutating func trimPrefix( + public mutating func trimPrefix( _ prefix: Prefix ) where Prefix.Element == Element { trimPrefix(FixedPatternConsumer(pattern: prefix)) diff --git a/Sources/_StringProcessing/Algorithms/Consumers/FixedPatternConsumer.swift b/Sources/_StringProcessing/Algorithms/Consumers/FixedPatternConsumer.swift index e611f477a..8312c247a 100644 --- a/Sources/_StringProcessing/Algorithms/Consumers/FixedPatternConsumer.swift +++ b/Sources/_StringProcessing/Algorithms/Consumers/FixedPatternConsumer.swift @@ -9,7 +9,7 @@ // //===----------------------------------------------------------------------===// -struct FixedPatternConsumer +struct FixedPatternConsumer where Consumed.Element: Equatable, Pattern.Element == Consumed.Element { let pattern: Pattern @@ -21,20 +21,17 @@ extension FixedPatternConsumer: CollectionConsumer { in range: Range ) -> Consumed.Index? { var index = range.lowerBound - var patternIndex = pattern.startIndex + var patternIterator = pattern.makeIterator() - while true { - if patternIndex == pattern.endIndex { - return index - } - - if index == range.upperBound || consumed[index] != pattern[patternIndex] { + while let element = patternIterator.next() { + if index == range.upperBound || consumed[index] != element { return nil } consumed.formIndex(after: &index) - pattern.formIndex(after: &patternIndex) } + + return index } } diff --git a/Sources/_StringProcessing/Algorithms/Matching/FirstMatch.swift b/Sources/_StringProcessing/Algorithms/Matching/FirstMatch.swift index cb527f948..4342391af 100644 --- a/Sources/_StringProcessing/Algorithms/Matching/FirstMatch.swift +++ b/Sources/_StringProcessing/Algorithms/Matching/FirstMatch.swift @@ -39,6 +39,7 @@ extension BidirectionalCollection { extension BidirectionalCollection where SubSequence == Substring { @available(SwiftStdlib 5.7, *) + @_disfavoredOverload func firstMatch( of regex: R ) -> _MatchResult>? { diff --git a/Sources/_StringProcessing/Algorithms/Matching/MatchReplace.swift b/Sources/_StringProcessing/Algorithms/Matching/MatchReplace.swift index 09e021a29..206d68554 100644 --- a/Sources/_StringProcessing/Algorithms/Matching/MatchReplace.swift +++ b/Sources/_StringProcessing/Algorithms/Matching/MatchReplace.swift @@ -118,19 +118,19 @@ extension RangeReplaceableCollection where SubSequence == Substring { /// the given regex are replaced by another regex match. /// - Parameters: /// - regex: A regex describing the sequence to replace. - /// - replacement: A closure that receives the full match information, - /// including captures, and returns a replacement collection. /// - subrange: The range in the collection in which to search for `regex`. /// - maxReplacements: A number specifying how many occurrences of the /// sequence matching `regex` to replace. Default is `Int.max`. + /// - replacement: A closure that receives the full match information, + /// including captures, and returns a replacement collection. /// - Returns: A new collection in which all occurrences of subsequence /// matching `regex` are replaced by `replacement`. @available(SwiftStdlib 5.7, *) public func replacing( _ regex: R, - with replacement: (Regex.Match) throws -> Replacement, subrange: Range, - maxReplacements: Int = .max + maxReplacements: Int = .max, + with replacement: (Regex.Match) throws -> Replacement ) rethrows -> Self where Replacement.Element == Element { precondition(maxReplacements >= 0) @@ -155,43 +155,43 @@ extension RangeReplaceableCollection where SubSequence == Substring { /// the given regex are replaced by another collection. /// - Parameters: /// - regex: A regex describing the sequence to replace. - /// - replacement: A closure that receives the full match information, - /// including captures, and returns a replacement collection. /// - maxReplacements: A number specifying how many occurrences of the /// sequence matching `regex` to replace. Default is `Int.max`. + /// - replacement: A closure that receives the full match information, + /// including captures, and returns a replacement collection. /// - Returns: A new collection in which all occurrences of subsequence /// matching `regex` are replaced by `replacement`. @available(SwiftStdlib 5.7, *) public func replacing( _ regex: R, - with replacement: (Regex.Match) throws -> Replacement, - maxReplacements: Int = .max + maxReplacements: Int = .max, + with replacement: (Regex.Match) throws -> Replacement ) rethrows -> Self where Replacement.Element == Element { try replacing( regex, - with: replacement, subrange: startIndex..( _ regex: R, - with replacement: (Regex.Match) throws -> Replacement, - maxReplacements: Int = .max + maxReplacements: Int = .max, + with replacement: (Regex.Match) throws -> Replacement ) rethrows where Replacement.Element == Element { self = try replacing( regex, - with: replacement, subrange: startIndex.. ) -> State { + // FIXME: Is this 'limitedBy' requirement a sign of error? let criticalIndex = searched.index( - range.lowerBound, offsetBy: criticalIndex) + range.lowerBound, offsetBy: criticalIndex, limitedBy: range.upperBound) + ?? range.upperBound return State( end: range.upperBound, index: range.lowerBound, @@ -66,7 +68,10 @@ extension TwoWaySearcher: CollectionSearcher { let start = _searchLeft(searched, &state, end) { state.index = end - state.criticalIndex = searched.index(end, offsetBy: criticalIndex) + // FIXME: Is this 'limitedBy' requirement a sign of error? + state.criticalIndex = searched.index( + end, offsetBy: criticalIndex, limitedBy: searched.endIndex) + ?? searched.endIndex state.memory = nil return start.. CaptureRegister { + mutating func makeCapture( + id: ReferenceID?, name: String? + ) -> CaptureRegister { defer { nextCaptureRegister.rawValue += 1 } // Register the capture for later lookup via symbolic references. if let id = id { @@ -446,6 +450,10 @@ extension MEProgram.Builder { captureCount, forKey: id) assert(preexistingValue == nil) } + if let name = name { + // TODO: Reject duplicate capture names unless `(?J)`? + namedCaptureOffsets.updateValue(captureCount, forKey: name) + } return nextCaptureRegister } diff --git a/Sources/_StringProcessing/Engine/MECapture.swift b/Sources/_StringProcessing/Engine/MECapture.swift index 390af7d66..807598637 100644 --- a/Sources/_StringProcessing/Engine/MECapture.swift +++ b/Sources/_StringProcessing/Engine/MECapture.swift @@ -145,6 +145,7 @@ extension Processor._StoredCapture: CustomStringConvertible { struct CaptureList { var values: Array._StoredCapture> var referencedCaptureOffsets: [ReferenceID: Int] + var namedCaptureOffsets: [String: Int] // func extract(from s: String) -> Array> { // caps.map { $0.map { s[$0] } } diff --git a/Sources/_StringProcessing/Engine/MEProgram.swift b/Sources/_StringProcessing/Engine/MEProgram.swift index b0f2e6a79..0bfa0ecba 100644 --- a/Sources/_StringProcessing/Engine/MEProgram.swift +++ b/Sources/_StringProcessing/Engine/MEProgram.swift @@ -36,6 +36,7 @@ struct MEProgram where Input.Element: Equatable { let captureStructure: CaptureStructure let referencedCaptureOffsets: [ReferenceID: Int] + let namedCaptureOffsets: [String: Int] } extension MEProgram: CustomStringConvertible { diff --git a/Sources/_StringProcessing/Executor.swift b/Sources/_StringProcessing/Executor.swift index c7d4527a5..6ebb93f5c 100644 --- a/Sources/_StringProcessing/Executor.swift +++ b/Sources/_StringProcessing/Executor.swift @@ -37,7 +37,8 @@ struct Executor { let capList = CaptureList( values: cpu.storedCaptures, - referencedCaptureOffsets: engine.program.referencedCaptureOffsets) + referencedCaptureOffsets: engine.program.referencedCaptureOffsets, + namedCaptureOffsets: engine.program.namedCaptureOffsets) let capStruct = engine.program.captureStructure let range = inputRange.lowerBound.. ) -> Substring { input[range] } + + public subscript(name: String) -> AnyRegexOutput.Element? { + namedCaptureOffsets[name].map { self[$0 + 1] } + } } -/// A type-erased regex output +/// A type-erased regex output. @available(SwiftStdlib 5.7, *) public struct AnyRegexOutput { let input: String + let namedCaptureOffsets: [String: Int] fileprivate let _elements: [ElementRepresentation] /// The underlying representation of the element of a type-erased regex @@ -62,7 +72,7 @@ extension AnyRegexOutput { /// Creates a type-erased regex output from an existing output. /// /// Use this initializer to fit a regex with strongly typed captures into the - /// use site of a dynamic regex, i.e. one that was created from a string. + /// use site of a dynamic regex, like one that was created from a string. public init(_ match: Regex.Match) { // Note: We use type equality instead of `match.output as? ...` to prevent // unexpected optional flattening. @@ -79,7 +89,7 @@ extension AnyRegexOutput { /// /// - Parameter type: The expected output type. /// - Returns: The output, if the underlying value can be converted to the - /// output type, or nil otherwise. + /// output type; otherwise `nil`. public func `as`(_ type: Output.Type) -> Output? { let elements = _elements.map { StructuredCapture( @@ -94,9 +104,12 @@ extension AnyRegexOutput { @available(SwiftStdlib 5.7, *) extension AnyRegexOutput { internal init( - input: String, elements: C + input: String, namedCaptureOffsets: [String: Int], elements: C ) where C.Element == StructuredCapture { - self.init(input: input, _elements: elements.map(ElementRepresentation.init)) + self.init( + input: input, + namedCaptureOffsets: namedCaptureOffsets, + _elements: elements.map(ElementRepresentation.init)) } } @@ -170,12 +183,19 @@ extension AnyRegexOutput: RandomAccessCollection { } } +@available(SwiftStdlib 5.7, *) +extension AnyRegexOutput { + public subscript(name: String) -> Element? { + namedCaptureOffsets[name].map { self[$0 + 1] } + } +} + @available(SwiftStdlib 5.7, *) extension Regex.Match where Output == AnyRegexOutput { /// Creates a type-erased regex match from an existing match. /// /// Use this initializer to fit a regex match with strongly typed captures into the - /// use site of a dynamic regex match, i.e. one that was created from a string. + /// use site of a dynamic regex match, like one that was created from a string. public init(_ match: Regex.Match) { fatalError("FIXME: Not implemented") } @@ -184,9 +204,34 @@ extension Regex.Match where Output == AnyRegexOutput { /// types. /// /// - Parameter type: The expected output type. - /// - Returns: A match generic over the output type if the underlying values can be converted to the - /// output type. Returns `nil` otherwise. + /// - Returns: A match generic over the output type, if the underlying values + /// can be converted to the output type; otherwise, `nil`. public func `as`(_ type: Output.Type) -> Regex.Match? { fatalError("FIXME: Not implemented") } } + +@available(SwiftStdlib 5.7, *) +extension Regex where Output == AnyRegexOutput { + /// Returns whether a named-capture with `name` exists + public func contains(captureNamed name: String) -> Bool { + fatalError("FIXME: not implemented") + } + + /// Creates a type-erased regex from an existing regex. + /// + /// Use this initializer to fit a regex with strongly typed captures into the + /// use site of a dynamic regex, i.e. one that was created from a string. + public init(_ match: Regex) { + fatalError("FIXME: Not implemented") + } + + /// Returns a typed regex by converting the underlying types. + /// + /// - Parameter type: The expected output type. + /// - Returns: A regex generic over the output type if the underlying types can be converted. + /// Returns `nil` otherwise. + public func `as`(_ type: Output.Type) -> Regex? { + fatalError("FIXME: Not implemented") + } +} diff --git a/Sources/_StringProcessing/Regex/Core.swift b/Sources/_StringProcessing/Regex/Core.swift index d77784df4..1f9a35dad 100644 --- a/Sources/_StringProcessing/Regex/Core.swift +++ b/Sources/_StringProcessing/Regex/Core.swift @@ -19,7 +19,7 @@ public protocol RegexComponent { var regex: Regex { get } } -/// A regex represents a string processing algorithm. +/// A regular expression. /// /// let regex = try Regex("a(.*)b") /// let match = "cbaxb".firstMatch(of: regex) @@ -61,6 +61,13 @@ public struct Regex: RegexComponent { } } +@available(SwiftStdlib 5.7, *) +extension Regex { + public init(quoting string: String) { + self.init(node: .quotedLiteral(string)) + } +} + @available(SwiftStdlib 5.7, *) extension Regex { /// A program representation that caches any lowered representation for diff --git a/Sources/_StringProcessing/Regex/CustomComponents.swift b/Sources/_StringProcessing/Regex/CustomComponents.swift index a5f9bd9ed..d675c3ae7 100644 --- a/Sources/_StringProcessing/Regex/CustomComponents.swift +++ b/Sources/_StringProcessing/Regex/CustomComponents.swift @@ -31,7 +31,7 @@ public protocol CustomConsumingRegexComponent: RegexComponent { @available(SwiftStdlib 5.7, *) extension CustomConsumingRegexComponent { public var regex: Regex { - let node: DSLTree.Node = .matcher(.init(RegexOutput.self), { input, index, bounds in + let node: DSLTree.Node = .matcher(RegexOutput.self, { input, index, bounds in try consuming(input, startingAt: index, in: bounds) }) return Regex(node: node) diff --git a/Sources/_StringProcessing/Regex/DSLTree.swift b/Sources/_StringProcessing/Regex/DSLTree.swift index 51f5ea36f..52eaeffb0 100644 --- a/Sources/_StringProcessing/Regex/DSLTree.swift +++ b/Sources/_StringProcessing/Regex/DSLTree.swift @@ -24,38 +24,38 @@ public struct DSLTree { extension DSLTree { @_spi(RegexBuilder) - public indirect enum Node: _TreeNode { - /// Try to match each node in order + public indirect enum Node { + /// Matches each node in order. /// /// ... | ... | ... case orderedChoice([Node]) - /// Match each node in sequence + /// Match each node in sequence. /// /// ... ... case concatenation([Node]) - /// Capture the result of a subpattern + /// Captures the result of a subpattern. /// /// (...), (?...) case capture( name: String? = nil, reference: ReferenceID? = nil, Node) - /// Match a (non-capturing) subpattern / group - case nonCapturingGroup(AST.Group.Kind, Node) + /// Matches a noncapturing subpattern. + case nonCapturingGroup(_AST.GroupKind, Node) // TODO: Consider splitting off grouped conditions, or have // our own kind - /// Match a choice of two nodes based on a condition + /// Matches a choice of two nodes, based on a condition. /// /// (?(cond) true-branch | false-branch) /// case conditional( - AST.Conditional.Condition.Kind, Node, Node) + _AST.ConditionKind, Node, Node) case quantification( - AST.Quantification.Amount, + _AST.QuantificationAmount, QuantificationKind, Node) @@ -63,7 +63,7 @@ extension DSLTree { case atom(Atom) - /// Comments, non-semantic whitespace, etc + /// Comments, non-semantic whitespace, and so on. // TODO: Do we want this? Could be interesting case trivia(String) @@ -73,20 +73,20 @@ extension DSLTree { case quotedLiteral(String) - /// An embedded literal - case regexLiteral(AST.Node) + /// An embedded literal. + case regexLiteral(_AST.ASTNode) // TODO: What should we do here? /// /// TODO: Consider splitting off expression functions, or have our own kind - case absentFunction(AST.AbsentFunction) + case absentFunction(_AST.AbsentFunction) // MARK: - Tree conversions /// The target of AST conversion. /// /// Keeps original AST around for rich syntactic and source information - case convertedRegexLiteral(Node, AST.Node) + case convertedRegexLiteral(Node, _AST.ASTNode) // MARK: - Extensibility points @@ -95,7 +95,7 @@ extension DSLTree { case consumer(_ConsumerInterface) - case matcher(AnyType, _MatcherInterface) + case matcher(Any.Type, _MatcherInterface) // TODO: Would this just boil down to a consumer? case characterPredicate(_CharacterPredicateInterface) @@ -108,9 +108,17 @@ extension DSLTree { /// The default quantification kind, as set by options. case `default` /// An explicitly chosen kind, overriding any options. - case explicit(AST.Quantification.Kind) + case explicit(_AST.QuantificationKind) /// A kind set via syntax, which can be affected by options. - case syntax(AST.Quantification.Kind) + case syntax(_AST.QuantificationKind) + + var ast: AST.Quantification.Kind? { + switch self { + case .default: return nil + case .explicit(let kind), .syntax(let kind): + return kind.ast + } + } } @_spi(RegexBuilder) @@ -134,6 +142,12 @@ extension DSLTree { self.isInverted = isInverted } + public static func generalCategory(_ category: Unicode.GeneralCategory) -> Self { + let property = AST.Atom.CharacterProperty(.generalCategory(category.extendedGeneralCategory!), isInverted: false, isPOSIX: false) + let astAtom = AST.Atom(.property(property), .fake) + return .init(members: [.atom(.unconverted(.init(ast: astAtom)))]) + } + public var inverted: CustomCharacterClass { var result = self result.isInverted.toggle() @@ -162,13 +176,51 @@ extension DSLTree { case scalar(Unicode.Scalar) case any - case assertion(AST.Atom.AssertionKind) - case backreference(AST.Reference) + case assertion(_AST.AssertionKind) + case backreference(_AST.Reference) case symbolicReference(ReferenceID) - case changeMatchingOptions(AST.MatchingOptionSequence) + case changeMatchingOptions(_AST.MatchingOptionSequence) + + case unconverted(_AST.Atom) + } +} - case unconverted(AST.Atom) +extension Unicode.GeneralCategory { + var extendedGeneralCategory: Unicode.ExtendedGeneralCategory? { + switch self { + case .uppercaseLetter: return .uppercaseLetter + case .lowercaseLetter: return .lowercaseLetter + case .titlecaseLetter: return .titlecaseLetter + case .modifierLetter: return .modifierLetter + case .otherLetter: return .otherLetter + case .nonspacingMark: return .nonspacingMark + case .spacingMark: return .spacingMark + case .enclosingMark: return .enclosingMark + case .decimalNumber: return .decimalNumber + case .letterNumber: return .letterNumber + case .otherNumber: return .otherNumber + case .connectorPunctuation: return .connectorPunctuation + case .dashPunctuation: return .dashPunctuation + case .openPunctuation: return .openPunctuation + case .closePunctuation: return .closePunctuation + case .initialPunctuation: return .initialPunctuation + case .finalPunctuation: return .finalPunctuation + case .otherPunctuation: return .otherPunctuation + case .mathSymbol: return .mathSymbol + case .currencySymbol: return .currencySymbol + case .modifierSymbol: return .modifierSymbol + case .otherSymbol: return .otherSymbol + case .spaceSeparator: return .spaceSeparator + case .lineSeparator: return .lineSeparator + case .paragraphSeparator: return .paragraphSeparator + case .control: return .control + case .format: return .format + case .surrogate: return .surrogate + case .privateUse: return .privateUse + case .unassigned: return .unassigned + @unknown default: return nil + } } } @@ -226,8 +278,8 @@ extension DSLTree.Node { .customCharacterClass, .atom: return [] - case let .absentFunction(a): - return a.children.map(\.dslTreeNode) + case let .absentFunction(abs): + return abs.ast.children.map(\.dslTreeNode) } } } @@ -235,8 +287,8 @@ extension DSLTree.Node { extension DSLTree.Node { var astNode: AST.Node? { switch self { - case let .regexLiteral(ast): return ast - case let .convertedRegexLiteral(_, ast): return ast + case let .regexLiteral(literal): return literal.ast + case let .convertedRegexLiteral(_, literal): return literal.ast default: return nil } } @@ -280,9 +332,9 @@ extension DSLTree.Node { case .capture: return true case let .regexLiteral(re): - return re.hasCapture + return re.ast.hasCapture case let .convertedRegexLiteral(n, re): - assert(n.hasCapture == re.hasCapture) + assert(n.hasCapture == re.ast.hasCapture) return n.hasCapture default: @@ -295,70 +347,15 @@ extension DSLTree { var captureStructure: CaptureStructure { // TODO: nesting var constructor = CaptureStructure.Constructor(.flatten) - return root._captureStructure(&constructor) + return _Tree(root)._captureStructure(&constructor) } } extension DSLTree.Node { - @_spi(RegexBuilder) - public func _captureStructure( - _ constructor: inout CaptureStructure.Constructor - ) -> CaptureStructure { - switch self { - case let .orderedChoice(children): - return constructor.alternating(children) - - case let .concatenation(children): - return constructor.concatenating(children) - - case let .capture(name, _, child): - if let type = child.valueCaptureType { - return constructor.capturing( - name: name, child, withType: type) - } - return constructor.capturing(name: name, child) - - case let .nonCapturingGroup(kind, child): - assert(!kind.isCapturing) - return constructor.grouping(child, as: kind) - - case let .conditional(cond, trueBranch, falseBranch): - return constructor.condition( - cond, - trueBranch: trueBranch, - falseBranch: falseBranch) - - case let .quantification(amount, _, child): - return constructor.quantifying( - child, amount: amount) - - case let .regexLiteral(re): - // TODO: Force a re-nesting? - return re._captureStructure(&constructor) - - case let .absentFunction(abs): - return constructor.absent(abs.kind) - - case let .convertedRegexLiteral(n, _): - // TODO: Switch nesting strategy? - return n._captureStructure(&constructor) - - case .matcher: - return .empty - - case .transform(_, let child): - return child._captureStructure(&constructor) - - case .customCharacterClass, .atom, .trivia, .empty, - .quotedLiteral, .consumer, .characterPredicate: - return .empty - } - } - /// For typed capture-producing nodes, the type produced. var valueCaptureType: AnyType? { switch self { case let .matcher(t, _): - return t + return AnyType(t) case let .transform(t, _): return AnyType(t.resultType) default: return nil @@ -455,3 +452,225 @@ public struct CaptureTransform: Hashable, CustomStringConvertible { "" } } + +// MARK: AST wrapper types +// +// These wrapper types are required because even @_spi-marked public APIs can't +// include symbols from implementation-only dependencies. + +extension DSLTree { + /// Presents a wrapped version of `DSLTree.Node` that can provide an internal + /// `_TreeNode` conformance. + struct _Tree: _TreeNode { + var node: DSLTree.Node + + init(_ node: DSLTree.Node) { + self.node = node + } + + var children: [_Tree]? { + switch node { + + case let .orderedChoice(v): return v.map(_Tree.init) + case let .concatenation(v): return v.map(_Tree.init) + + case let .convertedRegexLiteral(n, _): + // Treat this transparently + return _Tree(n).children + + case let .capture(_, _, n): return [_Tree(n)] + case let .nonCapturingGroup(_, n): return [_Tree(n)] + case let .transform(_, n): return [_Tree(n)] + case let .quantification(_, _, n): return [_Tree(n)] + + case let .conditional(_, t, f): return [_Tree(t), _Tree(f)] + + case .trivia, .empty, .quotedLiteral, .regexLiteral, + .consumer, .matcher, .characterPredicate, + .customCharacterClass, .atom: + return [] + + case let .absentFunction(abs): + return abs.ast.children.map(\.dslTreeNode).map(_Tree.init) + } + } + + func _captureStructure( + _ constructor: inout CaptureStructure.Constructor + ) -> CaptureStructure { + switch node { + case let .orderedChoice(children): + return constructor.alternating(children.map(_Tree.init)) + + case let .concatenation(children): + return constructor.concatenating(children.map(_Tree.init)) + + case let .capture(name, _, child): + if let type = child.valueCaptureType { + return constructor.capturing( + name: name, _Tree(child), withType: type) + } + return constructor.capturing(name: name, _Tree(child)) + + case let .nonCapturingGroup(kind, child): + assert(!kind.ast.isCapturing) + return constructor.grouping(_Tree(child), as: kind.ast) + + case let .conditional(cond, trueBranch, falseBranch): + return constructor.condition( + cond.ast, + trueBranch: _Tree(trueBranch), + falseBranch: _Tree(falseBranch)) + + case let .quantification(amount, _, child): + return constructor.quantifying( + Self(child), amount: amount.ast) + + case let .regexLiteral(re): + // TODO: Force a re-nesting? + return re.ast._captureStructure(&constructor) + + case let .absentFunction(abs): + return constructor.absent(abs.ast.kind) + + case let .convertedRegexLiteral(n, _): + // TODO: Switch nesting strategy? + return Self(n)._captureStructure(&constructor) + + case .matcher: + return .empty + + case .transform(_, let child): + return Self(child)._captureStructure(&constructor) + + case .customCharacterClass, .atom, .trivia, .empty, + .quotedLiteral, .consumer, .characterPredicate: + return .empty + } + } + } + + @_spi(RegexBuilder) + public enum _AST { + @_spi(RegexBuilder) + public struct GroupKind { + internal var ast: AST.Group.Kind + + public static var atomicNonCapturing: Self { + .init(ast: .atomicNonCapturing) + } + public static var lookahead: Self { + .init(ast: .lookahead) + } + public static var negativeLookahead: Self { + .init(ast: .negativeLookahead) + } + } + + @_spi(RegexBuilder) + public struct ConditionKind { + internal var ast: AST.Conditional.Condition.Kind + } + + @_spi(RegexBuilder) + public struct QuantificationKind { + internal var ast: AST.Quantification.Kind + + public static var eager: Self { + .init(ast: .eager) + } + public static var reluctant: Self { + .init(ast: .reluctant) + } + public static var possessive: Self { + .init(ast: .possessive) + } + } + + @_spi(RegexBuilder) + public struct QuantificationAmount { + internal var ast: AST.Quantification.Amount + + public static var zeroOrMore: Self { + .init(ast: .zeroOrMore) + } + public static var oneOrMore: Self { + .init(ast: .oneOrMore) + } + public static var zeroOrOne: Self { + .init(ast: .zeroOrOne) + } + public static func exactly(_ n: Int) -> Self { + .init(ast: .exactly(.init(faking: n))) + } + public static func nOrMore(_ n: Int) -> Self { + .init(ast: .nOrMore(.init(faking: n))) + } + public static func upToN(_ n: Int) -> Self { + .init(ast: .upToN(.init(faking: n))) + } + public static func range(_ lower: Int, _ upper: Int) -> Self { + .init(ast: .range(.init(faking: lower), .init(faking: upper))) + } + } + + @_spi(RegexBuilder) + public struct ASTNode { + internal var ast: AST.Node + } + + @_spi(RegexBuilder) + public struct AbsentFunction { + internal var ast: AST.AbsentFunction + } + + @_spi(RegexBuilder) + public struct AssertionKind { + internal var ast: AST.Atom.AssertionKind + + public static func startOfSubject(_ inverted: Bool = false) -> Self { + .init(ast: .startOfSubject) + } + public static func endOfSubjectBeforeNewline(_ inverted: Bool = false) -> Self { + .init(ast: .endOfSubjectBeforeNewline) + } + public static func endOfSubject(_ inverted: Bool = false) -> Self { + .init(ast: .endOfSubject) + } + public static func firstMatchingPositionInSubject(_ inverted: Bool = false) -> Self { + .init(ast: .firstMatchingPositionInSubject) + } + public static func textSegmentBoundary(_ inverted: Bool = false) -> Self { + inverted + ? .init(ast: .notTextSegment) + : .init(ast: .textSegment) + } + public static func startOfLine(_ inverted: Bool = false) -> Self { + .init(ast: .startOfLine) + } + public static func endOfLine(_ inverted: Bool = false) -> Self { + .init(ast: .endOfLine) + } + public static func wordBoundary(_ inverted: Bool = false) -> Self { + inverted + ? .init(ast: .notWordBoundary) + : .init(ast: .wordBoundary) + } + } + + @_spi(RegexBuilder) + public struct Reference { + internal var ast: AST.Reference + } + + @_spi(RegexBuilder) + public struct MatchingOptionSequence { + internal var ast: AST.MatchingOptionSequence + } + + @_spi(RegexBuilder) + public struct Atom { + internal var ast: AST.Atom + } + } +} diff --git a/Sources/_StringProcessing/Regex/Match.swift b/Sources/_StringProcessing/Regex/Match.swift index a86899041..3e8f8e9e8 100644 --- a/Sources/_StringProcessing/Regex/Match.swift +++ b/Sources/_StringProcessing/Regex/Match.swift @@ -19,20 +19,22 @@ extension Regex { public struct Match { let input: String - /// The range of the overall match + /// The range of the overall match. public let range: Range let rawCaptures: [StructuredCapture] let referencedCaptureOffsets: [ReferenceID: Int] + let namedCaptureOffsets: [String: Int] + let value: Any? } } @available(SwiftStdlib 5.7, *) extension Regex.Match { - /// The produced output from the match operation + /// The output produced from the match operation. public var output: Output { if Output.self == AnyRegexOutput.self { let wholeMatchAsCapture = StructuredCapture( @@ -40,6 +42,7 @@ extension Regex.Match { storedCapture: StoredCapture(range: range, value: nil)) let output = AnyRegexOutput( input: input, + namedCaptureOffsets: namedCaptureOffsets, elements: [wholeMatchAsCapture] + rawCaptures) return output as! Output } else if Output.self == Substring.self { @@ -59,12 +62,12 @@ extension Regex.Match { } } - /// Lookup a capture by name or number + /// Accesses a capture by its name or number. public subscript(dynamicMember keyPath: KeyPath) -> T { output[keyPath: keyPath] } - // Allows `.0` when `Match` is not a tuple. + /// Accesses a capture using the `.0` syntax, even when the match isn't a tuple. @_disfavoredOverload public subscript( dynamicMember keyPath: KeyPath<(Output, _doNotUse: ()), Output> @@ -85,44 +88,50 @@ extension Regex.Match { @available(SwiftStdlib 5.7, *) extension Regex { - /// Match a string in its entirety. + /// Matches a string in its entirety. /// - /// Returns `nil` if no match and throws on abort + /// - Parameter s: The string to match this regular expression against. + /// - Returns: The match, or `nil` if no match was found. public func wholeMatch(in s: String) throws -> Regex.Match? { try _match(s, in: s.startIndex.. Regex.Match? { try _match(s, in: s.startIndex.. Regex.Match? { try _firstMatch(s, in: s.startIndex.. Regex.Match? { try _match(s.base, in: s.startIndex.. Regex.Match? { try _match(s.base, in: s.startIndex.. Regex.Match? { try _firstMatch(s.base, in: s.startIndex..( of r: R ) -> Regex.Match? { - try? r.regex.wholeMatch(in: self) + try? r.regex.wholeMatch(in: self[...].base) } + /// Checks for a match against the string, starting at its beginning. + /// + /// - Parameter r: The regular expression being matched. + /// - Returns: The match, or `nil` if no match was found. public func prefixMatch( of r: R ) -> Regex.Match? { - try? r.regex.prefixMatch(in: self) + try? r.regex.prefixMatch(in: self[...]) } } @available(SwiftStdlib 5.7, *) -extension Substring { - public func wholeMatch( - of r: R - ) -> Regex.Match? { - try? r.regex.wholeMatch(in: self) +extension RegexComponent { + /*public*/ static func ~=(regex: Self, input: String) -> Bool { + input.wholeMatch(of: regex) != nil } - public func prefixMatch( - of r: R - ) -> Regex.Match? { - try? r.regex.prefixMatch(in: self) + /*public*/ static func ~=(regex: Self, input: Substring) -> Bool { + input.wholeMatch(of: regex) != nil } } diff --git a/Sources/_StringProcessing/Regex/Options.swift b/Sources/_StringProcessing/Regex/Options.swift index abc98991b..24d5c422e 100644 --- a/Sources/_StringProcessing/Regex/Options.swift +++ b/Sources/_StringProcessing/Regex/Options.swift @@ -13,35 +13,57 @@ @available(SwiftStdlib 5.7, *) extension RegexComponent { - /// Returns a regular expression that ignores casing when matching. + /// Returns a regular expression that ignores case when matching. + /// + /// - Parameter ignoresCase: A Boolean value indicating whether to ignore case. + /// - Returns: The modified regular expression. public func ignoresCase(_ ignoresCase: Bool = true) -> Regex { wrapInOption(.caseInsensitive, addingIf: ignoresCase) } - /// Returns a regular expression that only matches ASCII characters as "word - /// characters". + /// Returns a regular expression that matches only ASCII characters as word + /// characters. + /// + /// - Parameter useASCII: A Boolean value indicating whether to match only + /// ASCII characters as word characters. + /// - Returns: The modified regular expression. public func asciiOnlyWordCharacters(_ useASCII: Bool = true) -> Regex { wrapInOption(.asciiOnlyWord, addingIf: useASCII) } - /// Returns a regular expression that only matches ASCII characters as digits. + /// Returns a regular expression that matches only ASCII characters as digits. + /// + /// - Parameter useasciiOnlyDigits: A Boolean value indicating whether to + /// match only ASCII characters as digits. + /// - Returns: The modified regular expression. public func asciiOnlyDigits(_ useASCII: Bool = true) -> Regex { wrapInOption(.asciiOnlyDigit, addingIf: useASCII) } - /// Returns a regular expression that only matches ASCII characters as space + /// Returns a regular expression that matches only ASCII characters as space /// characters. + /// + /// - Parameter asciiOnlyWhitespace: A Boolean value indicating whether to + /// match only ASCII characters as space characters. + /// - Returns: The modified regular expression. public func asciiOnlyWhitespace(_ useASCII: Bool = true) -> Regex { wrapInOption(.asciiOnlySpace, addingIf: useASCII) } - /// Returns a regular expression that only matches ASCII characters when + /// Returns a regular expression that matches only ASCII characters when /// matching character classes. + /// + /// - Parameter useASCII: A Boolean value indicating whether to match only + /// ASCII characters when matching character classes. + /// - Returns: The modified regular expression. public func asciiOnlyCharacterClasses(_ useASCII: Bool = true) -> Regex { wrapInOption(.asciiOnlyPOSIXProps, addingIf: useASCII) } /// Returns a regular expression that uses the specified word boundary algorithm. + /// + /// - Parameter wordBoundaryKind: The algorithm to use for determining word boundaries. + /// - Returns: The modified regular expression. public func wordBoundaryKind(_ wordBoundaryKind: RegexWordBoundaryKind) -> Regex { wrapInOption(.unicodeWordBoundaries, addingIf: wordBoundaryKind == .unicodeLevel2) } @@ -51,6 +73,7 @@ extension RegexComponent { /// /// - Parameter dotMatchesNewlines: A Boolean value indicating whether `.` /// should match a newline character. + /// - Returns: The modified regular expression. public func dotMatchesNewlines(_ dotMatchesNewlines: Bool = true) -> Regex { wrapInOption(.singleLine, addingIf: dotMatchesNewlines) } @@ -65,6 +88,7 @@ extension RegexComponent { /// /// - Parameter matchLineEndings: A Boolean value indicating whether `^` and /// `$` should match the start and end of lines, respectively. + /// - Returns: The modified regular expression. public func anchorsMatchLineEndings(_ matchLineEndings: Bool = true) -> Regex { wrapInOption(.multiline, addingIf: matchLineEndings) } @@ -124,6 +148,9 @@ extension RegexComponent { /// // Prints "true" /// print(decomposed.contains(queRegexScalar)) /// // Prints "false" + /// + /// - Parameter semanticLevel: The semantics to use during matching. + /// - Returns: The modified regular expression. public func matchingSemantics(_ semanticLevel: RegexSemanticLevel) -> Regex { switch semanticLevel.base { case .graphemeCluster: @@ -144,14 +171,18 @@ public struct RegexSemanticLevel: Hashable { internal var base: Representation - /// Match at the default semantic level of a string, where each matched - /// element is a `Character`. + /// Match at the character level. + /// + /// At this semantic level, each matched element is a `Character` value. + /// This is the default semantic level. public static var graphemeCluster: RegexSemanticLevel { .init(base: .graphemeCluster) } - /// Match at the semantic level of a string's `UnicodeScalarView`, where each - /// matched element is a `UnicodeScalar` value. + /// Match at the Unicode scalar level. + /// + /// At this semantic level, the string's `UnicodeScalarView` is used for matching, + /// and each matched element is a `UnicodeScalar` value. public static var unicodeScalar: RegexSemanticLevel { .init(base: .unicodeScalar) } @@ -200,7 +231,7 @@ public struct RegexRepetitionBehavior: Hashable { var kind: Kind - @_spi(RegexBuilder) public var dslTreeKind: AST.Quantification.Kind { + @_spi(RegexBuilder) public var dslTreeKind: DSLTree._AST.QuantificationKind { switch kind { case .eager: return .eager case .reluctant: return .reluctant @@ -241,6 +272,6 @@ extension RegexComponent { ? AST.MatchingOptionSequence(adding: [.init(option, location: .fake)]) : AST.MatchingOptionSequence(removing: [.init(option, location: .fake)]) return Regex(node: .nonCapturingGroup( - .changeMatchingOptions(sequence), regex.root)) + .init(ast: .changeMatchingOptions(sequence)), regex.root)) } } diff --git a/Sources/_StringProcessing/Unicode/Data.swift b/Sources/_StringProcessing/Unicode/Data.swift deleted file mode 100644 index 2436b51cd..000000000 --- a/Sources/_StringProcessing/Unicode/Data.swift +++ /dev/null @@ -1,188 +0,0 @@ -//===----------------------------------------------------------------------===// -// -// This source file is part of the Swift.org open source project -// -// Copyright (c) 2021 Apple Inc. and the Swift project authors -// Licensed under Apache License v2.0 with Runtime Library Exception -// -// See https://swift.org/LICENSE.txt for license information -// See https://swift.org/CONTRIBUTORS.txt for the list of Swift project authors -// -//===----------------------------------------------------------------------===// - - -internal typealias ScalarAndNormData = ( - scalar: Unicode.Scalar, - normData: Unicode._NormData -) - -extension Unicode { - // A wrapper type over the normalization data value we receive when we - // lookup a scalar's normalization information. The layout of the underlying - // 16 bit value we receive is as follows: - // - // 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - // └───┬───┘ └──── CCC ────┘ └─┘ │ - // │ │ └── NFD_QC - // │ └── NFC_QC - // └── Unused - // - // NFD_QC: This is a simple Yes/No on whether the scalar has canonical - // decomposition. Note: Yes is indicated via 0 instead of 1. - // - // NFC_QC: This is either Yes/No/Maybe on whether the scalar is NFC quick - // check. Yes, represented as 0, means the scalar can NEVER compose - // with another scalar previous to it. No, represented as 1, means the - // scalar can NEVER appear within a well formed NFC string. Maybe, - // represented as 2, means the scalar could appear with an NFC string, - // but further information is required to determine if that is the - // case. At the moment, we really only care about Yes/No. - // - // CCC: This is the canonical combining class property of a scalar that is - // used when sorting scalars of a normalization segment after NFD - // computation. A scalar with a CCC value of 128 can NEVER appear before - // a scalar with a CCC value of 100, unless there are normalization - // boundaries between them. - // - internal struct _NormData { - var rawValue: UInt16 - - var ccc: UInt8 { - UInt8(truncatingIfNeeded: rawValue >> 3) - } - - var isNFCQC: Bool { - rawValue & 0x6 == 0 - } - - var isNFDQC: Bool { - rawValue & 0x1 == 0 - } - - init(_ scalar: Unicode.Scalar, fastUpperbound: UInt32 = 0xC0) { - if _fastPath(scalar.value < fastUpperbound) { - // CCC = 0, NFC_QC = Yes, NFD_QC = Yes - rawValue = 0 - } else { - rawValue = _swift_stdlib_getNormData(scalar.value) - - // Because we don't store precomposed hangul in our NFD_QC data, these - // will return true for NFD_QC when in fact they are not. - if (0xAC00 ... 0xD7A3).contains(scalar.value) { - // NFD_QC = false - rawValue |= 0x1 - } - } - } - - init(rawValue: UInt16) { - self.rawValue = rawValue - } - } -} - -extension Unicode { - // A wrapper type for normalization buffers in the NFC and NFD iterators. - // This helps remove some of the buffer logic like removal and sorting out of - // the iterators and into this type. - internal struct _NormDataBuffer { - var storage: [ScalarAndNormData] = [] - - // This is simply a marker denoting that we've built up our storage, and - // now everything within it needs to be emitted. We reverse the buffer and - // pop elements from the back as a way to remove them. - var isReversed = false - - var isEmpty: Bool { - storage.isEmpty - } - - var last: ScalarAndNormData? { - storage.last - } - - mutating func append(_ scalarAndNormData: ScalarAndNormData) { - _internalInvariant(!isReversed) - storage.append(scalarAndNormData) - } - - // Removes the first element from the buffer. Note: it is not safe to append - // to the buffer after this function has been called. We reverse the storage - // internally for everything to be emitted out, so appending would insert - // into the storage at the wrong location. One must continue to call this - // function until a 'nil' return value has been received before appending. - mutating func next() -> ScalarAndNormData? { - guard !storage.isEmpty else { - isReversed = false - return nil - } - - // If our storage hasn't been reversed yet, do so now. - if !isReversed { - storage.reverse() - isReversed = true - } - - return storage.removeLast() - } - - // Sort the entire buffer based on the canonical combining class. - mutating func sort() { - storage._insertionSort(within: storage.indices) { - $0.normData.ccc < $1.normData.ccc - } - } - } -} - -extension Unicode { - // A wrapper type over the decomposition entry value we receive when we - // lookup a scalar's canonical decomposition. The layout of the underlying - // 32 bit value we receive is as follows: - // - // Top 14 bits Bottom 18 bits - // - // 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - // └───────── Index ─────────┘ └───────── Hashed Scalar ─────────┘ - // - // Index: This is the direct index into '_swift_stdlib_nfd_decompositions' - // that points to a size byte indicating the overall size of the - // UTF-8 decomposition string. Following the size byte is said string. - // - // Hashed Scalar: Because perfect hashing doesn't know the original set of - // keys it was hashed with, we store the original scalar in the - // decomposition entry so that we can guard against scalars - // who happen to hash to the same index. - // - internal struct _DecompositionEntry { - let rawValue: UInt32 - - // Our original scalar is stored in the first 18 bits of this entry. - var hashedScalar: Unicode.Scalar { - Unicode.Scalar(_value: (rawValue << 14) >> 14) - } - - // The index into the decomposition array is stored in the top 14 bits. - var index: Int { - Int(truncatingIfNeeded: rawValue >> 18) - } - - // A buffer pointer to the UTF8 decomposition string. - var utf8: UnsafeBufferPointer { - let decompPtr = _swift_stdlib_nfd_decompositions()._unsafelyUnwrappedUnchecked - - // This size is the utf8 length of the decomposition. - let size = Int(truncatingIfNeeded: decompPtr[index]) - - return UnsafeBufferPointer( - // We add 1 here to skip the size byte. - start: decompPtr + index + 1, - count: size - ) - } - - init(_ scalar: Unicode.Scalar) { - rawValue = _swift_stdlib_getDecompositionEntry(scalar.value) - } - } -} diff --git a/Sources/_StringProcessing/Unicode/Graphemes.swift b/Sources/_StringProcessing/Unicode/Graphemes.swift deleted file mode 100644 index d27f1d32b..000000000 --- a/Sources/_StringProcessing/Unicode/Graphemes.swift +++ /dev/null @@ -1,668 +0,0 @@ -//===----------------------------------------------------------------------===// -// -// This source file is part of the Swift.org open source project -// -// Copyright (c) 2021-2022 Apple Inc. and the Swift project authors -// Licensed under Apache License v2.0 with Runtime Library Exception -// -// See https://swift.org/LICENSE.txt for license information -// -//===----------------------------------------------------------------------===// - -@_silgen_name("_swift_stdlib_isLinkingConsonant") -func _swift_stdlib_isLinkingConsonant(_: UInt32) -> Bool - -@_silgen_name("_swift_stdlib_getGraphemeBreakProperty") -func _swift_stdlib_getGraphemeBreakProperty(_: UInt32) -> UInt8 - -extension Unicode { - internal enum _GraphemeBreakProperty { - case any - case control - case extend - case extendedPictographic - case l - case lv - case lvt - case prepend - case regionalIndicator - case spacingMark - case t - case v - case zwj - - init(from scalar: Unicode.Scalar) { - switch scalar.value { - // Some fast paths for ascii characters... - case 0x0 ... 0x1F: - self = .control - case 0x20 ... 0x7E: - self = .any - - case 0x200D: - self = .zwj - case 0x1100 ... 0x115F, - 0xA960 ... 0xA97C: - self = .l - case 0x1160 ... 0x11A7, - 0xD7B0 ... 0xD7C6: - self = .v - case 0x11A8 ... 0x11FF, - 0xD7CB ... 0xD7FB: - self = .t - case 0xAC00 ... 0xD7A3: - if scalar.value % 28 == 16 { - self = .lv - } else { - self = .lvt - } - case 0x1F1E6 ... 0x1F1FF: - self = .regionalIndicator - case 0x1FC00 ... 0x1FFFD: - self = .extendedPictographic - case 0xE01F0 ... 0xE0FFF: - self = .control - default: - // Otherwise, default to binary searching the data array. - let rawEnumValue = _swift_stdlib_getGraphemeBreakProperty(scalar.value) - - switch rawEnumValue { - case 0: - self = .control - case 1: - self = .extend - case 2: - self = .prepend - case 3: - self = .spacingMark - - // Extended pictographic uses 2 values for its representation. - case 4, 5: - self = .extendedPictographic - default: - self = .any - } - } - } - } -} - -/// CR and LF are common special cases in grapheme breaking logic -private var _CR: UInt8 { return 0x0d } -private var _LF: UInt8 { return 0x0a } - -internal func _hasGraphemeBreakBetween( - _ lhs: Unicode.Scalar, _ rhs: Unicode.Scalar -) -> Bool { - - // CR-LF is a special case: no break between these - if lhs == Unicode.Scalar(_CR) && rhs == Unicode.Scalar(_LF) { - return false - } - - // Whether the given scalar, when it appears paired with another scalar - // satisfying this property, has a grapheme break between it and the other - // scalar. - func hasBreakWhenPaired(_ x: Unicode.Scalar) -> Bool { - // TODO: This doesn't generate optimal code, tune/re-write at a lower - // level. - // - // NOTE: Order of case ranges affects codegen, and thus performance. All - // things being equal, keep existing order below. - switch x.value { - // Unified CJK Han ideographs, common and some supplemental, amongst - // others: - // U+3400 ~ U+A4CF - case 0x3400...0xa4cf: return true - - // Repeat sub-300 check, this is beneficial for common cases of Latin - // characters embedded within non-Latin script (e.g. newlines, spaces, - // proper nouns and/or jargon, punctuation). - // - // NOTE: CR-LF special case has already been checked. - case 0x0000...0x02ff: return true - - // Non-combining kana: - // U+3041 ~ U+3096 - // U+30A1 ~ U+30FC - case 0x3041...0x3096: return true - case 0x30a1...0x30fc: return true - - // Non-combining modern (and some archaic) Cyrillic: - // U+0400 ~ U+0482 (first half of Cyrillic block) - case 0x0400...0x0482: return true - - // Modern Arabic, excluding extenders and prependers: - // U+061D ~ U+064A - case 0x061d...0x064a: return true - - // Precomposed Hangul syllables: - // U+AC00 ~ U+D7AF - case 0xac00...0xd7af: return true - - // Common general use punctuation, excluding extenders: - // U+2010 ~ U+2029 - case 0x2010...0x2029: return true - - // CJK punctuation characters, excluding extenders: - // U+3000 ~ U+3029 - case 0x3000...0x3029: return true - - // Full-width forms: - // U+FF01 ~ U+FF9D - case 0xFF01...0xFF9D: return true - - default: return false - } - } - return hasBreakWhenPaired(lhs) && hasBreakWhenPaired(rhs) -} - -extension Unicode.Scalar { - fileprivate var _isLinkingConsonant: Bool { - _swift_stdlib_isLinkingConsonant(value) - } - - fileprivate var _isVirama: Bool { - switch value { - // Devanagari - case 0x94D: - return true - // Bengali - case 0x9CD: - return true - // Gujarati - case 0xACD: - return true - // Oriya - case 0xB4D: - return true - // Telugu - case 0xC4D: - return true - // Malayalam - case 0xD4D: - return true - - default: - return false - } - } -} - -internal struct _GraphemeBreakingState { - // When we're looking through an indic sequence, one of the requirements is - // that there is at LEAST 1 Virama present between two linking consonants. - // This value helps ensure that when we ultimately need to decide whether or - // not to break that we've at least seen 1 when walking. - var hasSeenVirama = false - - // When walking forwards in a string, we need to know whether or not we've - // entered an emoji sequence to be able to eventually break after all of the - // emoji's various extenders and zero width joiners. This bit allows us to - // keep track of whether or not we're still in an emoji sequence when deciding - // to break. - var isInEmojiSequence = false - - // Similar to emoji sequences, we need to know not to break an Indic grapheme - // sequence. This sequence is (potentially) composed of many scalars and isn't - // as trivial as comparing two grapheme properties. - var isInIndicSequence = false - - // When walking forward in a string, we need to not break on emoji flag - // sequences. Emoji flag sequences are composed of 2 regional indicators, so - // when we see our first (.regionalIndicator, .regionalIndicator) decision, - // we need to know to return false in this case. However, if the next scalar - // is another regional indicator, we reach the same decision rule, but in this - // case we actually need to break there's a boundary between emoji flag - // sequences. - var shouldBreakRI = false -} - -extension String { - // Returns the stride of the next grapheme cluster at the previous boundary - // offset. - internal func nextBoundary( - startingAt index: Int, - nextScalar: (Int) -> (Unicode.Scalar, end: Int) - ) -> Int { - _internalInvariant(index != endIndex._encodedOffsetSP) - var state = _GraphemeBreakingState() - var index = index - - while true { - let (scalar1, nextIdx) = nextScalar(index) - index = nextIdx - - guard index != endIndex._encodedOffsetSP else { - break - } - - let (scalar2, _) = nextScalar(index) - - if shouldBreak(scalar1, between: scalar2, &state, index) { - break - } - } - - return index - } - - // Returns the stride of the previous grapheme cluster at the current boundary - // offset. - internal func previousBoundary( - endingAt index: Int, - previousScalar: (Int) -> (Unicode.Scalar, start: Int) - ) -> Int { - _internalInvariant(index != startIndex._encodedOffsetSP) - var state = _GraphemeBreakingState() - var index = index - - while true { - let (scalar2, previousIdx) = previousScalar(index) - index = previousIdx - - guard index != startIndex._encodedOffsetSP else { - break - } - - let (scalar1, _) = previousScalar(index) - - if shouldBreak( - scalar1, - between: scalar2, - &state, - index, - isBackwards: true - ) { - break - } - } - - return index - } -} - -extension String { - // The "algorithm" that determines whether or not we should break between - // certain grapheme break properties. - // - // This is based off of the Unicode Annex #29 for [Grapheme Cluster Boundary - // Rules](https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules). - internal func shouldBreak( - _ scalar1: Unicode.Scalar, - between scalar2: Unicode.Scalar, - _ state: inout _GraphemeBreakingState, - _ index: Int, - isBackwards: Bool = false - ) -> Bool { - // GB3 - if scalar1.value == 0xD, scalar2.value == 0xA { - return false - } - - if _hasGraphemeBreakBetween(scalar1, scalar2) { - return true - } - - let x = Unicode._GraphemeBreakProperty(from: scalar1) - let y = Unicode._GraphemeBreakProperty(from: scalar2) - - // This variable and the defer statement help toggle the isInEmojiSequence - // state variable to false after every decision of 'shouldBreak'. If we - // happen to see a rhs .extend or .zwj, then it's a signal that we should - // continue treating the current grapheme cluster as an emoji sequence. - var enterEmojiSequence = false - - // Very similar to emoji sequences, but for Indic grapheme sequences. - var enterIndicSequence = false - - defer { - state.isInEmojiSequence = enterEmojiSequence - state.isInIndicSequence = enterIndicSequence - } - - switch (x, y) { - - // Fast path: If we know our scalars have no properties the decision is - // trivial and we don't need to crawl to the default statement. - case (.any, .any): - return true - - // GB4 - case (.control, _): - return true - - // GB5 - case (_, .control): - return true - - // GB6 - case (.l, .l), - (.l, .v), - (.l, .lv), - (.l, .lvt): - return false - - // GB7 - case (.lv, .v), - (.v, .v), - (.lv, .t), - (.v, .t): - return false - - // GB8 - case (.lvt, .t), - (.t, .t): - return false - - // GB9 (partial GB11) - case (_, .extend), - (_, .zwj): - - // If we're currently in an emoji sequence, then extends and ZWJ help - // continue the grapheme cluster by combining more scalars later. If we're - // not currently in an emoji sequence, but our lhs scalar is a pictograph, - // then that's a signal that it's the start of an emoji sequence. - if state.isInEmojiSequence || x == .extendedPictographic { - enterEmojiSequence = true - } - - // If we're currently in an indic sequence (or if our lhs is a linking - // consonant), then this check and everything underneath ensures that - // we continue being in one and may check if this extend is a Virama. - if state.isInIndicSequence || scalar1._isLinkingConsonant { - if y == .extend { - let extendNormData = Unicode._NormData(scalar2, fastUpperbound: 0x300) - - // If our extend's CCC is 0, then this rule does not apply. - guard extendNormData.ccc != 0 else { - return false - } - } - - enterIndicSequence = true - - if scalar2._isVirama { - state.hasSeenVirama = true - } - } - - return false - - // GB9a - case (_, .spacingMark): - return false - - // GB9b - case (.prepend, _): - return false - - // GB11 - case (.zwj, .extendedPictographic): - if isBackwards { - return !checkIfInEmojiSequence(index) - } - - return !state.isInEmojiSequence - - // GB12 & GB13 - case (.regionalIndicator, .regionalIndicator): - if isBackwards { - return countRIs(index) - } - - defer { - state.shouldBreakRI.toggle() - } - - return state.shouldBreakRI - - // GB999 - default: - // GB9c - if state.isInIndicSequence, state.hasSeenVirama, scalar2._isLinkingConsonant { - state.hasSeenVirama = false - return false - } - - // Handle GB9c when walking backwards. - if isBackwards { - switch (x, scalar2._isLinkingConsonant) { - case (.extend, true): - let extendNormData = Unicode._NormData(scalar1, fastUpperbound: 0x300) - - guard extendNormData.ccc != 0 else { - return true - } - - return !checkIfInIndicSequence(index) - - case (.zwj, true): - return !checkIfInIndicSequence(index) - - default: - return true - } - } - - return true - } - } - - // When walking backwards, it's impossible to know whether we were in an emoji - // sequence without walking further backwards. This walks the string backwards - // enough until we figure out whether or not to break our - // (.zwj, .extendedPictographic) question. For example: - // - // Scalar view #1: - // - // [.control, .zwj, .extendedPictographic] - // ^ - // | = To determine whether or not we break here, we need - // to see the previous scalar's grapheme property. - // ^ - // | = This is neither .extendedPictographic nor .extend, thus we - // were never in an emoji sequence, so break between the .zwj - // and .extendedPictographic. - // - // Scalar view #2: - // - // [.extendedPictographic, .zwj, .extendedPictographic] - // ^ - // | = Same as above, move backwards one to - // view the previous scalar's property. - // ^ - // | = This is an .extendedPictographic, so this indicates that - // we are in an emoji sequence, so we should NOT break - // between the .zwj and .extendedPictographic. - // - // Scalar view #3: - // - // [.extendedPictographic, .extend, .extend, .zwj, .extendedPictographic] - // ^ - // | = Same as above - // ^ - // | = This is an .extend which means - // there is a potential emoji - // sequence, walk further backwards - // to find an .extendedPictographic. - // - // <-- = Another extend, go backwards more. - // ^ - // | = We found our starting .extendedPictographic letting us - // know that we are in an emoji sequence so our initial - // break question is answered as NO. - internal func checkIfInEmojiSequence(_ index: Int) -> Bool { - var emojiIdx = String.Index(_encodedOffsetSP: index) - - guard emojiIdx != startIndex else { - return false - } - - let scalars = unicodeScalars - scalars.formIndex(before: &emojiIdx) - - while emojiIdx != startIndex { - scalars.formIndex(before: &emojiIdx) - let scalar = scalars[emojiIdx] - - let gbp = Unicode._GraphemeBreakProperty(from: scalar) - - switch gbp { - case .extend: - continue - case .extendedPictographic: - return true - default: - return false - } - } - - return false - } - - // When walking backwards, it's impossible to know whether we break when we - // see our first ((.extend|.zwj), .linkingConsonant) without walking - // further backwards. This walks the string backwards enough until we figure - // out whether or not to break this indic sequence. For example: - // - // Scalar view #1: - // - // [.virama, .extend, .linkingConsonant] - // ^ - // | = To be able to know whether or not to break these - // two, we need to walk backwards to determine if - // this is a legitimate indic sequence. - // ^ - // | = The scalar sequence ends without a starting linking consonant, - // so this is in fact not an indic sequence, so we can break the two. - // - // Scalar view #2: - // - // [.linkingConsonant, .virama, .extend, .linkingConsonant] - // ^ - // | = Same as above - // ^ - // | = This is a virama, so we at least have seen - // 1 to be able to return true if we see a - // linking consonant later. - // ^ - // | = Is a linking consonant and we've seen a virama, so this is a - // legitimate indic sequence, so do NOT break the initial question. - internal func checkIfInIndicSequence(_ index: Int) -> Bool { - var indicIdx = String.Index(_encodedOffsetSP: index) - - guard indicIdx != startIndex else { - return false - } - - let scalars = unicodeScalars - scalars.formIndex(before: &indicIdx) - - var hasSeenVirama = false - - // Check if the first extend was the Virama. - let scalar = scalars[indicIdx] - - if scalar._isVirama { - hasSeenVirama = true - } - - while indicIdx != startIndex { - scalars.formIndex(before: &indicIdx) - let scalar = scalars[indicIdx] - - let gbp = Unicode._GraphemeBreakProperty(from: scalar) - - switch (gbp, scalar._isLinkingConsonant) { - case (.extend, false): - let extendNormData = Unicode._NormData(scalar, fastUpperbound: 0x300) - - guard extendNormData.ccc != 0 else { - return false - } - - if scalar._isVirama { - hasSeenVirama = true - } - - case (.zwj, false): - continue - - // LinkingConsonant - case (_, true): - guard hasSeenVirama else { - return false - } - - return true - - default: - return false - } - } - - return false - } - - // When walking backwards, it's impossible to know whether we break when we - // see our first (.regionalIndicator, .regionalIndicator) without walking - // further backwards. This walks the string backwards enough until we figure - // out whether or not to break these RIs. For example: - // - // Scalar view #1: - // - // [.control, .regionalIndicator, .regionalIndicator] - // ^ - // | = To be able to know whether or not to - // break these two, we need to walk - // backwards to determine if there were - // any previous .regionalIndicators in - // a row. - // ^ - // | = Not a .regionalIndicator, so our total riCount is 0 and 0 is - // even thus we do not break. - // - // Scalar view #2: - // - // [.control, .regionalIndicator, .regionalIndicator, .regionalIndicator] - // ^ - // | = Same as above - // ^ - // | = This is a .regionalIndicator, so continue - // walking backwards for more of them. riCount is - // now equal to 1. - // ^ - // | = Not a .regionalIndicator. riCount = 1 which is odd, so break - // the last two .regionalIndicators. - internal func countRIs( - _ index: Int - ) -> Bool { - var riIdx = String.Index(_encodedOffsetSP: index) - - guard riIdx != startIndex else { - return false - } - - var riCount = 0 - - let scalars = unicodeScalars - scalars.formIndex(before: &riIdx) - - while riIdx != startIndex { - scalars.formIndex(before: &riIdx) - let scalar = scalars[riIdx] - - let gbp = Unicode._GraphemeBreakProperty(from: scalar) - - guard gbp == .regionalIndicator else { - break - } - - riCount += 1 - } - - return riCount & 1 != 0 - } -} diff --git a/Sources/_StringProcessing/Unicode/NecessaryEvils.swift b/Sources/_StringProcessing/Unicode/NecessaryEvils.swift index 1c2499fbc..672a731bd 100644 --- a/Sources/_StringProcessing/Unicode/NecessaryEvils.swift +++ b/Sources/_StringProcessing/Unicode/NecessaryEvils.swift @@ -88,14 +88,3 @@ extension UTF16 { (UInt32(lead & 0x03ff) &<< 10 | UInt32(trail & 0x03ff))) } } - -extension String.Index { - internal var _encodedOffsetSP: Int { - // The encoded offset is found in the top 48 bits. - Int(unsafeBitCast(self, to: UInt64.self) >> 16) - } - - internal init(_encodedOffsetSP offset: Int) { - self = unsafeBitCast(offset << 16, to: Self.self) - } -} diff --git a/Sources/_StringProcessing/Unicode/Normalization.swift b/Sources/_StringProcessing/Unicode/Normalization.swift deleted file mode 100644 index ce73aac47..000000000 --- a/Sources/_StringProcessing/Unicode/Normalization.swift +++ /dev/null @@ -1,406 +0,0 @@ -//===----------------------------------------------------------------------===// -// -// This source file is part of the Swift.org open source project -// -// Copyright (c) 2021-2022 Apple Inc. and the Swift project authors -// Licensed under Apache License v2.0 with Runtime Library Exception -// -// See https://swift.org/LICENSE.txt for license information -// -//===----------------------------------------------------------------------===// - -@_silgen_name("_swift_stdlib_getNormData") -func _swift_stdlib_getNormData(_: UInt32) -> UInt16 - -@_silgen_name("_swift_stdlib_nfd_decompositions") -func _swift_stdlib_nfd_decompositions() -> UnsafePointer? - -@_silgen_name("_swift_stdlib_getDecompositionEntry") -func _swift_stdlib_getDecompositionEntry(_: UInt32) -> UInt32 - -@_silgen_name("_swift_stdlib_getComposition") -func _swift_stdlib_getComposition(_: UInt32, _: UInt32) -> UInt32 - -extension Unicode { - internal struct _NFD { - let base: S - } -} - -extension Unicode._NFD { - internal struct Iterator { - var buffer = Unicode._NormDataBuffer() - - // This index always points at the next starter of a normalization segment. - // Each iteration of 'next()' moves this index up to the next starter. - var index: S.UnicodeScalarView.Index - - let unicodeScalars: S.UnicodeScalarView - } -} - -extension Unicode._NFD.Iterator: IteratorProtocol { - internal mutating func decompose( - _ scalar: Unicode.Scalar, - with normData: Unicode._NormData - ) { - // ASCII always decomposes to itself. - if _fastPath(scalar.value < 0xC0) { - // ASCII always has normData of 0. - // CCC = 0, NFC_QC = Yes, NFD_QC = Yes - buffer.append((scalar, normData)) - return - } - - // Handle Hangul decomposition algorithmically. - // S.base = 0xAC00 - // S.count = 11172 - // S.base + S.count - 1 = 0xD7A3 - if (0xAC00 ... 0xD7A3).contains(scalar.value) { - decomposeHangul(scalar) - return - } - - // Otherwise, we need to lookup the decomposition (if there is one). - decomposeSlow(scalar, with: normData) - } - - @inline(never) - internal mutating func decomposeHangul(_ scalar: Unicode.Scalar) { - // L = Hangul leading consonants - let L: (base: UInt32, count: UInt32) = (base: 0x1100, count: 19) - // V = Hangul vowels - let V: (base: UInt32, count: UInt32) = (base: 0x1161, count: 21) - // T = Hangul tail consonants - let T: (base: UInt32, count: UInt32) = (base: 0x11A7, count: 28) - // N = Number of precomposed Hangul syllables that start with the same - // leading consonant. (There is no base for N). - let N: (base: UInt32, count: UInt32) = (base: 0x0, count: 588) - // S = Hangul precomposed syllables - let S: (base: UInt32, count: UInt32) = (base: 0xAC00, count: 11172) - - let sIdx = scalar.value &- S.base - - let lIdx = sIdx / N.count - let l = Unicode.Scalar(_value: L.base &+ lIdx) - // Hangul leading consonants, L, always have normData of 0. - // CCC = 0, NFC_QC = Yes, NFD_QC = Yes - buffer.append((scalar: l, normData: .init(rawValue: 0))) - - let vIdx = (sIdx % N.count) / T.count - let v = Unicode.Scalar(_value: V.base &+ vIdx) - // Hangul vowels, V, always have normData of 4. - // CCC = 0, NFC_QC = Maybe, NFD_QC = Yes - buffer.append((scalar: v, normData: .init(rawValue: 4))) - - let tIdx = sIdx % T.count - if tIdx != 0 { - let t = Unicode.Scalar(_value: T.base &+ tIdx) - // Hangul tail consonants, T, always have normData of 4. - // CCC = 0, NFC_QC = Maybe, NFD_QC = Yes - buffer.append((scalar: t, normData: .init(rawValue: 4))) - } - } - - @inline(never) - internal mutating func decomposeSlow( - _ scalar: Unicode.Scalar, - with normData: Unicode._NormData - ) { - // Look into the decomposition perfect hash table. - let decompEntry = Unicode._DecompositionEntry(scalar) - - // If this is not our original scalar, then we have no decomposition for this - // scalar, so just emit itself. This is required because perfect hashing - // does not know the original set of keys that it used to create itself, so - // we store the original scalar in our decomposition entry to ensure that - // scalars that hash to the same index don't succeed. - guard scalar == decompEntry.hashedScalar else { - buffer.append((scalar, normData)) - return - } - - var utf8 = decompEntry.utf8 - - while utf8.count > 0 { - let (scalar, len) = UnsafeAssumingValidUTF8.decode( - utf8.byteBuffer, - startingAt: 0 - ) - utf8 = UnsafeBufferPointer(rebasing: utf8[len...]) - - // Fast path: Because this will be emitted into the completed NFD buffer, - // we don't need to look at NFD_QC anymore which lets us do a larger - // latiny check for NFC_QC and CCC (0xC0 vs. 0x300). - let normData = Unicode._NormData(scalar, fastUpperbound: 0x300) - - buffer.append((scalar, normData)) - } - } - - internal mutating func next() -> ScalarAndNormData? { - // Empty out our buffer before attempting to decompose the next - // normalization segment. - if let nextBuffered = buffer.next() { - return nextBuffered - } - - while index < unicodeScalars.endIndex { - let scalar = unicodeScalars[index] - let normData = Unicode._NormData(scalar) - - // If we've reached a starter, stop. - if normData.ccc == 0, !buffer.isEmpty { - break - } - - unicodeScalars.formIndex(after: &index) - - // If our scalar IS NFD quick check, then it's as simple as appending to - // our buffer and moving on the next scalar. Otherwise, we need to - // decompose this and append each decomposed scalar. - if normData.isNFDQC { - // Fast path: If our scalar is also ccc = 0, then this doesn't need to - // be appended to the buffer at all. - if normData.ccc == 0 { - return (scalar, normData) - } - - buffer.append((scalar, normData)) - } else { - decompose(scalar, with: normData) - } - } - - // Sort the entire buffer based on the canonical combining class. - buffer.sort() - - return buffer.next() - } -} - -extension Unicode._NFD: Sequence { - internal func makeIterator() -> Iterator { - Iterator( - index: base.unicodeScalars.startIndex, - unicodeScalars: base.unicodeScalars - ) - } -} - -extension StringProtocol { - internal var _nfd: Unicode._NFD { - Unicode._NFD(base: self) - } -} - -extension Unicode { - internal struct _NFC { - let base: S - } -} - -extension Unicode._NFC { - internal struct Iterator { - var buffer = Unicode._NormDataBuffer() - - // This is our starter that is currently being composed with other scalars - // into new scalars. For example, "e\u{301}", here our first scalar is 'e', - // which is a starter, thus we assign composee to this 'e' and move to the - // next scalar. We attempt to compose our composee, 'e', with '\u{301}' and - // find that there is a composition. Thus our new composee is now 'é' and - // we continue to try and compose following scalars with this composee. - var composee: Unicode.Scalar? = nil - - var iterator: Unicode._NFD.Iterator - } -} - -extension Unicode._NFC.Iterator: IteratorProtocol { - internal func compose( - _ x: Unicode.Scalar, - and y: Unicode.Scalar - ) -> Unicode.Scalar? { - // Fast path: ASCII and some latiny scalars never compose when they're on - // the rhs. - if _fastPath(y.value < 0x300) { - return nil - } - - if let hangul = composeHangul(x, and: y) { - return hangul - } - - // Otherwise, lookup the composition. - let composition = _swift_stdlib_getComposition(x.value, y.value) - - guard composition != .max else { - return nil - } - - return Unicode.Scalar(_value: composition) - } - - @inline(never) - internal func composeHangul( - _ x: Unicode.Scalar, - and y: Unicode.Scalar - ) -> Unicode.Scalar? { - // L = Hangul leading consonants - let L: (base: UInt32, count: UInt32) = (base: 0x1100, count: 19) - // V = Hangul vowels - let V: (base: UInt32, count: UInt32) = (base: 0x1161, count: 21) - // T = Hangul tail consonants - let T: (base: UInt32, count: UInt32) = (base: 0x11A7, count: 28) - // N = Number of precomposed Hangul syllables that start with the same - // leading consonant. (There is no base for N). - let N: (base: UInt32, count: UInt32) = (base: 0x0, count: 588) - // S = Hangul precomposed syllables - let S: (base: UInt32, count: UInt32) = (base: 0xAC00, count: 11172) - - switch (x.value, y.value) { - // Check for Hangul (L, V) -> LV compositions. - case (L.base ..< L.base &+ L.count, V.base ..< V.base &+ V.count): - let lIdx = x.value &- L.base - let vIdx = y.value &- V.base - let lvIdx = lIdx &* N.count &+ vIdx &* T.count - let s = S.base &+ lvIdx - return Unicode.Scalar(_value: s) - - // Check for Hangul (LV, T) -> LVT compositions. - case (S.base ..< S.base &+ S.count, T.base &+ 1 ..< T.base &+ T.count): - if (x.value &- S.base) % T.count == 0 { - return Unicode.Scalar(_value: x.value &+ y.value &- T.base) - } else { - fallthrough - } - - default: - return nil - } - } - - internal mutating func next() -> Unicode.Scalar? { - // Empty out our buffer before attempting to compose anything with our new - // composee. - if let nextBuffered = buffer.next() { - return nextBuffered.scalar - } - - while let current = iterator.next() { - guard let currentComposee = composee else { - // If we don't have a composee at this point, we're most likely looking - // at the start of a string. If our class is 0, then attempt to compose - // the following scalars with this one. Otherwise, it's a one off scalar - // that needs to be emitted. - if current.normData.ccc == 0 { - composee = current.scalar - continue - } else { - return current.scalar - } - } - - // If we have any scalars in the buffer, it means those scalars couldn't - // compose with our composee to form a new scalar. However, scalars - // following them may still compose with our composee, so take the last - // scalar in the buffer and get its normalization data so that we can - // perform the check underneath this one about whether this current scalar - // is "blocked". We get the last scalar because the scalars we receive are - // already NFD, so the last scalar in the buffer will have the highest - // CCC value in this normalization segment. - guard let lastBufferedNormData = buffer.last?.normData else { - // If we do not any have scalars in our buffer yet, then this step is - // trivial. Attempt to compose our current scalar with whatever composee - // we're currently building up. - - // If our right hand side scalar IS NFC_QC, then that means it can - // never compose with any scalars previous to it. So, if our current - // scalar is NFC_QC, then we have no composition. - guard !current.normData.isNFCQC, - let composed = compose(currentComposee, and: current.scalar) else { - // We did not find a composition between the two. If our current class - // is 0, then set that as the new composee and return whatever built - // up scalar we have. Otherwise, add our current scalar to the buffer - // for eventual removal! - - if current.normData.ccc == 0 { - composee = current.scalar - return currentComposee - } - - buffer.append(current) - continue - } - - // We found a composition! Record it as our new composee and repeat the - // process. - composee = composed - continue - } - - // Check if our current scalar is not blocked from our current composee. - // In this case blocked means there is some scalar whose class - // (lastBufferedNormData.ccc) is either == 0 or >= current.normData.ccc. - // - // Example: - // - // "z\u{0335}\u{0327}\u{0324}\u{0301}" - // - // In this example, there are several combining marks following a 'z', but - // none of them actually compose with the composee 'z'. However, the last - // scalar U+0301 does actually compose. So this check makes sure that the - // last scalar doesn't have any scalar in between it and the composee that - // would otherwise "block" it from composing. - guard lastBufferedNormData.ccc < current.normData.ccc else { - // We had a scalar block it. That means our current scalar is either a - // starter or has a same class (preserve ordering). - - // Starters are the "start" of a new normalization segment. Set it as - // the new composee and return our current composee. This will trigger - // any other scalars in the buffer to be emitted before we handle - // normalizing this new segment. - if current.normData.ccc == 0 { - composee = current.scalar - return currentComposee - } - - _internalInvariant(current.normData.ccc == lastBufferedNormData.ccc) - buffer.append(current) - continue - } - - // There were no blockers! Attempt to compose the two! (Again, if our rhs - // scalar IS NFC_QC, then it can never compose with anything previous to - // it). - guard !current.normData.isNFCQC, - let composed = compose(currentComposee, and: current.scalar) else { - // No composition found. Stick it at the end of the buffer with the rest - // of non-composed scalars. - - buffer.append(current) - continue - } - - // They composed! Assign the composition as our new composee and iterate - // to the next scalar. - composee = composed - } - - // If we have a leftover composee, make sure to return it. - return composee._take() - } -} - -extension Unicode._NFC: Sequence { - internal func makeIterator() -> Iterator { - Iterator(iterator: base._nfd.makeIterator()) - } -} - -extension StringProtocol { - internal var _nfc: Unicode._NFC { - Unicode._NFC(base: self) - } -} - diff --git a/Sources/_StringProcessing/_CharacterClassModel.swift b/Sources/_StringProcessing/_CharacterClassModel.swift index 670a26c79..4d0c12c1f 100644 --- a/Sources/_StringProcessing/_CharacterClassModel.swift +++ b/Sources/_StringProcessing/_CharacterClassModel.swift @@ -28,7 +28,7 @@ public struct _CharacterClassModel: Hashable { var isInverted: Bool = false // TODO: Split out builtin character classes into their own type? - public enum Representation: Hashable { + public enum Representation: Hashable { /// Any character case any /// Any grapheme cluster @@ -54,10 +54,14 @@ public struct _CharacterClassModel: Hashable { case custom([CharacterSetComponent]) } - public typealias SetOperator = AST.CustomCharacterClass.SetOp + public enum SetOperator: Hashable { + case subtraction + case intersection + case symmetricDifference + } /// A binary set operation that forms a character class component. - public struct SetOperation: Hashable { + public struct SetOperation: Hashable { var lhs: CharacterSetComponent var op: SetOperator var rhs: CharacterSetComponent @@ -74,7 +78,7 @@ public struct _CharacterClassModel: Hashable { } } - public enum CharacterSetComponent: Hashable { + public enum CharacterSetComponent: Hashable { case character(Character) case range(ClosedRange) @@ -135,23 +139,30 @@ public struct _CharacterClassModel: Hashable { return result } - /// Returns an inverted character class if true is passed, otherwise the - /// same character class is returned. - func withInversion(_ invertion: Bool) -> Self { + /// Conditionally inverts a character class. + /// + /// - Parameter inversion: Indicates whether to invert the character class. + /// - Returns: The inverted character class if `inversion` is `true`; + /// otherwise, the same character class. + func withInversion(_ inversion: Bool) -> Self { var copy = self - if invertion { + if inversion { copy.isInverted.toggle() } return copy } - /// Returns the inverse character class. + /// Inverts a character class. public var inverted: Self { return withInversion(true) } - /// Returns the end of the match of this character class in `str`, if - /// it matches. + /// Returns the end of the match of this character class in the string. + /// + /// - Parameter str: The string to match against. + /// - Parameter at: The index to start matching. + /// - Parameter options: Options for the match operation. + /// - Returns: The index of the end of the match, or `nil` if there is no match. func matches(in str: String, at i: String.Index, with options: MatchingOptions) -> String.Index? { switch matchLevel { case .graphemeCluster: @@ -306,7 +317,17 @@ extension _CharacterClassModel: CustomStringConvertible { } extension _CharacterClassModel { - public func makeAST() -> AST.Node? { + public func makeDSLTreeCharacterClass() -> DSLTree.CustomCharacterClass? { + // FIXME: Implement in DSLTree instead of wrapping an AST atom + switch makeAST() { + case .atom(let atom): + return .init(members: [.atom(.unconverted(.init(ast: atom)))]) + default: + return nil + } + } + + internal func makeAST() -> AST.Node? { let inv = isInverted func esc(_ b: AST.Atom.EscapedBuiltin) -> AST.Node { @@ -387,7 +408,7 @@ extension DSLTree.Atom { var characterClass: _CharacterClassModel? { switch self { case let .unconverted(a): - return a.characterClass + return a.ast.characterClass default: return nil } diff --git a/Sources/Prototypes/CMakeLists.txt b/Tests/Prototypes/CMakeLists.txt similarity index 100% rename from Sources/Prototypes/CMakeLists.txt rename to Tests/Prototypes/CMakeLists.txt diff --git a/Sources/Prototypes/Combinators/Combinators.swift b/Tests/Prototypes/Combinators/Combinators.swift similarity index 100% rename from Sources/Prototypes/Combinators/Combinators.swift rename to Tests/Prototypes/Combinators/Combinators.swift diff --git a/Sources/Prototypes/PEG/PEG.swift b/Tests/Prototypes/PEG/PEG.swift similarity index 100% rename from Sources/Prototypes/PEG/PEG.swift rename to Tests/Prototypes/PEG/PEG.swift diff --git a/Sources/Prototypes/PEG/PEGCode.swift b/Tests/Prototypes/PEG/PEGCode.swift similarity index 100% rename from Sources/Prototypes/PEG/PEGCode.swift rename to Tests/Prototypes/PEG/PEGCode.swift diff --git a/Sources/Prototypes/PEG/PEGCompile.swift b/Tests/Prototypes/PEG/PEGCompile.swift similarity index 100% rename from Sources/Prototypes/PEG/PEGCompile.swift rename to Tests/Prototypes/PEG/PEGCompile.swift diff --git a/Sources/Prototypes/PEG/PEGCore.swift b/Tests/Prototypes/PEG/PEGCore.swift similarity index 100% rename from Sources/Prototypes/PEG/PEGCore.swift rename to Tests/Prototypes/PEG/PEGCore.swift diff --git a/Sources/Prototypes/PEG/PEGInterpreter.swift b/Tests/Prototypes/PEG/PEGInterpreter.swift similarity index 100% rename from Sources/Prototypes/PEG/PEGInterpreter.swift rename to Tests/Prototypes/PEG/PEGInterpreter.swift diff --git a/Sources/Prototypes/PEG/PEGTranspile.swift b/Tests/Prototypes/PEG/PEGTranspile.swift similarity index 100% rename from Sources/Prototypes/PEG/PEGTranspile.swift rename to Tests/Prototypes/PEG/PEGTranspile.swift diff --git a/Sources/Prototypes/PEG/PEGVM.swift b/Tests/Prototypes/PEG/PEGVM.swift similarity index 100% rename from Sources/Prototypes/PEG/PEGVM.swift rename to Tests/Prototypes/PEG/PEGVM.swift diff --git a/Sources/Prototypes/PEG/PEGVMExecute.swift b/Tests/Prototypes/PEG/PEGVMExecute.swift similarity index 100% rename from Sources/Prototypes/PEG/PEGVMExecute.swift rename to Tests/Prototypes/PEG/PEGVMExecute.swift diff --git a/Sources/Prototypes/PEG/Printing.swift b/Tests/Prototypes/PEG/Printing.swift similarity index 100% rename from Sources/Prototypes/PEG/Printing.swift rename to Tests/Prototypes/PEG/Printing.swift diff --git a/Sources/Prototypes/PTCaRet/Interpreter.swift b/Tests/Prototypes/PTCaRet/Interpreter.swift similarity index 100% rename from Sources/Prototypes/PTCaRet/Interpreter.swift rename to Tests/Prototypes/PTCaRet/Interpreter.swift diff --git a/Sources/Prototypes/PTCaRet/PTCaRet.swift b/Tests/Prototypes/PTCaRet/PTCaRet.swift similarity index 100% rename from Sources/Prototypes/PTCaRet/PTCaRet.swift rename to Tests/Prototypes/PTCaRet/PTCaRet.swift diff --git a/Sources/Prototypes/TourOfTypes/CharacterClass.swift b/Tests/Prototypes/TourOfTypes/CharacterClass.swift similarity index 100% rename from Sources/Prototypes/TourOfTypes/CharacterClass.swift rename to Tests/Prototypes/TourOfTypes/CharacterClass.swift diff --git a/Sources/Prototypes/TourOfTypes/Literal.swift b/Tests/Prototypes/TourOfTypes/Literal.swift similarity index 100% rename from Sources/Prototypes/TourOfTypes/Literal.swift rename to Tests/Prototypes/TourOfTypes/Literal.swift diff --git a/Tests/RegexBuilderTests/AlgorithmsTests.swift b/Tests/RegexBuilderTests/AlgorithmsTests.swift index 793054cd1..0a2e6bc21 100644 --- a/Tests/RegexBuilderTests/AlgorithmsTests.swift +++ b/Tests/RegexBuilderTests/AlgorithmsTests.swift @@ -11,7 +11,7 @@ import XCTest import _StringProcessing -@testable import RegexBuilder +import RegexBuilder @available(SwiftStdlib 5.7, *) class RegexConsumerTests: XCTestCase { @@ -60,4 +60,384 @@ class RegexConsumerTests: XCTestCase { result: "6 60 0 6x4", { match in "\(match.output.1 * match.output.2 * (match.output.3 ?? 1))" }) } + + func testMatchReplaceSubrange() { + func replaceTest( + _ regex: R, + input: String, + _ replace: (Regex.Match) -> String, + _ tests: (subrange: Range, maxReplacement: Int, result: String)..., + file: StaticString = #file, + line: UInt = #line + ) { + for (subrange, maxReplacement, result) in tests { + XCTAssertEqual(input.replacing(regex, subrange: subrange, maxReplacements: maxReplacement, with: replace), result, file: file, line: line) + } + } + + let int = Capture(OneOrMore(.digit)) { Int($0)! } + + let addition = "9+16, 0+3, 5+5, 99+1" + + replaceTest( + Regex { int; "+"; int }, + input: "9+16, 0+3, 5+5, 99+1", + { match in "\(match.output.1 + match.output.2)" }, + + (subrange: addition.startIndex..( + _ algo: MatchAlgo, + _ tests: (input: String, expectedCaptures: MatchType?)..., + matchType: MatchType.Type, + equivalence: (MatchType, MatchType) -> Bool, + file: StaticString = #file, + line: UInt = #line, + @RegexComponentBuilder _ content: () -> R + ) throws { + for (input, expectedCaptures) in tests { + var actual: Regex.Match? + switch algo { + case .whole: + actual = input.wholeMatch(of: content) + case .first: + actual = input.firstMatch(of: content) + case .prefix: + actual = input.prefixMatch(of: content) + } + if let expectedCaptures = expectedCaptures { + let match = try XCTUnwrap(actual, file: file, line: line) + let captures = try XCTUnwrap(match.output as? MatchType, file: file, line: line) + XCTAssertTrue(equivalence(captures, expectedCaptures), file: file, line: line) + } else { + XCTAssertNil(actual, file: file, line: line) + } + } + } + + func expectEqual( + _ algo: EquatableAlgo, + _ tests: (input: String, expected: Expected)..., + file: StaticString = #file, + line: UInt = #line, + @RegexComponentBuilder _ content: () -> R + ) throws { + for (input, expected) in tests { + var actual: Expected + switch algo { + case .contains: + actual = input.contains(content) as! Expected + case .starts: + actual = input.starts(with: content) as! Expected + case .trimmingPrefix: + actual = input.trimmingPrefix(content) as! Expected + } + XCTAssertEqual(actual, expected) + } + } + + func testMatches() throws { + let int = Capture(OneOrMore(.digit)) { Int($0)! } + + // Test syntax + let add = Regex { + int + "+" + int + } + let content = { add } + + let m = "2020+16".wholeMatch { + int + "+" + int + } + XCTAssertEqual(m?.output.0, "2020+16") + XCTAssertEqual(m?.output.1, 2020) + XCTAssertEqual(m?.output.2, 16) + + let m1 = "2020+16".wholeMatch(of: content) + XCTAssertEqual(m1?.output.0, m?.output.0) + XCTAssertEqual(m1?.output.1, m?.output.1) + XCTAssertEqual(m1?.output.2, m?.output.2) + + let firstMatch = "2020+16 0+0".firstMatch(of: content) + XCTAssertEqual(firstMatch?.output.0, "2020+16") + XCTAssertEqual(firstMatch?.output.1, 2020) + XCTAssertEqual(firstMatch?.output.2, 16) + + let prefix = "2020+16 0+0".prefixMatch(of: content) + XCTAssertEqual(prefix?.output.0, "2020+16") + XCTAssertEqual(prefix?.output.1, 2020) + XCTAssertEqual(prefix?.output.2, 16) + + try expectMatch( + .whole, + ("0+0", ("0+0", 0, 0)), + ("2020+16", ("2020+16", 2020, 16)), + ("-2020+16", nil), + ("2020+16+0+0", nil), + matchType: (Substring, Int, Int).self, + equivalence: == + ) { + int + "+" + int + } + + try expectMatch( + .prefix, + ("0+0", ("0+0", 0, 0)), + ("2020+16", ("2020+16", 2020, 16)), + ("-2020+16", nil), + ("2020+16+0+0", ("2020+16", 2020, 16)), + matchType: (Substring, Int, Int).self, + equivalence: == + ) { + int + "+" + int + } + + try expectMatch( + .first, + ("0+0", ("0+0", 0, 0)), + ("2020+16", ("2020+16", 2020, 16)), + ("-2020+16", ("2020+16", 2020, 16)), + ("2020+16+0+0", ("2020+16", 2020, 16)), + matchType: (Substring, Int, Int).self, + equivalence: == + ) { + int + "+" + int + } + } + + func testStartsAndContains() throws { + let fam = "👨‍👩‍👧‍👦👨‍👨‍👧‍👧 we Ⓡ family" + let startsWithGrapheme = fam.starts { + OneOrMore(.anyGrapheme) + OneOrMore(.whitespace) + } + XCTAssertEqual(startsWithGrapheme, true) + + let containsDads = fam.contains { + "👨‍👨‍👧‍👧" + } + XCTAssertEqual(containsDads, true) + + let content = { + Regex { + OneOrMore(.anyGrapheme) + OneOrMore(.whitespace) + } + } + XCTAssertEqual(fam.starts(with: content), true) + XCTAssertEqual(fam.contains(content), true) + + let int = Capture(OneOrMore(.digit)) { Int($0)! } + + try expectEqual( + .starts, + ("9+16, 0+3, 5+5, 99+1", true), + ("-9+16, 0+3, 5+5, 99+1", false), + (" 9+16", false), + ("a+b, c+d", false), + ("", false) + ) { + int + "+" + int + } + + try expectEqual( + .contains, + ("9+16, 0+3, 5+5, 99+1", true), + ("-9+16, 0+3, 5+5, 99+1", true), + (" 9+16", true), + ("a+b, c+d", false), + ("", false) + ) { + int + "+" + int + } + } + + func testTrim() throws { + let int = Capture(OneOrMore(.digit)) { Int($0)! } + + // Test syntax + let code = "(408)888-8888".trimmingPrefix { + "(" + OneOrMore(.digit) + ")" + } + XCTAssertEqual(code, Substring("888-8888")) + + var mutable = "👨‍👩‍👧‍👦 we Ⓡ family" + mutable.trimPrefix { + .anyGrapheme + ZeroOrMore(.whitespace) + } + XCTAssertEqual(mutable, "we Ⓡ family") + + try expectEqual( + .trimmingPrefix, + ("9+16 0+3 5+5 99+1", Substring(" 0+3 5+5 99+1")), + ("a+b 0+3 5+5 99+1", Substring("a+b 0+3 5+5 99+1")), + ("0+3+5+5+99+1", Substring("+5+5+99+1")), + ("", "") + ) { + int + "+" + int + } + } + + func testReplace() { + // Test no ambiguitiy using the trailing closure + var replaced: String + let str = "9+16, 0+3, 5+5, 99+1" + replaced = str.replacing(with: "🔢") { + OneOrMore(.digit) + "+" + OneOrMore(.digit) + } + XCTAssertEqual(replaced, "🔢, 🔢, 🔢, 🔢") + + replaced = str.replacing( + with: "🔢", + subrange: str.startIndex..( } } +// Test support +struct Concat : Equatable { + var wrapped: String + init(_ name: String, _ suffix: Int?) { + if let suffix = suffix { + wrapped = name + String(suffix) + } else { + wrapped = name + } + } +} + +extension Concat : Collection { + typealias Index = String.Index + typealias Element = String.Element + + var startIndex: Index { return wrapped.startIndex } + var endIndex: Index { return wrapped.endIndex } + + subscript(position: Index) -> Element { + return wrapped[position] + } + + func index(after i: Index) -> Index { + return wrapped.index(after: i) + } +} + +extension Concat: BidirectionalCollection { + typealias Indices = String.Indices + typealias SubSequence = String.SubSequence + + func index(before i: Index) -> Index { + return wrapped.index(before: i) + } + + var indices: Indices { + wrapped.indices + } + + subscript(bounds: Range) -> Substring { + Substring(wrapped[bounds]) + } +} + class CustomRegexComponentTests: XCTestCase { // TODO: Refactor below into more exhaustive, declarative // tests. - func testCustomRegexComponents() { + func testCustomRegexComponents() throws { customTest( Regex { Numbler() @@ -178,14 +223,13 @@ class CustomRegexComponentTests: XCTestCase { } } - guard let res3 = "ab123c".firstMatch(of: regex3) else { - XCTFail() - return - } + let str = "ab123c" + let res3 = try XCTUnwrap(str.firstMatch(of: regex3)) - XCTAssertEqual(res3.range, "ab123c".index(atOffset: 2)..<"ab123c".index(atOffset: 5)) - XCTAssertEqual(res3.output.0, "123") - XCTAssertEqual(res3.output.1, "123") + let expectedSubstring = str.dropFirst(2).prefix(3) + XCTAssertEqual(res3.range, expectedSubstring.startIndex..( + _ regex: Regex, + _ input: Concat, + expected: (wholeMatch: Match?, firstMatch: Match?, prefixMatch: Match?), + file: StaticString = #file, line: UInt = #line + ) { + let wholeResult = input.wholeMatch(of: regex)?.output + let firstResult = input.firstMatch(of: regex)?.output + let prefixResult = input.prefixMatch(of: regex)?.output + XCTAssertEqual(wholeResult, expected.wholeMatch, file: file, line: line) + XCTAssertEqual(firstResult, expected.firstMatch, file: file, line: line) + XCTAssertEqual(prefixResult, expected.prefixMatch, file: file, line: line) + } + + typealias CaptureMatch1 = (Substring, Int?) + func customTest( + _ regex: Regex, + _ input: Concat, + expected: (wholeMatch: CaptureMatch1?, firstMatch: CaptureMatch1?, prefixMatch: CaptureMatch1?), + file: StaticString = #file, line: UInt = #line + ) { + let wholeResult = input.wholeMatch(of: regex)?.output + let firstResult = input.firstMatch(of: regex)?.output + let prefixResult = input.prefixMatch(of: regex)?.output + XCTAssertEqual(wholeResult?.0, expected.wholeMatch?.0, file: file, line: line) + XCTAssertEqual(wholeResult?.1, expected.wholeMatch?.1, file: file, line: line) + + XCTAssertEqual(firstResult?.0, expected.firstMatch?.0, file: file, line: line) + XCTAssertEqual(firstResult?.1, expected.firstMatch?.1, file: file, line: line) + + XCTAssertEqual(prefixResult?.0, expected.prefixMatch?.0, file: file, line: line) + XCTAssertEqual(prefixResult?.1, expected.prefixMatch?.1, file: file, line: line) + } + + var regex = Regex { + OneOrMore(.digit) + } + + customTest(regex, Concat("amy", 2023), expected:(nil, "2023", nil)) // amy2023 + customTest(regex, Concat("amy2023", nil), expected:(nil, "2023", nil)) + customTest(regex, Concat("amy", nil), expected:(nil, nil, nil)) + customTest(regex, Concat("", 2023), expected:("2023", "2023", "2023")) // 2023 + customTest(regex, Concat("bob012b", 2023), expected:(nil, "012", nil)) // b012b2023 + customTest(regex, Concat("bob012b", nil), expected:(nil, "012", nil)) + customTest(regex, Concat("007bob", 2023), expected:(nil, "007", "007")) + customTest(regex, Concat("", nil), expected:(nil, nil, nil)) + + regex = Regex { + OneOrMore(CharacterClass("a"..."z")) + } + + customTest(regex, Concat("amy", 2023), expected:(nil, "amy", "amy")) // amy2023 + customTest(regex, Concat("amy", nil), expected:("amy", "amy", "amy")) + customTest(regex, Concat("amy2022-bob", 2023), expected:(nil, "amy", "amy")) // amy2023 + customTest(regex, Concat("", 2023), expected:(nil, nil, nil)) // 2023 + customTest(regex, Concat("bob012b", 2023), expected:(nil, "bob", "bob")) // b012b2023 + customTest(regex, Concat("bob012b", nil), expected:(nil, "bob", "bob")) + customTest(regex, Concat("007bob", 2023), expected:(nil, "bob", nil)) + customTest(regex, Concat("", nil), expected:(nil, nil, nil)) + + regex = Regex { + OneOrMore { + CharacterClass("A"..."Z") + OneOrMore(CharacterClass("a"..."z")) + Repeat(.digit, count: 2) + } + } + + customTest(regex, Concat("Amy12345", nil), expected:(nil, "Amy12", "Amy12")) + customTest(regex, Concat("Amy", 2023), expected:(nil, "Amy20", "Amy20")) + customTest(regex, Concat("Amy", 23), expected:("Amy23", "Amy23", "Amy23")) + customTest(regex, Concat("", 2023), expected:(nil, nil, nil)) // 2023 + customTest(regex, Concat("Amy23 Boba17", nil), expected:(nil, "Amy23", "Amy23")) + customTest(regex, Concat("amy23 Boba17", nil), expected:(nil, "Boba17", nil)) + customTest(regex, Concat("Amy23 boba17", nil), expected:(nil, "Amy23", "Amy23")) + customTest(regex, Concat("amy23 Boba", 17), expected:(nil, "Boba17", nil)) + customTest(regex, Concat("Amy23Boba17", nil), expected:("Amy23Boba17", "Amy23Boba17", "Amy23Boba17")) + customTest(regex, Concat("Amy23Boba", 17), expected:("Amy23Boba17", "Amy23Boba17", "Amy23Boba17")) + customTest(regex, Concat("23 Boba", 17), expected:(nil, "Boba17", nil)) + + let twoDigitRegex = Regex { + OneOrMore { + CharacterClass("A"..."Z") + OneOrMore(CharacterClass("a"..."z")) + Capture(Repeat(.digit, count: 2)) { Int($0) } + } + } + + customTest(twoDigitRegex, Concat("Amy12345", nil), expected: (nil, ("Amy12", 12), ("Amy12", 12))) + customTest(twoDigitRegex, Concat("Amy", 12345), expected: (nil, ("Amy12", 12), ("Amy12", 12))) + customTest(twoDigitRegex, Concat("Amy", 12), expected: (("Amy12", 12), ("Amy12", 12), ("Amy12", 12))) + customTest(twoDigitRegex, Concat("Amy23 Boba", 17), expected: (nil, firstMatch: ("Amy23", 23), prefixMatch: ("Amy23", 23))) + customTest(twoDigitRegex, Concat("amy23 Boba20", 23), expected:(nil, ("Boba20", 20), nil)) + customTest(twoDigitRegex, Concat("Amy23Boba17", nil), expected:(("Amy23Boba17", 17), ("Amy23Boba17", 17), ("Amy23Boba17", 17))) + customTest(twoDigitRegex, Concat("Amy23Boba", 17), expected:(("Amy23Boba17", 17), ("Amy23Boba17", 17), ("Amy23Boba17", 17))) + + let millennium = Regex { + CharacterClass("A"..."Z") + OneOrMore(CharacterClass("a"..."z")) + Capture { Repeat(.digit, count: 4) } transform: { v -> Int? in + guard let year = Int(v) else { return nil } + return year > 2000 ? year : nil + } + } + + customTest(millennium, Concat("Amy2025", nil), expected: (("Amy2025", 2025), ("Amy2025", 2025), ("Amy2025", 2025))) + customTest(millennium, Concat("Amy", 2025), expected: (("Amy2025", 2025), ("Amy2025", 2025), ("Amy2025", 2025))) + customTest(millennium, Concat("Amy1995", nil), expected: (("Amy1995", nil), ("Amy1995", nil), ("Amy1995", nil))) + customTest(millennium, Concat("Amy", 1995), expected: (("Amy1995", nil), ("Amy1995", nil), ("Amy1995", nil))) + customTest(millennium, Concat("amy2025", nil), expected: (nil, nil, nil)) + customTest(millennium, Concat("amy", 2025), expected: (nil, nil, nil)) + } } + diff --git a/Tests/RegexBuilderTests/MotivationTests.swift b/Tests/RegexBuilderTests/MotivationTests.swift index 22e790e2d..7dd4c77e4 100644 --- a/Tests/RegexBuilderTests/MotivationTests.swift +++ b/Tests/RegexBuilderTests/MotivationTests.swift @@ -9,18 +9,14 @@ // //===----------------------------------------------------------------------===// -// FIXME: macOS CI seems to be busted and Linux doesn't have FormatStyle -// So, we disable this file for now - -#if false - -import _MatchingEngine - import XCTest import _StringProcessing - import RegexBuilder +// FIXME: macOS CI seems to be busted and Linux doesn't have FormatStyle +// So, we disable this larger test for now. +#if false + private struct Transaction: Hashable { enum Kind: Hashable { case credit @@ -140,17 +136,19 @@ private func processWithRuntimeDynamicRegex( ) -> Transaction? { // FIXME: Shouldn't this init throw? let regex = try! Regex(pattern) + let dateStrat = Date.FormatStyle(date: .numeric).parseStrategy + + guard let result = line.wholeMatch(of: regex)?.output, + let kind = Transaction.Kind(result[1].substring!), + let date = try? Date(String(result[2].substring!), strategy: dateStrat), + let account = result[3].substring.map(String.init), + let amount = try? Decimal( + String(result[4].substring!), format: .currency(code: "USD")) else { + return nil + } -// guard let result = line.match(regex) else { return nil } -// -// // TODO: We should have Regex or somesuch and `.1` -// // should be the same as `\1`. -// let dynCaps = result.1 -// -// -// let kind = Transaction.Kind(result.1.first!.capture as Substring) - - return nil + return Transaction( + kind: kind, date: date, account: account, amount: amount) } @available(macOS 12.0, *) @@ -239,7 +237,8 @@ extension RegexDSLTests { XCTAssertEqual( referenceOutput, processWithNSRegularExpression(line)) - _ = processWithRuntimeDynamicRegex(line) + XCTAssertEqual( + referenceOutput, processWithRuntimeDynamicRegex(line)) // Static run-time regex XCTAssertEqual( @@ -256,12 +255,104 @@ extension RegexDSLTests { XCTFail() continue } - } - } - } #endif +extension RegexDSLTests { + func testProposalExample() { + let statement = """ + CREDIT 04062020 PayPal transfer $4.99 + CREDIT 04032020 Payroll $69.73 + DEBIT 04022020 ACH transfer $38.25 + DEBIT 03242020 IRS tax payment $52249.98 + """ + let expectation: [(TransactionKind, Date, Substring, Double)] = [ + (.credit, Date(mmddyyyy: "04062020")!, "PayPal transfer", 4.99), + (.credit, Date(mmddyyyy: "04032020")!, "Payroll", 69.73), + (.debit, Date(mmddyyyy: "04022020")!, "ACH transfer", 38.25), + (.debit, Date(mmddyyyy: "03242020")!, "IRS tax payment", 52249.98), + ] + + enum TransactionKind: String { + case credit = "CREDIT" + case debit = "DEBIT" + } + + struct Date: Hashable { + var month: Int + var day: Int + var year: Int + + init?(mmddyyyy: String) { + guard let (_, m, d, y) = mmddyyyy.wholeMatch(of: Regex { + Capture(Repeat(.digit, count: 2), transform: { Int($0)! }) + Capture(Repeat(.digit, count: 2), transform: { Int($0)! }) + Capture(Repeat(.digit, count: 4), transform: { Int($0)! }) + })?.output else { + return nil + } + + self.month = m + self.day = d + self.year = y + } + } + + let statementRegex = Regex { + // First, lets capture the transaction kind by wrapping our ChoiceOf in a + // TryCapture because we want + TryCapture { + ChoiceOf { + "CREDIT" + "DEBIT" + } + } transform: { + TransactionKind(rawValue: String($0)) + } + + OneOrMore(.whitespace) + + // Next, lets represent our date as 3 separate repeat quantifiers. The first + // two will require 2 digit characters, and the last will require 4. Then + // we'll take the entire substring and try to parse a date out. + TryCapture { + Repeat(.digit, count: 2) + Repeat(.digit, count: 2) + Repeat(.digit, count: 4) + } transform: { + Date(mmddyyyy: String($0)) + } + + OneOrMore(.whitespace) + + // Next, grab the description which can be any combination of word characters, + // digits, etc. + Capture { + OneOrMore(.any, .reluctant) + } + + OneOrMore(.whitespace) + + "$" + + // Finally, we'll grab one or more digits which will represent the whole + // dollars, match the decimal point, and finally get 2 digits which will be + // our cents. + TryCapture { + OneOrMore(.digit) + "." + Repeat(.digit, count: 2) + } transform: { + Double($0) + } + } + + for (i, match) in statement.matches(of: statementRegex).enumerated() { + let (_, kind, date, description, amount) = match.output + XCTAssert((kind, date, description, amount) == expectation[i]) + } + } +} diff --git a/Tests/RegexBuilderTests/RegexDSLTests.swift b/Tests/RegexBuilderTests/RegexDSLTests.swift index 10bc4ee35..b646f16f7 100644 --- a/Tests/RegexBuilderTests/RegexDSLTests.swift +++ b/Tests/RegexBuilderTests/RegexDSLTests.swift @@ -11,7 +11,7 @@ import XCTest import _StringProcessing -@testable import RegexBuilder +import RegexBuilder class RegexDSLTests: XCTestCase { func _testDSLCaptures( @@ -445,6 +445,14 @@ class RegexDSLTests: XCTestCase { Repeat(2...) { "e" } Repeat(0...) { "f" } } + + let octoDecimalRegex: Regex<(Substring, Int?)> = Regex { + let charClass = CharacterClass(.digit, "a"..."h")//.ignoringCase() + Capture { + OneOrMore(charClass) + } transform: { Int($0, radix: 18) } + } + XCTAssertEqual("ab12".firstMatch(of: octoDecimalRegex)!.output.1, 61904) } func testAssertions() throws { @@ -742,7 +750,9 @@ class RegexDSLTests: XCTestCase { } do { let regex = try Regex( - #"([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s+;\s+(\w+).*"#) + #""" + (?[0-9A-F]+)(?:\.\.(?[0-9A-F]+))?\s+;\s+(?\w+).* + """#) let line = """ A6F0..A6F1 ; Extend # Mn [2] BAMUM COMBINING MARK KOQNDON..BAMUM \ COMBINING MARK TUKWENTIS @@ -752,13 +762,16 @@ class RegexDSLTests: XCTestCase { let output = match.output XCTAssertEqual(output[0].substring, line[...]) XCTAssertTrue(output[1].substring == "A6F0") + XCTAssertTrue(output["lower"]?.substring == "A6F0") XCTAssertTrue(output[2].substring == "A6F1") + XCTAssertTrue(output["upper"]?.substring == "A6F1") XCTAssertTrue(output[3].substring == "Extend") + XCTAssertTrue(output["desc"]?.substring == "Extend") let typedOutput = try XCTUnwrap(output.as( - (Substring, Substring, Substring?, Substring).self)) + (Substring, lower: Substring, upper: Substring?, Substring).self)) XCTAssertEqual(typedOutput.0, line[...]) - XCTAssertTrue(typedOutput.1 == "A6F0") - XCTAssertTrue(typedOutput.2 == "A6F1") + XCTAssertTrue(typedOutput.lower == "A6F0") + XCTAssertTrue(typedOutput.upper == "A6F1") XCTAssertTrue(typedOutput.3 == "Extend") } } @@ -817,6 +830,38 @@ class RegexDSLTests: XCTestCase { XCTAssertEqual(result[b], 42) } + do { + let key = Reference(Substring.self) + let value = Reference(Int.self) + let input = " " + let regex = Regex { + Capture(as: key) { + Optionally { + OneOrMore(.word) + } + } + ":" + Optionally { + Capture(as: value) { + OneOrMore(.digit) + } transform: { Int($0)! } + } + } + + let result1 = try XCTUnwrap("age:123".wholeMatch(of: regex)) + XCTAssertEqual(result1[key], "age") + XCTAssertEqual(result1[value], 123) + + let result2 = try XCTUnwrap(":567".wholeMatch(of: regex)) + XCTAssertEqual(result2[key], "") + XCTAssertEqual(result2[value], 567) + + let result3 = try XCTUnwrap("status:".wholeMatch(of: regex)) + XCTAssertEqual(result3[key], "status") + // Traps: + // XCTAssertEqual(result3[value], nil) + } + // Post-hoc captured references // #"(?:\w\1|:(\w):)+"# try _testDSLCaptures( diff --git a/Tests/RegexTests/AlgorithmsInternalsTests.swift b/Tests/RegexTests/AlgorithmsInternalsTests.swift new file mode 100644 index 000000000..f0d556744 --- /dev/null +++ b/Tests/RegexTests/AlgorithmsInternalsTests.swift @@ -0,0 +1,47 @@ +//===----------------------------------------------------------------------===// +// +// This source file is part of the Swift.org open source project +// +// Copyright (c) 2021-2022 Apple Inc. and the Swift project authors +// Licensed under Apache License v2.0 with Runtime Library Exception +// +// See https://swift.org/LICENSE.txt for license information +// +//===----------------------------------------------------------------------===// + +@testable import _StringProcessing +import XCTest + +// TODO: Protocol-powered testing +extension AlgorithmTests { + func testAdHoc() { + let r = try! Regex("a|b+") + + XCTAssert("palindrome".contains(r)) + XCTAssert("botany".contains(r)) + XCTAssert("antiquing".contains(r)) + XCTAssertFalse("cdef".contains(r)) + + let str = "a string with the letter b in it" + let first = str.firstRange(of: r) + let last = str.lastRange(of: r) + let (expectFirst, expectLast) = ( + str.index(atOffset: 0)..(_ s: @autoclosure () -> T) { } } -class RegexConsumerTests: XCTestCase { +func makeSingleUseSequence(element: T, count: Int) -> UnfoldSequence { + var count = count + return sequence(state: ()) { _ in + defer { count -= 1 } + return count > 0 ? element : nil + } +} + +class AlgorithmTests: XCTestCase { + func testContains() { + XCTAssertTrue("".contains("")) + XCTAssertTrue("abcde".contains("")) + XCTAssertTrue("abcde".contains("abcd")) + XCTAssertTrue("abcde".contains("bcde")) + XCTAssertTrue("abcde".contains("bcd")) + XCTAssertTrue("ababacabababa".contains("abababa")) + + XCTAssertFalse("".contains("abcd")) + + for start in 0..<9 { + for end in start..<9 { + XCTAssertTrue((0..<10).contains(start...end)) + XCTAssertFalse((0..<10).contains(start...10)) + } + } + } + func testRanges() { func expectRanges( _ string: String, @@ -40,6 +66,9 @@ class RegexConsumerTests: XCTestCase { // `IndexingIterator` tests the collection conformance let actualCol: [Range] = string[...].ranges(of: regex)[...].map(string.offsets(of:)) XCTAssertEqual(actualCol, expected, file: file, line: line) + + let firstRange = string.firstRange(of: regex).map(string.offsets(of:)) + XCTAssertEqual(firstRange, expected.first, file: file, line: line) } expectRanges("", "", [0..<0]) @@ -60,6 +89,31 @@ class RegexConsumerTests: XCTestCase { expectRanges("abc", "(a|b)*", [0..<2, 2..<2, 3..<3]) expectRanges("abc", "(b|c)+", [1..<3]) expectRanges("abc", "(b|c)*", [0..<0, 1..<3, 3..<3]) + + func expectStringRanges( + _ input: String, + _ pattern: String, + _ expected: [Range], + file: StaticString = #file, line: UInt = #line + ) { + let actualSeq: [Range] = input.ranges(of: pattern).map(input.offsets(of:)) + XCTAssertEqual(actualSeq, expected, file: file, line: line) + + // `IndexingIterator` tests the collection conformance + let actualCol: [Range] = input.ranges(of: pattern)[...].map(input.offsets(of:)) + XCTAssertEqual(actualCol, expected, file: file, line: line) + + let firstRange = input.firstRange(of: pattern).map(input.offsets(of:)) + XCTAssertEqual(firstRange, expected.first, file: file, line: line) + } + + expectStringRanges("", "", [0..<0]) + expectStringRanges("abcde", "", [0..<0, 1..<1, 2..<2, 3..<3, 4..<4, 5..<5]) + expectStringRanges("abcde", "abcd", [0..<4]) + expectStringRanges("abcde", "bcde", [1..<5]) + expectStringRanges("abcde", "bcd", [1..<4]) + expectStringRanges("ababacabababa", "abababa", [6..<13]) + expectStringRanges("ababacabababa", "aba", [0..<3, 6..<9, 10..<13]) } func testSplit() { @@ -70,15 +124,153 @@ class RegexConsumerTests: XCTestCase { file: StaticString = #file, line: UInt = #line ) { let regex = try! Regex(regex) - let actual = Array(string.split(by: regex)) + let actual = Array(string.split(separator: regex, omittingEmptySubsequences: false)) XCTAssertEqual(actual, expected, file: file, line: line) } - expectSplit("", "", ["", ""]) + expectSplit("", "", [""]) expectSplit("", "x", [""]) expectSplit("a", "", ["", "a", ""]) expectSplit("a", "x", ["a"]) expectSplit("a", "a", ["", ""]) + expectSplit("a____a____a", "_+", ["a", "a", "a"]) + expectSplit("____a____a____a____", "_+", ["", "a", "a", "a", ""]) + + XCTAssertEqual("".split(separator: ""), []) + XCTAssertEqual("".split(separator: "", omittingEmptySubsequences: false), [""]) + + // Test that original `split` functions are still accessible + let splitRef = "abcd".split + XCTAssert(type(of: splitRef) == ((Character, Int, Bool) -> [Substring]).self) + let splitParamsRef = "abcd".split(separator:maxSplits:omittingEmptySubsequences:) + XCTAssert(type(of: splitParamsRef) == ((Character, Int, Bool) -> [Substring]).self) + } + + func testSplitPermutations() throws { + let splitRegex = try Regex(#"\|"#) + XCTAssertEqual( + "a|a|||a|a".split(separator: splitRegex), + ["a", "a", "a", "a"]) + XCTAssertEqual( + "a|a|||a|a".split(separator: splitRegex, omittingEmptySubsequences: false), + ["a", "a", "", "", "a", "a"]) + XCTAssertEqual( + "a|a|||a|a".split(separator: splitRegex, maxSplits: 2), + ["a", "a", "||a|a"]) + + XCTAssertEqual( + "a|a|||a|a|||a|a|||".split(separator: "|||"), + ["a|a", "a|a", "a|a"]) + XCTAssertEqual( + "a|a|||a|a|||a|a|||".split(separator: "|||", omittingEmptySubsequences: false), + ["a|a", "a|a", "a|a", ""]) + XCTAssertEqual( + "a|a|||a|a|||a|a|||".split(separator: "|||", maxSplits: 2), + ["a|a", "a|a", "a|a|||"]) + + XCTAssertEqual( + "aaaa".split(separator: ""), + ["a", "a", "a", "a"]) + XCTAssertEqual( + "aaaa".split(separator: "", omittingEmptySubsequences: false), + ["", "a", "a", "a", "a", ""]) + XCTAssertEqual( + "aaaa".split(separator: "", maxSplits: 2), + ["a", "a", "aa"]) + XCTAssertEqual( + "aaaa".split(separator: "", maxSplits: 2, omittingEmptySubsequences: false), + ["", "a", "aaa"]) + + // Fuzzing the input and parameters + for _ in 1...1_000 { + // Make strings that look like: + // "aaaaaaa" + // "|||aaaa||||" + // "a|a|aa|aa|" + // "|a||||aaa|a|||" + // "a|aa" + let keepCount = Int.random(in: 0...10) + let splitCount = Int.random(in: 0...10) + let str = [repeatElement("a", count: keepCount), repeatElement("|", count: splitCount)] + .joined() + .shuffled() + .joined() + + let omitEmpty = Bool.random() + let maxSplits = Bool.random() ? Int.max : Int.random(in: 0...10) + + // Use the stdlib behavior as the expected outcome + let expected = str.split( + separator: "|" as Character, + maxSplits: maxSplits, + omittingEmptySubsequences: omitEmpty) + let regexActual = str.split( + separator: splitRegex, + maxSplits: maxSplits, + omittingEmptySubsequences: omitEmpty) + let stringActual = str.split( + separator: "|" as String, + maxSplits: maxSplits, + omittingEmptySubsequences: omitEmpty) + XCTAssertEqual(regexActual, expected, """ + Mismatch in regex split of '\(str)', maxSplits: \(maxSplits), omitEmpty: \(omitEmpty) + expected: \(expected.map(String.init)) + actual: \(regexActual.map(String.init)) + """) + XCTAssertEqual(stringActual, expected, """ + Mismatch in string split of '\(str)', maxSplits: \(maxSplits), omitEmpty: \(omitEmpty) + expected: \(expected.map(String.init)) + actual: \(regexActual.map(String.init)) + """) + } + } + + func testTrim() { + func expectTrim( + _ string: String, + _ regex: String, + _ expected: Substring, + file: StaticString = #file, line: UInt = #line + ) { + let regex = try! Regex(regex) + let actual = string.trimmingPrefix(regex) + XCTAssertEqual(actual, expected, file: file, line: line) + } + + expectTrim("", "", "") + expectTrim("", "x", "") + expectTrim("a", "", "a") + expectTrim("a", "x", "a") + expectTrim("___a", "_", "__a") + expectTrim("___a", "_+", "a") + + XCTAssertEqual("".trimmingPrefix("a"), "") + XCTAssertEqual("a".trimmingPrefix("a"), "") + XCTAssertEqual("b".trimmingPrefix("a"), "b") + XCTAssertEqual("a".trimmingPrefix(""), "a") + XCTAssertEqual("___a".trimmingPrefix("_"), "__a") + XCTAssertEqual("___a".trimmingPrefix("___"), "a") + XCTAssertEqual("___a".trimmingPrefix("____"), "___a") + XCTAssertEqual("___a".trimmingPrefix("___a"), "") + + do { + let prefix = makeSingleUseSequence(element: "_" as Character, count: 5) + XCTAssertEqual("_____a".trimmingPrefix(prefix), "a") + XCTAssertEqual("_____a".trimmingPrefix(prefix), "_____a") + } + do { + let prefix = makeSingleUseSequence(element: "_" as Character, count: 5) + XCTAssertEqual("a".trimmingPrefix(prefix), "a") + // The result of this next call is technically undefined, so this + // is just to test that it doesn't crash. + XCTAssertNotEqual("_____a".trimmingPrefix(prefix), "") + } + + XCTAssertEqual("".trimmingPrefix(while: \.isWhitespace), "") + XCTAssertEqual("a".trimmingPrefix(while: \.isWhitespace), "a") + XCTAssertEqual(" ".trimmingPrefix(while: \.isWhitespace), "") + XCTAssertEqual(" a".trimmingPrefix(while: \.isWhitespace), "a") + XCTAssertEqual("a ".trimmingPrefix(while: \.isWhitespace), "a ") } func testReplace() { @@ -107,37 +299,6 @@ class RegexConsumerTests: XCTestCase { expectReplace("aab", "a*", "X", "XXbX") } - func testAdHoc() { - let r = try! Regex("a|b+") - - XCTAssert("palindrome".contains(r)) - XCTAssert("botany".contains(r)) - XCTAssert("antiquing".contains(r)) - XCTAssertFalse("cdef".contains(r)) - - let str = "a string with the letter b in it" - let first = str.firstRange(of: r) - let last = str.lastRange(of: r) - let (expectFirst, expectLast) = ( - str.index(atOffset: 0)..