Skip to content

[Integration] main (06f40f6) -> swift/main #328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Apr 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
89da9f8
Error on unknown character properties
hamishknight Apr 14, 2022
e1604a6
Merge pull request #280 from hamishknight/error-on-unknown-props
hamishknight Apr 19, 2022
3f16170
Don't parse a character property containing a backslash
hamishknight Apr 19, 2022
fa5f2f1
Update Regex Syntax document for `[:...:]` changes
hamishknight Apr 19, 2022
9ccde19
Support obtaining captures by name on `AnyRegexOutput` (#300)
rxwei Apr 19, 2022
182da3b
Untangle `_RegexParser` from `RegexBuilder` (#299)
natecook1000 Apr 19, 2022
8068ea1
Merge pull request #301 from hamishknight/yet-more-posix-quirks
hamishknight Apr 19, 2022
08b7808
Merge pull request #302 from hamishknight/update-syntax
hamishknight Apr 19, 2022
00aa315
Expose `matches`, `ranges` and `split` (#304)
itingliu Apr 19, 2022
15355bf
Convenience quoting (#305)
milseman Apr 19, 2022
46b9a0f
Remove compiling argument label (#306)
milseman Apr 20, 2022
b24d3ea
Move the closure argument to the end of the arg list (#307)
itingliu Apr 21, 2022
f9a4675
Adds RegexBuilder.CharacterClass.anyUnicodeScalar (#315)
natecook1000 Apr 21, 2022
4857bc7
Allow setting any of the three quant behaviors (#311)
natecook1000 Apr 21, 2022
73a5ccf
Add `wholeMatch` and `prefixMatch` (#286)
itingliu Apr 22, 2022
3e2160c
Update local proposal copies (#317)
milseman Apr 22, 2022
53acbb2
Update ProposalOverview.md
milseman Apr 22, 2022
b057c4e
Update ProposalOverview.md
milseman Apr 22, 2022
8dd8470
Unicode for String Processing proposal (#257)
natecook1000 Apr 22, 2022
81bc5d0
Updates for algorithms proposal (#319)
milseman Apr 22, 2022
89b80bf
Preparation for location aware diagnostics in the compiler.
rintaro Apr 11, 2022
06f40f6
Merge pull request #321 from rintaro/diagnostic-swiftcompiler
rintaro Apr 22, 2022
4d198ed
Merge branch 'swift/main' into integration-main-06f40f6
rxwei Apr 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
503 changes: 0 additions & 503 deletions Documentation/Evolution/CharacterClasses.md

This file was deleted.

12 changes: 6 additions & 6 deletions Documentation/Evolution/ProposalOverview.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,15 @@ Covers the result builder approach and basic API.

## Run-time Regex Construction

- [Pitch](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md)
- [Pitch](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexSyntaxRunTimeConstruction.md), [Thread](https://forums.swift.org/t/pitch-2-regex-syntax-and-run-time-construction/56624)
- (old) Pitch thread: [Regex Syntax](https://forums.swift.org/t/pitch-regex-syntax/55711)
+ Brief: Syntactic superset of PCRE2, Oniguruma, ICU, UTS\#18, etc.

Covers the "interior" syntax, extended syntaxes, run-time construction of a regex from a string, and details of `AnyRegexOutput`.

## Regex Literals

- [Draft](https://github.com/apple/swift-experimental-string-processing/pull/187)
- [Draft](https://github.com/apple/swift-experimental-string-processing/pull/187), [Thread](https://forums.swift.org/t/pitch-2-regex-literals/56736)
- (Old) original pitch:
+ [Thread](https://forums.swift.org/t/pitch-regular-expression-literals/52820)
+ [Update](https://forums.swift.org/t/pitch-regular-expression-literals/52820/90)
Expand All @@ -39,17 +39,17 @@ Covers the "interior" syntax, extended syntaxes, run-time construction of a rege

Proposes a slew of Regex-powered algorithms.

Introduces `CustomMatchingRegexComponent`, which is a monadic-parser style interface for external parsers to be used as components of a regex.
Introduces `CustomPrefixMatchRegexComponent`, which is a monadic-parser style interface for external parsers to be used as components of a regex.

## Unicode for String Processing

- Draft: TBD
- [Draft](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/UnicodeForStringProcessing.md)
- (Old) [Character class definitions](https://forums.swift.org/t/pitch-character-classes-for-string-processing/52920)

Covers three topics:

- Proposes literal and DSL API for library-defined character classes, Unicode scripts and properties, and custom character classes.
- Proposes literal and DSL API for options that affect matching behavior.
- Proposes regex syntax and `RegexBuilder` API for options that affect matching behavior.
- Proposes regex syntax and `RegexBuilder` API for library-defined character classes, Unicode properties, and custom character classes.
- Defines how Unicode scalar-based classes are extended to grapheme clusters in the different semantic and other matching modes.


4 changes: 2 additions & 2 deletions Documentation/Evolution/RegexLiterals.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ In *[Regex Type and Overview][regex-type]* we introduced the `Regex` type, which

```swift
let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
let regex = try! Regex(compiling: pattern)
let regex = try! Regex(pattern)
// regex: Regex<AnyRegexOutput>
```

Expand Down Expand Up @@ -366,7 +366,7 @@ However we decided against this because:

### No custom literal

Instead of adding a custom regex literal, we could require users to explicitly write `try! Regex(compiling: "[abc]+")`. This would be similar to `NSRegularExpression`, and loses all the benefits of parsing the literal at compile time. This would mean:
Instead of adding a custom regex literal, we could require users to explicitly write `try! Regex("[abc]+")`. This would be similar to `NSRegularExpression`, and loses all the benefits of parsing the literal at compile time. This would mean:

- No source tooling support (e.g syntax highlighting, refactoring actions) would be available.
- Parse errors would be diagnosed at run time rather than at compile time.
Expand Down
50 changes: 43 additions & 7 deletions Documentation/Evolution/RegexSyntaxRunTimeConstruction.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@

# Regex Syntax and Run-time Construction

- Authors: [Hamish Knight](https://github.com/hamishknight), [Michael Ilseman](https://github.com/milseman)
* Proposal: [SE-NNNN](NNNN-filename.md)
* Authors: [Hamish Knight](https://github.com/hamishknight), [Michael Ilseman](https://github.com/milseman)
* Review Manager: [Ben Cohen](https://github.com/airspeedswift)
* Status: **Awaiting review**
* Implementation: https://github.com/apple/swift-experimental-string-processing
* Available in nightly toolchain snapshots with `import _StringProcessing`

## Introduction

Expand Down Expand Up @@ -50,11 +55,11 @@ We propose run-time construction of `Regex` from a best-in-class treatment of fa

```swift
let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
let regex = try! Regex(compiling: pattern)
let regex = try! Regex(pattern)
// regex: Regex<AnyRegexOutput>

let regex: Regex<(Substring, Substring, Substring, Substring, Substring)> =
try! Regex(compiling: pattern)
try! Regex(pattern)
```

### Syntax
Expand All @@ -81,11 +86,11 @@ We propose initializers to declare and compile a regex from syntax. Upon failure
```swift
extension Regex {
/// Parse and compile `pattern`, resulting in a strongly-typed capture list.
public init(compiling pattern: String, as: Output.Type = Output.self) throws
public init(_ pattern: String, as: Output.Type = Output.self) throws
}
extension Regex where Output == AnyRegexOutput {
/// Parse and compile `pattern`, resulting in an existentially-typed capture list.
public init(compiling pattern: String) throws
public init(_ pattern: String) throws
}
```

Expand Down Expand Up @@ -156,6 +161,20 @@ extension Regex.Match where Output == AnyRegexOutput {
}
```

We propose adding API to query and access captures by name in an existentially typed regex match:

```swift
extension Regex.Match where Output == AnyRegexOutput {
/// If a named-capture with `name` is present, returns its value. Otherwise `nil`.
public subscript(_ name: String) -> AnyRegexOutput.Element? { get }
}

extension AnyRegexOutput {
/// If a named-capture with `name` is present, returns its value. Otherwise `nil`.
public subscript(_ name: String) -> AnyRegexOutput.Element? { get }
}
```

The rest of this proposal will be a detailed and exhaustive definition of our proposed regex syntax.

<details><summary>Grammar Notation</summary>
Expand Down Expand Up @@ -392,7 +411,7 @@ For non-Unicode properties, only a value is required. These include:
- The special PCRE2 properties `Xan`, `Xps`, `Xsp`, `Xuc`, `Xwd`.
- The special Java properties `javaLowerCase`, `javaUpperCase`, `javaWhitespace`, `javaMirrored`.

Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`.
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`. Both spellings may be used inside and outside of a custom character class.

#### `\K`

Expand Down Expand Up @@ -534,6 +553,7 @@ These operators have a lower precedence than the implicit union of members, e.g

To avoid ambiguity between .NET's subtraction syntax and range syntax, .NET specifies that a subtraction will only be parsed if the right-hand-side is a nested custom character class. We propose following this behavior.

Note that a custom character class may begin with the `:` character, and only becomes a POSIX character property if a closing `:]` is present. For example, `[:a]` is the character class of `:` and `a`.

### Matching options

Expand Down Expand Up @@ -863,7 +883,23 @@ PCRE supports `\N` meaning "not a newline", however there are engines that treat

### Extended character property syntax

ICU unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`, such that they follow the same internal grammar, which allows referencing any Unicode character property in addition to the POSIX properties. We propose supporting this, though it is a purely additive feature, and therefore should not conflict with regex engines that implement a more limited POSIX syntax.
ICU unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`. This has two effects:

- They share the same internal grammar, which allows the use of any Unicode character properties in addition to the POSIX properties.
- The POSIX syntax may be used outside of custom character classes, unlike in PCRE and Oniguruma.

We propose following both of these rules. The former is purely additive, and therefore should not conflict with regex engines that implement a more limited POSIX syntax. The latter does conflict with other engines, but we feel it is much more likely that a user would expect e.g `[:space:]` to be a character property rather than the character class `[:aceps]`. We do however feel that a warning might be warranted in order to avoid confusion.

### POSIX character property disambiguation

PCRE, Oniguruma and ICU allow `[:` to be part of a custom character class if a closing `:]` is not present. For example, `[:a]` is the character class of `:` and `a`. However they each have different rules for detecting the closing `:]`:

- PCRE will scan ahead until it hits either `:]`, `]`, or `[:`.
- Oniguruma will scan ahead until it hits either `:]`, `]`, or the length exceeds 20 characters.
- ICU will scan ahead until it hits a known escape sequence (e.g `\a`, `\e`, `\Q`, ...), or `:]`. Note this excludes character class escapes e.g `\d`. It also excludes `]`, meaning that even `[:a][:]` is parsed as a POSIX character property.

We propose unifying these behaviors by scanning ahead until we hit either `[`, `]`, `:]`, or `\`. Additionally, we will stop on encountering `}` or a second occurrence of `=`. These fall out the fact that they would be invalid contents of the alternative `\p{...}` syntax.


### Script properties

Expand Down
43 changes: 31 additions & 12 deletions Documentation/Evolution/RegexTypeOverview.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# Regex Type and Overview

- Authors: [Michael Ilseman](https://github.com/milseman)
* Proposal: [SE-0350](0350-regex-type-overview.md)
* Authors: [Michael Ilseman](https://github.com/milseman)
* Review Manager: [Ben Cohen](https://github.com/airspeedswift)
* Status: **Active Review (4 - 28 April 2022)**
* Implementation: https://github.com/apple/swift-experimental-string-processing
* Available in nightly toolchain snapshots with `import _StringProcessing`

## Introduction

Expand Down Expand Up @@ -134,11 +139,11 @@ Regexes can be created at run time from a string containing familiar regex synta

```swift
let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
let regex = try! Regex(compiling: pattern)
let regex = try! Regex(pattern)
// regex: Regex<AnyRegexOutput>

let regex: Regex<(Substring, Substring, Substring, Substring, Substring)> =
try! Regex(compiling: pattern)
try! Regex(pattern)
```

*Note*: The syntax accepted and further details on run-time compilation, including `AnyRegexOutput` and extended syntaxes, are discussed in [Run-time Regex Construction][pitches].
Expand Down Expand Up @@ -207,7 +212,7 @@ func processEntry(_ line: String) -> Transaction? {
// amount: Substring
// )>

guard let match = regex.matchWhole(line),
guard let match = regex.wholeMatch(line),
let kind = Transaction.Kind(match.kind),
let date = try? Date(String(match.date), strategy: dateParser),
let amount = try? Decimal(String(match.amount), format: decimalParser)
Expand All @@ -226,7 +231,7 @@ The result builder allows for inline failable value construction, which particip

Swift regexes describe an unambiguous algorithm, where choice is ordered and effects can be reliably observed. For example, a `print()` statement inside the `TryCapture`'s transform function will run whenever the overall algorithm naturally dictates an attempt should be made. Optimizations can only elide such calls if they can prove it is behavior-preserving (e.g. "pure").

`CustomMatchingRegexComponent`, discussed in [String Processing Algorithms][pitches], allows industrial-strength parsers to be used a regex components. This allows us to drop the overly-permissive pre-parsing step:
`CustomPrefixMatchRegexComponent`, discussed in [String Processing Algorithms][pitches], allows industrial-strength parsers to be used a regex components. This allows us to drop the overly-permissive pre-parsing step:

```swift
func processEntry(_ line: String) -> Transaction? {
Expand Down Expand Up @@ -300,7 +305,7 @@ Regex targets [UTS\#18 Level 2](https://www.unicode.org/reports/tr18/#Extended_U
```swift
/// A regex represents a string processing algorithm.
///
/// let regex = try Regex(compiling: "a(.*)b")
/// let regex = try Regex("a(.*)b")
/// let match = "cbaxb".firstMatch(of: regex)
/// print(match.0) // "axb"
/// print(match.1) // "x"
Expand Down Expand Up @@ -384,21 +389,25 @@ extension Regex.Match {
// Run-time compilation interfaces
extension Regex {
/// Parse and compile `pattern`, resulting in a strongly-typed capture list.
public init(compiling pattern: String, as: Output.Type = Output.self) throws
public init(_ pattern: String, as: Output.Type = Output.self) throws
}
extension Regex where Output == AnyRegexOutput {
/// Parse and compile `pattern`, resulting in an existentially-typed capture list.
public init(compiling pattern: String) throws
public init(_ pattern: String) throws
}
```

### Cancellation

Regex is somewhat different from existing standard library operations in that regex processing can be a long-running task.
For this reason regex algorithms may check if the parent task has been cancelled and end execution.

### On severability and related proposals

The proposal split presented is meant to aid focused discussion, while acknowledging that each is interconnected. The boundaries between them are not completely cut-and-dry and could be refined as they enter proposal phase.

Accepting this proposal in no way implies that all related proposals must be accepted. They are severable and each should stand on their own merit.


## Source compatibility

Everything in this proposal is additive. Regex delimiters may have their own source compatibility impact, which is discussed in that proposal.
Expand All @@ -422,7 +431,7 @@ Regular expressions have a deservedly mixed reputation, owing to their historica

* "Regular expressions are bad because you should use a real parser"
- In other systems, you're either in or you're out, leading to a gravitational pull to stay in when... you should get out
- Our remedy is interoperability with real parsers via `CustomMatchingRegexComponent`
- Our remedy is interoperability with real parsers via `CustomPrefixMatchRegexComponent`
- Literals with refactoring actions provide an incremental off-ramp from regex syntax to result builders and real parsers
* "Regular expressions are bad because ugly unmaintainable syntax"
- We propose literals with source tools support, allowing for better syntax highlighting and analysis
Expand Down Expand Up @@ -488,6 +497,16 @@ The generic parameter `Output` is proposed to contain both the whole match (the

The biggest issue with this alternative design is that the numbering of `Captures` elements misaligns with the numbering of captures in textual regexes, where backreference `\0` refers to the entire match and captures start at `\1`. This design would sacrifice familarity and have the pitfall of introducing off-by-one errors.

### Encoding `Regex`es into the type system

During the initial review period the following comment was made:

> I think the goal should be that, at least for regex literals (and hopefully for the DSL to some extent), one day we might not even need a bytecode or interpreter. I think the ideal case is if each literal was its own function or type that gets generated and optimised as if you wrote it in Swift.

This is an approach that has been tried a few times in a few different languages (including by a few members of the Swift Standard Library and Core teams), and while it can produce attractive microbenchmarks, it has almost always proved to be a bad idea at the macro scale. In particular, even if we set aside witness tables and other associated swift generics overhead, optimizing a fixed pipeline for each pattern you want to match causes significant codesize expansion when there are multiple patterns in use, as compared to a more flexible byte code interpreter. A bytecode interpreter makes better use of instruction caches and memory, and can also benefit from micro architectural resources that are shared across different patterns. There is a tradeoff w.r.t. branch prediction resources, where separately compiled patterns may have more decisive branch history data, but a shared bytecode engine has much more data to use; this tradeoff tends to fall on the side of a bytecode engine, but it does not always do so.

It should also be noted that nothing prevents AOT or JIT compiling of the bytecode if we believe it will be advantageous, but compiling or interpreting arbitrary Swift code at runtime is rather more unattractive, since both the type system and language are undecidable. Even absent this rationale, we would probably not encode regex programs directly into the type system simply because it is unnecessarily complex.

### Future work: static optimization and compilation

Swift's support for static compilation is still developing, and future work here is leveraging that to compile regex when profitable. Many regex describe simple [DFAs](https://en.wikipedia.org/wiki/Deterministic_finite_automaton) and can be statically compiled into very efficient programs. Full static compilation needs to be balanced with code size concerns, as a matching-specific bytecode is typically far smaller than a corresponding program (especially since the bytecode interpreter is shared).
Expand All @@ -497,7 +516,7 @@ Regex are compiled into an intermediary representation and fairly simple analysi

### Future work: parser combinators

What we propose here is an incremental step towards better parsing support in Swift using parser-combinator style libraries. The underlying execution engine supports recursive function calls and mechanisms for library extensibility. `CustomMatchingRegexComponent`'s protocol requirement is effectively a [monadic parser](https://homepages.inf.ed.ac.uk/wadler/papers/marktoberdorf/baastad.pdf), meaning `Regex` provides a regex-flavored combinator-like system.
What we propose here is an incremental step towards better parsing support in Swift using parser-combinator style libraries. The underlying execution engine supports recursive function calls and mechanisms for library extensibility. `CustomPrefixMatchRegexComponent`'s protocol requirement is effectively a [monadic parser](https://homepages.inf.ed.ac.uk/wadler/papers/marktoberdorf/baastad.pdf), meaning `Regex` provides a regex-flavored combinator-like system.

An issues with traditional parser combinator libraries are the compilation barriers between call-site and definition, resulting in excessive and overly-cautious backtracking traffic. These can be eliminated through better [compilation techniques](https://core.ac.uk/download/pdf/148008325.pdf). As mentioned above, Swift's support for custom static compilation is still under development.

Expand Down Expand Up @@ -546,7 +565,7 @@ Regexes are often used for tokenization and tokens can be represented with Swift

### Future work: baked-in localized processing

- `CustomMatchingRegexComponent` gives an entry point for localized processors
- `CustomPrefixMatchRegexComponent` gives an entry point for localized processors
- Future work includes (sub?)protocols to communicate localization intent

-->
Expand Down
Loading