From b1315ee3076645a2bdd93e011ae76fc9dbef9dc7 Mon Sep 17 00:00:00 2001 From: Andrew Gallant Date: Sat, 12 Apr 2014 23:31:42 -0400 Subject: [PATCH 01/12] An RFC for adding a regexp crate to the Rust distribution. --- active/0000-regexps.md | 235 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 235 insertions(+) create mode 100644 active/0000-regexps.md diff --git a/active/0000-regexps.md b/active/0000-regexps.md new file mode 100644 index 00000000000..0e2842eb814 --- /dev/null +++ b/active/0000-regexps.md @@ -0,0 +1,235 @@ +- Start Date: 2014-04-12 +- RFC PR #: (leave this empty) +- Rust Issue #: (leave this empty) + +# Summary + +Add a `regexp` crate to the Rust distribution in addition to a small +`regexp_re` crate that provides a syntax extension for compiling regular +expressions during the compilation of a Rust program. + +The implementation that supports this RFC is ready to receive +feedback: https://github.com/BurntSushi/regexp + +Documentation for the crate can be seen here: +http://burntsushi.net/rustdoc/regexp/index.html + +regex-dna benchmark (vs. Go, Python): +https://github.com/BurntSushi/regexp/tree/master/benchmark/regex-dna + +Other benchmarks (vs. Go): +https://github.com/BurntSushi/regexp/tree/master/benchmark + +(Perhaps the links should be removed if the RFC is accepted, since I can't +guarantee they will always exist.) + +# Motivation + +Regular expressions provide a succinct method of matching patterns against +search text and are frequently used. For example, many programming languages +include some kind of support for regular expressions in its standard library. + +The outcome of this RFC is to include a regular expression library in the Rust +distribution. + +# Detailed design + +(Note: This is describing an existing design that has been implemented. I have +no idea how much of this is appropriate for an RFC.) + +The first choice that most regular expression libraries make is whether or not +to include backreferences in the supported syntax, as this heavily influences +the implementation and the performance characteristics of matching text. + +In this RFC, I am proposing a library that closely models Russ Cox's RE2 +(either its C++ or Go variants). This means that features like backreferences +or generalized zero-width assertions are not supported. In return, we get +`O(mn)` worst case performance (with `m` being the size of the search text and +`n` being the number of instructions in the compiled expression). + +My implementation currently simulates an NFA using something resembling the +Pike VM. Future work could possibly include adding a DFA. (N.B. RE2/C++ +includes both an NFA and a DFA, but RE2/Go only implements an NFA.) + +The primary reason why I chose RE2 was that it seemed to be a popular choice in +issue [#3591](https://github.com/mozilla/rust/issues/3591), and its worst case +performance characteristics seemed appealing. I was also drawn to the limited +set of syntax supported by RE2 in comparison to other regexp flavors. + +With that out of the way, there are other things that inform the design of a +regexp library. + +## Unicode + +Given the already existing support for Unicode in Rust, this is a no-brainer. +Unicode literals should be allowed in expressions and Unicode character classes +should be included (e.g., general categories and scripts). + +Case folding is also important for case insensitive matching. Currently, this +is implemented by converting characters to their uppercase forms and then +comparing them. Future work includes applying at least a simple fold, since +folding one Unicode character can produce multiple characters. + +Normalization is another thing to consider, but like most other regexp +libraries, the one I'm proposing here does not do any normalization. (It seems +the recommended practice is to do normalization before matching if it's +needed.) + +A nice implementation strategy to support Unicode is to implement a VM that +matches characters instead of bytes. Indeed, my implementation does this. +However, the public API of a regular expression library should expose *byte +indices* corresponding to match locations (which ought to be guaranteed to be +UTF8 codepoint boundaries by construction of the VM). My reason for this is +that byte indices result in a lower cost abstraction. If character indices are +desired, then a mapping can be maintained by the client at their discretion. + +## Word boundaries, word characters and Unicode + +The `\w` character class and the zero-width word boundary assertion `\b` are +defined in terms of the ASCII character set. I'm not aware of any +implementation that defines these in terms of proper Unicode character classes. +Do we want to be the first? + +## Leftmost-first + +As of now, my implementation finds the leftmost-first match. This is consistent +with PCRE style regular expressions. + +I've pretty much ignored POSIX, but I think it's very possible to add +leftmost-longest semantics to the existing VM. (RE2 supports this as a +parameter, but I believe still does not fully comply with POSIX with respect to +picking the correct submatches.) + +## Public API + +There are three main questions that can be asked when searching text: + +1. Does the string match this expression? +2. If so, where? +3. Where are its submatches? + +In principle, an API could provide a function to only answer (3). The answers +to (1) and (2) would immediately follow. However, keeping track of submatches +is expensive, so it is useful to implement an optimization that doesn't keep +track of them if it doesn't have to. For example, submatches do not need to be +tracked to answer questions (1) and (2). + +The rabbit hole continues: answering (1) can be more efficient than answering +(2) because you don't have to keep track of *any* capture groups ((2) requires +tracking the position of the full match). More importantly, (1) enables early +exit from the VM. As soon as a match is found, the VM can quit instead of +continuing to search for greedy expressions. + +Therefore, it's worth it to segregate these operations. The performance +difference can get even bigger if a DFA were implemented (which can answer (1) +and (2) quickly and even help with (3)). Moreover, most other regular +expression libraries provide separate facilities for answering these questions +separately. + +Some libraries (like Python's `re` and RE2/C++) distinguish between matching an +expression against an entire string and matching an expression against part of +the string. My implementation favors simplicity: matching the entirety of a +string requires using the `^` and/or `$` anchors. In all cases, an implicit +`.*?` is added the beginning and end of each expression evaluated. (Which is +optimized out in the presence of anchors.) + +Finally, most regexp libraries provide facilities for splitting and replacing +text, usually making capture group names available with some sort of `$var` +syntax. My implementation provides this too. (These are a perfect fit for +Rust's iterators.) + +This basically makes up the entirety of the public API, in addition to perhaps +a `quote` function that escapes a string so that it may be used as a literal in +an expression. + +## The `re!` macro + +With syntax extensions, it's possible to write an `re!` macro that compiles an +expression when a Rust program is compiled. In my case, it seemed simplest to +compile it to *static* data. For example: + + static re: Regexp = re!("a*"); + +At first this seemed difficult to accommodate, but it turned out to be +relatively easy with a type like this: + + pub enum MaybeStatic { + Dynamic(Vec), + Static(&'static [T]), + } + +Another option is for the `re!` macro to produce a non-static value, but I +found this difficult to do with zero-runtime cost. Either way, the ability to +statically declare a regexp is pretty cool I think. + +Note that the syntax extension is the reason for the `regexp_re` crate. It's +very small and contains the macro registration function. I'm not sure how this +fits into the Rust distribution, but my vote is to document the `re!` macro in +the `regexp` crate and hide the `regexp_re` crate from public documentation. +(Or link it to the `regexp` crate.) + +## Untrusted input + +Given worst case `O(mn)` time complexity, I don't think it's worth worrying +about unsafe search text. + +Untrusted regular expressions are another matter. For example, it's very easy +to exhaust a system's resources with nested counted repetitions. For example, +`((a{100}){100}){100}` tries to create `100^3` instructions. My current +implementation does nothing to mitigate against this, but I think a simple hard +limit on the number of instructions allowed would work fine. (Should it be +configurable?) + +## Summary + +My implementation is pretty much a port of most of RE2. The syntax should be +identical or almost identical. I think matching an existing (and popular) +library has benefits, since it will make it easier for people to pick it up and +start using it. There will also be (hopefully) fewer surprises. There is also +plenty of room for performance improvement by implementing a DFA. + +# Alternatives + +I think the single biggest alternative is to provide a backtracking +implementation that supports backreferences and generalized zero-width +assertions. I don't think my implementation precludes this possibility. For +example, a backtracking approach could be implemented and used only when +features like backreferences are invoked in the expression. However, this gives +up the blanket guarantee of worst case `O(mn)` time. I don't think I have the +wisdom required to voice a strong opinion on whether this is a worthwhile +endeavor. + +Another alternative is using a binding to an existing regexp library. I think +this was discussed in issue +[#3591](https://github.com/mozilla/rust/issues/3591) and it seems like people +favor a native Rust implementation if it's to be included in the Rust +distribution. (Does the `re!` macro require it? If so, that's a huge +advantage.) + +Finally, it is always possible to persist without a regexp library. + +# Unresolved questions + +Firstly, I'm not entirely clear on how the `regexp_re` crate will be handled. +I gave a suggestion above, but I'm not sure if it's a good one. Is there any +precedent? + +Secondly, the public API design is fairly simple and straight-forward with no +surprises. I think most of the unresolved stuff is how the backend is +implemented, which should be changeable without changing the public API (sans +adding features to the syntax). + +I can't remember where I read it, but someone had mentioned defining a *trait* +that declared the API of a regexp engine. That way, anyone could write their +own backend and use the `regexp` interface. My initial thoughts are +YAGNI---since requiring different backends seems like a super specialized +case---but I'm just hazarding a guess here. (If we go this route, then we'd +probably also have to expose the regexp parser and AST and possibly the +compiler and instruction set to make writing your own backend easier. That +sounds restrictive with respect to making performance improvements in the +future.) + +I personally think there's great value in keeping the standard regexp +implementation small, simple and fast. People who have more specialized needs +can always pick one of the existing C or C++ libraries. + From 67f972f598c679903bffab1c3cc87bd198ff9568 Mon Sep 17 00:00:00 2001 From: Andrew Gallant Date: Sun, 13 Apr 2014 00:08:35 -0400 Subject: [PATCH 02/12] Mention consistency with std::str with respect to byte indices. --- active/0000-regexps.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/active/0000-regexps.md b/active/0000-regexps.md index 0e2842eb814..c1f9274cf4f 100644 --- a/active/0000-regexps.md +++ b/active/0000-regexps.md @@ -83,6 +83,9 @@ UTF8 codepoint boundaries by construction of the VM). My reason for this is that byte indices result in a lower cost abstraction. If character indices are desired, then a mapping can be maintained by the client at their discretion. +Additionally, this makes it consistent with the `std::str` API, which also +exposes byte indices. + ## Word boundaries, word characters and Unicode The `\w` character class and the zero-width word boundary assertion `\b` are From d45e7c23801c6592be05f6f222258439bf9129fe Mon Sep 17 00:00:00 2001 From: Andrew Gallant Date: Sun, 13 Apr 2014 00:34:26 -0400 Subject: [PATCH 03/12] cc the relevant issue in rust repo --- active/0000-regexps.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/active/0000-regexps.md b/active/0000-regexps.md index c1f9274cf4f..8f458d1d685 100644 --- a/active/0000-regexps.md +++ b/active/0000-regexps.md @@ -30,7 +30,8 @@ search text and are frequently used. For example, many programming languages include some kind of support for regular expressions in its standard library. The outcome of this RFC is to include a regular expression library in the Rust -distribution. +distribution and resolve issue +[#3591](https://github.com/mozilla/rust/issues/3591). # Detailed design From e7add74bac5433434198cfc3f1624c3e8b9444ce Mon Sep 17 00:00:00 2001 From: Andrew Gallant Date: Sun, 13 Apr 2014 00:39:27 -0400 Subject: [PATCH 04/12] sfackler tipped me off to a bug. mentioned. --- active/0000-regexps.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/active/0000-regexps.md b/active/0000-regexps.md index 8f458d1d685..b0b828c4bff 100644 --- a/active/0000-regexps.md +++ b/active/0000-regexps.md @@ -172,6 +172,9 @@ fits into the Rust distribution, but my vote is to document the `re!` macro in the `regexp` crate and hide the `regexp_re` crate from public documentation. (Or link it to the `regexp` crate.) +It seems like the `re!` macro will become a bit nicer to use once +[#11640](https://github.com/mozilla/rust/issues/11640) is fixed. + ## Untrusted input Given worst case `O(mn)` time complexity, I don't think it's worth worrying From 08df06ee613a96d91ee91443e383c01a6b8ee09d Mon Sep 17 00:00:00 2001 From: Andrew Gallant Date: Sun, 13 Apr 2014 00:41:03 -0400 Subject: [PATCH 05/12] native implementation is maximally portable --- active/0000-regexps.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/active/0000-regexps.md b/active/0000-regexps.md index b0b828c4bff..164a10940f8 100644 --- a/active/0000-regexps.md +++ b/active/0000-regexps.md @@ -211,7 +211,7 @@ this was discussed in issue [#3591](https://github.com/mozilla/rust/issues/3591) and it seems like people favor a native Rust implementation if it's to be included in the Rust distribution. (Does the `re!` macro require it? If so, that's a huge -advantage.) +advantage.) Also, a native implementation makes it maximally portable. Finally, it is always possible to persist without a regexp library. From 0584d78b1eb6f753135a5f1f651853426bed1d66 Mon Sep 17 00:00:00 2001 From: Andrew Gallant Date: Sun, 13 Apr 2014 00:53:14 -0400 Subject: [PATCH 06/12] change regexp_re to regexp_macros --- active/0000-regexps.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/active/0000-regexps.md b/active/0000-regexps.md index 164a10940f8..ce632fe4436 100644 --- a/active/0000-regexps.md +++ b/active/0000-regexps.md @@ -5,7 +5,7 @@ # Summary Add a `regexp` crate to the Rust distribution in addition to a small -`regexp_re` crate that provides a syntax extension for compiling regular +`regexp_macros` crate that provides a syntax extension for compiling regular expressions during the compilation of a Rust program. The implementation that supports this RFC is ready to receive @@ -166,10 +166,10 @@ Another option is for the `re!` macro to produce a non-static value, but I found this difficult to do with zero-runtime cost. Either way, the ability to statically declare a regexp is pretty cool I think. -Note that the syntax extension is the reason for the `regexp_re` crate. It's +Note that the syntax extension is the reason for the `regexp_macros` crate. It's very small and contains the macro registration function. I'm not sure how this fits into the Rust distribution, but my vote is to document the `re!` macro in -the `regexp` crate and hide the `regexp_re` crate from public documentation. +the `regexp` crate and hide the `regexp_macros` crate from public documentation. (Or link it to the `regexp` crate.) It seems like the `re!` macro will become a bit nicer to use once @@ -217,7 +217,7 @@ Finally, it is always possible to persist without a regexp library. # Unresolved questions -Firstly, I'm not entirely clear on how the `regexp_re` crate will be handled. +Firstly, I'm not entirely clear on how the `regexp_macros` crate will be handled. I gave a suggestion above, but I'm not sure if it's a good one. Is there any precedent? From b75713aa6c45cde8dbeec3bf997b05e6cc413d0e Mon Sep 17 00:00:00 2001 From: Andrew Gallant Date: Sun, 13 Apr 2014 11:55:11 -0400 Subject: [PATCH 07/12] Mention that the API could be unstable/experimental. --- active/0000-regexps.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/active/0000-regexps.md b/active/0000-regexps.md index ce632fe4436..581ce912219 100644 --- a/active/0000-regexps.md +++ b/active/0000-regexps.md @@ -240,3 +240,5 @@ I personally think there's great value in keeping the standard regexp implementation small, simple and fast. People who have more specialized needs can always pick one of the existing C or C++ libraries. +For now, we could mark the API as `#[unstable]` or `#[experimental]`. + From ae64e8b712529a216633b9381a4f15ecef6ffb67 Mon Sep 17 00:00:00 2001 From: Andrew Gallant Date: Sun, 13 Apr 2014 18:05:52 -0400 Subject: [PATCH 08/12] What's in a name? That which we call a regexp. --- active/0000-regexps.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/active/0000-regexps.md b/active/0000-regexps.md index 581ce912219..9cc4424fb97 100644 --- a/active/0000-regexps.md +++ b/active/0000-regexps.md @@ -187,6 +187,22 @@ implementation does nothing to mitigate against this, but I think a simple hard limit on the number of instructions allowed would work fine. (Should it be configurable?) +## Name + +The name of the crate being proposed is `regexp` and the type describing a +compiled regular expression is `Regexp`. I think an equally good name would be +`regex` (and `Regex`). Either name seems to be frequently used, e.g., "regexes" +or "regexps" in colloquial use. I chose `regexp` over `regex` because it +matches the name used for the corresponding package in Go's standard library. + +Other possible names are `regexpr` (and `Regexpr`) or something with +underscores: `reg_exp` (and `RegExp`). However, I perceive these to be more +ugly and less commonly used than either `regexp` or `regex`. + +Finally, we could use `re` (like Python), but I think the name could be +ambiguous since it's so short. `regexp` (or `regex`) unequivocally identifies +the crate as providing regular expressions. + ## Summary My implementation is pretty much a port of most of RE2. The syntax should be From 0439c18a8cd9b7b5c5ea2e6b8e2b419571d631c2 Mon Sep 17 00:00:00 2001 From: Andrew Gallant Date: Sun, 13 Apr 2014 20:10:36 -0400 Subject: [PATCH 09/12] Rename re! to regexp./wat --- active/0000-regexps.md | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/active/0000-regexps.md b/active/0000-regexps.md index 9cc4424fb97..ace32d72fd8 100644 --- a/active/0000-regexps.md +++ b/active/0000-regexps.md @@ -146,13 +146,13 @@ This basically makes up the entirety of the public API, in addition to perhaps a `quote` function that escapes a string so that it may be used as a literal in an expression. -## The `re!` macro +## The `regexp!` macro -With syntax extensions, it's possible to write an `re!` macro that compiles an -expression when a Rust program is compiled. In my case, it seemed simplest to -compile it to *static* data. For example: +With syntax extensions, it's possible to write an `regexp!` macro that compiles +an expression when a Rust program is compiled. In my case, it seemed simplest +to compile it to *static* data. For example: - static re: Regexp = re!("a*"); + static re: Regexp = regexp!("a*"); At first this seemed difficult to accommodate, but it turned out to be relatively easy with a type like this: @@ -162,17 +162,17 @@ relatively easy with a type like this: Static(&'static [T]), } -Another option is for the `re!` macro to produce a non-static value, but I +Another option is for the `regexp!` macro to produce a non-static value, but I found this difficult to do with zero-runtime cost. Either way, the ability to statically declare a regexp is pretty cool I think. Note that the syntax extension is the reason for the `regexp_macros` crate. It's very small and contains the macro registration function. I'm not sure how this -fits into the Rust distribution, but my vote is to document the `re!` macro in -the `regexp` crate and hide the `regexp_macros` crate from public documentation. -(Or link it to the `regexp` crate.) +fits into the Rust distribution, but my vote is to document the `regexp!` macro +in the `regexp` crate and hide the `regexp_macros` crate from public +documentation. (Or link it to the `regexp` crate.) -It seems like the `re!` macro will become a bit nicer to use once +It seems like the `regexp!` macro will become a bit nicer to use once [#11640](https://github.com/mozilla/rust/issues/11640) is fixed. ## Untrusted input @@ -203,6 +203,9 @@ Finally, we could use `re` (like Python), but I think the name could be ambiguous since it's so short. `regexp` (or `regex`) unequivocally identifies the crate as providing regular expressions. +For consistency's sake, I propose that the syntax extension provided be named +the same as the crate. So in this case, `regexp!`. + ## Summary My implementation is pretty much a port of most of RE2. The syntax should be @@ -226,7 +229,7 @@ Another alternative is using a binding to an existing regexp library. I think this was discussed in issue [#3591](https://github.com/mozilla/rust/issues/3591) and it seems like people favor a native Rust implementation if it's to be included in the Rust -distribution. (Does the `re!` macro require it? If so, that's a huge +distribution. (Does the `regexp!` macro require it? If so, that's a huge advantage.) Also, a native implementation makes it maximally portable. Finally, it is always possible to persist without a regexp library. From 603582d7f8381c341d9d642123dce37ca29dc384 Mon Sep 17 00:00:00 2001 From: Andrew Gallant Date: Sun, 13 Apr 2014 23:18:30 -0400 Subject: [PATCH 10/12] Use Unicode for words and spaces. --- active/0000-regexps.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/active/0000-regexps.md b/active/0000-regexps.md index ace32d72fd8..59b0ec86300 100644 --- a/active/0000-regexps.md +++ b/active/0000-regexps.md @@ -89,10 +89,8 @@ exposes byte indices. ## Word boundaries, word characters and Unicode -The `\w` character class and the zero-width word boundary assertion `\b` are -defined in terms of the ASCII character set. I'm not aware of any -implementation that defines these in terms of proper Unicode character classes. -Do we want to be the first? +At least Python and D define word characters, word boundaries and space +characters with Unicode character classes. I propose we do the same. ## Leftmost-first From 61b023023237a9cdd72b2206a927b408998ff16a Mon Sep 17 00:00:00 2001 From: Andrew Gallant Date: Mon, 14 Apr 2014 21:15:20 -0400 Subject: [PATCH 11/12] Notes on future work (optimizations and Unicode). --- active/0000-regexps.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/active/0000-regexps.md b/active/0000-regexps.md index 59b0ec86300..bbbdefb763c 100644 --- a/active/0000-regexps.md +++ b/active/0000-regexps.md @@ -259,3 +259,23 @@ can always pick one of the existing C or C++ libraries. For now, we could mark the API as `#[unstable]` or `#[experimental]`. +# Future work + +I think most of the future work for this crate is to increase the performance, +either by implementing different matching algorithms (e.g., a DFA) or by +compiling a regular expression to native Rust code. + +With regard to native compilation, there are a few notes: + +* If and when a DFA is implemented, care must be taken, as the size of the code + required can grow rapidly. +* Adding native compilation will very likely change the interface of the crate + in a meaningful way, particularly if we want the interface to be consistent + between natively compiled and dynamically compiled regexps. (i.e., Make + `Regexp` a trait.) + +Other future work (that is probably more important) includes more Unicode +support, specifically for simple case folding. Also, words and word boundaries +should also be Unicode friendly, but I plan to have this done before I submit a +PR. + From c250f8bb0b249eedc0cbb24fbf1238320770fba1 Mon Sep 17 00:00:00 2001 From: Andrew Gallant Date: Fri, 18 Apr 2014 18:22:37 -0400 Subject: [PATCH 12/12] Include information about native regexps. --- active/0000-regexps.md | 76 ++++++++++++++++++++---------------------- 1 file changed, 37 insertions(+), 39 deletions(-) diff --git a/active/0000-regexps.md b/active/0000-regexps.md index bbbdefb763c..a5a3138b458 100644 --- a/active/0000-regexps.md +++ b/active/0000-regexps.md @@ -90,7 +90,9 @@ exposes byte indices. ## Word boundaries, word characters and Unicode At least Python and D define word characters, word boundaries and space -characters with Unicode character classes. I propose we do the same. +characters with Unicode character classes. My implementation does the same +by augmenting the standard Perl character classes `\d`, `\s` and `\w` with +corresponding Unicode categories. ## Leftmost-first @@ -147,31 +149,39 @@ an expression. ## The `regexp!` macro With syntax extensions, it's possible to write an `regexp!` macro that compiles -an expression when a Rust program is compiled. In my case, it seemed simplest -to compile it to *static* data. For example: +an expression when a Rust program is compiled. This includes translating the +matching algorithm to Rust code specific to the expression given. This "ahead +of time" compiling results in a performance increase. Namely, it elides all +heap allocation. - static re: Regexp = regexp!("a*"); +I've called these "native" regexps, whereas expressions compiled at runtime are +"dynamic" regexps. The public API need not impose this distinction on users, +other than requiring the use of a syntax extension to construct a native +regexp. For example: -At first this seemed difficult to accommodate, but it turned out to be -relatively easy with a type like this: + let re = regexp!("a*"); - pub enum MaybeStatic { - Dynamic(Vec), - Static(&'static [T]), - } +After construction, `re` is indistinguishable from an expression created +dynamically: + + let re = Regexp::new("a*").unwrap(); + +In particular, both have the same type. This is accomplished with a +representation resembling: -Another option is for the `regexp!` macro to produce a non-static value, but I -found this difficult to do with zero-runtime cost. Either way, the ability to -statically declare a regexp is pretty cool I think. + enum MaybeNative { + Dynamic(~[Inst]), + Native(fn(MatchKind, &str, uint, uint) -> ~[Option]), + } -Note that the syntax extension is the reason for the `regexp_macros` crate. It's -very small and contains the macro registration function. I'm not sure how this -fits into the Rust distribution, but my vote is to document the `regexp!` macro -in the `regexp` crate and hide the `regexp_macros` crate from public -documentation. (Or link it to the `regexp` crate.) +This syntax extension requires a second crate, `regexp_macros`, where the +`regexp!` macro is defined. Technically, this could be provided in the `regexp` +crate, but this would introduce a runtime dependency on `libsyntax` for any use +of the `regexp` crate. -It seems like the `regexp!` macro will become a bit nicer to use once -[#11640](https://github.com/mozilla/rust/issues/11640) is fixed. +[@alexcrichton +remarks](https://github.com/rust-lang/rfcs/pull/42#issuecomment-40320112) +that this state of affairs is a wart that will be corrected in the future. ## Untrusted input @@ -234,11 +244,7 @@ Finally, it is always possible to persist without a regexp library. # Unresolved questions -Firstly, I'm not entirely clear on how the `regexp_macros` crate will be handled. -I gave a suggestion above, but I'm not sure if it's a good one. Is there any -precedent? - -Secondly, the public API design is fairly simple and straight-forward with no +The public API design is fairly simple and straight-forward with no surprises. I think most of the unresolved stuff is how the backend is implemented, which should be changeable without changing the public API (sans adding features to the syntax). @@ -247,8 +253,8 @@ I can't remember where I read it, but someone had mentioned defining a *trait* that declared the API of a regexp engine. That way, anyone could write their own backend and use the `regexp` interface. My initial thoughts are YAGNI---since requiring different backends seems like a super specialized -case---but I'm just hazarding a guess here. (If we go this route, then we'd -probably also have to expose the regexp parser and AST and possibly the +case---but I'm just hazarding a guess here. (If we go this route, then we +might want to expose the regexp parser and AST and possibly the compiler and instruction set to make writing your own backend easier. That sounds restrictive with respect to making performance improvements in the future.) @@ -263,19 +269,11 @@ For now, we could mark the API as `#[unstable]` or `#[experimental]`. I think most of the future work for this crate is to increase the performance, either by implementing different matching algorithms (e.g., a DFA) or by -compiling a regular expression to native Rust code. - -With regard to native compilation, there are a few notes: +improving the code generator that produces native regexps with `regexp!`. -* If and when a DFA is implemented, care must be taken, as the size of the code - required can grow rapidly. -* Adding native compilation will very likely change the interface of the crate - in a meaningful way, particularly if we want the interface to be consistent - between natively compiled and dynamically compiled regexps. (i.e., Make - `Regexp` a trait.) +If and when a DFA is implemented, care must be taken when creating a code +generator, as the size of the code required can grow rapidly. Other future work (that is probably more important) includes more Unicode -support, specifically for simple case folding. Also, words and word boundaries -should also be Unicode friendly, but I plan to have this done before I submit a -PR. +support, specifically for simple case folding.