Skip to content

Commit d39e254

Browse files
author
Ethan Pailes
committed
Add a onepass DFA.
This patch adds a onepass matcher, which is a DFA that has all the abilities of an NFA! There are lots of expressions that a onepass matcher can't handle, namely those cases where a regex contains non-determinism. The general approach we take is as follows: 1. Check if a regex is onepass using `src/onepass.rs::is_onepass`. 2. Compile a new regex program using the compiler with the bytes flag set. 3. Compile a onepass DFA from the program produced in step 2. We will roughly map each instruction to a state in the DFA, though instructions like `split` don't get states. a. Make a new transition table for the first instruction. b. For each child of the first instruction: - If it is a bytes instruction, add a transition to the table for every byte class in the instruction. - If it is an instruction which consumes zero input (like `EmptyLook` or `Save`), emit a job to a DAG asking to forward the first instruction state to the state for the non-consuming instruction. - Push the child instruction to a queue of instructions to process. c. Peel off an instruction from the queue and go back to step a, processing the instruction as if it was the first instruction. If the queue is empty, continue with step d. d. Topologically sort the forwarding jobs, and shuffle the transitions from the forwarding targets to the forwarding sources in topological order. e. Bake the intermediary transition tables down into a single flat vector. States which require some action (`EmptyLook` and `Save`) get an extra entry in the baked transition table that contains metadata instructing them on how to perform their actions. 4. Wait for the user to give us some input. 5. Execute the DFA: - The inner loop is basically: while at < text.len(): state_ptr = baked_table[text[at]] at += 1 - There is a lot of window dressing to handle special states. The idea of a onepass matcher comes from Russ Cox and his RE2 library. I haven't been as good about reading the RE2 source as I should have, but I've gotten the impression that the RE2 onepass matcher is more in the spirit of an NFA simulation without threads than a DFA. Squashed Patch Notes ==================== There were a few issues and burrs that needed to be sanded down in the original impl. They were fixed in a series of small patches that are described below. Fix bogus doctest. The list formatting in the module comment for `src/onepass.rs` was triggering a doctest. English != Rust, so this made `cargo test` grumpy. Thread only_utf8 through onepass to byte input. word_boundary_unicode::ascii3 was failing because I wasn't threading the correct only_utf8 value though to the actual input object. This patch fixes that. Drop empty branch restriction. When I fist noticed the problem with empty branches in alternatives, I added in a special case in the fset intersection code to close the loop hole. Since then I've implemented a more principled notion of regex accepting the empty string, so the special case is no longer needed. This patch removes that restriction. Fix documentation and style issues. This patch just has a bunch of style and doc fixes that I noticed when going over the diff on github. Flatten `onepass` member of the OnePassCompiler Embedding the OnePass DFA to be compiled in the OnePassCompiler caused a few values to be unnecessarily duplicated and added an extra level of indirection. This patch resolves that issue and takes advantage of these move semantics I'm always hearing about. Factor OnePassCompiler::forwards into local var. Iteration of a `Forwards` object is destructive, which previously meant that we had to clone it in order to iterate over it. Once the compiler iterates over the forwarding jobs, it never touches them again, so this was an extra copy. This patch plays a little type tetris to get rid of that extra copy. Filter out STATE_DEAD in eof single step. STATE_DEAD has the STATE_MATCH flag set even though it does not semantically indicate a match. This means that we have to be very careful about when we check for the STATE_MATCH flag. In the eof single step, just before the eof action drain loop, I was forgetting to filter out the STATE_DEAD case, with predictably bad results. This patch fixes that. Add an unrollable inner loop This patch adds an inner loop to the onepass DFA execution which which is set up for unrolling. Right now it is unrolled once, which isn't that interesting, but benchmarks will be required to determine the right number of times to unroll the loop. The inner loop does manage to avoid an extra branch around when to increment `at` which is required for the drain loop. Clarify forwarding DAG edge situation. Previously, the forwarding DAG was talked about both in terms of states which need to be forwarded to other states, and in more conventional graph theory terms. Forwarding one state to another makes sense in terms of the DFA, but unfortunately the directionality is exactly opposite the directionality present in the DAG we were dealing with. This patch tries to cut down on the confusion that this might have caused by renaming some variables and adding in more comments. Factor accepts_empty out of fset_of Previously the only way to determine if a given expression accepts the empty string was to compute the whole first set and then check the flag on the fset. This resulted in a little bit of wasted work because the set of accepting chars was also computed. It is unlikely that there was much of a perf impact, so this patch is mostly just unnecessary gardening. Nevertheless, this patch removes that tiny bit of wasted work. Update utf8 encoding to use new post regex-1.0 style!
1 parent 91371de commit d39e254

10 files changed

+1488
-71
lines changed

Cargo.toml

+5
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,11 @@ name = "backtrack-utf8bytes"
108108
path = "tests/test_backtrack_bytes.rs"
109109
name = "backtrack-bytes"
110110

111+
# Run the test suite on the onepass engine.
112+
[[test]]
113+
path = "tests/test_onepass.rs"
114+
name = "onepass"
115+
111116
[profile.release]
112117
debug = true
113118

src/analysis.rs

+77-53
Original file line numberDiff line numberDiff line change
@@ -50,20 +50,15 @@ impl IsOnePassVisitor {
5050
let mut empty_run = vec![];
5151

5252
for e in NestedConcat::new(es) {
53-
// TODO(ethan):yakshaving factor the determination of when
54-
// a regex accepts_empty out into a separate function,
55-
// so that we don't compute the whole first set when we
56-
// don't need to.
57-
let fset = fset_of(e);
5853
let is_rep = match e.kind() {
5954
&HirKind::Repetition(_) => true,
6055
_ => false,
6156
};
6257

6358
empty_run.push(e);
64-
if !(fset.accepts_empty || is_rep) {
65-
// this is the last one in the run
66-
break;
59+
if !(accepts_empty(e) || is_rep) {
60+
self.0 = self.0 && !fsets_clash(&empty_run);
61+
empty_run.clear();
6762
}
6863
}
6964

@@ -76,7 +71,7 @@ impl IsOnePassVisitor {
7671
self.0 = self.0 && !fsets_clash(&es.iter().collect::<Vec<_>>());
7772
}
7873

79-
// Unicode classes are really big alternatives from the byte
74+
// Unicode classes are really just big alternatives from the byte
8075
// oriented point of view.
8176
//
8277
// This function translates a unicode class into the
@@ -99,7 +94,7 @@ impl IsOnePassVisitor {
9994
}
10095
}
10196
}
102-
_ => {} // FALLTHROUGH
97+
_ => {}
10398
}
10499
}
105100

@@ -115,16 +110,6 @@ fn fsets_clash(es: &[&Hir]) -> bool {
115110
let mut fset = fset_of(e1);
116111
let fset2 = fset_of(e2);
117112

118-
// For the regex /a|()+/, we don't have a way to
119-
// differentiate the branches, so we are not onepass.
120-
//
121-
// We might be able to loosen this restriction by
122-
// considering the expression after the alternative
123-
// if there is one.
124-
if fset.is_empty() || fset2.is_empty() {
125-
return true;
126-
}
127-
128113
fset.intersect(&fset2);
129114
if ! fset.is_empty() {
130115
return true;
@@ -138,14 +123,14 @@ fn fsets_clash(es: &[&Hir]) -> bool {
138123

139124
/// Compute the first set of a given regular expression.
140125
///
141-
/// The first set of a regular expression is the set of all characters
126+
/// The first set of a regular expression is the set of all bytes
142127
/// which might begin it. This is a less general version of the
143128
/// notion of a regular expression preview (the first set can be
144129
/// thought of as the 1-preview of a regular expression).
145130
///
146131
/// Note that first sets are byte-oriented because the DFA is
147132
/// byte oriented. This means an expression like /Δ|δ/ is actually not
148-
/// one-pass, even though there is clearly no non-determinism inherent
133+
/// onepass, even though there is clearly no non-determinism inherent
149134
/// to the regex at a unicode code point level (big delta and little
150135
/// delta start with the same byte).
151136
fn fset_of(expr: &Hir) -> FirstSet {
@@ -155,7 +140,9 @@ fn fset_of(expr: &Hir) -> FirstSet {
155140
f
156141
}
157142

158-
match expr.kind() {
143+
// First compute the set of characters that might begin
144+
// the expression (ignoring epsilon for now).
145+
let mut f_char_set = match expr.kind() {
159146
&HirKind::Empty => FirstSet::epsilon(),
160147
&HirKind::Literal(ref lit) => {
161148
match lit {
@@ -191,29 +178,13 @@ fn fset_of(expr: &Hir) -> FirstSet {
191178
// that such an emptylook could potentially match on any character.
192179
&HirKind::Anchor(_) | &HirKind::WordBoundary(_) => FirstSet::anychar(),
193180

194-
&HirKind::Repetition(ref rep) => {
195-
let mut f = fset_of(&*rep.hir);
196-
match rep.kind {
197-
RepetitionKind::ZeroOrOne => f.accepts_empty = true,
198-
RepetitionKind::ZeroOrMore => f.accepts_empty = true,
199-
RepetitionKind::OneOrMore => {},
200-
RepetitionKind::Range(ref range) => {
201-
match range {
202-
&RepetitionRange::Exactly(0)
203-
| &RepetitionRange::AtLeast(0)
204-
| &RepetitionRange::Bounded(0, _) =>
205-
f.accepts_empty = true,
206-
_ => {}
207-
}
208-
}
209-
}
210-
f
211-
},
181+
&HirKind::Repetition(ref rep) => fset_of(&rep.hir),
212182
&HirKind::Group(ref group) => fset_of(&group.hir),
213183

214184
// The most involved case. We need to strip leading empty-looks
215185
// as well as take the union of the first sets of the first n+1
216-
// expressions where n is the number of leading repetitions.
186+
// expressions where n is the number of leading expressions which
187+
// accept the empty string.
217188
&HirKind::Concat(ref es) => {
218189
let mut fset = FirstSet::empty();
219190
for (i, e) in es.iter().enumerate() {
@@ -229,13 +200,9 @@ fn fset_of(expr: &Hir) -> FirstSet {
229200
let inner_fset = fset_of(e);
230201
fset.union(&inner_fset);
231202

232-
if !inner_fset.accepts_empty() {
203+
if !accepts_empty(e) {
233204
// We can stop accumulating after we stop seeing
234205
// first sets which contain epsilon.
235-
// Also, a contatination which terminated by
236-
// one or more expressions which do not accept
237-
// epsilon itself does not acceept epsilon.
238-
fset.accepts_empty = false;
239206
break;
240207
}
241208
}
@@ -250,13 +217,68 @@ fn fset_of(expr: &Hir) -> FirstSet {
250217
}
251218
fset
252219
}
220+
};
221+
222+
f_char_set.accepts_empty = accepts_empty(expr);
223+
f_char_set
224+
}
225+
226+
fn accepts_empty(expr: &Hir) -> bool {
227+
match expr.kind() {
228+
&HirKind::Empty => true,
229+
&HirKind::Literal(_) => false,
230+
&HirKind::Class(_) => false,
231+
232+
// A naked empty look is a pretty weird thing because we
233+
// normally strip them from the beginning of concatinations.
234+
// We are just going to treat them like `.`
235+
&HirKind::Anchor(_) | &HirKind::WordBoundary(_) => false,
236+
237+
&HirKind::Repetition(ref rep) => {
238+
match rep.kind {
239+
RepetitionKind::ZeroOrOne => true,
240+
RepetitionKind::ZeroOrMore => true,
241+
RepetitionKind::OneOrMore => accepts_empty(&rep.hir),
242+
RepetitionKind::Range(ref range) => {
243+
match range {
244+
&RepetitionRange::Exactly(0)
245+
| &RepetitionRange::AtLeast(0)
246+
| &RepetitionRange::Bounded(0, _) => true,
247+
_ => accepts_empty(&rep.hir),
248+
}
249+
}
250+
}
251+
}
252+
253+
&HirKind::Group(ref group) => accepts_empty(&group.hir),
254+
255+
&HirKind::Concat(ref es) => {
256+
let mut accepts: bool = true;
257+
for e in es.iter() {
258+
match e.kind() {
259+
&HirKind::Anchor(_) | &HirKind::WordBoundary(_) => {
260+
// Ignore any leading emptylooks.
261+
}
262+
_ => {
263+
accepts = accepts && accepts_empty(&e);
264+
}
265+
}
266+
267+
if !accepts {
268+
break;
269+
}
270+
}
271+
accepts
272+
}
273+
274+
&HirKind::Alternation(ref es) => es.iter().any(accepts_empty)
253275
}
254276
}
255277

256278
/// The first byte of a unicode code point.
257279
///
258-
/// We only ever care about the first byte of a particular character,
259-
/// because the onepass DFA is implemented in the byte space, not the
280+
/// We only ever care about the first byte of a particular character
281+
/// because the onepass DFA is implemented in the byte space not the
260282
/// character space. This means, for example, that a branch between
261283
/// lowercase delta and uppercase delta is actually non-deterministic.
262284
fn first_byte(c: char) -> u8 {
@@ -323,10 +345,6 @@ impl FirstSet {
323345
fn is_empty(&self) -> bool {
324346
self.bytes.is_empty() && !self.accepts_empty
325347
}
326-
327-
fn accepts_empty(&self) -> bool {
328-
self.accepts_empty
329-
}
330348
}
331349

332350
/// An iterator over a concatenation of expressions which
@@ -544,4 +562,10 @@ mod tests {
544562
assert!(!is_onepass(&e1));
545563
assert!(!is_onepass(&e2));
546564
}
565+
566+
#[test]
567+
fn is_onepass_clash_in_middle_of_concat() {
568+
let e = Parser::new().parse(r"ab?b").unwrap();
569+
assert!(!is_onepass(&e));
570+
}
547571
}

src/backtrack.rs

+1-1
Original file line numberDiff line numberDiff line change
@@ -245,7 +245,7 @@ impl<'a, 'm, 'r, 's, I: Input> Bounded<'a, 'm, 'r, 's, I> {
245245
ip = inst.goto1;
246246
}
247247
EmptyLook(ref inst) => {
248-
if self.input.is_empty_match(at, inst) {
248+
if self.input.is_empty_match(at, inst.look) {
249249
ip = inst.goto;
250250
} else {
251251
return false;

0 commit comments

Comments
 (0)