Skip to content

Commit cc907b8

Browse files
author
Ethan Pailes
committed
Add a onepass DFA.
This patch adds a onepass matcher, which is a DFA that has all the abilities of an NFA! There are lots of expressions that a onepass matcher can't handle, namely those cases where a regex contains non-determinism. The general approach we take is as follows: 1. Check if a regex is onepass using `src/onepass.rs::is_onepass`. 2. Compile a new regex program using the compiler with the bytes flag set. 3. Compile a onepass DFA from the program produced in step 2. We will roughly map each instruction to a state in the DFA, though instructions like `split` don't get states. a. Make a new transition table for the first instruction. b. For each child of the first instruction: - If it is a bytes instruction, add a transition to the table for every byte class in the instruction. - If it is an instruction which consumes zero input (like `EmptyLook` or `Save`), emit a job to a DAG asking to forward the first instruction state to the state for the non-consuming instruction. - Push the child instruction to a queue of instructions to process. c. Peel off an instruction from the queue and go back to step a, processing the instruction as if it was the first instruction. If the queue is empty, continue with step d. d. Topologically sort the forwarding jobs, and shuffle the transitions from the forwarding targets to the forwarding sources in topological order. e. Bake the intermediary transition tables down into a single flat vector. States which require some action (`EmptyLook` and `Save`) get an extra entry in the baked transition table that contains metadata instructing them on how to perform their actions. 4. Wait for the user to give us some input. 5. Execute the DFA: - The inner loop is basically: while at < text.len(): state_ptr = baked_table[text[at]] at += 1 - There is a lot of window dressing to handle special states. The idea of a onepass matcher comes from Russ Cox and his RE2 library. I haven't been as good about reading the RE2 source as I should have, but I've gotten the impression that the RE2 onepass matcher is more in the spirit of an NFA simulation without threads than a DFA.
1 parent 87cfe7e commit cc907b8

10 files changed

+1488
-71
lines changed

Cargo.toml

+5
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,11 @@ name = "backtrack-bytes"
114114
path = "tests/test_crates_regex.rs"
115115
name = "crates-regex"
116116

117+
# Run the test suite on the onepass engine.
118+
[[test]]
119+
path = "tests/test_onepass.rs"
120+
name = "onepass"
121+
117122
[profile.release]
118123
debug = true
119124

src/analysis.rs

+77-53
Original file line numberDiff line numberDiff line change
@@ -50,20 +50,15 @@ impl IsOnePassVisitor {
5050
let mut empty_run = vec![];
5151

5252
for e in NestedConcat::new(es) {
53-
// TODO(ethan):yakshaving factor the determination of when
54-
// a regex accepts_empty out into a separate function,
55-
// so that we don't compute the whole first set when we
56-
// don't need to.
57-
let fset = fset_of(e);
5853
let is_rep = match e.kind() {
5954
&HirKind::Repetition(_) => true,
6055
_ => false,
6156
};
6257

6358
empty_run.push(e);
64-
if !(fset.accepts_empty || is_rep) {
65-
// this is the last one in the run
66-
break;
59+
if !(accepts_empty(e) || is_rep) {
60+
self.0 = self.0 && !fsets_clash(&empty_run);
61+
empty_run.clear();
6762
}
6863
}
6964

@@ -76,7 +71,7 @@ impl IsOnePassVisitor {
7671
self.0 = self.0 && !fsets_clash(&es.iter().collect::<Vec<_>>());
7772
}
7873

79-
// Unicode classes are really big alternatives from the byte
74+
// Unicode classes are really just big alternatives from the byte
8075
// oriented point of view.
8176
//
8277
// This function translates a unicode class into the
@@ -99,7 +94,7 @@ impl IsOnePassVisitor {
9994
}
10095
}
10196
}
102-
_ => {} // FALLTHROUGH
97+
_ => {}
10398
}
10499
}
105100

@@ -115,16 +110,6 @@ fn fsets_clash(es: &[&Hir]) -> bool {
115110
let mut fset = fset_of(e1);
116111
let fset2 = fset_of(e2);
117112

118-
// For the regex /a|()+/, we don't have a way to
119-
// differentiate the branches, so we are not onepass.
120-
//
121-
// We might be able to loosen this restriction by
122-
// considering the expression after the alternative
123-
// if there is one.
124-
if fset.is_empty() || fset2.is_empty() {
125-
return true;
126-
}
127-
128113
fset.intersect(&fset2);
129114
if ! fset.is_empty() {
130115
return true;
@@ -138,14 +123,14 @@ fn fsets_clash(es: &[&Hir]) -> bool {
138123

139124
/// Compute the first set of a given regular expression.
140125
///
141-
/// The first set of a regular expression is the set of all characters
126+
/// The first set of a regular expression is the set of all bytes
142127
/// which might begin it. This is a less general version of the
143128
/// notion of a regular expression preview (the first set can be
144129
/// thought of as the 1-preview of a regular expression).
145130
///
146131
/// Note that first sets are byte-oriented because the DFA is
147132
/// byte oriented. This means an expression like /Δ|δ/ is actually not
148-
/// one-pass, even though there is clearly no non-determinism inherent
133+
/// onepass, even though there is clearly no non-determinism inherent
149134
/// to the regex at a unicode code point level (big delta and little
150135
/// delta start with the same byte).
151136
fn fset_of(expr: &Hir) -> FirstSet {
@@ -155,7 +140,9 @@ fn fset_of(expr: &Hir) -> FirstSet {
155140
f
156141
}
157142

158-
match expr.kind() {
143+
// First compute the set of characters that might begin
144+
// the expression (ignoring epsilon for now).
145+
let mut f_char_set = match expr.kind() {
159146
&HirKind::Empty => FirstSet::epsilon(),
160147
&HirKind::Literal(ref lit) => {
161148
match lit {
@@ -191,29 +178,13 @@ fn fset_of(expr: &Hir) -> FirstSet {
191178
// that such an emptylook could potentially match on any character.
192179
&HirKind::Anchor(_) | &HirKind::WordBoundary(_) => FirstSet::anychar(),
193180

194-
&HirKind::Repetition(ref rep) => {
195-
let mut f = fset_of(&*rep.hir);
196-
match rep.kind {
197-
RepetitionKind::ZeroOrOne => f.accepts_empty = true,
198-
RepetitionKind::ZeroOrMore => f.accepts_empty = true,
199-
RepetitionKind::OneOrMore => {},
200-
RepetitionKind::Range(ref range) => {
201-
match range {
202-
&RepetitionRange::Exactly(0)
203-
| &RepetitionRange::AtLeast(0)
204-
| &RepetitionRange::Bounded(0, _) =>
205-
f.accepts_empty = true,
206-
_ => {}
207-
}
208-
}
209-
}
210-
f
211-
},
181+
&HirKind::Repetition(ref rep) => fset_of(&rep.hir),
212182
&HirKind::Group(ref group) => fset_of(&group.hir),
213183

214184
// The most involved case. We need to strip leading empty-looks
215185
// as well as take the union of the first sets of the first n+1
216-
// expressions where n is the number of leading repetitions.
186+
// expressions where n is the number of leading expressions which
187+
// accept the empty string.
217188
&HirKind::Concat(ref es) => {
218189
let mut fset = FirstSet::empty();
219190
for (i, e) in es.iter().enumerate() {
@@ -229,13 +200,9 @@ fn fset_of(expr: &Hir) -> FirstSet {
229200
let inner_fset = fset_of(e);
230201
fset.union(&inner_fset);
231202

232-
if !inner_fset.accepts_empty() {
203+
if !accepts_empty(e) {
233204
// We can stop accumulating after we stop seeing
234205
// first sets which contain epsilon.
235-
// Also, a contatination which terminated by
236-
// one or more expressions which do not accept
237-
// epsilon itself does not acceept epsilon.
238-
fset.accepts_empty = false;
239206
break;
240207
}
241208
}
@@ -250,13 +217,68 @@ fn fset_of(expr: &Hir) -> FirstSet {
250217
}
251218
fset
252219
}
220+
};
221+
222+
f_char_set.accepts_empty = accepts_empty(expr);
223+
f_char_set
224+
}
225+
226+
fn accepts_empty(expr: &Hir) -> bool {
227+
match expr.kind() {
228+
&HirKind::Empty => true,
229+
&HirKind::Literal(_) => false,
230+
&HirKind::Class(_) => false,
231+
232+
// A naked empty look is a pretty weird thing because we
233+
// normally strip them from the beginning of concatinations.
234+
// We are just going to treat them like `.`
235+
&HirKind::Anchor(_) | &HirKind::WordBoundary(_) => false,
236+
237+
&HirKind::Repetition(ref rep) => {
238+
match rep.kind {
239+
RepetitionKind::ZeroOrOne => true,
240+
RepetitionKind::ZeroOrMore => true,
241+
RepetitionKind::OneOrMore => accepts_empty(&rep.hir),
242+
RepetitionKind::Range(ref range) => {
243+
match range {
244+
&RepetitionRange::Exactly(0)
245+
| &RepetitionRange::AtLeast(0)
246+
| &RepetitionRange::Bounded(0, _) => true,
247+
_ => accepts_empty(&rep.hir),
248+
}
249+
}
250+
}
251+
}
252+
253+
&HirKind::Group(ref group) => accepts_empty(&group.hir),
254+
255+
&HirKind::Concat(ref es) => {
256+
let mut accepts: bool = true;
257+
for e in es.iter() {
258+
match e.kind() {
259+
&HirKind::Anchor(_) | &HirKind::WordBoundary(_) => {
260+
// Ignore any leading emptylooks.
261+
}
262+
_ => {
263+
accepts = accepts && accepts_empty(&e);
264+
}
265+
}
266+
267+
if !accepts {
268+
break;
269+
}
270+
}
271+
accepts
272+
}
273+
274+
&HirKind::Alternation(ref es) => es.iter().any(accepts_empty)
253275
}
254276
}
255277

256278
/// The first byte of a unicode code point.
257279
///
258-
/// We only ever care about the first byte of a particular character,
259-
/// because the onepass DFA is implemented in the byte space, not the
280+
/// We only ever care about the first byte of a particular character
281+
/// because the onepass DFA is implemented in the byte space not the
260282
/// character space. This means, for example, that a branch between
261283
/// lowercase delta and uppercase delta is actually non-deterministic.
262284
fn first_byte(c: char) -> u8 {
@@ -323,10 +345,6 @@ impl FirstSet {
323345
fn is_empty(&self) -> bool {
324346
self.bytes.is_empty() && !self.accepts_empty
325347
}
326-
327-
fn accepts_empty(&self) -> bool {
328-
self.accepts_empty
329-
}
330348
}
331349

332350
/// An iterator over a concatenation of expressions which
@@ -544,4 +562,10 @@ mod tests {
544562
assert!(!is_onepass(&e1));
545563
assert!(!is_onepass(&e2));
546564
}
565+
566+
#[test]
567+
fn is_onepass_clash_in_middle_of_concat() {
568+
let e = Parser::new().parse(r"ab?b").unwrap();
569+
assert!(!is_onepass(&e));
570+
}
547571
}

src/backtrack.rs

+1-1
Original file line numberDiff line numberDiff line change
@@ -245,7 +245,7 @@ impl<'a, 'm, 'r, 's, I: Input> Bounded<'a, 'm, 'r, 's, I> {
245245
ip = inst.goto1;
246246
}
247247
EmptyLook(ref inst) => {
248-
if self.input.is_empty_match(at, inst) {
248+
if self.input.is_empty_match(at, inst.look) {
249249
ip = inst.goto;
250250
} else {
251251
return false;

0 commit comments

Comments
 (0)