503107 - regexp native compiler can't handle peacekeeper string-filter regex

Robert Sayre

Reporter

Description

•

15 years ago

I've been looking at the peacekeeper string tests. we've got a few really bad cases. this is one.

Robert Sayre

Reporter

Updated

•

15 years ago

Blocks: peacekeeper

Robert Sayre

Reporter

Comment 1

•

15 years ago

Attached file peacekeeper string-filter excerpted (deleted) — Details

var results = new Array(); var needle = /.*star.*/i; for (var k = 0; k < this.movies.length; k++) { if (needle.test(this.movies[k])) { results.push(this.movies[k]); } }

Nochum Sossonko [:Natch]

Updated

•

15 years ago

Version: unspecified → Trunk

Luke Wagner [:luke]

Comment 2

•

15 years ago

I'm about to post a patch on bug 406271 that compiles a lot more cases, namely quantifiers and all the builtin classes like '.'. Unfortunately, without changing the architecture to have a stack, I can only compile alternatives/quantifiers that don't need backtracking because the available paths have disjoint sets of first characters. Thus, the patch won't compile /.*star.*/i because the automaton can't decide whether to keep eating '.' or move on to the 's' in 'star'. If this was an important case, though, I could add just enough backtracking (through a stack) to do this, but not capturing parenthesis.

Robert Sayre

Reporter

Comment 3

•

15 years ago

(In reply to comment #2) > > If this was an important case, though, I could add just enough backtracking > (through a stack) to do this, but not capturing parenthesis. We need to be competitive on regular expressions that use backtracking. I'm not sure if it's only this restricted case we're losing on, or all backtracking expressions. Mac comparison: Operations per second Safari 4: 2453.4 Chrome 3: 1829.9 Minefield: 140.3

Robert Sayre

Reporter

Updated

•

15 years ago

Depends on: 406271

David Mandelin [:dmandelin]

Comment 4

•

15 years ago

(In reply to comment #3) > (In reply to comment #2) > > > > If this was an important case, though, I could add just enough backtracking > > (through a stack) to do this, but not capturing parenthesis. > > We need to be competitive on regular expressions that use backtracking. I'm not > sure if it's only this restricted case we're losing on, or all backtracking > expressions. > > Mac comparison: > > Operations per second > Safari 4: 2453.4 > Chrome 3: 1829.9 > Minefield: 140.3 Rumor has it that V8 has a pretty complete regexp->native compiler, and I think Nitro does as well now, so they probably do most backtracking regexps better than we do. In order to handle things like /.*star.*/, there are two ways we could go: - Do WREC-style support for fixed-length quantifiers without capture. The idea is that for [e]*, where [e] is any regexp that must match one character and * is any quantifier, you don't need a stack for backtracking. A single word saved character position is enough. So this can be done fairly easily and without changing the existing regexp compiler much. - Start on more complete support for backtracking. This would require significant reworking of the existing compiler. The reason is that the existing regexp compiler (similar to WREC's design) generates jumps to a single "exit" code path for any mismatch. This doesn't work for general backtracking. Instead you need to jump to a compensation block that does backtracking from the point you are at. To do this, we would want to study V8, Nitro, and the design done for the regexp compiler in Parrot, then create a design that works well with nanojit, then implement it.

Luke Wagner [:luke]

Comment 5

•

15 years ago

> - Do WREC-style support for fixed-length quantifiers without capture. The idea > is that for [e]*, where [e] is any regexp that must match one character and * > is any quantifier, you don't need a stack for backtracking. A single word saved > character position is enough. So this can be done fairly easily and without > changing the existing regexp compiler much. Even with that limitation though, consider: r = /a*aa*aa*aa*a/ r.test("aaaa") This seems to require nested levels of backtracking. It might be a special case, though, because the example doesn't require nested backtracking if some of the a's are b's. > - Start on more complete support for backtracking. This would require > significant reworking of the existing compiler. The reason is that the existing > regexp compiler (similar to WREC's design) generates jumps to a single "exit" > code path for any mismatch. In the patch in bug 406271, I do arbitrary jumps in the trace with loads/stores to simulate phi nodes. I think the major change (assuming no capturing parenthesis) would just be the addition and maintenance of a stack.

Luke Wagner [:luke]

Comment 6

•

15 years ago

Assuming no capturing parenthesis (are they used in any benchmarks we care about?), another alternative to backtracking is the NFA -> DFA conversion one learns in undergrad automata. It has the potential to explode the number of states (I think 2^n where n is the original number of states), but a fixed upper bound could cap this. As dmandelin pointed out on his blog a while back, this approach can be significantly faster (http://swtch.com/~rsc/regexp/regexp1.html).

David Mandelin [:dmandelin]

Comment 7

•

15 years ago

(In reply to comment #5) > > - Do WREC-style support for fixed-length quantifiers without capture. The idea > > is that for [e]*, where [e] is any regexp that must match one character and * > > is any quantifier, you don't need a stack for backtracking. A single word saved > > character position is enough. So this can be done fairly easily and without > > changing the existing regexp compiler much. > > Even with that limitation though, consider: > > r = /a*aa*aa*aa*a/ > r.test("aaaa") > > This seems to require nested levels of backtracking. It might be a special > case, though, because the example doesn't require nested backtracking if some > of the a's are b's. I was vague about not needing a stack. Your example does need multiple pieces of backtracking info, but a bounded number of pieces. You just need a position for each active quantifier subexpression. Compare to: /(?:aa|aba)*a/ where you will need an unbounded amount of storage just to handle the one quantifier expression. > > - Start on more complete support for backtracking. This would require > > significant reworking of the existing compiler. The reason is that the existing > > regexp compiler (similar to WREC's design) generates jumps to a single "exit" > > code path for any mismatch. > > In the patch in bug 406271, I do arbitrary jumps in the trace with loads/stores > to simulate phi nodes. I think the major change (assuming no capturing > parenthesis) would just be the addition and maintenance of a stack. I'm pretty sure the internal API of the regexp compiler will need changing. Currently, the compilation strategy is this: - Break up the regexp as a sequence of concatenated subexpressions and compile them from left to right. - Call a function to compile each subexpression. The input to the function is the LIR instruction for the matching |pos| variable (position in the text string). The output is a LIR instruction for the |pos| variable after this expression has matched, and a list of LIR instructions for match failures. Thus, calling a compilation function compiles one subexpression, returning an instruction for the success case to be fed into the next subexpr, and a list of failure exits to be patched at the end. The key assumption embodied in this strategy is that if a given subexpression fails, - there is a single next place in the code to go to - there is no state to be passed to that place in the code. Now consider this: /(abc|ef)gh/ If we are in the process of matching, and we fail during subexpr /gh/, then we should either (1) backtrack and retry subexpr /ef/, or (2) fail entirely, depending on where we came from. Thus, my initial strategy cannot be used. Instead, we need a new strategy that will: - remember the entry points to the /abc/ and /ef/ submatchers as they are generated - at the end of the /abc/ or /ef/ submatchers, save a value that indicates which submatcher just succeeded - retain this information as /gh/ is compiled. - patch the /gh/ failure point to a backtracking block - generate the backtracking block for backtracking back to /(abc|ef)/: - first, read the saved value - if /abc/ was run last, jump to the start of /ef/ - otherwise, we are at a failure point. This should jump to the backtracking block for the last choice subexpr before /(abc|ef)/, or match failure if there is none. DFAs are tough because the compilation time is relatively long. Also, the semantics are different for many regexps, although cgjones found the set of regexps that have the same semantics either way.

Luke Wagner [:luke]

Comment 8

•

15 years ago

> Now consider this: > > /(abc|ef)gh/ > > If we are in the process of matching, and we fail during subexpr /gh/, then we > should either (1) backtrack and retry subexpr /ef/, or (2) fail entirely, > depending on where we came from. Thus, my initial strategy cannot be used. > Instead, we need a new strategy that will: > ... When "abc" and "ef" accept disjoint strings (as they are in this example), its a good deal simpler: failure in abc jumps to ef, and failure in gh is real failure. The same idea works for quantifiers. So its a middle ground between what is in now and general backtracking.

David Mandelin [:dmandelin]

Comment 9

•

15 years ago

(In reply to comment #8) > > Now consider this: > > > > /(abc|ef)gh/ > > > > If we are in the process of matching, and we fail during subexpr /gh/, then we > > should either (1) backtrack and retry subexpr /ef/, or (2) fail entirely, > > depending on where we came from. Thus, my initial strategy cannot be used. > > Instead, we need a new strategy that will: > > ... > > When "abc" and "ef" accept disjoint strings (as they are in this example), its > a good deal simpler: failure in abc jumps to ef, and failure in gh is real > failure. The same idea works for quantifiers. So its a middle ground between > what is in now and general backtracking. Yes, but a general solution must work for any subexpressions. I consider such shortcutting to be an optional optimization.

nemo

Comment 10

•

15 years ago

Hey guys. I was reading this with interest. Just commenting so I'll get e-mails on it, and also note that http://osteele.com/tools/reanimator/ is a nifty visualisation of the stuff discussed in comment #6

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 11

•

13 years ago

Any news here?

Ryan VanderMeulen [:RyanVM]

Comment 12

•

13 years ago

Is this still relevant with Yarr?

Christian Sonne [:cers]

Comment 13

•

13 years ago

(In reply to Ryan VanderMeulen from comment #12) > Is this still relevant with Yarr? Yes - the most recent results from http://peacekeeper.futuremark.com/ I get 2670.5 in Firefox nightly and 32311 in Chrome dev. That's a ratio of ~12.

nemo

Comment 14

•

13 years ago

3205 / 22102 personally, for a ratio of 7, but yeah :) Also, I can't remember. Aren't those scores using a geometric mean or something? Not sure if is a pure "number of runs" even for the individual string-filter test. Anyway, regardless. Lot slower.

Luke Wagner [:luke]

Updated

•

13 years ago

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → DUPLICATE

Luke Wagner [:luke]

Comment 16

•

13 years ago

I guess I should be clear for comment 13 et al: the reason Chrome is 12x faster is that they have a special-case hack that transforms /.*foo/ to /foo/ (so, in this case, /.*star.*/ to /star.*/) which is enormously faster to execute.