Closed Bug 576822 Opened 14 years ago Closed 13 years ago

Yarr gives different results for /(?:a*?){2,}/

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 627609
Tracking Status
blocking2.0 --- -
status2.0 --- wanted

People

(Reporter: jruderman, Assigned: cdleary)

References

Details

(Keywords: testcase)

Attachments

(1 file)

/(?:a*?){2,}/.exec("a") YARR: [""] OLD: ["a"] I don't know which is correct. By the way, YARR hangs if I use "+" instead of "{2,}".
blocking2.0: --- → ?
Given my current understanding of our regular expression semantics, I'm thinking Yarr is correct. Lazy star matches zero "a"s in each of the first two iterations, and, having satisfied the minimum number of iterations in the greedy numerical quantifier and hitting the empty string once again, the result is the empty string. CC'ing brendan/mrbkap for a double-check/explanation, since v8 also disagrees with me.
Assignee: general → cdleary
Status: NEW → ASSIGNED
Okay, gave the spec a harder read, and I now think comment 1 is incorrect, but I'd like someone to double check me. The key observation is in the definition of RepeatMatcher step 2/step 1, "If min is zero and y's endIndex is equal to x's endIndex, then return failure." Since that's the continuation that the outer quantifier passes in to the inner (lazy) quantifier, we end up executing step 8 substep c, which matches the letter a.
I'm finding it difficult to follow/understand the way the Term definition is written. However, if your (@Chris Leary) second reading is correct and therefore YARR is wrong, then IMHO this is an oddity in the spec that should be corrected in future versions. I believe the correct result should be [""]. Prose from ES5 15.10.2.5 "Term": "NOTE An Atom followed by a Quantifier is repeated the number of times specified by the Quantifier. A Quantifier can be non-greedy, in which case the Atom pattern is repeated as few times as possible while still matching the sequel, or it can be greedy, in which case the Atom pattern is repeated as many times as possible while still matching the sequel. [...] NOTE 4 Step 1 of the RepeatMatcher's d closure states that, once the minimum number of repetitions has been satisfied, any more expansions of Atom that match the empty String are not considered for further repetitions. This prevents the regular expression engine from falling into an infinite loop" According to the expository notes quoted above (note the absence of crazy rule exceptions), the result should be [""]. At least IE 6-8 and Safari 5 already return [""], as would Java and .NET (just for starters). PCRE, too, would return [""], but PCRE is weird in that it can match the empty string and "a" at the same position. I.e., according to PCRE handling, "a".match(/(?:a*?){2,}/g) would return ["", "a", ""] (which is not something I'd argue for in ES).
I wrote a Python script with a bunch of closures and continuations according to the spec, and I got the same result as Chris gives in comment 2. I don't quite understand your reasoning in comment 2, but I am also pretty sure the 'return failure if y.endIndex == x.endIndex' is also the key. Following my debugging printouts and unwrapping the continuation-ness for the benefit of any human readers, it seems that it goes something like this: 1. First, match the expression (?:a*?) twice, which is required to get a match. This just matches the empty string both times. 2. Now, continue the outer match. At this point we are essentially matching (?:a*?)* against the rest of the input text (which is still the entire input text 'a'). 2.1. At this point, we are working on the non-greedy match a*?. The first thing we do with a non-greedy match is to try to match the rest of the regular expression, which in this case means trying more repetitions of the inside of the outer quantifier. But the |min==0 rule and matching endIndexes| means we fail, to avoid the infinite loop. 2.2. Now, in the non-greedy match, we didn't match the sequel, so we should go ahead and try to match one step of the inner expression, /a/. That matches. 2.3. At this point we'll do a bunch of other junk to continue the match, but the only thing that will work is matching empty at the point where we are, so we end up matching there and getting 'a' as the result. Now, can this be made at all comprehensible? Possibly: The above suggests that this test case can be reduced to just /(?:a*?)*/ with the same result, which appears to be so. The spec for non-greedy matches basically says to (1) try to match what comes after the non-greedy expression, and if that fails, then (2a) do match one copy of the inner expression and (2b) continue. So /(?:a*?)*/ translates to something like: (1) (2a) (2b) /(?:a*?)* | a (?:a*?)*/ Part (1) fails by the don't-iloop special rule, so we go on to (2a), match that, and then eventually match empty for (2b) after trying a few things. I'm don't know how to fix the spec if it is in fact wrong, but it appears that "try to match what comes after the non-greedy expression" needs special handling if "what comes after" is more repetitions of the non-greedy expression. If we consider that to match empty instead of failing, then I think we get the "expected" result.
David's description of what's happening is spot-on. The correct result here is "a". I ran the machine-executable ES3 regular expression semantics (which I used when writing the ECMAScript spec) and checked that they produce the same result "a". Those are available on Mozilla at: http://mxr.mozilla.org/mozilla/source/js2/semantics/ Here are a few related test cases. The answers are given in the form (x y z), where x is the offset where the match occurs, y is the matched text, and z are the results of the capturing parentheses (none in this case). ? (run-regexp "(?:a*?){2,}" "a") (0 "a" NIL) ? (run-regexp "(?:a*?){2,}" "") :FAILURE ? (run-regexp "(?:a*?)" "a") (0 "" NIL) ? (run-regexp "(?:a*?)(?:a*?)(?:a*?)" "a") (0 "" NIL) ? (run-regexp "(?:a*?){2}" "a") (0 "" NIL) ? (run-regexp "(?:a*?){2,3}" "a") (0 "a" NIL) ? (run-regexp "(?:a*?)*" "a") (0 "a" NIL)
Also, if you try to "fix" this result by removing the anti-infinite-loop rule in regular expression semantics, then this example will indeed cause an infinite loop.
blocking2.0: ? → betaN+
Hi Waldemar, Many thanks for the test cases, these are really great & have caught a couple of bugs in yarr, I'll incorporate these into our test suite. FYI I just wanted to mention that I think one of the results here is incorrect. > ? (run-regexp "(?:a*?){2,}" "") > :FAILURE Following the spec I don't think this should be a failed match, but rather should match the empty string. Here's how this works out: (A) The RepeatMatcher applied to the subpattern will attempt to match with a minimum of 2, and per step 7 of the RepeatMatcher description will call its internal matcher passing a continuation newly created continuation 'd', (with minimum set to 2). (B) The RepeatMatcher applied to the atom 'a' will attempt to match, and per step 8a of the RepeatMatcher description will call the continuation provided in (A). (C) Continuation 'd' with minimum 2 will call to the RepeatMatcher applied to the subpattern with a minimum of 1. (D) The RepeatMatcher applied to the subpattern will attempt to match with a minimum of 1, and per step 7 of the RepeatMatcher description will call its internal matcher passing a continuation newly created continuation 'd', (with minimum set to 1). (E) The RepeatMatcher applied to the atom 'a' will attempt to match, and per step 8a of the RepeatMatcher description will call the continuation provided in (D). (F) Continuation 'd' with minimum 1 will call to the RepeatMatcher applied to the subpattern with a minimum of 0. (G) The RepeatMatcher applied to the subpattern will attempt to match with a minimum of 0, and per step 9 of the RepeatMatcher description will call its internal matcher passing a continuation newly created continuation 'd', (with minimum set to 0). (H) The RepeatMatcher applied to the atom 'a' will attempt to match, and per step 8a of the RepeatMatcher description will call the continuation provided in (G). (I) Continuation 'd' with minimum 0 will return match failed, since the match is an empty string (per step 1 of the definition for internal continuation 'd'). (J) Having returned to the RepeatMatcher applied to the atom 'a', per step 8c of the RepeatMatcher description the matcher will be called and the result returned - this will be a failed match for an empty input string. (K) Having returned to the The RepeatMatcher applied to the subpattern, per step 11 of the RepeatMatcher description the continuation originally passed in to it will be called. This is the pattern completion continuation described in section 15.10.2.2, and always accepts the current match. Hope I'm not missing something here! cheers, Gavin.
FYI there's a patch with some fixes to these issues on the following WebKit bug, in case it helps: https://bugs.webkit.org/show_bug.cgi?id=48101 Contains a fix to RegexJIT::generateParenthesesSingle to check for empty matches,also contains fixes to RegexJIT::generateParenthesesGreedyNoBacktrack (I forget if this came in before or after you forked), and also fixed to the YARR interpreter (probably not interesting to you, believe you didn't take this). cheers, G.
blocking2.0: betaN+ → -
status2.0: --- → wanted
Depends on: 606882
(In reply to comment #8) > FYI there's a patch with some fixes to these issues on the following WebKit > bug, in case it helps: https://bugs.webkit.org/show_bug.cgi?id=48101 > > Contains a fix to RegexJIT::generateParenthesesSingle to check for empty > matches,also contains fixes to RegexJIT::generateParenthesesGreedyNoBacktrack > (I forget if this came in before or after you forked), and also fixed to the > YARR interpreter (probably not interesting to you, believe you didn't take > this). > > cheers, > G. Thanks for getting back to us on that! (Not to mention fixing the bug.) It not only fixes this bug but some other ones (bug 606882) that we need for Fx4, although I didn't know that until I dug into that one somewhat. But if we had taken the fix as soon as you posted here we would've saved ourselves some work.
Depends on: 599854
No longer depends on: 606882
Depends on: 627609
No longer depends on: 599854
Fixed?
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
No longer depends on: 627609
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: