Closed Bug 526173 Opened 15 years ago Closed 15 years ago

3.5x regression in indexOf performance

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
status1.9.2 --- beta5-fixed

People

(Reporter: bzbarsky, Assigned: luke)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(2 files, 5 obsolete files)

See the attached testcase. Apart from the fact that the indexOf tests are a lot slower than the lastIndexOf ones for some reason (and have been going back to Fx 3.5 at least), there's been a recent 3.5x regression on the indexOf tests. The m-c checkin range for that regression (using nightlies) is http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=def6be3d8b6b&tochange=149c3820e8a8 and there's a nice 100-changeset tracemonkey merge in there. indexOf got rewritten in bug 511777, which I'm fingering as the most likely culprit here. Shark just shows all the time spent in str_indexOf (yay inlining) and doesn't give me reasonable line listings for some reason (boo!).
Flags: blocking1.9.2?
Attached file Testcase (obsolete) (deleted) —
Attached file Same testcase without the shark gunk (obsolete) (deleted) —
And I bet the relevant change was this: - if (textlen - i >= 512 && (jsuint)(patlen - 2) <= BMH_PATLEN_MAX - 2) { + if (textlen >= 512 && patlen <= sBMHPatLenMax) { In particular this changed from doing the simple walk along the string for patlen == 1 to using BMH. This is an obvious loss, since BMH does more work and can't possibly win on patlen == 1. If I change the condition to also test patlen > 1, things get much faster, though still about 10-20% slower than Fx 3.5.
Attachment #409878 - Attachment is obsolete: true
Ok, indexOf being a lot slower than lastIndexOf was somewhat due to pilot error. With that fixed, indexOf is only about 50% slower than lastIndexOf, not many times slower.... Still might be worth a separate bug on that. It looks like our indexOf has about the performance of Safari's lastIndexOf, while their indexOf is about 3x faster.
Attachment #409879 - Attachment is obsolete: true
blocking2.0: --- → ?
(In reply to comment #3) First of all, many thanks for finding the weakened BMH condition! I don't know what I was thinking when I left that out. On that line of though, it is also worth some quick measurements to see if there is a higher k where we don't do BMH for patlen < k. > Ok, indexOf being a lot slower than lastIndexOf was somewhat due to pilot > error. With that fixed, indexOf is only about 50% slower than lastIndexOf, not > many times slower.... Hmm, I ran your tests in a tracemonkey tip js shell (with the BMH thinko fixed) and here's what I see: (in ms) testIndexOfx: 63 testLastIndexOfx: 50 testIndexOfDash: 874 testLastIndexOfDash: 1587 Now, if I read the construction of testStrGiant, the '-' is equidistant from the beginning and end of the string, hence |testStrGiant.indexOf("-")| should be the same speed as |testStrGiant.lastIndexOf("-")|. However, indexOf is faster; I suspect because of the changes in bug 511777. Finding 'x' from the beginning and end, on the other hand, is not a symmetric task because 'x' is 24 characters from the beginning and 9 characters from the end. That indexOf is only 26% slower is, I believe, also a positive result of bug 511777. > It looks like our indexOf has > about the performance of Safari's lastIndexOf, while their indexOf is about 3x > faster. Wow. Hmm, and we are tracing those loops too. Are you quite sure of these results?
Someone confirm the head-to-head with Safari and dig into the generated machine code. It's time to get down and dirty. /be
> worth some quick measurements to see if there is a higher k Yep. ;) > Hmm, I ran your tests in a tracemonkey tip js shell (with the BMH thinko > fixed) Interesting. I was running in tracemonkey tip opt browser. I just tried this again (with the patlen > 1 added, and in shell I see: testIndexOfDash: 874 testLastIndexOfDash: 1587 In browser I see: testIndexOfDash: 2594 testLastIndexOfDash: 1720 In Safari (browser) I see: testIndexOfDash: 906 testLastIndexOfDash: 2219 This is very consistent. Without the BMH fix, browser is at about 6000 for testIndexOfDash, while shell is at 8600 or so. Sounds like we have a separate bug here... Can you try browser on your end to confirm? > the '-' is equidistant from the beginning and end of the string +-1, yes. That was the goal, at least. > Finding 'x' from the beginning and end, on the other hand, is not a symmetric > task Right. I was mostly looking at the '-' case. > Are you quite sure of these results? See above.
Hmm. I just did more testing, and now shell is consistently coming up with 1500ms or so for testIndexOfDash (???)
Flags: blocking1.9.2? → blocking1.9.2+
(In reply to comment #4) > (In reply to comment #3) > On that line of though, it is also > worth some quick measurements to see if there is a higher k where we don't do > BMH for patlen < k. I ran a matrix of pattern lengths and BMH threshold values. Each row runs with a different value of k in the abovementioned |(patlen - k) < sBMHPatLenMax|, starting at 0. Each column uses a different length pattern, starting at 1. You can see a diagonal starting at k = 2, patlen = 1 and fading away around patlen = 11. Hopefully this remains formatted: 267 134 91 68 56 47 42 35 34 30 28 26 24 24 23 20 22 268 134 91 69 55 47 41 36 33 30 28 26 23 24 24 20 22 29 134 90 69 56 47 41 36 33 30 28 26 24 24 23 20 21 30 28 91 69 55 48 41 35 33 31 28 26 23 24 23 20 21 29 29 28 69 56 47 41 35 33 31 28 26 24 24 23 19 22 29 29 28 29 56 47 41 35 34 30 28 26 23 24 24 19 22 29 29 28 29 28 47 41 36 33 30 28 26 23 24 23 20 22 29 29 29 28 29 29 41 35 33 31 28 26 23 24 23 20 21 28 29 29 28 29 29 28 36 33 30 28 26 24 24 23 19 22 29 29 29 28 30 28 29 29 33 30 29 25 24 24 23 19 22 29 28 30 28 29 29 28 29 28 31 28 26 23 24 23 19 22 29 29 29 29 29 29 28 29 29 28 28 26 23 24 23 20 22 29 28 29 28 29 29 28 29 29 29 28 26 23 24 24 19 22 29 29 29 28 29 28 29 29 28 29 29 29 23 24 23 20 21 30 29 28 29 29 28 29 29 30 28 29 29 29 24 23 20 21 29 28 29 29 28 29 29 29 28 29 29 29 29 29 23 19 22 30 28 29 29 29 28 29 29 29 29 29 29 29 29 29 19 22 28 29 28 29 28 29 29 28 29 29 28 29 29 29 28 29 22 Clearly the numbers are machine specific (Ubuntu w/ Core 2 Duo @ 2.67Ghz, btw), but it does clearly suggest a k higher than 2. Now back to the "why is the browser slower" hunt.
Confound it! One column too many.
(In reply to comment #6) > Sounds like we have a separate bug here... Can you try browser on your end to > confirm? I get pretty much the same relative performance as the js shell in a browser build: testIndexOfDash: 871 testLastIndexOfDash: 1582
Interesting. I wonder what the difference is... I'm just using --enable-optimize --disable-debug; pretty basic Firefox opt build.
Attached patch fix the BMH bug (obsolete) (deleted) — Splinter Review
Well, fix the glaring problem for now.
Assignee: general → lw
Status: NEW → ASSIGNED
Attachment #410036 - Flags: review?(jwalden+bmo)
Explicit cast to jsuint might be nice there. And maybe a comment.
Attached patch with comments and cast (obsolete) (deleted) — Splinter Review
Attachment #410036 - Attachment is obsolete: true
Attachment #410041 - Flags: review?(jwalden+bmo)
Attachment #410036 - Flags: review?(jwalden+bmo)
Comment on attachment 410041 [details] [diff] [review] with comments and cast We try not to overparenthesize subtraction against <= and so on (gal's Pascal or Oberon mistrained fingers excluded :-P). /be
(In reply to comment #8) Split off to bug 526348.
I ran a try-server-built version of your benchmark and indeed there is a horrendous regression from testIndexOfDash being 2x faster to < 2x slower! I will shark this.
So looking at the disassembly with Shark data, it seems Mac gcc doesn't think that 't' and 'p0', the hottest variables of the unrolled loop in StringMatch, deserve registers. Slowness ensues.
Hmm. I just tried making those |register| with no effect, but it could well just be ignoring me....
Yeehaa! register int test_integer asm ("EBX"); Puts performance back to Linux-level for both shell and browser!
We used to have to do this kind of shit in the '80s (with the Portable C Compiler). To quote from the Wicked Witch of the West: "What a world!" /be
Attached patch fix both problems (obsolete) (deleted) — Splinter Review
I tried to make the register/asm hack apply as narrowly as possible (__APPLE__ && __GNUC__ && __i386__); anyone with portability experience, please let me know if I could do better. With this, OS X 32-bit browser and shell have much better indexOf performance than 3.5 and are on par with Ubuntu.
Attachment #410041 - Attachment is obsolete: true
Attachment #410377 - Flags: review?(jwalden+bmo)
Attachment #410041 - Flags: review?(jwalden+bmo)
Attached patch more faster (deleted) — Splinter Review
So WebKit has a hard-coded loop for length-one patterns. This should be slower than an unrolled loop (extra comparison and branch) and on 64-bit OS X, the unrolled loop wins. However, register pressure on 32-bit x86 (both OS X and Linux, no idea for Windows) slows down the unrolled loop due to spilling. testIndexOfDash spends enough time in the loop body for this to make a difference (~60ms). Comparing the result against 1.9.1, I also noticed that testLastIndexOfDash, which is syntactically identical, has dropped from 1141ms to 1511ms on tip. Looking at it under shark, all the time is spent in the 1 loop at the end. I played to forcing things into registers, but to no avail. Tweaking the logic to use pointers and a cached pat[0] recovers old performance almost exactly, so that is also in the patch. End result for 32-bit OS X browser: FF3.5.4 WebKit nightly tip patch textIndexOfx 80 74 263 63 testLastIndexOfx 45 86 43 45 testIndexOfDash 1876 803 5288 787 testLastIndexOfDash 1141 1916 1510 1143
Attachment #410377 - Attachment is obsolete: true
Attachment #410612 - Flags: review?(jwalden+bmo)
Attachment #410377 - Flags: review?(jwalden+bmo)
Attachment #410612 - Flags: review?(jwalden+bmo) → review+
Whiteboard: fixed-in-tracemonkey
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
These bugs landed after b4 was cut. Moving flag out.
blocking2.0: ? → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: