Closed
Bug 526173
Opened 15 years ago
Closed 15 years ago
3.5x regression in indexOf performance
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
status1.9.2 | --- | beta5-fixed |
People
(Reporter: bzbarsky, Assigned: luke)
References
Details
(Whiteboard: fixed-in-tracemonkey)
Attachments
(2 files, 5 obsolete files)
(deleted),
text/html
|
Details | |
(deleted),
patch
|
Waldo
:
review+
|
Details | Diff | Splinter Review |
See the attached testcase. Apart from the fact that the indexOf tests are a lot slower than the lastIndexOf ones for some reason (and have been going back to Fx 3.5 at least), there's been a recent 3.5x regression on the indexOf tests.
The m-c checkin range for that regression (using nightlies) is http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=def6be3d8b6b&tochange=149c3820e8a8 and there's a nice 100-changeset tracemonkey merge in there. indexOf got rewritten in bug 511777, which I'm fingering as the most likely culprit here. Shark just shows all the time spent in str_indexOf (yay inlining) and doesn't give me reasonable line listings for some reason (boo!).
Flags: blocking1.9.2?
Reporter | ||
Comment 1•15 years ago
|
||
Reporter | ||
Comment 2•15 years ago
|
||
And I bet the relevant change was this:
- if (textlen - i >= 512 && (jsuint)(patlen - 2) <= BMH_PATLEN_MAX - 2) {
+ if (textlen >= 512 && patlen <= sBMHPatLenMax) {
In particular this changed from doing the simple walk along the string for patlen == 1 to using BMH. This is an obvious loss, since BMH does more work and can't possibly win on patlen == 1.
If I change the condition to also test patlen > 1, things get much faster, though still about 10-20% slower than Fx 3.5.
Attachment #409878 -
Attachment is obsolete: true
Reporter | ||
Comment 3•15 years ago
|
||
Ok, indexOf being a lot slower than lastIndexOf was somewhat due to pilot error. With that fixed, indexOf is only about 50% slower than lastIndexOf, not many times slower....
Still might be worth a separate bug on that. It looks like our indexOf has about the performance of Safari's lastIndexOf, while their indexOf is about 3x faster.
Attachment #409879 -
Attachment is obsolete: true
Reporter | ||
Updated•15 years ago
|
blocking2.0: --- → ?
Assignee | ||
Comment 4•15 years ago
|
||
(In reply to comment #3)
First of all, many thanks for finding the weakened BMH condition! I don't know what I was thinking when I left that out. On that line of though, it is also worth some quick measurements to see if there is a higher k where we don't do BMH for patlen < k.
> Ok, indexOf being a lot slower than lastIndexOf was somewhat due to pilot
> error. With that fixed, indexOf is only about 50% slower than lastIndexOf, not
> many times slower....
Hmm, I ran your tests in a tracemonkey tip js shell (with the BMH thinko fixed) and here's what I see:
(in ms)
testIndexOfx: 63
testLastIndexOfx: 50
testIndexOfDash: 874
testLastIndexOfDash: 1587
Now, if I read the construction of testStrGiant, the '-' is equidistant from the beginning and end of the string, hence |testStrGiant.indexOf("-")| should be the same speed as |testStrGiant.lastIndexOf("-")|. However, indexOf is faster; I suspect because of the changes in bug 511777. Finding 'x' from the beginning and end, on the other hand, is not a symmetric task because 'x' is 24 characters from the beginning and 9 characters from the end. That indexOf is only 26% slower is, I believe, also a positive result of bug 511777.
> It looks like our indexOf has
> about the performance of Safari's lastIndexOf, while their indexOf is about 3x
> faster.
Wow. Hmm, and we are tracing those loops too. Are you quite sure of these results?
Comment 5•15 years ago
|
||
Someone confirm the head-to-head with Safari and dig into the generated machine code. It's time to get down and dirty.
/be
Reporter | ||
Comment 6•15 years ago
|
||
> worth some quick measurements to see if there is a higher k
Yep. ;)
> Hmm, I ran your tests in a tracemonkey tip js shell (with the BMH thinko
> fixed)
Interesting. I was running in tracemonkey tip opt browser. I just tried this again (with the patlen > 1 added, and in shell I see:
testIndexOfDash: 874
testLastIndexOfDash: 1587
In browser I see:
testIndexOfDash: 2594
testLastIndexOfDash: 1720
In Safari (browser) I see:
testIndexOfDash: 906
testLastIndexOfDash: 2219
This is very consistent. Without the BMH fix, browser is at about 6000 for testIndexOfDash, while shell is at 8600 or so.
Sounds like we have a separate bug here... Can you try browser on your end to confirm?
> the '-' is equidistant from the beginning and end of the string
+-1, yes. That was the goal, at least.
> Finding 'x' from the beginning and end, on the other hand, is not a symmetric
> task
Right. I was mostly looking at the '-' case.
> Are you quite sure of these results?
See above.
Reporter | ||
Comment 7•15 years ago
|
||
Hmm. I just did more testing, and now shell is consistently coming up with 1500ms or so for testIndexOfDash (???)
Updated•15 years ago
|
Flags: blocking1.9.2? → blocking1.9.2+
Assignee | ||
Comment 8•15 years ago
|
||
(In reply to comment #4)
> (In reply to comment #3)
> On that line of though, it is also
> worth some quick measurements to see if there is a higher k where we don't do
> BMH for patlen < k.
I ran a matrix of pattern lengths and BMH threshold values. Each row runs with a different value of k in the abovementioned |(patlen - k) < sBMHPatLenMax|, starting at 0. Each column uses a different length pattern, starting at 1. You can see a diagonal starting at k = 2, patlen = 1 and fading away around patlen = 11. Hopefully this remains formatted:
267 134 91 68 56 47 42 35 34 30 28 26 24 24 23 20 22
268 134 91 69 55 47 41 36 33 30 28 26 23 24 24 20 22
29 134 90 69 56 47 41 36 33 30 28 26 24 24 23 20 21
30 28 91 69 55 48 41 35 33 31 28 26 23 24 23 20 21
29 29 28 69 56 47 41 35 33 31 28 26 24 24 23 19 22
29 29 28 29 56 47 41 35 34 30 28 26 23 24 24 19 22
29 29 28 29 28 47 41 36 33 30 28 26 23 24 23 20 22
29 29 29 28 29 29 41 35 33 31 28 26 23 24 23 20 21
28 29 29 28 29 29 28 36 33 30 28 26 24 24 23 19 22
29 29 29 28 30 28 29 29 33 30 29 25 24 24 23 19 22
29 28 30 28 29 29 28 29 28 31 28 26 23 24 23 19 22
29 29 29 29 29 29 28 29 29 28 28 26 23 24 23 20 22
29 28 29 28 29 29 28 29 29 29 28 26 23 24 24 19 22
29 29 29 28 29 28 29 29 28 29 29 29 23 24 23 20 21
30 29 28 29 29 28 29 29 30 28 29 29 29 24 23 20 21
29 28 29 29 28 29 29 29 28 29 29 29 29 29 23 19 22
30 28 29 29 29 28 29 29 29 29 29 29 29 29 29 19 22
28 29 28 29 28 29 29 28 29 29 28 29 29 29 28 29 22
Clearly the numbers are machine specific (Ubuntu w/ Core 2 Duo @ 2.67Ghz, btw), but it does clearly suggest a k higher than 2.
Now back to the "why is the browser slower" hunt.
Assignee | ||
Comment 9•15 years ago
|
||
Confound it! One column too many.
Assignee | ||
Comment 10•15 years ago
|
||
(In reply to comment #6)
> Sounds like we have a separate bug here... Can you try browser on your end to
> confirm?
I get pretty much the same relative performance as the js shell in a browser build:
testIndexOfDash: 871
testLastIndexOfDash: 1582
Reporter | ||
Comment 11•15 years ago
|
||
Interesting. I wonder what the difference is... I'm just using --enable-optimize --disable-debug; pretty basic Firefox opt build.
Assignee | ||
Comment 12•15 years ago
|
||
Well, fix the glaring problem for now.
Reporter | ||
Comment 13•15 years ago
|
||
Explicit cast to jsuint might be nice there. And maybe a comment.
Comment 14•15 years ago
|
||
tests added to sunspider harness:
http://hg.mozilla.org/users/rsayre_mozilla.com/benchmarker/file/d31b875708c1/SunSpider/tests/LIST-MICROSTRING
Assignee | ||
Comment 15•15 years ago
|
||
Attachment #410036 -
Attachment is obsolete: true
Attachment #410041 -
Flags: review?(jwalden+bmo)
Attachment #410036 -
Flags: review?(jwalden+bmo)
Comment 16•15 years ago
|
||
Comment on attachment 410041 [details] [diff] [review]
with comments and cast
We try not to overparenthesize subtraction against <= and so on (gal's Pascal or Oberon mistrained fingers excluded :-P).
/be
Assignee | ||
Comment 17•15 years ago
|
||
(In reply to comment #8)
Split off to bug 526348.
Assignee | ||
Comment 18•15 years ago
|
||
I ran a try-server-built version of your benchmark and indeed there is a horrendous regression from testIndexOfDash being 2x faster to < 2x slower! I will shark this.
Assignee | ||
Comment 19•15 years ago
|
||
So looking at the disassembly with Shark data, it seems Mac gcc doesn't think that 't' and 'p0', the hottest variables of the unrolled loop in StringMatch, deserve registers. Slowness ensues.
Reporter | ||
Comment 20•15 years ago
|
||
Hmm. I just tried making those |register| with no effect, but it could well just be ignoring me....
Assignee | ||
Comment 21•15 years ago
|
||
Yeehaa!
register int test_integer asm ("EBX");
Puts performance back to Linux-level for both shell and browser!
Comment 22•15 years ago
|
||
We used to have to do this kind of shit in the '80s (with the Portable C Compiler). To quote from the Wicked Witch of the West: "What a world!"
/be
Assignee | ||
Comment 23•15 years ago
|
||
I tried to make the register/asm hack apply as narrowly as possible (__APPLE__ && __GNUC__ && __i386__); anyone with portability experience, please let me know if I could do better. With this, OS X 32-bit browser and shell have much better indexOf performance than 3.5 and are on par with Ubuntu.
Attachment #410041 -
Attachment is obsolete: true
Attachment #410377 -
Flags: review?(jwalden+bmo)
Attachment #410041 -
Flags: review?(jwalden+bmo)
Assignee | ||
Comment 24•15 years ago
|
||
So WebKit has a hard-coded loop for length-one patterns. This should be slower than an unrolled loop (extra comparison and branch) and on 64-bit OS X, the unrolled loop wins. However, register pressure on 32-bit x86 (both OS X and Linux, no idea for Windows) slows down the unrolled loop due to spilling. testIndexOfDash spends enough time in the loop body for this to make a difference (~60ms).
Comparing the result against 1.9.1, I also noticed that testLastIndexOfDash, which is syntactically identical, has dropped from 1141ms to 1511ms on tip. Looking at it under shark, all the time is spent in the 1 loop at the end. I played to forcing things into registers, but to no avail. Tweaking the logic to use pointers and a cached pat[0] recovers old performance almost exactly, so that is also in the patch.
End result for 32-bit OS X browser:
FF3.5.4 WebKit nightly tip patch
textIndexOfx 80 74 263 63
testLastIndexOfx 45 86 43 45
testIndexOfDash 1876 803 5288 787
testLastIndexOfDash 1141 1916 1510 1143
Attachment #410377 -
Attachment is obsolete: true
Attachment #410612 -
Flags: review?(jwalden+bmo)
Attachment #410377 -
Flags: review?(jwalden+bmo)
Updated•15 years ago
|
Attachment #410612 -
Flags: review?(jwalden+bmo) → review+
Assignee | ||
Comment 25•15 years ago
|
||
Whiteboard: fixed-in-tracemonkey
Comment 26•15 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Comment 27•15 years ago
|
||
status1.9.2:
--- → final-fixed
Comment 28•15 years ago
|
||
These bugs landed after b4 was cut. Moving flag out.
Updated•15 years ago
|
blocking2.0: ? → ---
You need to log in
before you can comment on or make changes to this bug.
Description
•