Closed Bug 526173 Opened 15 years ago Closed 15 years ago

3.5x regression in indexOf performance

Tracking

()

Status:

RESOLVED FIXED

Tracking Flags:

Tracking

Status

status1.9.2

---

beta5-fixed

People

(Reporter: bzbarsky, Assigned: luke)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(2 files, 5 obsolete files)

Testcase 15 years ago Boris Zbarsky [:bzbarsky] (deleted), text/html		Details
Same testcase without the shark gunk 15 years ago Boris Zbarsky [:bzbarsky] (deleted), text/html		Details
Testcase fixed so the lastIndexOf test is actually relevant 15 years ago Boris Zbarsky [:bzbarsky] (deleted), text/html		Details
fix the BMH bug 15 years ago Luke Wagner [:luke] (deleted), patch		Details \| Diff \| Splinter Review
with comments and cast 15 years ago Luke Wagner [:luke] (deleted), patch		Details \| Diff \| Splinter Review
fix both problems 15 years ago Luke Wagner [:luke] (deleted), patch		Details \| Diff \| Splinter Review
more faster 15 years ago Luke Wagner [:luke] (deleted), patch	Waldo : review+	Details \| Diff \| Splinter Review

Boris Zbarsky [:bzbarsky]

Reporter

Description

•

15 years ago

See the attached testcase. Apart from the fact that the indexOf tests are a lot slower than the lastIndexOf ones for some reason (and have been going back to Fx 3.5 at least), there's been a recent 3.5x regression on the indexOf tests. The m-c checkin range for that regression (using nightlies) is http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=def6be3d8b6b&tochange=149c3820e8a8 and there's a nice 100-changeset tracemonkey merge in there. indexOf got rewritten in bug 511777, which I'm fingering as the most likely culprit here. Shark just shows all the time spent in str_indexOf (yay inlining) and doesn't give me reasonable line listings for some reason (boo!).

Flags: blocking1.9.2?

Boris Zbarsky [:bzbarsky]

Reporter

Comment 1

•

15 years ago

Attached file Testcase (obsolete) (deleted) — Details

Boris Zbarsky [:bzbarsky]

Reporter

Comment 2

•

15 years ago

Attached file Same testcase without the shark gunk (obsolete) (deleted) — Details

And I bet the relevant change was this: - if (textlen - i >= 512 && (jsuint)(patlen - 2) <= BMH_PATLEN_MAX - 2) { + if (textlen >= 512 && patlen <= sBMHPatLenMax) { In particular this changed from doing the simple walk along the string for patlen == 1 to using BMH. This is an obvious loss, since BMH does more work and can't possibly win on patlen == 1. If I change the condition to also test patlen > 1, things get much faster, though still about 10-20% slower than Fx 3.5.

Attachment #409878 - Attachment is obsolete: true

Boris Zbarsky [:bzbarsky]

Reporter

Comment 3

•

15 years ago

Attached file Testcase fixed so the lastIndexOf test is actually relevant (deleted) — Details

Ok, indexOf being a lot slower than lastIndexOf was somewhat due to pilot error. With that fixed, indexOf is only about 50% slower than lastIndexOf, not many times slower.... Still might be worth a separate bug on that. It looks like our indexOf has about the performance of Safari's lastIndexOf, while their indexOf is about 3x faster.

Attachment #409879 - Attachment is obsolete: true

Boris Zbarsky [:bzbarsky]

Reporter

Updated

•

15 years ago

blocking2.0: --- → ?

Luke Wagner [:luke]

Assignee

Comment 4

•

15 years ago

(In reply to comment #3) First of all, many thanks for finding the weakened BMH condition! I don't know what I was thinking when I left that out. On that line of though, it is also worth some quick measurements to see if there is a higher k where we don't do BMH for patlen < k. > Ok, indexOf being a lot slower than lastIndexOf was somewhat due to pilot > error. With that fixed, indexOf is only about 50% slower than lastIndexOf, not > many times slower.... Hmm, I ran your tests in a tracemonkey tip js shell (with the BMH thinko fixed) and here's what I see: (in ms) testIndexOfx: 63 testLastIndexOfx: 50 testIndexOfDash: 874 testLastIndexOfDash: 1587 Now, if I read the construction of testStrGiant, the '-' is equidistant from the beginning and end of the string, hence |testStrGiant.indexOf("-")| should be the same speed as |testStrGiant.lastIndexOf("-")|. However, indexOf is faster; I suspect because of the changes in bug 511777. Finding 'x' from the beginning and end, on the other hand, is not a symmetric task because 'x' is 24 characters from the beginning and 9 characters from the end. That indexOf is only 26% slower is, I believe, also a positive result of bug 511777. > It looks like our indexOf has > about the performance of Safari's lastIndexOf, while their indexOf is about 3x > faster. Wow. Hmm, and we are tracing those loops too. Are you quite sure of these results?

Brendan Eich [:brendan]

Comment 5

•

15 years ago

Someone confirm the head-to-head with Safari and dig into the generated machine code. It's time to get down and dirty. /be

Boris Zbarsky [:bzbarsky]

Reporter

Comment 6

•

15 years ago

> worth some quick measurements to see if there is a higher k Yep. ;) > Hmm, I ran your tests in a tracemonkey tip js shell (with the BMH thinko > fixed) Interesting. I was running in tracemonkey tip opt browser. I just tried this again (with the patlen > 1 added, and in shell I see: testIndexOfDash: 874 testLastIndexOfDash: 1587 In browser I see: testIndexOfDash: 2594 testLastIndexOfDash: 1720 In Safari (browser) I see: testIndexOfDash: 906 testLastIndexOfDash: 2219 This is very consistent. Without the BMH fix, browser is at about 6000 for testIndexOfDash, while shell is at 8600 or so. Sounds like we have a separate bug here... Can you try browser on your end to confirm? > the '-' is equidistant from the beginning and end of the string +-1, yes. That was the goal, at least. > Finding 'x' from the beginning and end, on the other hand, is not a symmetric > task Right. I was mostly looking at the '-' case. > Are you quite sure of these results? See above.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 7

•

15 years ago

Hmm. I just did more testing, and now shell is consistently coming up with 1500ms or so for testIndexOfDash (???)

Robert Sayre

Updated

•

15 years ago

Flags: blocking1.9.2? → blocking1.9.2+

Luke Wagner [:luke]

Assignee

Comment 8

•

15 years ago

(In reply to comment #4) > (In reply to comment #3) > On that line of though, it is also > worth some quick measurements to see if there is a higher k where we don't do > BMH for patlen < k. I ran a matrix of pattern lengths and BMH threshold values. Each row runs with a different value of k in the abovementioned |(patlen - k) < sBMHPatLenMax|, starting at 0. Each column uses a different length pattern, starting at 1. You can see a diagonal starting at k = 2, patlen = 1 and fading away around patlen = 11. Hopefully this remains formatted: 267 134 91 68 56 47 42 35 34 30 28 26 24 24 23 20 22 268 134 91 69 55 47 41 36 33 30 28 26 23 24 24 20 22 29 134 90 69 56 47 41 36 33 30 28 26 24 24 23 20 21 30 28 91 69 55 48 41 35 33 31 28 26 23 24 23 20 21 29 29 28 69 56 47 41 35 33 31 28 26 24 24 23 19 22 29 29 28 29 56 47 41 35 34 30 28 26 23 24 24 19 22 29 29 28 29 28 47 41 36 33 30 28 26 23 24 23 20 22 29 29 29 28 29 29 41 35 33 31 28 26 23 24 23 20 21 28 29 29 28 29 29 28 36 33 30 28 26 24 24 23 19 22 29 29 29 28 30 28 29 29 33 30 29 25 24 24 23 19 22 29 28 30 28 29 29 28 29 28 31 28 26 23 24 23 19 22 29 29 29 29 29 29 28 29 29 28 28 26 23 24 23 20 22 29 28 29 28 29 29 28 29 29 29 28 26 23 24 24 19 22 29 29 29 28 29 28 29 29 28 29 29 29 23 24 23 20 21 30 29 28 29 29 28 29 29 30 28 29 29 29 24 23 20 21 29 28 29 29 28 29 29 29 28 29 29 29 29 29 23 19 22 30 28 29 29 29 28 29 29 29 29 29 29 29 29 29 19 22 28 29 28 29 28 29 29 28 29 29 28 29 29 29 28 29 22 Clearly the numbers are machine specific (Ubuntu w/ Core 2 Duo @ 2.67Ghz, btw), but it does clearly suggest a k higher than 2. Now back to the "why is the browser slower" hunt.

Luke Wagner [:luke]

Assignee

Comment 9

•

15 years ago

Confound it! One column too many.

Luke Wagner [:luke]

Assignee

Comment 10

•

15 years ago

(In reply to comment #6) > Sounds like we have a separate bug here... Can you try browser on your end to > confirm? I get pretty much the same relative performance as the js shell in a browser build: testIndexOfDash: 871 testLastIndexOfDash: 1582

Boris Zbarsky [:bzbarsky]

Reporter

Comment 11

•

15 years ago

Interesting. I wonder what the difference is... I'm just using --enable-optimize --disable-debug; pretty basic Firefox opt build.

Luke Wagner [:luke]

Assignee

Comment 12

•

15 years ago

Attached patch fix the BMH bug (obsolete) (deleted) — Details — Splinter Review

Well, fix the glaring problem for now.

Assignee: general → lw

Status: NEW → ASSIGNED

Attachment #410036 - Flags: review?(jwalden+bmo)

Boris Zbarsky [:bzbarsky]

Reporter

Comment 13

•

15 years ago

Explicit cast to jsuint might be nice there. And maybe a comment.

Robert Sayre

Comment 14

•

15 years ago

tests added to sunspider harness: http://hg.mozilla.org/users/rsayre_mozilla.com/benchmarker/file/d31b875708c1/SunSpider/tests/LIST-MICROSTRING

Luke Wagner [:luke]

Assignee

Comment 15

•

15 years ago

Attached patch with comments and cast (obsolete) (deleted) — Details — Splinter Review

Attachment #410036 - Attachment is obsolete: true

Attachment #410041 - Flags: review?(jwalden+bmo)

Attachment #410036 - Flags: review?(jwalden+bmo)

Brendan Eich [:brendan]

Comment 16

•

15 years ago

Comment on attachment 410041 [details] [diff] [review] with comments and cast We try not to overparenthesize subtraction against <= and so on (gal's Pascal or Oberon mistrained fingers excluded :-P). /be

Luke Wagner [:luke]

Assignee

Comment 17

•

15 years ago

(In reply to comment #8) Split off to bug 526348.

Luke Wagner [:luke]

Assignee

Comment 18

•

15 years ago

I ran a try-server-built version of your benchmark and indeed there is a horrendous regression from testIndexOfDash being 2x faster to < 2x slower! I will shark this.

Luke Wagner [:luke]

Assignee

Comment 19

•

15 years ago

So looking at the disassembly with Shark data, it seems Mac gcc doesn't think that 't' and 'p0', the hottest variables of the unrolled loop in StringMatch, deserve registers. Slowness ensues.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 20

•

15 years ago

Hmm. I just tried making those |register| with no effect, but it could well just be ignoring me....

Luke Wagner [:luke]

Assignee

Comment 21

•

15 years ago

Yeehaa! register int test_integer asm ("EBX"); Puts performance back to Linux-level for both shell and browser!

Brendan Eich [:brendan]

Comment 22

•

15 years ago

We used to have to do this kind of shit in the '80s (with the Portable C Compiler). To quote from the Wicked Witch of the West: "What a world!" /be

Luke Wagner [:luke]

Assignee

Comment 23

•

15 years ago

Attached patch fix both problems (obsolete) (deleted) — Details — Splinter Review

I tried to make the register/asm hack apply as narrowly as possible (__APPLE__ && __GNUC__ && __i386__); anyone with portability experience, please let me know if I could do better. With this, OS X 32-bit browser and shell have much better indexOf performance than 3.5 and are on par with Ubuntu.

Attachment #410041 - Attachment is obsolete: true

Attachment #410377 - Flags: review?(jwalden+bmo)

Attachment #410041 - Flags: review?(jwalden+bmo)

Luke Wagner [:luke]

Assignee

Comment 24

•

15 years ago

Attached patch more faster (deleted) — Details — Splinter Review

So WebKit has a hard-coded loop for length-one patterns. This should be slower than an unrolled loop (extra comparison and branch) and on 64-bit OS X, the unrolled loop wins. However, register pressure on 32-bit x86 (both OS X and Linux, no idea for Windows) slows down the unrolled loop due to spilling. testIndexOfDash spends enough time in the loop body for this to make a difference (~60ms). Comparing the result against 1.9.1, I also noticed that testLastIndexOfDash, which is syntactically identical, has dropped from 1141ms to 1511ms on tip. Looking at it under shark, all the time is spent in the 1 loop at the end. I played to forcing things into registers, but to no avail. Tweaking the logic to use pointers and a cached pat[0] recovers old performance almost exactly, so that is also in the patch. End result for 32-bit OS X browser: FF3.5.4 WebKit nightly tip patch textIndexOfx 80 74 263 63 testLastIndexOfx 45 86 43 45 testIndexOfDash 1876 803 5288 787 testLastIndexOfDash 1141 1916 1510 1143

Attachment #410377 - Attachment is obsolete: true

Attachment #410612 - Flags: review?(jwalden+bmo)

Attachment #410377 - Flags: review?(jwalden+bmo)

Jeff Walden [:Waldo]

Updated

•

15 years ago

Attachment #410612 - Flags: review?(jwalden+bmo) → review+

Luke Wagner [:luke]

Assignee

Comment 25

•

15 years ago

http://hg.mozilla.org/tracemonkey/rev/f8188f1c00f0

Whiteboard: fixed-in-tracemonkey

Robert Sayre

Comment 26

•

15 years ago

http://hg.mozilla.org/mozilla-central/rev/f8188f1c00f0

Status: ASSIGNED → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Robert Sayre

Comment 27

•

15 years ago

http://hg.mozilla.org/releases/mozilla-1.9.2/rev/1206cbcdd06e

status1.9.2: --- → final-fixed

Samuel Sidler (old account; do not CC)

Comment 28

•

15 years ago

These bugs landed after b4 was cut. Moving flag out.

status1.9.2: beta4-fixed → final-fixed

Robert Sayre

Updated

•

15 years ago

blocking2.0: ? → ---

You need to log in before you can comment on or make changes to this bug.