847160 - IonMonkey: Better code for self-hosted functions and aggressive inlining

Reporter

Description

•

12 years ago

Attached file testcase (deleted) — Details

Julian Viereck just showed me this http://jsperf.com/iter-array testcase on twitter. The forEach case is unbelievable slow compared to everything else. We have multiple problems here: - We don't inline ArrayForEach, which might be worthwhile - The generated code for ForEach is very bad, we don't hoist bounds checks in the inner loop and use a CallGeneric for the callback - The call to the callback in ArrayForEach should be completely inlined. I attached a shell testcase that shows this problem.

Reece H. Dunn

Comment 1

•

12 years ago

Also, from the results of that test there is a big performance regression on the different simple for loops, counting up/down: browser (runs) -- down / up / up with length cache FF18 (3) -- 839,777 / 839,777 / 839,777 FF19 (5) -- 821,707 / 821,707 / 821,707 FF20 (1) -- 147,997 / 50,489 / 50,489 FF22 (4) -- 207,446 / 119,751 / 199,751 I am not sure what is causing the slowdown from 19 to 20 on those tests.

Till Schneidereit [:till]

Comment 2

•

12 years ago

That's interesting. I wonder if this regressed since landing the self-hosting or if the slight differences to my performance tests make all the difference here. Back when I last tested this in December, there wasn't any speed difference at all between the forEach and the numeric for loop versions of my tests.

Till Schneidereit [:till]

Comment 3

•

12 years ago

Attached file testcase 2: list-modifying forEach (deleted) — Details

Yeah, so this regressed. Strongly. The attached testcase takes ~10x as long as it does when commenting out the the forEach version and activating the numeric loop version. With methodjit, (i.e., --no-ion), it runs slightly faster, but the numeric loop version is only ~2x faster. Maybe the regression is somehow related to the changes for ParallelArrays?

Flags: needinfo?(nmatsakis)

Boris Zbarsky [:bzbarsky]

Comment 4

•

12 years ago

When did this regress? Should we be tracking this for some release?

Keywords: regression

Boris Zbarsky [:bzbarsky]

Updated

•

12 years ago

Keywords: regressionwindow-wanted

Niko Matsakis [:nmatsakis]

Comment 5

•

12 years ago

I am not sure what changes would have caused the regression, but of course I wouldn't rule it out.

Flags: needinfo?(nmatsakis)

Till Schneidereit [:till]

Comment 6

•

12 years ago

Mmh, so this didn't regress after all: the very first Nightly containing the self-hosted array forEach implementation[1] is just as fast as today's. Having said that, my testing before the landing clearly showed speed-parity with numeric for loops at some point. Unfortunately, I only did ad hoc testing without really documenting the results or making them reproducible. Luckily, evilpie already identified what needs to be improved to speed this up. [1]: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/11/2012-11-23-03-08-27-mozilla-central/

Keywords: regression, regressionwindow-wanted

OS: Linux → All

Hardware: x86_64 → All

Boris Zbarsky [:bzbarsky]

Comment 7

•

12 years ago

Um. Both today's build and the builds right after landing show self-hosted forEach comparable to manual numeric loops on the testcase in bug 602132, say. So what's special about the testcase(s) here?

Boris Zbarsky [:bzbarsky]

Comment 8

•

12 years ago

Furthermore, if I take the attachments in this bug and use a manually self-hosted forEach, I don't see them get much faster. That is, the self-hosted forEach does just as well as a hand-rolled forEach _function_ but not nearly as well as numerical iteration with everything inlined. It's entirely possible that we used to only compare to hand-rolled forEach..... The main hit is in fact that we don't inline the call to the callback in ArrayForEach.

Till Schneidereit [:till]

Comment 9

•

12 years ago

(In reply to Boris Zbarsky (:bz) from comment #7) > So what's special about the testcase(s) here? The answer to that is surprisingly simple: these testcases run inside a test() function whereas the ones in bug 602132 run in the top-level script. If you change the latter to run inside a function, times roughly quadruple. Evilpie, do you know of any heuristics in IM that cause these differences?

Till Schneidereit [:till]

Comment 10

•

12 years ago

Oh, just did another quick variation I hadn't thought of before: Moving just the forEach callback out of the test() function to the top-level script restores the speed. So it looks like inner functions aren't inlined into ArrayForEach and somehow throw off some optimizations like hoisted bounds checks.

Shu-yu Guo [:shu]

Comment 11

•

12 years ago

Did some investigation for why forEach is so slow (in particular, why we don't inline) beyond the obvious "we don't have enough TI resolution to inline". That is, using callsite cloning or the landing of bug 804676 do *not* solve these problems. This comment also applies for bug 851120. There are 2 things we want to be inlineable for self-hosted higher-order operations like forEach to be fast: the call to forEach itself (for nested iteration, like the one in bug 851120), and the call to the kernel. The call to forEach is not inlineable because: - For semantic reasons (like keeping functions' .length property spec-compliant), forEach uses |arguments[i]|, which is NYI in Ion. - jsanalyze.cpp marks functions that contains loops as not inlineable. bhackett and h4writer have said on IRC that this should be enabled sometime in the future? - Heavyweight functions are not inlineable. The call to the kernel function is not inlineable because: - TypeObjects are not inlineable. See bug 847694. - Closures aren't inlineable. Related to heavyweight functions not being inlineable.

Shu-yu Guo [:shu]

Comment 12

•

12 years ago

Inlining closures probably go hand in hand with some kind of closure conversion, accounting for deopting on |f.caller| and all that.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 13

•

12 years ago

FWIW, I don't think there's any technical reason why we couldn't inline heavyweight functions.

Till Schneidereit [:till]

Comment 14

•

11 years ago

So, I looked at this again, and inlining doesn't seem to make much of a difference, for some reason. Adding `SetScriptHints(ArrayForEach, {cloneAtCallsite: true, inline: true});` to builtins/Array.js causes both forEach and the kernel to be inlined, but doesn't change the runtime of the first testcase at all. Interestingly, an opt shell is only a few percent faster than a debug one for the benchmark, and I have no idea why. Will look at it with the tracelogger.

Kannan Vijayan [:djvj]

Comment 15

•

11 years ago

Is this still a major regression from for-loops? I worked on optimizing these a while back. Inlining ArrayForEach directly is not that big of a deal, for most hot loops, we'll spend most of our time inside the ForEach loop, so a higher overhead for entering and leaving the loop is not such a big issue. OTOH, inlining the kernel would have a bigger impact. I do think we are able to inline the kernel these days, given that we're cloning ArrayForEach on site, and TI will be able to pick up the kernel as a singleton type. If you're seeing that opt is only a few percent faster than debug, it just means that we're spending most of our time in tight, jitted code.

Till Schneidereit [:till]

Comment 16

•

11 years ago

(In reply to Kannan Vijayan [:djvj] from comment #15) > Is this still a major regression from for-loops? I worked on optimizing > these a while back. It is, sadly. Changing the first attached testcase to use a numeric loop makes it ~3x faster. More importantly, in the jsperf benchmark from comment 0, the forEach version is 97%(!) slower than the other versions.

Nobody; OK to take it and work on it

Assignee

Updated

•

10 years ago

Assignee: general → nobody

Guilherme Lima

Comment 17

•

10 years ago

On the jsperf, Nightly / Chrome 43: forEach: 249,016 / 9,064 for, counting up: 1,070,746 / 401,057 for, counting up, caching length: 1,062,803 / 426,321 for, counting down: 1,065,931 / 145,648 forEach is 78% slower on Firefox and 98% on Chrome.

Till Schneidereit [:till]

Comment 18

•

9 years ago

While this isn't fully solved, it's gotten much better: On the jsperf test, forEach is half as fast (~800kOps/s vs 1600kOps/s) in Nightly, but 1% as fast (~7kOps/s vs. 670kOps/s) on Chrome. Pretty sure the loop version is just completely optimized out for us, making the benchmark largely meaningless. The shell test cases are both as fast with forEach as they are with for loops: 600ms and 62ms for the first and second test, respectively. Other engines don't fare that well, with d8's 62 seconds(!) for the first test being the worst. Is there much left to do here, I wonder?

Boris Zbarsky [:bzbarsky]

Comment 19

•

9 years ago

> On the jsperf test, forEach is half as fast (~800kOps/s vs 1600kOps/s) in Nightly, Fwiw, it's ~465kOps/s vs ~3400kOps/s for me on a Mac nightly....

Till Schneidereit [:till]

Comment 20

•

9 years ago

(In reply to Boris Zbarsky [:bz] from comment #19) > Fwiw, it's ~465kOps/s vs ~3400kOps/s for me on a Mac nightly.... Huh, interesting. My results hold up in a fresh profile on a 2012 rMBP with current Nightly.

Boris Zbarsky [:bzbarsky]

Comment 21

•

9 years ago

Oh, in a clean profile with a single tab I see numbers more like yours. In a clean profile with 10 copies of gmail opened in background tabs I see numbers a lot more like mine.

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

testcase 12 years ago Tom S [:evilpie] (deleted), application/javascript		Details
testcase 2: list-modifying forEach 12 years ago Till Schneidereit [:till] (deleted), text/javascript		Details