Closed Bug 1364908 Opened 8 years ago Closed 7 years ago

Support inlining functions that use arguments[x]

Categories

(Core :: JavaScript Engine: JIT, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla56
Performance Impact high
Tracking Status
firefox56 --- fixed

People

(Reporter: jandem, Assigned: nbp)

References

(Blocks 2 open bugs)

Details

Attachments

(4 files, 2 obsolete files)

We're currently unable to inline functions in Ion that use lazy arguments + arguments[x]. Unfortunately this is pretty common, also in self-hosted code, so we should fix it. When the number of arguments is small, we could get away with a MIR instruction that takes all arguments as operands and then just branches on the index. This doesn't really scale but in practice the number of arguments is often small so doing this for <= 3 arguments or so might be a reasonable short-term improvement. Handling the many-arguments case is a lot more complicated. We have to reserve a range of slots for the arguments somewhere, make sure the arguments are stored there, and then for arguments[x] we could add a MIR instruction to load the argument from slot x.
(In reply to Jan de Mooij [:jandem] from comment #0) > We have to > reserve a range of slots for the arguments somewhere, make sure the > arguments are stored there, and then for arguments[x] we could add a MIR > instruction to load the argument from slot x. For this I think our main options are: (1) Somehow force the register allocator to store the inlined arguments where we want them. Maybe for each inlined argument we could add an LUse with a fixed stack slot. Then ideally the allocator will be smart enough to move these spills out of the loop. (2) Push some kind of JIT frame (JitFrame_IonInlined) when we enter the inlined function, and pop it when we leave. This will complicate codegen (need to add the size of this frame to slot offsets) and frame iterators though.
(In reply to Jan de Mooij [:jandem] from comment #0) > This doesn't really scale but in practice the number of arguments > is often small so doing this for <= 3 arguments or so might be a reasonable > short-term improvement. Hm I checked Speedometer2 and various websites and the data confirms that handling <= 3 arguments will cover the vast majority of cases (like > 97%), so I think I'll start with this when I'm back from PTO (later this week). On Gmail it's also pretty common to abort inlining when we have arguments[x] and 0 actual arguments, that's just silly.
Flags: needinfo?(jdemooij)
Flags: needinfo?(jdemooij)
Whiteboard: [qf]
Flags: needinfo?(jdemooij)
P1 because it impacts Speedometer
Whiteboard: [qf] → [qf:p1]
Attached patch POC for x64 (deleted) — Splinter Review
This allows using an "unlimited" number of arguments by using stack arguments. However after running out of registers the performance sucks. Instead of using a linear number of compares, maybe a jump table could be used? I would probably just limit this to <= 3 as suggested, a small number of compares should still be cheap. Getting this to work on x86 is going to be painful, we need to support Values where the tag or payload can be on the stack. I hope this is useful, I didn't read your last comment about working on this after you are back.
Blocks: 1364854
No longer blocks: TimeToFirstPaint_FB
Attached patch WIP x64 - LoadElementFromState (obsolete) (deleted) — Splinter Review
The most interesting part of this feature, is mostly the fact that we would inline the code of the function in the caller. Still we don't want to make it smaller to access the arguments, so I did a bunch of testing, with a build using this WIP patch, and a build using the backtracking on inlining failures: Benchmark style: function sum() { var res = 0; for (var i = 0; i < arguments.length; i++) res += arguments[i]; return res; } function args10() { var isInIon = false; var start, end; for (var c = 0; c < 1000000; ) { sum(1,2,3,4,5,6,7,8,9,10); if (inIon()) { c++; } else { start = dateNow(); } } end = dateNow(); print("10 arguments:", end - start); } Results: (this patch) (today) 10 arguments: 373.260009765625 396.137939453125 9 arguments: 355.77783203125 400.9169921875 8 arguments: 367.971923828125 410.825927734375 7 arguments: 360.7490234375 393.195068359375 6 arguments: 350.8798828125 394.631103515625 5 arguments: 359.70703125 396.396240234375 4 arguments: 354.60009765625 391.651123046875 3 arguments: 360.801025390625 388.56005859375 2 arguments: 348.429931640625 386.27001953125 1 arguments: 346.49609375 376.010986328125 The current implementation implements a switch case, with an indirection table. Here, the arguments are all constants which are flagged as emitAtUses, which means that they are baked-in the switch case generated in CodeGenerator::LoadElementFromStateV. The reason why I adding MArgumentState and MLoadElementFromState, is that we might be able to use MLoadElementFromState with an MArrayState produced by ScalarReplacement. And later, we could also imagine using the same mechanism for property accesses on MObjectstate. I will split this patch in various refactoring, and a final one which adds MArgumentState and MLoadElementFromState.
Depends on: 1376921
Attachment #8882277 - Flags: review?(jdemooij)
Attachment #8882278 - Flags: review?(jdemooij)
Comment on attachment 8882277 [details] [diff] [review] Move moveValue into the generic MacroAssembler. Review of attachment 8882277 [details] [diff] [review]: ----------------------------------------------------------------- Nice patch. ::: js/src/jit/MacroAssembler-inl.h @@ +337,5 @@ > + moveValue(src.value(), dest); > + return; > + } > + > + if (src.reg().hasValue()) { Wouldn't it be clearer to have the TypedOrValueRegister overload call moveValue(ValueOperand) if needed? Now we can only call it with TypedOrValueRegister if reg.hasTyped() is true, but we could easily handle the ValueOperand case there instead of here.
Attachment #8882277 - Flags: review?(jdemooij) → review+
Comment on attachment 8882278 [details] [diff] [review] Add MacroAssembler::branchToComputedAddress. Review of attachment 8882278 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jit/arm/MacroAssembler-arm-inl.h @@ +2067,5 @@ > + > + if (addr.offset != 0) { > + SecondScratchRegisterScope scratch2(*this); > + ScratchRegisterScope scratch(*this); > + MOZ_ASSERT(base != pc, "NYI: offset from pc should be shifted."); We assert addr.base == pc at the start of this function. Does that mean this code is dead?
Attachment #8882278 - Flags: review?(jdemooij) → review+
Attachment #8879680 - Attachment is obsolete: true
(In reply to Jan de Mooij [:jandem] from comment #10) > ::: js/src/jit/arm/MacroAssembler-arm-inl.h > @@ +2067,5 @@ > > + > > + if (addr.offset != 0) { > > + SecondScratchRegisterScope scratch2(*this); > > + ScratchRegisterScope scratch(*this); > > + MOZ_ASSERT(base != pc, "NYI: offset from pc should be shifted."); > > We assert addr.base == pc at the start of this function. Does that mean this > code is dead? Yes it is, I will replace the above condition by an assertion.
(In reply to Jan de Mooij [:jandem] from comment #9) > ::: js/src/jit/MacroAssembler-inl.h > @@ +337,5 @@ > > + moveValue(src.value(), dest); > > + return; > > + } > > + > > + if (src.reg().hasValue()) { > > Wouldn't it be clearer to have the TypedOrValueRegister overload call > moveValue(ValueOperand) if needed? Now we can only call it with > TypedOrValueRegister if reg.hasTyped() is true, but we could easily handle > the ValueOperand case there instead of here. I will move the previous condition into the moveValue(TypedOrValueRegister, ValueOperand) functions. Thus removing the assertion at the top of these functions.
Comment on attachment 8882279 [details] [diff] [review] IonMonkey: Add LoadElementFromSate to support argument[x] in inlined functions. Review of attachment 8882279 [details] [diff] [review]: ----------------------------------------------------------------- Looks good but some questions below. ::: js/src/jit/CodeGenerator.cpp @@ +11146,5 @@ > + codegen->visitOutOfLineSwitch(this); > + } > + > + public: > + OutOfLineSwitch(TempAllocator& alloc) Nit: explicit @@ +11169,5 @@ > + if (tableType == SwitchTableType::Inline) { > +#if defined(JS_CODEGEN_ARM) > + base = ::js::jit::pc; > +#else > + MOZ_ASSERT(false, "NYI: SwitchTableType::Inline"); MOZ_CRASH? @@ +11173,5 @@ > + MOZ_ASSERT(false, "NYI: SwitchTableType::Inline"); > +#endif > + } else { > +#if defined(JS_CODEGEN_ARM) || defined(JS_CODEGEN_ARM64) > + MOZ_ASSERT(false, "NYI: SwitchTableType::OutOfLine"); And here @@ +11212,5 @@ > +{ > + jumpTable->setOutOfLine(); > + if (tableType == SwitchTableType::OutOfLine) { > +#if defined(JS_CODEGEN_ARM) || defined(JS_CODEGEN_ARM64) > + MOZ_ASSERT(false, "NYI: SwitchTableType::OutOfLine"); And here ::: js/src/jit/IonBuilder.cpp @@ +7641,5 @@ > MOZ_TRY(getElemTryArguments(&emitted, obj, index)); > if (emitted) > return Ok(); > > + trackOptimizationAttempt(TrackedStrategy::GetElem_ArgumentsInlinedConstant); There are no changes to TrackedOptimizationInfo.h in this patch? @@ +8365,5 @@ > + obj->setImplicitlyUsedUnchecked(); > + > + MOZ_ASSERT(!info().argsObjAliasesFormals()); > + > + // Ensure index is an integer. We should probably check index->type() == MIRType::Int32 or at least IsNumberType(index->type()). @@ +8370,5 @@ > + MInstruction* idInt32 = MToInt32::New(alloc(), index); > + current->add(idInt32); > + index = idInt32; > + > + // Bailouts if we read more than the number of actual arguments. s/Bailouts/Bailout/ Can you add a comment here saying bailing out is okay because reading out of bounds arguments will disable the lazy arguments optimization for this script? That should ensure we don't get stuck in a bailout loop. (I was also thinking LoadElementFromState could return undefined in this case. That's not correct because of getters on the prototype, but we could fix that with TI.) ::: js/src/jit/Lowering.cpp @@ +3245,5 @@ > + temp1 = temp(); > +#endif > + LLoadElementFromStateV* lir = new(alloc()) LLoadElementFromStateV(temp(), temp1, tempDouble()); > + if (!ins->array()->isArgumentState()) { > + abort(AbortReason::Alloc, "LIRGenerator::visitLoadElementFromState: Unsupported state object"); Hm I don't like having non-OOM aborts in the compiler backend, and AbortReason::Alloc seems wrong. Can this actually happen? If yes can you add a comment explaining when, and why this is uncommon enough that aborting is okay? If no, can we MOZ_ASSERT? @@ +3295,5 @@ > + lir->setOperand(1 + BOX_PIECES * i + 1, LAllocation()); > +#endif > + break; > + default: > + abort(AbortReason::Alloc, "LIRGenerator::visitLoadElementFromState: Unsupported element type."); Same here, I'd prefer a MOZ_CRASH. These aborts always show up on some random benchmark or website out there. ::: js/src/jit/MIR.h @@ +3957,5 @@ > } > }; > > +// Hold the arguments of an inlined frame. At the moment this class is not > +// recovered on bailout, but as it does not have an implementation and it should Nit: remove the "but " here? ::: js/src/jit/shared/LIR-shared.h @@ +5737,5 @@ > return getOperand(0); > } > }; > > +// Load a value from a dense array's elements vector. Bail out if it's the hole value. This comment should be updated.
Flags: needinfo?(jdemooij)
Attachment #8882279 - Flags: review?(jdemooij)
Delta: - Apply nits from comment 13. - Add the uncommitted files: - TrackedOptimizationInfo.h - jit-test/tests/ion/inlining/inline-getelem-args.js - Add test cases to check for underflow/overflow bailouts. - Add support for properly loading boolean from the stack: - see StackSlotAllocator::width(LDefinition::TypeFrom(elem->type()))
Attachment #8884831 - Flags: review?(jdemooij)
Attachment #8882279 - Attachment is obsolete: true
Comment on attachment 8884831 [details] [diff] [review] IonMonkey: Add LoadElementFromSate to support argument[x] in inlined functions. Review of attachment 8884831 [details] [diff] [review]: ----------------------------------------------------------------- Thanks. Some gnarly CodeGen stuff but I can't think of a good way to simplify that. It will be interesting to see if this affects Speedometer - hopefully there won't be regressions from increased compilation time :) ::: js/src/jit/CodeGenerator.cpp @@ +11298,5 @@ > + if (elem->type() == MIRType::Double) { > + masm.loadDouble(ToAddress(a), tempD); > + input = TypedOrValueRegister(elem->type(), AnyRegister(tempD)); > + } else if (elem->type() == MIRType::Value) { > + typeReg = temp0; Nit: can remove this line since we also set it before the if-else. @@ +11304,5 @@ > +#ifdef JS_PUNBOX64 > + input = TypedOrValueRegister(ValueOperand(temp0)); > +#endif > + } else { > + typeReg = temp0; Same. ::: js/src/jit/IonBuilder.cpp @@ +8357,5 @@ > + return Ok(); > + > + // Currently, we do not support any arguments vector larger than 10, as this > + // is being translated into code at the call site, and it would be better to > + // the arguments aligned on the stack. s/better to/better to store/?
Attachment #8884831 - Flags: review?(jdemooij) → review+
Pushed by npierron@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/a02a6cb9c94c Move moveValue into the generic MacroAssembler. r=jandem https://hg.mozilla.org/integration/mozilla-inbound/rev/60809bf55571 Add MacroAssembler::branchToComputedAddress. r=jandem https://hg.mozilla.org/integration/mozilla-inbound/rev/b6be7c2be50e IonMonkey: Add LoadElementFromSate to support argument[x] in inlined functions. r=jandem https://hg.mozilla.org/integration/mozilla-inbound/rev/a249596785aa Fix ion/bug1284485.js unreported OOM. r=bbouvier
Backed out for build bustage: undeclared 'StackSlotAllocator' at js/src/jit/CodeGenerator.cpp:11199: https://hg.mozilla.org/integration/mozilla-inbound/rev/273e8ce1c9e38f4ae3527508f9512adfe2d0bec4 https://hg.mozilla.org/integration/mozilla-inbound/rev/5913aca4915039cc0e05e7933922db23c30e4366 https://hg.mozilla.org/integration/mozilla-inbound/rev/9d84a411ea6da7680c5086f29ecdfe20d5c4a688 https://hg.mozilla.org/integration/mozilla-inbound/rev/a51672e13a1b018a41a6e9f6c156605861863293 Push with bustage: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a249596785aa58973113cf830a55a54d0a5257de&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=runnable Build log: https://treeherder.mozilla.org/logviewer.html#?job_id=114912641&repo=mozilla-inbound [task 2017-07-17T17:11:01.326683Z] 17:11:01 INFO - In file included from /home/worker/workspace/build/src/obj-firefox/js/src/Unified_cpp_js_src11.cpp:38:0: [task 2017-07-17T17:11:01.326804Z] 17:11:01 INFO - /home/worker/workspace/build/src/js/src/jit/CodeGenerator.cpp: In member function 'virtual void js::jit::CodeGenerator::visitLoadElementFromStateV(js::jit::LLoadElementFromStateV*)': [task 2017-07-17T17:11:01.326870Z] 17:11:01 INFO - /home/worker/workspace/build/src/js/src/jit/CodeGenerator.cpp:11199:32: error: 'StackSlotAllocator' has not been declared [task 2017-07-17T17:11:01.326925Z] 17:11:01 INFO - size_t width = StackSlotAllocator::width(LDefinition::TypeFrom(elem->type())); [task 2017-07-17T17:11:01.326958Z] 17:11:01 INFO - ^ [task 2017-07-17T17:11:01.327011Z] 17:11:01 INFO - /home/worker/workspace/build/src/config/rules.mk:1051: recipe for target 'Unified_cpp_js_src11.o' failed [task 2017-07-17T17:11:01.327048Z] 17:11:01 INFO - gmake[5]: *** [Unified_cpp_js_src11.o] Error 1
Flags: needinfo?(nicolas.b.pierron)
(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #17) > Backed out for build bustage: undeclared 'StackSlotAllocator' at > js/src/jit/CodeGenerator.cpp:11199: Fixed by adding a missing include in CodeGenerator.cpp: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8f98b23446f2eb7dd240fe4fe952f0e43e7df73e
Flags: needinfo?(nicolas.b.pierron)
Pushed by npierron@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/0f61bc51de11 Move moveValue into the generic MacroAssembler. r=jandem https://hg.mozilla.org/integration/mozilla-inbound/rev/8be467b057f6 Add MacroAssembler::branchToComputedAddress. r=jandem https://hg.mozilla.org/integration/mozilla-inbound/rev/07e91ed49793 IonMonkey: Add LoadElementFromSate to support argument[x] in inlined functions. r=jandem https://hg.mozilla.org/integration/mozilla-inbound/rev/09937ba00629 Fix ion/bug1284485.js unreported OOM. r=bbouvier
Pushed by npierron@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/2817d497d4b1 Fix test case basic/bug858097.js, to checking in Compacting GC cases. r=jonco
This regressed the assorted misc-bugs-658844-sha256-bitcoins test by 68% (22 ms -> 38 ms) and the misc-basic-arguments test from 15 to 24 ms. Would you mind taking a look? :)
Flags: needinfo?(nicolas.b.pierron)
(In reply to Jan de Mooij [:jandem] from comment #21) > This regressed the assorted misc-bugs-658844-sha256-bitcoins test by 68% (22 > ms -> 38 ms) and the misc-basic-arguments test from 15 to 24 ms. Would you > mind taking a look? :) Testing with misc-basic-arguments benchmark, and changing the number of actual arguments: before after 10 7.0341796875 10.79296875 9 7.213134765625 9.68408203125 8 7.39208984375 10.72802734375 7 7.550048828125 10.2041015625 6 7.85595703125 10.31298828125 5 8.593017578125 10.824951171875 4 8.97998046875 12.198974609375 3 9.863037109375 11.991943359375 2 11.97412109375 14.195068359375 1 18.015869140625 14.73193359375 Based on these results, it sounds like we should handle the case where there is only one argument to the function separately, and probably look for allocating a vector on the stack otherwise. I will look if there is an easy way to handle that.
Manually unrolling the loop which contains arguments[x], and making sure that these reads are not LICM-ed out of the loop highlights that we can optimize arguments[x] better by only unrolling the loops which are containing these expressions. before after 10 7.76513671875 7.136962890625 9 6.701171875 7.051025390625 8 7.461181640625 6.993896484375 7 7.212158203125 6.819091796875 6 7.501220703125 6.922119140625 5 8.198974609375 7.032958984375 4 9.0849609375 6.64404296875 3 12.39794921875 4.43798828125 2 15.126953125 3.48095703125 1 27.291015625 1.794921875 If the body of a loop is too large to be unrolled, then the cost of the arguments[x] probably does not matter as much compared to the cost of the body of the loop. In these results the cost of the arguments[x] expression (not yet folded into a single argument read, but still implemented as a switch case) seems to confirm that the issue seen in comment 23 & comment 21 comes from the fact that the generated code does not play nicely with the branch predictor (within a loop). Unrolling the loop containing arguments[x] removes the miss-prediction issue and indirect branch is known by the branch predictor to always use the same case statement, leading to the results above. If we were to fold the MLoadElementFromState given a known index, into the proper argument read, then we could potentially get even more performance out of it. I suggest to focus on enabling loop unrolling as a solution for these regressions.
Depends on: 1383828
Flags: needinfo?(nicolas.b.pierron)
Depends on: 1388005
Assignee: nobody → nicolas.b.pierron
Performance Impact: --- → P1
Whiteboard: [qf:p1]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: