864214 - IonMonkey: Generate asm.js heap load/stores when possible

Reporter

Description

•

12 years ago

Attached patch patch (obsolete) (deleted) — Details — Splinter Review

asm.js uses specialized MIR nodes for heap accesses, which are faster than what IonMonkey normally generates. It would be good if vanilla IonMonkey also generated these where possible, i.e. when the load/store can execute infallibly on a statically known array. The attached patch does this for x86. ARM should be straightforward but I don't have a way to test changes to it. x64 is considerably more complicated, as it depends on signal handlers to be able to execute loads/stores without the bounds checks that x86/ARM require. It would be possible on x86 and (presumably) ARM to get things going even faster than they are using these nodes, as bounds checks could be hoisted from the loop and instructions cut from the loads/stores. Saving that for followup, maybe.

Attachment #740153 - Flags: review?(luke)

Attachment #740153 - Flags: review?(jdemooij)

Luke Wagner [:luke]

Comment 1

•

12 years ago

Comment on attachment 740153 [details] [diff] [review] patch Optimizing typed array accesses in general sounds great, but I'd rather not reuse the AsmJS*Heap nodes since it makes them more complicated. Instead, can you add new or extend the existing *TypedArray* nodes. If there is duplication in the backend codegen, that can be factored out.

Attachment #740153 - Flags: review?(luke) → review-

Brian Hackett [Laid off!]

Reporter

Comment 2

•

12 years ago

Attached patch x86 patch (obsolete) (deleted) — Details — Splinter Review

Well, this patch adds {Load,Store}TypedArrayElementStatic nodes which are identical in behavior to AsmJS{Load,Store}Heap except for how the typed array is baked in. It's also twice as large, thanks to all the new boilerplate.

Attachment #740153 - Attachment is obsolete: true

Attachment #740153 - Flags: review?(jdemooij)

Attachment #740887 - Flags: review?(luke)

Attachment #740887 - Flags: review?(jdemooij)

Luke Wagner [:luke]

Comment 3

•

12 years ago

Comment on attachment 740887 [details] [diff] [review] x86 patch Review of attachment 740887 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/ion/IonBuilder.cpp @@ +6201,5 @@ > + // conversion, so that out of bounds accesses do not need to bail out. > + bool intConversion; > + if (*next == JSOP_POS) > + intConversion = false; > + else if (*next == JSOP_ZERO && *(next + JSOP_ZERO_LENGTH) == JSOP_BITOR) Your overall stated goal, which I like, is to optimize patterns on general (non-asm.js) principles, but this bytecode pattern matching contradicts that. For one thing, this matches a pattern more narrow than asm.js (loads produce a type "intish" or "doublish" which can be consumed by many operations (that perform ToInt32/ToNumber on their operand). It seems like you'd want to hook into the more general truncation analysis that is already performed.

Attachment #740887 - Flags: review?(luke)

Douglas Crosher [:dougc]

Comment 4

•

12 years ago

(In reply to Brian Hackett (:bhackett) from comment #2) > Created attachment 740887 [details] [diff] [review] > x86 patch > > Well, this patch adds {Load,Store}TypedArrayElementStatic nodes which are > identical in behavior to AsmJS{Load,Store}Heap except for how the typed > array is baked in. It's also twice as large, thanks to all the new > boilerplate. I volunteer to implement the ARM support if this helps. Forgive my ignorance, but it appears that the JIT code is specialized on the typed array size and this is given by the ins->mir->length()? Could I also ask if constant indexes use these code paths? Bug 865516 is optimizing out the bounds checks for small constant indexes in asm.js code. If the array length is known when compiling the JIT code and the index is constant then the check might be avoided. Further, is the index variable range type information available to the backend to make code choice decisions?

Brian Hackett [Laid off!]

Reporter

Comment 5

•

12 years ago

Attached patch x86 patch (deleted) — Details — Splinter Review

I think my overall stated goal is that code should run at the same speed whether asm.js is used or not, but, yeah, being robust against the patterns matched against by asm.js is part of that. This patch allows LoadTypedArrayElementStatic to be fallible. The fallible and infallible versions should run at about the same speed, provided that the associated snapshots are not keeping things alive longer than necessary (another TODO to get to soon). Both the truncation analysis and the bytecode sniffing are used to mark the loads as infallible. The truncation analysis is not sufficient by itself because (a) it doesn't do anything with general numeric conversions, and (b) it interacts badly with the expression folding done in GVN and the 'x[y]' in 'x[y] | 0' will not usually be marked as truncated. The latter is a more involved fix and outside the bounds of this patch, but since it also causes 'x + y' in '(x + y) | 0' ops to usually require overflow checks it's another upcoming TODO.

Attachment #740887 - Attachment is obsolete: true

Attachment #740887 - Flags: review?(jdemooij)

Attachment #741817 - Flags: review?(luke)

Attachment #741817 - Flags: review?(jdemooij)

Brian Hackett [Laid off!]

Reporter

Comment 6

•

12 years ago

(In reply to Douglas Crosher [:dougc] from comment #4) > I volunteer to implement the ARM support if this helps. Sure, this would be great. > Forgive my ignorance, but it appears that the JIT code is > specialized on the typed array size and this is given by the > ins->mir->length()? Yes, the typed array being specialized on is statically known; ins->mir->length() is the total length (in bytes) of the array. > Could I also ask if constant indexes use these code paths? > Bug 865516 is optimizing out the bounds checks for small > constant indexes in asm.js code. > > If the array length is known when compiling the JIT code > and the index is constant then the check might be avoided. > Further, is the index variable range type information > available to the backend to make code choice decisions? As with AsmJSLoadHeap, LoadTypedArrayElementStatic uses a register to hold the pointer, but by inspecting the MIR you can determine whether that register will always hold a constant value. The backend has range information for the index, which you can use to eliminate the bounds checks even if the index is not constant.

Douglas Crosher [:dougc]

Comment 7

•

12 years ago

Thank you for the information. The ARM bounds check implementation uses a logical shift to mask the index and only works for lengths of 2^n. Asm.js imposes some restrictions on the heap size. Would you be happy with the compiler giving up on the optimization if the length is not 2^n for the ARM? The x64 implementation for asm.js requires the heap array to be prepared - mapped to a specific layout in memory. Did you really want to attempt this? Might it be possible to specialize the JIT code on the array being already prepared? How would the array get prepared? Alternatively the x64 could just copy the x86 approach.

Luke Wagner [:luke]

Updated

•

12 years ago

Attachment #741817 - Flags: review?(luke) → review+

Brian Hackett [Laid off!]

Reporter

Comment 8

•

12 years ago

(In reply to Douglas Crosher [:dougc] from comment #7) > The ARM bounds check implementation uses a logical shift to > mask the index and only works for lengths of 2^n. Asm.js > imposes some restrictions on the heap size. Would you be > happy with the compiler giving up on the optimization if > the length is not 2^n for the ARM? This would be fine. > The x64 implementation for asm.js requires the heap array > to be prepared - mapped to a specific layout in memory. > Did you really want to attempt this? Might it be possible > to specialize the JIT code on the array being already > prepared? How would the array get prepared? Alternatively > the x64 could just copy the x86 approach. I think the x64 layout could be used, though it would be more invasive to the engine in general. Right now I'm mostly interested in x86 and ARM performance, since those are the architectures used by the vast majority of our users currently.

Jan de Mooij [:jandem]

Comment 9

•

12 years ago

Comment on attachment 741817 [details] [diff] [review] x86 patch Review of attachment 741817 [details] [diff] [review]: ----------------------------------------------------------------- I'm a bit uneasy about making major optimizations x86-only (most of our OS X and Linux users do use x64 builds), please file a bug for adding it. ::: js/src/ion/Lowering.cpp @@ +2092,5 @@ > > bool > +LIRGenerator::visitLoadTypedArrayElementStatic(MLoadTypedArrayElementStatic *ins) > +{ > + LLoadTypedArrayElementStatic *lir = Nit: JS_ASSERT(ins->ptr()->type() == MIRType_Int32);

Attachment #741817 - Flags: review?(jdemooij) → review+

Douglas Crosher [:dougc]

Comment 10

•

12 years ago

If no one beats me to it then I'll take care of the x64 backend support by adapting the x86 code. This will allow the x64 to also use this optimization for an arbitrary array length. The ARM support will also be written to work with an arbitrary length, but it will bit a little faster for a length of 2^n. Supporting this optimization for a length that is not just 2^n might help support a common pattern. If the length is 2^n+g then a 2^n-1 mask of a pointer index at the entry to a function would declare a useful range for the pointer and allow the bounds checks to be optimized away even in the case of small positive offsets, up to 'g'. This would fit a pattern of a pointer to a structure stored in the array. It would help if an 'inRange' flag could be added to the MIR object because all the backends will want to know if the index is known to be within the extent of the array length so that they can avoid the bounds check in this case. This could just a well be a follow up patch.

Brian Hackett [Laid off!]

Reporter

Comment 11

•

12 years ago

(In reply to Douglas Crosher [:dougc] from comment #10) > Supporting this optimization for a length that is not just 2^n might help > support a common pattern. If the length is 2^n+g then a 2^n-1 mask of a > pointer index at the entry to a function would declare a useful range for > the pointer and allow the bounds checks to be optimized away even in the > case of small positive offsets, up to 'g'. This would fit a pattern of a > pointer to a structure stored in the array. We already consolidate and loop hoist bounds checks for the other kinds of array and typed array access ops. I'm planning to expand that to these static accesses in a followup; this should improve both the 'embedded structure with many nearby offsets' and the 'embedded array within the typed array' cases. The relevant code is around MBoundsCheck and friends if you're interested.

Brian Hackett [Laid off!]

Reporter

Comment 12

•

12 years ago

x86 patch: https://hg.mozilla.org/integration/mozilla-inbound/rev/fc9427895561

Whiteboard: [leave open]

Ryan VanderMeulen [:RyanVM]

Comment 13

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/fc9427895561

Jan de Mooij [:jandem]

Comment 14

•

12 years ago

It looks like this regressed Octane-mandreel on AWFY by at least 5%. Can you take a look?

Daniel Holbert [:dholbert]

Updated

•

12 years ago

Depends on: 866784

Douglas Crosher [:dougc]

Comment 15

•

12 years ago

Attached patch ARM support (deleted) — Details — Splinter Review

This patch adds support for the ARM backend. It handles an arbitrary sized array length, but will be a little faster for a length of 2^n. The code paths have been tested, but this is not proposed to be committed yet because the performance difference is hard to measure. A constant is currently used for the array base and this requires two instructions, or a constant pool load, on the ARM - it might be better to keep the base in a register for the ARM. The use of a shift and zero test for the 2^n bounds check might be implemented in general code as an optimization for general comparisons and this might improve performance a little even without this patch. The patch takes advantage of the ARM's conditional instructions and more general support for these in the ARM assembler could help code in general.

Alon Zakai (:azakai)

Comment 16

•

12 years ago

(In reply to Jan de Mooij [:jandem] from comment #14) > It looks like this regressed Octane-mandreel on AWFY by at least 5%. Can you > take a look? Any news on this? The regression on mandreel is still on AWFY and likely affecting various non-asm.js emscripten and mandreel code on the web.

Brian Hackett [Laid off!]

Reporter

Comment 17

•

12 years ago

I compared outputs with -D and after this bug we are executing about 8% more Ion instructions on mandreel than before. The actual LIR opcode counts are unchanged except for the ones related to the typed array accesses, so this is just a benchmark where the asm.js loads/stores are less efficient than in the usual typed array case. That seems likely to be because we can hoist/consolidate bounds checks for normal typed array accesses, while asm.js heap accesses (on x86/arm at least) always require bounds checks. For mandreel, we used to eliminate 2/3 of bounds checks with these optimizations; now that is near zero. Fixing this and even improving on the earlier state will happen before long (see comment 11) but right now I'm more interested in other parity issues; if you're happy with asm.js x86 perf as is, I don't see a reason to rush a fix in.

Luke Wagner [:luke]

Comment 18

•

12 years ago

asm.js x86 bounds checking was shoehorned in at the last minute when I had to rip out the use of segmentation because of Win64, so I'm not surprised it's worse than bounds-check hoisting.

Alon Zakai (:azakai)

Comment 19

•

12 years ago

(In reply to Brian Hackett (:bhackett) from comment #17) > I compared outputs with -D and after this bug we are executing about 8% more > Ion instructions on mandreel than before. The actual LIR opcode counts are > unchanged except for the ones related to the typed array accesses, so this > is just a benchmark where the asm.js loads/stores are less efficient than in > the usual typed array case. That seems likely to be because we can > hoist/consolidate bounds checks for normal typed array accesses, while > asm.js heap accesses (on x86/arm at least) always require bounds checks. > For mandreel, we used to eliminate 2/3 of bounds checks with these > optimizations; now that is near zero. Fixing this and even improving on the > earlier state will happen before long (see comment 11) but right now I'm > more interested in other parity issues; if you're happy with asm.js x86 perf > as is, I don't see a reason to rush a fix in. Sounds fine to me, this is important but not urgent IMO.

Brian Hackett [Laid off!]

Reporter

Updated

•

11 years ago

Depends on: 891400

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Updated

•

11 years ago

Depends on: 897202

Nobody; OK to take it and work on it

Assignee

Updated

•

10 years ago

Assignee: general → nobody

Hannes Verschore [:h4writer]

Updated

•

10 years ago

Depends on: 1132290

Till Schneidereit [:till]

Comment 20

•

8 years ago

Is there anything left to do here for now, or can we close this bug?

Flags: needinfo?(bhackett1024)

Priority: -- → P5

Brian Hackett [Laid off!]

Reporter

Comment 21

•

8 years ago

We should be doing this for x86 now, any ARM support should be done in another bug.

Status: NEW → RESOLVED

Closed: 8 years ago

Flags: needinfo?(bhackett1024)

Resolution: --- → FIXED

patch 12 years ago Brian Hackett [Laid off!] (deleted), patch	luke : review-	Details \| Diff \| Splinter Review
x86 patch 12 years ago Brian Hackett [Laid off!] (deleted), patch		Details \| Diff \| Splinter Review
x86 patch 12 years ago Brian Hackett [Laid off!] (deleted), patch	luke : review+ jandem : review+	Details \| Diff \| Splinter Review
ARM support 12 years ago Douglas Crosher [:dougc] (deleted), patch		Details \| Diff \| Splinter Review