Optimize call_indirect calling sequence
Categories
(Core :: JavaScript: WebAssembly, enhancement, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox97 | --- | fixed |
People
(Reporter: lth, Assigned: lth)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
Attachments
(4 files)
With the call_indirect thunks having landed and apparently stuck (bug 1639153, bug 1742053) and a plausible null check elimination appearing on bug 1709578, and with better bounds check elimination a small matter of programming (bug 1625891), the call pretty much is reduced to "load type descriptor immediate" + "load code pointer" + "call indirect".
The remaining code for loading the code pointer and calling is not great, there are several things that can be done:
Constant optimization: If the table index is known, we don't need to scale it at runtime, we can scale it at compile time (and I believe we also don't need to pass it in the table call index register, so we save an immediate load there too). It would be nice to do a corpus analysis to find out how often the table index is known, but it's a small investment to just make the change.
Faster scaling: The current code is x86-oriented because it special-cases scaling by 8 (which can be an LEA by means of a BaseIndex) while everything else is shunted through a shift + add operation.
- On ARM and ARM64 we can merge the shift and the add with a shifted-add instruction.
- On x64 we can also use LEA... Rearrange the table and store all the code pointers together in the first half of the table and the tls pointers (which we still need) together in the second half of the table. The tls pointers are no longer needed during the call itself, so it's fine if an index has to be recomputed to get a tls value when we do table.get and similar. Now the code pointers are 8 bytes apart and we can use LEA.
Merge the code pointer load and the call: On x86 and x64, with the offset into the table being represented by a scaleable value, we should be able to use call with a BaseIndex operand.
If necessary, we split the code sequence for asmjs off from wasm to not break anything, but we can do much better for wasm than we currently are.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 1•3 years ago
|
||
Rollup for Ryan's pending work, DO NOT LAND
Assignee | ||
Comment 2•3 years ago
|
||
Remove the FunctionTableElem struct entirely and replace the
"functions_" array of these structs in the Table with two arrays
"codeptrs_" and "tlsptrs_" that are independently allocated.
The purpose of the change is to enable the call-indirect sequence to
always directly load the code pointer from the table with a BaseIndex
scale-by-pointer on every platform, tightening that code sequence
considerably on all platforms, albeit in slightly different ways.
The change sets us up for at least two future optimizations:
- Calls to constant indices can be optimized further on all platforms
since the table offset can be precomputed - Once the null check in the call-indirect sequence is removed,
the code pointer load can be further folded into the call instruction
on x86/x64.
Assignee | ||
Comment 3•3 years ago
|
||
This fuses the code pointer load into the call instruction on x64,
which we can do because we have a BaseIndex form of the call.
Performance-wise this does not move the needle on my x64 Xeon system,
though it probably makes the code slightly denser.
Added some notes about what we can do if the index is constant - we
don't need to load it into the index register, and slightly denser
code can be generated for the load + call - but before I go further in
that direction I want to do a corpus analysis of actual wasm code to
see how common this case is. (This also plays into bound check
frequency and BCE strategies.)
Updated•3 years ago
|
Assignee | ||
Comment 4•3 years ago
|
||
Split the asm.js indirect calling code out from the wasm ditto to open
up for simplifications of the latter.
Improve wasm call_indirect bounds checking in two ways:
-
if the table length is known because min=max (it appears this is
common), make the bounds check use the known length as an immediate,
and not load the length from globals. -
if the table length is not known, emit a compare-with-memory rather
than a load-from-memory-and-compare, as this will result in slightly
more compact code on Intel and should be neutral on all other
platforms.
TODO:
-
corpus analysis to check that known table length is actually a thing
-
benchmarking to ensure that comparing to a constant length does not
regress anything; it might, because more code may be emitted
Assignee | ||
Comment 5•3 years ago
|
||
Profiling a bunch of wasm content from the web (cf comment 4):
- most content has a fixed table upper bound, which is then invariably the same as the lower bound
- all fixed upper bounds seen fit in 16 bits and can be loaded with a single MOVZ on arm64, so this clearly beats loading the table limit from memory
- some fixed upper bounds fit in the unsigned 12-bit immediate field of a CMP on arm64, which is even better, but generally many don't
- autocad is an outlier (no upper bound, 93K elements in the table, YOW!)
On balance, I would say that it is likely to pay off a little to special-case the known table bound case to avoid loading the table bound from memory on every indirect call.
If, on ARM64, we are worried about the cost of loading a large constant (two instructions) relative to loading the limit from memory (one instruction) we can limit the optimization to the case where at most a MOVZ is needed.
Updated•3 years ago
|
Comment 7•3 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/92e3147368be
https://hg.mozilla.org/mozilla-central/rev/76819a3bf82d
Description
•