Closed Bug 1743586 Opened 3 years ago Closed 3 years ago

Optimize call_indirect calling sequence

Tracking

()

Status:

RESOLVED FIXED

Milestone:

97 Branch

Tracking Flags:

Tracking

Status

firefox97

---

fixed

People

(Reporter: lth, Assigned: lth)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(4 files)

WIP: Bug 1743586 - Table optimization patch from bug 1642412 - DO NOT LAND 3 years ago Lars T Hansen [:lth] (deleted), text/x-phabricator-request		Details
Bug 1743586 - Change table-of-funcref representation and call-indirect code. r?rhunt 3 years ago Lars T Hansen [:lth] (deleted), text/x-phabricator-request		Details
WIP: Bug 1743586 - Fuse the load and call on x64 + notes (WIP) 3 years ago Lars T Hansen [:lth] (deleted), text/x-phabricator-request		Details
Bug 1743586 - Better bounds checking code for call_indirect. r?rhunt 3 years ago Lars T Hansen [:lth] (deleted), text/x-phabricator-request		Details

Lars T Hansen [:lth]

Assignee

Description

•

3 years ago

With the call_indirect thunks having landed and apparently stuck (bug 1639153, bug 1742053) and a plausible null check elimination appearing on bug 1709578, and with better bounds check elimination a small matter of programming (bug 1625891), the call pretty much is reduced to "load type descriptor immediate" + "load code pointer" + "call indirect".

The remaining code for loading the code pointer and calling is not great, there are several things that can be done:

Constant optimization: If the table index is known, we don't need to scale it at runtime, we can scale it at compile time (and I believe we also don't need to pass it in the table call index register, so we save an immediate load there too). It would be nice to do a corpus analysis to find out how often the table index is known, but it's a small investment to just make the change.

Faster scaling: The current code is x86-oriented because it special-cases scaling by 8 (which can be an LEA by means of a BaseIndex) while everything else is shunted through a shift + add operation.

On ARM and ARM64 we can merge the shift and the add with a shifted-add instruction.
On x64 we can also use LEA... Rearrange the table and store all the code pointers together in the first half of the table and the tls pointers (which we still need) together in the second half of the table. The tls pointers are no longer needed during the call itself, so it's fine if an index has to be recomputed to get a tls value when we do table.get and similar. Now the code pointers are 8 bytes apart and we can use LEA.

Merge the code pointer load and the call: On x86 and x64, with the offset into the table being represented by a scaleable value, we should be able to use call with a BaseIndex operand.

If necessary, we split the code sequence for asmjs off from wasm to not break anything, but we can do much better for wasm than we currently are.

Lars T Hansen [:lth]

Assignee

Updated

•

3 years ago

Assignee: nobody → lhansen

Status: NEW → ASSIGNED

Lars T Hansen [:lth]

Assignee

Updated

•

3 years ago

Blocks: 1709578

No longer depends on: 1709578

Lars T Hansen [:lth]

Assignee

Updated

•

3 years ago

No longer blocks: 1709578

Lars T Hansen [:lth]

Assignee

Comment 1

•

3 years ago

Attached file WIP: Bug 1743586 - Table optimization patch from bug 1642412 - DO NOT LAND (deleted) — Details

Rollup for Ryan's pending work, DO NOT LAND

Lars T Hansen [:lth]

Assignee

Comment 2

•

3 years ago

Attached file Bug 1743586 - Change table-of-funcref representation and call-indirect code. r?rhunt (deleted) — Details

Remove the FunctionTableElem struct entirely and replace the
"functions_" array of these structs in the Table with two arrays
"codeptrs_" and "tlsptrs_" that are independently allocated.

The purpose of the change is to enable the call-indirect sequence to
always directly load the code pointer from the table with a BaseIndex
scale-by-pointer on every platform, tightening that code sequence
considerably on all platforms, albeit in slightly different ways.

The change sets us up for at least two future optimizations:

Calls to constant indices can be optimized further on all platforms
since the table offset can be precomputed
Once the null check in the call-indirect sequence is removed,
the code pointer load can be further folded into the call instruction
on x86/x64.

Lars T Hansen [:lth]

Assignee

Comment 3

•

3 years ago

Attached file WIP: Bug 1743586 - Fuse the load and call on x64 + notes (WIP) (deleted) — Details

This fuses the code pointer load into the call instruction on x64,
which we can do because we have a BaseIndex form of the call.
Performance-wise this does not move the needle on my x64 Xeon system,
though it probably makes the code slightly denser.

Added some notes about what we can do if the index is constant - we
don't need to load it into the index register, and slightly denser
code can be generated for the load + call - but before I go further in
that direction I want to do a corpus analysis of actual wasm code to
see how common this case is. (This also plays into bound check
frequency and BCE strategies.)

Phabricator Automation

Updated

•

3 years ago

Attachment #9255010 - Attachment description: WIP: Bug 1743586 - Change table-of-funcref representation and call-indirect code → Bug 1743586 - Change table-of-funcref representation and call-indirect code. r?rhunt

Lars T Hansen [:lth]

Assignee

Comment 4

•

3 years ago

Attached file Bug 1743586 - Better bounds checking code for call_indirect. r?rhunt (deleted) — Details

Split the asm.js indirect calling code out from the wasm ditto to open
up for simplifications of the latter.

Improve wasm call_indirect bounds checking in two ways:

if the table length is known because min=max (it appears this is
common), make the bounds check use the known length as an immediate,
and not load the length from globals.
if the table length is not known, emit a compare-with-memory rather
than a load-from-memory-and-compare, as this will result in slightly
more compact code on Intel and should be neutral on all other
platforms.

TODO:

corpus analysis to check that known table length is actually a thing
benchmarking to ensure that comparing to a constant length does not
regress anything; it might, because more code may be emitted

Lars T Hansen [:lth]

Assignee

Comment 5

•

3 years ago

Profiling a bunch of wasm content from the web (cf comment 4):

most content has a fixed table upper bound, which is then invariably the same as the lower bound
all fixed upper bounds seen fit in 16 bits and can be loaded with a single MOVZ on arm64, so this clearly beats loading the table limit from memory
some fixed upper bounds fit in the unsigned 12-bit immediate field of a CMP on arm64, which is even better, but generally many don't
autocad is an outlier (no upper bound, 93K elements in the table, YOW!)

On balance, I would say that it is likely to pay off a little to special-case the known table bound case to avoid loading the table bound from memory on every indirect call.

If, on ARM64, we are worried about the cost of loading a large constant (two instructions) relative to loading the limit from memory (one instruction) we can limit the optimization to the case where at most a MOVZ is needed.

Phabricator Automation

Updated

•

3 years ago

Attachment #9255462 - Attachment description: WIP: Bug 1743586 - Better bounds checking code for call_indirect → Bug 1743586 - Better bounds checking code for call_indirect. r?rhunt

Pulsebot

Comment 6

•

3 years ago

Pushed by lhansen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/92e3147368be Change table-of-funcref representation and call-indirect code. r=rhunt https://hg.mozilla.org/integration/autoland/rev/76819a3bf82d Better bounds checking code for call_indirect. r=rhunt

Andreea Pavel [:apavel]

Comment 7

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/92e3147368be
https://hg.mozilla.org/mozilla-central/rev/76819a3bf82d

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox97: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 97 Branch

Lars T Hansen [:lth]

Assignee

Updated

•

3 years ago

Regressions: 1750930

Lars T Hansen [:lth]

Assignee

Updated

•

3 years ago

Regressions: 1752870

Lars T Hansen [:lth]

Assignee

Updated

•

3 years ago

No longer regressions: 1752870

Lars T Hansen [:lth]

Assignee

Updated

•

3 years ago

No longer regressions: 1750930

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Optimize call_indirect calling sequence

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

Tracking

()

People

(Reporter: lth, Assigned: lth)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(4 files)

Description

Updated

Updated

Updated

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Updated

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type