Open Bug 1742930 Opened 3 years ago Updated 3 years ago

[meta] Optimize call_indirect

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

Product:

Component:

Type:

enhancement

Priority:

P3

Severity:

N/A

Tracking

()

Status:

NEW

People

(Reporter: lth, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: meta)

Lars T Hansen [:lth]

Reporter

Description

•

3 years ago

The call_indirect instruction, in its full glory, loads a function from a table (incurring a bounds check on the table) and if it is not null (incurring a null check) assumes that the call may be cross-module (incurring a context switch) and is unknown (incurring the cost of an indirect call and a signature check). All these checks can be optimized to some extent, to greater or lesser benefit. This is the top metabug to track all these optimizations, and related ones. It will block bug reports of how our indirect calls are slower than those of asm.js or those of other browsers.

Of particular interest here is bug 1340235 comment 15 et seq, which breaks down some of the work.

Once bug 1639153 lands (imminent, we hope), the most important work will be removing the null check and cleaning up the bounds check to let the normal BCE machinery go to work. It will probably also be important not to make a purely static determination about whether a table is exposed to other instances, but to allow the table to be downgraded from unexposed to exposed at runtime; this will allow more tables to remain in the unexposed state, and calls not to go through the indirect thunks.

There are also representational issues. Since a table-of-funcref is a 16-byte representation, a simple LEA can't be used to load the indirect callee. Getting this down to a one-word representation (which we want for fast call_ref performance anyway) will allow for a faster indirect calling sequence.

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

No longer blocks: 1340235

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Blocks: 1340106, 1340235, 1742832

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

No longer blocks: 1625891, 1639153, 1709578, 1711412, 1742832

Depends on: 1625891, 1639153, 1709578, 1711412, 1742832

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Blocks: 1711412

No longer depends on: 1711412

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Depends on: 1742936

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Depends on: 1743581

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Depends on: 1743586

Lars T Hansen [:lth]

Reporter

Comment 1

•

3 years ago

(In reply to Lars T Hansen [:lth] from comment #0)

There are also representational issues. Since a table-of-funcref is a 16-byte representation, a simple LEA can't be used to load the indirect callee. Getting this down to a one-word representation (which we want for fast call_ref performance anyway) will allow for a faster indirect calling sequence.

An LEA (or by extension, any scaled address) can be used on x64 by packing the table differently: all the code pointers together at the beginning, and all the tls pointers together at the end. Now code pointers are 8 bytes apart. This does not have to wait for getting function table entries down to a single word.

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Depends on: 1743756

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Depends on: 1743760

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Depends on: 1744513

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Depends on: 1744711

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Depends on: 1744943

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Depends on: 1745170

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Depends on: 1745939

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Depends on: 1748794

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

No longer blocks: 1711412

Updated

•

3 years ago

No longer blocks: 1340235

Depends on: 1340235

Updated

•

3 years ago

No longer blocks: 1340106

Depends on: 1340106

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Blocks: 1340235

No longer depends on: 1340235

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Depends on: 1754377

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Blocks: wasm-lang

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

No longer depends on: 1742832

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

No longer depends on: 1743581

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

No longer depends on: 1744513

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

No longer depends on: 1743756

Lars T Hansen [:lth]

Reporter

Comment 2

•

3 years ago

This is what call_indirect to a variable index (which is the common case) looks like now:

;; table bounds check, known table length (the common case)
00000034  41 83 fc 02               cmp $0x02, %r12d
00000038  0f 82 02 00 00 00         jb 0x0000000000000040
0000003E  0f 0b                     ud2
;; set up signature
00000040  41 ba 03 00 00 00         mov $0x03, %r10d
;; load table pointer and compute index of callee
00000046  49 8b 46 78               movq 0x78(%r14), %rax
0000004A  41 c1 e4 04               shl $0x04, %r12d
0000004E  49 03 c4                  add %r12, %rax
;; load callee tls
00000051  48 8b 58 08               movq 0x08(%rax), %rbx
;; compare to caller tls and branch if the same (the common case)
00000055  4c 3b f3                  cmp %rbx, %r14
00000058  0f 84 40 00 00 00         jz 0x000000000000009E
;; slow path: context switch.  note null check is by NPE
0000005E  4c 89 74 24 08            movq %r14, 0x08(%rsp)
00000063  4c 8b f3                  mov %rbx, %r14
00000066  4c 89 34 24               movq %r14, (%rsp)
0000006A  4d 8b 3e                  movq (%r14), %r15
0000006D  4d 8b 66 20               movq 0x20(%r14), %r12
00000071  49 8b 5e 18               movq 0x18(%r14), %rbx
00000075  49 89 9c 24 a0 00 00 00   movq %rbx, 0xA0(%r12)
0000007D  48 8b 00                  movq (%rax), %rax
00000080  ff d0                     call %rax
00000082  4c 8b 74 24 08            movq 0x08(%rsp), %r14
00000087  4d 8b 3e                  movq (%r14), %r15
0000008A  4d 8b 56 20               movq 0x20(%r14), %r10
0000008E  4d 8b 66 18               movq 0x18(%r14), %r12
00000092  4d 89 a2 a0 00 00 00      movq %r12, 0xA0(%r10)
00000099  e9 05 00 00 00            jmp 0x00000000000000A3
;; fast path: no context switch.  note no null check required
0000009E  48 8b 00                  movq (%rax), %rax
000000A1  ff d0                     call %rax

There are a couple of things we could do here. The slow case should go out-of-line so that the fast case can just fall through (bug TBD), as forward conditional branches are statically predicted as not-taken. The bounds check should jump out-of-line to the trap (bug 1680243) for the same reason. We could hoist the table pointer load up before the signature setup so that it has more time to execute. And we could hoist the code pointer load up to just after the tls pointer load, this might benefit the fast path especially.

Clearly the shift could be done before the load of the table pointer and on x64 the table pointer could then be added from memory into the shifted value, if the latter is in the right place. This might reduce code size a little, though in reality code size doesn't seem to be an issue - there aren't that many call_indirect sites (there should tend to be many more callees than callers).

The code pointer load can be folded into the call, if we don't hoist it. This too would reduce code size but might affect the CPU's ability to schedule things.

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Priority: P2 → P3

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Depends on: 1680243

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

Depends on: 1755625

Lars T Hansen [:lth]

Reporter

Updated

•

3 years ago

No longer depends on: 1680243

You need to log in before you can comment on or make changes to this bug.