Baldr: Bounds check elimination for indirect calls
Categories
(Core :: JavaScript: WebAssembly, enhancement, P3)
Tracking
()
People
(Reporter: lth, Unassigned)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
(deleted),
text/x-phabricator-request
|
Details |
Consider:
(module
(type $t0 (func (result i32)))
(func $f1 (result i32) (i32.const 0))
(func $f2 (result i32) (i32.const 1))
(table $tbl funcref (elem $f1 $f2))
(func $f3 (param i32) (result i32)
(call_indirect (type $t0) $tbl (local.get 0))
drop
(call_indirect (type $t0) $tbl (local.get 0))))`));
The ion code generated for the two indirect calls shows that we bounds check the index against the table length in both calls, yet the index has not changed and tables cannot shrink, so the second test must pass if the first one did. To do better, the bounds check may have to be open-coded so that Ion can see it. At the moment, there is ad-hoc BCE in LIRGenerator::visitWasmCall where we remove the bounds check if the index is constant and below the table minimum, but we want something better.
Reporter | ||
Comment 1•3 years ago
|
||
It would be useful to have an idea about the impact of this optimization before trying to do the work. One thing that's easy to measure is the static ratio of variable-index accesses to constant-index accesses; slightly harder but much more interesting would be the dynamic ratio. It will be hard to generalize this: source language, compilation strategy, optimization level, and the program itself all play a part. But it doesn't hurt to look at code that people care about, such as autocad, zoom, pspdfkit, and probably some rust codebase, tbd. It doesn't have to be a web-only codebase, either.
Reporter | ||
Comment 2•3 years ago
|
||
The evidence is fairly strong that the non-constant-index indirect calls are the rule (profiling code to appear):
web.autocad.com, the roller assembly sample + a single rectangular selection before quitting:
Bounds check eliminated: 0
Bounds checked, constant index: 0
Bounds checked, variable index: 5473719
pspdfkit.com, just click around in the standalone demo for a little while:
Bounds check eliminated: 0
Bounds checked, constant index: 0
Bounds checked, variable index: 7346718
ruffle.rs (flash emulator), running a demo:
Bounds check eliminated: 0
Bounds checked, constant index: 0
Bounds checked, variable index: 6816046
At the same time, 7M calls over, say, 10s is nothing, so these applications will not benefit much from an optimization unless those calls are very heavily clustered.
Raybench, from bug 1340106, might benefit more, though rendering time here was 8.3s so this is just 45M indirect calls per second:
Bounds check eliminated: 0
Bounds checked, constant index: 0
Bounds checked, variable index: 369114997
The benchmark on bug 1340235 is now a 404, unfortunately.
Reporter | ||
Comment 3•3 years ago
|
||
Run this with --disable-e10s in the browser, since the profiling data
are captured only in the main process.
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Comment 4•3 years ago
|
||
Another run on oskarstalberg.com/Townscaper gives broadly consistent results: while during warmup many bounds checks can be eliminated, as the game progresses all calls are to variable indices.
Reporter | ||
Comment 5•3 years ago
|
||
Over in bug 1340235, I noticed that the table has known length - initial and max are the same. We should find out if this is common, because if it is then the table bound can be baked into the table bounds check, it need not be loaded from tls. (Edit: I'm doing this particular change as part of bug 1743586.)
Edit: According to bug 1743586 comment 5, known table length (min==max) is the common case, and code has landed to bake a known limit into the bounds check.
Reporter | ||
Comment 6•3 years ago
|
||
I think the easiest test on whether it pays off to optimize table bounds checks more is to turn them off completely and run some benchmarks and some content, if we can find content where we will be able to observe the difference.
Obvious cases:
- the fib benchmark from bug 1340235
- the raybench benchmark from bug 1340106
If they don't show improvements with bounds checks eliminated then nothing will and we should be done. If they show huge improvements then we should consider further work. We should check this on multiple microarchitectures.
Reporter | ||
Comment 7•3 years ago
|
||
The fib benchmark performs 10 indirect calls in a row in the base case of the recursion, so disabling bounds checks should help quite a bit. It also has the most to gain from BCE, since all calls use the same known table index. BCE would eliminate the last nine checks in this innermost block.
The raybench benchmarks traverses a scene graph and calls a virtual intersect method on each object. Most of these do a fair amount of computation. The index is usually a constant vtable index off a variable index representing the start of the object's vtable, the latter index is loaded from the object (and emscripten then masks it, at least in the code for this benchmark). There should be some benefit from disabling bounds checking, but BCE would probably have a hard time removing the checks in practice.
Results, showing running time reduction for removing the bounds check altoghether:
fib arm64/Apple-M1: 6%
fib x64/Intel-i7: 20% (yeah, actually)
raybench arm64/Apple-M1: 5%
raybench x64/intel-i7: 2% (if we're charitable)
The advantage of removing bounds checks might be reduced if we rewrote our code generator so that we would jump to an out-of-line trap and fall through to success code, as conditional forward branches are statically predicted as not taken (for sure on Intel but I believe even on the M1).
Reporter | ||
Comment 8•3 years ago
|
||
I don't think there's much to be gained from fixing this in practice.
Description
•