Closed Bug 1316808 Opened 8 years ago Closed 3 years ago

Wasm baseline: Allocate some locals to registers (investigation)

Categories

(Core :: JavaScript: WebAssembly, task, P5)

task

Tracking

()

RESOLVED INACTIVE

People

(Reporter: lth, Unassigned)

References

Details

Attachments

(6 files)

Try to allocate some locals to registers on machines where there are enough registers. This is probably hard to do well in a one-pass compiler but it might be that just keeping register arguments and the first few locals in registers is a viable strategy; another (more general) strategy is caching locals in registers in straight-line code. Such caching could also track constant values in registers, if that is deemed valuable. A combination of techniques may be desirable: parameters and the first few locals could be cached on entry to the function but not statically assigned to registers throughout. (On a large corpus of code it should be possible to compute, for every signature comprising the types of parameters and locals, and using a static weight for loops, a list in priority order of which parameters and locals that should be assigned to registers. Or something like that. Wasm makes this simple. Static assignments are desirable because they are not flushed to memory by the pre-block sync() call.) It's likely that we want to delay this until we've done more sync() optimization, notably, attemptint to reduce sync() activity in the register allocator and optimistically delaying sync() at block boundaries. Only when those have been completed will we see the true potential for allocating locals to registers.
Priority: P5 → P3
Here's a stray thought about how to make good use of locals that are cached in registers. (For this we want bug 1318654 to be finished probably, but it would work without it.) Suppose we can cheaply maintain a data structure that tracks locals that are cached in registers and which registers they are in. We now specialize (say) RegI32 and popI32(): let there be a new register type ValI32 that is a base class of RegI32, and let there be a new function popValI32() that returns a ValI32 instead of a RegI32. Now change the signatures of low-level emitters that currently take a RegI32 but do not modify that register to instead take a ValI32. Higher-level emitters that pass an argument to a ValI32 can call popValI32() instead of popI32(). Now when popValI32() wants to pop a local but that local is cached in a register, the register can be returned as-is: no load or copy is needed. popI32(), in contrast, must make a copy or can - in some circumstances, when there's no conflict - choose to destroy the caching by returning the register. The conflict alluded to in the previous paragraph occurs if we pop twice for eg (+ x x) where x is local, first as a ValI32 that returns the register of x and then as a RegI32, should the second pop, which will normally be a popI32(), choose not to make a copy but instead return the local. Food for thought. (We can track constant values in registers in the same way as locals, maybe. Worth it? Who knows. Data will tell.)
Depends on: 1316806
I performed a simple corpus analysis by counting the number of integer and floating point params and locals for the functions of Tanks, AngryBots, and ZenGarden (will attach results). For x64 and ARM64 these sums give an accurate picture of how many statically assigned registers we would need to cover all locals. Somewhat as expected, but it's nice to have this confirmed, we cover the vast majority of functions with only a few statically allocated registers. Here are the top entries for ZenGarden ("ints" are i32 and i64 params and locals; "floats" are f32 and f64 ditto; "count" is the number of functions having those characteristics): ints: 1 floats: 0 count: 18547 ints: 2 floats: 0 count: 8240 ints: 1 floats: 1 count: 7228 ints: 2 floats: 1 count: 6519 ints: 0 floats: 2 count: 5075 ints: 2 floats: 2 count: 4003 ints: 3 floats: 1 count: 3846 ints: 3 floats: 0 count: 3699 ints: 4 floats: 0 count: 2915 ints: 3 floats: 2 count: 2893 ints: 0 floats: 1 count: 2679 ints: 1 floats: 2 count: 2494 ints: 2 floats: 3 count: 2196 ints: 1 floats: 3 count: 2171 ints: 3 floats: 4 count: 1992 ints: 2 floats: 4 count: 1601 ints: 3 floats: 3 count: 1536 ints: 0 floats: 7 count: 1482 ints: 3 floats: 5 count: 1248 ints: 0 floats: 8 count: 1236 ints: 2 floats: 5 count: 1107 ints: 4 floats: 1 count: 1096 ints: 1 floats: 4 count: 1059 ints: 2 floats: 6 count: 927 ints: 5 floats: 0 count: 849 ints: 3 floats: 7 count: 849 ints: 3 floats: 6 count: 843 ints: 0 floats: 9 count: 799 ints: 0 floats: 6 count: 740 ints: 4 floats: 3 count: 734 ints: 1 floats: 5 count: 648 ints: 0 floats: 10 count: 617 ints: 2 floats: 8 count: 589 ints: 3 floats: 8 count: 583 ints: 4 floats: 2 count: 546 ints: 2 floats: 7 count: 544 ints: 0 floats: 11 count: 486 ints: 0 floats: 5 count: 468 ints: 4 floats: 4 count: 422 ints: 5 floats: 1 count: 396 ints: 0 floats: 3 count: 390 ints: 1 floats: 6 count: 375 ints: 5 floats: 2 count: 356 ints: 4 floats: 5 count: 345 ints: 2 floats: 9 count: 323 ints: 1 floats: 8 count: 314 ints: 0 floats: 12 count: 314 ints: 1 floats: 7 count: 302 ints: 1 floats: 12 count: 292 ints: 0 floats: 4 count: 285 ints: 3 floats: 9 count: 278 ints: 6 floats: 0 count: 271 If statically allocated registers can be made to coincide with parameter registers (and why could they not?) then function prologues will be shorter, there will be less load and store traffic, and there will be fewer instructions emitted. With the idea from comment 1 there will also be fewer register-register moves. Statically allocated registers with lazy restore after calls (when we must spill) is a simple bitvector problem and need not slow down compilation much at all.
Attached file Tanks-histo.txt (deleted) —
Attached file AngryBots-histo.txt (deleted) —
Attached file ZenGarden-histo.txt (deleted) —
Attached file wasm-sig (deleted) —
Corpus analysis program (python)
Attached file Cumulative-histo.txt (deleted) —
Postprocssing of the previous text files. This shows that for the trio of programs considered here, 8 statically assigned integer registers cover 99% of all functions, and 14 statically assigned float registers cover 95% of all functions (very long tail, the largest function has 661 float params + locals). (Hello, ARM64.) But even four integer registers cover 92% of all functions, and 8 floating point registers cover 88%. This is excellent news for both ARM and x64. (Output of cat AngryBots-histo.txt Tanks-histo.txt ZenGarden-histo.txt | ./cumulative-histo.py where the program follows.)
Attached file cumulative-histo.py (deleted) —
I'm backburner-working on this so I guess I'll take it.
Assignee: nobody → lhansen
Status: NEW → ASSIGNED
Component: JavaScript Engine: JIT → Javascript: Web Assembly
Type: defect → enhancement
Type: enhancement → task
Assignee: lhansen → nobody
Status: ASSIGNED → NEW

We'll get to this if it becomes necessary.

Priority: P3 → P5
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: