[meta] Redundant MoveGroups and other Move badness
Categories
(Core :: JavaScript Engine: JIT, enhancement, P3)
Tracking
()
People
(Reporter: lth, Unassigned)
References
(Depends on 3 open bugs, Blocks 3 open bugs)
Details
(Keywords: meta)
A little too often I observe that the regalloc inserts apparently redundant MoveGroups when inputs and outputs to an operation are constrained (param registers, output registers). Frequently this gives rise to code that moves redundantly via a temp that is then immediately dead. I'll dig up some examples and make dependent bugs.
These kinds of redundancies matter more for wasm than for JS probably, but it will increase register pressure and instruction counts for all.
Reporter | ||
Comment 1•4 years ago
|
||
Here's an example:
wasmDis(new WebAssembly.Module(wasmTextToBinary(`
(module
(func (param v128) (param v128) (result v128)
(i32x4.add (local.get 1) (local.get 0))))
`)))
The core of the code generated for this is:
00000024 66 0f 6f d1 movdqa %xmm1, %xmm2
00000028 66 0f 6f ca movdqa %xmm2, %xmm1
0000002C 66 0f fe c8 paddd %xmm0, %xmm1
00000030 66 0f 6f c1 movdqa %xmm1, %xmm0
which is not great. Integer SIMD addition is commutative so the optimal code is just the addition; almost-optimal code has at most one move (the one at the end). But even ignoring that, the first two moves are clearly redundant, as xmm2 is dead and the move accomplishes nothing.
(Note the swapped operand order. If the body were param0 + param1 then the machine code is simply the paddd.)
Regalloc reports that the input is this:
[RegAlloc] [2,3 WasmParameter] [def v1<simd128>:%xmm0.i4]
[RegAlloc] [4,5 WasmParameter] [def v2<simd128>:%xmm1.i4]
[RegAlloc] [6,7 WasmParameter] [def v3<g>:r14]
[RegAlloc] [8,9 WasmBinarySimd128] [def v4<simd128>:tied(0)] [use v2:r?] [use v1:r]
[RegAlloc] [10,11 WasmReturn] [use v4:%xmm0.d] [use v3:r14]
but at the end the IR looks like this:
[RegAlloc] [2,3 WasmParameter] [def v1<simd128>:%xmm0.i4]
[RegAlloc] [4,5 WasmParameter] [def v2<simd128>:%xmm1.i4]
[RegAlloc] [MoveGroup] [%xmm1.i4 -> %xmm2.i4]
[RegAlloc] [6,7 WasmParameter] [def v3<g>:r14]
[RegAlloc] [MoveGroup] [%xmm2.i4 -> %xmm1.i4]
[RegAlloc] [8,9 WasmBinarySimd128] [def v4<simd128>:%xmm1.i4] [use v2:r %xmm1.i4] [use v1:r %xmm0.i4]
[RegAlloc] [MoveGroup] [%xmm1.i4 -> %xmm0.i4]
[RegAlloc] [10,11 WasmReturn] [use v4:%xmm0.d %xmm0.i4] [use v3:r14 r14]
Reporter | ||
Comment 2•4 years ago
|
||
Here's another one:
wasmDis(new WebAssembly.Module(wasmTextToBinary(`
(module
(memory (export "mem") 1)
(func (param i32) (result i32)
(i32.load (local.get 0))))
`)))
Run this with --disable-wasm-huge-memory --large-arraybuffers
and the regalloc is handed this:
[RegAlloc] [2,3 WasmParameter] [def v1<i>:rdi]
[RegAlloc] [4,5 WasmParameter] [def v2<g>:r14]
[RegAlloc] [6,7 WasmLoadTls] [def v3<g>] [use v2:r]
[RegAlloc] [8,9 WasmExtendU32Index] [def v4<g>:tied(0)] [use v1:r?]
[RegAlloc] [10,11 WasmBoundsCheck64] [def v5<g>:tied(0)] [use v4:r?] [use v3:r]
[RegAlloc] [12,13 WasmWrapU32Index] [def v6<i>:tied(0)] [use v5:r?]
[RegAlloc] [14,15 WasmLoad] [def v7<i>] [use v6:r]
[RegAlloc] [16,17 WasmReturn] [use v7:rax] [use v2:r14]
where the nodes at 8 and and 12 emit no code on x64 and use defineReuseInput. And yet redundant moves are inserted by the regalloc:
[RegAlloc] [2,3 WasmParameter] [def v1<i>:rdi]
[RegAlloc] [4,5 WasmParameter] [def v2<g>:r14]
[RegAlloc] [6,7 WasmLoadTls] [def v3<g>:rax] [use v2:r r14]
[RegAlloc] [MoveGroup] [rdi -> rcx]
[RegAlloc] [8,9 WasmExtendU32Index] [def v4<g>:rcx] [use v1:r rcx]
[RegAlloc] [10,11 WasmBoundsCheck64] [def v5<g>:rcx] [use v4:r rcx] [use v3:r rax]
[RegAlloc] [MoveGroup] [rcx -> rax]
[RegAlloc] [12,13 WasmWrapU32Index] [def v6<i>:rax] [use v5:r rax]
[RegAlloc] [14,15 WasmLoad] [def v7<i>:rax] [use v6:r rax] [use bogus]
[RegAlloc] [16,17 WasmReturn] [use v7:rax rax] [use v2:r14 r14]
In this case there could be an interesting complication since the extend and wrap nodes are technically changing the MIR types. But on every 64-bit platform the 32-bit and 64-bit GPRs overlap, so this is not a great explanation.
Reporter | ||
Comment 3•3 years ago
|
||
This bug currently bites really hard on ARM when manipulating 64-bit data. Consider:
(func (param i32) (param $p i64) (result i64)
(i64.sub (i64.const 0) (local.get $p)))
(notice the extra dead parameter) where the body turns into this:
0x47fc302c e1a01003 e1a01003 mov r1, r3
0x47fc3030 e1a00002 e1a00002 mov r0, r2
0x47fc3034 e24dd008 e24dd008 sub sp, sp, #8
0x47fc3038 e58d1000 e58d1000 str r1, [sp, #+0]
0x47fc303c e1a01000 e1a01000 mov r1, r0
0x47fc3040 e59d0000 e59d0000 ldr r0, [sp, #+0]
0x47fc3044 e28dd008 e28dd008 add sp, sp, #8
0x47fc3048 e2711000 e2711000 rsbs r1, r1, #0
0x47fc304c e2e00000 e2e00000 rsc r0, r0, #0
0x47fc3050 e24dd008 e24dd008 sub sp, sp, #8
0x47fc3054 e58d0000 e58d0000 str r0, [sp, #+0]
0x47fc3058 e1a00001 e1a00001 mov r0, r1
0x47fc305c e59d1000 e59d1000 ldr r1, [sp, #+0]
0x47fc3060 e28dd008 e28dd008 add sp, sp, #8
because the function body is elaborated to this by the register allocator:
[RegAlloc] Register Allocation Integrity State:
[RegAlloc] Block 0
[RegAlloc] [2,3 WasmParameter] [def v1<i>:r0]
[RegAlloc] [4,5 WasmParameterI64] [def v2<g>:r2] [def v3<g>:r3]
[RegAlloc] [6,7 WasmParameter] [def v4<g>:r9]
[RegAlloc] [MoveGroup] [r3 -> r1] [r2 -> r0]
[RegAlloc] [MoveGroup] [r1 -> r0] [r0 -> r1]
[RegAlloc] [8,9 NegI64] [def v5<g>:r1] [def v6<g>:r0] [use v2:r r1] [use v3:r r0]
[RegAlloc] [MoveGroup] [r0 -> r1] [r1 -> r0]
[RegAlloc] [10,11 WasmReturnI64] [use v5:r0 r0] [use v6:r1 r1] [use v4:r9 r9]
It looks like 10 of those instructions are basically r0 <-> r1 register swaps through memory with ad-hoc stack allocation, really ugly and something we ought to be able to specialize. But there's no obvious reason why a register would not be used for that, there are plenty available. So maybe there's an underlying bug.
Reporter | ||
Comment 4•3 years ago
|
||
Looks like there's been no attempt to optimize simple reg->reg move cycles on ARM or ARM64, unlike on x86. Filed as bug 1713180.
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Description
•