Closed
Bug 787601
Opened 12 years ago
Closed 12 years ago
IonMonkey: Optimize empty for-loops. (AWFY: 2x slower than JM on x86)
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
DUPLICATE
of bug 765119
People
(Reporter: nbp, Assigned: nbp)
References
(Blocks 1 open bug)
Details
(Whiteboard: [ion:t])
It appears that IonMonkey is still slower on empty for-loops. This for loop optimization might be the key to some remaining sunspider benchmarks.
Assignee | ||
Updated•12 years ago
|
Assignee: general → nicolas.b.pierron
Status: NEW → ASSIGNED
Hardware: x86_64 → x86
Updated•12 years ago
|
Whiteboard: [ion:t]
Assignee | ||
Comment 1•12 years ago
|
||
The performance problem we observe on misc-basic-emptyloop seems to be cause by the allocation of an extra register to do the increment of the loop counter.
move rdx -> rbx
addi rbx, c -> rbx
move rbx -> rdx
This allocation is likely caused by the block resume point which captures the loop counter. This cause the allocation of an extra register because LAddI instruction can bail out and the loop counter should be restored.
We have one solution to handle such cases, which is to ensure all bailout happen before we modify any input register.
This solution alone will only solve the loop-counter issue, but it will not help in case of extra counter registered within the loop, because we might still have to resume before other counters modifications. In addition, we can register when an idempotent MIR/LIR instruction is a bijection (under the non-bailing domain), in which case we can handle multiple counters. This will increase the bailing cost but might reduce the register pressure caused by resume points.
In this case, removing the resume point of the loop-increment block won't help because it capture the same state of the counter, and thus will still imply having an extra register for saving the state of the counter.
Assignee | ||
Comment 2•12 years ago
|
||
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #1)
> In addition, we
> can register when an idempotent MIR/LIR instruction is a bijection (under
> the non-bailing domain), in which case we can handle multiple counters.
> This will increase the bailing cost but might reduce the register pressure
> caused by resume points.
Brian Hackett worked on something similar in Bug 773666, where he made an attempt to do so for add and sub.
Assignee | ||
Comment 3•12 years ago
|
||
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #1)
> The performance problem we observe on misc-basic-emptyloop seems to be cause
> by the allocation of an extra register to do the increment of the loop
> counter.
>
> move rdx -> rbx
> addi rbx, c -> rbx
> move rbx -> rdx
Before going any deeper … I made a simple hack to test these hypothesis:
1/ using rdx instead of doing the 2 moves: no gain.
2/ using aligned basic block for loop edges: no gain.
3/ removing the overflow check: 120ms --> 90ms
IonMonkey: (hacked: aligned, less registers) 90ms
; rcx = len
0x7ffff7f8aed0: movabs $0x7ffff7fbc010,%r11
0x7ffff7f8aeda: cmpl $0x0,(%r11)
0x7ffff7f8aede: jne 0x7ffff7f8af56
0x7ffff7f8aee4: cmp %ecx,%edx
0x7ffff7f8aee6: jge 0x7ffff7f8aef4
; mov edx, ebx
0x7ffff7f8aeec: add $0x1,%edx ; edx <-> ebx
; mov ebx, edx
; jo <bailout>
0x7ffff7f8aeef: jmpq 0x7ffff7f8aed0
JägerMonkey: 70ms
; r9 = len
0x7ffff7f8a129: int3 // added for dumping this code.
=> 0x7ffff7f8a12a: movabs $0x7ffff7fbc010,%r15
0x7ffff7f8a134: cmpl $0x0,(%r15)
0x7ffff7f8a138: jne 0x7ffff7f8a27e
0x7ffff7f8a13e: sub $0xffffffff,%r12d
; really ?!
0x7ffff7f8a142: movabs $0xfff8800000000000,%r10
0x7ffff7f8a14c: or %r12,%r10
0x7ffff7f8a14f: mov %r10,0x70(%rbx)
0x7ffff7f8a153: cmp %r9d,%r12d
0x7ffff7f8a156: jl 0x7ffff7f8a129
0x7ffff7f8a15c: jmpq 0x7ffff7f8a179
I will look at the order of blocks and see if replacing the conditional forward jump by a conditional backward jump can help. Then it might be interesting to test (2) again.
Assignee | ||
Comment 4•12 years ago
|
||
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #3)
>
> 1/ using rdx instead of doing the 2 moves: no gain.
> 2/ using aligned basic block for loop edges: no gain.
> 3/ removing the overflow check: 120ms --> 90ms
4/ Using a conditional jump to jump to the top of the loop.
5/ Using sub -1 instead of add +1
Here are the following test I made, I put the following list after remarking that a configuration which was uninteresting at some point became interesting when combined.
current jm: 70ms
current: 120ms
1: 120ms
1 & 2: 120ms
1 & 2 & 3: 90ms
3 & 4: 90ms
3 & 4 & 5: 90ms
0x7ffff7f8aec9: mov %rdx,%rbx
0x7ffff7f8aecc: sub $0xffffffff,%ebx
0x7ffff7f8aecf: mov %rbx,%rdx
0x7ffff7f8aed2: movabs $0x7ffff7fbc010,%r11
0x7ffff7f8aedc: cmpl $0x0,(%r11)
0x7ffff7f8aee0: jne 0x7ffff7f8af50
0x7ffff7f8aee6: cmp %ecx,%edx
0x7ffff7f8aee8: jl 0x7ffff7f8aec9
*1* & 3 & 4: 60ms
Removing the extra moves statements seems to provide a huge win this time. Changing the moves to be 32 bits move instead of 64 bits does change the previous results. I guess this reduce the number of useful instructions for the loop. The JM equivalent, which does a write within the loop, is likely optimized out by the CPU which analyze that it is the only owner of the cache line, and that nobody can read the data which is constantly erased by the latest iteration.
0x7ffff7f8aeca: add $0x1,%edx
0x7ffff7f8aecd: movabs $0x7ffff7fbc010,%r11
0x7ffff7f8aed7: cmpl $0x0,(%r11)
0x7ffff7f8aedb: jne 0x7ffff7f8af4b
0x7ffff7f8aee1: cmp %ecx,%edx
0x7ffff7f8aee3: jl 0x7ffff7f8aeca
0x7ffff7f8aee9:
1 & 2 & 3 & 4: 60ms
Assignee | ||
Comment 5•12 years ago
|
||
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #3)
> 3/ removing the overflow check: 120ms --> 90ms
Enabling range analysis does the same work, and remove the overflow check of this loop. We cannot enable range analysis by default yet because a few patches are missing. See Bug 765119 for details.
Depends on: 765119
Assignee | ||
Comment 6•12 years ago
|
||
This bug is not fully fixed by range analysis, but other hack to reduce the loop size are just kind of optimizations which are not reachable by our current code generation. With usual JS code, these tight loops are unlikely to be triggered.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
You need to log in
before you can comment on or make changes to this bug.
Description
•