Open
Bug 895127
Opened 11 years ago
Updated 2 years ago
ARM: Chrome faster than Firefox on nsieve asm.js port
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
REOPENED
People
(Reporter: djvj, Unassigned)
References
(Depends on 1 open bug)
Details
Attachments
(7 files, 1 obsolete file)
This shows up in ARM only (ran on my android system). On my Mac Firefox is faster.
I ported nsieve to C and built with emscripten at -O2. The asm.js code can be accessed at:
vijayan.ca/nsieve/nsieve.html
Chrome on Android preforms about 1.4x faster than Firefox (nightly) on Android. Given that we do better than Chrome on x64, this seems like it might be an ARM codegen issue.
Reporter | ||
Comment 1•11 years ago
|
||
Full hyperlink:
http://vijayan.ca/nsieve/nsieve.html
Reporter | ||
Updated•11 years ago
|
Summary: Chrome faster than Firefox on nsieve asm.js port → ARM: Chrome faster than Firefox on nsieve asm.js port
Reporter | ||
Comment 2•11 years ago
|
||
Reporter | ||
Comment 3•11 years ago
|
||
Comment 4•11 years ago
|
||
What do you see for x86? This could be the issue that ARM/x86 loads/store bounds checks aren't hoisted.
Comment 5•11 years ago
|
||
Profiling results:
30.29% nsieve.html:2570:1221: Function aS - Block 25
13.96% nsieve.html:2570:1010: Function aS - Block 15
12.81% nsieve.html:2570:1296: Function aS - Block 29
6.71% nsieve.html:2570:807: Function aS - Block 5
5.50% nsieve.html:2570:1085: Function aS - Block 19
3.12% nsieve.html:2570:882: Function aS - Block 9
2.73% nsieve.html:2570:1268: Function aS - Block 28
2.71% nsieve.html:2570:1057: Function aS - Block 18
2.18% nsieve.html:0:0: Function aS - Block 26
2.13% nsieve.html:2570:1192: Function aS - Block 23
1.72% nsieve.html:2570:1183: Function aS - Block 22
1.69% nsieve.html:0:0: Function aS - Block 16
-=-
The hottest block.
Original source: for (k = i + i; k <= m; k += i) isPrime[k] = false;
Minified JS: {g=e;do{a[b+g|0]=0;g=g+f|0;}while((g|0)<=8e4)}
400335f8 28 nsieve.html:2570:1221: Function aS - Block 25
0x400335f8: ldr r3, [sp, #8]
0x400335fc: adds r4, r3, r0
0x40033600: mov r5, #0
0x40033604: lsrs r12, r4, #24 <- bounds check
0x40033608: strbeq r5, [r11, r4] <- heap store
0x4003360c: adds r4, r0, r1
0x40033610: movw r12, #14464
0x40033614: movt r12, #1
0x40033618: cmp r4, r12
0x4003361c: bgt 0x4003362c
Note the low bit masking is not needed in this case because this is a byte size operation.
* Removing the low bit masking and the bounds check gives around a 10% reduction in run time.
400cb578 24 nsieve.html:2570:1221: Function aS - Block 25
0x400cb578: ldr r3, [sp, #8]
0x400cb57c: adds r4, r3, r0
0x400cb580: mov r5, #0
0x400cb584: strb r5, [r11, r4]
0x400cb588: adds r4, r0, r1
0x400cb58c: movw r12, #14464 ; 0x3880
0x400cb590: movt r12, #1
0x400cb594: cmp r4, r12
0x400cb598: bgt 0x400cb5a8
* Using high bit masking, rather than the heap bounds check, gives little change. Note that for byte size heap access the low bit mask was not necessary so the bounds check is just being replaced by the high bit mask.
4001d5c0 28 nsieve.html:2570:1221: Function aS - Block 25
0x4001d5c0: ldr r3, [sp, #8]
0x4001d5c4: adds r4, r3, r0
0x4001d5c8: bic r5, r4, #-67108864 ; 0xfc000000
0x4001d5cc: mov r4, #0
0x4001d5d0: strb r4, [r11, r5]
0x4001d5d4: adds r4, r0, r1
0x4001d5d8: movw r12, #14464 ; 0x3880
0x4001d5dc: movt r12, #1
0x4001d5e0: cmp r4, r12
0x4001d5e4: bgt 0x4001d5f4
* Clang generated code:
lsl r3, r2, #1
b .LBB0_5
.LBB0_4:
strb r1, [r4, r3]
add r3, r3, r2
.LBB0_5:
cmp r3, r5
ble .LBB0_4
Note that the constants are hoisted, the loop blocks are better ordered, and the store instruction is able to make better use of its 'base plus index' encoding.
* GCC generated code
.L22:
strb ip, [r1, r3]
add r3, r3, r2
cmp r0, r3
bge .L22
Very similar to the clang code.
Comment 6•11 years ago
|
||
If you remove bounds checking altogether, how does our perf compare the Chrome's? (I'm wondering if this is due to bounds-check hoisting.)
Comment 7•11 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #6)
> If you remove bounds checking altogether, how does our perf compare the
> Chrome's? (I'm wondering if this is due to bounds-check hoisting.)
With the bounds checking removed altogether the performance improved by around 10%. I have not actually compared the results to Chrome, but based on the reported 1.4x speed difference this still leaves a good gap.
For the best performance it might be necessary to consider allowing the heap base to be hoisted too, to allow better use of the instruction encoding within such loops.
Comment 8•11 years ago
|
||
Already mentioned in IRC, but perhaps we should look into the run time characteristics of malloc/memset.
I replaced the malloc() call with a use of a static array, and here were the times:
mjrosenb@eve:~/bench$ js.ion.opt ./nsieve_static.js
warning: successfully compiled asm.js code (total compilation time 6ms)
CXX Time: 23.450000 (count=143020000)
mjrosenb@eve:~/bench$ js.ion.opt ./nsieve.10000.js
warning: successfully compiled asm.js code (total compilation time 42ms)
CXX Time: 34.527000 (count=143020000)
I used emscripten to compile the new test, but not the old one, so I should make sure there aren't differences in emscripten that are causing these timing differences.
Comment 9•11 years ago
|
||
More information!
I had a theory that emscripten simply generates better code for static array accesses than it does for dynamically allocated arrays. To test this, I used the original code, along with a custom malloc implementation (in C) that can basically handle nsieve, and nothing else.
The results surprised me in that the dynamically allocated test was a non-trivial amount *faster* than the statically allocated test.
mjrosenb@eve:~/bench$ js.ion.opt ./nsieve_static.js
warning: successfully compiled asm.js code (total compilation time 6ms)
CXX Time: 23.754000 (count=143020000)
mjrosenb@eve:~/bench$ js.ion.opt ./nsieve_neutered_malloc.js
warning: successfully compiled asm.js code (total compilation time 6ms)
CXX Time: 20.274000 (count=143020000)
Comment 10•11 years ago
|
||
I am confused about why my testing shows that we're spending some absurd amount of time in malloc()(or at least malloc slows us down by an ABSURD amount), and yet profiling doesnt't show any time spent there. Does the profiling code not emit information for js that doesn't have source maps back to C++ code?
Comment 11•11 years ago
|
||
More finger pointing:
mjrosenb@eve:~/bench$ perf stat js.ion.opt nsieve_malloc.js; perf stat js.ion.opt ./nsieve_neutered_malloc.js
warning: successfully compiled asm.js code (total compilation time 320ms)
CXX Time: 34.346000 (count=143020000)
Performance counter stats for 'js.ion.opt nsieve_malloc.js':
34884.707897 task-clock # 0.995 CPUs utilized
814 context-switches # 0.023 K/sec
0 CPU-migrations # 0.000 K/sec
3,169 page-faults # 0.091 K/sec
58,297,313,868 cycles # 1.671 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
83,104,489,736 instructions # 1.43 insns per cycle
13,868,335,782 branches # 397.548 M/sec
207,125,841 branch-misses # 1.49% of all branches
35.049975289 seconds time elapsed
warning: successfully compiled asm.js code (total compilation time 6ms)
CXX Time: 20.271000 (count=143020000)
Performance counter stats for 'js.ion.opt ./nsieve_neutered_malloc.js':
20192.141046 task-clock # 0.993 CPUs utilized
372 context-switches # 0.018 K/sec
0 CPU-migrations # 0.000 K/sec
2,331 page-faults # 0.115 K/sec
34,323,928,941 cycles # 1.700 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
61,163,271,375 instructions # 1.78 insns per cycle
14,229,247,827 branches # 704.692 M/sec
205,919,866 branch-misses # 1.45% of all branches
20.333820067 seconds time elapsed
mjrosenb@eve:~/bench$
It looks like 1/4 of our instructions are spent in malloc, which is a totally unreasonable amount.
For comparison, running the same test on x86 gives:
~/bench; perf stat js.ion.opt ./nsieve_malloc.js; perf stat js.ion.opt ./nsieve_neutered_malloc.js
warning: successfully compiled asm.js code (total compilation time 13ms)
CXX Time: 7.029000 (count=143020000)
Performance counter stats for 'js.ion.opt ./nsieve_malloc.js':
7077.914139 task-clock # 1.001 CPUs utilized
28 context-switches # 0.004 K/sec
0 CPU-migrations # 0.000 K/sec
11,953 page-faults # 0.002 M/sec
22,930,948,302 cycles # 3.240 GHz [83.32%]
5,258,175,147 stalled-cycles-frontend # 22.93% frontend cycles idle [83.32%]
3,418,331,293 stalled-cycles-backend # 14.91% backend cycles idle [66.68%]
43,925,367,676 instructions # 1.92 insns per cycle
# 0.12 stalled cycles per insn [83.37%]
8,572,309,463 branches # 1211.135 M/sec [83.37%]
174,790,702 branch-misses # 2.04% of all branches [83.32%]
7.073529927 seconds time elapsed
warning: successfully compiled asm.js code (total compilation time 1ms)
CXX Time: 6.744000 (count=143020000)
Performance counter stats for 'js.ion.opt ./nsieve_neutered_malloc.js':
6779.736561 task-clock # 1.001 CPUs utilized
17 context-switches # 0.003 K/sec
0 CPU-migrations # 0.000 K/sec
10,547 page-faults # 0.002 M/sec
22,163,384,075 cycles # 3.269 GHz [83.33%]
8,100,893,711 stalled-cycles-frontend # 36.55% frontend cycles idle [83.35%]
4,770,117,610 stalled-cycles-backend # 21.52% backend cycles idle [66.70%]
36,908,598,483 instructions # 1.67 insns per cycle
# 0.22 stalled cycles per insn [83.35%]
7,167,986,291 branches # 1057.266 M/sec [83.35%]
179,414,425 branch-misses # 2.50% of all branches [83.30%]
6.775314255 seconds time elapsed
I am concerned that we are executing almost twice the number of instructions arm as we are on x86, even in the neutered testcase. Theories currently are
1) codegen for dlmalloc is *awful* on arm
2) codegen for dlmalloc is broken on arm, and we're taking some branches that we shouldn't be (and somehow or other aren't tripping any asserts or corrupting memory)
3) dlmalloc is somehow not a good fit for arm's memory characteristics
I can likely test 2 by dumping memory / an execution trace, and making sure they are the same on x86 and arm. Profiling and/or inspecting the code generated for dlmalloc is probably the only way to distinguish 1 from 3.
Comment 12•11 years ago
|
||
Perhaps ffi cost is higher on arm? Did you measure the time spent in ffi calls (luke has a patch)?
Comment 13•11 years ago
|
||
It's hard to measure FFI calls now since they've been optimized (to be direct asm.js-to-Ion calls). If you put 'return false' at the top of TryEnablingIon, you can put a counter in InvokeFromAsmJS_(Ignore,ToInt32,ToNumber) and see how many FFI calls there are; if there are more than a small number then we should probably change the benchmark to not do that.
Comment 14•11 years ago
|
||
I can likely just instrument all of the ffi calls that we can make from dlmalloc, and see how many times we call each of those.
The obvious choice was sbrk. Since only 4 sizes are being allocated, and there is only one thing live at any given time, it is not unreasonable for a malloc implementation to just return all of the freed memory ass soon as it is freed.
I instrumented the ffi sbrk function with a print statement, and the result was:
mjrosenb@eve:~/bench$ js.ion.opt ./nsieve_malloc_sbrkcnt.js
warning: successfully compiled asm.js code (total compilation time 46ms)
_SBRK
_SBRK
_SBRK
_SBRK
_SBRK
CXX Time: 34.288000 (count=143020000)
mjrosenb@eve:~/bench$
so we call sbrk 5 times total.
Out of curiosity, I poked around in emscripten to see if HAVE_MMAP is unset/set to 0, or MORECORE_CANNOT_TRIM is set. I didn't see either of them. Turning off code that we can't possibly use seems like a good strategy no matter what.
Comment 15•11 years ago
|
||
Comment 16•11 years ago
|
||
Attachment #787315 -
Attachment is obsolete: true
Comment 17•11 years ago
|
||
This example hoists the 'malloc' operation, reusing the flags vector, to help measure the impact of 'malloc' on the run time. Timing results on the ARM show little difference which suggests malloc is not hot and this is consistent with the perf results which did not show malloc as a hot function.
These were compiled with:
em++ -O2 -g2 -o nsieve.js nsieve.c
em++ -O2 -g2 -o nsieves.js nsieves.c
Comment 18•11 years ago
|
||
This test combines four variations. The first two use a static vector for isPrime. The first had a larger offset for isPrime and the second had a small offset. The high impact of loading large integer constants on the ARM might be significant. Perhaps hoisting such load needs to be explored, and also hoisting the heap base+offset calculation. The third case hoists the 'malloc', and the last is the original, and these a very close so malloc does not appear to be hot.
Static vector 1: CXX Time: 121.194000 (count=143020000)
Static vector 2: CXX Time: 103.530000 (count=143020000)
Dynamic hoisted vector: CXX Time: 108.764000 (count=143020000)
Dynamic vector: CXX Time: 109.119000 (count=143020000)
Comment 19•11 years ago
|
||
It has been noted that if there is less unrolling or perhaps less inlining then the code runs faster. The hot block is as follows, and can be compared with the hot block noted in comment 5.
0xb64f0538: adds r5, r2, r3
0xb64f053c: mov r6, #0
0xb64f0540: lsrs r12, r5, #24
0xb64f0544: strbeq r6, [r11, r5]
0xb64f0548: adds r5, r3, r1
0xb64f054c: cmp r5, r4
0xb64f0550: bgt 0xb64f055c
0xb64f0554: mov r3, r5
0xb64f0558: b 0xb64f0538
A significant difference is that the loop termination count is stored in a register (r4) rather than being loaded on each iteration. There also appears to be less register pressure as there is not the stack load and store on each iteration. The loop block order might still use some improvement - move the termination test to the tail to reduce the number of branches.
Comment 20•11 years ago
|
||
Jan has a fix for the issue of storing the variable isPrime back to the heap on every iteration of the loop, but there is also the problem that we *load* it on every iteration of the loop. Some of my findings relayed on IRC:
13:29 < mjrosenb|ARM> jandem: so investigation results: backtracking does a bit better because it pulls the load/store combo out of a tight loop.
13:30 < mjrosenb|ARM> it looks like bt also needs the store elimination logic, although I'm not sure such a simple test will be sufficent.
13:31 < mjrosenb|ARM> lastly, I want to try both of the interval splitting optimizations. BT has a simple form of one, but it could likely be improved
13:32 < mjrosenb|ARM> I think getting these optimizations in will be a bit annoying, since nothing seems to rely on the actual location/nesting level of
blocks.
Comment 21•11 years ago
|
||
thoroughly annotated partial nsieve-unrolled dump, using the backtracking allocator:
0x76892450: push {lr} ; (str lr, [sp, #-4]!)
0x76892454: sub sp, sp, #36 ; 0x24
0x76892458: movw r12, #28404 ; 0x6ef4
0x7689245c: movt r12, #121 ; 0x79
0x76892460: ldr lr, [r12]
0x76892464: cmp lr, sp
0x76892468: bcs 0x76899b54
0x7689246c: movw r0, #20001 ; 0x4e21
0x76892470: tst sp, #7
0x76892474: beq 0x7689247c
0x76892478: bkpt 0x0001
0x7689247c: bl 0x76892988 // call malloc
0x76892480: str r0, [sp, #16] // store result, due to memset
0x76892484: mov r1, r0 // this happens because we *CANNOT* specify the same register on alu ops.
0x76892488: adds r0, r1, #2 // this should just be add r0, r0, 2
0x7689248c: mov r1, #1 // setting each byte to 1
0x76892490: movw r2, #19999 ; 0x4e1f
0x76892494: tst sp, #7
0x76892498: beq 0x768924a0
0x7689249c: bkpt 0x0002
0x768924a0: bl 0x76897800 // call memset
0x768924a4: mov r2, #0 // not exactly like the C code, we start the loop by assuming isPrime[...] is true
0x768924a8: mov r0, #0
0x768924ac: mov r1, #2 // initialize i to 2
outer head:
0x768924b0: cmp r2, #0 // isPrime is inverted before we get to this point.
0x768924b4: beq 0x768924c0 // jump to tight loop prehead
0x768924b8: str r0, [sp, #32] // update stack value of count :(
0x768924bc: b 0x76892510 // jump to outer loopcont.
tight loop prehead:
0x768924c0: lsl r2, r1, #1 // initialize k to 2i
0x768924c4: movw r12, #20000
0x768924c8: cmp r2, r12 // initial check to see if k > m
0x768924cc: bgt 0x76892508
0x768924d0: str r0, [sp, #32] // update stack value of count
0x768924d4: ldr r0, [sp, #16] // load isPrime :(
tight loop head:
0x768924d8: adds r3, r0, r2 // calculate &isPrime[k]
0x768924dc: mov r4, #0 // generate immediate to store.
0x768924e0: lsrs r12, r3, #24 // bounds check!
0x768924e4: strbeq r4, [r11, r3] // store if we passed the bounds check.
0x768924e8: adds r3, r2, r1 // k' <- k+i
0x768924ec: movw r12, #20000 ; 0x4e20
0x768924f0: cmp r3, r12 // see if k' > m
0x768924f4: bgt 0x76892500 // jump to after tight loop.
0x768924f8: mov r2, r3 // k <- k'
0x768924fc: b 0x768924d8 // jump to tight loop head
after tight loop:
0x76892500: str r0, [sp, #16] // sync back isPrime
0x76892504: ldr r0, [sp, #32] //load count
0x76892508: adds r2, r0, #1 // increment count
0x7689250c: str r2, [sp, #32] // sync back count.
outer loop cont.
0x76892510: adds r3, r1, #1 // increment i into new variable
0x76892514: movw r12, #20000 ; 0x4e20
0x76892518: cmp r3, r12 // check if i is larger than m
0x7689251c: bgt 0x76892550 // conditionally jump to post-loop.
0x76892520: ldr r1, [sp, #16] // load isPrime (yet again!), ooh, it has a different register!
0x76892524: adds r0, r1, r3 // compute &isPrime[i]
0x76892528: lsrs r12, r0, #24 // bounds check
0x7689252c: ldrsbeq r1, [r11, r0] // conditional load
0x76892530: movne r1, #0 // default to 0 on oob accesses
0x76892534: and r0, r1, #1 // start godawful tmp = (isPrime[i] & 1) == 0 check
0x76892538: cmp r0, #0 // ?= 0
0x7689253c: mov r2, #0 // default to 0
0x76892540: moveq r2, #1 // x==0 => 1
0x76892544: mov r1, r3 // fix-up i.
0x76892548: ldr r0, [sp, #32] // load count
0x7689254c: b 0x768924b0 // branch to outer head.
post-loop:
0x76892550: ldr r0, [sp, #16] // load isPrime as an argumnet to free()
0x76892554: tst sp, #7
0x76892558: beq 0x76892560
0x7689255c: bkpt 0x0003
0x76892560: bl 0x76895f38 // call free()
Comment 22•11 years ago
|
||
annotation for same chuck of C/js code, this time compiled with lsra:
0x76892450: push {lr} ; (str lr, [sp, #-4]!)
0x76892454: sub sp, sp, #28
0x76892458: movw r12, #28404 ; 0x6ef4
0x7689245c: movt r12, #121 ; 0x79
0x76892460: ldr lr, [r12]
0x76892464: cmp lr, sp
0x76892468: bcs 0x76899b50
0x7689246c: movw r0, #20001 ; 0x4e21
0x76892470: tst sp, #7
0x76892474: beq 0x7689247c
0x76892478: bkpt 0x0001
0x7689247c: bl 0x768929e4 // call malloc?
0x76892480: adds r1, r0, #2
0x76892484: mov r2, #1
0x76892488: movw r3, #19999 ; 0x4e1f
0x7689248c: str r0, [sp, #24] // stash the address of isPrime
0x76892490: mov r0, r1
0x76892494: mov r1, r2
0x76892498: mov r2, r3
0x7689249c: tst sp, #7
0x768924a0: beq 0x768924a8
0x768924a4: bkpt 0x0002
0x768924a8: bl 0x76897954 // call memset.
// loop preheader
0x768924ac: mov r0, #0 // initialize temp for holding the current prime?
0x768924b0: mov r1, #0
0x768924b4: mov r2, #2 // initialize i to 2.
// some sort of loop header.
0x768924b8: cmp r0, #0 // test to see if the current number is a prime number // this is inverted?
0x768924bc: beq 0x768924cc
0x768924c0: ldr r3, [sp, #24] // it is not prime! [sp, #24] contains isPrime!!
0x768924c4: mov r0, r1
0x768924c8: b 0x76892518
// it isn't
0x768924cc: lsl r0, r2, #1 // initialize k -- add_i instantiation
0x768924d0: movw r12, #20000 ; 0x4e20
0x768924d4: cmp r0, r12
0x768924d8: ble 0x768924e4 // make sure that k <= m
0x768924dc: ldr r3, [sp, #24] // loading r3 *everywhere*
0x768924e0: b 0x76892514 // don't enter the loop.
// loop head.
0x768924e4: ldr r3, [sp, #24] // load in tight loop is bad.
0x768924e8: adds r4, r3, r0 // compute &isPrime[k]
0x768924ec: mov r5, #0 // make a temp to store 0 -- this may be hoistable, debatable benefits.
0x768924f0: lsrs r12, r4, #24 // heap check
0x768924f4: strbeq r5, [r11, r4] // write 0 into isPrime
0x768924f8: adds r4, r0, r2 // add i to k, rename k to r4.
0x768924fc: movw r12, #20000 ; 0x4e20
0x76892500: cmp r4, r12
0x76892504: bgt 0x76892514 // terminate inner loop
0x76892508: str r3, [sp, #24] // we don't change r3, why do we store it back?
0x7689250c: mov r0, r4 // sync back change to k.
0x76892510: b 0x768924e4
// inner loop termination
0x76892514: adds r0, r1, #1
// JOIN for primality check?
0x76892518: adds r1, r2, #1 // increment i.
0x7689251c: movw r12, #20000 ; 0x4e20
0x76892520: cmp r1, r12
0x76892524: bgt 0x76892568 // terminate whole loop?
0x76892528: adds r2, r3, r1 // compute &isPrime[i]
0x7689252c: lsrs r12, r2, #24 // bounds check
0x76892530: ldrsbeq r2, [r11, r2] // load isPrime[i]
0x76892534: movne r2, #0 // fallback to loading 0
0x76892538: and r4, r2, #1 // make out bottom bit
0x7689253c: cmp r4, #0 // superfluous comparison
0x76892540: mov r2, #0 // oh god, why?
0x76892544: moveq r2, #1 // we *really* should do better than this.
// this looks like a movegroup shuffle.
0x76892548: sub sp, sp, #8
0x7689254c: str r3, [sp, #32]
0x76892550: str r1, [sp]
0x76892554: mov r1, r0
0x76892558: mov r0, r2
0x7689255c: ldr r2, [sp]
0x76892560: add sp, sp, #8
// movegroup over.
0x76892564: b 0x768924b8
0x76892568: str r0, [sp, #20]
0x7689256c: mov r0, r3
0x76892570: tst sp, #7
0x76892574: beq 0x7689257c
0x76892578: bkpt 0x0003
0x7689257c: bl 0x76895e8c // call free.
Comment 23•11 years ago
|
||
aaand a commented non-unrolled case. I believe this is with the backtracking allocator:
0x76893450: push {lr} ; (str lr, [sp, #-4]!)
0x76893454: sub sp, sp, #28
0x76893458: movw r12, #28404 ; 0x6ef4
0x7689345c: movt r12, #121 ; 0x79
0x76893460: ldr lr, [r12]
0x76893464: cmp lr, sp
0x76893468: bcs 0x7689ab54
0x7689346c: mov r1, #0 // initialize ? count_07?
0x76893470: mov r0, #1 // initialize outer i to 1.
0x76893474: str r0, [sp, #24] // sync ? and outer i to the stack
0x76893478: str r1, [sp, #20]
top of all loops
0x7689347c: movw r0, #10000 // compute 10000
0x76893480: ldr r2, [sp, #24] // grab outer i from the stack-- really?
0x76893484: and r1, r2, #31 // bitmask for shift. bounds check should eliminate this.
0x76893488: lsl r1, r0, r1 // compute m = 10000 * (1 << i) == 10000 << i
0x7689348c: str r1, [sp, #12] // store m to the stack
0x76893490: adds r0, r1, #1 // we allocate m+1 elements
0x76893494: tst sp, #7
0x76893498: beq 0x768934a0
0x7689349c: bkpt 0x0001
0x768934a0: bl 0x768937d0 // call malloc
0x768934a4: str r0, [sp, #16] // store flags to the stack
0x768934a8: ldr r0, [sp, #12] // grab m from the stack-- can we use callee saved regs?
0x768934ac: cmp r0, #2 // m < 2? that will *never* happen.
0x768934b0: bge 0x768934bc
0x768934b4: mov r0, #0 // wtf? why do we only initialize count to 0 here? this is llvm generating absurd code.
0x768934b8: b 0x76893564 // b to end of middle
0x768934bc: ldr r1, [sp, #16] // re-load flags -- argument to memset
0x768934c0: adds r0, r1, #2 // actually start at &flags[2]
0x768934c4: ldr r1, [sp, #12] // third argument is m-1; load m
0x768934c8: subs r2, r1, #1 // subs? we should really generate sub here.
0x768934cc: mov r1, #1 // second argument is 1.
0x768934d0: tst sp, #7
0x768934d4: beq 0x768934dc
0x768934d8: bkpt 0x0002
0x768934dc: bl 0x76898648 // call memset
0x768934e0: mov r0, #0 // zero count (called $count_020_i in js)
0x768934e4: mov r1, #2 // set i_119_i to 2? that sounds like the inner i.
outer loop head:
0x768934e8: ldr r2, [sp, #16] // re-load flags
0x768934ec: adds r3, r2, r1 // compute &flags[inner i]
0x768934f0: lsrs r12, r3, #24 // bounds check
0x768934f4: ldrsbeq r3, [r11, r3] // load flags[inner i]
0x768934f8: movne r3, #0 // oob handler
0x768934fc: and r4, r3, #1 // extract bottom bit, kind of pointless.
0x76893500: cmp r4, #0 // tst... the instruction you want is tst.
0x76893504: beq 0x7689354c // branch to composite case
0x76893508: lsl r3, r1, #1 // set k = i + i (inner i, here)
0x7689350c: ldr r4, [sp, #12] // grab m from the stack
0x76893510: cmp r3, r4 // is k's initial value out of bounds?
0x76893514: bgt 0x76893544 // if it is, go to after tight loop.
0x76893518: ldr r2, [sp, #16] // re-load flags --- again ?!
tight loop head:
0x7689351c: adds r4, r2, r3 // compute flags[k]
0x76893520: mov r5, #0 // load 0 for store-back
0x76893524: lsrs r12, r4, #24 // bounds check
0x76893528: strbeq r5, [r11, r4] // flags[k] = 0
0x7689352c: adds r4, r3, r1 // k' = k + i
0x76893530: ldr r3, [sp, #12] // m, yet again.
0x76893534: cmp r4, r3 // if m > k, abort
0x76893538: bgt 0x76893544 // go to after tight loop
0x7689353c: mov r3, r4 // synchronize variables
0x76893540: b 0x7689351c // always go totight loop head
after tight loop
0x76893544: adds r2, r0, #1 // increment count
0x76893548: mov r0, r2 // synchronize registers
composite case
0x7689354c: adds r2, r1, #1 // increment i-inner
0x76893550: ldr r1, [sp, #12] // re-load m (AGAIN)
0x76893554: cmp r2, r1 // upper bounds check on i inner
0x76893558: bgt 0x76893564 // branch to end of middle
0x7689355c: mov r1, r2 // sync incremented i back to r1
0x76893560: b 0x768934e8 // go to the outer loop head.
end of middle:
0x76893564: ldr r1, [sp, #20] // load count
0x76893568: adds r2, r0, r1 // add inner count to outer count.
0x7689356c: str r2, [sp, #8] // store outer count back to the stack.
0x76893570: ldr r0, [sp, #16] // load flags for argument to free.
0x76893574: tst sp, #7
0x76893578: beq 0x76893580
0x7689357c: bkpt 0x0003
0x76893580: bl 0x76896d80 // call free
0x76893584: ldr r1, [sp, #24] // load outer i
0x76893588: adds r0, r1, #1 // increment outer i
0x7689358c: cmp r0, #4 // see if the outer loop terminated
0x76893590: bge 0x768935b0 // branch to after all loops
0x76893594: str r0, [sp, #24] // cycle fixup
0x76893598: push {lr} ; (str lr, [sp, #-4]!)
0x7689359c: ldr lr, [sp, #12]
0x768935a0: str lr, [sp, #24]
0x768935a4: ldr lr, [sp]
0x768935a8: add sp, sp, #4
0x768935ac: b 0x7689347c
after all loops:
0x768935b0: ldr r0, [sp, #8] // load outer count for returning
0x768935b4: add sp, sp, #28
0x768935b8: pop {pc} ; (ldr pc, [sp], #4)
Comment 24•11 years ago
|
||
First column is without this "improvement", second column is with it.
TEST COMPARISON FROM TO DETAILS
=============================================================================
and since octane doesn't come with a diff function:
mjrosenb@eve:~/bench/octane$ BETTER_SPLIT=1 ~/src/central/central-regalloc/js/objs/armhf-opt/shell/js ./run.js; ~/src/central/central-regalloc/js/objs/armhf-opt/shell/js ./run.js;
Richards: 7300
DeltaBlue: 6856
Crypto: 6717
RayTrace: 7357
EarleyBoyer: 6321
RegExp: 894
Splay: 4669
NavierStokes: 9043
PdfJS: 2963
Mandreel: 1377
Gameboy: 7136
CodeLoad: 7199
Box2D: 2028
----
Score (version 8): 4427
Richards: 7442
DeltaBlue: 6850
Crypto: 6175
RayTrace: 7322
EarleyBoyer: 6323
RegExp: 936
Splay: 4425
NavierStokes: 8877
PdfJS: 3003
Mandreel: 1350
Gameboy: 8029
CodeLoad: 7545
Box2D: 1918
----
Score (version 8): 4429
which isn't saying all that much, since evidently, the change from one run to the next is appreciable.
** TOTAL **: ?? 5499.5ms +/- 0.2% 5531.1ms +/- 0.9% not conclusive: might be *1.006x as slow*
=============================================================================
ai: ?? 415.8ms +/- 1.4% 425.9ms +/- 2.4% not conclusive: might be *1.024x as slow*
astar: ?? 415.8ms +/- 1.4% 425.9ms +/- 2.4% not conclusive: might be *1.024x as slow*
audio: *1.013x as slow* 2074.3ms +/- 0.4% 2100.4ms +/- 1.1% significant
beat-detection: ?? 592.7ms +/- 0.6% 607.7ms +/- 3.3% not conclusive: might be *1.025x as slow*
dft: *1.024x as slow* 954.4ms +/- 0.5% 977.4ms +/- 2.2% significant
fft: 1.048x as fast 216.8ms +/- 1.9% 206.9ms +/- 1.5% significant
oscillator: - 310.4ms +/- 1.6% 308.4ms +/- 1.9%
imaging: ?? 1329.4ms +/- 0.2% 1333.4ms +/- 1.5% not conclusive: might be *1.003x as slow*
gaussian-blur: ?? 396.2ms +/- 0.1% 403.7ms +/- 2.5% not conclusive: might be *1.019x as slow*
darkroom: 1.025x as fast 585.9ms +/- 0.1% 571.4ms +/- 0.2% significant
desaturate: *1.032x as slow* 347.3ms +/- 0.8% 358.3ms +/- 3.1% significant
json: ?? 302.9ms +/- 0.5% 303.6ms +/- 0.6% not conclusive: might be *1.002x as slow*
parse-financial: - 148.8ms +/- 1.1% 148.6ms +/- 1.3%
stringify-tinderbox: ?? 154.1ms +/- 0.7% 155.0ms +/- 0.4% not conclusive: might be *1.006x as slow*
stanford: - 1377.1ms +/- 0.4% 1367.8ms +/- 0.7%
crypto-aes: - 334.7ms +/- 1.5% 330.9ms +/- 0.6%
crypto-ccm: - 286.0ms +/- 1.1% 286.1ms +/- 0.7%
crypto-pbkdf2: - 549.7ms +/- 0.6% 546.3ms +/- 1.8%
crypto-sha256-iterative: 1.011x as fast 206.7ms +/- 0.4% 204.5ms +/- 0.4% significant
Comment 25•11 years ago
|
||
Oh, I forgot to mention, this brings the testcase (run 100x iterations) from 35 seconds down to 23 seconds which is *most* of the benefit that I saw from not unrolling the loop.
Comment 26•11 years ago
|
||
With this patch applied, it dies on many tests, one of which is:
BETTER_SPLIT=1 ./shell/js --ion-eager --ion-eager /home/mjrosenb/src/central/central-regalloc/js/src/jit-test/tests/jaeger/loops/hoist-08.js
Assertion failure: *def->output() == alloc, at /home/mjrosenb/src/central/central-regalloc/js/src/jit/RegisterAllocator.cpp:207
Segmentation fault
The reason for this is the ordering of movegroups:
Block 0 [successor 1]
[2,3 Label]
[4,5 Parameter] [def v1 arg:0]
[6,7 Parameter] [def v2 arg:8]
[8,9 Parameter] [def v3 arg:16]
[10,11 Start] [use c c] [use v1:* arg:0] [use v2:* arg:8] [use v3:* arg:16] [use c c]
[12,13 CheckOverRecursed]
[14,15 OsiPoint] [use c c] [use v1:* arg:0] [use v2:* arg:8] [use v3:* arg:16] [use c c]
[0,1 MoveGroup] [arg:8 -> =rax]
[16,17 Unbox] [def v4 =rcx] [use v2:r =rax] [use c c] [use v1:* arg:0] [use v2:* =rax] [use v3:* arg:16] [use c c]
[0,1 MoveGroup] [=rcx -> stack:i3]
[18,19 Integer] [def v5 =rdx]
[0,1 MoveGroup] [=rax -> arg:16]
[0,1 MoveGroup] [arg:16 -> =rax]
[20,21 Goto]
for the final Goto, vreg 3 is allocated to rax, and in at least one of the successor blocks, arg:16 is captured by an OsiPoint, so we spill it (We don't need to spill here, in fact, we *never* need to spill arguments, but I suspect this can happen in other cases)
so the per-block register fixup code inserts an rax -> arg:16 store. Later, while the stack is being re-synchronized for each individual instruction, lsra discovers that [20,21 Goto] expects vreg 3 in rax, so it generates a load, which gets inserted immediately before the Goto. Unfortunately, these two movegroups are now in the wrong order.
Jan, you said you could poke at this with the patch + testcase, so is this something that is just a fundamental flaw, and should be worked around, or did I miss something obvious?
Attachment #796360 -
Flags: feedback?(jdemooij)
Comment 27•11 years ago
|
||
Ok, I just went and tested this on the asm.js microbenchmarks. I ran the whole suite 100 times, then computed the improvement:
copy
1.1293700
corrections
.1252000
fannkuch
1.6535700
fasta
.6401500
life
2.5245000
memops
.2542300
primes
.0143900
skinning
.0945500
note these are percentage improved, so primes is 0.01% "faster" with the patch.
so it looks like across the board improvements (verytiny) on the microbenchmarks. I'll test the apps next.
Comment 28•11 years ago
|
||
Comment on attachment 796360 [details] [diff] [review]
bad_better_split.diff
Sorry for the delay. Do you still see the problems with this patch on tip?
Attachment #796360 -
Flags: feedback?(jdemooij)
Assignee | ||
Updated•10 years ago
|
Assignee: general → nobody
Comment 29•7 years ago
|
||
Per policy at https://wiki.mozilla.org/Bug_Triage/Projects/Bug_Handling/Bug_Husbandry#Inactive_Bugs. If this bug is not an enhancement request or a bug not present in a supported release of Firefox, then it may be reopened.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INACTIVE
Updated•7 years ago
|
Status: RESOLVED → REOPENED
Resolution: INACTIVE → ---
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•