Open Bug 895127 Opened 11 years ago Updated 2 years ago

ARM: Chrome faster than Firefox on nsieve asm.js port

Categories

(Core :: JavaScript Engine, defect)

ARM
Android
defect

Tracking

()

REOPENED

People

(Reporter: djvj, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(7 files, 1 obsolete file)

This shows up in ARM only (ran on my android system). On my Mac Firefox is faster. I ported nsieve to C and built with emscripten at -O2. The asm.js code can be accessed at: vijayan.ca/nsieve/nsieve.html Chrome on Android preforms about 1.4x faster than Firefox (nightly) on Android. Given that we do better than Chrome on x64, this seems like it might be an ARM codegen issue.
Summary: Chrome faster than Firefox on nsieve asm.js port → ARM: Chrome faster than Firefox on nsieve asm.js port
Attached file C source code. (deleted) —
What do you see for x86? This could be the issue that ARM/x86 loads/store bounds checks aren't hoisted.
Profiling results: 30.29% nsieve.html:2570:1221: Function aS - Block 25 13.96% nsieve.html:2570:1010: Function aS - Block 15 12.81% nsieve.html:2570:1296: Function aS - Block 29 6.71% nsieve.html:2570:807: Function aS - Block 5 5.50% nsieve.html:2570:1085: Function aS - Block 19 3.12% nsieve.html:2570:882: Function aS - Block 9 2.73% nsieve.html:2570:1268: Function aS - Block 28 2.71% nsieve.html:2570:1057: Function aS - Block 18 2.18% nsieve.html:0:0: Function aS - Block 26 2.13% nsieve.html:2570:1192: Function aS - Block 23 1.72% nsieve.html:2570:1183: Function aS - Block 22 1.69% nsieve.html:0:0: Function aS - Block 16 -=- The hottest block. Original source: for (k = i + i; k <= m; k += i) isPrime[k] = false; Minified JS: {g=e;do{a[b+g|0]=0;g=g+f|0;}while((g|0)<=8e4)} 400335f8 28 nsieve.html:2570:1221: Function aS - Block 25 0x400335f8: ldr r3, [sp, #8] 0x400335fc: adds r4, r3, r0 0x40033600: mov r5, #0 0x40033604: lsrs r12, r4, #24 <- bounds check 0x40033608: strbeq r5, [r11, r4] <- heap store 0x4003360c: adds r4, r0, r1 0x40033610: movw r12, #14464 0x40033614: movt r12, #1 0x40033618: cmp r4, r12 0x4003361c: bgt 0x4003362c Note the low bit masking is not needed in this case because this is a byte size operation. * Removing the low bit masking and the bounds check gives around a 10% reduction in run time. 400cb578 24 nsieve.html:2570:1221: Function aS - Block 25 0x400cb578: ldr r3, [sp, #8] 0x400cb57c: adds r4, r3, r0 0x400cb580: mov r5, #0 0x400cb584: strb r5, [r11, r4] 0x400cb588: adds r4, r0, r1 0x400cb58c: movw r12, #14464 ; 0x3880 0x400cb590: movt r12, #1 0x400cb594: cmp r4, r12 0x400cb598: bgt 0x400cb5a8 * Using high bit masking, rather than the heap bounds check, gives little change. Note that for byte size heap access the low bit mask was not necessary so the bounds check is just being replaced by the high bit mask. 4001d5c0 28 nsieve.html:2570:1221: Function aS - Block 25 0x4001d5c0: ldr r3, [sp, #8] 0x4001d5c4: adds r4, r3, r0 0x4001d5c8: bic r5, r4, #-67108864 ; 0xfc000000 0x4001d5cc: mov r4, #0 0x4001d5d0: strb r4, [r11, r5] 0x4001d5d4: adds r4, r0, r1 0x4001d5d8: movw r12, #14464 ; 0x3880 0x4001d5dc: movt r12, #1 0x4001d5e0: cmp r4, r12 0x4001d5e4: bgt 0x4001d5f4 * Clang generated code: lsl r3, r2, #1 b .LBB0_5 .LBB0_4: strb r1, [r4, r3] add r3, r3, r2 .LBB0_5: cmp r3, r5 ble .LBB0_4 Note that the constants are hoisted, the loop blocks are better ordered, and the store instruction is able to make better use of its 'base plus index' encoding. * GCC generated code .L22: strb ip, [r1, r3] add r3, r3, r2 cmp r0, r3 bge .L22 Very similar to the clang code.
If you remove bounds checking altogether, how does our perf compare the Chrome's? (I'm wondering if this is due to bounds-check hoisting.)
(In reply to Luke Wagner [:luke] from comment #6) > If you remove bounds checking altogether, how does our perf compare the > Chrome's? (I'm wondering if this is due to bounds-check hoisting.) With the bounds checking removed altogether the performance improved by around 10%. I have not actually compared the results to Chrome, but based on the reported 1.4x speed difference this still leaves a good gap. For the best performance it might be necessary to consider allowing the heap base to be hoisted too, to allow better use of the instruction encoding within such loops.
Depends on: 897412
Depends on: 897425
Already mentioned in IRC, but perhaps we should look into the run time characteristics of malloc/memset. I replaced the malloc() call with a use of a static array, and here were the times: mjrosenb@eve:~/bench$ js.ion.opt ./nsieve_static.js warning: successfully compiled asm.js code (total compilation time 6ms) CXX Time: 23.450000 (count=143020000) mjrosenb@eve:~/bench$ js.ion.opt ./nsieve.10000.js warning: successfully compiled asm.js code (total compilation time 42ms) CXX Time: 34.527000 (count=143020000) I used emscripten to compile the new test, but not the old one, so I should make sure there aren't differences in emscripten that are causing these timing differences.
More information! I had a theory that emscripten simply generates better code for static array accesses than it does for dynamically allocated arrays. To test this, I used the original code, along with a custom malloc implementation (in C) that can basically handle nsieve, and nothing else. The results surprised me in that the dynamically allocated test was a non-trivial amount *faster* than the statically allocated test. mjrosenb@eve:~/bench$ js.ion.opt ./nsieve_static.js warning: successfully compiled asm.js code (total compilation time 6ms) CXX Time: 23.754000 (count=143020000) mjrosenb@eve:~/bench$ js.ion.opt ./nsieve_neutered_malloc.js warning: successfully compiled asm.js code (total compilation time 6ms) CXX Time: 20.274000 (count=143020000)
I am confused about why my testing shows that we're spending some absurd amount of time in malloc()(or at least malloc slows us down by an ABSURD amount), and yet profiling doesnt't show any time spent there. Does the profiling code not emit information for js that doesn't have source maps back to C++ code?
More finger pointing: mjrosenb@eve:~/bench$ perf stat js.ion.opt nsieve_malloc.js; perf stat js.ion.opt ./nsieve_neutered_malloc.js warning: successfully compiled asm.js code (total compilation time 320ms) CXX Time: 34.346000 (count=143020000) Performance counter stats for 'js.ion.opt nsieve_malloc.js': 34884.707897 task-clock # 0.995 CPUs utilized 814 context-switches # 0.023 K/sec 0 CPU-migrations # 0.000 K/sec 3,169 page-faults # 0.091 K/sec 58,297,313,868 cycles # 1.671 GHz <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 83,104,489,736 instructions # 1.43 insns per cycle 13,868,335,782 branches # 397.548 M/sec 207,125,841 branch-misses # 1.49% of all branches 35.049975289 seconds time elapsed warning: successfully compiled asm.js code (total compilation time 6ms) CXX Time: 20.271000 (count=143020000) Performance counter stats for 'js.ion.opt ./nsieve_neutered_malloc.js': 20192.141046 task-clock # 0.993 CPUs utilized 372 context-switches # 0.018 K/sec 0 CPU-migrations # 0.000 K/sec 2,331 page-faults # 0.115 K/sec 34,323,928,941 cycles # 1.700 GHz <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 61,163,271,375 instructions # 1.78 insns per cycle 14,229,247,827 branches # 704.692 M/sec 205,919,866 branch-misses # 1.45% of all branches 20.333820067 seconds time elapsed mjrosenb@eve:~/bench$ It looks like 1/4 of our instructions are spent in malloc, which is a totally unreasonable amount. For comparison, running the same test on x86 gives: ~/bench; perf stat js.ion.opt ./nsieve_malloc.js; perf stat js.ion.opt ./nsieve_neutered_malloc.js warning: successfully compiled asm.js code (total compilation time 13ms) CXX Time: 7.029000 (count=143020000) Performance counter stats for 'js.ion.opt ./nsieve_malloc.js': 7077.914139 task-clock # 1.001 CPUs utilized 28 context-switches # 0.004 K/sec 0 CPU-migrations # 0.000 K/sec 11,953 page-faults # 0.002 M/sec 22,930,948,302 cycles # 3.240 GHz [83.32%] 5,258,175,147 stalled-cycles-frontend # 22.93% frontend cycles idle [83.32%] 3,418,331,293 stalled-cycles-backend # 14.91% backend cycles idle [66.68%] 43,925,367,676 instructions # 1.92 insns per cycle # 0.12 stalled cycles per insn [83.37%] 8,572,309,463 branches # 1211.135 M/sec [83.37%] 174,790,702 branch-misses # 2.04% of all branches [83.32%] 7.073529927 seconds time elapsed warning: successfully compiled asm.js code (total compilation time 1ms) CXX Time: 6.744000 (count=143020000) Performance counter stats for 'js.ion.opt ./nsieve_neutered_malloc.js': 6779.736561 task-clock # 1.001 CPUs utilized 17 context-switches # 0.003 K/sec 0 CPU-migrations # 0.000 K/sec 10,547 page-faults # 0.002 M/sec 22,163,384,075 cycles # 3.269 GHz [83.33%] 8,100,893,711 stalled-cycles-frontend # 36.55% frontend cycles idle [83.35%] 4,770,117,610 stalled-cycles-backend # 21.52% backend cycles idle [66.70%] 36,908,598,483 instructions # 1.67 insns per cycle # 0.22 stalled cycles per insn [83.35%] 7,167,986,291 branches # 1057.266 M/sec [83.35%] 179,414,425 branch-misses # 2.50% of all branches [83.30%] 6.775314255 seconds time elapsed I am concerned that we are executing almost twice the number of instructions arm as we are on x86, even in the neutered testcase. Theories currently are 1) codegen for dlmalloc is *awful* on arm 2) codegen for dlmalloc is broken on arm, and we're taking some branches that we shouldn't be (and somehow or other aren't tripping any asserts or corrupting memory) 3) dlmalloc is somehow not a good fit for arm's memory characteristics I can likely test 2 by dumping memory / an execution trace, and making sure they are the same on x86 and arm. Profiling and/or inspecting the code generated for dlmalloc is probably the only way to distinguish 1 from 3.
Perhaps ffi cost is higher on arm? Did you measure the time spent in ffi calls (luke has a patch)?
It's hard to measure FFI calls now since they've been optimized (to be direct asm.js-to-Ion calls). If you put 'return false' at the top of TryEnablingIon, you can put a counter in InvokeFromAsmJS_(Ignore,ToInt32,ToNumber) and see how many FFI calls there are; if there are more than a small number then we should probably change the benchmark to not do that.
I can likely just instrument all of the ffi calls that we can make from dlmalloc, and see how many times we call each of those. The obvious choice was sbrk. Since only 4 sizes are being allocated, and there is only one thing live at any given time, it is not unreasonable for a malloc implementation to just return all of the freed memory ass soon as it is freed. I instrumented the ffi sbrk function with a print statement, and the result was: mjrosenb@eve:~/bench$ js.ion.opt ./nsieve_malloc_sbrkcnt.js warning: successfully compiled asm.js code (total compilation time 46ms) _SBRK _SBRK _SBRK _SBRK _SBRK CXX Time: 34.288000 (count=143020000) mjrosenb@eve:~/bench$ so we call sbrk 5 times total. Out of curiosity, I poked around in emscripten to see if HAVE_MMAP is unset/set to 0, or MORECORE_CANNOT_TRIM is set. I didn't see either of them. Turning off code that we can't possibly use seems like a good strategy no matter what.
Attached file C source - noline. (obsolete) (deleted) —
Attached file C source - noinline (deleted) —
Attachment #787315 - Attachment is obsolete: true
This example hoists the 'malloc' operation, reusing the flags vector, to help measure the impact of 'malloc' on the run time. Timing results on the ARM show little difference which suggests malloc is not hot and this is consistent with the perf results which did not show malloc as a hot function. These were compiled with: em++ -O2 -g2 -o nsieve.js nsieve.c em++ -O2 -g2 -o nsieves.js nsieves.c
Attached file C source code for a combined test. (deleted) —
This test combines four variations. The first two use a static vector for isPrime. The first had a larger offset for isPrime and the second had a small offset. The high impact of loading large integer constants on the ARM might be significant. Perhaps hoisting such load needs to be explored, and also hoisting the heap base+offset calculation. The third case hoists the 'malloc', and the last is the original, and these a very close so malloc does not appear to be hot. Static vector 1: CXX Time: 121.194000 (count=143020000) Static vector 2: CXX Time: 103.530000 (count=143020000) Dynamic hoisted vector: CXX Time: 108.764000 (count=143020000) Dynamic vector: CXX Time: 109.119000 (count=143020000)
It has been noted that if there is less unrolling or perhaps less inlining then the code runs faster. The hot block is as follows, and can be compared with the hot block noted in comment 5. 0xb64f0538: adds r5, r2, r3 0xb64f053c: mov r6, #0 0xb64f0540: lsrs r12, r5, #24 0xb64f0544: strbeq r6, [r11, r5] 0xb64f0548: adds r5, r3, r1 0xb64f054c: cmp r5, r4 0xb64f0550: bgt 0xb64f055c 0xb64f0554: mov r3, r5 0xb64f0558: b 0xb64f0538 A significant difference is that the loop termination count is stored in a register (r4) rather than being loaded on each iteration. There also appears to be less register pressure as there is not the stack load and store on each iteration. The loop block order might still use some improvement - move the termination test to the tail to reduce the number of branches.
Jan has a fix for the issue of storing the variable isPrime back to the heap on every iteration of the loop, but there is also the problem that we *load* it on every iteration of the loop. Some of my findings relayed on IRC: 13:29 < mjrosenb|ARM> jandem: so investigation results: backtracking does a bit better because it pulls the load/store combo out of a tight loop. 13:30 < mjrosenb|ARM> it looks like bt also needs the store elimination logic, although I'm not sure such a simple test will be sufficent. 13:31 < mjrosenb|ARM> lastly, I want to try both of the interval splitting optimizations. BT has a simple form of one, but it could likely be improved 13:32 < mjrosenb|ARM> I think getting these optimizations in will be a bit annoying, since nothing seems to rely on the actual location/nesting level of blocks.
thoroughly annotated partial nsieve-unrolled dump, using the backtracking allocator: 0x76892450: push {lr} ; (str lr, [sp, #-4]!) 0x76892454: sub sp, sp, #36 ; 0x24 0x76892458: movw r12, #28404 ; 0x6ef4 0x7689245c: movt r12, #121 ; 0x79 0x76892460: ldr lr, [r12] 0x76892464: cmp lr, sp 0x76892468: bcs 0x76899b54 0x7689246c: movw r0, #20001 ; 0x4e21 0x76892470: tst sp, #7 0x76892474: beq 0x7689247c 0x76892478: bkpt 0x0001 0x7689247c: bl 0x76892988 // call malloc 0x76892480: str r0, [sp, #16] // store result, due to memset 0x76892484: mov r1, r0 // this happens because we *CANNOT* specify the same register on alu ops. 0x76892488: adds r0, r1, #2 // this should just be add r0, r0, 2 0x7689248c: mov r1, #1 // setting each byte to 1 0x76892490: movw r2, #19999 ; 0x4e1f 0x76892494: tst sp, #7 0x76892498: beq 0x768924a0 0x7689249c: bkpt 0x0002 0x768924a0: bl 0x76897800 // call memset 0x768924a4: mov r2, #0 // not exactly like the C code, we start the loop by assuming isPrime[...] is true 0x768924a8: mov r0, #0 0x768924ac: mov r1, #2 // initialize i to 2 outer head: 0x768924b0: cmp r2, #0 // isPrime is inverted before we get to this point. 0x768924b4: beq 0x768924c0 // jump to tight loop prehead 0x768924b8: str r0, [sp, #32] // update stack value of count :( 0x768924bc: b 0x76892510 // jump to outer loopcont. tight loop prehead: 0x768924c0: lsl r2, r1, #1 // initialize k to 2i 0x768924c4: movw r12, #20000 0x768924c8: cmp r2, r12 // initial check to see if k > m 0x768924cc: bgt 0x76892508 0x768924d0: str r0, [sp, #32] // update stack value of count 0x768924d4: ldr r0, [sp, #16] // load isPrime :( tight loop head: 0x768924d8: adds r3, r0, r2 // calculate &isPrime[k] 0x768924dc: mov r4, #0 // generate immediate to store. 0x768924e0: lsrs r12, r3, #24 // bounds check! 0x768924e4: strbeq r4, [r11, r3] // store if we passed the bounds check. 0x768924e8: adds r3, r2, r1 // k' <- k+i 0x768924ec: movw r12, #20000 ; 0x4e20 0x768924f0: cmp r3, r12 // see if k' > m 0x768924f4: bgt 0x76892500 // jump to after tight loop. 0x768924f8: mov r2, r3 // k <- k' 0x768924fc: b 0x768924d8 // jump to tight loop head after tight loop: 0x76892500: str r0, [sp, #16] // sync back isPrime 0x76892504: ldr r0, [sp, #32] //load count 0x76892508: adds r2, r0, #1 // increment count 0x7689250c: str r2, [sp, #32] // sync back count. outer loop cont. 0x76892510: adds r3, r1, #1 // increment i into new variable 0x76892514: movw r12, #20000 ; 0x4e20 0x76892518: cmp r3, r12 // check if i is larger than m 0x7689251c: bgt 0x76892550 // conditionally jump to post-loop. 0x76892520: ldr r1, [sp, #16] // load isPrime (yet again!), ooh, it has a different register! 0x76892524: adds r0, r1, r3 // compute &isPrime[i] 0x76892528: lsrs r12, r0, #24 // bounds check 0x7689252c: ldrsbeq r1, [r11, r0] // conditional load 0x76892530: movne r1, #0 // default to 0 on oob accesses 0x76892534: and r0, r1, #1 // start godawful tmp = (isPrime[i] & 1) == 0 check 0x76892538: cmp r0, #0 // ?= 0 0x7689253c: mov r2, #0 // default to 0 0x76892540: moveq r2, #1 // x==0 => 1 0x76892544: mov r1, r3 // fix-up i. 0x76892548: ldr r0, [sp, #32] // load count 0x7689254c: b 0x768924b0 // branch to outer head. post-loop: 0x76892550: ldr r0, [sp, #16] // load isPrime as an argumnet to free() 0x76892554: tst sp, #7 0x76892558: beq 0x76892560 0x7689255c: bkpt 0x0003 0x76892560: bl 0x76895f38 // call free()
annotation for same chuck of C/js code, this time compiled with lsra: 0x76892450: push {lr} ; (str lr, [sp, #-4]!) 0x76892454: sub sp, sp, #28 0x76892458: movw r12, #28404 ; 0x6ef4 0x7689245c: movt r12, #121 ; 0x79 0x76892460: ldr lr, [r12] 0x76892464: cmp lr, sp 0x76892468: bcs 0x76899b50 0x7689246c: movw r0, #20001 ; 0x4e21 0x76892470: tst sp, #7 0x76892474: beq 0x7689247c 0x76892478: bkpt 0x0001 0x7689247c: bl 0x768929e4 // call malloc? 0x76892480: adds r1, r0, #2 0x76892484: mov r2, #1 0x76892488: movw r3, #19999 ; 0x4e1f 0x7689248c: str r0, [sp, #24] // stash the address of isPrime 0x76892490: mov r0, r1 0x76892494: mov r1, r2 0x76892498: mov r2, r3 0x7689249c: tst sp, #7 0x768924a0: beq 0x768924a8 0x768924a4: bkpt 0x0002 0x768924a8: bl 0x76897954 // call memset. // loop preheader 0x768924ac: mov r0, #0 // initialize temp for holding the current prime? 0x768924b0: mov r1, #0 0x768924b4: mov r2, #2 // initialize i to 2. // some sort of loop header. 0x768924b8: cmp r0, #0 // test to see if the current number is a prime number // this is inverted? 0x768924bc: beq 0x768924cc 0x768924c0: ldr r3, [sp, #24] // it is not prime! [sp, #24] contains isPrime!! 0x768924c4: mov r0, r1 0x768924c8: b 0x76892518 // it isn't 0x768924cc: lsl r0, r2, #1 // initialize k -- add_i instantiation 0x768924d0: movw r12, #20000 ; 0x4e20 0x768924d4: cmp r0, r12 0x768924d8: ble 0x768924e4 // make sure that k <= m 0x768924dc: ldr r3, [sp, #24] // loading r3 *everywhere* 0x768924e0: b 0x76892514 // don't enter the loop. // loop head. 0x768924e4: ldr r3, [sp, #24] // load in tight loop is bad. 0x768924e8: adds r4, r3, r0 // compute &isPrime[k] 0x768924ec: mov r5, #0 // make a temp to store 0 -- this may be hoistable, debatable benefits. 0x768924f0: lsrs r12, r4, #24 // heap check 0x768924f4: strbeq r5, [r11, r4] // write 0 into isPrime 0x768924f8: adds r4, r0, r2 // add i to k, rename k to r4. 0x768924fc: movw r12, #20000 ; 0x4e20 0x76892500: cmp r4, r12 0x76892504: bgt 0x76892514 // terminate inner loop 0x76892508: str r3, [sp, #24] // we don't change r3, why do we store it back? 0x7689250c: mov r0, r4 // sync back change to k. 0x76892510: b 0x768924e4 // inner loop termination 0x76892514: adds r0, r1, #1 // JOIN for primality check? 0x76892518: adds r1, r2, #1 // increment i. 0x7689251c: movw r12, #20000 ; 0x4e20 0x76892520: cmp r1, r12 0x76892524: bgt 0x76892568 // terminate whole loop? 0x76892528: adds r2, r3, r1 // compute &isPrime[i] 0x7689252c: lsrs r12, r2, #24 // bounds check 0x76892530: ldrsbeq r2, [r11, r2] // load isPrime[i] 0x76892534: movne r2, #0 // fallback to loading 0 0x76892538: and r4, r2, #1 // make out bottom bit 0x7689253c: cmp r4, #0 // superfluous comparison 0x76892540: mov r2, #0 // oh god, why? 0x76892544: moveq r2, #1 // we *really* should do better than this. // this looks like a movegroup shuffle. 0x76892548: sub sp, sp, #8 0x7689254c: str r3, [sp, #32] 0x76892550: str r1, [sp] 0x76892554: mov r1, r0 0x76892558: mov r0, r2 0x7689255c: ldr r2, [sp] 0x76892560: add sp, sp, #8 // movegroup over. 0x76892564: b 0x768924b8 0x76892568: str r0, [sp, #20] 0x7689256c: mov r0, r3 0x76892570: tst sp, #7 0x76892574: beq 0x7689257c 0x76892578: bkpt 0x0003 0x7689257c: bl 0x76895e8c // call free.
aaand a commented non-unrolled case. I believe this is with the backtracking allocator: 0x76893450: push {lr} ; (str lr, [sp, #-4]!) 0x76893454: sub sp, sp, #28 0x76893458: movw r12, #28404 ; 0x6ef4 0x7689345c: movt r12, #121 ; 0x79 0x76893460: ldr lr, [r12] 0x76893464: cmp lr, sp 0x76893468: bcs 0x7689ab54 0x7689346c: mov r1, #0 // initialize ? count_07? 0x76893470: mov r0, #1 // initialize outer i to 1. 0x76893474: str r0, [sp, #24] // sync ? and outer i to the stack 0x76893478: str r1, [sp, #20] top of all loops 0x7689347c: movw r0, #10000 // compute 10000 0x76893480: ldr r2, [sp, #24] // grab outer i from the stack-- really? 0x76893484: and r1, r2, #31 // bitmask for shift. bounds check should eliminate this. 0x76893488: lsl r1, r0, r1 // compute m = 10000 * (1 << i) == 10000 << i 0x7689348c: str r1, [sp, #12] // store m to the stack 0x76893490: adds r0, r1, #1 // we allocate m+1 elements 0x76893494: tst sp, #7 0x76893498: beq 0x768934a0 0x7689349c: bkpt 0x0001 0x768934a0: bl 0x768937d0 // call malloc 0x768934a4: str r0, [sp, #16] // store flags to the stack 0x768934a8: ldr r0, [sp, #12] // grab m from the stack-- can we use callee saved regs? 0x768934ac: cmp r0, #2 // m < 2? that will *never* happen. 0x768934b0: bge 0x768934bc 0x768934b4: mov r0, #0 // wtf? why do we only initialize count to 0 here? this is llvm generating absurd code. 0x768934b8: b 0x76893564 // b to end of middle 0x768934bc: ldr r1, [sp, #16] // re-load flags -- argument to memset 0x768934c0: adds r0, r1, #2 // actually start at &flags[2] 0x768934c4: ldr r1, [sp, #12] // third argument is m-1; load m 0x768934c8: subs r2, r1, #1 // subs? we should really generate sub here. 0x768934cc: mov r1, #1 // second argument is 1. 0x768934d0: tst sp, #7 0x768934d4: beq 0x768934dc 0x768934d8: bkpt 0x0002 0x768934dc: bl 0x76898648 // call memset 0x768934e0: mov r0, #0 // zero count (called $count_020_i in js) 0x768934e4: mov r1, #2 // set i_119_i to 2? that sounds like the inner i. outer loop head: 0x768934e8: ldr r2, [sp, #16] // re-load flags 0x768934ec: adds r3, r2, r1 // compute &flags[inner i] 0x768934f0: lsrs r12, r3, #24 // bounds check 0x768934f4: ldrsbeq r3, [r11, r3] // load flags[inner i] 0x768934f8: movne r3, #0 // oob handler 0x768934fc: and r4, r3, #1 // extract bottom bit, kind of pointless. 0x76893500: cmp r4, #0 // tst... the instruction you want is tst. 0x76893504: beq 0x7689354c // branch to composite case 0x76893508: lsl r3, r1, #1 // set k = i + i (inner i, here) 0x7689350c: ldr r4, [sp, #12] // grab m from the stack 0x76893510: cmp r3, r4 // is k's initial value out of bounds? 0x76893514: bgt 0x76893544 // if it is, go to after tight loop. 0x76893518: ldr r2, [sp, #16] // re-load flags --- again ?! tight loop head: 0x7689351c: adds r4, r2, r3 // compute flags[k] 0x76893520: mov r5, #0 // load 0 for store-back 0x76893524: lsrs r12, r4, #24 // bounds check 0x76893528: strbeq r5, [r11, r4] // flags[k] = 0 0x7689352c: adds r4, r3, r1 // k' = k + i 0x76893530: ldr r3, [sp, #12] // m, yet again. 0x76893534: cmp r4, r3 // if m > k, abort 0x76893538: bgt 0x76893544 // go to after tight loop 0x7689353c: mov r3, r4 // synchronize variables 0x76893540: b 0x7689351c // always go totight loop head after tight loop 0x76893544: adds r2, r0, #1 // increment count 0x76893548: mov r0, r2 // synchronize registers composite case 0x7689354c: adds r2, r1, #1 // increment i-inner 0x76893550: ldr r1, [sp, #12] // re-load m (AGAIN) 0x76893554: cmp r2, r1 // upper bounds check on i inner 0x76893558: bgt 0x76893564 // branch to end of middle 0x7689355c: mov r1, r2 // sync incremented i back to r1 0x76893560: b 0x768934e8 // go to the outer loop head. end of middle: 0x76893564: ldr r1, [sp, #20] // load count 0x76893568: adds r2, r0, r1 // add inner count to outer count. 0x7689356c: str r2, [sp, #8] // store outer count back to the stack. 0x76893570: ldr r0, [sp, #16] // load flags for argument to free. 0x76893574: tst sp, #7 0x76893578: beq 0x76893580 0x7689357c: bkpt 0x0003 0x76893580: bl 0x76896d80 // call free 0x76893584: ldr r1, [sp, #24] // load outer i 0x76893588: adds r0, r1, #1 // increment outer i 0x7689358c: cmp r0, #4 // see if the outer loop terminated 0x76893590: bge 0x768935b0 // branch to after all loops 0x76893594: str r0, [sp, #24] // cycle fixup 0x76893598: push {lr} ; (str lr, [sp, #-4]!) 0x7689359c: ldr lr, [sp, #12] 0x768935a0: str lr, [sp, #24] 0x768935a4: ldr lr, [sp] 0x768935a8: add sp, sp, #4 0x768935ac: b 0x7689347c after all loops: 0x768935b0: ldr r0, [sp, #8] // load outer count for returning 0x768935b4: add sp, sp, #28 0x768935b8: pop {pc} ; (ldr pc, [sp], #4)
First column is without this "improvement", second column is with it. TEST COMPARISON FROM TO DETAILS ============================================================================= and since octane doesn't come with a diff function: mjrosenb@eve:~/bench/octane$ BETTER_SPLIT=1 ~/src/central/central-regalloc/js/objs/armhf-opt/shell/js ./run.js; ~/src/central/central-regalloc/js/objs/armhf-opt/shell/js ./run.js; Richards: 7300 DeltaBlue: 6856 Crypto: 6717 RayTrace: 7357 EarleyBoyer: 6321 RegExp: 894 Splay: 4669 NavierStokes: 9043 PdfJS: 2963 Mandreel: 1377 Gameboy: 7136 CodeLoad: 7199 Box2D: 2028 ---- Score (version 8): 4427 Richards: 7442 DeltaBlue: 6850 Crypto: 6175 RayTrace: 7322 EarleyBoyer: 6323 RegExp: 936 Splay: 4425 NavierStokes: 8877 PdfJS: 3003 Mandreel: 1350 Gameboy: 8029 CodeLoad: 7545 Box2D: 1918 ---- Score (version 8): 4429 which isn't saying all that much, since evidently, the change from one run to the next is appreciable. ** TOTAL **: ?? 5499.5ms +/- 0.2% 5531.1ms +/- 0.9% not conclusive: might be *1.006x as slow* ============================================================================= ai: ?? 415.8ms +/- 1.4% 425.9ms +/- 2.4% not conclusive: might be *1.024x as slow* astar: ?? 415.8ms +/- 1.4% 425.9ms +/- 2.4% not conclusive: might be *1.024x as slow* audio: *1.013x as slow* 2074.3ms +/- 0.4% 2100.4ms +/- 1.1% significant beat-detection: ?? 592.7ms +/- 0.6% 607.7ms +/- 3.3% not conclusive: might be *1.025x as slow* dft: *1.024x as slow* 954.4ms +/- 0.5% 977.4ms +/- 2.2% significant fft: 1.048x as fast 216.8ms +/- 1.9% 206.9ms +/- 1.5% significant oscillator: - 310.4ms +/- 1.6% 308.4ms +/- 1.9% imaging: ?? 1329.4ms +/- 0.2% 1333.4ms +/- 1.5% not conclusive: might be *1.003x as slow* gaussian-blur: ?? 396.2ms +/- 0.1% 403.7ms +/- 2.5% not conclusive: might be *1.019x as slow* darkroom: 1.025x as fast 585.9ms +/- 0.1% 571.4ms +/- 0.2% significant desaturate: *1.032x as slow* 347.3ms +/- 0.8% 358.3ms +/- 3.1% significant json: ?? 302.9ms +/- 0.5% 303.6ms +/- 0.6% not conclusive: might be *1.002x as slow* parse-financial: - 148.8ms +/- 1.1% 148.6ms +/- 1.3% stringify-tinderbox: ?? 154.1ms +/- 0.7% 155.0ms +/- 0.4% not conclusive: might be *1.006x as slow* stanford: - 1377.1ms +/- 0.4% 1367.8ms +/- 0.7% crypto-aes: - 334.7ms +/- 1.5% 330.9ms +/- 0.6% crypto-ccm: - 286.0ms +/- 1.1% 286.1ms +/- 0.7% crypto-pbkdf2: - 549.7ms +/- 0.6% 546.3ms +/- 1.8% crypto-sha256-iterative: 1.011x as fast 206.7ms +/- 0.4% 204.5ms +/- 0.4% significant
Oh, I forgot to mention, this brings the testcase (run 100x iterations) from 35 seconds down to 23 seconds which is *most* of the benefit that I saw from not unrolling the loop.
Attached patch bad_better_split.diff (deleted) — Splinter Review
With this patch applied, it dies on many tests, one of which is: BETTER_SPLIT=1 ./shell/js --ion-eager --ion-eager /home/mjrosenb/src/central/central-regalloc/js/src/jit-test/tests/jaeger/loops/hoist-08.js Assertion failure: *def->output() == alloc, at /home/mjrosenb/src/central/central-regalloc/js/src/jit/RegisterAllocator.cpp:207 Segmentation fault The reason for this is the ordering of movegroups: Block 0 [successor 1] [2,3 Label] [4,5 Parameter] [def v1 arg:0] [6,7 Parameter] [def v2 arg:8] [8,9 Parameter] [def v3 arg:16] [10,11 Start] [use c c] [use v1:* arg:0] [use v2:* arg:8] [use v3:* arg:16] [use c c] [12,13 CheckOverRecursed] [14,15 OsiPoint] [use c c] [use v1:* arg:0] [use v2:* arg:8] [use v3:* arg:16] [use c c] [0,1 MoveGroup] [arg:8 -> =rax] [16,17 Unbox] [def v4 =rcx] [use v2:r =rax] [use c c] [use v1:* arg:0] [use v2:* =rax] [use v3:* arg:16] [use c c] [0,1 MoveGroup] [=rcx -> stack:i3] [18,19 Integer] [def v5 =rdx] [0,1 MoveGroup] [=rax -> arg:16] [0,1 MoveGroup] [arg:16 -> =rax] [20,21 Goto] for the final Goto, vreg 3 is allocated to rax, and in at least one of the successor blocks, arg:16 is captured by an OsiPoint, so we spill it (We don't need to spill here, in fact, we *never* need to spill arguments, but I suspect this can happen in other cases) so the per-block register fixup code inserts an rax -> arg:16 store. Later, while the stack is being re-synchronized for each individual instruction, lsra discovers that [20,21 Goto] expects vreg 3 in rax, so it generates a load, which gets inserted immediately before the Goto. Unfortunately, these two movegroups are now in the wrong order. Jan, you said you could poke at this with the patch + testcase, so is this something that is just a fundamental flaw, and should be worked around, or did I miss something obvious?
Attachment #796360 - Flags: feedback?(jdemooij)
Ok, I just went and tested this on the asm.js microbenchmarks. I ran the whole suite 100 times, then computed the improvement: copy 1.1293700 corrections .1252000 fannkuch 1.6535700 fasta .6401500 life 2.5245000 memops .2542300 primes .0143900 skinning .0945500 note these are percentage improved, so primes is 0.01% "faster" with the patch. so it looks like across the board improvements (verytiny) on the microbenchmarks. I'll test the apps next.
Comment on attachment 796360 [details] [diff] [review] bad_better_split.diff Sorry for the delay. Do you still see the problems with this patch on tip?
Attachment #796360 - Flags: feedback?(jdemooij)
Assignee: general → nobody
Per policy at https://wiki.mozilla.org/Bug_Triage/Projects/Bug_Handling/Bug_Husbandry#Inactive_Bugs. If this bug is not an enhancement request or a bug not present in a supported release of Firefox, then it may be reopened.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INACTIVE
Status: RESOLVED → REOPENED
Resolution: INACTIVE → ---
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: