Closed Bug 501189 Opened 15 years ago Closed 11 years ago

Understand empty loop performance

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: dmandelin, Assigned: gal)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

Test case 15 years ago David Mandelin [:dmandelin] (deleted), application/x-javascript		Details
Empty loop trace asm 15 years ago David Mandelin [:dmandelin] (deleted), text/plain		Details
Test harness for trace asm 15 years ago David Mandelin [:dmandelin] (deleted), text/plain		Details

David Mandelin [:dmandelin]

Reporter

Description

•

15 years ago

Attached file Test case (deleted) — Details

On the attached test case, tracing should be at least as fast as any other technique. But instead, we run in 38 ms (8-9 cycles/iteration) while WebKit runs in 30 ms (6-7 cycles/iteration). (v8 is 104 ms (23 cycles/iteration).) Why? Presumably the generated x86 has most of the answer.

Gregor Wagner

Comment 1

•

15 years ago

I tested the test-case version where an object is created each time. So ({}) instead of {} and reduced the loop count to 1000000 because otherwise the GC kicks in and we have to recompile everything 5 times. TM runs 117ms (234 cycles/iteration) non jit 1426ms, Webkit 69ms (140 cycles/iteration) and v8 56ms (112 cycles/iteration)

Gregor Wagner

Comment 2

•

15 years ago

The shark tool says TM spends about 85-90% of the time in js_Object_tn. This method calls js_newGCThing where we spend about 60% of the time. The following newGCThing function only needs about 5%. So we spend about 50% of the time within js_newGCThing excluding called functions. These numbers are relative to the time spent in js_Execute.

Gregor Wagner

Comment 3

•

15 years ago

So within the js_newGCThing function there is a very expensive division in the THINGS_PER_ARENA(nbytes); macro. Due to the shark tool we spend about 30% of the time waiting for the result of this division.

Brendan Eich [:brendan]

Comment 4

•

15 years ago

(In reply to comment #3) > So within the js_newGCThing function there is a very expensive division in the > THINGS_PER_ARENA(nbytes); macro. Due to the shark tool we spend about 30% of > the time waiting for the result of this division. Division!? My strength-reduction radar must have been broken when reviewing. /be

Brendan Eich [:brendan]

Comment 5

•

15 years ago

Comment 1 on seem to be in the wrong bug. This bug is about an empty loop. The bug about object allocation is bug 501186 -- right? /be

Gregor Wagner

Comment 6

•

15 years ago

ups my fault. I only got linked to this bug and Andreas wanted to get numbers for the object creation loop. Anyways this macro with the division is called 1000386 times for this test case with 1000000 iterations.

Andreas Gal :gal

Assignee

Comment 7

•

15 years ago

Patching.

Gregor Wagner

Comment 8

•

15 years ago

there is also another division in the same function: nbytes = JS_ROUNDUP(nbytes, sizeof(JSGCThing)); which goes to #define JS_HOWMANY(x,y) (((x)+(y)-1)/(y)) #define JS_ROUNDUP(x,y) (JS_HOWMANY(x,y)*(y))

Andreas Gal :gal

Assignee

Comment 9

•

15 years ago

You are still posting in the wrong bug. Lets move this to https://bugzilla.mozilla.org/show_bug.cgi?id=501186

Brendan Eich [:brendan]

Comment 10

•

15 years ago

(In reply to comment #8) > there is also another division in the same function: > nbytes = JS_ROUNDUP(nbytes, sizeof(JSGCThing)); > > which goes to > > #define JS_HOWMANY(x,y) (((x)+(y)-1)/(y)) > #define JS_ROUNDUP(x,y) (JS_HOWMANY(x,y)*(y)) Gregor: parting shot in this bug -- that macro with unsigned power of two y (sizeof(JSGCThing) is 8 or 16 and size_t) should not compile to division. Do you know for sure that a compiler is failing to reduce strength? /be

David Mandelin [:dmandelin]

Reporter

Comment 11

•

15 years ago

Attached file Empty loop trace asm (deleted) — Details

David Mandelin [:dmandelin]

Reporter

Comment 12

•

15 years ago

Attached file Test harness for trace asm (deleted) — Details

David Mandelin [:dmandelin]

Reporter

Comment 13

•

15 years ago

I've attached a human-readable assembly representation of the trace generated for the empty loop test. I found that I can increase the speed by 2x or more by removing unneeded code. trace run time (ms) -------------------------- ------------- baseline from js -j 35 remove SB1 27 remove SB2 25 remove SB3 31 remove SB1,SB2 25 remove SB1,SB3 31 remove SB2,SB3 31 remove SB* (all 3) 25 hoist sp 38 remove SB*, hoist sp 21 remove SB*, hoist sp,i 18 remove SB* & last store, hoist sp,i 18 all above + remove oflow guard 14 At least on this test, the on-trace store-backs are hurting us a lot: removing them is a 35/25=1.4x speedup. Hoisting doesn't make sense without removing the store-backs first, but if we do both, we are up to a 2x speedup. Removing the overflow guard gets us up to a 2.5x speedup. Doing that is hard in general but should not be too hard for loops with an integer iterator variable that is only set in the loop increment and is bounded by a constant.

Robert Sayre

Updated

•

15 years ago

Flags: blocking1.9.2+

Priority: -- → P2

Damon Sicore (:damons)

Comment 14

•

15 years ago

-'ing this as dmandelin and gal say it shouldn't block. Sayre feel free to override.

Assignee: general → gal

Flags: blocking1.9.2+ → blocking1.9.2-

Ryan VanderMeulen [:RyanVM]

Comment 16

•

13 years ago

Current JS shell numbers for the attached testcase: Interp: 1014 ms, 223.08 cycles/obj -j: 1065 ms, 234.30 cycles/obj -m: 54 ms, 11.88 cycles/obj -m -n: 26 ms, 5.72 cycles/obj JM+TI seems to be ahead of the Webkit numbers given in comment #0. I don't have a d8 Windows binary to test. Looking at comment #13, it appears that there's still more room for improvement, though?

Blocks: 467263

Summary: TM: Understand empty loop performance → Understand empty loop performance

Brendan Eich [:brendan]

Comment 17

•

13 years ago

If V8 is using process isolation only, not a interrupt checking load on the loop edge, then it will win on empty loop benchmarks. Question is what is the cost of that load on minimal but non-empty loops? /be

Jan de Mooij [:jandem]

Comment 18

•

11 years ago

I get these numbers: SM : 18 ms d8 : 26 ms Ion emits this for an empty loop in the global scope: [Codegen] cmpl $0x0, 0x0x2a12440 // interrupt check [Codegen] jne ((513)) [Codegen] movl 0x758(%eax), %ecx [Codegen] cmpl $0x989680, %ecx [Codegen] jge ((531)) [Codegen] addl $0x1, %ecx [Codegen] movl %ecx, 0x758(%eax) [Codegen] jmp ((545)) Same loop inside a function (local instead of global var): [Codegen] cmpl $0x0, 0x0x29cc240 // interrupt check [Codegen] jne ((159)) [Codegen] cmpl $0x989680, %eax [Codegen] jge ((171)) [Codegen] addl $0x1, %eax We can use asm.js style mprotect to eliminate the interrupt check (bug 864220), but we beat v8 now and I don't think we have to keep this bug open.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.