Closed
Bug 501189
Opened 15 years ago
Closed 11 years ago
Understand empty loop performance
Categories
(Core :: JavaScript Engine, defect, P2)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
People
(Reporter: dmandelin, Assigned: gal)
References
(Blocks 1 open bug)
Details
Attachments
(3 files)
On the attached test case, tracing should be at least as fast as any other technique. But instead, we run in 38 ms (8-9 cycles/iteration) while WebKit runs in 30 ms (6-7 cycles/iteration). (v8 is 104 ms (23 cycles/iteration).) Why? Presumably the generated x86 has most of the answer.
Comment 1•15 years ago
|
||
I tested the test-case version where an object is created each time.
So ({}) instead of {} and reduced the loop count to 1000000 because otherwise the GC kicks in and we have to recompile everything 5 times.
TM runs 117ms (234 cycles/iteration) non jit 1426ms, Webkit 69ms (140 cycles/iteration) and v8 56ms (112 cycles/iteration)
Comment 2•15 years ago
|
||
The shark tool says TM spends about 85-90% of the time in js_Object_tn. This method calls js_newGCThing where we spend about 60% of the time.
The following newGCThing function only needs about 5%.
So we spend about 50% of the time within js_newGCThing excluding called functions.
These numbers are relative to the time spent in js_Execute.
Comment 3•15 years ago
|
||
So within the js_newGCThing function there is a very expensive division in the THINGS_PER_ARENA(nbytes); macro. Due to the shark tool we spend about 30% of the time waiting for the result of this division.
Comment 4•15 years ago
|
||
(In reply to comment #3)
> So within the js_newGCThing function there is a very expensive division in the
> THINGS_PER_ARENA(nbytes); macro. Due to the shark tool we spend about 30% of
> the time waiting for the result of this division.
Division!? My strength-reduction radar must have been broken when reviewing.
/be
Comment 5•15 years ago
|
||
Comment 1 on seem to be in the wrong bug. This bug is about an empty loop. The bug about object allocation is bug 501186 -- right?
/be
Comment 6•15 years ago
|
||
ups my fault. I only got linked to this bug and Andreas wanted to get numbers for the object creation loop.
Anyways this macro with the division is called 1000386 times for this test case with 1000000 iterations.
Assignee | ||
Comment 7•15 years ago
|
||
Patching.
Comment 8•15 years ago
|
||
there is also another division in the same function:
nbytes = JS_ROUNDUP(nbytes, sizeof(JSGCThing));
which goes to
#define JS_HOWMANY(x,y) (((x)+(y)-1)/(y))
#define JS_ROUNDUP(x,y) (JS_HOWMANY(x,y)*(y))
Assignee | ||
Comment 9•15 years ago
|
||
You are still posting in the wrong bug. Lets move this to https://bugzilla.mozilla.org/show_bug.cgi?id=501186
Comment 10•15 years ago
|
||
(In reply to comment #8)
> there is also another division in the same function:
> nbytes = JS_ROUNDUP(nbytes, sizeof(JSGCThing));
>
> which goes to
>
> #define JS_HOWMANY(x,y) (((x)+(y)-1)/(y))
> #define JS_ROUNDUP(x,y) (JS_HOWMANY(x,y)*(y))
Gregor: parting shot in this bug -- that macro with unsigned power of two y (sizeof(JSGCThing) is 8 or 16 and size_t) should not compile to division. Do you know for sure that a compiler is failing to reduce strength?
/be
Reporter | ||
Comment 11•15 years ago
|
||
Reporter | ||
Comment 12•15 years ago
|
||
Reporter | ||
Comment 13•15 years ago
|
||
I've attached a human-readable assembly representation of the trace generated for the empty loop test. I found that I can increase the speed by 2x or more by removing unneeded code.
trace run time (ms)
-------------------------- -------------
baseline from js -j 35
remove SB1 27
remove SB2 25
remove SB3 31
remove SB1,SB2 25
remove SB1,SB3 31
remove SB2,SB3 31
remove SB* (all 3) 25
hoist sp 38
remove SB*, hoist sp 21
remove SB*, hoist sp,i 18
remove SB* & last store, hoist sp,i 18
all above + remove oflow guard 14
At least on this test, the on-trace store-backs are hurting us a lot: removing them is a 35/25=1.4x speedup. Hoisting doesn't make sense without removing the store-backs first, but if we do both, we are up to a 2x speedup.
Removing the overflow guard gets us up to a 2.5x speedup. Doing that is hard in general but should not be too hard for loops with an integer iterator variable that is only set in the loop increment and is bounded by a constant.
Updated•15 years ago
|
Flags: blocking1.9.2+
Priority: -- → P2
Comment 14•15 years ago
|
||
-'ing this as dmandelin and gal say it shouldn't block. Sayre feel free to override.
Assignee: general → gal
Flags: blocking1.9.2+ → blocking1.9.2-
Comment 16•13 years ago
|
||
Current JS shell numbers for the attached testcase:
Interp: 1014 ms, 223.08 cycles/obj
-j: 1065 ms, 234.30 cycles/obj
-m: 54 ms, 11.88 cycles/obj
-m -n: 26 ms, 5.72 cycles/obj
JM+TI seems to be ahead of the Webkit numbers given in comment #0. I don't have a d8 Windows binary to test. Looking at comment #13, it appears that there's still more room for improvement, though?
Blocks: 467263
Summary: TM: Understand empty loop performance → Understand empty loop performance
Comment 17•13 years ago
|
||
If V8 is using process isolation only, not a interrupt checking load on the loop edge, then it will win on empty loop benchmarks. Question is what is the cost of that load on minimal but non-empty loops?
/be
Comment 18•11 years ago
|
||
I get these numbers:
SM : 18 ms
d8 : 26 ms
Ion emits this for an empty loop in the global scope:
[Codegen] cmpl $0x0, 0x0x2a12440 // interrupt check
[Codegen] jne ((513))
[Codegen] movl 0x758(%eax), %ecx
[Codegen] cmpl $0x989680, %ecx
[Codegen] jge ((531))
[Codegen] addl $0x1, %ecx
[Codegen] movl %ecx, 0x758(%eax)
[Codegen] jmp ((545))
Same loop inside a function (local instead of global var):
[Codegen] cmpl $0x0, 0x0x29cc240 // interrupt check
[Codegen] jne ((159))
[Codegen] cmpl $0x989680, %eax
[Codegen] jge ((171))
[Codegen] addl $0x1, %eax
We can use asm.js style mprotect to eliminate the interrupt check (bug 864220), but we beat v8 now and I don't think we have to keep this bug open.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•