Closed Bug 531185 Opened 15 years ago Closed 12 years ago

JM+TI 1.5x slower than v8 in JS-based JPEG encoder

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: blizzard, Assigned: dvander)

References

(Blocks 1 open bug, )

Details

Attachments

(1 file)

This guy ported over a JPEG encoder into JS. TraceMonkey (3.6b1) is about 10x slower than Safari.
We should turn this into a shell test case and then do some analysis.
Bug 531225 will help the worker version, probably.
Depends on: 531225
jitstats for the non-worker version: recorder: started(85), aborted(28), completed(197), different header(0), trees trashed(0), slot promoted(0), unstable loop variable(85), breaks(0), returns(0), merged loop exits(116), unstableInnerCalls(17), blacklisted(8) monitor: exits(6085), timeouts(1), type mismatch(0), triggered(6086), global mismatch(0), flushed(0) Aborts (piped through sort | uniq -c | sort -n -r): 5 Abort recording of tree jpegencoder.js:558@263 at jpegencoder.js:560@275: No compatible inner tree. 3 Abort recording of tree jpegencoder.js:558@263 at jpegencoder.js:560@289: Inner tree is trying to grow, abort outer recording. 2 Abort recording of tree jpegencoder.js:629@270 at jpegencoder.js:635@312: No compatible inner tree. 2 Abort recording of tree jpegencoder.js:629@270 at jpegencoder.js:560@289: Inner tree is trying to grow, abort outer recording. 2 Abort recording of tree jpegencoder.js:629@270 at jpegencoder.js:550@213: Inner tree is trying to grow, abort outer recording. 2 Abort recording of tree jpegencoder.js:629@270 at jpegencoder.js:550@203: No compatible inner tree. 2 Abort recording of tree jpegencoder.js:194@30 at jpegencoder.js:195@39: No compatible inner tree. 1 Abort recording of tree jpegencoder.js:629@270 at jpegencoder.js:558@263: No compatible inner tree. 1 Abort recording of tree jpegencoder.js:629@270 at jpegencoder.js:558@263: Inner tree is trying to stabilize, abort outer recording. 1 Abort recording of tree jpegencoder.js:629@270 at jpegencoder.js:536@68: No compatible inner tree. 1 Abort recording of tree jpegencoder.js:629@270 at jpegencoder.js:263@66: Inner tree is trying to grow, abort outer recording. 1 Abort recording of tree jpegencoder.js:627@261 at jpegencoder.js:629@270: No compatible inner tree. 1 Abort recording of tree jpegencoder.js:558@263 at jpegencoder.js:564@341: No compatible inner tree. 1 Abort recording of tree jpegencoder.js:558@263 at jpegencoder.js:263@66: Inner tree is trying to grow, abort outer recording. 1 Abort recording of tree jpegencoder.js:558@263 at jpegencoder.js:258@34: Inner tree is trying to grow, abort outer recording. 1 Abort recording of tree jpegencoder.js:558@263 at jpegencoder.js:257@24: No compatible inner tree. 1 Abort recording of tree jpegencoder.js:219@20 at jpegencoder.js:221@31: No compatible inner tree.
If I make Blacklist(jsbytecode* pc) a no-op, that doesn't change the wall-clock times much as far as I can tell. Just gets rid of that blacklisted(8) and adds a few hundred triggered/exits. Looking at the profile: 35% under js_GetPropertyHelper, called from GetClosureVar. 5% in GetClosureVar itself. 18% under TraceRecorder::monitorRecording (about half of this under closeLoop, 17% under JSOP_GETELEM, 6% under js_GetScrNoteCached, etc). 4% under TraceRecorder::recordLoopEdge, mostly under attemptTreeCall doing the emit, prepare, snapshot dance. 32% under ExecuteTree. The time under here breaks down as (percentages of ExecuteTree time; all times in not under): 10% js_Array_dense_setelem_double 6% js_NewDoubleInRootedValue 5% js_UnboxDouble 4% VisitFrameSlots<FlusNativeStackFrameVisitor> 4% js_DoubleToInt32 3% VisitFrameSlots<BuildNativeFrameVisitor> 3% LeaveTree 2% js_SetReservedSlot 2% js_Array_dense_setelem_int 31% actual jit-generated instructions 2.2% spent in js_Interpret General diagnosis, without having examined the typemaps for those compatible inner tree aborts is that we have int/double mismatch issues in the typemaps, lots of cost in GetClosureVar (maybe bug 530255 will help?). We're also storing doubles in arrays, which is making alarms ring in my head; that ought not to be happening in this testcase, I'd think.
Depends on: 530255
I guess all those doubles might be coming from us reading values out of arrays.... I thought we had a bug on that.
Applying the patch from bug 486213 doesn't seem to make the jitstats or wall-clock time any better. And the profile still shows doubles all over.
Assignee: general → dvander
I made a shell test case out of this (attachment forthcoming). Roughly, where we are with performance here: TM Interp: 424ms (x64) TM JIT: 180ms (x64) v8: 70ms (x86) Nitro: 30ms (x64) We seem to have a few problems. 1. Hoisted undefineds poison trace type stability. 2. We always read numbers from arrays as doubles. 3. Because of #2, we get more type instability. 4. We spend time in the interpreter. So to test the first two theories I edited the JPEG decoder to hoist definitions to the top of functions. This got us down to around 160ms. Then I disabled the Oracle and we got down to very few inner compatible tree problems, and that brought us to 140ms. Clearly those are both big problems we need to tackle, but there's still something else dominating this test case. TraceVis says: interpret 257.881230 monitor 5.255656 record 12.698644 compile 12.710549 execute 1.352516 native 213.188171 ------------------------- Subtotal 503.086766 Non-JS 0.000000 ========================= Total 503.086766 recorder: started(40), aborted(7), completed(81), different header(0), trees trashed(34), slot promoted(0), unstable loop variable(2), breaks(0), returns(0), merged loop exits(43), unstableInnerCalls(4), blacklisted(0) monitor: exits(880), timeouts(0), type mismatch(0), triggered(880), global mismatch(0), flushed(0) So tree connectivity is still not perfect... will investigate further.
We should work on the array unbox path: https://bugzilla.mozilla.org/show_bug.cgi?id=486213 We have a patch, but if I remember correctly it regressed 3d-cube a lot for some unknown reason.
Current js shell results for jpeg_encoder_basic: Interp: 558ms JM: 77ms JM+TI: 71ms d8: 46ms Current js shell results for altered_jpeg_encoder: Interp: 560ms JM: 78ms JM+TI: 72ms d8: 48ms Looks like v8 is still about 1.5x faster than JM+TI.
Blocks: 467263
OS: Mac OS X → All
Hardware: x86 → All
Summary: TraceMonkey 10x slower in JS-based JPEG encoder → JM+TI 1.5x slower than v8 in JS-based JPEG encoder
Version: 1.9.2 Branch → Trunk
We are now as fast as d8 with --ion-parallel-compile=on (20 ms). With baseline we are a few ms faster (17 ms).
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: