622494 - Investigate performance on IMVU model skinning benchmark

Reporter

Description

•

14 years ago

I claimed that JavaScript will never be as fast as native code, thus the web needs Native Client or something like it. Shaver called my bluff "Take a performance-sensitive algorithm that matters to you, write a benchmark, and submit it as a bug." I took the SSE-accelerated Cal3D skinning loop and implemented four versions of it. I benchmarked them on a MacBook Pro, 2.5 GHz Core i5. C++ is compiled with gcc -O2, and JavaScript is run with Firefox 4 beta 8. Results are in millions of vertices skinned per second. C++ w/ SSE intrinsics (near theoretical maximum on a single core): 98.3 C++ w/ scalars: 61.2 JavaScript: 5.1 JavaScript w/ typed arrays: 8.4 The code is available at https://github.com/chadaustin/Web-Benchmarks vlad did some investigation on IRC: <vlad> are we really guarding that these things are still functions? <vlad> is that what's going on here? <vlad> well, got a 15% win by removing all the inner calls to TransformPoint/TransformVector/ScaleMatrix/AddScaledMatrix and inlining those <vlad> which is stupid The JS testcases are trivial to turn into shell testcases by changing dump() to print(). (I think vlad meant alert() there.)

Chad Austin

Reporter

Updated

•

14 years ago

OS: Mac OS X → All

Whiteboard: imvu

Boris Zbarsky [:bzbarsky]

Comment 1

•

14 years ago

No, he probably meant dump(), which writes to stdout in the browser if the right pref is enabled. I'll attach a shell version of the typed array JS, with profiling hooks added but commented out; running a shark profile there shows pretty much all of the time spent in tracejit-generated code. Given that tjit inlines method calls, obviously, the fact that inlining them manually helped does seem to point to function guards of some sort. I wonder how this testcase does with the LICM patches...

Boris Zbarsky [:bzbarsky]

Comment 2

•

14 years ago

Attached file Typed array shell testcase (deleted) — Details

Boris Zbarsky [:bzbarsky]

Comment 3

•

14 years ago

Fwiw, v8 can't run the typed array testcase, but their tip (with crankshaft) gets about 8e6 vertices per second on the non-typed-array version; we get about 5.5e6 (with -j or -m -j -p; the results are about the same; -m alone is at 3e6). So there's definite headroom here!

Boris Zbarsky [:bzbarsky]

Comment 4

•

14 years ago

Also fwiw, I can't seem to reproduce vlad's 15% speedup here, though maybe my lazy way of inlining (copy-paste, with some var decls to assign the things that got passed to the function before) is responsible... but that seems like the right way to inline in this case anyway, so we only compute the somewhat-complicated arguments once. One other note: we seem to compile two copies of the hot loop here; it looks like something is type-unstable (a bunch of I and O types the first time through; all U the second time through). And if I inline the ScaleMatrix/AddScaledMatrix calls (not like the latter is called anyway, in my testing), and then re-roll those loops, performance goes down by a factor of 2, and the number of traces compiled goes from 4 to 10 (presumably due to more type instability?).

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 5

•

14 years ago

(In reply to comment #4) > Also fwiw, I can't seem to reproduce vlad's 15% speedup here, though maybe my > lazy way of inlining (copy-paste, with some var decls to assign the things that > got passed to the function before) is responsible... but that seems like the > right way to inline in this case anyway, so we only compute the > somewhat-complicated arguments once. Yeah, that's what I did. After a bunch more runs, the win wasn't consistent, but it showed up. > One other note: we seem to compile two copies of the hot loop here; it looks > like something is type-unstable (a bunch of I and O types the first time > through; all U the second time through). > > And if I inline the ScaleMatrix/AddScaledMatrix calls (not like the latter is > called anyway, in my testing), and then re-roll those loops, performance goes > down by a factor of 2, and the number of traces compiled goes from 4 to 10 > (presumably due to more type instability?). One thing that I did is change the timing loop: const start = new Date(); const now = new Date(); while ((new Date() - start) < 1000) { calculateVerticesAndNormals(bt, v, i, output); vertices_skinned += N; } to var start = Date.now(); while (true) { calculateVerticesAndNormals(bt, v, i, output); vertices_skinned += N; if ((vertices_skinned % 100000) == 0 && (Date.now() - start) >= 1000) { break; } } to avoid constant Date object creation. It made a difference, but not a huge one; it made me feel better, though! (Also the testcase has some odd const usage; replacing most of those with var didn't change the results, but I think we largely ignore const for this?)

Nicholas Nethercote [inactive]

Comment 6

•

14 years ago

(In reply to comment #0) > > <vlad> are we really guarding that these things are still functions? Sounds like bug 606892 would help? And I bet type inference (bug 608741) will help a *lot* with this program.

Boris Zbarsky [:bzbarsky]

Comment 7

•

14 years ago

Yeah, the date thing isn't too bad, esp on non-Windows. And yes, const is pretty much ignored at runtime for optimization purposes. Marking dependent on the obvious bugs that can help here. Chances are, it'll be a few months before we get far enough on type inference to see an effect here, right?

Depends on: TypeInference, 606892, 545406

Nicholas Nethercote [inactive]

Comment 8

•

14 years ago

(In reply to comment #7) > > Marking dependent on the obvious bugs that can help here. Chances are, it'll > be a few months before we get far enough on type inference to see an effect > here, right? Yep. Type inference will cannibalize some of LICM's potential benefits too. Eg. LICM often can hoist the "is this an array?" test that currently occur on every array access; type inference will typically eliminate that test outright. IMO, type inference is crucial if JS is to have a chance of getting remotely close to C++ on computationally-intensive programs like this.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 9

•

14 years ago

I'm not as convinced, but maybe my methodology is flawed. I went in and stubbed out guardClass to be a no-op -- it only gave back a few precent of perf. Might even have been noise. Given that I don't see any side exits other than the normal loop termination, it should be safe to stub out -all- guards here and see what the perf is like, right? I suspect we'll still be a long ways off.

Nicholas Nethercote [inactive]

Comment 10

•

13 years ago

I wonder how we're doing on this now that type inference is enabled.

Boris Zbarsky [:bzbarsky]

Updated

•

13 years ago

Blocks: 700101

Boris Zbarsky [:bzbarsky]

Updated

•

13 years ago

No longer blocks: 700101

Depends on: 700101

Alon Zakai (:azakai)

Comment 11

•

11 years ago

Almost 2 years passed here. The non-sse version of this benchmark is tested on http://arewefastyet.com/#machine=12&view=breakdown&suite=asmjs-ubench and it runs at about half the speed of native with odin. Not quite native yet, but getting close.

Nobody; OK to take it and work on it

Assignee

Updated

•

10 years ago

Assignee: general → nobody

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Bugzilla

Investigate performance on IMVU model skinning benchmark

Categories

(Core :: JavaScript Engine, defect)

Tracking

()

People

(Reporter: caustin, Unassigned)

References

Details

(Whiteboard: imvu)

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Updated

Comment 11

Updated

Updated

Attachment

General

Description

File Name

Content Type