Open Bug 622494 Opened 14 years ago Updated 2 years ago

Investigate performance on IMVU model skinning benchmark

Categories

(Core :: JavaScript Engine, defect)

x86
All
defect

Tracking

()

People

(Reporter: caustin, Unassigned)

References

Details

(Whiteboard: imvu)

Attachments

(1 file)

I claimed that JavaScript will never be as fast as native code, thus the web needs Native Client or something like it. Shaver called my bluff "Take a performance-sensitive algorithm that matters to you, write a benchmark, and submit it as a bug." I took the SSE-accelerated Cal3D skinning loop and implemented four versions of it. I benchmarked them on a MacBook Pro, 2.5 GHz Core i5. C++ is compiled with gcc -O2, and JavaScript is run with Firefox 4 beta 8. Results are in millions of vertices skinned per second. C++ w/ SSE intrinsics (near theoretical maximum on a single core): 98.3 C++ w/ scalars: 61.2 JavaScript: 5.1 JavaScript w/ typed arrays: 8.4 The code is available at https://github.com/chadaustin/Web-Benchmarks vlad did some investigation on IRC: <vlad> are we really guarding that these things are still functions? <vlad> is that what's going on here? <vlad> well, got a 15% win by removing all the inner calls to TransformPoint/TransformVector/ScaleMatrix/AddScaledMatrix and inlining those <vlad> which is stupid The JS testcases are trivial to turn into shell testcases by changing dump() to print(). (I think vlad meant alert() there.)
OS: Mac OS X → All
Whiteboard: imvu
No, he probably meant dump(), which writes to stdout in the browser if the right pref is enabled. I'll attach a shell version of the typed array JS, with profiling hooks added but commented out; running a shark profile there shows pretty much all of the time spent in tracejit-generated code. Given that tjit inlines method calls, obviously, the fact that inlining them manually helped does seem to point to function guards of some sort. I wonder how this testcase does with the LICM patches...
Attached file Typed array shell testcase (deleted) —
Fwiw, v8 can't run the typed array testcase, but their tip (with crankshaft) gets about 8e6 vertices per second on the non-typed-array version; we get about 5.5e6 (with -j or -m -j -p; the results are about the same; -m alone is at 3e6). So there's definite headroom here!
Also fwiw, I can't seem to reproduce vlad's 15% speedup here, though maybe my lazy way of inlining (copy-paste, with some var decls to assign the things that got passed to the function before) is responsible... but that seems like the right way to inline in this case anyway, so we only compute the somewhat-complicated arguments once. One other note: we seem to compile two copies of the hot loop here; it looks like something is type-unstable (a bunch of I and O types the first time through; all U the second time through). And if I inline the ScaleMatrix/AddScaledMatrix calls (not like the latter is called anyway, in my testing), and then re-roll those loops, performance goes down by a factor of 2, and the number of traces compiled goes from 4 to 10 (presumably due to more type instability?).
(In reply to comment #4) > Also fwiw, I can't seem to reproduce vlad's 15% speedup here, though maybe my > lazy way of inlining (copy-paste, with some var decls to assign the things that > got passed to the function before) is responsible... but that seems like the > right way to inline in this case anyway, so we only compute the > somewhat-complicated arguments once. Yeah, that's what I did. After a bunch more runs, the win wasn't consistent, but it showed up. > One other note: we seem to compile two copies of the hot loop here; it looks > like something is type-unstable (a bunch of I and O types the first time > through; all U the second time through). > > And if I inline the ScaleMatrix/AddScaledMatrix calls (not like the latter is > called anyway, in my testing), and then re-roll those loops, performance goes > down by a factor of 2, and the number of traces compiled goes from 4 to 10 > (presumably due to more type instability?). One thing that I did is change the timing loop: const start = new Date(); const now = new Date(); while ((new Date() - start) < 1000) { calculateVerticesAndNormals(bt, v, i, output); vertices_skinned += N; } to var start = Date.now(); while (true) { calculateVerticesAndNormals(bt, v, i, output); vertices_skinned += N; if ((vertices_skinned % 100000) == 0 && (Date.now() - start) >= 1000) { break; } } to avoid constant Date object creation. It made a difference, but not a huge one; it made me feel better, though! (Also the testcase has some odd const usage; replacing most of those with var didn't change the results, but I think we largely ignore const for this?)
(In reply to comment #0) > > <vlad> are we really guarding that these things are still functions? Sounds like bug 606892 would help? And I bet type inference (bug 608741) will help a *lot* with this program.
Yeah, the date thing isn't too bad, esp on non-Windows. And yes, const is pretty much ignored at runtime for optimization purposes. Marking dependent on the obvious bugs that can help here. Chances are, it'll be a few months before we get far enough on type inference to see an effect here, right?
(In reply to comment #7) > > Marking dependent on the obvious bugs that can help here. Chances are, it'll > be a few months before we get far enough on type inference to see an effect > here, right? Yep. Type inference will cannibalize some of LICM's potential benefits too. Eg. LICM often can hoist the "is this an array?" test that currently occur on every array access; type inference will typically eliminate that test outright. IMO, type inference is crucial if JS is to have a chance of getting remotely close to C++ on computationally-intensive programs like this.
I'm not as convinced, but maybe my methodology is flawed. I went in and stubbed out guardClass to be a no-op -- it only gave back a few precent of perf. Might even have been noise. Given that I don't see any side exits other than the normal loop termination, it should be safe to stub out -all- guards here and see what the perf is like, right? I suspect we'll still be a long ways off.
I wonder how we're doing on this now that type inference is enabled.
Blocks: 700101
No longer blocks: 700101
Depends on: 700101
Almost 2 years passed here. The non-sse version of this benchmark is tested on http://arewefastyet.com/#machine=12&view=breakdown&suite=asmjs-ubench and it runs at about half the speed of native with odin. Not quite native yet, but getting close.
Assignee: general → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: