Closed
Bug 596544
Opened 14 years ago
Closed 14 years ago
These WebGL samples are slow on Firefox, fast on Chrome
Categories
(Core :: Graphics: CanvasWebGL, defect)
Tracking
()
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
blocking2.0 | --- | - |
People
(Reporter: paul, Assigned: bjacob)
References
()
Details
Attachments
(1 file)
(deleted),
patch
|
vlad
:
review+
joe
:
approval2.0+
|
Details | Diff | Splinter Review |
Assignee | ||
Comment 1•14 years ago
|
||
How slow are they for you?
Here Aquarium runs at:
- 40 FPS without GL layers
- 70 FPS with GL layers
On my Core i7 + NVIDIA Quadro FX 880M laptop.
What's your platform, do you have accelerated layers enabled, and can you run any slow demo in a profiler to see where it's spending time? Finally, are you using a very recent build with Jaegermonkey?
Reporter | ||
Comment 2•14 years ago
|
||
Linux, ATI, nightly (don't know about accelerated layers)
Assignee | ||
Comment 3•14 years ago
|
||
What framerate do you get in Aquarium with a 1000x1000 window roughly? What's your CPU and graphics card? For accelerated layers, check the preference layers.accelerate-all and -none
Reporter | ||
Comment 4•14 years ago
|
||
3 fps (with and without layers activated)
Assignee | ||
Comment 5•14 years ago
|
||
Ah, so we have a real problem here. So keep layers acceleration disabled, and if you have time, please profile it. Here's how you can do that if you have a recent kernel (linux 2.6.33 or newer). Install the 'perf' profiler package from your distro or get perf from
https://perf.wiki.kernel.org/index.php/Main_Page
then follow these steps:
1. launch firefox
2. go to the demo page and wait until it's fully loaded
3. Open a terminal
4. Find the PID of your firefox. You can do:
ps aux | grep firefox | grep -v grep
5. Attach the perf profiler to the running firefox process:
perf record -f -g -p THE_PID_OF_YOUR_FIREFOX
replacing THE_PID_OF_YOUR_FIREFOX by the value you found in step 4.
6. After like 10 seconds, interrupt profiling by hitting Ctrl+C in the terminal.
7. To see profiler results you can then do:
perf report
and to get a quick summary:
perf report --sort dso,symbol | head -20
please paste that quick summary here. You can also compress the perf.data file and attach it here.
Updated•14 years ago
|
blocking2.0: --- → ?
Comment 6•14 years ago
|
||
I'm also seeing this on 2010-09-18 nightly and 4.0b6.
Window size has no major effect on the speed so it doesn't seem to be compositing-bound.
On Chrome (CPU compositing) top shows 70% chrome, 20% GPU process, 12% Xorg.
On Firefox (ditto, accelerated comp doesn't work) top output is 40% firefox-bin, 80% Xorg. Even on a small window Xorg share is 70%.
Also, GL calls seem to take an avg 75 us in Firebug profile (Aquarium does ~2000 per frame = 150 ms).
The high Xorg CPU use is weird.
Assignee | ||
Comment 7•14 years ago
|
||
Ah, that. OK. That could be **** XRender support in your drivers. It would have to be very, very **** as you said it isn't fillrate-bound anyway, but it's stll worth considering. What driver are you using ? Need to know the driver, not just "ATI".
Could you disable XRender acceleration in your xorg.conf? Not sure what the best way to do that is but you could always use the "vesa" or "fbdev" driver.
Could you run this in a profiler ? See comment 5.
Comment 8•14 years ago
|
||
Sure, driver is Catalyst 10.9, x86 Linux, Radeon HD 4850, X.Org X Server 1.7.7 Release Date: 2010-05-04. I don't know off-hand how to disable XRender, but if I figure it out I'll post an update.
Perf results for running aquarium demo for ~10 seconds:
perf report --sort dso,symbol | head -20
For Firefox 4.0b6:
# Samples: 36083
#
# Overhead Shared Object Symbol
# ........ ................................... ......
#
9.01% ./downloads/firefox/libxul.so [.] 0x00000000f6177a
|
|--1.60%-- 0xb72cc177
| 0xb716e66a
|
|--1.14%-- 0xb72c9b78
| 0xb716e66a
|
|--1.08%-- 0xb72c9bdc
| 0xb716e66a
|
|--1.01%-- 0xb72c76f7
| 0xb716e66a
|
|--0.98%-- 0xb72c9bc6
Xorg /w Firefox:
# Samples: 73552
#
# Overhead Shared Object Symbol
# ........ ............................................. ......
#
21.18% /usr/lib/libpixman-1.so.0.16.4 [.] 0x0000000004cbe6
|
|--3.27%-- 0xb778e23a
| 0xb7761dd9
...
| pixman_image_composite
| fbComposite
| 0xb088e817
| 0xb088d7a9
| 0x810654b
| CompositePicture
... followed by repeated entries for pixman_image_composite
8.91% /usr/lib/libpixman-1.so.0.16.4 [.] 0x00000000005b16
...
| pixman_image_composite
...
...
...
0.58% /lib/i686/cmov/libc-2.11.2.so [.] memcpy
...
0.43% /usr/lib/xorg/modules/glesx.so [.] 0x000000000212a3
Xorg output on Chrome with CPU compositing:
# Samples: 12844
#
# Overhead Shared Object Symbol
# ........ ............................................. ......
#
46.71% /lib/i686/cmov/libc-2.11.2.so [.] memcpy
|
|--99.77%-- 0xb088cfab
| 0x810654b
| CompositePicture
| 0x80ff94d
| 0x80fc623
| 0x8080067
| 0x806692a
| __libc_start_main
| 0x8066511
--0.23%-- [...]
6.84% /usr/lib/xorg/modules/glesx.so [.] 0x000000000ecb29
|
|--1.48%-- 0xb089122f
| 0xb0892e5d
| esutDeleteSurf
| glesxDeleteSharedAccelSurf
| atiddxPixmapFreeGARTCacheable
| destroyPixmap
| 0x810912c
Xorg output on Chrome with accelerated compositing:
# Samples: 2619
#
# Overhead Shared Object Symbol
# ........ ............................................. ......
#
5.77% [kernel] [k] _ZN4Asic16Is_WPTR
5.42% [kernel] [k] unix_poll
4.51% [kernel] [k] do_select
|
|--98.31%-- 0x807fda0
| 0x806692a
| __libc_start_main
| 0x8066511
|
--1.69%-- 0xb7830400
0x807fda0
0x806692a
__libc_start_main
0x8066511
4.47% [kernel] [k] _ZN4Asic9WaitUnti
4.43% [kernel] [k] _spin_lock_irqsav
OS: Linux → Windows Server 2003
Assignee | ||
Comment 9•14 years ago
|
||
(In reply to comment #8)
> perf report --sort dso,symbol | head -20
argh, sorry! This is printing a call graph, making 20 lines be not enough. Can you please use this command instead:
perf report -g flat --sort dso,symbol | head -20
Assignee | ||
Comment 10•14 years ago
|
||
Also, your executables are missing a lot of symbols. Just firefox would be enough, but i'm afraid than in order to get perf to pick them up, they need to be in the firefox executable. Which means that you'd have to build firefox yourself, unless there's something I'm not aware of.
Comment 11•14 years ago
|
||
Alright, I'll roll my own build tomorrow.
I tested with some other full-window scenes. There is a clear connection between GL calls per frame and framerate. A scene with only one object runs smoothly (20-30 fps), a scene with a couple dozen objects runs at 10 fps, a scene with a hundred objects runs at 5 fps.
Here's the symbol-poor output of the flat perf for Aquarium:
firefox-bin 4.0b6
# Samples: 63577
#
# Overhead Shared Object Symbol
# ........ ................................... ......
#
9.03% ./downloads/firefox/libxul.so [.] 0x00000000eca012
7.68% /usr/lib/dri/fglrx_dri.so [.] 0x000000003f20d7
0.93%
0xa1da699a
0.63%
0xa1da69c1
6.58% ./downloads/firefox/libxul.so [.] gfxUtils::PremultiplyImageSurface(gfxImageSurface*, gfxImageSurface*)
6.57%
gfxUtils::PremultiplyImageSurface(gfxImageSurface*, gfxImageSurface*)
5.93% ./downloads/firefox/libxul.so [.] 0x0000000055184e
2.75% [kernel] [k] copy_from_user
1.57%
For X.org
# Samples: 88543
#
# Overhead Shared Object Symbol
# ........ ............................................. ......
#
18.79% /usr/lib/libpixman-1.so.0.16.4 [.] 0x0000000004023a
11.84% /usr/lib/libpixman-1.so.0.16.4 [.] 0x00000000005b1a
6.62%
0xb7753b00
0xb7761dd9
0xb7787a6d
0xb778f80a
0xb7790a2a
0xb7787763
0xb7762953
0xb778975e
0xb7762953
0xb7794b63
0xb7762953
0xb779b289
0xb7762953
pixman_image_composite
Comment 12•14 years ago
|
||
> Window size has no major effect on the speed so it doesn't seem to be
> compositing-bound.
The window size only matters when compositing to screen, since you can then optimize away everything outside the window. When compositing to a canvas (as is the case here, right?), all that matters is the size of the canvas, not the size of the window.
So yeah, crappy Render seems like the most likely cause. God, I hate XOrg. :(
Comment 13•14 years ago
|
||
Sorry, I was being unclear. The canvas size changes with the window size in Aquarium. So window size = canvas size = compositing overhead.
If Nvidia drivers perform fine on Linux, that'd suggest that there's some slow path that is getting hammered on ATI. And because the amount of GL calls is correlated with the slowness, I'd guess that each GL call is triggering some extra bit of work that the Nvidia driver shrugs off but ATI relegates to a slow path.
My reasoning here is:
If Nvidia is slow as well, then it'd be something inherently slow in the JS->GL-path. But if not, then there must be something going on that ATI isn't handling well. And it's not unfixable because Chrome doesn't have the slowness. It also seems to be triggered on most GL calls. So it must be something fixable that's affecting every GL call.
But yeah, I have no idea about the exact cause. Are GLXPixmaps accelerated?
Maybe crib this http://src.chromium.org/svn/trunk/src/app/gfx/gl/gl_context_linux.cc
They seem to be using a 1x1 pbuffer for context, fbo in the context for the render target, fallback to glxpixmaps where pbuffers are not supported.
Assignee | ||
Comment 14•14 years ago
|
||
*** SCROLL DOWN TO WHERE I MENTION MAKECURRENT FOR THE INTERESTING BIT ***
(In reply to comment #13)
> Sorry, I was being unclear. The canvas size changes with the window size in
> Aquarium. So window size = canvas size = compositing overhead.
>
> If Nvidia drivers perform fine on Linux, that'd suggest that there's some slow
> path that is getting hammered on ATI.
Agree.
> And because the amount of GL calls is correlated with the slowness,
Thanks for finding out this new bit of information. It's very interesting because the amount of GL calls in a WebGL app is NOT correlated to XRender at all. So IF we are looking at ATI driver slowness here, that would be OpenGL implementation slowness, not XRender slowness.
This isn't per se going to be easy to measure because GL calls are asynchronous and return immediately. The GL debug mode I'm coding (argh, I have to finish this today) is calling glFinish() after every GL call, emulating synchronousness, but this is only for debug builds, which aren't very suitable for benchmarking. Still, the slowness that you're experiencing here is so extreme that I guess that profiling a debug build will still make the culprit show up. I'll let you know when it's available.
> I'd guess that each GL call is triggering some
> extra bit of work that the Nvidia driver shrugs off but ATI relegates to a slow path.
Either that, or some function is just hideously slow in the ATI implementation. Either way, we'll know with a profiling of firefox in GL debug mode.
> My reasoning here is:
> If Nvidia is slow as well, then it'd be something inherently slow in the
> JS->GL-path.
Again, NVIDIA is /not/ slow here. I get 70 FPS in the Aquarium demo full-screen with default settings.
> But if not, then there must be something going on that ATI isn't
> handling well. And it's not unfixable because Chrome doesn't have the slowness.
Indeed. When we were looking at XRender slowness, I was shrugging this off as we all know that Chrome does in plain optimized software rendering what we do with XRender. But now that we're looking at GL slowness, this is getting interesting.
> It also seems to be triggered on most GL calls. So it must be something fixable that's affecting every GL call.
OOOOOOOhhhhhhhh I know. The only thing that's triggered by every GL call is GLX MakeCurrent(). Let me make you a patch !!
> But yeah, I have no idea about the exact cause. Are GLXPixmaps accelerated?
Yes, since GLXPixmaps live on the server.
>
> Maybe crib this
> http://src.chromium.org/svn/trunk/src/app/gfx/gl/gl_context_linux.cc
>
> They seem to be using a 1x1 pbuffer for context, fbo in the context for the
> render target, fallback to glxpixmaps where pbuffers are not supported.
Yeah, we just create a plain 1x1 window for the context, so we don't have to worry about whether pbuffers are supported. We use a FBO for the actual rendering.
Assignee | ||
Comment 15•14 years ago
|
||
Here's a patch making us try to avoid calling MakeCurrent.
Before someone says "**** linux graphics", keep in mind that we're already doing this on other platforms including on Windows (WGL) !
Assignee | ||
Comment 16•14 years ago
|
||
Ilmari / Paul: could you please my patch? Would be good to know asap if it fixes it. Otherwise you'll have to wait until someone reviews this.
Attachment #476659 -
Flags: review?(vladimir) → review+
Comment 17•14 years ago
|
||
Yep, that did it. Aquarium with accelerated layers: 5 fps with HEAD, 60 fps with patch.
OS: Windows Server 2003 → All
Reporter | ||
Comment 18•14 years ago
|
||
Wow, good job guys :)
Ilmari, thanks a lot for the informations!
Updated•14 years ago
|
Attachment #476659 -
Flags: approval2.0+
Assignee | ||
Comment 19•14 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 20•14 years ago
|
||
I just tried on latest hourly. http://hg.mozilla.org/mozilla-central/rev/901fd772c4da
1000 Fish in Aquarium, D2D ON
Fx4 with layers: 25
Fx4 without layers: 20
Chrome 7 dev with all HW acc enabled: 45
Is this performance issue related to this bug, or this is the current Fx speed?
Assignee | ||
Comment 21•14 years ago
|
||
With 1000 fish, you are really measuring Javascript performance, not so much graphics performance. So 25:45 against Chrome 7 seems plausible, see http://www.arewefastyet.com/ , feel free to continue this discussion with javascript people;
but could you try with fewer, e.g. 50 fish ? If we're slower than Chrome with 50 fish, that's a graphics bug.
Assignee | ||
Comment 22•14 years ago
|
||
Ah, wait, something else --- we are currently using OpenGL for the WebGL rendering and D3D for layer acceleration. Once we switch to D3D for WebGL rendering via ANGLE, that will play much better with D3D layer acceleration. So expect a performance improvement in not too long. But again, 1000 fish is very JS intensive.
Comment 23•14 years ago
|
||
Fwiw, testing on more or less current trunk on Mac with GL layers, I get about 40fps on a sorta-recent laptop (though nothing actually paints). About 32% of the time is spent in the JS engine with 1000 fish. The rest is spent in gl code of various sorts (most prominently under DrawElements, Uniform3fv, Uniform1 (those three are 51% between them).
So the time it takes to run the JS for 1000 fish on this machine is about 8ms (one third of the 25ms a 40Hz framerate leaves per frame). If we're ending up at 25fps, that means 40ms per frame. So either Csaba's machine is a lot slower than mine, or there's significant overhead in the non-JS stuff here, or both.
Assignee | ||
Comment 24•14 years ago
|
||
Ah, OK. I think I talked a bit too fast. Here I get:
70 FPS with 50 fish
40 FPS with 1000 fish
As for profiling, I am starting to wonder if my profiling practices are really appropriate here. Indeed, in both cases, I get above 90% time spent in JITted code, and the rest in the kernel. Could it simply be that all the time spent waiting on GPU rendering is accounted on the JIT code ?! How then would I go about measuring this time spent waiting on the GPU?
Assignee | ||
Comment 25•14 years ago
|
||
Here's the perf output per-DSO:
[bjacob@cahouette 1000fish]$ perf report -g flat --sort dso | head -20
# Samples: 34786750860
#
# Overhead Shared Object
# ........ .................
#
92.77% 7f5ea521be7a
7.21% [kernel]
0.01% [rfcomm]
0.01% [nvidia]
0.01% [cfg80211]
Here's the perf output per-symbol, showing the top 20 symbols:
[bjacob@cahouette 1000fish]$ perf report -g flat --sort dso,symbol | head -20
# Samples: 34786750860
#
# Overhead Shared Object Symbol
# ........ ................. ......
#
92.77% 7f5ea521be7a [.] 0x007f5ea521be7a
0.55% [kernel] [k] hpet_next_event
0.55% [kernel] [k] system_call
0.53% [kernel] [k] audit_syscall_exit
0.53% [kernel] [k] audit_syscall_entry
0.47% [kernel] [k] system_call_after_swapgs
0.39% [kernel] [k] pid_vnr
0.37% [kernel] [k] unroll_tree_refs
0.29% [kernel] [k] kfree
0.28% [kernel] [k] audit_free_names
0.25% [kernel] [k] sys_getpid
0.25% [kernel] [k] sysret_check
0.22% [kernel] [k] sysret_signal
0.20% [kernel] [k] audit_get_context
0.20% [kernel] [k] auditsys
Comment 26•14 years ago
|
||
(In reply to comment #21)> > but could you try with fewer, e.g. 50 fish ? If we're slower than Chrome with> 50 fish, that's a graphics bug.50 fish:Fx4 with layers: 38 FPSFx4 without layers: 29 FPSChrome 7 dev with all HW acc: 52 FPS.Maybe it's a graphics bug, too. And i have JM+TM enabled, i didnt mentioned that.(In reply to comment #23)> > So either Csaba's machine is a lot slower> than mine, or there's significant overhead in the non-JS stuff here, or both.I tested on my laptop: Core2 T6600 2,2Ghz, 4GB RAM, Mob HD4330 512MB DDR2.(In reply to comment #22)> Once we switch to D3D for WebGL> rendering via ANGLE, that will play much better with D3D layer acceleration. I didnt know that Fx4 will use ANGLE, that's good news. Is there tracking bug for that? Will it make into Fx4.0?
Comment 27•14 years ago
|
||
(In reply to comment #21)
>
> but could you try with fewer, e.g. 50 fish ? If we're slower than Chrome with
> 50 fish, that's a graphics bug.
50 fish:
Fx4 with layers: 38 FPS
Fx4 without layers: 29 FPS
Chrome 7 dev with all HW acc: 52 FPS.
Maybe it's a graphics bug, too.
And i have JM+TM enabled, i didnt mentioned that.
(In reply to comment #23)
>
> So either Csaba's machine is a lot slower
>than mine, or there's significant overhead in the non-JS stuff here, or both.
I tested on my laptop: Core2 T6600 2,2Ghz, 4GB RAM, Mob HD4330 512MB DDR2.
(In reply to comment #22)
>
>Once we switch to D3D for WebGL rendering via ANGLE,that will play much better >with D3D layer acceleration.
I didnt know that Fx4 will use ANGLE, that's good news. Is there tracking bug for that? Will it make into Fx4.0?
Assignee | ||
Comment 28•14 years ago
|
||
(In reply to comment #27)
> (In reply to comment #21)
> >
> > but could you try with fewer, e.g. 50 fish ? If we're slower than Chrome with
> > 50 fish, that's a graphics bug.
>
> 50 fish:
>
> Fx4 with layers: 38 FPS
> Fx4 without layers: 29 FPS
> Chrome 7 dev with all HW acc: 52 FPS.
>
> Maybe it's a graphics bug, too.
We'll need actually good profiler results to know... Boris just explained me on IRC why my above results look strange, I had a couple of things wrong.
Comment 29•14 years ago
|
||
Okay. If you need some testing, tell! (and tell how to do it :) )
Assignee | ||
Comment 30•14 years ago
|
||
(In reply to comment #27)
> I didnt know that Fx4 will use ANGLE, that's good news. Is there tracking bug
> for that? Will it make into Fx4.0?
There's no tracking bug yet, and it probably won't make it into 4.0.0 although it should be done shortly thereafter and I have no idea whether we'll be allowed to backport that (the trick is to claim that it's not a feature, it's just to fix support for windows machines with **** GL drivers).
Comment 31•14 years ago
|
||
Thanks for the info!
Updated•14 years ago
|
blocking2.0: ? → -
You need to log in
before you can comment on or make changes to this bug.
Description
•