Slow 3d transform rendering on https://keithclark.co.uk/articles/pure-css-parallax-websites/demo3/ in debug mode
Categories
(Core :: Graphics: WebRender, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox65 | --- | affected |
People
(Reporter: mstange, Unassigned)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
Updated•6 years ago
|
Comment 2•6 years ago
|
||
Comment 3•6 years ago
|
||
Comment 5•6 years ago
|
||
Comment 6•6 years ago
|
||
Comment 7•6 years ago
|
||
Comment 8•6 years ago
|
||
Comment 9•6 years ago
|
||
We discussed this a bit on IRC. We're going to do a bit more investigation as to whether the clips are being applied when compositing into the scene.
One option we discussed is making the clip mask rendering much more efficient for these cases, where the clip mask is large but the clip is actually a simple rect, and we could perhaps skip rendering the inner part for example.
Comment 10•6 years ago
|
||
So I did the simple thing Glenn suggested (based on comment 7, which it turns out did make sense after all) and dropped the clips in the case where the clip would get applied later in the compositing pipeline. Locally that seemed to help, on my Mac it removed the C_Clip chunk of the GPU times and the mean GPU time went down from ~29ms to ~25ms. On Mac I was getting pretty big B_Blend times but not so on my Wintel setup, so I suspect the patch will help more there. I'll test that next, but I've kicked off a try push with the patch at [1] to check for correctness.
Comment 11•6 years ago
|
||
There's some test failures, likely because I didn't get the intermediate surface condition quite right.
Comment 12•6 years ago
|
||
I looked some more at the clip collector code that Glenn pointed me to yesterday as well. It is structured in a way that should already handle this case, but it doesn't actually. I'm still trying to figure out why - it looks like we're not turning one of the pictures with the perspective transform into a raster root, because the requested_composite_mode is empty. Trying to understand the display list flattener code now to figure out why, but my brain stopped working about an hour ago so I'll likely pick this up tomorrow.
Comment 13•6 years ago
|
||
Ok, so I think I have a handle on what's going on here. When we have a perspective-transformed elements on a page, the GeckoDL looks kind of like this:
nsDisplayPerspective item with perspective transform
nsDisplayTransform item with 3d transform and extends-3d-context flag
...
This produces a WR display list like this:
PushReferenceFrameDisplayListItem { ... transform_style: Flat, perspective: Some(/*perspective transform*/), ... }
PushStackingContextDisplayItem { ... transform_style: Flat, ... }
PushReferenceFrameDisplayListItem { ... transform_style: Preserve3D, transform: Some(/*3d transform*/), ... }
PushStackingContextDisplayItem { ... transform_style: Preserve3D, ... }
...
Since the transform_style of the outer stacking context is Flat, the participating_in_3d_context flag is false for that, which cascades down to WR not turning that into a raster root. The inner stacking context does get turned into a raster root. This means any clips from "outside" the outer stacking context (e.g. the rootmost clip around the entire browser area) will get collected onto the inner stacking context. When we process the clips for the raster root picture, we pick up the collected clips. Those clips, however, are from e.g. the root spatial node, and there is a perspective transform between the raster root and the root spatial node, so we end up creating masks for these clips as described in previous comments.
So one possible fix here should be to also take Flat stacking contexts but whose spatial node has a perspective transform, and turn it into a raster root. I think that in combination with my previous patch would solve the problem (will test this) but it would create extra intermediate surfaces which might incur a performance penalty in other scenarios.
Comment 14•6 years ago
|
||
(In reply to Kartikaya Gupta (away Jan 12-26; email:kats@mozilla.com) from comment #13)
So one possible fix here should be to also take Flat stacking contexts but whose spatial node has a perspective transform, and turn it into a raster root. I think that in combination with my previous patch would solve the problem (will test this) but it would create extra intermediate surfaces which might incur a performance penalty in other scenarios.
I tried this, but it produced some reftest failures. Then I tried just this change (forcing new raster roots) in isolation, and that also produced reftest failures. Which appears to be a bug, so I filed bug 1519420 for that.
Comment 16•6 years ago
|
||
Looked at a GPU capture a bit. The real cost appears to come from the sheer amount of pixel copying/filling that we do.
In the first pass, we target 10 full-screen surfaces, and half of them get most of the pixels filled up with those solid-colored planes. Another half is rendering some thin borders (with solid shaders) and text, which is near-zero cost in terms of GPU time. Interestingly, they only affect the alpha.
In the second pass we render some clip masks that in total may come around a 1x-1.5x of full-screen estate in pixels. This is ok.
In the third pass something unexpected happens: we are basically copying over those giant filled rects into new render targets. This is 10 full-screen slices that we mostly fill out, and here we pay extra costs for the borders in the first pass: at this stage we don't know that only a few pixels were touched when they were rendered, so we end up copying whole giant rects of texels.
Then comes the 4th pass rendering everything to screen, plus blitting back to cached surfaces.
So here are the things to investigate and improve:
- figure out why we even have the 3rd pass, and if we can avoid it
- check if we can (picture) cache the local space rasterized surfaces for plane splitting (and don't try to picture cache the screen tiles)
- look closer at what is rendered in the 1st pass. Perhaps, if we know that some of the things only affect alpha, we can choose the alpha render targets for those and save on fill-rate and sampling cost?
Comment 17•6 years ago
|
||
Talked to :gw, and apparently the WR's GPU profiler disagrees with RenderDoc on what is slow. C_Clip is shown to consume most of GPU time, even if we make the shader trivial, and the opaque pixel coverage doesn't match the one we see in RD.
Comment 18•6 years ago
|
||
FWIW, passing always TransformStyle::Preserve3D to perspective reference frames is the right thing to do, and it used to be that way. It was undone to avoid WR doing unnecessary work long time ago.
And actually if my reading of the display-list up is correct, my patch at bug 1505222 should fix this.
Comment 19•6 years ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #18)
FWIW, passing always TransformStyle::Preserve3D to perspective reference frames is the right thing to do, and it used to be that way. It was undone to avoid WR doing unnecessary work long time ago.
In particular it was undone in bug 1489337. Equivalent Gecko code is:
https://searchfox.org/mozilla-central/source/layout/painting/nsDisplayList.cpp#8435
My patch at bug 1505222 partially undoes this, I think it will catch this case too, but we can always check back.
Updated•6 years ago
|
Comment 20•6 years ago
|
||
I implemented some optimizations to drawing large clip rects for transformed primitives in https://bugzilla.mozilla.org/show_bug.cgi?id=1522758 (as part of https://bugzilla.mozilla.org/show_bug.cgi?id=1521015).
I tested this page with that patch and it runs at 60fps on my machine. It could still be improved in the future, but I suspect it will be good enough to ship with once that lands in m-c.
Updated•6 years ago
|
Comment 22•6 years ago
|
||
It runs much worse to me on 67.0a1 (2019-02-11), it's barely scroll-able now...
Comment 23•6 years ago
|
||
Weird. It still runs okish (60fps) for me on 67.0a1 (2019-02-11)
Comment 24•6 years ago
|
||
We are compositor-bound in the case where it's slow on my machine (NV GPU, linux, 4K screen). We issue about 160 draw calls, occupy about 20-ish render target layers, and waste a whole lot of time (>50% based on the GPU time queries) just clearing the surfaces.
Potential things to look at:
- Policy for re-initializing FBOs. We spend ~12% of total compositor time just doing that...
- Clearing policy. It's quite possible that my platform is just inefficient at clearing when scissor is provided.
- There appear to be quite a few surfaces that we render full-screen without touching any pixel.
- Cache plane splitting surfaces. Currently, we intersect them with the near plane, render a bunch of clip masks, and bake the planes with the local space adjusted accordingly. This approach aims to miminize the number of pixels we fill per frame, but it misses an opportunity to retain those pixels across frames. We could just bake the surfaces without clipping by the near plane, and every frame use only the portions that are visible.
Updated•6 years ago
|
Updated•6 years ago
|
Updated•5 years ago
|
Comment 25•2 years ago
|
||
This appears to be fixed now : https://share.firefox.dev/3M8c8L7
Description
•