Bugzilla

Comment 1

•

6 years ago

Any idea what's going on here glenn?

Flags: needinfo?(gwatson)

Glenn Watson [:gw]

Comment 2

•

6 years ago

The backend CPU time, draw calls, primitive count and vertex count are all low. The main GPU slowness seems to be that we are creating a heap of very large clip masks (typically 4x 2048x2048 masks). Visually it's not obvious why these would be needed - but if there are some tricks going on with perspective I can imagine that might be confusing WR into creating masks. This should certainly be fixable with some fixes and/or optimizations to clip mask generation.

Flags: needinfo?(gwatson)

Comment 3

•

6 years ago

Next step: look at the gecko/WR display list for this page and see where the masks are coming from. If they can be removed easily then we should do that. If not maybe Glenn can provide more details on what sorts of fixes/optimizations on the WR side could mitigate this.

Comment 4

•

6 years ago

I can repro, will take a look.

Assignee: nobody → kats

Comment 5

•

6 years ago

So there's no masks in the gecko or WR display lists. I do see a number of large render targets being created, and I think the 3D transforms are making us create a number of intermediate surfaces to composite everything properly. Still poking around since I'm not too familiar with the code.

Comment 6

•

6 years ago

So after much poking around and some guidance from Glenn, I think I have a handle on the root problem. AFAICT the transformed divs are getting clipped by the root clips (the ones around the e.g. displayport, viewport, browser window) but the clips are in a different coordinate space from the primitives (which are inside the perspective transforms). When we go to build the clip chain instances for the primitives, the clip node infos for these clips get a Transform conversion [1]. This leaves the prim's local_clip_rect untouched because [2]. Then, when processing the clip node infos to actually produce the clip node instances, get_clip_result_complex returns a Partial match [3] because the divs in world space are still large and not fully contained inside the clip/viewport area. The Partial match means we can't discard the clip or cull the prim, and so we keep the clip but also determine that it needs a mask at [4] because of the coordinate space mismatch. I think that the code is basically working as designed (i.e. not a dumb bug somewhere). To improve this scenario, the obvious solutions is to (a) modify [2] so that we clip local_clip_rect using the clip transformed into the prim's coordinate space. This would then give us an Accept result instead of Partial and we wouldn't need to keep the clip around afterwards. But the clip transformed into the prim's coordinate space will be non-rectangular, and I don't think that getting the bounding box would still produce an Accept result. If we could store the prim's "local" clip rect in world space rather than local space that would help but I don't know how feasible that is and what kind of impact that would have. Will think about this more. Glenn, let me know if this makes sense or if I'm missing something. [1] https://searchfox.org/mozilla-central/rev/fef7f858efb695a76010b4c624da5277c16e95b3/gfx/wr/webrender/src/clip.rs#1340 [2] https://searchfox.org/mozilla-central/rev/fef7f858efb695a76010b4c624da5277c16e95b3/gfx/wr/webrender/src/clip.rs#1361 [3] https://searchfox.org/mozilla-central/rev/fef7f858efb695a76010b4c624da5277c16e95b3/gfx/wr/webrender/src/clip.rs#625 [4] https://searchfox.org/mozilla-central/rev/fef7f858efb695a76010b4c624da5277c16e95b3/gfx/wr/webrender/src/clip.rs#677

Comment 7

•

6 years ago

Another option might be discard those clips from the prim's clip chain instance anyway, if we know that the prims are going to get rendered into an intermediate surface (or whatever the WR equivalent is) and the clips will get applied on the whole surface at some later step. I'm imagining that would be the case here because the clips are from an ancestor clip chain, but I haven't checked that yet.

Comment 8

•

6 years ago

Ignore comment 7, that doesn't make much sense. More thoughts though: based on further reading of the code, the clipping actually happens in three steps. The first is on the CPU side (affecting the prim's local_clip_rect) as described in comment 6. The other two happen in the vertex shader [1][2] and the fragment shader [3] respectively. The vertex shader ones basically clamp the vertex inside the clip rect (which is specified in the primitive's local space). The fragment shader one checks the world-space pixel against the mask. But what we really want to do in this case is apply a rectangle clip in world space. So we could do this either in the vertex shader post-transform (although then it might not be as simple as clamping the vertex) or in the fragment shader by allowing it to take a rectangle clip as well as a mask clip. I think this is the most promising approach I've come up with so far. In a nutshell, the code at [4] would be modified so that if the clip in world space can be represented as an axis-aligned rectangle, then we will leave needs_mask as false. Instead, we would propagate that world-space clip rect to the GPU, and then the fragment shader could clip against it in do_clip. Glenn, does that sound reasonable? [1] https://searchfox.org/mozilla-central/rev/a0841a254f67ae6c58a7b03473b6c8a10d928368/gfx/wr/webrender/res/prim_shared.glsl#103 [2] https://searchfox.org/mozilla-central/rev/a0841a254f67ae6c58a7b03473b6c8a10d928368/gfx/wr/webrender/res/prim_shared.glsl#162-163 [3] https://searchfox.org/mozilla-central/rev/a0841a254f67ae6c58a7b03473b6c8a10d928368/gfx/wr/webrender/res/prim_shared.glsl#229 [4] https://searchfox.org/mozilla-central/rev/a0841a254f67ae6c58a7b03473b6c8a10d928368/gfx/wr/webrender/src/clip.rs#669

Flags: needinfo?(gwatson)

Updated

•

6 years ago

Blocks: 1517940

Glenn Watson [:gw]

Comment 9

•

6 years ago

We discussed this a bit on IRC. We're going to do a bit more investigation as to whether the clips are being applied when compositing into the scene.

One option we discussed is making the clip mask rendering much more efficient for these cases, where the clip mask is large but the clip is actually a simple rect, and we could perhaps skip rendering the inner part for example.

Flags: needinfo?(gwatson)

[1] https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=668eb9d926fd268ee4c93c2ad7a46861c986e948

Comment 10

•

6 years ago

So I did the simple thing Glenn suggested (based on comment 7, which it turns out did make sense after all) and dropped the clips in the case where the clip would get applied later in the compositing pipeline. Locally that seemed to help, on my Mac it removed the C_Clip chunk of the GPU times and the mean GPU time went down from ~29ms to ~25ms. On Mac I was getting pretty big B_Blend times but not so on my Wintel setup, so I suspect the patch will help more there. I'll test that next, but I've kicked off a try push with the patch at [1] to check for correctness.

Comment 11

•

6 years ago

There's some test failures, likely because I didn't get the intermediate surface condition quite right.

Comment 12

•

6 years ago

I looked some more at the clip collector code that Glenn pointed me to yesterday as well. It is structured in a way that should already handle this case, but it doesn't actually. I'm still trying to figure out why - it looks like we're not turning one of the pictures with the perspective transform into a raster root, because the requested_composite_mode is empty. Trying to understand the display list flattener code now to figure out why, but my brain stopped working about an hour ago so I'll likely pick this up tomorrow.

Updated

•

6 years ago

Blocks: 1518995

Comment 13

•

6 years ago

Ok, so I think I have a handle on what's going on here. When we have a perspective-transformed elements on a page, the GeckoDL looks kind of like this:

nsDisplayPerspective item with perspective transform
  nsDisplayTransform item with 3d transform and extends-3d-context flag
    ...

This produces a WR display list like this:

PushReferenceFrameDisplayListItem { ... transform_style: Flat, perspective: Some(/*perspective transform*/), ... }
PushStackingContextDisplayItem { ... transform_style: Flat, ... }
  PushReferenceFrameDisplayListItem { ... transform_style: Preserve3D, transform: Some(/*3d transform*/), ... }
  PushStackingContextDisplayItem { ... transform_style: Preserve3D, ... }
    ...

Since the transform_style of the outer stacking context is Flat, the participating_in_3d_context flag is false for that, which cascades down to WR not turning that into a raster root. The inner stacking context does get turned into a raster root. This means any clips from "outside" the outer stacking context (e.g. the rootmost clip around the entire browser area) will get collected onto the inner stacking context. When we process the clips for the raster root picture, we pick up the collected clips. Those clips, however, are from e.g. the root spatial node, and there is a perspective transform between the raster root and the root spatial node, so we end up creating masks for these clips as described in previous comments.

So one possible fix here should be to also take Flat stacking contexts but whose spatial node has a perspective transform, and turn it into a raster root. I think that in combination with my previous patch would solve the problem (will test this) but it would create extra intermediate surfaces which might incur a performance penalty in other scenarios.

Updated

•

6 years ago

Depends on: 1519420

Comment 14

•

6 years ago

(In reply to Kartikaya Gupta (away Jan 12-26; email:kats@mozilla.com) from comment #13)

So one possible fix here should be to also take Flat stacking contexts but whose spatial node has a perspective transform, and turn it into a raster root. I think that in combination with my previous patch would solve the problem (will test this) but it would create extra intermediate surfaces which might incur a performance penalty in other scenarios.

I tried this, but it produced some reftest failures. Then I tried just this change (forcing new raster roots) in isolation, and that also produced reftest failures. Which appears to be a bug, so I filed bug 1519420 for that.

Comment 15

•

6 years ago

Reassigning to kvark while kats is away

Assignee: kats → dmalyshau

Comment 16

•

6 years ago

Looked at a GPU capture a bit. The real cost appears to come from the sheer amount of pixel copying/filling that we do.

In the first pass, we target 10 full-screen surfaces, and half of them get most of the pixels filled up with those solid-colored planes. Another half is rendering some thin borders (with solid shaders) and text, which is near-zero cost in terms of GPU time. Interestingly, they only affect the alpha.

In the second pass we render some clip masks that in total may come around a 1x-1.5x of full-screen estate in pixels. This is ok.

In the third pass something unexpected happens: we are basically copying over those giant filled rects into new render targets. This is 10 full-screen slices that we mostly fill out, and here we pay extra costs for the borders in the first pass: at this stage we don't know that only a few pixels were touched when they were rendered, so we end up copying whole giant rects of texels.

Then comes the 4th pass rendering everything to screen, plus blitting back to cached surfaces.

So here are the things to investigate and improve:

figure out why we even have the 3rd pass, and if we can avoid it
check if we can (picture) cache the local space rasterized surfaces for plane splitting (and don't try to picture cache the screen tiles)
look closer at what is rendered in the 1st pass. Perhaps, if we know that some of the things only affect alpha, we can choose the alpha render targets for those and save on fill-rate and sampling cost?

Emilio Cobos Álvarez (:emilio)

Comment 17

•

6 years ago

Talked to :gw, and apparently the WR's GPU profiler disagrees with RenderDoc on what is slow. C_Clip is shown to consume most of GPU time, even if we make the shader trivial, and the opaque pixel coverage doesn't match the one we see in RD.

Comment 18

•

6 years ago

FWIW, passing always TransformStyle::Preserve3D to perspective reference frames is the right thing to do, and it used to be that way. It was undone to avoid WR doing unnecessary work long time ago.

And actually if my reading of the display-list up is correct, my patch at bug 1505222 should fix this.

Depends on: 1505222

Emilio Cobos Álvarez (:emilio)

Comment 19

•

6 years ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #18)

FWIW, passing always TransformStyle::Preserve3D to perspective reference frames is the right thing to do, and it used to be that way. It was undone to avoid WR doing unnecessary work long time ago.

In particular it was undone in bug 1489337. Equivalent Gecko code is:

https://searchfox.org/mozilla-central/source/layout/painting/nsDisplayList.cpp#8435

My patch at bug 1505222 partially undoes this, I think it will catch this case too, but we can always check back.

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1489337

Emilio Cobos Álvarez (:emilio)

Updated

•

6 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1521656

Glenn Watson [:gw]

Comment 20

•

6 years ago

I implemented some optimizations to drawing large clip rects for transformed primitives in https://bugzilla.mozilla.org/show_bug.cgi?id=1522758 (as part of https://bugzilla.mozilla.org/show_bug.cgi?id=1521015).

I tested this page with that patch and it runs at 60fps on my machine. It could still be improved in the future, but I suspect it will be good enough to ship with once that lands in m-c.

Darkspirit

Updated

•

6 years ago

Depends on: 1522758

Comment 21

•

6 years ago

Yeah, it seems ok. Moving to P4

Priority: P3 → P4

Updated

•

6 years ago

Assignee: dmalyshau → nobody

Updated

•

6 years ago

Blocks: wr-perf

Comment 22

•

6 years ago

It runs much worse to me on 67.0a1 (2019-02-11), it's barely scroll-able now...

Comment 23

•

6 years ago

Weird. It still runs okish (60fps) for me on 67.0a1 (2019-02-11)

Comment 24

•

6 years ago

We are compositor-bound in the case where it's slow on my machine (NV GPU, linux, 4K screen). We issue about 160 draw calls, occupy about 20-ish render target layers, and waste a whole lot of time (>50% based on the GPU time queries) just clearing the surfaces.

Potential things to look at:

Policy for re-initializing FBOs. We spend ~12% of total compositor time just doing that...
Clearing policy. It's quite possible that my platform is just inefficient at clearing when scissor is provided.
There appear to be quite a few surfaces that we render full-screen without touching any pixel.
Cache plane splitting surfaces. Currently, we intersect them with the near plane, render a bunch of clip masks, and bake the planes with the local space adjusted accordingly. This approach aims to miminize the number of pixels we fill per frame, but it misses an opportunity to retain those pixels across frames. We could just bake the surfaces without clipping by the near plane, and every frame use only the portions that are visible.

Updated

•

6 years ago

Blocks: wr-67
No longer blocks: stage-wr-trains

Priority: P4 → P3

Updated

•

6 years ago

Blocks: wr-68
No longer blocks: wr-67

Updated

•

5 years ago

Blocks: wr-70
No longer blocks: wr-68