Closed Bug 1706561 Opened 4 years ago Closed 3 years ago

5.12 - 3.97% displaylist_mutate / displaylist_mutate + 1 more (OSX) regression on Thu April 15 2021

Categories

(Core :: Graphics: WebRender, defect)

Firefox 90
defect

Tracking

()

RESOLVED FIXED
90 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox88 --- unaffected
firefox89 --- wontfix
firefox90 --- fixed

People

(Reporter: alexandrui, Assigned: bradwerth)

References

(Regression)

Details

(4 keywords)

Attachments

(5 files, 2 obsolete files)

Perfherder has detected a talos performance regression from push 3da14f29d4d71054d80b93bad5cef6ed3f110c8c. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Suite Test Platform Options Absolute values (old vs new)
5% displaylist_mutate (docs) macosx1015-64-shippable-qr e10s stylo webrender 2,017.32 -> 2,120.67
4% displaylist_mutate (docs) macosx1015-64-shippable-qr e10s stylo webrender-sw 2,144.85 -> 2,235.38
4% displaylist_mutate (docs) macosx1015-64-shippable-qr e10s stylo webrender-sw 2,146.41 -> 2,231.52

Improvements:

Ratio Suite Test Platform Options Absolute values (old vs new)
5% rasterflood_svg (docs) macosx1015-64-shippable-qr e10s stylo webrender-sw 9,832.74 -> 9,379.64

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(bwerth)

Set release status flags based on info from the regressing bug 1675375

I think this can be fixed by adding logic to define_clip_image_mask to avoid pushing the SetPoints DisplayItem when no points are provided (which is the common case). I will create a patch that does that and see if it fixes the performance regression.

Assignee: nobody → bwerth

This reduces the size of the display list in the common case of an image mask
having no associated hit-test polygon.

This should recover at least some of the performance. I'll start a talos run to check. To get all of the performance back, it may be necessary to special-case the ImageMaskClip DisplayItem so that it doesn't carry the fixed-size PolygonKey when it's not needed. That may be a memory savings rather than a performance savings, but it's still a useful optimization.

Depends on D113000

The patches are not improving performance on inactive_mutate.html so I'll try again and generate profiles. I guess I'll need to generate a profile on a backed-out version, too. Complicated!

Ah, I see I can get the try server to generate profiles. Here's the new try run: https://treeherder.mozilla.org/jobs?repo=try&revision=f8757abbaedfb302f05838ce2947fc36f866a3dc

I was able to generate 3 profiles of interest:

  1. Before the performance-regressing patches of Bug 1675375 landed.
  2. After Bug 1675375 landed.
  3. After the proposed fixes in this bug landed.

But I'm not yet seeing interesting differences in the profiles, particularly between profiles 1 and 2. I'm not sure where to find the calls related to the the display list processing that is the focus of the talos test. The activity in the WRSceneBuilder#4 and WRRenderBackend#4 threads seems quite similar even though profile 2 is creating more display list items.

Matt, you're the creator of the the displaylist_inactive_mutate.html test. Can you help me understand what to look for in these profiles that might explain the talos results? And what might explain why my hypothesis of reducing the number of display list items is having no effect on performance?

Flags: needinfo?(matt.woodrow)

These profiles are showing displaylist_flattened_mutate.html for the PresShell::Paint stack marker on the content process main thread.

I suspect that is why we're not seeing any difference here.

Flags: needinfo?(matt.woodrow)

It's likely that bug 1497758 is causing this.

This continues to confound me. In local builds, including optimized builds, I can't measure any difference in the performance of the displaylist_inactive_mutate.html test with or without the code from Bug 1675375. Locally captured profiles show very similar code paths. But when I push a backout of Bug 1675375 code, I can see a Talos improvement on the try server. Unfortunately, the profiles I can generate from the try server don't show the results from the displaylist_inactive_mutate.html as noted in comment 9 and caused by Bug 1497758. I guess the way forward is to try to fix Bug 1497758 so I can see what the profiles look like on the try server.

Depends on: 1497758

With my no-so-great fix for Bug 1497758 in place, I can compare two profiles more directly. Rollback profile (fast) vs current profile (slow). I see the following differences:

  1. In the slow case, the Renderer thread spends more time in brush_solid_vert::run. Not sure what causes this path to be hit harder.
  2. In the slow case, the WRSceneBuilder thread spends more time in hit_test::add_clips. That would indicate that more clips are being processed. Again, not sure why.
  3. In the slow case, WRRenderBackend threads spends more time in SpatialTree::update_tree calling memmove. This is scaling up though the other calls are not, indicating that the triggering call is outside of the loop in that function. My assumption is that this means that update_tree is being called more frequently on a comparable number of nodes_to_update.

(In reply to Brad Werth [:bradwerth] from comment #12)

  1. In the slow case, the WRSceneBuilder thread spends more time in hit_test::add_clips. That would indicate that more clips are being processed. Again, not sure why.

A local patch I built confirmed that the clip count is not changing between the current (slow) and backed-out (fast) runs. Instead, it seems that each clip is just taking more time to process. However, there are a large amount of clip nodes in the regressing test -- 7012. After conversation with gw, the leading theory is that the slowness is the result of bloating the size of the ClipItemKeyKind enum, which is now much larger since it has a fixed-size array to store POLYGON_CLIP_VERTEX_MAX (currently 16) points.

So the proposed approach is to shrink the size of the ClipItemKeyKind enum by externalizing the storage of those polygon points and only claiming that storage when needed.

Attachment #9217506 - Attachment is obsolete: true

Defining a new interner requires a new profiler tag. Adding a new tag requires
that some ids need to be shifted to maintain continuity with the other
interners. This patch isolates those changes without actually creating the new
polygon interner in the hope that it makes the reviewing easier.

Depends on D113000

This is needed to support the use of intern::Handle structs in hashable enums.

Depends on D114555

HitTestScene will need to access the SceneBuilder interners in order to pull
polygon data when creating HitTestNodes. This patch isolates the function
signature changes and the passthrough of the interners, without actually
using the interners for anything yet.

Depends on D114556

This moves the polygon data out of the ClipItemKind and ClipItemKeyKind enums,
which reduces bloat, which improves access times for scenes with lots of
clips -- even if they don't use image mask clips. The SceneBuilder interns
polygons when they arrive alongside image mask clips, and the HitTestingScene
retrieves them when it's time to build the hit test nodes.

Depends on D114557

This new patch series didn't improve performance on the g4 talos series. The new profile, compared to the rollback (fast) profile shows that the scene builder thread is doing more work, and the "SceneBuilding" marker is further apart. But when I look at the flame graph, it looks like the time and percentages spent in SceneBuilder::build_all and its sub-functions are very similar.

So although the patches did reduce the time spent in hit_test::add_clips to pre-regression levels, that wasn't enough to restore the performance. I'll keep looking to try to determine the next hot code path.

(In reply to Brad Werth [:bradwerth] from comment #18)

So although the patches did reduce the time spent in hit_test::add_clips to pre-regression levels, that wasn't enough to restore the performance. I'll keep looking to try to determine the next hot code path.

I was mistaken: hit_test::add_clips still appears to be the bottleneck. Best guess is that since ClipItemKind::Image was already the biggest enum variant, adding an additional element to that variant is making add_clips slower. A fix will probably have to restore that enum variant to its original composition and size. One way to do that would be to create a dedicated polygon clip, intended only for hit testing, and setup the clip chain such that image mask is used for drawing but the polygon clip is used for hit testing. Complicated!

For what it's worth, displaylist_mutate is a very synthetic micro-benchmark that creates a huge number of simple items and doesn't necessarily reflect real-world performance.

I think it's worth considering the added complexity of solving this, any future planned changes to the clipping code (gw?), and what else we could fix instead if we put this on the back burner for the near future.

The performance of this test has been restored by Bug 1710336, which avoids the problem by greatly reducing the number of clip nodes. We still may want to land some of the proposed patches. I'll confirm the patches work as intended after rebasing (and don't degrade performance), then seek review.

Flags: needinfo?(bwerth)
Attachment #9221216 - Attachment is obsolete: true
Pushed by bwerth@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b9ebbb0e9db0 Part 1: Only push a SetPoints DisplayItem when points have been provided. r=gw https://hg.mozilla.org/integration/autoland/rev/0cadee5455a1 Part 2: Update profiler ids to also track interned polygons. r=gw https://hg.mozilla.org/integration/autoland/rev/62f10a2fceee Part 3: Make intern Handles hashable. r=gw https://hg.mozilla.org/integration/autoland/rev/d6d0a9726fb8 Part 4: Passthrough SceneBuilder Interners to HitTestScene. r=gw https://hg.mozilla.org/integration/autoland/rev/b2aae9afa4a5 Part 5: Intern polygon data for image masks, and retrieve for hit tests. r=gw
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: