5.12 - 3.97% displaylist_mutate / displaylist_mutate + 1 more (OSX) regression on Thu April 15 2021
Categories
(Core :: Graphics: WebRender, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr78 | --- | unaffected |
firefox88 | --- | unaffected |
firefox89 | --- | wontfix |
firefox90 | --- | fixed |
People
(Reporter: alexandrui, Assigned: bradwerth)
References
(Regression)
Details
(4 keywords)
Attachments
(5 files, 2 obsolete files)
Perfherder has detected a talos performance regression from push 3da14f29d4d71054d80b93bad5cef6ed3f110c8c. As author of one of the patches included in that push, we need your help to address this regression.
Regressions:
Ratio | Suite | Test | Platform | Options | Absolute values (old vs new) |
---|---|---|---|---|---|
5% | displaylist_mutate (docs) | macosx1015-64-shippable-qr | e10s stylo webrender | 2,017.32 -> 2,120.67 | |
4% | displaylist_mutate (docs) | macosx1015-64-shippable-qr | e10s stylo webrender-sw | 2,144.85 -> 2,235.38 | |
4% | displaylist_mutate (docs) | macosx1015-64-shippable-qr | e10s stylo webrender-sw | 2,146.41 -> 2,231.52 |
Improvements:
Ratio | Suite | Test | Platform | Options | Absolute values (old vs new) |
---|---|---|---|---|---|
5% | rasterflood_svg (docs) | macosx1015-64-shippable-qr | e10s stylo webrender-sw | 9,832.74 -> 9,379.64 |
Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.
For more information on performance sheriffing please see our FAQ.
Comment 1•4 years ago
|
||
Set release status flags based on info from the regressing bug 1675375
Assignee | ||
Comment 2•4 years ago
|
||
I think this can be fixed by adding logic to define_clip_image_mask
to avoid pushing the SetPoints
DisplayItem when no points are provided (which is the common case). I will create a patch that does that and see if it fixes the performance regression.
Assignee | ||
Comment 3•4 years ago
|
||
This reduces the size of the display list in the common case of an image mask
having no associated hit-test polygon.
Assignee | ||
Comment 4•4 years ago
|
||
This should recover at least some of the performance. I'll start a talos run to check. To get all of the performance back, it may be necessary to special-case the ImageMaskClip
DisplayItem so that it doesn't carry the fixed-size PolygonKey
when it's not needed. That may be a memory savings rather than a performance savings, but it's still a useful optimization.
Assignee | ||
Comment 5•4 years ago
|
||
Depends on D113000
Assignee | ||
Comment 6•4 years ago
|
||
The patches are not improving performance on inactive_mutate.html so I'll try again and generate profiles. I guess I'll need to generate a profile on a backed-out version, too. Complicated!
Assignee | ||
Comment 7•4 years ago
|
||
Ah, I see I can get the try server to generate profiles. Here's the new try run: https://treeherder.mozilla.org/jobs?repo=try&revision=f8757abbaedfb302f05838ce2947fc36f866a3dc
Assignee | ||
Comment 8•4 years ago
|
||
I was able to generate 3 profiles of interest:
- Before the performance-regressing patches of Bug 1675375 landed.
- After Bug 1675375 landed.
- After the proposed fixes in this bug landed.
But I'm not yet seeing interesting differences in the profiles, particularly between profiles 1 and 2. I'm not sure where to find the calls related to the the display list processing that is the focus of the talos test. The activity in the WRSceneBuilder#4
and WRRenderBackend#4
threads seems quite similar even though profile 2 is creating more display list items.
Matt, you're the creator of the the displaylist_inactive_mutate.html test. Can you help me understand what to look for in these profiles that might explain the talos results? And what might explain why my hypothesis of reducing the number of display list items is having no effect on performance?
Comment 9•4 years ago
|
||
These profiles are showing displaylist_flattened_mutate.html
for the PresShell::Paint stack marker on the content process main thread.
I suspect that is why we're not seeing any difference here.
Comment 10•4 years ago
|
||
It's likely that bug 1497758 is causing this.
Assignee | ||
Comment 11•4 years ago
|
||
This continues to confound me. In local builds, including optimized builds, I can't measure any difference in the performance of the displaylist_inactive_mutate.html test with or without the code from Bug 1675375. Locally captured profiles show very similar code paths. But when I push a backout of Bug 1675375 code, I can see a Talos improvement on the try server. Unfortunately, the profiles I can generate from the try server don't show the results from the displaylist_inactive_mutate.html as noted in comment 9 and caused by Bug 1497758. I guess the way forward is to try to fix Bug 1497758 so I can see what the profiles look like on the try server.
Assignee | ||
Comment 12•4 years ago
|
||
With my no-so-great fix for Bug 1497758 in place, I can compare two profiles more directly. Rollback profile (fast) vs current profile (slow). I see the following differences:
- In the slow case, the Renderer thread spends more time in
brush_solid_vert::run
. Not sure what causes this path to be hit harder. - In the slow case, the WRSceneBuilder thread spends more time in
hit_test::add_clips
. That would indicate that more clips are being processed. Again, not sure why. - In the slow case, WRRenderBackend threads spends more time in
SpatialTree::update_tree
callingmemmove
. This is scaling up though the other calls are not, indicating that the triggering call is outside of the loop in that function. My assumption is that this means thatupdate_tree
is being called more frequently on a comparable number ofnodes_to_update
.
Assignee | ||
Comment 13•4 years ago
|
||
(In reply to Brad Werth [:bradwerth] from comment #12)
- In the slow case, the WRSceneBuilder thread spends more time in
hit_test::add_clips
. That would indicate that more clips are being processed. Again, not sure why.
A local patch I built confirmed that the clip count is not changing between the current (slow) and backed-out (fast) runs. Instead, it seems that each clip is just taking more time to process. However, there are a large amount of clip nodes in the regressing test -- 7012. After conversation with gw, the leading theory is that the slowness is the result of bloating the size of the ClipItemKeyKind
enum, which is now much larger since it has a fixed-size array to store POLYGON_CLIP_VERTEX_MAX
(currently 16) points.
So the proposed approach is to shrink the size of the ClipItemKeyKind
enum by externalizing the storage of those polygon points and only claiming that storage when needed.
Updated•4 years ago
|
Assignee | ||
Comment 14•4 years ago
|
||
Defining a new interner requires a new profiler tag. Adding a new tag requires
that some ids need to be shifted to maintain continuity with the other
interners. This patch isolates those changes without actually creating the new
polygon interner in the hope that it makes the reviewing easier.
Depends on D113000
Assignee | ||
Comment 15•4 years ago
|
||
This is needed to support the use of intern::Handle structs in hashable enums.
Depends on D114555
Assignee | ||
Comment 16•4 years ago
|
||
HitTestScene will need to access the SceneBuilder interners in order to pull
polygon data when creating HitTestNodes. This patch isolates the function
signature changes and the passthrough of the interners, without actually
using the interners for anything yet.
Depends on D114556
Assignee | ||
Comment 17•4 years ago
|
||
This moves the polygon data out of the ClipItemKind and ClipItemKeyKind enums,
which reduces bloat, which improves access times for scenes with lots of
clips -- even if they don't use image mask clips. The SceneBuilder interns
polygons when they arrive alongside image mask clips, and the HitTestingScene
retrieves them when it's time to build the hit test nodes.
Depends on D114557
Assignee | ||
Comment 18•4 years ago
|
||
This new patch series didn't improve performance on the g4 talos series. The new profile, compared to the rollback (fast) profile shows that the scene builder thread is doing more work, and the "SceneBuilding" marker is further apart. But when I look at the flame graph, it looks like the time and percentages spent in SceneBuilder::build_all
and its sub-functions are very similar.
So although the patches did reduce the time spent in hit_test::add_clips
to pre-regression levels, that wasn't enough to restore the performance. I'll keep looking to try to determine the next hot code path.
Assignee | ||
Comment 19•4 years ago
|
||
(In reply to Brad Werth [:bradwerth] from comment #18)
So although the patches did reduce the time spent in
hit_test::add_clips
to pre-regression levels, that wasn't enough to restore the performance. I'll keep looking to try to determine the next hot code path.
I was mistaken: hit_test::add_clips
still appears to be the bottleneck. Best guess is that since ClipItemKind::Image
was already the biggest enum variant, adding an additional element to that variant is making add_clips
slower. A fix will probably have to restore that enum variant to its original composition and size. One way to do that would be to create a dedicated polygon clip, intended only for hit testing, and setup the clip chain such that image mask is used for drawing but the polygon clip is used for hit testing. Complicated!
Comment 20•4 years ago
|
||
For what it's worth, displaylist_mutate is a very synthetic micro-benchmark that creates a huge number of simple items and doesn't necessarily reflect real-world performance.
I think it's worth considering the added complexity of solving this, any future planned changes to the clipping code (gw?), and what else we could fix instead if we put this on the back burner for the near future.
Assignee | ||
Comment 21•4 years ago
|
||
Updated•4 years ago
|
Assignee | ||
Comment 22•4 years ago
|
||
The performance of this test has been restored by Bug 1710336, which avoids the problem by greatly reducing the number of clip nodes. We still may want to land some of the proposed patches. I'll confirm the patches work as intended after rebasing (and don't degrade performance), then seek review.
Updated•3 years ago
|
Updated•3 years ago
|
Comment 23•3 years ago
|
||
Comment 24•3 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/b9ebbb0e9db0
https://hg.mozilla.org/mozilla-central/rev/0cadee5455a1
https://hg.mozilla.org/mozilla-central/rev/62f10a2fceee
https://hg.mozilla.org/mozilla-central/rev/d6d0a9726fb8
https://hg.mozilla.org/mozilla-central/rev/b2aae9afa4a5
Updated•3 years ago
|
Description
•