Closed Bug 1504767 Opened 6 years ago Closed 6 years ago

6.04 - 9.03% displaylist_mutate / tp5o_scroll / tscrollx (linux64-qr, windows10-64-qr) regression on push fb647fa86523da8e9a1173776606d2a0fcae32c2 (Fri Nov 2 2018)

Categories

(Testing :: General, defect)

Version 3
defect
Not set
normal

Tracking

(firefox-esr60 unaffected, firefox63 unaffected, firefox64 unaffected, firefox65 verified)

VERIFIED FIXED
mozilla65
Tracking Status
firefox-esr60 --- unaffected
firefox63 --- unaffected
firefox64 --- unaffected
firefox65 --- verified

People

(Reporter: jmaher, Assigned: gw)

References

Details

(Keywords: perf, regression, talos-regression)

Talos has detected a Firefox performance regression from push: https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=fb647fa86523da8e9a1173776606d2a0fcae32c2 As author of one of the patches included in that push, we need your help to address this regression. Regressions: 9% displaylist_mutate windows10-64-qr opt e10s stylo 5,585.58 -> 6,090.11 8% displaylist_mutate linux64-qr opt e10s stylo 5,277.42 -> 5,694.29 6% tp5o_scroll linux64-qr opt e10s stylo 4.13 -> 4.39 6% tscrollx linux64-qr opt e10s stylo 2.34 -> 2.48 Improvements: 6% tart windows10-64-qr opt e10s stylo 4.15 -> 3.89 4% ts_paint_webext linux64-qr opt e10s stylo 902.62 -> 864.00 4% sessionrestore linux64-qr opt e10s stylo 886.17 -> 850.67 4% sessionrestore_no_auto_restore linux64-qr opt e10s stylo 948.67 -> 911.58 4% ts_paint linux64-qr opt e10s stylo 897.33 -> 864.67 You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=17341 On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format. To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Performance_sheriffing/Talos/Tests For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Performance_sheriffing/Talos/Running *** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! *** Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling
:gfx, I see you authored a patch and landed it- it caused some regressions and some improvements, can you look at the regressions to determine if they need to be backed out, accepted as a new baseline, or something we can fix?
:kats, I assume you are :gfx based on the staktrace domain similarities?
Flags: needinfo?(kats)
Looks like this was from [1]. Glenn, WDYT? [1] https://github.com/servo/webrender/pull/3260
Flags: needinfo?(kats) → needinfo?(gwatson)
Is it possible to get a profile for the above with the WR threads included? I think the profile needs to be created with 'WR' added to the config for selecting which threads to sample. Or is there a guide / info on how I can create perf profiles myself from the CI machines?
Flags: needinfo?(gwatson)
(The WR threads are included in the profile, I pointed Glenn to them).
(Restoring the NI for gw).
Flags: needinfo?(gwatson)
Ah, never mind - I can see the stack traces now. This does seem quite unexpected - I would expect the text interning patch to cause some (temporary) regression on tests that have very large amounts of text. However, the dl_mutate test at least doesn't have much text, as far as I know? I'll do some investigating today.
I noticed two things that immediately stood out in the profile. 1) Segment building is often being invoked when it won't produce any segments. This is unrelated to the patch above, but was high in the profile, so I fixed it - https://github.com/servo/webrender/pull/3273. This drops the dl_mutate backend time from 10.8ms to 8.2ms on my machine, which should be quite a significant win on talos. 2) The patch referenced above significantly increases the size of the PrimitiveInstance structure. I believe this is the cause of most of the regression noted above. I'm working on a patch now which should hopefully resolve this - I expect it to be ready by end of today, but it might be tomorrow.
Flags: needinfo?(gwatson)
Assignee: nobody → gwatson
I added a workaround patch for (2) here https://github.com/servo/webrender/pull/3274. https://treeherder.mozilla.org/#/jobs?repo=try&revision=9bb90203cf929dc413a26ce972af6b4554aa7021 contains a try run with both the patches for (1) and (2), so I'll check the performance difference once that completes, but it should hopefully be improved quite a bit.
It looks like these two patches resolve the regression. For example "displaylist_mutate linux64-qr opt e10s stylo" was originally 5277, then regressed to 5694, and with these two patches is 4980. I haven't checked the other results in detail yet. The scroll one looks to have regressed a small amount, which is fine, since the benefits of the text interning have not all landed yet.
igoldan, can you confirm that we can resolve this?
Flags: needinfo?(igoldan)
(In reply to Bobby Holley (:bholley) from comment #12) > igoldan, can you confirm that we can resolve this? Indeed, the displaylist_mutate regressions were more than fixed. They actually got a boost bigger than prior values \0/! As :gw said, the scroll regressions (tp5o_scroll & tscrollx) are still there, out of which tp5o_scroll is on Windows 10 QR. Thus, I'd like to keep this bug open until the benefits of the text interning land and their fix confirmed also.
Flags: needinfo?(igoldan)
(In reply to Glenn Watson [:gw] from comment #11) > I haven't checked the other results in detail yet. The scroll one looks to > have regressed a small amount, which is fine, since the benefits of the text > interning have not all landed yet. Please link the bug you filed for text interning against this bug.
Flags: needinfo?(gwatson)
Depends on: 1507257
Flags: needinfo?(gwatson)
I think this can be closed, now that https://bugzilla.mozilla.org/show_bug.cgi?id=1507257 has brought the dl_mutate time down to ~4700?
Flags: needinfo?(igoldan)
(In reply to Glenn Watson [:gw] from comment #15) > I think this can be closed, now that > https://bugzilla.mozilla.org/show_bug.cgi?id=1507257 has brought the > dl_mutate time down to ~4700? Yes, you are right. All regressions have now been fixed.
Flags: needinfo?(igoldan)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED
Target Milestone: --- → mozilla65
You need to log in before you can comment on or make changes to this bug.