Closed Bug 1226316 Opened 9 years ago Closed 9 years ago

18-22% Linux 64/Win8/Mac E10S tp5o_scroll regression on Mozilla-Inbound (v.45) on Nov 19, 2015 from push 5326ac1a435a44d335752ed727fa157273b91244

Categories

(Core :: Panning and Zooming, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla45
Tracking Status
firefox45 --- fixed

People

(Reporter: jmaher, Assigned: mchang)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression][e10s])

Attachments

(2 files)

Talos has detected a Firefox performance regression from your commit 5326ac1a435a44d335752ed727fa157273b91244 in bug 1208636. We need you to address this regression. This is a list of all known regressions and improvements related to your bug: http://alertmanager.allizom.org:8080/alerts.html?rev=5326ac1a435a44d335752ed727fa157273b91244&showAll=1 On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format. To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5o_scroll Reproducing and debugging the regression: If you would like to re-run this Talos test on a potential fix, use try with the following syntax: try: -b o -p linux64,win64 -u none -t g1 # add "mozharness: --spsProfile" to generate profile data To run the test locally and do a more in-depth investigation, first set up a local Talos environment: https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code Then run the following command from the directory where you set up Talos: talos --develop -e <path>/firefox -a tp5o_scroll Making a decision: As the patch author we need your feedback to help us handle this regression. *** Please let us know your plans by Monday, or the offending patch will be backed out! *** Our wiki page oulines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Mason did a great job of giving a heads up in the original bug! Big kudos for that:) Here is a compareview of the changes: https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=57593f27594a&newProject=mozilla-inbound&newRevision=5326ac1a435a&showOnlyConfident=1 I would have thought more regressions would have come from this- but I am seeing the two, possibly a 3rd with tp5o xres: https://treeherder.allizom.org/perf.html#/graphs?timerange=86400&series=[mozilla-inbound,a20264b4683ad3ae2483813370c9a211daf2a348,1]&highlightedRevisions=57593f27594a&highlightedRevisions=5326ac1a435a Mason, as you were expecting some regressions- could you explain why these regressed and confirm this is expected (both the specific tests/platforms and magnitude)?
Flags: needinfo?(mchang)
This is weird. We have test runs for: - linux64 - win7-32 - win8-64 - win-XP The non-e10s results look pretty much unchanged. On the e10s ones: - tp5o-scroll: linux64 and win8-64 regressed ~20%, while the others are unaffected. - tscrollx: seems we have slightly lower noise, but otherwise no change. So: 1. What's common to linux64 and win8-64 here that only they regressed? 2. Bug 1208636 says "Adjust displayport size based on available system memory", maybe these machines have different amount of system memory (compared to the other two), and the amount of available memory affects the regression [magnitude]? 3. Why didn't tscrollx regress too? technically it should be an identical test, except that it runs on synthetic pages instead of "real world" pages [*] [*] they should be identical, and they even use identical code, but their invocation is different. On both tests, the pageloader addon is used to load the pages and receive the scroll performance reports. At tscrollx, the scrolling/measuring code is part of the content page itself (the page is synthetic and we include the scroll js file directly from the page). But at tp5o-scroll, while it's exactly the same code, the scroll/measure function is injected into the page by the pageloader addon. Maybe there's some new performance impact (only with e10s?) when the scroll function is injected into the page, but no impact if the page includes it "normally"? (sorry, no concrete answers, but hopefully we could make progress using this info)
Assignee: nobody → mchang
Flags: needinfo?(mchang)
Some info after a discussion with mchang and kats: 1. mchang says that if the scroll is APZ, then it's possible that since tp5o-scroll uses twice as big scroll steps compared to tscrollx (10px vs 5px), APZ recognizes it as "fast scroll" and activates some mechanisms which might be regressing tp5o-scroll. 2. Both tests were designed to scroll synchronously using scrollBy (or scrollTop+= on frames), so they shouldn't be scrolling async with APZ. However, kats says that if the scroll behavior is set to smooth (a CSS property or DOM API option), then APZ would handle it. However, the scroll commands don't use this option, and I'm pretty sure none of the many tp5o-scroll pages which regressed has its scroll behavior set to smooth using CSS. So the questions are: 1. If APZ is not taking part during neither of these tests: a. how come one test regressed and the other didn't? b. how come two platforms regressed meaningfully and the other two didn't? c. do we know why the regression occur? 2. If APZ does take place in the scrolls: a/b/c same as above. 3. Is APZ involved in these scroll tests? maybe only with some of the tests/pages? I'm pretty sure it shouldn't, but maybe there's a bug which makes APZ take control anyway. For reference, that's the scrolling code (should be identical): - tp5o-scroll: http://hg.mozilla.org/mozilla-central/file/tip/testing/talos/talos/pageloader/chrome/tscroll.js -tscrollx: hg.mozilla.org/mozilla-central/file/tip/testing/talos/talos/tests/scroll/scroll-test.js
Flags: needinfo?(bugmail.mozilla)
Attached file CalculateDisplayPort call stack (deleted) —
I got a call stack while running the tp5o_scrolling test case. It's attached. This is from gdb while attached to the parent process (which I think is the right place?). So it looks like even though APZ isn't directly handling the input event from widget/gtk/nsWindow, it's still controlling the DisplayPort size. Is that correct kats? Thanks!
(In reply to Avi Halachmi (:avih) from comment #3) > Some info after a discussion with mchang and kats: Sorry, I was lacking context and was distracted during our IRC conversation, so I may have misinterpreted what you meant by "APZ handling the scroll". What I was thinking of was whether or not APZ is driving the compositor animation for scrolling one place to another. APZ would be doing this if the test was using the CSS smooth scroll behaviour, but this is largely irrelevant. For the purposes of this talos regression we don't care about whether or not APZ is driving the compositor animation, we care about who is setting up the displayport that layout is painting. And yes, that's going to be APZ. > 1. mchang says that if the scroll is APZ, then it's possible that since > tp5o-scroll uses twice as big scroll steps compared to tscrollx (10px vs > 5px), APZ recognizes it as "fast scroll" and activates some mechanisms which > might be regressing tp5o-scroll. We do have some mechanisms that will adjust the displayport based on scroll velocity, but scrollTo/scrollBy calls do not affect this notion of velocity. So I don't believe this is a factor here. > 2. Both tests were designed to scroll synchronously using scrollBy (or > scrollTop+= on frames), so they shouldn't be scrolling async with APZ. Here "scrolling async with APZ" is not really a well-defined term. When you do a scrollBy, layout updates the main-thread scroll position synchronously. It also does a repaint synchronously. It sends the new painted area and the new scroll offset to APZ which (asynchronously from the main thread) updates the compositor scroll offset. It can also trigger additional repaint requests back to the main thread, which should generally be no-ops in this scenario. > However, kats says that if the scroll behavior is set to smooth (a CSS > property or DOM API option), then APZ would handle it. However, the scroll > commands don't use this option, and I'm pretty sure none of the many > tp5o-scroll pages which regressed has its scroll behavior set to smooth > using CSS. See my comment above. But yes I agree that it's unlikely that smooth scrolling is happening here. > So the questions are: > > 1. If APZ is not taking part during neither of these tests: > a. how come one test regressed and the other didn't? > b. how come two platforms regressed meaningfully and the other two didn't? > c. do we know why the regression occur? > > 2. If APZ does take place in the scrolls: > a/b/c same as above. > So looking at the patch, (c) is obvious: the regression occurred because the patch increased the size of the displayport that was being used in some cases, and that resulted in more area being painted. That was the intent of the patch, after all. For (b), I see two possible reasons: the most likely one is just that the patch Mason landed only makes a difference if the machine has > 4GB of RAM. If only the Linux and Windows machines satisfy this, they will be the ones affected. The other reason is that Windows and Linux don't have tiling enabled by default, but OS X does. Tiling prevents us from repainting areas we've already painted, and the larger the displayport, the more the benefit. Therefore having tiling enabled on OS X probably mitigated the regression. I'm leaning towards the RAM thing though, partly because the regressing platforms are both 64-bit rather than 32-bit which again ties in to the 4 GB threshold. I don't have an answer for (a), I'm not sure why one test would regress and the other not. > 3. Is APZ involved in these scroll tests? maybe only with some of the > tests/pages? I'm pretty sure it shouldn't, but maybe there's a bug which > makes APZ take control anyway. > All I can tell you from what information I have so far - it looks like APZ should be the playing the same role for all of these tests. (In reply to Mason Chang [:mchang] from comment #4) > Created attachment 8689849 [details] > CalculateDisplayPort call stack > > I got a call stack while running the tp5o_scrolling test case. It's > attached. This is from gdb while attached to the parent process (which I > think is the right place?). > > So it looks like even though APZ isn't directly handling the input event > from widget/gtk/nsWindow, it's still controlling the DisplayPort size. Is > that correct kats? Thanks! Yes, APZ will still be determining the displayport size. That is expected when APZ is enabled, regardless of how the scrolling is triggered.
Flags: needinfo?(bugmail.mozilla)
I did get a mac regression talos email: Regression: Mozilla-Inbound - TP5 Scroll - MacOSX 10.10.5 (e10s) - 13.3% increase --------------------------------------------------------------------------------- Previous: avg 8.185 stddev 0.152 of 12 runs up to revision 57593f27594a4f0d04960362fd2a74e3052c4c80 New : avg 9.276 stddev 0.156 of 12 runs since revision 5326ac1a435a44d335752ed727fa157273b91244 Change : +1.091 (13.3% / z=7.154) Graph : http://mzl.la/1T0e86B
Summary: 18-22% Linux 64/Win8 E10S tp5o_scroll regression on Mozilla-Inbound (v.45) on Nov 19, 2015 from push 5326ac1a435a44d335752ed727fa157273b91244 → 18-22% Linux 64/Win8/Mac E10S tp5o_scroll regression on Mozilla-Inbound (v.45) on Nov 19, 2015 from push 5326ac1a435a44d335752ed727fa157273b91244
Went back to the patch and noticed a bug in the implementation compared to the discussion in comment 5. We always expanded the displayport on high mem systems, when the intent was only during skating. Since the velocity here should be (0, 0), we were mistakenly expanding the displayport.
Attachment #8690152 - Flags: review?(bugmail.mozilla)
Comment on attachment 8690152 [details] [diff] [review] Fixup to only expand displayport when skating Review of attachment 8690152 [details] [diff] [review]: ----------------------------------------------------------------- Ah, good point. I forgot it was only supposed to take effect during skate.
Attachment #8690152 - Flags: review?(bugmail.mozilla) → review+
Looks like the patch is fixing it: Original: Regression: Mozilla-Inbound-Non-PGO - TP5 Scroll - Ubuntu HW 12.04 x64 (e10s) - 18.8% increase New inbound: Improvement: Mozilla-Inbound-Non-PGO - TP5 Scroll - Ubuntu HW 12.04 x64 (e10s) - 17.1% decrease
Good fix! Thanks.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
These are the final numbers I got from talos emails: Improvement: Mozilla-Inbound-Non-PGO - TP5 Scroll - Ubuntu HW 12.04 x64 (e10s) - 17.1% decrease Improvement: Mozilla-Inbound-Non-PGO - TP5 Scroll - WINNT 6.2 x64 (e10s) - 17.4% decrease Improvement: Mozilla-Inbound - TP5 Scroll - MacOSX 10.10.5 (e10s) - 13.3% decrease
verified on the graphs! Thanks for fixing this.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: