2.2 - 3.82% raptor-tp6- loadtime (windows10-64-shippable, windows10-64-shippable-qr, windows7-32-shippable) regression on push 787a82df49667019d3353c08d2c8a3f707d502cf (Sat April 4 2020)
Categories
(Testing :: Raptor, defect, P2)
Tracking
(firefox76 wontfix, firefox77 wontfix, firefox78 wontfix)
People
(Reporter: Bebe, Assigned: sparky)
References
(Regression)
Details
(Keywords: perf, perf-alert, regression)
Raptor has detected a Firefox performance regression from push:
As author of one of the patches included in that push, we need your help to address this regression.
Regressions:
4% raptor-tp6-google-mail-firefox-cold loadtime windows10-64-shippable-qr opt 627.04 -> 651.00
4% raptor-tp6-wikipedia-firefox-cold windows10-64-shippable-qr opt 1,356.71 -> 1,404.55
3% raptor-tp6-wikipedia-firefox-cold windows10-64-shippable opt 1,305.28 -> 1,350.84
3% raptor-tp6-wikipedia-firefox-cold fcp windows10-64-shippable opt 1,279.54 -> 1,322.08
3% raptor-tp6-outlook-firefox-cold fcp windows10-64-shippable opt 212.33 -> 219.33
3% raptor-tp6-docs-firefox-cold fcp windows10-64-shippable-qr opt 999.25 -> 1,030.42
3% raptor-tp6-wikipedia-firefox-cold loadtime windows10-64-shippable opt 1,424.42 -> 1,466.92
3% raptor-tp6-docs-firefox-cold fcp windows7-32-shippable opt 1,026.58 -> 1,055.75
3% raptor-tp6-docs-firefox-cold windows7-32-shippable opt 1,088.48 -> 1,118.37
3% raptor-tp6-google-mail-firefox-cold loadtime windows7-32-shippable opt 609.17 -> 625.00
3% raptor-tp6-docs-firefox-cold windows10-64-shippable-qr opt 1,079.91 -> 1,107.85
2% raptor-tp6-slides-firefox-cold loadtime windows7-32-shippable opt 1,948.58 -> 1,991.50
Improvements:
5% raptor-tp6-outlook-firefox-cold-mozproxy-replay windows7-32-shippable opt 87.75 -> 92.25
You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=25544
On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a Treeherder page showing the Raptor jobs in a pushlog format.
To learn more about the regressing test(s) or reproducing them, please see: https://wiki.mozilla.org/TestEngineering/Performance/Raptor
*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***
Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/TestEngineering/Performance/Talos/RegressionBugsHandling
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Comment 1•5 years ago
|
||
I assume that due to the changes we have a more realistic image of the different tests. Before the tests were intersecting with others specifically when a new tab is opened for each individual page cycle. Now we clearly wait. Greg, did you more expect a regression or improvement when comparing with the results we get from Browsertime jobs?
Assignee | ||
Comment 2•5 years ago
|
||
If this patch did cause the regression/improvements, then the metrics we have now should become the new baseline.
Between Browsertime/Raptor the differences in the mean change depending on the page (the error/difference is not systematic) but, on average, the geometric mean of the metrics decreases in browsertime. I didn't test any of these pages so that's as much as I can say.
I find it interesting how your code caused such a significant performance change. What's funny is the changes are not that drastic, and we really nearly do the same thing as before in the runner. I wonder if we should bring this up with some product teams (I don't know which one).
Comment 3•5 years ago
|
||
As discussed in the perftest triage meeting we are going with the following steps:
-
Check some affected tp6 page load tests with the gecko profiler enabled. Maybe there is a CC/GC which runs during page load and negatively affects our page load. If that is the case we could force a GC, or at least add another delay after tab opening and before starting the test.
-
Run try builds for each individual revision of my patch and check which of those triggered the regressions.
Comment 4•5 years ago
|
||
So given that only cold tests show this problem it cannot be related to closing/opening tabs and possibly running garbage collections. In-between each page cycle the browser gets restarted. Also very suspicious is that it only happens for Windows shippable opt builds, and not for just opt, or other platforms.
I think we have to push some try builds to check which revision caused this change. I will do that tomorrow.
Comment 5•5 years ago
|
||
For the try builds to be triggered for each of the revisions we would need the following Raptor jobs to be run on Windows shippable opt:
tp6-c-5 (docs)
tp6-c-24 (wikipedia)
tp6-c-25 (google-mail)
Comment 6•5 years ago
|
||
Comment 7•5 years ago
|
||
Due to a bug in the 2nd patch of the series we had broken Raptor tests. As such after applying the fix Alexandru triggered new jobs:
Comment 8•5 years ago
|
||
(In reply to Alexandru Ionescu :alexandrui (needinfo me) from bug 1625892 comment #19)
This is the revision that introduced the regression: [raptor] Refactor webextension code for opening and closing tabs.
So this is most likely the cause from removing the delay when closing/opening the tab. As such we run a lot faster through the test, and might give the browser less time to recover and maybe garbage collect memory from the former page cycle.
I will push a try build shortly with the delay being re-added.
Comment 9•5 years ago
|
||
When checking the old code I saw something curious:
https://hg.mozilla.org/integration/autoland/file/d2a334ceab53916e967cbe8e2dbbbd0733956217/testing/raptor/webext/raptor/runner.js#l444
It didn't matter if we are closing/opening a new tab per page cycle, we always paused the execution of the next page cycle by 1000ms. This is clearly a bug, which got fixed with my commit, and I don't think that we should re-introduce it.
Also we still have the pageCycleDelay
of 1000ms, which is executed BEFORE we close/open the tab. I feel that this is a problem, and that it has to be executed right before we start the cycle. That way we will give Firefox time to do eg. garbage collection for the old tab.
With both together we were waiting 2000ms per page cycle, and now it's only 1000ms.
I will move up the code for closing/opening the tab, so that we wait the 1000ms afterward. Here the try build:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=837929ab284c46a9c28ca13fdbdc75ab887fc41a
Comment 10•5 years ago
|
||
The performance results of this try build are even worse then current mozilla-central jobs:
There are huge decreases across different page load and benchmark tests.
Greg, I tried to understand when exactly the pageload.js background script starts its timers again for the next page cycle. Maybe I miss something here but I cannot spot it right now, and don't have the time today to deep dive into it today. Any hint from your side would be great.
I feel that runner.js should indicate the pageload.js script when a new measurement has to start, or even better the timing is based on that moment when the page load request is triggered via updateTab()
.
Assignee | ||
Comment 11•5 years ago
|
||
:whimboo can you retry your last test with more trials (at least 5-10)? We can't say anything about the results with only one trial - those issues might be outliers.
Assignee | ||
Updated•5 years ago
|
Updated•5 years ago
|
Comment 12•4 years ago
|
||
I suspect the Raptor changes in bug 1625892 also caused a 200% (!!) tp6 warm page load regressions (and increased variability) on "raptor-tp6-outlook-firefox":
status-firefox77=affected because these regressions are still present in 77 Nightly.
Comment 13•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #8)
(In reply to Alexandru Ionescu :alexandrui (needinfo me) from bug 1625892 comment #19)
This is the revision that introduced the regression: [raptor] Refactor webextension code for opening and closing tabs.
So this is most likely the cause from removing the delay when closing/opening the tab. As such we run a lot faster through the test, and might give the browser less time to recover and maybe garbage collect memory from the former page cycle.
The delay might not have been the issue here. Over the weekend we got a great response from Matt Woodrow via bug 1602318:
(In reply to Matt Woodrow (:mattwoodrow) from bug 1602318 comment #44)
I took a local profile of the twitch regression to try to understand it: https://bit.ly/3d1H4uI (sorry, xul symbolication didn't seem to work).
Looking at that it appears that the main document load of twitch returns, but we spend ~33ms waiting for the main-thread to become free, so that we can send it to the content process (which is roughly similar to the magnitude of the regression).
The main thread is busy in NeckoParent::RecvGetExtensionStream, and then we run a layout/paint cycle for the parent process content.
Looking at other profiles, it appears that have this same work on all pages, but twitch has an http response time of ~50ms, which is really quick and leads to it being ready before the other work completes. My local run of instagram had a response time of 180ms, so the main-thread was idle by the time we needed it.
I think this explains the regression, and why it comes and goes, since it wouldn't take much for the scheduling to work out better.
The extension work appears to be handling calls from raptor's 'postToControlServer' (I suspect the one inside updateTab, followed by the "status" message in nextCycle, but hard to be sure).
Henrik, is this something you've looked at?
I suspect reducing (or delaying) the number of postToControlServer calls during the time we're measuring might lead to better and more consistent results.
Given that my patch actually added a couple more calls to postToControlServer
this might have been caused this problem.
I will go ahead and push a patch with all of those changed to normal Raptor log calls.
Comment 14•4 years ago
|
||
Here the try build which has all the status messages to the control server replaced by simple log calls:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e31f4e625ce002ba3d366b0829e35268e547775a
Comment 15•4 years ago
|
||
Matt could you please try with https://hg.mozilla.org/try/rev/95a75121395bed6bcb5168128aea0b6f8897bfcc, and check if that makes a difference for you?
Comment 16•4 years ago
|
||
I chatted with Matt via Riot and he mentioned there there are no differences for twitch with the above try build:
As such I updated my patch to even stop logging anything to the console. Let see if that helps:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c44eb9f727e72d3ca5e21ac918794b131bf23362
Comment 17•4 years ago
|
||
(In reply to Chris Peterson [:cpeterson] from comment #12)
I suspect the Raptor changes in bug 1625892 also caused a 200% (!!) tp6 warm page load regressions (and increased variability) on "raptor-tp6-outlook-firefox":
It's 100% which means a jump from 200 to 400, and at the same time reduced the noise a lot. But as you can see there is another jump recently to 600 as caused by bug 1624813 by upgrading mitmproxy to 5.0.1. Interestingly the alert as created by Bebe shows improvements for outlook? How can this be? Maybe that's better to discuss on a new alert bug. Can you please file one, Bebe?
Comment 18•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #17)
(In reply to Chris Peterson [:cpeterson] from comment #12)
I suspect the Raptor changes in bug 1625892 also caused a 200% (!!) tp6 warm page load regressions (and increased variability) on "raptor-tp6-outlook-firefox":
It's 100% which means a jump from 200 to 400, and at the same time reduced the noise a lot. But as you can see there is another jump recently to 600 as caused by bug 1624813 by upgrading mitmproxy to 5.0.1. Interestingly the alert as created by Bebe shows improvements for outlook? How can this be? Maybe that's better to discuss on a new alert bug. Can you please file one, Bebe?
I also wonder why this huge regression for raptor-tp6-outlook-firefox
isn't reflected in the alert as referenced in the first comment.
Reporter | ||
Comment 19•4 years ago
|
||
Opend Bug 1636048 - Investigate performance changes after updating to mitm 5.0.1 to investigate the performance changes after mitm5.0.1 update
Updated•4 years ago
|
Comment 20•4 years ago
|
||
I should have time to continue here by next week.
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Comment 21•4 years ago
|
||
We'll accept these changes as the new baseline given how long they've been in place now.
Description
•