Determine Windows configuration options that reduce noise on reference laptop (windows10-64-ref-hw-2017)
Categories
(Testing :: Raptor, task, P1)
Tracking
(Not tracked)
People
(Reporter: acreskey, Assigned: acreskey)
References
Details
Attachments
(2 files)
Using the reference laptop in CI and locally we see very significant variations in performance results.
This makes getting reproduceable results extremely difficult.
The purpose of this bug is to collect Windows configuration options that minimize OS-induced noise.
These are the configurations options that I've been disabling locally:
• Indexing Service (file search)
• Windows Defender (default antivirus)
Assignee | ||
Comment 1•5 years ago
|
||
Denis, Mike, I know that you two have had some success in reducing noise on the 2017 reference laptop.
Can you please add any OS features that you are disabling?
Updated•5 years ago
|
Comment 2•5 years ago
|
||
In my experience, operating system updates make the system much slower, due to triggering a lot of disk activity. Both when they are being downloaded, and after they have been installed during the next ~10h while Windows is 'optimizing' stuff on the disk after the update install.
Not sure if your scripts already include this, but when I was trying to get numbers automatically from this hardware, I used a script that waited for CPU idle and disk idle before starting Firefox.
Comment 3•5 years ago
|
||
Windows defender was the big one for me. After I turned that off (I used 3 different ways to do this to make sure it's never on), the machine became quite usable. Other than that, I just make sure disk is close to 0% before I begin my tests.
Comment 4•5 years ago
|
||
I disable the Superfetch / Prefetch stuff (now called SysMain in Services), because otherwise, I was noticing a big shift in measurement over time as Windows "learned" what I liked to run during start-up.
Comment 5•5 years ago
|
||
When I was running some tests locally, to reduce noise, I disabled bluetooth, enabled metered network connection, and disabled windows updates.
Comment 6•5 years ago
|
||
I should note that Windows Defender on is the "default" mode that users will use systems in, so our testing should reflect that.
Comment 7•5 years ago
|
||
:jesup, the problem with leaving defaults on is that we introduce false positives/negatives into our data, and this makes it a bit more difficult to directly relate the performance issues to either firefox changes or because some OS tasks whose - resource-usage interacts poorly with firefox - have intermittently started. However, based on this, I'm thinking it might be worthwhile if we look into testing interoperability throughput performance separately from application-only throughput performance.
Updated•5 years ago
|
Assignee | ||
Comment 8•5 years ago
|
||
To get an idea on where we stand, I made a fresh baseline here:
windows10-64-ref-hw-2017
and also windows10-64
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=792c62b2ad217ba4c0a49c639d1f6696eac2578b&newProject=try&newRevision=5aec96c94b123de3b4b7d2af1292c86ac20e3e01&framework=10
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 9•5 years ago
|
||
Noise is still a major problem on the reference laptop.
Comparing 10 runs against 10 runs of the same changeset I see large differences:
• Amazon warm load metrics off by ~10%
• Facebook loadtime off by 13%
• Netflix metrics off by ~10%
Assignee | ||
Comment 10•5 years ago
|
||
Assignee | ||
Comment 11•5 years ago
|
||
This is a particularly interesting replicates view.
Note the batch of loadtimes that come in at ~200ms while the median is about 2000ms.
Assignee | ||
Comment 12•5 years ago
|
||
Ionut, can I ask for your thoughts on the noise in this comparison?
It's a changeset compared against itself on windows10-64-ref-hw-2017
and also windows10-64
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=792c62b2ad217ba4c0a49c639d1f6696eac2578b&newProject=try&newRevision=5aec96c94b123de3b4b7d2af1292c86ac20e3e01&framework=10
Perfherder is picking up two significant changes on the windows10-64-ref-hw-2017
I also see a 9% and a 10% change on raptor-tp6-imgur-firefox
and raptor-tp6-outlook-firefox
for windows10-64
I don't have any experience with sheriffing, but I would imagine that all of these are problematic.
If we could solve just the issues that lead to the changes marked as "Significant/Important" by perfherder, would that get us most of the value?
Comment 13•5 years ago
|
||
(In reply to Andrew Creskey from comment #12)
Ionut, can I ask for your thoughts on the noise in this comparison?
It's a changeset compared against itself onwindows10-64-ref-hw-2017
and alsowindows10-64
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=792c62b2ad217ba4c0a49c639d1f6696eac2578b&newProject=try&newRevision=5aec96c94b123de3b4b7d2af1292c86ac20e3e01&framework=10Perfherder is picking up two significant changes on the
windows10-64-ref-hw-2017
I also see a 9% and a 10% change onraptor-tp6-imgur-firefox
andraptor-tp6-outlook-firefox
forwindows10-64
I don't have any experience with sheriffing, but I would imagine that all of these are problematic.
Indeed, this is a weird situation. I actually see more changes here than those you mentioned. They vary from +/- 4% to 10%.
Seems like our Windows platform's environments aren't yet quite suited for properly running perf tests.
If we could solve just the issues that lead to the changes marked as "Significant/Important" by perfherder, would that get us most of the value?
Yes, I see this as a valuable step forward.
Assignee | ||
Comment 14•5 years ago
|
||
Thank you Ionut.
I did do some quick tests:
1. Disable OCSP and compare against same revision
This is a known source of noise on the reference laptop. I didn't think it would have any impact here because we connect to mitmproxy and thus don't use the actual site certificates.
Maybe I got lucky on the runs but this comparison doesn't show any flagged perf differences.
So this could be worth looking into.
If this did reduce noise in the test environment, we could argue for disabling OCSP in the perf profile, since OCSP itself will be replaced in the not-too-distant future.
2. Defer setTimeouts() during pageload (would otherwise run on idle)
This is another known source of bimodal behaviour.
Comparing this job against itself also gives a perfherder diff with no flagged perf differences. (although still quite a bit of noise).
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=411785ebb7de902bc37af52759d4d4e2aef83532&newProject=try&newRevision=71cb7ba08f32eea6891e401174323e62209050bf&framework=10
If these results had been smoother, maybe we could consider a 'deterministic load' preference for the perf profile, but I'm not sure...
Assignee | ||
Comment 15•5 years ago
|
||
Back to the bug as logged.
Dave, who could explain to me the differences in the OS setup between the reference laptops in test (i.e. windows10-64-ref-hw-2017
) and the devices that, to my understanding, run virtualized on AWS, such as windows10-64
?
Presumably the images for windows10-64
on AWS don't allow system updates and Windows Defender to be running?
Comment 16•5 years ago
|
||
(In reply to Andrew Creskey from comment #15)
Dave, who could explain to me the differences in the OS setup between the reference laptops in test (i.e.
windows10-64-ref-hw-2017
) and the devices that, to my understanding, run virtualized on AWS, such aswindows10-64
?
Kendall: Could someone from your team help to understand the differences between these platforms in automation, or point Andrew to the relevant documentation/configuration.
Comment 17•5 years ago
|
||
Mark knows the most about the ref laptops, and is familar with AWS, redirecting NI to him.
Comment 18•5 years ago
|
||
The two most significant differences is the quality of the hardware, and a difference in the Windows 10 build; 1803 for AWS and 1703 for the reference laptops. General configuration like Windows Update and Windows Defender are disable for both platforms.
Do we have examples of the noisy tests from the last week? If so I can start looking through papertrail logs and see if anything obvious jumps out.
Also fee free to hit me up on Slack or send a meeting invite to discuss this in further detail.
Assignee | ||
Comment 19•5 years ago
|
||
Thank you Mark.
I think raptor-tp6-netflix-firefox loadtime opt
is as good as any for a noisy test example.
If you see anything in the papertrail, I would be curious.
My hunch is that the runtime is fighting with the OS for resources like the slow platter drive, but I don't know for sure.
I'll try some local testing and the script and suggestion from Comment 2 and Comment 3 (wait until the disk is quiet before starting tests) to see if that helps.
By the way, disabling OCSP is not helpful here, I was just lucky in Comment 14.
Here are two pushes with ~10 retries of the same revision compared: 4 tests flagged as significant changes (~8% to 13%)
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3ebb29d9815219c462e95200016d6fadd84331dc&newProject=try&newRevision=d8fdc966ff39defbe61af3491ffde17fe6983bb8&framework=10
Comment 20•5 years ago
|
||
There was nothing significant in Papertrail.
Is this an issue that progressively gotten worse, or has these test always been noisy?
Assignee | ||
Comment 21•5 years ago
|
||
Thanks for looking Mark
As far as I know these tests have always been very noisy.
These are some related bugs:
Bug 1525017 (8 months ago)
Bug 1536090 (7 months ago)
Assignee | ||
Comment 22•5 years ago
|
||
I ran a test where I made raptor wait for idle CPU (below 3%) and disk (no new activity) as in :florian's script
However, within the time given to wait (only 15 seconds), the device never comes close to idle:
For example,
[task 2019-10-03T15:27:53.365Z] 15:27:53 INFO - raptor-main Info: CPU use: 27.7%
[task 2019-10-03T15:27:53.365Z] 15:27:53 INFO - raptor-main Info: AJC - disk reads: 11
[task 2019-10-03T15:27:53.365Z] 15:27:53 INFO - raptor-main Info: AJC - disk writes: 8
and
[task 2019-10-03T15:25:27.378Z] 15:25:27 INFO - raptor-main Info: CPU use: 5.8%
[task 2019-10-03T15:25:27.378Z] 15:25:27 INFO - raptor-main Info: AJC - disk reads: 321
[task 2019-10-03T15:25:27.378Z] 15:25:27 INFO - raptor-main Info: AJC - disk writes: 4
I'll try to relax the conditions and give it a bit more time to wait.
Assignee | ||
Comment 23•5 years ago
|
||
Mark, I forgot to ask you -- can you tell me if the Windows Indexing Service is disabled on these configurations?
Assignee | ||
Comment 24•5 years ago
|
||
I'll investigate this further, but I was able to get the reference laptop to be roughly idle before starting the pageload tests.
I reduced the raptor post_startup_delay
from 30 seconds to 1 second and instead made the runner wait for <5 % CPU usage and only a handful of disk read/writes.
The wait for near idle seems to take between 25 and 45 seconds.
I'm now bumping into test timeouts, but it could be an error in how I've set this up.
Comment 25•5 years ago
|
||
(In reply to Andrew Creskey from comment #23)
Mark, I forgot to ask you -- can you tell me if the Windows Indexing Service is disabled on these configurations?
It is disabled.
Assignee | ||
Comment 26•5 years ago
|
||
I've spun off Bug 1589356 based on Florian's script comment 2 - waiting for the OS to be idle before Raptor starts a test (warm or cold load).
Early results are promising, at least on the other desktop hardware.
Assignee | ||
Comment 27•5 years ago
|
||
Mark, I think the last question -- can you tell me if the Windows Superfetch / Prefetch is disabled, as described in comment 4.
This could, at least theoretically, introduce some irregularities into the page load tests.
Comment 28•5 years ago
|
||
(In reply to Andrew Creskey from comment #27)
Mark, I think the last question -- can you tell me if the Windows Superfetch / Prefetch is disabled, as described in comment 4.
This could, at least theoretically, introduce some irregularities into the page load tests.
Currently those service are not explicitly disabled. I have asked Bitbar to check one of the laptops to see if the services is running or not.
Comment 29•5 years ago
|
||
Bitbar verified that Superfetch was running.
Assignee | ||
Comment 30•5 years ago
|
||
Thank you Mark.
Let me ask around for input on this.
Disabling Superfetch could give us more reliable results (again reducing 'realism' in the same way that Windows Update, Windows Defender, and Windows Indexing Service are disabled).
Assignee | ||
Comment 31•5 years ago
|
||
The view of the performance team was that Superfetch
should not impact pageload performance.
And it turns out that :denispal had done tests that confirm this.
So I'll close this bug -- it doesn't look like there's anything to be done here.
Comment 32•5 years ago
|
||
FWIW, Superfetch / Prefetch would definitely impact startup tests. Is that a consideration here, or is this bug strictly about page load?
Assignee | ||
Comment 33•5 years ago
|
||
I did create this bug to try and reduce the page load noise but if it could help other tests, that would be great.
Specifically the target was the Windows configuration used in CI (AWS and Bitbar devices).
I know next to nothing about the startup tests -- are they running on AWS and or on the reference laptop in automation?
Comment 34•5 years ago
|
||
(In reply to Andrew Creskey from comment #33)
I know next to nothing about the startup tests -- are they running on AWS and or on the reference laptop in automation?
They will eventually be running on the reference laptop in automation.
Assignee | ||
Comment 35•5 years ago
|
||
(In reply to Mike Conley (:mconley) (:⚙️) (Wayyyy behind on needinfos) from comment #34)
(In reply to Andrew Creskey from comment #33)
I know next to nothing about the startup tests -- are they running on AWS and or on the reference laptop in automation?
They will eventually be running on the reference laptop in automation.
Interesting.
Then let's flip this around: is there any reason to not disable SuperFetch/Sysmain in the automation Windows configurations?
If we're favouring reproducible results in general then I think this can't hurt.
If startup tests are coming to automation then I think this is absolutely necessary.
Mark, I'm leaning on you again for thoughts, next steps?
Comment 36•5 years ago
|
||
The start up testing is going to be a very small pool separate from other reference laptops.
I can set up a laptop with a testing workerType in automation, and have superfetch disabled on that laptop. We will then be able to push tests to it with changes similar to https://hg.mozilla.org/try/rev/c7c581111bdf320defefe476560897c7c810d62e . It is such a s mall pool of nodes that we would have to stick to one or two testing nodes.
Assignee | ||
Comment 37•5 years ago
|
||
Thanks again Mark.
Given that there is already work planned for setting up the separate startup testing pool, I don't see anything else to do here.
Description
•