1406878 - 17.32 - 25.65% perf_reftest_singletons (linux64, osx-10-10, windows10-64, windows7-32) regression on push 3e85f0761fc9ec42f8cc0ef57ad3e27e8127323b (Sat Oct 7 2017)

Reporter

Description

•

7 years ago

Talos has detected a Firefox performance regression from push: https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=3e85f0761fc9ec42f8cc0ef57ad3e27e8127323b As author of one of the patches included in that push, we need your help to address this regression. Regressions: 26% perf_reftest_singletons summary linux64 opt e10s 24.68 -> 31.02 25% perf_reftest_singletons summary linux64 pgo e10s 23.07 -> 28.92 23% perf_reftest_singletons summary windows7-32 opt e10s 25.86 -> 31.67 19% perf_reftest_singletons summary windows7-32 pgo e10s 22.54 -> 26.86 19% perf_reftest_singletons summary windows10-64 pgo e10s23.45 -> 27.92 19% perf_reftest_singletons summary windows10-64 opt e10s25.71 -> 30.60 17% perf_reftest_singletons summary osx-10-10 opt e10s 25.44 -> 29.85 You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=9869 On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format. To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running *** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! *** Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling

Ionuț Goldan [:igoldan]

Reporter

Updated

•

7 years ago

Component: Untriaged → Layout: Text

Product: Firefox → Core

Xidorn Quan [:xidorn] UTC+11

Comment 1

•

7 years ago

Isn't that simply because a new test is added? I think this is an INVALID or WONTFIX.

Xidorn Quan [:xidorn] UTC+11

Comment 2

•

7 years ago

I think for perf_reftest_singletons, the subtests should be tracked separately, rather than bundling together like this. This is both useless and misleading, and hard to catch real regressions.

Ionuț Goldan [:igoldan]

Reporter

Comment 3

•

7 years ago

In reply to Xidorn Quan [:xidorn] UTC+10 from comment #1) > Isn't that simply because a new test is added? I think this is an INVALID or > WONTFIX. I was just about to ask for that. I'm marking this as WONTFIX then.

Ionuț Goldan [:igoldan]

Reporter

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WONTFIX

Ionuț Goldan [:igoldan]

Reporter

Comment 4

•

7 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #2) > I think for perf_reftest_singletons, the subtests should be tracked > separately, rather than bundling together like this. This is both useless > and misleading, and hard to catch real regressions. Thanks for this suggestion. I agree and will stick with it.

Joel Maher ( :jmaher ) (UTC -8)

Comment 5

•

7 years ago

:xidorn, we discussed this, and it would be too much noise to track individual test results- this gives us some signal and we use a geometric mean which catches most of the real sustained regressions.

Xidorn Quan [:xidorn] UTC+11

Comment 6

•

7 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #5) > :xidorn, we discussed this, and it would be too much noise to track > individual test results- this gives us some signal and we use a geometric > mean which catches most of the real sustained regressions. I don't quite understand. Are you calculating the geometric mean of all subtests and then use the percentage of their difference? That doesn't make sense, because we currently have 16 subtests, which means a single test can be regressed by up to 17% (1.01^16 - 1) without triggering a 1% regression alert. Maybe you can exp the result difference percentage by the number of subtests, that probably makes more sense. I wonder why is it too much noise? Probably the several very quick test can vary significantly?

Joel Maher ( :jmaher ) (UTC -8)

Comment 7

•

7 years ago

noise == pestering developers more frequently, lack of sheriffing resources - we have 900 tests we track so far it is not sustainable for Mozilla to add 240 more tests that we track. here is an example of adding 1 test to perf_reftests which generated a 25% regression alert: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=autoland&originalRevision=52748bb525f2a7aac2d82647a6d41b16c873a245&newProject=autoland&newRevision=3e85f0761fc9ec42f8cc0ef57ad3e27e8127323b&originalSignature=d816936ecd2474b13579e9e9426c4e92c0c4d3a7&newSignature=d816936ecd2474b13579e9e9426c4e92c0c4d3a7&framework=1 here we added 2 new tests and saw a change: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=autoland&originalRevision=f5a42ccee0a4bab53f84ddc26ce50515dc3b8f58&newProject=autoland&newRevision=f781d6ffe5e42736e98a254b6a40674136cbb1a2&originalSignature=713420b13030f329dc214532301155532a147631&newSignature=713420b13030f329dc214532301155532a147631&framework=1 this one shows a 69% increase in perfherder: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=autoland&originalRevision=b90b6a1fc26b307fff950ca3d39617995eebecea&newProject=autoland&newRevision=a6ffff95554772e0c69c050e6d3b6fb48ce3ec17&originalSignature=a29feea25c232ff228e407c84c80ebfb7e07220e&newSignature=a29feea25c232ff228e407c84c80ebfb7e07220e&framework=1

Xidorn Quan [:xidorn] UTC+11

Comment 8

•

7 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #7) > noise == pestering developers more frequently, lack of sheriffing resources > - we have 900 tests we track so far it is not sustainable for Mozilla to add > 240 more tests that we track. I think you misunderstood me. I said that subtests should be tracked separately specifically for perf_reftest_singletons, so that's 16 more, not 240. Other talos tests are basically general performance tests, and we care them as a whole. But each single test in perf_reftest_singletons is pretty much testing a very specific optimization, and it doesn't make much sense to mix them together. I'm not sure I understand the additional burden for having finer-grained tracking here. I suppose alerts are triggered by the CI directly, and perf sheriffs would track them in some dashboard thing? I guess we can have some script to scan the history of the subtests of perf_reftest_singletons and see if they are really noisy and add significant more tracking work to perf sheriffs. I suspect they are not. And given that a single test may be regressed by 17% without triggering even a 1% alert at the moment, I think as a compromise we can set a larger tolerant range for those subtests so that they alert less. > here is an example of adding 1 test to perf_reftests which generated a 25% > regression alert: > https://treeherder.mozilla.org/perf.html#/ > comparesubtest?originalProject=autoland&originalRevision=52748bb525f2a7aac2d8 > 2647a6d41b16c873a245&newProject=autoland&newRevision=3e85f0761fc9ec42f8cc0ef5 > 7ad3e27e8127323b&originalSignature=d816936ecd2474b13579e9e9426c4e92c0c4d3a7&n > ewSignature=d816936ecd2474b13579e9e9426c4e92c0c4d3a7&framework=1 This is a pretty good example that the current approach is problematic for perf_reftest_singletons. There isn't any real regression here. It's just a new test added. And this kind of annoyance would happen whenever someone adds a new subtest, which is expected to happen more in the future as we make more optimizations. And having perf sheriffs file regression alerts for this kind of things is a waste to both sheriffing resources and developer time.

Joel Maher ( :jmaher ) (UTC -8)

Comment 9

•

7 years ago

we have 15 for perf_reftest + 15 for singletons so 30 additional data points. We run these on: * linux64-stylo, linux64-stylo-disabled * macosx-stylo, macosx-stylo-disabled * win7-stylo, win7-stylo-disabled * win10-stylo, win10-stylo-disabled that is 8 configurations * 30 == 240 new tests to track.

Bugzilla

17.32 - 25.65% perf_reftest_singletons (linux64, osx-10-10, windows10-64, windows7-32) regression on push 3e85f0761fc9ec42f8cc0ef57ad3e27e8127323b (Sat Oct 7 2017)

Categories

(Core :: Layout: Text and Fonts, defect)

Tracking

()

People

(Reporter: igoldan, Unassigned)

References

Details

(Keywords: perf, regression, talos-regression)

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9