run expected-[fail|crash|timeout] tests once/day as tier-2 instead of all the time
Categories
(Testing :: General, task, P3)
Tracking
(firefox77 fixed)
Tracking | Status | |
---|---|---|
firefox77 | --- | fixed |
People
(Reporter: jmaher, Assigned: jmaher)
References
(Depends on 8 open bugs, Blocks 1 open bug)
Details
22% of our overall web-platform-test execution time is spent on tests which are not expected to pass. Most likely the tests are not well written or Firefox doesn't support the test fully. Either way, werun these per commit, for all try pushes and as tier-1. I would like to run these as tier-2 and only once/day on mozilla-central.
We could do this for xpcshell/mochitest as well, lets start with wpt as the runtime will be a significant savings.
Some things to solve:
- wpt-update scripts will need to account for this (ensure jobs run on try, etc.)
- write better code than this: https://hg.mozilla.org/try/rev/e39120a4e76eeb3b5140fec372fd54ed33f6ba09
- currently taskcluster scheduling with SETA skips many WPT tests when they are updated, we should write a rule for scheduling that when wpt manifests are changed we run all wpt (tier-1?)
- we should quantify how many regressions we would miss per commit and only find in the new once/day scenario (ignoring wpt updates).
Assignee | ||
Comment 1•5 years ago
|
||
some data for item #4 above ---
In 2019-H1, I find reference to 102 regressions where wpt tests are the only annotated items. Of those 80 were resolved by editing test cases, manifests, or tooling/infrastructure.
That leaves 22 regressions that had a product fix or an unknown fix (not straightforward to figure out:
fixed revision, type of fix, bug number
c908cdfffe30, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1509575
008355f71ca0, unknown,
c7085705cd57, unknown,
2e610b19a641, unknown,
31b3a7b0f085, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1524648
36ae1038bf8c, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1509738
caab6cb81820, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1509738
903ca27a7aed, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1521476
31bb66676827, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1529117
3a2ced4fbd98, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1529831
66a4a5cb3fc7, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1525245
6b8e4909d303, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1521923
ede970649f87, unknown, https://bugzilla.mozilla.org/show_bug.cgi?id=1531294
1df7ae29b121, unknown, https://bugzilla.mozilla.org/show_bug.cgi?id=1535507
5d179e192915, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1538969
4d450ab98f58, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1539006
45bf2b32a377, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1525640
29be14061be5, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1545751
e8766f96041a, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1516722
5179e588a94f, product,
0a1d0653490a, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1554244
ff649d101ade, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1561773
for these 22 regressions we should see what test failed and if that test had any expectations in the manifest other than pass/ok. I suspect that list will be short. Knowing what type of expectations we have and the count will help determine how many will get found much slower and if the numbers are high or only on a single type of failure we could adjust the goals of this bug.
Updated•5 years ago
|
Comment 2•5 years ago
|
||
My rough calculations show this would save $100k/yr going forward. Definitely worth prioritizing.
Assignee | ||
Updated•5 years ago
|
Comment 3•5 years ago
|
||
I have some concerns with the approach suggested here:
- The wpt sync relies on having complete results on central in order to detect differences introduced by PRs. Not running all tests on every central commit would break that.
- The sync also relies on try running all tests, and I think that developers do too. Determining whether failing tests are in fact fixed by a patch is one of the most important functions of running wpt through try, so it shouldn't require extra effort to get that correct.
- The suggested semantics ("don't run tests that aren't expected to pass") don't make much sense in detail. There is a difference between reftests and test types with subtests. Reftests have a single status that they report to the top level, whereas for other tests the top-level status is only whether the test completed without an uncaught exception, timeout, or crash. This means we will suppress running failing layout tests, but continue to run testharness tests in which all the subtests fail, even though the former are most likely gecko bugs but the latter may be features we don't intend to implement.
- It's also possible for a test to have a status of ERROR or TIMEOUT but a number of stably passing subtests. They may also show leaks, asserts, or crashes. So the proposed semantics also don't correspond to only running tests that are able to show regressions.
- I worry about the messaging associated with labelling failing tests "Tier 2". The failing tests — particuarly the ones that are passing in Chrome and Safari — are high value web-platform-tests because they are the ones that are identifying issues that we should triage and prioritise for fixing. Although logically we can still do that if they run less frequently, I think we already have a tendency to view tests we aren't already passing as "probably low value" without actually checking that assumption, and this plays into that narrative.
So at the very least we need to always run all wpt tests on central and try and only change things for integration branches.
However I think we should consider different approaches here. Two classes of tests were identified as being "low value" in comment #0
- Not well written
- Not supported in Firefox
For the first case I think we should improve the tests. There's no sense in having tests that are buggy running in any circumstance (on central, in the upstream CI, etc.). This presumably isn't like intermittents where the issues are often subtle race conditions, so it shouldn't be nearly as hard to work through the backlog.
The tests that aren't supported in Firefox breaks down into two subsets:
- Features that we intend to support but where we have implemenation issues
- Features we don't plan to support at this time
For features we intend to support the tests are high value in the sense that they're telling us about bugs in our implemenation. For all the reasons given above I'd like to keep running these at tier 1. For features we simply don't intend to support the overall status of the tests isn't so important; a well written WebUSB test might have a status of OK
in Firefox, and so run under the above proposal, but be useless to us unless we plan to actually implement that feature. But we could add metadata to mark those directories as tier-2
and only run them on central to check for leaks, crashes, etc. This of course requires some manual work to label the tier-2 directories, but it seems like it's a useful exercise to record the choices that we've made, and the data can be used for other purposes (e.g. not autogenerating bugs corresponding to changes in those directories).
Assignee | ||
Comment 4•5 years ago
|
||
I keep seeing wpt-sync as a reason for this being not ideal. Can we fix wpt-sync to work in a before/after mode instead of depending on mozilla-central?
We do not have the luxury of running unlimited tests because they exist, if there are some subsets of non expected-pass tests that we need to run, I would like them to be justified individually with a business case:
- for this test specifically has this been finding regressions in our code
- we have developers actively working on a new feature which will depend on these tests
- why we cannot wait 24 hours for a nightly build to find the failure and backfill
If we need to help figure out wpt-sync changes before deploying this change, then that should be part of the scope of this bug. Assuming we are running tests at a much lower rate, having a secondary try push seems reasonable from a cost/machine usage situation.
Comment 5•5 years ago
|
||
Can we fix wpt-sync to work in a before/after mode instead of depending on mozilla-central?
Yes, probably but that's additional complexity that we don't already have.
We do not have the luxury of running unlimited tests because they exist, if there are some subsets of non expected-pass tests that we need to run, I would like them to be justified individually with a business case:
Given that "expected pass" as defined here doesn't map logically onto the semantics you want, I don't think that's a reasonable position to take.
I have proposed an alternative solution (deprioritise tests for features we've made a business decision not to support). I think it's reasonable to have a discussion about the different tradeoffs, but I don't think it's reasonable to start from the assumption that the proposal in comment #0 is the correct solution.
Updated•5 years ago
|
Comment 6•5 years ago
|
||
Yesterday Joel, James, and I met about this bug. As James pointed out in comment 3 we can't just switch off everything that is expecting to fail as it would create a lot of work for all the engineering productivity tooling available.
We wrote out 2 possible scenarios in https://docs.google.com/document/d/1IaHpE0WVlK_ME_dXqSqrlzOCciHBjGwWFGHFwHPkDwY/edit?usp=sharing
After the meeting the plan is to
- add more expectation data describing features we will never implement
- add more expectation data for features we will do but arent on a roadmap
We will have those tests turned off for now.
The cost of updating the tooling and processes and rerunning issues to find the regression will offset just turning off the tests wholesale.
Comment 7•5 years ago
|
||
The next step here is to identify the parts of the wpt suite that are for features we aren't going to implement or don't have plans to implement at this time. I'll put together a spreadsheet with the list of test directories, the results summary for each, and arrange for it to be sent around to that the relevant code owners and people in product/engineering can fill in the implementation status. After that the steps are:
- Add support for implementation status in the expectation metadata files.
- Allow wptrunner to skip tests matching a given status using a command line option like `--skip-implementation-status=not-implementing,backlog" (or whatever the names we use are).
- Update the TC tasks (and mozharness, perhaps) to pass in the relevant command line options to the jobs
Comment 8•5 years ago
|
||
Looking at https://docs.google.com/spreadsheets/d/1YOlTYJ6cXtXBcdmPVRm0zH6xP6KEJ08q4BrXg0NedTM/edit#gid=0 we have some candidates for areas where we should not be running tests
Assignee | ||
Comment 9•5 years ago
|
||
in bug 1595515 we landed changes to not run 3 specific directories of tests that we do not have implemented.
I have done a fresh query for the last week and 23.4% of runtime is for expected timeout (18%) and error (5.4%). Looking at our old data for tests that timeout from May 2019:
https://docs.google.com/spreadsheets/d/1N4ix5ZDMSgxzCtMJ5BDGVr6RygeoGnjkko4l9gj_xpY/edit#gid=1212266288
March 2020: 430 test names
May 2019: 353 test names
of the 2 sets, 285 are the same tests
A lot of time has elapsed for us to fix issues, and in bug 1595515 it was hoped that a large chunk of our timeout tests would be disabled as a result, unfortunately that isn't true.
:jgraham, should we revisit the tests we skip for no implementation, or should we do something different for expected-timeout bugs?
Comment 10•5 years ago
|
||
The plan is still to run things that we don't have plans to implement in the near future less often (the existing change only applies to the much smaller subset of things we regard as harmful to the web and plan to never implement). But doing that requires knowing which features those are. I talked to mconca about providing an initial cut of the data we need here (i.e. filling in the initial information in [1]), but I'm not sure what his timeline is there.
Once we have that data the infrastructure added in bug 1595515 can be applied to a much wider range of directories.
Comment 11•5 years ago
|
||
(In reply to James Graham [:jgraham] from comment #10)
I talked to mconca about providing an initial cut of the data we need here (i.e. filling in the initial information in [1]), but I'm not sure what his timeline is there.
I just asked a few people to look at this. Let's assess the state of that spreadsheet in another two weeks.
Assignee | ||
Comment 12•5 years ago
|
||
I see on the spreadsheet all features have a status:
261 - total
90 - unknown
4 - not implemented
4 - not scheduled
2 - not a spec
it sounds like we have a lot of unknown data to resolve still, what will it take to do that? From an outsider perspective if we don't know we probably don't have a lot of value from it and should be running it once/day (vs not scheduled)
Comment 13•5 years ago
|
||
Unknown is literally the holes in data from my pass at filling it in. It means "I don't know for sure", not "Mozilla don't know"; lots of unknown is probably features we already implement.
Assignee | ||
Comment 14•5 years ago
|
||
how do we get answers to that?
Comment 15•5 years ago
|
||
That's the question that I'm hoping mconca and colleagues can answer.
Assignee | ||
Comment 16•5 years ago
|
||
:mconca, it has been a couple weeks and I wanted to check in with you- I assumed you were going to fill out the spreadsheet or twist arms to get it filled out. Right now there are a lot of unknown areas.
Comment 17•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #16)
:mconca, it has been a couple weeks and I wanted to check in with you- I assumed you were going to fill out the spreadsheet or twist arms to get it filled out. Right now there are a lot of unknown areas.
Looks like a bit more of it got filled out, but not significantly. Let me ping more folks (leaving my NI open for now).
Assignee | ||
Comment 18•5 years ago
|
||
I see a lot of updates:
10 unknown
8 not implementing
3 not a spec
52 not scheduled
thanks for the work so far and I assume the last 10 unknowns can be figured out in the coming week.
Comment 19•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #18)
I see a lot of updates:
10 unknown
8 not implementing
3 not a spec
52 not scheduledthanks for the work so far and I assume the last 10 unknowns can be figured out in the coming week.
Much more progress, I agree. I'll try to get the final ones figured out this week.
Comment 20•5 years ago
|
||
(In reply to Mike Conca [:mconca] from comment #19)
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #18)
I see a lot of updates:
10 unknown
8 not implementing
3 not a spec
52 not scheduledthanks for the work so far and I assume the last 10 unknowns can be figured out in the coming week.
Much more progress, I agree. I'll try to get the final ones figured out this week.
Done. The list should be complete. Thank you for your patience.
Once we start implementing a new roadmap item, what is the process for moving a WPT test suite from tier 2 to tier 1?
Assignee | ||
Comment 21•5 years ago
|
||
first off, thanks for completing this list. The intention here is to run these "not implemented" tests as tier-2, which will be on mozilla-central pushes only, not beta, autoland, try by default.
These will have a single line in the meta/<dirname>/dir.ini which will be the implementation status [1] . Since this is in-tree, when the feature is being worked on (maybe even scheduled to work on in the short term), the implementation status line could be removed from the dir.ini file and the set of tests would be run as tier-1 (everywhere).
:mconca, does that work for you? I am happy to adjust things if needed
Comment 22•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #21)
first off, thanks for completing this list. The intention here is to run these "not implemented" tests as tier-2, which will be on mozilla-central pushes only, not beta, autoland, try by default.
These will have a single line in the meta/<dirname>/dir.ini which will be the implementation status [1] . Since this is in-tree, when the feature is being worked on (maybe even scheduled to work on in the short term), the implementation status line could be removed from the dir.ini file and the set of tests would be run as tier-1 (everywhere).
:mconca, does that work for you? I am happy to adjust things if needed
Sounds good. Thanks.
Comment 23•5 years ago
|
||
Comment 26•5 years ago
|
||
bugherder |
Updated•5 years ago
|
Comment 28•5 years ago
|
||
We discussed some of the issues around this on matrix today, and decided that at least we should stop running tests that are expected to timeout out autoland. Ideally we would do the same on try if we had an easy mechanism to turn them back on for testing.
Description
•