Closed Bug 1572820 Opened 5 years ago Closed 4 years ago

run expected-[fail|crash|timeout] tests once/day as tier-2 instead of all the time

Categories

(Testing :: General, task, P3)

Version 3
task

Tracking

(firefox77 fixed)

RESOLVED FIXED
mozilla77
Tracking Status
firefox77 --- fixed

People

(Reporter: jmaher, Assigned: jmaher)

References

(Depends on 8 open bugs, Blocks 1 open bug)

Details

22% of our overall web-platform-test execution time is spent on tests which are not expected to pass. Most likely the tests are not well written or Firefox doesn't support the test fully. Either way, werun these per commit, for all try pushes and as tier-1. I would like to run these as tier-2 and only once/day on mozilla-central.

We could do this for xpcshell/mochitest as well, lets start with wpt as the runtime will be a significant savings.

Some things to solve:

  1. wpt-update scripts will need to account for this (ensure jobs run on try, etc.)
  2. write better code than this: https://hg.mozilla.org/try/rev/e39120a4e76eeb3b5140fec372fd54ed33f6ba09
  3. currently taskcluster scheduling with SETA skips many WPT tests when they are updated, we should write a rule for scheduling that when wpt manifests are changed we run all wpt (tier-1?)
  4. we should quantify how many regressions we would miss per commit and only find in the new once/day scenario (ignoring wpt updates).
Blocks: 1573872

some data for item #4 above ---

In 2019-H1, I find reference to 102 regressions where wpt tests are the only annotated items. Of those 80 were resolved by editing test cases, manifests, or tooling/infrastructure.

That leaves 22 regressions that had a product fix or an unknown fix (not straightforward to figure out:
fixed revision, type of fix, bug number
c908cdfffe30, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1509575
008355f71ca0, unknown,
c7085705cd57, unknown,
2e610b19a641, unknown,
31b3a7b0f085, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1524648
36ae1038bf8c, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1509738
caab6cb81820, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1509738
903ca27a7aed, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1521476
31bb66676827, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1529117
3a2ced4fbd98, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1529831
66a4a5cb3fc7, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1525245
6b8e4909d303, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1521923
ede970649f87, unknown, https://bugzilla.mozilla.org/show_bug.cgi?id=1531294
1df7ae29b121, unknown, https://bugzilla.mozilla.org/show_bug.cgi?id=1535507
5d179e192915, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1538969
4d450ab98f58, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1539006
45bf2b32a377, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1525640
29be14061be5, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1545751
e8766f96041a, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1516722
5179e588a94f, product,
0a1d0653490a, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1554244
ff649d101ade, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1561773

for these 22 regressions we should see what test failed and if that test had any expectations in the manifest other than pass/ok. I suspect that list will be short. Knowing what type of expectations we have and the count will help determine how many will get found much slower and if the numbers are high or only on a single type of failure we could adjust the goals of this bug.

Priority: -- → P3

My rough calculations show this would save $100k/yr going forward. Definitely worth prioritizing.

I have some concerns with the approach suggested here:

  • The wpt sync relies on having complete results on central in order to detect differences introduced by PRs. Not running all tests on every central commit would break that.
  • The sync also relies on try running all tests, and I think that developers do too. Determining whether failing tests are in fact fixed by a patch is one of the most important functions of running wpt through try, so it shouldn't require extra effort to get that correct.
  • The suggested semantics ("don't run tests that aren't expected to pass") don't make much sense in detail. There is a difference between reftests and test types with subtests. Reftests have a single status that they report to the top level, whereas for other tests the top-level status is only whether the test completed without an uncaught exception, timeout, or crash. This means we will suppress running failing layout tests, but continue to run testharness tests in which all the subtests fail, even though the former are most likely gecko bugs but the latter may be features we don't intend to implement.
  • It's also possible for a test to have a status of ERROR or TIMEOUT but a number of stably passing subtests. They may also show leaks, asserts, or crashes. So the proposed semantics also don't correspond to only running tests that are able to show regressions.
  • I worry about the messaging associated with labelling failing tests "Tier 2". The failing tests — particuarly the ones that are passing in Chrome and Safari — are high value web-platform-tests because they are the ones that are identifying issues that we should triage and prioritise for fixing. Although logically we can still do that if they run less frequently, I think we already have a tendency to view tests we aren't already passing as "probably low value" without actually checking that assumption, and this plays into that narrative.

So at the very least we need to always run all wpt tests on central and try and only change things for integration branches.

However I think we should consider different approaches here. Two classes of tests were identified as being "low value" in comment #0

  • Not well written
  • Not supported in Firefox

For the first case I think we should improve the tests. There's no sense in having tests that are buggy running in any circumstance (on central, in the upstream CI, etc.). This presumably isn't like intermittents where the issues are often subtle race conditions, so it shouldn't be nearly as hard to work through the backlog.

The tests that aren't supported in Firefox breaks down into two subsets:

  • Features that we intend to support but where we have implemenation issues
  • Features we don't plan to support at this time

For features we intend to support the tests are high value in the sense that they're telling us about bugs in our implemenation. For all the reasons given above I'd like to keep running these at tier 1. For features we simply don't intend to support the overall status of the tests isn't so important; a well written WebUSB test might have a status of OK in Firefox, and so run under the above proposal, but be useless to us unless we plan to actually implement that feature. But we could add metadata to mark those directories as tier-2 and only run them on central to check for leaks, crashes, etc. This of course requires some manual work to label the tier-2 directories, but it seems like it's a useful exercise to record the choices that we've made, and the data can be used for other purposes (e.g. not autogenerating bugs corresponding to changes in those directories).

I keep seeing wpt-sync as a reason for this being not ideal. Can we fix wpt-sync to work in a before/after mode instead of depending on mozilla-central?

We do not have the luxury of running unlimited tests because they exist, if there are some subsets of non expected-pass tests that we need to run, I would like them to be justified individually with a business case:

  • for this test specifically has this been finding regressions in our code
  • we have developers actively working on a new feature which will depend on these tests
  • why we cannot wait 24 hours for a nightly build to find the failure and backfill

If we need to help figure out wpt-sync changes before deploying this change, then that should be part of the scope of this bug. Assuming we are running tests at a much lower rate, having a secondary try push seems reasonable from a cost/machine usage situation.

Can we fix wpt-sync to work in a before/after mode instead of depending on mozilla-central?

Yes, probably but that's additional complexity that we don't already have.

We do not have the luxury of running unlimited tests because they exist, if there are some subsets of non expected-pass tests that we need to run, I would like them to be justified individually with a business case:

Given that "expected pass" as defined here doesn't map logically onto the semantics you want, I don't think that's a reasonable position to take.

I have proposed an alternative solution (deprioritise tests for features we've made a business decision not to support). I think it's reasonable to have a discussion about the different tradeoffs, but I don't think it's reasonable to start from the assumption that the proposal in comment #0 is the correct solution.

Yesterday Joel, James, and I met about this bug. As James pointed out in comment 3 we can't just switch off everything that is expecting to fail as it would create a lot of work for all the engineering productivity tooling available.

We wrote out 2 possible scenarios in https://docs.google.com/document/d/1IaHpE0WVlK_ME_dXqSqrlzOCciHBjGwWFGHFwHPkDwY/edit?usp=sharing

After the meeting the plan is to

  • add more expectation data describing features we will never implement
  • add more expectation data for features we will do but arent on a roadmap

We will have those tests turned off for now.

The cost of updating the tooling and processes and rerunning issues to find the regression will offset just turning off the tests wholesale.

The next step here is to identify the parts of the wpt suite that are for features we aren't going to implement or don't have plans to implement at this time. I'll put together a spreadsheet with the list of test directories, the results summary for each, and arrange for it to be sent around to that the relevant code owners and people in product/engineering can fill in the implementation status. After that the steps are:

  • Add support for implementation status in the expectation metadata files.
  • Allow wptrunner to skip tests matching a given status using a command line option like `--skip-implementation-status=not-implementing,backlog" (or whatever the names we use are).
  • Update the TC tasks (and mozharness, perhaps) to pass in the relevant command line options to the jobs

Looking at https://docs.google.com/spreadsheets/d/1YOlTYJ6cXtXBcdmPVRm0zH6xP6KEJ08q4BrXg0NedTM/edit#gid=0 we have some candidates for areas where we should not be running tests

in bug 1595515 we landed changes to not run 3 specific directories of tests that we do not have implemented.

I have done a fresh query for the last week and 23.4% of runtime is for expected timeout (18%) and error (5.4%). Looking at our old data for tests that timeout from May 2019:
https://docs.google.com/spreadsheets/d/1N4ix5ZDMSgxzCtMJ5BDGVr6RygeoGnjkko4l9gj_xpY/edit#gid=1212266288

March 2020: 430 test names
May 2019: 353 test names
of the 2 sets, 285 are the same tests

A lot of time has elapsed for us to fix issues, and in bug 1595515 it was hoped that a large chunk of our timeout tests would be disabled as a result, unfortunately that isn't true.

:jgraham, should we revisit the tests we skip for no implementation, or should we do something different for expected-timeout bugs?

Flags: needinfo?(james)

The plan is still to run things that we don't have plans to implement in the near future less often (the existing change only applies to the much smaller subset of things we regard as harmful to the web and plan to never implement). But doing that requires knowing which features those are. I talked to mconca about providing an initial cut of the data we need here (i.e. filling in the initial information in [1]), but I'm not sure what his timeline is there.

Once we have that data the infrastructure added in bug 1595515 can be applied to a much wider range of directories.

[1] https://docs.google.com/spreadsheets/d/1xMI7akJqtLfB-CsXI1maKsP57WrAtJbSpmZrsFTZNp0/edit#gid=1132212086

Flags: needinfo?(james) → needinfo?(mconca)

(In reply to James Graham [:jgraham] from comment #10)

I talked to mconca about providing an initial cut of the data we need here (i.e. filling in the initial information in [1]), but I'm not sure what his timeline is there.

I just asked a few people to look at this. Let's assess the state of that spreadsheet in another two weeks.

Flags: needinfo?(mconca)

I see on the spreadsheet all features have a status:
261 - total
90 - unknown
4 - not implemented
4 - not scheduled
2 - not a spec

it sounds like we have a lot of unknown data to resolve still, what will it take to do that? From an outsider perspective if we don't know we probably don't have a lot of value from it and should be running it once/day (vs not scheduled)

Unknown is literally the holes in data from my pass at filling it in. It means "I don't know for sure", not "Mozilla don't know"; lots of unknown is probably features we already implement.

how do we get answers to that?

That's the question that I'm hoping mconca and colleagues can answer.

:mconca, it has been a couple weeks and I wanted to check in with you- I assumed you were going to fill out the spreadsheet or twist arms to get it filled out. Right now there are a lot of unknown areas.

Flags: needinfo?(mconca)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #16)

:mconca, it has been a couple weeks and I wanted to check in with you- I assumed you were going to fill out the spreadsheet or twist arms to get it filled out. Right now there are a lot of unknown areas.

Looks like a bit more of it got filled out, but not significantly. Let me ping more folks (leaving my NI open for now).

I see a lot of updates:
10 unknown
8 not implementing
3 not a spec
52 not scheduled

thanks for the work so far and I assume the last 10 unknowns can be figured out in the coming week.

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #18)

I see a lot of updates:
10 unknown
8 not implementing
3 not a spec
52 not scheduled

thanks for the work so far and I assume the last 10 unknowns can be figured out in the coming week.

Much more progress, I agree. I'll try to get the final ones figured out this week.

(In reply to Mike Conca [:mconca] from comment #19)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #18)

I see a lot of updates:
10 unknown
8 not implementing
3 not a spec
52 not scheduled

thanks for the work so far and I assume the last 10 unknowns can be figured out in the coming week.

Much more progress, I agree. I'll try to get the final ones figured out this week.

Done. The list should be complete. Thank you for your patience.

Once we start implementing a new roadmap item, what is the process for moving a WPT test suite from tier 2 to tier 1?

Flags: needinfo?(mconca) → needinfo?(jmaher)

first off, thanks for completing this list. The intention here is to run these "not implemented" tests as tier-2, which will be on mozilla-central pushes only, not beta, autoland, try by default.

These will have a single line in the meta/<dirname>/dir.ini which will be the implementation status [1] . Since this is in-tree, when the feature is being worked on (maybe even scheduled to work on in the short term), the implementation status line could be removed from the dir.ini file and the set of tests would be run as tier-1 (everywhere).

:mconca, does that work for you? I am happy to adjust things if needed

[1] https://searchfox.org/mozilla-central/source/testing/web-platform/meta/PeriodicBackgroundSync/__dir__.ini#1

Flags: needinfo?(jmaher)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #21)

first off, thanks for completing this list. The intention here is to run these "not implemented" tests as tier-2, which will be on mozilla-central pushes only, not beta, autoland, try by default.

These will have a single line in the meta/<dirname>/dir.ini which will be the implementation status [1] . Since this is in-tree, when the feature is being worked on (maybe even scheduled to work on in the short term), the implementation status line could be removed from the dir.ini file and the set of tests would be run as tier-1 (everywhere).

:mconca, does that work for you? I am happy to adjust things if needed

[1] https://searchfox.org/mozilla-central/source/testing/web-platform/meta/PeriodicBackgroundSync/__dir__.ini#1

Sounds good. Thanks.

Depends on: 1631410
Depends on: 1631882
Depends on: 1631922
Depends on: 1631938
Depends on: 1632079
Depends on: 1632083
Depends on: 1632086
Depends on: 1632119
Depends on: 1632122
Depends on: 1632125
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a9ed4f48957c
run wpt components we are not implementing as tier-2 on m-c only. r=jgraham
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/23179 for changes under testing/web-platform/tests
Upstream web-platform-tests status checks passed, PR will merge once commit reaches central.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla77
Assignee: nobody → jmaher
Depends on: 1632811
Upstream PR merged by moz-wptsync-bot
Depends on: 1632869
Depends on: 1632873
Depends on: 1632876
Depends on: 1634337
Depends on: 1634357
Depends on: 1634369
Depends on: 1634389
Depends on: 1634854
Depends on: 1634862

We discussed some of the issues around this on matrix today, and decided that at least we should stop running tests that are expected to timeout out autoland. Ideally we would do the same on try if we had an easy mechanism to turn them back on for testing.

Depends on: 1635911
Depends on: 1635919
Depends on: 1635922
Depends on: 1635926
Depends on: 1636069
Depends on: 1636070
Depends on: 1636073
Depends on: 1636079
Depends on: 1636118
Depends on: 1636120
Depends on: 1636178
Depends on: 1636905
Depends on: 1639487
Depends on: 1639490
Depends on: 1639499
Depends on: 1639505
Depends on: 1639507
Depends on: 1639510
Depends on: 1639511
Depends on: 1639538
Depends on: 1639539
Depends on: 1642090
Depends on: 1642101
Depends on: 1642157
Depends on: 1642160
Depends on: 1642163
Depends on: 1642193
Depends on: 1642302
You need to log in before you can comment on or make changes to this bug.