Closed Bug 1638862 Opened 4 years ago Closed 4 years ago

mach try auto selected no tests

Categories

(Developer Infrastructure :: Try, defect, P3)

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: jgraham, Assigned: marco)

References

(Blocks 2 open bugs)

Details

mach try auto for a marionette change selected no tests [1]. I'd generically expect the wdspec and marionette tests to be selected for such a change.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&selectedTaskRun=RgR332_MRwmrPwdxVSeB_A-0&tier=1%2C2%2C3&revision=3821c1e560fecd26cb4c7aa959ac01889b358fb3

bugbug_disperse_low would have selected some marionette and some web-platform-tests.

My initial guess is that maybe there were very few failures in the past with the files modified in the push, and so nothing was selected.

Assignee: nobody → mcastelluccio
Status: NEW → ASSIGNED

That definitely could be true (although it might be surprising for the driver.js changes), but maybe in that case we should fall back on a less aggressive optimistation. Running try and getting no tests at all seems likely to always be a surprise.

I'm of two minds on this:

  1. Running a try push and getting no tasks selected is definitely surprising.
  2. Running tasks for no reason other than satisfying developer expectations is a waste of resources.

I think the fallback to a less aggressive optimization is a good idea in this instance because we want people to trust the tool. Though, longer term I think part of the solution will involve a cultural shift. Maybe it's ok that non-risky patches have no try coverage before landing. The current Mozilla culture is that it's not ok.

Severity: -- → S3
Priority: -- → P3

In this case it definitely wasn't OK; once I ran some tests they failed. And my patch reviewer quite reasonably wanted to check that I'd run some tests before approving. So my initial reaction to this experience is "don't use mach try auto again at least for this kind of change". Which isn't what we want in the longer term.

This does suggest a worrying selection bias in the mach try auto test selection code. In modules where reviewers insist on pushing to try, or the correct set of tests is "obvious" we expect to see fewer failures on integration branches. That means the ML will never see that these tests are catching failures, so they won't be scheduled, so people will learn not to trust the tool since it will miss the "obvious" tests to run. I'm not sure what the solution is here, but it seems unlikely that we will end up with the outcome we want in a reasonable timeframe if we have to accept a higher failure rate on autoland in the short term in order to teach the classifier the correct associations.

Good point about the bias, we're planning to start using try pushes to help train the model as well so should help. Though determining what is/isn't a regression on try is a lot trickier.. so the quality of data there will still be less than autoland.

And yeah, as I said I was more thinking long term around the culture stuff. For now, a less aggressive fallback is a good fix.

Good point about the bias, we're planning to start using try pushes to help train the model as well so should help. Though determining what is/isn't a regression on try is a lot trickier.. so the quality of data there will still be less than autoland.

I almost wonder if the bias is a feature not a bug. Modules that consistently test and have a high bar of quality would effectively be more "trusted" than modules that don't.

(This is the equivalent of a shower thought, don't take it seriously :p).

The bias could indeed be a feature (at least unless people start to get more sloppy :P).
We could train a model for try which uses data from try too and a model for autoland which exclusively uses data from autoland.

So far I haven't seen a need for this though, a less conservative model would have caught all failures that have been reported missing for "mach try auto".

(In reply to Andrew Halberstadt [:ahal] from comment #5)

Good point about the bias, we're planning to start using try pushes to help train the model as well so should help. Though determining what is/isn't a regression on try is a lot trickier.. so the quality of data there will still be less than autoland.

And yeah, as I said I was more thinking long term around the culture stuff. For now, a less aggressive fallback is a good fix.

I was thinking that we could collect data from try only for those pushes where bugbug would not have caught a failure, this way we gain the most with the least addition of false positives. I would also change a bit the mozci algorithm so that we don't consider intermittents as failures (I assume on try developers often retrigger jobs to figure out if they were intermittent failures unrelated to their patch), this too would reduce the false positives in the dataset.

(This is the equivalent of a shower thought, don't take it seriously :p).

OK I won't :p But I think it's exactly the opposite; the people who care most about testing will notice that the algorithm isn't doing what they expect and will just avoid using it for try.

It would be interesting to train a model on mach try fuzzy test selections and see how that compares to the mach try auto selections. Some discrepancies will be things that people over-select out of caution, and some will be things that authors are correctly selecting but the autoland-based model isn't.

(I assume on try developers often retrigger jobs to figure out if they were intermittent failures unrelated to their patch),

With intermittents it's often easy for a human to guess if the failures are known or not; if there are bug suggestions for all the lines that don't look like logging noise it's probably all known intermittents. The issue for tooling is we never managed to expose this in an API returning whether a specific result is likely just intermittent to a quality level that we actually trust. And real failures may or may not be obvious, so people may or may not retrigger things that are actually permafails. So I'm not sure how to use the retrigger status to determine much.

Yeah, perception is definitely a worry. I agree with everything you've said, so this isn't me trying to argue or anything.. but a couple other points:

  1. Cases where the selection is "obvious" could be better served by a preset. It's not clear to me that we should be pushing |mach try auto| on those modules (maybe eventually we'll get there).
  2. The goal of |mach try auto| isn't to prevent all backouts. It's to maximize resource efficiency, so some level of backouts are to be expected. There are currently many developers who optimize for avoiding backouts (and average 1000+ tasks per try push). Those developers will have to learn to meet us somewhere in the middle. This is partly what I meant by a culture shift above.
  3. My personal thought is that we should focus on making backouts less painful rather than focus on trying to reach a super high detection rate.. I could imagine a (distant) world where we are fine with only detecting 60% of regressions.

(In reply to Andrew Halberstadt [:ahal] from comment #10)

  1. The goal of |mach try auto| isn't to prevent all backouts. It's to maximize resource efficiency, so some level of backouts are to be expected.

This is not my mental model of mach try auto is, nor is it what the messaging around mach try auto has been. Quoting from the dev-platform post requesting feedback: "We are approaching our first major milestone, a try selector that can automatically pick which tests are most relevant to your push on your behalf." And the feedback requested has as its very first item "1. Pushes where |mach try auto| failed to detect a regression (you either got backed out or detected it yourself via some other means)."

As a developer, that doesn't communicate "we're going to try and get away with running the fewest jobs we possibly can", it communicates "we're going to try and run all the tests that are likely to fail on your push so you don't get backed out".

I can see how those two sentiments dovetail together, but I don't think people view them as equivalent, and they'd much rather optimize for the second even at the loss of minimality on the number of jobs ran.

Cases where the selection is "obvious" could be better served by a preset. It's not clear to me that we should be pushing |mach try auto| on those modules (maybe eventually we'll get there).

But then I need a mental model for figuring out when mach try auto is good enough and when I'm better off selecting tasks by hand. And I don't have that model or really know how to develop it except in an expensive way (push with auto, see if it does something expected, if not either add tasks, or — given how fun the add tasks UI is — repush with the jobs I actually wanted).

If mach try auto could just prepopulate a fuzzy selection, but give the option to add or remove tasks, and learn from that for future changes, that would remove the need to make a choice; I could always use mach try fuzzy and just re-add things that are obviously missing (or remove things that are obviously unrelated).

The goal of |mach try auto| isn't to prevent all backouts. It's to maximize resource efficiency, so some level of backouts are to be expected. There are currently many developers who optimize for avoiding backouts (and average 1000+ tasks per try push).

I obviously agree that such big try pushes without good reason aren't sustainable. But developer time is also expensive, and there's a failure scenario where we optimise the easy to measure costs without noticing that we've lost overall on hard to measure productivity issues.

In general, I think people will trust the tooling once it passes a reasonableness test. If the tool does something that looks reasonable and there's a backout, that's fine, no one will be upset. But if the tooling misses out jobs that are "obviously" relevant to the current push then people are going to be a lot more grumpy about being backed out, and look for ways to avoid the tooling.

(In reply to Nathan Froyd [:froydnj] from comment #11)

As a developer, that doesn't communicate "we're going to try and get away with running the fewest jobs we possibly can", it communicates "we're going to try and run all the tests that are likely to fail on your push so you don't get backed out".

I can see how those two sentiments dovetail together, but I don't think people view them as equivalent, and they'd much rather optimize for the second even at the loss of minimality on the number of jobs ran.

As you can probably guess from Mozilla's general shift in focus, driving down costs is the driving force behind this project. That said, we think we can improve the regression detection rate while driving down costs at the same time. So yes, we are trying to do both those things at once, apologies if my dev-platform post wasn't clear on that.

Interestingly, a lot of the early feedback for ./mach try auto was that it ran too many tasks. Some developers felt that it was wasting resources and they could pick tasks better themselves.

(In reply to James Graham [:jgraham] from comment #12)

But then I need a mental model for figuring out when mach try auto is good enough and when I'm better off selecting tasks by hand. And I don't have that model or really know how to develop it except in an expensive way (push with auto, see if it does something expected, if not either add tasks, or — given how fun the add tasks UI is — repush with the jobs I actually wanted).

Agreed, just saying it will take time to get there and until it does, some developers may prefer to continue using manual selection methods (and that's fine).

If mach try auto could just prepopulate a fuzzy selection, but give the option to add or remove tasks, and learn from that for future changes, that would remove the need to make a choice; I could always use mach try fuzzy and just re-add things that are obviously missing (or remove things that are obviously unrelated).

Yes, there is a bug on file for this already (bug 1635921) it just won't be a priority for a little while.

In general, I think people will trust the tooling once it passes a reasonableness test. If the tool does something that looks reasonable and there's a backout, that's fine, no one will be upset. But if the tooling misses out jobs that are "obviously" relevant to the current push then people are going to be a lot more grumpy about being backed out, and look for ways to avoid the tooling.

Again no disagreement here. I think part of the issue is that this tool is still very new :). There's a reason we still print that warning every time it gets used. It'll get better over time.

grumpy about being backed out

This sentiment has already come up a lot in this conversation. So I just want to expand on a few of my ideas for making backouts less painful:

  1. Ignore backouts / pushes that were backed out using hg's smart-annotate extension. This would remove the issue of backouts interfering with blame (at least for mercurial.. would need to investigate git solutions).
  2. Stop running all tasks on backouts, just run the ones that failed.
  3. Sheriffs either just back you out immediately or ping you in a DM if they are unsure (avoids the embarrassment of being "called out" on a public channel).
  4. Update tooling such that the "merge to central" is the thing that signals patch completion. E.g, close the phabricator revision once the patch hits central rather than once it hits autoland.

If those things are put in place, the downsides to a backout become very small. In fact, they would pretty much be the equivalent of pushing to try, noticing an error and then fixing it + re-pushing.

Honestly, I believe this problem will be almost totally solved in a little while.

At the moment, finding a balance between scheduling too much (and developers not using the tool because they think it's wasting resources) and scheduling too little (and developers not using the tool because they think it's not running enough) is not so easy. We've already switched to scheduling a bit more on try vs autoland since you filed this bug, but scheduling more than that would currently lead to the first situation (developers not using the tool because it schedules too much).

With manifest-level scheduling, even when scheduling more relevant tests, we should still be able to keep the number of tasks low.
Also, with a different default view on Treeherder, developers will not "see" a large vertical space being taken up by tasks, and thus will not worry about the number of tasks.

For the time being, we've switched to scheduling a bit more on try vs autoland a couple of days ago (I don't know if your failure would have been caught though). We could potentially also check what enforcing a minimum number of tasks would entail (I have a feeling the average results of the schedulers wouldn't change much, but it would give developers a bit more certainty that at least something is run).
Another possibility is bug 1632572, where we'd allow the developer to choose between running fewer/more relevant tests according to their understanding of the patch risk.

I think the situation is better with manifest-level scheduling.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
Product: Firefox Build System → Developer Infrastructure
You need to log in before you can comment on or make changes to this bug.