"Trigger missing jobs" doesn't trigger all jobs
Categories
(Firefox Build System :: Task Configuration, defect, P3)
Tracking
(Not tracked)
People
(Reporter: NarcisB, Unassigned)
References
(Blocks 1 open bug)
Details
After manifest scheduling was enabled, trigger missing jobs on backouts doesn`t trigger all jobs.
It would be very helpful to trigger all jobs because we are using it mostly on backouts and we want to be sure that on that push there are no other perma failures.
Comment 1•4 years ago
|
||
Definitely a legit bug, but I'm curious to know why you'd want to run the full set of tasks on backouts? My intuition was that backouts should only run the things that were broken so that you can verify they got fixed. There's actually a bug on file to stop running non-essential things on backouts.
Maybe we can discuss this over on bug 1636440 so this bug stays relevant to the actual issue on hand.
Comment 2•4 years ago
|
||
This happens on all pushes where we hit trigger missing jobs, not just on backouts.
For example, we hit trigger missing jobs on: https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&resultStatus=success%2Cpending%2Crunning%2Ctestfailed%2Cbusted%2Cexception&classifiedState=unclassified&revision=3a9fcbf00f37714e083b447a059ad543e50eee71 and for win 7 opt/debug not all Mochitest (M) jobs ran
vs a push where all jobs ran by default: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception%2Csuccess%2Cretry%2Cusercancel%2Crunnable&revision=dbaa8eaf275379ff77d25dcc0ca1ea70ab74fffe&group_state=expanded
Comment 4•4 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #1)
Definitely a legit bug, but I'm curious to know why you'd want to run the full set of tasks on backouts? My intuition was that backouts should only run the things that were broken so that you can verify they got fixed.
- When a backout gets pushed, 1+ issue is known but not necessarily everything which fails due to the backed out changes.
- All missing jobs get manually triggered to verify a push is good for merging. Running only the failed tasks misses the unscheduled tasks with unknown results.
- If tasks are omitted because they were successful in previous pushes, an overview which task ran when and which pushes have what coverage from their own or later builds and tests is missing. Armen was looking into this - I don't find a bug for it.
Updated•4 years ago
|
Comment 5•4 years ago
|
||
Is this actually a Treeherder bug or an upstream issue?
Comment 6•4 years ago
|
||
Yeah, likely needs to be fixed in the taskgraph actions code. Just to clarify.. is the Trigger Missing Jobs
button calling the "run-missing-tests" action?
Comment 7•4 years ago
|
||
Yup, I can see that happening here (and if for some reason that action wasn't found, an error would be thrown): https://github.com/mozilla/treeherder/blob/master/ui/models/push.js#L78
Updated•4 years ago
|
Updated•4 years ago
|
Comment 8•4 years ago
|
||
Looking at the implementation it was obvious why this is broken. It's comparing the result of target-tasks.json
to what actually ran and filling in the missing tasks. However, due to manifest-scheduling
we've already pruned many tests (and many chunks by extension) out by the time we reach the target-tasks phase.
This action needs a complete re-write / replacement. I can think of two ideas:
-
Rename this action to
make-backstop
or similar. Then when pressed, it would retrigger the decision task, except withbackstop=True
in the parameters. This would be simplest, but has the downside of re-running everything that already ran (including builds and other dependencies). -
Same as above except we add a mechanism to remove test manifests that already ran on the push from the total set. We also try to optimize-by-replacement all the builds and other deps that already exist. Not sure how feasible this is.
This bug is going to be a fair amount of work and not something I'll be able to get to before going on leave. I think it might be worth taking a step back and thinking about the use cases this is needed for, and making sure there aren't alternative solutions that can be implemented more easily than this..
Comment 9•4 years ago
|
||
and making sure there aren't alternative solutions that can be implemented more easily than this..
Just wanted to call this out..
One idea that Marco and I have had in the past, was writing a script that could automatically determine what the merge candidate is at any point in time. It would do this by using the same regression detection algorithms in mozci that we use for test selection ML. The idea is that when a task passes on a push later on, it's "as if" it passed on the current push (I realize there are exceptions to this that we'd have to be careful of). So rather than physically see every green task on a push before using it as a merge candidate, we could run this script to find the push where every task has passed on some future push (with no failures in-between).
I suspect building such a script might actually be easier than fixing this bug.
Updated•4 years ago
|
Updated•1 years ago
|
Description
•