Open Bug 1670002 Opened 4 years ago Updated 1 years ago

"Trigger missing jobs" doesn't trigger all jobs

Categories

(Firefox Build System :: Task Configuration, defect, P3)

defect

Tracking

(Not tracked)

People

(Reporter: NarcisB, Unassigned)

References

(Blocks 1 open bug)

Details

After manifest scheduling was enabled, trigger missing jobs on backouts doesn`t trigger all jobs.
It would be very helpful to trigger all jobs because we are using it mostly on backouts and we want to be sure that on that push there are no other perma failures.

Definitely a legit bug, but I'm curious to know why you'd want to run the full set of tasks on backouts? My intuition was that backouts should only run the things that were broken so that you can verify they got fixed. There's actually a bug on file to stop running non-essential things on backouts.

Maybe we can discuss this over on bug 1636440 so this bug stays relevant to the actual issue on hand.

Updated the comm, removing ni.

Flags: needinfo?(aryx.bugmail)

(In reply to Andrew Halberstadt [:ahal] from comment #1)

Definitely a legit bug, but I'm curious to know why you'd want to run the full set of tasks on backouts? My intuition was that backouts should only run the things that were broken so that you can verify they got fixed.

  1. When a backout gets pushed, 1+ issue is known but not necessarily everything which fails due to the backed out changes.
  2. All missing jobs get manually triggered to verify a push is good for merging. Running only the failed tasks misses the unscheduled tasks with unknown results.
  3. If tasks are omitted because they were successful in previous pushes, an overview which task ran when and which pushes have what coverage from their own or later builds and tests is missing. Armen was looking into this - I don't find a bug for it.
Summary: "Trigger missing jobs" should trigger all jobs → "Trigger missing jobs" should trigger all jobs for backouts

Is this actually a Treeherder bug or an upstream issue?

Yeah, likely needs to be fixed in the taskgraph actions code. Just to clarify.. is the Trigger Missing Jobs button calling the "run-missing-tests" action?

Component: Treeherder: Job Triggering & Cancellation → Task Configuration
Flags: needinfo?(sclements)
Product: Tree Management → Firefox Build System
Version: --- → unspecified

Yup, I can see that happening here (and if for some reason that action wasn't found, an error would be thrown): https://github.com/mozilla/treeherder/blob/master/ui/models/push.js#L78

Flags: needinfo?(sclements)
Summary: "Trigger missing jobs" should trigger all jobs for backouts → "Trigger missing jobs" doesn't trigger all jobs
Type: enhancement → defect

Looking at the implementation it was obvious why this is broken. It's comparing the result of target-tasks.json to what actually ran and filling in the missing tasks. However, due to manifest-scheduling we've already pruned many tests (and many chunks by extension) out by the time we reach the target-tasks phase.

This action needs a complete re-write / replacement. I can think of two ideas:

  1. Rename this action to make-backstop or similar. Then when pressed, it would retrigger the decision task, except with backstop=True in the parameters. This would be simplest, but has the downside of re-running everything that already ran (including builds and other dependencies).

  2. Same as above except we add a mechanism to remove test manifests that already ran on the push from the total set. We also try to optimize-by-replacement all the builds and other deps that already exist. Not sure how feasible this is.

This bug is going to be a fair amount of work and not something I'll be able to get to before going on leave. I think it might be worth taking a step back and thinking about the use cases this is needed for, and making sure there aren't alternative solutions that can be implemented more easily than this..

and making sure there aren't alternative solutions that can be implemented more easily than this..

Just wanted to call this out..

One idea that Marco and I have had in the past, was writing a script that could automatically determine what the merge candidate is at any point in time. It would do this by using the same regression detection algorithms in mozci that we use for test selection ML. The idea is that when a task passes on a push later on, it's "as if" it passed on the current push (I realize there are exceptions to this that we'd have to be careful of). So rather than physically see every green task on a push before using it as a merge candidate, we could run this script to find the push where every task has passed on some future push (with no failures in-between).

I suspect building such a script might actually be easier than fixing this bug.

Severity: -- → S3
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.