1670002 - "Trigger missing jobs" doesn't trigger all jobs

Reporter

Description

•

4 years ago

After manifest scheduling was enabled, trigger missing jobs on backouts doesn`t trigger all jobs.
It would be very helpful to trigger all jobs because we are using it mostly on backouts and we want to be sure that on that push there are no other perma failures.

Andrew Halberstadt [:ahal]

Comment 1

•

4 years ago

Definitely a legit bug, but I'm curious to know why you'd want to run the full set of tasks on backouts? My intuition was that backouts should only run the things that were broken so that you can verify they got fixed. There's actually a bug on file to stop running non-essential things on backouts.

Maybe we can discuss this over on bug 1636440 so this bug stays relevant to the actual issue on hand.

Andreea Pavel [:apavel]

Comment 2

•

4 years ago

This happens on all pushes where we hit trigger missing jobs, not just on backouts.

For example, we hit trigger missing jobs on: https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&resultStatus=success%2Cpending%2Crunning%2Ctestfailed%2Cbusted%2Cexception&classifiedState=unclassified&revision=3a9fcbf00f37714e083b447a059ad543e50eee71 and for win 7 opt/debug not all Mochitest (M) jobs ran

vs a push where all jobs ran by default: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception%2Csuccess%2Cretry%2Cusercancel%2Crunnable&revision=dbaa8eaf275379ff77d25dcc0ca1ea70ab74fffe&group_state=expanded

Flags: needinfo?(aryx.bugmail)

Andreea Pavel [:apavel]

Comment 3

•

4 years ago

Updated the comm, removing ni.

Flags: needinfo?(aryx.bugmail)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 4

•

4 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #1)

Definitely a legit bug, but I'm curious to know why you'd want to run the full set of tasks on backouts? My intuition was that backouts should only run the things that were broken so that you can verify they got fixed.

When a backout gets pushed, 1+ issue is known but not necessarily everything which fails due to the backed out changes.
All missing jobs get manually triggered to verify a push is good for merging. Running only the failed tasks misses the unscheduled tasks with unknown results.
If tasks are omitted because they were successful in previous pushes, an overview which task ran when and which pushes have what coverage from their own or later builds and tests is missing. Armen was looking into this - I don't find a bug for it.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

4 years ago

Summary: "Trigger missing jobs" should trigger all jobs → "Trigger missing jobs" should trigger all jobs for backouts

Sarah Clements [:sclements]

Comment 5

•

4 years ago

Is this actually a Treeherder bug or an upstream issue?

Andrew Halberstadt [:ahal]

Comment 6

•

4 years ago

Yeah, likely needs to be fixed in the taskgraph actions code. Just to clarify.. is the Trigger Missing Jobs button calling the "run-missing-tests" action?

Component: Treeherder: Job Triggering & Cancellation → Task Configuration

Flags: needinfo?(sclements)

Product: Tree Management → Firefox Build System

Version: --- → unspecified

Sarah Clements [:sclements]

Comment 7

•

4 years ago

Yup, I can see that happening here (and if for some reason that action wasn't found, an error would be thrown): https://github.com/mozilla/treeherder/blob/master/ui/models/push.js#L78

Flags: needinfo?(sclements)

Andrew Halberstadt [:ahal]

Updated

•

4 years ago

Summary: "Trigger missing jobs" should trigger all jobs for backouts → "Trigger missing jobs" doesn't trigger all jobs

Andrew Halberstadt [:ahal]

Updated

•

4 years ago

Type: enhancement → defect

Andrew Halberstadt [:ahal]

Comment 8

•

4 years ago

Looking at the implementation it was obvious why this is broken. It's comparing the result of target-tasks.json to what actually ran and filling in the missing tasks. However, due to manifest-scheduling we've already pruned many tests (and many chunks by extension) out by the time we reach the target-tasks phase.

This action needs a complete re-write / replacement. I can think of two ideas:

Rename this action to make-backstop or similar. Then when pressed, it would retrigger the decision task, except with backstop=True in the parameters. This would be simplest, but has the downside of re-running everything that already ran (including builds and other dependencies).
Same as above except we add a mechanism to remove test manifests that already ran on the push from the total set. We also try to optimize-by-replacement all the builds and other deps that already exist. Not sure how feasible this is.

This bug is going to be a fair amount of work and not something I'll be able to get to before going on leave. I think it might be worth taking a step back and thinking about the use cases this is needed for, and making sure there aren't alternative solutions that can be implemented more easily than this..

Andrew Halberstadt [:ahal]

Comment 9

•

4 years ago

and making sure there aren't alternative solutions that can be implemented more easily than this..

Just wanted to call this out..

One idea that Marco and I have had in the past, was writing a script that could automatically determine what the merge candidate is at any point in time. It would do this by using the same regression detection algorithms in mozci that we use for test selection ML. The idea is that when a task passes on a push later on, it's "as if" it passed on the current push (I realize there are exceptions to this that we'd have to be careful of). So rather than physically see every green task on a push before using it as a merge candidate, we could run this script to find the push where every task has passed on some future push (with no failures in-between).

I suspect building such a script might actually be easier than fixing this bug.

Andrew Halberstadt [:ahal]

Updated

•

4 years ago

Blocks: sheriff-workflow

Julien Cristau [:jcristau]

Updated

•

1 years ago

Severity: -- → S3

Priority: -- → P3

Bugzilla

"Trigger missing jobs" doesn't trigger all jobs

Categories

(Firefox Build System :: Task Configuration, defect, P3)

Tracking

(Not tracked)

People

(Reporter: NarcisB, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Updated

Updated

Comment 8

Comment 9

Updated

Updated