1183528 - Downstream tasks not resolved by scheduler when a task they depend on is a) resolved as failed or b) resolved as exception with no more reruns available

Reporter

Description

•

9 years ago

like https://tools.taskcluster.net/task-inspector/#rRq3MGTJSN2dJ3vTIXLssQ/ Created a day ago Deadline 2 hours ago (a day later) pmoore already looked into this and see if he can find out whats going on.

:glob ✱

Updated

•

9 years ago

Summary: Task time out after 1day → Task time out after 1 day

Pete Moore [:pmoore][:pete]

Comment 1

•

9 years ago

It looks to me like the tests were dependent on the b2g build job, and that build job failed: https://tools.taskcluster.net/task-inspector/#vDMjO3q4TyidJXi9Xj2ilA/0 I would guess that the scheduler should then listen on the taskFailed exchange (http://docs.taskcluster.net/queue/exchanges/#taskFailed), and then resolve all dependent jobs (which have already been defined on the queue: http://docs.taskcluster.net/queue/api-docs/#defineTask but not scheduled: http://docs.taskcluster.net/queue/api-docs/#scheduleTask) as either failed or exception (it can be debated which is more appropriate). It looks like the scheduler did not resolve the downstream tasks, and so the tasks were only resolved once the deadline was exceeded, and this resolution was performed by the queue (since they were not scheduled, no worker claimed them).

Component: General → Scheduler

Flags: needinfo?(jopsen)

Summary: Task time out after 1 day → scheduler: downstream tasks were not resolved when a task failed

Comment hidden (Legacy TBPL/Treeherder Robot)

Pete Moore [:pmoore][:pete]

Updated

•

9 years ago

Summary: scheduler: downstream tasks were not resolved when a task failed → Downstream tasks not resolved by scheduler when a task they depend on is a) resolved as failed or b) resolved as exception with no more reruns available

Comment hidden (Legacy TBPL/Treeherder Robot)

Jonas Finnemann Jensen (:jonasfj)

Comment 61

•

9 years ago

> It looks like the scheduler did not resolve the downstream tasks, and so the tasks were only > resolved once the deadline was exceeded, and this resolution was performed by the queue > (since they were not scheduled, no worker claimed them). This is the intended behaviour! The scheduler should **not** resolve dependent tasks. I don't see the problem, these semantics are sane. If we want treeherder to display something else, maybe not show things before they are scheduled? Or show them as not-going-to-run when a required tasks is blocking that is different. We can solve this at the treeherder integration level. But I have no idea what people want in treeherder, and it's not defined what TH supports.

Flags: needinfo?(jopsen)

Pete Moore [:pmoore][:pete]

Comment 62

•

9 years ago

Once it becomes impossible for a task to run (since it depends on a task that resolved unsuccessfully and has no more available runs) I think it does not make sense to leave it in a pending state, which inherently suggests that it may resolve as successful at some point in the future, since I believe under no circumstances can the task ever be executed. Also when it resolved as deadline-exceeded, you have to go digging to find out why. Maybe the scheduler could resolve tasks as "exception"/"parent-task-unsuccessful" or "exception"/"pre-condition-failure" or "exception"/"permanently-blocked" (or something similar) in the case where under no circumstances can the tasks that a task depends on all be resolved as successful. When we see we have "exception"/"deadline-exceeded" we do not know if the deadline was exceeded because: a) performance was insufficient * capacity problems * deadline too short * parent tasks took too long * ... or b) regardless of what the deadline was, the task could never have resolved * parent task could not resolve as successful Just capturing this difference in the result would be enormously helpful for: a) troubleshooting issues (i.e. categorizing the type of failure) b) prompt resolution (not needing to wait for task to timeout before getting the failure) Maybe this is partly covered by the task-graph being identified as blocked, but having this information at a task level is also very helpful, since when a task graph is blocked, you don't immediately know which tasks are affected. Blocked tasks could already be differentiated in the Task Graph Inspector, for example, in a different colour (purple?) - just as an illustration of how differentiating this state could be useful.

Pete Moore [:pmoore][:pete]

Comment 63

•

9 years ago

Let's talk about this in our next meeting...

Flags: needinfo?(jopsen)

Jonas Finnemann Jensen (:jonasfj)

Comment 64

•

9 years ago

> Once it becomes impossible for a task to run (since it depends on a task that resolved unsuccessfully > and has no more available runs) I think it does not make sense to leave it in a pending state, It's not "pending", it's "unscheduled". And with reruns and other things, the task we depend on could be resolved later. We can take it at next meeting, but I doubt this is something we should address at scheduler level. If so we should do it in the next-gen scheduler... Maybe cancel-task, but I would still prefer not to deal with it.

Flags: needinfo?(jopsen)

Pete Moore [:pmoore][:pete]

Comment 65

•

9 years ago

Any solution must track in the task whether the task did not run because an explicit list of parent tasks were not successful, or it did not run because of some other reason. Between the Scheduler and the Queue, Taskcluster *knows* the list of tasks that blocked it, so the user shouldn't have to go digging. That is a bad design since it wastes time and resources, raising operational cost. This could also be an artifact attached to the task. In short, if taskcluster knows information that is useful for troubleshooting a failure/exception, it should provide it.

Pete Moore [:pmoore][:pete]

Comment 66

•

9 years ago

Thinking about this some more, we could even make it more obvious in the UI of Task Inspector and Task Graph Inspector, that when a task graph is blocked and requires human intervention (i.e. rerun required to unblock task graph) that this is highlighted in such a way that the user can't miss it. Something like, you go to the page, and you are immediately alerted that unless you can fix tasks X, Y, Z before deadline in X hours and minutes, your dependent tasks are never going to run. So there are both potential improvements to be made before the deadline expires to alert user to required human-interactions to resolve issues, and improvements after deadlines expire, to give a clearer explanation about what "blockages" caused tasks not to get scheduled.

Jonas Finnemann Jensen (:jonasfj)

Comment 67

•

9 years ago

@pmoore, I believe in solving this by improving our tools. Especially for next-gen scheduler. Note: if this is a problem to sheriffs we should handle it at treeherder integration-level. Ahh, I call something about that... We can treeherder integration to not export tasks as failed if resolved exception with deadline-exceeded and it was never scheduled. I can't recall where I discussed this with jlal but there was a discussion on the subject. I'll add it to the agenda for monday with garndt about mozilla-taskcluster split-up.

Jonas Finnemann Jensen (:jonasfj)

Updated

•

9 years ago

Component: Scheduler → Integration

Comment hidden (Legacy TBPL/Treeherder Robot)

Carsten Book [:Tomcat]

Reporter

Updated

•

9 years ago

Blocks: q3-bb-tc-migration

Comment hidden (Legacy TBPL/Treeherder Robot)

Greg Arndt [:garndt]

Assignee

Comment 126

•

9 years ago

As discussed in the previous comments, the behavior of leaving downstream tasks as unscheduled until deadline is exceeded is intentional and at that point the task run will be marked with a reasonCreated as "exception" and the reasonResolved is "deadline-exceeded". The solution should be modifying mozilla-taskcluster to not report these jobs to treeherder [1] when this situation occurs. [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1148965#c4

Comment hidden (Legacy TBPL/Treeherder Robot)

Selena Deckelmann :selenamarie :selena

Updated

•

9 years ago

Blocks: 1148965

Selena Deckelmann :selenamarie :selena

Updated

•

9 years ago

No longer blocks: 1148965, q3-bb-tc-migration

Depends on: 1148965

Comment hidden (Legacy TBPL/Treeherder Robot)

Greg Arndt [:garndt]

Assignee

Comment 2081

•

9 years ago

Changes have been made to have mozilla-taskcluster not submit jobs that were created only to report an exception (i.e. no work was done). https://github.com/taskcluster/mozilla-taskcluster/commit/5b23872a666fd7c1b42c9e55c2696e8ab3363079

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Greg Arndt [:garndt]

Assignee

Updated

•

9 years ago

Blocks: 1182491

Selena Deckelmann :selenamarie :selena

Updated

•

9 years ago

Assignee: nobody → garndt

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: Integration → Services