Closed Bug 1183528 Opened 9 years ago Closed 9 years ago

Downstream tasks not resolved by scheduler when a task they depend on is a) resolved as failed or b) resolved as exception with no more reruns available

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Assigned: garndt)

References

()

Details

like https://tools.taskcluster.net/task-inspector/#rRq3MGTJSN2dJ3vTIXLssQ/ Created a day ago Deadline 2 hours ago (a day later) pmoore already looked into this and see if he can find out whats going on.
Summary: Task time out after 1day → Task time out after 1 day
It looks to me like the tests were dependent on the b2g build job, and that build job failed: https://tools.taskcluster.net/task-inspector/#vDMjO3q4TyidJXi9Xj2ilA/0 I would guess that the scheduler should then listen on the taskFailed exchange (http://docs.taskcluster.net/queue/exchanges/#taskFailed), and then resolve all dependent jobs (which have already been defined on the queue: http://docs.taskcluster.net/queue/api-docs/#defineTask but not scheduled: http://docs.taskcluster.net/queue/api-docs/#scheduleTask) as either failed or exception (it can be debated which is more appropriate). It looks like the scheduler did not resolve the downstream tasks, and so the tasks were only resolved once the deadline was exceeded, and this resolution was performed by the queue (since they were not scheduled, no worker claimed them).
Component: General → Scheduler
Flags: needinfo?(jopsen)
Summary: Task time out after 1 day → scheduler: downstream tasks were not resolved when a task failed
Summary: scheduler: downstream tasks were not resolved when a task failed → Downstream tasks not resolved by scheduler when a task they depend on is a) resolved as failed or b) resolved as exception with no more reruns available
> It looks like the scheduler did not resolve the downstream tasks, and so the tasks were only > resolved once the deadline was exceeded, and this resolution was performed by the queue > (since they were not scheduled, no worker claimed them). This is the intended behaviour! The scheduler should **not** resolve dependent tasks. I don't see the problem, these semantics are sane. If we want treeherder to display something else, maybe not show things before they are scheduled? Or show them as not-going-to-run when a required tasks is blocking that is different. We can solve this at the treeherder integration level. But I have no idea what people want in treeherder, and it's not defined what TH supports.
Flags: needinfo?(jopsen)
Once it becomes impossible for a task to run (since it depends on a task that resolved unsuccessfully and has no more available runs) I think it does not make sense to leave it in a pending state, which inherently suggests that it may resolve as successful at some point in the future, since I believe under no circumstances can the task ever be executed. Also when it resolved as deadline-exceeded, you have to go digging to find out why. Maybe the scheduler could resolve tasks as "exception"/"parent-task-unsuccessful" or "exception"/"pre-condition-failure" or "exception"/"permanently-blocked" (or something similar) in the case where under no circumstances can the tasks that a task depends on all be resolved as successful. When we see we have "exception"/"deadline-exceeded" we do not know if the deadline was exceeded because: a) performance was insufficient * capacity problems * deadline too short * parent tasks took too long * ... or b) regardless of what the deadline was, the task could never have resolved * parent task could not resolve as successful Just capturing this difference in the result would be enormously helpful for: a) troubleshooting issues (i.e. categorizing the type of failure) b) prompt resolution (not needing to wait for task to timeout before getting the failure) Maybe this is partly covered by the task-graph being identified as blocked, but having this information at a task level is also very helpful, since when a task graph is blocked, you don't immediately know which tasks are affected. Blocked tasks could already be differentiated in the Task Graph Inspector, for example, in a different colour (purple?) - just as an illustration of how differentiating this state could be useful.
Let's talk about this in our next meeting...
Flags: needinfo?(jopsen)
> Once it becomes impossible for a task to run (since it depends on a task that resolved unsuccessfully > and has no more available runs) I think it does not make sense to leave it in a pending state, It's not "pending", it's "unscheduled". And with reruns and other things, the task we depend on could be resolved later. We can take it at next meeting, but I doubt this is something we should address at scheduler level. If so we should do it in the next-gen scheduler... Maybe cancel-task, but I would still prefer not to deal with it.
Flags: needinfo?(jopsen)
Any solution must track in the task whether the task did not run because an explicit list of parent tasks were not successful, or it did not run because of some other reason. Between the Scheduler and the Queue, Taskcluster *knows* the list of tasks that blocked it, so the user shouldn't have to go digging. That is a bad design since it wastes time and resources, raising operational cost. This could also be an artifact attached to the task. In short, if taskcluster knows information that is useful for troubleshooting a failure/exception, it should provide it.
Thinking about this some more, we could even make it more obvious in the UI of Task Inspector and Task Graph Inspector, that when a task graph is blocked and requires human intervention (i.e. rerun required to unblock task graph) that this is highlighted in such a way that the user can't miss it. Something like, you go to the page, and you are immediately alerted that unless you can fix tasks X, Y, Z before deadline in X hours and minutes, your dependent tasks are never going to run. So there are both potential improvements to be made before the deadline expires to alert user to required human-interactions to resolve issues, and improvements after deadlines expire, to give a clearer explanation about what "blockages" caused tasks not to get scheduled.
@pmoore, I believe in solving this by improving our tools. Especially for next-gen scheduler. Note: if this is a problem to sheriffs we should handle it at treeherder integration-level. Ahh, I call something about that... We can treeherder integration to not export tasks as failed if resolved exception with deadline-exceeded and it was never scheduled. I can't recall where I discussed this with jlal but there was a discussion on the subject. I'll add it to the agenda for monday with garndt about mozilla-taskcluster split-up.
Component: Scheduler → Integration
As discussed in the previous comments, the behavior of leaving downstream tasks as unscheduled until deadline is exceeded is intentional and at that point the task run will be marked with a reasonCreated as "exception" and the reasonResolved is "deadline-exceeded". The solution should be modifying mozilla-taskcluster to not report these jobs to treeherder [1] when this situation occurs. [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1148965#c4
Changes have been made to have mozilla-taskcluster not submit jobs that were created only to report an exception (i.e. no work was done). https://github.com/taskcluster/mozilla-taskcluster/commit/5b23872a666fd7c1b42c9e55c2696e8ab3363079
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1182491
Assignee: nobody → garndt
Component: Integration → Services
You need to log in before you can comment on or make changes to this bug.