Closed Bug 1458896 Opened 7 years ago Closed 4 years ago

A downstream task blocked by failed dependencies can be run if cancelled then rerun

Categories

(Taskcluster :: Services, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED MOVED

People

(Reporter: jlorenzo, Unassigned)

Details

Disclaimer: This is a behavior several people in releng have used (myself included). It has some value to us (more details below). We want to clarify if this is intended and if it represents any risks to the TC team STEPS TO REPRODUCE 1. Generate a taskgraph like this one: > A <--- B <--- C B being a task depending on A, and C on B. A, B, C all define `"requires": "all-completed"` 2. Make A fail or being in an exception state => Notice B doesn't start at all. 3. With a Taskcluster client with the right scopes, cancel B => See B is in now reported as "exception" 4. With the same TC client, rerun B. RESULTS B gets picked by a worker until completion. Then C starts. VALUE PROVIDED TO RELENG * Testing. This allows us to test quickly downstream tasks which don't depend on the result of their dependencies. This can reduce the waiting time from 4 hours to 1 minute. * Unblocking releases. For instance, 60.0esr has some expected failed tasks[1]. Having this workaround allow us to trigger the rest of the release, without having to modify the taskgraph for this special release. QUESTIONS * Is this behavior expected? If so, \o/ * Are there any known limitations? [1] failed partials because 60.0esr isn't meant to be sent on the ESR channel yet. For instance: https://tools.taskcluster.net/groups/HOffha7JS8eISza1qm5sYA/tasks/AUU_PeHLQoCAkDPOzDg0pA/details
This behaviour is definitely an interesting case. I think whether this is a security issue depends on whether the permissions to retrigger C overlap with the permission to define a task equivalent to C. This behaviour definitely feels incorrect to me, but that's just a gut-instinct. Let's not build too much on this behaviour, but in the meantime we can use it. The reason I say this is because we're investigating moving the Queue to a Postgres storage system. If we do that, this mechanism might go away, but if we have a supported endpoint, we can preserve the behaviour. For a supported mechanism here, we should probably design something which doesn't require such a complex interface :) As for testing, is there any way that the taskgraph's shape could change to depend on an earlier task while testing?
QA Contact: jhford
(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #2) > This behaviour is definitely an interesting case. I think whether this is a > security issue depends on whether the permissions to retrigger C overlap > with the permission to define a task equivalent to C. This behaviour > definitely feels incorrect to me, but that's just a gut-instinct. There are also cases where we block releasing a build (e.g., task 10/10) if tests around the build fail, (e.g. task 9/10). It's a good thing that we have an override if task 9/10 failed for an ignorable error, and not great if task 9 failed for a very good reason. > As for testing, is there any way that the taskgraph's shape could change to > depend on an earlier task while testing? It could, but it involves a more complex set of transforms, and the transforms are already complex. Also, this means a) we wouldn't test the production graph's shape while testing, so we might miss bustage there, and b) conversely, if we don't use the test graph configuration for a while, the production graph could be green but the test graph could be busted for test-specific reasons.
It sounds like there's a case to be made for some sort of forcing of a graph to continue. We have a working solution (albeit unsupported), and a lot of how the dependencies relate will be modified while moving to Postgres, so I'm going to mark this bug as depending on the move to Postgres and make it a P5 to reflect that we have a work around for current behaviour and a plan for creating future behaviour.
Depends on: 1436478
Priority: -- → P5
Component: Queue → Services

Docs:

This method reruns a previously resolved task, even if it was completed. This is useful if your task completes unsuccessfully, and you just want to run it from scratch again. This will also reset the number of retries allowed.

This method is deprecated in favour of creating a new task with the same task definition (but with a new taskId).

Remember that retries in the task status counts the number of runs that the queue have started because the worker stopped responding, for example because a spot node died.

Remark this operation is idempotent, if you try to rerun a task that is not either failed or completed, this operation will just return the current task status.

The first and last sentences seem to be inaccurate: it apparently works on an "unscheduled" task as well. So we should decide whether we want to keep that behavior, move it to another method (forceRun?) with different scopes, or disable it altogether.

This doesn't really relate to postgres at all at this point, although this work won't get scheduled until postgres is finished anyway.

No longer depends on: 1436478
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → MOVED
You need to log in before you can comment on or make changes to this bug.