Closed Bug 1398277 Opened 7 years ago Closed 6 years ago

Retriggering a decision task can cause duplicate jobs to be scheduled

Categories

(Taskcluster :: Services, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED FIXED
mozilla63

People

(Reporter: garndt, Assigned: dustin)

References

Details

Attachments

(2 files, 2 obsolete files)

Currently when a task is retriggered, mozilla-taskcluster will respond to the action and duplicate that node along with all descendants. Tasks scheduled by a gecko decision task are made to depend on the decision task. In the situation of retriggering, this means that the decision task along with all the dependent tasks are recreated. Once the retriggered decision task runs, it will then schedule another duplicate set of jobs.
I think the fix to this is to make sure that the "regular" retrigger action doesn't apply to the decision task, and define another retrigger action specifically for the decision task that just duplicates it, without duplicating child tasks. Or I guess since that's all implemented in Python, it could just detect action/decision tasks and not duplicate their children.
Mentor: dustin
Keywords: good-first-bug
Priority: -- → P5
Hi, I would like to work on this bug. Can you guide me to the relevant code?
Flags: needinfo?(dustin)
The retrigger action is implemented here: https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/retrigger.py I think the best solution would be to modify that code as suggested in the second paragraph in comment 1.
Flags: needinfo?(dustin)
Hey, This is my first attempt at contributing to an open source project. It would be highly appreciable if you could guide me as to how to get started. Thanks.
Hi! I suspect ydidwania has started working on this. Let's see if they are still working before you jump in.
Flags: needinfo?(didwaniayashvardhan)
I think ydidwania is working on some other issue(1349261). Please confirm Thanks.
(In reply to yuktikirtani from comment #6) > I think ydidwania is working on some other issue(1349261). > Please confirm > Thanks. Hi yuktikirtani. Thanks for your interest in fixing this bug. Dustin is right, I am still working on this. You are however,free to try out other bugs you find interesting.
Flags: needinfo?(didwaniayashvardhan)
(In reply to ydidwania from comment #7) > (In reply to yuktikirtani from comment #6) > > I think ydidwania is working on some other issue(1349261). > > Please confirm > > Thanks. > > Hi yuktikirtani. Thanks for your interest in fixing this bug. Dustin is > right, I am still working on this. You are however,free to try out other > bugs you find interesting. Hey, Okay no problem
I think modifying the code as below should work. if task_id != decision_task_id: if input.get('downstream'): to_run = full_task_graph.graph.transitive_closure(set(to_run), reverse=True).nodes to_run = to_run & set(label_to_taskid.keys()) with_downstream = ' (with downstream) ' Basically adding downstream tasks to `to_run` only when the task is not a decision task. Are action tasks different from decision tasks? I didn't find any mention to action task while going through any of the docs. Can you tell me why `to_run` is a set of labels rather than task ids.There can be many tasks with the same label.
Flags: needinfo?(dustin)
That sounds like a good plan! Action tasks are sort of like decision tasks, in that they call queue.createTask -- but they serve to modify an already-existing push, rather than start a new one. It is a set of labels because that represents what we want to retrigger. If you think of a taskId as a specific instance of a label, it's "do this <label> again" rather than repeating the specific instance (which has already failed and can't be further modified).
Flags: needinfo?(dustin)
Are action tasks covered in the condition task_id != decision_task_id
Flags: needinfo?(dustin)
I'm not worried about that right now -- let's just worry about decision tasks :)
Flags: needinfo?(dustin)
Assignee: nobody → didwaniayashvardhan
Comment on attachment 8916790 [details] Bug 1398277 - Fixed typo https://reviewboard.mozilla.org/r/187846/#review193172 Looks good! I triggered a try push, and will try re-triggering that push's decision task.
Attachment #8916790 - Flags: review?(dustin) → review+
I found this in taskcluster/taskgraph/graph.py def transitive_closure(self, nodes, reverse=False): """ Return the transitive closure of <nodes>: the graph containing all specified nodes as well as any nodes reachable from them, and any intervening edges. If `reverse` is true, the "reachability" will be reversed and this will return the set of nodes that can reach the specified nodes. Example ------- a ------> b ------> c | `-------> d transitive_closure([b]).nodes == set([a, b]) transitive_closure([c]).nodes == set([c, b, a]) transitive_closure([c], reverse=True).nodes == set([c]) transitive_closure([b], reverse=True).nodes == set([b, c, d]) """ 1) I went through the implementation. I think it is reversed according to the explanation provided. The examples are correct. There's this line in retrigger.py: to_run = full_task_graph.graph.transitive_closure(set(to_run), reverse=True).nodes which is expected to fetch the set of all downstream tasks from a given task. 2) Regarding the error, I think we need to add the keys as is done when we were duplicating the child tasks as well. Something like to_run = to_run & set(label_to_taskid.keys())
Flags: needinfo?(dustin)
The error is coming from a *different* call to transitive_closure, from what I can tell.
Flags: needinfo?(dustin)
Yeah its because of this call - [task 2017-10-10T15:11:49.623251Z] File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/actions/util.py", line 44, in create_tasks [task 2017-10-10T15:11:49.623313Z] target_graph = full_task_graph.graph.transitive_closure(to_run) I got confused when I saw the examples in transitive_closure. I think they contradict the explanation of transitive closure of a graph written above.
There is some error in `to_run`, the set of labels that we are passing to `create_task() . Is there a way to test the retrigger action locally?
Flags: needinfo?(dustin)
Yes -- if you look at the task details in https://tools.taskcluster.net/groups/TuAbomxORdW52JbuFvKRlQ/tasks/NG1i-Na7S6yJTPpznFC93A/details you will see a bunch of environment variables and a command line. You can set those variables on your command line, and then run the last bit of that command ("./mach --log-no-times taskgraph action-callback").
Flags: needinfo?(dustin)
The Gecko Decision Task does not get added to the task graph, hence the statement assert nodes <= self.nodes in https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/graph.py fails. This failure happens even with the current version of retrigger.py that is in use( I tried it locally), but from the call in line 59 (https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/retrigger.py#59) . So my question is why isn't it getting added to the full_task_graph.json or does the problem exist only locally?
Flags: needinfo?(dustin)
That's because the decision task isn't in the task graph -- it *creates* the task graph. So `create_tasks` is the wrong tool for retriggering the decision task. It looks like you want `create_task_from_def`: https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/util.py#55 you will need to replace the `created`, `deadline`, and `expires` attributes in the task definition with new timestamps using `{"relative-datestamp": ".."}`.
Flags: needinfo?(dustin)
How do I obtain the task definition of the decision task since it is not present in the task graph? Will 'get_decision_parameters' in : https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/decision.py#131 be useful here?
Flags: needinfo?(dustin)
You can call https://docs.taskcluster.net/reference/platform/taskcluster-queue/references/api#task from the action task to get the task definition, passing it the decision task's taskid.
Flags: needinfo?(dustin)
This is where I am at currently when task_id == decision_task_id new_task_definition = requests.get('https://queue.taskcluster.net/v1/task/%s'%(task_id)).json() new_task_definition['created'] = dict({"relative-datestamp": "0 seconds"}) new_task_definition['deadline'] = dict({"relative-datestamp": "4 days"}) new_task_definition['expired'] = dict({"relative-datestamp": "14 days"}) new_task_id=slugid() create_task_from_def(new_task_id, new_task_definition, parameters['level'])t logger.info('Scheduled {}{}(time {}/{})'.format(label, with_downstream, i+1, times)) I am unable to test it because I don't have the rights( thats what scopes mean, right?) to create a task using the api. A decision task shouldn't be retriggered multiple times hence I did not. Did I miss anything?
Flags: needinfo?(dustin)
I think that looks good. Do you want to push the patch and we can try it out in try?
Flags: needinfo?(dustin)
Pushed. Should I add comments in the code?
Flags: needinfo?(dustin)
Comment on attachment 8923583 [details] Bug 1398277-Retrigger decision task from task definition; https://reviewboard.mozilla.org/r/194714/#review200702 Sorry for the delay -- if you can fix this up I can push the buttons in try and see how it works. ::: taskcluster/taskgraph/actions/retrigger.py:80 (Diff revision 1) > > - logger.info('Scheduled {}{}(time {}/{})'.format(label, with_downstream, i+1, times)) > + logger.info('Scheduled {}{}(time {}/{})'.format(label, with_downstream, i+1, times)) > + else: > + print ("hello") > + task_id = 'QAs5KRluQqe6PKtx_DpQLQ' > + print (task_id) Can you remove this? Otherwise it won't work when I push the buttons on the try push :)
Attachment #8923583 - Flags: review?(dustin)
Attachment #8923583 - Attachment is obsolete: true
I made a commit with the lines removed and pushed only that extra commit for review. I thought it will get mapped to the existing patch (now totalling 3 commits) but instead that patch became obsolete. I have a feeling I messed up. If required I will push the entire revset again for review :)
That looks like a mozreview bug to me. I've triggered the try job, and will try re-triggering the decision task on it.
[task 2017-11-03T16:08:48.528026Z] Creating task with taskId L7ATmV-WT--bbaDkqvwZdg for Gecko Decision Task [task 2017-11-03T16:08:48.529131Z] Starting new HTTP connection (1): taskcluster [task 2017-11-03T16:08:49.377623Z] "PUT /queue/v1/task/L7ATmV-WT--bbaDkqvwZdg HTTP/1.1" 400 3848 [task 2017-11-03T16:08:49.378432Z] [task 2017-11-03T16:08:49.378470Z] Schema Validation Failed! [task 2017-11-03T16:08:49.378504Z] Rejecting Schema: http://schemas.taskcluster.net/queue/v1/create-task-request.json# [task 2017-11-03T16:08:49.378525Z] Errors: [task 2017-11-03T16:08:49.378562Z] * data should NOT have additional properties [task 2017-11-03T16:08:49.378574Z] ---- [task 2017-11-03T16:08:49.378586Z] method: createTask [task 2017-11-03T16:08:49.378599Z] errorCode: InputValidationError [task 2017-11-03T16:08:49.378609Z] statusCode: 400 [task 2017-11-03T16:08:49.378622Z] time: 2017-11-03T16:08:49.191Z That error means that the payload to createTask had an extra top-level attribute that it should not. It doesn't say *what* that attribute was, unfortunately. But! I think I see it: it should be `expires` and not `expired`.
Fixed the typo. Sorry for letting that error pass by. Hope it works fine now :)
Hm, that didn't work either: [task 2017-11-05T13:39:55.292866Z] In other words you are missing scopes from one of the options: [task 2017-11-05T13:39:55.292879Z] * Option 0: [task 2017-11-05T13:39:55.292911Z] - "queue:route:notify.email.dmitchell@mozilla.com.*", and [task 2017-11-05T13:39:55.292935Z] - "queue:route:notify.email.dmitchell@mozilla.com.on-failed", and [task 2017-11-05T13:39:55.292953Z] - "queue:route:notify.email.dmitchell@mozilla.com.on-exception" If it's OK with you, I will take this bug and figure out how to finish it up. I suspect we may have started out in the wrong direction here and I need to rethink it. Which is not to diminish your work -- I appreciate it! Hopefully some of the projects we discussed in irc will be more tractable.
Yeah, its fine with me. Thanks for all your help on this. Learnt a lot. I hope you fix this bug soon:) I'll start on the bug you filed just yesterday.
Attachment #8916790 - Flags: review?(dustin)
Assignee: didwaniayashvardhan → dustin
Mentor: dustin
I'm deeply suspicious that the fix in that try job will not have the scopes to run a decision task. At least right now, and possibly once we start assigning more fine-grained scopes. We'll see..
Am I cynical or just accurate? You decide.. [task 2018-07-06T18:16:27.986Z] You do not have sufficient scopes. You are missing the following scopes: [task 2018-07-06T18:16:27.986Z] [task 2018-07-06T18:16:27.986Z] ``` [task 2018-07-06T18:16:27.986Z] { [task 2018-07-06T18:16:27.986Z] "AllOf": [ [task 2018-07-06T18:16:27.986Z] "assume:repo:hg.mozilla.org/try:branch:default", [task 2018-07-06T18:16:27.986Z] "in-tree:hook-action:project-gecko/in-tree-action-1-*" [task 2018-07-06T18:16:27.986Z] ] [task 2018-07-06T18:16:27.986Z] } [task 2018-07-06T18:16:27.986Z] ``` So I think that retriggering a decision task requires an actionPerm, which requires bug 1472792. It also means bug 1470622 can't go ahead yet, as sheriffs need to be able to retrigger decision tasks.
Blocks: 1470622, 1415868
Depends on: 1472792
This still needs to wait until bug 1472792 is deployed in treeherder.
Attached patch bug1398277.patch (deleted) — Splinter Review
Attachment #8916790 - Attachment is obsolete: true
Attachment #8994319 - Flags: review?(bstack)
Comment on attachment 8994319 [details] [diff] [review] bug1398277.patch I wish all of these datestamps would have been relative from the get-go. Assuming you ran flake8 on this, r+!
Attachment #8994319 - Flags: review?(bstack) → review+
Haha, you know me so well. Yes, flakes fixed. This can't land until bug 1472792 lands in treeherder -- that's been waiting for two weeks now, and will wait another three until I'm back from PTO.
Comment on attachment 8994315 [details] Bug 1398277: add a special permission for retriggering decision tasks Brian Stack [:bstack] has approved the revision.
Attachment #8994315 - Flags: review+
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Target Milestone: mozilla61 → mozilla63
Component: Integration → Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: