Closed Bug 1259627 Opened 9 years ago Closed 7 years ago

Stop using TC scheduler API

Categories

(Release Engineering :: Release Automation: Other, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rail, Assigned: rail)

References

()

Details

Attachments

(5 files)

The scheduler API is deprecated after TC added support for task dependencies. Instead of submitting a graph we should submit individual tasks with dependencies specified in task.dependencies and grouped using taskGroupId. Tasks should be submitted in topographical order (TC verifies dependencies). In case of failure we should resubmit only failed tasks. Sorting and mass submission sound like good candidates for TC client.
Depends on: 1316340
Priority: -- → P2
It would be great if we could get this done, as this might be the last thing that requires us to keep the scheduler alive (but not well). For anyone looking at this bug that fancies taking a stab, it looks like this is where the scheduler is called: https://hg.mozilla.org/build/tools/file/9193e0759659/buildfarm/release/release-runner.py#l517
Mentor: pmoore
Whiteboard: [good first bug]
Erm, i've just read the dependent bugs and realise this might not be so entirely trivial after all! :D
Mentor: pmoore
Whiteboard: [good first bug]
My understanding is that there are two ways to achieve this: 1. move releasetasks in-tree (a long project) 2. rewrite releasetasks to use the queue API instead of scheduler The latter is a little complicated since the releasetasks graph has tasks with more dependencies than the queue API allows, so insertion of some dummy tasks will be required. There's disagreement as to how difficult that might be :)
Rail I hear has an implementation of a solution.
Flags: needinfo?(rail)
(In reply to Dustin J. Mitchell [:dustin] from comment #4) > My understanding is that there are two ways to achieve this: > 1. move releasetasks in-tree (a long project) > 2. rewrite releasetasks to use the queue API instead of scheduler > > The latter is a little complicated since the releasetasks graph has tasks > with more dependencies than the queue API allows, so insertion of some dummy > tasks will be required. There's disagreement as to how difficult that might > be :) Rail managed to implement a nice solution for 2) last week during the All-hands, one that uses the Queue API by reorganizing the current releasetasks graph into a large flatten list. There's a topological sort involved as well as some other processing. Last I know on Friday he managed to submit to TC a smaller subgraph via our staging environment. I remembered he mentioned some other complications he needed to tackle to declare victory over this. With the current tcmigration project focus going on, I can look at this to finalize if Rail wants to hand-off.
I'll put together 2 different solutions I have. Leaving NI
I ended up with 2 different approaches: 1) extend the existing tools to convert the generated graph into a list of tasks * patches releasetasks: https://github.com/mozilla-releng/releasetasks/compare/master...rail:queue_api?expand=1 ** convert existing graph into a list of tasks ** add dummy tasks in between tasks with more than 100 dependent tasks ** add a task (atomic task) to prevent ASAP scheduling in case the whole submission fails for some reason (mistakes, scopes, network, etc) ** sort the tasks topologically * releaserunner patch: https://gist.github.com/rail/4074a19313ca3970f9a7c9123b5b74ee ** resolve the atomic task to start the execution ** graph2 generation should be tested I tested 1) on jamun and it worked fine. It takes longer to submit tasks (no async!), but we don't have the max 3 partials limit anymore! 2) py35 CLI tool to do the same using a saved copy of releasetasks-generated graph * async submission (to be parallelized) * release runner should save the generated graph to a file and pass it to this tool * to be puppetized I have not tested this version at all. I will be busy with some other stuff until the end of August, but feel free to adapt the patches.
Flags: needinfo?(rail)
Oh, forgot to point to the second tool: https://github.com/rail/graph2tasks
Note to self (context) for later: * we're to flatten the existing graph and submit it in chunks via TC Queue * rail already implemented a good part of this, we still need more testing (unit testing) to that piece and deal with the submission * currently, the only caveat is the fact that it takes a long time to submit those tasks (about ~20-30min for a release graph). Mostly because all that is sequentially. We need to bring async muscle to this in a smart way * once we flatten graph, topological sort and all, we'll have a chunk of tasks with 0 dependencies, another chunk with 1 dep, etc. The chunks shall be submitted sequentially but the tasks within the chunk can be asynced.
Hoping to get my hands on this in the next two weeks, now that I'm off releaseduty cycle.
Assignee: nobody → mtabara
Hijacking this because I'm back!
Assignee: mtabara → rail
Attached file PR for releasetask (deleted) —
Attachment #8900706 - Flags: review?(mtabara)
Attachment #8900706 - Flags: review?(mtabara) → review+
Comment on attachment 8900704 [details] Bug 1259627 - [tools] Use TC Queue API https://reviewboard.mozilla.org/r/172146/#review177782 Looking good to me - just a few nits and a bunch of questions. Also, the most concerning thing for me - shouldn't we attempt a staging release for Fennec as well? See my comment below. ::: buildfarm/release/release-runner.py:414 (Diff revision 1) > > # XXX: Doesn't work with neither Fennec nor Thunderbird > platforms = branchConfig['release_platforms'] > > try: > - graph_id = slugId() > + task_group_id = None Can't we remove this line since the task group id comes from releasetasks? I doubt there's any warning downstream. ::: buildfarm/release/release-runner.py:523 (Diff revision 1) > import pprint > - log.debug(pprint.pformat(graph, indent=4, width=160)) > - print(scheduler.createTaskGraph(graph_id, graph)) > + for task_id, task_def in tasks.items(): > + log.debug("%s ->\n%s", task_id, > + pprint.pformat(task_def, indent=4, width=160)) > + submit_parallelized(queue, tasks) > + resolve_task(queue, toplevel_task_id) Beautiful! ::: buildfarm/release/releasetasks_graph_gen.py:134 (Diff revision 1) > + log.debug("%s ->\n%s", task_id, > + pprint.pformat(task_def, indent=4, width=160)) > + > if not options.dry_run: > - print scheduler.createTaskGraph(graph_id, graph) > + submit_parallelized(queue, tasks) > + resolve_task(queue, toplevel_task_id) Wonderful! ::: buildfarm/release/releasetasks_graph_gen.py:218 (Diff revision 1) > tc_config = { > "credentials": { > "clientId": get_config(release_runner_config, "taskcluster", "client_id", None), > "accessToken": get_config(release_runner_config, "taskcluster", "access_token", None), > + }, > + "maxRetries": 12, Question: Since this is global, which task falls into this? ::: lib/python/kickoff/__init__.py:370 (Diff revision 1) > kwargs["extra_balrog_submitter_params"] = extra_balrog_submitter_params > > - # don't import releasetasks until required within function impl to avoid global failures > - # during nosetests > - from releasetasks import make_task_graph > - return make_task_graph(**kwargs) > + # don't import releasetasks until required within function impl to avoid > + # global failures during nosetests > + from releasetasks import make_tasks > + return make_tasks(**kwargs) Really nice approach! Easy to backout if we need to. My only concern is that I'd like to see a Fennec staging release as well on jamun. For Fennec we have a decision task within the graph, and I'd feel safer if we tested that case before rolling this out to production. ::: lib/python/kickoff/tc.py:29 (Diff revision 1) > + def submit_task(t_id, t_def): > + log.info("Submitting %s", t_id) > + queue.createTask(t_id, t_def) > + > + with futures.ThreadPoolExecutor(CONCURRENCY) as e: > + fs = {} Nit: maybe use some more intuitive to read variable here instead of `fs`. "submitted_future" or alike.
Attachment #8900704 - Flags: review?(mtabara) → review+
Comment on attachment 8900705 [details] Bug 1259627 - Update releaserunner dependencies https://reviewboard.mozilla.org/r/172150/#review177834 ::: modules/releaserunner/manifests/init.pp:60 (Diff revision 1) > 'simplejson==2.6.2', > 'singledispatch==3.4.0.3', > 'six==1.9.0', > 'sqlalchemy-migrate==0.7.2', > 'taskcluster==0.0.24', > + 'toposort==1.0', Any reason we're not using the latest 1.5 version of it ? https://pypi.python.org/pypi/toposort
Attachment #8900705 - Flags: review?(mtabara) → review+
Attachment #8900706 - Flags: checked-in+
Comment on attachment 8901870 [details] Bug 1259627 - Keep printing out the task group ID https://reviewboard.mozilla.org/r/173302/#review178662 Good catch, lgtm!
Attachment #8901870 - Flags: review+
I was chatting with :jlorenzo earlier tonight as he was doing a Fennec staging release to test things out and we just realized ww maybe can actually get rid of "graph" scopes from releasetasks. AFAIK, *before*, we'd define scopes in each tasks + graph level needed for the Scheduler API. But since now, all the graph is just a formal step before flattening it and making a list of tasks and individually send them via Queue API, I think we no longer need them. :jlorenzo was having trouble with some missing scopes and tweaked only the task-level scope, not the graph one and still worked. I think this proves the theory above. To conclude, I think we can actually remove all scopes from [1]. Might be worth testing this with a staging release tomorrow morning. Leaving myself a NI on this one: [1]: https://github.com/mozilla-releng/releasetasks/blob/master/releasetasks/templates/mobile/release_graph.yml.tmpl#L21
Flags: needinfo?(mtabara)
yup, we can kill those.
Filed bug 1394852 to track removal of those graph-level scopes mentioned in comment 26.
Flags: needinfo?(mtabara)
Attached file Funsize scheduler PR (deleted) —
Attachment #8902642 - Flags: review?
Attachment #8902642 - Flags: review? → review+
Comment on attachment 8902642 [details] Funsize scheduler PR to be deployed soon!
Attachment #8902642 - Flags: checked-in+
We hit some artifact expires errors like Aug 30 13:05:48 signingworker-6.srv.releng.use1.mozilla.com python: signingworker taskcluster.exceptions.TaskclusterRestFailure: Artifact expires (2017-09-13T20:05:47.672Z) after the task expiration 2017-09-06T12:48:31.067Z (task.expires < expires) - this is not allowed, artifacts may not expire after the task they belong to expires We *think* https://github.com/mozilla-releng/funsize/commit/96d4d90350a8a1ee378f1b0dddfcd45acd4bff53 should fix that symptom... we'll see during tonight's nightlies, triggering in <1hr. But when digging, I found another possible fix: At https://github.com/mozilla-releng/signingworker/blob/master/signingworker/worker.py#L176 we set the artifact expires to 2weeks from now, but "now" is well after the graph was created. At https://github.com/mozilla-releng/signingworker/blob/master/signingworker/worker.py#L71 we get the task definition. `task.get('expires')` should be the task's expiration datestring, if any. We could fall back to an arrow, e.g. `task.get('expires', arrow.now().replace(weeks=2).isoformat())` . `sign()` has access to the task definition, so we could create `expires` at that point. It calls `download_and_sign_file()`, which calls `create_artifact()`, and it also calls `create_artifact()` directly. If we pass `expires` down as args or as a self.variable, we can then make sure we never try to create an artifact with an `expires` longer than `task.expires`.
Flags: needinfo?(rail)
Yeah, I noticed that issue yesterday and https://github.com/mozilla-releng/funsize/pull/82 fixed it. I shouldn't add that field in the first place. :)
Flags: needinfo?(rail)
Everything looks good so far, no backouts!
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
I've removed: "Scheduler": "http://references.taskcluster.net/scheduler/v1/api.json", "SchedulerEvents": "http://references.taskcluster.net/scheduler/v1/exchanges.json", from http://references.taskcluster.net/manifest.json so that the scheduler APIs don't get generated in our clients.
The following scopes are still dotted around our clients/roles: scheduler:* scheduler:create-task-graph scheduler:extend-task-graph scheduler:extend-task-graph:* scheduler:route:gaia-taskcluster scheduler:route:taskcluster-github.* I intend to remove these - shout out if you see any issue with me doing this.
Sorry, I see now this bug was about not continuing to use the scheduler APIs, rather than about sunsetting the scheduler. Therefore I've moved the sunsetting parts into bug 1399437. Sorry for the noise! :)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: