Closed Bug 1259627 Opened 9 years ago Closed 7 years ago

Stop using TC scheduler API

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rail, Assigned: rail)

References

(
URL
)

Details

Attachments

(5 files)

Bug 1259627 - [tools] Use TC Queue API 7 years ago Rail Aliiev [:rail] (deleted), text/x-review-board-request	mtabara : review+ rail : checked-in+	Details
Bug 1259627 - Update releaserunner dependencies 7 years ago Rail Aliiev [:rail] (deleted), text/x-review-board-request	mtabara : review+ rail : checked-in+	Details
PR for releasetask 7 years ago Rail Aliiev [:rail] (deleted), text/x-github-pull-request	mtabara : review+ rail : checked-in+	Details
Bug 1259627 - Keep printing out the task group ID 7 years ago Johan Lorenzo [:jlorenzo] - On PTO, back on September 11th (deleted), text/x-review-board-request	mtabara : review+ jlorenzo : checked-in+	Details
Funsize scheduler PR 7 years ago Rail Aliiev [:rail] (deleted), text/x-github-pull-request	mtabara : review+ rail : checked-in+	Details

Rail Aliiev [:rail]

Assignee

Description

•

9 years ago

The scheduler API is deprecated after TC added support for task dependencies. Instead of submitting a graph we should submit individual tasks with dependencies specified in task.dependencies and grouped using taskGroupId. Tasks should be submitted in topographical order (TC verifies dependencies). In case of failure we should resubmit only failed tasks. Sorting and mass submission sound like good candidates for TC client.

Nick Thomas [:nthomas] (UTC+12)

Updated

•

8 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1288573

Rail Aliiev [:rail]

Assignee

Updated

•

8 years ago

Depends on: 1316340

Rail Aliiev [:rail]

Assignee

Updated

•

8 years ago

Priority: -- → P2

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1375802

Pete Moore [:pmoore][:pete]

Comment 1

•

7 years ago

It would be great if we could get this done, as this might be the last thing that requires us to keep the scheduler alive (but not well). For anyone looking at this bug that fancies taking a stab, it looks like this is where the scheduler is called: https://hg.mozilla.org/build/tools/file/9193e0759659/buildfarm/release/release-runner.py#l517

Pete Moore [:pmoore][:pete]

Comment 2

•

7 years ago

And I believe this is where the graph is generated: https://hg.mozilla.org/build/tools/file/9193e0759659/buildfarm/release/releasetasks_graph_gen.py

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

Mentor: pmoore

Whiteboard: [good first bug]

Pete Moore [:pmoore][:pete]

Comment 3

•

7 years ago

Erm, i've just read the dependent bugs and realise this might not be so entirely trivial after all! :D

Mentor: pmoore

Whiteboard: [good first bug]

Dustin J. Mitchell [:dustin] (he/him)

Comment 4

•

7 years ago

My understanding is that there are two ways to achieve this: 1. move releasetasks in-tree (a long project) 2. rewrite releasetasks to use the queue API instead of scheduler The latter is a little complicated since the releasetasks graph has tasks with more dependencies than the queue API allows, so insertion of some dummy tasks will be required. There's disagreement as to how difficult that might be :)

Justin Wood (:Callek)

Comment 5

•

7 years ago

Rail I hear has an implementation of a solution.

Flags: needinfo?(rail)

Mihai Tabara [:mtabara]⌚️GMT

Comment 6

•

7 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #4) > My understanding is that there are two ways to achieve this: > 1. move releasetasks in-tree (a long project) > 2. rewrite releasetasks to use the queue API instead of scheduler > > The latter is a little complicated since the releasetasks graph has tasks > with more dependencies than the queue API allows, so insertion of some dummy > tasks will be required. There's disagreement as to how difficult that might > be :) Rail managed to implement a nice solution for 2) last week during the All-hands, one that uses the Queue API by reorganizing the current releasetasks graph into a large flatten list. There's a topological sort involved as well as some other processing. Last I know on Friday he managed to submit to TC a smaller subgraph via our staging environment. I remembered he mentioned some other complications he needed to tackle to declare victory over this. With the current tcmigration project focus going on, I can look at this to finalize if Rail wants to hand-off.

Rail Aliiev [:rail]

Assignee

Comment 7

•

7 years ago

I'll put together 2 different solutions I have. Leaving NI

Rail Aliiev [:rail]

Assignee

Comment 8

•

7 years ago

I ended up with 2 different approaches: 1) extend the existing tools to convert the generated graph into a list of tasks * patches releasetasks: https://github.com/mozilla-releng/releasetasks/compare/master...rail:queue_api?expand=1 ** convert existing graph into a list of tasks ** add dummy tasks in between tasks with more than 100 dependent tasks ** add a task (atomic task) to prevent ASAP scheduling in case the whole submission fails for some reason (mistakes, scopes, network, etc) ** sort the tasks topologically * releaserunner patch: https://gist.github.com/rail/4074a19313ca3970f9a7c9123b5b74ee ** resolve the atomic task to start the execution ** graph2 generation should be tested I tested 1) on jamun and it worked fine. It takes longer to submit tasks (no async!), but we don't have the max 3 partials limit anymore! 2) py35 CLI tool to do the same using a saved copy of releasetasks-generated graph * async submission (to be parallelized) * release runner should save the generated graph to a file and pass it to this tool * to be puppetized I have not tested this version at all. I will be busy with some other stuff until the end of August, but feel free to adapt the patches.

Flags: needinfo?(rail)

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

7 years ago

I like it! We use a threadpool in-tree https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/create.py?q=create.py&redirect_type=direct#48 so maybe that's an option here, too?

Rail Aliiev [:rail]

Assignee

Comment 10

•

7 years ago

Oh, forgot to point to the second tool: https://github.com/rail/graph2tasks

Mihai Tabara [:mtabara]⌚️GMT

Updated

•

7 years ago

URL: https://github.com/taskcluster/taskcl...

Mihai Tabara [:mtabara]⌚️GMT

Updated

•

7 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1385294

Mihai Tabara [:mtabara]⌚️GMT

Comment 11

•

7 years ago

Note to self (context) for later: * we're to flatten the existing graph and submit it in chunks via TC Queue * rail already implemented a good part of this, we still need more testing (unit testing) to that piece and deal with the submission * currently, the only caveat is the fact that it takes a long time to submit those tasks (about ~20-30min for a release graph). Mostly because all that is sequentially. We need to bring async muscle to this in a smart way * once we flatten graph, topological sort and all, we'll have a chunk of tasks with 0 dependencies, another chunk with 1 dep, etc. The chunks shall be submitted sequentially but the tasks within the chunk can be asynced.

Mihai Tabara [:mtabara]⌚️GMT

Comment 12

•

7 years ago

Hoping to get my hands on this in the next two weeks, now that I'm off releaseduty cycle.

Assignee: nobody → mtabara

Rail Aliiev [:rail]

Assignee

Comment 13

•

7 years ago

Hijacking this because I'm back!

Assignee: mtabara → rail

Comment hidden (mozreview-request)

Rail Aliiev [:rail]

Assignee

Comment 16

•

7 years ago

Attached file PR for releasetask (deleted) — Details

Attachment #8900706 - Flags: review?(mtabara)

Mihai Tabara [:mtabara]⌚️GMT

Comment 17

•

7 years ago

Comment on attachment 8900706 [details] PR for releasetask Commented out in the PR[1] [1]: https://github.com/mozilla-releng/releasetasks/pull/267#pullrequestreview-58577330

Attachment #8900706 - Flags: review?(mtabara) → review+

Mihai Tabara [:mtabara]⌚️GMT

Comment 18

•

7 years ago

mozreview-review

Comment on attachment 8900704 [details] Bug 1259627 - [tools] Use TC Queue API https://reviewboard.mozilla.org/r/172146/#review177782 Looking good to me - just a few nits and a bunch of questions. Also, the most concerning thing for me - shouldn't we attempt a staging release for Fennec as well? See my comment below. ::: buildfarm/release/release-runner.py:414 (Diff revision 1) > > # XXX: Doesn't work with neither Fennec nor Thunderbird > platforms = branchConfig['release_platforms'] > > try: > - graph_id = slugId() > + task_group_id = None Can't we remove this line since the task group id comes from releasetasks? I doubt there's any warning downstream. ::: buildfarm/release/release-runner.py:523 (Diff revision 1) > import pprint > - log.debug(pprint.pformat(graph, indent=4, width=160)) > - print(scheduler.createTaskGraph(graph_id, graph)) > + for task_id, task_def in tasks.items(): > + log.debug("%s ->\n%s", task_id, > + pprint.pformat(task_def, indent=4, width=160)) > + submit_parallelized(queue, tasks) > + resolve_task(queue, toplevel_task_id) Beautiful! ::: buildfarm/release/releasetasks_graph_gen.py:134 (Diff revision 1) > + log.debug("%s ->\n%s", task_id, > + pprint.pformat(task_def, indent=4, width=160)) > + > if not options.dry_run: > - print scheduler.createTaskGraph(graph_id, graph) > + submit_parallelized(queue, tasks) > + resolve_task(queue, toplevel_task_id) Wonderful! ::: buildfarm/release/releasetasks_graph_gen.py:218 (Diff revision 1) > tc_config = { > "credentials": { > "clientId": get_config(release_runner_config, "taskcluster", "client_id", None), > "accessToken": get_config(release_runner_config, "taskcluster", "access_token", None), > + }, > + "maxRetries": 12, Question: Since this is global, which task falls into this? ::: lib/python/kickoff/__init__.py:370 (Diff revision 1) > kwargs["extra_balrog_submitter_params"] = extra_balrog_submitter_params > > - # don't import releasetasks until required within function impl to avoid global failures > - # during nosetests > - from releasetasks import make_task_graph > - return make_task_graph(**kwargs) > + # don't import releasetasks until required within function impl to avoid > + # global failures during nosetests > + from releasetasks import make_tasks > + return make_tasks(**kwargs) Really nice approach! Easy to backout if we need to. My only concern is that I'd like to see a Fennec staging release as well on jamun. For Fennec we have a decision task within the graph, and I'd feel safer if we tested that case before rolling this out to production. ::: lib/python/kickoff/tc.py:29 (Diff revision 1) > + def submit_task(t_id, t_def): > + log.info("Submitting %s", t_id) > + queue.createTask(t_id, t_def) > + > + with futures.ThreadPoolExecutor(CONCURRENCY) as e: > + fs = {} Nit: maybe use some more intuitive to read variable here instead of `fs`. "submitted_future" or alike.

Attachment #8900704 - Flags: review?(mtabara) → review+

Mihai Tabara [:mtabara]⌚️GMT

Comment 19

•

7 years ago

mozreview-review

Comment on attachment 8900705 [details] Bug 1259627 - Update releaserunner dependencies https://reviewboard.mozilla.org/r/172150/#review177834 ::: modules/releaserunner/manifests/init.pp:60 (Diff revision 1) > 'simplejson==2.6.2', > 'singledispatch==3.4.0.3', > 'six==1.9.0', > 'sqlalchemy-migrate==0.7.2', > 'taskcluster==0.0.24', > + 'toposort==1.0', Any reason we're not using the latest 1.5 version of it ? https://pypi.python.org/pypi/toposort

Attachment #8900705 - Flags: review?(mtabara) → review+

Rail Aliiev [:rail]

Assignee

Comment 20

•

7 years ago

Comment on attachment 8900705 [details] Bug 1259627 - Update releaserunner dependencies toposort 1.5, it is! remote: https://hg.mozilla.org/build/puppet/rev/fdbd9474bc29a5d8ff6841d6554f1590e4670054 remote: https://hg.mozilla.org/build/puppet/rev/0da450a5412289b03425bb66be16a1c730dcaa94

Attachment #8900705 - Flags: checked-in+

Rail Aliiev [:rail]

Assignee

Comment 21

•

7 years ago

Comment on attachment 8900706 [details] PR for releasetask Merged https://github.com/mozilla-releng/releasetasks/pull/267

Attachment #8900706 - Flags: checked-in+

Rail Aliiev [:rail]

Assignee

Comment 22

•

7 years ago

Comment on attachment 8900704 [details] Bug 1259627 - [tools] Use TC Queue API https://hg.mozilla.org/build/tools/rev/d0fdabea1ee70000001069700a9cde5b3957db4c

Attachment #8900704 - Flags: checked-in+

Comment hidden (mozreview-request)

Mihai Tabara [:mtabara]⌚️GMT

Comment 24

•

7 years ago

mozreview-review

Comment on attachment 8901870 [details] Bug 1259627 - Keep printing out the task group ID https://reviewboard.mozilla.org/r/173302/#review178662 Good catch, lgtm!

Attachment #8901870 - Flags: review+

Johan Lorenzo [:jlorenzo] - On PTO, back on September 11th

Comment 25

•

7 years ago

Comment on attachment 8901870 [details] Bug 1259627 - Keep printing out the task group ID https://hg.mozilla.org/build/tools/rev/813629fc9d561badf37ac19e6399f1a1fbac7932

Attachment #8901870 - Flags: review?(rail) → checked-in+

Mihai Tabara [:mtabara]⌚️GMT

Comment 26

•

7 years ago

I was chatting with :jlorenzo earlier tonight as he was doing a Fennec staging release to test things out and we just realized ww maybe can actually get rid of "graph" scopes from releasetasks. AFAIK, *before*, we'd define scopes in each tasks + graph level needed for the Scheduler API. But since now, all the graph is just a formal step before flattening it and making a list of tasks and individually send them via Queue API, I think we no longer need them. :jlorenzo was having trouble with some missing scopes and tweaked only the task-level scope, not the graph one and still worked. I think this proves the theory above. To conclude, I think we can actually remove all scopes from [1]. Might be worth testing this with a staging release tomorrow morning. Leaving myself a NI on this one: [1]: https://github.com/mozilla-releng/releasetasks/blob/master/releasetasks/templates/mobile/release_graph.yml.tmpl#L21

Flags: needinfo?(mtabara)

Mihai Tabara [:mtabara]⌚️GMT

Updated

•

7 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1394511

Rail Aliiev [:rail]

Assignee

Comment 27

•

7 years ago

yup, we can kill those.

Nick Thomas [:nthomas] (UTC+12)

Updated

•

7 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1394644

Nick Thomas [:nthomas] (UTC+12)

Updated

•

7 years ago

Depends on: 1394657

Mihai Tabara [:mtabara]⌚️GMT

Updated

•

7 years ago

Blocks: 1394852

Mihai Tabara [:mtabara]⌚️GMT

Comment 28

•

7 years ago

Filed bug 1394852 to track removal of those graph-level scopes mentioned in comment 26.

Mihai Tabara [:mtabara]⌚️GMT

Updated

•

7 years ago

Flags: needinfo?(mtabara)

Johan Lorenzo [:jlorenzo] - On PTO, back on September 11th

Updated

•

7 years ago

Blocks: 1394664

Rail Aliiev [:rail]

Assignee

Comment 29

•

7 years ago

Attached file Funsize scheduler PR (deleted) — Details

Attachment #8902642 - Flags: review?

Mihai Tabara [:mtabara]⌚️GMT

Updated

•

7 years ago

Attachment #8902642 - Flags: review? → review+

Rail Aliiev [:rail]

Assignee

Comment 30

•

7 years ago

Comment on attachment 8902642 [details] Funsize scheduler PR to be deployed soon!

Attachment #8902642 - Flags: checked-in+

Rail Aliiev [:rail]

Assignee

Comment 31

•

7 years ago

remote: https://hg.mozilla.org/build/puppet/rev/929209172607b1a4bd33c051041e3784976941ba remote: https://hg.mozilla.org/build/puppet/rev/cfa41b0c02832183ccd2dc31167e88826517f553

Aki Sasaki (not active)

Comment 32

•

7 years ago

We hit some artifact expires errors like Aug 30 13:05:48 signingworker-6.srv.releng.use1.mozilla.com python: signingworker taskcluster.exceptions.TaskclusterRestFailure: Artifact expires (2017-09-13T20:05:47.672Z) after the task expiration 2017-09-06T12:48:31.067Z (task.expires < expires) - this is not allowed, artifacts may not expire after the task they belong to expires We *think* https://github.com/mozilla-releng/funsize/commit/96d4d90350a8a1ee378f1b0dddfcd45acd4bff53 should fix that symptom... we'll see during tonight's nightlies, triggering in <1hr. But when digging, I found another possible fix: At https://github.com/mozilla-releng/signingworker/blob/master/signingworker/worker.py#L176 we set the artifact expires to 2weeks from now, but "now" is well after the graph was created. At https://github.com/mozilla-releng/signingworker/blob/master/signingworker/worker.py#L71 we get the task definition. `task.get('expires')` should be the task's expiration datestring, if any. We could fall back to an arrow, e.g. `task.get('expires', arrow.now().replace(weeks=2).isoformat())` . `sign()` has access to the task definition, so we could create `expires` at that point. It calls `download_and_sign_file()`, which calls `create_artifact()`, and it also calls `create_artifact()` directly. If we pass `expires` down as args or as a self.variable, we can then make sure we never try to create an artifact with an `expires` longer than `task.expires`.

Flags: needinfo?(rail)

Rail Aliiev [:rail]

Assignee

Comment 33

•

7 years ago

Yeah, I noticed that issue yesterday and https://github.com/mozilla-releng/funsize/pull/82 fixed it. I shouldn't add that field in the first place. :)

Flags: needinfo?(rail)

Rail Aliiev [:rail]

Assignee

Comment 34

•

7 years ago

Everything looks good so far, no backouts!

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Johan Lorenzo [:jlorenzo] - On PTO, back on September 11th

Updated

•

7 years ago

Blocks: 1396472

Johan Lorenzo [:jlorenzo] - On PTO, back on September 11th

Updated

•

7 years ago

Blocks: 1396517

Nick Thomas [:nthomas] (UTC+12)

Updated

•

7 years ago

Depends on: 1398964

Pete Moore [:pmoore][:pete]

Comment 35

•

7 years ago

I've removed: "Scheduler": "http://references.taskcluster.net/scheduler/v1/api.json", "SchedulerEvents": "http://references.taskcluster.net/scheduler/v1/exchanges.json", from http://references.taskcluster.net/manifest.json so that the scheduler APIs don't get generated in our clients.

Pete Moore [:pmoore][:pete]

Comment 36

•

7 years ago

The following scopes are still dotted around our clients/roles: scheduler:* scheduler:create-task-graph scheduler:extend-task-graph scheduler:extend-task-graph:* scheduler:route:gaia-taskcluster scheduler:route:taskcluster-github.* I intend to remove these - shout out if you see any issue with me doing this.

[github robot]

Comment 37

•

7 years ago

Commit pushed to master at https://github.com/taskcluster/taskcluster-client-go https://github.com/taskcluster/taskcluster-client-go/commit/04c1b1bcc331fe51d6d0782ee38d529b5eb3e90c Bug 1259627 - scheduler no longer exists

Pete Moore [:pmoore][:pete]

Comment 38

•

7 years ago

Sorry, I see now this bug was about not continuing to use the scheduler APIs, rather than about sunsetting the scheduler. Therefore I've moved the sunsetting parts into bug 1399437. Sorry for the noise! :)

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1399437

Johan Lorenzo [:jlorenzo] - On PTO, back on September 11th

Updated

•

7 years ago

Blocks: 1431859

You need to log in before you can comment on or make changes to this bug.