Closed Bug 1574659 Opened 5 years ago Closed 5 years ago

Migrate deepspeech to community taskcluster deployment

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: miles)

References

Details

Make a plan to move this project to the new community deployment.

Notes:
need to upgrade docker-worker workers quite some distance
generic-worker is at 14.x, can try an upgrade

Pete, FYI regarding the worker updates here. I think Alex was getting started on that already.

Yeah, I started some work but I'm busy on other things, I'll focus on that start of next month :)

So, I've updated:

And deepspeech-win/-b for windows runs generic-worker v15.1.0

I'll work on updating tc.yml to v1. Notably, it uses env vars:

https://github.com/lissyx/taskcluster-github-decision/blob/master/tc-decision.py#L62-L80

Assignee: dustin → pmoore
Assignee: pmoore → miles

:Summarizing a conversation with Alexandre in irc (and notes from above):

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #5)

:Summarizing a conversation with Alexandre in irc (and notes from above):

  • Updating to v1 tc.yml might not be necessary, and seems hard due to use of env vars. We can still specify different provisionerId/workerType using v0.
  • We will manage the project with https://github.com/mozilla/community-tc-config/, so we should figure out
    • What cloud-based worker pools are required (we have docker-worker and win2012r2 right now)

We use:

taskcluster:
  schedulerId: taskcluster-github
  docker:
    provisionerId: aws-provisioner-v1
    workerType: deepspeech-worker
    workerTypeKvm: deepspeech-kvm-worker
    workerTypeWin: deepspeech-win-b
  dockerrpi3:
    provisionerId: deepspeech-provisioner
    workerType: ds-rpi3
  dockerarm64:
    provisionerId: deepspeech-provisioner
    workerType: ds-lepotato
  generic:
    provisionerId: deepspeech-provisioner
    workerType: ds-macos-light
  script:
    provisionerId: deepspeech-provisioner
    workerType: ds-scriptworker
  • Clients that are required (one for each worker)

Not sure I get this one, we only have clientId for the macOS, RPi3 and LePotato workers

Not sure I get this one as well

  • Any hooks, or other things I haven't thought of

He he, no idea :)

For my notes:

workerType: deepspeech-worker

This can be a standard docker-worker pool in GCP.

workerTypeKvm: deepspeech-kvm-worker

This will need to be docker-worker on metal instances in AWS.

workerTypeWin: deepspeech-win-b

This can be a generic-worker windows pool in AWS/GCP.

Essentially, we have to make sure that the same worker images are available in new and separate AWS/GCP accounts, and that corresponding worker-pools exist in the community cluster for this project. That's why we're asking about cloud workers.

  dockerrpi3:
    provisionerId: deepspeech-provisioner
    workerType: ds-rpi3
  dockerarm64:
    provisionerId: deepspeech-provisioner
    workerType: ds-lepotato
  generic:
    provisionerId: deepspeech-provisioner
    workerType: ds-macos-light
  script:
    provisionerId: deepspeech-provisioner
    workerType: ds-scriptworker

Given that you have your own provisioner here and your comment above, we'll need to provide you with credentials / clientIds for these workers, and you'll need to update TASKCLUSTER_ROOT_URL and other configurations.

This is referring to extra taskcluster scopes that your project will need, i.e. beyond Github. The workers you have will need to be able to claim tasks from the queue, for example.

In addition to the github repo roles it looks like you have a project admin role that we'll replicate as well: https://tools.taskcluster.net/auth/roles/project%3Adeepspeech%3Aadmin

Miles, we likely don't need anymore the KVM type. For the own provisioner, I think I can even create clientIds myself. At least I did in the past, does the community deployment changes that?

The structure of things is changing a bit with the move to the community deployment. The biggest shift is that we're managing taskcluster roles and scopes and other things in a project definition yaml file.

Because of the complexity of this project there are a few things that will need to change:

I've created a bug for these changes in the deepspeech repo: https://github.com/mozilla/DeepSpeech/pull/2485

Here is my PR to community-tc-config that adds the deepspeech project (and corresponding worker-pools): https://github.com/mozilla/community-tc-config/pull/51/. It also creates clients, and users in the github team mozilla/research-machine-learning will be able to reset those clients' accessTokens to set up workers.

There's a bit more to do here, and some verification work to make sure things won't be broken, but the foundation is laid. Outstanding work:

  • I need to test the provisionerId/workerType changes myself on my forked in community-tc
  • I need to mirror the deepspeech changes to the tensorflow repo
  • Once the changes are landed in each repo a Github admin will need to replace the taskcluster integration with the community-tc-integration

(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #9)

The structure of things is changing a bit with the move to the community deployment. The biggest shift is that we're managing taskcluster roles and scopes and other things in a project definition yaml file.

Because of the complexity of this project there are a few things that will need to change:

Will this be the TASKCLUSTER_ROOT_URL to use?

  • The community cluster has different provisioners, and worker types are per project

Which ones does change ? Again, it's not obvious to me from the PR linked.

I've created a bug for these changes in the deepspeech repo: https://github.com/mozilla/DeepSpeech/pull/2485

You need to take care of https://github.com/mozilla/tensorflow as well, on branch master and r1.14 (and potentially others)

Here is my PR to community-tc-config that adds the deepspeech project (and corresponding worker-pools): https://github.com/mozilla/community-tc-config/pull/51/. It also creates clients, and users in the github team mozilla/research-machine-learning will be able to reset those clients' accessTokens to set up workers.

There's a bit more to do here, and some verification work to make sure things won't be broken, but the foundation is laid. Outstanding work:

  • I need to test the provisionerId/workerType changes myself on my forked in community-tc
  • I need to mirror the deepspeech changes to the tensorflow repo
  • Once the changes are landed in each repo a Github admin will need to replace the taskcluster integration with the community-tc-integration

Can we get some planning here ? We're getting close to a v0.6 release now, I'd like to know when we have a hard cut date (this was supposed to be september 21st).

PSA: The existing (https://taskcluster.net) deployment will be shut down a week from today, on November 9. After that point, any CI not migrated to the new community cluster will stop functioning. The TC team is ready and eager to help get everything migrated by that time, but the deadline is firm.

Apologies for failing to communicate this as broadly and loudly as necessary, and for the bugspam now.

Alexandre emailed with a pretty tight timeline, essentially looking to land this on the 5th (Tuesday) at the latest.

I've given the filed PRs a good going-over, and landed the community-tc-config PR. I also filed https://github.com/mozilla/DeepSpeech/pull/2486 to replace https://github.com/mozilla/DeepSpeech/pull/2485. And I filed https://github.com/mozilla/tensorflow/pull/113 as miles suggested in #2485. Thankfully there are no taskcluster.net references in tensorflow to rewrite.

I see that lots of DS configs reference an index path for tensorflow. I hope that putting that path in place is as simple as pushing to the tensorflow repo, and perhaps updating the sha1 in the index path in the DS repo?

I'll check back today (Sunday) during my daylight hours. I'm also available all day Monday to work on this. Pete, if you can help out with any issues Monday before I'm awake, that would be great. Hopefully that's limited to landing and applying community-tc-config patches.

Yeah, the TensorFlow repo is less complicated. Honestly, if it's change in our repo, it's less big of an issue. What I would like to avoid is us blocked on something you need to do.

Same -- and in general we've architected this so that there are fewer of those, and especially fewer of them that are "behind the scenes" from your perspective -- at worst, you should need to file a PR to community-tc-config and someone can merge/apply.

That said, SimonSapin has encountered a number of bugs that have been solved most efficiently by someone like me, and solving such bugs is a top priority.

So, neither my github account nor :reuben ones are working as expected, we don't get granted the assume:project-admin:deepspeech. In my case, it seems it is because I am not part of mozilla/research-machine-learning but reuben is.

^^ addressed temporarily in #66

Unblocked, we could manually run dry tasks with true as payload on all workers. So at least this part is fine.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.