Open Bug 1607935 Opened 5 years ago Updated 4 years ago

Transition standalone workers in community-tc to static workers

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

People

(Reporter: dustin, Unassigned)

References

(Blocks 1 open bug)

Details

We have a bunch of standalone workers set up in community-tc:

$ grep assume:worker-pool config/projects.yml  | sort -u
        - assume:worker-pool:proj-deepspeech/ds-lepotato
        - assume:worker-pool:proj-deepspeech/ds-macos-heavy
        - assume:worker-pool:proj-deepspeech/ds-macos-light
        - assume:worker-pool:proj-deepspeech/ds-rpi3
        - assume:worker-pool:proj-deepspeech/ds-scriptworker
        - assume:worker-pool:proj-git-cinnabar/osx-10-10
        - assume:worker-pool:proj-git-cinnabar/osx-10-11
        - assume:worker-pool:proj-taskcluster/gw-ci-macos
        - assume:worker-pool:proj-taskcluster/gw-ci-macos-staging
        - assume:worker-pool:proj-taskcluster/gw-ci-raspbian-stretch
        - assume:worker-pool:proj-taskcluster/gw-ci-windows10-amd64
        - assume:worker-pool:proj-taskcluster/gw-ci-windows10-arm
        - assume:worker-pool:proj-webrender/ci-macos
        - assume:worker-pool:test-provisioner/*

Those currently get a hand-created TC client, are configured on-disk, and are invisible to worker-manager. Let's instead use worker-runner to start them, and create worker pools with the static provider to contain them.

Let's start with the proj-taskcluster/gw-* workers. The process is:

  • create a "static" provider (cloudops -- I'll file a bug for this)
  • create worker pools using that provider (PR to community-tc-config), containing the worker configuration currently set on-disk
  • for each worker, create a corresponding worker in worker-manager:
    WORKER_SECRET=<...>
    echo '{"capacity": 1, "expires": "3000-01-01T15:42:34Z", "providerInfo": {"staticSecret": "'"$WORKER_SECRET"'"}}' | \
    taskcluster api workerManager createWorker <workerPoolId> <workerGroup> <workerId>
    
  • install start-worker on the worker host
  • configure it, using the static provider and the configuration appropriate to the worker implementation. Something like
     worker:
         implementation: generic-worker
         path: /usr/local/bin/generic-worker  # or wherever this is installed
         configPath: /etc/generic-worker/config  # or wherever you keep your g-w config
     provider:
         providerType: static
         rootURL: https://community-tc.services.mozilla.com
         providerID: static
         workerPoolID: ...
         workerGroup: ...
         workerID: ...
         staticSecret: ... # WORKER_SECRET
     workerConfig:
         ... # any config that doesn't belong in the worker-pool definition, such as filesystem paths
    
  • set things up so that start-worker starts automatically and generic-worker does not (as appropriate to the system)
Depends on: 1608141

Oh, I forgot we don't support worker config in worker pools yet. So the worker pools themselves are pretty simple. However, community-tc-config needs some refactoring to be able to generate them (since there's no imageset involved, for example).

Next up: reconfiguring the ci-macos worker to use its workerpool.

Pete, do you want to take a bit to try to set this up? It would be steps 3, 4, and 5 from above.

Flags: needinfo?(pmoore)

I'll add it to the project board. I'm not sure at this point which items I'll take myself, and which others will have, but we should certainly track it. Thanks for raising.

Flags: needinfo?(pmoore)

You're the only one with access, so I'm pretty sure you'll need to do it :)

Kats, do you think we could arrange a way for me to help make this change for the proj-webrender/ci-macos worker(s)? We can either work together for a bit, or if you'd prefer to just give me host access I can take care of it.

Flags: needinfo?(kats)

Dropping needinfo per discussion on Matrix. Giving host access is nontrivial since we'd have to punch through firewalls/NATs in the toronto office. But :jrmuizel can help you with on-machine access once you have the steps to do this.

Flags: needinfo?(kats)

Pete reminded me that the macos worker and windows worker are both things I have access to, with creds in our secret store. So I'll futz with those and come up with a more specific list of steps than above.

We'll do this in two phases. First, set up with a standalone provider, and once that's working, switch to the static provider.

provider:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
  providerType: standalone
  rootURL: (from existing config)
  clientID: (from existing config)
  accessToken: (from existing config)
  workerPoolID:  (from existing config, of the form provisionerId `/` workerType)
  workerGroup: (from existing config)
  workerID: (from existing config)
worker:
  implementation: generic-worker
  path: (path to generic worker)
  configPath: (path to existing generic-worker config)
workerConfig:
  (all of the items from generic-worker config that were not used above, translated from JSON into YAML format)
  • In whatever way is appropriate, set things up to run /path/to/start-worker /path/to/runner.yml instead of /path/to/generic-worker ...

That's the end of phase one. Restart the worker by rebooting or whatever is approriate, and check that it can execute a task. If it doesn't start up, check the stderr logs. It's easy to mistype config, especially the Golang-style capitalization of ID and URL.

If there are errors fetching secrets, check the scopes given to the client. They should include assume:worker-pool:<workerPoolId>, which (via https://github.com/mozilla/community-tc-config/pull/222) will grant the necessary scopes.

If there are errors reporting to sentry, the actual error is probably before that. Phase 2 will grant sentry permissions.

Be careful running generic-worker directly from the command line, as it is sensitive to the cwd. If you get errors about the logged-in user not matching the next user, that's the cause.

Depends on: 1616998

Phase 2!

This involves setting up a record of the worker in Taskcluster, using a STATIC_SECRET value, and then configuring the worker with the same secret. If you have multiple workers, you should do this once for each worker, using a different secret for each one.

  • create a worker-pool, similar to this example, with the same workerPoolId you set up in phase 1
  • generate a STATIC_SECRET. The shell client makes this easy: echo $(taskcluster slugid generate)$(taskcluster slugid generate)
  • create a new worker in the given workerPool, using the static secret you just created as well as the info from the config file in phase 1. You may need to eval $(taskcluster signin) first.
EXPIRES=`taskcluster from-now 31 days`
echo '{"providerInfo": {"staticSecret": "$STATIC_SECRET"}, "expires": "$EXPIRES"}' | \
  taskcluster api workerManager createWorker $WORKER_POOL_ID $WORKER_GROUP $WORKER_ID

you should see something like

{
  "workerPoolId": "proj-taskcluster/gw-ci-macos",
  "workerGroup": "proj-taskcluster",
  "workerId": "63-135-170-250",
  "providerId": "static",
  "created": "2020-02-28T20:32:16.138Z",
  "expires": "3019-07-01T20:30:09.902Z",
  "lastModified": "2020-02-28T20:32:16.138Z",
  "lastChecked": "2020-02-28T20:32:16.138Z",
  "capacity": 1,
  "state": "running"
}
  • Edit runner.yml.
    • change standalone to static
    • Remove clientID and accessToken.
    • Add providerID: static
    • Add staticSecret: <your STATIC_SECRET>.

Reboot and test.

NOTE the '31 days' above is the maximum, and isn't a good practical number. We generally use "1000 years" to mean "forever", but bug 1618983 prevents that right now.

Depends on: 1618983

This is now done for proj-taskcluster/63-135-170-250, but it expires in 31 days. Hopefully bug 1618938 is fixed by that time :)

Assignee: dustin → nobody

(this is no longer blocked, but nobody is doing it)

You need to log in before you can comment on or make changes to this bug.