[tc-worker-runner] support stopping workers when deployment is updated
Categories
(Taskcluster :: Workers, enhancement)
Tracking
(Not tracked)
People
(Reporter: dustin, Unassigned)
References
Details
This is currently supported by generic-worker, but the deploymentId is included in the workerConfig, which is duplicated for every region or AZ. Worker-manager and tc-worker-runner could work together to handle this more smoothly (perhaps via registerWorker and a subsequent workerCheckin API call or something like that), and for all provider types (including static).
Comment 1•5 years ago
|
||
workerConfig, which is duplicated for every region or AZ.
Does it have to? I copied the logic from https://github.com/mozilla/community-tc-config/blob/72ff4b48c4/generate/workers.py#L222-L236 but does worker-manager also support specifying workerConfig
once outside of the launchConfigs
array? Or is there a use case for the worker config to vary across launch configs?
Comment 2•5 years ago
|
||
The status quo is that either workers with an outdated configuration can continue picking up tasks for an unbounded amount of time, or they have to be terminated with the removeWorker
API. If a task is running, the latter makes it fail.
For Servo’s CI, a failing task usually means that a PR doesn’t land and needs to be retried manually, which runs all test suites again from the start. (We should probably start using TC’s queue’s built-in task-level automatic retry support, but we’d need some heuristic to tell intermittent failures apart from permanent ones.)
Ideally, workers with an outdated config would immediately stop picking up new tasks (rather than after a checkForNewDeploymentEverySecs
interval) but keep running already-started tasks to completion.
This is exactly what worker-manager’s deleteWorkerPool
API does. So maybe we should have versioned worker pool IDs? But that’d require in-tree configuration changes.
Comment 3•5 years ago
|
||
specifying workerConfig once outside of the launchConfigs array
This was already proposed in bug 1578922.
Reporter | ||
Comment 4•5 years ago
|
||
A few more bits here: if we make this a "core" property of a worker pool, and provide it during registerWorker and reregisterWorker, then we can also track what deploymentId each worker is running, and be able to display that in the UI (at least some kind of indication that it's running an old config).
To do that, we might want more meaningful deploymentIds than just slugids. A datestamp actually seems like a good idea: unique, monotonically increasing. If that's the last-modified time of the worker pool, then it also has some semantic meaning ("running the config from last week").
Currently it's possible to update a worker pool without changing the deploymentId. We could achieve that by adding a ?update-last-modified=0
to the workerManager.updateWorkerPool
API which would skip the update (with the default being to update the datestamp).
A fringe benefit here would be that even deletion of a worker pool would trigger an update to the last-modified timestamp, meaning that properly-configured workers would gracefully shut down.
Reporter | ||
Comment 5•5 years ago
|
||
This has been added to the Worker Lifecycle Management project.
Comment 6•5 years ago
|
||
lastModified is a top-level feature as of a month or so. next step is to use it in reregister
Reporter | ||
Comment 7•5 years ago
|
||
In the discussion we just had, we said we would use a deploymentId with arbitrary content (e.g., a slugid). This is a change from comment 4, and comment 4 describes some benefits of using a lastModified value that are worth reconsidering.
Also, worth thinking about how this would interact with ci-admin/tc-admin. Would they update the deploymentId on every write? On no writes? Have some logic for it?
Comment 8•4 years ago
|
||
This will be worked on in https://github.com/taskcluster/taskcluster/issues/2891
Description
•