Closed Bug 1593482 Opened 5 years ago Closed 4 years ago

[tc-worker-runner] support stopping workers when deployment is updated

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED MOVED

People

(Reporter: dustin, Unassigned)

References

Details

This is currently supported by generic-worker, but the deploymentId is included in the workerConfig, which is duplicated for every region or AZ. Worker-manager and tc-worker-runner could work together to handle this more smoothly (perhaps via registerWorker and a subsequent workerCheckin API call or something like that), and for all provider types (including static).

workerConfig, which is duplicated for every region or AZ.

Does it have to? I copied the logic from https://github.com/mozilla/community-tc-config/blob/72ff4b48c4/generate/workers.py#L222-L236 but does worker-manager also support specifying workerConfig once outside of the launchConfigs array? Or is there a use case for the worker config to vary across launch configs?

The status quo is that either workers with an outdated configuration can continue picking up tasks for an unbounded amount of time, or they have to be terminated with the removeWorker API. If a task is running, the latter makes it fail.

For Servo’s CI, a failing task usually means that a PR doesn’t land and needs to be retried manually, which runs all test suites again from the start. (We should probably start using TC’s queue’s built-in task-level automatic retry support, but we’d need some heuristic to tell intermittent failures apart from permanent ones.)

Ideally, workers with an outdated config would immediately stop picking up new tasks (rather than after a checkForNewDeploymentEverySecs interval) but keep running already-started tasks to completion.

This is exactly what worker-manager’s deleteWorkerPool API does. So maybe we should have versioned worker pool IDs? But that’d require in-tree configuration changes.

specifying workerConfig once outside of the launchConfigs array

This was already proposed in bug 1578922.

Depends on: 1607606

A few more bits here: if we make this a "core" property of a worker pool, and provide it during registerWorker and reregisterWorker, then we can also track what deploymentId each worker is running, and be able to display that in the UI (at least some kind of indication that it's running an old config).

To do that, we might want more meaningful deploymentIds than just slugids. A datestamp actually seems like a good idea: unique, monotonically increasing. If that's the last-modified time of the worker pool, then it also has some semantic meaning ("running the config from last week").

Currently it's possible to update a worker pool without changing the deploymentId. We could achieve that by adding a ?update-last-modified=0 to the workerManager.updateWorkerPool API which would skip the update (with the default being to update the datestamp).

A fringe benefit here would be that even deletion of a worker pool would trigger an update to the last-modified timestamp, meaning that properly-configured workers would gracefully shut down.

This has been added to the Worker Lifecycle Management project.

lastModified is a top-level feature as of a month or so. next step is to use it in reregister

In the discussion we just had, we said we would use a deploymentId with arbitrary content (e.g., a slugid). This is a change from comment 4, and comment 4 describes some benefits of using a lastModified value that are worth reconsidering.

Also, worth thinking about how this would interact with ci-admin/tc-admin. Would they update the deploymentId on every write? On no writes? Have some logic for it?

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → MOVED
You need to log in before you can comment on or make changes to this bug.