1593482 - [tc-worker-runner] support stopping workers when deployment is updated

Reporter

Description

•

5 years ago

This is currently supported by generic-worker, but the deploymentId is included in the workerConfig, which is duplicated for every region or AZ. Worker-manager and tc-worker-runner could work together to handle this more smoothly (perhaps via registerWorker and a subsequent workerCheckin API call or something like that), and for all provider types (including static).

Simon Sapin (:SimonSapin)

Comment 1

•

5 years ago

workerConfig, which is duplicated for every region or AZ.

Does it have to? I copied the logic from https://github.com/mozilla/community-tc-config/blob/72ff4b48c4/generate/workers.py#L222-L236 but does worker-manager also support specifying workerConfig once outside of the launchConfigs array? Or is there a use case for the worker config to vary across launch configs?

Simon Sapin (:SimonSapin)

Comment 2

•

5 years ago

The status quo is that either workers with an outdated configuration can continue picking up tasks for an unbounded amount of time, or they have to be terminated with the removeWorker API. If a task is running, the latter makes it fail.

For Servo’s CI, a failing task usually means that a PR doesn’t land and needs to be retried manually, which runs all test suites again from the start. (We should probably start using TC’s queue’s built-in task-level automatic retry support, but we’d need some heuristic to tell intermittent failures apart from permanent ones.)

Ideally, workers with an outdated config would immediately stop picking up new tasks (rather than after a checkForNewDeploymentEverySecs interval) but keep running already-started tasks to completion.

This is exactly what worker-manager’s deleteWorkerPool API does. So maybe we should have versioned worker pool IDs? But that’d require in-tree configuration changes.

Simon Sapin (:SimonSapin)

Comment 3

•

5 years ago

specifying workerConfig once outside of the launchConfigs array

This was already proposed in bug 1578922.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

5 years ago

Depends on: 1607606

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 4

•

5 years ago

A few more bits here: if we make this a "core" property of a worker pool, and provide it during registerWorker and reregisterWorker, then we can also track what deploymentId each worker is running, and be able to display that in the UI (at least some kind of indication that it's running an old config).

To do that, we might want more meaningful deploymentIds than just slugids. A datestamp actually seems like a good idea: unique, monotonically increasing. If that's the last-modified time of the worker pool, then it also has some semantic meaning ("running the config from last week").

Currently it's possible to update a worker pool without changing the deploymentId. We could achieve that by adding a ?update-last-modified=0 to the workerManager.updateWorkerPool API which would skip the update (with the default being to update the datestamp).

A fringe benefit here would be that even deletion of a worker pool would trigger an update to the last-modified timestamp, meaning that properly-configured workers would gracefully shut down.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 5

•

5 years ago

This has been added to the Worker Lifecycle Management project.

Brian Stack [:bstack]

Comment 6

•

5 years ago

lastModified is a top-level feature as of a month or so. next step is to use it in reregister

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 7

•

5 years ago

In the discussion we just had, we said we would use a deploymentId with arbitrary content (e.g., a slugid). This is a change from comment 4, and comment 4 describes some benefits of using a lastModified value that are worth reconsidering.

Also, worth thinking about how this would interact with ci-admin/tc-admin. Would they update the deploymentId on every write? On no writes? Have some logic for it?

Brian Stack [:bstack]

Comment 8

•

4 years ago

This will be worked on in https://github.com/taskcluster/taskcluster/issues/2891

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → MOVED

Bugzilla

[tc-worker-runner] support stopping workers when deployment is updated

Categories

(Taskcluster :: Workers, enhancement)

Tracking

(Not tracked)

People

(Reporter: dustin, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8