Open Bug 1592844 Opened 5 years ago Updated 5 years ago

Improve handling generic-worker config in aws-provider.

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

People

(Reporter: tomprince, Unassigned)

References

Details

There are a couple of improvements that would make configuring generic-worker in ci-config much easier:

  1. taskcluster-worker-runner will expect the worker configuration to be at the top-level. It would be good if generic-worker running under worker-manager directly expected the same. In other words, when running under worker-manager, generic-worker should look in workerConfig.genericWorker.config for its config, but rather expect it at workerConfig.

  2. Looking at the current config:

"cachesDir": "Z:\\caches"
"cleanUpTaskDirs": true
"deploymentId": "f37dd7ee1cae"
"disableReboots": true
"downloadsDir": "Z:\\downloads"
"ed25519SigningKeyLocation": "C:\\generic-worker\\ed25519-private.key"
"idleTimeoutSecs": 900
"livelogExecutable": "C:\\generic-worker\\livelog.exe"
"livelogPUTPort": 60022
"numberOfTasksToRun": 0
"runAfterUserCreation": "C:\\generic-worker\\task-user-init.cmd"
"runTasksAsCurrentUser": false
"sentryProject": "generic-worker"
"shutdownMachineOnIdle": false
"shutdownMachineOnInternalError": true
"taskclusterProxyExecutable": "C:\\generic-worker\\taskcluster-proxy.exe"
"taskclusterProxyPort": 80
"tasksDir": "Z:\\"
"wstAudience": "taskcluster-net"
"wstServerURL": "https://websocktunnel.tasks.build"

most of that configuration is tied to the AMI rather than a specific worker pool. It would be good if there was a way to have a configuration file on the instance, that is merged with the configuration from the cloud provider.

Flags: needinfo?(pmoore)

Many thanks for raising this.

I think after I've explained the rationale behind the current design, it will make much more sense.

workerConfig is the only section in the worker pool definition that exists to serve runtime configuration to a worker running a prebaked image. Therefore any system that runs on the worker relies on this single section to retrieve runtime configuration.

One of the pieces of software that runs on the worker is the worker application (generic-worker/docker-worker/script-worker etc). It therefore seems reasonable for it not to claim all of the available configuration for itself, when other software may be installed on the worker that requires workerpool-specific configuration too.

Even early on we've seen that this was useful for example for OpenCloudConfig, which also runs on the workers, so that workerpool-specific OCC configuration could be included in the workerConfig section. If generic-worker had claimed the entire blob, OCC would have had no way to obtain workerpool-specific config at runtime.

This explains the reason for the top level genericWorker config property.

The reason for the config property below that is that the runtime generic worker configuration is more than just the config settings in the generic-worker.config file. That indeed covers most of its configuration, however, generic-worker also provides a mechanism (since its very first release) that allows workers to be bootstrapped with additional files. This can be workerpool-specific files that cannot be baked into an AMI that may be shared with other worker pools, or secrets that we do not want to bake into the machine images for privacy reasons. For the specific case of secrets, this provides a means to rotate without secrets without needing to build new machine images, and to make machine images public for shared use.

Historically, secrets were baked into AWS Provisioner Worker Type definitions, and worker type definitions weren't public. Then the taskcluster-secrets service was created and the secrets migrated there, but non-secret workerpool-specific files can still be provided in worker type definitions (or workerpool definitions). Being able to bootstrap workers with workerpool-specific content when the base image is shared across worker pools is a really useful feature.

Even if we removed this feature (which I don't propose we do) there is always the possibility of future features requiring configuration which sits outside of the generic-worker config file, so it is useful to apportion one property for the generic-worker config file, to allow other properties for other features, such as this one. Currently generic-worker understands config (for its config file) and files (for bootstrapping files). If we moved the properties up, we'd lose this feature.

We can probably do a better job of documenting the bootstrapping features, and explain the reason for the top level genericWorker property, as without this information, it is easy to conclude that the config is nested unnecessarily. However, historically, having both levels has saved us, since it has enabled us to bootstrap workers with worker pool specific content, and to include config for other systems running on the workers (such as OCC).

The ability to bootstrap workers with worker pool specific files will become more important when generic-worker machine images are shared across worker pools, which is happening with the move to ci-admin/tc-admin, so we should probably shouldn't rush into changing anything quite yet.

Flags: needinfo?(pmoore)

(In reply to Tom Prince [:tomprince] from comment #0)

most of that configuration is tied to the AMI rather than a specific worker pool. It would be good if there was a way to have a configuration file on the instance, that is merged with the configuration from the cloud provider.

if possible i'd like to avoid having to have another configuration file on the worker that is merged with the worker pool config. in practice, there is nothing ami specific that is not also worker-pool specific. eg: on windows, we don't currently use ami's that could also be more than one worker-type. it's possible that could work, but it's the sort of major change that has a lot of follow-on implications that would need to be addressed. on linux that's probably more useful because of docker, but a windows ami is currently bootstrapped with dependencies specific to the worker-(type|pool). decoupling that relationship is more complicated than the gw config.

also, it's been very nice to not have to manage generic-worker configuration in occ (we've been there on hardware and it is painful). on ec2 and gcp we currently just run the --configure-for-(ec2|gcp) switch and things just work. having to instead do some sort of configuration merge between an ami specific configuration and worker-pool specific configuration is something i'm not excited about.

I think both of these are questions we can revisit when we're using worker-runner. Pete's point makes some sense in that context, where we might want

{
  "workerConfig": {
    "genericWorker": {
      "config": { ... },
      "files": [ ... ],
    },
  },
  "runnerConfig": {
    ...
  }
}

and regarding all the baked-in things, tc-worker-runner already has facilities for merging config files, so something like

worker:
  implementation: generic-worker
  ...
workerConfig:
  downloadsDir: "Z:\\downloads"
  ed25519SigningKeyLocation: "C:\\generic-worker\\ed25519-private.key"
  livelogExecutable: "C:\\generic-worker\\livelog.exe"
  runAfterUserCreation: "C:\\generic-worker\\task-user-init.cmd"
  taskclusterProxyExecutable: "C:\\generic-worker\\taskcluster-proxy.exe"
...

Again once we have worker-runner working everywhere, I'd like to switch from providing config in userdata, where it may run into size limits and where it is not available for static workers, to providing it in the return value from registerWorker.

So, let's put a pin in this until generic-worker uses worker-runner everywhere (bug 1558532).

a windows ami is currently bootstrapped with dependencies specific to the worker-(type|pool)

This isn't always the case -- with monopacker, we will support a one-to-many relationship between images and worker pools. As an example in the community deployment, it may be the case that a few community projects need "vanilla" win2012 images that just have git and python on them, and we can share a single image among those projects' worker pools.

Depends on: 1558532

(In reply to Rob Thijssen [:grenade (EET/UTC+0300)] from comment #2)

also, it's been very nice to not have to manage generic-worker configuration in occ (we've been there on hardware and it is painful). on ec2 and gcp we currently just run the --configure-for-(ec2|gcp) switch and things just work. having to instead do some sort of configuration merge between an ami specific configuration and worker-pool specific configuration is something i'm not excited about.

I would like to underline Rob's comment here too. When implementing config negotiation with AWS Provisioner, we made the choice early on that if AWS Provisioner provided generic-worker configuration, it should provide the entire config, and not just a part of it, since it becomes harder to reason about configuration when it comes from multiple sources, potentially spread across systems and repositories. Having a single location for worker config has kept things simple, both for users administering config, and for the code that needs to negotiate the config.

I think the danger of baking config into machine images is that it makes the already-complex systems even more opaque and difficult to reason about, and at the moment I feel we should be trying to make our systems more understandable rather than more complex. It feels like taskcluster is already overwhelmingly complex, so anything we can do to keep things simple is a real benefit.

(In reply to Pete Moore [:pmoore][:pete] from comment #4)

I think the danger of baking config into machine images is that it makes the already-complex systems even more opaque and difficult to reason about

For me, having configuration like taskclusterProxyExecutable be set per-worker makes things more complex and harder to reason about. That configuration isn't something that can be meaningful changed other than at AMI creation time (when the file is put in place), and so that configuration should live on the AMI. Otherwise, ci-configuration needs to also have copies of those values, and they'll need to be updated if it changes in OCC.

I filed this bug because I found the current setup complex, opaque and difficult to reason about, and wanted to make things simpler.

I'll also note that the current setup probably works well for when the configuration is all managed by OCC, but since we are moving to having the worker pools managed by ci-config, that the tradeoffs involved are different, and so the clearest solution may have a different division of responsibilities for managing the configuration.

OK, that's a reasonable argument. Let's meet to discuss options. This will also be a good opportunity to formalise the hand off between OpenCloudConfig and ci-configuration.

For example, I could imagine when OpenCloudConfig CI runs, that it generates a JSON artifact containing not just machine image IDs/names, but also a suitable workerConfig object containing appropriate image-specific settings such as taskclusterProxyExecutable. This information could be directly stored in ci-configuration, keeping the association between machine image names/IDs and the default workerConfig settings. Then when ci-admin builds worker pool definitions, it would merge workerpool-specific workerConfig settings together with the machine image-specific workerConfig object associated to the images, so that there is a clean separation between workerpool-specific workerConfig and machine image-specific workerConfig. However, it then becomes a function of ci-admin to merge the configs, which:

  1. keeps config negotiation in generic-worker simple
  2. provides transparency of settings to people inspecting workerpool definitions
  3. allows control of all config settings in a workerpool definition, for cases when the default isn't appropriate

So my feeling is that the separation should be there, just that it shouldn't be low down in the worker implementation. The logical place for config merging to sit is in ci-admin, and if this requires information sharing between OpenCloudConfig and ci-admin, we should formalise that in the design of the interface between OCC and ci-admin. This makes the config overall more transparent, requires fewer steps to load, makes it easier to reason about, more explicit, and simpler to change (no config change requires building new machine images).

Summary: Improve handling generic-worker config in aws-rprovider. → Improve handling generic-worker config in aws-prrovider.
Summary: Improve handling generic-worker config in aws-prrovider. → Improve handling generic-worker config in aws-provider.

Pete, I think you've provided defaults for most of the image-specific things now? Then all that's left is bug 1602960, right?

You need to log in before you can comment on or make changes to this bug.