Closed Bug 1424042 Opened 7 years ago Closed 6 years ago

Create a new worker type for non-resource-intensive tasks

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: ted, Assigned: ted)

Details

gps points out (rightfully so) in bug 1422740 that using the same worker type as builds for the upload-symbol task is wasteful because it's not very resource intensive, but having an extra worker type is worse.

I proposed that we create a new worker type for things like this and all the source-test tasks that aren't super resource intensive, and set its capacity to something greater than 1, so that we could run more than one task on an instance. That should give us better resource usage so we don't wind up tying up a whole instance to do things like upload-symbols or run a lint. All of the instance types we use have multiple cores available, so I suspect we wouldn't notice any performance issues.
Ted, can you be more concrete about the request here?  Worker-type name(s), and what worker-type they should be modeled on?  I'll be happy to set that up.
Sure, I can try to flesh this out a bit. We currently use a builder worker type for the `upload-symbols` tasks:
https://dxr.mozilla.org/mozilla-central/rev/6ff60a083701d08c52702daf50f28e8f46ae3a1c/taskcluster/ci/upload-symbols/kind.yml#37

and `gecko-t-linux-xlarge` for upload-generated-sources:
https://dxr.mozilla.org/mozilla-central/rev/6ff60a083701d08c52702daf50f28e8f46ae3a1c/taskcluster/ci/upload-generated-sources/kind.yml#22

The various task types under `source-test` seem to all be using `gecko-t-linux-xlarge` as well:

luser@eye7:/build/mozilla-central$ rg worker-type taskcluster/ci/source-test/
taskcluster/ci/source-test/mozlint.yml
6:    worker-type: aws-provisioner-v1/gecko-t-linux-xlarge

taskcluster/ci/source-test/doc.yml
9:    worker-type: aws-provisioner-v1/gecko-t-linux-xlarge
34:    worker-type: aws-provisioner-v1/gecko-t-linux-xlarge

taskcluster/ci/source-test/mocha.yml
8:    worker-type: aws-provisioner-v1/gecko-t-linux-xlarge

taskcluster/ci/source-test/webidl.yml
8:    worker-type: aws-provisioner-v1/gecko-t-linux-xlarge

taskcluster/ci/source-test/cram.yml
9:    worker-type: aws-provisioner-v1/gecko-t-linux-xlarge

taskcluster/ci/source-test/file-metadata.yml
6:    worker-type: aws-provisioner-v1/gecko-t-linux-xlarge

taskcluster/ci/source-test/python.yml
4:    worker-type:


Almost all of these tasks are using the in-tree `lint` image as well, with the exception of a few of the Python tests that use `desktop1604-test`.

`gecko-t-linux-xlarge` is spec'ed to use m3.xlarge / c3.xlarge, which each have 4 vCPUs and 15/7.5GB of memory, respectively. This worker type has its capacity set to 1 for both of them.

Given that most of these tasks are not hugely resource intensive we would benefit from having a worker type for them that sets capacity > 1 so that we can more fully utilize the instances. I would suggest we either set capacity = (num vCPUs) or at least (num vCPUs) / 2, which if we continue to use m3.xlarge / c3.xlarge would give us capacity for 4 or 2 tasks per instance, respectively. I'm not very up on the current state of the art with EC2 instance types, so if there are more suitable instance types to use here that's also fine. (Some of the c5 instance types might be reasonable choices, since they have even more vCPUs and we could set capacity even higher.)

As to naming, maybe something like `gecko-t-linux-shared` to indicate that the worker isn't exclusive?

In bug 1422740 comment 35 you suggested that the worker type for the `upload-symbols` task should be level-specific, which `gecko-t-linux-xlarge` is not. Should we make this new worker type level-specific? If so `gecko-{level}-t-linux-shared` would be fine.
Thanks!  I think it should be level-specific.  The testers are not because they do not produce any output that ends up in a release, or handle any secret data.  The uploads do both (ish..)
Flags: needinfo?(dustin)
We already have gecko-{level}-decision and gecko-{level}-images, so I've added gecko-{level}-linux-shared.  The "-t-" suggests it's for tests, which this isn't if it's doing uploads too.  m3.xlarge / c3.xlarge, capacity=4 for both.  maxCapacity 100 (so, 25 instances).  We can raise the maxCapacity of course, but best to start small.

Can you take it from here then?
Flags: needinfo?(dustin)
Yes, thanks! I'll take a crack at this when I'm back from PTO next week.
Assignee: nobody → ted
I think our workers should have as generic names as possible and that workers should be identified by their resource capacity. e.g. workers should be X-large, X-xlarge, X-2xlarge, etc. If we allow multiple tasks on a worker, that should be denoted by the worker name. e.g. X-4xlarge-shared2.

If we need separation for security reasons, we can encode that in the worker name as well. e.g. X-{level}-2xlarge.

Also, I'm tempted to say we should make all workers level specific. Even if tasks running on them don't handle secrets, I'd feel better if there were physical isolation between tasks running for Try and everything else. I don't think a shared pool for e.g. the test workers buys us that much.
Hi Ted,

Have you had a chance to look into this yet? Thanks!
Flags: needinfo?(ted)
I didn't, sorry! For sure we can use this in the upload-symbols and upload-generated-sources tasks:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/upload-symbols/kind.yml
https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/upload-generated-sources/kind.yml

We have a bunch of source-test tasks, and I'm sure many of them could run on these instances as well:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/source-test

Even the ones that are running tests that can be parallelized are not going to use the same amount of CPU as a build task.
Flags: needinfo?(ted)
Hey Ted,

This bug is in the taskcluster Service Request bug queue but I think it is blocked by the work you are doing - is there another component we can move it into? Ideally we use the Service Request bugs for short-lived administrative tasks for the taskcluster team.

Many thanks!
Pete
Flags: needinfo?(ted)
Wondering whether this bug could be merged with bug 1496390.
AWS Lambda workers would only know how to do one thing (the lambda function they are configured to run).  This is more of a general-purpose-but-very-quick worker thing.  So no, I don't think the two are related.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
Component: Service Request → Operations and Service Requests
Flags: needinfo?(ted)
You need to log in before you can comment on or make changes to this bug.