Create new worker pool for NSS ARM workers
Categories
(Release Engineering :: General, task, P1)
Tracking
(Not tracked)
People
(Reporter: coop, Unassigned)
References
(Blocks 1 open bug, Regression)
Details
Attachments
(1 file)
(deleted),
text/x-phabricator-request
|
Details |
As part of the current cost reduction efforts, we're trying to reduce the number of cloud providers we support, specifically by exiting packet.net. Over in bug 1632599, we're trying to move Mozilla CI workloads out of packet.net and into AWS now that Amazon supports bare metal instance types. There is currently a single, statically-provisioned ARM machine that was setup for NSS testing that is the last machine we have running in packet.net.
Amazon has a variety of ARM options now. We'd like to transition NSS to AWS from packet.net.
Here are the specs for the current machine in packet.net:
- PROC: 2 x Cavium ThunderX CN8890 @2GHz
- RAM: 128GB
- DISK: 1 x 340GB SSD
- NIC: 2 x 10Gbps Bonded Ports
I'd would humbly suggest that this is over-provisioned for running a single CI worker. I think an a1.large instance in AWS would be sufficient for the new pool. I also think that we can get away with a very small pool here, capping the max # of instances at 5 (or less) given that check-ins are infrequent and a single worker is keeping up just fine right now.
Note: I've cc-ed some Taskcluster folks here because I'm unsure about the vintage of docker-worker that's running on ARM in packet.net. We may need to engage with releng to get something running on ARM in AWS.
See also bug 1594891 where releng inherited many of the other NSS worker pools.
Reporter | ||
Comment 1•5 years ago
|
||
(In reply to Chris Cooper [:coop] pronoun: he from comment #0)
I'd would humbly suggest that this is over-provisioned for running a single CI worker. I think an a1.large instance in AWS would be sufficient for the new pool. I also think that we can get away with a very small pool here, capping the max # of instances at 5 (or less) given that check-ins are infrequent and a single worker is keeping up just fine right now.
Having now verified that I can connect to the NSS machine in packet.net, I can see that it's actually running with capacity=20 which makes more sense given the beefiness of the machine.
I'd suggest stepping down to a a1.medium instance in AWS, but having a max pool size of 20 and running 1 worker per instance.
Reporter | ||
Comment 2•5 years ago
|
||
:miles is looking to get an ARM64 docker-worker image running in AWS. Once that's in place we can figure out next steps.
IIRC docker-worker needed some code tweaks to work on ARM and I'm not sure those were ever landed. I guess we'll find out soon.
Reporter | ||
Comment 3•5 years ago
|
||
(In reply to Chris Cooper [:coop] pronoun: he from comment #2)
IIRC docker-worker needed some code tweaks to work on ARM and I'm not sure those were ever landed. I guess we'll find out soon.
Found a branch that may help with this: https://github.com/taskcluster/docker-worker/compare/packet-net
Comment 4•5 years ago
|
||
I did some hacking on this, here's some of that state:
- Used a test version of docker, see here: https://www.docker.com/blog/getting-started-with-docker-for-arm-on-linux/
- Compiled worker-runner start-worker on my test box because our releases don't include arm64
- Installed docker-worker dependencies, was able to run docker-worker independently and via worker-runner
I have some WIP scripts for this / modifications to monopacker and some notes to pick this back up after the current sprint.
I wasn't able to claim tasks in my dev environment due to docker-worker reporting capacity: 0
, this is something we should be able to work around with some small code edits as NSS doesn't use video / audio loopbacks.
Reporter | ||
Comment 5•5 years ago
|
||
(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #4)
I wasn't able to claim tasks in my dev environment due to docker-worker reporting
capacity: 0
, this is something we should be able to work around with some small code edits as NSS doesn't use video / audio loopbacks.
:kjacobs - I think this is an important point to clarify. Do any of the NSS workloads that would be running on these workers require either the audio and/or video loopback devices that we create for Firefox testing? If we don't need to customize a kernel to add these devices on ARM, our path gets a lot simpler (and quicker), but I'd like confirmation before we proceed.
Comment 6•5 years ago
|
||
That's correct - NSS has no need for these devices. Thanks for checking.
Comment hidden (Intermittent Failures Robot) |
Comment 8•4 years ago
|
||
:coop or :miles, is there any updated ETA on worker availability?
Reporter | ||
Comment 9•4 years ago
|
||
(In reply to Kevin Jacobs [:kjacobs] from comment #8)
:coop or :miles, is there any updated ETA on worker availability?
Sorry, I should have updated. :miles will be back to this tomorrow. With the streamlined kernel requirements we should have something for validation shortly (1-2 days).
Comment 10•4 years ago
|
||
I've been working on this a bit more, shooting to have a working instance to test on by EOD tomorrow and will test manually baking an AMI from that instance so that we have replacement strategy should it fail.
Due to the level of customizations / hackery I haven't built an entirely new set of scripts to automate creating the image, I've taken notes on the changes I've made so we can adapt from there.
Comment 11•4 years ago
|
||
I have an arm64 instance that I've claimed some tasks on [0] in stage, so I'm going to create a worker in production now. That should make it easier to test for NSS because of the lack of convenient ways to trigger pushes in stage.
The instance is configured as a static worker using the static provisioner and a test worker-pool / worker-type that I've configured.
[0] https://stage.taskcluster.nonprod.cloudops.mozgcp.net/tasks/OHdGKi-8TN-ufvKvHLnJ_Q
Comment 12•4 years ago
|
||
I duplicated my test worker in production but need the credentials to the client reset as they were lost on the old machine.
Tom, could you please reset the static secret for https://firefox-ci-tc.services.mozilla.com/auth/clients/project%2Fnss-nspr%2Faarch64 and send it to me? Not sure if you're most applicable, but I figured you would have access.
Reporter | ||
Comment 13•4 years ago
|
||
(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #12)
I duplicated my test worker in production but need the credentials to the client reset as they were lost on the old machine.
None of them are in our password repo?
Comment 14•4 years ago
|
||
This also removes the scopes need to create the client. The clients already exist,
and will be managed automatically once Bug 1632009 lands.
Comment 15•4 years ago
|
||
I've granted the taskcluster team permissions to reset the access token.
None of them are in our password repo?
I'd generally opt to regenerate access tokens, rather than store than store them separately from the deployment infrastructure for them.
Comment 16•4 years ago
|
||
Comment 17•4 years ago
|
||
The production worker is up and running, the worker-pool matches the one that the client had access to, localprovisioner/nss-aarch64
. It looks like there is a different naming scheme for the other nss
related worker-pools in Firefox CI, they are prefixed with nss/
.
Because of the client structure I configured the worker as standalone
rather than static
, which should be fine but means that credentials are long-lived and tied to the client.
Who should be given access to the box? Is there a meaningful distinction between L1 and L3 for this worker as in other nss
worker types?
Comment 18•4 years ago
|
||
I can confirm that the new worker is functioning as expected. Thanks!
I'm not sure what "L1" and "L3" refer t. Can you give a little more information (or point to where these are used in the other workers)?
I believe all of the current NSS team (myself, :jcj, and :beurdouche) had SSH access on the old machine. It would be good if we can restore that.
Reporter | ||
Comment 19•4 years ago
|
||
(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #17)
The production worker is up and running, the worker-pool matches the one that the client had access to,
localprovisioner/nss-aarch64
. It looks like there is a different naming scheme for the othernss
related worker-pools in Firefox CI, they are prefixed withnss/
.Because of the client structure I configured the worker as
standalone
rather thanstatic
, which should be fine but means that credentials are long-lived and tied to the client.Who should be given access to the box? Is there a meaningful distinction between L1 and L3 for this worker as in other
nss
worker types?
Thanks for this, Miles. I'm sure NSS is glad to get their ARM coverage back.
Now that we're on AWS, we do have the opportunity to make these pools flexible, i.e. livestock not pets. The frequency of check-ins on NSS is much lower, so we will absolutely spend less money if we can spin the pool down when there are no jobs pending. It may also depend on whether the NSS team still needs direct access to the worker (see below). I'll file a follow-up bug for that.
(In reply to Kevin Jacobs [:kjacobs] from comment #18)
I'm not sure what "L1" and "L3" refer t. Can you give a little more information (or point to where these are used in the other workers)?
The level maps to the trust level of machine. For NSS, the L1 machines would map to the nss-try tree and L3 would map to the nss tree. The distinction exists because (at least in the Firefox case), the restrictions around who can push to Try are much lower, and you don't want random jobs from Try poisoning future nightly/release builds.
If you're not expecting to have Try coverage for ARM, then we can eliminate the need for L1 builders. However, it looks like the task list is the same on Try.
I believe all of the current NSS team (myself, :jcj, and :beurdouche) had SSH access on the old machine. It would be good if we can restore that.
What's the use-case here? Ideally releng will manage these pools for you going forward and you shouldn't need to access them directly.
Reporter | ||
Comment 20•4 years ago
|
||
(In reply to Chris Cooper [:coop] pronoun: he from comment #19)
What's the use-case here? Ideally releng will manage these pools for you going forward and you shouldn't need to access them directly.
Pinging :kjacobs ^^
Reporter | ||
Comment 21•4 years ago
|
||
(In reply to Chris Cooper [:coop] pronoun: he from comment #19)
I'll file a follow-up bug for that.
Comment 22•4 years ago
|
||
(In reply to Chris Cooper [:coop] pronoun: he from comment #19)
(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #17)
The production worker is up and running, the worker-pool matches the one that the client had access to,
localprovisioner/nss-aarch64
. It looks like there is a different naming scheme for the othernss
related worker-pools in Firefox CI, they are prefixed withnss/
.Because of the client structure I configured the worker as
standalone
rather thanstatic
, which should be fine but means that credentials are long-lived and tied to the client.Who should be given access to the box? Is there a meaningful distinction between L1 and L3 for this worker as in other
nss
worker types?Thanks for this, Miles. I'm sure NSS is glad to get their ARM coverage back.
Now that we're on AWS, we do have the opportunity to make these pools flexible, i.e. livestock not pets. The frequency of check-ins on NSS is much lower, so we will absolutely spend less money if we can spin the pool down when there are no jobs pending. It may also depend on whether the NSS team still needs direct access to the worker (see below). I'll file a follow-up bug for that.
(In reply to Kevin Jacobs [:kjacobs] from comment #18)
I'm not sure what "L1" and "L3" refer t. Can you give a little more information (or point to where these are used in the other workers)?
The level maps to the trust level of machine. For NSS, the L1 machines would map to the nss-try tree and L3 would map to the nss tree. The distinction exists because (at least in the Firefox case), the restrictions around who can push to Try are much lower, and you don't want random jobs from Try poisoning future nightly/release builds.
If you're not expecting to have Try coverage for ARM, then we can eliminate the need for L1 builders. However, it looks like the task list is the same on Try.
Yes, sounds like we'll want to keep both.
I believe all of the current NSS team (myself, :jcj, and :beurdouche) had SSH access on the old machine. It would be good if we can restore that.
What's the use-case here? Ideally releng will manage these pools for you going forward and you shouldn't need to access them directly.
Primarily testing security patches, which become public once pushed to nss-try. Having access to the box allows spinning up a local docker container and testing the patch manually. Maybe there's some alternative that would still enable that use case?
Comment 23•4 years ago
|
||
(In reply to Kevin Jacobs [:kjacobs] from comment #22)
What's the use-case here? Ideally releng will manage these pools for you going forward and you shouldn't need to access them directly.
Primarily testing security patches, which become public once pushed to nss-try. Having access to the box allows spinning up a local docker container and testing the patch manually. Maybe there's some alternative that would still enable that use case?
Do you need access to a worker explicitly, or would access to any arm machine worker? Given that AWS has arm machines, would having access to a different EC2 arm machine (either always on, or on demand) in a different account work for manual testing?
Comment 24•4 years ago
|
||
(In reply to Tom Prince [:tomprince] from comment #23)
(In reply to Kevin Jacobs [:kjacobs] from comment #22)
What's the use-case here? Ideally releng will manage these pools for you going forward and you shouldn't need to access them directly.
Primarily testing security patches, which become public once pushed to nss-try. Having access to the box allows spinning up a local docker container and testing the patch manually. Maybe there's some alternative that would still enable that use case?
Do you need access to a worker explicitly, or would access to any arm machine worker? Given that AWS has arm machines, would having access to a different EC2 arm machine (either always on, or on demand) in a different account work for manual testing?
No, any aarch64 machine should do the trick. The worker was just convenient and available.
Comment 25•4 years ago
|
||
Tom, is there a process that we should go through in order to get access to one of these arm machines for testing?
Comment 26•4 years ago
|
||
This would probably make the most sense outside of taskcluster, so redirecting to :fubar, who I think manages our AWS accounts.
Comment 27•4 years ago
|
||
I think we can do this inside the NSS team ourselves, actually. I'll DM you klibby if we encounter issues.
Reporter | ||
Comment 28•4 years ago
|
||
:miles - what are the next steps here? Are we converting this to a managed pool, or leaving it in localprovisioner?
Comment 29•4 years ago
|
||
This worker type has lots of potential beyond NSS. In particular the JS team would be interested in having native AArch64 builds and tests on such machines, at least for the JS shell.
Comment 30•4 years ago
|
||
(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #10)
Due to the level of customizations / hackery I haven't built an entirely new set of scripts to automate creating the image, I've taken notes on the changes I've made so we can adapt from there.
Hey Miles, could you put a copy of the notes (or a link to them) in this bug? Thanks!
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Comment 31•4 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #30)
Hey Miles, could you put a copy of the notes (or a link to them) in this bug? Thanks!
Notes from Miles: https://github.com/taskcluster/taskcluster/issues/3524#issue-703707552
Comment 32•4 years ago
|
||
Coop, should we close this?
I believe the workers are running, and the only caveat is that the machine images were manually created, bug issue 3524 (see comment 31) tracks automating this process for future linux/aarch64 workers.
Comment 33•4 years ago
|
||
👋!
The important piece is that because that instance was manually created if it is deleted the data is lost. Two good steps to take to make things more recoverable/durable would be 1. disable volume deletion on instance termination so that the root volume can be reused on another instance 2. take a snapshot of the current instance state (or for bonus points make an AMI out if it, note that if you make an AMI you should verify the docker-worker relevant services are set to start on boot).
Reporter | ||
Comment 34•4 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #32)
Coop, should we close this?
Sure, we can close this. I'll file follow-up issues for Miles' call-outs in comment #33.
Reporter | ||
Comment 35•4 years ago
|
||
(In reply to Miles Crabill [:miles] from comment #33)
2. take a snapshot of the current instance state (or for bonus points make an AMI out if it, note that if you make an AMI you should verify the docker-worker relevant services are set to start on boot).
I took a snapshot of the running instance. The snapshot is in us-west-2: snap-0a41f3f82635cd474.
Description
•