Closed Bug 1636245 Opened 5 years ago Closed 4 years ago

Create new worker pool for NSS ARM workers

Categories

(Release Engineering :: General, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Unassigned)

References

(Blocks 1 open bug, Regression)

Details

Attachments

(1 file)

As part of the current cost reduction efforts, we're trying to reduce the number of cloud providers we support, specifically by exiting packet.net. Over in bug 1632599, we're trying to move Mozilla CI workloads out of packet.net and into AWS now that Amazon supports bare metal instance types. There is currently a single, statically-provisioned ARM machine that was setup for NSS testing that is the last machine we have running in packet.net.

Amazon has a variety of ARM options now. We'd like to transition NSS to AWS from packet.net.

Here are the specs for the current machine in packet.net:

  • PROC: 2 x Cavium ThunderX CN8890 @2GHz
  • RAM: 128GB
  • DISK: 1 x 340GB SSD
  • NIC: 2 x 10Gbps Bonded Ports

I'd would humbly suggest that this is over-provisioned for running a single CI worker. I think an a1.large instance in AWS would be sufficient for the new pool. I also think that we can get away with a very small pool here, capping the max # of instances at 5 (or less) given that check-ins are infrequent and a single worker is keeping up just fine right now.

Note: I've cc-ed some Taskcluster folks here because I'm unsure about the vintage of docker-worker that's running on ARM in packet.net. We may need to engage with releng to get something running on ARM in AWS.

See also bug 1594891 where releng inherited many of the other NSS worker pools.

(In reply to Chris Cooper [:coop] pronoun: he from comment #0)

I'd would humbly suggest that this is over-provisioned for running a single CI worker. I think an a1.large instance in AWS would be sufficient for the new pool. I also think that we can get away with a very small pool here, capping the max # of instances at 5 (or less) given that check-ins are infrequent and a single worker is keeping up just fine right now.

Having now verified that I can connect to the NSS machine in packet.net, I can see that it's actually running with capacity=20 which makes more sense given the beefiness of the machine.

I'd suggest stepping down to a a1.medium instance in AWS, but having a max pool size of 20 and running 1 worker per instance.

Priority: -- → P1
Regressed by: 1636101

:miles is looking to get an ARM64 docker-worker image running in AWS. Once that's in place we can figure out next steps.

IIRC docker-worker needed some code tweaks to work on ARM and I'm not sure those were ever landed. I guess we'll find out soon.

(In reply to Chris Cooper [:coop] pronoun: he from comment #2)

IIRC docker-worker needed some code tweaks to work on ARM and I'm not sure those were ever landed. I guess we'll find out soon.

Found a branch that may help with this: https://github.com/taskcluster/docker-worker/compare/packet-net

I did some hacking on this, here's some of that state:

I have some WIP scripts for this / modifications to monopacker and some notes to pick this back up after the current sprint.

I wasn't able to claim tasks in my dev environment due to docker-worker reporting capacity: 0, this is something we should be able to work around with some small code edits as NSS doesn't use video / audio loopbacks.

(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #4)

I wasn't able to claim tasks in my dev environment due to docker-worker reporting capacity: 0, this is something we should be able to work around with some small code edits as NSS doesn't use video / audio loopbacks.

:kjacobs - I think this is an important point to clarify. Do any of the NSS workloads that would be running on these workers require either the audio and/or video loopback devices that we create for Firefox testing? If we don't need to customize a kernel to add these devices on ARM, our path gets a lot simpler (and quicker), but I'd like confirmation before we proceed.

Flags: needinfo?(kjacobs.bugzilla)

That's correct - NSS has no need for these devices. Thanks for checking.

Flags: needinfo?(kjacobs.bugzilla)

:coop or :miles, is there any updated ETA on worker availability?

Flags: needinfo?(coop)

(In reply to Kevin Jacobs [:kjacobs] from comment #8)

:coop or :miles, is there any updated ETA on worker availability?

Sorry, I should have updated. :miles will be back to this tomorrow. With the streamlined kernel requirements we should have something for validation shortly (1-2 days).

Assignee: nobody → miles
Status: NEW → ASSIGNED
Flags: needinfo?(coop)

I've been working on this a bit more, shooting to have a working instance to test on by EOD tomorrow and will test manually baking an AMI from that instance so that we have replacement strategy should it fail.

Due to the level of customizations / hackery I haven't built an entirely new set of scripts to automate creating the image, I've taken notes on the changes I've made so we can adapt from there.

I have an arm64 instance that I've claimed some tasks on [0] in stage, so I'm going to create a worker in production now. That should make it easier to test for NSS because of the lack of convenient ways to trigger pushes in stage.

The instance is configured as a static worker using the static provisioner and a test worker-pool / worker-type that I've configured.

[0] https://stage.taskcluster.nonprod.cloudops.mozgcp.net/tasks/OHdGKi-8TN-ufvKvHLnJ_Q

I duplicated my test worker in production but need the credentials to the client reset as they were lost on the old machine.

Tom, could you please reset the static secret for https://firefox-ci-tc.services.mozilla.com/auth/clients/project%2Fnss-nspr%2Faarch64 and send it to me? Not sure if you're most applicable, but I figured you would have access.

Flags: needinfo?(mozilla)

(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #12)

I duplicated my test worker in production but need the credentials to the client reset as they were lost on the old machine.

None of them are in our password repo?

Flags: needinfo?(miles)

This also removes the scopes need to create the client. The clients already exist,
and will be managed automatically once Bug 1632009 lands.

I've granted the taskcluster team permissions to reset the access token.

None of them are in our password repo?

I'd generally opt to regenerate access tokens, rather than store than store them separately from the deployment infrastructure for them.

Flags: needinfo?(mozilla)
Flags: needinfo?(miles)
Pushed by mozilla@hocat.ca: https://hg.mozilla.org/ci/ci-configuration/rev/db8d027d3fbd Grant taskcluster team permission to reset NSS worker access tokens; r=Callek

The production worker is up and running, the worker-pool matches the one that the client had access to, localprovisioner/nss-aarch64. It looks like there is a different naming scheme for the other nss related worker-pools in Firefox CI, they are prefixed with nss/.

Because of the client structure I configured the worker as standalone rather than static, which should be fine but means that credentials are long-lived and tied to the client.

Who should be given access to the box? Is there a meaningful distinction between L1 and L3 for this worker as in other nss worker types?

I can confirm that the new worker is functioning as expected. Thanks!

I'm not sure what "L1" and "L3" refer t. Can you give a little more information (or point to where these are used in the other workers)?

I believe all of the current NSS team (myself, :jcj, and :beurdouche) had SSH access on the old machine. It would be good if we can restore that.

(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #17)

The production worker is up and running, the worker-pool matches the one that the client had access to, localprovisioner/nss-aarch64. It looks like there is a different naming scheme for the other nss related worker-pools in Firefox CI, they are prefixed with nss/.

Because of the client structure I configured the worker as standalone rather than static, which should be fine but means that credentials are long-lived and tied to the client.

Who should be given access to the box? Is there a meaningful distinction between L1 and L3 for this worker as in other nss worker types?

Thanks for this, Miles. I'm sure NSS is glad to get their ARM coverage back.

Now that we're on AWS, we do have the opportunity to make these pools flexible, i.e. livestock not pets. The frequency of check-ins on NSS is much lower, so we will absolutely spend less money if we can spin the pool down when there are no jobs pending. It may also depend on whether the NSS team still needs direct access to the worker (see below). I'll file a follow-up bug for that.

(In reply to Kevin Jacobs [:kjacobs] from comment #18)

I'm not sure what "L1" and "L3" refer t. Can you give a little more information (or point to where these are used in the other workers)?

The level maps to the trust level of machine. For NSS, the L1 machines would map to the nss-try tree and L3 would map to the nss tree. The distinction exists because (at least in the Firefox case), the restrictions around who can push to Try are much lower, and you don't want random jobs from Try poisoning future nightly/release builds.

If you're not expecting to have Try coverage for ARM, then we can eliminate the need for L1 builders. However, it looks like the task list is the same on Try.

I believe all of the current NSS team (myself, :jcj, and :beurdouche) had SSH access on the old machine. It would be good if we can restore that.

What's the use-case here? Ideally releng will manage these pools for you going forward and you shouldn't need to access them directly.

(In reply to Chris Cooper [:coop] pronoun: he from comment #19)

What's the use-case here? Ideally releng will manage these pools for you going forward and you shouldn't need to access them directly.

Pinging :kjacobs ^^

Flags: needinfo?(kjacobs.bugzilla)
Blocks: 1648080

(In reply to Chris Cooper [:coop] pronoun: he from comment #19)

I'll file a follow-up bug for that.

Bug 1648080

(In reply to Chris Cooper [:coop] pronoun: he from comment #19)

(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #17)

The production worker is up and running, the worker-pool matches the one that the client had access to, localprovisioner/nss-aarch64. It looks like there is a different naming scheme for the other nss related worker-pools in Firefox CI, they are prefixed with nss/.

Because of the client structure I configured the worker as standalone rather than static, which should be fine but means that credentials are long-lived and tied to the client.

Who should be given access to the box? Is there a meaningful distinction between L1 and L3 for this worker as in other nss worker types?

Thanks for this, Miles. I'm sure NSS is glad to get their ARM coverage back.

Now that we're on AWS, we do have the opportunity to make these pools flexible, i.e. livestock not pets. The frequency of check-ins on NSS is much lower, so we will absolutely spend less money if we can spin the pool down when there are no jobs pending. It may also depend on whether the NSS team still needs direct access to the worker (see below). I'll file a follow-up bug for that.

(In reply to Kevin Jacobs [:kjacobs] from comment #18)

I'm not sure what "L1" and "L3" refer t. Can you give a little more information (or point to where these are used in the other workers)?

The level maps to the trust level of machine. For NSS, the L1 machines would map to the nss-try tree and L3 would map to the nss tree. The distinction exists because (at least in the Firefox case), the restrictions around who can push to Try are much lower, and you don't want random jobs from Try poisoning future nightly/release builds.

If you're not expecting to have Try coverage for ARM, then we can eliminate the need for L1 builders. However, it looks like the task list is the same on Try.

Yes, sounds like we'll want to keep both.

I believe all of the current NSS team (myself, :jcj, and :beurdouche) had SSH access on the old machine. It would be good if we can restore that.

What's the use-case here? Ideally releng will manage these pools for you going forward and you shouldn't need to access them directly.

Primarily testing security patches, which become public once pushed to nss-try. Having access to the box allows spinning up a local docker container and testing the patch manually. Maybe there's some alternative that would still enable that use case?

Flags: needinfo?(kjacobs.bugzilla)

(In reply to Kevin Jacobs [:kjacobs] from comment #22)

What's the use-case here? Ideally releng will manage these pools for you going forward and you shouldn't need to access them directly.

Primarily testing security patches, which become public once pushed to nss-try. Having access to the box allows spinning up a local docker container and testing the patch manually. Maybe there's some alternative that would still enable that use case?

Do you need access to a worker explicitly, or would access to any arm machine worker? Given that AWS has arm machines, would having access to a different EC2 arm machine (either always on, or on demand) in a different account work for manual testing?

(In reply to Tom Prince [:tomprince] from comment #23)

(In reply to Kevin Jacobs [:kjacobs] from comment #22)

What's the use-case here? Ideally releng will manage these pools for you going forward and you shouldn't need to access them directly.

Primarily testing security patches, which become public once pushed to nss-try. Having access to the box allows spinning up a local docker container and testing the patch manually. Maybe there's some alternative that would still enable that use case?

Do you need access to a worker explicitly, or would access to any arm machine worker? Given that AWS has arm machines, would having access to a different EC2 arm machine (either always on, or on demand) in a different account work for manual testing?

No, any aarch64 machine should do the trick. The worker was just convenient and available.

Tom, is there a process that we should go through in order to get access to one of these arm machines for testing?

Flags: needinfo?(mozilla)

This would probably make the most sense outside of taskcluster, so redirecting to :fubar, who I think manages our AWS accounts.

Flags: needinfo?(mozilla) → needinfo?(klibby)

I think we can do this inside the NSS team ourselves, actually. I'll DM you klibby if we encounter issues.

Flags: needinfo?(klibby)

:miles - what are the next steps here? Are we converting this to a managed pool, or leaving it in localprovisioner?

Flags: needinfo?(miles)

This worker type has lots of potential beyond NSS. In particular the JS team would be interested in having native AArch64 builds and tests on such machines, at least for the JS shell.

(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #10)

Due to the level of customizations / hackery I haven't built an entirely new set of scripts to automate creating the image, I've taken notes on the changes I've made so we can adapt from there.

Hey Miles, could you put a copy of the notes (or a link to them) in this bug? Thanks!

Flags: needinfo?(miles)
Assignee: miles → nobody
Status: ASSIGNED → NEW

(In reply to Pete Moore [:pmoore][:pete] from comment #30)

Hey Miles, could you put a copy of the notes (or a link to them) in this bug? Thanks!

Notes from Miles: https://github.com/taskcluster/taskcluster/issues/3524#issue-703707552

Coop, should we close this?

I believe the workers are running, and the only caveat is that the machine images were manually created, bug issue 3524 (see comment 31) tracks automating this process for future linux/aarch64 workers.

Flags: needinfo?(coop)

👋!

The important piece is that because that instance was manually created if it is deleted the data is lost. Two good steps to take to make things more recoverable/durable would be 1. disable volume deletion on instance termination so that the root volume can be reused on another instance 2. take a snapshot of the current instance state (or for bonus points make an AMI out if it, note that if you make an AMI you should verify the docker-worker relevant services are set to start on boot).

(In reply to Pete Moore [:pmoore][:pete] from comment #32)

Coop, should we close this?

Sure, we can close this. I'll file follow-up issues for Miles' call-outs in comment #33.

Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(coop)
Resolution: --- → FIXED

(In reply to Miles Crabill [:miles] from comment #33)
2. take a snapshot of the current instance state (or for bonus points make an AMI out if it, note that if you make an AMI you should verify the docker-worker relevant services are set to start on boot).

I took a snapshot of the running instance. The snapshot is in us-west-2: snap-0a41f3f82635cd474.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: