1594891 - stand up NSS CI in new Taskcluster Firefox-ci environment, have releng take administration of NSS going forward

Jordan Lund (:jlund)

Reporter

Description

•

5 years ago

NSS is going to be part of FF-CI.

Needs workers, tooltool support, updated tc clients, and to point to the new Taskcluster root url.

Jordan Lund (:jlund)

Reporter

Updated

•

5 years ago

Blocks: 1591152

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

5 years ago

https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=b775e9d2d4c027df3b599477571ad0bb7bf12a92 is a start -- replaces hard-coded taskcluster.net references and updates the taskcluster client version. It doesn't update tooltool support or switch it to use new workers.

Miles Crabill [:miles]

Comment 2

•

5 years ago

The changes to .taskcluster.yml happened here: https://hg.mozilla.org/projects/nss/rev/a2bebaad41dd7b8e543c2ba8df1f9b83ab01876c
The changes to add the worker-pools happened here: https://phabricator.services.mozilla.com/D52233

I'm not totally sure what tooltool support entails, unfortunately. Does that mean publishing tooltool manifests for nss artifacts?

Dustin J. Mitchell [:dustin] (he/him)

Comment 3

•

5 years ago

Attached file Bug 1594891 - Updates to run correctly on the new TC deployment r=kjacobs (deleted) — Details

Update the Taskcluster client used in the decision task to one that
understands Taskcluster rootUrls.
Update scripts that fetch content to use the TASKCLUSTER_ROOT_URL
- the absence of this variale signals an "old" worker so we use an "old" URL

Dustin J. Mitchell [:dustin] (he/him)

Comment 4

•

5 years ago

OK, miles and I have had a deep look at this.

The nss source is mostly in good shape: the worker pools have been renamed and no further renaming will be required. The patch above needed adjustments for the situations where TASKCLUSTER_ROOT_URL isn't set (which is on the old docker-worker version on the packet.net laptop).

As for tooltool, hypothesis is that it's fine as-is.

Here's the state of workers:

nss-1/win2012r2 is confirmed to be working in the new deployment (AWS instances)
localprovisioner/nss-macos-10-12 is backed by 10 instances of generic-worker on a macstadium host at 208.52.182.28. That has my SSH keys and miles' on it now, and :kjacobs also has access
localprovisioner/nss-aarch64 is backed by a host in packet.net, to which miles has access. It's got an ancient version of docker-worker on it.

During the tree-closing window, the plan is:

log in to the macstadium host and reconfigure all 10 instances, using the same clientId and accessToken:
- remove all *BaseURL configuration
- update accessToken
- update rootURL
- start a worker manually to verify
- then reboot and see that all workers start correctly with the correct context, etc.
log in to the packet.net host and (upgrade docker and)? reconfigure its credentials
- details TBD
run a try push

To get there:

[DONE] verify that mac workers can talk to the new deployment by reconfiguring one of them
verify that an updated docker-worker version can talk to the legacy deployment
verify that an updated docker-worker version can talk to the firefox-ci deployment
with at least one of each kind of worker, run a try push in the firefox-ci deployment
- this will validate the assumption about tooltool

The bottom line is, we should (and we'll know for sure when the "to get there" items are complete) be able to make this transition with no downtime for NSS outside of the TCW, and with no need for NSS staff to be available on Saturday.

Miles Crabill [:miles]

Comment 5

•

5 years ago

In the process of reconfiguring the nss-aarch64 worker in packet.net the box disconnected me and went dark to ping and ssh. I wasn’t totally sure of the cause, what I had done:

Backed up the existing docker-worker checkout to /home/ci/docker-worker-bak
Backed up the existing service definition /lib/systemd/system/docker-worker.service to /home/ci/service.bak
Edited /etc/docker-worker.conf to add TASKCLUSTER_ROOT_URL env var
Used nvm to install node v8.15.0 to test with a checkout of docker-worker v201911061915
Edited /lib/systemd/system/docker-worker.service to point to that checkout

And then the server disconnected me, went dark for a period to ping and ssh, at some point let me back in, then disconnected me again.
I waited ~30 minutes unable to get back into the box, was unable to, then triggered a reboot of the instance via the packet.net webui.

After the reboot the instance responded to ping for a period, then went dark again. I’m going to engage packet.net support to try to resolve this tomorrow.

For now, the impact is that aarch64 tasks will not run.

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

5 years ago

Status Update:

We ran a try push in the new deployment!
https://treeherder-prototype.herokuapp.com/#/jobs?repo=nss-try&revision=cc8b3f93ff090c267734609d1ab2918f1049a58a&selectedJob=274493469
I see some orange there, but I can't tell what's wrong. Any assistance interpreting that information would be helpful! A few thoughts:
- this try push re-generated all of the docker image, so it's possible that one of those images includes an incompatible "latest" version of some dependency
- these workers are identical in configuration to those in this push which succeeded
The packet.net laptop appears to be down, as described in the previous comment.
The mac workers don't work when run from the command-line, and I don't have sudo access to reboot the host to get them to start automatically. @franziskus (as I think the next person to have working hours) can you figure out a way for me to be able to reboot the host? Either a sudoers entry, or admin password, or any other solution. I'll need to do it again on Saturday. 9 of the 10 workers continue to function as usual, so mac CI can continue.

Thanks for any help with the above from any of the NSS folks.

J.C. Jones [:jcj] (he/him)

Comment 7

•

5 years ago

https://hg.mozilla.org/projects/nss/rev/67d630e7cb7cbf943408f9e238d41839028bcfee

Pete Moore [:pmoore][:pete]

Comment 8

•

5 years ago

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #6)

The mac workers don't work when run from the command-line, and I don't have sudo access to reboot the host to get them to start automatically. @franziskus (as I think the next person to have working hours) can you figure out a way for me to be able to reboot the host? Either a sudoers entry, or admin password, or any other solution. I'll need to do it again on Saturday. 9 of the 10 workers continue to function as usual, so mac CI can continue.

Thanks for any help with the above from any of the NSS folks.

I've restarted worker 3 and I reran task u2GyIr2XQteNJ8s6udPqdw which has now turned green! 🥳

To stop one of the ten workers on 208.52.182.28 (workers numbered from 0 to 9):

sudo launchctl unload /Library/LaunchDaemons/net.generic.worker.[0-9].plist

To start a worker:

sudo launchctl load /Library/LaunchDaemons/net.generic.worker.[0-9].plist

To tail the logs of a worker:

tail -1000f /Users/administrator/worker[0-9]/generic-worker.log

I've added the administrator credentials for both NSS macstadium machines to the taskcluster team password store, together with the instructions above.

Config files of workers:

/Users/administrator/worker[0-9]/generic-worker.config

Note the workers are running simple engine, not multiuser engine:

administrators-Mac-mini-98:~ administrator$ /usr/local/bin/generic-worker --version
generic-worker (simple engine) 15.1.4 [ revision: https://github.com/taskcluster/generic-worker/commits/c407e45e3f019599005971b30993f29eb3c59b0d ]
administrators-Mac-mini-98:~ administrator$

Wander Lairson Costa

Comment 9

•

5 years ago

(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #5)

In the process of reconfiguring the nss-aarch64 worker in packet.net the box disconnected me and went dark to ping and ssh. I wasn’t totally sure of the cause, what I had done:

Backed up the existing docker-worker checkout to /home/ci/docker-worker-bak

Backed up the existing service definition /lib/systemd/system/docker-worker.service to /home/ci/service.bak

Edited /etc/docker-worker.conf to add TASKCLUSTER_ROOT_URL env var

Used nvm to install node v8.15.0 to test with a checkout of docker-worker v201911061915

Edited /lib/systemd/system/docker-worker.service to point to that checkout

And then the server disconnected me, went dark for a period to ping and ssh, at some point let me back in, then disconnected me again.
I waited ~30 minutes unable to get back into the box, was unable to, then triggered a reboot of the instance via the packet.net webui.

After the reboot the instance responded to ping for a period, then went dark again. I’m going to engage packet.net support to try to resolve this tomorrow.

For now, the impact is that aarch64 tasks will not run.

Cross posting with slack:

docker-worker is probably failing to start and then rebooting the machine. See https://github.com/taskcluster/docker-worker/blob/master/deploy/template/usr/local/bin/start-docker-worker#L5

Dustin J. Mitchell [:dustin] (he/him)

Comment 10

•

5 years ago

Fresh try push to see the mac jobs run and to see if those oranges in the previous job reproduce here. Also, once the nss-aarch64 host is reconfigured to the new deployment, we should see those jobs run.

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

5 years ago

Updates:

The latest try push looks good -- there were lots of retries of the aarch64 worker as we worked to get it set up, but it ran some green jobs after that. Mac, Windows, and Linux are all green.

The aarch64 worker is still configured to point to the new (firefox-ci) deployment. Per Kevin we will leave it there to avoid further churn and simplify things tomorrow. So, no coverage on aarch64 until that's over.

We will be transitioning the mac workers during the TCW tomorrow. Otherwise, I think we're good!

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

5 years ago

This appears done!

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Tom Prince [:tomprince]

Comment 13

•

5 years ago

Attached file Use tc-proxy for nss tooltool; (deleted) — Details

Jordan Lund (:jlund)

Reporter

Comment 14

•

5 years ago

sounds like we need to fix up tooltool here. I'm re-opening until comment 13 lands

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Tom Prince [:tomprince]

Comment 15

•

5 years ago

That landed in
https://hg.mozilla.org/projects/nss/rev/c33b214b2ec8a8f85ac472aaf6dea15ac5cead2b

Status: REOPENED → RESOLVED

Closed: 5 years ago → 5 years ago

Resolution: --- → FIXED

J.C. Jones [:jcj] (he/him)

Updated

•

5 years ago

Regressions: 1598485

Chris Cooper [:coop] (he/him)

Updated

•

5 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1636245

Bugzilla

stand up NSS CI in new Taskcluster Firefox-ci environment, have releng take administration of NSS going forward

Categories

(Release Engineering :: General, enhancement)

Tracking

(Not tracked)

People

(Reporter: jlund, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Updated

Updated

Attachment

General

Description

File Name

Content Type

Bug 1594891 - Updates to run correctly on the new TC deployment r=kjacobs 5 years ago Dustin J. Mitchell [:dustin] (he/him) (deleted), text/x-phabricator-request		Details
Use tc-proxy for nss tooltool; 5 years ago Tom Prince [:tomprince] (deleted), text/x-phabricator-request		Details