Remove c4 and m4 workers from gecko-*-b-* provisioners
Categories
(Release Engineering :: Release Automation: Other, enhancement)
Tracking
(Not tracked)
People
(Reporter: glandium, Assigned: glandium)
Details
Attachments
(3 files)
Looking at build times graphs across all types of workers shows the m4 and c4 ones are significantly slower than the c5 and m5 ones.
Assignee | ||
Comment 1•5 years ago
|
||
Looking at build times graphs across all types of workers shows the m4
and c4 ones are significantly slower than the c5 and m5 ones.
I've attached the current diff of infra versus commit https://hg.mozilla.org/ci/ci-configuration/rev/3803a9d8efc7551240fe907a350df4e408b5e00d
Resetting back to commit c178d1dbb4b https://hg.mozilla.org/ci/ci-configuration/rev/c178d1dbb4bf5850bc5a0cd485fb7e49c6ac5fcc drops one of the scope changes:
home:ci-configuration_reset house$ diff ../ci-configuration/ci-conf_diff_2019-08-09-2105.log ci-conf_diff_c178d1dbb4b_2019-08-09-2117.log
2514,2532d2513
< @@ -89517,17 +89693,17 @@ Role=repo:github.com/mozilla-mobile/fenix:pull-request:
< - queue:create-task:highest:proj-autophone/gecko-t-bitbar-gw-perf-p2
< - queue:create-task:highest:scriptworker-prov-v1/mobile-signing-dep-v1
< - queue:route:index.project.fenix.android.preview-builds
< - queue:route:index.project.mobile.fenix.cache.level-1.*
< - queue:route:index.project.mobile.fenix.staging-signed-nightly.*
< - queue:route:index.project.mobile.fenix.v2.staging.*
< - queue:route:index.project.mobile.fenix.v3.staging.*
< - queue:route:notify.email.perftest-alerts@mozilla.com.on-failed
< - - secrets:get:project/mobile/fenix/pr
< + - secrets:get:project/fenix/preview-key-store
<
< Role=repo:github.com/mozilla-mobile/fenix:release:
< roleId: repo:github.com/mozilla-mobile/fenix:release
< description:
< *DO NOT EDIT* - This resource is configured automatically by [ci-admin](https://hg.mozilla.org/ci/ci-admin).
<
< Scopes in this role are defined in [ci-configuration/grants.yml](https://hg.mozilla.org/ci/ci-configuration/file/tip/grants.yml).
< scopes:
Updated•5 years ago
|
Comment 4•5 years ago
|
||
I've hit this error before myself. The aws-provisioner doesn't handle it well when you remove instance types from a worker with those instance types still running:
Aug 10 09:49:26 taskcluster-aws-provisioner2 app/provisioner.1: 13:49:26.181Z INFO aws-provisioner-production: determined number of pending tasks (workerType=gecko-3-b-linux, pendingTasks=912)
Aug 10 09:49:26 taskcluster-aws-provisioner2 app/provisioner.1: 13:49:26.201Z ERROR aws-provisioner-production: error provisioning this worker type, skipping (workerType=gecko-3-b-linux, err={})
Aug 10 09:49:26 taskcluster-aws-provisioner2 app/provisioner.1: reportError - level: warning, tags: {"workerType":"gecko-3-b-linux"}
Aug 10 09:49:26 taskcluster-aws-provisioner2 app/provisioner.1: Error: gecko-3-b-linux does not contain c4.4xlarge
glandium has already backed out his config change, and dhouse is going to apply it. That should allow us to start provisioning workers again.
I confirmed the ci-admin diff was the same this morning at the other attachment. Here attached is the output from the ci-admin apply showing the diff-matching updates to workertypes and various roles.
Updated•5 years ago
|
Tom, can you take this? I'm not sure of the full context or related bugs/changes.
Comment 8•5 years ago
|
||
I am on PTO for a couple of weeks; redirecting to :bstack
Comment 9•5 years ago
|
||
dhouse: what is required here, just applying ci-admin and terminating the currently running instances? I'm not sure I have enough context yet.
Comment 10•5 years ago
|
||
(In reply to Brian Stack [:bstack] from comment #9)
dhouse: what is required here, just applying ci-admin and terminating the currently running instances? I'm not sure I have enough context yet.
That sounds like what is needed.
I don't think the discussion about this change is in a bug (maybe was in a chat?). So I don't know if we want it applied or not.
Also, when it was applied a week before, it took some hours before it was apparent that new instances were not getting provisioned. So I think we need some awareness of the change and monitoring afterward to make sure we don't repeat the outage.
Comment 11•5 years ago
|
||
Ok, I'm around if we decide it needs to be applied.
Assignee | ||
Comment 12•5 years ago
|
||
Can we coordinate a landing of this early next week?
Comment 13•5 years ago
|
||
Yeah for sure. Afaict the only thing I'm doing here is applying ci-admin so any time in my schedule works great. Who should be around to validate things work?
Assignee | ||
Comment 14•5 years ago
|
||
Who would terminate the running instances?
Comment 15•5 years ago
|
||
Ok, I'm just landing this now. I still kinda feel like I don't have proper context here but hopefully it goes well.
Comment 16•5 years ago
|
||
This should be done now. I'm watching provisioner logs and terminating anything that gets in its way. Marking as resolved. please reopen and assign me if there's anything else to do!
Description
•