Closed Bug 1572312 Opened 5 years ago Closed 5 years ago

Remove c4 and m4 workers from gecko-*-b-* provisioners

Categories

(Release Engineering :: Release Automation: Other, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: glandium, Assigned: glandium)

Details

Attachments

(3 files)

Looking at build times graphs across all types of workers shows the m4 and c4 ones are significantly slower than the c5 and m5 ones.

e.g. https://treeherder.mozilla.org/perf.html#/graphs?series=autoland,1930882,1,2&series=autoland,1922228,1,2&series=autoland,1922220,1,2&series=autoland,1921038,1,2&series=autoland,1922597,1,2

Attached file ci-conf_diff_2019-08-09-1930.log (deleted) —

Resetting back to commit c178d1dbb4b https://hg.mozilla.org/ci/ci-configuration/rev/c178d1dbb4bf5850bc5a0cd485fb7e49c6ac5fcc drops one of the scope changes:

home:ci-configuration_reset house$ diff ../ci-configuration/ci-conf_diff_2019-08-09-2105.log  ci-conf_diff_c178d1dbb4b_2019-08-09-2117.log 
2514,2532d2513
< @@ -89517,17 +89693,17 @@ Role=repo:github.com/mozilla-mobile/fenix:pull-request:
<        - queue:create-task:highest:proj-autophone/gecko-t-bitbar-gw-perf-p2
<        - queue:create-task:highest:scriptworker-prov-v1/mobile-signing-dep-v1
<        - queue:route:index.project.fenix.android.preview-builds
<        - queue:route:index.project.mobile.fenix.cache.level-1.*
<        - queue:route:index.project.mobile.fenix.staging-signed-nightly.*
<        - queue:route:index.project.mobile.fenix.v2.staging.*
<        - queue:route:index.project.mobile.fenix.v3.staging.*
<        - queue:route:notify.email.perftest-alerts@mozilla.com.on-failed
< -      - secrets:get:project/mobile/fenix/pr
< +      - secrets:get:project/fenix/preview-key-store
< 
<    Role=repo:github.com/mozilla-mobile/fenix:release:
<      roleId: repo:github.com/mozilla-mobile/fenix:release
<      description:
<        *DO NOT EDIT* - This resource is configured automatically by [ci-admin](https://hg.mozilla.org/ci/ci-admin).
< 
<        Scopes in this role are defined in [ci-configuration/grants.yml](https://hg.mozilla.org/ci/ci-configuration/file/tip/grants.yml).
<      scopes:
Severity: normal → blocker
Summary: Remove c4 and m4 workers from gecko-*-b-* provisioners → [trees closed (for lack of capacity?)] Remove c4 and m4 workers from gecko-*-b-* provisioners

I've hit this error before myself. The aws-provisioner doesn't handle it well when you remove instance types from a worker with those instance types still running:

Aug 10 09:49:26 taskcluster-aws-provisioner2 app/provisioner.1: 13:49:26.181Z  INFO aws-provisioner-production: determined number of pending tasks (workerType=gecko-3-b-linux, pendingTasks=912) 
Aug 10 09:49:26 taskcluster-aws-provisioner2 app/provisioner.1: 13:49:26.201Z ERROR aws-provisioner-production: error provisioning this worker type, skipping (workerType=gecko-3-b-linux, err={}) 
Aug 10 09:49:26 taskcluster-aws-provisioner2 app/provisioner.1: reportError - level: warning, tags: {"workerType":"gecko-3-b-linux"} 
Aug 10 09:49:26 taskcluster-aws-provisioner2 app/provisioner.1:  Error: gecko-3-b-linux does not contain c4.4xlarge

glandium has already backed out his config change, and dhouse is going to apply it. That should allow us to start provisioning workers again.

Attached file ci-con-apply-2019-08-09-0721.log (deleted) —

I confirmed the ci-admin diff was the same this morning at the other attachment. Here attached is the output from the ci-admin apply showing the diff-matching updates to workertypes and various roles.

Severity: blocker → normal
Summary: [trees closed (for lack of capacity?)] Remove c4 and m4 workers from gecko-*-b-* provisioners → Remove c4 and m4 workers from gecko-*-b-* provisioners

Are we doomed to never land this?

Flags: needinfo?(dhouse)

Tom, can you take this? I'm not sure of the full context or related bugs/changes.

Flags: needinfo?(dhouse) → needinfo?(mozilla)

I am on PTO for a couple of weeks; redirecting to :bstack

Flags: needinfo?(mozilla) → needinfo?(bstack)

dhouse: what is required here, just applying ci-admin and terminating the currently running instances? I'm not sure I have enough context yet.

Flags: needinfo?(bstack) → needinfo?(dhouse)

(In reply to Brian Stack [:bstack] from comment #9)

dhouse: what is required here, just applying ci-admin and terminating the currently running instances? I'm not sure I have enough context yet.

That sounds like what is needed.

I don't think the discussion about this change is in a bug (maybe was in a chat?). So I don't know if we want it applied or not.

Also, when it was applied a week before, it took some hours before it was apparent that new instances were not getting provisioned. So I think we need some awareness of the change and monitoring afterward to make sure we don't repeat the outage.

Flags: needinfo?(dhouse)

Ok, I'm around if we decide it needs to be applied.

Can we coordinate a landing of this early next week?

Flags: needinfo?(bstack)

Yeah for sure. Afaict the only thing I'm doing here is applying ci-admin so any time in my schedule works great. Who should be around to validate things work?

Flags: needinfo?(bstack)

Who would terminate the running instances?

Flags: needinfo?(bstack)

Ok, I'm just landing this now. I still kinda feel like I don't have proper context here but hopefully it goes well.

Flags: needinfo?(bstack)

This should be done now. I'm watching provisioner logs and terminating anything that gets in its way. Marking as resolved. please reopen and assign me if there's anything else to do!

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: