Closed Bug 1572312 Opened 5 years ago Closed 5 years ago

Remove c4 and m4 workers from gecko--b- provisioners

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: glandium, Assigned: glandium)

Details

Attachments

(3 files)

Bug 1572312 - Remove c4 and m4 workers from gecko--b- provisioners. 5 years ago Mike Hommey [:glandium] (deleted), text/x-phabricator-request		Details
ci-conf_diff_2019-08-09-1930.log 5 years ago :dhouse (deleted), application/octet-stream		Details
ci-con-apply-2019-08-09-0721.log 5 years ago :dhouse (deleted), application/octet-stream		Details

Mike Hommey [:glandium]

Assignee

Description

•

5 years ago

Looking at build times graphs across all types of workers shows the m4 and c4 ones are significantly slower than the c5 and m5 ones.

e.g. https://treeherder.mozilla.org/perf.html#/graphs?series=autoland,1930882,1,2&series=autoland,1922228,1,2&series=autoland,1922220,1,2&series=autoland,1921038,1,2&series=autoland,1922597,1,2

Mike Hommey [:glandium]

Assignee

Comment 1

•

5 years ago

Attached file Bug 1572312 - Remove c4 and m4 workers from gecko-*-b-* provisioners. (deleted) — Details

Looking at build times graphs across all types of workers shows the m4
and c4 ones are significantly slower than the c5 and m5 ones.

e.g. https://treeherder.mozilla.org/perf.html#/graphs?series=autoland,1930882,1,2&series=autoland,1922228,1,2&series=autoland,1922220,1,2&series=autoland,1921038,1,2&series=autoland,1922597,1,2

:dhouse

Comment 2

•

5 years ago

Attached file ci-conf_diff_2019-08-09-1930.log (deleted) — Details

I've attached the current diff of infra versus commit https://hg.mozilla.org/ci/ci-configuration/rev/3803a9d8efc7551240fe907a350df4e408b5e00d

:dhouse

Comment 3

•

5 years ago

Resetting back to commit c178d1dbb4b https://hg.mozilla.org/ci/ci-configuration/rev/c178d1dbb4bf5850bc5a0cd485fb7e49c6ac5fcc drops one of the scope changes:

home:ci-configuration_reset house$ diff ../ci-configuration/ci-conf_diff_2019-08-09-2105.log  ci-conf_diff_c178d1dbb4b_2019-08-09-2117.log 
2514,2532d2513
< @@ -89517,17 +89693,17 @@ Role=repo:github.com/mozilla-mobile/fenix:pull-request:
<        - queue:create-task:highest:proj-autophone/gecko-t-bitbar-gw-perf-p2
<        - queue:create-task:highest:scriptworker-prov-v1/mobile-signing-dep-v1
<        - queue:route:index.project.fenix.android.preview-builds
<        - queue:route:index.project.mobile.fenix.cache.level-1.*
<        - queue:route:index.project.mobile.fenix.staging-signed-nightly.*
<        - queue:route:index.project.mobile.fenix.v2.staging.*
<        - queue:route:index.project.mobile.fenix.v3.staging.*
<        - queue:route:notify.email.perftest-alerts@mozilla.com.on-failed
< -      - secrets:get:project/mobile/fenix/pr
< +      - secrets:get:project/fenix/preview-key-store
< 
<    Role=repo:github.com/mozilla-mobile/fenix:release:
<      roleId: repo:github.com/mozilla-mobile/fenix:release
<      description:
<        *DO NOT EDIT* - This resource is configured automatically by [ci-admin](https://hg.mozilla.org/ci/ci-admin).
< 
<        Scopes in this role are defined in [ci-configuration/grants.yml](https://hg.mozilla.org/ci/ci-configuration/file/tip/grants.yml).
<      scopes:

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

5 years ago

Severity: normal → blocker

Summary: Remove c4 and m4 workers from gecko-*-b-* provisioners → [trees closed (for lack of capacity?)] Remove c4 and m4 workers from gecko-*-b-* provisioners

Chris Cooper [:coop] (he/him)

Comment 4

•

5 years ago

I've hit this error before myself. The aws-provisioner doesn't handle it well when you remove instance types from a worker with those instance types still running:

Aug 10 09:49:26 taskcluster-aws-provisioner2 app/provisioner.1: 13:49:26.181Z  INFO aws-provisioner-production: determined number of pending tasks (workerType=gecko-3-b-linux, pendingTasks=912) 
Aug 10 09:49:26 taskcluster-aws-provisioner2 app/provisioner.1: 13:49:26.201Z ERROR aws-provisioner-production: error provisioning this worker type, skipping (workerType=gecko-3-b-linux, err={}) 
Aug 10 09:49:26 taskcluster-aws-provisioner2 app/provisioner.1: reportError - level: warning, tags: {"workerType":"gecko-3-b-linux"} 
Aug 10 09:49:26 taskcluster-aws-provisioner2 app/provisioner.1:  Error: gecko-3-b-linux does not contain c4.4xlarge

glandium has already backed out his config change, and dhouse is going to apply it. That should allow us to start provisioning workers again.

:dhouse

Comment 5

•

5 years ago

Attached file ci-con-apply-2019-08-09-0721.log (deleted) — Details

I confirmed the ci-admin diff was the same this morning at the other attachment. Here attached is the output from the ci-admin apply showing the diff-matching updates to workertypes and various roles.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

5 years ago

Severity: blocker → normal

Summary: [trees closed (for lack of capacity?)] Remove c4 and m4 workers from gecko-*-b-* provisioners → Remove c4 and m4 workers from gecko-*-b-* provisioners

Mike Hommey [:glandium]

Assignee

Comment 6

•

5 years ago

Are we doomed to never land this?

Flags: needinfo?(dhouse)

:dhouse

Comment 7

•

5 years ago

Tom, can you take this? I'm not sure of the full context or related bugs/changes.

Flags: needinfo?(dhouse) → needinfo?(mozilla)

Tom Prince [:tomprince]

Comment 8

•

5 years ago

I am on PTO for a couple of weeks; redirecting to :bstack

Flags: needinfo?(mozilla) → needinfo?(bstack)

Brian Stack [:bstack]

Comment 9

•

5 years ago

dhouse: what is required here, just applying ci-admin and terminating the currently running instances? I'm not sure I have enough context yet.

Flags: needinfo?(bstack) → needinfo?(dhouse)

:dhouse

Comment 10

•

5 years ago

(In reply to Brian Stack [:bstack] from comment #9)

dhouse: what is required here, just applying ci-admin and terminating the currently running instances? I'm not sure I have enough context yet.

That sounds like what is needed.

I don't think the discussion about this change is in a bug (maybe was in a chat?). So I don't know if we want it applied or not.

Also, when it was applied a week before, it took some hours before it was apparent that new instances were not getting provisioned. So I think we need some awareness of the change and monitoring afterward to make sure we don't repeat the outage.

Flags: needinfo?(dhouse)

Brian Stack [:bstack]

Comment 11

•

5 years ago

Ok, I'm around if we decide it needs to be applied.

Mike Hommey [:glandium]

Assignee

Comment 12

•

5 years ago

Can we coordinate a landing of this early next week?

Flags: needinfo?(bstack)

Brian Stack [:bstack]

Comment 13

•

5 years ago

Yeah for sure. Afaict the only thing I'm doing here is applying ci-admin so any time in my schedule works great. Who should be around to validate things work?

Flags: needinfo?(bstack)

Mike Hommey [:glandium]

Assignee

Comment 14

•

5 years ago

Who would terminate the running instances?

Flags: needinfo?(bstack)

Brian Stack [:bstack]

Comment 15

•

5 years ago

Ok, I'm just landing this now. I still kinda feel like I don't have proper context here but hopefully it goes well.

Flags: needinfo?(bstack)

Brian Stack [:bstack]

Comment 16

•

5 years ago

This should be done now. I'm watching provisioner logs and terminating anything that gets in its way. Marking as resolved. please reopen and assign me if there's anything else to do!

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Remove c4 and m4 workers from gecko--b- provisioners

Categories

(Release Engineering :: Release Automation: Other, enhancement)

Tracking

(Not tracked)

People

(Reporter: glandium, Assigned: glandium)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Attachment

General

Description

File Name

Content Type