Closed Bug 1547111 Opened 6 years ago Closed 2 years ago

Migrate shippable builds from AWS to GCP

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: coop, Assigned: masterwayz)

References

Details

Attachments

(17 files, 1 obsolete file)

Bug 1547111 - Bump min count of gce gecko tier 1 workers for now 5 years ago Brian Stack [:bstack] (deleted), text/x-phabricator-request		Details
Bug 1547111 - Temporary zone-specific disk config for gce workers 5 years ago Brian Stack [:bstack] (deleted), text/x-phabricator-request		Details
Bug 1547111 - Use an image for gce level 1 builders that can use multiple disks 5 years ago Brian Stack [:bstack] (deleted), text/x-phabricator-request		Details
Bug 1547111 - Fix typo in workerpool config 5 years ago Brian Stack [:bstack] (deleted), text/x-phabricator-request		Details
Bug Bug 1547111 - Increase root size of gecko-1-b-linux gce 5 years ago Brian Stack [:bstack] (deleted), text/x-phabricator-request		Details
Bug 1547111 - Update image for multi-disk gce instances 5 years ago Brian Stack [:bstack] (deleted), text/x-phabricator-request		Details
Bug 1547111: Replace `-gcp` builds with `-aws` builds r=tomprince 5 years ago Wander Lairson Costa (deleted), text/x-phabricator-request		Details
Bug 1547111: Switch build worker types to GCP r=tomprince 5 years ago Wander Lairson Costa (deleted), text/x-phabricator-request		Details
Bug 1547111: Bump worker capacity on gecko-[13]/b-linux to match pre-GCP; r?bstack 5 years ago Tom Prince [:tomprince] (deleted), text/x-phabricator-request		Details
Bug 1547111: Fix sccache bucket perms for level 2/3 workers r=Callek 5 years ago Wander Lairson Costa (deleted), text/x-phabricator-request		Details
Bug 1547111: Remove incorrect GCP sccache scope; r?Callek 5 years ago Tom Prince [:tomprince] (deleted), text/x-phabricator-request		Details
Bug 1547111: Remove grant of incorrect gcp sccache scopes to projects; r?Callek 5 years ago Tom Prince [:tomprince] (deleted), text/x-phabricator-request		Details
Bug 1547111: Increase capacity of gecko-3/b-linux worker type; 5 years ago Tom Prince [:tomprince] (deleted), text/x-phabricator-request		Details
Bug 1547111: Allow varying the worker-pool provider in variants; r?aki 4 years ago Tom Prince [:tomprince] (deleted), text/x-phabricator-request		Details
Bug 1547111: Refactor gcp workers using variants; r?aki 4 years ago Tom Prince [:tomprince] (deleted), text/x-phabricator-request		Details
Bug 1547111: Swap aws->gcp workers for gecko-1; r=tomprince 4 years ago :dhouse (deleted), text/x-phabricator-request		Details
Bug 1547111: Swap aws->gcp workers for gecko-3/b-linux; r=tomprince 4 years ago :dhouse (deleted), text/x-phabricator-request		Details
Bug 1547111 - Migrate shippable builds from AWS to GCP r=ahal!,jmaher! 2 years ago Michelle Goossens [:masterwayz] (deleted), text/x-phabricator-request		Details

Chris Cooper [:coop] (he/him)

Reporter

Description

•

6 years ago

We have tier 3 builds in GCP on all platforms. Now we need to take stock of what's still missing and what we need to do to migrate the tier 1 builds from AWS to GCP.

Chris Cooper [:coop] (he/him)

Reporter

Comment 1

•

6 years ago

The #1 blocker right now is the lack of sccache in GCP (bug 1539961). This prevents some debug builds and other slower variants from completing reliably in their existing duration. This is compounded by not yet having access to compute-optimized instances in GCP.

Once sccache is running, there are a bunch of build variants that still need to be setup in GCP:

spidermonkey (linux/win)
aarch64 (linux/win)
asan/asan reporter (linux/mac/win)
ccov (linux/mac/win)
noopt (linux/mac/win)
searchfox (linux/mac/win)
PGO builds (all platforms)

Some of those require scriptworker (esp. for signing), although the existing scriptworker pools in AWS can be used transitionally.

We can't start using GCP to produce binaries that we ship to end users (or data we rely on) until we have done a security review of the GCP platform.

We'll also want specific RRAs for the following services that are changing or being created fresh in GCP:

worker manager
GCP provider
helm deployment process of Taskcluster services in GCP

Finally, there are a few accessory services that, like sccache, should live next to workers in any new cloud providers, although these are nice-to-haves in that they will reduce costs and/or speed up the build process by minimizing network transfer. This includes services like:

hg mirror
cloud-mirror (will be object service eventually)

I'll add to this bug as I think of other things, and will add bug numbers as I find/file them. Please do the same.

Depends on: 1539961

Chris Cooper [:coop] (he/him)

Reporter

Comment 2

•

6 years ago

(In reply to Chris Cooper [:coop] pronoun: he from comment #1)

Some of those require scriptworker (esp. for signing), although the existing scriptworker pools in AWS can be used transitionally.

We also need to make sure that workers in GCP can generate the proper signatures for chain-of-trust (bug 1545524).

Depends on: 1545524

Joel Maher ( :jmaher ) (UTC -8)

Comment 3

•

6 years ago

results from analyzing the builds are done in bug 1546414.

Key findings:
build times are much slower for almost all build types
intermittent failures in the builds.

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 4

•

5 years ago

i'm working through some issues with sccache on windows in bug 1549346 and also need to modify our provisioning script to provision builders in the us only.

Depends on: 1549346, 1550468

Rob Thijssen [:grenade (EET/UTC+0300)]

Updated

•

5 years ago

Depends on: 1551642

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

5 years ago

Depends on: 1572236

Chris Cooper [:coop] (he/him)

Reporter

Comment 5

•

5 years ago

Overall, there are some concerns around performance but no showstoppers.

Builds are slower than expected, perhaps to the tune of 30-40%. In particular, Windows PGO builds are slower, and tasks that have relatively short timeouts do hit those timeouts frequently, e.g. symbol uploads.

I wouldn't be comfortable moving builds over if they are 30% slower, but there are some mitigations we can put in place:

Optimize hg for GCP the same way we did for AWS - bug 1585133
Extend timeouts for some jobs (e.g. symbol uploads). I increased the timeout for symbol upload jobs to 30min and they all completed successfully.
Temporarily increase instance specs (e.g. Windows PGO builds). These are currently running on n1-highcpu-32 so we could bump that up to n1-highcpu-64.
Wait for n2 instances to become available, or create worker pools in us-central where beta n2s are already available. This might be useful for Windows instances.

I think we should wait for bug 1585133 to land because that will improve things across the board.

Brian Stack [:bstack]

Comment 6

•

5 years ago

Attached file Bug 1547111 - Bump min count of gce gecko tier 1 workers for now (deleted) — Details

Brian Stack [:bstack]

Comment 7

•

5 years ago

Attached file Bug 1547111 - Temporary zone-specific disk config for gce workers (deleted) — Details

Brian Stack [:bstack]

Comment 8

•

5 years ago

Attached file Bug 1547111 - Use an image for gce level 1 builders that can use multiple disks (deleted) — Details

Brian Stack [:bstack]

Comment 9

•

5 years ago

Attached file Bug 1547111 - Fix typo in workerpool config (deleted) — Details

Brian Stack [:bstack]

Comment 10

•

5 years ago

Attached file Bug Bug 1547111 - Increase root size of gecko-1-b-linux gce (obsolete) (deleted) — Details

Phabricator Automation

Updated

•

5 years ago

Attachment #9099440 - Attachment is obsolete: true

Brian Stack [:bstack]

Comment 11

•

5 years ago

Attached file Bug 1547111 - Update image for multi-disk gce instances (deleted) — Details

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

5 years ago

Depends on: 1585133

Chris Cooper [:coop] (he/him)

Reporter

Comment 12

•

5 years ago

:bc - Connor's patch in bug 1585133 seems to have stuck. Have you been able to generate new perf numbers for validation? Is it worth creating a separate bug just to cover that work?

Flags: needinfo?(bob)

Bob Clary [:bc] (inactive)

Comment 13

•

5 years ago

I have been working with a try push where I have been collecting the test results from using the linux builder workers as test workers. I have one iteration of the builds and 5 iterations of the tests so far and plan to add more build iterations to collect the statistics for the builds when I complete the tests. I don't think there is a need for a separate bug for now. I will update this bug with the build performance comparison when I have completed it. I plan to document the test results for the linux builder/test workers in bug 1577276.

Flags: needinfo?(bob)

Bob Clary [:bc] (inactive)

Comment 14

•

5 years ago

Coop: I did a new try run with 20 builds. The resulting google sheet looks much better than before.

Chris Cooper [:coop] (he/him)

Reporter

Comment 15

•

5 years ago

I found out last week that both the n2 and c2 instance families have hit general availability (GA) in GCP, so I did some renewed timings with both of them using my previous methodology with plain builds:

n2: https://treeherder.mozilla.org/#/jobs?repo=try&revision=65700ea553546df1ef64b032c70a8f7319340f37
Avg task time: 1381.77s (~23.0m)
Std. dev: 48.46
CoV: 0.04

c2: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7e705f2899f617e1bcba10e7c45b67a63008f1f9
Avg task time: 1272.52s (~21.2m)
Std. dev: 21.75
CoV: 0.02

Synopsis: both instance types are still faster than our current AWS instances, and are several minutes faster than they were just a few months ago, likely due to the hg speed-ups. Google tells me they won't have enough capacity to run our peak workloads on c2, but n2 can shoulder the burden.

We're not planning on migrating anything this week, but I'll roll some patches to:

Change our current GCP builders to n2 instances. This can land this week
Migrate POSIX build load from AWS -> GCP, to be landed early next week after the migration/TCW

Chris Cooper [:coop] (he/him)

Reporter

Comment 16

•

5 years ago

Adding bug 1546801 as a dependency, not for technical reasons but to avoid disrupting that migration by changing the build platform at the same time.

Depends on: tc-cloudops

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

5 years ago

Depends on: 1594583

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

5 years ago

Depends on: 1595623

Chris Cooper [:coop] (he/him)

Reporter

Comment 17

•

5 years ago

I'm driving this, if not doing the work.

Assignee: nobody → coop

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Chris Cooper [:coop] (he/him)

Reporter

Comment 18

•

5 years ago

Didn't mean to close this.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

5 years ago

Depends on: 1597996

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

5 years ago

Depends on: 1587958

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

5 years ago

Depends on: 1598295

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

5 years ago

Depends on: 1601736

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

5 years ago

Depends on: 1604196

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

5 years ago

Depends on: 1607241

Wander Lairson Costa

Comment 19

•

5 years ago

Attached file Bug 1547111: Replace `-gcp` builds with `-aws` builds r=tomprince (deleted) — Details

We are swtching tier-1 builds to GCP.

Wander Lairson Costa

Comment 20

•

5 years ago

Attached file Bug 1547111: Switch build worker types to GCP r=tomprince (deleted) — Details

Phabricator Automation

Updated

•

5 years ago

Attachment #9121113 - Attachment description: Bug 1547111: Remove tier-3 GCP builds r=tomprince → Bug 1547111: Switch GCP and AWS tier builds r=tomprince

Phabricator Automation

Updated

•

5 years ago

Attachment #9121113 - Attachment description: Bug 1547111: Switch GCP and AWS tier builds r=tomprince → Bug 1547111: Replace `-gcp` builds with `-aws` builds r=tomprince

Pulsebot

Comment 21

•

5 years ago

Pushed by mozilla@hocat.ca: https://hg.mozilla.org/integration/autoland/rev/540db822a1d4 Replace `-gcp` builds with `-aws` builds r=tomprince

Tom Prince [:tomprince]

Comment 22

•

5 years ago

Attached file Bug 1547111: Bump worker capacity on gecko-[13]/b-linux to match pre-GCP; r?bstack (deleted) — Details

Tom Prince [:tomprince]

Comment 23

•

5 years ago

There was a high rate of tasks failing with claim-expired, and it appears that sccache is not working (the task are getting write errors). Because of that and bug 1609568, I've backed it out.

Flags: needinfo?(wcosta)

Regressions: 1609568

Pulsebot

Comment 24

•

5 years ago

Backout by dvarga@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/daf3b53c3efa Backed out changeset 540db822a1d4 for causing bug 1609568

Wander Lairson Costa

Comment 25

•

5 years ago

It feels like the worker-pool is lacking the scope auth:gcp:access-token:sccache-3/sccache-l{level}-us-central1@sccache-3.iam.gserviceaccount.com. I believe this is a misconfiguration of ci-configuration, but I could not track where. Tom, can you figure this out?

Flags: needinfo?(wcosta) → needinfo?(mozilla)

Wander Lairson Costa

Comment 26

•

5 years ago

It feels like everything is configured correctly. I opened a new pull request for sccache report when the request to access token fail. The only piece of the puzzle that I see is taskcluster-auth. :edunham, could you confirm sccache-l3-us-central1@sccache-3.iam.gserviceaccount.com is in the auth white list?

Flags: needinfo?(mozilla) → needinfo?(edunham)

Tom Prince [:tomprince]

Updated

•

5 years ago

Regressions: 1609949

edunham

Comment 27

•

5 years ago

Wander, sccache-l3-us-central1@sccache-3.iam.gserviceaccount.com is in the allwoedServiceAccounts list configured for the sccache-3 project in gcp_credentials_allowed_projects for the auth service in firefoxci.

Stage previously had some L3 configuration which it turns out should never have been there, so that is being removed now. Stage has the same L1 and L2 accounts configured as firefoxci does, though.

Flags: needinfo?(edunham)

Comment hidden (Intermittent Failures Robot)

Alexandru Ionescu (needinfo me) [:alexandrui]

Comment 29

•

5 years ago

(In reply to Pulsebot from comment #24)

Backout by dvarga@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/daf3b53c3efa
Backed out changeset 540db822a1d4 for causing bug 1609568

== Change summary for alert #24699 (as of Thu, 16 Jan 2020 07:26:06 GMT) ==

Improvements:

24% build times windows2012-64-noopt debug taskcluster-c5.4xlarge 2,609.43 -> 1,992.32

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=24699

Wander Lairson Costa

Comment 30

•

5 years ago

Attached file Bug 1547111: Fix sccache bucket perms for level 2/3 workers r=Callek (deleted) — Details

Tom Prince [:tomprince]

Comment 31

•

5 years ago

Attached file Bug 1547111: Remove incorrect GCP sccache scope; r?Callek (deleted) — Details

The scopes that are used are managed by the
project:taskcluster:{trust_domain}:level-{level}-sccache-buckets
role that is added a few lines above.

Phabricator Automation

Updated

•

5 years ago

Attachment #9122369 - Attachment description: Bug 1547111: Fix sccache bucket perms for level 2/3 workers r=callek → Bug 1547111: Fix sccache bucket perms for level 2/3 workers r=Callek

Pulsebot

Comment 32

•

5 years ago

Pushed by mozilla@hocat.ca: https://hg.mozilla.org/integration/autoland/rev/e1a3a62f2035 Remove incorrect GCP sccache scope; r=Callek

Tom Prince [:tomprince]

Updated

•

5 years ago

Keywords: leave-open

Alexandru Michis [:malexandru]

Comment 33

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e1a3a62f2035

Daniel Varga [:dvarga]

Comment 34

•

5 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/9c59a17ad5ec

Cristina Coroiu [:ccoroiu]

Comment 35

•

5 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-release/rev/ac4cb8731cb1

Wander Lairson Costa

Updated

•

5 years ago

Depends on: 1611255

Tom Prince [:tomprince]

Comment 36

•

5 years ago

Attached file Bug 1547111: Remove grant of incorrect gcp sccache scopes to projects; r?Callek (deleted) — Details

The scopes was added to tasks using sscache, but does not correspond to an
actual bucket. Now that the code that added that scope is gone, we can remove
the scope.

Wander Lairson Costa

Updated

•

5 years ago

Depends on: 1609568

Wander Lairson Costa

Updated

•

5 years ago

Depends on: 1609595

Chris Cooper [:coop] (he/him)

Reporter

Comment 37

•

5 years ago

To be explicit, when we do the cutover, we want the try and release builds (levels 1-3) to switch from AWS to GCP, but we also want to switch the tier 3 validation builds that we've been running in GCP to switch to back AWS. This is important in terms of making sure builds continue to work in AWS in case we need to switch back from GCP->AWS at some point in the future.

Pulsebot

Comment 38

•

5 years ago

Pushed by wcosta@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9819d9e38727 Replace `-gcp` builds with `-aws` builds r=tomprince

Bogdan Tara[:bogdan_tara | bogdant]

Comment 39

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9819d9e38727

Tom Prince [:tomprince]

Comment 40

•

5 years ago

Attached file Bug 1547111: Increase capacity of gecko-3/b-linux worker type; (deleted) — Details

Aki Sasaki (not active)

Updated

•

5 years ago

Depends on: 1614852

Razvan Maries

Comment 41

•

5 years ago

Backed 9819d9e38727 out from autoland & central:
https://hg.mozilla.org/integration/autoland/rev/7dcbf3debe6a3a527df82c422446ca42a3de78c4
https://hg.mozilla.org/mozilla-central/rev/5fa1a31cb9547a0400f0955707e46b3046680123

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 42

•

5 years ago

Merged backout of 9819d9e38727 to central: https://hg.mozilla.org/mozilla-central/rev/7dcbf3debe6a

Chris Cooper [:coop] (he/him)

Reporter

Comment 43

•

5 years ago

:fubar is going to drive this going forward.

Assignee: coop → klibby

:dhouse

Comment 44

•

4 years ago

:tomprince, here is the NI I mentioned yesterday. I verified my test results from April 23 and found a 3% failures on bundles, but they were caught immediately (and did not cause a delayed timeout). I'll work on the image update in another bug and the ci-config test here.

Can you, or can you direct me how to, test moving the linux(+android+macos) builds to gcp in the taskcluster staging env--or do we just need to test it in production like before? (I am guessing we want to keep some "shadow" builds in aws for a few release cycles in case things break in gcp and we need to switch back).
I also want to test smaller instance sizes to minimize cost. Is that something I could test with ci-configuration in the taskcluster staging? I want to target n2-standard -- pending google's confirmation they have the capacity now. I can manually test instances in the google projects, but I'd like to test it "the right way"(?) with ci-config in staging so that we can have more confidence in the results for applying the change to production.

Assignee: klibby → dhouse

Flags: needinfo?(mozilla)

:dhouse

Comment 45

•

4 years ago

(tom met with me over zoom and explained the ci-admin/ci-config staging testing and we planned the steps for this)

Flags: needinfo?(mozilla)

:dhouse

Comment 46

•

4 years ago

:fubar, have you heard from our gcp person about capacity for c2 (or if not c2 then n2)? re: coop's numbers from https://docs.google.com/spreadsheets/d/1Qe7MFyvce59Oqtugm62-Fs9H9f3shNKHhcwhLOjuVRc/edit#gid=0 c2-standard-16 looks best, but the note says "capacity limited"

Flags: needinfo?(klibby)

Mike Hommey [:glandium]

Comment 47

•

4 years ago

Did someone ever look at the bad performance from the sccache bucket? If I look at a random recent build on GCP, I see that 14.4% of the requests to the bucket take more than 1s (going up to 12.3s!), while the corresponding build on AWS has only 0.26% requests taking more than 1s (and max 3s).

Interestingly, it's the same rust crate cache hit that yields the max time in both cases, which got me to look further, and the cumulative download time is 483s on AWS (average 0.11s), and 2029.3s on GCP (average 0.49s). It seems to be that overall the sccache bucket on gcp is 4~5 times as slow as the one on AWS.

This has a noticeable impact on build times.

Tom Prince [:tomprince]

Comment 48

•

4 years ago

Attached file Bug 1547111: Allow varying the worker-pool provider in variants; r?aki (deleted) — Details

In GCP, we have different GCP projects (represented by different worker manager
providers) for level-1, level-3 and test workers. Thus we need to be able to vary the
provider in variant worker-pool families.

Tom Prince [:tomprince]

Comment 49

•

4 years ago

Attached file Bug 1547111: Refactor gcp workers using variants; r?aki (deleted) — Details

Pulsebot

Comment 50

•

4 years ago

Pushed by mozilla@hocat.ca: https://hg.mozilla.org/ci/ci-admin/rev/f07536224105 Allow varying the worker-pool provider in variants; r=aki

:dhouse

Comment 51

•

4 years ago

(In reply to Mike Hommey [:glandium] from comment #47)

Did someone ever look at the bad performance from the sccache bucket? If I look at a random recent build on GCP, I see that 14.4% of the requests to the bucket take more than 1s (going up to 12.3s!), while the corresponding build on AWS has only 0.26% requests taking more than 1s (and max 3s).

Interestingly, it's the same rust crate cache hit that yields the max time in both cases, which got me to look further, and the cumulative download time is 483s on AWS (average 0.11s), and 2029.3s on GCP (average 0.49s). It seems to be that overall the sccache bucket on gcp is 4~5 times as slow as the one on AWS.

This has a noticeable impact on build times.

Thanks :glandium for seeing this! I did not look at sccache, and I don't know if anyone else did. Is this an average per task cumulative sum, or across all tasks for a period?

Do you know who has worked on the sccache for gcp (or aws)? If not, I'll need to ask around or search bugs.

Flags: needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Comment 52

•

4 years ago

(In reply to Dave House [:dhouse] from comment #51)

Thanks :glandium for seeing this! I did not look at sccache, and I don't know if anyone else did. Is this an average per task cumulative sum, or across all tasks for a period?

Cumulative sum for one random build.

Do you know who has worked on the sccache for gcp (or aws)? If not, I'll need to ask around or search bugs.

No idea.

Flags: needinfo?(mh+mozilla)

Pulsebot

Comment 53

•

4 years ago

Pushed by mozilla@hocat.ca: https://hg.mozilla.org/ci/ci-configuration/rev/36f582746ac6 Refactor gcp workers using variants; r=aki

:dhouse

Comment 54

•

4 years ago

fubar is checking with google on the n2-standard-16 instance capacity

Flags: needinfo?(klibby)

:dhouse

Comment 55

•

4 years ago

Attached file Bug 1547111: Swap aws->gcp workers for gecko-1; r=tomprince (deleted) — Details

:dhouse

Comment 56

•

4 years ago

Attached file Bug 1547111: Swap aws->gcp workers for gecko-3/b-linux; r=tomprince (deleted) — Details

Depends on D77293

:dhouse

Comment 57

•

4 years ago

:miles do you have a docker-worker gcp image ready we could switch to in ci-config?

I see these from May 27th in the taskcluster-imaging project:
docker-worker-gcp-community-googlecompute-2020-05-27t20-17-36z
docker-worker-gcp-googlecompute-2020-05-27t18-59-51z

Are you running this "-community-" image in community, and could you (or i) switch firefox-ci to use this image or is there a better one?

https://hg.mozilla.org/ci/ci-configuration/file/tip/worker-images.yml#l12:
monopacker-docker-worker-current: monopacker-docker-worker-2020-02-07t09-14-17z
monopacker-docker-worker-trusted-current: monopacker-docker-worker-gcp-trusted-2020-02-13t03-22-56z

Flags: needinfo?(miles)

:dhouse

Comment 58

•

4 years ago

:miles thankyou for contacting me yesterday. Will you build a level3 image to match the level1, or can I use the same image or who do I need to get the secrets from? (I'm assuming secrets are still in the images however and require us to make separate images)

Mike Hommey [:glandium]

Updated

•

4 years ago

Depends on: 1643562

:dhouse

Comment 59

•

4 years ago

Some related discussion of this appeared in #firefox-ci today starting at 2:10 pacific:

dustin
who typically bakes the docker-worker images in GCP?
I can, but there's a 49% chance I'll get the secrets wrong, so if someone else typically does it, that'd be great
last time was 2020-02-07
for example I see a secret named docker-worker/yaml/firefox-tc-production-l3-new.yaml, but that's from March 5
and I don't see an l1
for production
aki
hm, not sure
https://hg.mozilla.org/ci/ci-configuration/file/tip/grants.yml#l2300 points to miles and wander
ci-configuration @ tip / grants.yml
Content of grants.yml at revision 523fca0e1e6ab088282cfee1fd0cb15a7e70f8a7 in ci-configuration
miles
dustin: the naming scheme is lacking, looks like we are indeed missing production-l1
-new and -old was from the rotation in march
wander baked some images 5/27 that haven't been entered, we should probably re-do that at this point
because CoT isn't used for L1 I think the staging-l1 yaml has been used for all L1 images

:dhouse

Comment 60

•

4 years ago

(In reply to Dave House [:dhouse] from comment #58)

:miles thankyou for contacting me yesterday. Will you build a level3 image to match the level1, or can I use the same image or who do I need to get the secrets from? (I'm assuming secrets are still in the images however and require us to make separate images)

:miles could you build a gcp image for level1 and level3 for this? (or if I need to do it, who do I get the secrets from? and I saw the refactoring+changes in the repo, is monopacker in a ready state to build for gcp?)

Dustin J. Mitchell [:dustin] (he/him)

Comment 61

•

4 years ago

bug 1643562 also needs new docker-worker images baked for gcp. I don't have the necessary access.

Miles Crabill [:miles]

Updated

•

4 years ago

Flags: needinfo?(miles)

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

4 years ago

Component: General → Firefox-CI Administration

Product: Taskcluster → Release Engineering

QA Contact: mtabara

Sylvestre Ledru [:Sylvestre]

Updated

•

3 years ago

Type: defect → task

Connor Sheehan [:sheehan]

Updated

•

3 years ago

Depends on: 1749810

Connor Sheehan [:sheehan]

Updated

•

3 years ago

Depends on: 1749820

Michelle Goossens [:masterwayz]

Assignee

Updated

•

3 years ago

Blocks: tc-gcp-builds

Michelle Goossens [:masterwayz]

Assignee

Updated

•

3 years ago

No longer blocks: tc-gcp

Michelle Goossens [:masterwayz]

Assignee

Comment 63

•

3 years ago

Main = everything !debug. Debug will have its own bug so we can use it for testing and getting started on this work.

QA Contact: mtabara → mgoossens

Summary: Migrate tier 1 builds from AWS to GCP → Migrate main tier 1 builds from AWS to GCP

Andrew Halberstadt [:ahal]

Comment 64

•

3 years ago

Migrating of debug builds is almost ready to land over in bug 1757602. In the meantime I want to figure out what else is blocking shippable builds.

Hal, is there anything left to do on the SecOps side of things around migrating Linux shippable builds from AWS -> GCP?

Flags: needinfo?(hwine)

Hal Wine [:hwine] (use NI)

Comment 65

•

3 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #64)

Hal, is there anything left to do on the SecOps side of things around migrating Linux shippable builds from AWS -> GCP?

I'm not sure we (secops) has been involved in this yet. :/

If I understand the situation correctly:

these linux builds would be the first release builds to be shipped from GCP (it looks like the other OS are still pending)
the scope of this bug is only linux builds (as there are other bugs for the other platforms)
we haven't yet done any sec eval of tc in gcp, afaik (:ajvb to confirm)
there appear to be a couple of key elements not yet completed:
- bug 1597771 should be completed first (or we need a longer discussion)
- bug 1587958 is of interest, as these are the hg-mirrors-that-matter, and no review has been done yet

When I say "review" above, I believe the scope is more of a "mini RRA". We (secops) would want to go over any changes in workflow, permissions, and access between the aws tooling and the gcp tooling. The way things work isn't identical between the systems.

:ahal - what, if any, release builds have we been doing in GCP? If none, what other builds are running in gcp?
:ajvb - has secops (i.e. you 😏) done any review of tc-in-gcp yet?

Flags: needinfo?(hwine)

Flags: needinfo?(ahal)

Flags: needinfo?(abahnken)

Andrew Halberstadt [:ahal]

Comment 66

•

3 years ago

There are currently no shippable builds happening in GCP. We have debug and opt builds running in GCP for Linux, Android and OSX / Windows (cross-compiled).. Come to think of it, all our builds might be on Linux due to cross-compiling. If there are release builds happening on other platforms, we can certainly punt on it and just drill down on Linux for now.

Flags: needinfo?(ahal)

AJ Bahnken [:ajvb] (she/her)

Comment 67

•

3 years ago

:hwine - Nope! I had been involved in the original conversations (now over 2yrs ago! wow) but have not been involved since then. I think you're description of a "mini RRA" sounds great.

Flags: needinfo?(abahnken)

:dhouse

Comment 68

•

3 years ago

:ahal is there a way to turn off interactive tasks for a pool? I tested and I was able to create one for the l3 gcp builder pool (I hit errors downloading the web interface because of path length, but it created the interactive task). We'll need to make sure the gcp firewalls are blocking incoming for the L3 builders, but could turning it off be "nice" so devs don't expect it to work?

Flags: needinfo?(ahal)

Andrew Halberstadt [:ahal]

Comment 69

•

3 years ago

Hm, I'm not aware of any way to do this no. I think even for AWS based pools it's possible to create interactive level 3 tasks, it's just not possible to connect to them.

I agree that a scope error, or better yet, not offer the "Create Interactive" button in the first place would be a much better user experience.. though probably out of scope for this bug. I think solving this will likely involve changes in Taskcluster itself.

Flags: needinfo?(ahal)

Hal Wine [:hwine] (use NI)

Comment 70

•

3 years ago

:ajvb -- long ago, ulfr raised this issue, and (name forgotten) was supposed to default this feature to "disabled", and then only explicitly enable for "try" (or some other non-L3 set of machines). Does that ring a bell?

Flags: needinfo?(abahnken)

AJ Bahnken [:ajvb] (she/her)

Comment 71

•

3 years ago

(In reply to Hal Wine [:hwine] (use NI) from comment #70)

:ajvb -- long ago, ulfr raised this issue, and (name forgotten) was supposed to default this feature to "disabled", and then only explicitly enable for "try" (or some other non-L3 set of machines). Does that ring a bell?

No it doesn't :/ - But I do agree that this should be fixed.

Flags: needinfo?(abahnken)

Hal Wine [:hwine] (use NI)

Comment 72

•

2 years ago

So, for the purposes of the tier 1 migration, we will only care about the same blocking of access to the interactive feature, but:

:ahal can you open an appropriate TC bug to allow disabling the interactive task feature, please? CC both AJ & myself on that
:dhouse - how do you suggest we verify and/or monitor the L3 firewall setting for this, as you mentioned in comment #68?

Flags: needinfo?(dhouse)

Flags: needinfo?(ahal)

Andrew Halberstadt [:ahal]

Comment 73

•

2 years ago

(In reply to Hal Wine [:hwine] (use NI) from comment #72)

:ahal can you open an appropriate TC bug to allow disabling the interactive task feature, please? CC both AJ & myself on that

I had previously filed https://github.com/taskcluster/taskcluster/issues/5225 which was similar. I tweaked the title and added some extra context. CC'ed you both.

Flags: needinfo?(ahal)

Andrew Halberstadt [:ahal]

Updated

•

2 years ago

Summary: Migrate main tier 1 builds from AWS to GCP → Migrate shippable builds from AWS to GCP

Connor Sheehan [:sheehan]

Updated

•

2 years ago

No longer depends on: 1587958

Andrew Halberstadt [:ahal]

Comment 74

•

2 years ago

I was looking at the RRA doc, and I believe we're good to go here (maybe other than :dhouse's pending needinfo). But the CoT key has been updated in the image. There was also a recommendation around artifact storage, but that shouldn't be changing with this patch (we're just switching from EC2 -> GCE without touching the artifact storage). I also attempted to create an interactive task of a non-shippable build (which is running in GCP) and get the following error in my browser console:

Firefox can’t establish a connection to the server at wss://skkf5eiaaaayfup6iccmlbrvg4q2ik7mbtllkknbu7bcrxhq.taskcluster-worker.net:50724/1QXZS924S1axOiJ81jv0pg/shell.sock?...

So looks like the firewall is working \o/

Pending objections and a green light from relman, we'd like to attempt switching this over on Friday. Hal, can you think of any other reasons to hold off?

Flags: needinfo?(hwine)

Andrew Halberstadt [:ahal]

Comment 75

•

2 years ago

I chatted with Hal out of band. He has at least two requirements before we proceed here:

Resolving bug 1597771 either by investigating and deciding the protections we have in place currently is good enough (maybe things have changed since that bug was filed). Or else implementing the same safeguards we have in AWS if they are still missing.
Make sure all non-SRE access to the project in GCP is revoked.

Hal Wine [:hwine] (use NI)

Comment 76

•

2 years ago

I'm about to head out on PTO -- get signoff from :rforsythe before flipping the switch, please. Open items are in comment 75.

Flags: needinfo?(hwine)

:dhouse

Comment 77

•

2 years ago

:aj do you have admin access to remove Michelle from the relops folder? https://console.cloud.google.com/iam-admin/iam?authuser=1&folder=723902893592 I no longer have the full admin.
At this point in the migration, we can remove that access and Michelle can coordinate any needed changes through relops.

Flags: needinfo?(dhouse) → needinfo?(abahnken)

AJ Bahnken [:ajvb] (she/her)

Comment 78

•

2 years ago

:dhouse - I do not have permissions to do this. I'd imagine the owning ops folks of this folder or a GCP admin from SRE can take care of this.

Flags: needinfo?(abahnken)

Andrew Halberstadt [:ahal]

Comment 79

•

2 years ago

Hi Chris, is Dave's comment 77 something you can help out with?

Flags: needinfo?(cvalaas)

chris valaas [:cvalaas]

Comment 80

•

2 years ago

Nope, I don't have any permissions on that folder. :jason should be able to - or at least discover who does have permissions there.

Flags: needinfo?(cvalaas) → needinfo?(jthomas)

Jason Thomas [:jason]

Comment 81

•

2 years ago

Filed https://mozilla-hub.atlassian.net/browse/OPST-776 for removal

Flags: needinfo?(jthomas)

Michelle Goossens [:masterwayz]

Assignee

Comment 82

•

2 years ago

Attached file Bug 1547111 - Migrate shippable builds from AWS to GCP r=ahal!,jmaher! (deleted) — Details

Michelle Goossens [:masterwayz]

Assignee

Updated

•

2 years ago

Assignee: dhouse → mgoossens

Andrew Halberstadt [:ahal]

Comment 83

•

2 years ago

Looks like OPST-776 is complete, and all of Hal's remaining concerns from comment 75 have been addressed.

Since this change just rides the trains, I'm planning to:

Land it on autoland
Trigger some shippable builds
Manually run verify_cot to test it out

If it doesn't work, I'll ask sheriffs to back out. If it does, the next test will be nightlies. Then sometime after all-hands we can uplift this to ESR (probably fine to let this ride the trains to release).

Andrew Halberstadt [:ahal]

Comment 84

•

2 years ago

Hey Hal, did you want to do a final sign-off here?

Flags: needinfo?(hwine)

Hal Wine [:hwine] (use NI)

Comment 85

•

2 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #84)

Hey Hal, did you want to do a final sign-off here?

lgtm
X <== marks the spot

Flags: needinfo?(hwine)

Pulsebot

Comment 86

•

2 years ago

Pushed by ahalberstadt@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/7c0a787fe65a Migrate shippable builds from AWS to GCP r=ahal,jmaher

Marian-Vasile Laza

Comment 87

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/7c0a787fe65a

Michelle Goossens [:masterwayz]

Assignee

Comment 88

•

2 years ago

And done!

Status: REOPENED → RESOLVED

Closed: 5 years ago → 2 years ago

Keywords: leave-open

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.