Migrate shippable builds from AWS to GCP
Categories
(Release Engineering :: Firefox-CI Administration, task)
Tracking
(Not tracked)
People
(Reporter: coop, Assigned: masterwayz)
References
Details
Attachments
(17 files, 1 obsolete file)
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details |
We have tier 3 builds in GCP on all platforms. Now we need to take stock of what's still missing and what we need to do to migrate the tier 1 builds from AWS to GCP.
Reporter | ||
Comment 1•6 years ago
|
||
The #1 blocker right now is the lack of sccache in GCP (bug 1539961). This prevents some debug builds and other slower variants from completing reliably in their existing duration. This is compounded by not yet having access to compute-optimized instances in GCP.
Once sccache is running, there are a bunch of build variants that still need to be setup in GCP:
- spidermonkey (linux/win)
- aarch64 (linux/win)
- asan/asan reporter (linux/mac/win)
- ccov (linux/mac/win)
- noopt (linux/mac/win)
- searchfox (linux/mac/win)
- PGO builds (all platforms)
Some of those require scriptworker (esp. for signing), although the existing scriptworker pools in AWS can be used transitionally.
We can't start using GCP to produce binaries that we ship to end users (or data we rely on) until we have done a security review of the GCP platform.
We'll also want specific RRAs for the following services that are changing or being created fresh in GCP:
- worker manager
- GCP provider
- helm deployment process of Taskcluster services in GCP
Finally, there are a few accessory services that, like sccache, should live next to workers in any new cloud providers, although these are nice-to-haves in that they will reduce costs and/or speed up the build process by minimizing network transfer. This includes services like:
- hg mirror
- cloud-mirror (will be object service eventually)
I'll add to this bug as I think of other things, and will add bug numbers as I find/file them. Please do the same.
Reporter | ||
Comment 2•6 years ago
|
||
(In reply to Chris Cooper [:coop] pronoun: he from comment #1)
Some of those require scriptworker (esp. for signing), although the existing scriptworker pools in AWS can be used transitionally.
We also need to make sure that workers in GCP can generate the proper signatures for chain-of-trust (bug 1545524).
Comment 3•6 years ago
|
||
results from analyzing the builds are done in bug 1546414.
Key findings:
build times are much slower for almost all build types
intermittent failures in the builds.
Comment 4•5 years ago
|
||
i'm working through some issues with sccache on windows in bug 1549346 and also need to modify our provisioning script to provision builders in the us only.
Reporter | ||
Comment 5•5 years ago
|
||
Overall, there are some concerns around performance but no showstoppers.
Builds are slower than expected, perhaps to the tune of 30-40%. In particular, Windows PGO builds are slower, and tasks that have relatively short timeouts do hit those timeouts frequently, e.g. symbol uploads.
I wouldn't be comfortable moving builds over if they are 30% slower, but there are some mitigations we can put in place:
- Optimize hg for GCP the same way we did for AWS - bug 1585133
- Extend timeouts for some jobs (e.g. symbol uploads). I increased the timeout for symbol upload jobs to 30min and they all completed successfully.
- Temporarily increase instance specs (e.g. Windows PGO builds). These are currently running on n1-highcpu-32 so we could bump that up to n1-highcpu-64.
- Wait for n2 instances to become available, or create worker pools in us-central where beta n2s are already available. This might be useful for Windows instances.
I think we should wait for bug 1585133 to land because that will improve things across the board.
Comment 6•5 years ago
|
||
Comment 7•5 years ago
|
||
Comment 8•5 years ago
|
||
Comment 9•5 years ago
|
||
Comment 10•5 years ago
|
||
Updated•5 years ago
|
Comment 11•5 years ago
|
||
Reporter | ||
Comment 12•5 years ago
|
||
:bc - Connor's patch in bug 1585133 seems to have stuck. Have you been able to generate new perf numbers for validation? Is it worth creating a separate bug just to cover that work?
Comment 13•5 years ago
|
||
I have been working with a try push where I have been collecting the test results from using the linux builder workers as test workers. I have one iteration of the builds and 5 iterations of the tests so far and plan to add more build iterations to collect the statistics for the builds when I complete the tests. I don't think there is a need for a separate bug for now. I will update this bug with the build performance comparison when I have completed it. I plan to document the test results for the linux builder/test workers in bug 1577276.
Comment 14•5 years ago
|
||
Coop: I did a new try run with 20 builds. The resulting google sheet looks much better than before.
Reporter | ||
Comment 15•5 years ago
|
||
I found out last week that both the n2 and c2 instance families have hit general availability (GA) in GCP, so I did some renewed timings with both of them using my previous methodology with plain builds:
n2: https://treeherder.mozilla.org/#/jobs?repo=try&revision=65700ea553546df1ef64b032c70a8f7319340f37
Avg task time: 1381.77s (~23.0m)
Std. dev: 48.46
CoV: 0.04
c2: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7e705f2899f617e1bcba10e7c45b67a63008f1f9
Avg task time: 1272.52s (~21.2m)
Std. dev: 21.75
CoV: 0.02
Synopsis: both instance types are still faster than our current AWS instances, and are several minutes faster than they were just a few months ago, likely due to the hg speed-ups. Google tells me they won't have enough capacity to run our peak workloads on c2, but n2 can shoulder the burden.
We're not planning on migrating anything this week, but I'll roll some patches to:
- Change our current GCP builders to n2 instances. This can land this week
- Migrate POSIX build load from AWS -> GCP, to be landed early next week after the migration/TCW
Reporter | ||
Comment 16•5 years ago
|
||
Adding bug 1546801 as a dependency, not for technical reasons but to avoid disrupting that migration by changing the build platform at the same time.
Reporter | ||
Comment 17•5 years ago
|
||
I'm driving this, if not doing the work.
Reporter | ||
Comment 18•5 years ago
|
||
Didn't mean to close this.
Comment 19•5 years ago
|
||
We are swtching tier-1 builds to GCP.
Comment 20•5 years ago
|
||
Updated•5 years ago
|
Updated•5 years ago
|
Comment 21•5 years ago
|
||
Comment 22•5 years ago
|
||
Comment 23•5 years ago
|
||
There was a high rate of tasks failing with claim-expired, and it appears that sccache is not working (the task are getting write errors). Because of that and bug 1609568, I've backed it out.
Comment 24•5 years ago
|
||
Comment 25•5 years ago
|
||
It feels like the worker-pool is lacking the scope auth:gcp:access-token:sccache-3/sccache-l{level}-us-central1@sccache-3.iam.gserviceaccount.com
. I believe this is a misconfiguration of ci-configuration
, but I could not track where. Tom, can you figure this out?
Comment 26•5 years ago
|
||
It feels like everything is configured correctly. I opened a new pull request for sccache report when the request to access token fail. The only piece of the puzzle that I see is taskcluster-auth. :edunham, could you confirm sccache-l3-us-central1@sccache-3.iam.gserviceaccount.com
is in the auth white list?
Comment 27•5 years ago
|
||
Wander, sccache-l3-us-central1@sccache-3.iam.gserviceaccount.com
is in the allwoedServiceAccounts
list configured for the sccache-3
project in gcp_credentials_allowed_projects
for the auth service in firefoxci.
Stage previously had some L3 configuration which it turns out should never have been there, so that is being removed now. Stage has the same L1 and L2 accounts configured as firefoxci does, though.
Comment hidden (Intermittent Failures Robot) |
Comment 29•5 years ago
|
||
(In reply to Pulsebot from comment #24)
Backout by dvarga@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/daf3b53c3efa
Backed out changeset 540db822a1d4 for causing bug 1609568
== Change summary for alert #24699 (as of Thu, 16 Jan 2020 07:26:06 GMT) ==
Improvements:
24% build times windows2012-64-noopt debug taskcluster-c5.4xlarge 2,609.43 -> 1,992.32
For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=24699
Comment 30•5 years ago
|
||
Comment 31•5 years ago
|
||
The scopes that are used are managed by the
project:taskcluster:{trust_domain}:level-{level}-sccache-buckets
role that is added a few lines above.
Updated•5 years ago
|
Comment 32•5 years ago
|
||
Updated•5 years ago
|
Comment 33•5 years ago
|
||
bugherder |
Comment 34•5 years ago
|
||
bugherder uplift |
Comment 35•5 years ago
|
||
bugherder uplift |
Comment 36•5 years ago
|
||
The scopes was added to tasks using sscache, but does not correspond to an
actual bucket. Now that the code that added that scope is gone, we can remove
the scope.
Reporter | ||
Comment 37•5 years ago
|
||
To be explicit, when we do the cutover, we want the try and release builds (levels 1-3) to switch from AWS to GCP, but we also want to switch the tier 3 validation builds that we've been running in GCP to switch to back AWS. This is important in terms of making sure builds continue to work in AWS in case we need to switch back from GCP->AWS at some point in the future.
Comment 38•5 years ago
|
||
Comment 39•5 years ago
|
||
bugherder |
Comment 40•5 years ago
|
||
Comment 41•5 years ago
|
||
Backed 9819d9e38727 out from autoland & central:
https://hg.mozilla.org/integration/autoland/rev/7dcbf3debe6a3a527df82c422446ca42a3de78c4
https://hg.mozilla.org/mozilla-central/rev/5fa1a31cb9547a0400f0955707e46b3046680123
Comment 42•5 years ago
|
||
Merged backout of 9819d9e38727 to central: https://hg.mozilla.org/mozilla-central/rev/7dcbf3debe6a
Reporter | ||
Comment 43•5 years ago
|
||
:fubar is going to drive this going forward.
Comment 44•4 years ago
|
||
:tomprince, here is the NI I mentioned yesterday. I verified my test results from April 23 and found a 3% failures on bundles, but they were caught immediately (and did not cause a delayed timeout). I'll work on the image update in another bug and the ci-config test here.
-
Can you, or can you direct me how to, test moving the linux(+android+macos) builds to gcp in the taskcluster staging env--or do we just need to test it in production like before? (I am guessing we want to keep some "shadow" builds in aws for a few release cycles in case things break in gcp and we need to switch back).
-
I also want to test smaller instance sizes to minimize cost. Is that something I could test with ci-configuration in the taskcluster staging? I want to target n2-standard -- pending google's confirmation they have the capacity now. I can manually test instances in the google projects, but I'd like to test it "the right way"(?) with ci-config in staging so that we can have more confidence in the results for applying the change to production.
Comment 45•4 years ago
|
||
(tom met with me over zoom and explained the ci-admin/ci-config staging testing and we planned the steps for this)
Comment 46•4 years ago
|
||
:fubar, have you heard from our gcp person about capacity for c2 (or if not c2 then n2)? re: coop's numbers from https://docs.google.com/spreadsheets/d/1Qe7MFyvce59Oqtugm62-Fs9H9f3shNKHhcwhLOjuVRc/edit#gid=0 c2-standard-16 looks best, but the note says "capacity limited"
Comment 47•4 years ago
|
||
Did someone ever look at the bad performance from the sccache bucket? If I look at a random recent build on GCP, I see that 14.4% of the requests to the bucket take more than 1s (going up to 12.3s!), while the corresponding build on AWS has only 0.26% requests taking more than 1s (and max 3s).
Interestingly, it's the same rust crate cache hit that yields the max time in both cases, which got me to look further, and the cumulative download time is 483s on AWS (average 0.11s), and 2029.3s on GCP (average 0.49s). It seems to be that overall the sccache bucket on gcp is 4~5 times as slow as the one on AWS.
This has a noticeable impact on build times.
Comment 48•4 years ago
|
||
In GCP, we have different GCP projects (represented by different worker manager
providers) for level-1, level-3 and test workers. Thus we need to be able to vary the
provider in variant worker-pool families.
Comment 49•4 years ago
|
||
Comment 50•4 years ago
|
||
Comment 51•4 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #47)
Did someone ever look at the bad performance from the sccache bucket? If I look at a random recent build on GCP, I see that 14.4% of the requests to the bucket take more than 1s (going up to 12.3s!), while the corresponding build on AWS has only 0.26% requests taking more than 1s (and max 3s).
Interestingly, it's the same rust crate cache hit that yields the max time in both cases, which got me to look further, and the cumulative download time is 483s on AWS (average 0.11s), and 2029.3s on GCP (average 0.49s). It seems to be that overall the sccache bucket on gcp is 4~5 times as slow as the one on AWS.
This has a noticeable impact on build times.
Thanks :glandium for seeing this! I did not look at sccache, and I don't know if anyone else did. Is this an average per task cumulative sum, or across all tasks for a period?
Do you know who has worked on the sccache for gcp (or aws)? If not, I'll need to ask around or search bugs.
Comment 52•4 years ago
|
||
(In reply to Dave House [:dhouse] from comment #51)
Thanks :glandium for seeing this! I did not look at sccache, and I don't know if anyone else did. Is this an average per task cumulative sum, or across all tasks for a period?
Cumulative sum for one random build.
Do you know who has worked on the sccache for gcp (or aws)? If not, I'll need to ask around or search bugs.
No idea.
Comment 53•4 years ago
|
||
Comment 54•4 years ago
|
||
fubar is checking with google on the n2-standard-16 instance capacity
Comment 55•4 years ago
|
||
Comment 56•4 years ago
|
||
Depends on D77293
Comment 57•4 years ago
|
||
:miles do you have a docker-worker gcp image ready we could switch to in ci-config?
I see these from May 27th in the taskcluster-imaging project:
docker-worker-gcp-community-googlecompute-2020-05-27t20-17-36z
docker-worker-gcp-googlecompute-2020-05-27t18-59-51z
Are you running this "-community-" image in community, and could you (or i) switch firefox-ci to use this image or is there a better one?
https://hg.mozilla.org/ci/ci-configuration/file/tip/worker-images.yml#l12:
monopacker-docker-worker-current: monopacker-docker-worker-2020-02-07t09-14-17z
monopacker-docker-worker-trusted-current: monopacker-docker-worker-gcp-trusted-2020-02-13t03-22-56z
Comment 58•4 years ago
|
||
:miles thankyou for contacting me yesterday. Will you build a level3 image to match the level1, or can I use the same image or who do I need to get the secrets from? (I'm assuming secrets are still in the images however and require us to make separate images)
Comment 59•4 years ago
|
||
Some related discussion of this appeared in #firefox-ci today starting at 2:10 pacific:
dustin
who typically bakes the docker-worker images in GCP?
I can, but there's a 49% chance I'll get the secrets wrong, so if someone else typically does it, that'd be great
last time was 2020-02-07
for example I see a secret named docker-worker/yaml/firefox-tc-production-l3-new.yaml, but that's from March 5
and I don't see an l1
for production
aki
hm, not sure
https://hg.mozilla.org/ci/ci-configuration/file/tip/grants.yml#l2300 points to miles and wander
ci-configuration @ tip / grants.yml
Content of grants.yml at revision 523fca0e1e6ab088282cfee1fd0cb15a7e70f8a7 in ci-configuration
miles
dustin: the naming scheme is lacking, looks like we are indeed missing production-l1
-new and -old was from the rotation in march
wander baked some images 5/27 that haven't been entered, we should probably re-do that at this point
because CoT isn't used for L1 I think the staging-l1 yaml has been used for all L1 images
Comment 60•4 years ago
|
||
(In reply to Dave House [:dhouse] from comment #58)
:miles thankyou for contacting me yesterday. Will you build a level3 image to match the level1, or can I use the same image or who do I need to get the secrets from? (I'm assuming secrets are still in the images however and require us to make separate images)
:miles could you build a gcp image for level1 and level3 for this? (or if I need to do it, who do I get the secrets from? and I saw the refactoring+changes in the repo, is monopacker in a ready state to build for gcp?)
Comment 61•4 years ago
|
||
bug 1643562 also needs new docker-worker images baked for gcp. I don't have the necessary access.
Updated•4 years ago
|
Updated•4 years ago
|
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 63•3 years ago
|
||
Main = everything !debug. Debug will have its own bug so we can use it for testing and getting started on this work.
Comment 64•3 years ago
|
||
Migrating of debug builds is almost ready to land over in bug 1757602. In the meantime I want to figure out what else is blocking shippable builds.
Hal, is there anything left to do on the SecOps side of things around migrating Linux shippable builds from AWS -> GCP?
Comment 65•3 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #64)
Hal, is there anything left to do on the SecOps side of things around migrating Linux shippable builds from AWS -> GCP?
I'm not sure we (secops) has been involved in this yet. :/
If I understand the situation correctly:
- these linux builds would be the first release builds to be shipped from GCP (it looks like the other OS are still pending)
- the scope of this bug is only linux builds (as there are other bugs for the other platforms)
- we haven't yet done any sec eval of tc in gcp, afaik (:ajvb to confirm)
- there appear to be a couple of key elements not yet completed:
- bug 1597771 should be completed first (or we need a longer discussion)
- bug 1587958 is of interest, as these are the hg-mirrors-that-matter, and no review has been done yet
When I say "review" above, I believe the scope is more of a "mini RRA". We (secops) would want to go over any changes in workflow, permissions, and access between the aws tooling and the gcp tooling. The way things work isn't identical between the systems.
:ahal - what, if any, release builds have we been doing in GCP? If none, what other builds are running in gcp?
:ajvb - has secops (i.e. you 😏) done any review of tc-in-gcp yet?
Comment 66•3 years ago
|
||
There are currently no shippable builds happening in GCP. We have debug and opt builds running in GCP for Linux, Android and OSX / Windows (cross-compiled).. Come to think of it, all our builds might be on Linux due to cross-compiling. If there are release builds happening on other platforms, we can certainly punt on it and just drill down on Linux for now.
Comment 67•3 years ago
|
||
:hwine - Nope! I had been involved in the original conversations (now over 2yrs ago! wow) but have not been involved since then. I think you're description of a "mini RRA" sounds great.
Comment 68•3 years ago
|
||
:ahal is there a way to turn off interactive tasks for a pool? I tested and I was able to create one for the l3 gcp builder pool (I hit errors downloading the web interface because of path length, but it created the interactive task). We'll need to make sure the gcp firewalls are blocking incoming for the L3 builders, but could turning it off be "nice" so devs don't expect it to work?
Comment 69•3 years ago
|
||
Hm, I'm not aware of any way to do this no. I think even for AWS based pools it's possible to create interactive level 3 tasks, it's just not possible to connect to them.
I agree that a scope error, or better yet, not offer the "Create Interactive" button in the first place would be a much better user experience.. though probably out of scope for this bug. I think solving this will likely involve changes in Taskcluster itself.
Comment 70•3 years ago
|
||
:ajvb -- long ago, ulfr raised this issue, and (name forgotten) was supposed to default this feature to "disabled", and then only explicitly enable for "try" (or some other non-L3 set of machines). Does that ring a bell?
Comment 71•3 years ago
|
||
(In reply to Hal Wine [:hwine] (use NI) from comment #70)
:ajvb -- long ago, ulfr raised this issue, and (name forgotten) was supposed to default this feature to "disabled", and then only explicitly enable for "try" (or some other non-L3 set of machines). Does that ring a bell?
No it doesn't :/ - But I do agree that this should be fixed.
Comment 72•2 years ago
|
||
So, for the purposes of the tier 1 migration, we will only care about the same blocking of access to the interactive feature, but:
- :ahal can you open an appropriate TC bug to allow disabling the interactive task feature, please? CC both AJ & myself on that
- :dhouse - how do you suggest we verify and/or monitor the L3 firewall setting for this, as you mentioned in comment #68?
Comment 73•2 years ago
|
||
(In reply to Hal Wine [:hwine] (use NI) from comment #72)
- :ahal can you open an appropriate TC bug to allow disabling the interactive task feature, please? CC both AJ & myself on that
I had previously filed https://github.com/taskcluster/taskcluster/issues/5225 which was similar. I tweaked the title and added some extra context. CC'ed you both.
Updated•2 years ago
|
Comment 74•2 years ago
|
||
I was looking at the RRA doc, and I believe we're good to go here (maybe other than :dhouse's pending needinfo). But the CoT key has been updated in the image. There was also a recommendation around artifact storage, but that shouldn't be changing with this patch (we're just switching from EC2 -> GCE without touching the artifact storage). I also attempted to create an interactive task of a non-shippable build (which is running in GCP) and get the following error in my browser console:
Firefox can’t establish a connection to the server at wss://skkf5eiaaaayfup6iccmlbrvg4q2ik7mbtllkknbu7bcrxhq.taskcluster-worker.net:50724/1QXZS924S1axOiJ81jv0pg/shell.sock?...
So looks like the firewall is working \o/
Pending objections and a green light from relman, we'd like to attempt switching this over on Friday. Hal, can you think of any other reasons to hold off?
Comment 75•2 years ago
|
||
I chatted with Hal out of band. He has at least two requirements before we proceed here:
-
Resolving bug 1597771 either by investigating and deciding the protections we have in place currently is good enough (maybe things have changed since that bug was filed). Or else implementing the same safeguards we have in AWS if they are still missing.
-
Make sure all non-SRE access to the project in GCP is revoked.
Comment 76•2 years ago
|
||
I'm about to head out on PTO -- get signoff from :rforsythe before flipping the switch, please. Open items are in comment 75.
Comment 77•2 years ago
|
||
:aj do you have admin access to remove Michelle from the relops folder? https://console.cloud.google.com/iam-admin/iam?authuser=1&folder=723902893592 I no longer have the full admin.
At this point in the migration, we can remove that access and Michelle can coordinate any needed changes through relops.
Comment 78•2 years ago
|
||
:dhouse - I do not have permissions to do this. I'd imagine the owning ops folks of this folder or a GCP admin from SRE can take care of this.
Comment 79•2 years ago
|
||
Hi Chris, is Dave's comment 77 something you can help out with?
Comment 80•2 years ago
|
||
Nope, I don't have any permissions on that folder. :jason should be able to - or at least discover who does have permissions there.
Comment 81•2 years ago
|
||
Filed https://mozilla-hub.atlassian.net/browse/OPST-776 for removal
Assignee | ||
Comment 82•2 years ago
|
||
Assignee | ||
Updated•2 years ago
|
Comment 83•2 years ago
|
||
Looks like OPST-776 is complete, and all of Hal's remaining concerns from comment 75 have been addressed.
Since this change just rides the trains, I'm planning to:
- Land it on autoland
- Trigger some shippable builds
- Manually run verify_cot to test it out
If it doesn't work, I'll ask sheriffs to back out. If it does, the next test will be nightlies. Then sometime after all-hands we can uplift this to ESR (probably fine to let this ride the trains to release).
Comment 85•2 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #84)
Hey Hal, did you want to do a final sign-off here?
lgtm
X <== marks the spot
Comment 86•2 years ago
|
||
Comment 87•2 years ago
|
||
bugherder |
Assignee | ||
Comment 88•2 years ago
|
||
And done!
Description
•