Closed Bug 1549346 Opened 5 years ago Closed 5 years ago

21.2 - 22.68% build times (windows2012-32, windows2012-64) regression on push 5b08dd3eeec974c6ae229134906fc79dec7ef2ba (Thu May 2 2019)

Categories

(Infrastructure & Operations :: RelOps: OpenCloudConfig, defect)

All
Windows
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: igoldan, Assigned: grenade)

References

(Regression)

Details

(Keywords: regression)

Attachments

(1 file)

We have detected a build metrics regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=5b08dd3eeec974c6ae229134906fc79dec7ef2ba

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

23% build times windows2012-32 debug gcp taskcluster-n1-standard-32 2,255.64 -> 2,767.15
21% build times windows2012-64 debug gcp taskcluster-n1-standard-32 2,279.04 -> 2,764.69
21% build times windows2012-64 debug gcp taskcluster-n1-highcpu-32 2,318.17 -> 2,809.59

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=20762

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Automated_Performance_Testing_and_Sheriffing/Build_Metrics

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Component: General → RelOps: OpenCloudConfig
Product: Testing → Infrastructure & Operations
QA Contact: rthijssen
Flags: needinfo?(rthijssen)

the change in question adds sccache support to gcp builders. it should not be backed out as it only affects tier 3 gcp builds and was added specifically because we need to investigate and debug sccache performance on gcp builds.

Flags: needinfo?(rthijssen)

(In reply to Rob Thijssen [:grenade (EET)] from comment #1)

the change in question adds sccache support to gcp builders. it should not be backed out as it only affects tier 3 gcp builds and was added specifically because we need to investigate and debug sccache performance on gcp builds.

Should we close this as WONTFIX?

Flags: needinfo?(rthijssen)

lets leave it open and i'll update it as we start to understand the performance issues.

Assignee: nobody → rthijssen
Status: NEW → ASSIGNED
Flags: needinfo?(rthijssen)

i'm hoping the problem is due to object level permissions being enabled on the gcp buckets.
i'm switching all the gcp bucket object acls off now and hope this will get things working properly.
https://github.com/mozilla-releng/OpenCloudConfig/commit/1ad4b7f

We're also hoping that the GCE compute-optimized instances will help, when we get access to them. Soon-ish.

i think i misunderstood ted's instructions with my earlier patch in bug 1543026 because i assumed that the sccache toolchain build happened as part of the gecko build. i've since learned that it doesn't and that there's a dedicated sccache toolchain build which currently happens on our ec2 builders which of course don't have the SCCACHE_GCS_KEY_PATH variable set and therefore doesn't build sccache with the gcs feature switch toggled on.

in light of that misunderstanding, i believe i need to back out my change to taskcluster/scripts/misc/build-sccache.sh and just toggle the gcs feature switch to always-on, for the windows build irrespective of the SCCACHE_GCS_KEY_PATH setting.

partial revert of https://hg.mozilla.org/mozilla-central/rev/5b08dd3eeec9 where the gcs feature switch should have been to the win build only and always enabled rather than just when the SCCACHE_GCS_KEY_PATH variable is set.

Attachment #9064243 - Attachment description: toggle gcs feature on → toggle sccache gcs feature to always on

Comment on attachment 9064243 [details]
toggle sccache gcs feature to always on

cmanchester: i hope you don't mind my requesting your review. i looked for recent changes to this build file and your name came up. feel free to redirect if you think someone else should take a look.

Attachment #9064243 - Flags: review?(cmanchester)

Comment on attachment 9064243 [details]
toggle sccache gcs feature to always on

approved in phabricator

Attachment #9064243 - Flags: review?(cmanchester) → review+

Pushed by csabou@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b23f1b465581
toggle sccache gcs feature to always on r=chmanchester

Keywords: checkin-needed

patch has worked. builds that have picked up the sccache artifact from the autoland toolchain (sccache2.tar.bz2@QhxrnGYjQv2SLC5VW5dHQQ) are successfully reading and writing from the gcs sccache buckets (eg: taskcluster-level-3-sccache-us-west1).

we should see a build time decrease in the metrics for gcp shortly.

Keywords: leave-open
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED

(In reply to Rob Thijssen [:grenade (EET)] from comment #11)

patch has worked. builds that have picked up the sccache artifact from the autoland toolchain (sccache2.tar.bz2@QhxrnGYjQv2SLC5VW5dHQQ) are successfully reading and writing from the gcs sccache buckets (eg: taskcluster-level-3-sccache-us-west1).

we should see a build time decrease in the metrics for gcp shortly.

Indeed, I confirm the time decrease. Still, it doesn't look like a total fix.

Is there something more we could do to completely fix these increases?

Flags: needinfo?(rthijssen)

i'm not sure. there is an ongoing discussion (cmanchester, glandium, ajvb, hwine, wcosta & others) about how we will use sccache. it would be useful to know what the actual build time increase is with sccache. when this bug was opened, the title suggested a build time increase of 21.2 - 22.68%. however the initial builds with sccache in gcp were misconfigured (sccache was enabled but the buckets were unreachable), the patch in comment 12 fixed those issues. later, in bug 1552503, comment 3, we turned off parallel gcp windows builds. so the window of time when gcp sccache was enabled and correctly configured, for m-i and autoland, was 29-05-14 to 29-05-19. the build time percentage difference from build times before this bug was created on 06-05-2019 would tell us what the actual build time increase was.

but in terms of can we make sccache faster, i don't know the answer to that. cmanchester and glandium are probably better placed to answer that. i can say that my finger-in-the-wind-feeling about why sccache is slow on gcp is related to network performance for bucket transfers. they "feel" faster on ec2 but i haven't done any sort of scientific comparisons.

the ~20% figure is probably inaccurate because at that time, sccache was enabled but buckets unavailable, so instances were using local storage for their sccache, which just resulted in overhead for sccache calls and no performance boost from cached artifacts. if it's possible to check the performance of windows gcp autoland and m-i builds between 29-05-14 and 29-05-19 (when buckets were available and working correctly), compared to builds before 06-05-2019 (when sccache was completely disabled on gcp) we'd get a better idea of what the performance cost was.

Flags: needinfo?(rthijssen)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: