9.22 - 18305% Taskcluster infra change - build times / sccache
Categories
(Taskcluster :: General, defect, P2)
Tracking
(Not tracked)
People
(Reporter: alexandrui, Assigned: tomprince)
References
(Regression)
Details
(Keywords: perf-alert, regression, Whiteboard: [necko-triaged])
Attachments
(1 file)
(deleted),
text/x-phabricator-request
|
Details |
We have detected a build metrics regression from push:
As author of one of the patches included in that push, we need your help to address this regression.
Regressions:
18305% sccache cache_write_errors android-5-0-aarch64 opt 3.33 -> 613.50
15167% sccache cache_write_errors android-4-0-armv7-api16 opt 4.00 -> 610.67
14432% sccache cache_write_errors android-4-2-x86 opt 4.17 -> 605.50
14235% sccache cache_write_errors osx-cross-noopt debug 4.50 -> 645.08
12955% sccache cache_write_errors android-5-0-x86_64 debug 4.67 -> 609.25
12929% sccache cache_write_errors osx-cross debug 4.92 -> 640.58
12744% sccache cache_write_errors android-5-0-x86_64 opt 4.75 -> 610.08
10862% sccache cache_write_errors osx-cross asan asan-fuzzing 5.92 -> 648.58
9013% sccache cache_write_errors osx-cross debug fuzzing 7.17 -> 653.08
5798% sccache cache_write_errors linux64-noopt debug 11.25 -> 663.50
5559% sccache cache_write_errors linux64 asan opt 11.50 -> 650.83
5465% sccache cache_write_errors linux64 debug 11.92 -> 663.17
5135% sccache cache_write_errors linux64 opt valgrind 12.50 -> 654.42
4730% sccache cache_write_errors linux64 opt 13.58 -> 656.08
4404% sccache cache_write_errors android-5-0-aarch64 debug 13.50 -> 608.00
3295% sccache cache_write_errors linux64 asan asan-fuzzing 19.50 -> 662.08
2755% sccache cache_write_errors android-4-0-armv7-api16 debug 21.25 -> 606.75
1129% sccache cache_write_errors android-4-2-x86 debug 50.08 -> 615.58
1029% sccache cache_write_errors linux32 debug 59.25 -> 668.67
978% sccache cache_write_errors linux64-aarch64 opt 60.42 -> 651.42
719% sccache cache_write_errors linux64 asan debug 80.29 -> 657.58
You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=23788
On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format.
To learn more about the regressing test(s), please see: https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Automated_Performance_Testing_and_Sheriffing/Build_Metrics
*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***
Reporter | ||
Updated•5 years ago
|
Comment 1•5 years ago
|
||
The patch was backed out because of hazard failures (I am not sure yet what that means)
I also don't quite understand what sccache cache_write_errors
means. Can you explain what the metric measures?
Updated•5 years ago
|
Reporter | ||
Comment 2•5 years ago
|
||
You're right, this is odd that it was backed out and the regression is still there. Also sccache test react a bit with delay, but the regression should've been fixed.
igoldan, what do you think?
Comment 3•5 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #1)
I also don't quite understand what
sccache cache_write_errors
means. Can you explain what the metric measures?
Kim, these tests used to be owned by Ted. Could you point us to their current owner, to receive some clarifications? Thanks!
Comment 4•5 years ago
|
||
This is indeed weird. Under no circumstance bug 1552176 could have caused these regressions.
These regressions are all over the place; all build_metrics have now changed their baselines & remained that way, not just sccache. Kim, please bring this heads up to your team.
Anyway, this feels to me like an infra change actually. But to prove it, we would need to do retriggers on older jobs, which ATM cannot be done (more about that in bug 1595359).
From the considerable infra changes which occurred during this weekend (Nov 9 and 10), after which all the build_metrics baselines changed, I think a pretty likely suspect is bug 1546801.
Dustin, I am not familiar with work done in this area, so I need your confirmation or clarification here. Tom, I believe you also have the required knowledge to answer this.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 5•5 years ago
|
||
Valentin, I'd say you can unassigned yourself for now from this bug. It's very likely we need to clarify a whole different matter.
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 6•5 years ago
|
||
I suspect this means that some or all of the sccache credentials aren't in place in the new deployment, so all writes are failing. Wander and Grenade are the most knowledgeable.
Comment 7•5 years ago
|
||
I created new credentials for Bug 1595567, but did not delete the old one. Maybe GCP IAM messed up something.
Comment 8•5 years ago
|
||
Looking at the logs it feels like it is unrelated, as my change only affects GCP, it actually can't find AWS credentials: From sccache.log:
Could not load AWS creds: Couldn't find AWS credentials in environment, credentials file, or IAM role.
It might be a configuration problem in taskcluster-auth
.
Comment 9•5 years ago
|
||
before, we used to apply an iam role to windows ec2 instances that gave them access to sccache. we stopped doing that in bug 1562686 when task configuration was modified to use taskcluster scopes and roles to grant sccache access at the task level to windows workers.
workers and infrastructure no longer have anything to do with sccache bucket access permissions. it is managed at the task level and is not something i understand now that iam roles on instances are not used. furthermore, i have no information about where gcp buckets are hosted and what iam roles exist to manage their access.
in bug 1570148 comment 65, i asked for information about how the gcp sccache buckets are configured, what roles and iam permissions are used and how to deal with issues like this one, but that information has not been forthcoming. for example, i still don't know what gcp projects even host the sccache buckets or how to find information about what iam roles exist. without that information i am unable to troubleshoot sccache issues.
at the very least, i will need to know what project is used and will need access granted for the gcp project that hosts the sccache buckets and the iam configuration. without that, i'm completely in the dark about sccache and unable to assist with this.
i'm happy to help because i have done sccache configuration and debugging in the past but i do need the access and information i requested in bug 1570148 in order to do so.
Assignee | ||
Comment 11•5 years ago
|
||
:Callek The project:taskcluster:{trust_domain}:level-{level}-sccache-buckets
need to be ported to the new cluster. Those roles are directly referenced by the in-tree code, that adds them to the the tasks (so the in-tree code doesn't need to be able to enumerate the buckets).
Comment 12•5 years ago
|
||
Comment 13•5 years ago
|
||
I just tossed a patch up and applied this anyway, lets see what comes out of review
Comment 14•5 years ago
|
||
triaging, assigning to callek since he attached patches
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 15•5 years ago
|
||
It looks like sccache is succesfully getting credentials from taskcluster, but those credentials are not working. Here is where those credentials are requested, and there isn't failure.
:dustin Can you verify if the credentials there? I suspect this should probably belong eventually to relops, but I'm not sure who there I should be talking to (cc: :fubar)
Comment 16•5 years ago
|
||
The access_key_id currently configured in the auth service is in the cloudops-taskcluster-aws-prod AWS account, not the mozilla-taskcluster
account where we run workers. It has access to the Taskcluster backup bucket, but nothing more.
Longer-term, I think we'd like to configure these AWS credential the way we configure the GCP credentials, so they could potentially span multiple accounts. But for the moment, the fix is probably to replace that account with one from the mozilla-taskcluster
account. I'll send some credentials along to cloudops.
Comment 17•5 years ago
|
||
https://bugzilla.mozilla.org/show_bug.cgi?id=1598758 for the "longer-term" bit.
Assignee | ||
Comment 18•5 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=007007b06317aba21a5d445dbc38dd6fee0811b0&selectedJob=277727358 has sccache cache_write_errors 0
Updated•5 years ago
|
Description
•