intermittent funsize partials timeouts
Categories
(Release Engineering :: Release Automation: Other, enhancement)
Tracking
(Not tracked)
People
(Reporter: mtabara, Assigned: sfraser)
References
Details
(Keywords: leave-open, Whiteboard: [releaseduty])
Attachments
(5 files)
RelMan pointed that we've started to see funsize jobs failing with this lately. Subsequent rereuns eventually go green like in this case.
We've seen occurrences of the same issue in bug 1411358 but that seems too broad.
Jcristau also suggested that these timeouts should be less permisive. A normal task usually takes <10 min yet the timeout is set to 3600. This should help increase the end-to-end time we wait to see the jobs green. Something similar to what we did in bug 1531072.
Assignee | ||
Comment 1•6 years ago
|
||
Interesting, something inside make_incremental_update.sh is breaking, but we're not getting the logs. I will update the timeout as suggested and try to improve the reporting.
Reporter | ||
Updated•6 years ago
|
Assignee | ||
Comment 2•6 years ago
|
||
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Comment 4•6 years ago
|
||
bugherder |
Comment 5•6 years ago
|
||
Turns out 600s isn't enough for Linux64 Asan partial genertion, where the job may take up to 36 minutes. Windows asan is ~20 minutes.
Comment 6•6 years ago
|
||
Updated•6 years ago
|
Comment 8•6 years ago
|
||
jlund pointed out we usually specify parameters like a timeout elsewhere, like the kind config. My fix was only a quick followup to resolve the bustage in the next nightly.
Comment 9•6 years ago
|
||
bugherder |
Assignee | ||
Comment 10•6 years ago
|
||
Comment 11•6 years ago
|
||
Assignee | ||
Comment 12•6 years ago
|
||
https://tools.taskcluster.net/groups/Jj8j2VzfQ-WCPqsqZHPGbA/tasks/MOuYO-Y2QyyvH0ft4ry44A/runs/0/logs/public%2Flogs%2Flive.log had an error in today's nightly, and it looks to have been diffing libxul.so:
2019-03-05 12:15:29,381 - WARNING - target.partial-1.mar: diffing "libxul.so"
2019-03-05 12:24:15,463 - WARNING - target.partial-1.mar: patch "libxul.so.patch" "libxul.so"
Each of the four partials took nine minutes to process that file. Caching seems disabled.
Assignee | ||
Comment 13•6 years ago
|
||
Assignee | ||
Comment 14•6 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/a91196fe2eb3#l2.64 introduced a default level, which was a string, not an int, so the caching has been broken for a while.
Comment 15•6 years ago
|
||
Comment 16•6 years ago
|
||
Backed out changeset ce3dfcdb5861 (Bug 1532236) for linting opt failure in partials.py CLOSED TREE
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=231926130&repo=autoland&lineNumber=295
Backout: https://hg.mozilla.org/integration/autoland/rev/9fca85ee3084599c0119d126840eb2062f65003d
Comment 17•6 years ago
|
||
Comment 18•6 years ago
|
||
bugherder |
Comment 19•6 years ago
|
||
bugherder |
Assignee | ||
Updated•6 years ago
|
Reporter | ||
Comment 20•6 years ago
|
||
Found in postmortem triaging.
sfraser: can we close this?
Assignee | ||
Comment 21•6 years ago
|
||
There are still a few errors to sort out with this, sorry.
Assignee | ||
Comment 22•6 years ago
|
||
Reinsert awscli for partials, which is needed for caching. Also update packages and fix the metrics recording
Comment 23•6 years ago
|
||
Comment 24•6 years ago
|
||
Backed out for lint failure at mbsdiff_hook.sh
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=233055627&repo=autoland&lineNumber=280
Backout: https://hg.mozilla.org/integration/autoland/rev/8076b05b8631a55cc9a9922821e975c4ea4434af
Comment 25•6 years ago
|
||
Comment 26•6 years ago
|
||
bugherder |
Comment 27•6 years ago
|
||
https://trello.com/c/roEzdOJx/172-partial-hu-linux-task-failing
looks like this is failing at least some of the time on new 67 beta (devedition)
aws-cli isn't found. Do we need to uplift given this missed the merge a few hours earlier?
Comment 28•6 years ago
|
||
I think we have enabled s3 caching for releases between 66 and 67, where previously we had no caching at all. See the scopes on the last 66 beta and deved 67.0b1. I don't know if that was deliberate or not. We did have local caching for releases until bug 1501113.
Assignee | ||
Comment 29•6 years ago
|
||
I think it'll either be a case of increasing the task timeout (even just to 20 minutes) or adding the caching in, as some of the files now take a lot longer to diff than they did at this point last year.
I'd vote for an uplift.
Updated•2 years ago
|
Description
•