Closed Bug 1532236 Opened 6 years ago Closed 2 years ago

intermittent funsize partials timeouts

Categories

(Release Engineering :: Release Automation: Other, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: mtabara, Assigned: sfraser)

References

Details

(Keywords: leave-open, Whiteboard: [releaseduty])

Attachments

(5 files)

RelMan pointed that we've started to see funsize jobs failing with this lately. Subsequent rereuns eventually go green like in this case.

We've seen occurrences of the same issue in bug 1411358 but that seems too broad.

Jcristau also suggested that these timeouts should be less permisive. A normal task usually takes <10 min yet the timeout is set to 3600. This should help increase the end-to-end time we wait to see the jobs green. Something similar to what we did in bug 1531072.

Interesting, something inside make_incremental_update.sh is breaking, but we're not getting the logs. I will update the timeout as suggested and try to improve the reporting.

Whiteboard: [releaseduty]
Assignee: nobody → sfraser
Keywords: leave-open
Pushed by sfraser@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/817014bcd372 Improve logging and timeouts in partials generation r=mtabara

Turns out 600s isn't enough for Linux64 Asan partial genertion, where the job may take up to 36 minutes. Windows asan is ~20 minutes.

Attachment #9048364 - Attachment description: Bug 1532236 - longer timeout for asan partial generation, r?tomprince → Bug 1532236 - longer timeout for asan partial generation, r=tomprince
Pushed by nthomas@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1cc8b60d8a6b longer timeout for asan partial generation, r=tomprince

jlund pointed out we usually specify parameters like a timeout elsewhere, like the kind config. My fix was only a quick followup to resolve the bustage in the next nightly.

Pushed by sfraser@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/bc4e03f4ea20 Remove extra newlines from partials logging r=mtabara

https://tools.taskcluster.net/groups/Jj8j2VzfQ-WCPqsqZHPGbA/tasks/MOuYO-Y2QyyvH0ft4ry44A/runs/0/logs/public%2Flogs%2Flive.log had an error in today's nightly, and it looks to have been diffing libxul.so:

2019-03-05 12:15:29,381 - WARNING - target.partial-1.mar: diffing "libxul.so"
2019-03-05 12:24:15,463 - WARNING - target.partial-1.mar: patch "libxul.so.patch" "libxul.so"

Each of the four partials took nine minutes to process that file. Caching seems disabled.

https://hg.mozilla.org/mozilla-central/rev/a91196fe2eb3#l2.64 introduced a default level, which was a string, not an int, so the caching has been broken for a while.

Pushed by sfraser@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ce3dfcdb5861 Convert level into integer in partials transform r=mtabara
Pushed by sfraser@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f6d891b25f43 Convert level into integer in partials transform r=mtabara
Flags: needinfo?(sfraser)

Found in postmortem triaging.
sfraser: can we close this?

Flags: needinfo?(sfraser)

There are still a few errors to sort out with this, sorry.

Flags: needinfo?(sfraser)

Reinsert awscli for partials, which is needed for caching. Also update packages and fix the metrics recording

Pushed by sfraser@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/7b2ae2ea0495 Reinsert awscli, required for partials caching r=mtabara
Pushed by sfraser@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e108f9ad99a5 Reinsert awscli, required for partials caching r=mtabara

https://trello.com/c/roEzdOJx/172-partial-hu-linux-task-failing

looks like this is failing at least some of the time on new 67 beta (devedition)

aws-cli isn't found. Do we need to uplift given this missed the merge a few hours earlier?

I think we have enabled s3 caching for releases between 66 and 67, where previously we had no caching at all. See the scopes on the last 66 beta and deved 67.0b1. I don't know if that was deliberate or not. We did have local caching for releases until bug 1501113.

I think it'll either be a case of increasing the task timeout (even just to 20 minutes) or adding the caching in, as some of the files now take a lot longer to diff than they did at this point last year.

I'd vote for an uplift.

Flags: needinfo?(sfraser)
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: