Closed
Bug 733663
Opened 13 years ago
Closed 8 years ago
Consider limiting how often or for how long we RETRY from hg_errors
Categories
(Release Engineering :: General, defect, P3)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: dholbert, Assigned: catlee)
References
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2670] [retry][hg])
Attachments
(1 file)
(deleted),
patch
|
dustin
:
feedback+
|
Details | Diff | Splinter Review |
A Try push of mine seems to have hit an infra problem of some sort (bug 733658).
That infra problem seems to be of the sort that triggers auto-rebuilds (which are then doomed to failure, and that failure triggers another auto-rebuild, etc).
As a result, I got 13 rapid-fire emails from Try in the span of ~6 minutes, all for the same platform.
It looks like they've stopped now (maybe because the infra issue cleared itself up, or I got a better buildslave?), but if it hadn't cleared up, I suspect I would have been continuously spammed indefinitely.
Each email I received looked exactly like this:
=========
Your Try Server build (2a9660ea4911) had unknown problem (5) on builder try-win32-debug.
The full log for this build run is available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2603.txt.gz.
=========
...with the only difference being the log number. (03.txt.gz up through 15.txt.gz)
Here are the logs for these failures (so far):
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2603.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2604.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2605.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2606.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2607.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2608.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2609.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2610.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2611.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2612.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2613.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2614.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2615.txt.gz
I'm filing this bug on making Try auto-detect this sort of issue, after a few failed autoretried builds, and stop at that point. (Otherwise, it apparently can get into a state where it'll just endlessly spam the submitter.)
Maybe that Try feature already exists and the threshold is 13 failed retries, but I suspect not. :)
Reporter | ||
Comment 1•13 years ago
|
||
(In reply to Daniel Holbert [:dholbert] from comment #0)
> I'm filing this bug on making Try auto-detect this sort of issue, after a
> few failed autoretried builds, and stop at that point. (Otherwise, it
> apparently can get into a state where it'll just endlessly spam the
> submitter.)
(and endlessly occupy build resources, which is also bad, of course. I just highlighted the emails because they're more annoying -- I was starting to despair after hearing 13 rapid-fire "Ba-da-ding" email-notifications on my phone over 6 minutes and not knowing if they were ever going to stop. :))
Reporter | ||
Comment 2•13 years ago
|
||
(see bug 733658 comment 3 -- looks like the issue here was an hg outage / blip of some sort. So, that's the sort of infra issue that can trigger this sort of perma-cycle/spam (for the duration of the hg outage))
Reporter | ||
Updated•13 years ago
|
Summary: Try can get into an infinite loop of builds (spamming endless failure emails to the developer) if it hits an infra perma-fail which triggers an autoretry → Try can get into an infinite loop of builds & spamming endless failure emails to the developer if it hits an infra perma-fail which triggers an autoretry
Comment 3•13 years ago
|
||
FWIW, I think the right fix here is bug 712205. Slaves shouldn't even make it back into the slave pool if they can't pull/update tools *and* we'd save time on jobs.
Comment 4•13 years ago
|
||
Split the merciless spamming part off to bug 733801.
I don't think bug 712205 will save us, because updating the source repo automatically retries too. hg_errors is playing with fire, we know it is, we'll get burned (again, we already did once where I had to catch several infinite jobs before they could retry again, and cancel them), but because we have a thousand five minute outages of hg.m.o for each three hour outage, we want to keep on playing with fire, and passing the knowledge that this RETRY is RETRY12 on to the next job would be awkward, so although that's what's left for this bug to be about, I'll be surprised if that gets done.
Comment 5•13 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #4)
> I don't think bug 712205 will save us, because updating the source repo
> automatically retries too. hg_errors is playing with fire, we know it is,
> we'll get burned (again, we already did once where I had to catch several
> infinite jobs before they could retry again, and cancel them), but because
> we have a thousand five minute outages of hg.m.o for each three hour outage,
> we want to keep on playing with fire, and passing the knowledge that this
> RETRY is RETRY12 on to the next job would be awkward, so although that's
> what's left for this bug to be about, I'll be surprised if that gets done.
Yes, bug 712205 is only a partial fix.
hg_errors currently looks like this:
hg_errors = ((re.compile("abort: HTTP Error 5\d{2}"), RETRY),
(re.compile("abort: .*: no match found!"), RETRY),
(re.compile("abort: Connection reset by peer"), RETRY),
(re.compile("transaction abort!"), RETRY),
(re.compile("abort: error:"), RETRY),)
What subset of those (if any) should we simply mark as FAILURE?
Comment 6•13 years ago
|
||
Slotting into Automation because these hg errors can affect any branch, not just try.
(In reply to Chris Cooper [:coop] from comment #5)
> What subset of those (if any) should we simply mark as FAILURE?
Do we need a new status such as UNRECOVERABLE?
Component: Release Engineering → Release Engineering: Automation
QA Contact: release → catlee
Whiteboard: [retry]
Comment 7•13 years ago
|
||
Ugh, no, that cure is vastly worse than the disease. I count 118 jobs on mozilla-inbound alone that I didn't have to manually retrigger after last night's brief hg.m.o outage because we automatically retried them.
Resummarizing since as filed it was bug 733801 plus invalid - "infinite loop" is only true if hg.m.o is down infinitely long, in which case none of us will be worrying about what happens because we'll be gone.
If you picture a happy world with both bug 712205 and bug 733801 fixed, then we are only talking about build slaves, and the only impacts of having them continue retrying when IT has said that hg.m.o will be down for three hours are that we'll use electricity running build slaves that don't have anything else they can be doing, and tbpl will get messy with big long strings of blue. Nice to fix, but the unfixed case just means that tree-watching people with nothing else to do since no builds are starting need to go around killing builds to stop them trying again (and they can, because tbpl lists the branches where things are running, unlike the "have to retrigger jobs because they failed when hg.m.o was down for 45 seconds" case, where we have absolutely no way of knowing what trees had failures because of not retrying), or releng needs to take the opportunity to shut down the masters.
Summary: Try can get into an infinite loop of builds & spamming endless failure emails to the developer if it hits an infra perma-fail which triggers an autoretry → Consider limiting how often or for how long we RETRY from hg_errors
Assignee | ||
Updated•13 years ago
|
Severity: normal → major
Priority: -- → P2
Assignee | ||
Comment 8•13 years ago
|
||
from triage:
this isn't a cause of problems, but it makes a bad situation worse
it's not obvious how to fix this. some hacky ideas:
- have something in the build artificially delay after failing to clone from hg
- rip out/modify buildbot's retry logic and make it support a limited # of retries, with delays in between
Severity: major → normal
Priority: P2 → P3
Whiteboard: [retry] → [retry][hg]
Assignee | ||
Comment 9•13 years ago
|
||
One quick fix could be to add a delay of X seconds between a build finishing with RETRY and the next build being kicked off.
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•10 years ago
|
Whiteboard: [retry][hg] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2662] [retry][hg]
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2662] [retry][hg] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2670] [retry][hg]
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → catlee
Assignee | ||
Comment 10•10 years ago
|
||
I couldn't think of a nicer way to do this...maybe you have some other ideas?
Attachment #8529931 -
Flags: feedback?(dustin)
Comment 11•10 years ago
|
||
Comment on attachment 8529931 [details] [diff] [review]
limit # of retries
That's a lot of synchronous DB queries, and more when builds are collapsed. It seems like that would lead to further performance degredation just when masters fall behind.
Assignee | ||
Comment 12•10 years ago
|
||
Hmm...can you think of a better way to do this?
Comment 13•10 years ago
|
||
Just running the queries asynchronously, possibly under a DeferredLock so that only one runs at a time, might help. Then when the master is busy the effect will be to delay RETRYs further, and limit the number of DB queries to one at a time.
Updated•10 years ago
|
Attachment #8529931 -
Flags: feedback?(dustin) → feedback+
Assignee | ||
Comment 14•8 years ago
|
||
This is fixed in Taskcluster. We won't do anything further here for buildbot.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•