733663 - Consider limiting how often or for how long we RETRY from hg_errors

Reporter

Description

•

13 years ago

Daniel Holbert [:dholbert]

Reporter

Comment 1

•

13 years ago

(In reply to Daniel Holbert [:dholbert] from comment #0) > I'm filing this bug on making Try auto-detect this sort of issue, after a > few failed autoretried builds, and stop at that point. (Otherwise, it > apparently can get into a state where it'll just endlessly spam the > submitter.) (and endlessly occupy build resources, which is also bad, of course. I just highlighted the emails because they're more annoying -- I was starting to despair after hearing 13 rapid-fire "Ba-da-ding" email-notifications on my phone over 6 minutes and not knowing if they were ever going to stop. :))

Daniel Holbert [:dholbert]

Reporter

Comment 2

•

13 years ago

(see bug 733658 comment 3 -- looks like the issue here was an hg outage / blip of some sort. So, that's the sort of infra issue that can trigger this sort of perma-cycle/spam (for the duration of the hg outage))

Daniel Holbert [:dholbert]

Reporter

Updated

•

13 years ago

Summary: Try can get into an infinite loop of builds (spamming endless failure emails to the developer) if it hits an infra perma-fail which triggers an autoretry → Try can get into an infinite loop of builds & spamming endless failure emails to the developer if it hits an infra perma-fail which triggers an autoretry

Chris Cooper [:coop] (he/him)

Comment 3

•

13 years ago

FWIW, I think the right fix here is bug 712205. Slaves shouldn't even make it back into the slave pool if they can't pull/update tools *and* we'd save time on jobs.

Phil Ringnalda (:philor)

Updated

•

13 years ago

Depends on: 733801

Phil Ringnalda (:philor)

Comment 4

•

13 years ago

Split the merciless spamming part off to bug 733801. I don't think bug 712205 will save us, because updating the source repo automatically retries too. hg_errors is playing with fire, we know it is, we'll get burned (again, we already did once where I had to catch several infinite jobs before they could retry again, and cancel them), but because we have a thousand five minute outages of hg.m.o for each three hour outage, we want to keep on playing with fire, and passing the knowledge that this RETRY is RETRY12 on to the next job would be awkward, so although that's what's left for this bug to be about, I'll be surprised if that gets done.

Chris Cooper [:coop] (he/him)

Comment 5

•

13 years ago

(In reply to Phil Ringnalda (:philor) from comment #4) > I don't think bug 712205 will save us, because updating the source repo > automatically retries too. hg_errors is playing with fire, we know it is, > we'll get burned (again, we already did once where I had to catch several > infinite jobs before they could retry again, and cancel them), but because > we have a thousand five minute outages of hg.m.o for each three hour outage, > we want to keep on playing with fire, and passing the knowledge that this > RETRY is RETRY12 on to the next job would be awkward, so although that's > what's left for this bug to be about, I'll be surprised if that gets done. Yes, bug 712205 is only a partial fix. hg_errors currently looks like this: hg_errors = ((re.compile("abort: HTTP Error 5\d{2}"), RETRY), (re.compile("abort: .*: no match found!"), RETRY), (re.compile("abort: Connection reset by peer"), RETRY), (re.compile("transaction abort!"), RETRY), (re.compile("abort: error:"), RETRY),) What subset of those (if any) should we simply mark as FAILURE?

Chris Cooper [:coop] (he/him)

Comment 6

•

13 years ago

Slotting into Automation because these hg errors can affect any branch, not just try. (In reply to Chris Cooper [:coop] from comment #5) > What subset of those (if any) should we simply mark as FAILURE? Do we need a new status such as UNRECOVERABLE?

Component: Release Engineering → Release Engineering: Automation

QA Contact: release → catlee

Whiteboard: [retry]

Phil Ringnalda (:philor)

Comment 7

•

13 years ago

Ugh, no, that cure is vastly worse than the disease. I count 118 jobs on mozilla-inbound alone that I didn't have to manually retrigger after last night's brief hg.m.o outage because we automatically retried them. Resummarizing since as filed it was bug 733801 plus invalid - "infinite loop" is only true if hg.m.o is down infinitely long, in which case none of us will be worrying about what happens because we'll be gone. If you picture a happy world with both bug 712205 and bug 733801 fixed, then we are only talking about build slaves, and the only impacts of having them continue retrying when IT has said that hg.m.o will be down for three hours are that we'll use electricity running build slaves that don't have anything else they can be doing, and tbpl will get messy with big long strings of blue. Nice to fix, but the unfixed case just means that tree-watching people with nothing else to do since no builds are starting need to go around killing builds to stop them trying again (and they can, because tbpl lists the branches where things are running, unlike the "have to retrigger jobs because they failed when hg.m.o was down for 45 seconds" case, where we have absolutely no way of knowing what trees had failures because of not retrying), or releng needs to take the opportunity to shut down the masters.

Summary: Try can get into an infinite loop of builds & spamming endless failure emails to the developer if it hits an infra perma-fail which triggers an autoretry → Consider limiting how often or for how long we RETRY from hg_errors

Chris AtLee [:catlee]

Assignee

Updated

•

13 years ago

Severity: normal → major

Priority: -- → P2

Chris AtLee [:catlee]

Assignee

Comment 8

•

13 years ago

from triage: this isn't a cause of problems, but it makes a bad situation worse it's not obvious how to fix this. some hacky ideas: - have something in the build artificially delay after failing to clone from hg - rip out/modify buildbot's retry logic and make it support a limited # of retries, with delays in between

Severity: major → normal

Priority: P2 → P3

Whiteboard: [retry] → [retry][hg]

Chris AtLee [:catlee]

Assignee

Comment 9

•

13 years ago

One quick fix could be to add a delay of X seconds between a build finishing with RETRY and the next build being kicked off.

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

:kanban-engops

Updated

•

10 years ago

Whiteboard: [retry][hg] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2662] [retry][hg]

:kanban-engops

Updated

•

10 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2662] [retry][hg] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2670] [retry][hg]

Chris AtLee [:catlee]

Assignee

Updated

•

10 years ago

Assignee: nobody → catlee

Chris AtLee [:catlee]

Assignee

Comment 10

•

10 years ago

Attached patch limit # of retries (deleted) — Details — Splinter Review

I couldn't think of a nicer way to do this...maybe you have some other ideas?

Attachment #8529931 - Flags: feedback?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

10 years ago

Comment on attachment 8529931 [details] [diff] [review] limit # of retries That's a lot of synchronous DB queries, and more when builds are collapsed. It seems like that would lead to further performance degredation just when masters fall behind.

Chris AtLee [:catlee]

Assignee

Comment 12

•

10 years ago

Hmm...can you think of a better way to do this?

Dustin J. Mitchell [:dustin] (he/him)

Comment 13

•

10 years ago

Just running the queries asynchronously, possibly under a DeferredLock so that only one runs at a time, might help. Then when the master is busy the effect will be to delay RETRYs further, and limit the number of DB queries to one at a time.

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

10 years ago

Attachment #8529931 - Flags: feedback?(dustin) → feedback+

Chris AtLee [:catlee]

Assignee

Comment 14

•

8 years ago

This is fixed in Taskcluster. We won't do anything further here for buildbot.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: General Automation → General