Closed Bug 1000210 Opened 11 years ago Closed 7 years ago

slaverebooter should file bugs for and/or attempt to reboot machines that have retried more than the last N jobs

Tracking

(Not tracked)

Status:

RESOLVED INCOMPLETE

People

(Reporter: emorley, Unassigned)

References

Details

(Keywords: sheriffing-P1, Whiteboard: slaveapi)

Ed Morley [:emorley]

Reporter

Description

•

11 years ago

Jobs that fail with a buildbot result of "RETRY" are (correctly) ignored for most of the TBPL starring workflow, but this means that bad machines can go on a rampage, chewing through jobs - and unless someone looking at TBPL is not in onlyunstarred=1 mode and is paying attention, it can go overlooked for several hours if not days. Ideally slaverebooter would file a bug for, and possibly also reboot machines that have had a buildbot result of RETRY for more than the last N jobs that machine has performed (some of the retry failure modes, like the tools repo clone not working can be fixed by a reboot alone). I would imagine an N of 10-20 might be a good place to start. At the moment, the sheriffs have to manually go through recent blue jobs on TBPL, load the slave health page for each using the link in the bottom left of the UI, and look at the job history table. Joel is working on something that would be able to help in longer-term bad SD card intermittent failure type cases, but this bug would take care of this "retrying machine chewing through everything in site over the last hour" case.

Ed Morley [:emorley]

Reporter

Comment 1

•

11 years ago

Ben, is this something that would be easily doable? :-)

Flags: needinfo?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Comment 2

•

11 years ago

(In reply to Ed Morley [:edmorley UTC+0] from comment #1) > Ben, is this something that would be easily doable? :-) I'm busy with a high priority B2G item right now, I probably won't be able to respond soon. Callek may be able to help you out in the meantime.

Ed Morley [:emorley]

Reporter

Comment 3

•

11 years ago

(In reply to Ben Hearsum [:bhearsum] from comment #2) > (In reply to Ed Morley [:edmorley UTC+0] from comment #1) > > Ben, is this something that would be easily doable? :-) > > I'm busy with a high priority B2G item right now, I probably won't be able > to respond soon. Callek may be able to help you out in the meantime. Np, thank you :-)

Flags: needinfo?(bugspam.Callek)

bhearsum@mozilla.com (:bhearsum)

Comment 4

•

11 years ago

So, slaverebooter already gets the most recent job information for each slave. We _could_ have it pull extra job information if the most recent job is RETRY. That needs a couple of things: 1) A new or modified endpoint to support pulling more than the most recent job. Best way to do that is probably to add a query arg to the /slaves/:slave endpoint (http://git.mozilla.org/?p=build/slaveapi.git;a=blob;f=slaveapi/web/slave.py;h=03108a044815dd3623844a5c4b6ad045676f26a0;hb=HEAD#l16) 2) Add the extra call to slaveapi in the slave rebooter script: https://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/reboot-idle-slaves.py#l49.

Flags: needinfo?(bhearsum)

Justin Wood (:Callek)

Comment 5

•

10 years ago

I also don't forsee doing this anytime very soon, that said it shouldn't be too hard if I get an inclination/free time sooner than later :)

Flags: needinfo?(bugspam.Callek)

Whiteboard: slaveapi

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Blocks: byebyebuildduty

Nobody; OK to take it and work on it

Assignee

Updated

•

7 years ago

Component: Tools → General

Ed Morley [:emorley]

Reporter

Comment 6

•

7 years ago

Mass-closing old bugs I filed that have not had recent activity/no longer affect me.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → INCOMPLETE

You need to log in before you can comment on or make changes to this bug.

Bugzilla

slaverebooter should file bugs for and/or attempt to reboot machines that have retried more than the last N jobs

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: emorley, Unassigned)

References

Details

(Keywords: sheriffing-P1, Whiteboard: slaveapi)

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Updated

Comment 6