Closed Bug 463020 Opened 16 years ago Closed 16 years ago

Talos machines should be automatically rebooted periodically

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: catlee, Assigned: catlee)

References

Details

Attachments

(4 files, 3 obsolete files)

add call to count_and_reboot.py 16 years ago Chris AtLee [:catlee] (deleted), patch		Details \| Diff \| Splinter Review
Script to increment a counter and reboot after the counter and crossed a set threshold 16 years ago Chris AtLee [:catlee] (deleted), text/plain		Details
add call to count_and_reboot.py 16 years ago Chris AtLee [:catlee] (deleted), patch	anodelman : review-	Details \| Diff \| Splinter Review
Script to increment a counter and reboot after the counter and crossed a set threshold 16 years ago Chris AtLee [:catlee] (deleted), text/plain	anodelman : review+ coop : checked-in+	Details
add call to count_and_reboot.py 16 years ago Chris AtLee [:catlee] (deleted), patch	anodelman : review+ coop : checked-in+	Details \| Diff \| Splinter Review
[Checked in]reboot after every test run 16 years ago Chris AtLee [:catlee] (deleted), patch	anodelman : review+ anodelman : checked-in+	Details \| Diff \| Splinter Review
[Checked in]Enable rebooting on all branches 16 years ago Chris AtLee [:catlee] (deleted), patch	anodelman : review+ anodelman : checked-in+	Details \| Diff \| Splinter Review

Chris AtLee [:catlee]

Assignee

Description

•

16 years ago

After Talos machines have been up for a while, the performance results start to drift. Rebooting the machines seems to fix this problem. We should be rebooting the Talos machines regularly so that this drift is not a problem. Two approaches are: - Reboot after every n Talos runs - Reboot if uptime > m Other issues to solve: - How to perform the reboot? The cltbld account normally doesn't have permission to do this. On Linux and Mac we could give sudo access. - How to prevent buildbot from freaking out? If the reboot is performed inside a buildbot job, then that job will fail, and then the build slave will lose connectivity.

bhearsum@mozilla.com (:bhearsum)

Comment 1

•

16 years ago

(I've been thinking about this a lot lately, so bear with me) I think the "right" way to do this is from inside of Buildbot. It's the only way we can be *certain* we don't interrupt a running job. Rebooting based on uptime is going to be tough. AFAIK we don't have any unix facilities (cygwin, msys, et. al) on the Windows Talos machines - so we can't use a simple bash oneliner to do it. Rebooting after every N runs is pretty easy. It'll require a custom BuildStep, but a trivial one: class Reboot(ShellCommand): def start(self): buildNum = self.step_status.getBuild().getNumber() # clobber every 25 builds if (buildNum % 25) != 0: return SKIPPED ShellCommand.start(self) ...and the TalosFactory would call it thusly: self.addStep(Reboot, command=['shutdown', '-r', '-t', '0'], flunkOnFailure=False) (with the proper variant per platform.) I'm probably missing a detail or two in the step, but I've used the same logic for conditional clobbers on a non-Mozilla Buildbot. The flunkOnFailure is important here -- it will make sure that the build won't turn red when the slave *does* reboot. One thing I'm not sure of here is whether or not the master will try to give the ghost'ed slave another job before it rejoin. Just my thoughts, take them fwiw!

Chris AtLee [:catlee]

Assignee

Comment 2

•

16 years ago

Attached patch add call to count_and_reboot.py (obsolete) (deleted) — Details — Splinter Review

Chris AtLee [:catlee]

Assignee

Comment 3

•

16 years ago

Attached file Script to increment a counter and reboot after the counter and crossed a set threshold (obsolete) (deleted) — Details

Chris AtLee [:catlee]

Assignee

Updated

•

16 years ago

Attachment #346470 - Attachment mime type: application/octet-stream → text/plain

Chris AtLee [:catlee]

Assignee

Updated

•

16 years ago

Priority: -- → P2

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 5

•

16 years ago

looking on http://qm-buildbot01.mozilla.org:2008, it seems that the slaves qm-pxp-stage01 qm-pvista-stage01 qm-pubuntu-stage01 qm-ptiger-stage01 qm-pleopard-stage01 are for staging only, and could be used to see if this auto-restarting patch works.

Chris AtLee [:catlee]

Assignee

Comment 6

•

16 years ago

http://graphs-stage.mozilla.org/#show=395345,395333,395459,395497,395511 machines are rebooting after every 3 talos runs since nov 12th ~12pm

Chris AtLee [:catlee]

Assignee

Comment 7

•

16 years ago

Attached patch add call to count_and_reboot.py (obsolete) (deleted) — Details — Splinter Review

Attachment #346468 - Attachment is obsolete: true

Attachment #348603 - Flags: review?(anodelman)

Chris AtLee [:catlee]

Assignee

Comment 8

•

16 years ago

Attached file Script to increment a counter and reboot after the counter and crossed a set threshold (deleted) — Details

Attachment #346470 - Attachment is obsolete: true

Attachment #348604 - Flags: review?(anodelman)

Chris AtLee [:catlee]

Assignee

Comment 9

•

16 years ago

Once these are approved, let's make these live on Firefox 3.0 production. These machines are: qm-plinux-fast01 (fast) qm-mini-ubuntu03 (nochrome) qm-mini-ubuntu01 qm-mini-ubuntu02 qm-mini-ubuntu05 qm-pmac-fast01 (fast) qm-pmac05 (nochrome) qm-pmac01 qm-pmac02 qm-pmac03 qm-pleopard-trunk07 qm-pleopard-trunk08 qm-pxp-fast01 (fast) qm-pxp-jss01 (jss) qm-pxp-jss02 (jss) qm-pxp-jss03 (jss) qm-mini-xp05 (nochrome) qm-mini-xp01 qm-mini-xp02 qm-mini-xp03 qm-mini-vista05 (nochrome) qm-mini-vista01 qm-mini-vista02 qm-mini-vista03 qm-pxp-

Chris AtLee [:catlee]

Assignee

Comment 10

•

16 years ago

ignore that last 'qm-pxp-' on the mac and linux machines, we need to add this line to /etc/sudoers: mozqa ALL=NOPASSWD: /sbin/reboot