Closed
Bug 463020
Opened 16 years ago
Closed 16 years ago
Talos machines should be automatically rebooted periodically
Categories
(Release Engineering :: General, defect, P2)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Assigned: catlee)
References
Details
Attachments
(4 files, 3 obsolete files)
(deleted),
text/plain
|
anodelman
:
review+
coop
:
checked-in+
|
Details |
(deleted),
patch
|
anodelman
:
review+
coop
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
anodelman
:
review+
anodelman
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
anodelman
:
review+
anodelman
:
checked-in+
|
Details | Diff | Splinter Review |
After Talos machines have been up for a while, the performance results start to drift. Rebooting the machines seems to fix this problem.
We should be rebooting the Talos machines regularly so that this drift is not a problem. Two approaches are:
- Reboot after every n Talos runs
- Reboot if uptime > m
Other issues to solve:
- How to perform the reboot? The cltbld account normally doesn't have permission to do this. On Linux and Mac we could give sudo access.
- How to prevent buildbot from freaking out? If the reboot is performed inside a buildbot job, then that job will fail, and then the build slave will lose connectivity.
Comment 1•16 years ago
|
||
(I've been thinking about this a lot lately, so bear with me)
I think the "right" way to do this is from inside of Buildbot. It's the only way we can be *certain* we don't interrupt a running job.
Rebooting based on uptime is going to be tough. AFAIK we don't have any unix facilities (cygwin, msys, et. al) on the Windows Talos machines - so we can't use a simple bash oneliner to do it.
Rebooting after every N runs is pretty easy. It'll require a custom BuildStep, but a trivial one:
class Reboot(ShellCommand):
def start(self):
buildNum = self.step_status.getBuild().getNumber()
# clobber every 25 builds
if (buildNum % 25) != 0:
return SKIPPED
ShellCommand.start(self)
...and the TalosFactory would call it thusly:
self.addStep(Reboot, command=['shutdown', '-r', '-t', '0'], flunkOnFailure=False)
(with the proper variant per platform.)
I'm probably missing a detail or two in the step, but I've used the same logic for conditional clobbers on a non-Mozilla Buildbot. The flunkOnFailure is important here -- it will make sure that the build won't turn red when the slave *does* reboot.
One thing I'm not sure of here is whether or not the master will try to give the ghost'ed slave another job before it rejoin.
Just my thoughts, take them fwiw!
Assignee | ||
Comment 2•16 years ago
|
||
Assignee | ||
Comment 3•16 years ago
|
||
Assignee | ||
Updated•16 years ago
|
Attachment #346470 -
Attachment mime type: application/octet-stream → text/plain
Assignee | ||
Updated•16 years ago
|
Priority: -- → P2
Comment 5•16 years ago
|
||
looking on http://qm-buildbot01.mozilla.org:2008, it seems that the slaves
qm-pxp-stage01
qm-pvista-stage01
qm-pubuntu-stage01
qm-ptiger-stage01
qm-pleopard-stage01
are for staging only, and could be used to see if this auto-restarting patch
works.
Assignee | ||
Comment 6•16 years ago
|
||
http://graphs-stage.mozilla.org/#show=395345,395333,395459,395497,395511
machines are rebooting after every 3 talos runs since nov 12th ~12pm
Assignee | ||
Comment 7•16 years ago
|
||
Attachment #346468 -
Attachment is obsolete: true
Attachment #348603 -
Flags: review?(anodelman)
Assignee | ||
Comment 8•16 years ago
|
||
Attachment #346470 -
Attachment is obsolete: true
Attachment #348604 -
Flags: review?(anodelman)
Assignee | ||
Comment 9•16 years ago
|
||
Once these are approved, let's make these live on Firefox 3.0 production.
These machines are:
qm-plinux-fast01 (fast)
qm-mini-ubuntu03 (nochrome)
qm-mini-ubuntu01
qm-mini-ubuntu02
qm-mini-ubuntu05
qm-pmac-fast01 (fast)
qm-pmac05 (nochrome)
qm-pmac01
qm-pmac02
qm-pmac03
qm-pleopard-trunk07
qm-pleopard-trunk08
qm-pxp-fast01 (fast)
qm-pxp-jss01 (jss)
qm-pxp-jss02 (jss)
qm-pxp-jss03 (jss)
qm-mini-xp05 (nochrome)
qm-mini-xp01
qm-mini-xp02
qm-mini-xp03
qm-mini-vista05 (nochrome)
qm-mini-vista01
qm-mini-vista02
qm-mini-vista03
qm-pxp-
Assignee | ||
Comment 10•16 years ago
|
||
ignore that last 'qm-pxp-'
on the mac and linux machines, we need to add this line to /etc/sudoers:
mozqa ALL=NOPASSWD: /sbin/reboot
Comment 11•16 years ago
|
||
You're missing qm-pleopard-trunk06 from your list.
Updated•16 years ago
|
Attachment #348604 -
Flags: review?(anodelman) → review+
Comment 12•16 years ago
|
||
Comment on attachment 348603 [details] [diff] [review]
add call to count_and_reboot.py
The code is fine - the downside is that this will affect all production boxes. We need an interim patch that will only end up touching the production Firefox3.0 machines so that we can do a proper test before a full roll out.
Attachment #348603 -
Flags: review?(anodelman) → review-
Assignee | ||
Comment 13•16 years ago
|
||
qm-pubuntu-stage01 and qm-pvista-stage01 both failed to come back after an automated reboot. We need to figure out why before moving forward with this.
Comment 14•16 years ago
|
||
I think that we can still move forward with a patch limited to affecting Firefox3.0 so that we can get a real feel for how often (or if) machines don't come back from reboot.
This will require a new patch and a plan for applying the necessary sudoers change to the linux/mac machines. I think that we can fold this into the already scheduled downtime for next monday.
Assignee | ||
Comment 15•16 years ago
|
||
Attachment #348603 -
Attachment is obsolete: true
Attachment #350184 -
Flags: review?(anodelman)
Comment 16•16 years ago
|
||
Comment on attachment 350184 [details] [diff] [review]
add call to count_and_reboot.py
This is a good way around this - though, we could also go with passing in an extra variable for turning rebooting on/off in case we want to do this per-factory. But, not necessary for this in between test phase.
Attachment #350184 -
Flags: review?(anodelman) → review+
Assignee | ||
Comment 17•16 years ago
|
||
I've updated all the mac and linux machines except for qm-pleopard-trunk06,08 which are down right now (see bug #466889)
Depends on: 466889
Assignee | ||
Comment 18•16 years ago
|
||
qm-pleopard-trunk06 and qm-pleopard-trunk08 now have sudoers updated
Updated•16 years ago
|
Attachment #348604 -
Flags: checked‑in+
Updated•16 years ago
|
Attachment #350184 -
Flags: checked‑in+
Assignee | ||
Comment 19•16 years ago
|
||
This has been put into production for FF3.0 machines
Comment 21•16 years ago
|
||
Update:
- in production for Firefox3.0
- 3 winxp machines fell over (bug 467608) - no diagnoses yet
- 2 mac machines fell over (bug 467568) - appears to be caused by a fixable configuration issue
This will continue to bake on Firefox3.0 till we've worked out the snags.
Comment 23•16 years ago
|
||
Has anyone investigated whether the drift is due to the Windows fastload files? It may not be but it might be worth a shot.
Comment 26•16 years ago
|
||
(In reply to comment #23)
> Has anyone investigated whether the drift is due to the Windows fastload files?
> It may not be but it might be worth a shot.
I would hope that we're starting with a new profile every time.
I'm somewhat disappointed that any bug about machine inconsistency is getting duped to this bug; we really ought to have stable numbers, not slightly increasing and then going down again when we reboot. If rebooting is really necessary to get stable numbers, then it seems like we should be doing it *every* run.
Assignee | ||
Comment 27•16 years ago
|
||
(In reply to comment #26)
> (In reply to comment #23)
> > Has anyone investigated whether the drift is due to the Windows fastload files?
> > It may not be but it might be worth a shot.
>
> I would hope that we're starting with a new profile every time.
>
>
> I'm somewhat disappointed that any bug about machine inconsistency is getting
> duped to this bug; we really ought to have stable numbers, not slightly
> increasing and then going down again when we reboot. If rebooting is really
> necessary to get stable numbers, then it seems like we should be doing it
> *every* run.
Results being impacted by machine uptime shouldn't be surprising. We're very dependent on all sorts of O/S issues like file system caches and memory fragmentation that we have little control over.
The frequency of reboots will take some time to work out, I imagine. It could be that the first run post-reboot is always faster or slower, and then results flatten out. It could be that running a days worth of tests before rebooting doesn't show any significant drift. We need more data before saying we should reboot before *every* run.
Comment 29•16 years ago
|
||
If the first run after a reboot is different, then we need to throw the data from it out. If we push that data to the graph, people will waste time chasing down a regression that doesn't exist. If the first six are different and then it flattens out, then we need to throw the first six out. If we reboot every time, at least we're running the same test each time.
Assignee | ||
Comment 30•16 years ago
|
||
Right, but what we don't know is if a machine is sufficiently 'settled' right after a reboot to give meaningful results. We've only been rebooting regularly on FF3.0 for 2 days now, so give it some time!
Comment 31•16 years ago
|
||
(In reply to comment #27)
> Results being impacted by machine uptime shouldn't be surprising. We're very
> dependent on all sorts of O/S issues like file system caches and memory
> fragmentation that we have little control over.
Where is the data that shows that this randomness is caused by those OS issues? Until we have that data, we should assume it's something in our code and try to fix it. In the specific case of bug 467791, we had a conversation with a developer yesterday where he had an idea on what could be causing that issue.
Comment 32•16 years ago
|
||
(In reply to comment #26)
> (In reply to comment #23)
> > Has anyone investigated whether the drift is due to the Windows fastload files?
> > It may not be but it might be worth a shot.
>
> I would hope that we're starting with a new profile every time.
These aren't profile files... iirc the files I am referring to are prefetch files and are located at C:\Windows\Prefetch. For Firefox they would be named FIREFOX.EXE-XXXXXX.pf where XXXXXX I believe is a random hex number.
Comment 33•16 years ago
|
||
The goal of this bug is to generate consistent performance results. It's good testing practice to work from a clean environment - at the high end I would love if we could re-image fresh before each test run.
From my work with talos I've seen the following
- vista numbers for all tests gradually rise along with machine uptime, resolved by rebooting
- leopard boxes show wildly increasing tp times until the box eventually freezes and needs a hard reboot
- leopard/tiger boxes have runaway Terminal apps consuming 100% cpu resulting in series of rapidly failing tests until the box is rebooted
- ubuntu boxes showing fluctuating results, or results consistently too high/too low, resolved by rebooting
Rebooting fixes a lot of problems and gives us far fewer gaps in our knowledge where machines were cycling green but reporting garbage results. We'll play with the system a bit and if it turns out that rebooting after every single test run is beneficial we'll do it.
As to comment #31, if we are interesting in learning about how firefox behaves after a long period of uptime then we should specifically design a test and a test harness to do that. We shouldn't be relying on a side effect of the original talos design to be generating useful data for us.
Comment 35•16 years ago
|
||
- qm-pmac-fast01 went into a fail state post reboot again, seems to be having trouble restarting apache post-reboot
Comment 36•16 years ago
|
||
Looking at the graphs from here:
https://wiki.mozilla.org/Buildbot/Talos/Machines#1.8_.26_1.9_.28Firefox3.0_.26_Mozilla1.8.29
post auto-rebooting:
- vista numbers from different machines are more consistent with each other
- ubuntu numbers that were on a gradual upward swing were corrected and are now consistent
- winxp results dropped but look like they may be normalizing, will have to keep watching this
- leopard results haven't done a major swing (ie, their periodic major increases in Tp numbers followed by system freeze/crash hasn't happened), but further watching here is warranted
From my point of view we need to figure out how to ensure that mac boxes come back up and successfully restart apache before starting testing. Other than that, I'd like to track the numbers for another week to ensure that they remain consistent. If that all looks good then we should push this out to 1.9.1 and 1.9.2.
Comment 37•16 years ago
|
||
(In reply to bug #467791 comment #6)
> Please read https://bugzilla.mozilla.org/show_bug.cgi?id=463020#c33
>
> Relying on a system state that is dirty as a side effect of other tests isn't a
> good way to test how firefox behaves over the long term.
>
> Please direct further discussion to 463020.
>
> *** This bug has been marked as a duplicate of bug 463020 ***
In relation to the prefetch files they optimize loading of an application that is expected not to change as often as the application on talos does... the exact effect this has in relation to prefetch (iirc this is called superfetch on Vista due to the changes made to prefetch on Vista) is unknown and *might* have an adverse effect. It is also important to note that the way firefox behaves over the long term is not entirely representative on talos since the user wouldn't be running a new build - at least potentially - as often as talos does.
Comment 38•16 years ago
|
||
Still having issues with apache not starting on talos tiger boxes post-reboot. Installed this crontab on the affected machines:
@reboot ps -axc | grep httpd || SystemStarter start "Web Server"
*/5 * * * * ps -axc | grep httpd || SystemStarter start "Web Server"
Will give it another day to see if we stop having frozen boxes waiting on web page loads.
Comment 39•16 years ago
|
||
Another issue to deal with, screen dimensions on the winxp boxes seem to occasionally return to 800x600 (instead of 1280x1024) upon reboot - resulting in reporting lower than expected numbers. We'll need a way to ensure that the screen dimensions are correct before beginning testing.
Comment 40•16 years ago
|
||
Found a small command line tool for windows that allows us to force screen dimensions upon reboot. Have installed on all the firefox3.0 winxp machines.
Comment 41•16 years ago
|
||
Vista TS, TP3, and TSVG have all significantly increased after the reboot. There was almost a 100% increase in TS when comparing 1.9.2 to 1.9.1 prior to the reboot and now they are approximately the same which leads me to believe that the reboot caused this increase. For the time being I am keeping the tree closed to investigate further.
Comment 42•16 years ago
|
||
Also noticed that Vista's TS, TP3, and TSVG on 1.9.2 is also much closer to XP on 1.9.1 and 1.9.2 which also leads me to believe that this is due to the reboot.
Comment 43•16 years ago
|
||
Seems to have caused bug 467990.
Comment 44•16 years ago
|
||
Regarding the maintenance on the talos machines today: it would be much better if talos machine maintenance were done during a tree closure such that the cycle before the maintenance and the cycle after it are testing the same code. This allows performance regressions from the maintenance to be separated from those due to the code (assuming the regressions are large enough to be detected in one cycle, at least, which isn't always the case).
Today there was a closure, but (I was told) the talos runs that would have provided the necessary coverage were stopped in the middle of the runs in some cases. The talos runs should have been allowed to test the post-closure code before they were stopped.
What happened today can cast unnecessary suspicion on the code that landed before the closure, potentially requiring its authors to go through cycles of backout and relanding.
Assignee | ||
Comment 45•16 years ago
|
||
This was my bad(In reply to comment #44)
> Regarding the maintenance on the talos machines today: it would be much better
> if talos machine maintenance were done during a tree closure such that the
> cycle before the maintenance and the cycle after it are testing the same code.
> This allows performance regressions from the maintenance to be separated from
> those due to the code (assuming the regressions are large enough to be detected
> in one cycle, at least, which isn't always the case).
>
> Today there was a closure, but (I was told) the talos runs that would have
> provided the necessary coverage were stopped in the middle of the runs in some
> cases. The talos runs should have been allowed to test the post-closure code
> before they were stopped.
>
> What happened today can cast unnecessary suspicion on the code that landed
> before the closure, potentially requiring its authors to go through cycles of
> backout and relanding.
This was my fault, my apologies. I wanted to minimize the tree closure period, so interrupted the currently running tests.
Comment 46•16 years ago
|
||
Apache still not consistently starting on reboot. Going to turn off auto-reboot until this can be fixed.
Comment 47•16 years ago
|
||
Catlee - can you put a patch together for rebooting after every test run? The winxp talos boxes are showing more peaks/valleys now that auto-rebooting is on - I'd like the line to be as smooth as possible before moving ahead with this for other machines.
Assignee | ||
Comment 48•16 years ago
|
||
Attachment #352102 -
Flags: review?(anodelman)
Updated•16 years ago
|
Attachment #352102 -
Flags: review?(anodelman) → review+
Comment 49•16 years ago
|
||
Comment on attachment 352102 [details] [diff] [review]
[Checked in]reboot after every test run
Checking in perfrunner.py;
/cvsroot/mozilla/tools/buildbot-configs/testing/talos/perfmaster/perfrunner.py,v <-- perfrunner.py
new revision: 1.28; previous revision: 1.27
done
Attachment #352102 -
Attachment description: reboot after every test run → [Checked in]reboot after every test run
Attachment #352102 -
Flags: checked‑in+
Assignee | ||
Comment 50•16 years ago
|
||
qm-plinux-fast01 fell over around 5am this morning (bug 468827)
Comment 51•16 years ago
|
||
Now that we are rebooting boxes post every test run the winxp talos Tp results have gotten as steady as they were pre-autorebooting.
I'm happy with the current set up and think that we have enough confidence to roll this out to other branches.
Assignee | ||
Comment 52•16 years ago
|
||
qm-mini-vista01, qm-mini-vista02 both failed to come back this morning. Bug 469332
Comment 53•16 years ago
|
||
qm-plinux-fast01 down again (bug 469404).
Assignee | ||
Comment 54•16 years ago
|
||
Attachment #353517 -
Flags: review?(anodelman)
Assignee | ||
Comment 55•16 years ago
|
||
Updated sudoers on:
qm-mini-ubuntu01
qm-mini-ubuntu02
qm-mini-ubuntu03
qm-mini-ubuntu05
qm-pleopard-stage01
qm-pleopard-talos01
qm-pleopard-talos02
qm-pleopard-talos04
qm-pleopard-trunk01
qm-pleopard-trunk02
qm-pleopard-trunk03
qm-pleopard-trunk04
qm-pleopard-trunk06
qm-pleopard-trunk07
qm-pleopard-trunk08
qm-plinux-fast01
qm-plinux-fast03
qm-plinux-fast04
qm-plinux-talos01
qm-plinux-talos02
qm-plinux-talos03
qm-plinux-talos04
qm-plinux-trunk01
qm-plinux-trunk03
qm-plinux-trunk04
qm-plinux-trunk05
qm-plinux-trunk06
qm-plinux-trunk07
qm-pmac-fast01
qm-pmac-fast03
qm-pmac-fast04
qm-pmac-talos01
qm-pmac-talos02
qm-pmac-talos03
qm-pmac-talos04
qm-pmac-trunk01
qm-pmac-trunk02
qm-pmac-trunk03
qm-pmac-trunk07
qm-pmac-trunk08
qm-pmac-trunk09
qm-pmac-trunk10
qm-pmac01
qm-pmac02
qm-pmac03
qm-pmac05
qm-ptiger-stage01
qm-ptiger-try01
qm-pubuntu-stage01
qm-pubuntu-try01
qm-pleopard-talos03 is down, so it couldn't be updated.
Updated•16 years ago
|
Attachment #353517 -
Flags: review?(anodelman) → review+
Comment 56•16 years ago
|
||
Still missing here:
- updates to sudoers on qm-pleoaprd-talos03
- addition of force resolution program and update to start batch script on winxp talos boxes
- addition of crontrab to restart apache on mac tiger talos boxes
- scheduler downtime to push out the change and watch the results for consistency
Assignee | ||
Comment 57•16 years ago
|
||
Added crontab to:
qm-pmac-fast01
qm-pmac-fast03
qm-pmac-fast04
qm-pmac-talos01
qm-pmac-talos02
qm-pmac-talos03
qm-pmac-talos04
qm-pmac-trunk01
qm-pmac-trunk02
qm-pmac-trunk03
qm-pmac-trunk07
qm-pmac-trunk08
qm-pmac-trunk09
qm-pmac-trunk10
qm-pmac01
qm-pmac02
qm-pmac03
qm-pmac05
qm-ptiger-stage01
qm-ptiger-try01
Assignee | ||
Comment 58•16 years ago
|
||
qm-pleopard-talos03 has updated sudoers file now
Comment 59•16 years ago
|
||
Updated winxp talos machines with resolution setting on startup.
qm-pxp-fast01
qm-pxp-fast03
qm-pxp-fast04
qm-mini-xp01
qm-mini-xp02
qm-mini-xp03
qm-mini-xp05
qm-pxp-talos01
qm-pxp-talos02
qm-pxp-talos03
qm-pxp-talos04
qm-pxp-trunk01
qm-pxp-trunk02
qm-pxp-trunk03
qm-pxp-trunk04
qm-pxp-trunk05
qm-pxp-trunk06
qm-pxp-trunk07
qm-pxp-try01
qm-pxp-stage01
Comment 60•16 years ago
|
||
also:
qm-pxp-jss01
qm-pxp-jss02
qm-pxp-jss03
This just leaves scheduling a downtime to push this out to production.
Comment 61•16 years ago
|
||
Comment on attachment 353517 [details] [diff] [review]
[Checked in]Enable rebooting on all branches
Checking in perfrunner.py;
/cvsroot/mozilla/tools/buildbot-configs/testing/talos/perfmaster/perfrunner.py,v <-- perfrunner.py
new revision: 1.29; previous revision: 1.28
done
Attachment #353517 -
Attachment description: Enable rebooting on all branches → [Checked in]Enable rebooting on all branches
Attachment #353517 -
Flags: checked‑in+
Comment 62•16 years ago
|
||
Pushed out change to production during downtime this afternoon. Numbers look good, will continue to check them periodically over the next week to ensure that things stay stable.
Otherwise, all done here.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•