Closed
Bug 798380
Opened 12 years ago
Closed 12 years ago
Run hardware diagnostics on talos-r4-lion-063.build.scl1
Categories
(Infrastructure & Operations :: DCOps, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rail, Unassigned)
References
Details
.. particularly RAM tests
Updated•12 years ago
|
colo-trip: --- → scl1
Comment 1•12 years ago
|
||
Since lion-63 is sharing the same chassis with lion-64, please take both builds offline so that I can run the diagnostics.
Thanks,
Vinh
Comment 2•12 years ago
|
||
Hi Vinh, I have disabled both.
You can go ahead any time in the 30 minutes (it is just finishing a job).
Comment 3•12 years ago
|
||
Can we go ahead with this now? I don't miss 063 in the least, but losing a good slave for three weeks just because it sits next to a bad one is a bit painful.
Comment 4•12 years ago
|
||
Sorry for the delay guys. Hardware diagnostics did not find any problems with lion-063. I've plugged both 63 and 64 back online.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 5•12 years ago
|
||
We seem to be hitting a lot of requests for hardware diagnostics that don't show any problems. We should probably be trying reimages first (releng folks) since they seem to clear up a number of crufty filesystem issues.
Comment 6•12 years ago
|
||
I'd be in favor of that as soon as someone develops a system for actually testing reimaged slaves other than "throw them back into production, and hope that after a dozen invalid code bugs are filed and several patches are backed out or abandoned because of inexplicable failures on Try, someone will notice that a broken slave is still broken."
In the case of this one, which had absolutely nothing to do with the filesystem so that a reimage smells an awful lot like cargo-culting, about 500 runs of reftest without an unexplained failure ought to do it.
The other options I've come up with so far are to rename the broken slaves we insist on continuing to use, so that people might have a chance of noticing that their test failed on "talos-r4-snow-014-totally-unreliable," or having a pool of slaves which are allowed to say that a test suite passed, but are not allowed to say that it failed, only setting RETRY to let some other unbroken slave rerun it when they fail. Dunno how many tests we have that would get false positives when they should have failed but are run on broken hardware, though.
Comment 7•12 years ago
|
||
philor: Often machines come back having passed the apple hardware checks, and operations has no way of validating beyond that point. The hardware looks fine, the OS passes checks.
Releng: should questionable machines come back in a separate pool to be verified?
Comment 8•12 years ago
|
||
IMHO we should design a system that would sheppard slaves from preproduction to production automatically only if they pass a whole set of jobs and/or burning tests.
Meanwhile, I will figure out what manual process we can put in place.
We can make use of the preproduction masters and make sure that it takes a bunch of jobs.
Comment 9•12 years ago
|
||
We want to look into make bug 712206 a way to prevent bad slaves to get into the pool.
I want to add prospective slaves into "see also" for bug 712206.
We should collect problems, determine how to diagnose them and adding a method to prevent a slave from starting if there are any issues.
Updated•11 years ago
|
Assignee: server-ops → server-ops-dcops
Updated•10 years ago
|
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•