Closed
Bug 629511
Opened 14 years ago
Closed 14 years ago
computers that need physical intervention
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: coop, Assigned: zandr)
References
()
Details
Attachments
(1 file)
(deleted),
patch
|
coop
:
review+
|
Details | Diff | Splinter Review |
+++ This bug was initially created as a clone of Bug #620948 +++
The following computers require manual "looking at" to determine why they are offline.
bm-xserve06
linux-ix-slave42
talos-r3-w7-020
talos-r3-w7-036
try-mac-slave42
Comment 1•14 years ago
|
||
talos-r3-fed-029
Comment 2•14 years ago
|
||
talos-r3-w7-011
Comment 3•14 years ago
|
||
moz2-darwin9-slave40 is refusing all SSH and NRPE connections.
Comment 4•14 years ago
|
||
(moz2-darwin9-slave40 moved to bug 629763, since it seems quite ill)
Comment 5•14 years ago
|
||
(In reply to comment #0)
> talos-r3-w7-020
> talos-r3-w7-036
Can we prevent these two from starting buildbot?
I need them to come up online and install DirectX and the Nvidia driver update.
I can't think of any other solution than actually removing them temporarily from buildbot (we could try to boot them and make the change very fast but would require doing it before it picks up a change and that is not easy AFAIK!)
Attachment #507973 -
Flags: review?(dustin)
Reporter | ||
Comment 6•14 years ago
|
||
talos-r3-w7-040 doesn't even seem to be in DNS.
Comment 7•14 years ago
|
||
Indeed, there is no such machine in inventory. After a brief look at bug 620948, I'm not sure how that name got on the list.
Comment 8•14 years ago
|
||
Comment on attachment 507973 [details] [diff] [review]
[deployed] disable temporarily slaves 20 & 36
I'm probably not the right guy to review this patch.
Attachment #507973 -
Flags: review?(dustin)
Assignee | ||
Comment 9•14 years ago
|
||
(In reply to comment #6)
> talos-r3-w7-040 doesn't even seem to be in DNS.
There's an empty slot for it on the rack, but I have never seen it. Why we skipped 40 remains a mystery.
Comment 10•14 years ago
|
||
Comment on attachment 507973 [details] [diff] [review]
[deployed] disable temporarily slaves 20 & 36
Passing it to coop :)
Attachment #507973 -
Flags: review?(coop)
Reporter | ||
Comment 11•14 years ago
|
||
Comment on attachment 507973 [details] [diff] [review]
[deployed] disable temporarily slaves 20 & 36
>+ range(21,36) + range(36,40) + range(41,54)],
That should be range(37,40) to exclude 36. r+ with that change.
Attachment #507973 -
Flags: review?(coop) → review+
Comment 12•14 years ago
|
||
bm-xserve21 is refusing NRPE and SSH gives:
ssh_exchange_identification: Connection closed by remote host
Comment 13•14 years ago
|
||
talos-r3-fed64-050
Comment 14•14 years ago
|
||
Comment on attachment 507973 [details] [diff] [review]
[deployed] disable temporarily slaves 20 & 36
We have to reconfigure the testing masters before slaves talos-r3-w7-20 & talos-r3-w7-036 can come back online.
The change has landed on the "default" branch:
http://hg.mozilla.org/build/buildbot-configs/rev/09e5b421ffe5
I am planing on doing a reconfig in the morning.
Attachment #507973 -
Attachment description: disable temporarily slaves 20 & 36 → [checked in] disable temporarily slaves 20 & 36
Assignee | ||
Comment 15•14 years ago
|
||
Ravi says he has kicked:
bm-xserve06
try-mac-slave42
bm-xserve21
Reporter | ||
Comment 16•14 years ago
|
||
(In reply to comment #15)
> bm-xserve21
bm-xserve21 is failing nagios checks again. Can we get someone from IT to run some diagnostics on it since it's failing repeatedly?
Comment 17•14 years ago
|
||
Comment on attachment 507973 [details] [diff] [review]
[deployed] disable temporarily slaves 20 & 36
Slaves talos-r3-w7-020 and talos-r3-w7-036 can now be brought back online without any worries.
Attachment #507973 -
Attachment description: [checked in] disable temporarily slaves 20 & 36 → [deployed] disable temporarily slaves 20 & 36
Reporter | ||
Comment 18•14 years ago
|
||
try-mac-slave28 is still offline pending IT investigation (bug 586892).
Reporter | ||
Comment 19•14 years ago
|
||
talos-r3-w7-032 is unpingable.
Assignee | ||
Comment 20•14 years ago
|
||
talos-r3-w7-020: grey screen -> reboot
talos-r3-w7-036: grey screen -> reboot
talos-r3-w7-032: grey screen -> reboot
talos-r3-fed64-050: date problem -> fsck
Comment 21•14 years ago
|
||
The whole list for this bug bug, according to the slave spreadsheet, is now:
bm-xserve06 - pingable, but nothing else
linux-ix-slave15
linux-ix-slave42
moz2-darwin9-slave05 - very slow to restart
moz2-darwin9-slave40 - connections refused, very slow login
mv-moz2-linux-ix-slave22
talos-r3-fed-024
talos-r3-fed-029
talos-r3-fed-047
talos-r3-w7-011
try-mac-slave42
w32-ix-slave03
w32-ix-slave08 - VNC, IPMI don't work
Comment 22•14 years ago
|
||
talos-r3-fed64-011
Comment 23•14 years ago
|
||
moz2-darwin10-slave40
Reporter | ||
Comment 24•14 years ago
|
||
talos-r3-snow-009 didn't come back from a software reboot, so it requires a power cycle.
Blocks: 631587
Reporter | ||
Comment 25•14 years ago
|
||
(In reply to comment #24)
> talos-r3-snow-009 didn't come back from a software reboot, so it requires a
> power cycle.
Nevermind, this machine came back on it's own eventually.
No longer blocks: 631587
Reporter | ||
Comment 26•14 years ago
|
||
I managed to resurrect linux-ix-slave42 today, so it no longer needs intervention.
Comment 27•14 years ago
|
||
talos-r3-fed64-004
Comment 28•14 years ago
|
||
talos-r3-fed-037
Assignee | ||
Comment 29•14 years ago
|
||
(In reply to comment #26)
> I managed to resurrect linux-ix-slave42 today, so it no longer needs
> intervention.
Hmm, that was on the list that I asked Spencer to reimage, but he told me this morning that he had trouble reimaging it.
If it hasn't been reimaged, I'd expect it to fall over again soon.
Spencer- What's the status on that machine?
Comment 30•14 years ago
|
||
w32-ix-slave05 - machine is up, stuck in the prelogin opsi stuff. ipmi is inaccessible
Comment 31•14 years ago
|
||
talos-r3-xp-042 - stuck in a reboot, with no VNC access and the shutdown command says "A Shutdown Is In Progress"
Comment 32•14 years ago
|
||
disregard comment 31 regarding talos-r3-xp-042 - I forgot to try RDP, which worked.
Comment 33•14 years ago
|
||
bkero just reimaged w32-ix-slave03 so no need to reboot it.
Comment 34•14 years ago
|
||
I just swept the spreadsheet to catch any boxes that were restored without being annotated here (only a few). The latest list of reboots is:
linux-ix-slave15
moz2-darwin10-slave40
moz2-darwin9-slave51
mv-moz2-linux-ix-slave07
mv-moz2-linux-ix-slave22
talos-r3-fed-024
talos-r3-fed-029
talos-r3-fed-037
talos-r3-fed-047
talos-r3-fed64-004
talos-r3-fed64-011
talos-r3-fed64-013
talos-r3-fed64-036
talos-r3-w7-011
talos-r3-w7-036
w32-ix-slave05
w32-ix-slave08
the IX boxes' IPMI didn't work, at least not by following the OOB IP in inventory. Note that linux-ix-slave15 came back and failed again on the 7th, so it may require an extra bit of TLC.
Comment 35•14 years ago
|
||
talos-r3-fed-044
Comment 36•14 years ago
|
||
talos-r3-fed-003
talos-r3-fed-030
talos-r3-fed64-016
talos-r3-fed64-023
talos-r3-xp-039
Comment 37•14 years ago
|
||
talos-r3-fed-036
mv-moz2-linux-ix-slave21
Assignee | ||
Comment 38•14 years ago
|
||
talos-r3-xp-039: frozen solid at desktop (6:29AM in the taskbar) -> rebooted normally
talos-r3-w7-011: gray screen -> rebooted normally
talos-r3-w7-036: gray screen -> rebooted normally
talos-r3-fed-003: blank screen -> rebooted normally
talos-r3-fed-024: looked OK, no network. -> rebooted normally
talos-r3-fed-029: looked OK, no network. -> rebooted normally
talos-r3-fed-030: blank screen -> date problem
talos-r3-fed-036: grey screen -> reboot
talos-r3-fed-037: grey screen -> reboot
talos-r3-fed-047: blank screen -> date problem
talos-r3-fed64-004: blank screen -> hang -> reimaged
talos-r3-fed64-011: looked OK, no network. -> rebooted normally
talos-r3-fed64-013: blank screen -> date problem
talos-r3-fed64-016: blank screen -> date problem
talos-r3-fed64-023: blank screen -> date problem
talos-r3-fed64-036: looked OK, no network. -> rebooted normally
Comment 39•14 years ago
|
||
Looks like linux-ix-slave15 got reimaged about 5 days ago ?
Comment 40•14 years ago
|
||
talos-r3-fed-036 is down again, it managed about 20 reboots before failing.
Comment 41•14 years ago
|
||
talos-r3-xp-041 is hung at OPSI.
Comment 42•14 years ago
|
||
talos-r3-snow-032 is refusing ssh & vnc. Needs a reboot, and possibly a reimage.
Comment 43•14 years ago
|
||
talos-r3-fed64-014
Comment 44•14 years ago
|
||
talos-r3-w7-036
Comment 45•14 years ago
|
||
(In reply to comment #41)
> talos-r3-xp-041 is hung at OPSI.
Cancel this one, it rebooted on its own and is working normally now.
Comment 46•14 years ago
|
||
talos-r3-xp-004
Comment 47•14 years ago
|
||
talos-r3-fed-028
talos-r3-fed64-054
Assignee | ||
Comment 48•14 years ago
|
||
talos-r3-xp-004: gray screen -> reboot
talos-r3-w7-036: gray screen -> reboot
talos-r3-fed-028: date problem
talos-r3-fed-036: up, but no network lease (this is the "looked OK state" in comment 38.
talos-r3-fed64-014: gray screen -> reboot
talos-r3-fed64-054: gray screen -> reboot
talos-r3-snow-032: blue desktop + pinwheel (looks like a hang shutting down) rebooted a couple of times normally
Assignee | ||
Comment 49•14 years ago
|
||
linux-ix-slave15: reimaged, hostname fixed, in puppetd loop.
moz2-darwin9-slave51: rebooted in bug 634368
moz2-darwin10-slave40: reooted in bug 634368
Thus endeth this bug. Nothing to carry forward, so please start a new bug for the next interventions.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•14 years ago
|
Alias: reboots
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•