Closed Bug 629511 Opened 14 years ago Closed 14 years ago

computers that need physical intervention

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: zandr)

References

()

Details

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #620948 +++ The following computers require manual "looking at" to determine why they are offline. bm-xserve06 linux-ix-slave42 talos-r3-w7-020 talos-r3-w7-036 try-mac-slave42
talos-r3-fed-029
talos-r3-w7-011
moz2-darwin9-slave40 is refusing all SSH and NRPE connections.
(moz2-darwin9-slave40 moved to bug 629763, since it seems quite ill)
(In reply to comment #0) > talos-r3-w7-020 > talos-r3-w7-036 Can we prevent these two from starting buildbot? I need them to come up online and install DirectX and the Nvidia driver update. I can't think of any other solution than actually removing them temporarily from buildbot (we could try to boot them and make the change very fast but would require doing it before it picks up a change and that is not easy AFAIK!)
Attachment #507973 - Flags: review?(dustin)
talos-r3-w7-040 doesn't even seem to be in DNS.
Blocks: 624044
Indeed, there is no such machine in inventory. After a brief look at bug 620948, I'm not sure how that name got on the list.
Comment on attachment 507973 [details] [diff] [review] [deployed] disable temporarily slaves 20 & 36 I'm probably not the right guy to review this patch.
Attachment #507973 - Flags: review?(dustin)
(In reply to comment #6) > talos-r3-w7-040 doesn't even seem to be in DNS. There's an empty slot for it on the rack, but I have never seen it. Why we skipped 40 remains a mystery.
Comment on attachment 507973 [details] [diff] [review] [deployed] disable temporarily slaves 20 & 36 Passing it to coop :)
Attachment #507973 - Flags: review?(coop)
Comment on attachment 507973 [details] [diff] [review] [deployed] disable temporarily slaves 20 & 36 >+ range(21,36) + range(36,40) + range(41,54)], That should be range(37,40) to exclude 36. r+ with that change.
Attachment #507973 - Flags: review?(coop) → review+
bm-xserve21 is refusing NRPE and SSH gives: ssh_exchange_identification: Connection closed by remote host
talos-r3-fed64-050
Blocks: 630309
Comment on attachment 507973 [details] [diff] [review] [deployed] disable temporarily slaves 20 & 36 We have to reconfigure the testing masters before slaves talos-r3-w7-20 & talos-r3-w7-036 can come back online. The change has landed on the "default" branch: http://hg.mozilla.org/build/buildbot-configs/rev/09e5b421ffe5 I am planing on doing a reconfig in the morning.
Attachment #507973 - Attachment description: disable temporarily slaves 20 & 36 → [checked in] disable temporarily slaves 20 & 36
No longer blocks: 624044
Ravi says he has kicked: bm-xserve06 try-mac-slave42 bm-xserve21
(In reply to comment #15) > bm-xserve21 bm-xserve21 is failing nagios checks again. Can we get someone from IT to run some diagnostics on it since it's failing repeatedly?
Comment on attachment 507973 [details] [diff] [review] [deployed] disable temporarily slaves 20 & 36 Slaves talos-r3-w7-020 and talos-r3-w7-036 can now be brought back online without any worries.
Attachment #507973 - Attachment description: [checked in] disable temporarily slaves 20 & 36 → [deployed] disable temporarily slaves 20 & 36
try-mac-slave28 is still offline pending IT investigation (bug 586892).
talos-r3-w7-032 is unpingable.
talos-r3-w7-020: grey screen -> reboot talos-r3-w7-036: grey screen -> reboot talos-r3-w7-032: grey screen -> reboot talos-r3-fed64-050: date problem -> fsck
The whole list for this bug bug, according to the slave spreadsheet, is now: bm-xserve06 - pingable, but nothing else linux-ix-slave15 linux-ix-slave42 moz2-darwin9-slave05 - very slow to restart moz2-darwin9-slave40 - connections refused, very slow login mv-moz2-linux-ix-slave22 talos-r3-fed-024 talos-r3-fed-029 talos-r3-fed-047 talos-r3-w7-011 try-mac-slave42 w32-ix-slave03 w32-ix-slave08 - VNC, IPMI don't work
talos-r3-fed64-011
moz2-darwin10-slave40
talos-r3-snow-009 didn't come back from a software reboot, so it requires a power cycle.
Blocks: 631587
(In reply to comment #24) > talos-r3-snow-009 didn't come back from a software reboot, so it requires a > power cycle. Nevermind, this machine came back on it's own eventually.
No longer blocks: 631587
I managed to resurrect linux-ix-slave42 today, so it no longer needs intervention.
talos-r3-fed64-004
talos-r3-fed-037
(In reply to comment #26) > I managed to resurrect linux-ix-slave42 today, so it no longer needs > intervention. Hmm, that was on the list that I asked Spencer to reimage, but he told me this morning that he had trouble reimaging it. If it hasn't been reimaged, I'd expect it to fall over again soon. Spencer- What's the status on that machine?
w32-ix-slave05 - machine is up, stuck in the prelogin opsi stuff. ipmi is inaccessible
talos-r3-xp-042 - stuck in a reboot, with no VNC access and the shutdown command says "A Shutdown Is In Progress"
disregard comment 31 regarding talos-r3-xp-042 - I forgot to try RDP, which worked.
bkero just reimaged w32-ix-slave03 so no need to reboot it.
I just swept the spreadsheet to catch any boxes that were restored without being annotated here (only a few). The latest list of reboots is: linux-ix-slave15 moz2-darwin10-slave40 moz2-darwin9-slave51 mv-moz2-linux-ix-slave07 mv-moz2-linux-ix-slave22 talos-r3-fed-024 talos-r3-fed-029 talos-r3-fed-037 talos-r3-fed-047 talos-r3-fed64-004 talos-r3-fed64-011 talos-r3-fed64-013 talos-r3-fed64-036 talos-r3-w7-011 talos-r3-w7-036 w32-ix-slave05 w32-ix-slave08 the IX boxes' IPMI didn't work, at least not by following the OOB IP in inventory. Note that linux-ix-slave15 came back and failed again on the 7th, so it may require an extra bit of TLC.
talos-r3-fed-044
talos-r3-fed-003 talos-r3-fed-030 talos-r3-fed64-016 talos-r3-fed64-023 talos-r3-xp-039
talos-r3-fed-036 mv-moz2-linux-ix-slave21
talos-r3-xp-039: frozen solid at desktop (6:29AM in the taskbar) -> rebooted normally talos-r3-w7-011: gray screen -> rebooted normally talos-r3-w7-036: gray screen -> rebooted normally talos-r3-fed-003: blank screen -> rebooted normally talos-r3-fed-024: looked OK, no network. -> rebooted normally talos-r3-fed-029: looked OK, no network. -> rebooted normally talos-r3-fed-030: blank screen -> date problem talos-r3-fed-036: grey screen -> reboot talos-r3-fed-037: grey screen -> reboot talos-r3-fed-047: blank screen -> date problem talos-r3-fed64-004: blank screen -> hang -> reimaged talos-r3-fed64-011: looked OK, no network. -> rebooted normally talos-r3-fed64-013: blank screen -> date problem talos-r3-fed64-016: blank screen -> date problem talos-r3-fed64-023: blank screen -> date problem talos-r3-fed64-036: looked OK, no network. -> rebooted normally
Depends on: 634368
Looks like linux-ix-slave15 got reimaged about 5 days ago ?
talos-r3-fed-036 is down again, it managed about 20 reboots before failing.
talos-r3-xp-041 is hung at OPSI.
talos-r3-snow-032 is refusing ssh & vnc. Needs a reboot, and possibly a reimage.
talos-r3-fed64-014
talos-r3-w7-036
(In reply to comment #41) > talos-r3-xp-041 is hung at OPSI. Cancel this one, it rebooted on its own and is working normally now.
talos-r3-xp-004
talos-r3-fed-028 talos-r3-fed64-054
talos-r3-xp-004: gray screen -> reboot talos-r3-w7-036: gray screen -> reboot talos-r3-fed-028: date problem talos-r3-fed-036: up, but no network lease (this is the "looked OK state" in comment 38. talos-r3-fed64-014: gray screen -> reboot talos-r3-fed64-054: gray screen -> reboot talos-r3-snow-032: blue desktop + pinwheel (looks like a hang shutting down) rebooted a couple of times normally
linux-ix-slave15: reimaged, hostname fixed, in puppetd loop. moz2-darwin9-slave51: rebooted in bug 634368 moz2-darwin10-slave40: reooted in bug 634368 Thus endeth this bug. Nothing to carry forward, so please start a new bug for the next interventions.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Alias: reboots
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: