Closed
Bug 461685
Opened 16 years ago
Closed 16 years ago
balsa-18branch is down
Categories
(mozilla.org Graveyard :: Server Operations, task, P2)
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 467634
People
(Reporter: nthomas, Assigned: phong)
Details
Attachments
(1 file)
(deleted),
text/plain
|
Details |
Nagios first alerted that all the service checks on this box were timing out. Turns out something is using all the CPU up, but can't find out what as ssh sessions can't open and even a VI console is useless. Rebooted, then fixed some errors with fsck.
Reporter | ||
Comment 1•16 years ago
|
||
Back up, with a clobber of the build dir for good measure.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Priority: -- → P1
Resolution: --- → FIXED
Reporter | ||
Comment 2•16 years ago
|
||
Looks like the same problem just started again, investigating.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 3•16 years ago
|
||
100% CPU, can't get a ssh session, and the console doesn't let me move the mouse to the running terminal - hence can't get any top output. Now rebooted, will fsck each drive and hope there is something in the system logs.
Reporter | ||
Comment 4•16 years ago
|
||
It's back building now.
There's nothing in the system logs at all, but the build log got as far as cvs co on SeaMonkeyAll (and the last mod time of that file matches the point that it started using 100% CPU according to the VI client). So possibly a cvs/ssh/network glitch, or an I/O issue since a cvs update is very I/O intesive.
I've left a 'watch ps' running in the console terminal, so hopefully we will have some more info if this occurs again. Also left the focus on that terminal so that we might be able to kill any errant process - I couldn't find the magical keystroke to focus it on the two outages and the mouse was non-responsive.
Status: REOPENED → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Comment 5•16 years ago
|
||
It's down *again*, I'm trying to check out the console right now...
Assignee: nthomas → bhearsum
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 6•16 years ago
|
||
Unfortunately, I'm unable to switch to the system console, nor able to focus the terminal window...I've rebooted the machine again, started tinderbox, and started 'watch ps aux' on the terminal inside of X - which we should be able to see next time.
It's up again for now..resolving this bug.
Status: REOPENED → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 7•16 years ago
|
||
And down again. Nothing showing in the X session, so X must be crashing out and getting restarted by the script. Off to disk check land we go, I'll probably reinstall vmware tools too.
Assignee: bhearsum → nthomas
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 8•16 years ago
|
||
It was pulling SeaMonkeyAll from CVS when it died (again), which is pretty I/O intensive. No errors were found by fsck on either drive, so I've enabled syslog on runlevel 3 to try to get some info in the system logs.
I also looked for errors on the host machine, which was bm-vmware07 at the time of the failure. There's nothing at 19:30, but there this a little earlier in /var/log/vmkwarning:
Nov 4 16:47:57 bm-vmware07 vmkernel: 180:05:53:45.427 cpu1:1035)WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts
Nov 4 16:47:57 bm-vmware07 vmkernel: 180:05:53:45.427 cpu1:1035)WARNING: ScsiDevice: 2740: Failed for vml.020001000060a98000433466344a344969744731474c554e20202
0: SCSI reservation conflict
mrz, is that netapp-c-001 ?
Reporter | ||
Updated•16 years ago
|
Priority: P1 → P2
Reporter | ||
Comment 9•16 years ago
|
||
And again about 3 hours ago. Everything in /var/log is populated but looks clean, and the mtime of X log indicates it only started once. Failed out doing cvs again:
cvs -q -z 3 co -P -r MOZILLA_1_8_BRANCH -D 11/06/2008 07:14 +0000 SeaMonkeyAll
I put a cron job in to dump 'ps auxw --sort -pcpu' to /tmp/ps.log every minute (nagios will warn when / is filling up).
Ideas for more exhaustive checks welcome.
Reporter | ||
Comment 10•16 years ago
|
||
And again, this time doing "c++ ... mozilla/dom/src/base/nsGlobalWindow.cpp", which is what the ps log shows too. Tinderbox restarted.
Reporter | ||
Comment 11•16 years ago
|
||
Went again at around 09:50:00, checking out SeaMonkeyAll. And again when I was just poking around on the box.
Reporter | ||
Comment 12•16 years ago
|
||
Reporter | ||
Comment 13•16 years ago
|
||
I've set Vmware to expect Redhat Enterprise Linux 2 rather than 4, and done a clean reinstall of the vmware tools. Not really expecting that to fix it up, but here's hoping.
Reporter | ||
Comment 14•16 years ago
|
||
Fixed for now.
Assignee: nthomas → nobody
Status: REOPENED → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 15•16 years ago
|
||
Continues to be problematic. As a last ditch attempt, I'm moving the VM from the netapp-c-001 storage (which also holds the templates for new VM's) to netapp-d-sata-003.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 16•16 years ago
|
||
Haven't really been paying attention to this but I'd actually recommend moving to equallogic (yes, despite yesterday's problems) than the netapp. Let me know if you need help doing that.
Reporter | ||
Comment 17•16 years ago
|
||
It's running on netapp-d-sata-003 for now but feel free to move it to a suitable equallogic partition. I'm not sure how much free space you're keeping on each and I can't see reliable free space numbers using the VI client (due to only build VM's being visible to me ?). Only need 14GB for this VM.
Reporter | ||
Comment 18•16 years ago
|
||
netapp-d-sata-003 didn't help, trying eql01-bm06.
Reporter | ||
Comment 19•16 years ago
|
||
... and it went to 100% CPU shortly after.
More worryingly we now have a second machine hanging "randomly": prometheus-vm went boom at around 1800 PST. It uses eql-bm02, and was hosted on bm-vmware05 at the time; it's running "Red Hat Enterprise Linux AS release 3 (Taroon Update 8)" and a 2.4.21-27.0.4.EL kernel.
mrz, could you please take a look at the ESX and storage array logs for any problems.
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Updated•16 years ago
|
Assignee: server-ops → mrz
Reporter | ||
Comment 21•16 years ago
|
||
FYI, balsa-18branch is currently off, prometheus-vm hasn't gone nuts again.
Comment 22•16 years ago
|
||
gentle ping... any update?
Assignee | ||
Comment 23•16 years ago
|
||
What are you looking for in these logs?
Reporter | ||
Comment 24•16 years ago
|
||
eg lost connections/dropped transactions between ESX hosts the network storage
Comment 25•16 years ago
|
||
(In reply to comment #24)
> eg lost connections/dropped transactions between ESX hosts the network storage
Actual dropped connections, as I know from past experience, result it total chaos and ESX hosts that die. That is not the case here.
Assignee | ||
Comment 26•16 years ago
|
||
Did this happen a few weeks ago when we had issue with the equallogic?
Assignee | ||
Comment 27•16 years ago
|
||
I think this related to the ESX build cluster being under high load. This is related to bug 467634.
Assignee | ||
Updated•16 years ago
|
Status: REOPENED → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → DUPLICATE
Reporter | ||
Comment 29•16 years ago
|
||
VM is now restarted, lets see if this canary down the mine will sing again.
Reporter | ||
Comment 30•16 years ago
|
||
balsa went boom again at 14:28 today, and started chewing a lot of CPU. I've re-educated the little punk with a reboot.
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•