Closed
Bug 493450
Opened 15 years ago
Closed 15 years ago
Unexpected reboots of cb-seamonkey-osx-*
Categories
(SeaMonkey :: Release Engineering, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: kairo, Unassigned)
References
Details
The currently running Mac VM in the SeaMonkey pool sometimes reboot unexpectedly, possibly due to system/VM crashes.
This could be connected to bug 493321 and be a Parallels issue, this bug is just for tracking the problem as it blocks moving the pool to production.
Reporter | ||
Comment 1•15 years ago
|
||
cb-seamonkey-osx-01 experienced network problems in recent cycles, looked like it couldn't get new connections to the outside (like chechout, etc.) but it still could sen stuff to the buildmaster. I also could ssh in and forced a reboot at about 6:53 today.
Reporter | ||
Comment 2•15 years ago
|
||
cb-seamonkey-osx-02 unexpectedly rebooted between 10:33 ("slave lost" msg) and 10:38 ("slave connected" msg) today.
Reporter | ||
Comment 3•15 years ago
|
||
And now cb-seamonkey-osx-01 did it between 14:50 and 14:53.
Reporter | ||
Comment 4•15 years ago
|
||
cb-seamonkey-osx-02, between 11:11 and 11:15 today.
Reporter | ||
Comment 5•15 years ago
|
||
We also have crashes in almost every mochitest-plain cycle from any of the two slaves, at different places in the test cycles, so not related to one specific thing the box is doing.
Reporter | ||
Comment 6•15 years ago
|
||
To be clear, comment #5 is application crashes that are caught and reported by the test/buildbot harness, the rest in this bug are VMs crashes/reboots which we only see due to losing the slaves for a few minutes, and them coming online a few minutes later (due to autologin and automatic start of buildbot) with reset uptime of the machine.
Reporter | ||
Comment 7•15 years ago
|
||
lost cb-seamonkey-osx-01 between 16:59 and 17:10
Reporter | ||
Comment 8•15 years ago
|
||
Another one of -osx-01 between 18:33 and 18:37 yesterday might be related to phong trying out things for bug 493321, including reboots of -osx-03 and -osx-04, which came online at 18:44 and 18:52, respectively.
03 ran into networking problems very fast, very much the same as I reported in comment #1. 04 disconnected at 00:00 without having got anything to do up to that point.
I rebooted cb-seamonkey-osx-03 at 03:36, it came back online at 03:40.
Reporter | ||
Comment 9•15 years ago
|
||
cb-seamonkey-osx-02 lost and regained network at 08:16, was no reboot this time.
Comment 10•15 years ago
|
||
I think osx-04 is frozen again. I'm going to power it down unless you tell me that it's still up and running.
Reporter | ||
Comment 11•15 years ago
|
||
Phong: yes, like I stated in comment #8, osx-04 went away at midnight - it didn't come back from that. If you could bring back win32-02 instead that would be cool as having only one of the Windows VMs makes the "pool" a bit slow for testing config changes.
osx-03 has had network problems for a while, again just like in comment #8, and disconnected and reconnected to buildbot at 17:16. I now rebooted it between 18:04 and 18:06.
Reporter | ||
Comment 12•15 years ago
|
||
cb-seamonkey-osx-03 rebooted between 13:03 and 13:18 today.
Reporter | ||
Comment 13•15 years ago
|
||
cb-seamonkey-osx-02 rebooted 14:26 to 14:30 today.
Reporter | ||
Comment 14•15 years ago
|
||
cb-seamonkey-osx-02 once again 22:49 to 22:53 yersterday.
Comment 15•15 years ago
|
||
(In reply to comment #5)
> We also have crashes in almost every mochitest-plain cycle
(In reply to comment #6)
> To be clear, comment #5 is application crashes that are caught and reported by
> the test/buildbot harness
I filed bug 494671 about the mochitest-plain crash(es).
Reporter | ||
Comment 16•15 years ago
|
||
cb-seamonkey-osx-01 rebooted between 10:56 and 10:58 today.
Reporter | ||
Comment 17•15 years ago
|
||
cb-seamonkey-osx-01 again, 16:46 to 16:50 today.
Reporter | ||
Comment 18•15 years ago
|
||
ugh. and cb-seamonkey-osx-01 once again, 21:09 to 21:13 yesterday
Reporter | ||
Comment 19•15 years ago
|
||
cb-seamonkey-osx-02 hasn't been crashing for some time but now started to report errors such as "FAILED TO GET ASN FROM CORESERVICES so aborting." in mochitests and then got into failures to get stuff from the network, same pattern as reported earlier in here, so I rebooted the VM just now.
Reporter | ||
Comment 20•15 years ago
|
||
Bang. Now cb-seamonkey-osx-02 did a crash reboot again, between 08:44 and 08:48.
This seems to (almost) always happen during mochitest-plain runs, where we launch SeaMonkey and run a ton of tests on it. It happens at different points during the testing though.
Reporter | ||
Comment 21•15 years ago
|
||
cb-seamonkey-osx-01 crashed/rebooted between 18:06 to 18:10, a few video tests failed before disconnecting in /tests/layout/base/tests/test_bug467672-1c.html
Reporter | ||
Comment 22•15 years ago
|
||
All test cycles since then failed with a crsh/reboot:
cb-seamonkey-osx-02 between 18:51 and 18:55 yesterday, during nsITransactionManager Aggregate Batch Transaction Stress Test (make check).
cb-seamonkey-osx-01 between 21:59 and 22:06 yesterday, two video test failues, a test_jQuery.html failure, lost in /tests/layout/style/test/test_bug391221.html
cb-seamonkey-osx-02 between 23:34 and 23:37 yesterday, one video test failure, test_Scriptaculous.html failures, lost in /tests/layout/base/tests/test_bug441782-2b.html
cb-seamonkey-osx-02 between 3:42 and 3:46 today, some video test errors, later lost in /tests/layout/base/tests/test_bug441782-2d.html
Interestingly, the video failures before such a crash/reboot are wrong end times of the video element when it reports it's done playing.
Reporter | ||
Comment 23•15 years ago
|
||
02 made it through a whole test cycle with a video end time failure and test_Scriptaculous.html failures, but without a crash or timeout, neither this one nor bug 494671.
cb-seamonkey-osx-01 did a crash/reboot again though between 7:24 and 7:28, failing in /tests/content/media/video/test/test_bug482461.html (only failure before that was an xpcshell test, but nothing seen usually)
Reporter | ||
Comment 24•15 years ago
|
||
cb-seamonkey-osx-01, between 12:28 and 12:32, two video "currentTime at end" test failures, test_Scriptaculous.html failures, lost in /tests/layout/base/tests/test_bug467672-2c.html
At a similar time, cb-seamonkey-osx-02 ran into those network problems again (e.g. hg clones/updates failing, "abort: error: Temporary failure in name resolution") and I'm manually rebooting it right now.
Reporter | ||
Comment 25•15 years ago
|
||
cb-seamonkey-osx-01, between 16:23 and 16:38, one video "currentTime at end"
test failure, test_jQuery.html failures, lost in /tests/layout/base/tests/test_bug441782-4a.html
Reporter | ||
Comment 26•15 years ago
|
||
cb-seamonkey-osx-01, between 18:18 and 18:22, one video "currentTime at end" test failure, lost in /tests/layout/base/tests/test_bug467672-1c.html
cb-seamonkey-osx-02, between 21:59 and 22:03, one video "currentTime at end" test failure, lost in /tests/layout/base/tests/test_bug467672-4e.html
Then 01 actually made it through a cycle without an OS crash, but it saw two video "currentTime at end" test failures and crashed the SeaMonkey process after passing /tests/content/media/video/test/test_wav_onloadedmetadata.html
cb-seamonkey-osx-01, between 02:02 and 02:06, three video "currentTime at end" test failures, lost in /tests/content/media/video/test/test_wav_ended1.html
cb-seamonkey-osx-02, between 02:47 and 02:50, while doing a nightly build cycle, somewhere in building mailnews/ for i386 (not that easy to pinpoint due to parallel build process).
Reporter | ||
Comment 27•15 years ago
|
||
cb-seamonkey-osx-02, between 10:32 and 10:36, lost in xpcshell-tests after xpcshell/test_mailnewsglobaldb/unit/test_gloda_content.js
cb-seamonkey-osx-02, between 12:08 and 13:13, two video "currentTime at end" test failures, lost in /tests/content/media/video/test/test_volume.html
cb-seamonkey-osx-02, between 13:55 and 13:58, one reftest and one crashtest
failure, two video "currentTime at end" test failures, lost in /tests/layout/base/tests/test_bug441782-1c.html
Updated•15 years ago
|
Severity: normal → critical
Comment 28•15 years ago
|
||
It seems like the problem has been mostly with the OSX vms. Can we try giving it more RAM?
Comment 29•15 years ago
|
||
[mid-air collision with comment 28 ... which I would gladly try first ;-)]
(with all these comments (to read))
It is not obvious to me whether we kind of narrowed down a "trigger" for this reboot behavior or not.
If not, I would suggest to try and disable various jobs:
start with mochitest-plain only, maybe all tests, even main build, up to whole buildbot (= leaving the VM idled).
I mean:
if this is caused by tests, let's find out which one(s),
...,
if it's an OS issue, no need to loose more time monitoring and commenting about build/tests.
Reporter | ||
Comment 30•15 years ago
|
||
It's pretty clear that it's a virtualization/OS problem and not a test problem. The question is what things really trigger the OS or Parallels problem.
Reporter | ||
Comment 31•15 years ago
|
||
And bugs on a non-production setup aren't critical IMHO.
Severity: critical → major
Comment 32•15 years ago
|
||
(In reply to comment #30)
> It's pretty clear that it's a virtualization/OS problem and not a test problem.
Yeah.
> The question is what things really trigger the OS or Parallels problem.
Comment 28 and comment 29 suggestions stands, I think.
(In reply to comment #31)
> And bugs on a non-production setup aren't critical IMHO.
(Well, see bug 493449 comment 5...)
Reporter | ||
Comment 33•15 years ago
|
||
(In reply to comment #28)
> It seems like the problem has been mostly with the OSX vms. Can we try giving
> it more RAM?
I'm just having a top open on ssh while mochitest-plain is running and I see that it says "1018M used, 6368K free." at this moment - so could we actually try that way of increasing RAM?
Reporter | ||
Comment 34•15 years ago
|
||
BTW, just to get an impression of where the RAM is going, here a snippet of the top output:
31574 seamonkey- 62.5% 12:22.84 11 188- 1053 122M- 40M+ 165M+ 356M+
31573 ssltunnel 0.0% 0:00.50 5 37 89 420K 2600K 2884K 84M
31560 xpcshell 1.2% 1:00.30 4 66 2805 394M 15M 401M 555M
31559 python 1.1% 1:07.85 1 15 113 2808K 1468K 4584K 79M
Reporter | ||
Comment 35•15 years ago
|
||
OK, this VM crash-rebooted while I had top open, this was the last top output left on my ssh session:
SharedLibs: num = 7, resident = 31M code, 1992K data, 3656K linkedit.
MemRegions: num = 8317, resident = 414M + 18M private, 257M shared.
PhysMem: 143M wired, 337M active, 290M inactive, 775M used, 249M free.
VM: 3380M + 374M 154296(0) pageins, 12115(0) pageouts
PID COMMAND %CPU TIME #TH #PRTS #MREGS RPRVT RSHRD RSIZE VSIZE
31657 httpd 0.0% 0:00.01 1 11 199 280K 10M 792K 38M
31617 top 20.0% 9:48.14 1 20 34 1368K 188K 1956K 19M
31608 bash 0.0% 0:00.11 1 14 18 188K 184K 828K 18M
31607 sshd 0.0% 0:01.46 1 10 59 116K 884K 508K 22M
31592 sshd 0.0% 0:00.47 1 20 59 204K 884K 1572K 22M
31574 seamonkey- 96.7% 30:37.32 25 629+ 1284 158M- 45M 205M- 697M-
31573 ssltunnel 0.0% 0:02.52 5 37 90 1312K 2600K 3548K 84M
31560 xpcshell 0.0% 1:32.85 4 66 1796 170M 15M 177M 323M
31559 python 0.0% 1:20.28 1 15 113 2800K 1468K 4648K 79M
31557 sh 0.0% 0:00.01 1 13 18 112K 184K 588K 74M
31555 gnumake 0.0% 0:00.03 1 13 18 276K 312K 620K 74M
31553 gnumake 0.0% 0:00.07 1 13 19 188K 312K 532K 18M
19584 ssh-agent 0.0% 0:00.10 1 23 28 436K 200K 1112K 19M
212 python 0.0% 2:49.64 2 27 141 5908K 1468K 5760K 25M
180 Finder 0.0% 0:42.66 7 147 121 1600K 8668K 7164K 96M
179 SystemUISe 0.0% 0:03.10 6 185 192 2032K 9760K 6156K 96M
174 Dock 0.0% 0:02.23 4 102 180 1600K 10M 7776K 69M
173 coreaudiod 0.0% 0:00.23 2 82 25 200K 208K 848K 18M
172 ATSServer 0.0% 0:12.97 2 84 133 1364K 11M 7136K 147M
171 pboard 0.0% 0:00.02 1 15 23 104K 184K 540K 19M
165 UserEventA 0.0% 0:00.43 2 109 86 632K 1196K 2080K 36M
164 Spotlight 0.0% 0:01.21 2 78 82 1008K 4920K 4348K 54M
162 AppleVNCSe 0.0% 0:00.25 2 73 40 364K 2460K 2224K 41M
159 AirPort Ba 0.0% 0:00.28 2 58 67 512K 4176K 2768K 67M
158 dynres 0.0% 0:06.39 1 28 28 200K 224K 1184K 35M
153 launchd 0.0% 0:10.11 3 122 24 188K 296K 520K 18M
137 VNCPrivile 0.0% 0:00.04 1 16 24 100K 188K 576K 19M
126 CoreRAIDSe 0.0% 0:03.57 1 33 28 212K 348K 908K 19M
125 httpd 0.0% 0:00.06 1 11 201 660K 10M 2268K 38M
107 WindowServ 18.2% 16:04.82 5 154 233 4800K 22M 25M 92M
105 krb5kdc 0.0% 0:00.10 1 17 42 88K 372K 712K 18M
92 coreservic 0.0% 0:06.62 4 117 67 1080K 2856K 3448K 25M
90 timesync 0.0% 0:00.49 1 40 29 264K 800K 1472K 19M
89 socketfilt 0.0% 0:04.16 3 36 25 392K 200K 1236K 18M
87 autofsd 0.0% 0:00.09 1 21 18 140K 184K 660K 18M
83 diskarbitr 0.0% 0:10.71 1 104 19 328K 188K 912K 18M
80 dynamic_pa 0.0% 0:00.08 1 17 20 156K 184K 696K 18M
79 emond 0.1% 0:30.21 1 32 22 320K 1764K 1824K 27M
77 fseventsd 0.2% 0:48.15 11 64 48 796K 184K 1292K 23M
75 hidd 0.0% 0:00.04 2 28 20 116K 204K 592K 18M
74 hwmond 0.0% 0:56.09 10 91 47 320K 352K 1260K 23M
72 kdcmond 0.0% 0:06.03 2 24 19 188K 184K 904K 18M
71 KernelEven 0.0% 0:00.04 2 20 19 152K 184K 628K 18M
70 loginwindo 0.0% 0:01.33 3 174 105 1184K 6220K 4136K 56M
69 mds 0.0% 0:43.93 14 185 86 1848K 192K 3716K 28M
68 PasswordSe 0.0% 0:02.37 10 61 115 148K 1188K 852K 23M
67 RFBRegiste 0.0% 0:00.07 1 16 18 176K 184K 1036K 18M Connection to cb-seamonkey-osx-01 closed by remote host.
Connection to cb-seamonkey-osx-01 closed.
Comment 36•15 years ago
|
||
Fwiw, might it a worse case of bug 494769?
Reporter | ||
Comment 37•15 years ago
|
||
(In reply to comment #36)
> Fwiw, might it a worse case of bug 494769?
I wouldn't have thought about your statement here being true in any way, but since you checked in this disabling of that test, we haven't crashed the OS again - yet.
We're still seeing bug 494671 so something's really fishy with the OS or virtualization (I guess the latter) but it definitely feels like that .wav test did strike a chord. Since we're seeing other stuff related to media tests as well, I wonder if the problem is somewhere in that area of media (audio) on those virtualized Leopard boxes.
Comment 38•15 years ago
|
||
(In reply to comment #37)
> I wouldn't have thought about your statement here being true in any way,
My initial thought was there might be a (very little) chance the reboot would be triggered by a memory allocation (related) "error".
> We're still seeing bug 494671 so something's really fishy with the OS or
> virtualization (I guess the latter) but it definitely feels like that .wav test
> did strike a chord. Since we're seeing other stuff related to media tests as
Yes, the media area looked most suspicious, but bug 494671 logs did not confirm this.
> well, I wonder if the problem is somewhere in that area of media (audio) on
> those virtualized Leopard boxes.
Now, with the fresh details found by Ted, the probable explanation would be that the/that media tests are simply relying more on the "time", thus hitting the underlying bug more easily.
Reporter | ||
Comment 39•15 years ago
|
||
We'll need to watch this for a bit more, but it looks like the Parallels and system upgrades in bug 494462 might have fixed this. I'd like to see a day or so of non-crash data before closing the bug here though.
Reporter | ||
Comment 40•15 years ago
|
||
Hrm. cb-seamonkey-osx-01 rebooted again, between 11:20 and 11:24 today, while running /tests/layout/base/tests/test_bug467672-3d.html :(
Reporter | ||
Comment 41•15 years ago
|
||
We lost cb-seamonkey-osx-03 after mozilla/layout/reftests/bugs/315920-3a.html today at 02:34 PDT, but the VM didn't come back so I'm unsure if it was a VM crash or a network loss.
Reporter | ||
Comment 42•15 years ago
|
||
Phong says the 02:34 loss of osx-03 was a VM crash: "it was at the gray screen telling me to hold down the power button to reboot."
Reporter | ||
Comment 43•15 years ago
|
||
cb-seamonkey-osx-03 again crash-rebooted between 13:49 and 13:54 today, last logged test was /tests/layout/base/tests/test_bug467672-3b.html
Reporter | ||
Comment 44•15 years ago
|
||
Since the OS X VMs were reduced to one CPU, we haven't seen the machine crashes again, but I'm not completely trusting this yet as bug 494671 still causes strange crashes in test, so this machine thing might not be gone completely, actually.
Reporter | ||
Comment 45•15 years ago
|
||
From all I see, this apparently has been fixed (temporarily) by reducing the VMs to one CPU. The issue behind it is probably still around in bug 494671 even though that issue has become more rare as well.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•15 years ago
|
Component: Project Organization → Release Engineering
Reporter | ||
Updated•15 years ago
|
QA Contact: organization → release
You need to log in
before you can comment on or make changes to this bug.
Description
•