Closed Bug 493450 Opened 15 years ago Closed 15 years ago

Unexpected reboots of cb-seamonkey-osx-*

Categories

(SeaMonkey :: Release Engineering, defect)

x86
macOS
defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kairo, Unassigned)

References

Details

The currently running Mac VM in the SeaMonkey pool sometimes reboot unexpectedly, possibly due to system/VM crashes. This could be connected to bug 493321 and be a Parallels issue, this bug is just for tracking the problem as it blocks moving the pool to production.
cb-seamonkey-osx-01 experienced network problems in recent cycles, looked like it couldn't get new connections to the outside (like chechout, etc.) but it still could sen stuff to the buildmaster. I also could ssh in and forced a reboot at about 6:53 today.
cb-seamonkey-osx-02 unexpectedly rebooted between 10:33 ("slave lost" msg) and 10:38 ("slave connected" msg) today.
And now cb-seamonkey-osx-01 did it between 14:50 and 14:53.
cb-seamonkey-osx-02, between 11:11 and 11:15 today.
We also have crashes in almost every mochitest-plain cycle from any of the two slaves, at different places in the test cycles, so not related to one specific thing the box is doing.
To be clear, comment #5 is application crashes that are caught and reported by the test/buildbot harness, the rest in this bug are VMs crashes/reboots which we only see due to losing the slaves for a few minutes, and them coming online a few minutes later (due to autologin and automatic start of buildbot) with reset uptime of the machine.
lost cb-seamonkey-osx-01 between 16:59 and 17:10
Another one of -osx-01 between 18:33 and 18:37 yesterday might be related to phong trying out things for bug 493321, including reboots of -osx-03 and -osx-04, which came online at 18:44 and 18:52, respectively. 03 ran into networking problems very fast, very much the same as I reported in comment #1. 04 disconnected at 00:00 without having got anything to do up to that point. I rebooted cb-seamonkey-osx-03 at 03:36, it came back online at 03:40.
cb-seamonkey-osx-02 lost and regained network at 08:16, was no reboot this time.
I think osx-04 is frozen again. I'm going to power it down unless you tell me that it's still up and running.
Phong: yes, like I stated in comment #8, osx-04 went away at midnight - it didn't come back from that. If you could bring back win32-02 instead that would be cool as having only one of the Windows VMs makes the "pool" a bit slow for testing config changes. osx-03 has had network problems for a while, again just like in comment #8, and disconnected and reconnected to buildbot at 17:16. I now rebooted it between 18:04 and 18:06.
cb-seamonkey-osx-03 rebooted between 13:03 and 13:18 today.
cb-seamonkey-osx-02 rebooted 14:26 to 14:30 today.
cb-seamonkey-osx-02 once again 22:49 to 22:53 yersterday.
(In reply to comment #5) > We also have crashes in almost every mochitest-plain cycle (In reply to comment #6) > To be clear, comment #5 is application crashes that are caught and reported by > the test/buildbot harness I filed bug 494671 about the mochitest-plain crash(es).
cb-seamonkey-osx-01 rebooted between 10:56 and 10:58 today.
cb-seamonkey-osx-01 again, 16:46 to 16:50 today.
ugh. and cb-seamonkey-osx-01 once again, 21:09 to 21:13 yesterday
cb-seamonkey-osx-02 hasn't been crashing for some time but now started to report errors such as "FAILED TO GET ASN FROM CORESERVICES so aborting." in mochitests and then got into failures to get stuff from the network, same pattern as reported earlier in here, so I rebooted the VM just now.
Bang. Now cb-seamonkey-osx-02 did a crash reboot again, between 08:44 and 08:48. This seems to (almost) always happen during mochitest-plain runs, where we launch SeaMonkey and run a ton of tests on it. It happens at different points during the testing though.
cb-seamonkey-osx-01 crashed/rebooted between 18:06 to 18:10, a few video tests failed before disconnecting in /tests/layout/base/tests/test_bug467672-1c.html
All test cycles since then failed with a crsh/reboot: cb-seamonkey-osx-02 between 18:51 and 18:55 yesterday, during nsITransactionManager Aggregate Batch Transaction Stress Test (make check). cb-seamonkey-osx-01 between 21:59 and 22:06 yesterday, two video test failues, a test_jQuery.html failure, lost in /tests/layout/style/test/test_bug391221.html cb-seamonkey-osx-02 between 23:34 and 23:37 yesterday, one video test failure, test_Scriptaculous.html failures, lost in /tests/layout/base/tests/test_bug441782-2b.html cb-seamonkey-osx-02 between 3:42 and 3:46 today, some video test errors, later lost in /tests/layout/base/tests/test_bug441782-2d.html Interestingly, the video failures before such a crash/reboot are wrong end times of the video element when it reports it's done playing.
02 made it through a whole test cycle with a video end time failure and test_Scriptaculous.html failures, but without a crash or timeout, neither this one nor bug 494671. cb-seamonkey-osx-01 did a crash/reboot again though between 7:24 and 7:28, failing in /tests/content/media/video/test/test_bug482461.html (only failure before that was an xpcshell test, but nothing seen usually)
cb-seamonkey-osx-01, between 12:28 and 12:32, two video "currentTime at end" test failures, test_Scriptaculous.html failures, lost in /tests/layout/base/tests/test_bug467672-2c.html At a similar time, cb-seamonkey-osx-02 ran into those network problems again (e.g. hg clones/updates failing, "abort: error: Temporary failure in name resolution") and I'm manually rebooting it right now.
cb-seamonkey-osx-01, between 16:23 and 16:38, one video "currentTime at end" test failure, test_jQuery.html failures, lost in /tests/layout/base/tests/test_bug441782-4a.html
cb-seamonkey-osx-01, between 18:18 and 18:22, one video "currentTime at end" test failure, lost in /tests/layout/base/tests/test_bug467672-1c.html cb-seamonkey-osx-02, between 21:59 and 22:03, one video "currentTime at end" test failure, lost in /tests/layout/base/tests/test_bug467672-4e.html Then 01 actually made it through a cycle without an OS crash, but it saw two video "currentTime at end" test failures and crashed the SeaMonkey process after passing /tests/content/media/video/test/test_wav_onloadedmetadata.html cb-seamonkey-osx-01, between 02:02 and 02:06, three video "currentTime at end" test failures, lost in /tests/content/media/video/test/test_wav_ended1.html cb-seamonkey-osx-02, between 02:47 and 02:50, while doing a nightly build cycle, somewhere in building mailnews/ for i386 (not that easy to pinpoint due to parallel build process).
cb-seamonkey-osx-02, between 10:32 and 10:36, lost in xpcshell-tests after xpcshell/test_mailnewsglobaldb/unit/test_gloda_content.js cb-seamonkey-osx-02, between 12:08 and 13:13, two video "currentTime at end" test failures, lost in /tests/content/media/video/test/test_volume.html cb-seamonkey-osx-02, between 13:55 and 13:58, one reftest and one crashtest failure, two video "currentTime at end" test failures, lost in /tests/layout/base/tests/test_bug441782-1c.html
Severity: normal → critical
It seems like the problem has been mostly with the OSX vms. Can we try giving it more RAM?
[mid-air collision with comment 28 ... which I would gladly try first ;-)] (with all these comments (to read)) It is not obvious to me whether we kind of narrowed down a "trigger" for this reboot behavior or not. If not, I would suggest to try and disable various jobs: start with mochitest-plain only, maybe all tests, even main build, up to whole buildbot (= leaving the VM idled). I mean: if this is caused by tests, let's find out which one(s), ..., if it's an OS issue, no need to loose more time monitoring and commenting about build/tests.
It's pretty clear that it's a virtualization/OS problem and not a test problem. The question is what things really trigger the OS or Parallels problem.
And bugs on a non-production setup aren't critical IMHO.
Severity: critical → major
(In reply to comment #30) > It's pretty clear that it's a virtualization/OS problem and not a test problem. Yeah. > The question is what things really trigger the OS or Parallels problem. Comment 28 and comment 29 suggestions stands, I think. (In reply to comment #31) > And bugs on a non-production setup aren't critical IMHO. (Well, see bug 493449 comment 5...)
(In reply to comment #28) > It seems like the problem has been mostly with the OSX vms. Can we try giving > it more RAM? I'm just having a top open on ssh while mochitest-plain is running and I see that it says "1018M used, 6368K free." at this moment - so could we actually try that way of increasing RAM?
BTW, just to get an impression of where the RAM is going, here a snippet of the top output: 31574 seamonkey- 62.5% 12:22.84 11 188- 1053 122M- 40M+ 165M+ 356M+ 31573 ssltunnel 0.0% 0:00.50 5 37 89 420K 2600K 2884K 84M 31560 xpcshell 1.2% 1:00.30 4 66 2805 394M 15M 401M 555M 31559 python 1.1% 1:07.85 1 15 113 2808K 1468K 4584K 79M
OK, this VM crash-rebooted while I had top open, this was the last top output left on my ssh session: SharedLibs: num = 7, resident = 31M code, 1992K data, 3656K linkedit. MemRegions: num = 8317, resident = 414M + 18M private, 257M shared. PhysMem: 143M wired, 337M active, 290M inactive, 775M used, 249M free. VM: 3380M + 374M 154296(0) pageins, 12115(0) pageouts PID COMMAND %CPU TIME #TH #PRTS #MREGS RPRVT RSHRD RSIZE VSIZE 31657 httpd 0.0% 0:00.01 1 11 199 280K 10M 792K 38M 31617 top 20.0% 9:48.14 1 20 34 1368K 188K 1956K 19M 31608 bash 0.0% 0:00.11 1 14 18 188K 184K 828K 18M 31607 sshd 0.0% 0:01.46 1 10 59 116K 884K 508K 22M 31592 sshd 0.0% 0:00.47 1 20 59 204K 884K 1572K 22M 31574 seamonkey- 96.7% 30:37.32 25 629+ 1284 158M- 45M 205M- 697M- 31573 ssltunnel 0.0% 0:02.52 5 37 90 1312K 2600K 3548K 84M 31560 xpcshell 0.0% 1:32.85 4 66 1796 170M 15M 177M 323M 31559 python 0.0% 1:20.28 1 15 113 2800K 1468K 4648K 79M 31557 sh 0.0% 0:00.01 1 13 18 112K 184K 588K 74M 31555 gnumake 0.0% 0:00.03 1 13 18 276K 312K 620K 74M 31553 gnumake 0.0% 0:00.07 1 13 19 188K 312K 532K 18M 19584 ssh-agent 0.0% 0:00.10 1 23 28 436K 200K 1112K 19M 212 python 0.0% 2:49.64 2 27 141 5908K 1468K 5760K 25M 180 Finder 0.0% 0:42.66 7 147 121 1600K 8668K 7164K 96M 179 SystemUISe 0.0% 0:03.10 6 185 192 2032K 9760K 6156K 96M 174 Dock 0.0% 0:02.23 4 102 180 1600K 10M 7776K 69M 173 coreaudiod 0.0% 0:00.23 2 82 25 200K 208K 848K 18M 172 ATSServer 0.0% 0:12.97 2 84 133 1364K 11M 7136K 147M 171 pboard 0.0% 0:00.02 1 15 23 104K 184K 540K 19M 165 UserEventA 0.0% 0:00.43 2 109 86 632K 1196K 2080K 36M 164 Spotlight 0.0% 0:01.21 2 78 82 1008K 4920K 4348K 54M 162 AppleVNCSe 0.0% 0:00.25 2 73 40 364K 2460K 2224K 41M 159 AirPort Ba 0.0% 0:00.28 2 58 67 512K 4176K 2768K 67M 158 dynres 0.0% 0:06.39 1 28 28 200K 224K 1184K 35M 153 launchd 0.0% 0:10.11 3 122 24 188K 296K 520K 18M 137 VNCPrivile 0.0% 0:00.04 1 16 24 100K 188K 576K 19M 126 CoreRAIDSe 0.0% 0:03.57 1 33 28 212K 348K 908K 19M 125 httpd 0.0% 0:00.06 1 11 201 660K 10M 2268K 38M 107 WindowServ 18.2% 16:04.82 5 154 233 4800K 22M 25M 92M 105 krb5kdc 0.0% 0:00.10 1 17 42 88K 372K 712K 18M 92 coreservic 0.0% 0:06.62 4 117 67 1080K 2856K 3448K 25M 90 timesync 0.0% 0:00.49 1 40 29 264K 800K 1472K 19M 89 socketfilt 0.0% 0:04.16 3 36 25 392K 200K 1236K 18M 87 autofsd 0.0% 0:00.09 1 21 18 140K 184K 660K 18M 83 diskarbitr 0.0% 0:10.71 1 104 19 328K 188K 912K 18M 80 dynamic_pa 0.0% 0:00.08 1 17 20 156K 184K 696K 18M 79 emond 0.1% 0:30.21 1 32 22 320K 1764K 1824K 27M 77 fseventsd 0.2% 0:48.15 11 64 48 796K 184K 1292K 23M 75 hidd 0.0% 0:00.04 2 28 20 116K 204K 592K 18M 74 hwmond 0.0% 0:56.09 10 91 47 320K 352K 1260K 23M 72 kdcmond 0.0% 0:06.03 2 24 19 188K 184K 904K 18M 71 KernelEven 0.0% 0:00.04 2 20 19 152K 184K 628K 18M 70 loginwindo 0.0% 0:01.33 3 174 105 1184K 6220K 4136K 56M 69 mds 0.0% 0:43.93 14 185 86 1848K 192K 3716K 28M 68 PasswordSe 0.0% 0:02.37 10 61 115 148K 1188K 852K 23M 67 RFBRegiste 0.0% 0:00.07 1 16 18 176K 184K 1036K 18M Connection to cb-seamonkey-osx-01 closed by remote host. Connection to cb-seamonkey-osx-01 closed.
Fwiw, might it a worse case of bug 494769?
(In reply to comment #36) > Fwiw, might it a worse case of bug 494769? I wouldn't have thought about your statement here being true in any way, but since you checked in this disabling of that test, we haven't crashed the OS again - yet. We're still seeing bug 494671 so something's really fishy with the OS or virtualization (I guess the latter) but it definitely feels like that .wav test did strike a chord. Since we're seeing other stuff related to media tests as well, I wonder if the problem is somewhere in that area of media (audio) on those virtualized Leopard boxes.
Depends on: 494769
(In reply to comment #37) > I wouldn't have thought about your statement here being true in any way, My initial thought was there might be a (very little) chance the reboot would be triggered by a memory allocation (related) "error". > We're still seeing bug 494671 so something's really fishy with the OS or > virtualization (I guess the latter) but it definitely feels like that .wav test > did strike a chord. Since we're seeing other stuff related to media tests as Yes, the media area looked most suspicious, but bug 494671 logs did not confirm this. > well, I wonder if the problem is somewhere in that area of media (audio) on > those virtualized Leopard boxes. Now, with the fresh details found by Ted, the probable explanation would be that the/that media tests are simply relying more on the "time", thus hitting the underlying bug more easily.
We'll need to watch this for a bit more, but it looks like the Parallels and system upgrades in bug 494462 might have fixed this. I'd like to see a day or so of non-crash data before closing the bug here though.
No longer blocks: 494671
Depends on: 494671
Hrm. cb-seamonkey-osx-01 rebooted again, between 11:20 and 11:24 today, while running /tests/layout/base/tests/test_bug467672-3d.html :(
We lost cb-seamonkey-osx-03 after mozilla/layout/reftests/bugs/315920-3a.html today at 02:34 PDT, but the VM didn't come back so I'm unsure if it was a VM crash or a network loss.
Phong says the 02:34 loss of osx-03 was a VM crash: "it was at the gray screen telling me to hold down the power button to reboot."
cb-seamonkey-osx-03 again crash-rebooted between 13:49 and 13:54 today, last logged test was /tests/layout/base/tests/test_bug467672-3b.html
Since the OS X VMs were reduced to one CPU, we haven't seen the machine crashes again, but I'm not completely trusting this yet as bug 494671 still causes strange crashes in test, so this machine thing might not be gone completely, actually.
From all I see, this apparently has been fixed (temporarily) by reducing the VMs to one CPU. The issue behind it is probably still around in bug 494671 even though that issue has become more rare as well.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Component: Project Organization → Release Engineering
QA Contact: organization → release
You need to log in before you can comment on or make changes to this bug.