Closed Bug 493450 Opened 15 years ago Closed 15 years ago

Unexpected reboots of cb-seamonkey-osx-*

Categories

(SeaMonkey :: Release Engineering, defect)

Product:

Component:

Platform:

x86

macOS

Type:

defect

Priority:

Not set

Severity:

major

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: kairo, Unassigned)

References

Details

Reporter

Description

•

15 years ago

The currently running Mac VM in the SeaMonkey pool sometimes reboot unexpectedly, possibly due to system/VM crashes. This could be connected to bug 493321 and be a Parallels issue, this bug is just for tracking the problem as it blocks moving the pool to production.

Reporter

Comment 1

•

15 years ago

cb-seamonkey-osx-01 experienced network problems in recent cycles, looked like it couldn't get new connections to the outside (like chechout, etc.) but it still could sen stuff to the buildmaster. I also could ssh in and forced a reboot at about 6:53 today.

Reporter

Comment 2

•

15 years ago

cb-seamonkey-osx-02 unexpectedly rebooted between 10:33 ("slave lost" msg) and 10:38 ("slave connected" msg) today.

Reporter

Comment 3

•

15 years ago

And now cb-seamonkey-osx-01 did it between 14:50 and 14:53.

Reporter

Comment 4

•

15 years ago

cb-seamonkey-osx-02, between 11:11 and 11:15 today.

Reporter

Comment 5

•

15 years ago

We also have crashes in almost every mochitest-plain cycle from any of the two slaves, at different places in the test cycles, so not related to one specific thing the box is doing.

Reporter

Comment 6

•

15 years ago

To be clear, comment #5 is application crashes that are caught and reported by the test/buildbot harness, the rest in this bug are VMs crashes/reboots which we only see due to losing the slaves for a few minutes, and them coming online a few minutes later (due to autologin and automatic start of buildbot) with reset uptime of the machine.

Reporter

Comment 7

•

15 years ago

lost cb-seamonkey-osx-01 between 16:59 and 17:10

Reporter

Comment 8

•

15 years ago

Another one of -osx-01 between 18:33 and 18:37 yesterday might be related to phong trying out things for bug 493321, including reboots of -osx-03 and -osx-04, which came online at 18:44 and 18:52, respectively. 03 ran into networking problems very fast, very much the same as I reported in comment #1. 04 disconnected at 00:00 without having got anything to do up to that point. I rebooted cb-seamonkey-osx-03 at 03:36, it came back online at 03:40.

Reporter

Comment 9

•

15 years ago

cb-seamonkey-osx-02 lost and regained network at 08:16, was no reboot this time.

Phong Tran [:phong]

Comment 10

•

15 years ago

I think osx-04 is frozen again. I'm going to power it down unless you tell me that it's still up and running.

Reporter

Comment 11

•

15 years ago

Phong: yes, like I stated in comment #8, osx-04 went away at midnight - it didn't come back from that. If you could bring back win32-02 instead that would be cool as having only one of the Windows VMs makes the "pool" a bit slow for testing config changes. osx-03 has had network problems for a while, again just like in comment #8, and disconnected and reconnected to buildbot at 17:16. I now rebooted it between 18:04 and 18:06.

Reporter

Comment 12

•

15 years ago

cb-seamonkey-osx-03 rebooted between 13:03 and 13:18 today.

Reporter

Comment 13

•

15 years ago

cb-seamonkey-osx-02 rebooted 14:26 to 14:30 today.

Reporter

Comment 14

•

15 years ago

cb-seamonkey-osx-02 once again 22:49 to 22:53 yersterday.

Serge Gautherie (:sgautherie)

Updated

•

15 years ago

Blocks: 494671

Serge Gautherie (:sgautherie)

Comment 15

•

15 years ago

(In reply to comment #5) > We also have crashes in almost every mochitest-plain cycle (In reply to comment #6) > To be clear, comment #5 is application crashes that are caught and reported by > the test/buildbot harness I filed bug 494671 about the mochitest-plain crash(es).

Reporter

Comment 16

•

15 years ago

cb-seamonkey-osx-01 rebooted between 10:56 and 10:58 today.

Reporter

Comment 17

•

15 years ago

cb-seamonkey-osx-01 again, 16:46 to 16:50 today.

Reporter

Comment 18

•

15 years ago

ugh. and cb-seamonkey-osx-01 once again, 21:09 to 21:13 yesterday

Reporter

Comment 19

•

15 years ago

cb-seamonkey-osx-02 hasn't been crashing for some time but now started to report errors such as "FAILED TO GET ASN FROM CORESERVICES so aborting." in mochitests and then got into failures to get stuff from the network, same pattern as reported earlier in here, so I rebooted the VM just now.

Reporter

Comment 20

•

15 years ago

Bang. Now cb-seamonkey-osx-02 did a crash reboot again, between 08:44 and 08:48. This seems to (almost) always happen during mochitest-plain runs, where we launch SeaMonkey and run a ton of tests on it. It happens at different points during the testing though.

Reporter

Comment 21

•

15 years ago

cb-seamonkey-osx-01 crashed/rebooted between 18:06 to 18:10, a few video tests failed before disconnecting in /tests/layout/base/tests/test_bug467672-1c.html

Reporter

Comment 22

•

15 years ago

All test cycles since then failed with a crsh/reboot: cb-seamonkey-osx-02 between 18:51 and 18:55 yesterday, during nsITransactionManager Aggregate Batch Transaction Stress Test (make check). cb-seamonkey-osx-01 between 21:59 and 22:06 yesterday, two video test failues, a test_jQuery.html failure, lost in /tests/layout/style/test/test_bug391221.html cb-seamonkey-osx-02 between 23:34 and 23:37 yesterday, one video test failure, test_Scriptaculous.html failures, lost in /tests/layout/base/tests/test_bug441782-2b.html cb-seamonkey-osx-02 between 3:42 and 3:46 today, some video test errors, later lost in /tests/layout/base/tests/test_bug441782-2d.html Interestingly, the video failures before such a crash/reboot are wrong end times of the video element when it reports it's done playing.

Reporter

Comment 23

•

15 years ago

02 made it through a whole test cycle with a video end time failure and test_Scriptaculous.html failures, but without a crash or timeout, neither this one nor bug 494671. cb-seamonkey-osx-01 did a crash/reboot again though between 7:24 and 7:28, failing in /tests/content/media/video/test/test_bug482461.html (only failure before that was an xpcshell test, but nothing seen usually)

Reporter

Comment 24

•

15 years ago

cb-seamonkey-osx-01, between 12:28 and 12:32, two video "currentTime at end" test failures, test_Scriptaculous.html failures, lost in /tests/layout/base/tests/test_bug467672-2c.html At a similar time, cb-seamonkey-osx-02 ran into those network problems again (e.g. hg clones/updates failing, "abort: error: Temporary failure in name resolution") and I'm manually rebooting it right now.

Reporter

Comment 25

•

15 years ago

cb-seamonkey-osx-01, between 16:23 and 16:38, one video "currentTime at end" test failure, test_jQuery.html failures, lost in /tests/layout/base/tests/test_bug441782-4a.html

Reporter

Comment 26

•

15 years ago

cb-seamonkey-osx-01, between 18:18 and 18:22, one video "currentTime at end" test failure, lost in /tests/layout/base/tests/test_bug467672-1c.html cb-seamonkey-osx-02, between 21:59 and 22:03, one video "currentTime at end" test failure, lost in /tests/layout/base/tests/test_bug467672-4e.html Then 01 actually made it through a cycle without an OS crash, but it saw two video "currentTime at end" test failures and crashed the SeaMonkey process after passing /tests/content/media/video/test/test_wav_onloadedmetadata.html cb-seamonkey-osx-01, between 02:02 and 02:06, three video "currentTime at end" test failures, lost in /tests/content/media/video/test/test_wav_ended1.html cb-seamonkey-osx-02, between 02:47 and 02:50, while doing a nightly build cycle, somewhere in building mailnews/ for i386 (not that easy to pinpoint due to parallel build process).

Reporter

Comment 27

•

15 years ago

cb-seamonkey-osx-02, between 10:32 and 10:36, lost in xpcshell-tests after xpcshell/test_mailnewsglobaldb/unit/test_gloda_content.js cb-seamonkey-osx-02, between 12:08 and 13:13, two video "currentTime at end" test failures, lost in /tests/content/media/video/test/test_volume.html cb-seamonkey-osx-02, between 13:55 and 13:58, one reftest and one crashtest failure, two video "currentTime at end" test failures, lost in /tests/layout/base/tests/test_bug441782-1c.html

Serge Gautherie (:sgautherie)

Updated

•

15 years ago

Severity: normal → critical

Phong Tran [:phong]

Comment 28

•

15 years ago

It seems like the problem has been mostly with the OSX vms. Can we try giving it more RAM?

Serge Gautherie (:sgautherie)

Comment 29

•

15 years ago

[mid-air collision with comment 28 ... which I would gladly try first ;-)] (with all these comments (to read)) It is not obvious to me whether we kind of narrowed down a "trigger" for this reboot behavior or not. If not, I would suggest to try and disable various jobs: start with mochitest-plain only, maybe all tests, even main build, up to whole buildbot (= leaving the VM idled). I mean: if this is caused by tests, let's find out which one(s), ..., if it's an OS issue, no need to loose more time monitoring and commenting about build/tests.

Reporter

Comment 30

•

15 years ago

It's pretty clear that it's a virtualization/OS problem and not a test problem. The question is what things really trigger the OS or Parallels problem.

Reporter

Comment 31

•

15 years ago

And bugs on a non-production setup aren't critical IMHO.

Severity: critical → major

Serge Gautherie (:sgautherie)

Comment 32

•

15 years ago

(In reply to comment #30) > It's pretty clear that it's a virtualization/OS problem and not a test problem. Yeah. > The question is what things really trigger the OS or Parallels problem. Comment 28 and comment 29 suggestions stands, I think. (In reply to comment #31) > And bugs on a non-production setup aren't critical IMHO. (Well, see bug 493449 comment 5...)

Reporter

Comment 33

•

15 years ago

(In reply to comment #28) > It seems like the problem has been mostly with the OSX vms. Can we try giving > it more RAM? I'm just having a top open on ssh while mochitest-plain is running and I see that it says "1018M used, 6368K free." at this moment - so could we actually try that way of increasing RAM?

Reporter

Comment 34

•

15 years ago

BTW, just to get an impression of where the RAM is going, here a snippet of the top output: 31574 seamonkey- 62.5% 12:22.84 11 188- 1053 122M- 40M+ 165M+ 356M+ 31573 ssltunnel 0.0% 0:00.50 5 37 89 420K 2600K 2884K 84M 31560 xpcshell 1.2% 1:00.30 4 66 2805 394M 15M 401M 555M 31559 python 1.1% 1:07.85 1 15 113 2808K 1468K 4584K 79M

Reporter

Comment 35

•

15 years ago

OK, this VM crash-rebooted while I had top open, this was the last top output left on my ssh session: SharedLibs: num = 7, resident = 31M code, 1992K data, 3656K linkedit. MemRegions: num = 8317, resident = 414M + 18M private, 257M shared. PhysMem: 143M wired, 337M active, 290M inactive, 775M used, 249M free. VM: 3380M + 374M 154296(0) pageins, 12115(0) pageouts PID COMMAND %CPU TIME #TH #PRTS #MREGS RPRVT RSHRD RSIZE VSIZE 31657 httpd 0.0% 0:00.01 1 11 199 280K 10M 792K 38M 31617 top 20.0% 9:48.14 1 20 34 1368K 188K 1956K 19M 31608 bash 0.0% 0:00.11 1 14 18 188K 184K 828K 18M 31607 sshd 0.0% 0:01.46 1 10 59 116K 884K 508K 22M 31592 sshd 0.0% 0:00.47 1 20 59 204K 884K 1572K 22M 31574 seamonkey- 96.7% 30:37.32 25 629+ 1284 158M- 45M 205M- 697M- 31573 ssltunnel 0.0% 0:02.52 5 37 90 1312K 2600K 3548K 84M 31560 xpcshell 0.0% 1:32.85 4 66 1796 170M 15M 177M 323M 31559 python 0.0% 1:20.28 1 15 113 2800K 1468K 4648K 79M 31557 sh 0.0% 0:00.01 1 13 18 112K 184K 588K 74M 31555 gnumake 0.0% 0:00.03 1 13 18 276K 312K 620K 74M 31553 gnumake 0.0% 0:00.07 1 13 19 188K 312K 532K 18M 19584 ssh-agent 0.0% 0:00.10 1 23 28 436K 200K 1112K 19M 212 python 0.0% 2:49.64 2 27 141 5908K 1468K 5760K 25M 180 Finder 0.0% 0:42.66 7 147 121 1600K 8668K 7164K 96M 179 SystemUISe 0.0% 0:03.10 6 185 192 2032K 9760K 6156K 96M 174 Dock 0.0% 0:02.23 4 102 180 1600K 10M 7776K 69M 173 coreaudiod 0.0% 0:00.23 2 82 25 200K 208K 848K 18M 172 ATSServer 0.0% 0:12.97 2 84 133 1364K 11M 7136K 147M 171 pboard 0.0% 0:00.02 1 15 23 104K 184K 540K 19M 165 UserEventA 0.0% 0:00.43 2 109 86 632K 1196K 2080K 36M 164 Spotlight 0.0% 0:01.21 2 78 82 1008K 4920K 4348K 54M 162 AppleVNCSe 0.0% 0:00.25 2 73 40 364K 2460K 2224K 41M 159 AirPort Ba 0.0% 0:00.28 2 58 67 512K 4176K 2768K 67M 158 dynres 0.0% 0:06.39 1 28 28 200K 224K 1184K 35M 153 launchd 0.0% 0:10.11 3 122 24 188K 296K 520K 18M 137 VNCPrivile 0.0% 0:00.04 1 16 24 100K 188K 576K 19M 126 CoreRAIDSe 0.0% 0:03.57 1 33 28 212K 348K 908K 19M 125 httpd 0.0% 0:00.06 1 11 201 660K 10M 2268K 38M 107 WindowServ 18.2% 16:04.82 5 154 233 4800K 22M 25M 92M 105 krb5kdc 0.0% 0:00.10 1 17 42 88K 372K 712K 18M 92 coreservic 0.0% 0:06.62 4 117 67 1080K 2856K 3448K 25M 90 timesync 0.0% 0:00.49 1 40 29 264K 800K 1472K 19M 89 socketfilt 0.0% 0:04.16 3 36 25 392K 200K 1236K 18M 87 autofsd 0.0% 0:00.09 1 21 18 140K 184K 660K 18M 83 diskarbitr 0.0% 0:10.71 1 104 19 328K 188K 912K 18M 80 dynamic_pa 0.0% 0:00.08 1 17 20 156K 184K 696K 18M 79 emond 0.1% 0:30.21 1 32 22 320K 1764K 1824K 27M 77 fseventsd 0.2% 0:48.15 11 64 48 796K 184K 1292K 23M 75 hidd 0.0% 0:00.04 2 28 20 116K 204K 592K 18M 74 hwmond 0.0% 0:56.09 10 91 47 320K 352K 1260K 23M 72 kdcmond 0.0% 0:06.03 2 24 19 188K 184K 904K 18M 71 KernelEven 0.0% 0:00.04 2 20 19 152K 184K 628K 18M 70 loginwindo 0.0% 0:01.33 3 174 105 1184K 6220K 4136K 56M 69 mds 0.0% 0:43.93 14 185 86 1848K 192K 3716K 28M 68 PasswordSe 0.0% 0:02.37 10 61 115 148K 1188K 852K 23M 67 RFBRegiste 0.0% 0:00.07 1 16 18 176K 184K 1036K 18M Connection to cb-seamonkey-osx-01 closed by remote host. Connection to cb-seamonkey-osx-01 closed.

Serge Gautherie (:sgautherie)

Comment 36

•

15 years ago

Fwiw, might it a worse case of bug 494769?

Reporter

Comment 37

•

15 years ago

(In reply to comment #36) > Fwiw, might it a worse case of bug 494769? I wouldn't have thought about your statement here being true in any way, but since you checked in this disabling of that test, we haven't crashed the OS again - yet. We're still seeing bug 494671 so something's really fishy with the OS or virtualization (I guess the latter) but it definitely feels like that .wav test did strike a chord. Since we're seeing other stuff related to media tests as well, I wonder if the problem is somewhere in that area of media (audio) on those virtualized Leopard boxes.

Serge Gautherie (:sgautherie)

Updated

•

15 years ago

Depends on: 494769

Serge Gautherie (:sgautherie)

Comment 38

•

15 years ago

(In reply to comment #37) > I wouldn't have thought about your statement here being true in any way, My initial thought was there might be a (very little) chance the reboot would be triggered by a memory allocation (related) "error". > We're still seeing bug 494671 so something's really fishy with the OS or > virtualization (I guess the latter) but it definitely feels like that .wav test > did strike a chord. Since we're seeing other stuff related to media tests as Yes, the media area looked most suspicious, but bug 494671 logs did not confirm this. > well, I wonder if the problem is somewhere in that area of media (audio) on > those virtualized Leopard boxes. Now, with the fresh details found by Ted, the probable explanation would be that the/that media tests are simply relying more on the "time", thus hitting the underlying bug more easily.

Reporter

Comment 39

•

15 years ago

We'll need to watch this for a bit more, but it looks like the Parallels and system upgrades in bug 494462 might have fixed this. I'd like to see a day or so of non-crash data before closing the bug here though.

Serge Gautherie (:sgautherie)

Updated

•

15 years ago

No longer blocks: 494671

Reporter

Updated

•

15 years ago

Depends on: 494671

Reporter

Comment 40

•

15 years ago

Hrm. cb-seamonkey-osx-01 rebooted again, between 11:20 and 11:24 today, while running /tests/layout/base/tests/test_bug467672-3d.html :(

Reporter

Comment 41

•

15 years ago

We lost cb-seamonkey-osx-03 after mozilla/layout/reftests/bugs/315920-3a.html today at 02:34 PDT, but the VM didn't come back so I'm unsure if it was a VM crash or a network loss.

Reporter

Comment 42

•

15 years ago

Phong says the 02:34 loss of osx-03 was a VM crash: "it was at the gray screen telling me to hold down the power button to reboot."

Reporter

Comment 43

•

15 years ago

cb-seamonkey-osx-03 again crash-rebooted between 13:49 and 13:54 today, last logged test was /tests/layout/base/tests/test_bug467672-3b.html

Reporter

Comment 44

•

15 years ago

Since the OS X VMs were reduced to one CPU, we haven't seen the machine crashes again, but I'm not completely trusting this yet as bug 494671 still causes strange crashes in test, so this machine thing might not be gone completely, actually.

Reporter

Comment 45

•

15 years ago

From all I see, this apparently has been fixed (temporarily) by reducing the VMs to one CPU. The issue behind it is probably still around in bug 494671 even though that issue has become more rare as well.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Reporter

Updated

•

15 years ago

Component: Project Organization → Release Engineering

Reporter

Updated

•

15 years ago

QA Contact: organization → release

You need to log in before you can comment on or make changes to this bug.