Closed Bug 478603 Opened 16 years ago Closed 16 years ago

intermittent orange on Windows mozilla-central talos Ts and Tdhtml tests ("failed to initialize browser")

Categories

(Release Engineering :: General, defect, P3)

x86
Windows Vista
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: anodelman)

References

Details

(Keywords: intermittent-failure)

Attachments

(2 files)

The tree is currently closed "due to Windows talos orange." My best guess for wtf that means is the way that starting with qm-pvista-trunk03 at 2009/02/08 18:12:12, then becoming nearly continuous with 03 the 9th and with 02 joining in increasingly often the 9th and through the rest of the week (ignored by everyone, all week long), we've been having "FAIL: Busted: tdhtml FAIL: failed to initialize browser". During that entire time, qm-pvista-trunk01 has been continuously green except for one non-tdhtml failure, which would certainly seem to indicate that it's box trouble rather than code trouble. I think we're going to reopen, since whoever anonymously closed the tree didn't leave any indication of having noticed that it was only two of the three, much less that it had been ignored all week long, but it's still pretty critical since it leaves us with just one.
I was just about to close the tree again, in particular, as it's been a while since qm-pvista-trunk01 reported a successful report. Anyone on this?
The logs for qm-pvista-trunk02/03 look fine in the sense that they clean up old builds at the start, and run Ts and Tp happily. There isn't very much more info for the failing Tdhtml in the log, just Running test tdhtml: Started Mon, 16 Feb 2009 09:48:44 Failed tdhtml: Stopped Mon, 16 Feb 2009 09:49:17 FAIL: Busted: tdhtml FAIL: failed to initialize browser Trying to catch the problem on qm-pvista-trunk03 with a nightly build, in the hope of getting some sort of crash report.
I haven't managed to reproduce the error in comment #2 when adjusting the config.yml (test manifest) to test just tdhtml. And get a counter error if I leave ts and tp in before tdhtml. AFAICT, the "failed to initialize browser" message can only come from initializeProfile() http://mxr.mozilla.org/mozilla/source/testing/performance/talos/ttest.py#117 which is calling InitializeNewProfile http://mxr.mozilla.org/mozilla/source/testing/performance/talos/ffsetup.py#143 to fill out the profile dir (with all the other files that talos doesn't want to explicitly specify). If we're getting an error here, it's because it's taking more than 30 seconds to do that launch and shutdown. Talos has already checked for existing firefox/minefield processes at this point. It could be that the Tp chews up a bunch of memory, and it takes a little while for things to clear up to a useful state for Tdhtml (say if we're well into the swap file). I've observed Working Sets and Private Bytes which are larger than 400MB after a few cycles of the pageset for Tp, and Commit was as high as 550MB, which doesn't really tally with the ~70MB which is reported on the graph server. Probably that's measuring different things and I need some schooling, but I mention it because the mini's have 1GB of RAM, of which about 150MB was still being used for caching and small amount truly free. Don't have data for the state when it gets to the very end of Tp, which would be useful to confirm or shoot down this guess. Alice will of course know how it really works. At any rate, I've left qm-pvista-trunk03 out of circulation for the moment.
I talked to alice about this when I first noticed it. (Although forgot to file a bug, tsk tsk.) I was suspicious that 2/3 were orange, and one solid green, but at the same time, it seems odd that both of those machines would be having trouble at the same time. I didn't wind up feeling confident about it either way.
We have a theory that this is heat related. qm-pvista-trunk01/02/03 are the first three machines in the rack (on their side, in a row of 8), and idling 03 seemed to cause 02 to be green rather than often orange. To rule out code changes in the meantime, qm-pvista-trunk03 is back on for a spell. The next 4 machines in the row are qm-pxp-trunk05/01/02/03, so I'll look at their recent history too. Unthrottling may also be related (bug 468680, jan 30).
qm-pxp-trunk05/01/02/03 were perma-green over the last week. Perhaps Vista is working the mini much harder than XP does.
Since restarting qm-pvista-trunk03 it's been green except for one weird failure http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1234969776.1234975262.13021.gz&fulltext=1 Similarly qm-pvista-trunk02 has been green except for one FAIL: Busted: ts FAIL: previous cycle still running http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1234944070.1234950010.13051.gz&fulltext=1 after doing a few of the startups in the test. Going to be very tempting to resolve this WFM if it continues like this.
They've continued to be green apart from the cross-tree talos bustage for revisions f4800de50e03, d17cb4c725bd, d17cb4c725bd. It's unsatisfying to resolve without knowing what the problem was, please reopen if this starts up again.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → WORKSFORME
qm-pvista-trunk03 and 02 are exhibiting similar symptoms again. repeated failure: FAIL: Busted: tdhtml FAIL: failed to initialize browser 01 had the Ts failure again, but seems more stable than 02 and 03. should i re-open this?
Since the boxes are still going orange, let's re-open.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Alice: 1) So, is this a fair summary: qm-pvista-trunk01 remains green, but qm-pvista-trunk02,03 start going orange at the same time? 2) Last time these machines went green, what exactly did you do? Reboot 02,03? Reimage 02,03? Nothing? 3) Its unclear to me if this is talos/releng infrastructure problem, or a genuine code bug. The fact it hits two machines at the same time makes me suspect a code/testware problem, but how can we tell where to start looking?
Looks like the continuous green lasted until (using "days" starting at 10:30, since that's what Tinderbox gave me) 2/22. Since then, the daily fails were 02 03 1 0 0 0 0 0 1 0 2 0 0 3 0 0 0 2 1 0 2 0 2 1 4 3 and then 0 and 2 for the half-day today. Not exactly the pattern of the previous episode, but here again, if you want to blame it on code, you'll first need to explain away the fact that during those 24 failures over 12.5 days, 01 failed this way zero times.
We can attempt here to give Talos a little more time to open/close the browser and see if we can stop this orange. I've been wary about changing test parameters, but I have an idea for a browser shutdown test in the works that should start collecting data.
Increase allowable time for browser to open/close.
Assignee: nobody → anodelman
Attachment #366017 - Flags: review?(bhearsum)
Comment on attachment 366017 [details] [diff] [review] [Checked in]increase timeout for browser opening/closing wfm
Attachment #366017 - Flags: review?(bhearsum) → review+
Comment on attachment 366017 [details] [diff] [review] [Checked in]increase timeout for browser opening/closing Checking in sample.config; /cvsroot/mozilla/testing/performance/talos/sample.config,v <-- sample.config new revision: 1.25; previous revision: 1.24 done Checking in ffsetup.py; /cvsroot/mozilla/testing/performance/talos/ffsetup.py,v <-- ffsetup.py new revision: 1.8; previous revision: 1.7 done
Attachment #366017 - Attachment description: increase timeout for browser opening/closing → [Checked in]increase timeout for browser opening/closing
Attachment #366017 - Flags: checked‑in+ checked‑in+
These machines were not re-imaged. Machines were re-imaged in bug 480048 - and it seems to have resulted in green machines. My timeout increase may not have been necessary, we may just need to re-image the orange cycling machines.
After this landed, Ts regressed about 11-14% on Mac 10.5, both mozilla-central and mozilla-1.9.1 with no changes landed on the latter. Possibly also bug 480577 but the timing here matches better.
(In reply to comment #17) > These machines were not re-imaged. Machines were re-imaged in bug 480048 - and > it seems to have resulted in green machines. My timeout increase may not have > been necessary, we may just need to re-image the orange cycling machines. ok, good to know, thanks Alice. Agreed, sounds like next time we start seeing orange cycling machines, we should start by reimaging quickly, before too many other code changes land and complicate the picture. If a reimage turns the machine back green again, then we at least have narrowed down the problem a bit!
I can't see how changing the browser timeout would have affected Ts - and if it does affect Ts in some way that I'm not seeing I don't understand why it would only affect leopard. I'd be willing to do a backout to see if numbers normalize, but that should happen during a downtime.
Attachment #370539 - Flags: review?(joduinn)
Going back to the old timer settings in an attempt to resolve scattered Ts wonkiness since the timer settings were increased.
Attachment #370539 - Flags: review?(joduinn) → review+
Comment on attachment 370539 [details] [diff] [review] [Checked in]back to the old timer settings From irc, aki already confirmed he manually overrides these settings for Talos-for-mobile. As this doesn't cause mobile any problems, and as it simply reverts to previous values, I'll r+.
Comment on attachment 370539 [details] [diff] [review] [Checked in]back to the old timer settings Checking in ffsetup.py; /cvsroot/mozilla/testing/performance/talos/ffsetup.py,v <-- ffsetup.py new revision: 1.9; previous revision: 1.8 done Checking in sample.config; /cvsroot/mozilla/testing/performance/talos/sample.config,v <-- sample.config new revision: 1.26; previous revision: 1.25 done
Attachment #370539 - Attachment description: back to the old timer settings → [Checked in]back to the old timer settings
Attachment #370539 - Flags: checked‑in+ checked‑in+
Has this gone live?
Went live yesterday at around 5pm.
Judging from this graph of branch Darwin 9.2.2 boxes, the backout resolved part of the regression: http://graphs-new.mozilla.org/#tests=[{%22test%22:%2216%22,%22branch%22:%223%22,%22machine%22:%2240%22},{%22test%22:%2216%22,%22branch%22:%223%22,%22machine%22:%2241%22},{%22test%22:%2216%22,%22branch%22:%223%22,%22machine%22:%2242%22},{%22test%22:%2216%22,%22branch%22:%223%22,%22machine%22:%2243%22}]&sel=1238192959,1238687040 It's early yet, but it looks like we were averaging ~1530 before the regression, 1800 during the regression, and post-backout we're down to 1650.
Summary: Investigate orange on mozilla-central's qm-pvista-trunk02 and qm-pvista-trunk03 → Tdhtml orange ("failed to initialize browser") on mozilla-central's qm-pvista-trunk02 and qm-pvista-trunk03
Whiteboard: [orange]
Summary: Tdhtml orange ("failed to initialize browser") on mozilla-central's qm-pvista-trunk02 and qm-pvista-trunk03 → intermittent orange on mozilla-central's qm-pvista-trunk02/03/04
qm-pvista-trunk03 has been orange with the problem mentioned in comment 9 for a few days now. Is there anything that can be done about it, or can that box be removed from the Firefox tinderbox so we don't have to keep starring it?
Buildbot is stopped on qm-pvista-trunk03, it'll fall off the waterfall from lack of work.
Having talos monitor browser shutdown time should eliminate these oranges, as they are caused by the browser taking a long time to close and thus confusing talos - that is why the initial fix of increasing timeouts made the orange go away.
Depends on: Tshutdown
Blocks: 438871
Alice, would the shutdown monitoring address the oranges we're seeing today on WINNT 6.0 talos mozilla-central nochrome qm-pvista-trunk04 WINNT 6.0 talos mozilla-central qm-pvista-trunk02 WINNT 5.1 talos mozilla-central qm-pxp-trunk02 All three have experienced either FAIL: Busted: ts FAIL: failed to initialize browser or FAIL: Busted: tdhtml FAIL: failed to initialize browser Or could these be something else?
Summary: intermittent orange on mozilla-central's qm-pvista-trunk02/03/04 → intermittent orange on mozilla-central's qm-pvista-trunk02/03/04 ("failed to initialize browser")
(In reply to comment #32) > FAIL: Busted: ts > FAIL: failed to initialize browser Seeing this today too. Is this the same bug, or something else?
I'll take the blame for the annoyance, since I got Alice to back it out while we were hunting the Ts regression on Mac. As I understand it though, the Tshutdown patch that will remedy this just needs review and then scheduled downtime to push. Even with the semi-regular failures, I suspect we don't want that downtime to happen before freeze?
(In reply to comment #35) > Even with the semi-regular failures, I suspect we don't want that downtime to > happen before freeze? Probably not. Good to know this is on course to getting fixed.
A lot of my previous links may have been bug 482575 as well (didn't check machine until now). Maybe that's a dupe of this though...
(In reply to comment #35) > Even with the semi-regular failures, I suspect we don't want that downtime to > happen before freeze? We've been assuming (hopefully correctly) that we should avoid any downtime before FF3.5b4 work is done. However, happy to change plans if thats preferred.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1239976792.1239982719.11487.gz&fulltext=1 WINNT 5.1 talos mozilla-central nochrome qm-pxp-trunk07 on 2009/04/17 06:59:52 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1239988477.1239994511.2731.gz&fulltext=1 WINNT 5.1 talos mozilla-central nochrome qm-pxp-trunk07 on 2009/04/17 10:14:37 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1239984953.1239991004.29196.gz&fulltext=1 WINNT 5.1 talos mozilla-central nochrome qm-pxp-trunk07 on 2009/04/17 09:15:53 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1239984953.1239995249.3869.gz&fulltext=1 WINNT 6.0 talos mozilla-central qm-pvista-trunk02 on 2009/04/17 09:15:53
I thought this was supposed to be fixed by this weekend's talos maintenance, but it's not: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240231529.1240242539.14651.gz WINNT 6.0 talos mozilla-central qm-pvista-trunk02 on 2009/04/20 05:45:29 Tdhtml http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240231529.1240238249.7784.gz WINNT 5.1 talos mozilla-central nochrome qm-pxp-trunk07 on 2009/04/20 05:45:29 Ts http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240231529.1240239353.9482.gz WINNT 6.0 talos mozilla-central nochrome qm-pvista-trunk04 on 2009/04/20 05:45:29 Ts http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240236984.1240242926.15242.gz WINNT 5.1 talos mozilla-central nochrome qm-pxp-trunk07 on 2009/04/20 07:16:24 Ts http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240228045.1240233098.26771.gz WINNT 5.1 talos mozilla-central nochrome qm-pxp-trunk07 on 2009/04/20 04:47:25 Ts http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240218634.1240225467.13349.gz WINNT 6.0 talos mozilla-central qm-pvista-trunk02 on 2009/04/20 02:10:34 Ts http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240224794.1240230173.22414.gz WINNT 5.1 talos mozilla-central qm-pxp-trunk02 on 2009/04/20 03:53:14 Ts http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240224794.1240230172.22410.gz WINNT 5.1 talos mozilla-central nochrome qm-pxp-trunk07 on 2009/04/20 03:53:14 Ts http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240226189.1240235637.30602.gz WINNT 6.0 talos mozilla-central qm-pvista-trunk02 on 2009/04/20 04:16:29 Tdhtml http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240226189.1240231365.24217.gz WINNT 5.1 talos mozilla-central nochrome qm-pxp-trunk07 on 2009/04/20 04:16:29 Ts Those are the "failed to initialize browser" oranges in the past 12 hours. They include XP and Vista machines, Ts and Tdhtml tests, and include the nochrome boxes. Which of those are covered by this bug, and for which should new bugs be filed?
This weekend's fix was backed out due to reds on some mac boxes. See bug 480413 comment 20
Summary: intermittent orange on mozilla-central's qm-pvista-trunk02/03/04 ("failed to initialize browser") → intermittent orange on Windows mozilla-central talos Ts and Tdhtml tests ("failed to initialize browser")
WINNT 5.1 talos mozilla-central nochrome qm-pxp-trunk07 [testfailed] http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240277387.1240283560.15351.gz&fulltext=1 TinderboxPrint:FAIL: Busted: ts TinderboxPrint:FAIL: failed to initialize browser
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240408820.1240415112.13072.gz WINNT 6.0 talos mozilla-central qm-pvista-trunk02 on 2009/04/22 07:00:20 Ts http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240400478.1240407892.28809.gz WINNT 6.0 talos mozilla-central qm-pvista-trunk02 on 2009/04/22 04:41:18 Ts
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240576888.1240586182.25390.gz WINNT 6.0 talos mozilla-central qm-pvista-trunk02 on 2009/04/24 05:41:28 Tdhtml
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240589285.1240599408.15920.gz WINNT 6.0 talos mozilla-central qm-pvista-trunk02 on 2009/04/24 09:08:05 Tdhtml
I'm seeing a lot of green vista boxes now. Will leave this open for another couple of days to ensure that we are done with this error for good.
Still seeing lots of green, and on moz-central instead of intermittent orange there now appears to be an intermittent crash with stack - so I'm going to call this success.
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Component: Release Engineering: Talos → Release Engineering
Whiteboard: [orange]
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: