Closed
Bug 921635
Opened 11 years ago
Closed 7 years ago
Frequent OSX mochitest timeouts in the dom-level* suites
Categories
(Core :: General, defect)
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: RyanVM, Unassigned)
References
Details
(Whiteboard: [test disabled on OS X] [leave open])
Attachments
(1 file)
(deleted),
image/png
|
Details |
This has been persistently occurring for the last week+. We are seeing frequent OSX "application timed out after 330 seconds with no output" failures, often in the dom-level* test suite like the log below:
https://tbpl.mozilla.org/php/getParsedLog.php?id=28486318&tree=Mozilla-Inbound
Rev4 MacOSX Lion 10.7 mozilla-inbound opt test mochitest-3 on 2013-09-27 14:11:22 PDT for push 6b9e77c98785
slave: talos-r4-lion-011
14:19:11 INFO - 28705 INFO TEST-START | /tests/dom/tests/mochitest/dom-level1-core/test_nodeappendchildnomodificationallowederrEE.html
14:24:41 WARNING - TEST-UNEXPECTED-FAIL | /tests/dom/tests/mochitest/dom-level1-core/test_nodeappendchildnomodificationallowederrEE.html | application timed out after 330 seconds with no output
14:24:45 INFO - SCREENSHOT: <see attached>
14:24:46 INFO - Traceback (most recent call last):
14:24:46 INFO - File "/builds/slave/talos-slave/test/build/tests/mochitest/runtests.py", line 1045, in runTests
14:24:46 INFO - onLaunch=onLaunch
14:24:46 INFO - File "/builds/slave/talos-slave/test/build/tests/mochitest/runtests.py", line 939, in runApp
14:24:46 INFO - self.handleTimeout(timeout, proc, utilityPath, debuggerInfo, browserProcessId)
14:24:46 INFO - File "/builds/slave/talos-slave/test/build/tests/mochitest/runtests.py", line 1073, in handleTimeout
14:24:46 INFO - self.killAndGetStack(browserProcessId, utilityPath, debuggerInfo, dump_screen=not debuggerInfo)
14:24:46 INFO - File "/builds/slave/talos-slave/test/build/tests/mochitest/runtests.py", line 724, in killAndGetStack
14:24:46 INFO - os.kill(processPID, signal.SIGABRT)
14:24:46 INFO - OSError: [Errno 3] No such process
14:24:46 INFO - Automation Error: Received unexpected exception while running application
14:24:46 INFO - WARNING | leakcheck | refcount logging is off, so leaks can't be detected!
14:24:46 INFO - runtests.py | Running tests: end.
14:24:46 ERROR - Return code: 256
Multiple issues:
1.) Breakpad isn't launching and giving us a stack. Ted, is that a known issue?
2.) I see iCal running in all of the screenshots. Talking to philor, it doesn't sound like this is related, but it's worth pointing out.
These oranges are hitting frequently and we really need some movement on finding the root cause.
Flags: needinfo?(ted)
Reporter | ||
Updated•11 years ago
|
Component: Buildduty → General
Product: Release Engineering → Core
QA Contact: armenzg
Version: other → unspecified
Comment 1•11 years ago
|
||
Dunno what the root cause is, but the proximate cause is one of glandium's "move * to moz.build" patches, where we had a makefile (or multiple makefiles, not sure) with broken loops that caused us to not run most of these tests. Despite the tests having been in the tree forever, these are basically new tests, not a new regression in old tests.
Comment 2•11 years ago
|
||
Comment 3•11 years ago
|
||
Comment 4•11 years ago
|
||
Comment 5•11 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #0)
> 1.) Breakpad isn't launching and giving us a stack. Ted, is that a known
> issue?
We discussed this in some other bug, but judging from the error message the process no longer exists by the time we try to kill it. I suggested that we should probably just wrap that os.kill in a try/except block to fail more gracefully there. I note that the screenshot doesn't show any trace of a Firefox process either, which lends credence to that theory. If we fix that we might be able to get a stack out of the process, since the error we're hitting here is causing us to skip the check_for_crashes call.
As to why the process is dying in slow motion, I don't know. Note that bug 920640 fixed the underlying cause of us not trying to kill the process for a stack (and bug 717758 made us try at all on Mac).
Flags: needinfo?(ted)
Comment 6•11 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #5)
> We discussed this in some other bug, but judging from the error message the
> process no longer exists by the time we try to kill it.
(Bug 921509)
Reporter | ||
Comment 7•11 years ago
|
||
Comment 8•11 years ago
|
||
Comment 9•11 years ago
|
||
Comment 10•11 years ago
|
||
Comment 11•11 years ago
|
||
Comment 12•11 years ago
|
||
Comment 13•11 years ago
|
||
Comment 14•11 years ago
|
||
Comment 15•11 years ago
|
||
Comment 16•11 years ago
|
||
Comment 17•11 years ago
|
||
Comment 18•11 years ago
|
||
Comment 19•11 years ago
|
||
Comment 20•11 years ago
|
||
Comment 21•11 years ago
|
||
Talos has a similar problem which I have documented in bug 923770. While I don't know the root cause we could probably be smarter in our process management.
Comment 22•11 years ago
|
||
Comment 23•11 years ago
|
||
Reporter | ||
Comment 24•11 years ago
|
||
Comment 26•11 years ago
|
||
Comment 28•11 years ago
|
||
Comment 29•11 years ago
|
||
Comment 30•11 years ago
|
||
Comment 31•11 years ago
|
||
Comment 32•11 years ago
|
||
Comment 33•11 years ago
|
||
Comment 34•11 years ago
|
||
Comment 36•11 years ago
|
||
Reporter | ||
Comment 37•11 years ago
|
||
Reporter | ||
Comment 38•11 years ago
|
||
Reporter | ||
Comment 39•11 years ago
|
||
Comment 40•11 years ago
|
||
Comment 41•11 years ago
|
||
Comment 42•11 years ago
|
||
Comment 43•11 years ago
|
||
Comment 44•11 years ago
|
||
Comment 45•11 years ago
|
||
Reporter | ||
Comment 46•11 years ago
|
||
Can we please get a bug on file (or mark any existing bugs on file as dependencies here) for whatever work needs to be done to get a stack from these hangs? I'm not going to be able to get any traction with the development team if we can't give them a stack.
Flags: needinfo?(ted)
Flags: needinfo?(jmaher)
Comment 47•11 years ago
|
||
at first I thought this was related to loading too many pages (a bug I see on 10.6 and 10.7 in talos tp5 only), but after looking at a few logs in here, I am not sure if it is the same. This tp5 regression is something that is in the last couple weeks, likewise with this bug. Both bugs fail on test suites that load a lot of test files which are small and quick to load (i.e. fast page cycling) and on 10.6 and 10.7.
either our code changed or the test machines changed (possible os update?), Callek, can you help me figure out if we have updated any software on our 10.6/10.7 machines in the last 3 weeks?
Flags: needinfo?(jmaher) → needinfo?(bugspam.Callek)
Comment 48•11 years ago
|
||
I don't see anything obvious in our puppet history for what could have changed on these test machines.
*) Do retriggers on old jobs exhibit the same talos regressions?
*) Do you have a regression range that is smaller than 3 weeks to look in?
Flags: needinfo?(bugspam.Callek)
Comment 49•11 years ago
|
||
What is the frequency of the dom-level* failures? Maybe 1:20, we could post to try server and do a bunch of retriggers. A theory I have is that we are loading pages too fast, so slowing down the pages in mochitest might fix this. While that isn't a permanent solution, it might help us figure out the root cause. I guess I could test this theory in talos :)
I am not sure if we can get a stack from the non responsive browser, if there is a way, I would love to get this in talos, possibly ted has some ideas there and I could test them out on try server while I shoot for the high score table.
Comment 50•11 years ago
|
||
Comment 1. Any attempt at retriggering to a single push regressor should find the single push which took us from only running a fraction of these tests to actually running them all.
Comment 51•11 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/54e4a1ad2b22 if you want to bisect to it quickly.
Comment 52•11 years ago
|
||
Comment 53•11 years ago
|
||
Comment 54•11 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #46)
> Can we please get a bug on file (or mark any existing bugs on file as
> dependencies here) for whatever work needs to be done to get a stack from
> these hangs? I'm not going to be able to get any traction with the
> development team if we can't give them a stack.
I thought I commented on this elsewhere, but my original theory seems to be off-base. I think something is going wrong with our process management, since the process is dead before we get to kill it to get a stack. I would tentatively blame the mozbase refactor, since it seems plausible that something could be going wrong inside of mozrunner or mozprocess.
jhammel: what do you think?
Flags: needinfo?(ted) → needinfo?(jhammel)
Comment 55•11 years ago
|
||
It would be worth checking the test archives. The cause might be missing files because of make fuckage.
Comment 56•11 years ago
|
||
Comment 57•11 years ago
|
||
Comment 58•11 years ago
|
||
Reporter | ||
Comment 59•11 years ago
|
||
Comment 60•11 years ago
|
||
> I think something is going wrong with our process management,
:( Sadly seems likely.
> since the process is dead before we get to kill it to get a stack. I
> would tentatively blame the mozbase refactor, since it seems
> plausible that something could be going wrong inside of mozrunner
> or mozprocess.
I'm going to give a strong blanket "maybe".
I'll try to add in a few debug statements as soon as I can, but I can't put this at priority #1 right right now.
Reporter | ||
Comment 61•11 years ago
|
||
Clint, jst - these frequent timeouts are contributing to an already bad situation for OSX test slave backlog. If we can't find an owner for this, I'm afraid we may need to resort to disabling the dom-level* tests on OSX.
AFAICT, we need:
* Fixes to the test harness to give stacks on hangs/timeouts
* Developer action once usable stacks exist
I'll also point out that per comment 1, these tests were actually accidentally disabled for a long time and nobody seemed to notice or miss them.
Flags: needinfo?(jst)
Flags: needinfo?(ctalbert)
Comment 62•11 years ago
|
||
I propose:
* disabling the tests as they were
* filing a bug to enable them (treat them as new tests)
* file a bug to get valid crash stacks
Comment 63•11 years ago
|
||
Turns out that trusting my guesses about history rather than actually looking at blame in https://hg.mozilla.org/integration/mozilla-inbound/annotate/45097bc3a578/dom/tests/mochitest/dom-level1-core/Makefile.in isn't a great way to make decisions.
We were missing some but not all of them between July 3 2012 and September 5 2013. If we're going to decide we don't need them at all on the platform that I think most of the DOM owners and peers use, then we should do so by deciding that we don't need them at all because we made it over a year without regressing any of them so they just test things that we test with other tests, and stop running them on any platform.
But before that, we should actually check whether I'm even right that 54e4a1ad2b22 is the start (though not on a weekday, we really can't afford the load of an extra few hundred 10.7 mochitests in our 4.5 hour queue).
Comment 64•11 years ago
|
||
Comment 65•11 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #1)
> Dunno what the root cause is, but the proximate cause is one of glandium's
> "move * to moz.build" patches, where we had a makefile (or multiple
> makefiles, not sure) with broken loops that caused us to not run most of
> these tests. Despite the tests having been in the tree forever, these are
> basically new tests, not a new regression in old tests.
It was only 11 of those tests that were disabled. test_HTMLAppletElement01.html to test_HTMLAppletElement11.html in dom/tests/mochitest/dom-level2-html/Makefile.in. The change that enabled them is bug 917622.
Comment 66•11 years ago
|
||
Flags: needinfo?(jhammel)
Comment 68•11 years ago
|
||
Comment 69•11 years ago
|
||
Comment 70•11 years ago
|
||
Comment 71•11 years ago
|
||
Comment 72•11 years ago
|
||
Comment 73•11 years ago
|
||
Comment 74•11 years ago
|
||
Comment 75•11 years ago
|
||
Comment 76•11 years ago
|
||
Comment 77•11 years ago
|
||
Comment 78•11 years ago
|
||
Comment 79•11 years ago
|
||
Comment 80•11 years ago
|
||
Comment 81•11 years ago
|
||
Comment 82•11 years ago
|
||
Comment 83•11 years ago
|
||
Comment 84•11 years ago
|
||
Comment 85•11 years ago
|
||
Comment 86•11 years ago
|
||
Comment 87•11 years ago
|
||
Comment 88•11 years ago
|
||
Comment 89•11 years ago
|
||
Comment 90•11 years ago
|
||
Comment 91•11 years ago
|
||
Comment 92•11 years ago
|
||
Comment 93•11 years ago
|
||
Comment 94•11 years ago
|
||
Comment 95•11 years ago
|
||
Comment 96•11 years ago
|
||
Comment 97•11 years ago
|
||
Comment 98•11 years ago
|
||
Comment 99•11 years ago
|
||
Comment 100•11 years ago
|
||
Comment 101•11 years ago
|
||
Comment 102•11 years ago
|
||
Hope everyone enjoyed that brief respite, which was brought to you by bug 928222 turning these timeouts into the crashes of bug 930495 and friends.
https://tbpl.mozilla.org/php/getParsedLog.php?id=29641756&tree=Mozilla-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=29641903&tree=Mozilla-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=29641938&tree=Mozilla-Central
Comment 103•11 years ago
|
||
Comment 104•11 years ago
|
||
Comment 105•11 years ago
|
||
Comment 106•11 years ago
|
||
Comment 107•11 years ago
|
||
Comment 108•11 years ago
|
||
Comment 109•11 years ago
|
||
Comment 110•11 years ago
|
||
Comment 111•11 years ago
|
||
Comment 112•11 years ago
|
||
Comment 113•11 years ago
|
||
Comment 114•11 years ago
|
||
Comment 115•11 years ago
|
||
Comment 116•11 years ago
|
||
Comment 117•11 years ago
|
||
Comment 118•11 years ago
|
||
Comment 119•11 years ago
|
||
Comment 120•11 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #65)
> (In reply to Phil Ringnalda (:philor) from comment #1)
> > Dunno what the root cause is, but the proximate cause is one of glandium's
> > "move * to moz.build" patches, where we had a makefile (or multiple
> > makefiles, not sure) with broken loops that caused us to not run most of
> > these tests. Despite the tests having been in the tree forever, these are
> > basically new tests, not a new regression in old tests.
>
> It was only 11 of those tests that were disabled.
> test_HTMLAppletElement01.html to test_HTMLAppletElement11.html in
> dom/tests/mochitest/dom-level2-html/Makefile.in. The change that enabled
> them is bug 917622.
I've disabled these 11 tests on OS X, in:
https://hg.mozilla.org/mozilla-central/rev/f2b317484d08
Let's see if that helps... :-)
Whiteboard: [some tests disabled on OS X] [leave open]
Comment 121•11 years ago
|
||
Comment 122•11 years ago
|
||
Comment 123•11 years ago
|
||
Comment 124•11 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #122)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=29814900&tree=Mozilla-
> Inbound
This was after the test disabling in comment 120 sadly.
Comment 125•11 years ago
|
||
I think the next step is just to disable all dom-level* tests on OS X until someone actually cares whether they are reliable or not :-(
Updated•11 years ago
|
Updated•11 years ago
|
Comment 145•11 years ago
|
||
Comment 146•11 years ago
|
||
Comment 147•11 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #146)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=29775369&tree=B2g-Inbound
This wasn't a timeout - wrong bug?
Flags: needinfo?(cbook)
Comment 148•11 years ago
|
||
We've sadly come to the end of the road for now - dom-level1-core, dom-level2-core & dom-level2-html disabled on OS X for too many intermittent timeouts & until someone is able to investigate the failures:
remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/66d14f6329ca
Whiteboard: [some tests disabled on OS X] [leave open] → [test disabled on OS X] [leave open]
Comment 149•11 years ago
|
||
I should also add that these tests are also too OOMy for Android, so have been disabled there for some time (see bug 910229). Seems like these tests need some TLC all around.
Comment 150•11 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+1] from comment #148)
> We've sadly come to the end of the road for now - dom-level1-core,
> dom-level2-core & dom-level2-html disabled on OS X for too many intermittent
> timeouts & until someone is able to investigate the failures:
> remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/66d14f6329ca
Bizarrely this exposed issues with layout/generic/test/test_bug448987.html which has now also been disabled. Bug 932296 filed to get that sorted.
Reporter | ||
Comment 151•11 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #62)
> I propose:
> * disabling the tests as they were
> * filing a bug to enable them (treat them as new tests)
> * file a bug to get valid crash stacks
So step 1 is done here. Will file the 2 follow-ups and we'll go from there.
Flags: needinfo?(jst)
Flags: needinfo?(jhammel)
Flags: needinfo?(jgriffin)
Flags: needinfo?(ctalbert)
Flags: needinfo?(cbook)
Reporter | ||
Updated•11 years ago
|
Flags: needinfo?(jgriffin)
Comment 152•11 years ago
|
||
Step 2 now needs to be "file a bug to go about fixing and re-enabling dom-level1-core, dom-level2-core & dom-level2-html" rather than just those 11.
Reporter | ||
Comment 153•11 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #62)
> I propose:
> * disabling the tests as they were
> * filing a bug to enable them (treat them as new tests)
Bug 932350
> * file a bug to get valid crash stacks
Bug 932349
Reporter | ||
Comment 155•11 years ago
|
||
Comment 157•11 years ago
|
||
Comment 158•11 years ago
|
||
Comment 159•11 years ago
|
||
Comment 160•11 years ago
|
||
Reporter | ||
Comment 161•11 years ago
|
||
Reporter | ||
Comment 162•11 years ago
|
||
Reporter | ||
Comment 164•11 years ago
|
||
Reporter | ||
Comment 165•11 years ago
|
||
Comment 166•11 years ago
|
||
Comment 167•11 years ago
|
||
Updated•11 years ago
|
Summary: Frequent OSX mochitest timeouts, often in the dom-level* suites → Frequent OSX mochitest timeouts in the dom-level* suites
Comment 169•11 years ago
|
||
Comment 170•11 years ago
|
||
Comment 171•11 years ago
|
||
Reporter | ||
Comment 172•11 years ago
|
||
Comment 174•11 years ago
|
||
Comment 175•11 years ago
|
||
Comment 176•11 years ago
|
||
Comment 177•11 years ago
|
||
Comment 178•11 years ago
|
||
Reporter | ||
Comment 179•11 years ago
|
||
Reporter | ||
Comment 180•11 years ago
|
||
Reporter | ||
Comment 181•11 years ago
|
||
Comment 182•11 years ago
|
||
Reporter | ||
Comment 183•11 years ago
|
||
Comment 184•11 years ago
|
||
Comment 187•11 years ago
|
||
Comment 188•11 years ago
|
||
Comment 189•10 years ago
|
||
Reporter | ||
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(ryanvm)
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•