Closed
Bug 650535
Opened 14 years ago
Closed 13 years ago
Android tests fail with "Timed out while waiting for server startup"
Categories
(Release Engineering :: General, defect, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Assigned: bear)
References
Details
(Keywords: intermittent-failure, Whiteboard: [mobile_unittests][android_tier_1])
Attachments
(1 file)
(deleted),
patch
|
jgriffin
:
review+
|
Details | Diff | Splinter Review |
The pattern, for those needing to star something to be able to land or to stop someone yelling at you about your orange, is:
You have a log like http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Aurora/1302957935.1302959478.28879.gz where a mochitest step died with
INFO | runtests.py | Server pid: 27278
uncaught exception: 2147746065
Timed out while waiting for server startup.
which tends to mean "the server returned NS_ERROR_NOT_AVAILABLE because it's still running from something that died before" so you look at https://build.mozilla.org/buildapi/recent/tegra-nnn with the nnn of your slave, look back at recent failures to find an orange result==1 on a crashtest that happened without any intervening mochitests between it and your run, use the name of the tree and the start time to figure out what run that would have been, and find a log like http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1302953854.1302957705.20676.gz where the crashtest run ended with
REFTEST TEST-START | Shutdown
INFO | automation.py | Application ran for: 0:08:32.671885
INFO | automation.py | Reading PID log: /tmp/tmp1mtwpHpidlog
WARNING | automationutils.processLeakLog() | refcount logging is off, so leaks can't be detected!
REFTEST INFO | runreftest.py | Running tests: end.
command timed out: 2400 seconds without output, killing pid 26215
process killed by signal 9
program finished with exit code -1
which seems to leave the server running, ready to blow up the next job.
Reporter | ||
Comment 1•14 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1302950674.1302952005.14010.gz is cute, since it's actually "crashtest killed by crashtest dying."
Reporter | ||
Comment 2•14 years ago
|
||
Reporter | ||
Comment 3•14 years ago
|
||
Reporter | ||
Comment 4•14 years ago
|
||
And http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303005755.1303006681.18626.gz as the other shoe dropping when 066 finally got another mochitest run to blow up.
Reporter | ||
Comment 5•14 years ago
|
||
Would have been more accurate if I filed "go ahead and write the patch to check for leftover stuff after anything bad happens like you already know you need to do" since http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303010194.1303011036.1211.gz seems to be a mochitest dying after http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303005755.1303007172.20105.gz which was a tsvg hitting bug 650018 after a pair of hg failures.
Reporter | ||
Comment 6•14 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303012774.1303013872.10694.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303012775.1303013570.9762.gz
Despite having what I thought was a world class appetite for spamming bugs, I don't think I can actually stand to copy-paste every instance of this, if it's really two or three or four every single m-c push.
Comment 7•14 years ago
|
||
Yeah, I think killing zombie daemons is the right long term approach.
Reporter | ||
Comment 8•14 years ago
|
||
From a single push (since it caught the nightly, and for some reason, maybe the m-c nightly, triggered three sets of tests), three mochitests, a crashtest, and a reftest.
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303040375.1303041088.26418.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303041411.1303042486.3338.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303032814.1303033512.15925.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303032814.1303033837.16888.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303040374.1303041214.27341.gz
If killing zombies is the right long term approach, what's the right short term approach? This absolutely blocks showing mobile results on the mozilla-central tree.
Updated•14 years ago
|
Whiteboard: [orange] → [orange][mobile_unittests]
Comment 9•14 years ago
|
||
(In reply to comment #8)
> If killing zombies is the right long term approach, what's the right short term
> approach? This absolutely blocks showing mobile results on the mozilla-central
> tree.
Most likely hiding the broken unit test results is the correct short term approach, until we can fix them.
Comment 10•14 years ago
|
||
we need to figure out a decent solution for this sooner than later. The problem is when a test leaves a zombie xpcshell process running the next test that is run for that given machine will fail to startup. This is partly because we use the same xpcshell httpd.js port for each tegra (based off the tegra number). fwiw, I think that is a good approach to keep it simple.
A couple possible solutions:
1) use random dynamic ports, not based on tegra name/number
2) in the install/config scripts, search for an xpcshell process with the given port for the tegra.
3) keep track of a PID for the xpcshell process when launching it and in the cleanup step ensure that pid is terminated.
I am not sure why we turned on the mobile unittests for m-c without seeing them run for a week or two without problems on the mobile tree...that is just my opinion.
Reporter | ||
Comment 11•14 years ago
|
||
We didn't put them on mozilla-central (though sort of close, M1-4 and crashtest on mozilla-aurora) - it just seems like it because I'm attacking them on mobile as though people have been threatening to put them on mozilla-central, because people have been threatening to put them on mozilla-central. They got a lot more visibility the last couple of weeks since people pushed things that were totally green on try and totally green on cedar to mozilla-central, and then got yelled at and backed out because they broke startup on mobile.
Comment 12•14 years ago
|
||
This should potentially be fixed by bug 650828 and bug 650865 (remote unit test pid files + usage of such).
Priority: -- → P3
Reporter | ||
Comment 13•14 years ago
|
||
I hid the mochitests and crashtests (reftests will be permaorange anyway until the next time we merge) on aurora, so someone help me remember to unhide them when 650865 cuts the failure rate down.
Comment 14•14 years ago
|
||
This should improve now that bug 650865 has landed (though only new daemons will get pidfiles, so may take a while to shake out old zombie daemons).
Reporter | ||
Comment 15•14 years ago
|
||
Did this maybe get broken by bug 652032, or something else that happened Friday afternoon? Things were looking really good for a while, but now we're back to just as many Timed out while waitings as ever.
Priority: P3 → --
Reporter | ||
Comment 16•14 years ago
|
||
Screw you and your failure to detect midairs and your random changes, Bugzilla.
Priority: -- → P3
Comment 17•14 years ago
|
||
It's certainly possible.
I updated sut_tools, which grabbed a new reboot.py.
I also updated talos, which grabbed a new devicemanager that conflicted with the existing [patched] devicemanager.
Comment 18•14 years ago
|
||
Any recurrence in the past month?
Reporter | ||
Comment 19•14 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1306481774.1306482622.21118.gz (and two others so far today) looks characteristic, a "Timed out while waiting for server startup" following a "uncaught exception: 2147746065", though I find it a little suspicious that all three I saw were in just the content/smil/test hunk of mochitest-1.
Reporter | ||
Comment 20•13 years ago
|
||
Unless you meant "does crashtest still fizzle out during shutdown?," in which case, http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1306636426.1306640465.27008.gz and http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1306590586.1306594368.5519.gz
Reporter | ||
Comment 21•13 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1306785048.1306786606.23050.gz (tegra-074)
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1306775925.1306781372.1642.gz (tegra-074)
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1306759163.1306759877.31750.gz (tegra-050)
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1306753846.1306755058.4723.gz (tegra-060)
Maybe it's just a result of the number of runs there, but it feels like when I look at /recent/ for one of these (because I'm seeing way more timed out waiting for startup than I'm seeing crashtest hangs to produce them), there's a try run that blew up preceeding the timeout. Is try maybe not doing the same things to clean up after itself?
Comment 22•13 years ago
|
||
(In reply to comment #21)
> Maybe it's just a result of the number of runs there, but it feels like when
> I look at /recent/ for one of these (because I'm seeing way more timed out
> waiting for startup than I'm seeing crashtest hangs to produce them),
> there's a try run that blew up preceeding the timeout. Is try maybe not
> doing the same things to clean up after itself?
I think try is blowing up more than other branches, as it should be. We may need to beef up our cleanup algorithms, as well as potentially split the pool to create dedicated try tegras.
Reporter | ||
Comment 23•13 years ago
|
||
Reporter | ||
Comment 24•13 years ago
|
||
Reporter | ||
Comment 25•13 years ago
|
||
Reporter | ||
Comment 26•13 years ago
|
||
Reporter | ||
Comment 27•13 years ago
|
||
Reporter | ||
Comment 28•13 years ago
|
||
Reporter | ||
Comment 29•13 years ago
|
||
Reporter | ||
Comment 30•13 years ago
|
||
Comment 31•13 years ago
|
||
(In reply to comment #22)
> (In reply to comment #21)
> > Maybe it's just a result of the number of runs there, but it feels like when
> > I look at /recent/ for one of these (because I'm seeing way more timed out
> > waiting for startup than I'm seeing crashtest hangs to produce them),
> > there's a try run that blew up preceeding the timeout. Is try maybe not
> > doing the same things to clean up after itself?
>
> I think try is blowing up more than other branches, as it should be.
> We may
> need to beef up our cleanup algorithms, as well as potentially split the
> pool to create dedicated try tegras.
For desktop, we use one shared pool of machines for both production testing and try testing, which gets us better capacity scaling. I'd expect to be doing the same shared-pool for tegras.
(In reply to comment #21)
> Is try maybe not doing the same things to clean up after itself?
Excellent question.
Assignee | ||
Comment 32•13 years ago
|
||
taking this bug as I know I can detect the pattern by checking the buildslave log and adjusting/adding some code to "check.py -c"
Assignee: nobody → bear
Updated•13 years ago
|
Whiteboard: [orange][mobile_unittests] → [orange][mobile_unittests][android_tier_1]
Comment 34•13 years ago
|
||
(In reply to Mike Taylor [:bear] from comment #32)
> taking this bug as I know I can detect the pattern by checking the
> buildslave log and adjusting/adding some code to "check.py -c"
No update here in over a month - any progress on this android_tier_1 bug?
Reporter | ||
Comment 35•13 years ago
|
||
Reporter | ||
Comment 36•13 years ago
|
||
Reporter | ||
Comment 37•13 years ago
|
||
Reporter | ||
Comment 38•13 years ago
|
||
Reporter | ||
Comment 39•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6369488
https://tbpl.mozilla.org/php/getParsedLog.php?id=6369546
https://tbpl.mozilla.org/php/getParsedLog.php?id=6369547
https://tbpl.mozilla.org/php/getParsedLog.php?id=6369491
Seems a bit odd that this failure mode, which is pretty rare now on mozilla-inbound and mozilla-central, was four of the five failures on a tryserver run someone had me look at.
Reporter | ||
Comment 40•13 years ago
|
||
Reporter | ||
Comment 41•13 years ago
|
||
Reporter | ||
Comment 42•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6374647
https://tbpl.mozilla.org/php/getParsedLog.php?id=6375089
https://tbpl.mozilla.org/php/getParsedLog.php?id=6374602
https://tbpl.mozilla.org/php/getParsedLog.php?id=6375023
https://tbpl.mozilla.org/php/getParsedLog.php?id=6374634
https://tbpl.mozilla.org/php/getParsedLog.php?id=6375017
https://tbpl.mozilla.org/php/getParsedLog.php?id=6375088
https://tbpl.mozilla.org/php/getParsedLog.php?id=6375152
Reporter | ||
Comment 43•13 years ago
|
||
Reporter | ||
Comment 44•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6377126
https://tbpl.mozilla.org/php/getParsedLog.php?id=6376333
https://tbpl.mozilla.org/php/getParsedLog.php?id=6376340
https://tbpl.mozilla.org/php/getParsedLog.php?id=6374766
https://tbpl.mozilla.org/php/getParsedLog.php?id=6374728
https://tbpl.mozilla.org/php/getParsedLog.php?id=6374767
https://tbpl.mozilla.org/php/getParsedLog.php?id=6374763
Reporter | ||
Comment 45•13 years ago
|
||
Apparently just coincidence that I wasn't seeing it off try for a while. And while the summary was accurate once upon a time, I doubt we're even running, much less timing out, enough crashtests to account for all of these.
https://tbpl.mozilla.org/php/getParsedLog.php?id=6375306
https://tbpl.mozilla.org/php/getParsedLog.php?id=6375311
https://tbpl.mozilla.org/php/getParsedLog.php?id=6375151
Reporter | ||
Comment 46•13 years ago
|
||
Reporter | ||
Comment 47•13 years ago
|
||
Summary: Android mochitests fail with "Timed out while waiting for server startup" after the slave previously timed out during crashtest shutdown → Android tests fail with "Timed out while waiting for server startup"
Reporter | ||
Comment 48•13 years ago
|
||
Reporter | ||
Comment 49•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6384659&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6385117&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6384578&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6384579&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6384762&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6384655&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6385146&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6384574&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6384573&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6384668&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6384656&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6385088&tree=Mozilla-Inbound
Reporter | ||
Comment 50•13 years ago
|
||
Reporter | ||
Comment 51•13 years ago
|
||
Reporter | ||
Comment 52•13 years ago
|
||
Reporter | ||
Comment 53•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6392672&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6396355&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6394995&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6396307&tree=Mozilla-Inbound
Reporter | ||
Comment 54•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6397516&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=6395136&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=6393668&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=6393524&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=6395849&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=6393615&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=6398680&tree=Mozilla-Inbound
Comment 55•13 years ago
|
||
in remote tegra testing there is a becoming not so rare exception that we throw "unable to launch process". This is not caught and we don't return an error code from the test. Consequently we never cleanup the web server.
Attachment #560199 -
Flags: review?(jgriffin)
Comment 56•13 years ago
|
||
Comment on attachment 560199 [details] [diff] [review]
add exception handling to runapp so we can exit cleanly on tegra mochitest (1.0)
Will this cause the color of the failures of this class to change from purple to orange on TBPL, and if so, does it matter? If it does, you could call a new function in the except clause, something like shutDownServers, and then re-raise the exception.
Attachment #560199 -
Flags: review?(jgriffin) → review+
Reporter | ||
Comment 57•13 years ago
|
||
Color is pretty much meaningless: I'd love to be wrong, but I don't think that |buildduty or |ateamduty trawls through every tree looking at every purple, and orange/red have long since lost all distinction. "Purple" is a failure that's purple, orange, or red, with a bug filed in releng or Testing. ("Purple" is also a couple of test failures that wind up mistakenly purple, or a no-output test timeout on WinXP where we lost the ability to kill the process a few weeks ago, but whatever.)
Comment 58•13 years ago
|
||
yes, these will change from purple to orange, but all cleanup will happen and hopefully cleaning up the state of the host and tegra for future runs.
As :philor mentions we are sort of in a blur of color. According to bear, we need to have return codes in all situations, this was not happening, nor were we seeing the exception in the public facing logs :(
:jgriffin, I don't mind implementing your alternative method to cleanup and rethrow, but I would request we keep it this way so the builtbot code can handle the error more gracefully.
Comment 59•13 years ago
|
||
Just realized jgriffin was not cc'd on this bug, adding him so he can reply to comment 58 above.
Comment 60•13 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #58)
> yes, these will change from purple to orange, but all cleanup will happen
> and hopefully cleaning up the state of the host and tegra for future runs.
>
> As :philor mentions we are sort of in a blur of color. According to bear,
> we need to have return codes in all situations, this was not happening, nor
> were we seeing the exception in the public facing logs :(
>
> :jgriffin, I don't mind implementing your alternative method to cleanup and
> rethrow, but I would request we keep it this way so the builtbot code can
> handle the error more gracefully.
If this helps buildbot handle the error gracefully, that's great. I just wasn't sure whether or not keeping exceptions in the test infrastructure a certain color was important or not.
Updated•13 years ago
|
Whiteboard: [orange][mobile_unittests][android_tier_1] → [orange][mobile_unittests][android_tier_1][inbound]
Reporter | ||
Comment 61•13 years ago
|
||
Reporter | ||
Comment 62•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6405093&tree=Mozilla-Inbound (on the push itself, so it didn't yet get to benefit from it)
Comment 63•13 years ago
|
||
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [orange][mobile_unittests][android_tier_1][inbound] → [orange][mobile_unittests][android_tier_1]
Comment 64•13 years ago
|
||
Reporter | ||
Comment 65•13 years ago
|
||
Going to be hard to draw a line between "these are still fine, they're from Tegras that went into a bad state before the patch landed" and "these are a new bug from a different cause of the bad state."
https://tbpl.mozilla.org/php/getParsedLog.php?id=6410814&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6411561&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6412339&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=6412341&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=6409895&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=6409803&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=6409888&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=6410186&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=6410056&tree=Firefox
Reporter | ||
Comment 66•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6413177&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6410896&tree=Mozilla-Aurora
https://tbpl.mozilla.org/php/getParsedLog.php?id=6410642&tree=Mozilla-Aurora
https://tbpl.mozilla.org/php/getParsedLog.php?id=6413705&tree=Mozilla-Beta
https://tbpl.mozilla.org/php/getParsedLog.php?id=6413979&tree=Firefox
Reporter | ||
Comment 67•13 years ago
|
||
Comment 68•13 years ago
|
||
Comment 69•13 years ago
|
||
Updated•12 years ago
|
Keywords: intermittent-failure
Updated•12 years ago
|
Whiteboard: [orange][mobile_unittests][android_tier_1] → [mobile_unittests][android_tier_1]
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•