Closed
Bug 809437
Opened 12 years ago
Closed 12 years ago
Frequent b2g emulator failures with "ERROR - Return code: 1" after ": Timeout waiting for marionette on port 'nnnnn'" or "Timeout waiting for the b2g process to start" or "TypeError: %d format: a number is required, not NoneType"
Categories
(Testing :: Marionette Client and Harness, defect)
Testing
Marionette Client and Harness
Tracking
(firefox17 unaffected, firefox18 affected, firefox19 affected)
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
firefox17 | --- | unaffected |
firefox18 | --- | affected |
firefox19 | --- | affected |
People
(Reporter: ahal, Assigned: ahal)
References
Details
(Keywords: intermittent-failure)
Attachments
(2 files, 6 obsolete files)
(deleted),
patch
|
mdas
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
Details | Diff | Splinter Review |
I spied: https://tbpl.mozilla.org/php/getParsedLog.php?id=16822692&tree=Mozilla-Inbound&full=1
It looks like there is a race condition when stopping/starting b2g: http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/emulator.py#404
Assignee | ||
Comment 1•12 years ago
|
||
Another one: https://tbpl.mozilla.org/php/getParsedLog.php?id=16818281&tree=Mozilla-Inbound&full=1
In both cases there was a retry when pushing one of the files (see bug 800102 (sigh))
Assignee | ||
Comment 2•12 years ago
|
||
Nice, a third one that doesn't have a retried push: https://tbpl.mozilla.org/php/getParsedLog.php?id=16823573&tree=Mozilla-Inbound&full=1
I'd bet a sleep would fix the issue, but I'll try and figure out what event we need to be waiting for before we're supposed to restart the emulator.
Updated•12 years ago
|
Blocks: 438871
Summary: Possible race condition when restarting b2g in install_gecko → Intermitent B2G emulator "Could not successfully complete transport of message to Gecko, socket closed?", due to possible race condition when restarting b2g in install_gecko
Whiteboard: [orange]
Comment 3•12 years ago
|
||
Assignee | ||
Updated•12 years ago
|
Assignee: nobody → ahalberstadt
Status: NEW → ASSIGNED
Assignee | ||
Comment 4•12 years ago
|
||
Ed, this shouldn't cause any problems with the tbpl error regexes should it?
Attachment #679232 -
Flags: review?(jgriffin)
Attachment #679232 -
Flags: feedback?(bmo)
Comment 5•12 years ago
|
||
Comment on attachment 679232 [details] [diff] [review]
Patch 1.0 - ensure b2g process has stopped
Review of attachment 679232 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good. I think we should add these two new strings to the error list of the mozharness scripts, so that they'll turn the jobs red:
http://hg.mozilla.org/build/mozharness/file/739cfaf3d165/scripts/b2g_emulator_unittest.py#l80
http://hg.mozilla.org/build/mozharness/file/739cfaf3d165/scripts/marionette.py#l80
Attachment #679232 -
Flags: review?(jgriffin) → review+
Comment 6•12 years ago
|
||
Comment on attachment 679232 [details] [diff] [review]
Patch 1.0 - ensure b2g process has stopped
These are fine from a TBPL POV (whilst TBPL won't recognise them as is; with the comment 5 change plus my patch in bug 809436, it will).
Thank you for checking though :-)
Attachment #679232 -
Flags: feedback?(bmo) → feedback+
Comment 7•12 years ago
|
||
Updated•12 years ago
|
Whiteboard: [orange] → [orange][automation-needed-in-aurora]
Assignee | ||
Comment 8•12 years ago
|
||
I saw this yet again on inbound :(
https://tbpl.mozilla.org/php/getParsedLog.php?id=16837096&tree=Mozilla-Inbound&full=1
Whiteboard: [orange][automation-needed-in-aurora] → [orange][automation-needed-in-aurora][leave-open]
Assignee | ||
Comment 9•12 years ago
|
||
This has been a pretty annoying problem as I can't reproduce it locally at all. All I know is that we never had these timeouts before when we stopped b2g before copying the files. Perhaps this will finally be the fix.
Attachment #679426 -
Flags: review?(jgriffin)
Comment 10•12 years ago
|
||
Comment on attachment 679426 [details] [diff] [review]
Patch 1.1 - Possible fix
Review of attachment 679426 [details] [diff] [review]:
-----------------------------------------------------------------
Worth a shot!
Attachment #679426 -
Flags: review?(jgriffin) → review+
Comment 11•12 years ago
|
||
Assignee | ||
Comment 12•12 years ago
|
||
Nope. If anything that last one seems to have made it worse. Next idea is to go back to pushing the files to /data/local first and then copying to /system/b2g like we had it before. Maybe that combined with the retrying of the push files will work.
Comment 13•12 years ago
|
||
Flags: in-testsuite+
Comment 14•12 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #12)
> Nope. If anything that last one seems to have made it worse.
Feel free to backout said broken patches next time. This was causing a lot of orange/red on b2g Arm opt builds.
Backed out:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1ab2043a4669
Assignee | ||
Comment 15•12 years ago
|
||
Sorry about that, I was the only one around when I noticed and I don't have level 3 :)
Comment 16•12 years ago
|
||
Merged the current state into aurora: https://hg.mozilla.org/releases/mozilla-aurora/rev/f183e4e73382
Assignee | ||
Comment 17•12 years ago
|
||
Attachment #679775 -
Flags: review?(jgriffin)
Updated•12 years ago
|
Attachment #679775 -
Flags: review?(jgriffin) → review+
Comment 18•12 years ago
|
||
Comment 19•12 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #15)
> Sorry about that, I was the only one around when I noticed and I don't have
> level 3 :)
You can always ping a sheriff on IRC.
(In reply to Jonathan Griffin (:jgriffin) from comment #18)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/53d5080a6749
Backed out for b2g mochitest orange.
https://hg.mozilla.org/integration/mozilla-inbound/rev/960037a36758
13:40:45 WARNING - TEST-UNEXPECTED-FAIL | automation.py | application timed out after 330 seconds with no output
13:43:54 ERROR - Return code: 1
Assignee | ||
Comment 20•12 years ago
|
||
The previous patch timed out in automation.py (not waiting for marionette like what this bug is about). There was an error loading libxpcom.so in both the reftests and mochitests (though apparently it didn't matter for reftest).
I don't know why this is, libxpcom.so was getting copied to /system/b2g like it was supposed to. Maybe pushing to /data/local vs /system/b2g results in different file permissions?
Assignee | ||
Comment 22•12 years ago
|
||
Updating title to reflect the updated error message from attachment 679232 [details] [diff] [review].
Summary: Intermitent B2G emulator "Could not successfully complete transport of message to Gecko, socket closed?", due to possible race condition when restarting b2g in install_gecko → Frequent b2g emulator failures with "ERROR - Return code: 1" after ": Timeout waiting for marionette on port 'nnnnn'"
Comment 23•12 years ago
|
||
Comment 24•12 years ago
|
||
Comment 25•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=16935211&tree=Mozilla-Inbound
(and a dozen or so starred by people who think it's a good idea not to paste logs)
Comment 26•12 years ago
|
||
Updated•12 years ago
|
status-firefox17:
--- → unaffected
status-firefox18:
--- → affected
status-firefox19:
--- → affected
Comment 27•12 years ago
|
||
Comment 28•12 years ago
|
||
Inconveniently, we still fail with both the previous summary and the current summary for this bug:
https://tbpl.mozilla.org/php/getParsedLog.php?id=16957633&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=16957322&tree=Mozilla-Inbound
Assignee | ||
Comment 29•12 years ago
|
||
Somehow this patch: https://hg.mozilla.org/try/rev/b5744527e483
Resulted in this orange: https://tbpl.mozilla.org/php/getParsedLog.php?id=16960991&tree=Try
I'm starting to think the easiest way to solve this is to get releng to provide full stack emulator builds.
Comment 30•12 years ago
|
||
Maybe we should just put a time.sleep(10) between 'stop b2g' and 'start b2g'. Yucky, I know.
Comment 31•12 years ago
|
||
Comment 32•12 years ago
|
||
Comment 33•12 years ago
|
||
Assignee | ||
Comment 34•12 years ago
|
||
So I think this patch fixes the issue. I re-triggered it a whole bunch of times on try and haven't seen any timeouts: https://tbpl.mozilla.org/?tree=Try&rev=80318b816089
I also don't think that is enough evidence to call it fixed, but I think we should land this and keep an eye out in the future.
Attachment #680832 -
Flags: review?(jgriffin)
Updated•12 years ago
|
Attachment #680832 -
Flags: review?(jgriffin) → review+
Comment 35•12 years ago
|
||
Comment 36•12 years ago
|
||
Comment 37•12 years ago
|
||
Comment 38•12 years ago
|
||
Comment 39•12 years ago
|
||
Assignee | ||
Comment 40•12 years ago
|
||
status-firefox17:
unaffected → ---
status-firefox18:
affected → ---
status-firefox19:
affected → ---
Assignee | ||
Updated•12 years ago
|
status-firefox17:
--- → unaffected
status-firefox18:
--- → affected
status-firefox19:
--- → affected
Assignee | ||
Updated•12 years ago
|
Whiteboard: [orange][automation-needed-in-aurora][leave-open] → [orange][automation-needed-in-aurora]
Assignee | ||
Comment 41•12 years ago
|
||
Evidently my 40+ try re-triggers didn't actually manage to test the failing codepath, otherwise this typo would have been caught.
https://tbpl.mozilla.org/php/getParsedLog.php?id=16991177&tree=Mozilla-Inbound
Assignee | ||
Updated•12 years ago
|
Attachment #681033 -
Flags: review?(mdas)
Updated•12 years ago
|
Attachment #681033 -
Flags: review?(mdas) → review+
Assignee | ||
Comment 42•12 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/c3b7e68ef6a0
So seeing as all those try re-triggers never hit the failing codepath, I have no idea if this is actually fixed or not. I'll keep an eye out.
Assignee | ||
Comment 43•12 years ago
|
||
When we re-tried the restart this happened: https://tbpl.mozilla.org/php/getParsedLog.php?id=16994991&tree=Mozilla-Inbound
Possible solutions:
1) Restart whole emulator instead of b2g process
2) Get releng to provide full stack emulator builds
3) Figure out why this is happening and fix properly (would require dumping logcat to the logs)
Whiteboard: [orange][automation-needed-in-aurora] → [orange][automation-needed-in-aurora][leave-open]
Comment 44•12 years ago
|
||
If we go with option 1), there is some code in emulator.py to handle userdata.img files, but none to handle system.img files, so we'd have to add that before rebooting the emulator would be a viable option.
We should pursue 3) regardless, since if the problem is related to restarting B2G and has nothing to do with installing gecko, then 2) won't help us. If we can verify that the problem only occurs when installing gecko, then we could ask for 2).
Comment 45•12 years ago
|
||
Comment 46•12 years ago
|
||
Comment 47•12 years ago
|
||
Assignee | ||
Comment 48•12 years ago
|
||
Updated•12 years ago
|
Summary: Frequent b2g emulator failures with "ERROR - Return code: 1" after ": Timeout waiting for marionette on port 'nnnnn'" → Frequent b2g emulator failures with "ERROR - Return code: 1" after ": Timeout waiting for marionette on port 'nnnnn'" or "Timeout waiting for the b2g process to start"
Comment 49•12 years ago
|
||
Assignee | ||
Comment 50•12 years ago
|
||
I tried making a patch that restarts the emulator and preserves system.img, but I run into bug 802877
I have another patch that just does a dirty long sleep in between the b2g start/stop here: https://tbpl.mozilla.org/?tree=Try&rev=b75a849e2840
I'll re-trigger a whole bunch more times before I consider landing it though.
Comment 51•12 years ago
|
||
Comment 52•12 years ago
|
||
Comment 53•12 years ago
|
||
Comment 54•12 years ago
|
||
Comment 55•12 years ago
|
||
Comment 56•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=17055710&tree=Mozilla-Inbound
21:36:13 INFO - restarting B2G (attempt 2 of 10)
21:36:13 INFO - Traceback (most recent call last):
21:36:13 INFO - File "runreftestb2g.py", line 563, in <module>
21:36:13 INFO - sys.exit(main())
21:36:13 INFO - File "runreftestb2g.py", line 483, in main
21:36:13 INFO - marionette = Marionette(**kwargs)
21:36:13 INFO - File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/marionette/marionette.py", line 148, in __init__
21:36:13 INFO - self.emulator.install_gecko(self.gecko_path, self)
21:36:13 INFO - File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/marionette/emulator.py", line 436, in install_gecko
21:36:13 INFO - self._restart_b2g(marionette)
21:36:13 INFO - File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/marionette/emulator.py", line 380, in _restart_b2g
21:36:13 INFO - self.dm.shellCheckOutput(['stop', 'b2g'])
21:36:13 INFO - File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/mozdevice/devicemanager.py", line 65, in shellCheckOutput
21:36:13 INFO - raise DMError("Non-zero return code for command: %s (output: '%s', retval: '%i')" % (cmd, output, retval))
21:36:13 INFO - TypeError: %d format: a number is required, not NoneType
21:36:13 ERROR - Return code: 1
Summary: Frequent b2g emulator failures with "ERROR - Return code: 1" after ": Timeout waiting for marionette on port 'nnnnn'" or "Timeout waiting for the b2g process to start" → Frequent b2g emulator failures with "ERROR - Return code: 1" after ": Timeout waiting for marionette on port 'nnnnn'" or "Timeout waiting for the b2g process to start" or "TypeError: %d format: a number is required, not NoneType"
Comment 57•12 years ago
|
||
Comment 58•12 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #56)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=17055710&tree=Mozilla-
> Inbound
>
...
> 21:36:13 INFO - File
> "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/
> mozdevice/devicemanager.py", line 65, in shellCheckOutput
> 21:36:13 INFO - raise DMError("Non-zero return code for command: %s
> (output: '%s', retval: '%i')" % (cmd, output, retval))
> 21:36:13 INFO - TypeError: %d format: a number is required, not NoneType
> 21:36:13 ERROR - Return code: 1
This is bug 811916.
Depends on: 811916
Comment 59•12 years ago
|
||
Comment 60•12 years ago
|
||
Comment 61•12 years ago
|
||
Assignee | ||
Comment 62•12 years ago
|
||
This patch makes me sad. All I can say is that after 60 try retriggers the error hasn't been reproduced: https://tbpl.mozilla.org/?tree=Try&rev=b75a849e2840&noignore=1
I don't think that means much, but we need this bug fixed asap so I will try anything at this point.
Attachment #679232 -
Attachment is obsolete: true
Attachment #679426 -
Attachment is obsolete: true
Attachment #679775 -
Attachment is obsolete: true
Attachment #680832 -
Attachment is obsolete: true
Attachment #681033 -
Attachment is obsolete: true
Attachment #681987 -
Flags: review?(mdas)
Comment 63•12 years ago
|
||
Comment on attachment 681987 [details] [diff] [review]
Patch 4.0 - I will atone for my sins
Review of attachment 681987 [details] [diff] [review]:
-----------------------------------------------------------------
r+'d since we need this working ASAP, but with the caveat that we should file a follow up bug to investigate what we should be waiting for, instead of sleeping for an arbitrary amount of time.
Attachment #681987 -
Flags: review?(mdas) → review+
Assignee | ||
Comment 64•12 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/473d3748d994
Also filed bug 812190 as a follow up to fix this issue properly.
Whiteboard: [orange][automation-needed-in-aurora][leave-open] → [orange][automation-needed-in-aurora]
Assignee | ||
Comment 65•12 years ago
|
||
Whiteboard: [orange][automation-needed-in-aurora] → [orange][automation-needed-in-aurora][leave-open]
Assignee | ||
Comment 66•12 years ago
|
||
Pretty sure there is some difference on try that makes it less reproducible.
Assignee | ||
Comment 67•12 years ago
|
||
:jgriffin, here's the wip emulator restart patch I had if you want to take a look
Comment 68•12 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #67)
> Created attachment 682067 [details] [diff] [review]
> WIP restart emulator patch
>
> :jgriffin, here's the wip emulator restart patch I had if you want to take a
> look
I got the emulator restart stuff working, and I'll attach it to the bug for posterity, BUT it doesn't help us. Changes made to the system partition do persist between restarts. You can verify this manually by launching the emulator manually (using ./run-emulator.sh) and copying some file to /system/b2g. Then close the emulator, restart it, and the file will be missing. :(
Comment 69•12 years ago
|
||
This has some extra debugging info that would need to be taken out, but I don't intent to land it since it doesn't actually solve any problems.
Updated•12 years ago
|
Attachment #682067 -
Attachment is obsolete: true
Comment 70•12 years ago
|
||
We're basically down to two options, AFAICT:
1 - Get someone on the B2G team to try and figure out why B2G doesn't restart correctly in this situation and fix it
2 - Change the mozharness script to detect this kind of failure and re-run the test from scratch, up to N times
1 will likely take weeks, so that leaves us with 2. This is pretty brutal, but seems likely to work.
Comment 71•12 years ago
|
||
(In reply to Jonathan Griffin (:jgriffin) from comment #70)
> We're basically down to two options, AFAICT:
>
> 1 - Get someone on the B2G team to try and figure out why B2G doesn't
> restart correctly in this situation and fix it
> 2 - Change the mozharness script to detect this kind of failure and re-run
> the test from scratch, up to N times
>
> 1 will likely take weeks, so that leaves us with 2. This is pretty brutal,
> but seems likely to work.
Oh, there is an option 3: Have rel-eng produce full emulator builds, so we don't have to update gecko in the emulator. This may be the best medium-term solution. I think we should probably implement 2 in the interim, however.
Comment 72•12 years ago
|
||
According to the emulator docs (http://developer.android.com/tools/devices/emulator.html):
The emulator creates two writeable images at startup that it deletes at device power-off. The images are:
A writable copy of the Android system image
The /cache partition image
The emulator does not permit renaming the temporary system image or persisting it at device power-off.
Comment 73•12 years ago
|
||
Comment 74•12 years ago
|
||
(In reply to Ryan VanderMeulen from comment #73)
> https://hg.mozilla.org/mozilla-central/rev/473d3748d994
https://hg.mozilla.org/releases/mozilla-aurora/rev/9e49a48f8ef8
Comment 75•12 years ago
|
||
Comment 76•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=17093856&tree=Mozilla-Aurora
(At least it generously waited until the next push after your patch was pushed to aurora, instead of still failing directly on your push.)
Assignee | ||
Comment 77•12 years ago
|
||
(In reply to Jonathan Griffin (:jgriffin) from comment #71)
> medium-term solution. I think we should probably implement 2 in the
> interim, however.
This is bug 812395
Depends on: 812395
Comment 78•12 years ago
|
||
Comment 79•12 years ago
|
||
Comment 80•12 years ago
|
||
Comment 81•12 years ago
|
||
Comment 82•12 years ago
|
||
Also:
https://tbpl.mozilla.org/php/getParsedLog.php?id=17138342&tree=Mozilla-Inbound
{
11:10:59 ERROR - errors.MarionetteException: Could not successfully complete transport of message to Gecko, socket closed?
11:11:00 ERROR - Return code: 1
}
Comment 83•12 years ago
|
||
I just landed bug 812395 on inbound, which could possibly fix this problem.
Updated•12 years ago
|
Keywords: intermittent-failure
Comment 84•12 years ago
|
||
Comment 85•12 years ago
|
||
Assignee | ||
Comment 86•12 years ago
|
||
I landed bug 812395 on aurora and beta as well. I haven't seen this issue anywhere since landing so I'm going to resolve fixed this bug.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•12 years ago
|
Whiteboard: [orange][automation-needed-in-aurora][leave-open] → [orange]
Updated•12 years ago
|
Whiteboard: [orange]
Updated•2 years ago
|
Product: Testing → Remote Protocol
Comment 87•2 years ago
|
||
Moving bug to Testing::Marionette Client and Harness component per bug 1815831.
Component: Marionette → Marionette Client and Harness
Product: Remote Protocol → Testing
You need to log in
before you can comment on or make changes to this bug.
Description
•