Closed
Bug 1362293
Opened 8 years ago
Closed 7 years ago
Intermittent IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for connection on localhost:2828!)
Categories
(Testing :: Marionette Client and Harness, defect, P3)
Tracking
(firefox57 fixed)
RESOLVED
FIXED
mozilla57
Tracking | Status | |
---|---|---|
firefox57 | --- | fixed |
People
(Reporter: robert.strong.bugs, Assigned: whimboo)
References
Details
(Keywords: intermittent-failure)
https://queue.taskcluster.net/v1/task/QhF7g-_ITg2FioQIg0SVjw/runs/0/artifacts/public/logs/live_backing.log
[task 2017-05-05T04:17:14.989160Z] 04:17:14 INFO - Calling ['/home/worker/workspace/build/venv/bin/python', '-u', '/home/worker/workspace/build/tests/marionette/harness/marionette_harness/runtests.py', '--gecko-log=/home/worker/workspace/build/blobber_upload_dir/gecko.log', '--log-raw=-', '-vv', '--log-raw=/home/worker/workspace/build/blobber_upload_dir/marionette_raw.log', '--log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/marionette_errorsummary.log', '--log-html=/home/worker/workspace/build/blobber_upload_dir/report.html', '--binary=/home/worker/workspace/build/application/firefox/firefox', '--address=localhost:2828', '--symbols-path=https://queue.taskcluster.net/v1/task/WkysWBE5T7aqhYU4g63dIA/artifacts/public/build/target.crashreporter-symbols.zip', '/home/worker/workspace/build/tests/marionette/tests/testing/marionette/harness/marionette_harness/tests/unit-tests.ini'] with output_timeout 1000
[task 2017-05-05T04:17:15.409099Z] 04:17:15 INFO - Using workspace for temporary data: "/home/worker/workspace"
[task 2017-05-05T04:17:15.411063Z] 04:17:15 INFO - mozversion application_buildid: 20170505035020
[task 2017-05-05T04:17:15.412802Z] 04:17:15 INFO - mozversion application_changeset: bd5e30c8066b3a077822c894024bd752e8e9792c
[task 2017-05-05T04:17:15.414355Z] 04:17:15 INFO - mozversion application_display_name: Nightly
[task 2017-05-05T04:17:15.415891Z] 04:17:15 INFO - mozversion application_id: {ec8030f7-c20a-464f-9b0e-13a3a9e97384}
[task 2017-05-05T04:17:15.417251Z] 04:17:15 INFO - mozversion application_name: Firefox
[task 2017-05-05T04:17:15.418834Z] 04:17:15 INFO - mozversion application_remotingname: firefox
[task 2017-05-05T04:17:15.420019Z] 04:17:15 INFO - mozversion application_repository: https://hg.mozilla.org/projects/oak
[task 2017-05-05T04:17:15.420578Z] 04:17:15 INFO - mozversion application_vendor: Mozilla
[task 2017-05-05T04:17:15.421303Z] 04:17:15 INFO - mozversion application_version: 55.0a1
[task 2017-05-05T04:17:15.421400Z] 04:17:15 INFO - mozversion platform_buildid: 20170505035020
[task 2017-05-05T04:17:15.422045Z] 04:17:15 INFO - mozversion platform_changeset: bd5e30c8066b3a077822c894024bd752e8e9792c
[task 2017-05-05T04:17:15.422652Z] 04:17:15 INFO - mozversion platform_repository: https://hg.mozilla.org/projects/oak
[task 2017-05-05T04:17:15.422795Z] 04:17:15 INFO - mozversion platform_version: 55.0a1
[task 2017-05-05T04:17:15.423438Z] 04:17:15 INFO - Application command: /home/worker/workspace/build/application/firefox/firefox -no-remote -marionette -profile /tmp/tmpr1nxNw.mozrunner
[task 2017-05-05T04:19:21.850904Z] 04:19:21 ERROR - Failure during harness execution
[task 2017-05-05T04:19:21.861006Z] 04:19:21 ERROR - Traceback (most recent call last):
[task 2017-05-05T04:19:21.861065Z] 04:19:21 ERROR -
[task 2017-05-05T04:19:21.867418Z] 04:19:21 ERROR - File "/home/worker/workspace/build/tests/marionette/harness/marionette_harness/runtests.py", line 92, in cli
[task 2017-05-05T04:19:21.867513Z] 04:19:21 ERROR - failed = harness_instance.run()
[task 2017-05-05T04:19:21.872985Z] 04:19:21 ERROR -
[task 2017-05-05T04:19:21.873080Z] 04:19:21 ERROR - File "/home/worker/workspace/build/tests/marionette/harness/marionette_harness/runtests.py", line 72, in run
[task 2017-05-05T04:19:21.873140Z] 04:19:21 ERROR - runner.run_tests(tests)
[task 2017-05-05T04:19:21.873185Z] 04:19:21 ERROR -
[task 2017-05-05T04:19:21.873275Z] 04:19:21 ERROR - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_harness/runner/base.py", line 818, in run_tests
[task 2017-05-05T04:19:21.873347Z] 04:19:21 ERROR - self.marionette = self.driverclass(**self._build_kwargs())
[task 2017-05-05T04:19:21.873393Z] 04:19:21 ERROR -
[task 2017-05-05T04:19:21.873770Z] 04:19:21 ERROR - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 611, in __init__
[task 2017-05-05T04:19:21.874316Z] 04:19:21 ERROR - self.raise_for_port(timeout=startup_timeout)
[task 2017-05-05T04:19:21.874833Z] 04:19:21 ERROR -
[task 2017-05-05T04:19:21.875342Z] 04:19:21 ERROR - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 28, in _
[task 2017-05-05T04:19:21.875885Z] 04:19:21 ERROR - m._handle_socket_failure()
[task 2017-05-05T04:19:21.876393Z] 04:19:21 ERROR -
[task 2017-05-05T04:19:21.877010Z] 04:19:21 ERROR - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
[task 2017-05-05T04:19:21.877587Z] 04:19:21 ERROR - return func(*args, **kwargs)
[task 2017-05-05T04:19:21.880531Z] 04:19:21 ERROR -
[task 2017-05-05T04:19:21.881105Z] 04:19:21 ERROR - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 697, in raise_for_port
[task 2017-05-05T04:19:21.881668Z] 04:19:21 ERROR - self.host, self.port))
[task 2017-05-05T04:19:21.882240Z] 04:19:21 ERROR -
[task 2017-05-05T04:19:21.882884Z] 04:19:21 ERROR - IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for connection on localhost:2828!)
[task 2017-05-05T04:19:21.883634Z] 04:19:21 ERROR -
[task 2017-05-05T04:19:21.927997Z] 04:19:21 ERROR - Return code: 1
[task 2017-05-05T04:19:21.928795Z] 04:19:21 ERROR - No checks run.
Reporter | ||
Updated•8 years ago
|
Component: Firefox UI Tests → Marionette
QA Contact: hskupin
Assignee | ||
Comment 1•8 years ago
|
||
Robert, do you have the treeherder link for me? I would also need the gecko.log to examine the details. Thanks.
Flags: needinfo?(robert.strong.bugs)
Reporter | ||
Comment 2•8 years ago
|
||
Flags: needinfo?(robert.strong.bugs)
Reporter | ||
Updated•8 years ago
|
Keywords: intermittent-failure
Reporter | ||
Updated•8 years ago
|
Summary: IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for connection on localhost:2828!) → Intermittent IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for connection on localhost:2828!)
Assignee | ||
Comment 3•8 years ago
|
||
Th raw log doesn't contain the output from the gecko process. So here it is:
https://public-artifacts.taskcluster.net/QhF7g-_ITg2FioQIg0SVjw/0/public/test_info//gecko.log
As we can see Marionette started successfully on port 2828 at 1493957903,384 (Fri, 05 May 2017 04:18:23 GMT). The above log tells that Firefox got started at 04:17:15, so it took a bit more than 1 minute to bring it up, and the observer notification "sessionstore-windows-restored" sent.
The Python client timed out at 04:19:21, which is an additional minute. Given that it constantly tries on this port, I could only assume that there are more delays in Firefox itself. But hard to tell. Maybe a try build with enabled socket log might help if this problem is more often occurring in the future.
Assignee | ||
Comment 4•8 years ago
|
||
CC'ing Geoff because he is working on a couple of other investigations when we have a busted Firefox startup, like bug 1352671 or bug 1261598.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 13•7 years ago
|
||
In the case of an ASAN build I see the following output of Marionette in the log:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=111383412&lineNumber=1204
[task 2017-07-01T21:34:32.594790Z] 21:34:32 INFO - 1498944872582 Marionette INFO Listening on port 2828
[task 2017-07-01T21:34:32.692007Z] 21:34:32 INFO - JavaScript error: chrome://marionette/content/server.js, line 350: NS_ERROR_SOCKET_ADDRESS_IN_USE: Component returned failure code: 0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE) [nsIServerSocket.initSpecialConnection]
This is strange because the `listening` output line should only be visible when the server got correctly started, and then even afterward. I wonder if there is a situation when we try to create the server twice. I will file a Marionette bug for that.
Comment hidden (Intermittent Failures Robot) |
Comment 15•7 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [partly available 07/10
-07/14] from comment #13)
> In the case of an ASAN build I see the following output of
> Marionette in the log:
>
> https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=111383412&lineNumber=1204
>
> [task 2017-07-01T21:34:32.594790Z] 21:34:32 INFO - 1498944872582 Marionette INFO Listening on port 2828
> [task 2017-07-01T21:34:32.692007Z] 21:34:32 INFO - JavaScript error:
> chrome://marionette/content/server.js, line 350:
> NS_ERROR_SOCKET_ADDRESS_IN_USE: Component returned failure code: 0x804b0036
> (NS_ERROR_SOCKET_ADDRESS_IN_USE) [nsIServerSocket.initSpecialConnection]
>
> This is strange because the `listening` output line should only
> be visible when the server got correctly started, and then even
> afterward. I wonder if there is a situation when we try to create
> the server twice. I will file a Marionette bug for that.
In
https://searchfox.org/mozilla-central/rev/b7e531410ebc1c7d74a78e86a894e69608db318c/testing/marionette/components/marionette.js#257-280
there might, I suppose, be a race condition if
> if (this.running || !this.enabled || !this.finalUIStartup) {
matches and this.running has not yet been set to true, which only
happens on :275. Perhaps we need to set an initialisation flag
right at the beginning when this function is called to prevent it
from being called if e.g. we receive two calls to init() in close
succession?
There is also a possibility that we already have something listening
on port 2828. Ideally we should use port 0 to have the system
atomically assign a free port number, and read the used port number
from marionette.port before attempting to connect.
Assignee | ||
Comment 16•7 years ago
|
||
Please see bug 1254136 for ongoing work about the double server socket start() call. Sorry for marking the wrong bug as dependency.
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 22•7 years ago
|
||
With bug 1254136 fixed we get the following observer notifications and correctly start the server:
[task 2017-08-17T14:53:48.609571Z] 14:53:48 INFO - 1502981628606 Marionette DEBUG Received observer notification "profile-after-change"
[task 2017-08-17T14:54:05.775492Z] 14:54:05 INFO - 1502981645767 Marionette DEBUG Received observer notification "command-line-startup"
[task 2017-08-17T14:54:05.776573Z] 14:54:05 INFO - 1502981645767 Marionette INFO Enabled via --marionette
[task 2017-08-17T14:55:16.219257Z] 14:55:16 INFO - 1502981716209 Marionette DEBUG Received observer notification "sessionstore-windows-restored"
[task 2017-08-17T14:55:21.041807Z] 14:55:21 INFO - 1502981721035 Marionette INFO Listening on port 2828
It means that whether the client is not constantly trying to connect, or it cannot establish a connection because it gets blocked. I would like to see some more failure reports especially for debug builds over the next days.
Comment hidden (Intermittent Failures Robot) |
Comment 24•7 years ago
|
||
:whimboo- I see this reporting a medium frequency of failures last week and continuing on this week:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1362293&startday=2017-08-14&endday=2017-08-25&tree=all
is there enough data to see a pattern or figure out what is going on?
Flags: needinfo?(hskupin)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 26•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #24)
> :whimboo- I see this reporting a medium frequency of failures last week and
> continuing on this week:
> https://brasstacks.mozilla.com/orangefactor/
> ?display=Bug&bugid=1362293&startday=2017-08-14&endday=2017-08-25&tree=all
>
> is there enough data to see a pattern or figure out what is going on?
The only idea I have here is that we do something similar as what geckodriver does in sending info level logging to stdout for each connection attempt from the client. Then we could see if the client still tries to connect, and why it fails.
I will file a new bug for this particular improvement.
Flags: needinfo?(hskupin)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 28•7 years ago
|
||
With my patch on bug 1394381 landed for a short moment on autoland before it got backed out, I was able to reproduce this bug. So here some logging:
https://treeherder.mozilla.org/logviewer.html#?job_id=129146798&repo=autoland&lineNumber=1431
[task 2017-09-07T06:31:53.982817Z] 06:31:53 INFO - Application command: /builds/worker/workspace/build/application/firefox/firefox -no-remote -marionette -profile /tmp/tmpf_6SeT.mozrunner
[task 2017-09-07T06:31:53.985692Z] 06:31:53 INFO - Connection attempt to Marionette failed ([Errno 111] Connection refused)
[..]
[task 2017-09-07T06:32:45.113433Z] 06:32:45 INFO - Connection attempt to Marionette failed ([Errno 111] Connection refused)
[task 2017-09-07T06:32:46.120309Z] 06:32:46 INFO - Connection attempt to Marionette failed ([Errno 111] Connection refused)
[task 2017-09-07T06:32:46.621424Z] 06:32:46 INFO - 1504765966607 Marionette DEBUG Received observer notification "command-line-startup"
[task 2017-09-07T06:32:46.623572Z] 06:32:46 INFO - 1504765966608 Marionette INFO Enabled via --marionette
[..]
[task 2017-09-07T06:32:53.145463Z] 06:32:53 INFO - Connection attempt to Marionette failed ([Errno 111] Connection refused)
[..]
[task 2017-09-07T06:32:58.746529Z] 06:32:58 INFO - 1504765978742 Marionette DEBUG Received observer notification "sessionstore-windows-restored"
[..]
[task 2017-09-07T06:33:12.072290Z] 06:33:12 INFO - 1504765992061 Marionette INFO Listening on port 2828
[..]
[task 2017-09-07T06:34:55.255229Z] 06:34:55 ERROR - Failure during harness execution
As you can see we start Firefox and Marionette client is trying to connect to the server. About a minute later we receive the "sessionstore-windows-restored" observer notification which starts the initialization of Marionette, and the server is up and running another 14s later. Interestingly any connection attempt from the client stopped after 06:32:53, which is exactly 1 minute after we started the application. The startup timeout in use here is by default 120s, and the mozharness config also doesn't override it as I can see here. Even if we would run into this timeout an exception would be thrown. But nothing is seen here.
So something I will try now is to force delaying the initialization of the server component to see if I can reproduce an early abort of `wait_for_port`.
Assignee | ||
Comment 29•7 years ago
|
||
Ok, the problem here is because of the following situation... Because of bug 1397734 the Marionette harness sets a timeout value for startup of the browser until the server is ready to 60s. This differs from what Marionette driver has as default, and perfectly explains what I was seeing in comment 28. Additionally to that bug 1397675 plays into account because we delay the forced shutdown by 120s by waiting for the application to quit, which doesn't make sense at all at this point.
With both bugs fixed this failure should go away when we find a sane default for our jobs in CI.
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Priority: -- → P3
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 32•7 years ago
|
||
There are no more aborts visible since Sep 12th which is nearly a week. Before we had multiple failures a day. So I would say this is fixed now by my patches as referenced in comment 29.
Assignee: nobody → hskupin
Status: NEW → RESOLVED
Closed: 7 years ago
status-firefox57:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Product: Testing → Remote Protocol
Comment 34•2 years ago
|
||
Moving bug to Testing::Marionette Client and Harness component per bug 1815831.
Component: Marionette → Marionette Client and Harness
Product: Remote Protocol → Testing
You need to log in
before you can comment on or make changes to this bug.
Description
•