Sometimes geckodriver kills the browser already 1s after starting it
Categories
(Testing :: geckodriver, defect, P1)
Tracking
(firefox81 fixed)
Tracking | Status | |
---|---|---|
firefox81 | --- | fixed |
People
(Reporter: whimboo, Assigned: whimboo)
References
Details
Attachments
(1 file)
(deleted),
text/x-phabricator-request
|
Details |
I have seen it already in the past but never filed a bug for it. See the following log:
https://taskcluster-artifacts.net/Nm24orzAS_e6ahY8r_IcqQ/0/public/logs/live_backing.log
[task 2019-08-19T14:44:19.391Z] 14:44:19 INFO - PID 5036 | 1566225859381 mozrunner::runner INFO Running command: "Z:\\task_1566220377\\build\\application\\firefox\\firefox.exe" "-marionette" "-foreground" "-no-remote" "-profile" "C:\\Users\\task_1566220377\\AppData\\Local\\Temp\\rust_mozprofileDCDfsW"
[task 2019-08-19T14:44:19.406Z] 14:44:19 INFO - PID 5036 | 1566225859390 geckodriver::marionette DEBUG Waiting 60s to connect to browser on 127.0.0.1:2828
[task 2019-08-19T14:44:19.821Z] 14:44:19 INFO - PID 5036 | [3516, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file z:/build/build/src/extensions/permissions/nsPermissionManager.cpp, line 2901
[task 2019-08-19T14:44:20.504Z] 14:44:20 INFO - PID 5036 | 1566225860490 mozrunner::runner DEBUG Killing process 5528
Why is Firefox killed after 1s while we expect to wait 60s for its startup? Could the failed Permission check result in an early exit of Firefox? Mark given that you added that code, mind giving us some feedback?
Comment 1•5 years ago
|
||
I don't recall what code you are referring to - do you have a bug number of revision?
Assignee | ||
Comment 2•5 years ago
|
||
It's for nsPermissionManager.cpp, line 2901
which is part of the above comment. It was implemented as part of bug 506446.
If this cannot cause a very early exit of the process, there is really something wrong with geckodriver. Thanks!
Comment 3•5 years ago
|
||
Sorry, I mis-understood and thought you meant I added code which killed Firefox after some time. The error being referenced is NS_ERROR_FILE_NOT_FOUND trying to import default permissions, and I guess the NS_ENSURE_SUCCESS macro will indeed kill debug versions of Firefox. It's probably not necessary for that failure to be fatal, but it's probably worth trying to understand why the default permissions are missing.
Assignee | ||
Comment 4•5 years ago
|
||
Mark, which file is that actually? One from the profile, or from the application directory? If I would know the name I could have a look at the builds.
Comment 5•5 years ago
|
||
I believe it is whatever the pref permissions.manager.defaultsUrl
points at - I'm afraid I can't remember what that was 5 years ago, and even if I could, it could well have changed since then.
Assignee | ||
Comment 6•5 years ago
|
||
So far this problem has only been manifested on Windows 7 32bit, and specifically for debug builds.
In regards of the pref I had a look at about:config
, and it refers to resource://app/defaults/permissions
. Maybe for Firefox this comes from https://searchfox.org/mozilla-central/source/browser/app/permissions.
But having a look at the import again which happens here:
The file which actually gets imported is hostperm.1
(HOSTPERM_FILE_NAME
) from the user's profile. But it looks like this will only succeed for kinda old profiles, which are still prepared with this file. At least for mozprofile generated profiles we use permissions.sqlite
.
So maybe this is all just red herring, and there might indeed be a geckodriver issue here, and not waiting longer than 1s before killing Firefox.
Assignee | ||
Comment 7•5 years ago
|
||
Indeed when looking at the same log and observing start-ups of Firefox from previous tests the same log entry exists:
[task 2019-08-19T14:27:26.905Z] 14:27:26 INFO - PID 2272 | 1566224846894 geckodriver::marionette DEBUG Waiting 60s to connect to browser on 127.0.0.1:2828
[task 2019-08-19T14:27:27.693Z] 14:27:27 INFO - PID 2272 | [3820, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file z:/build/build/src/extensions/permissions/nsPermissionManager.cpp, line 2901
[..]
[task 2019-08-19T14:27:29.450Z] 14:27:29 INFO - PID 2272 | 1566224849443 Marionette TRACE Received observer notification command-line-startup
[task 2019-08-19T14:27:29.451Z] 14:27:29 INFO - PID 2272 | 1566224849443 Marionette TRACE Received observer notification nsPref:changed
[task
[..]
[task 2019-08-19T14:27:34.043Z] 14:27:34 INFO - PID 2272 | 1566224854028 Marionette INFO Listening on port 2828
[task 2019-08-19T14:27:34.043Z] 14:27:34 INFO - PID 2272 | 1566224854029 Marionette DEBUG Remote service is active
So this log entry clearly hasn't something to do with that early kill of the application.
Assignee | ||
Comment 8•5 years ago
|
||
The line in mozrunner, which causes this early kill of Firefox is:
But I fail to parse the marionette.rs code in how we reach that line:
But maybe this is a problem with calling TcpStream::connect()
?
Andreas, any idea?
Assignee | ||
Comment 9•5 years ago
|
||
Maybe the following patch from bug 1525126 might help us to get more information due to the additional added debug output.
Comment 10•5 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #8)
The line in mozrunner, which causes this early kill of Firefox is:
But I fail to parse the marionette.rs code in how we reach that line:
I think you may be mixing up try_wait()
on std::process::Child
and FirefoxProcess
.
The only time FirefoxProcess::try_wait(time::Duration)
is called is from MarionetteHandler::delete_session()
,
which is when we hit the loop you linked to in testing/mozbase/rust/mozrunner/src/runner.rs:138.
In MarionetteConnection::connect()
we call std::process::Child::try_wait()
directly,
which just tells if the process has exited yet.
The return value is a bit convoluted, but the API documentation is pretty good.
But maybe this is a problem with calling
TcpStream::connect()
?
TcpStream::connect()
shouldn’t cause the Process unexpectedly closed with status
as it ignores any connection error until the timeout has elapsed,
and then stringifies the std::io::Error
which would have a different string.
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #9)
Maybe the following patch from bug 1525126 might help us to get more information due to the additional added debug output.
Yes, I was about to say the same thing. I think that is worth waiting for.
Assignee | ||
Comment 11•5 years ago
|
||
The fix on bug 1525126 should have fixed that. I will revisit this bug in a couple of weeks, and after the 0.26.0 release went out.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 13•4 years ago
|
||
So this is still not fixed. Here another case from bug 1659967:
[task 2020-08-19T11:17:29.527Z] 11:17:29 INFO - PID 891 | 1597835849515 mozrunner::runner INFO Running command: "/builds/worker/workspace/build/application/firefox/firefox" "--marionette" "-foreground" "-no-remote" "-profile" "/tmp/rust_mozprofiles0QlUe"
[task 2020-08-19T11:17:29.527Z] 11:17:29 INFO - PID 891 | 1597835849515 geckodriver::marionette DEBUG Waiting 60s to connect to browser on 127.0.0.1:2828
[task 2020-08-19T11:17:29.623Z] 11:17:29 INFO - PID 891 | 1597835849616 mozrunner::runner DEBUG Killing process 900
James, do you have an idea why the connection attempt to Marionette is aborted within 1s? See also Andreas' last comment 10 from a year ago. The related code is here:
https://searchfox.org/mozilla-central/source/testing/geckodriver/src/marionette.rs#1307-1355
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 14•4 years ago
|
||
Actually all the test jobs for that platform failed due to application terminated with exit code -11
(see bug 1659965):
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=157db696462d8a98905d0f8697088aa97cb6e08f&searchStr=ccov%2Clinux
So we don't seem to report the error correctly when Firefox dies during the connection attempt.
Comment 15•4 years ago
|
||
Right, from the code it looks like the most likely problem is a crash during startup that gets reported as a UnknownError i.e. a 500 but where we aren't surfacing the error message. I think we should add logging in geckodriver for that case so it appears in the stderr not just in the protocol response.
Assignee | ||
Comment 16•4 years ago
|
||
Oh, well we actually override the error by force-killing the browser even when the process doesn't exist:
As such the original error message is lost and we always report can't kill an exited process
, which totally makes sense.
Will be easy to fix.
Assignee | ||
Comment 17•4 years ago
|
||
Comment 18•4 years ago
|
||
Comment 19•4 years ago
|
||
bugherder |
Description
•