Open Bug 1240830 Opened 9 years ago Updated 2 years ago

[meta] Use system allocated port for Marionette server

Categories

(Testing :: Marionette Client and Harness, defect, P3)

defect

Tracking

(Not tracked)

People

(Reporter: Silne30, Unassigned)

References

(Depends on 3 open bugs)

Details

(Keywords: meta)

We need to have marionette find an open port to run tests. Per ato, we could have port=0 passed to the constructor and have that have special meaning.
OS: Unspecified → All
Hardware: Unspecified → All
To add some context to this: It’s inherently racy to first find a free port, then pass it to the Marionette server in Gecko for it to bind to. A more fail-safe option would be to ask Marionette to bind to port 0, asking the system to allocate a free port, and then somehow communicate that port back to the client for it to connect on.
We discussed this a while ago and I wrote a proof of concept patch in bug 1027022. That patch flips the current model on its head, making the marionette client open a port for listening, pass the port number to the application in a preference, and having the marionette server in the application connect to the client.
Blocks: 1240253
No longer blocks: 1240253
Blocks: 1240253
We had another discussion around this on IRC this afternoon. The patch in bug 1027022 reverses the relationship between the server and the client: Instead of starting the TCP socket in the server, the client binds to socket 0, letting the system allocate a free port using syscalls, and then passes the allocated port as a preference to Gecko, on which Marionette connects. I can think of two alternatives which neither involve reversing the relationship: (1) Assign special meaning to port 0 in the server, and communicate the allocated port back to the client using some IPC mechanism. (Unix domain sockets or a file on the local system was suggested.) (2) Create a Unix domain socket file in a temporary location and pass this to the server through a preference. Then listen for activity on said socket. When Marionette starts writing, the connection has been established. I’m personally leaning towards option 2, but I have not yet investigated the portability aspect of this solution. ted mentioned that named pipes should do the trick in place of domain sockets on Windows, and that these have little- or no performance drawback. The important part here is to not have to poll continuously but to use whatever is the equivalent of the kqueue/epoll syscalls. With the exception of ADB device tests that must happen over TCP, it is always expected that the client and server will live on the same system. There is no “remoteness” concept in Marionette like there is in WebDriver. This is what the wires/geckodriver HTTP frontend is for. With devices, there is always at most a single one; or when there are more they must have ports specifically defined due to the ADB forwarding constraints. Reversing the client-server relationship is difficult because the client explicitly needs to provide backwards compatibility to earlier Gecko versions on the release channels for upgrade-, UI, and LucidDream tests.
Whiteboard: [blocked]
Apparently neither the nsIServerSocket (https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Interface/nsIServerSocket) or the TCPServerSocket (https://developer.mozilla.org/en-US/docs/Web/API/TCPServerSocket) allows you to define the socket family, for which we want the AF_UNIX flag: http://man7.org/linux/man-pages/man7/unix.7.html
No longer blocks: 1240253
The currently larger issue which we have is that if a Firefox process remains open on a machine (for example see bug 1176758), a second testrun executed against a different build would still connect to the first process via port 2828. But that is not the Firefox process we want to connect to. As result we run our tests in a different build which is even harder when you run nightly and release builds on the same machine - that produces completely busted test results.
I would like to ressurect this work to help solve the "can't run multiple webdriver sessions with geckodriver" issue that has been one of the biggest roadblocks for people using that so far. Reading everything, it seems like the shortest path to victory might be to go with the patch in bug 1027022 and ensure that it's opt-in so that older gecko versions continue to work. Is there enough advantage of the approaches presented in this bug that I should start working on one of those from scratch instead?
Flags: needinfo?(ted)
Flags: needinfo?(ato)
(In reply to James Graham [:jgraham] from comment #6) > I would like to ressurect this work to help solve the "can't run multiple > webdriver sessions with geckodriver" issue that has been one of the biggest > roadblocks for people using that so far. > > Reading everything, it seems like the shortest path to victory might be to > go with the patch in bug 1027022 and ensure that it's opt-in so that older > gecko versions continue to work. Is there enough advantage of the approaches > presented in this bug that I should start working on one of those from > scratch instead? The quickest way to victory is to get a free port from the OS and then use that port. That way we have guaranteed random ports and doesnt rely on Firefox starting up before we get a "free port". While it is racy the amount of times I have received bug reports in the Selenium about this race I could probably count on 1 hand.
The critical point is that to get a guaranteed free listen port you have to pass port=0 down to the OS TCP stack and let it give you back a free port, and this has to happen in whatever is calling listen. My patch in bug 1027022 takes care of that by spawning the server in the Python script so it can then pass that port number down to Firefox via a pref. Since the existing Marionette code expects the server to be in the browser process, the only way to make that work is to do as ato has suggested and send that port number back out-of-band in some way. That would avoid having to change the existing marionette client/server code, you'd be adding some additional code around it to listen in some way on the Python side, and send the port number back on the JS side. I'm not personally invested in the Marionette code, so I can't speak to what will be best there. My patch was just the simplest way I could get things to work in this manner.
I agree with dburns here that the path of least resistance is to allocate a free port in the client, close it, and ask Marionette to start a server on that port. This is racy, but doesn’t reverse the client-server relationship. We can achieve this by setting the marionette.defaultPrefs.port preference from geckodriver. It does not require any Marionette-side changes at all. We can add something more clever, like one of the solutions I suggest in comment #3 at a later point.
Flags: needinfo?(ato)
Flags: needinfo?(ted)
So another idea... With bug 1281750 we now transfer the process id of Firefox via the session capabilities to the client. Cannot we simply use the pid to check for listen ports? As I can see psutil.net_connections() should offer that to us. With that we can still let the Marionette server auto-select a port, and we wouldn't have a race condition.
Ups. Please forget the last comment. Without a session we cannot transfer anything. ;)
Ok, so it might work in normal cases without fancy restarts which involve process group changes (bug 1176758). The runner property always has the valid process id handy, so it could be used to query for open listening ports of the process.
Given that I did a bit of work lately for the socket connections I had time to think about this a bit more. Maybe the following solution might be something you like? The server socket connection gets established by Marionette server running inside of Firefox. At this time we always have a profile present. When Marionette server finds a free port, why cannot it simply create a new file in the profile (similar to .parentLock as done by Firefox indicating the process is using this profile) listing the port number? The client knows the profile location and can watch out for that file, read the port number, and finally connect to Marionette server. This method will not be affected by clean restarts which will swap the profile, given that once Marionette created the socket connection, the file gets written. So the client always know about it. We might have to add some new code to handle remote profiles for Fennec, but I'm sure we will be able to make that. Ted, what do you think?
Flags: needinfo?(ted)
Whiteboard: [blocked]
That should work fine. The only unfortunate bit is that the client will wind up polling looking for that file, but that's not the worst thing ever. (We could optimize that if it becomes a problem by using platform-specific file watching APIs like inotify on Linux.) Since we'd only use this when the port is explicitly set to zero in the profile prefs, it shouldn't break backwards-compat with older Marionette clients. It would, however, break newer Marionette clients attempting to run an older Firefox, since the older Firefox would not write the file with the port number in the profile. I suppose you could check the version number of the Firefox you're trying to launch and only use this technique if it's new enough.
Flags: needinfo?(ted)
Priority: -- → P3
Version: unspecified → Trunk

Given the discussion from comment 13 and comment 14 lets follow the approach by using a file in the profile directory to indicate the port as used by Marionette. It's actually backed by our partial CDP implementation where I'm about to land a patch that does the same thing (bug 1706581).

Because there is a bit more work to do here I'm going to transform this bug into a meta bug, and will update the dependency list for all the bits and pieces we have to take care. In general I'm thinking of the following strategy:

  1. [Done] Start Marionette writing the file to the profile directory (bug 1735162) and maybe lets also include the feature for the BiDi implementation (bug 1749673). This patch should be uplifted to the 91ESR branch, and if possible even the 78ESR branch to allow geckodriver sooner than later to make use of the new technique.

  2. Update the Marionette client and harness code to handle the scenario when port=0 is requested by a consumer (bug 1750630). We should still keep 2828 as default to not cause incompatibilities with not yet updated consumers. Here we don't have to uplift patches. Here we do not have to care about backward compatibility (we do not push releases to pypi anymore) and can allow the code to ride the train. But the question is how we have to handle the Android scenario given that we do not have proper Android support for the harness (bug 1607210).

  3. Update web-platform-tests to request port 0 from marionette client instead of finding a free port on its own, which is racy. And for sure looking for that file, and reading the port once it exists.

  4. [Done] Update geckodriver to allow handling of port 0 (bug 1421766), whereby we would first need support for pulling files from the Android device (bug 1725622). By default we should keep port 2828 until we know it runs all stable.

  5. At a later time consider using port 0 as default.

Lets discuss this topic in next week's triage meeting.

Summary: Marionette find open port to run tests → [meta] Use system allocated port for Marionette server
Whiteboard: [webdriver:triage]

We reached consensus for this proposal but should limit the amount of actual work to do here to the bare minimum for now. Means as first step we only get the support for writing out this file added, and at least uplifted to 93 beta. Uplift to 91 will only happen if no further (complex) patch adjustments are necessary.

I'll file a new bug to get the code added.

Whiteboard: [webdriver:triage]
No longer blocks: 1421766
Depends on: 1421766
Severity: normal → S3
Product: Testing → Remote Protocol
Component: Marionette → Marionette Client and Harness
Product: Remote Protocol → Testing
You need to log in before you can comment on or make changes to this bug.