Open Bug 1493796 Opened 6 years ago Updated 1 years ago

[mozprocess] Improve the ProcessHandlerMix on Windows to handle detached processes

Categories

(Testing :: Mozbase, enhancement, P3)

enhancement

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: whimboo, Assigned: whimboo)

References

(Blocks 3 open bugs)

Details

Attachments

(1 file)

(In reply to Henrik Skupin (:whimboo) from bug 1438830 comment #10) > So the patch is actually not complete yet. What we would need in terms of > Marionette is that mozprocess would also know about the new and detached > process id. In case of Posix we simply update it with the process id which > Marionette gets via Firefox: > > https://dxr.mozilla.org/mozilla-central/rev/ > a9e339b3e5d8dc3b108d3dc8b85f72b2235fafb2/testing/mozbase/mozprocess/ > mozprocess/processhandler.py#910 > > But how can we overwrite the process id (`self.pid`) on Windows? Aaron, is > there a simple way of doing that? Or any alternative?
Aaron, maybe you have an idea? Not sure if we would have to recreate the whole job object, or if there is a simpler solution. Thanks.
Flags: needinfo?(aklotz)
Blocks: 1434878
Priority: -- → P3
So there is actually a real problem here. Even we can track the process after a restart of Firefox via the job, a call to `GetExitCodeProcess()` returns 0. With that information we currently assume the process has exited, and consumers are misbehaving. See the perma failures on beta for bug 1496759 and bug 1497062. This all started to happen with the following commit on bug 1433905, which assumes that the exit code is always correct: https://hg.mozilla.org/integration/autoland/rev/8793e332890e I think this happens because the handle would still refer to the old pid of the former process id, but not for the Firefox process after the restart! As such it thinks the process doesn't exist anymore. To temporarily unblock us from having those failures on central and beta, I will backout the above commit, and disable the affected mozprocess unit test. Aaron, I hope that you will have a bit of time soon to help us getting this right. If not please also let me know, and I have to dig into all that myself. But I would appreciate at least a direction. Thanks!
Blocks: 1497062, 1496759
Blocks: 1400780
Blocks: 1499717
Originally landed as changeset 8793e332890e via bug 1433905 the patch caused a regression because GetExitCodeProcess() returns 0 for an inside of Firefox restarted process. It can be relanded once the process id of the job object can successfully be tracked.
I messed-up the patch with a wrong commit from bug 1397612. Given that this one has been reverted now, here another try build: https://treeherder.mozilla.org/#/jobs?repo=try&revision=748a483a4fb10fbc32337ec0d0d394eec6ebbf0c
Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9782e0b1657d [mozprocess] Revert poll() behavior on Windows due to regression. r=gbrown
Status: NEW → ASSIGNED
Keywords: leave-open
Priority: P3 → P2
Sorry for the delay here, I am really slammed at the moment. I don't know if this is a complete solution, but it is possible to receive notifications from a job object by calling SetInformationJobObject with the JobObjectAssociateCompletionPortInformation information class, and registering an I/O completion port on that job object. This would allow you to poll for the JOB_OBJECT_MSG_NEW_PROCESS notification, which tells you when a new process has been created within the job. I would think that a browser restart on Windows would result in the newly created browser process still being within the same job, so I *think* that could be enough? Since this does require you to set up an I/O completion port, it will probably require some extra ctypes work to make happen, but I think it might be your best bet.
Flags: needinfo?(aklotz)
(In reply to Aaron Klotz [:aklotz] from comment #10) > I would think that a browser restart on Windows would result in the > newly created browser process still being within the same job, so I *think* > that could be enough? I guess the other catch is that you will receive that notification for *all* new processes created within the job, so you also somehow need to filter out which one is the browser.
Blocks: 1391545

The leave-open keyword is there and there is no activity for 6 months.
:whimboo, maybe it's time to close this bug?

Flags: needinfo?(hskupin)

No, there is still more work to do here, but I don't have the time doing it at the moment. If anyone has interest feel free to do it.

Assignee: hskupin → nobody
Status: ASSIGNED → NEW
Type: defect → enhancement
Flags: needinfo?(hskupin)
Keywords: leave-open
Priority: P2 → P3
Severity: normal → S3

I started some work to improve the stability of wdspec tests on bug 1824841. When pushing to try I saw a high frequent failures on Windows only when mozprocess is handling the shutdown of the geckodriver binary, which look like:

https://treeherder.mozilla.org/logviewer?job_id=410631876&repo=try&lineNumber=109643-109648

[task 2023-03-29T09:20:31.185Z] 09:20:31     INFO - STDOUT: Exiting with code: 64
[task 2023-03-29T09:20:31.187Z] 09:20:31     INFO - STDOUT: DBG::MOZPROC PID:9140 (Thread-9) | process id 9140 exited normally
[task 2023-03-29T09:20:31.187Z] 09:20:31     INFO - STDOUT: DBG::MOZPROC PID:9140 (Thread-9) | job object msg active processes zero
[task 2023-03-29T09:20:31.188Z] 09:20:31     INFO - STDOUT: DBG::MOZPROC ProcessReader | _read loop exited
[task 2023-03-29T09:20:31.188Z] 09:20:31     INFO - STDOUT: DBG::MOZPROC ProcessReader | _read exited
[task 2023-03-29T09:20:33.201Z] 09:20:33     INFO - STDOUT: DBG::MOZPROC PID:9140 (Thread-7) | No process handle
[task 2023-03-29T09:20:33.203Z] 09:20:33     INFO - STDOUT: ** returncode: None 1680081633.2012012

As it can be seen the geckodriver process itself quits with an exit code of 64, but mozprocess continues to report None as exit code. The problem here is actually part of the Process.poll() handling. We are on Windows and the handle is gone, but the method returns still None and not the expected exit code, which means that mozprocess thinks that the process is still running.

I then remembered my patch from 5 years ago which landed on bug 1433905, but then got backed out here. Applying it again (after fixing the merge conflicts) actually makes it work for the above case as this try build shows. But sadly now there is a perma failure for a Marionette test, where an environment variable as set before a restart is not available after the restart anymore. Investigating this clearly showed me that mozprocess is failing to handle in-application restart scenarios as only done via Marionette. And this is clearly something that we have to get fixed to improve the reliability of tests but also the Marionette client itself.

Here some observations:

  1. On Windows the Firefox binary is launched via a launcher process (2008), which means that the real parent process (4284) is hold within a wrapper process. The client which starts Firefox via mozprocess uses the wrapper process as known PID, and not the real binary. Also all child processes (eg. web content) are attached to the inner PID (4284) and not the launcher.

When Firefox is restarted from within the application itself a new launcher process (7864) is forked/spawned under the existing parent process (4284), which itself then spawns the new Firefox parent process (6392). Right after both the former parent process (4284) and launcher process (2008) exit, as well the new launcher process (7864) so that the new parent process (6392) is now made the top-level one:

 0:00.03 INFO Using workspace for temporary data: "C:\mozilla\mozilla-unified"
 0:00.05 mozversion INFO application_buildid: 20230413093523
 0:00.05 mozversion INFO application_changeset: c35b4a88139530b8c2f2e58e0fededacfe7a8571
 0:00.05 mozversion INFO application_display_name: Nightly
 0:00.05 mozversion INFO application_id: {ec8030f7-c20a-464f-9b0e-13a3a9e97384}
 0:00.05 mozversion INFO application_name: Firefox
 0:00.05 mozversion INFO application_remotingname: firefox-default
 0:00.05 mozversion INFO application_vendor: Mozilla
 0:00.05 mozversion INFO application_version: 114.0a1
 0:00.05 mozversion INFO platform_buildid: 20230413093523
 0:00.06 mozversion INFO platform_changeset: c35b4a88139530b8c2f2e58e0fededacfe7a8571
 0:00.06 mozversion INFO platform_version: 114.0a1
 0:00.06 INFO Application command: C:/mozilla/mozilla-unified/obj-x86_64-pc-windows-msvc\dist\bin\firefox.exe -no-remote -marionette --wait-for-browser -profile C:\Users\henrik\AppData\Local\Temp\tmpldsuf5ca.mozrunner
*** initial pid: 2008
*** new process - psutils details: {'ppid': 2008, 'name': 'firefox.exe', 'pid': 4284}
*** new process - psutils details: {'ppid': 4284, 'name': 'firefox.exe', 'pid': 11160}
*** new process - psutils details: {'ppid': 4284, 'name': 'firefox.exe', 'pid': 13644}
*** new process - psutils details: {'ppid': 4284, 'name': 'firefox.exe', 'pid': 3896}
*** new process - psutils details: {'ppid': 4284, 'name': 'firefox.exe', 'pid': 9568}
*** new process - psutils details: {'ppid': 4284, 'name': 'firefox.exe', 'pid': 13348}
*** new process - psutils details: {'ppid': 4284, 'name': 'firefox.exe', 'pid': 6536}
*** new process - psutils details: {'ppid': 4284, 'name': 'firefox.exe', 'pid': 13436}
*** new process - psutils details: {'ppid': 4284, 'name': 'firefox.exe', 'pid': 12996}
1681828049924   Marionette      DEBUG   3 -> [0,23,"Marionette:Quit",{"flags":["eRestart"]}]
1681828050065   Marionette      DEBUG   Marionette stopped listening
1681828050067   Marionette      DEBUG   3 <- [1,23,null,{"cause":"restart","forced":false,"in_app":true}]
DBG::MOZPROC PID:2008 (Thread-3) | process id 9568 exited normally
DBG::MOZPROC PID:2008 (Thread-3) | process id 3896 exited normally
DBG::MOZPROC PID:2008 (Thread-3) | process id 6536 exited normally
DBG::MOZPROC PID:2008 (Thread-3) | process id 12996 exited normally
DBG::MOZPROC PID:2008 (Thread-3) | process id 13436 exited normally
DBG::MOZPROC PID:2008 (Thread-3) | process id 13348 exited normally
*** new process - psutils details: {'ppid': 4284, 'name': 'firefox.exe', 'pid': 7864}
DBG::MOZPROC PID:2008 (Thread-3) | process id 13644 exited normally
DBG::MOZPROC PID:2008 (Thread-3) | process id 11160 exited normally
*** new process - psutils details: {'ppid': 7864, 'name': 'firefox.exe', 'pid': 6392}
*** new process - psutils details: {'ppid': 6392, 'name': 'firefox.exe', 'pid': 11432}
DBG::MOZPROC PID:2008 (Thread-3) | process id 7864 exited normally
DBG::MOZPROC PID:2008 (Thread-3) | process id 4284 exited normally
*** new process - psutils details: {'ppid': 6392, 'name': 'firefox.exe', 'pid': 2444}
DBG::MOZPROC PID:2008 (Thread-3) | process id 2008 exited normally
*** new process - psutils details: {'ppid': 6392, 'name': 'firefox.exe', 'pid': 6228}
*** new process - psutils details: {'ppid': 6392, 'name': 'firefox.exe', 'pid': 12620}
*** new process - psutils details: {'ppid': 6392, 'name': 'firefox.exe', 'pid': 8744}
*** new process - psutils details: {'ppid': 6392, 'name': 'firefox.exe', 'pid': 14456}
*** new process - psutils details: {'ppid': 6392, 'name': 'firefox.exe', 'pid': 6928}
*** new process - psutils details: {'ppid': 6392, 'name': 'firefox.exe', 'pid': 996}
*** new process - psutils details: {'ppid': 6392, 'name': 'firefox.exe', 'pid': 7088}

Given that we are working within a job object I wonder if mozprocess should not track the launcher process but the real firefox binary process. As least that would make it easier when the process gets restarted. Also it probably shouldn't matter if we kill the launcher process or the real parent process. The poll logic on Windows uses the Job Object and we exit when all processes are gone, and the launcher should be gone as well when the parent process exited.

  1. I tried to add some code via ctypes which can retrieve the pid of the parent process for a new spawned process. But sadly this doesn't work yet, and I'm currently experimenting by using psutil. Hereby I'm not sure if we can actually use that package via mach or in CI. Once I have it all working I might find a way to retrieve the information via the WinAPI.
Assignee: nobody → hskupin
Blocks: 1824841
Status: NEW → ASSIGNED
Summary: [mozprocess] Allow to update process id on Windows if process has been restarted → [mozprocess] Improve the ProcessHandlerMix in Windows to handle detached processes
Summary: [mozprocess] Improve the ProcessHandlerMix in Windows to handle detached processes → [mozprocess] Improve the ProcessHandlerMix on Windows to handle detached processes
Blocks: 1635776

I've pushed a try build for my proposed changes but it turns out that quite a lot of test failures started to happen that need investigation:

https://treeherder.mozilla.org/jobs?repo=try&revision=3675ba08ec911eed3a34a79135a2db1836e4db5e

I'll deprioritize this bug a bit would would still check why there are such failures.

Blocks: 1438009
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: