Open Bug 1404190 Opened 7 years ago Updated 2 years ago

Only the browser process is terminated on content process shutdown hang with NS_FREE_PERMANENT_DATA

Categories

(Core :: IPC, defect, P2)

Unspecified
All
defect

Tracking

()

People

(Reporter: karlt, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: regression)

When the content process hangs on shutdown, the browser process will terminate
itself with
MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) at /home/karl/moz/dev/toolkit/components/terminator/nsTerminator.cpp:160

Expected:
The content process is also terminated.

Actual:
The content process continues to run after browser process and crashreporter
have exited.
The test harness waits for both browser and content process output to finish,
and so a hung content process leads to symptoms of bug 1204281, which is not
so helpful about identifying the cause of the hang.
Blocks: 1204281
Priority: -- → P2
Blocks: 1411358
Blocks: 1339568
Depends on: e10s
Keywords: regression
Blocks: e10s
No longer depends on: e10s
Karl, is this only happening on Linux as stated in the platform field or all platforms?
Flags: needinfo?(karlt)
I'm aware of this issue only on Linux.
I don't recall where but I think I saw something to indicate that WINNT child processes were terminated by the OS when their parent exited.
I don't know why MacOS UNIX processes would behave differently to Linux though, so maybe the bug exists there too or maybe there's a Mac-only workaround somewhere.
Flags: needinfo?(karlt)
Does this happen only on debug builds?  If so, this might be part of what's going on: https://searchfox.org/mozilla-central/rev/919dce54f43356c22d6ff6b81c07ef412b1bf933/ipc/glue/GeckoChildProcessHost.cpp#115-118

IPC will wait forever for child processes to exit if refcount logging is enabled, to make sure the entire log is written, but will kill the process after a few seconds otherwise.
Thanks!  Yes, the grim reaper kills the child process as expected (assuming
the browser process does not hang), but the lax reaper does not.
The lax reaper is used when NS_FREE_PERMANENT_DATA is defined.

 * |force=false| is expected to be used when an external entity can be
 * responsible for terminating hung processes, e.g. automated test
 * harnesses.

https://searchfox.org/mozilla-central/rev/797c93d81fe446f78babf20894f0729f15f71ee6/ipc/chromium/src/chrome/common/process_watcher_posix_sigchld.cc#189

Perhaps the test harness once managed the processes differently.
(Or perhaps we didn't have enough plugin-process shutdown hangs to notice
problems.)

The recent test harness waits for both browser and content process output to
finish.  Disabling the first path at
http://searchfox.org/mozilla-central/rev/f54c1723befe6bcc7229f005217d5c681128fcad/testing/mozbase/mozprocess/mozprocess/processhandler.py#843
and instead using self.proc.wait() allows the harness to proceed and kill the
child process with SIGKILL, after nsTerminator kills the browser process.
Summary: Only the browser process is terminated on content process shutdown hang → Only the browser process is terminated on content process shutdown hang with NS_FREE_PERMANENT_DATA
Blocks: 1351655
(In reply to Karl Tomlinson (:karlt) from comment #7)
> The recent test harness waits for both browser and content process output to
> finish.  Disabling the first path at
> http://searchfox.org/mozilla-central/rev/
> f54c1723befe6bcc7229f005217d5c681128fcad/testing/mozbase/mozprocess/
> mozprocess/processhandler.py#843
> and instead using self.proc.wait() allows the harness to proceed and kill the
> child process with SIGKILL, after nsTerminator kills the browser process.

So does that mean we would have to update mozprocess for better handling this situation? This could be the solution we (especially Geoff) are looking for to finally stopping jobs to fail due to strange hangs of the processes.
(In reply to Henrik Skupin (:whimboo) from comment #8)
> (In reply to Karl Tomlinson (:karlt) from comment #7)
> > The recent test harness waits for both browser and content process output to
> > finish.  Disabling the first path at
> > http://searchfox.org/mozilla-central/rev/
> > f54c1723befe6bcc7229f005217d5c681128fcad/testing/mozbase/mozprocess/
> > mozprocess/processhandler.py#843
> > and instead using self.proc.wait() allows the harness to proceed and kill
> > the child process with SIGKILL, after nsTerminator kills the browser
> > process.
> 
> So does that mean we would have to update mozprocess for better handling
> this situation? This could be the solution we (especially Geoff) are looking
> for to finally stopping jobs to fail due to strange hangs of the processes.

I'm not familiar with the reader but I expect whatever is reading output from
the browser could exit when the browser process exits.  IIUC a successful
browser process shutdown will wait for content process shutdown, and so AFAIK
there is no need to wait also for content processes.

I'm guessing the reader is waiting until it gets EOF on its input pipe, but
a hung content process keeps this from ending.

(FWIW running the harness with --debugger does not suffer this problem, but
instead wait() returns when the browser process is terminated and the harness
kills the child process with SIGKILL.)

This is probably the right thing for the harness to do because it reduces the
time spent waiting after the content process has hung.  I guess this would
allow the harness to start a new browser instance for running additional sets
of tests.  On its own, however, I don't know that such a change would make
such hangs significantly easier to diagnose, because there would still be no
crash stacks for the content processes.

I think an in-product (not harness) solution is also helpful.  Making the lax
reaper more like the grim reaper but with a longer timeout would be helpful
for obtaining crash reports in many situations I assume.

An nsTerminator in the content process might be a more dependable safety net
as it would catch problems even when the browser process crashes.  I don't
know whether or not crash reports would be generated in that situation.
(In reply to Karl Tomlinson (:karlt) from comment #9)
> AFAIK there is no need to wait also for content processes.

Oh, but if there is a separate crash reporter process, then that would need
to be allowed some time to complete.

> On its own, however, I don't know that such a change would make
> such hangs significantly easier to diagnose, because there would still be no
> crash stacks for the content processes.

I assume the harness could first kill remaining child processes with SIGABRT
but I don't know whether or not the crash reporter is still around to collect
stack dumps after the browser process has terminated.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ac2249f1e9e4f5bd7570fee60b63de66b08bf180
confirms that the symptoms on MacOS are like those on Linux, and so PR_SET_PDEATHSIG would not be a general solution.
On WINNT, the tests return results much sooner and more tests are run, but there is still no useful content process stack.
OS: Linux → All
Andrew, as far as I remember you implemented the waits for output threads in mozprocess a (long) time ago. Would you mind having a look at this bug, if there is something we can do on the harness level to avoid the 1000s timeouts of Taskcluster jobs in case of such hangs? Thanks.
Flags: needinfo?(ahalberstadt)
Are you asking me to look at bug 1421289? Or is this different.

I didn't implement that code, though I'm somewhat familiar with it. I don't have any bright ideas beyond gbrown's investigation in the other bug.
Flags: needinfo?(ahalberstadt)
Bug 1421289 was filed after I asked, but yes that is the one. But as you say, Geoff already did some great investigation in the meantime too.
We could change the behavior for leak checking builds to have a longer timeout than opt builds. Even if it was, say, a few minutes it would not quite as much machine time. The tradeoff might be more failures to create logs, but that could be tuned.
It looks like that I can hit this problem fairly consistently on MacOS for opt builds of Firefox by running one of the crash unit tests of the Marionette test suite. Therefore attachment 8940193 [details] needs to be applied. In each of the cases the the parent process is gone and the webcontent process is idling.

Is there something I could help with in further investigating this problem?
Flags: needinfo?(karlt)
Flags: needinfo?(continuation)
If you are seeing hangs in opt builds, that sounds unrelated to the original issue, which is about NS_FREE_PERMANENT_DATA builds (debug and ASan).
Flags: needinfo?(continuation)
Ok, then we will continue tracking this on bug 1395504. Thanks.
Flags: needinfo?(karlt)
No longer blocks: 1376773
(In reply to Karl Tomlinson (back Jan 30 :karlt) from comment #9)
> I'm not familiar with the reader but I expect whatever is reading output from
> the browser could exit when the browser process exits.  IIUC a successful
> browser process shutdown will wait for content process shutdown, and so AFAIK
> there is no need to wait also for content processes.

We parse the browser's output for use in things like the shutdown leaks checker, which uses the DOMWINDOW etc output, but we do have code to handle cases where processes crashed so this shouldn't be a problem. We also have code in the harnesses to check for and kill orphaned processes (which the harness refers to as "zombies" but are probably not actually in the zombie state on Linux), so if we fix things so that the mozprocess `wait` doesn't hang in this situation (browser process has exited, child processes are still running) I think the harness will do the right thing.

> I'm guessing the reader is waiting until it gets EOF on its input pipe, but
> a hung content process keeps this from ending.

This sounds very plausible!

> (FWIW running the harness with --debugger does not suffer this problem, but
> instead wait() returns when the browser process is terminated and the harness
> kills the child process with SIGKILL.)

Right. With `--debugger` the harness doesn't use mozprocess at all, it just runs the debugger process with `subprocess.Popen` because debuggers generally need stdin/stdout to be the actual terminal.

> This is probably the right thing for the harness to do because it reduces the
> time spent waiting after the content process has hung.  I guess this would
> allow the harness to start a new browser instance for running additional sets
> of tests.  On its own, however, I don't know that such a change would make
> such hangs significantly easier to diagnose, because there would still be no
> crash stacks for the content processes.
> 
> I think an in-product (not harness) solution is also helpful.  Making the lax
> reaper more like the grim reaper but with a longer timeout would be helpful
> for obtaining crash reports in many situations I assume.
> 
> An nsTerminator in the content process might be a more dependable safety net
> as it would catch problems even when the browser process crashes.  I don't
> know whether or not crash reports would be generated in that situation.

We are not going to get crash reports from child processes if the browser process has already exited. Minidumps for child processes are written in the browser process. When a child process crashes its signal/exception handler simply sends a message to the browser process over some sort of IPC (not our IPDL-based IPC, a different pre-arranged channel) and the browser process writes the dump. If the browser process has exited this doesn't work.

It would probably be useful to change the nsTerminator code to have a mode we could enable in CI where it would do something like:
1) Timeout is hit.
2) If there are any still-running child processes, either kill them in such a way that Breakpad will be triggered and we'll generate a crash report (SIGABRT would work on Linux/Mac) or just run some crashreporter code to write minidumps for them and terminate them (SIGKILL).
3) Wait till all child processes are dead before actually terminating the browser process.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #19)
> It would probably be useful to change the nsTerminator code to have a mode
> we could enable in CI where it would do something like:
> 1) Timeout is hit.
> 2) If there are any still-running child processes, either kill them in such
> a way that Breakpad will be triggered and we'll generate a crash report
> (SIGABRT would work on Linux/Mac) or just run some crashreporter code to
> write minidumps for them and terminate them (SIGKILL).
> 3) Wait till all child processes are dead before actually terminating the
> browser process.

This is definitely doable, file a bug and I'll get to it as soon as I'm done adding crash ping support to Fennec.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.