Closed Bug 811636 Opened 12 years ago Closed 11 years ago

Parent process does not handle the situation correctly if its child process is failed at launching

Categories

(Firefox OS Graveyard :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(blocking-b2g:leo+, firefox23 wontfix, firefox24 wontfix, firefox25 fixed, b2g18+ fixed, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 wontfix, b2g-v1.1hd fixed)

RESOLVED FIXED
1.1 QE5
blocking-b2g leo+
Tracking Status
firefox23 --- wontfix
firefox24 --- wontfix
firefox25 --- fixed
b2g18 + fixed
b2g18-v1.0.0 --- wontfix
b2g18-v1.0.1 --- wontfix
b2g-v1.1hd --- fixed

People

(Reporter: shelly, Assigned: shelly)

References

Details

(Whiteboard: [tech-bug][LeoVB+])

Attachments

(1 file, 7 obsolete files)

Parent process is fail to get any notification if its child process does not launch or execute itself properly.
Launching a child process involves a fork/exec There are at least 4 very distinct time periods that we need to concern ourselves with: 1 - Failures which occur in the parent before the fork 2 - Failures which occur in the child after the fork but before the exec 3 - Failures which occur in the child after the exec but before the IPC channel is setup 4 - Failures which occur in the child after the IPC channel is setup
This is happened when a child process is forked successfully, but failed after execv, and call _exit(127). The current system misses the situation when a child process exit itself abnormally. This draft of patch is trying to watch the SIGCHLD in an earlier state of creating a child process. Modify the fetched result from DidProcessCrash() (in process_util_posix.cc) in order to defer a child process is exit normally or not. And after receives an abnormal exit of a child process, sends a "oop-frameloader-crashed' to window_manager.js. Or, maybe we should do something else?
Depends on: 823474
Please reference comment 7 from bug 823474 for details of reproduce steps. https://bugzilla.mozilla.org/show_bug.cgi?id=823474#c7
Attachment #684599 - Flags: feedback?(justin.lebar+bug)
Comment on attachment 684599 [details] [diff] [review] draft 1: Handle when a child process exit abnormally I think cjones is probably a better person to look at this; I'm not familiar with any of this code. Chris, if you're swamped, I can figure this stuff out.
Attachment #684599 - Flags: feedback?(justin.lebar+bug) → feedback?(jones.chris.g)
Comment on attachment 684599 [details] [diff] [review] draft 1: Handle when a child process exit abnormally Hi Shelly, this is a very good UNIX-y solution to this problem. I like it a lot :). The problem is, this process mangement code is cross-platform, and we need to keep all of them in mind when designing features like this. We have an existing mechanism to notify us of "channel errors" (which during startup, correspond to process crashes/exits). The first place this code is used is when we launch the subprocess. The subprocess is launched from ipc/glue/GeckoChildProcessHost.cpp. If an error occurs, it will be notified here [1]. The code doesn't properly handle the error notification at the moment; fixing that is bug 773925. The second piece we need is to let ContentParent know when a channel error has occurred during launch. We can set a particular state on the GeckoChildProcessHost, for example. The advantage of this alternate approach is that it will work across all the platforms we support. Let me know if this makes sense. [1] http://mxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost.cpp#827
Attachment #684599 - Flags: feedback?(jones.chris.g)
Attached patch draft 2: Close the client fd earlier (obsolete) (deleted) — Splinter Review
Attachment #684599 - Attachment is obsolete: true
Comment on attachment 703172 [details] [diff] [review] draft 2: Close the client fd earlier Hi Chris, Thanks for your feedback! I've worked on another solution to handle this kind of error, with less changes and closer to the outlook of our exist mechanism. When a child process is created, its parent process waits for a hello message and close the client_pipe_. However, in this kind of situation, a system interrupt with signal SIGCHLD is sent, interrupts epoll_dispatch, and thus the hello message is never added to active event list, making the libevent fail to receive any notification. I figured maybe we can close the client_pipe_ eariler in time, in order to find out the connection has been reset already. This change uses the exist mechanism to notify us channel error, but the rest is like you mentioned, where OnChannelError from GeckoChildProcessHost is not yet being called at this moment.
Attachment #703172 - Flags: feedback?(jones.chris.g)
Btw, bug 823474 is no longer needed with this version of fix.
I'm very sorry for the long feedback lag here :(. I'll be clearing out my queue very soon. We need this work for 100% rock-solid stability. When this kind of lmk death occurs, we permanently lose a substantial amount of device memory. That makes the possibility of this crash happening again higher, etc, creating a disastrous feedback loop.
blocking-b2g: --- → leo?
Comment on attachment 703172 [details] [diff] [review] draft 2: Close the client fd earlier Hi Shelly, is it correct that we're not getting the error notification at [1] currently? If so, I /think/ I understand what this patch is doing, but there's still a race condition where the process can die without the notification. We should try to discuss this on IRC this week. (I want to make sure I understand the details here ;).) [1] http://mxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost.cpp#833
Attachment #703172 - Flags: feedback?(jones.chris.g)
How often does this occur? How can this be triggered? Is it showing up in any of our or our partners' stability tests?
Needed for competitive stability.
Whiteboard: [tech-bug]
(In reply to Chris Jones [:cjones] [:warhammer] from comment #10) > Comment on attachment 703172 [details] [diff] [review] > draft 2: Close the client fd earlier > > Hi Shelly, is it correct that we're not getting the error notification at > [1] currently? If so, I /think/ I understand what this patch is doing, but > there's still a race condition where the process can die without the > notification. > > We should try to discuss this on IRC this week. (I want to make sure I > understand the details here ;).) > > [1] > http://mxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost. > cpp#833 Hi Chris, Thanks for your feeback, and yes, we're not getting the error notification at [1], I'm not 100 percent sure if it is not called other where else, but at least it's not being called when the app is killed (shutdown abnormally). Please feel free to ping me on IRC, my nickname is ShellyLin. :)
blocking-b2g: leo? → -
tracking-b2g18: --- → +
Hi Justin, Without this fix, we will end up having lots of zombie processes in the system. Since project Nuwa (bug 771765) might refactor most of the ipc implementation, especially the part of creating child processes, I'm not sure this is needed anymore, however, this fix is small and not risky, I still think it is worth trying. I know you have re-directed this issue to cjones, or could you re-direct to someone else? Thanks.
Attachment #703172 - Attachment is obsolete: true
Attachment #761912 - Flags: review?
Attachment #761912 - Flags: feedback?(justin.lebar+bug)
Attachment #761912 - Flags: review?
We should definitely still fix this even though we're trying to do Nuwa. Did you and cjones ever figure out the race condition he mentioned in comment 10?
We need to cover the child dying before it gets the IPC channel opened as well as after it gets the IPC channel opened. Normally, the parent would launch a thread which does wait calls (a reaper thread). This reaps the zombies and allows the parent to be notified regardless when the child dies.
Even under Nuwa, we need the same thing, but it needs to be the template process that does the reaping (you can only reap children, not grand children).
(In reply to Justin Lebar [:jlebar] from comment #16) > We should definitely still fix this even though we're trying to do Nuwa. > > Did you and cjones ever figure out the race condition he mentioned in > comment 10? I didn't get a chance to talk with cjones, but I've discussed this issue with Thinker. Race condition won't happen here, since we are closing the client's fd and the call to |base::LaunchApp()| is synchronized, from the part of fork(), execv() till return.
(In reply to Dave Hylands [:dhylands] from comment #17) > We need to cover the child dying before it gets the IPC channel opened as > well as after it gets the IPC channel opened. > > Normally, the parent would launch a thread which does wait calls (a reaper > thread). This reaps the zombies and allows the parent to be notified > regardless when the child dies. Yes, normally it does. In this abnormal case, child process dies before our ipc mechanism receives the "hello message", that is, the parent won't be notified because this child process does not event "exist".
The hacky steps to produce a zombie process: 1. Launch b2g, and kill the Preallocated-app process. 2. Go to system/b2g, rename |plugin-container| to something else, e.g. |plugin-container.2|. 3. Launch any web-app from Homescreen. 4. Observe a Z process with name "Gecko_IOThread".
(In reply to Shelly Lin [:shelly] from comment #20) > (In reply to Dave Hylands [:dhylands] from comment #17) > > We need to cover the child dying before it gets the IPC channel opened as > > well as after it gets the IPC channel opened. > > > > Normally, the parent would launch a thread which does wait calls (a reaper > > thread). This reaps the zombies and allows the parent to be notified > > regardless when the child dies. > > Yes, normally it does. In this abnormal case, child process dies before our > ipc mechanism receives the "hello message", that is, the parent won't be > notified because this child process does not event "exist". Well the child does, in fact, exist, it just didn't get around to using the ipc mechanism. My point is that we shouldn't be using the IPC mechanism to track the child processes. If the fork succeeds, then we should start tracking the child and be issuing a wait for the child. Otherwise, any child that fails for any reason between the fork and the open of the ipc channel becomes a zombie (because the parent needs to reap it so that the zombie goes away).
It sounds like dhylands understands this better than I do, and it also looks like the patch here doesn't fix the problem dhylands identified in comment 22. Shelly, are you willing to wait a few weeks for Dave to get back from vacation and figure this out with him?
Comment on attachment 761912 [details] [diff] [review] Handle the error of execv() when launching child processes Kicking this over to Dave. If you need this taken care of more quickly, let me know and we can figure that out.
Attachment #761912 - Flags: feedback?(justin.lebar+bug) → feedback?(dhylands)
This problem sounds similar to this issue that was fixed upstream: https://codereview.chromium.org/7870008 "The problem was that when the child process died before connecting to the parent's IPC channel, the parent wouldn't notice the crash because the child end of the IPC pipe was kept open for too long."
(In reply to Dave Hylands [:dhylands] (PTO back July 2) from comment #22) > (In reply to Shelly Lin [:shelly] from comment #20) > > (In reply to Dave Hylands [:dhylands] from comment #17) > > > We need to cover the child dying before it gets the IPC channel opened as > > > well as after it gets the IPC channel opened. > > > > > > Normally, the parent would launch a thread which does wait calls (a reaper > > > thread). This reaps the zombies and allows the parent to be notified > > > regardless when the child dies. > > > > Yes, normally it does. In this abnormal case, child process dies before our > > ipc mechanism receives the "hello message", that is, the parent won't be > > notified because this child process does not event "exist". > > Well the child does, in fact, exist, it just didn't get around to using the > ipc mechanism. > > My point is that we shouldn't be using the IPC mechanism to track the child > processes. > > If the fork succeeds, then we should start tracking the child and be issuing > a wait for the child. Otherwise, any child that fails for any reason between > the fork and the open of the ipc channel becomes a zombie (because the > parent needs to reap it so that the zombie goes away). I see. "Issuing a wait" happens to be my very first solution, but Chris (in comment #5) thinks it's better to use our IPC mechanism to report errors. Attaching that patch back for further discussion.
Attached patch Issuing a wait for child process (obsolete) (deleted) — Splinter Review
Issuing a wait on newly created child process, in order to monitor its life state.
Justin, sounds good to me, thanks. Andrew, it does look exactly the same like this issue! (https://codereview.chromium.org/7870008)
Comment on attachment 761912 [details] [diff] [review] Handle the error of execv() when launching child processes Review of attachment 761912 [details] [diff] [review]: ----------------------------------------------------------------- OK - I spent some time looking at the child/parent IPC socket and I things are a bit clearer. What actually happens is something along these lines: 1 - parent creates a socketpair 2 - parent fork/execs the child, which inherits the open file descriptors So I now believe that this patch is doing the right thing (and this should cover all of the failure points).
Attachment #761912 - Flags: feedback?(dhylands) → feedback+
Thanks for your time very much! I've added and updated some comments, could you give it a reivew?
Attachment #761912 - Attachment is obsolete: true
Attachment #765168 - Attachment is obsolete: true
Attachment #771146 - Flags: review?(dhylands)
Comment on attachment 771146 [details] [diff] [review] Close the FD of client side after parent fork/execs the child Review of attachment 771146 [details] [diff] [review]: ----------------------------------------------------------------- r=me with comments addressed. ::: ipc/chromium/src/chrome/common/ipc_channel_posix.cc @@ +781,5 @@ > > +void Channel::ChannelImpl::CloseClientFileDescriptor() const { > + if (client_pipe_ != -1) { > + Singleton<PipeMap>()->Remove(pipe_name_); > + HANDLE_EINTR(close(client_pipe_)); This function should be setting client_pipe_ to -1 after closing the file descriptor. This then implies that this can't be a const method (so you'll need to remove the const or declare client_pipe_ mutable. @@ +951,5 @@ > return channel_impl_->GetServerFileDescriptor(); > } > > +void Channel::CloseClientFileDescriptor() const { > + return channel_impl_->CloseClientFileDescriptor(); nit: I'd remove the return since this is a void function. ::: ipc/glue/GeckoChildProcessHost.cpp @@ +656,5 @@ > #endif > false, &process, arch); > > + // Close the FD of client side as early as possible, letting our ipc mechanism > + // to detect unexpected error at any stage. nit: I'd reword this slightly. // We're in the parent and the child was launched. Close the child FD // in the parent as soon as possible, which will allow the parent // to detect when the child closes its FD (either due to normal exit or // due to crash).
Attachment #771146 - Flags: review?(dhylands) → review+
Patch updated with nit fixes, and thanks for your review :)
Attachment #771146 - Attachment is obsolete: true
Attachment #771910 - Flags: review?(dhylands)
Fixed compiling error.
Attachment #771910 - Attachment is obsolete: true
Attachment #771910 - Flags: review?(dhylands)
Attachment #771949 - Flags: review?(dhylands)
Comment on attachment 771949 [details] [diff] [review] Close the FD of client side after parent fork/execs the child (w/ nit fix) Review of attachment 771949 [details] [diff] [review]: ----------------------------------------------------------------- ::: ipc/glue/GeckoChildProcessHost.cpp @@ +658,5 @@ > > + // We're in the parent and the child was launched. Close the child FD in the > + // parent as soon as possible, which will allow the parent to detect when the > + // child closes its FD (either due to normal exit or due to crash). > + const_cast<IPC::Channel&>(channel()).CloseClientFileDescriptor(); nit: You should be able to do it without the cast by doing: GetChannel()->CloseClientFileDescriptor(); GetChannel is declared here: https://mxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost.h#83
Attachment #771949 - Flags: review?(dhylands) → review+
Patch to checkin, with nit fixed, and carry r+ from :dhylands.
Attachment #772478 - Flags: review+
Keywords: checkin-needed
Attachment #771949 - Attachment is obsolete: true
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Blocks: 893012
Marking leo? per bug 893012#c54.
blocking-b2g: - → leo?
Blocks a blocker, and already has testing done by Leo.
blocking-b2g: leo? → leo+
As I said in bug 893012, comment 54, the accumulated messages in output queue is solved by this patch. But I observed just now, that a preallocated process is leaked after this patch. I am checking how it happened, and if it is really related to this patch.
Ok, the extra preallocated process in comment 41 is a known issue (bug 867739) and not related to this patch.
(In reply to andre.graziani from comment #44) > Ok, the extra preallocated process in comment 41 is a known issue (bug > 867739) and not related to this patch. Thanks for your confirm. :)
Whiteboard: [tech-bug] → [tech-bug][LeoVB+]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: