1270308 - Remove content-child-shutdown observer for OS.file

Bill McCloskey [inactive unless it's an emergency] (:billm)

Reporter

Description

•

9 years ago

Bug 1216972 started using content-child-shutdown to shut down some media and OS.File stuff. I'd like to remove this code because it's holding up the shutdown of child processes. Here's some context for this bug:

- Child processes are never supposed to write to disk, except maybe to temporary storage.
- The "content-child-shutdown" observer should only be used to send information up to the parent that needs to be saved to disk before shutdown. Right now the only user of this is for telemetry (sending up the final ping). We're going to be removing this use once telemetry starts sending up ping data continuously.
- Child processes run a normal XPCOM shutdown sequence in DEBUG builds to ensure they don't leak memory. In opt builds, they notify "content-child-shutdown" and then _exit(0).

The "content-child-shutdown" observer should never be used to release memory or remove observers. That should be done in an "xpcom-shutdown" observer (or a related one). Otherwise we're pointless spending time releasing stuff in opt builds.

The intention of bug 1216972 was apparently to fix a shutdown hang/crash (see comment 27). I don't really understand the code enough to know how the media manager shuts itself down. Here are the questions I have for the people involved: Why is it necessary to shut down the media manager in the content process? Is the shutdown code writing to disk? Is it notifying the parent process of something? Will anything bad happen if we stop using this observer, or if we use xpcom-shutdown instead (note that we're not able to talk to the parent process or write to disk when xpcom-shutdown happens)?

Flags: needinfo?(rjesup)

Flags: needinfo?(jib)

Flags: needinfo?(dteller)

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 1

•

9 years ago

We need it in DEBUG to not leak memory intermittently, because we have memory across multiple threads that wont deallocate predictably without back-and-forth runnables.

Why not define a new "content-child-shutdown-write-to-disk" or something, and then in opt build, only do that one only and _exit(0), and let us keep "content-child-shutdown" in DEBUG?

Flags: needinfo?(jib)

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1239873, https://bugzilla.mozilla.org/show_bug.cgi?id=1166293

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 2

•

9 years ago

I'm also unsure how the camera code would react to content going _exit(0).

Flags: needinfo?(gpascutto)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Reporter

Comment 3

•

9 years ago

(In reply to Jan-Ivar Bruaroey [:jib] from comment #1)
> We need it in DEBUG to not leak memory intermittently, because we have
> memory across multiple threads that wont deallocate predictably without
> back-and-forth runnables.

Is there a reason you can't use xpcom-shutdown for this?

> Why not define a new "content-child-shutdown-write-to-disk" or something,
> and then in opt build, only do that one only and _exit(0), and let us keep
> "content-child-shutdown" in DEBUG?

You've just described what we have now except with different names. "content-child-shutdown" is the one that runs in opt and DEBUG while xpcom-shutdown runs in DEBUG only.

> I'm also unsure how the camera code would react to content going _exit(0).

Well, we already do _exit(0):
http://searchfox.org/mozilla-central/source/ipc/glue/ProcessChild.cpp#40
So it already has to deal with that.

Randell Jesup [:jesup] (needinfo me)

Comment 4

•

9 years ago

> > We need it in DEBUG to not leak memory intermittently, because we have
> > memory across multiple threads that wont deallocate predictably without
> > back-and-forth runnables.
>
> Is there a reason you can't use xpcom-shutdown for this?

We may be able to.  Before we landed bug 1216972, we were using xpcom-will-shutdown (for non-e10s cases); see bug 1166293.

> > I'm also unsure how the camera code would react to content going _exit(0).
> 
> Well, we already do _exit(0):
> http://searchfox.org/mozilla-central/source/ipc/glue/ProcessChild.cpp#40
> So it already has to deal with that.

Well, the OS will shut down anything accessed directly, but the camera (and soon to be audio) is remote through IPC.  It will see the IPC channel close (much like a crash) and deal with it, though that's probably not an ideal shutdown path.  The "proper" way would be to shut down the media code before IPC is taken down.

An example of what in theory might go wrong on an exit(0) shutdown: If we're playing audio, and the last sample provided to the OS ended with the sample being far from 0, cutting it off at that point may cause a 'pop'.  Now, it's possible the OS/driver will compensate for that and do a smoother handling of a failure to provide data or the driver thread failure.  (It may take a while for the OS to close the audio channel itself as it cleans up resources.)

That said: this is likely ok in practice, and drivers probably try to avoid glitching in this case.  And it's not a major problem.  if/when audio moves to the master process (especially to support multiple content processes) things get more complicated, but we need to handle content crashes anyways.

We are currently writing to disk in some cases from child processes, in particular logs in webrtc and mtransport; we're currently in another bug trying to get that to work more cleanly without pushing a ton of data over IPC, especially non-PBackground IPC (some of these logs are from hard-realtime (audio) code, others are simply high-volume and written off a logging thread).  These uses may go away, and also any data written here is non-critical and can be lost safely.

In general: I'm supportive of using exit() after state has been committed to disk.  It avoids the "I closed it but it's still chewing my system!!" annoyance, especially when you just want to close-and-reopen (or shutdown/poweroff).  We have elaborate ways to cleanly shut down so we can better find leaks; IMHO we might do better in some cases to mark things as being expected to leak and avoid a lot of racy shutdown/cleanup code.  The biggest danger with shutdown is the possibility of producing a sec problem in shutdown as things start to get freed/go away, and "unusual" paths get followed.

Paul - any comments or concerns?

Flags: needinfo?(rjesup) → needinfo?(padenot)

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 5

•

9 years ago

(In reply to Bill McCloskey (:billm) from comment #3)
> Is there a reason you can't use xpcom-shutdown for this?

Sure! By "xpcom-shutdown", do you mean that's the name of another AsyncShutdown barrier that runs in content?

We used "xpcom-will-shutdown" initially (forget the reason, earlier may just have sounded better), and were surprised to learn this only worked in the master process. That's why we waited for "content-child-shutdown" to be added. I wasn't aware that "xpcom-shutdown" existed as an AsyncShutdown barrier already and was available to hook up to in the content process.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Reporter

Comment 6

•

9 years ago

(In reply to Jan-Ivar Bruaroey [:jib] from comment #5)
> (In reply to Bill McCloskey (:billm) from comment #3)
> > Is there a reason you can't use xpcom-shutdown for this?
> 
> Sure! By "xpcom-shutdown", do you mean that's the name of another
> AsyncShutdown barrier that runs in content?

I guess it doesn't exist there, but we could make one.

> We used "xpcom-will-shutdown" initially (forget the reason, earlier may just
> have sounded better), and were surprised to learn this only worked in the
> master process. That's why we waited for "content-child-shutdown" to be
> added. I wasn't aware that "xpcom-shutdown" existed as an AsyncShutdown
> barrier already and was available to hook up to in the content process.

xpcom-will-shutdown does work in the content process. However, it only runs in DEBUG builds. It sounds like that's all you should need.

Andrew McCreight [:mccr8]

Assignee

Comment 7

•

9 years ago

IIRC, WebRTC was using some kind of profile-related observer topic, which is only fired in the parent process.

Andrew Overholt [:overholt]

Comment 8

•

9 years ago

("fixlater" because I'm guessing this is a "next few months but not this week" kind of thing)

Whiteboard: btpp-fixlater

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 9

•

9 years ago

(In reply to Andrew McCreight [:mccr8] from comment #7)
> IIRC, WebRTC was using some kind of profile-related observer topic, which is
> only fired in the parent process.

You're right, we were using profileBeforeChange (non-e10s only) before bug 1166293, not xpcom-will-shutdown, sorry. Makes sense now.

xpcom-will-shutdown is not an AsyncShutdown barrier AFAICT [1]. What else can we use?

[1] http://mxr.mozilla.org/mozilla-central/source/toolkit/components/asyncshutdown/nsIAsyncShutdown.idl?rev=478290d4a81d#160

Flags: needinfo?(wmccloskey)

David Teller [:Yoric] - still alive but not very active

Comment 10

•

8 years ago

Note: I *think* that xpcom-will-shutdown could be exposed as an AsyncShutdown barrier, but I haven't checked.

> In general: I'm supportive of using exit() after state has been committed to disk.  It avoids the "I closed it but it's still chewing my system!!" annoyance, especially when you just want to close-and-reopen (or shutdown/poweroff).

How fast toes the system actually recollect its resources? I seem to remember that for sockets/file descriptors, this can take one minute on Linux, so we may need to be careful not to cause issues for camera & co.

Flags: needinfo?(dteller)

Paul Adenot (:padenot)

Comment 11

•

8 years ago

(In reply to Randell Jesup [:jesup] from comment #4)
> Paul - any comments or concerns?

No, I think you captured the problems and solutions well.

Flags: needinfo?(padenot)

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 12

•

8 years ago

(In reply to Randell Jesup [:jesup] from comment #4)
> IMHO we might do better in some cases to mark things as
> being expected to leak and avoid a lot of racy shutdown/cleanup code.  The
> biggest danger with shutdown is the possibility of producing a sec problem
> in shutdown as things start to get freed/go away, and "unusual" paths get
> followed.

Why do we still maintain this DEBUG only shutdown code? It's gotten quite complicated, we spend a lot of time fixing intermittents that have nothing to do with the code we ship, and we're not eating what we ship.

Aren't there other ways we could find runtime leaks?

Andrew McCreight [:mccr8]

Assignee

Comment 13

•

8 years ago

(In reply to Jan-Ivar Bruaroey [:jib] from comment #12) 
> Aren't there other ways we could find runtime leaks?

Unfortunately, there are none that I know of. In order to check for leaks, you have to know whether every live allocation should be alive or not. Waiting for shutdown and declaring that nothing should be alive is the easiest way to do this.

Randell Jesup [:jesup] (needinfo me)

Comment 14

•

8 years ago

(In reply to Andrew McCreight [:mccr8] from comment #13)
> (In reply to Jan-Ivar Bruaroey [:jib] from comment #12) 
> > Aren't there other ways we could find runtime leaks?
> 
> Unfortunately, there are none that I know of. In order to check for leaks,
> you have to know whether every live allocation should be alive or not.
> Waiting for shutdown and declaring that nothing should be alive is the
> easiest way to do this.

Right, though there are a couple of ways to do that.  One is what we're (mostly) doing, which is to unwind everything and free, then see what's left.  The other way (which we do with a few bits) is to mark allocations or classes are being ok to exist at exit (some of Mutexes for example IIRC). Big advantage there is cleanup/unwind code has risks of it's own and can introduce race conditions.  Downside of this is that the analysis of leaks is trickier; you end up needing basically a GC scanner to see of allocations are tied to a "ok to leak" root.  That said... overall I think that's a better/safer/overall-lower-effort (at least ongoing) than the complexity of the unwind code, just to avoid leaks.  Also: real leaks we care about are either large single allocations, or allocations that can increase in number the more times you do something.  (Which includes leaks-occasionally-due-race).  one-off leaks (leak of a single pre-process instance of something) are at most an annoyance unless large or eating CPU.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Reporter

Updated

•

8 years ago

Blocks: 1271336

Gian-Carlo Pascutto [:gcp]

Comment 15

•

8 years ago

(In reply to Jan-Ivar Bruaroey [:jib] from comment #2)
> I'm also unsure how the camera code would react to content going _exit(0).

I guess you're thinking of a bug where the Mac OS X shmem code didn't like the content process dying? (Can't remember the exact number). We have to fix and make those robust anyway, so it's not relevant to this discussion.

>In general: I'm supportive of using exit() after state has been committed to disk.

Weekly reminder that on Android the processes are just blown out of the sky regardless of state and any code that assumes differently is a [dataloss] bug to be addressed.

Flags: needinfo?(gpascutto)

Jim Blandy :jimb

Comment 16

•

8 years ago

In bug 1268992, Terrence Cole is considering adding an assertion that the JavaScript heap is empty when we do a "shutdown" GC, which is new testing along the lines cited in comment 12. That seems to be showing up reference-counting leaks like the one the Nick Fitzgerald pursued in bug 1261106.

Andrew McCreight [:mccr8]

Assignee

Comment 17

•

8 years ago

I can try fixing this.

Assignee: nobody → continuation

Flags: needinfo?(wmccloskey)

Andrew McCreight [:mccr8]

Assignee

Comment 18

•

8 years ago

I filed bug 1276383 for the WebRTC part, leaving this for the OS.file part, if we can do it.

Summary: Consider removing content-child-shutdown observers introduced in bug 1216972 → Remove content-child-shutdown observer for OS.file

Andrew McCreight [:mccr8]

Assignee

Comment 19

•

8 years ago

Yoric, what exactly does OS.File do in the content process? It seems like OS.File gives low-level access to the file system (like iterating over files in a directory, and reading and writing to files) that in a sandboxed world we don't want to allow from a child process. Ideally, the child process wouldn't be interacting with the file system at all. Might it be possible to just drop OS.File support for the child process? Then we wouldn't need to worry about flushing IO at content-child-shutdown.

Flags: needinfo?(dteller)

David Teller [:Yoric] - still alive but not very active

Comment 20

•

8 years ago

(In reply to Andrew McCreight [:mccr8] from comment #19)
> Yoric, what exactly does OS.File do in the content process? It seems like
> OS.File gives low-level access to the file system (like iterating over files
> in a directory, and reading and writing to files) that in a sandboxed world
> we don't want to allow from a child process. Ideally, the child process
> wouldn't be interacting with the file system at all. Might it be possible to
> just drop OS.File support for the child process? Then we wouldn't need to
> worry about flushing IO at content-child-shutdown.

I *think* that OS.File is not even launched in content processes, but I haven't checked. If someone is interested in enforcing this, that would be a good idea.

Flags: needinfo?(dteller)

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Blocks: 1277067

Andrew McCreight [:mccr8]

Assignee

Comment 21

•

8 years ago

Attached patch Remove content-child-shutdown observer for OS.file. (deleted) — Details — Splinter Review

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dbb7fd16bff0

Attachment #8758739 - Flags: review?(dteller)

David Teller [:Yoric] - still alive but not very active

Comment 22

•

8 years ago

Comment on attachment 8758739 [details] [diff] [review]
Remove content-child-shutdown observer for OS.file.

Review of attachment 8758739 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/osfile/modules/osfile_async_front.jsm
@@ +1526,5 @@
>  }
>  
> +// profile-before-change only exists in the parent, and OS.File should
> +// not be used in the child process anyways.
> +if (!isContent) {

Could we take the opportunity to raise an error if we are triggered in the content process?

Attachment #8758739 - Flags: review?(dteller) → review+

Andrew McCreight [:mccr8]

Assignee

Comment 23

•

8 years ago

(In reply to David Rajchenbach-Teller [:Yoric] (please use "needinfo") from comment #22)
> Could we take the opportunity to raise an error if we are triggered in the
> content process?

Turns out, this is loaded in the content process. TelemetryController.jsm loads osfile.jsm in the content process. It looks like it doesn't actually use it any more. I'll file a separate bug for this and see if a sandboxing person is interested. The easiest approach might be to make osfile.jsm not load anything in the content process.

Andrew McCreight [:mccr8]

Assignee

Comment 24

•

8 years ago

I filed bug 1277381.

Pulsebot

Comment 25

•

8 years ago

Pushed by amccreight@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/cb803c900f83
Remove content-child-shutdown observer for OS.file. r=Yoric

Carsten Book [:Tomcat]

Comment 26

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/cb803c900f83

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox49: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla49