Closed Bug 893012 Opened 11 years ago Closed 11 years ago

We leak a ContentParent when its corresponding process is SIGKIL'ed

Categories

(Core :: IPC, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
blocking-b2g leo+
Tracking Status
b2g18 --- fixed

People

(Reporter: justin.lebar+bug, Assigned: justin.lebar+bug)

References

Details

(Keywords: meta, Whiteboard: [MemShrink:P1])

Attachments

(10 files, 1 obsolete file)

(deleted), application/x-xz
Details
(deleted), application/x-xz
Details
(deleted), application/x-compressed-tar
Details
(deleted), application/x-compressed-tar
Details
(deleted), application/x-compressed-tar
Details
(deleted), application/x-compressed-tar
Details
(deleted), application/zip
Details
(deleted), application/x-compressed-tar
Details
(deleted), application/x-compressed-tar
Details
(deleted), application/x-compressed-tar
Details
STR: * Send SIGKIL to a B2G child process * ContentParent::ActorDestroy runs, but ContentParent::ShutDownProcess does not, and neither does the ContentParent destructor. This is at least part of the cause of bug 889261 and friends.
Blocks: 886217
Assignee: nobody → justin.lebar+bug
Whiteboard: [MemShrink]
If ShutDownProcess is supposed to be called, it probably needs a slightly different name. But yes, ActorDestroy for the root actor is responsible for making sure that it triggers the actions necessary to destroy that actor, because the IPC layer doesn't own it and is therefore not responsible for destroying it as it does for all the child actors.
We leak the ContentParent because its nsFrameLoader keeps a strong ref to it in in |mChildHost|. Still figuring out why we leak the frame loader.
We leak the frameloader because we leak BrowserElementParent. We leak BEP in part because it registers an event listener on its window. I fixed that and it seems like we still leak the BEP, and I'm not sure yet what's going on. But I'm going to open a separate bug for closing the ContentParent's channel upon ActorDestroy. That should take care of the many leaked messages, and it's a prerequisite for not leaking ContentParent.
Depends on: 893172
With the patch from bug 893172, if I remove the mChildHost ref in nsFrameLoader, my memory reporter from bug 893242 shows that we no longer leak the ContentParent. So that's good confirmation. But removing the mChildHost doesn't really fix the problem, since we're still leaking an nsFrameLoader.
Andrew (or someone), can you help me understand what this means? The nsFrameLoader I care about is 0x476a8ec0: > $ python find_roots.py cc-edges.14646.log 0x476a8ec0 > Parsing cc-edges.14646.log. Done loading graph. Reversing graph. Done. > > 0x44e66600 [JS Object (XPCWrappedNative_NoHelper)] > --[js::GetObjectPrivate(obj)]-> 0x4721baf0 [XPCWrappedNative] > --[mIdentity]-> 0x476a8ec0 [nsFrameLoader] > > Root 0x44e66600 is a marked GC object. So far so good; it's held alive only by JS. $ python find_roots.py gc-edges.14646.log 0x44e66600 Parsing gc-edges.14646.log. Done loading graph. Reversing graph. Done. > via resource://gre/modules/BrowserElementParent.jsm : > 0x4391f0a0 [FakeBackstagePass <no private>] > --[BrowserElementPromptService]-> 0x4393b2b0 [Object <no private>] > --[_browserElementParentMap]-> 0x4393c020 [WeakMap 46ebffd0] > --[WeakMap entry]-> 0x47481b80 [Object <no private>] > --[_frameLoader]-> 0x44e66600 [XPCWrappedNative_NoHelper 4721baf0] > > via resource://gre/modules/BrowserElementPromptService.jsm : > 0x4391f0b0 [FakeBackstagePass <no private>] > --[BrowserElementPromptService]-> 0x4393b2b0 [Object <no private>] > --[_browserElementParentMap]-> 0x4393c020 [WeakMap 46ebffd0] > --[WeakMap entry]-> 0x47481b80 [Object <no private>] > --[_frameLoader]-> 0x44e66600 [XPCWrappedNative_NoHelper 4721baf0] > > via XPCWrappedNative::mFlatJSObject : > 0x47455940 [ContentFrameMessageManager 460381f0] > --[BrowserElementPromptService]-> 0x4394e0a0 [Proxy <no private>] > --[private]-> 0x4393b2b0 [Object <no private>] > --[_browserElementParentMap]-> 0x4393c020 [WeakMap 46ebffd0] > --[WeakMap entry]-> 0x47481b80 [Object <no private>] > --[_frameLoader]-> 0x44e66600 [XPCWrappedNative_NoHelper 4721baf0] > > via jar:file:///system/b2g/omni.ja!/components/Webapps.js : > 0x4391fa20 [FakeBackstagePass <no private>] > --[BrowserElementPromptService]-> 0x4393b2b0 [Object <no private>] > --[_browserElementParentMap]-> 0x4393c020 [WeakMap 46ebffd0] > --[WeakMap entry]-> 0x47481b80 [Object <no private>] > --[_frameLoader]-> 0x44e66600 [XPCWrappedNative_NoHelper 4721baf0] > > via XPCWrappedNative::mFlatJSObject : > 0x47455040 [ContentFrameMessageManager 469ea100] > --[BrowserElementPromptService]-> 0x47725460 [Proxy <no private>] > --[private]-> 0x4393b2b0 [Object <no private>] > --[_browserElementParentMap]-> 0x4393c020 [WeakMap 46ebffd0] > --[WeakMap entry]-> 0x47481b80 [Object <no private>] > --[_frameLoader]-> 0x44e66600 [XPCWrappedNative_NoHelper 4721baf0] > > via nsXPCWrappedJS::mJSObj : > 0x43944100 [Object <no private>] > --[_bepMap]-> 0x4393c5a0 [WeakMap 460380a0] > --[WeakMap entry]-> 0x47481b80 [Object <no private>] > --[_frameLoader]-> 0x44e66600 [XPCWrappedNative_NoHelper 4721baf0] > > via Preserved wrapper : > 0x43b72580 [HTMLIFrameElement <no private>] > --[setVisible]-> 0x43b84150 [Proxy <no private>] > --[private]-> 0x47781f40 [Function] > --[fun_callscope]-> 0x47481d30 [Call <no private>] > --[**UNKNOWN SLOT 1**]-> 0x47781f00 [Function BrowserElementParent/defineMethod] > --[fun_callscope]-> 0x47481b50 [Call <no private>] > --[self]-> 0x47481b80 [Object <no private>] > --[_frameLoader]-> 0x44e66600 [XPCWrappedNative_NoHelper 4721baf0] The only non-weak entry here is the last one. The <iframe mozbrowser> has an expando called setVisible, and that function holds a ref to |self|, which is BrowserElementParent, which holds a ref to _frameLoader. That makes sense. There are a few other expandos on <iframe mozbrowser> which are functions whose scopes also hold a ref to the BrowserElementParent, but I presume they're being elided here. So now, why is the iframe alive? > $ python find_roots.py gc-edges.14646.log 0x43b72580 > Parsing gc-edges.14646.log. Done loading graph. Reversing graph. Done. > > via Preserved wrapper : > 0x43b72580 [HTMLIFrameElement <no private>] Hm...what's a preserved wrapper? The cc log isn't any more help. > $ python find_roots.py cc-edges.14646.log 0x43b72580 > Parsing cc-edges.14646.log. Done loading graph. Reversing graph. Done. > > 0x43b72580 [JS Object (HTMLIFrameElement)] > > Root 0x43b72580 is a marked GC object. I tried to look at the object in gdb, but 0x43b72580 does not appear to be a mozilla::dom::HTMLIFrameElement; its vtable pointers don't line up with anything gdb knows about, and its refcnt is 1136019944. Maybe that pointer is to one of the object's nsISupports's, or something. Anyway, I'm not sure what to do now. I could make the expandos on the iframe hold weak refs to the BrowserElementParent, but then we'd still be leaking the iframe. Any ideas?
Ah, I misunderstood those weakmap refs. They're refs where the BEP is the /value/, which is a strong ref. However, these are all keyed on the iframe, so when the iframe goes away, the BEP should too. So we're still left with the question of why this iframe is alive.
Without this, both the window and the iframe hold the BEP alive. With this patch, only the iframe seems to hold the BEP alive (per the above comments), which is what we want. But we still leak the iframe, it seems.
Could you upload the GC and CC logs you are talking about? I can look at them. 0x43b72580 is a JS Object (HTMLIFrameElement), which means it is the JS reflector for the HTMLIFrameElement, not the element itself. This is an all-traces log (hopefully), so you want to look up the reflector 0x43b72580 in the CC log, and that should contain a pointer to the C++ object. When a JS reflector (aka wrapper in the context of preserved wrapper) is created for a C++ object, it owns the C++ object, but not vice versa. If you set an expando on the reflector (or a few other things), we want to ensure the reflector lives as long as the C++ object, so we make the C++ object own its reflector. That's called a "preserved wrapper".
While a preserved wrapper on a C++ object makes it own its reflector, it shouldn't cause its reflector to be marked black (which means the CC won't collect it), except as a CC optimization (if we decide the C++ object is definitely alive, then if the reflector is preserved, the reflector is definitely alive), so I suspect this is an instance where AllTraces is failing to deoptimize, perhaps in mozilla::dom::TraceBlackJS (I filed bug 893378 for that). The good news though is that this is really just find_roots getting confused, the information should still be in the log. In general, you might need to manually root around (no pun intended) in the log, but here I think it should suffice to edit the log, and replace the [gc.marked] next to the entry for 0x43b72580 with [gc]. This will void your warranty (eg it may mess stuff up, including but not limited to things this reflector owns), but I think it should be okay, because nothing else is pointing to the reflector.
Attached file cc edges from comment 6 (deleted) —
Attached file gc edges from comment 6 (deleted) —
Okay, with the suggested change to the log from comment 10, I now get > $ python cc/find_roots.py cc-edges.log 0x43b72580 > Parsing cc-edges.log. Done loading graph. Reversing graph. Done. > > 0x43b725b0 [JS Object (HTMLDivElement)] > --[UnwrapDOMObject(obj)]-> 0x476a8dd0 [FragmentOrElement (xhtml) div id='appframe0' class='appWindow homescreen active' app://system.gaiamobile.org/index.html] > --[mAttrsAndChildren[i]]-> 0x46b9a880 [FragmentOrElement (xhtml) iframe app://system.gaiamobile.org/index.html] > --[Preserved wrapper]-> 0x43b72580 [JS Object (HTMLIFrameElement)] > > Root 0x43b725b0 is a marked GC object. > > 0x43b72610 [JS Object (HTMLDivElement)] > --[UnwrapDOMObject(obj)]-> 0x47b472e0 [FragmentOrElement (xhtml) div class='screenshot-overlay' app://system.gaiamobile.org/index.html] > --[GetParent()]-> 0x476a8dd0 [FragmentOrElement (xhtml) div id='appframe0' class='appWindow homescreen active' app://system.gaiamobile.org/index.html] > --[mAttrsAndChildren[i]]-> 0x46b9a880 [FragmentOrElement (xhtml) iframe app://system.gaiamobile.org/index.html] > --[Preserved wrapper]-> 0x43b72580 [JS Object (HTMLIFrameElement)] > > Root 0x43b72610 is a marked GC object. > > 0x47756d90 [JS Object (Proxy)] > --[private]-> 0x43b72580 [JS Object (HTMLIFrameElement)] > > Root 0x47756d90 is a marked GC object. > > 0x43b84010 [JS Object (Call)] > --[iframe]-> 0x43b72580 [JS Object (HTMLIFrameElement)] > > Root 0x43b84010 is a marked GC object. > > 0x43b6eeb0 [JS Object (Object)] > --[iframe]-> 0x43b72580 [JS Object (HTMLIFrameElement)] > > Root 0x43b6eeb0 is a marked GC object. I'm not sure what the last three mean, but the first is indicating that we're holding a div element alive, and that div contains the iframe. I think the second is saying that this first div contains a second div, and that second div holds alive the first one by virtue of its parent pointer. It's not clear to me whether this rules out the fact that the second div is held alive by the first one (by virtue of being its child); that is, I don't know if the second div is interesting.
Looking for gc roots of the first div element appears to show that Gaia is leaking the world. > $ python ~/code/moz/heapgraph/gc/find_roots.py gc-edges.log 0x43b725b0 > Parsing gc-edges.log. Done loading graph. Reversing graph. Done. > > via nsXPCWrappedJS::mJSObj : > 0x43b6dac0 [Function AppError/<] > --[fun_callscope]-> 0x43b6eee0 [Call <no private>] > --[self]-> 0x43b6ef10 [Object <no private>] > --[app]-> 0x43b6eeb0 [Object <no private>] > --[frame]-> 0x43b725b0 [HTMLDivElement <no private>] > > via nsXPCWrappedJS::mJSObj : > 0x43b6db40 [Function apploaded] > --[fun_callscope]-> 0x43b84010 [Call <no private>] > --[homescreenFrame]-> 0x43b725b0 [HTMLDivElement <no private>]
In Gaia's apps/system/js/window.js: > window.AppError = function AppError(app) { > var self = this; > this.app = app; > this.app.frame.addEventListener('mozbrowsererror', function(evt) { > if (evt.detail.type != 'other') > return; > > console.warn( > 'app of [' + self.app.origin + '] got a mozbrowsererror event.'); > > if (self.injected) { > self.update(); > } else { > self.render(); > } > self.show(); > self.injected = true; > }); > return this; > }; We register one mozbrowsererror listener for every frame, but we never remove the listener. The mozbrowsererror listener's closure holds alive self, which is |this|, which holds alive |this.app|, which holds alive |frame|, which holds alive our iframe. Sheesh. So that's just gonna leak. In Gaia's apps/system/js/window_manager.js: > app.frame.addEventListener(type, function apploaded(e) { > e.target.removeEventListener(e.type, apploaded, true); > > var evt = document.createEvent('CustomEvent'); > evt.initCustomEvent('apploadtime', true, false, { > time: parseInt(Date.now() - iframe.dataset.start), > type: (e.type == 'appopen') ? 'w' : 'c' > }); > iframe.dispatchEvent(evt); > }, true); This one is a bit weird. If this closure holds alive homescreenFrame (which exists in the closure's scope but is not referenced inside the closure), does that mean the closure is heavyweight? If so, why is it heavyweight? Let us pause to consider how incredibly difficult it is to write JS that doesn't leak.
(In reply to Justin Lebar [:jlebar] from comment #15) > In Gaia's apps/system/js/window.js: > > We register one mozbrowsererror listener for every frame, but we never > remove the listener. The mozbrowsererror listener's closure holds alive > self, which is |this|, which holds alive |this.app|, which holds alive > |frame|, which holds alive our iframe. Sheesh. > > So that's just gonna leak. Does "removeEventListener() before the element is removed" help? Do you suggest that we should remove event listener before an element is removed from DOM tree? > > In Gaia's apps/system/js/window_manager.js: > > > app.frame.addEventListener(type, function apploaded(e) { > > e.target.removeEventListener(e.type, apploaded, true); > > > > var evt = document.createEvent('CustomEvent'); > > evt.initCustomEvent('apploadtime', true, false, { > > time: parseInt(Date.now() - iframe.dataset.start), > > type: (e.type == 'appopen') ? 'w' : 'c' > > }); > > iframe.dispatchEvent(evt); > > }, true); > > This one is a bit weird. If this closure holds alive homescreenFrame (which > exists in the closure's scope but is not referenced inside the closure), > does that mean the closure is heavyweight? If so, why is it heavyweight? This one is not invented by me...however I don't get your point here. Which line is the heavyweight point you mean? Or you are saying the whole WindowManager closure is something bad? > > Let us pause to consider how incredibly difficult it is to write JS that > doesn't leak. I agree.
(In reply to Justin Lebar [:jlebar] from comment #15) > In Gaia's apps/system/js/window_manager.js: > > > app.frame.addEventListener(type, function apploaded(e) { > > e.target.removeEventListener(e.type, apploaded, true); > > > > var evt = document.createEvent('CustomEvent'); > > evt.initCustomEvent('apploadtime', true, false, { > > time: parseInt(Date.now() - iframe.dataset.start), > > type: (e.type == 'appopen') ? 'w' : 'c' > > }); > > iframe.dispatchEvent(evt); > > }, true); > > This one is a bit weird. If this closure holds alive homescreenFrame (which > exists in the closure's scope but is not referenced inside the closure), > does that mean the closure is heavyweight? If so, why is it heavyweight? I believe there are situations where a closure can keep alive things that appear to be unnecessary without being heavyweight. Let's poke at this tomorrow (today? :-P) in the debugger, and rope in some JS folks if we need them. > Let us pause to consider how incredibly difficult it is to write JS that > doesn't leak. *le sigh*
> So that's just gonna leak. Cycle collector should collect this cycle. Why doesn't it? Or is the point that "self" is actually a live window all along or something?
> I don't get your point here. > Which line is the heavyweight point you mean? Or you are saying the whole WindowManager > closure is something bad? In our JS engine, a closure can either be "heavyweight" or "lightweight". The idea is that a closure can access variables in its containing scope, so if you keep the closure alive, it has to keep alive the variables from the containing scope that it can access. Normally the JS engine parses the text of the closure, understands which variables the closure might access, and keeps them all alive. When we can do this, we call the closure lightweight. But sometimes we can't determine statically which variables the closure might keep alive. For example, if the closure calls eval(), the eval'ed code might access anything. In this case, we make the closure "heavyweight" and it keeps alive everything in its containing scope. All of this is to say, I don't see why this particular closure should be heavyweight. But also, it doesn't access homescreenFrame, so if it's lightweight, I don't see why it's holding that variable alive. So obviously my explanation is incomplete in some way. :) khuey is looking into it.
> Does "removeEventListener() before the element is removed" help? In this case, yes. Alternatively, we could register just one event listener and use the target of the event to get the iframe. > Do you suggest that we should remove event listener before an element is removed from DOM > tree? I'm not suggesting this as a general rule. The issue is that this particular event listener holds a reference to the DOM element in question. Therefore if we don't remove the event listener when we remove the DOM element from the tree, then the listener leaks the DOM element.
Again: it shouldn't. The cycle collector knows about event listeners!
(In reply to Boris Zbarsky (:bz) from comment #18) > > So that's just gonna leak. > > Cycle collector should collect this cycle. Why doesn't it? Or is the point > that "self" is actually a live window all along or something? Confusingly enough, |frame| is not the iframe. I think it is an element in the window that stays alive. So I don't think there's a cycle to collect. However, if we registered the event listener on |this.app.iframe|, that might let the CC collect the closure...
Oh, so app.frame sticks around because someone is explicitly keeping it alive? Then yeah, its listeners will leak whatever they close over...
(In reply to Boris Zbarsky (:bz) from comment #23) > Oh, so app.frame sticks around because someone is explicitly keeping it > alive? Then yeah, its listeners will leak whatever they close over... app.frame is the container element, a DIV, of the mozbrowser iframe. <div class="app-window"> <iframe mozbrowser></iframe> </div> So we should use app.iframe to attach the event listener and need not to do remove event listener but just remove the whole frame element and then GC would work correctly? Correct me if I'm wrong. If yes, I'm going to fix this case. (In reply to Justin Lebar [:jlebar] from comment #19) > > I don't get your point here. > > Which line is the heavyweight point you mean? Or you are saying the whole WindowManager > > closure is something bad? > > In our JS engine, a closure can either be "heavyweight" or "lightweight". > > The idea is that a closure can access variables in its containing scope, so > if you keep the closure alive, it has to keep alive the variables from the > containing scope that it can access. > > Normally the JS engine parses the text of the closure, understands which > variables the closure might access, and keeps them all alive. When we can > do this, we call the closure lightweight. > > But sometimes we can't determine statically which variables the closure > might keep alive. For example, if the closure calls eval(), the eval'ed > code might access anything. In this case, we make the closure "heavyweight" > and it keeps alive everything in its containing scope. > > All of this is to say, I don't see why this particular closure should be > heavyweight. But also, it doesn't access homescreenFrame, so if it's > lightweight, I don't see why it's holding that variable alive. So obviously > my explanation is incomplete in some way. :) khuey is looking into it. Thanks for explanation. I'll try to find out why this is relevant to |homescreenFrame| variable from my point of view.
(In reply to Alive Kuo [:alive] from comment #24) > (In reply to Justin Lebar [:jlebar] from comment #19) > > > I don't get your point here. > > > Which line is the heavyweight point you mean? Or you are saying the whole WindowManager > > > closure is something bad? > > > > In our JS engine, a closure can either be "heavyweight" or "lightweight". > > > > The idea is that a closure can access variables in its containing scope, so > > if you keep the closure alive, it has to keep alive the variables from the > > containing scope that it can access. > > > > Normally the JS engine parses the text of the closure, understands which > > variables the closure might access, and keeps them all alive. When we can > > do this, we call the closure lightweight. > > > > But sometimes we can't determine statically which variables the closure > > might keep alive. For example, if the closure calls eval(), the eval'ed > > code might access anything. In this case, we make the closure "heavyweight" > > and it keeps alive everything in its containing scope. > > > > All of this is to say, I don't see why this particular closure should be > > heavyweight. But also, it doesn't access homescreenFrame, so if it's > > lightweight, I don't see why it's holding that variable alive. So obviously > > my explanation is incomplete in some way. :) khuey is looking into it. > > Thanks for explanation. I'll try to find out why this is relevant to > |homescreenFrame| variable from my point of view. I'm digging into this with the help of some JS engine folks. Hopefully I'll have an answer soon.
Yes, I think switching to app.iframe would work. I'm testing a patch now, and I'll send a review your way, if that sounds OK to you.
(In reply to Justin Lebar [:jlebar] from comment #26) > Yes, I think switching to app.iframe would work. I'm testing a patch now, > and I'll send a review your way, if that sounds OK to you. Sure, thank you!
No longer blocks: 870588
Depends on: 894081
Depends on: 894135
Depends on: 894147
Comment on attachment 775147 [details] [diff] [review] Part 1: Don't leak BrowserElementParent due to an event listener on the window which contains it. I'm moving this into its own bug (bug 894147), since the Gaia patches also have their own bugs. That way, if there are any other surprise fixes we need here, we can continue to attach them off this bug.
Attachment #775147 - Attachment is obsolete: true
Steps for testing: 1) Apply patches from bug 894081, bug 894135, bug 894147, and bug 893242. I'm working on landing these, but it will take a day or two longer. 2) Load the browser app, go to example.com 3) Load the gallery app 4) adb shell b2g-ps 5) adb shell kill -9 PIDS where PIDS is equal to the list of all PIDs from step 4 above, not including the main b2g process. For example, if adb shell b2g-ps gives me >APPLICATION USER PID PPID VSIZE RSS WCHAN PC NAME >b2g root 2810 158 160808 71504 ffffffff 400f8430 S /system/b2g/b2g >Usage app_2863 2863 2810 64392 24084 ffffffff 40020430 S /system/b2g/plugin-container >Homescreen app_2882 2882 2810 69212 26684 ffffffff 40059430 S /system/b2g/plugin-container >Gallery app_2962 2962 2810 84700 25088 ffffffff 400dd430 S /system/b2g/plugin-container >Browser app_3004 3004 2810 64072 22204 ffffffff 4002e430 S /system/b2g/plugin-container >(Preallocated a root 3068 2810 62996 21152 ffffffff 400e3430 S /system/b2g/plugin-container Then I would run > $ adb shell kill -9 2863 2882 2962 3004 3068 6) Wait ten seconds, then run $ B2G/tools/get_about_memory.py --minimize where B2G is the root B2G checkout (from https://github.com/mozilla-b2g/B2G). 7) In the resulting memory report, search for "content-parent". If this bug is fixed, you should see only one or two content-parent entries: One for the homescreen, and maybe one for the preallocated process. Neither should have pid=-1, and neither should have a closed channel. If this bug is not fixed, you will see content-parent entries with pid=-1 or closed channels. You may see content-parent entries for Browser, Gallery, or Usage, and you may see multiple content-parent entries for the Homescreen. When I checked with these STR a few minutes ago, everything seemed fixed. But now I'm seeing things which indicate that this bug may not be fixed. Please stand by.
Depends on: 893242
Here's an example of a bad memory report, indicating that the bug still exists. > 0 (100.0%) -- queued-ipc-messages > ├──0 (100.0%) ── content-parent((Preallocated), pid=3809, open channel, 0x46dd8000, refcnt=14) > ├──0 (100.0%) ── content-parent(Homescreen, pid=-1, closed channel, 0x43208c00, refcnt=1) > ├──0 (100.0%) ── content-parent(Homescreen, pid=3756, open channel, 0x45bef800, refcnt=15) > └──0 (100.0%) ── content-parent(Usage, pid=-1, closed channel, 0x46f0a800, refcnt=1) We want it instead to look like this. > 0 (100.0%) -- queued-ipc-messages > ├──0 (100.0%) ── content-parent((Preallocated), pid=3809, open channel, 0x46dd8000, refcnt=14) > └──0 (100.0%) ── content-parent(Homescreen, pid=3756, open channel, 0x45bef800, refcnt=15)
It turns out that we "leak" one or two ContentParents under some circumstances (*). This isn't a big deal -- each one is relatively small. What's important is that the number of ContentParent objects we leak is bounded by a small constant. So to test for this, I need to amend the steps from comment 29 a bit. A) Do step 1 B) Repeat steps 2, 3, 4, 5 five times. C) Do steps 6 and 7 D) When looking through the memory report, we pass if there are at most two ContentParent entries with pid=-1 and/or a closed channel. Otherwise we fail. Let me know if this isn't clear. (*) In our testing, it looks like BrowserElementPanning.js can leak one ContentParent, via its pointerDownTarget member. It also looks like Gaia leaks one ContentParent when you kill the Usage app. It gets a mozbrowsererror when the process crashes, but it apparently does not remove the old, dead iframe until it wants to show the Usage app again. These may be bugs (particularly the second one), but I don't think they're a big deal.
One thing I should add is that I haven't tested any of this in b2g18 yet. All of my experiments have been on trunk. If you're going to test on trunk, be sure that you have bug 887075; without it, we leak like crazy. As far as I know, this isn't a problem on b2g18.
Depends on: 887075
blocking-b2g: --- → leo+
Depends on: 894597
Depends on: 894669
Depends on: 894191
Attached file Memory Reports And DMD After 16 hour (deleted) —
this log applied Bug 894081, Bug 894135, Bug 894147, Bug 893242 this was taken after 16 hour.
this log applied Bug 894081, Bug 894135, Bug 894147, Bug 893242 this was taken after 33 hour.
Thank you; this is very useful information.
Next time you collect this information, it would be helpful if you'd collect it with get_about_memory.py --minimize. --minimize forces a GC right before collecting the information. This helps reduce noise in the test results.
> Memory Reports And DMD After 16 hour Also next time it would be helpful to have a GC/CC log both before and after. We'll work with what we have until you can get us better data, but if that's at all possible, please post the new data to the bug when you get it.
Judging by the DMD logs, there's definitely a ContentParent leak still remaining here. But even if we fix that, it's not going to fix the whole problem. About 50% of the increase in memory usage is due to JS; that is likely a completely unrelated problem. To diagnose that, we really need before and after GC/CC logs. My working theory is that maybe ContentParent::ActorDestroy isn't being run. That would definitely cause us to leak. I think we're going to try to catch this in a debugger tomorrow.
Just to be sure for when we try to reproduce this, the tests from comment 33 and 34 are run on b2g-18 and gaia v1-train (plus the patches)?
Flags: needinfo?(jeremy.kim.leo)
By the way, I just pushed a fix to the get_about_memory.py script that should fix getting memory reports from phones running b2g18. If you git fetch && git merge in the root B2G repository, you should get this new version; hopefully it will work for you! https://github.com/mozilla-b2g/B2G/commit/1f777756bfd1a80381208ef9851315ce1cec3dc3
Base Revision : AU_LINUX_GECKO_ICS_STRAWBERRY.01.01.00.019.149 gecko : 499c1f8ea7ad0cdaa7086214278e2944b8a2ea33 gaia : 0d5a9a7577f16b6a72a982148c6f509ee1714ea2 and add those patches. + Bug 894081, Bug 894135, Bug 894147, Bug 894191 + Bug 893242 : except Part 3, v2: Add a memory reporter counting ContentParents and the number of IPC messages they have outstanding. Part 1.5: Make sContentParents a StaticAutoPtr.
Flags: needinfo?(jeremy.kim.leo)
Digging in the logs some more, from the 33 hour CC/GC logs, I see around a thousand BEPs, but I also see about a thousand iframe elements. These iframes are not present in the CC log, so it seems like maybe the CC log it not an AllTraces log, so we can't tell why the event listeners that are holding the iframes alive are being held alive. But I guess the good news here is that it is a similar problem to one we've looked into before. Kyle is trying to reproduce the problem locally.
Depends on: 895933
> so it seems like maybe the CC log it not an AllTraces log Ack, I must have forgotten to mark bug 890135 as leo+. Sorry. :-/
Depends on: 896088
We were able to reproduce more ContentParent leaks today. These latest leaks appear to be distinct from everything we've fixed so far. We were able to partially address them, but we ran into a bug in our GC heap dumping code where we're not dumping the entire heap (!?!?). We'll pick up on Monday trying to fix the tool so we can see what leaks remain.
Can you please provide steps to verify this fix?- to perform blackbox testing.
Depends on: 896755
Depends on: 896773
Whiteboard: [MemShrink] → [MemShrink:P1]
Attached file memory report before testing. (deleted) —
to gather more debugging infomation. this version applied more patchs. - Bug 893242, Bug 895933, Bug 890135 and this is log of before testing after boot up. memory-report with minimize dmd gc/cc log.
and this is log of after testing for 14 hour memory-report with minimize dmd gc/cc log.
Thanks, jeremy, that's really useful. The log from comment 47 has 827 (yes, really) leaked ContentParent objects. Tomorrow I think we will have patches that fix all the ContentParent leaks we have found so far. Hopefully after you retest with those we'll see a much small (maybe even 0!) number.
No longer depends on: 896088
Hi Jeremy, B2G-18 tip and gaia v1-train tip plus the patches from bug 897701, bug 896825, and bug 896755 fix all of the ContentParent leaks that we have been able to reproduce. It would be great if leo could test with those patches.
Flags: needinfo?(jeremy.kim.leo)
For comparison, I also checked this on the latest nightly on Unagi after 7 Trials Gecko http://hg.mozilla.org/releases/mozilla-b2g18/rev/91fa9bf5e638 Gaia a64f37c3327e4cf2105a16eedaf68727bb2adbde BuildID 20130724070208 Version 18.0
It seems that, at least there, the bug was still occurring.
If you weren't on tip and don't have the patches from comment 49 that's expected.
Hi Justin, This is Andre from Leo. The bug 811636 is observed in Leo as well. After hours of test, the output channel's queue starts accumulating again. The child associated to that channel no longer exists, and the main process seems to hold both sides of the pipe. So no channel error is dispatched. This is exactly what bug 811636 fixes. I tested that patch in Leo, and the result was positive: no messages accumulating in the output queue. So I am putting the current issue as dependent of that, since that causes the ContentParent to leak.
Depends on: 811636
Attached file memory report before testing. (deleted) —
this version applied more patchs. - Bug 897701, Bug 896825, Bug 896755 and this is log of before testing after boot up. memory-report with minimize dmd gc/cc log.
Flags: needinfo?(jeremy.kim.leo)
and this is log of after testing for 3 hour memory-report with minimize dmd gc/cc log.
and this is log of after testing for 15 hours memory-report with minimize dmd gc/cc log. awesome! there are only 3 ContentParent objects. it looks very greats.
That is great news. No leaked ContentParent objects after 15 hours, and RSS has only increased by 1.5 MB in 12 hours. Jeremy if leo is happy with these results I'd like to close this bug as RESOLVED FIXED. But if you want to do more QA testing on these patches we can wait until you are done.
Flags: needinfo?(jeremy.kim.leo)
Keywords: meta
It is also worth noting that in the main process, from 3 hours to 15 hours, the only real increase in explicit memory (around 0.36mb) is from DMD itself.
No longer depends on: 896776
sorry for delay because of weekends. first of all, thanks for efforts of all. i agree with Comment 58. it looks clear leaked issue caused by ContentParents. let's change status of this as RESOLVED-FIXED. in additionally, our official QA testing will start on next monday(05-AUG-2013) for a week. it maybe, the official results is possible to take after two weeks.'
Flags: needinfo?(jeremy.kim.leo)
(In reply to leo.bugzilla.gecko from comment #60) > sorry for delay because of weekends. No problem. Have to take a break sometime ;-) > first of all, thanks for efforts of all. > i agree with Comment 58. it looks clear leaked issue caused by > ContentParents. > let's change status of this as RESOLVED-FIXED. Great! > in additionally, our official QA testing will start on next > monday(05-AUG-2013) for a week. > it maybe, the official results is possible to take after two weeks.' Looking forward to those results.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Kyle, i have some question. currently, i found an abnormal behavior. after 3~4 days testing this, b2g has some delay when i do something on target. (for examples, delayed launch application / delayed response as my action) i checked 'adb shell top -m 10' ---------------------------------------------------------------------------------------------- User 88%, System 11%, IOW 0%, IRQ 0% User 275 + Nice 0 + Sys 35 + Idle 0 + IOW 0 + IRQ 0 + SIRQ 0 = 310 PID PR CPU% S #THR VSS RSS PCY UID Name 140 0 92% R 57 240296K 82372K fg root /system/b2g/b2g 163 0 5% S 6 5908K 436K fg root /system/bin/sensord 9099 0 1% R 1 1060K 452K fg root top 79 0 0% S 1 0K 0K fg root mmcqd/0 8909 0 0% S 1 0K 0K fg root kworker/0:2 8 0 0% S 1 0K 0K fg root suspend 9 0 0% S 1 0K 0K fg root sync_supers 10 0 0% S 1 0K 0K fg root bdi-default 11 0 0% S 1 0K 0K fg root kblockd 12 0 0% S 1 0K 0K fg root khubd ---------------------------------------------------------------------------------------------- sometimes, b2g process has 92% of CPU per 20 seconds. so i guess that b2g is enabled some log to gather memory-report/dmd/gc/cc log. is it possible to cause some delays? (after 3days, cc log is 14mbytes, gc log is 20mbytes) (if it is right, how i can disable to log data?) if i builds user-mode, is it disabled logging? (currently, i'm used userdebug-mode.)
There is no way to disable logging, but having logging enabled causes very little overhead, because Firefox has to maintain the data we use to generate the logs, even if we're not using the logs. We have this sort of logging enabled on all desktop Firefox builds. What you are seeing is very unlikely related to gc/cc/about:memory logs.
You should use a profiler to see why Firefox is spinning. I'm not sure how a userdebug-mode build is made, but if you have debug symbols, you could attach GDB, hit ctrl-c, and observe the backtrace. Do this a bunch of times and you might understand why the build is spinning. It might also be worth trying to use SPS, but that will probably only be helpful if we're spending time running JavaScript. Also I'm not sure you can use SPS without first rebooting the device, which isn't what you want to do. https://developer.mozilla.org/en-US/docs/Performance/Profiling_with_the_Built-in_Profiler#Profiling_Boot_to_Gecko_%28with_a_real_device%29 This issue is probably worth a separate bug; I can help you make sure you have the right people cc'ed to help you with this issue.
ok. thanks Justin. After attach GDB, When b2g has very high cpu, i hits ctrl-c, at this point, almost, b2g is doing to collect cycle memory. so, i doubt cc logging has a overhead. and also, when i reboots to enable profile, because this will be cleared, so i can't do this yet. i understood, i will file a bug as new after more investigation.
Ah, long cycle collection pauses! That is interesting. Please cc me, :mccr8, and :smaug on the bug. Please also attach the gc and cc logs.
OK to mark this as status-b2g18:fixed? If so, please do :)
Flags: needinfo?(justin.lebar+bug)
Flags: needinfo?(justin.lebar+bug)
Depends on: 739681
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: