Closed Bug 824919 Opened 12 years ago Closed 12 years ago

PeerConnection instances not GC'ed until page gets closed (failure '0x80004005' in IPeerConnection.initialize)

Categories

(Core :: WebRTC, defect, P2)

defect

Tracking

()

VERIFIED FIXED
mozilla21

People

(Reporter: whimboo, Assigned: abr)

References

(Blocks 1 open bug)

Details

(Keywords: testcase, Whiteboard: [WebRTC], [blocking-webrtc+])

Attachments

(2 files, 5 obsolete files)

If no external network is available we fail to initialize a peer connection instance. We should not fail given that the localhost is still available and could be used even only for testing purposes. I can see this problem when no network cable is connected, Wifi is turned on but hasn't been connected to a known network. The following output I get when running a peer connection mochitest: 0:27.28 54 INFO TEST-START | /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html 0:30.30 55 INFO Error: Unable to restore focus, expect failures and timeouts. 0:30.41 ++DOMWINDOW == 15 (0x10ba7c530) [serial = 19] [outer = 0x1200c9c30] 0:30.48 WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004003: file /Volumes/data/code/firefox/nightly/intl/uconv/src/nsCharsetConverterManager.cpp, line 301 0:30.58 WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004003: file /Volumes/data/code/firefox/nightly/intl/uconv/src/nsCharsetConverterManager.cpp, line 301 0:30.77 JavaScript error: file:///Volumes/data/code/firefox/obj/debug/dist/NightlyDebug.app/Contents/MacOS/components/PeerConnection.js, line 278: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [IPeerConnection.initialize] Code that fails is: _queueOrRun: function(obj) { if (this._closed) { return; } if (!this._pending) { > obj.func.apply(this, obj.args); if (obj.wait) { this._pending = true; } } else { this._queue.push(obj); } }, This is most likely the reason why we are failing in bug 823056 with the 330s timeout for test_peerConnection_basicAudio.html.
Blocks: 823056
Your assessment of the (In reply to Henrik Skupin (:whimboo) [away 12/21 - 01/01] from comment #0) > If no external network is available we fail to initialize a peer connection > instance. We should not fail given that the localhost is still available and > could be used even only for testing purposes. This is an intentional design feature. The chance of loopback addresses working in any operational scenario is vanishing and they just clutter up the ICE candidate list. Moreover, as a general matter, we don't intend for PCs to work in offline mode. You should arrange for the testbed machines to have dummy addresses.
Making this a blocker for now (so it doesn't fall off our radar while we get answers to open questions). We need to know how this will interact with the automation tests. Does this actually cause problems on the test machines? We also want to handle gracefully the error case where the user's machine loses its WiFi connection.
Assignee: nobody → adam
Priority: -- → P2
Whiteboard: [WebRTC] → [WebRTC], [blocking-webrtc+]
Currently, when you go into offline mode, we tear down all peerconnections. This doesn't happen at the ICE layer, which will blissfully keep going, but at the PC layer. Similarly, you can't create PCs in offline mode. We certainly don't want to be handing out 127.x addresses ordinarily. I suppose we could have some testing frob which let you tell it you wanted that.
Do the test servers have network connectivity when running their automated tests?
Flags: needinfo?(hskupin)
If we don't allow that we should at least throw a better failure message to make it clear why it's failing. What we are getting right now is not clear. The machines have network access otherwise we wouldn't be able to hand over tasks. But none of the tests are allowed to access external networks. Everything is done via httpd.js and the loop back interface. So I still can't say what's happening here given that it is only happening on a single platform right now. So I would wait until my patch on bug 824923 has been merged to m-c and then to alder. At this point we will know what's going on. Further discussion on that for tests we should move to bug 823056.
Status: NEW → ASSIGNED
Flags: needinfo?(hskupin)
This might not be related to a missing network interface. When I have discovered this problem I was running a couple of tests in a row which might have exceeded the number of possible to create peer connection tests before we start failing in initialize(). Same as what Jason has seen on bug 825534. Is there a reason why we are failing after around 20 pc instances?
Flags: needinfo?(ekr)
Summary: Failure '0x80004005' in IPeerConnection.initialize when trying to establish a peer connection if no network interface is present → Failure '0x80004005' in IPeerConnection.initialize when creating too many (>20) peer connection instances
Possibly. If they're not cleaned up properly we start to run out of resources. You'd need to run it with logging to find out what resource and why it's not cleaning it up
Flags: needinfo?(ekr)
Try NSPR_LOG_MODULES signaling:5
A testcase which can be used to reproduce this bug is attachment 680451 [details] from bug 801227.
Henrik -- are you sure this is the same bug? When I run that test case, I'm reliably getting a different error from a different component: 0x80040111 in [nsIDOMNavigatorUserMedia.mozGetUserMedia], versus 0x80004005 in [IPeerConnection.initialize] Can you please double-check that you're getting the '005 error rather than the '111 error when you run that test case?
Flags: needinfo?(hskupin)
Tracking down the 0x80040111 I describe above, I've isolated it to code inside Navigator::MozGetUserMedia: nsCOMPtr<nsPIDOMWindow> win = do_QueryReferent(mWindow); if (!win || !win->GetOuterWindow() || win->GetOuterWindow()->GetCurrentInnerWindow() != win) { return NS_ERROR_NOT_AVAILABLE; } The check that is failing here is ensuring that "win->GetOuterWindow()->GetCurrentInnerWindow()" is equal to "win". I also notice, when running these tests, that the number of DOMWINDOW instances grows rather steadily as the test progresses. Generally, it will get pretty far before hitting this condition (on the order of 20 - 50 reloads). If one then manually hits 'reload' in the browser interface, it will typically run for approximately the same number of iterations before failing out again. My hunch, then, is that we're not really exhausting a resource here, as much as stimulating a condition that has something like a 2%-5% chance of failure. I suspect, but cannot yet confirm, that the fact that we're leaking DOM windows is an unrelated issue.
Okay, on further digging and simplification of the test case, I can isolate the window leak to getUserMedia. See Bug 826538 for a simplified test case that triggers the leak. Note that the timing difference of that test case (it runs much faster than the test case provided above) appears to reduce or eliminate the window for the race that causes the NS_ERROR_NOT_AVAILABLE to be returned. On further evaluation, however, the conditions that lead to NS_ERROR_NOT_AVAILABLE being returned appear to be perfectly normal. What appears to happen under such circumstances is that the window reload takes partial effect prior to mozGetUserMedia being invoked. Consequently, the conditional check I cite above is simply doing its job by failing the request to get user media when the window is gone/going away. In other words, the "0x80040111 in [nsIDOMNavigatorUserMedia.mozGetUserMedia]" situation I describe in comment 11 appears to be the code working exactly as intended. At this point, my attempts to reproduce the "0x80004005 in [IPeerConnection.initialize]" condition using the supplied test case have been unsuccessful. I await clarification from Henrik to confirm that the error he sees from the supplied test case is, in fact, the '005 error and not the '111 error.
Yes, it's the following failure: Timestamp: 1/4/13 10:07:06 AM Error: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [IPeerConnection.initialize] Source File: resource://gre/components/PeerConnection.js Line: 278 Just grab the testcase from Jason on bug 825534, and click more than 20 times on the 'Create Peer Connection 1' button.
Flags: needinfo?(hskupin)
Attached file testcase (obsolete) (deleted) —
Testcase for that issue. Let it run for a moment and it will print the failure message to the document body.
Keywords: testcase
(In reply to Eric Rescorla (:ekr) from comment #9) > Try NSPR_LOG_MODULES signaling:5 Not sure why but it doesn't put anything in the log. Given that abr is on it I would leave it in his capable hands!
Just keeping some debugging notes in here before I put this down for the evening... From what I've seen (at least, on my machine -- it may vary by IP interface configuration), it errors out on exactly the 31st iteration through sipcc::PeerConnectionImpl::Initialize, when calling into mMedia->Init(). Chasing down that rabbit hole, the failure appears to originate down in sipcc::PeerConnectionMedia::Init, in the thunk to NrIceCtx::StartGathering(). Inside that method, the error comes from nr_ice_initialize -> nr_ice_media_stream_initialize -> nr_ice_component_initialize, which is returning R_NOT_FOUND when the component->candidates list is found to be of length zero. This happens the 184th time nr_ice_component_initialize is called.
I don't have time to write this up 100%, but the short version is that this is failing as expected. I'll put a better explanation in here tomorrow. The Cliff's notes version can be found in nsSocketTransportService2.cpp: // // the number of sockets that can be attached at any given time is // limited. this is done because some operating systems (e.g., Win9x) // limit the number of sockets that can be created by an application. // AttachSocket will fail if the limit is exceeded. consumers should // call CanAttachSocket and check the result before creating a socket. // bool CanAttachSocket() { return mActiveCount + mIdleCount < gMaxCount; } On my platform, gMaxCount is 550. When running these tests, I have easily seen us reach past 520 UDP listener sockets just for ICE checks. And that's just how ICE works.
At a high level, we use something like one socket/component/interface. So, if we have a machine with two NICs and we do audio, video, and datachannel with no mux, we would expect something like 8 sockets used/peerconnection. So, I think there are two questions here: 1. Are the peerconnections being cleaned up? 2. If they are, are the sockets being closed upon cleanup?
(In reply to Eric Rescorla (:ekr) from comment #19) > 1. Are the peerconnections being cleaned up? > 2. If they are, are the sockets being closed upon cleanup? From what I've seen, they're cleaned up when the page goes away, but apparently not before that. I added code to force GC between each iteration, but the problem remains. So it does appear that we may have an issue with GC on a single page. Digging...
That's identical to what I have seen, Adam. Thank you for looking into it.
Summary: Failure '0x80004005' in IPeerConnection.initialize when creating too many (>20) peer connection instances → PeerConnection instances not GC'ed until page gets closed (failure '0x80004005' in IPeerConnection.initialize)
Slightly related to this bug, I've filed Bug 827982 to produce a more useful error response in this situation.
I'm attaching a revised test case that forces garbage collection and cycle collection at the start of each iteration, as this is fundamentally a GC/CC bug. We probably want to also investigate what it would take to more aggressively collect PC instances, since they do tend to take a lot of resources when they're first instantiated.
Attachment #697840 - Attachment is obsolete: true
This bug is being treated as the problem with GC/CC not collecting peer connection objects. Once it is fixed, the original test case will still return errors due to network socket exhaustion (the revised test case avoids this by manually triggering GC and CC). Fixing that behavior (if a fix is even possible) will require substantial research and possibly non-trivial new work. I have filed bug 829591 to address that task.
Blocks: 829591
Okay, I have a patch ready that converts two different previously strong references into weak references, so as to allow cycle collection to occur correctly. At a high level, here are the classes involved: +--------------+ | | | GlobalPCList | | | +------+-------+ | | <Weak> V +----------------+ <Strong> +------------------------+ | +--------->| | | PeerConnection | | PeerConnectionObserver | | |<---------+ | +----------------+ <Strong> +------------------------+ JavaScript A A =============|==========================|================= C++ V V +-----------------+ +-------------------------+ | | | | | IPeerConnection | | IPeerConnectionObserver | | | | | +-----------------+ +-------------------------+ A ^ | <is-a> | | | +-------+------------+ | <Weak> | | | | PeerConnectionImpl +---------------+ | | +--------------------+ Previously, the GlobalPCList had a strong reference to the PeerConnection, which would keep them alive until the DOM window went away. Also, the C++ PeerConnectionImpl objects had a strong reference to the IPeerConnectionObservers, which would keep the entire graph alive. Although this forms a simple cycle, the cycle collector would not find it. This could have been solved two ways: either by adding the PeerConnectionImpl objects to the cycle collector, or by making the observer/observed relationship weak (which is a common pattern). We elected to do the second, since we initially believed that there were some threading issues that counter-indicated cleaning up the PeerConnectionImpl on the main thread (although, at this point, I believe these concerns may have been misplaced). This patch now demonstrably solves the GC/CC issue, as the attached test case reveals. However, it either introduces or unmasks two classes of existing warnings (but no crashes!) on process shutdown. One happens if a PC-attached media element (e.g., nsHTMLAudioElement) is still allocated when the browser closes. These are of the form "WARNING: NS_ENSURE_TRUE(mMainThread) failed: file .../xpcom/threads/nsThreadManager.cpp, line 250" (and a matching warning for the same condition, emitted from nsThreadUtils.cpp), and appears to result from cycle collection occurring after the thread manager has begun tearing the process threads down. The stack for these warnings looks like this (from inner frame to outer frame): nsThreadManager::GetMainThread NS_GetMainThread_P NS_DispatchToMainThread_P nsHTMLMediaElement::DispatchAsyncEvent nsHTMLMediaElement::FireTimeUpdate nsHTMLMediaElement::Pause nsHTMLAudioElement::Pause nsHTMLMediaElement::UnbindFromTree mozilla::dom::Element::UnbindFromTree nsGenericHTMLElement::UnbindFromTree mozilla::dom::FragmentOrElement::cycleCollection::UnlinkImpl nsCycleCollector::CollectWhite nsCycleCollector::FinishCollection nsCycleCollector::Collect nsCycleCollector::Shutdown nsCycleCollector_shutdown mozilla::ShutdownXPCOM NS_ShutdownXPCOM_P ScopedXPCOMStartup::~ScopedXPCOMStartup ScopedXPCOMStartup::~ScopedXPCOMStartup XREMain::XRE_main XRE_main do_main main The other warning occurs when the window closes (nsGlobalWindow::ReallyCloseWindow), and appears to be due to elements that have has an associated script object at some point, but which no longer do. I don't know enough about the DOM internals to know exactly what this means or whether it is cause for concern. The specific warning is "WARNING: NS_ENSURE_SUCCESS(rv, false) failed with result 0x8000FFFF: file .../content/base/src/nsContentUtils.cpp, line 2974" (which is also reported from a higher stack frame by nsXBLProtoImplMethod). As these warnings occur at shutdown and do not appear to cause any collateral damage, I don't think we want to block fixing this object leak patch on nailing them down completely. So I'm going to attach my patch momentarily, and try to find someone with DOM knowledge who might be able to comment on the severity of these shutdown warnings.
Olli would be a good person to ask as he knows a lot about both cycle collection and events. :)
Attachment #701248 - Attachment is obsolete: true
Attachment #701251 - Flags: review?(ekr)
Attachment #701251 - Flags: review?(rjesup)
Attachment #701251 - Flags: review?(bugs)
Attachment #701251 - Flags: review?(rjesup) → review+
/me so much wishes PeerConnection was implemented in C++. Reviewing...
NS_ENSURE_TRUE(mMainThread) and NS_ENSURE_SUCCESS(rv, false) are both ok. (they are just super-useful-for-debugging warnings)
Comment on attachment 701251 [details] [diff] [review] Weaken key references to PeerConnection and friends >- nsresult rv = observerService->AddObserver(this, >- NS_XPCOM_SHUTDOWN_OBSERVER_ID, >- false); >+ nsresult rv = NS_OK; >+ >+#ifdef MOZILLA_INTERNAL_API >+ rv = observerService->AddObserver(this, >+ NS_XPCOM_SHUTDOWN_OBSERVER_ID, >+ false); > MOZ_ASSERT(rv == NS_OK); >+#endif > (void) rv; > } And one should check success using NS_SUCCEEDED(rv) Before this patch, do we actually have shutdown leaks and that's why not even getting NS_ENSURE_TRUE(mMainThread)? Have you run this with XPCOM_MEM_LEAK_LOG env variable set? (FYI, https://wiki.mozilla.org/Performance:Leak_Tools has all sorts of useful information about leak hunting)
Attachment #701251 - Flags: review?(bugs) → review+
This hasn't appeared to leak at shutdown (though peerconnection mochitests aren't enabled on m-c yet, we do run them locally and on alder). Probably the reason it doesn't leak is that the PC global list is released on XPCOM shutdown.
Reason here is that all instances are getting released once the page (tab) gets closed and probably reloaded.
Attachment #701251 - Attachment is obsolete: true
Attachment #701251 - Flags: review?(ekr)
Comment on attachment 701949 [details] [diff] [review] Weaken key references to PeerConnection and friends, The new patch fixes the nit from smaug's review. Already have r+ from jesup, smaug.
Attachment #701949 - Flags: review?(ekr)
(In reply to Olli Pettay [:smaug] from comment #31) > Comment on attachment 701251 [details] [diff] [review] > Weaken key references to PeerConnection and friends > > > >- nsresult rv = observerService->AddObserver(this, > >- NS_XPCOM_SHUTDOWN_OBSERVER_ID, > >- false); > >+ nsresult rv = NS_OK; > >+ > >+#ifdef MOZILLA_INTERNAL_API > >+ rv = observerService->AddObserver(this, > >+ NS_XPCOM_SHUTDOWN_OBSERVER_ID, > >+ false); > > MOZ_ASSERT(rv == NS_OK); > >+#endif > > (void) rv; > > } > And one should check success using NS_SUCCEEDED(rv) Done. > Before this patch, do we actually have shutdown leaks and that's why not > even getting > NS_ENSURE_TRUE(mMainThread)? We've been running without this patch and haven't seen any shutdown leaks. The only thing that's causing problems is that the PeerConection (and its extended graph) was hanging around until its associated DOM window went away. > Have you run this with XPCOM_MEM_LEAK_LOG env variable set? I hadn't, but just did (both with and without the patch). It reported no leaks. > (FYI, https://wiki.mozilla.org/Performance:Leak_Tools has all sorts of > useful information about leak hunting) Thanks for the pointer; it's very helpful. I've bookmarked it for later reference.
Comment on attachment 701949 [details] [diff] [review] Weaken key references to PeerConnection and friends, Review of attachment 701949 [details] [diff] [review]: ----------------------------------------------------------------- I am assuming that smaug and/or jesup reviewed the JS. This is an r+ for the C++ code. ::: media/webrtc/signaling/src/peerconnection/PeerConnectionCtx.cpp @@ +52,1 @@ > (void) rv; What does this last line do. Seems pointless. ::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp @@ +552,3 @@ > RUN_ON_THREAD(mThread, > + WrapRunnable(pco, > + &IPeerConnectionObserver::NotifyClosedConnection), Indentation broken. @@ +1028,2 @@ > } > + runnable->Run(); This appears to leak the runnable if !mThread?
Attachment #701949 - Flags: review?(ekr) → review+
Attachment #701949 - Attachment is obsolete: true
Attachment #702012 - Flags: checkin?(ethanhugg)
Attachment #702012 - Flags: checkin?(ethanhugg) → checkin+
This got backed out of m-i for mochitest oranges. I'll push to try and investigate tomorrow morning.
Whoops, forgot debug builds. Second batch of try run: https://tbpl.mozilla.org/?tree=Try&rev=692fbe241f50
Just for reference before it gets lost, here the tbpl results for inbound: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=cebb008a72f9
Digging through the try failures, all I'm seeing is problems that seem really quite far away from anything this patch touches. The inbound brokenness is pretty catastrophic, and I'm not going to attempt to slog through it all. A few quick spot checks turns up the same thing: nothing pointing to this patch. Also, the builds leading up to this patch landing in the log show quite a bit of orange. So here's another go at a full try run, using a recent tip of m-i: https://tbpl.mozilla.org/?tree=Try&rev=6916eaae1e60 Unfortunately, neither m-i nor m-c are completely green at the moment, but this should give us a good idea about whether any of the apparently unrelated issues from last night pop up again -- and therefore might be related in some obscure way.
Okay, that try showed up the same failures (which is really quite odd, since the failing tests should not be touching the code this patch changes). As an isolation step, I'm pushing the same tree to try without the patch here: https://tbpl.mozilla.org/?tree=Try&rev=20ba330545e4
And that push turned up green. So regardless of whether this patch is actually the root of these oranges (which seems quite unlikely), it does demonstrably *trigger* them. Since this bug won't block preffing PC on, I'm going to back-burner it and circle back once I've finished work on a few that will.
Adam, whenever you push a try build please use the alder branch as base. Only there we have our crashtests and additional peer connection mochitests running. Without those we might not see new regressions in our code.
(In reply to Henrik Skupin (:whimboo) from comment #48) > Adam, whenever you push a try build please use the alder branch as base. > Only there we have our crashtests and additional peer connection mochitests > running. Without those we might not see new regressions in our code. If I'm pushing to test regressions, I'll certainly use the alder branch. These pushes aren't meant to test regressions. These pushes are trying to figure out why m-i produces orange with this patch. The additional crashtests and mochi tests from alder would actually interfere with that troubleshooting process by introducing additional variables.
Attachment #702012 - Flags: checkin+
Out of sheer incredulity, I ran another push with this patch applied. It failed. Here is the test run, for the sake of posterity: https://tbpl.mozilla.org/?tree=Try&rev=62b71c2aa79d
Attachment #702012 - Attachment is obsolete: true
Patch has been revised to fix exception that was being thrown when hasActivePeerConnection was called with a page ID for a page that has never had a PeerConnection attached to it. Try run: https://tbpl.mozilla.org/?tree=Try&rev=f0caa7592135
Comment on attachment 703624 [details] [diff] [review] Weaken key references to PeerConnection and friends See https://bugzilla.mozilla.org/attachment.cgi?oldid=702012&action=interdiff&newid=703624&headers=1 for interdiff between this and previous review.
Attachment #703624 - Flags: review?(rjesup)
Comment on attachment 703624 [details] [diff] [review] Weaken key references to PeerConnection and friends Got r+ from jesup on the interdiff via IRC.
Attachment #703624 - Flags: review?(rjesup)
Attachment #703624 - Flags: review+
Attachment #703624 - Flags: checkin?(rjesup)
Comment on attachment 703624 [details] [diff] [review] Weaken key references to PeerConnection and friends Landed on inbound: https://hg.mozilla.org/integration/mozilla-inbound/rev/c2636bf84b4b
Attachment #703624 - Flags: checkin?(rjesup) → checkin+
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla21
Flags: in-testsuite?
Keywords: verifyme
No dice. This doesn't work: Timestamp: 1/28/2013 7:23:34 PM Error: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [IPeerConnection.initialize] Source File: resource://gre/components/PeerConnection.js Line: (null) I managed to get the NS_ERROR_FAILURE in my test case in bug 825534 by hammering creation of many peer connections (maybe 20 to 30?). Followup bug coming.
Depends on: 835657
(In reply to Jason Smith [:jsmith] from comment #57) > No dice. This doesn't work: > > Timestamp: 1/28/2013 7:23:34 PM > Error: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 > (NS_ERROR_FAILURE) [IPeerConnection.initialize] > Source File: resource://gre/components/PeerConnection.js > Line: (null) > > I managed to get the NS_ERROR_FAILURE in my test case in bug 825534 by > hammering creation of many peer connections (maybe 20 to 30?). > > Followup bug coming. Yeah, the follow-up bug that you're about to file is a dupe of Bug 829591. See comment 24 for details.
Okay. I duped my bug against that then. Marking as verified with the followup already known then.
Status: RESOLVED → VERIFIED
Keywords: verifyme
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: