Closed Bug 1122387 Opened 10 years ago Closed 10 years ago

MediaEngineWebRTCVideoSource::mSources is inadequately protected from non-threadsafe access

Categories

(Core :: WebRTC: Audio/Video, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla38
Tracking Status
firefox35 --- wontfix
firefox36 + fixed
firefox37 + fixed
firefox38 + fixed
firefox-esr31 --- unaffected
b2g-v2.0 --- unaffected
b2g-v2.1 --- fixed
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: bwc, Assigned: jesup)

References

Details

(Keywords: regression, sec-high, Whiteboard: [adv-main36+])

Attachments

(3 files, 1 obsolete file)

In trying to diagnose this failure:

https://treeherder.mozilla.org/logviewer.html#?job_id=5528186&repo=mozilla-inbound

It appears that |mSources| must either be corrupt, or have been accessed from another thread. Neither of the places that write to it lock |mMonitor| first (although |Stop| locks it afterward), and we would expect to hit |Stop| when things are being torn down (where we see the failure above).

I have not determined why this started happening with the multistream push, but it may just be a matter of having more video tracks widening the window of opportunity.
Looking closer, |MediaEngineWebRTCVideoSource::DeliverFrame| is called by the webrtc.org code, which is probably not being run on the same thread as the rest of the media subsystem. There are a bunch of reads of member variables that are going to be racy in here, as well as accesses to |mSources|.
FWIW, the webrtc.org code is very aggressive about using a lot of threads, and this seems likely
Likely this was introduced in 
Bug 1080755: Push video frames into MediaStreamGraph instead of waiting for pulls 
which iterates mSources in DeliverFrame(), instead of letting MSG iterate the active sources (not mSources, note) and makes N calls to NotifyPull from the MSG thread.

mSources should therefor be protected; it *is* under mMonitor in DeliverFrame, but not in other uses. It is protected for Audio, which always accessed it from webrtc threads.  Bug 1080755 changed this.

   // mMonitor protects mImage access/changes, and transitions of mState
   // from kStarted to kStopped (which are combined with EndTrack() and
   // image changes).  Note that mSources is not accessed from other threads
   // for video and is not protected.
Byron - please give this a try.  Also cleans up a few latent TSAN issues with audio I think.  Still need to check fake-video (MediaEngineDefault*)
With this patch I'm seeing a pretty reliable failure in test_peerConnection_basicH264Video.html on OS X e10s

[Child 15953] ###!!! ASSERTION: null mutex: 'mLock', file ../../dist/include/mozilla/Mutex.h, line 163

Unfortunately, there's no stack.
Seems to happen even without the patches, might be that the plugin crashed; the assertion I see in the output seems to come from the crash reporter, and looks like it could be caused by the out-of-process crash reporter trying to pop when the in-process crash reporter was never enabled (which it wouldn't be in a DEBUG build).
So after some try pushes and lots of retriggers, there seem to be the following failures:

https://treeherder.mozilla.org/logviewer.html#?job_id=4289317&repo=try (3 times, all on linux x64 debug)

https://treeherder.mozilla.org/logviewer.html#?job_id=4287190&repo=try (6 times, all on OS X debug)

https://treeherder.mozilla.org/logviewer.html#?job_id=4289269&repo=try

https://treeherder.mozilla.org/logviewer.html#?job_id=4289100&repo=try (3 times on linux ASAN)
Ok, regarding the failures in https://treeherder.mozilla.org/logviewer.html#?job_id=4289317&repo=try, it looks like this can happen if the tracks are just not ready yet. Not sure whether the right course of action is to wait for the tracks available callback to fire (https://dxr.mozilla.org/mozilla-central/source/dom/media/MediaManager.cpp#894) before calling |Activate| (called just before the setup of the tracks callback, which allows |NotifyPull| to be called), or to just silently ignore this error.
Regarding https://treeherder.mozilla.org/logviewer.html#?job_id=4287190&repo=try, it looks like we're leaking three threads of stuff, maybe the GMP, compositor, and MSG threads? Not sure how this happens. I do see this NS_ERROR assertion earlier in the log on the same process, not sure if it is important:

16:02:19     INFO -  [Child 1054] ###!!! ASSERTION: Appshell already destroyed?: 'Error', file /builds/slave/try-osx64-d-000000000000000000/build/src/dom/media/MediaStreamGraph.cpp, line 1746
16:02:48     INFO -  #01: mozilla::MediaStreamGraphImpl::AppendMessage(mozilla::ControlMessage*) [dom/media/MediaStreamGraph.cpp:1801]
16:02:48     INFO -  #02: mozilla::dom::HTMLMediaElement::Pause(mozilla::ErrorResult&) [dom/html/HTMLMediaElement.cpp:1622]
16:02:48     INFO -  #03: mozilla::dom::HTMLMediaElement::Pause() [dom/bindings/ErrorResult.h:139]
16:02:48     INFO -  #04: mozilla::dom::HTMLMediaElement::UnbindFromTree(bool, bool) [xpcom/base/nsRefPtr.h:209]
16:02:48     INFO -  #05: mozilla::dom::FragmentOrElement::cycleCollection::Unlink(void*) [xpcom/glue/nsCOMPtr.h:389]
16:02:48     INFO -  #06: nsCycleCollector::CollectWhite() [xpcom/base/nsCycleCollector.cpp:3306]
16:02:48     INFO -  #07: nsCycleCollector::Collect(ccType, js::SliceBudget&, nsICycleCollectorListener*, bool) [xpcom/base/nsCycleCollector.cpp:3636]
16:02:48     INFO -  #08: nsCycleCollector::Shutdown() [xpcom/base/nsCycleCollector.cpp:3565]
16:02:48     INFO -  #09: nsCycleCollector_shutdown() [xpcom/base/nsRefPtr.h:44]
16:02:48     INFO -  #10: mozilla::ShutdownXPCOM(nsIServiceManager*) [gfx/layers/ipc/AsyncTransactionTracker.h:130]
16:02:48     INFO -  #11: XRE_TermEmbedding [toolkit/xre/nsEmbedFunctions.cpp:207]
16:02:48     INFO -  #12: mozilla::ipc::ScopedXREEmbed::Stop() [ipc/glue/ScopedXREEmbed.cpp:116]
16:02:48     INFO -  #13: XRE_InitChildProcess [toolkit/xre/nsEmbedFunctions.cpp:580]
16:02:48     INFO -  #14: content_process_main(int, char**) [ipc/contentproc/plugin-container.cpp:211]
https://treeherder.mozilla.org/logviewer.html#?job_id=4289269&repo=try looks like the same problem in comment 8.
Bryon's remaining oranges appear to be due to Bug 1109644's landing - now, instead of storing the amount of data provided to the track in the source, it's stored in the track - but that requires the track to be available.  We received NotifyPull and ask all tracks to provide data, but I suspect it's being invoked because of the other track on the stream (we add (via Start()) the audio track before the video track in MediaOperationsTask::Run()).

It's kinda ugly to deal with due to TSAN.  Let me see what I can do
(In reply to Randell Jesup [:jesup] from comment #11)
> Bryon's remaining oranges appear to be due to Bug 1109644's landing - now,
> instead of storing the amount of data provided to the track in the source,
> it's stored in the track - but that requires the track to be available.  We
> received NotifyPull and ask all tracks to provide data, but I suspect it's
> being invoked because of the other track on the stream (we add (via Start())
> the audio track before the video track in MediaOperationsTask::Run()).
> 
> It's kinda ugly to deal with due to TSAN.  Let me see what I can do

That explains the oranges in comment 8 and comment 10, but doesn't seem to explain the orange in comment 9 (nor the remaining orange; I'm still looking into it, but it is definitely an audio problem).
(In reply to Byron Campen [:bwc] from comment #12)
> (In reply to Randell Jesup [:jesup] from comment #11)
> > Bryon's remaining oranges appear to be due to Bug 1109644's landing - now,
> > instead of storing the amount of data provided to the track in the source,
> > it's stored in the track - but that requires the track to be available.  We
> > received NotifyPull and ask all tracks to provide data, but I suspect it's
> > being invoked because of the other track on the stream (we add (via Start())
> > the audio track before the video track in MediaOperationsTask::Run()).
> > 
> > It's kinda ugly to deal with due to TSAN.  Let me see what I can do
> 
> That explains the oranges in comment 8 and comment 10, but doesn't seem to
> explain the orange in comment 9 (nor the remaining orange; I'm still looking
> into it, but it is definitely an audio problem).

For comment 9, try moving AddRemoveSelfReference() in HTMLMediaElement::Pause() to the end - we could be dropping the self-reference before we call ChangeExplicitBlockerCount(), which is where we're hitting the assertion.  I'm not sure that's to blame, but switching to Paused would affect the conditional in AddRemoveSelfReference ("(!mPaused && mSrcStream && !mSrcStream->IsFinished())")

roc/andreas/jw, any ideas?

I'll have a patch for comment 8 and 10 in a bit.  API is a bit of a pain here, since I need to know if I AddTracked it on MediaManager thread from within NotifyPull (MSG thread), which forces me to use locks.
Flags: needinfo?(roc)
Attached patch avoid NotifyPull() for tracks we haven't added yet (obsolete) (deleted) β€” β€” Splinter Review
Untested patch for comment 8 and comment 10; should compile - will test shortly for basic functionality
(In reply to Randell Jesup [:jesup] from comment #13)
> For comment 9, try moving AddRemoveSelfReference() in
> HTMLMediaElement::Pause() to the end - we could be dropping the
> self-reference before we call ChangeExplicitBlockerCount(), which is where
> we're hitting the assertion.  I'm not sure that's to blame, but switching to
> Paused would affect the conditional in AddRemoveSelfReference ("(!mPaused &&
> mSrcStream && !mSrcStream->IsFinished())")
> 
> roc/andreas/jw, any ideas?

I'm blank on this one I'm afraid.

> I'll have a patch for comment 8 and 10 in a bit.  API is a bit of a pain
> here, since I need to know if I AddTracked it on MediaManager thread from
> within NotifyPull (MSG thread), which forces me to use locks.

These issues look like bug 1116925.
(In reply to Andreas Pehrson [:pehrsons] (Telenor) from comment #16)
> 
> These issues look like bug 1116925.

Yeah, they do. I guess we can move the NotifyPull patch over there, and avoid blocking bug 1095218 on it?
(In reply to Byron Campen [:bwc] from comment #17)
> (In reply to Andreas Pehrson [:pehrsons] (Telenor) from comment #16)
> > 
> > These issues look like bug 1116925.
> 
> Yeah, they do. I guess we can move the NotifyPull patch over there, and
> avoid blocking bug 1095218 on it?

I've moved it there and asked for review, as that's orthogonal to this bug.  Don't block on it. 

We still have the one from comment 9 IIRC (Pause)
Attachment #8551989 - Attachment is obsolete: true
try with the locking logic fix, and the fix for bug 1123882:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ea15c056b1ae

try with the above plus the fix for NotifyPull:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8bfde853ca3f

I need to do a final try push with the multistream work too.
Is ESR31 affected by this?
(In reply to Al Billings [:abillings] from comment #20)
> Is ESR31 affected by this?

The bug that caused this is in 34 and up.
Here's that push with multistream. Looking pretty good so far, still waiting on lots of retriggers. Includes patches from some other bugs.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=cfca214e0135
Also have verified that signaling_unittests works on linux ASAN.
Seeing a bunch of crashes on linux x86 debug mochitest e10s, seems to be a MOZ_CRASH() caused by failing a check here:

https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/shared_memory_posix.cc?from=shared_memory_posix.cc&case=true#221
Going to try to figure out what errno is when that happens.
As I suspected:

16:33:32     INFO -  [Child 1900] ###!!! ABORT: Call to dup failed, errno=24: file /builds/slave/try-lx-d-000000000000000000000/build/src/ipc/chromium/src/base/shared_memory_posix.cc, line 221

We are exhausting the maximum number of fds in the child process. From what I can tell, every frame of video is getting its own fd. Not sure if something similar applies to audio or networking.
(In reply to Randell Jesup [:jesup] from comment #13)
> For comment 9, try moving AddRemoveSelfReference() in
> HTMLMediaElement::Pause() to the end - we could be dropping the
> self-reference before we call ChangeExplicitBlockerCount(), which is where
> we're hitting the assertion.  I'm not sure that's to blame, but switching to
> Paused would affect the conditional in AddRemoveSelfReference ("(!mPaused &&
> mSrcStream && !mSrcStream->IsFinished())")
> 
> roc/andreas/jw, any ideas?

Not really, but it sounds reasonable.
Flags: needinfo?(roc)
not sure if this will help, but I don't think it will hurt.
Comment on attachment 8550510 [details] [diff] [review]
Update locking logic for getUserMedia video

Byron has green tries on this I understand, and this blocks landing the multistream work (which blocks landing the webrtc branch 40 update).  Thanks!
Attachment #8550510 - Flags: review?(roc)
(In reply to Randell Jesup [:jesup] from comment #28)
> Created attachment 8553715 [details] [diff] [review]
> speculative patch for comment 9's assertion
> 
> not sure if this will help, but I don't think it will hurt.

   I don't disagree; the patch seems to be empty though.
oops, edited wrong version of HTMLMediaElement.cpp
Try push that includes the patch from this ticket, along with multistream (bug 1095218) and bug 1116925:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=88eefa79cb8f
https://hg.mozilla.org/integration/mozilla-inbound/rev/38853d802eac

Did not land the speculative patch, since it has not had a try run yet. Need to handle that one separately.
Comment on attachment 8550510 [details] [diff] [review]
Update locking logic for getUserMedia video

Oops, we goofed and jumped the gun here...

[Security approval request comment]
How easily could an exploit be constructed based on the patch?

   Seems like it would be pretty hard.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?

   It would be very easy to see what was being fixed.

Which older supported branches are affected by this flaw?

   Looks like everything but esr31, and b2g <= 2.0.

If not all supported branches, which bug introduced the flaw?

   Bug 1080755.

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?

   No. I don't think they will be particularly hard. jesup?

How likely is this patch to cause regressions; how much testing does it need?

   Some risk of introducing a deadlock bug, but it has been run through try lots of times.
Attachment #8550510 - Flags: sec-approval?
Assignee: nobody → docfaraday
backports should be trivial
Assignee: docfaraday → rjesup
Comment on attachment 8550510 [details] [diff] [review]
Update locking logic for getUserMedia video

sec-approval+. Let's get this on Aurora and Beta.
Attachment #8550510 - Flags: sec-approval? → sec-approval+
Comment on attachment 8550510 [details] [diff] [review]
Update locking logic for getUserMedia video

[Triage Comment]
Attachment #8550510 - Flags: approval-mozilla-beta+
Attachment #8550510 - Flags: approval-mozilla-aurora+
https://hg.mozilla.org/mozilla-central/rev/38853d802eac
Flags: needinfo?(rjesup)
Target Milestone: --- → mozilla38
https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/8bca070f1dc8
https://hg.mozilla.org/releases/mozilla-b2g34_v2_1/rev/095a46f32cd4
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(rjesup)
Resolution: --- → FIXED
Whiteboard: [adv-main36+]
Group: core-security → core-security-release
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: