Closed
Bug 814718
Opened 12 years ago
Closed 12 years ago
Intermittent failure in basic gUM tests | canplaythrough event never fired
Categories
(Core :: WebRTC: Audio/Video, defect)
Tracking
()
VERIFIED
FIXED
mozilla20
People
(Reporter: jsmith, Assigned: roc)
References
Details
(Keywords: intermittent-failure, Whiteboard: [getUserMedia] [blocking-gum+])
Attachments
(2 files)
(deleted),
patch
|
jesup
:
review+
whimboo
:
checkin+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
roc
:
review+
whimboo
:
checkin+
|
Details | Diff | Splinter Review |
Followup on try run from https://bugzilla.mozilla.org/show_bug.cgi?id=781534#c50 - apparently the canplaythrough event never manages to fire in the gum test on OS X 10.7 and 10.8 opt builds. Actual try run results here - https://tbpl.mozilla.org/?tree=Try&rev=f7e037dc5360.
Apparently it's only happening with basic video test.
Reporter | ||
Comment 1•12 years ago
|
||
Blocker cause we can't pref on gum automation without it, which is a blocker as well.
Whiteboard: [getUserMedia] [blocking-gum+]
Reporter | ||
Updated•12 years ago
|
Whiteboard: [getUserMedia] [blocking-gum+] → [getUserMedia] [blocking-gum+] [automation-blocked]
Comment 2•12 years ago
|
||
Maire, can we please get this higher prioritized given that it blocks our automation efforts?
Comment 3•12 years ago
|
||
Taking that per discussion during the 2012/12/04 WebRTC meeting.
If this is indeed MacOS specific, I'll receive a Macbook in a couple days. I won't be able to investigate until then.
Assignee: nobody → paul
Status: NEW → ASSIGNED
Comment 4•12 years ago
|
||
A bit more on this. The "canplaythrough" event fires when I run only the test test_getUserMedia_basicVideo.html. However, when I run the whole suite, I can reproduce. Now that I can repro, I'm going to dig a little further.
Comment 5•12 years ago
|
||
Another update: the root problem is at [1]. We don't go in this if(), and we should: if I log in this function, when we switch from blocked to not blocked, we don't notify the listener, that does not notify the element, and so on. I'm going to try to fix the problem, if this explanation makes sense.
[1]: http://mxr.mozilla.org/mozilla-central/source/content/media/MediaStreamGraph.cpp#813
Assignee | ||
Comment 6•12 years ago
|
||
yep
Reporter | ||
Comment 7•12 years ago
|
||
Paul says he can't reproduce on today's build. he just kicked off a try build with the basic video test prefed on to see what happens:
https://tbpl.mozilla.org/?tree=Try&rev=4a43d50ab3cc
Comment 8•12 years ago
|
||
The try push shows that the problem still occurs, but I still can't reproduce locally. I'll try harder.
Reporter | ||
Comment 9•12 years ago
|
||
So bug 802538's try run reveals this is happening generally with any of the mochitests. And it's not just OS X opt builds either.
Keywords: intermittent-failure
Summary: canplaythrough event on a media element with gum media stream with video fails to fire on OS X 10.7/10.8 opt builds → canplaythrough event on a media element with gum media stream with video or audio fails to fire intermittently
Comment 10•12 years ago
|
||
Here is the current state of understanding of this issue:
What should happen (i.e. how "canplaythrough" is fired):
- Whenever the blocking state of a stream changes (from Blocked to not blocked
or the inverse), in the |MediaStreamGraphImpl::UpdateCurrentTime|, the graph
should call |nsHTMLMediaElement::StreamListener::NotifyBlockingChanged|. This
dispatches an event on the main thread (after all the streams have been
updated), that marks the streams as unblocked in the listener, and call
|UpdateReadyStateForData| on the element, with |NEXT_FRAME_AVAILABLE| as
parameter we it is not blocked anymore, that end up calling
|nsHTMLMediaElement::ChangeReadyState| withe |HAVE_ENOUGH_DATA| as parameter,
that fires "canplaythrough".
What actually happens (i.e. why it is not fired):
- On try, and very rarely on my local machine, and only when running the full
suite (not the individual test), very frequently on MacOS opt, sometimes on
MacOS debug, the MediaStreamGraph experiences a global underrun (the code to
detect that is at the beginning of the |MediaStreamGraphImpl::UpdateCurrentTime|
method). I failed to notice this at the beginning, because LOG macros don't
output on try when doing an opt build.
This moves the |mStateComputedTime| forward to the |nextCurrentTime|, and set
all the streams as blocking from |mStateComputedTime|. The stream becomes
blocked, and everything is dispatched properly to the element.
Next time the MediaStreamGraph thread runs, the stream is logically not blocked
anymore, but should be reported as blocked at |prevCurrentTime|, and we _should_
dispatch an event to the main thread saying that we are not blocked anymore,
since we have advanced the |mCurrentTime| time to compensate the underrun that
made us set the stream as blocked. The problem that seem to happen is that
|prevCurrentTime| is wrong for some reason OR there is a bug in
TimeVarying::GetAt, and this line reports that the stream was not blocked.
> // Save current blocked status
> bool wasBlocked = stream->mBlocked.GetAt(prevCurrentTime);
Because we only notify the element via the listener if there is a blocking state
change from |prevCurrentTime| to |nextCurrentTime| (in the while loop) and
because the streams are marked as not blocked during this period, the
|wasBlocked != blocked| condition is never true, and we don't notify the
element.
When outputing the values, |prevCurrentTime| in the function call where we fail
to notify the element is the |nextCurrentTime| of the previous call, so this
seems alright to me. Then it is normal that the stream is not blocked at that
time, because we specifically updated the graph time to bring it to a point
where it is not blocked to compensate for the underrun, so I believe the
underrun compensation code is wrong: it should set the stream as blocked until
the next time we run the graph, so we report the the stream was blocked at
|prevCurrentTime| in the next run.
Hopefully this makes sense. I have a 3g connection this week, so I'll be able to
check my email, please tell me if you need clarifications or something.
Assignee | ||
Comment 11•12 years ago
|
||
I suspect the problem happens like this:
1) The global underrun happens and we set mCurrentTime and mStateComputedTime to the same value. The stream is marked blocking from prevCurrentTime onward and we notify the element.
2) After that, we recompute blocking status for the stream and decide it should not be blocked from mStateComputedTime onward. So we set that in its mBlocking. Note that mBlocking is now false at mCurrentTime!
3) In the next iteration of UpdateCurrentTime, we proceed normally. wasBlocked is set to false since at prevCurrentTime mBlocking is false.
Basically the problem is that we have forgotten our old blocking status; it's been overwritten when mStateComputedTime == mCurrentTime. We should store it separately on each stream instead of utilizing mBlocked.
Assignee | ||
Comment 12•12 years ago
|
||
Assignee: paul → roc
Attachment #696822 -
Flags: review?(rjesup)
Assignee | ||
Comment 13•12 years ago
|
||
Thanks Paul, I wouldn't have figured that out without your data.
Updated•12 years ago
|
Attachment #696822 -
Flags: review?(rjesup) → review+
Reporter | ||
Comment 14•12 years ago
|
||
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #12)
> Created attachment 696822 [details] [diff] [review]
> fix (hopefully)
>
> https://tbpl.mozilla.org/?tree=Try&rev=05b9fe9c8865
Good news is that the canplaythrough event is firing correctly in the try run. Bad news is that we're still failing in the automation, which I didn't initially expect to see - when I originally ran this try run a while back, the canplaythrough event + leaks were the main issue. But now we're seeing the following with the basicVideo test:
1. [NEW] played length test is failing is consistently
2. [NEW] 1 of the the tests failed with the stream and media element's time not greater than zero
3. [NEW] Lots and lots of leaks in some of the test runs
Any ideas? What's causing these failures?
Assignee | ||
Comment 15•12 years ago
|
||
Maybe the "convert to TrackUnionStream" patch.
Reporter | ||
Comment 16•12 years ago
|
||
Okay. Sounds like it's safe to land this patch then.
May be worth kicking off another try run without the "convert to trackunionstream" patch though to be sure.
Assignee | ||
Comment 17•12 years ago
|
||
Comment 18•12 years ago
|
||
https://tbpl.mozilla.org/?tree=Try&rev=ca0ab4c290b0
With gUM and PeerConnection tests enabled
Comment 19•12 years ago
|
||
So, from that try run with all gum and peerconnections mochitests enabled, it appears that the errors roc saw are in fact real (and new). As validation, I'm running a Try with roc's patch backed out:
https://tbpl.mozilla.org/?tree=Try&rev=1bc99ea0c320
Comment 20•12 years ago
|
||
Backed out:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1322a7cadd56
for:
https://tbpl.mozilla.org/php/getParsedLog.php?id=18382182&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=18383599&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=18382584&tree=Mozilla-Inbound
Comment 21•12 years ago
|
||
Backout was almost certainly due to other patches backed out with it for scalable fonts (since the errors are in SVG). However, we still have issues in the gUM mochitests with this patch applied (see the try pushes)
Reporter | ||
Comment 22•12 years ago
|
||
Comparing the results across each try run, here's what I see:
1. This patch definitely causes the regression with the play length test - didn't happen before this patch, but it happens with this patch applied.
2. There's leaks evident in both try runs, but it doesn't appear to be caused by this patch - my hunch is that these leaks are coming from peer connection. Filed bug 825747 on this.
3. There's crashes in both try runs, but I don't think it's from this patch - probably coming from the peer connection tests. I believe there's some bugs on this already.
4. There's a peer connection test timeout in the try run with this patch, which is a known bug - bug 823056.
So why is the play length test failing with this patch?
Comment 23•12 years ago
|
||
Note from my push with all desktop platforms that only Mac had the new gum Mochitest problems (the Windows failures appear to be known bugs with patches r+'d and waiting to land by jib and ekr)
Interesting....
Comment 24•12 years ago
|
||
Ok, the test that failed on 4/6 Mac runs from my try build was "Played length shall be one - got 0, expected 1". This is tested in the canPlayThrough callback, so my guess is that this is either a timing hole, or (more likely) that it's incorrect to check that data has played in canPlayThrough; that there are instances (underuns) where data may not have played (yet). Roc?
Comment 25•12 years ago
|
||
Ok, checked roc's Try as well. All the errors there appear to be variants of the same problem - data not having played when canplaythrough fires. The leaks are all due to the getUserMedia singleton and audio/video sources, but they appear to only happen occasionally. I'll look at that
Comment 26•12 years ago
|
||
Hmm. Pushing a new try. The tree roc used for his try *seems* to have been last updated on Dec 7th, before my gum leak fixes landed. Hard to tell from Try, so I'll push a new set with jib's and ekr's fixes
Comment 27•12 years ago
|
||
Reporter | ||
Comment 28•12 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #27)
> https://tbpl.mozilla.org/?tree=Try&rev=5ba93a7ba678
Hmm...I don't see a patch on that try run to pref on the mochitests on mozilla central. Am I misreading something or was the try run here setup correctly?
Comment 29•12 years ago
|
||
(In reply to Jason Smith [:jsmith] from comment #22)
> Comparing the results across each try run, here's what I see:
>
> 1. This patch definitely causes the regression with the play length test -
> didn't happen before this patch, but it happens with this patch applied.
Right, see above.
> 2. There's leaks evident in both try runs, but it doesn't appear to be
> caused by this patch - my hunch is that these leaks are coming from peer
> connection. Filed bug 825747 on this.
Roc's leak I think is because an old base rev (see above). New push should show.
My leak (one Win debug build) appears to be due to a possible race condition where the PeerConnection leak cleanup patch doesn't finish shutting down Signaling before XPCOM shutdown finishes. Probably need to add synchronization to the shutdown of signaling.
> 3. There's crashes in both try runs, but I don't think it's from this patch
> - probably coming from the peer connection tests. I believe there's some
> bugs on this already.
Yup. Including some ready-to-land patches in the Try push. None seem new (though we don't have a patch for the Mac poisoned-write crash; we need a better backtrace of it).
> 4. There's a peer connection test timeout in the try run with this patch,
> which is a known bug - bug 823056.
>
> So why is the play length test failing with this patch?
Comment 30•12 years ago
|
||
(In reply to Jason Smith [:jsmith] from comment #28)
> (In reply to Randell Jesup [:jesup] from comment #27)
> > https://tbpl.mozilla.org/?tree=Try&rev=5ba93a7ba678
>
> Hmm...I don't see a patch on that try run to pref on the mochitests on
> mozilla central. Am I misreading something or was the try run here setup
> correctly?
Oops
https://tbpl.mozilla.org/?tree=Try&rev=33bd2d9bca6e
Comment 31•12 years ago
|
||
https://tbpl.mozilla.org/?tree=Try&rev=8680e3def608
Once more with feeling... and roc's patch too (forgot it was backed out)
Reporter | ||
Comment 32•12 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #31)
> https://tbpl.mozilla.org/?tree=Try&rev=8680e3def608
>
> Once more with feeling... and roc's patch too (forgot it was backed out)
This one has Roc's patch but doesn't have the patch to pref on the tests.
Comment 33•12 years ago
|
||
Try hides any patches that were in your previous Try. You'd have to look at the changesets and then click on the Parent links to know exactly what's in a Try. (That's how I found roc's Try was based in Dec 7th).
Comment 34•12 years ago
|
||
Filed bug 825764 for the new? crash from my try.
No leaks, so I think roc's gum leak was an artifact. There is an intermittent link in windows debug (at least) of sipcc, though it didn't show here.
Reporter | ||
Comment 35•12 years ago
|
||
Still seeing this repeatedly:
ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_getUserMedia_basicVideoAudio.html | Played length shall be one - got 0, expected 1
And saw this once:
39 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_getUserMedia_basicVideoAudio.html | Stream's time should be greater than zero
40 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_getUserMedia_basicVideoAudio.html | MediaElement time shall be greater than zero
Comment 36•12 years ago
|
||
Right. We're waiting on roc to either do further work on canplaythroughevent, or more likely indicate why the tests are wrong to assume that media has played at that point (or without checking something else like the video element state).
All of those appear to be fallout from the canplaythroughevent change (and all the same issue - the media hasn't played yet due to blocking). The other two happen once in a while, the first happens on 50-80% of the Mac try runs.
Assignee | ||
Comment 37•12 years ago
|
||
canplaythrough can fire pretty much right away after the stream has been set up and started playing, but going through the event loop means it can be delayed a long time.
So these tests are not valid:
is(self.mediaElement.ended, false,
"Media element should not have ended");
(canplaythrough might have been delayed until after the media element ended.)
is(self.mediaElement.played.length, 1, "Played length shall be one");
This isn't correct either. We'll receive a message from the MSG when the stream unblocks, which when delivered to the main thread queues dispatch of canplaythrough, but then if the main thread is able to fire canplaythrough before the MSG ticks again, we will still not have been notified that any data has actually been played.
Reporter | ||
Comment 38•12 years ago
|
||
Okay. I'll hack together a patch quickly to remove those tests.
Reporter | ||
Comment 39•12 years ago
|
||
Reporter | ||
Updated•12 years ago
|
Attachment #696900 -
Flags: review?(roc)
Assignee | ||
Comment 40•12 years ago
|
||
Comment on attachment 696900 [details] [diff] [review]
Remove Incorrect Gum Tests
Review of attachment 696900 [details] [diff] [review]:
-----------------------------------------------------------------
::: dom/media/tests/mochitest/mediaStreamPlayback.js
@@ -75,5 @@
> - is(self.mediaElement.played.start(0), 0,
> - "Played start shall be zero");
> - is(self.mediaElement.played.end(0), self.mediaElement.currentTime,
> - "End shall be current time");
> - }
You can restore this test to check "length == 0 || (length == 1 && played.end(0) == currentTime)", but removing is good for now.
Attachment #696900 -
Flags: review?(roc) → review+
Reporter | ||
Comment 41•12 years ago
|
||
Sounds good. Let's land each patch now that we know we're good to land after doing some try build checks and fixing the gum tests.
Keywords: checkin-needed
Comment 42•12 years ago
|
||
Comment on attachment 696900 [details] [diff] [review]
Remove Incorrect Gum Tests
https://hg.mozilla.org/integration/mozilla-inbound/rev/b4f5631d7ce7
Attachment #696900 -
Flags: checkin+
Updated•12 years ago
|
Keywords: checkin-needed
Updated•12 years ago
|
Keywords: checkin-needed
Summary: canplaythrough event on a media element with gum media stream with video or audio fails to fire intermittently → Intermittent failure in basic gUM tests | canplaythrough event never fired
Comment 44•12 years ago
|
||
Comment on attachment 696822 [details] [diff] [review]
fix (hopefully)
https://hg.mozilla.org/integration/mozilla-inbound/rev/9f493439be84
Attachment #696822 -
Flags: checkin+
Comment 45•12 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/b4f5631d7ce7
https://hg.mozilla.org/mozilla-central/rev/9f493439be84
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
Updated•12 years ago
|
Whiteboard: [getUserMedia] [blocking-gum+] [automation-blocked] → [getUserMedia] [blocking-gum+]
You need to log in
before you can comment on or make changes to this bug.
Description
•