Closed
Bug 870564
Opened 12 years ago
Closed 11 years ago
Youtube video freezes after a long time (~30 mins)
Categories
(Firefox OS Graveyard :: General, defect, P1)
Tracking
(blocking-b2g:leo+, firefox22 wontfix, firefox23 wontfix, firefox24 fixed, b2g18+ verified, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 wontfix, b2g-v1.1hd fixed)
People
(Reporter: diego, Assigned: mikeh)
References
()
Details
(Whiteboard: c=video [TD-26549], MiniWW, [YouTubeCertBlocker+])
Attachments
(8 files, 1 obsolete file)
(deleted),
text/plain
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
application/x-zip-compressed
|
Details | |
(deleted),
application/x-zip-compressed
|
Details | |
(deleted),
patch
|
jduell.mcbugs
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jduell.mcbugs
:
review+
|
Details | Diff | Splinter Review |
This youtube video freezes after a very long time
https://www.youtube.com/watch?v=V1IBH1bz5-c
I've seen it fail in two different ways:
* Green screen. This seems to be a wi-fi connection issue. May be device specific
* Screen freeze. This looks like a problem somewhere in the media graph. I often see a "Timed out waiting for output buffers: 0/0" message for audio buffers.
Updated•12 years ago
|
blocking-b2g: --- → tef?
Updated•12 years ago
|
Assignee: nobody → sotaro.ikeda.g
Comment 1•12 years ago
|
||
By Bug 863441, youtube video playback becomes somewhat stable. But there are still some problems.
Comment 2•12 years ago
|
||
I did not see green. But failed in 2 ways.
- [1] video freeze(but ui works)
- [2] video app crash at
mozilla::net::HttpChannelChild::OnStartRequest ( Bug 868073 )
Comment 3•12 years ago
|
||
I debugged about [1].
[1] happend by nsMediaCacheStream::Read() stall forever. OMXCodec calls the function with grabbing OMXCodec's mutex. Then the other threads(decoder thread and ImageBridge thread) waited forever to get the mutex.
Comment 4•12 years ago
|
||
Comment 5•12 years ago
|
||
It seems that both [1] and [2] are caused by HttpChannelChild.
The following diagram is about ChannelMediaResource.
https://github.com/sotaroikeda/firefox-diagrams/blob/master/media/content_media_ChannelMediaResource_FirefoxOS_1_01.pdf?raw=true
Updated•12 years ago
|
Whiteboard: [TD-26549]
Target Milestone: --- → 1.1 QE2
Comment 6•12 years ago
|
||
Not necessarily a blocker for v1.0.1/v1.1, but we would take a low risk uplift if available. Videos on YouTube that are >30min long are very uncommon, and it's not yet clear if this is specific to long videos or this video (we need more repro cases).
blocking-b2g: tef? → -
tracking-b2g18:
--- → +
Updated•12 years ago
|
Whiteboard: [TD-26549] → c= [TD-26549]
Updated•12 years ago
|
Whiteboard: c= [TD-26549] → c=video [TD-26549]
Updated•12 years ago
|
Status: NEW → ASSIGNED
Updated•12 years ago
|
blocking-b2g: - → leo?
Priority: -- → P1
Assignee | ||
Updated•12 years ago
|
Assignee: sotaro.ikeda.g → mhabicher
Comment 7•12 years ago
|
||
Triage 5/20 -
TEF? per blocking a tef+ bug 871753.
If decided not to be tef+, then please leo+.
blocking-b2g: leo? → tef?
Comment 8•12 years ago
|
||
(In reply to Wayne Chang [:wchang] from comment #7)
> Triage 5/20 -
> TEF? per blocking a tef+ bug 871753.
> If decided not to be tef+, then please leo+.
bug 871753 is already fixed in all branches. Can you clarify how does this one blocks 871753?
Reporter | ||
Comment 10•12 years ago
|
||
Bug 871753 seems to help, but I still saw the time out message around 40 minutes into the video:
E/OMXCodec( 674): [OMX.qcom.audio.decoder.aac] Timed out waiting for output buffers: 0/2
Sotaro,
Are you able to play the youtube video in the description all the way?
Comment 11•12 years ago
|
||
I cannot play youtube at the end of contents.
If I try to play URL in comment 0, the screen freezes at 200 seconds from the start, 70~80% probability.
Comment 12•12 years ago
|
||
(In reply to Daniel Coloma:dcoloma from comment #8)
> (In reply to Wayne Chang [:wchang] from comment #7)
> > Triage 5/20 -
> > TEF? per blocking a tef+ bug 871753.
> > If decided not to be tef+, then please leo+.
>
> bug 871753 is already fixed in all branches. Can you clarify how does this
> one blocks 871753?
This block was set by Sotaro, perhaps we need him to comment if this is still true.
Flags: needinfo?(wchang) → needinfo?(sotaro.ikeda.g)
Comment 13•12 years ago
|
||
The block was set because Youtube freeze happens by some reasons on a same youtube video. And it is difficult to differentiate which freeze happens just from the freezed UI.
Flags: needinfo?(sotaro.ikeda.g)
Assignee | ||
Comment 14•12 years ago
|
||
thread 18, blocked here:
#7 nsMediaCacheStream::Read (this=0x446b8030, aBuffer=<value optimized out>, aCount=<value optimized out>, aBytes=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/nsMediaCache.cpp:2103
#8 0x40899d08 in mozilla::ChannelMediaResource::Read (this=<value optimized out>, aBuffer=0x80 <Address 0x80 out of bounds>, aCount=4294965384, aBytes=0x0)
at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/MediaResource.cpp:754
#9 0x40872a9c in android::MediaStreamSource::readAt (this=0x44011320, offset=5362745, data=0x43bbfc00, size=340) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/omx/OmxDecoder.cpp:101
thread 17 and 16, blocked on 'mutex=0x41f25410':
- thread 17: android::OMXCodec::signalBufferReturned
- thread 16: android::OMXCodec::read
thread 9, blocked here:
#0 poll () at bionic/libc/arch-arm/syscalls/poll.S:10
#1 0x4167036a in _pr_poll_with_poll (pds=0x42ca8020, npds=1, timeout=4294967295) at /home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:3920
#2 0x41670546 in PR_Poll (pds=0xfffffffc, npds=1, timeout=4294967295) at /home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:4322
#3 0x404426d2 in nsSocketTransportService::Poll (this=<value optimized out>, wait=<value optimized out>, interval=0x434c2e24) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:397
#4 0x40442bf0 in nsSocketTransportService::DoPollIteration (this=0x41f13100, wait=true) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:781
#5 0x40442ddc in nsSocketTransportService::Run (this=0x41f13100) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:645
- thread 8: XPCJSRuntime::WatchdogMain
- thread 7: js::SourceCompressorThread::threadLoop
- thread 6: js::GCHelperThread::threadLoop
- thread 5: nsCycleCollectorRunner::Run
thread 4, here:
#0 __ioctl () at bionic/libc/arch-arm/syscalls/__ioctl.S:9
#1 0x40107b5c in ioctl (fd=<value optimized out>, request=8) at bionic/libc/bionic/ioctl.c:41
#2 0x41858a6e in android::IPCThreadState::talkWithDriver (this=0x41bfe110, doReceive=true) at frameworks/base/libs/binder/IPCThreadState.cpp:804
#3 0x41859354 in android::IPCThreadState::joinThreadPool (this=0x41bfe110, isMain=false) at frameworks/base/libs/binder/IPCThreadState.cpp:457
thread 2, blocked here:
#5 android::MPEG4DataSource::readAt (this=0x42cd3a20, offset=2, data=0x44030000, size=17040) at frameworks/base/media/libstagefright/MPEG4Extractor.cpp:164
#6 0x4173e4da in android::MPEG4Source::read (this=0x41b4f240, out=0x100ffc70, options=<value optimized out>) at frameworks/base/media/libstagefright/MPEG4Extractor.cpp:2410
#7 0x4174c6f4 in android::OMXCodec::drainInputBuffer (this=0x41f25360, info=0x42cd29f0) at frameworks/base/media/libstagefright/OMXCodec.cpp:4046
Assignee | ||
Comment 15•12 years ago
|
||
Sotaro, I'm trying to reproduce this bug on Inari, and what I'm seeing is the video play for ~5 to 10 minutes without trouble, but it progressively starts to degrade. It doesn't stop all of a sudden.
Instead, it looks like the video just stops playing; then a few minutes later, there is some renewed network activity (at least according to the arrows in the status bar), then the video resumes (albeit it stutters a lot) with audio; then everything seems to catch up for a while and play smoothly, before the video stops again.
With each stop/restart, it takes longer for the video to resume, and it plays for a much shorter time before stopping again.
logcat shows regular occurrences of these statements while this is happening:
E( 872:0x3ea) [OMX.qcom.audio.decoder.aac] Timed out waiting for output buffers: 0/0
W( 872:0x405) obtainBuffer() track 0x43e794c0 disabled, restarting
E( 117:0xe0) Parsing error in AudioFilter.csv.
W( 872:0x405) obtainBuffer() track 0x43e794c0 disabled, restarting
W( 117:0xe0) Thread AudioOut_1 cannot connect to the power manager service
Flags: needinfo?(sotaro.ikeda.g)
Reporter | ||
Comment 16•12 years ago
|
||
mikeh,
I see the same timeout. I don't see any stutter though.
As a side note a very common problem I've been seeing is MAC address clashing. The reference devices my team has all share the same MAC address by default. Fortunately we were able to override it via a setting file called softmac.
Any chance the inaris have the same problem? Can you compare the MAC address of two inaris?
Assignee | ||
Comment 17•12 years ago
|
||
Relevant code:
- "obtainBuffer() track 0x43e794c0 disabled, restarting"
http://androidxref.com/4.0.4/xref/frameworks/base/media/libmedia/AudioTrack.cpp#915
- comment:
// restart track if it was disabled by audioflinger due to previous underrun
- "Timed out waiting for output buffers: 0/0"
http://androidxref.com/4.0.4/xref/frameworks/base/media/libstagefright/OMXCodec.cpp#3429
Assignee | ||
Comment 18•12 years ago
|
||
(In reply to Diego Wilson [:diego] from comment #16)
>
> As a side note a very common problem I've been seeing is MAC address
> clashing. The reference devices my team has all share the same MAC address
> by default. Fortunately we were able to override it via a setting file
> called softmac.
>
> Any chance the inaris have the same problem? Can you compare the MAC address
> of two inaris?
Diego, I would hope not--I'm running the tests on my home network. :) The MAC on my Inari is 90:1D:27:02:B5:A7.
It's interesting that after a ~10 to 15-minute "freeze", the YouTube video in comment 0 has now resumed playing completely smoothly, with no more pausing.
Assignee | ||
Comment 19•12 years ago
|
||
I also observe that during the periods where the audio plays smoothly, the video is either frozen, or significantly behind the audio track.
Reporter | ||
Comment 20•12 years ago
|
||
mikeh,
Sounds like your seeing bug 863441 and bug 864230. The first one was fixed in gecko a couple of weeks ago. The second one is a fix in gonk. You may need an updated vendor build for that.
Assignee | ||
Comment 21•12 years ago
|
||
diego, I'm running a build from a completely fresh repo that I pulled this morning (~10am, EDT). Would that include the fix in bug 864230?
Flags: needinfo?(dwilson)
Comment 22•12 years ago
|
||
It is not enabled on Mozbuild. Add patch to https://github.com/mozilla-b2g/gonk-patches is necessary. The github hadles patchs only for leo and hamachi now.
Flags: needinfo?(sotaro.ikeda.g)
Reporter | ||
Comment 23•12 years ago
|
||
Maybe there needs to be a new flag for that? Or was it supposed to be marked "Fixed" until it made it there?
Flags: needinfo?(dwilson)
Comment 24•12 years ago
|
||
(In reply to Mike Habicher [:mikeh] from comment #15)
> Sotaro, I'm trying to reproduce this bug on Inari, and what I'm seeing is
> the video play for ~5 to 10 minutes without trouble, but it progressively
> starts to degrade. It doesn't stop all of a sudden.
>
//snip
>
> logcat shows regular occurrences of these statements while this is happening:
>
> E( 872:0x3ea) [OMX.qcom.audio.decoder.aac] Timed out waiting for output
> buffers: 0/0
This error could be reported, when MediaStreamSource::readAt() was blocked for a longer time.
Assignee | ||
Comment 25•12 years ago
|
||
(In reply to Sotaro Ikeda [:sotaro] from comment #22)
>
> It is not enabled on Mozbuild. Add patch to
> https://github.com/mozilla-b2g/gonk-patches is necessary. The github hadles
> patchs only for leo and hamachi now.
If the fix for these video issues is common to all platforms, it sounds like gonk-patches needs an 'all' subfolder.
Comment 26•12 years ago
|
||
The rationale from comment 13 does not change our blocking decision, and comment 15 is being resolved in other bugs. Not blocking.
blocking-b2g: tef? → leo+
Assignee | ||
Comment 27•12 years ago
|
||
diego, sotaro, with:
- gecko: b2g18:2f8bf6531f39 ...which includes the bug 863441 patch
- gaia: c51082d92f18c4c7cdbc4f96c977be735500562a
...and with the bug 864230 patch applied to gonk, I see YouTube videos freeze.
Not all the time: the video in comment 0 has played through at least twice today, completely, without error; but on the third attempt, I see the incoming TCP connection's receive window collapse and the video stop after ~1 second of playback.
Are there any other patches I could be missing?
blocking-b2g: leo+ → tef?
Flags: needinfo?(dwilson)
Assignee | ||
Comment 28•12 years ago
|
||
- thread 19, mutex=0x41f26510:
#5 android::OMXCodec::read (this=0x41f26460, buffer=0x42da5360, options=0x2) at frameworks/base/media/libstagefright/OMXCodec.cpp:5061
#6 0x40880de8 in android::OmxDecoder::ReadVideo (this=0x42da5280, aFrame=0x426c6cc0, aTimeUs=187238095, aKeyframeSkip=false, aDoSeek=false) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/omx/OmxDecoder.cpp:556
#7 0x4087f59e in nsMediaOmxReader::DecodeVideoFrame (this=0x41f42780, aKeyframeSkip=@0x426c6e1f, aTimeThreshold=2) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/omx/nsMediaOmxReader.cpp:133
- thread 16, also mutex=0x41f26510:
#5 android::OMXCodec::signalBufferReturned (this=0x41f26460, buffer=0x80) at frameworks/base/media/libstagefright/OMXCodec.cpp:5227
#6 0x41773d26 in android::MediaBuffer::release (this=0x43dcb980) at frameworks/base/media/libstagefright/MediaBuffer.cpp:102
#7 0x4087fcac in android::OmxDecoder::ReleaseAllPendingVideoBuffersLocked (this=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/omx/OmxDecoder.cpp:769
- thread 18, cond=0x442c7824, mutex=0x43d77284:
#8 0x408a6a70 in mozilla::ChannelMediaResource::Read (this=<value optimized out>, aBuffer=0x80 <Address 0x80 out of bounds>, aCount=4294964942, aBytes=0x0) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/MediaResource.cpp:754
#9 0x4087fa94 in android::MediaStreamSource::readAt (this=0x442c7a50, offset=15881796, data=0x43d8b000, size=238) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/omx/OmxDecoder.cpp:101
#10 0x4176bc34 in android::MPEG4DataSource::readAt (this=<value optimized out>, offset=15881796, data=0x43d8b000, size=238) at frameworks/base/media/libstagefright/MPEG4Extractor.cpp:172
#11 0x4176bc34 in android::MPEG4DataSource::readAt (this=<value optimized out>, offset=15881796, data=0x43d8b000, size=238) at frameworks/base/media/libstagefright/MPEG4Extractor.cpp:172
#12 0x4176bee8 in android::MPEG4Source::read (this=0x43d930b0, out=0x45163c70, options=<value optimized out>) at frameworks/base/media/libstagefright/MPEG4Extractor.cpp:2329
- thread 9:
#2 0x4167c546 in PR_Poll (pds=0xfffffffc, npds=1, timeout=4294967295) at /home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:4322
#3 0x4044f6d2 in nsSocketTransportService::Poll (this=<value optimized out>, wait=<value optimized out>, interval=0x435dae24) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:397
#4 0x4044fbf0 in nsSocketTransportService::DoPollIteration (this=0x41f13100, wait=true) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:781
#5 0x4044fddc in nsSocketTransportService::Run (this=0x41f13100) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:645
In the PR_Poll call (frame #2), timeout=4294967295 is 0xFFFFFFFF, or -1, which means (for the POSIX version of poll()[1], at least) no timeout.
1. http://pubs.opengroup.org/onlinepubs/007908799/xsh/poll.html
Assignee | ||
Comment 29•12 years ago
|
||
Depending on how much I trust the gdb backtrace, the thread 9 frame 2 call looks completely wrong:
#2 0x4167c546 in PR_Poll (pds=0xfffffffc, npds=1, timeout=4294967295) at /home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:4322
According to this[1], 'pds' is a pointer type, in which case a value of 0xfffffffc (-2 as a uint32_t) is most definitely bogus.
1. https://developer.mozilla.org/en-US/docs/PR_Poll
Assignee | ||
Comment 30•12 years ago
|
||
Further to comment 20, nsSocketTransportService::Poll()[1] sets the 'pds' argument to one of three values:
- if nsSocketTransportService::mPollList->fd is non-zero, pds = mPollList
- if nsSocketTransportService::mPollList->fd is zero, and:
- if nsSocketTransportService::mActiveCount is non-zero, pds = mPollList + 1
- if nsSocketTransportService::mActiveCount is zero, pds = nullptr
Assuming mPollList is a valid pointer, none of the valid paths should lead to pds = 0xfffffffc.
mPollList is initialized in the constructor by calling moz_xmalloc() (which is infallible[2]), and freed in the destructor by calling moz_xfree().
mPollList can also grow if a process uses a LOT of sockets[3] (which watching a long YouTube video seems to do--I've observed one new TCP connection per 800KiB to 1.8MiB of data), by calling moz_xrealloc(). It does this in 100-socket increments, once mActiveListSize exceeds SOCKET_LIMIT_MIN (currently defined as 50)[4].
1. http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransportService2.cpp#361
2. https://developer.mozilla.org/en-US/docs/Infallible_memory_allocation
3. http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransportService2.cpp#318
4. http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransportService2.cpp#40
Assignee | ||
Comment 31•12 years ago
|
||
Assignee | ||
Comment 32•12 years ago
|
||
One thing I've noticed about the all-green screen: it seems to happen when WiFi cycles. I haven't been able to determine if it's due to a complete disconnect, or if it's just a rekeying; but I wonder if there's an error code somewhere that isn't getting propagated properly.
Assignee | ||
Comment 33•12 years ago
|
||
Of course, having posted that, I just managed to get a green-screen without a WiFi cycle.
Comment 34•12 years ago
|
||
Assuming comment 27 is a mid-air collision and setting back to leo+. No new tef block information provided.
blocking-b2g: tef? → leo+
Assignee | ||
Comment 35•11 years ago
|
||
Some gdb debugging:
(gdb) list 3920
3905 pds[index].out_flags = 0;
3906 }
3907 }
3908 if (0 == ready)
3909 {
3910 switch (timeout)
3911 {
3912 case PR_INTERVAL_NO_WAIT: msecs = 0; break;
3913 case PR_INTERVAL_NO_TIMEOUT: msecs = -1; break;
3914 default:
3915 msecs = PR_IntervalToMilliseconds(timeout);
3916 start = PR_IntervalNow();
3917 }
3918
3919 retry:
3920 ready = poll(syspoll, npds, msecs);
3921 if (-1 == ready)
3922 {
3923 PRIntn oserror = errno;
3924
3925 if (EINTR == oserror)
3926 {
3927 if (timeout == PR_INTERVAL_NO_TIMEOUT)
3928 goto retry;
3929 else if (timeout == PR_INTERVAL_NO_WAIT)
3930 ready = 0; /* don't retry, just time out */
3931 else
3932 {
3933 elapsed = (PRIntervalTime) (PR_IntervalNow()
3934 - start);
(gdb) p npds
$1 = 1
(gdb) p msecs
$2 = -1
(gdb) p syspoll
$3 = (struct pollfd *) 0x4359dbcc
(gdb) p *syspoll
$4 = {fd = 21, events = 1, revents = 0}
(gdb) info frame
Stack level 1, frame at 0x4359ddf8:
pc = 0x4015236a in _pr_poll_with_poll (/home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:3920); saved pc 0x40152546
called by frame at 0x4359de00, caller of frame at 0x4359dbb0
source language c.
Arglist at 0x4359dbb0, args: pds=0x42ba8020, npds=1, timeout=4294967295
Locals at 0x4359dbb0, Previous frame's sp is 0x4359ddf8
Saved registers:
r4 at 0x4359ddd4, r5 at 0x4359ddd8, r6 at 0x4359dddc, r7 at 0x4359dde0, r8 at 0x4359dde4, r9 at 0x4359dde8, r10 at 0x4359ddec, r11 at 0x4359ddf0, lr at 0x4359ddf4
(gdb) p pds
$5 = (PRPollDesc *) 0x42ba8020
(gdb) p *pds
$6 = {fd = 0x41a03c00, in_flags = 1, out_flags = 1}
(gdb) p *fd
Cannot access memory at address 0xffffffff
(gdb) p fd
$7 = -1
(gdb) p pds->fd
$8 = (PRFileDesc *) 0x41a03c00
(gdb) p *pds->fd
warning: can't find linker symbol for virtual table for `(null)' value
$9 = warning: can't find linker symbol for virtual table for `(null)' value
{methods = 0x4015bbe8, secret = 0x41a01270, lower = 0x42ba6820, higher = 0x0, dtor = 0x40143d21 <pl_FDDestructor>, identity = 1}
(gdb) p POLLIN
No symbol "POLLIN" in current context.
(gdb) p (PRInt16*)malloc(sizeof(PRInt16))
$10 = (PRInt16 *) 0x1f83ef0
(gdb) p *(PRInt16*)0x1f83ef0
$11 = 30024
(gdb) p (pds[0].fd->methods->poll)(pds[0].fd, pds[0].in_flags & 0xfffd, (PRInt16*)0x1f83ef0)
warning: can't find linker symbol for virtual table for `(null)' value
$12 = 1
(gdb) p pds[0]
$13 = {fd = 0x41a03c00, in_flags = 1, out_flags = 1}
(gdb) p *(PRInt16*)0x1f83ef0
$14 = 0
(gdb) p 0xd
$15 = 13
(gdb) p (pds[0].fd->methods->poll)(pds[0].fd, pds[0].in_flags & 0xfffe, (PRInt16*)0x1f83ef0)
warning: can't find linker symbol for virtual table for `(null)' value
$16 = 0
(gdb) p *(PRInt16*)0x1f83ef0
$17 = 0
(gdb) p PR_GetIdentitiesLayer(pds[0].fd, (PRDescIdentity)0)
warning: can't find linker symbol for virtual table for `(null)' value
$18 = (PRFileDesc *) 0x42ba6820
(gdb) p pds[0]
$19 = {fd = 0x41a03c00, in_flags = 1, out_flags = 1}
(gdb) p *(PRFileDesc*)0x42ba6820
warning: can't find linker symbol for virtual table for `(null)' value
$20 = warning: can't find linker symbol for virtual table for `(null)' value
{methods = 0x4015be70, secret = 0x41a03c20, lower = 0x0, higher = 0x41a03c00, dtor = 0, identity = 0}
(gdb) p *(PRFileDesc*)0x42ba6820->secret
Attempt to extract a component of a value that is not a structure pointer.
(gdb) p ((PRFileDesc*)0x42ba6820)->secret
warning: can't find linker symbol for virtual table for `(null)' value
$21 = (PRFilePrivate *) 0x41a03c20
(gdb) p *((PRFileDesc*)0x42ba6820)->secret
warning: can't find linker symbol for virtual table for `(null)' value
$22 = {state = -1431655766, nonblocking = 0, inheritable = _PR_TRI_TRUE, next = 0x0, lockCount = 0, md = {osfd = 21, tcp_nodelay = 0}}
(gdb) p/x *((PRFileDesc*)0x42ba6820)->secret
warning: can't find linker symbol for virtual table for `(null)' value
$23 = {state = 0xaaaaaaaa, nonblocking = 0x0, inheritable = 0x1, next = 0x0, lockCount = 0x0, md = {osfd = 0x15, tcp_nodelay = 0x0}}
(gdb) p ready
$24 = 0
(gdb)
blocking-b2g: leo+ → leo?
Assignee | ||
Comment 36•11 years ago
|
||
It also looks like fd=21 has no data waiting:
(gdb) p (unsigned int*)malloc(sizeof(unsigned int))
$25 = (unsigned int *) 0x1f83f00
(gdb) p FIONREAD
No symbol "FIONREAD" in current context.
(gdb) p ioctl(21, 0x541b, (unsigned int*)0x1f83f00)
$26 = 0
(gdb) p *(unsigned int*)0x1f83f00
$27 = 0
This would be inconsistent with the TCP.rwin=0 observed in the video data stream.
Assignee | ||
Comment 37•11 years ago
|
||
Further to comment 29 and comment 30, the value pds=0xfffffffc seems to be incorrect, possibly a by-product of an optimized build. Probing with gdb (see comment 35) suggests that a valid PRPollDesc pointer is ultimately making it into PR_Poll().
Assignee | ||
Comment 38•11 years ago
|
||
Hmm, looks like the socket (in this case, fd=21) is not (really) a socket. I can't make any sense out of the 'st_mode' member of the struct stat returned by fstat():
(gdb) p (struct stat*)malloc(sizeof(struct stat))
$50 = (struct stat *) 0x1f86000
(gdb) p fstat(21, (struct stat*)0x1f86000)
$51 = 0
(gdb) p *(struct stat*)0x1f86000
$52 = {st_dev = 7, __pad0 = "\000\000\000", __st_ino = 2979, st_mode = 4480, st_nlink = 1, st_uid = 0, st_gid = 0, st_rdev = 0, __pad3 = "\000\000\000", st_size = 0, st_blksize = 4096, st_blocks = 0,
st_atime = 1369663045, st_atime_nsec = 661191246, st_mtime = 1369663045, st_mtime_nsec = 661191246, st_ctime = 1369663045, st_ctime_nsec = 661191246, st_ino = 2979}
(gdb) p ((struct stat*)0x1f86000)->st_mode
$53 = 4480
(gdb) p S_IFMT
No symbol "S_IFMT" in current context.
(gdb) p 00170000
$54 = 61440
(gdb) p ((struct stat*)0x1f86000)->st_mode & 61440
$55 = 4096
According to the AOSP version of stat.h, 4096 is 'S_ISVTX', which seems to mean:
"on directories, restricted deletion flag"[1]
1. http://pubs.opengroup.org/onlinepubs/7908799/xsh/sysstat.h.html
Assignee | ||
Comment 39•11 years ago
|
||
Most recent reproduction of this problem occurred after 50m30s of playing the comment 0 video. gdb shows that the socket thread is blocked on the poll() call, while Wireshark shows there is no longer any network traffic between the DUT and YouTube.
Wireshark (a 798MB log--too big to upload) shows:
1. a session on local port 46297 receiving ~1.5MiB of data from YT, its Rwin hitting 0
2. Rwin update of 10880
3. FIN send to YT
4. TCP keep-alives from YT
5. YT sends 14 more packets (up to ACK 1510469, 19304 more bytes following item 1 above), which we respond to with TCP RSTs
6. we open a new TCP connection on local port 46298 (the 3-way handshake completes right away)
7. we send 4 HTTP GETs (TCP retransmissions) which are ignored by YT
8. we get a FIN-ACK, and send a FIN-ACK on port 46928
9. YT sends us a RST packet
10. we repeat steps 6 through 9 above on port 46299
There is no more traffic between the DUT and YT: no new connection attempts. In gdb, the socket is still blocked on poll(). The video player UI is stuck where it was when the connection(s) failed; pressing Pause/Play has no effect on the player, and causes no new traffic to appear in Wireshark.
Reporter | ||
Comment 40•11 years ago
|
||
Is there any test we can do to confirm once and for all if it's a network connection issue or a hardware decoder issue?
Flags: needinfo?(dwilson)
Reporter | ||
Comment 41•11 years ago
|
||
(In reply to Mike Habicher [:mikeh] from comment #27)
> Are there any other patches I could be missing?
Not that I know of. You're probably seeing the same thing I'm seeing now. But does seem to be at least dependent (or even caused by) network conditions.
Assignee | ||
Comment 42•11 years ago
|
||
(In reply to Diego Wilson [:diego] from comment #41)
>
> Not that I know of. You're probably seeing the same thing I'm seeing now.
> But does seem to be at least dependent (or even caused by) network
> conditions.
I'm investigating that now.
I have my DUT sniffing packets into Wireshark over adb[1], and my test AP is plugged into an old-school Ethernet hub[2] between it and my Internet connection, with an extra computer also running Wireshark sniffing packets on the wire.
I'm checking to see if packets are lost when the video playback stalls.
1. https://wiki.mozilla.org/B2G/Packet_Sniffing_with_Wireshark
2. http://en.wikipedia.org/wiki/Ethernet_hub
Updated•11 years ago
|
Whiteboard: c=video [TD-26549] → c=video [TD-26549], MiniWW
Assignee | ||
Comment 43•11 years ago
|
||
Per the stand-up, I have filed some other bugs I have observed while investigating this one:
- bug 876838: Screen goes green when YouTube video stream stalls
- bug 876843: YouTube video display stops but audio continues to play
Comment 45•11 years ago
|
||
Comment 35 unintentionally cleared the flag - resetting leo+
blocking-b2g: leo? → leo+
Comment 46•11 years ago
|
||
mike, can you put the wireshark profile somewhere?
Flags: needinfo?(doug.turner)
Assignee | ||
Comment 47•11 years ago
|
||
After sleeping on comment 39, this occurred to me: we _are_ talking to the YouTube server, or at least some machine with a TCP server socket on port 80, because in step 6, we successfully complete the TCP handshake!
The SYN/SYN+ACK RTT on the port 46298 connection is 26ms, and on port 46299 it's 34ms, both of which are consistent with earlier connections; also, in the SYN+ACK, MSS, window scaling, the lack of timestamps, and Rwin=14600 are consistent with earlier SYN+ACK packets.
Also, after the 4 TCP retransmissions in step 7, the server sends us a FIN+ACK in step 8 to close the connection.
Here's the weird part: we get the FIN+ACK packets with ACK numbers of 1, which means the server sending them never saw the HTTP GETs sent in step 7 (consistent with the SEQ number of 1 in these packets, which indicates the server never set us any data either).
We respond to the FIN+ACK with our own FIN+ACK with SEQ=980 and ACK=2.
So where are the HTTP GETs getting lost, and why? So far, I have been unable to reproduce this case with a wired-side sniffing connection:
DUT <--> AP-test <--> EthHub <--> Router <--> {Internet}
^--> Netbook/Wireshark
Next step: raw wireless sniffing.
Comment 48•11 years ago
|
||
(In reply to Mike Habicher [:mikeh] from comment #47)
>
> So where are the HTTP GETs getting lost, and why? So far, I have been unable
> to reproduce this case with a wired-side sniffing connection:
>
So that's a good question - but its worth thinking about the fact that even if you answer the question for this use case the fact remains the GETs could always get lost for some other reason (radio interference for example) that you aren't experiencing here and which isn't fixable.. because that's true it might make more sense to focus on why retries totally stop at step 9 or 10..
Assignee | ||
Comment 49•11 years ago
|
||
As if to reinforce mcmanus' comment, my most recent reproduction of the YouTube stall occurred 17m48s into the video, where there is an open TCP connection but it is stalled. The video is frozen on the DUT's screen, and Wireshark shows incoming TCP keep-alives from YT (approximately every 10s), to which we respond with TCP ACKs with Rwin=0.
roc, any idea where to look to see where the video path might be stalled?
Flags: needinfo?(roc)
Assignee | ||
Comment 50•11 years ago
|
||
Further to comment 49, looking at the socket service in the parent process:
(gdb) frame 1
#1 0x403bb36a in _pr_poll_with_poll (pds=0x42f9d020, npds=3, timeout=65535000) at /home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:3920
3920 ready = poll(syspoll, npds, msecs);
(gdb) p npds
$1 = 3
(gdb) p syspoll
$2 = (struct pollfd *) 0x42ef5bcc
(gdb) p *syspoll
$3 = {fd = 33, events = 1, revents = 0}
(gdb) p syspoll[0]
$4 = {fd = 33, events = 1, revents = 0}
(gdb) p syspoll[1]
$5 = {fd = 61, events = 3, revents = 0}
(gdb) p syspoll[2]
$6 = {fd = 62, events = 3, revents = 0}
(gdb) p (unsigned int*)malloc(sizeof(unsigned int))
$7 = (unsigned int *) 0x315e80
(gdb) p ioctl(21, 0x541b, (unsigned int*)0x315e80)
$8 = -1
(gdb) p ioctl(33, 0x541b, (unsigned int*)0x315e80)
$9 = 0
(gdb) p *(unsigned int*)0x315e80
$10 = 1
(gdb) p ioctl(61, 0x541b, (unsigned int*)0x315e80)
$11 = -1
(gdb) p ioctl(62, 0x541b, (unsigned int*)0x315e80)
$12 = -1
(gdb) p __errno()
$13 = (volatile int *) 0x42ef5f08
(gdb) p *__errno()
$14 = 22
(gdb) p msecs
$15 = 65535000
(gdb)
The errno value of 22 is EINVAL, which in conjunction with FIONREAD on bionic, supposedly indicates that fds 61 and 62 are listeners: http://stackoverflow.com/questions/13353146/android-linux-ioctl-fionread-failing (assuming they're even sockets in the first place).
(With both the parent and video player app processes stopped in gdb, Wireshark shows the TCP keep-alives and ACK dance every 10 seconds, seemingly indefinitely.)
You want NSPR_LOG_MODULES=nsMediaCache:5,MediaResource:5.
I don't know why data would not be flowing if the HTTP port is open and we're trying to play the video. Definitely sounds like a bug on our side.
Flags: needinfo?(roc)
The media cache does at times suspend reading from stream, but it should only do that if we've cached data ahead of where we're playing from and the cache is full. I can help you interpret the nsMediaCache logs to figure out if and why it stopped reading.
Blocks: 877024
Assignee | ||
Comment 54•11 years ago
|
||
(In reply to Andreas Gal :gal from comment #53)
>
> mikeh, are you attaching logs?
gal, I haven't been attaching Wireshark .pcap logs because they tend to compress down to ~800MiB; hence the descriptive summaries in (e.g.) comment 39 and comment 47. On the other hand, logcat logs don't show anything relevant.
Flags: needinfo?(mhabicher)
Assignee | ||
Comment 55•11 years ago
|
||
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #51)
>
> You want NSPR_LOG_MODULES=nsMediaCache:5,MediaResource:5.
>
> I don't know why data would not be flowing if the HTTP port is open and
> we're trying to play the video. Definitely sounds like a bug on our side.
Is there any way to coax PRLogging out of a build aside from setting DEBUG=1? b2g18 debug builds currently throw assertions like "MOZ_Assert: Assertion failure: status == OK, at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/ipc/ShadowLayerUtilsGralloc.cpp:315"[1]
1. http://mxr.mozilla.org/mozilla-b2g18/source/gfx/layers/ipc/ShadowLayerUtilsGralloc.cpp#315
(This issue has already been submitted as bug 841206.)
Comment 56•11 years ago
|
||
mike, can you try to reproduce this on desktop? it sounds like this problem might be reproducible there as well.
Assignee | ||
Comment 57•11 years ago
|
||
(In reply to Doug Turner (:dougt) from comment #56)
>
> mike, can you try to reproduce this on desktop? it sounds like this problem
> might be reproducible there as well.
dougt, how do I go about doing that?
Flags: needinfo?(doug.turner)
Assignee | ||
Comment 58•11 years ago
|
||
As requested. This is with the video on the DUT stalled, and the TCP stack receiving and ACKing TCP keep-alives, with Rwin=0.
The log ends with the following output, more or less continuously:
05-30 13:11:39.721 1277 1295 I Gecko - SYDNEY_AUDIO: 0x4435f340 - get position
05-30 13:11:39.721 1277 1277 I PRLog : 1074652408[425c92b0]: Stream 44ccc030 catching up
Flags: needinfo?(roc)
Thanks for these logs! Very helpful.
OK it looks like the media cache fills up and we start suspending and resuming the connection to throttle the download. This shows up in the log like this:
05-30 13:10:01.635 1277 1277 I PRLog : 1074652408[425c92b0]: Stream 44ccc030 predict next data in 42.297000, current worst block is 42.004089
05-30 13:10:01.645 1277 1277 I PRLog : 1074652408[425c92b0]: Stream 44ccc030 Suspended
That means the next data we read via HTTP is expected to play 42.297 seconds from now, but least-useful data in the cache is for 42.004089 seconds from now, so we want to stop reading temporarily, and we tell Necko to suspend reading from the stream.
You see a sequence of Suspends and Resumes as the video playback advances and we try to slide forward the window of data that we have around the current playback position. This is normal. You will also see some DataReceived notifications signalling that we received data from Necko immediately after a Suspend; this is also normal, and I guess this is expected when Suspend calls race with Necko data delivery.
Here is the last time at which data is received:
05-30 13:10:01.645 1277 1277 I PRLog : 1074652408[425c92b0]: Stream 44ccc030 DataReceived at 105742336 count=59532
There's a couple of Suspends and Resumes after that. Then the media cache does its last Resume call:
05-30 13:10:02.055 1277 1277 I PRLog : 1074652408[425c92b0]: Stream 44ccc030 Resumed
and there are no Suspend calls after that. So we're trying to read data from Necko, but none arrives.
Everything after that shows the window of cached data after the playback position gradually shrinking as the playback position advances, until finally it runs out, with predictable results.
So as far as I can tell, the media cache is behaving as expected.
Based on previous experience we know that if we suspend for too long, some servers will close the HTTP connection. We have logic to reopen the connection automatically in that case, but these logs seem to show that doesn't happen. (MediaChannelResource::CacheClientSeek would be called in that case, and would have been logged.) Perhaps suspending the HTTP channel has messed something up in another way --- e.g. flow control at some level, or maybe the server has reacted incorrectly and can't resume delivering data?
If you still have the wireshark logs I guess you should be able to infer from those whether our TCP receive window closed and then opened again.
Whatever the problem is, we may be able to work around it by detecting when we have received no data for some amount of time (fairly long, say ten seconds), and if our last action was a Resume (and the server supports byte-range requests) automatically close the connection and open a new one. I will try to make a patch for that.
Flags: needinfo?(roc)
Comment 60•11 years ago
|
||
I suspect some state inconsistency happened around IPC. Bug 868073 is similar area bug.
Mike should be able to tell by looking at logs whether that last Resume() actually took effect.
Mike, this patch might help. It should at least prevent the kind of log that you supplied here --- the patch should reset the connection after not receiving data for five seconds after the Resume().
Comment 63•11 years ago
|
||
maybe jduell can write some tests around ipc and pause/resume?
Flags: needinfo?(jduell.mcbugs)
Comment 64•11 years ago
|
||
Sure--we have basic suspend/resume tests, but I'll knock off something that does it repeatedly with a very large file. I doubt it'll turn anything up (I suspect we really need the specific flakiness of the network connection here), but it's worth trying.
Meanwhile I think it would be useful here to get a run where we have both a wireshark log (post it to a website somewhere so we can download it) AND an NSPR HTTP log:
https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging#Firefox_OS_phones
(I think using a single log file is fine)
Flags: needinfo?(jduell.mcbugs)
Comment 65•11 years ago
|
||
roc:
> You will also see some DataReceived notifications signalling that we
> received data from Necko immediately after a Suspend; this is also normal,
> and I guess this is expected when Suspend calls race with Necko data delivery.
As long as you suspend a channel on the main thread, Necko should never call OnDataAvailable until you call Resume [1]. I.e no races. That would be a bug. Can you verify for me whether you're actually seeing that in the log?
[1] For single process necko, we suspend the underlying inputStreamPump which handles calling OnDataAvailable. For e10s we send a message back to the parent to suspend the pump, and queue any OnDataAvail IPDL messages that arrive in the meantime.
Flags: needinfo?(roc)
Comment 66•11 years ago
|
||
Sorry, in case it wasn't clear, we want an NSPR log with both media/HTTP logging, ie.
NSPR_LOG_MODULES=timestamp,nsMediaCache:5,MediaResource:5:nsHttp:5,nsSocketTransport:5
>> can you try to reproduce this on desktop?
>
> how do I go about doing that?
I'm guessing forging the UserAgent header to B2G's (via about:config 'general.useragent.override') and going to the URI would be a good start.
Flags: needinfo?(doug.turner) → needinfo?(mhabicher)
Assignee | ||
Comment 67•11 years ago
|
||
roc, with your patch, the video plays until the end; the display stalled quite early on (see bug 877461) but the audio kept on going. Wireshark confirms that we hit the end of the stream.
For posterity, I added a printf_stderr() call into the NoDataDeliveredTimeout() callback in your patch; not sure if 'mOffset' value being the same means anything but there you go.
15:14:52 ➜ bug870564 grep -n "mikeh:" log06_media_timer.txt
64950:05-31 13:42:44.650 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
101956:05-31 13:44:40.844 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
129423:05-31 13:46:03.845 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
143889:05-31 13:46:49.739 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
181937:05-31 13:48:44.471 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
199981:05-31 13:49:40.996 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
210232:05-31 13:50:11.446 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
223187:05-31 13:50:53.077 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
236334:05-31 13:51:34.377 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
292244:05-31 13:54:32.921 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
313632:05-31 13:55:40.497 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
358557:05-31 13:58:45.167 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
420596:05-31 14:05:17.040 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
427522:05-31 14:06:54.675 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
465309:05-31 14:15:19.327 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
491143:05-31 14:22:47.214 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
522485:05-31 14:31:19.034 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
579806:05-31 14:43:09.917 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
629090:05-31 14:53:20.983 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
664792:05-31 15:01:53.533 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
Assignee | ||
Comment 68•11 years ago
|
||
(In reply to Jason Duell (:jduell) from comment #66)
> Sorry, in case it wasn't clear, we want an NSPR log with both media/HTTP
> logging, ie.
>
> NSPR_LOG_MODULES=timestamp,nsMediaCache:5,MediaResource:5:nsHttp:5,
> nsSocketTransport:5
I'll roll back roc's change and add the extra logging options.
Flags: needinfo?(mhabicher)
Assignee | ||
Comment 69•11 years ago
|
||
jduell: here's the logcat output; the corresponding Wireshark output can be found here: http://people.mozilla.com/~mhabicher/log07_dut.pcap
Attachment #756747 -
Flags: feedback?(jduell.mcbugs)
Reporter | ||
Comment 70•11 years ago
|
||
(In reply to Mike Habicher [:mikeh] from comment #67)
> roc, with your patch, the video plays until the end;
Awesome possum! I'll make sure to try it out.
> the display stalled
> quite early on (see bug 877461) but the audio kept on going. Wireshark
> confirms that we hit the end of the stream.
Yeah, I totally agree this is a separate bug. I can actually reproduce it earlier if I leave on the overlay progress bar.
Updated•11 years ago
|
status-b2g18:
--- → affected
status-b2g18-v1.0.0:
--- → wontfix
status-b2g18-v1.0.1:
--- → wontfix
status-b2g-v1.1hd:
--- → affected
Comment 71•11 years ago
|
||
Ran some e10s xpcshell tests on my linux desktop with large files (10 to 100 MB), where at every 5th OnDataAvailable I call suspend, wait a second, then resume. With a localhost http server, 25 MB file downloads fine that way. 100 MB caused a crash (which is a bug, but not this one). Over remote server loading 10 MB file once saw server drop the connection half-way through (but OnStopRequest called), once saw a crash. Clearly work to be done with the crash issue, but nothing here so far points to a cause for this bug. I'll take a look at the wireshark/NSPR log next.
Comment 72•11 years ago
|
||
On Firefox OS, I saw a lot of consecutive pause()s/resume()s were called without OnDataAvailable.
Comment 73•11 years ago
|
||
(In reply to Jason Duell (:jduell) from comment #71)
> Ran some e10s xpcshell tests on my linux desktop with large files (10 to 100
> MB), where at every 5th OnDataAvailable I call suspend, wait a second, then
> resume. With a localhost http server, 25 MB file downloads fine that way.
> 100 MB caused a crash (which is a bug, but not this one). Over remote
> server loading 10 MB file once saw server drop the connection half-way
> through (but OnStopRequest called), once saw a crash. Clearly work to be
> done with the crash issue, but nothing here so far points to a cause for
> this bug. I'll take a look at the wireshark/NSPR log next.
Try reducing the size of the media cache if you haven't already (media.cache_size). On Firefox OS it's something pretty small - 4MB IIRC.
(In reply to Jason Duell (:jduell) from comment #65)
> As long as you suspend a channel on the main thread, Necko should never call
> OnDataAvailable until you call Resume [1]. I.e no races. That would be a
> bug. Can you verify for me whether you're actually seeing that in the log?
From comment #59:
> Then the media cache does its last Resume call:
> 05-30 13:10:02.055 1277 1277 I PRLog : 1074652408[425c92b0]: Stream 44ccc030 Resumed
> and there are no Suspend calls after that. So we're trying to read data from Necko, but none
> arrives.
Mike confirmed on IRC that after our last Resume call, Wireshark shows the TCP receive window never opens. Which is why I think it's a Necko bug.
Flags: needinfo?(roc)
Comment on attachment 756494 [details] [diff] [review]
timer patch
Review of attachment 756494 [details] [diff] [review]:
-----------------------------------------------------------------
It sounds like this patch is working. I think we should take it as well as fix any underlying Necko bug. It gives us some extra protection against possible bugs in the network or server.
::: layout/base/FrameLayerBuilder.cpp
@@ +2947,5 @@
> layer->SetClipRect(nullptr);
> layer->SetMaskLayer(nullptr);
> ImageLayer* imgLayer = static_cast<ImageLayer*>(layer.get());
> layer.forget();
> + printf("### RECYCLING IMAGE LAYER\n");
Er, ignore this.
Attachment #756494 -
Flags: review?(cpearce)
Updated•11 years ago
|
Attachment #756494 -
Flags: review?(cpearce) → review+
Comment 76•11 years ago
|
||
I adopt patch on this version..
Mozilla build ID: 20130526070207
Gaia Revision : 4d10e1297b859cacc174c0a54af61a7678d7c32d
Gecko Revision : 52341e43539a0e8b9aa77a9128cc3871439b8aa6
There's more severe problem occurred.
Play is lagged from the start and then it is stuck after few seconds.
Is there anyone who check this patch?
Reporter | ||
Comment 77•11 years ago
|
||
(In reply to leo.bugzilla.gecko from comment #76)
> There's more severe problem occurred.
> Play is lagged from the start and then it is stuck after few seconds.
Does audio keep playing? If so, that is a separate issue tracked in bug 877461
Comment 78•11 years ago
|
||
(In reply to Diego Wilson [:diego] from comment #77)
> Does audio keep playing? If so, that is a separate issue tracked in bug
> 877461
No, A/V is just stuck forever.
(In reply to leo.bugzilla.gecko from comment #76)
> I adopt patch on this version..
> Mozilla build ID: 20130526070207
> Gaia Revision : 4d10e1297b859cacc174c0a54af61a7678d7c32d
> Gecko Revision : 52341e43539a0e8b9aa77a9128cc3871439b8aa6
Can you give a URL for this revision? I assume this is from git but I don't know which repo...
> There's more severe problem occurred.
> Play is lagged from the start and then it is stuck after few seconds.
>
> Is there anyone who check this patch?
You mean this is a regression? Does it happen on all Youtube videos? Can you give specific steps to reproduce?
Comment 80•11 years ago
|
||
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #79)
> Can you give a URL for this revision? I assume this is from git but I don't
> know which repo...
Do you mean this URL?
Gaia : http://git.mozilla.org/?p=releases/gaia.git;a=shortlog;h=refs/heads/v1-train
Gecko : http://git.mozilla.org/?p=releases/gecko.git;a=shortlog;h=refs/heads/gecko-18
> You mean this is a regression? Does it happen on all Youtube videos? Can you
> give specific steps to reproduce?
I just checked URL in comment 0, just try to play.
I will check for another youtube videos also.
Comment 81•11 years ago
|
||
I think......Bug 871485 could affect this problem...
Because I check patch here after patching Bug 871485.
(It seems like youtube got more problem with patch Bug 871485)
I am checking ...
Comment 82•11 years ago
|
||
Sorry for confusing. The stuck problem is definitely because of Bug 871485.
After adapting this patch, it's lagged so much and only audio is keep playing.
I think it's same as bug 877461.
Comment 83•11 years ago
|
||
There are so many message from to log like this..when it's lagging
MediaResource Could not trim cache block 268 (destination 127, predicted next use 38.952000, latest predicted use for overflow 198.109000
Comment 84•11 years ago
|
||
(In reply to Diego Wilson [:diego] from comment #77)
> Does audio keep playing? If so, that is a separate issue tracked in bug
> 877461
It fills like worse than before.
Is Bug 877461 same as Bug 863441.
I think this patch reopen Bug 863441.
Comment 85•11 years ago
|
||
More information..
When a network connection is good, it occurred less than before.
If network connection is not good, it occurred whenever DataReceived notifications signalling is show up.
Comment 86•11 years ago
|
||
(In reply to leo.bugzilla.gecko from comment #82)
> Sorry for confusing. The stuck problem is definitely because of Bug 871485.
The patch does not change how video playback is done, but change the timing of codec allocation. I think Bug 871485 does not cause the problem, but hit more often the hidden problem.
Assignee | ||
Comment 87•11 years ago
|
||
(In reply to Jason Duell (:jduell) from comment #71)
>
> Clearly work to be
> done with the crash issue, but nothing here so far points to a cause for
> this bug. I'll take a look at the wireshark/NSPR log next.
jduell, anything I can do to help?
Comment 88•11 years ago
|
||
It seems that on leo device by using MozBuild, we can see more freeze easily. It might be true also on vendor built ROM. Last week, I was in lg b2g work week, lg engineer said, it is easy to generate freeze.
Comment 89•11 years ago
|
||
I've got :mikeh running a build that asserts if the necko channel's suspend/resume methods are called off main thread. We seem to be seeing that (or some other kind of error) in bug 877708, and that could explain the "lost resume" we seem to seeing here.
Comment 90•11 years ago
|
||
So we're not seeing any off-main thread use of suspend/resume. But :mikeh has been running some builds with various diagnostic checks, and we did catch a hang and attached a debugger to it, and we saw
1) nsHttpChannel::mSuspendCount = 1
2) the last printf in our log of suspend/resume is a suspend()
3) wireshark shows TCP window is 0 (i.e. full client TCP buffer because channel is suspended)
This really seems to point to the code above necko not calling resume(), yet expecting data. roc: I smell a media cache bug at this point.
Flags: needinfo?(roc)
(In reply to Jason Duell (:jduell) from comment #90)
> This really seems to point to the code above necko not calling resume(), yet
> expecting data. roc: I smell a media cache bug at this point.
Maybe there is a media cache bug too, but in the log in comment #58, it's clearly a Resume that's the last call. Please don't ignore that.
Flags: needinfo?(roc)
(In reply to Jason Duell (:jduell) from comment #90)
> This really seems to point to the code above necko not calling resume(), yet
> expecting data. roc: I smell a media cache bug at this point.
Did you get the nsMediaCache log for this? If so can you drop a link here?
Hmm, it's possible the resume gets lost somewhere between where nsMediaCache logs it and calling Necko. I'll dig.
Comment 94•11 years ago
|
||
I now suspect bug 877708 is the real culprit here. Taking roc off the hook.
Comment 95•11 years ago
|
||
mikeh: could you try my patch (the simple one for b2g18 branch) from bug 877708 (without roc's patch) and see if it fixes the freeze here?
Alas, there's a good chance my bug 877708 fix won't actually fix this issue. The bug there was that if suspend was called again very shortly after resume, we could wind up flushing events from the parent (like OnDataAvailable) that we have queued on the child, even though we're supposed to be suspended (these were the "racy" OnDataAvailable calls mentioned in comment 65). My patch fixes that, but AFAICT it won't help with any "lost" resume calls (i.e we shouldn't remain suspended if the media code really has issued a matching number of suspends/resumes). But I don't see any obvious problems with the way necko e10s does suspend/resume, so I'm still hoping that not issuing the racy ODA calls will somehow make the media code behave better.
Flags: needinfo?(mhabicher)
Comment 96•11 years ago
|
||
Why does stuck happen at the same position all the time??
and...according to logs, it seems like h/w codec keeps consuming data even though it needs
Reporter | ||
Comment 97•11 years ago
|
||
Does anyone have any objections to landing attachment 756494 [details] [diff] [review] ?
Keywords: checkin-needed
Reporter | ||
Comment 98•11 years ago
|
||
Also, perhaps we can spin off another bug for continuing the media cache debugging discussion?
Assignee | ||
Comment 99•11 years ago
|
||
(In reply to Diego Wilson [:diego] from comment #97)
>
> Does anyone have any objections to landing attachment 756494 [details] [diff] [review] ?
My concern is that while is does unstick the video, it doesn't address the underlying problem that causes the stream to get stuck in the first place. I'm currently testing a patch in bug 877708 that may do that, so I'm going to remove the 'checkin-needed' keyword for now. I'll update both bugs once I have a couple of runs-through to examine.
Flags: needinfo?(mhabicher)
Keywords: checkin-needed
Assignee | ||
Comment 100•11 years ago
|
||
With bug 877708 attachment 757784 [details] [diff] [review] applies (and not attachment 756494 [details] [diff] [review] in this bug), the TCP stream(s) no longer stall. Instead, Wireshark shows periodic TCP window collapses which linger for 1~3s and are then cleared and streaming resumes.
(The video is frozen, however--something that sounds like it is resolved in another bug. Audio was stalled too, despite the network traffic; but when I moved the DUT to temporarily increase WiFi RSSI, the audio resumed with frozen display.)
Assignee | ||
Comment 101•11 years ago
|
||
Further to comment 100, that would probably be bug 879297.
Assignee | ||
Comment 102•11 years ago
|
||
(In reply to Diego Wilson [:diego] from comment #98)
>
> Also, perhaps we can spin off another bug for continuing the media cache
> debugging discussion?
In light of comment 100, I don't think a new bug is needed; I need to conduct more testing, but if the TCP stream stall is fixed, I don't think we need to revisit the MediaCache.
Assignee | ||
Comment 103•11 years ago
|
||
jduell: I've put bug 877708 attachment 757784 [details] [diff] [review] through its paces today, and over several hours of testing, the TCP stream never gets stuck. I think you've done it. :)
Comment 104•11 years ago
|
||
Yay. That's good news! I guess I'll dupe that bug over to this one (so I don't have to ask for leo+ on it).
mikeh: Re: TCP window collapses: IIRC this is normal--when the media code suspends a channel, we let its TCP buffer fill up (and the TCP recv window becomes 0). When we resume we read from the TCP buffer and the window gets bigger again. If that's all you're seeing there's no problem here (but I seem to remember you saying we were sending hard RSTs to server when we talked yesterday? That sounded like a new, different bug).
Comment 106•11 years ago
|
||
This is patch for m-c from bug 870564, already approved by jdm (and nits fixed).
Attachment #758276 -
Flags: review+
Comment 107•11 years ago
|
||
This is patch for mb2g18 from bug 870564, already +r'd by jdm (and nits fixed).
[Approval Request Comment]
Bug caused by (feature/regressing bug #): been in tree for long time.
User impact if declined: youtube on b2g will freeze on long videos
Testing completed: yes see comment 103.
Risk to taking this patch (and alternatives if risky): low.
String or UUID changes made by this patch: none
Attachment #758280 -
Flags: review+
Attachment #758280 -
Flags: approval-mozilla-b2g18?
Comment 108•11 years ago
|
||
Comment 109•11 years ago
|
||
Comment on attachment 756494 [details] [diff] [review]
timer patch
Marking roc's patch obsolete, as it was a bit of a hack and doesn't seem to be needed. Correct me if I'm wrong about that.
Attachment #756494 -
Attachment is obsolete: true
Comment 110•11 years ago
|
||
typo in comment 106 && 107--I moved those patches here from bug 877708.
(In reply to Jason Duell (:jduell) from comment #109)
> Marking roc's patch obsolete, as it was a bit of a hack and doesn't seem to
> be needed. Correct me if I'm wrong about that.
Agreed.
(In reply to Jason Duell (:jduell) from comment #104)
> mikeh: Re: TCP window collapses: IIRC this is normal--when the media code
> suspends a channel, we let its TCP buffer fill up (and the TCP recv window
> becomes 0). When we resume we read from the TCP buffer and the window gets
> bigger again.
This is expected. This is the result of the media cache doing flow control.
Updated•11 years ago
|
Whiteboard: c=video [TD-26549], MiniWW → c=video [TD-26549], MiniWW, [YouTubeCertBlocker]
Updated•11 years ago
|
Whiteboard: c=video [TD-26549], MiniWW, [YouTubeCertBlocker] → c=video [TD-26549], MiniWW, [YouTubeCertBlocker+]
Comment 113•11 years ago
|
||
jason this is awesome.
Comment 114•11 years ago
|
||
For consistency of symptom what I see, please try it via 3G connection.
If you cannot use 3G connection, try it via other device's tethering.
I cannot get any progress even though I applied many patches mentioned in Bug 877461, Bug 878343 and here.
Comment 115•11 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 116•11 years ago
|
||
Comment on attachment 758280 [details] [diff] [review]
less risky patch for b2g18 branch
Has leo+, doesn't need additional approval.
Attachment #758280 -
Flags: approval-mozilla-b2g18?
Comment 117•11 years ago
|
||
WTF happened to the header on this patch?
https://hg.mozilla.org/releases/mozilla-b2g18/rev/176fe1ce4cd2
Assignee | ||
Comment 118•11 years ago
|
||
(In reply to Jason Duell (:jduell) from comment #104)
>
> mikeh: Re: TCP window collapses: IIRC this is normal--when the media code
> suspends a channel, we let its TCP buffer fill up (and the TCP recv window
> becomes 0). When we resume we read from the TCP buffer and the window gets
> bigger again. If that's all you're seeing there's no problem here (but I
> seem to remember you saying we were sending hard RSTs to server when we
> talked yesterday? That sounded like a new, different bug).
jduell, I still see the RSTs from us to the server. Basically it looks like:
1. regular TCP streaming
2. TCP receive window goes to zero, then:
3. http > us TCP keepalive
4. us > http TCP ACK, Rwin=0
5. us > http TCP RST
6. open new stream
The only time this doesn't happen is on the last chunk, where we close it off gracefully with FIN+ACKs.
Comment 119•11 years ago
|
||
(In reply to leo.bugzilla.gecko from comment #114)
> For consistency of symptom what I see, please try it via 3G connection.
> If you cannot use 3G connection, try it via other device's tethering.
> I cannot get any progress even though I applied many patches mentioned in
> Bug 877461, Bug 878343 and here.
FYI, for confirmation a patch in Bug 879529 is also necessary.
Comment 120•11 years ago
|
||
>
> jduell, I still see the RSTs from us to the server. Basically it looks like:
> 1. regular TCP streaming
> 2. TCP receive window goes to zero, then:
> 3. http > us TCP keepalive
> 4. us > http TCP ACK, Rwin=0
> 5. us > http TCP RST
> 6. open new stream
>
> The only time this doesn't happen is on the last chunk, where we close it
> off gracefully with FIN+ACKs.
The RST gets sent because we closed the socket without reading data that was queued in the kernel before doing that. If a stream was paused and then not resumed before being closed that is what I would expect would happen. (it happens on lots of other close scenarios too) Its not necessarily an indication of a problem unless its associated with some other issue.
Comment 121•11 years ago
|
||
> WTF happened to the header on this patch?
Not sure--I didn't tweak it manually, so I'll blame hg.
Comment 122•11 years ago
|
||
Reporter | ||
Comment 123•11 years ago
|
||
(In reply to Jason Duell (:jduell) from comment #107)
> Created attachment 758280 [details] [diff] [review]
> less risky patch for b2g18 branch
>
> This is patch for mb2g18 from bug 870564, already +r'd by jdm (and nits
> fixed).
>
> [Approval Request Comment]
> Bug caused by (feature/regressing bug #): been in tree for long time.
> User impact if declined: youtube on b2g will freeze on long videos
> Testing completed: yes see comment 103.
> Risk to taking this patch (and alternatives if risky): low.
> String or UUID changes made by this patch: none
This patch worked for me. Thanks Jason!
Updated•11 years ago
|
Flags: in-moztrap?
Updated•11 years ago
|
Flags: in-moztrap? → in-moztrap+
Comment 124•11 years ago
|
||
Played fine with some long videos, although i did some of the behavior described in bug 880601.
Keywords: verifyme
Updated•11 years ago
|
Attachment #756747 -
Flags: feedback?(jduell.mcbugs)
You need to log in
before you can comment on or make changes to this bug.
Description
•