870564 - Youtube video freezes after a long time (~30 mins)

Reporter

Description

•

12 years ago

This youtube video freezes after a very long time https://www.youtube.com/watch?v=V1IBH1bz5-c I've seen it fail in two different ways: * Green screen. This seems to be a wi-fi connection issue. May be device specific * Screen freeze. This looks like a problem somewhere in the media graph. I often see a "Timed out waiting for output buffers: 0/0" message for audio buffers.

Sotaro Ikeda [:sotaro]

Updated

•

12 years ago

blocking-b2g: --- → tef?

Sotaro Ikeda [:sotaro]

Updated

•

12 years ago

Assignee: nobody → sotaro.ikeda.g

Sotaro Ikeda [:sotaro]

Updated

•

12 years ago

Depends on: 863441

Sotaro Ikeda [:sotaro]

Comment 1

•

12 years ago

By Bug 863441, youtube video playback becomes somewhat stable. But there are still some problems.

Sotaro Ikeda [:sotaro]

Updated

•

12 years ago

Depends on: 864230

Sotaro Ikeda [:sotaro]

Comment 2

•

12 years ago

I did not see green. But failed in 2 ways. - [1] video freeze(but ui works) - [2] video app crash at mozilla::net::HttpChannelChild::OnStartRequest ( Bug 868073 )

Sotaro Ikeda [:sotaro]

Comment 3

•

12 years ago

I debugged about [1]. [1] happend by nsMediaCacheStream::Read() stall forever. OMXCodec calls the function with grabbing OMXCodec's mutex. Then the other threads(decoder thread and ImageBridge thread) waited forever to get the mutex.

Sotaro Ikeda [:sotaro]

Comment 4

•

12 years ago

Attached file stack track when video freeze happend (deleted) — Details

Sotaro Ikeda [:sotaro]

Comment 5

•

12 years ago

It seems that both [1] and [2] are caused by HttpChannelChild. The following diagram is about ChannelMediaResource. https://github.com/sotaroikeda/firefox-diagrams/blob/master/media/content_media_ChannelMediaResource_FirefoxOS_1_01.pdf?raw=true

leo.bugzilla.gecko

Updated

•

12 years ago

Whiteboard: [TD-26549]

Target Milestone: --- → 1.1 QE2

Alex Keybl [:akeybl]

Comment 6

•

12 years ago

Not necessarily a blocker for v1.0.1/v1.1, but we would take a low risk uplift if available. Videos on YouTube that are >30min long are very uncommon, and it's not yet clear if this is specific to long videos or this video (we need more repro cases).

blocking-b2g: tef? → -

tracking-b2g18: --- → +

Sotaro Ikeda [:sotaro]

Updated

•

12 years ago

Blocks: 871753

Mike Lee [:mlee]

Updated

•

12 years ago

Whiteboard: [TD-26549] → c= [TD-26549]

Hema Koka [:hema]

Updated

•

12 years ago

Whiteboard: c= [TD-26549] → c=video [TD-26549]

Sotaro Ikeda [:sotaro]

Updated

•

12 years ago

Status: NEW → ASSIGNED

leo.bugzilla.gecko

Updated

•

12 years ago

blocking-b2g: - → leo?

Priority: -- → P1

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Updated

•

12 years ago

Assignee: sotaro.ikeda.g → mhabicher

Wayne Chang [:wchang]

Comment 7

•

12 years ago

Triage 5/20 - TEF? per blocking a tef+ bug 871753. If decided not to be tef+, then please leo+.

blocking-b2g: leo? → tef?

Daniel Coloma:dcoloma

Comment 8

•

12 years ago

(In reply to Wayne Chang [:wchang] from comment #7) > Triage 5/20 - > TEF? per blocking a tef+ bug 871753. > If decided not to be tef+, then please leo+. bug 871753 is already fixed in all branches. Can you clarify how does this one blocks 871753?

bhavana bajaj [:bajaj]

Comment 9

•

12 years ago

needsinfo wayne here to help with comment #8

Flags: needinfo?(wchang)

Diego Wilson [:diego]

Reporter

Comment 10

•

12 years ago

Bug 871753 seems to help, but I still saw the time out message around 40 minutes into the video: E/OMXCodec( 674): [OMX.qcom.audio.decoder.aac] Timed out waiting for output buffers: 0/2 Sotaro, Are you able to play the youtube video in the description all the way?

leo.bugzilla.gecko

Comment 11

•

12 years ago

I cannot play youtube at the end of contents. If I try to play URL in comment 0, the screen freezes at 200 seconds from the start, 70~80% probability.

Sotaro Ikeda [:sotaro]

Updated

•

12 years ago

Depends on: 874325

Wayne Chang [:wchang]

Comment 12

•

12 years ago

(In reply to Daniel Coloma:dcoloma from comment #8) > (In reply to Wayne Chang [:wchang] from comment #7) > > Triage 5/20 - > > TEF? per blocking a tef+ bug 871753. > > If decided not to be tef+, then please leo+. > > bug 871753 is already fixed in all branches. Can you clarify how does this > one blocks 871753? This block was set by Sotaro, perhaps we need him to comment if this is still true.

Flags: needinfo?(wchang) → needinfo?(sotaro.ikeda.g)

Sotaro Ikeda [:sotaro]

Comment 13

•

12 years ago

The block was set because Youtube freeze happens by some reasons on a same youtube video. And it is difficult to differentiate which freeze happens just from the freezed UI.

Flags: needinfo?(sotaro.ikeda.g)

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 14

•

12 years ago

Attached file gdb bt while media is frozen (UI is still responsive) (deleted) — Details

thread 18, blocked here: #7 nsMediaCacheStream::Read (this=0x446b8030, aBuffer=<value optimized out>, aCount=<value optimized out>, aBytes=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/nsMediaCache.cpp:2103 #8 0x40899d08 in mozilla::ChannelMediaResource::Read (this=<value optimized out>, aBuffer=0x80 <Address 0x80 out of bounds>, aCount=4294965384, aBytes=0x0) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/MediaResource.cpp:754 #9 0x40872a9c in android::MediaStreamSource::readAt (this=0x44011320, offset=5362745, data=0x43bbfc00, size=340) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/omx/OmxDecoder.cpp:101 thread 17 and 16, blocked on 'mutex=0x41f25410': - thread 17: android::OMXCodec::signalBufferReturned - thread 16: android::OMXCodec::read thread 9, blocked here: #0 poll () at bionic/libc/arch-arm/syscalls/poll.S:10 #1 0x4167036a in _pr_poll_with_poll (pds=0x42ca8020, npds=1, timeout=4294967295) at /home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:3920 #2 0x41670546 in PR_Poll (pds=0xfffffffc, npds=1, timeout=4294967295) at /home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:4322 #3 0x404426d2 in nsSocketTransportService::Poll (this=<value optimized out>, wait=<value optimized out>, interval=0x434c2e24) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:397 #4 0x40442bf0 in nsSocketTransportService::DoPollIteration (this=0x41f13100, wait=true) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:781 #5 0x40442ddc in nsSocketTransportService::Run (this=0x41f13100) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:645 - thread 8: XPCJSRuntime::WatchdogMain - thread 7: js::SourceCompressorThread::threadLoop - thread 6: js::GCHelperThread::threadLoop - thread 5: nsCycleCollectorRunner::Run thread 4, here: #0 __ioctl () at bionic/libc/arch-arm/syscalls/__ioctl.S:9 #1 0x40107b5c in ioctl (fd=<value optimized out>, request=8) at bionic/libc/bionic/ioctl.c:41 #2 0x41858a6e in android::IPCThreadState::talkWithDriver (this=0x41bfe110, doReceive=true) at frameworks/base/libs/binder/IPCThreadState.cpp:804 #3 0x41859354 in android::IPCThreadState::joinThreadPool (this=0x41bfe110, isMain=false) at frameworks/base/libs/binder/IPCThreadState.cpp:457 thread 2, blocked here: #5 android::MPEG4DataSource::readAt (this=0x42cd3a20, offset=2, data=0x44030000, size=17040) at frameworks/base/media/libstagefright/MPEG4Extractor.cpp:164 #6 0x4173e4da in android::MPEG4Source::read (this=0x41b4f240, out=0x100ffc70, options=<value optimized out>) at frameworks/base/media/libstagefright/MPEG4Extractor.cpp:2410 #7 0x4174c6f4 in android::OMXCodec::drainInputBuffer (this=0x41f25360, info=0x42cd29f0) at frameworks/base/media/libstagefright/OMXCodec.cpp:4046

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 15

•

12 years ago

Sotaro, I'm trying to reproduce this bug on Inari, and what I'm seeing is the video play for ~5 to 10 minutes without trouble, but it progressively starts to degrade. It doesn't stop all of a sudden. Instead, it looks like the video just stops playing; then a few minutes later, there is some renewed network activity (at least according to the arrows in the status bar), then the video resumes (albeit it stutters a lot) with audio; then everything seems to catch up for a while and play smoothly, before the video stops again. With each stop/restart, it takes longer for the video to resume, and it plays for a much shorter time before stopping again. logcat shows regular occurrences of these statements while this is happening: E( 872:0x3ea) [OMX.qcom.audio.decoder.aac] Timed out waiting for output buffers: 0/0 W( 872:0x405) obtainBuffer() track 0x43e794c0 disabled, restarting E( 117:0xe0) Parsing error in AudioFilter.csv. W( 872:0x405) obtainBuffer() track 0x43e794c0 disabled, restarting W( 117:0xe0) Thread AudioOut_1 cannot connect to the power manager service

Flags: needinfo?(sotaro.ikeda.g)

Diego Wilson [:diego]

Reporter

Comment 16

•

12 years ago

mikeh, I see the same timeout. I don't see any stutter though. As a side note a very common problem I've been seeing is MAC address clashing. The reference devices my team has all share the same MAC address by default. Fortunately we were able to override it via a setting file called softmac. Any chance the inaris have the same problem? Can you compare the MAC address of two inaris?

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 17

•

12 years ago

Relevant code: - "obtainBuffer() track 0x43e794c0 disabled, restarting" http://androidxref.com/4.0.4/xref/frameworks/base/media/libmedia/AudioTrack.cpp#915 - comment: // restart track if it was disabled by audioflinger due to previous underrun - "Timed out waiting for output buffers: 0/0" http://androidxref.com/4.0.4/xref/frameworks/base/media/libstagefright/OMXCodec.cpp#3429

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 18

•

12 years ago

(In reply to Diego Wilson [:diego] from comment #16) > > As a side note a very common problem I've been seeing is MAC address > clashing. The reference devices my team has all share the same MAC address > by default. Fortunately we were able to override it via a setting file > called softmac. > > Any chance the inaris have the same problem? Can you compare the MAC address > of two inaris? Diego, I would hope not--I'm running the tests on my home network. :) The MAC on my Inari is 90:1D:27:02:B5:A7. It's interesting that after a ~10 to 15-minute "freeze", the YouTube video in comment 0 has now resumed playing completely smoothly, with no more pausing.

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 19

•

12 years ago

I also observe that during the periods where the audio plays smoothly, the video is either frozen, or significantly behind the audio track.

Diego Wilson [:diego]

Reporter

Comment 20

•

12 years ago

mikeh, Sounds like your seeing bug 863441 and bug 864230. The first one was fixed in gecko a couple of weeks ago. The second one is a fix in gonk. You may need an updated vendor build for that.

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 21

•

12 years ago

diego, I'm running a build from a completely fresh repo that I pulled this morning (~10am, EDT). Would that include the fix in bug 864230?

Flags: needinfo?(dwilson)

Sotaro Ikeda [:sotaro]

Comment 22

•

12 years ago

It is not enabled on Mozbuild. Add patch to https://github.com/mozilla-b2g/gonk-patches is necessary. The github hadles patchs only for leo and hamachi now.

Flags: needinfo?(sotaro.ikeda.g)

Diego Wilson [:diego]

Reporter

Comment 23

•

12 years ago

Maybe there needs to be a new flag for that? Or was it supposed to be marked "Fixed" until it made it there?

Flags: needinfo?(dwilson)

Sotaro Ikeda [:sotaro]

Comment 24

•

12 years ago

(In reply to Mike Habicher [:mikeh] from comment #15) > Sotaro, I'm trying to reproduce this bug on Inari, and what I'm seeing is > the video play for ~5 to 10 minutes without trouble, but it progressively > starts to degrade. It doesn't stop all of a sudden. > //snip > > logcat shows regular occurrences of these statements while this is happening: > > E( 872:0x3ea) [OMX.qcom.audio.decoder.aac] Timed out waiting for output > buffers: 0/0 This error could be reported, when MediaStreamSource::readAt() was blocked for a longer time.

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 25

•

12 years ago

(In reply to Sotaro Ikeda [:sotaro] from comment #22) > > It is not enabled on Mozbuild. Add patch to > https://github.com/mozilla-b2g/gonk-patches is necessary. The github hadles > patchs only for leo and hamachi now. If the fix for these video issues is common to all platforms, it sounds like gonk-patches needs an 'all' subfolder.

Alex Keybl [:akeybl]

Comment 26

•

12 years ago

The rationale from comment 13 does not change our blocking decision, and comment 15 is being resolved in other bugs. Not blocking.

blocking-b2g: tef? → leo+

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 27

•

12 years ago

diego, sotaro, with: - gecko: b2g18:2f8bf6531f39 ...which includes the bug 863441 patch - gaia: c51082d92f18c4c7cdbc4f96c977be735500562a ...and with the bug 864230 patch applied to gonk, I see YouTube videos freeze. Not all the time: the video in comment 0 has played through at least twice today, completely, without error; but on the third attempt, I see the incoming TCP connection's receive window collapse and the video stop after ~1 second of playback. Are there any other patches I could be missing?

blocking-b2g: leo+ → tef?

Flags: needinfo?(dwilson)

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 28

•

12 years ago

Attached file gdb bt while media is frozen (UI is still responsive), 2013-05-23_16-05 (deleted) — Details

- thread 19, mutex=0x41f26510: #5 android::OMXCodec::read (this=0x41f26460, buffer=0x42da5360, options=0x2) at frameworks/base/media/libstagefright/OMXCodec.cpp:5061 #6 0x40880de8 in android::OmxDecoder::ReadVideo (this=0x42da5280, aFrame=0x426c6cc0, aTimeUs=187238095, aKeyframeSkip=false, aDoSeek=false) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/omx/OmxDecoder.cpp:556 #7 0x4087f59e in nsMediaOmxReader::DecodeVideoFrame (this=0x41f42780, aKeyframeSkip=@0x426c6e1f, aTimeThreshold=2) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/omx/nsMediaOmxReader.cpp:133 - thread 16, also mutex=0x41f26510: #5 android::OMXCodec::signalBufferReturned (this=0x41f26460, buffer=0x80) at frameworks/base/media/libstagefright/OMXCodec.cpp:5227 #6 0x41773d26 in android::MediaBuffer::release (this=0x43dcb980) at frameworks/base/media/libstagefright/MediaBuffer.cpp:102 #7 0x4087fcac in android::OmxDecoder::ReleaseAllPendingVideoBuffersLocked (this=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/omx/OmxDecoder.cpp:769 - thread 18, cond=0x442c7824, mutex=0x43d77284: #8 0x408a6a70 in mozilla::ChannelMediaResource::Read (this=<value optimized out>, aBuffer=0x80 <Address 0x80 out of bounds>, aCount=4294964942, aBytes=0x0) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/MediaResource.cpp:754 #9 0x4087fa94 in android::MediaStreamSource::readAt (this=0x442c7a50, offset=15881796, data=0x43d8b000, size=238) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/omx/OmxDecoder.cpp:101 #10 0x4176bc34 in android::MPEG4DataSource::readAt (this=<value optimized out>, offset=15881796, data=0x43d8b000, size=238) at frameworks/base/media/libstagefright/MPEG4Extractor.cpp:172 #11 0x4176bc34 in android::MPEG4DataSource::readAt (this=<value optimized out>, offset=15881796, data=0x43d8b000, size=238) at frameworks/base/media/libstagefright/MPEG4Extractor.cpp:172 #12 0x4176bee8 in android::MPEG4Source::read (this=0x43d930b0, out=0x45163c70, options=<value optimized out>) at frameworks/base/media/libstagefright/MPEG4Extractor.cpp:2329 - thread 9: #2 0x4167c546 in PR_Poll (pds=0xfffffffc, npds=1, timeout=4294967295) at /home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:4322 #3 0x4044f6d2 in nsSocketTransportService::Poll (this=<value optimized out>, wait=<value optimized out>, interval=0x435dae24) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:397 #4 0x4044fbf0 in nsSocketTransportService::DoPollIteration (this=0x41f13100, wait=true) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:781 #5 0x4044fddc in nsSocketTransportService::Run (this=0x41f13100) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:645 In the PR_Poll call (frame #2), timeout=4294967295 is 0xFFFFFFFF, or -1, which means (for the POSIX version of poll()[1], at least) no timeout. 1. http://pubs.opengroup.org/onlinepubs/007908799/xsh/poll.html

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 29

•

12 years ago

Depending on how much I trust the gdb backtrace, the thread 9 frame 2 call looks completely wrong: #2 0x4167c546 in PR_Poll (pds=0xfffffffc, npds=1, timeout=4294967295) at /home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:4322 According to this[1], 'pds' is a pointer type, in which case a value of 0xfffffffc (-2 as a uint32_t) is most definitely bogus. 1. https://developer.mozilla.org/en-US/docs/PR_Poll

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 30

•

12 years ago

Further to comment 20, nsSocketTransportService::Poll()[1] sets the 'pds' argument to one of three values: - if nsSocketTransportService::mPollList->fd is non-zero, pds = mPollList - if nsSocketTransportService::mPollList->fd is zero, and: - if nsSocketTransportService::mActiveCount is non-zero, pds = mPollList + 1 - if nsSocketTransportService::mActiveCount is zero, pds = nullptr Assuming mPollList is a valid pointer, none of the valid paths should lead to pds = 0xfffffffc. mPollList is initialized in the constructor by calling moz_xmalloc() (which is infallible[2]), and freed in the destructor by calling moz_xfree(). mPollList can also grow if a process uses a LOT of sockets[3] (which watching a long YouTube video seems to do--I've observed one new TCP connection per 800KiB to 1.8MiB of data), by calling moz_xrealloc(). It does this in 100-socket increments, once mActiveListSize exceeds SOCKET_LIMIT_MIN (currently defined as 50)[4]. 1. http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransportService2.cpp#361 2. https://developer.mozilla.org/en-US/docs/Infallible_memory_allocation 3. http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransportService2.cpp#318 4. http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransportService2.cpp#40

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 31

•

12 years ago

Attached patch WIP - logging to try to figure out where pds=0xfffffffc comes from (deleted) — Details — Splinter Review

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 32

•

12 years ago

One thing I've noticed about the all-green screen: it seems to happen when WiFi cycles. I haven't been able to determine if it's due to a complete disconnect, or if it's just a rekeying; but I wonder if there's an error code somewhere that isn't getting propagated properly.

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 33

•

12 years ago

Of course, having posted that, I just managed to get a green-screen without a WiFi cycle.

Alex Keybl [:akeybl]

Comment 34

•

12 years ago

Assuming comment 27 is a mid-air collision and setting back to leo+. No new tef block information provided.

blocking-b2g: tef? → leo+

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 35

•

11 years ago

Some gdb debugging: (gdb) list 3920 3905 pds[index].out_flags = 0; 3906 } 3907 } 3908 if (0 == ready) 3909 { 3910 switch (timeout) 3911 { 3912 case PR_INTERVAL_NO_WAIT: msecs = 0; break; 3913 case PR_INTERVAL_NO_TIMEOUT: msecs = -1; break; 3914 default: 3915 msecs = PR_IntervalToMilliseconds(timeout); 3916 start = PR_IntervalNow(); 3917 } 3918 3919 retry: 3920 ready = poll(syspoll, npds, msecs); 3921 if (-1 == ready) 3922 { 3923 PRIntn oserror = errno; 3924 3925 if (EINTR == oserror) 3926 { 3927 if (timeout == PR_INTERVAL_NO_TIMEOUT) 3928 goto retry; 3929 else if (timeout == PR_INTERVAL_NO_WAIT) 3930 ready = 0; /* don't retry, just time out */ 3931 else 3932 { 3933 elapsed = (PRIntervalTime) (PR_IntervalNow() 3934 - start); (gdb) p npds $1 = 1 (gdb) p msecs $2 = -1 (gdb) p syspoll $3 = (struct pollfd *) 0x4359dbcc (gdb) p *syspoll $4 = {fd = 21, events = 1, revents = 0} (gdb) info frame Stack level 1, frame at 0x4359ddf8: pc = 0x4015236a in _pr_poll_with_poll (/home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:3920); saved pc 0x40152546 called by frame at 0x4359de00, caller of frame at 0x4359dbb0 source language c. Arglist at 0x4359dbb0, args: pds=0x42ba8020, npds=1, timeout=4294967295 Locals at 0x4359dbb0, Previous frame's sp is 0x4359ddf8 Saved registers: r4 at 0x4359ddd4, r5 at 0x4359ddd8, r6 at 0x4359dddc, r7 at 0x4359dde0, r8 at 0x4359dde4, r9 at 0x4359dde8, r10 at 0x4359ddec, r11 at 0x4359ddf0, lr at 0x4359ddf4 (gdb) p pds $5 = (PRPollDesc *) 0x42ba8020 (gdb) p *pds $6 = {fd = 0x41a03c00, in_flags = 1, out_flags = 1} (gdb) p *fd Cannot access memory at address 0xffffffff (gdb) p fd $7 = -1 (gdb) p pds->fd $8 = (PRFileDesc *) 0x41a03c00 (gdb) p *pds->fd warning: can't find linker symbol for virtual table for `(null)' value $9 = warning: can't find linker symbol for virtual table for `(null)' value {methods = 0x4015bbe8, secret = 0x41a01270, lower = 0x42ba6820, higher = 0x0, dtor = 0x40143d21 <pl_FDDestructor>, identity = 1} (gdb) p POLLIN No symbol "POLLIN" in current context. (gdb) p (PRInt16*)malloc(sizeof(PRInt16)) $10 = (PRInt16 *) 0x1f83ef0 (gdb) p *(PRInt16*)0x1f83ef0 $11 = 30024 (gdb) p (pds[0].fd->methods->poll)(pds[0].fd, pds[0].in_flags & 0xfffd, (PRInt16*)0x1f83ef0) warning: can't find linker symbol for virtual table for `(null)' value $12 = 1 (gdb) p pds[0] $13 = {fd = 0x41a03c00, in_flags = 1, out_flags = 1} (gdb) p *(PRInt16*)0x1f83ef0 $14 = 0 (gdb) p 0xd $15 = 13 (gdb) p (pds[0].fd->methods->poll)(pds[0].fd, pds[0].in_flags & 0xfffe, (PRInt16*)0x1f83ef0) warning: can't find linker symbol for virtual table for `(null)' value $16 = 0 (gdb) p *(PRInt16*)0x1f83ef0 $17 = 0 (gdb) p PR_GetIdentitiesLayer(pds[0].fd, (PRDescIdentity)0) warning: can't find linker symbol for virtual table for `(null)' value $18 = (PRFileDesc *) 0x42ba6820 (gdb) p pds[0] $19 = {fd = 0x41a03c00, in_flags = 1, out_flags = 1} (gdb) p *(PRFileDesc*)0x42ba6820 warning: can't find linker symbol for virtual table for `(null)' value $20 = warning: can't find linker symbol for virtual table for `(null)' value {methods = 0x4015be70, secret = 0x41a03c20, lower = 0x0, higher = 0x41a03c00, dtor = 0, identity = 0} (gdb) p *(PRFileDesc*)0x42ba6820->secret Attempt to extract a component of a value that is not a structure pointer. (gdb) p ((PRFileDesc*)0x42ba6820)->secret warning: can't find linker symbol for virtual table for `(null)' value $21 = (PRFilePrivate *) 0x41a03c20 (gdb) p *((PRFileDesc*)0x42ba6820)->secret warning: can't find linker symbol for virtual table for `(null)' value $22 = {state = -1431655766, nonblocking = 0, inheritable = _PR_TRI_TRUE, next = 0x0, lockCount = 0, md = {osfd = 21, tcp_nodelay = 0}} (gdb) p/x *((PRFileDesc*)0x42ba6820)->secret warning: can't find linker symbol for virtual table for `(null)' value $23 = {state = 0xaaaaaaaa, nonblocking = 0x0, inheritable = 0x1, next = 0x0, lockCount = 0x0, md = {osfd = 0x15, tcp_nodelay = 0x0}} (gdb) p ready $24 = 0 (gdb)

blocking-b2g: leo+ → leo?

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 36

•

11 years ago

It also looks like fd=21 has no data waiting: (gdb) p (unsigned int*)malloc(sizeof(unsigned int)) $25 = (unsigned int *) 0x1f83f00 (gdb) p FIONREAD No symbol "FIONREAD" in current context. (gdb) p ioctl(21, 0x541b, (unsigned int*)0x1f83f00) $26 = 0 (gdb) p *(unsigned int*)0x1f83f00 $27 = 0 This would be inconsistent with the TCP.rwin=0 observed in the video data stream.

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 37

•

11 years ago

Further to comment 29 and comment 30, the value pds=0xfffffffc seems to be incorrect, possibly a by-product of an optimized build. Probing with gdb (see comment 35) suggests that a valid PRPollDesc pointer is ultimately making it into PR_Poll().

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 38

•

11 years ago

Hmm, looks like the socket (in this case, fd=21) is not (really) a socket. I can't make any sense out of the 'st_mode' member of the struct stat returned by fstat(): (gdb) p (struct stat*)malloc(sizeof(struct stat)) $50 = (struct stat *) 0x1f86000 (gdb) p fstat(21, (struct stat*)0x1f86000) $51 = 0 (gdb) p *(struct stat*)0x1f86000 $52 = {st_dev = 7, __pad0 = "\000\000\000", __st_ino = 2979, st_mode = 4480, st_nlink = 1, st_uid = 0, st_gid = 0, st_rdev = 0, __pad3 = "\000\000\000", st_size = 0, st_blksize = 4096, st_blocks = 0, st_atime = 1369663045, st_atime_nsec = 661191246, st_mtime = 1369663045, st_mtime_nsec = 661191246, st_ctime = 1369663045, st_ctime_nsec = 661191246, st_ino = 2979} (gdb) p ((struct stat*)0x1f86000)->st_mode $53 = 4480 (gdb) p S_IFMT No symbol "S_IFMT" in current context. (gdb) p 00170000 $54 = 61440 (gdb) p ((struct stat*)0x1f86000)->st_mode & 61440 $55 = 4096 According to the AOSP version of stat.h, 4096 is 'S_ISVTX', which seems to mean: "on directories, restricted deletion flag"[1] 1. http://pubs.opengroup.org/onlinepubs/7908799/xsh/sysstat.h.html

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 39

•

11 years ago

Most recent reproduction of this problem occurred after 50m30s of playing the comment 0 video. gdb shows that the socket thread is blocked on the poll() call, while Wireshark shows there is no longer any network traffic between the DUT and YouTube. Wireshark (a 798MB log--too big to upload) shows: 1. a session on local port 46297 receiving ~1.5MiB of data from YT, its Rwin hitting 0 2. Rwin update of 10880 3. FIN send to YT 4. TCP keep-alives from YT 5. YT sends 14 more packets (up to ACK 1510469, 19304 more bytes following item 1 above), which we respond to with TCP RSTs 6. we open a new TCP connection on local port 46298 (the 3-way handshake completes right away) 7. we send 4 HTTP GETs (TCP retransmissions) which are ignored by YT 8. we get a FIN-ACK, and send a FIN-ACK on port 46928 9. YT sends us a RST packet 10. we repeat steps 6 through 9 above on port 46299 There is no more traffic between the DUT and YT: no new connection attempts. In gdb, the socket is still blocked on poll(). The video player UI is stuck where it was when the connection(s) failed; pressing Pause/Play has no effect on the player, and causes no new traffic to appear in Wireshark.

Diego Wilson [:diego]

Reporter

Comment 40

•

11 years ago

Is there any test we can do to confirm once and for all if it's a network connection issue or a hardware decoder issue?

Flags: needinfo?(dwilson)

Diego Wilson [:diego]

Reporter

Comment 41

•

11 years ago

(In reply to Mike Habicher [:mikeh] from comment #27) > Are there any other patches I could be missing? Not that I know of. You're probably seeing the same thing I'm seeing now. But does seem to be at least dependent (or even caused by) network conditions.

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 42

•

11 years ago

(In reply to Diego Wilson [:diego] from comment #41) > > Not that I know of. You're probably seeing the same thing I'm seeing now. > But does seem to be at least dependent (or even caused by) network > conditions. I'm investigating that now. I have my DUT sniffing packets into Wireshark over adb[1], and my test AP is plugged into an old-school Ethernet hub[2] between it and my Internet connection, with an extra computer also running Wireshark sniffing packets on the wire. I'm checking to see if packets are lost when the video playback stalls. 1. https://wiki.mozilla.org/B2G/Packet_Sniffing_with_Wireshark 2. http://en.wikipedia.org/wiki/Ethernet_hub

Wayne Chang [:wchang]

Updated

•

11 years ago

Whiteboard: c=video [TD-26549] → c=video [TD-26549], MiniWW

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 43

•

11 years ago

Per the stand-up, I have filed some other bugs I have observed while investigating this one: - bug 876838: Screen goes green when YouTube video stream stalls - bug 876843: YouTube video display stops but audio continues to play

Andreas Gal :gal

Comment 44

•

11 years ago

dougt, can your team help with this?

Flags: needinfo?(doug.turner)

Wayne Chang [:wchang]

Comment 45

•

11 years ago

Comment 35 unintentionally cleared the flag - resetting leo+

blocking-b2g: leo? → leo+

Doug Turner (:dougt)

Comment 46

•

11 years ago

mike, can you put the wireshark profile somewhere?

Flags: needinfo?(doug.turner)

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 47

•

11 years ago

After sleeping on comment 39, this occurred to me: we _are_ talking to the YouTube server, or at least some machine with a TCP server socket on port 80, because in step 6, we successfully complete the TCP handshake! The SYN/SYN+ACK RTT on the port 46298 connection is 26ms, and on port 46299 it's 34ms, both of which are consistent with earlier connections; also, in the SYN+ACK, MSS, window scaling, the lack of timestamps, and Rwin=14600 are consistent with earlier SYN+ACK packets. Also, after the 4 TCP retransmissions in step 7, the server sends us a FIN+ACK in step 8 to close the connection. Here's the weird part: we get the FIN+ACK packets with ACK numbers of 1, which means the server sending them never saw the HTTP GETs sent in step 7 (consistent with the SEQ number of 1 in these packets, which indicates the server never set us any data either). We respond to the FIN+ACK with our own FIN+ACK with SEQ=980 and ACK=2. So where are the HTTP GETs getting lost, and why? So far, I have been unable to reproduce this case with a wired-side sniffing connection: DUT <--> AP-test <--> EthHub <--> Router <--> {Internet} ^--> Netbook/Wireshark Next step: raw wireless sniffing.

Patrick McManus [:mcmanus]

Comment 48

•

11 years ago

(In reply to Mike Habicher [:mikeh] from comment #47) > > So where are the HTTP GETs getting lost, and why? So far, I have been unable > to reproduce this case with a wired-side sniffing connection: > So that's a good question - but its worth thinking about the fact that even if you answer the question for this use case the fact remains the GETs could always get lost for some other reason (radio interference for example) that you aren't experiencing here and which isn't fixable.. because that's true it might make more sense to focus on why retries totally stop at step 9 or 10..

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 49

•

11 years ago

As if to reinforce mcmanus' comment, my most recent reproduction of the YouTube stall occurred 17m48s into the video, where there is an open TCP connection but it is stalled. The video is frozen on the DUT's screen, and Wireshark shows incoming TCP keep-alives from YT (approximately every 10s), to which we respond with TCP ACKs with Rwin=0. roc, any idea where to look to see where the video path might be stalled?

Flags: needinfo?(roc)

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 50

•

11 years ago

Further to comment 49, looking at the socket service in the parent process: (gdb) frame 1 #1 0x403bb36a in _pr_poll_with_poll (pds=0x42f9d020, npds=3, timeout=65535000) at /home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:3920 3920 ready = poll(syspoll, npds, msecs); (gdb) p npds $1 = 3 (gdb) p syspoll $2 = (struct pollfd *) 0x42ef5bcc (gdb) p *syspoll $3 = {fd = 33, events = 1, revents = 0} (gdb) p syspoll[0] $4 = {fd = 33, events = 1, revents = 0} (gdb) p syspoll[1] $5 = {fd = 61, events = 3, revents = 0} (gdb) p syspoll[2] $6 = {fd = 62, events = 3, revents = 0} (gdb) p (unsigned int*)malloc(sizeof(unsigned int)) $7 = (unsigned int *) 0x315e80 (gdb) p ioctl(21, 0x541b, (unsigned int*)0x315e80) $8 = -1 (gdb) p ioctl(33, 0x541b, (unsigned int*)0x315e80) $9 = 0 (gdb) p *(unsigned int*)0x315e80 $10 = 1 (gdb) p ioctl(61, 0x541b, (unsigned int*)0x315e80) $11 = -1 (gdb) p ioctl(62, 0x541b, (unsigned int*)0x315e80) $12 = -1 (gdb) p __errno() $13 = (volatile int *) 0x42ef5f08 (gdb) p *__errno() $14 = 22 (gdb) p msecs $15 = 65535000 (gdb) The errno value of 22 is EINVAL, which in conjunction with FIONREAD on bionic, supposedly indicates that fds 61 and 62 are listeners: http://stackoverflow.com/questions/13353146/android-linux-ioctl-fionread-failing (assuming they're even sockets in the first place). (With both the parent and video player app processes stopped in gdb, Wireshark shows the TCP keep-alives and ACK dance every 10 seconds, seemingly indefinitely.)

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 51

•

11 years ago

You want NSPR_LOG_MODULES=nsMediaCache:5,MediaResource:5. I don't know why data would not be flowing if the HTTP port is open and we're trying to play the video. Definitely sounds like a bug on our side.

Flags: needinfo?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 52

•

11 years ago

The media cache does at times suspend reading from stream, but it should only do that if we've cached data ahead of where we're playing from and the cache is full. I can help you interpret the nsMediaCache logs to figure out if and why it stopped reading.

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

11 years ago

Blocks: 877024

Andreas Gal :gal

Comment 53

•

11 years ago

mikeh, are you attaching logs?

Flags: needinfo?(mhabicher)

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 54

•

11 years ago

(In reply to Andreas Gal :gal from comment #53) > > mikeh, are you attaching logs? gal, I haven't been attaching Wireshark .pcap logs because they tend to compress down to ~800MiB; hence the descriptive summaries in (e.g.) comment 39 and comment 47. On the other hand, logcat logs don't show anything relevant.

Flags: needinfo?(mhabicher)

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 55

•

11 years ago

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #51) > > You want NSPR_LOG_MODULES=nsMediaCache:5,MediaResource:5. > > I don't know why data would not be flowing if the HTTP port is open and > we're trying to play the video. Definitely sounds like a bug on our side. Is there any way to coax PRLogging out of a build aside from setting DEBUG=1? b2g18 debug builds currently throw assertions like "MOZ_Assert: Assertion failure: status == OK, at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/ipc/ShadowLayerUtilsGralloc.cpp:315"[1] 1. http://mxr.mozilla.org/mozilla-b2g18/source/gfx/layers/ipc/ShadowLayerUtilsGralloc.cpp#315 (This issue has already been submitted as bug 841206.)

Doug Turner (:dougt)

Comment 56

•

11 years ago

mike, can you try to reproduce this on desktop? it sounds like this problem might be reproducible there as well.

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 57

•

11 years ago

(In reply to Doug Turner (:dougt) from comment #56) > > mike, can you try to reproduce this on desktop? it sounds like this problem > might be reproducible there as well. dougt, how do I go about doing that?

Flags: needinfo?(doug.turner)

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 58

•

11 years ago

Attached file logcat output with NSPR_LOG_MODULES=nsMediaCache:5,MediaResource:5 (deleted) — Details

As requested. This is with the video on the DUT stalled, and the TCP stack receiving and ACKing TCP keep-alives, with Rwin=0. The log ends with the following output, more or less continuously: 05-30 13:11:39.721 1277 1295 I Gecko - SYDNEY_AUDIO: 0x4435f340 - get position 05-30 13:11:39.721 1277 1277 I PRLog : 1074652408[425c92b0]: Stream 44ccc030 catching up

Flags: needinfo?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 59

•

11 years ago

Thanks for these logs! Very helpful. OK it looks like the media cache fills up and we start suspending and resuming the connection to throttle the download. This shows up in the log like this: 05-30 13:10:01.635 1277 1277 I PRLog : 1074652408[425c92b0]: Stream 44ccc030 predict next data in 42.297000, current worst block is 42.004089 05-30 13:10:01.645 1277 1277 I PRLog : 1074652408[425c92b0]: Stream 44ccc030 Suspended That means the next data we read via HTTP is expected to play 42.297 seconds from now, but least-useful data in the cache is for 42.004089 seconds from now, so we want to stop reading temporarily, and we tell Necko to suspend reading from the stream. You see a sequence of Suspends and Resumes as the video playback advances and we try to slide forward the window of data that we have around the current playback position. This is normal. You will also see some DataReceived notifications signalling that we received data from Necko immediately after a Suspend; this is also normal, and I guess this is expected when Suspend calls race with Necko data delivery. Here is the last time at which data is received: 05-30 13:10:01.645 1277 1277 I PRLog : 1074652408[425c92b0]: Stream 44ccc030 DataReceived at 105742336 count=59532 There's a couple of Suspends and Resumes after that. Then the media cache does its last Resume call: 05-30 13:10:02.055 1277 1277 I PRLog : 1074652408[425c92b0]: Stream 44ccc030 Resumed and there are no Suspend calls after that. So we're trying to read data from Necko, but none arrives. Everything after that shows the window of cached data after the playback position gradually shrinking as the playback position advances, until finally it runs out, with predictable results. So as far as I can tell, the media cache is behaving as expected. Based on previous experience we know that if we suspend for too long, some servers will close the HTTP connection. We have logic to reopen the connection automatically in that case, but these logs seem to show that doesn't happen. (MediaChannelResource::CacheClientSeek would be called in that case, and would have been logged.) Perhaps suspending the HTTP channel has messed something up in another way --- e.g. flow control at some level, or maybe the server has reacted incorrectly and can't resume delivering data? If you still have the wireshark logs I guess you should be able to infer from those whether our TCP receive window closed and then opened again. Whatever the problem is, we may be able to work around it by detecting when we have received no data for some amount of time (fairly long, say ten seconds), and if our last action was a Resume (and the server supports byte-range requests) automatically close the connection and open a new one. I will try to make a patch for that.

Flags: needinfo?(roc)

Sotaro Ikeda [:sotaro]

Comment 60

•

11 years ago

I suspect some state inconsistency happened around IPC. Bug 868073 is similar area bug.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 61

•

11 years ago

Mike should be able to tell by looking at logs whether that last Resume() actually took effect.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 62

•

11 years ago

Attached patch timer patch (obsolete) (deleted) — Details — Splinter Review

Mike, this patch might help. It should at least prevent the kind of log that you supplied here --- the patch should reset the connection after not receiving data for five seconds after the Resume().

Patrick McManus [:mcmanus]

Comment 63

•

11 years ago

maybe jduell can write some tests around ipc and pause/resume?

Flags: needinfo?(jduell.mcbugs)

Jason Duell

Comment 64

•

11 years ago

Sure--we have basic suspend/resume tests, but I'll knock off something that does it repeatedly with a very large file. I doubt it'll turn anything up (I suspect we really need the specific flakiness of the network connection here), but it's worth trying. Meanwhile I think it would be useful here to get a run where we have both a wireshark log (post it to a website somewhere so we can download it) AND an NSPR HTTP log: https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging#Firefox_OS_phones (I think using a single log file is fine)

Flags: needinfo?(jduell.mcbugs)

Jason Duell

Comment 65

•

11 years ago

roc: > You will also see some DataReceived notifications signalling that we > received data from Necko immediately after a Suspend; this is also normal, > and I guess this is expected when Suspend calls race with Necko data delivery. As long as you suspend a channel on the main thread, Necko should never call OnDataAvailable until you call Resume [1]. I.e no races. That would be a bug. Can you verify for me whether you're actually seeing that in the log? [1] For single process necko, we suspend the underlying inputStreamPump which handles calling OnDataAvailable. For e10s we send a message back to the parent to suspend the pump, and queue any OnDataAvail IPDL messages that arrive in the meantime.

Flags: needinfo?(roc)

Jason Duell

Comment 66

•

11 years ago

Sorry, in case it wasn't clear, we want an NSPR log with both media/HTTP logging, ie. NSPR_LOG_MODULES=timestamp,nsMediaCache:5,MediaResource:5:nsHttp:5,nsSocketTransport:5 >> can you try to reproduce this on desktop? > > how do I go about doing that? I'm guessing forging the UserAgent header to B2G's (via about:config 'general.useragent.override') and going to the URI would be a good start.

Flags: needinfo?(doug.turner) → needinfo?(mhabicher)

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 67

•

11 years ago

roc, with your patch, the video plays until the end; the display stalled quite early on (see bug 877461) but the audio kept on going. Wireshark confirms that we hit the end of the stream. For posterity, I added a printf_stderr() call into the NoDataDeliveredTimeout() callback in your patch; not sure if 'mOffset' value being the same means anything but there you go. 15:14:52 ➜ bug870564 grep -n "mikeh:" log06_media_timer.txt 64950:05-31 13:42:44.650 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 101956:05-31 13:44:40.844 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 129423:05-31 13:46:03.845 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 143889:05-31 13:46:49.739 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 181937:05-31 13:48:44.471 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 199981:05-31 13:49:40.996 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 210232:05-31 13:50:11.446 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 223187:05-31 13:50:53.077 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 236334:05-31 13:51:34.377 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 292244:05-31 13:54:32.921 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 313632:05-31 13:55:40.497 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 358557:05-31 13:58:45.167 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 420596:05-31 14:05:17.040 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 427522:05-31 14:06:54.675 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 465309:05-31 14:15:19.327 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 491143:05-31 14:22:47.214 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 522485:05-31 14:31:19.034 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 579806:05-31 14:43:09.917 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 629090:05-31 14:53:20.983 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664) 664792:05-31 15:01:53.533 1338 1338 I Gecko : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 68

•

11 years ago

(In reply to Jason Duell (:jduell) from comment #66) > Sorry, in case it wasn't clear, we want an NSPR log with both media/HTTP > logging, ie. > > NSPR_LOG_MODULES=timestamp,nsMediaCache:5,MediaResource:5:nsHttp:5, > nsSocketTransport:5 I'll roll back roc's change and add the extra logging options.

Flags: needinfo?(mhabicher)

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 69

•

11 years ago

Attached file logcat with NSPR_LOG_MODULES=timestamp,nsMediaCache:5,MediaResource:5,nsHttp:5,nsSocketTransport:5 (deleted) — Details

jduell: here's the logcat output; the corresponding Wireshark output can be found here: http://people.mozilla.com/~mhabicher/log07_dut.pcap

Attachment #756747 - Flags: feedback?(jduell.mcbugs)

Diego Wilson [:diego]

Reporter

Comment 70

•

11 years ago

(In reply to Mike Habicher [:mikeh] from comment #67) > roc, with your patch, the video plays until the end; Awesome possum! I'll make sure to try it out. > the display stalled > quite early on (see bug 877461) but the audio kept on going. Wireshark > confirms that we hit the end of the stream. Yeah, I totally agree this is a separate bug. I can actually reproduce it earlier if I leave on the overlay progress bar.

Jason Smith [:jsmith]

Updated

•

11 years ago

status-b2g18: --- → affected

status-b2g18-v1.0.0: --- → wontfix

status-b2g18-v1.0.1: --- → wontfix

status-b2g-v1.1hd: --- → affected

Diego Wilson [:diego]

Reporter

Updated

•

11 years ago

Blocks: 878343

Jason Duell

Comment 71

•

11 years ago

Ran some e10s xpcshell tests on my linux desktop with large files (10 to 100 MB), where at every 5th OnDataAvailable I call suspend, wait a second, then resume. With a localhost http server, 25 MB file downloads fine that way. 100 MB caused a crash (which is a bug, but not this one). Over remote server loading 10 MB file once saw server drop the connection half-way through (but OnStopRequest called), once saw a crash. Clearly work to be done with the crash issue, but nothing here so far points to a cause for this bug. I'll take a look at the wireshark/NSPR log next.

Sotaro Ikeda [:sotaro]

Comment 72

•

11 years ago

On Firefox OS, I saw a lot of consecutive pause()s/resume()s were called without OnDataAvailable.

cajbir (:cajbir)

Comment 73

•

11 years ago

(In reply to Jason Duell (:jduell) from comment #71) > Ran some e10s xpcshell tests on my linux desktop with large files (10 to 100 > MB), where at every 5th OnDataAvailable I call suspend, wait a second, then > resume. With a localhost http server, 25 MB file downloads fine that way. > 100 MB caused a crash (which is a bug, but not this one). Over remote > server loading 10 MB file once saw server drop the connection half-way > through (but OnStopRequest called), once saw a crash. Clearly work to be > done with the crash issue, but nothing here so far points to a cause for > this bug. I'll take a look at the wireshark/NSPR log next. Try reducing the size of the media cache if you haven't already (media.cache_size). On Firefox OS it's something pretty small - 4MB IIRC.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 74

•

11 years ago

(In reply to Jason Duell (:jduell) from comment #65) > As long as you suspend a channel on the main thread, Necko should never call > OnDataAvailable until you call Resume [1]. I.e no races. That would be a > bug. Can you verify for me whether you're actually seeing that in the log? From comment #59: > Then the media cache does its last Resume call: > 05-30 13:10:02.055 1277 1277 I PRLog : 1074652408[425c92b0]: Stream 44ccc030 Resumed > and there are no Suspend calls after that. So we're trying to read data from Necko, but none > arrives. Mike confirmed on IRC that after our last Resume call, Wireshark shows the TCP receive window never opens. Which is why I think it's a Necko bug.

Flags: needinfo?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 75

•

11 years ago

Comment on attachment 756494 [details] [diff] [review] timer patch Review of attachment 756494 [details] [diff] [review]: ----------------------------------------------------------------- It sounds like this patch is working. I think we should take it as well as fix any underlying Necko bug. It gives us some extra protection against possible bugs in the network or server. ::: layout/base/FrameLayerBuilder.cpp @@ +2947,5 @@ > layer->SetClipRect(nullptr); > layer->SetMaskLayer(nullptr); > ImageLayer* imgLayer = static_cast<ImageLayer*>(layer.get()); > layer.forget(); > + printf("### RECYCLING IMAGE LAYER\n"); Er, ignore this.

Attachment #756494 - Flags: review?(cpearce)

Chris Pearce [:cpearce (Not reading bugmail)]

Updated

•

11 years ago

Attachment #756494 - Flags: review?(cpearce) → review+

leo.bugzilla.gecko

Comment 76

•

11 years ago

I adopt patch on this version.. Mozilla build ID: 20130526070207 Gaia Revision : 4d10e1297b859cacc174c0a54af61a7678d7c32d Gecko Revision : 52341e43539a0e8b9aa77a9128cc3871439b8aa6 There's more severe problem occurred. Play is lagged from the start and then it is stuck after few seconds. Is there anyone who check this patch?

Diego Wilson [:diego]

Reporter

Comment 77

•

11 years ago

(In reply to leo.bugzilla.gecko from comment #76) > There's more severe problem occurred. > Play is lagged from the start and then it is stuck after few seconds. Does audio keep playing? If so, that is a separate issue tracked in bug 877461

leo.bugzilla.gecko

Comment 78

•

11 years ago

(In reply to Diego Wilson [:diego] from comment #77) > Does audio keep playing? If so, that is a separate issue tracked in bug > 877461 No, A/V is just stuck forever.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 79

•

11 years ago

(In reply to leo.bugzilla.gecko from comment #76) > I adopt patch on this version.. > Mozilla build ID: 20130526070207 > Gaia Revision : 4d10e1297b859cacc174c0a54af61a7678d7c32d > Gecko Revision : 52341e43539a0e8b9aa77a9128cc3871439b8aa6 Can you give a URL for this revision? I assume this is from git but I don't know which repo... > There's more severe problem occurred. > Play is lagged from the start and then it is stuck after few seconds. > > Is there anyone who check this patch? You mean this is a regression? Does it happen on all Youtube videos? Can you give specific steps to reproduce?

leo.bugzilla.gecko

Comment 80

•

11 years ago

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #79) > Can you give a URL for this revision? I assume this is from git but I don't > know which repo... Do you mean this URL? Gaia : http://git.mozilla.org/?p=releases/gaia.git;a=shortlog;h=refs/heads/v1-train Gecko : http://git.mozilla.org/?p=releases/gecko.git;a=shortlog;h=refs/heads/gecko-18 > You mean this is a regression? Does it happen on all Youtube videos? Can you > give specific steps to reproduce? I just checked URL in comment 0, just try to play. I will check for another youtube videos also.

leo.bugzilla.gecko

Comment 81

•

11 years ago

I think......Bug 871485 could affect this problem... Because I check patch here after patching Bug 871485. (It seems like youtube got more problem with patch Bug 871485) I am checking ...

leo.bugzilla.gecko

Comment 82

•

11 years ago

Sorry for confusing. The stuck problem is definitely because of Bug 871485. After adapting this patch, it's lagged so much and only audio is keep playing. I think it's same as bug 877461.

leo.bugzilla.gecko

Comment 83

•

11 years ago

There are so many message from to log like this..when it's lagging MediaResource Could not trim cache block 268 (destination 127, predicted next use 38.952000, latest predicted use for overflow 198.109000

leo.bugzilla.gecko

Comment 84

•

11 years ago

(In reply to Diego Wilson [:diego] from comment #77) > Does audio keep playing? If so, that is a separate issue tracked in bug > 877461 It fills like worse than before. Is Bug 877461 same as Bug 863441. I think this patch reopen Bug 863441.

leo.bugzilla.gecko

Comment 85

•

11 years ago

More information.. When a network connection is good, it occurred less than before. If network connection is not good, it occurred whenever DataReceived notifications signalling is show up.

Sotaro Ikeda [:sotaro]

Comment 86

•

11 years ago

(In reply to leo.bugzilla.gecko from comment #82) > Sorry for confusing. The stuck problem is definitely because of Bug 871485. The patch does not change how video playback is done, but change the timing of codec allocation. I think Bug 871485 does not cause the problem, but hit more often the hidden problem.

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 87

•

11 years ago

(In reply to Jason Duell (:jduell) from comment #71) > > Clearly work to be > done with the crash issue, but nothing here so far points to a cause for > this bug. I'll take a look at the wireshark/NSPR log next. jduell, anything I can do to help?

Sotaro Ikeda [:sotaro]

Comment 88

•

11 years ago

It seems that on leo device by using MozBuild, we can see more freeze easily. It might be true also on vendor built ROM. Last week, I was in lg b2g work week, lg engineer said, it is easy to generate freeze.

Jason Duell

Comment 89

•

11 years ago

I've got :mikeh running a build that asserts if the necko channel's suspend/resume methods are called off main thread. We seem to be seeing that (or some other kind of error) in bug 877708, and that could explain the "lost resume" we seem to seeing here.

Jason Duell

Comment 90

•

11 years ago

So we're not seeing any off-main thread use of suspend/resume. But :mikeh has been running some builds with various diagnostic checks, and we did catch a hang and attached a debugger to it, and we saw 1) nsHttpChannel::mSuspendCount = 1 2) the last printf in our log of suspend/resume is a suspend() 3) wireshark shows TCP window is 0 (i.e. full client TCP buffer because channel is suspended) This really seems to point to the code above necko not calling resume(), yet expecting data. roc: I smell a media cache bug at this point.

Flags: needinfo?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 91

•

11 years ago

(In reply to Jason Duell (:jduell) from comment #90) > This really seems to point to the code above necko not calling resume(), yet > expecting data. roc: I smell a media cache bug at this point. Maybe there is a media cache bug too, but in the log in comment #58, it's clearly a Resume that's the last call. Please don't ignore that.

Flags: needinfo?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 92

•

11 years ago

(In reply to Jason Duell (:jduell) from comment #90) > This really seems to point to the code above necko not calling resume(), yet > expecting data. roc: I smell a media cache bug at this point. Did you get the nsMediaCache log for this? If so can you drop a link here?

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 93

•

11 years ago

Hmm, it's possible the resume gets lost somewhere between where nsMediaCache logs it and calling Necko. I'll dig.

Jason Duell

Comment 94

•

11 years ago

I now suspect bug 877708 is the real culprit here. Taking roc off the hook.

Jason Duell

Comment 95

•

11 years ago

mikeh: could you try my patch (the simple one for b2g18 branch) from bug 877708 (without roc's patch) and see if it fixes the freeze here? Alas, there's a good chance my bug 877708 fix won't actually fix this issue. The bug there was that if suspend was called again very shortly after resume, we could wind up flushing events from the parent (like OnDataAvailable) that we have queued on the child, even though we're supposed to be suspended (these were the "racy" OnDataAvailable calls mentioned in comment 65). My patch fixes that, but AFAICT it won't help with any "lost" resume calls (i.e we shouldn't remain suspended if the media code really has issued a matching number of suspends/resumes). But I don't see any obvious problems with the way necko e10s does suspend/resume, so I'm still hoping that not issuing the racy ODA calls will somehow make the media code behave better.

Flags: needinfo?(mhabicher)

leo.bugzilla.gecko

Comment 96

•

11 years ago

Why does stuck happen at the same position all the time?? and...according to logs, it seems like h/w codec keeps consuming data even though it needs

Diego Wilson [:diego]

Reporter

Comment 97

•

11 years ago

Does anyone have any objections to landing attachment 756494 [details] [diff] [review] ?

Keywords: checkin-needed

Diego Wilson [:diego]

Reporter

Comment 98

•

11 years ago

Also, perhaps we can spin off another bug for continuing the media cache debugging discussion?

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 99

•

11 years ago

(In reply to Diego Wilson [:diego] from comment #97) > > Does anyone have any objections to landing attachment 756494 [details] [diff] [review] ? My concern is that while is does unstick the video, it doesn't address the underlying problem that causes the stream to get stuck in the first place. I'm currently testing a patch in bug 877708 that may do that, so I'm going to remove the 'checkin-needed' keyword for now. I'll update both bugs once I have a couple of runs-through to examine.

Flags: needinfo?(mhabicher)

Keywords: checkin-needed

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 100

•

11 years ago

With bug 877708 attachment 757784 [details] [diff] [review] applies (and not attachment 756494 [details] [diff] [review] in this bug), the TCP stream(s) no longer stall. Instead, Wireshark shows periodic TCP window collapses which linger for 1~3s and are then cleared and streaming resumes. (The video is frozen, however--something that sounds like it is resolved in another bug. Audio was stalled too, despite the network traffic; but when I moved the DUT to temporarily increase WiFi RSSI, the audio resumed with frozen display.)

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 101

•

11 years ago

Further to comment 100, that would probably be bug 879297.

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 102

•

11 years ago

(In reply to Diego Wilson [:diego] from comment #98) > > Also, perhaps we can spin off another bug for continuing the media cache > debugging discussion? In light of comment 100, I don't think a new bug is needed; I need to conduct more testing, but if the TCP stream stall is fixed, I don't think we need to revisit the MediaCache.

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 103

•

11 years ago

jduell: I've put bug 877708 attachment 757784 [details] [diff] [review] through its paces today, and over several hours of testing, the TCP stream never gets stuck. I think you've done it. :)

Jason Duell

Comment 104

•

11 years ago

Yay. That's good news! I guess I'll dupe that bug over to this one (so I don't have to ask for leo+ on it). mikeh: Re: TCP window collapses: IIRC this is normal--when the media code suspends a channel, we let its TCP buffer fill up (and the TCP recv window becomes 0). When we resume we read from the TCP buffer and the window gets bigger again. If that's all you're seeing there's no problem here (but I seem to remember you saying we were sending hard RSTs to server when we talked yesterday? That sounded like a new, different bug).

Jason Duell

Comment 106

•

11 years ago

Attached patch patch for mozilla central (deleted) — Details — Splinter Review

This is patch for m-c from bug 870564, already approved by jdm (and nits fixed).

Attachment #758276 - Flags: review+

Jason Duell

Comment 107

•

11 years ago

Attached patch less risky patch for b2g18 branch (deleted) — Details — Splinter Review

This is patch for mb2g18 from bug 870564, already +r'd by jdm (and nits fixed). [Approval Request Comment] Bug caused by (feature/regressing bug #): been in tree for long time. User impact if declined: youtube on b2g will freeze on long videos Testing completed: yes see comment 103. Risk to taking this patch (and alternatives if risky): low. String or UUID changes made by this patch: none

Attachment #758280 - Flags: review+

Attachment #758280 - Flags: approval-mozilla-b2g18?

Jason Duell

Comment 108

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/850b4e5dc45e

Jason Duell

Comment 109

•

11 years ago

Comment on attachment 756494 [details] [diff] [review] timer patch Marking roc's patch obsolete, as it was a bit of a hack and doesn't seem to be needed. Correct me if I'm wrong about that.

Attachment #756494 - Attachment is obsolete: true

Jason Duell

Comment 110

•

11 years ago

typo in comment 106 && 107--I moved those patches here from bug 877708.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 111

•

11 years ago

(In reply to Jason Duell (:jduell) from comment #109) > Marking roc's patch obsolete, as it was a bit of a hack and doesn't seem to > be needed. Correct me if I'm wrong about that. Agreed.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 112

•

11 years ago

(In reply to Jason Duell (:jduell) from comment #104) > mikeh: Re: TCP window collapses: IIRC this is normal--when the media code > suspends a channel, we let its TCP buffer fill up (and the TCP recv window > becomes 0). When we resume we read from the TCP buffer and the window gets > bigger again. This is expected. This is the result of the media cache doing flow control.

Jason Smith [:jsmith]

Updated

•

11 years ago

Whiteboard: c=video [TD-26549], MiniWW → c=video [TD-26549], MiniWW, [YouTubeCertBlocker]

Jason Smith [:jsmith]

Updated

•

11 years ago

Whiteboard: c=video [TD-26549], MiniWW, [YouTubeCertBlocker] → c=video [TD-26549], MiniWW, [YouTubeCertBlocker+]

Patrick McManus [:mcmanus]

Comment 113

•

11 years ago

jason this is awesome.

leo.bugzilla.gecko

Comment 114

•

11 years ago

For consistency of symptom what I see, please try it via 3G connection. If you cannot use 3G connection, try it via other device's tethering. I cannot get any progress even though I applied many patches mentioned in Bug 877461, Bug 878343 and here.

Ed Morley [:emorley]

Comment 115

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/850b4e5dc45e

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Comment 116

•

11 years ago

Comment on attachment 758280 [details] [diff] [review] less risky patch for b2g18 branch Has leo+, doesn't need additional approval.

Attachment #758280 - Flags: approval-mozilla-b2g18?

Ryan VanderMeulen [:RyanVM]

Comment 117

•

11 years ago

WTF happened to the header on this patch? https://hg.mozilla.org/releases/mozilla-b2g18/rev/176fe1ce4cd2

status-b2g18: affected → fixed

status-firefox22: --- → wontfix

status-firefox23: --- → wontfix

status-firefox24: --- → fixed

Mike Habicher [:mikeh] (high bugzilla latency)

Assignee

Comment 118

•

11 years ago

(In reply to Jason Duell (:jduell) from comment #104) > > mikeh: Re: TCP window collapses: IIRC this is normal--when the media code > suspends a channel, we let its TCP buffer fill up (and the TCP recv window > becomes 0). When we resume we read from the TCP buffer and the window gets > bigger again. If that's all you're seeing there's no problem here (but I > seem to remember you saying we were sending hard RSTs to server when we > talked yesterday? That sounded like a new, different bug). jduell, I still see the RSTs from us to the server. Basically it looks like: 1. regular TCP streaming 2. TCP receive window goes to zero, then: 3. http > us TCP keepalive 4. us > http TCP ACK, Rwin=0 5. us > http TCP RST 6. open new stream The only time this doesn't happen is on the last chunk, where we close it off gracefully with FIN+ACKs.

Sotaro Ikeda [:sotaro]

Comment 119

•

11 years ago

(In reply to leo.bugzilla.gecko from comment #114) > For consistency of symptom what I see, please try it via 3G connection. > If you cannot use 3G connection, try it via other device's tethering. > I cannot get any progress even though I applied many patches mentioned in > Bug 877461, Bug 878343 and here. FYI, for confirmation a patch in Bug 879529 is also necessary.

Patrick McManus [:mcmanus]

Comment 120

•

11 years ago

> > jduell, I still see the RSTs from us to the server. Basically it looks like: > 1. regular TCP streaming > 2. TCP receive window goes to zero, then: > 3. http > us TCP keepalive > 4. us > http TCP ACK, Rwin=0 > 5. us > http TCP RST > 6. open new stream > > The only time this doesn't happen is on the last chunk, where we close it > off gracefully with FIN+ACKs. The RST gets sent because we closed the socket without reading data that was queued in the kernel before doing that. If a stream was paused and then not resumed before being closed that is what I would expect would happen. (it happens on lots of other close scenarios too) Its not necessarily an indication of a problem unless its associated with some other issue.

Jason Smith [:jsmith]

Updated

•

11 years ago

Keywords: verifyme

QA Contact: jsmith

Jason Duell

Comment 121

•

11 years ago

> WTF happened to the header on this patch? Not sure--I didn't tweak it manually, so I'll blame hg.

Ryan VanderMeulen [:RyanVM]

Comment 122

•

11 years ago

https://hg.mozilla.org/releases/mozilla-b2g18_v1_1_0_hd/rev/176fe1ce4cd2

status-b2g-v1.1hd: affected → fixed

Diego Wilson [:diego]

Reporter

Comment 123

•

11 years ago

(In reply to Jason Duell (:jduell) from comment #107) > Created attachment 758280 [details] [diff] [review] > less risky patch for b2g18 branch > > This is patch for mb2g18 from bug 870564, already +r'd by jdm (and nits > fixed). > > [Approval Request Comment] > Bug caused by (feature/regressing bug #): been in tree for long time. > User impact if declined: youtube on b2g will freeze on long videos > Testing completed: yes see comment 103. > Risk to taking this patch (and alternatives if risky): low. > String or UUID changes made by this patch: none This patch worked for me. Thanks Jason!

Jason Smith [:jsmith]

Updated

•

11 years ago

Flags: in-moztrap?

Cody Roesch [:croesch]

Updated

•

11 years ago

URL: https://moztrap.mozilla.org/manage/ca...

Flags: in-moztrap? → in-moztrap+

Jason Smith [:jsmith]

Comment 124

•

11 years ago

Played fine with some long videos, although i did some of the behavior described in bug 880601.

status-b2g18: fixed → verified

Keywords: verifyme

Jason Duell

Updated

•

11 years ago

Attachment #756747 - Flags: feedback?(jduell.mcbugs)

stack track when video freeze happend 12 years ago Sotaro Ikeda [:sotaro] (deleted), text/plain		Details
gdb bt while media is frozen (UI is still responsive) 12 years ago Mike Habicher [:mikeh] (high bugzilla latency) (deleted), text/plain		Details
gdb bt while media is frozen (UI is still responsive), 2013-05-23_16-05 12 years ago Mike Habicher [:mikeh] (high bugzilla latency) (deleted), text/plain		Details
WIP - logging to try to figure out where pds=0xfffffffc comes from 12 years ago Mike Habicher [:mikeh] (high bugzilla latency) (deleted), patch		Details \| Diff \| Splinter Review
logcat output with NSPR_LOG_MODULES=nsMediaCache:5,MediaResource:5 11 years ago Mike Habicher [:mikeh] (high bugzilla latency) (deleted), application/x-zip-compressed		Details
timer patch 11 years ago Robert O'Callahan (:roc) (email my personal email if necessary) (deleted), patch	cpearce : review+	Details \| Diff \| Splinter Review
logcat with NSPR_LOG_MODULES=timestamp,nsMediaCache:5,MediaResource:5,nsHttp:5,nsSocketTransport:5 11 years ago Mike Habicher [:mikeh] (high bugzilla latency) (deleted), application/x-zip-compressed		Details
patch for mozilla central 11 years ago Jason Duell (deleted), patch	jduell.mcbugs : review+	Details \| Diff \| Splinter Review
less risky patch for b2g18 branch 11 years ago Jason Duell (deleted), patch	jduell.mcbugs : review+	Details \| Diff \| Splinter Review