Open
Bug 1441563
(gUM_timeout)
Opened 7 years ago
Updated 2 years ago
getUserMedia timeouts causing lots of intermittent failures (mostly on linux, but sometimes on windows)
Categories
(Core :: WebRTC: Audio/Video, defect, P2)
Tracking
()
NEW
People
(Reporter: bwc, Unassigned, NeedInfo)
References
(Blocks 2 open bugs)
Details
Attachments
(9 files)
Many of our intermittent failures are caused by getUserMedia never resolving, but because the failures are spread out it is hard to see how big this problem is.
Reporter | ||
Comment 1•7 years ago
|
||
This is causing (at least some of) the failures in bug 1059240, 1440169, 1432025, 1210206, 1292976, 1300686, 1293910, 1424141, 1296075, 1297611, 1300644, 1309419, 1440669, 1441179, 1227386, 1127828, and 1436171,
Might be related to bug 1432786, 1439030, 1439491, and 1441512 (NotFoundError in getUserMedia).
Alias: gUM_timeout
Blocks: 1059240, 1440169, 1432025, 1210206, 1292976, 1300686, 1293910, 1424141, 1296075, 1297611, 1300644, 1309419, 1440669, 1441179, 1227386, 1127828, 1436171
See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1432786,
https://bugzilla.mozilla.org/show_bug.cgi?id=1439030,
https://bugzilla.mozilla.org/show_bug.cgi?id=1439491,
https://bugzilla.mozilla.org/show_bug.cgi?id=1441512
Summary: getUserMedia timeouts causing lots of intermittent failures → getUserMedia timeouts causing lots of intermittent failures (mostly on linux, but sometimes on windows)
Reporter | ||
Updated•7 years ago
|
Flags: needinfo?(jib)
Comment 2•7 years ago
|
||
This bug blocks 1059240 which has reached 40 failures in the past week.
:jib any updates here?
Updated•7 years ago
|
Rank: 19
Priority: -- → P2
Comment 3•7 years ago
|
||
A lot of the dependent bug reports are years old, but I agree there appears to be a recent pattern. I see two kinds:
In bug 1432025 comment 6: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=163285091&lineNumber=79792
[task 2018-02-20T20:58:48.224Z] 20:58:48 INFO - Call getUserMedia for {"audio":{"autoGainControl":false,"echoCancellation":false,"noiseSuppression":false}}
[task 2018-02-20T20:58:48.225Z] 20:58:48 INFO - Buffered messages finished
[task 2018-02-20T20:58:48.226Z] 20:58:48 ERROR - TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_transceivers.html | application timed out after 370 seconds with no output
[task 2018-02-20T20:58:48.227Z] 20:58:48 ERROR - Force-terminating active process(es).
and bug 1432786 comment 5: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=163090755&lineNumber=47758
[task 2018-02-19T21:34:25.579Z] 21:34:25 INFO - Call getUserMedia for {"audio":{"autoGainControl":false,"echoCancellation":false,"noiseSuppression":false}}
[task 2018-02-19T21:34:25.582Z] 21:34:25 INFO - Buffered messages finished
[task 2018-02-19T21:34:25.590Z] 21:34:25 INFO - TEST-UNEXPECTED-FAIL | dom/media/tests/mochitest/test_peerConnection_transceivers.html | Error executing test: NotFoundError: The object can not be found here.
[task 2018-02-19T21:34:25.594Z] 21:34:25 INFO - runTestWhenReady/<@dom/media/tests/mochitest/head.js:454:7
[task 2018-02-19T21:34:25.595Z] 21:34:25 INFO - Async*runTestWhenReady@dom/media/tests/mochitest/head.js:452:10
[task 2018-02-19T21:34:25.596Z] 21:34:25 INFO - async*@dom/media/tests/mochitest/test_peerConnection_transceivers.html:2197:3
Alex, can you take a look? cc Andreas and Bryce as well for help.
Assignee: nobody → achronop
Flags: needinfo?(jib)
Comment 4•7 years ago
|
||
This is the big issue we introduced with loopback devices. Unfortunately, there is no easy explanation. I have to create some time to engage PulseAudio guys on this. I will do it right after the push I am currently working on.
There are 2 possible workarounds. The first and most reasonable is not to use Loopback device if you do not need it. For example if the test does not make use of the AudioAnalyzer to verify the receiving tone you can disable it (by setting on the top of `runTest()` method `DISABLE_LOOPBACK_TONE = true). The second and less preferable workaround is to make use of virtual devices.
Comment 5•7 years ago
|
||
Please try to mitigate this as soon as possible, this is getting really annoying and burns quite a lot of time of a lot of people.
Comment 6•7 years ago
|
||
Following command repro the error reliably:
taskset -c 0 ./mach mochitest --use-test-media-devices dom/media/tests/mochitest/test_peerConnection_replaceTrack.html
Comment 7•7 years ago
|
||
Command to repro with a TSAN build:
./mach mochitest --use-test-media-devices dom/media/tests/mochitest/test_peerConnection_replaceTrack.html |& tee ../tsan_log.txt
Comment 8•7 years ago
|
||
This is part of the TSAN log uploaded in previous post
Comment 9•7 years ago
|
||
When I replace the input device test is becoming green. Loopback device setup makes use of "Monitor of Null Output" device in order to capture whatever is going to the output. If change the input device back to the "Sine source at 440 Hz" the test runs successfully. I have already provided logs to the PulseAudio people and I am waiting for their response. I will discuss this extra finding with them.
Comment 10•7 years ago
|
||
Profiling PulseAudio server on failed case:
Samples: 11K of event 'cycles:uppp', Event count (approx.): 9974038159
Overhead Command Shared Object Symbol
31.21% alsa-sink-CS420 [unknown] [.] 0xffffffff9ba0013c
19.25% pulseaudio [unknown] [.] 0xffffffff9ba0013c
4.54% null-sink [unknown] [.] 0xffffffff9ba0013c
1.44% pulseaudio libpulse.so.0.20.2 [.] pa_mainloop_dispatch
1.23% alsa-source-CS4 [unknown] [.] 0xffffffff9ba0013c
1.01% alsa-sink-CS420 libpulsecore-11.1.so [.] _init
0.91% alsa-sink-CS420 libpulsecommon-11.1.so [.] _init
0.81% alsa-sink-CS420 libalsa-util.so [.] _init
0.71% alsa-sink-CS420 libpulsecore-11.1.so [.] pa_rtpoll_run
0.71% alsa-sink-CS420 libalsa-util.so [.] thread_func
0.55% alsa-sink-CS420 libpulse.so.0.20.2 [.] _init
0.42% pulseaudio libpulse.so.0.20.2 [.] _init
0.41% pulseaudio libpulsecommon-11.1.so [.] _init
Comment 11•7 years ago
|
||
If this is an issue on our side it could be triggered by different behavior for the different input devices. I recall when testing the loopback device patches early on that the null monitor's callbacks are much more frequent than those of the sine source.
That could explain a higher orange factor if we're looking at some sort of race involving the audio callback.
Comment 12•7 years ago
|
||
I can't seem to reproduce the problem locally. Would it be possible to get a backtrace from when the deadlock happens? And is the deadlock in PA or the test (does pactl stat work when the deadlock happens)?
Comment 13•7 years ago
|
||
Thanks for looking at it. I think you do not repro because you have an optimized build. I can repro on a debug build only. I tried this morning an opt build and the test passes. I can explain you in IRC how to build with debug enabled. It's very easy.
I tried the `pactl stat` while the deadlock was happening, that's the results:
$ taskset -c 0 pactl stat
Currently in use: 2041 blocks containing 4.1 MiB bytes total.
Allocated during whole lifetime: 1026601 blocks containing 638.6 MiB bytes total.
Sample cache size: 0 B
I also attached the debugger on PA process:
0x00007f1e4669cb36 in __GI_ppoll (fds=0x558ca613eef0, nfds=36, timeout=<optimized out>, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
39 return SYSCALL_CANCEL (ppoll, fds, nfds, timeout, sigmask, _NSIG / 8);
warning: File "/home/achronop/myrepos/mozilla/firefox/.gdbinit" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
add-auto-load-safe-path /home/achronop/myrepos/mozilla/firefox/.gdbinit
line to your configuration file "/home/achronop/.gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/home/achronop/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
Missing separate debuginfos, use: dnf debuginfo-install alsa-lib-1.1.4.1-1.fc26.x86_64 libgcrypt-1.8.2-1.fc26.x86_64 nss-softokn-freebl-3.35.0-1.0.fc26.x86_64 pcre-8.41-5.fc26.x86_64 speexdsp-1.2-0.10.rc3.fc26.x86_64
(gdb) bt
#0 0x00007f1e4669cb36 in __GI_ppoll (fds=0x558ca613eef0, nfds=36, timeout=<optimized out>, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1 0x00007f1e498d2641 in ppoll (__ss=0x0, __timeout=<optimized out>, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2 pa_mainloop_poll (m=m@entry=0x558ca602eb30) at pulse/mainloop.c:852
#3 0x00007f1e498d2c50 in pa_mainloop_iterate (m=0x558ca602eb30, block=<optimized out>, retval=0x7ffffa011dd8) at pulse/mainloop.c:926
#4 0x00007f1e498d2ce0 in pa_mainloop_run (m=0x558ca602eb30, retval=0x7ffffa011dd8) at pulse/mainloop.c:944
#5 0x0000558ca496133e in main (argc=<optimized out>, argv=<optimized out>) at daemon/main.c:1142
Comment 14•7 years ago
|
||
Comment 15•7 years ago
|
||
Comment 16•7 years ago
|
||
Comment 17•7 years ago
|
||
Arun, following your question about which threads are involved in the scenario I uploaded the bt off all threads for the Firefox and PulseAudio when the deadlock was happening. For PA I continued and re-paused once to check if it was "moving". Feel free to ping me in IRC for more questions.
Comment 18•7 years ago
|
||
Comment 19•7 years ago
|
||
I attached to the audio process (sorry, not sure what the correct terminology is) and dumped the backtrace a couple of times (once when it looked deadlocked, then again close to timeout). I see two audio threads.
One of the threads has a different trace in the second backtrace (more or less waiting in the mainloop), suggesting that it is not where the deadlock is.
Will try to unravel where the common backtrace is in the code (it's in a pa_threaded_mainloop_wait()).
Comment 20•7 years ago
|
||
(In reply to Arun Raghavan from comment #19)
[...]
> Will try to unravel where the common backtrace is in the code (it's in a
> pa_threaded_mainloop_wait()).
I meant pa_threaded_mainloop_lock().
Comment 21•7 years ago
|
||
I'm quite puzzled by what's happening. Thread 13 in the backtraces I pasted seems stuck in a wait that should not happen. My machine seems to be quite slow at doing backtraces, so _maybe_ there's something wrong there. I've asked achronop to get me some backtraces from the deadlock.
From what I can tell, the problem is occurring on the client side. Next steps after I have looked at the backtraces (if nothing is different), is to see if the problem occurs with cubeb remote disabled, and finally with rust disabled if it does.
Since this is a bit of a mystery, my current paranoid suspicions include single-threaded weirdness in the Rust async code (since it only happens with taskset -c 0) or a problem in null-sink (maybe if all else fails, we can try to just write a simple loopback and try to make it do the same thing).
I did quickly look at cubeb-pulse-rs context code and all the locking code looked kosher, as expected.
Comment 22•7 years ago
|
||
The other thing I was thinking of was instrumenting all the lock/unlock wait/signal things in cubeb-pulse-rs (ideally with some way to identify the caller), but my worry is that it might be a _lot_ of output to parse (and might affect behaviour of the test).
Comment 24•7 years ago
|
||
This file contains 4 backtracks from all threads of content process. Remote cubeb is disabled.
Flags: needinfo?(achronop)
Comment 25•7 years ago
|
||
I executed the test using two different pairs of Sink Devices and Monitor Sources. Sink devices was real devices, the built in sink and an HDMI sink. In all cases the error is reproducible. I wonder if the the way PA implements the Monitor Sources uncovers that issue.
Comment 26•7 years ago
|
||
This could part of the symptoms we see in fuzzing (quite a few cubeb processes and a lot of processes locked in I/O).
Comment 27•7 years ago
|
||
Are these backtraces from when the process is hung?
Comment 28•7 years ago
|
||
Yes, I paused the debugger 4 times, every 15 secs, started at the point that test was not progressing further.
Comment 29•3 years ago
|
||
The bug assignee didn't login in Bugzilla in the last 7 months.
:jib, could you have a look please?
For more information, please visit auto_nag documentation.
Assignee: achronop → nobody
Flags: needinfo?(jib)
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•