Closed Bug 1255737 Opened 9 years ago Closed 8 years ago

Async shutdown crash with "MediaStreamGraph shutdown: blocking on msg thread"

Categories

(Core :: Audio/Video: MediaStreamGraph, defect, P1)

44 Branch
Unspecified
All
defect

Tracking

()

RESOLVED FIXED
mozilla51
Tracking Status
firefox45 --- wontfix
firefox46 + wontfix
firefox47 + fixed
firefox48 + fixed
firefox49 --- wontfix
firefox50 --- verified
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: mak, Assigned: jesup)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(9 files)

(deleted), patch
padenot
: review+
Details | Diff | Splinter Review
(deleted), image/jpeg
Details
(deleted), patch
jesup
: review+
Details | Diff | Splinter Review
(deleted), patch
jib
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
pehrsons
: review+
Details | Diff | Splinter Review
(deleted), patch
pehrsons
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
padenot
: review+
Details | Diff | Splinter Review
We are crashing on shutdown for some users due to the async shutdown timeout abort.
That indicates this code can block shutdown for a minute.

A couple example crashes from Nightly:
https://crash-stats.mozilla.com/report/index/2a9ac8f3-5ec8-4266-8923-85e9a2160311
https://crash-stats.mozilla.com/report/index/bea387c7-463c-44c9-b11b-cabdf2160310

I've seen this crash in all the recent versions while investigating bug 1250424.
This is the second most common cause of shutdown aborts (20% of shutdown aborts).

It would be worth fixing this and uplifting a fix, crashing on shutdown after blocking firefox for one minute is not a pleasant experience for our users.
Jan-Ivar, this is in you new shutdown code, do you know what would cause this?
Flags: needinfo?(jib)
Severity: normal → critical
(In reply to Marco Bonardo [::mak] from comment #0)
> We are crashing on shutdown for some users due to the async shutdown timeout
> abort. That indicates this code can block shutdown for a minute.
> 
> A couple example crashes from Nightly:
> https://crash-stats.mozilla.com/report/index/2a9ac8f3-5ec8-4266-8923-85e9a2160311
> https://crash-stats.mozilla.com/report/index/bea387c7-463c-44c9-b11b-cabdf2160310

Marco, I see a !mainthread-assert in XPConnect() in both of these. Can you help tie this to Media shutdown for me?

(In reply to Paul Adenot (:padenot) from comment #1)
> Jan-Ivar, this is in you new shutdown code, do you know what would cause
> this?

(I'm looking for clarification on the source of the specific crash above, but) our change to block on media-shutdown on shutdown, will naturally change the look of media-shutdown symptoms, looking at a minute delay and a MOZ_CRASH, whereas before shutdown would resume in spite of media trouble, causing, I don't know, crashes, leaks, asserts and/or UAFs.
Flags: needinfo?(jib) → needinfo?(mak77)
> Marco, I see a !mainthread-assert in XPConnect() in both of these. Can you help tie this to Media shutdown for me?

Look in Metadata > AsyncShutdownTimeout.
Flags: needinfo?(mak77)
To clarify, this is a crash forced by async shutdown when one of the shutdown blockers takes too much time to resolve. All these crashes have a similar stack but different reasons, that is expressed in metadata, as Yoric said.
In this case we forced a crash cause profile-before-change took more than 1 minute to complete, and the currently running blocker when the timeout happened was the mediagraph shutdown blocker.
Thanks, got it. This means that one or more MediaStreamGraphShutDownRunnable failed to run.
http://mxr.mozilla.org/mozilla-central/source/dom/media/MediaStreamGraph.cpp?rev=c0ec7ef7bde9&mark=1517-1517#1490
All crashes for this I see in crash-stats are Windows-only
OS: Unspecified → Windows
Looked at one of the stacks (all threads).  Thread 42 is sitting in a wait with winmm_buffer_thread() on the stack.  Matches that MSG seems to still be up.

https://crash-stats.mozilla.com/signature/?async_shutdown_timeout=~MediaStreamGraph&signature=mozalloc_abort+|+NS_DebugBreak+|+nsDebugImpl%3A%3AAbort+|+NS_InvokeByIndex&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#reports

That should just get MSG failures.  I see two linux reports, so may not be windows-only.
Paul - can you do some initial investigation?  We'll look to find an owner for it next week.  This is probably more important than the intermittent webaudio leaks, at least to investigate.
Rank: 15
Flags: needinfo?(padenot)
OS: Windows → All
Priority: -- → P1
Depends on: 1256510
I have a testcase in bug 1256510 that trips a similar shutdown abort. Filed separately because I don't know whether it's the same underlying bug that is causing frequent shutdown aborts for Nightly users.
I have just landed bug 1256510's patch on inbound; padenot and I think there's a good chance this is the root cause of the shutdown failures reported here - though without STR it's hard to be sure.  Watching crash-stats for this after it merges to central (and likely uplifts to aurora/beta) should tell us if this is some, none, or all of the cause for the reports here.
Assignee: nobody → rjesup
Flags: needinfo?(padenot)
Blocks: 1239873
Crash Signature: [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] → [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ]
Keywords: crash, regression
Version: unspecified → 46 Branch
Version: 46 Branch → 44 Branch
This is the #3 topcrash in beta 2, with 3.15% of the total crashes.   That's high enough that I would call it a release blocker, even though it's a shutdown hang. 
The patch from bug 1256510 may fix this.  It will ship with beta 4 tomorrow.
Reviewing the crash data, In all of February I only see maybe 100 events, which is hard to reason about. 46b1/2 has clearly dramatically increased usage (and maybe shifted where that usage goes).  

This AsyncShutdown blocking was landed to try to avoid a class of crashes and other possible issues; we had considerable evidence from various bugs and crashes that MSG wasn't shutting down when it was supposed to, especially as it's inherently a multiple-thread-operation.

It appears to have stumbled over an existing problem where some graphs (likely WebAudio graphs, but we don't know yet) will not shut down (or like Jesse's, never started and thus aren't shutting down).

If the fix for Jesse's bug in bug 1256510 doesn't cause an obvious reduction in reports in the couple of days following b4, we should consider backing this out of 46 while we try to determine what we tripped over.

I'm going to attach a patch in case we need to back out of 46/beta because the fix for bug 1256510 doesn't get the crash rate low enough.
Attached patch Backout f31612f8f3bb () from beta 46 (deleted) β€” β€” Splinter Review
Beta patch if needed
Attachment #8733203 - Flags: review?(padenot)
Attachment #8733203 - Flags: review?(padenot) → review+
r+ in case we need to land this.
Early results for beta 4 are showing this crash signature is still the #2 top crash, though there is not really enough data to judge yet (219/3630 total crashes)
https://crash-stats.mozilla.com/topcrashers/?product=Firefox&version=46.0b4
In 46.0b2, it was 795 of 2849, so roughly the same % of the AsyncShutdown crashes
Comment on attachment 8733203 [details] [diff] [review]
Backout f31612f8f3bb () from beta 46

Approval Request Comment
[Feature/regressing bug #]: Bug 1239873

[User impact if declined]: Shutdown hangs which eventually convert to crashes; rate is high-ish - 15th or 16th Beta crasher, with around 0.75% of all crashes.

[Describe test coverage new/current, TreeHerder]: We're not hitting htis in the automated tests.

[Risks and why]: Low to medium risk; restores MSG to the shutdown behavior it's had all along (which is why it's low-ish).  Risk is that this may be blocking some more serious bugs from being hit.  Only requesting backout from beta so we can continue to work on resolving the issue.

[String/UUID change made/needed]: none
Attachment #8733203 - Flags: approval-mozilla-beta?
[Tracking Requested - why for this release]:
Moving back to 46=? due to the analysis above (15th-ish on crash list, not 3rd).  If we take the backout (which we can do), then we should mark 46 as unaffected or fixed, or wontfix if we decide to live with it without the backout.
This signature, https://crash-stats.mozilla.com/signature/?product=Firefox&signature=mozalloc_abort+|+NS_DebugBreak+|+nsDebugImpl%3A%3AAbort+|+NS_InvokeByIndex&version=46.0b5&date=%3C2016-03-28T17%3A02%3A27&date=%3E%3D2016-03-21T17%3A02%3A27   is #4 on the top crash list for beta 5, 2.49% of all crashes reported in beta 5.  I'm looking by specific beta version to see what changes rather than looking at all of the betas at once.   

Are you saying that if we back this out (uplifting your patch here) that it doesn't change much from users' point of view, and they still hit the issue, but we don't see it as clearly in aggregate?
Flags: needinfo?(rjesup)
Comment on attachment 8733203 [details] [diff] [review]
Backout f31612f8f3bb () from beta 46

Backout to mitigate a topcrash, let's try to land this for beta 6 today.
Attachment #8733203 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Adding back the blocking flag.  I don't have the feeling this is resolved and I'd like to understand the underlying issues better before we decide that. 

Please don't change the blocking flag without talking to a release manager; we may lose track of our worst issues if you do that. While the crash rate may be an imperfect way to judge, and I need your expertise to understand what's going on, it's what I've got to work with right now.
Sorry, my point was to put it on your radar to re-evaluate given the info above.

The search you're using is picking up many AsyncShutdown timeout/crashes, not just this bug.  You need to add "async_shutdown_timeout=~MediaStreamGraph" to your searches.  the automated list is lumping them because they're all AsyncShutdown crashes.

Per comment 17, MediaStreamGraph is the source of maybe 30% of the AsyncShutdown timeouts, which would put it around #15 or so on the list.  Still too high, and we need to do something about it (and are).  I'm on PTO today, as is padenot who is the other person who can attack this.

Note that backing this out will only cut your crash rate by maybe 30% for that signature; it will still be in the top 10.  Try adding the async_shutdown_timeout field to the displayed columns, and look at what the other sources are (Places, among others).
Flags: needinfo?(rjesup)
Landed backout on Beta: https://hg.mozilla.org/releases/mozilla-beta/rev/7e37bc747439
A query for crashes on nightly since January 1 with this problem:
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&date=%3E%3D2016-01-01&abort_message=mediastreamgraph.cpp&signature=mozalloc_abort+%7C+NS_DebugBreak+%7C+nsDebugImpl%3A%3AAbort+%7C+XPTC__InvokebyIndex
(using a query on abort message).
Happening on Linux nightly too - https://crash-stats.mozilla.com/report/index/5b2c2cc5-cace-4e54-a8b1-2c3492160425
Triggered by enabling WebAudio on music.yandex.ru (music streaming service).
OK, seems that this has shown up under a few different signatures.  Here are all the crashes on nightly this year whose abort message contains mediastreamgraph.cpp:
https://crash-stats.mozilla.com/search/?product=Firefox&release_channel=nightly&abort_message=mediastreamgraph.cpp&date=%3E%3D2016-01-01&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform

What's going to happen to this bug for 47?
Flags: needinfo?(rjesup)
Paul, We may finally have a reproducible test case for this crash by enabling Web Audio on music.yandex.ru.  (See Comment 27.)  Can you take the lead on investigating this new info and let me and jesup know what you find?  

David -- If we do indeed have a reproducible test case, I hope to know what's going on and have a real fix in Nightly early in the cycle.  We'll then either uplift the fix if it's safe to do so or back out like we did for Fx46 until the real fix is in the release.  (I'm also bumping the priority.)
Rank: 15 → 10
Flags: needinfo?(padenot)
I tried to repro on music.yandex.ru on Linux Nightly, but I'm in France, and the website only allows playing music from Russia it seems. Also, would you mind explaining me how to enable the Web Audio playback ? I tried clicking around, but I didn't see anything obvious.
Flags: needinfo?(padenot) → needinfo?(phazon)
We may have a possible fix landed in nightly in bug 1267600.  let's watch the stats for a short while
Flags: needinfo?(rjesup)
Attached image Yandex WebAudio HowTo (deleted) β€”
(In reply to Paul Adenot (:padenot) from comment #30)
> I tried to repro on music.yandex.ru on Linux Nightly, but I'm in France, and
> the website only allows playing music from Russia it seems. Also, would you
> mind explaining me how to enable the Web Audio playback ? I tried clicking
> around, but I didn't see anything obvious.

Ah, yes, it seems that service only allow russian users to play music (obvious copyright issues).
As for how to turn WebAudion on - i attached screenshot from music.yandex.ru
there is equalizer icon above volume control if you click on it (turns to yellow) then equalizer is open where you must set switch to "Π’ΠΊΠ»" and choose preset to use. All that steps actually turns this one (can be seen with WebAudio panel in DevTools)
Another sign - presence of CubebUtils : AudioCallDriver stream in "pavucontrol" linux app, its not present if WebAudio is not turned on.
Flags: needinfo?(phazon)
Flags: needinfo?(padenot)
Crash Signature: [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] → [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] [@ mozalloc_abort | xul.dll@0x77f718 ]
From the crash stats - https://goo.gl/RMFyGl, you can see the signature, there are still more than 8000 crashes for 46. This patch may not work. Change the status-firefox46 to affected. What do you think the next step?
Flags: needinfo?(rjesup)
That search doesn't work the way you think; if you restrict it to 46 release crashes you find none.  You can find 1000's of 46.0bX crashes still happening because people run beta and don't update often/ever.

The Summary pages are misleading; they show 1000's of 46.0.1 crashes, but when you look for those (with the mediastreamgraph restrictions) you find none in the actual reports.

All that said... we still haven't found a way to reproduce this such that we can find the problem causing it.  We'll need to back out of beta again, and will sit down with padenot next week and figure out what can get us the data to resolve this.
Flags: needinfo?(rjesup)
Randal in Gerry's search if you expand the "product" tab it breaks down by version. And, on the topcrash list for 46.0.1 this signature is the #6 top crash. And when I look with supersearch limiting to 46.0.1 and the buildid we released I still see over 8000 crashes.     

https://crash-stats.mozilla.com/search/?product=Firefox&signature=%5B%40+mozalloc_abort+%7C+NS_DebugBreak+%7C+nsDebugImpl%3A%3AAbort+%7C+NS_InvokeByIndex+%5D&version=46.0.1&build_id=20160502172042&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

I want to make sure we're judging crash data correctly on the release channel. How are you limiting your search?
Sorry I just typed your name wrong, argh.
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #36)
> Randal in Gerry's search if you expand the "product" tab it breaks down by
> version. And, on the topcrash list for 46.0.1 this signature is the #6 top
> crash. And when I look with supersearch limiting to 46.0.1 and the buildid
> we released I still see over 8000 crashes.     

Right -- but search with a version restriction.  For some reason the Summary page is showing crashes not included in the list of reports for the search.

> 
> https://crash-stats.mozilla.com/search/
> ?product=Firefox&signature=%5B%40+mozalloc_abort+%7C+NS_DebugBreak+%7C+nsDebu
> gImpl%3A%3AAbort+%7C+NS_InvokeByIndex+%5D&version=46.0.
> 1&build_id=20160502172042&_facets=signature&_columns=date&_columns=signature&
> _columns=product&_columns=version&_columns=build_id&_columns=platform#facet-
> signature

This shows 0 bugs for me.

Gerry's search (https://crash-stats.mozilla.com/report/list?signature=mozalloc_abort%20%7C%20NS_DebugBreak%20%7C%20nsDebugImpl%3A%3AAbort%20%7C%20NS_InvokeByIndex#tab-sigsummary) grabs many things that aren't MediaStreamGraph.

Adding a term for MediaStream in the async shutdown field still shows 8000+ for 46.0.1 in Product -- but clicking Reports shows 622 (and eyeballing them there are no 46 release hits).

Adding version has terms 46.0.1 gives 0 Reports - but still shows 8000 hits in Product under Summary.

> 
> I want to make sure we're judging crash data correctly on the release
> channel. How are you limiting your search?
Flags: needinfo?(rjesup)
Peter, do you know if there's an existing bug for the problem in crash-stats described in comment 38?
Flags: needinfo?(peterbe)
When I click on the link in comment 36 I get 0 results now, but the signature has some extra symbols in it (a bunch of plus signs).  When I reconstruct the same search from scratch again (still limiting by version 46.0.1 )  I still get 8000+ crashes in the results for that one signature ([@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ]) . Odd behavior from crash-stats. 

So, even if this isn't a MediaStreamGraph issue we still have a bunch of crashes on release, maybe best followed up in a different bug.
Actually, wontfix for 46. We don't have a fix and it sounds like this specific MediaStreamGraph issue may not even affect 46.  Would you say this is still affecting 47 and other versions?
yes, the issue is still around in 47.0b-49.0a1. it helps when you just search for "MediaStreamGraph" in the async_shutdown_timeout field and not any particular signature. for example regarding the current beta:
https://crash-stats.mozilla.com/search/?product=Firefox&version=47.0b&async_shutdown_timeout=~MediaStreamGraph&_facets=signature&_facets=user_comments&_facets=version&_facets=platform_pretty_version&_facets=async_shutdown_timeout&_columns=date&_columns=signature&_columns=version&_columns=platform#facet-async_shutdown_timeout
Yes, as I indicated we need to do a backout for 47beta, and try to find a way to attack it. If you look at the async shutdown field (add it to your display of reports!) you can see what the real 46.0.1 shutdown hangs are coming from
Liz, Randell, 

We definitely have a bug in SuperSearch. As I see it right now, the bug is that any link, with a signature, with spaces in it, makes it impossible to load it a second time. It's because the whitespace gets "doubly-encoded" to plus characters which gets encoded a second time. So all links to SuperSearch searches with pluses in the URL, when clicked, always yields 0 results. 

We have a fix in staging already but because of a recent snag in our AWS setup we can't deploy our fix to production.

The other thing, the report/list page is deprecated and will completely disappear in a couple of weeks/months. 

Lastly, I saw one link where the signature was `[@ ...]`. That strange prefix and suffix is a relic of the past and predates me so I can't explain why it's presented like that.
Flags: needinfo?(peterbe)
(In reply to Randell Jesup [:jesup] from comment #43)
> Yes, as I indicated we need to do a backout for 47beta, and try to find a
> way to attack it. If you look at the async shutdown field (add it to your
> display of reports!) you can see what the real 46.0.1 shutdown hangs are
> coming from

Hi Jesup, which specific patch(es) did you have in mind when suggesting a backout? Fx47 is 2 weeks away from going live so we need to act on this quickly. Please let me know how I can help.
Flags: needinfo?(rjesup)
Attached patch Backout f31612f8f3bb () from beta 47 (deleted) β€” β€” Splinter Review
MozReview-Commit-ID: J39kgU76VwL
Attachment #8733203 - Attachment is obsolete: true
Attachment #8733203 - Attachment description: Backout f31612f8f3bb () from beta → Backout f31612f8f3bb () from beta 46
Attachment #8733203 - Attachment is obsolete: false
Flags: needinfo?(rjesup)
Comment on attachment 8755521 [details] [diff] [review]
Backout f31612f8f3bb () from beta 47

Approval Request Comment
[Feature/regressing bug #]: 1239873

[User impact if declined]: Shutdown hang/crashes

[Describe test coverage new/current, TreeHerder]: N/A

[Risks and why]: Still haven't found cause of in-the-field shutdown hangs

[String/UUID change made/needed]: none
Attachment #8755521 - Flags: review+
Attachment #8755521 - Flags: approval-mozilla-beta?
Comment on attachment 8755521 [details] [diff] [review]
Backout f31612f8f3bb () from beta 47

Beta patch backout, let's do it!
Attachment #8755521 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
KWierso, fyi, not sure if our queries will catch this one due to target milestone being 44 and this being a patch backout. Thanks!
Flags: needinfo?(wkocher)
https://hg.mozilla.org/releases/mozilla-beta/rev/c126405491b0
I guess that means status-firefox47 for this goes to "fixed"?
Flags: needinfo?(wkocher)
Something rather curious I just noticed in crash-stats:  all of the crashes seem to occur on 64-bit Windows Firefox, and none on 32-bit Windows Firefox or on any other platforms.  That's rather unusual.

https://crash-stats.mozilla.com/signature/?product=Firefox&date=%3E%3D2016-04-01&abort_message=mediastreamgraph.cpp&signature=mozalloc_abort%20%7C%20NS_DebugBreak%20%7C%20nsDebugImpl%3A%3AAbort%20%7C%20XPTC__InvokebyIndex#aggregations
(and then add aggregations for "cpu arch")
crashes on 32bit are showing up under the [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex] signature.
We've made progress on this (two patches, fixing a lot of intermittents and
probably shutdown crashes in the wild as well), but we're not quite ready yet,
best to back it out again.
Attachment #8763608 - Flags: review?(jib)
Assignee: rjesup → padenot
Status: NEW → ASSIGNED
Attachment #8763608 - Flags: review?(jib) → review+
Comment on attachment 8763608 [details] [diff] [review]
Backout f31612f8f3bb (Bug 1239873) from beta 48

Approval Request Comment
[Feature/regressing bug #]: 1239873

[User impact if declined]: Shutdown hang/crashes

[Describe test coverage new/current, TreeHerder]: N/A

[Risks and why]: Still haven't found cause of in-the-field shutdown hangs (although we've fixed two bugs that reduced the number of intermittents on treeherder significantly)

[String/UUID change made/needed]: none
Flags: needinfo?(padenot)
Attachment #8763608 - Flags: approval-mozilla-beta?
Comment on attachment 8763608 [details] [diff] [review]
Backout f31612f8f3bb (Bug 1239873) from beta 48

This patch may fix the shutdown hang/crashes, take it in 48 beta 2.
Attachment #8763608 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/releases/mozilla-beta/rev/cf6b8e808455
Crash Signature: [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] [@ mozalloc_abort | xul.dll@0x77f718 ] → [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] [@ mozalloc_abort | xul.dll@0x77f718 ] [@ Abort | mozalloc_abort | NS_DebugBreak | nsDebugImpl::A…
Maire, instead of backing it out one beta by one beta, is there an alternative?  And if not, we should probably be backing it out of 49 beta just about now...
Flags: needinfo?(mreavy)
Yep, I was talking with Jesup about this exact issue on Friday.  We are looking at alternatives, but we don't have any yet.  We will take action by end of this week for 49.
Flags: needinfo?(mreavy)
Padenot is on vacation until Sept.
Assignee: padenot → rjesup
MozReview-Commit-ID: 3xn3S1ZlEc1
Comment on attachment 8780660 [details] [diff] [review]
Backout f31612f8f3bb (Bug 1239873) from beta 49

Approval Request Comment

Backing out again from beta; we need to resolve this somehow (even if it's yanking the gate) by the next release, or decide we're willing to live with it in release to help protect against shutdown race conditions.
Attachment #8780660 - Attachment description: Backout f31612f8f3bb (Bug 1239873) from beta 48 → Backout f31612f8f3bb (Bug 1239873) from beta 49
Attachment #8780660 - Flags: approval-mozilla-beta?
Comment on attachment 8780660 [details] [diff] [review]
Backout f31612f8f3bb (Bug 1239873) from beta 49

Backout, crash rate too high.
Attachment #8780660 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/releases/mozilla-beta/rev/ab7b68014a1e
Avoids running into the stalled-shutdown killer if the audio driver/OS is stupid

MozReview-Commit-ID: 58SUg2Xt37C
Attachment #8783598 - Flags: review?(pehrson)
Comment on attachment 8783598 [details] [diff] [review]
don't wait more than 15 seconds for an AudioCallbackDriver to exit

Review of attachment 8783598 [details] [diff] [review]:
-----------------------------------------------------------------

I assume a lot of bells and whistles go off if we introduce a bug in GraphDriver shutdown. At least leak checks will fail, correct?

::: dom/media/MediaStreamGraph.cpp
@@ +1687,5 @@
> +    mTimer = do_CreateInstance(NS_TIMER_CONTRACTID);
> +    if (!mTimer) {
> +      return NS_ERROR_FAILURE;
> +    }
> +    mTimer->InitWithCallback(this, 15*1000, nsITimer::TYPE_ONE_SHOT);

Break the number out and make it const.
Attachment #8783598 - Flags: review?(pehrson) → review+
Comment on attachment 8783598 [details] [diff] [review]
don't wait more than 15 seconds for an AudioCallbackDriver to exit

Approval Request Comment
(for this and the reland of bug 1255737)

[Feature/regressing bug #]: backout of bug 1255737 from beta

[User impact if declined]: many issues in beta, some possible crash/etc risks

[Describe test coverage new/current, TreeHerder]: Running without bug 1255737 was causing near-permafail in automation.  Running with it was causing crashes in the field.

[Risks and why]: Want to both kill shutdown timeout crashes in the field, and avoid all the issues (oranges and other dangerous bugs) of not having this patch.

[String/UUID change made/needed]: none
Attachment #8783598 - Flags: approval-mozilla-beta?
Attachment #8783598 - Flags: approval-mozilla-aurora?
Pushed by rjesup@wgate.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/418acf685ec2
don't wait more than 15 seconds for an AudioCallbackDriver to exit r=pehrsons
Pushed by rjesup@wgate.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c612cd6788e0
downgrade NS_ASSERTION to NS_WARNING due to linux32 VMs rs=bustage
Pushed by rjesup@wgate.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/74668b4ab3ea
increase audio callback shutdown timeout to 45 seconds to reduce oranges/leaks r=kwierso
Backout by rjesup@wgate.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a6ee4a4f5bf9
Backed out 3 changesets 74668b4ab3ea c612cd6788e0 418acf685ec2 for shutdown leaks
Attachment #8784364 - Flags: review?(pehrson)
Comment on attachment 8784364 [details] [diff] [review]
don't set a shutdown timer if we don't have a shutdown ticket

Review of attachment 8784364 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/media/MediaStreamGraph.cpp
@@ +1490,5 @@
>      mGraph->mDriver->Shutdown(); // This will wait until it's shutdown since
>                                   // we'll start tearing down the graph after this
>  
>      // We may be one of several graphs. Drop ticket to eventually unblock shutdown.
> +    if (mTimer && !mGraph->mForceShutdownTicket) {

Why do we need to check mTimer here? Should it be asserted instead?

@@ +1495,2 @@
>        // XXX - this should be NS_ASSERTION, but that kicks out on linux32 test VMs
> +      MOZ_ASSERT(mGraph->mForceShutdownTicket,

`MOZ_ASSERT(false,` per our IRC conversation. Is the comment still accurate?

::: dom/media/MediaStreamGraph.h
@@ +1251,5 @@
>      AUDIO_THREAD_DRIVER,
>      SYSTEM_THREAD_DRIVER,
>      OFFLINE_THREAD_DRIVER
>    };
> +  static const uint32_t AUDIO_CALLBACK_DRIVER_SHUTDOWN_TIMEOUT = 20*1000;

Put this in the previous patch when you've determined how much time we really need.
Attachment #8784364 - Flags: review?(pehrson) → review+
Pushed by rjesup@wgate.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7ebc6b4df664
don't set a shutdown timer if we don't have a shutdown ticket r=pehrsons
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f958f8fcd7a1126c4209346f1a155b185a4b4501
The bc6/bc7 crashes are due to MOZ_LOG in the try, and the mda reds are too much output from MOZ_LOG.
https://hg.mozilla.org/mozilla-central/rev/7ebc6b4df664
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Comment on attachment 8783598 [details] [diff] [review]
don't wait more than 15 seconds for an AudioCallbackDriver to exit

Fix for shutdown crash/hang, please uplift to aurora and beta.
Attachment #8783598 - Flags: approval-mozilla-beta?
Attachment #8783598 - Flags: approval-mozilla-beta+
Attachment #8783598 - Flags: approval-mozilla-aurora?
Attachment #8783598 - Flags: approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-beta/rev/561081fdbcc1
https://hg.mozilla.org/releases/mozilla-beta/rev/198825ee397d
https://hg.mozilla.org/releases/mozilla-aurora/rev/95e4986d9803
the crash pattern is now reappearing on beta starting in 49.0b7 builds: http://bit.ly/2bZXncT
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Crash signatures indicate it isn't hanging waiting for the 20 second timeout on Cubeb shutdown I added.  That means something else is somehow blocking async shutdown from proceeding.  Writing up a document on how shutdown happens, I found one minor hole that might be the root cause.  I'll file a bug on that, and we should re-backout the async shutdown code and this patch from beta,unless we decide that some number of async shutdown timeouts are preferable to the possible shutdown races that occur without this patch.

Ryan, this will mean those oranges will likely recur.

The only alternative would be to use the 20 second timer to stop blocking shutdown on MSG (and if so, block any further attempts to shutdown MSG to avoid races in case one is just slow).
Blocks: 1298698
Comment on attachment 8785760 [details] [diff] [review]
NOT FOR CHECKIN TO 49 -- Backed out changeset b44c72b85a80 198825ee397d 561081fdbcc1 (, Bug 1239873))

Approval Request Comment
[Feature/regressing bug #]: 

[User impact if declined]: Slow shutdowns in small(?) number of cases causing the shutdown timeout to force a crash.  Note: aurora and Nightly are running with this code, so it's safe other than for the shutdown-timeouts.

[Describe test coverage new/current, TreeHerder]: Cause of hung shutdowns unknown; landing this backout will cause oranges in automation.

[Risks and why]: Low risk; we ran most of beta this way.  However, this does expose shutdown races with MSG continuing to shut down past xpcom-shutdown-threads - though in most cases it would just cause leaks in shutdown or maybe some crashes.

If you're willing to live with the crash rate, don't back it out.  Otherwise we'll need to live with the oranges we had after it was backed out earlier.

[String/UUID change made/needed]: none

Note: this is a rolled-up backout of the patches ryan landed on beta
Attachment #8785760 - Flags: approval-mozilla-beta?
From discussion on irc with jesup, wes, and ryan I think we want to leave this as is for 49, accepting some number of shutdown hang/crashes, but keeping our tests meaningful so we can catch other problems. That seems better to me than dismissing entire test suites.  Maybe this will give time for jesup and others to look at other possible fixes in 51 or maybe 50. It is a hard call for now. I appreciate the complicated explanations.  

Marking this "wontfix" for 49, though we landed something, we still have a problem. Hopefully that makes sense to people.
Comment on attachment 8785760 [details] [diff] [review]
NOT FOR CHECKIN TO 49 -- Backed out changeset b44c72b85a80 198825ee397d 561081fdbcc1 (, Bug 1239873))

Leaving this as-is for 49 as discussed.
Attachment #8785760 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment on attachment 8785760 [details] [diff] [review]
NOT FOR CHECKIN TO 49 -- Backed out changeset b44c72b85a80 198825ee397d 561081fdbcc1 (, Bug 1239873))

Per liz's comments, renaming patch to avoid confusion, since the patch was beta-a+'d but should NOT land in 49
Attachment #8785760 - Attachment description: Backed out changeset b44c72b85a80 198825ee397d 561081fdbcc1 (, Bug 1239873)) → NOT FOR CHECKIN TO 49 -- Backed out changeset b44c72b85a80 198825ee397d 561081fdbcc1 (, Bug 1239873))
Crash Signature: nsDebugImpl::Abort | NS_InvokeByIndex ] → nsDebugImpl::Abort | NS_InvokeByIndex ] [@ AsyncShutdownTimeout | profile-before-change | MediaStreamGraph shutdown: blocking on msg thread ]
:jesup, it's a long read, but given the timing and (tenuous) audio connection, any insight on bug 1296630?  We just made it block the release.
Flags: needinfo?(rjesup)
I just commented there.  The pattern of which betas bug 1296630 fails in doesn't match the pattern here (we were/are blocking shutdown in 1, 2, 3 and 7, 8, 9, 10.)
Flags: needinfo?(rjesup)
We'll expect to see this in beta again; we plan to revise the entire shutdown sequence soon, and I'll land a patch for beta and aurora that should completely limit the time we'll wait for MSG shutdown (instead of just limiting cubeb shutdown).
Hi Jesup, the Intel crash (in bug 1296630) is back in 50.b1. The sooner we can get something ready to uplift to Beta50 the better. Please and thank you! :)
Flags: needinfo?(rjesup)
Effectively backs out the previous landing for bug 1255737

MozReview-Commit-ID: 37QfhATw8wU
Attachment #8795849 - Flags: review?(padenot)
Attachment #8795849 - Flags: review?(padenot) → review+
Pushed by rjesup@wgate.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/637f4d980a67
Move MSG shutdown max-timeout from just only cubeb shutdown to the entire graph r=padenot
https://hg.mozilla.org/mozilla-central/rev/637f4d980a67
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Comment on attachment 8795849 [details] [diff] [review]
Move MSG shutdown max-timeout from just only cubeb shutdown to the entire graph

Approval Request Comment
[Feature/regressing bug #]: N/A

[User impact if declined]: (Longer) Shutdown timeouts and crash reports generated

[Describe test coverage new/current, TreeHerder]: not testable in treeherder; manual using a debugger typically

[Risks and why]: Low risk - moves shutdown timeout for MSG to cover all of shutdown for MSG.  Note there's still a delay

[String/UUID change made/needed]: None
Flags: needinfo?(rjesup)
Attachment #8795849 - Flags: approval-mozilla-beta?
Attachment #8795849 - Flags: approval-mozilla-aurora?
Comment on attachment 8795849 [details] [diff] [review]
Move MSG shutdown max-timeout from just only cubeb shutdown to the entire graph

This might help with the Intel related top crashers we are noticing, Beta50+, Aurora51+
Attachment #8795849 - Flags: approval-mozilla-beta?
Attachment #8795849 - Flags: approval-mozilla-beta+
Attachment #8795849 - Flags: approval-mozilla-aurora?
Attachment #8795849 - Flags: approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/e30a9ec032f5
https://hg.mozilla.org/releases/mozilla-beta/rev/5dd4f8b96696
I still reproduce this intermittently [@ AsyncShutdownTimeout | profile-before-change | MediaStreamGraph shutdown: blocking on msg thread ] (https://crash-stats.mozilla.com/report/index/aba20cd6-2033-4bb9-b130-d10da2161006) following these steps: 
1. Launch Firefox
2. Go to https://opentokrtc.com/ (or to https://talky.io/ or to https://jitsi.org/) and enter in a call
3. Unplug the headset
4. Connect back the headset (the connection can't be restored even after refresh)
5. Restart the browser in order to reestablish the call (the browser doesn't restart; after ~1 min, Mozilla Crash Reporter shows up)

I reproduced the crash on 50.0b4 (20161003155957), using Windows 7 x64, Plantronics Audio 355 headset, Microsoft LifeCam HD 3000 / Logitech HD Pro Webcam C920 and the call peer using Windows 8.1 x86 or Windows 10  x64. 

Also ran intermittently into [@ AsyncShutdownTimeout | profile-before-change | Media shutdown: blocking on media thread ] (https://crash-stats.mozilla.com/report/index/08450d81-d792-497b-8215-d05a92161005), following the above steps, but using Windows 8.1 x86.
Flags: needinfo?(rjesup)
The volume of crashes has seriously decreased under beta but the it's pretty high under release (10197 for last week)

Crash volume for signature 'AsyncShutdownTimeout | profile-before-change | MediaStreamGraph shutdown: blocking on msg thread':
 - Ni.  (v. 52): 77 crashes from 2016-09-19.
 - Aur. (v. 51): 126 crashes from 2016-09-19.
 - Beta (v. 50): 2055 crashes from 2016-09-20.
 - Rel. (v. 49): 37301 crashes from 2016-09-05.
 - Esr  (v. 45): 0 crashes from 2016-07-25.

Crash volume on the last weeks (Week N is from 10-31 to 11-06):
              W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - Ni.  (52)       1       0       1      10      37      28
 - Aur. (51)       0       1       4      39      70      12
 - Beta (50)      59      66     123     735     840     196
 - Rel. (49)   10197    6996    4173    3987    4397     897       3
Flags: needinfo?(rkothari)
Flags: needinfo?(lhenry)
I wonder if this is related to the async rendering issue. We shipped that in 49.0.2 on Oct. 20th. But this crash spike happened on Oct. 31st: https://crash-stats.mozilla.com/signature/?signature=AsyncShutdownTimeout%20%7C%20profile-before-change%20%7C%20MediaStreamGraph%20shutdown%3A%20blocking%20on%20msg%20thread&date=%3E%3D2016-10-03T17%3A48%3A47.000Z&date=%3C2016-11-03T17%3A48%3A47.000Z#graphs   

Seems more likely that this is due to some change in a video player or some other external site.
Flags: needinfo?(lhenry)
I see no crashes in 50 beta after 50b4
Crashes in 50b1-b4 were low (1-10/day it appears).
I see a total of 3 in the last week for 51 and 52.
Flags: needinfo?(rjesup)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: