Closed Bug 1405290 Opened 7 years ago Closed 7 years ago

Shutdownhang in mozilla::dom::workers::RuntimeService::Cleanup rising in Firefox 56

Categories

(Core :: DOM: Workers, defect, P2)

56 Branch
All
Windows
defect

Tracking

()

RESOLVED DUPLICATE of bug 1437575
Tracking Status
firefox-esr52 --- wontfix
firefox56 --- wontfix
firefox57 + wontfix
firefox58 --- wontfix
firefox59 + wontfix
firefox60 + fixed
firefox61 + fixed

People

(Reporter: philipp, Assigned: baku)

References

(Blocks 3 open bugs)

Details

(4 keywords, Whiteboard: [tbird crash])

Crash Data

Attachments

(4 files)

This bug was filed from the Socorro interface and is report bp-6e92b73c-bbec-4833-9b6e-524ec0170929. ============================================================= Crashing Thread (0) Frame Module Signature Source 0 ntdll.dll NtWaitForAlertByThreadId 1 ntdll.dll RtlSleepConditionVariableCS 2 kernelbase.dll SleepConditionVariableCS 3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) mozglue/misc/ConditionVariable_windows.cpp:58 4 xul.dll mozilla::CondVar::Wait(unsigned int) obj-firefox/dist/include/mozilla/CondVar.h:68 5 xul.dll nsEventQueue::GetEvent(bool, nsIRunnable**, mozilla::BaseAutoLock<mozilla::Mutex>&) xpcom/threads/nsEventQueue.cpp:76 6 xul.dll nsThread::nsChainedEventQueue::GetEvent(bool, nsIRunnable**, unsigned short*, mozilla::BaseAutoLock<mozilla::Mutex>&) xpcom/threads/nsThread.cpp:839 7 xul.dll nsThread::GetEvent(bool, nsIRunnable**, unsigned short*, mozilla::BaseAutoLock<mozilla::Mutex>&) xpcom/threads/nsThread.cpp:1297 8 xul.dll nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp:1380 9 xul.dll NS_ProcessNextEvent(nsIThread*, bool) xpcom/threads/nsThreadUtils.cpp:480 10 xul.dll mozilla::dom::workers::RuntimeService::Cleanup() dom/workers/RuntimeService.cpp:2239 11 xul.dll mozilla::dom::workers::RuntimeService::Observe(nsISupports*, char const*, char16_t const*) dom/workers/RuntimeService.cpp:2761 12 xul.dll nsObserverList::NotifyObservers(nsISupports*, char const*, char16_t const*) xpcom/ds/nsObserverList.cpp:112 13 xul.dll nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) xpcom/ds/nsObserverService.cpp:295 14 xul.dll mozilla::ShutdownXPCOM(nsIServiceManager*) xpcom/build/XPCOMInit.cpp:894 15 xul.dll ScopedXPCOMStartup::~ScopedXPCOMStartup() toolkit/xre/nsAppRunner.cpp:1475 16 xul.dll mozilla::UniquePtr<ScopedXPCOMStartup, mozilla::DefaultDelete<ScopedXPCOMStartup> >::reset(ScopedXPCOMStartup*) obj-firefox/dist/include/mozilla/UniquePtr.h:345 17 xul.dll XREMain::XRE_main(int, char** const, mozilla::BootstrapConfig const&) toolkit/xre/nsAppRunner.cpp:4812 18 xul.dll XRE_main(int, char** const, mozilla::BootstrapConfig const&) toolkit/xre/nsAppRunner.cpp:4879 19 xul.dll mozilla::BootstrapImpl::XRE_main(int, char** const, mozilla::BootstrapConfig const&) toolkit/xre/Bootstrap.cpp:45 20 firefox.exe wmain toolkit/xre/nsWindowsWMain.cpp:115 21 firefox.exe __scrt_common_main_seh f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:253 22 kernel32.dll BaseThreadInitThunk 23 ntdll.dll __RtlUserThreadStart 24 ntdll.dll _RtlUserThreadStart these shutdownhangs have been around for a long time, but their frequency has noticably increased in firefox 56. in version 56.0 shutdownhangs with mozilla::dom::workers::RuntimeService::Cleanup in their proto signature account for 5.7% of all browser crashes while they were just 0.7% in 55.0.3 ... the volume of those crashes during the 56.0b cycle seems to have jumped up with 56.0b12 - the changes in b12 were: https://mzl.la/2vVyxo4 Correlations for Firefox Release (97.17% in signature vs 07.61% overall) Module "pcacli.dll" = true ("Program Compatibility Assistant Client Module") (96.83% in signature vs 09.87% overall) Module "sfc_os.dll" = true Correlations for Firefox Beta (99.01% in signature vs 01.98% overall) Module "twinapi.dll" = true (96.06% in signature vs 10.48% overall) Module "mpr.dll" = true [100.0% vs 04.88% if platform_version = 6.3.9600] (100.0% in signature vs 07.35% overall) Module "explorerframe.dll" = true (98.52% in signature vs 13.07% overall) Module "urlmon.dll" = true [100.0% vs 06.54% if platform_version = 6.3.9600] (99.01% in signature vs 07.54% overall) Module "FWPUCLNT.DLL" = true (96.06% in signature vs 04.56% overall) Module "ntshrui.dll" = true (96.06% in signature vs 04.57% overall) Module "cscapi.dll" = true (93.10% in signature vs 01.20% overall) Module "pcacli.dll" = true
Assignee: nobody → amarchesini
Priority: -- → P1
I don't think we can do anything for 56. I'm planning to work on shutting down crashes and so on as next task (probably). We are talking about 58 probably.
OK, marking wontfix for 56 and 57. Since I'm not even sure we'll get to this in 58, I've marked it as fix-optional.
This seems to be spiking recently. Is there anything we can do for 57?
Flags: needinfo?(amarchesini)
I added one similar signature I have seen in Mac crash stats that is affecting 57/58, but not 56: http://bit.ly/2kyerii
Crash Signature: [@ shutdownhang | NtWaitForAlertByThreadId | mozilla::dom::workers::RuntimeService::Cleanup] [@ shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup] → [@ shutdownhang | NtWaitForAlertByThreadId | mozilla::dom::workers::RuntimeService::Cleanup] [@ shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup] [@ shutdownhang | __psynch_cvwait | <name omitted> | mozilla::dom::workers::RuntimeService::Cl…
Many user comments on these crashes. I think we should reassess this issue for 57 (and maybe even 56) as it is high volume and seems to annoy users. Is there anything we can do here?
Since we are tracking this for 57, it would great if :baku could weigh in on Comment 3.
Can we get some common URLs people are hitting this with? I know it's a shutdown issue but without an STR or sites to reproduce it's hard to get anywhere here :)
Flags: needinfo?(amarchesini) → needinfo?(lhenry)
The timing of the spike here looks similar to bug 1404105 (stemming from bug 1047098).
Maybe something around http://searchfox.org/mozilla-central/source/browser/base/content/sanitize.js#299. Andrew, you were involved with bug 1047098; any thoughts related to this worker-related shutdown crash?
Flags: needinfo?(bugmail)
The urls seem to mostly be FB, gmail, youtube and maybe I could pick out weather pages, but nothing that seems obviously reproducible. Many comments do mention having shut down firefox and walked away, then come back to the crash report, but I think even more comments are about Firefox actually crashing.
Flags: needinfo?(lhenry)
This needs more investigation, but the short answer is this does not appear to have anything to do with sanitize.js changes. - For release and trunk builds, we're ending up with 1 or 2 workers that are hanging out in their run loops at "xpcom-shutdown-threads", not having noticed that they should be dead. - My best guess is an edge-case is resulting in WorkerHolders releasing themselves on the wrong thread. Although we have NS_ASSERT_OWNINGTHREAD and AssertIsOnWorkerThread() checks, they both effectively only happen on DEBUG builds. (NS_ASSERT_OWNINGTHREAD is enabled on nightlies, but only if !MOZ_PROFILING[1], and it so happens that every nightly config does "--enbale-profiling", so that's not helping. AssertIsOnWorkerThread is explicitly DEBUG-only.) - For esr52 builds, I'm seeing more cases where there workers are actively trying to GC. In some cases, the GC appears to be hung waiting on joining GlobalHelperThreadState locks. In at least one case, 8 worker threads were all busy in GC. I'm really hoping that's not bitcoin mining related. Options for further diagnosis are: - Ensure that we trigger a crash on non-DEBUG nightly builds if WorkerHolder::ReleaseWorkerInternal occurs off the worker thread. - Make the RuntimeService "xpcom-shutdown-threads" Cleanup() function some number of seconds of grace before it triggers a crash that identifies if HasActiveHolders() is returning true for a given worker, indicating if it's due to mChildWorkers, mTimeouts, or mHolders. If mHolders, identify the holder that betrayed us. The simplest approach for all of these is just to have distinct MOZ_CRASH calls, adding virtual WorkerHolder::BadWorkerHolderCrashYourself to help. Options for blind mitigation: - Have RuntimeService::Cleanup dispatch a ControlRunnable at all still-alive workers. The arrival of such a runnable would address the wrong-thread RemoveHolder call if that's what happening. 1: https://searchfox.org/mozilla-central/rev/1285ae3e810e2dbf2652c49ee539e3199fcfe820/xpcom/base/nsDebug.h#361
Flags: needinfo?(bugmail)
(In reply to Andrew Sutherland [:asuth] from comment #11) > This needs more investigation baku knows the worker code so maybe he can help (or a thought will be triggered by comment 11?). > Options for blind mitigation: > - Have RuntimeService::Cleanup dispatch a ControlRunnable at all still-alive > workers. The arrival of such a runnable would address the wrong-thread > RemoveHolder call if that's what happening. Is this something we'd want to do in 57?
Flags: needinfo?(bugmail)
Flags: needinfo?(amarchesini)
re: comment 12, off-bug we determined I'd try and look for likely WorkerHolder changes that might be involved. From revision spelunking, changes to FetchConsumer in https://bugzilla.mozilla.org/show_bug.cgi?id=1381748 got uplifted for beta 11. There was a notable change in behavior in part 3 of the patch at https://hg.mozilla.org/releases/mozilla-beta/rev/335fe07a77ed where the WorkerHolder's Notify method changed from synchronously removing the WorkerHolder via ContinueConsumeBody to asynchronously releasing them, cases described at https://bugzilla.mozilla.org/show_bug.cgi?id=1381748#c7. It's quite possible those cases are sound, but something like bug 1395837 violates the assumed invariant (but which should hopefully now be fixed by 1391693). Note that this is just informed guesswork. In particular, comment 0 identifies b12 as the likely build where things got worse, but I'm finding it really difficult to contextualize the crash rates using the crash-stats UI beyond the absolute crash numbers. :philipp, do you have any tips or documentation links that could help me better understand your b12 conclusion and help me be better able to do such analyses myself? Thanks! (It's okay if the answer is "hard-won experience" ;)
Flags: needinfo?(madperson)
(In reply to Andrew Sutherland [:asuth] from comment #13) > In particular, comment 0 > identifies b12 as the likely build where things got worse, but I'm finding > it really difficult to contextualize the crash rates using the crash-stats > UI beyond the absolute crash numbers. :philipp, do you have any tips or > documentation links that could help me better understand your b12 conclusion > and help me be better able to do such analyses myself? Thanks! (It's okay > if the answer is "hard-won experience" ;) investigating these shutdowhang crashes is a bit odd since their signatures often shifted around in the past - so for such an edgecase i don't think that there's any existing documentation, but i can try to explain how i came to the conclusion for the particular case: i was also just starting off with the absolute number of shutdownhang reports on the beta channel with the particular proto signature & faceted by their version number: https://crash-stats.mozilla.com/search/?proto_signature=~mozilla%3A%3Adom%3A%3Aworkers%3A%3ARuntimeService%3A%3ACleanup&release_channel=beta&submitted_from_infobar=%21__true__&product=Firefox&date=%3E%3D2017-01-01T01%3A00%3A00.000Z&_facets=signature&_facets=version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version in there it's apparent that the number of those crashes numbered between 500-700 per version in the 56.0b cycle (56.0b6 was higher because we skipped shipping b7 for a different reason), this is already a fair bit higher than in prior cycles. then it jumped up to over 2000 in 56.0b12 & the subsequent rc builds though - incidentally revisiting this query now shows that things have much improved in 57.0b again. the most common crash signature for the issue at the time of 56.0b was [@ shutdownhang | NtWaitForAlertByThreadId | RtlSleepConditionVariableCS | SleepConditionVariableCS] - it's also possible to graph out how it developed during the 56.0b cycle and it's apparent that it started spiking up after 2017-09-17 as well: https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=beta&proto_signature=~mozilla%3A%3Adom%3A%3Aworkers%3A%3ARuntimeService%3A%3ACleanup&submitted_from_infobar=%21__true__&signature=shutdownhang%20%7C%20NtWaitForAlertByThreadId%20%7C%20RtlSleepConditionVariableCS%20%7C%20SleepConditionVariableCS&date=%3E%3D2017-06-01T02%3A00%3A00.000Z#graphs - after bug 1402037 landed on 2017-09-27 those signatures should have mostly morphed into [@shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup]
Flags: needinfo?(madperson)
Any updates on this crash? It is the #1 top crash signature in 56, even higher than "OOM | small". It only seems to affect Windows. The crash report URLs are the usual collection of facebook.com and youtube.com URLs.
Crash Signature: mozilla::dom::workers::RuntimeService::Cleanup ] → mozilla::dom::workers::RuntimeService::Cleanup ] [@ shutdownhang | ZwWaitForAlertByThreadId | mozilla::dom::workers::RuntimeService::Cleanup]
I agree with asuth. I would like to fix this bug in 2 steps: I'm writing a couple of patches in order to have more data about what is keeping the workers alive. After that, we can understand which API is blocking the worker shutting down. The first step is going to happen today. My patches are almost ready.
Flags: needinfo?(amarchesini)
These 2 patches are temporary. We need something like that to find the reason of this crash. Than we can revert them, if we want.
Attachment #8927819 - Flags: review?(bugmail)
Attachment #8927821 - Flags: review?(bugmail)
Attachment #8927822 - Flags: review?(bugmail)
Comment on attachment 8927819 [details] [diff] [review] part 1 - WorkerHolder must assert in release Review of attachment 8927819 [details] [diff] [review]: ----------------------------------------------------------------- ::: dom/workers/WorkerHolder.cpp @@ +14,5 @@ > +void > +AssertOnOwningThread(void* aThread) > +{ > + if (MOZ_UNLIKELY(aThread != GetCurrentVirtualThread())) { > + MOZ_CRASH_UNSAFE_OOL("WorkerHolder on the wrong thread."); Can't this can just be a normal MOZ_CRASH()?
Attachment #8927819 - Flags: review?(bugmail) → review+
Attachment #8927822 - Flags: review?(bugmail) → review+
Comment on attachment 8927821 [details] [diff] [review] part 2 - RuntimeService::CrashIfHanging() Review of attachment 8927821 [details] [diff] [review]: ----------------------------------------------------------------- ::: dom/workers/RuntimeService.cpp @@ +2198,5 @@ > + for (uint32_t i = 0; i < workers.Length(); ++i) { > + WorkerPrivate* workerPrivate = workers[i]; > + > + // BC: Busy Count > + msg.AppendPrintf("-BC:%d", workerPrivate->BusyCount()); nit: maybe add some whitespace? Like " -BC:%d "? @@ +2208,5 @@ > + msg.Append(runnable->MsgData()); > + } > + > + // This string will be leaked. > + MOZ_CRASH_UNSAFE_OOL(strdup(msg.BeginReading())); Data review request is here: This string is a computed list of WorkerHolder types (from part 3, all of which are string literals like "XMLHttpRequestWorker") plus the above worker counts *for things that failed to shut down* properly. We'd expect this to look like the following as currently implemented in the patches: "Workers Hanging - A:2|S:0|Q:0-BC:1XMLHttpRequestWorker-BC2:BroadcastChannelWorkerHolder|WebSocketWorkerHolder" No origins will be explicitly revealed. This is unlikely to aid in fingerprinting from crash stats unless it turns out there's a single website with a very distinctive set of operations that trigger this phenomenon. Based on the URL's we've seen in the crash reports, we do not believe this to be the case, with this being a somewhat widespread problem, origin-wise.
Attachment #8927821 - Flags: review?(francois)
Attachment #8927821 - Flags: review?(bugmail)
Attachment #8927821 - Flags: review+
Comment on attachment 8927821 [details] [diff] [review] part 2 - RuntimeService::CrashIfHanging() Review of attachment 8927821 [details] [diff] [review]: ----------------------------------------------------------------- datareview+ ::: dom/workers/RuntimeService.cpp @@ +2208,5 @@ > + msg.Append(runnable->MsgData()); > + } > + > + // This string will be leaked. > + MOZ_CRASH_UNSAFE_OOL(strdup(msg.BeginReading())); Based on this description, this looks like Category 1 data.
Attachment #8927821 - Flags: review?(francois) → review+
I think we're all good to land these other than the part 1 nit. Setting NI to land or indicate next step.
Flags: needinfo?(bugmail) → needinfo?(amarchesini)
Last push to try before landing... it should be in m-i for tonight. max tomorrow morning.
Flags: needinfo?(amarchesini)
Pushed by amarchesini@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/3d7896bf89f7 Improve logging of workers when shutting down - part 1 - WorkerHolder must assert in release, r=asuth https://hg.mozilla.org/integration/mozilla-inbound/rev/bb508b270fab Improve logging of workers when shutting down - part 2 - Implement RuntimeService::CrashIfHanging(), r=asuth https://hg.mozilla.org/integration/mozilla-inbound/rev/dec3b1ceaa9b Improve logging of workers when shutting down - part 3 - WorkerHolder with names, r=asuth
baku suggested we uplift these patches to 52 to get diagnostic messages from there, too, since it's different enough from the majority-e10s-multi users in later releases.
Comment on attachment 8927819 [details] [diff] [review] part 1 - WorkerHolder must assert in release [Approval Request Comment] If this is not a sec:{high,crit} bug, please state case for ESR consideration: User impact if declined: This set of patches are helping the debugging of a workers shutdown hanging. But they are not really useful for e10s in production because the crash happens only if the hanging is done on the parent process. 52 is still non-e10s and would be nice to collect data in order to fix the 'real' bug. Fix Landed on Version: 59 Risk to taking this patch (and alternatives if risky): low. We collect data and we use it to compose a string for the crash report. String or UUID changes made by this patch: none
Attachment #8927819 - Flags: approval-mozilla-esr52?
We've gotten 3 crashes so far, all Windows: * bp-b8699baf-24a4-4e98-a24e-a94900171116 Workers Hanging - A:2|S:0|Q:0-BC:0-BC:0 Both active workers in WaitForWorkerEvents(). There's also one "DOM Worker" thread (thread 40) whose WorkerThreadPrimaryRunnable::Run() would seem to have successfully exited and is waiting in the normal thread event loop. * bp-3e8a8d0c-c816-4606-bbdc-5915d0171116 Workers Hanging - A:3|S:0|Q:0-BC:0-BC:0-BC:0 * bp-83ec8bb9-241b-4f88-832d-02ef40171116 Workers Hanging - A:3|S:0|Q:0-BC:1-BC:0-BC:0 The busy thread appears to be a ChromeWorker in js::ctypes code performing a CreateFileW call that's still pending on the underlying NtCreateFile call.
Crash Signature: mozilla::dom::workers::RuntimeService::Cleanup ] [@ shutdownhang | ZwWaitForAlertByThreadId | mozilla::dom::workers::RuntimeService::Cleanup] → mozilla::dom::workers::RuntimeService::Cleanup ] [@ shutdownhang | ZwWaitForAlertByThreadId | mozilla::dom::workers::RuntimeService::Cleanup] [mozilla::dom::workers::RuntimeService::CrashIfHanging]
Crash Signature: mozilla::dom::workers::RuntimeService::Cleanup ] [@ shutdownhang | ZwWaitForAlertByThreadId | mozilla::dom::workers::RuntimeService::Cleanup] [mozilla::dom::workers::RuntimeService::CrashIfHanging] → mozilla::dom::workers::RuntimeService::Cleanup ] [@ shutdownhang | ZwWaitForAlertByThreadId | mozilla::dom::workers::RuntimeService::Cleanup] [@ mozilla::dom::workers::RuntimeService::CrashIfHanging]
Should this have been marked leave-open?
Flags: needinfo?(amarchesini)
The signature changed. But yes, let's keep it open.
Flags: needinfo?(amarchesini)
https://crash-stats.mozilla.com/report/index/b8699baf-24a4-4e98-a24e-a94900171116 is interesting: 1. The main-thread is spinning the event loop here: https://hg.mozilla.org/mozilla-central/annotate/f41930a869a8/dom/storage/StorageObserver.cpp#l405 - it's waiting for PBackground to nullify a couple of pointers. 2. But PBcackground thread is not processing runnables because it's blocked here: https://hg.mozilla.org/mozilla-central/annotate/f41930a869a8/dom/media/systemservices/CamerasParent.cpp#l200 3. VideoCapture thread is not processing the runnable because is blocked here https://hg.mozilla.org/mozilla-central/annotate/f41930a869a8/media/webrtc/trunk/webrtc/modules/video_capture/device_info_impl.cc#l34 jesup, can you take a look? Note: I don't think this is _the_ reason why we have a Worker hanging, but it's important to investigate this deadlock. If you prefer I can file a separate bug.
Flags: needinfo?(rjesup)
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
Target Milestone: mozilla59 → ---
Also https://crash-stats.mozilla.com/report/index/83ec8bb9-241b-4f88-832d-02ef40171116 shows a interesting behavior. Here PBackground is blocked waiting for the shutting down of QuotaManager IO. QuotaManager IO thread busy doing... IO. Janv, can you see if this is correct to happen on shutting down?
Flags: needinfo?(jvarga)
Munro, jib: this is relevant to munro's work on CamerasParent. Looks like some form of deadlock. I see no webrtc.org threads alive; the library seems shut down. Looking at the code, line 146 of device-info_impl.cc appears to have at least one hole, though probably not the cause, in line 146 - the return -1; should have ReleaseLockExclusive()/AcquireLockShared() before it (like line 101). That's an easy fix, and CreateCapabilityMap certainly can fail. Worth fixing, and probably uplifting to 58.
Flags: needinfo?(rjesup)
Flags: needinfo?(mchiang)
Flags: needinfo?(jib)
Munro, can you take a look?
Flags: needinfo?(jib)
Create a separated bug 1418871 to trace comment 32.
Flags: needinfo?(mchiang)
I think this bug is related to the bugs generating crash signatures of [@ IPCError-browser | ShutDownKill ] with processor notes "Signature replaced with an IPC Channel Error, was ______.". I had the following crash reports generated while updating Firefox Nightly. Therefore, Bug 1405290 blocks Bug 1279293(IPCError_ShutDownKill) (Or could be changed to "See Also".) And See Also Bug 1424451. 2017-12-10 2138pm Firefox Nightly crashed while updating. Firefox 59.0a1 Crash Report [@ IPCError-browser | ShutDownKill ] Firefox 59.0a1 Crash Report [@ mozilla::dom::workers::RuntimeService::CrashIfHanging ] MOZ_CRASH Reason Workers Hanging - A:3|S:0|Q:0-BC:0|WorkerHolderToken-BC:0|WorkerHolderToken-BC:0|WorkerHolderToken These crash reports were generated at the same time, but were manually submitted a minute apart from each other. bp-f25eb0e9-d35f-44c8-90f9-e5b7f0171211 12/10/2017 8:46 PM bp-40531424-889a-42b4-a8f2-7010a0171211 12/10/2017 8:45 PM bp-e236ffa2-0ab1-453b-bd87-8c2f80171211 12/10/2017 8:44 PM bp-0000c897-5a75-4397-9cd3-5edd10171211 12/10/2017 8:43 PM bp-1a1b6cf1-0022-4dd6-b8df-66f400171211 12/10/2017 8:42 PM bp-ab36ebe6-53f4-4fe0-ba77-b41ee0171211 12/10/2017 8:41 PM bp-4981d798-cfa1-45a8-8493-38e390171211 12/10/2017 8:38 PM bp-f25eb0e9-d35f-44c8-90f9-e5b7f0171211 12/10/2017 8:46 PM Firefox 59.0a1 Crash Report [@ mozilla::dom::workers::RuntimeService::CrashIfHanging ] Crash Reason EXCEPTION_BREAKPOINT MOZ_CRASH Reason Workers Hanging - A:3|S:0|Q:0-BC:0|WorkerHolderToken-BC:0|WorkerHolderToken-BC:0|WorkerHolderToken Crashing Thread (64), Name: Shutdown Hang Terminator Frame Module Signature Source 0 mozglue.dll MOZ_CrashOOL mfbt/Assertions.cpp:33 1 xul.dll mozilla::dom::workers::RuntimeService::CrashIfHanging() dom/workers/RuntimeService.cpp:2191 2 xul.dll mozilla::`anonymous namespace'::RunWatchdog toolkit/components/terminator/nsTerminator.cpp:162 3 nss3.dll _PR_NativeRunThread nsprpub/pr/src/threads/combined/pruthr.c:397 bp-40531424-889a-42b4-a8f2-7010a0171211 12/10/2017 8:45 PM Firefox 59.0a1 Crash Report [@ IPCError-browser | ShutDownKill ] Crash Reason EXCEPTION_BREAKPOINT Signature replaced with an IPC Channel Error, was: "style::rule_tree::StrongRuleNode::gc" Crashing Thread (0) Frame Module Signature Source 0 xul.dll style::rule_tree::StrongRuleNode::gc servo/components/style/rule_tree/mod.rs:1061 1 xul.dll geckoservo::glue::Servo_StyleSet_Drop servo/ports/geckolib/glue.rs:2415 2 xul.dll mozilla::ServoStyleSet::Shutdown() layout/style/ServoStyleSet.cpp:180 3 xul.dll mozilla::PresShell::Destroy() layout/base/PresShell.cpp:1360 bp-e236ffa2-0ab1-453b-bd87-8c2f80171211 12/10/2017 8:44 PM Firefox 59.0a1 Crash Report [@ IPCError-browser | ShutDownKill ] Crash Reason EXCEPTION_BREAKPOINT Signature replaced with an IPC Channel Error, was: "RequestBehaviour::GetProgressTracker" Crashing Thread (0) Frame Module Signature Source 0 xul.dll RequestBehaviour::GetProgressTracker() image/imgRequestProxy.cpp:95 1 xul.dll RequestBehaviour::GetImage() image/imgRequestProxy.cpp:82 2 xul.dll imgRequestProxy::RequestDiscard() image/imgRequestProxy.cpp:604 3 xul.dll mozilla::dom::ImageTracker::RequestDiscardAll() dom/base/ImageTracker.cpp:157 bp-0000c897-5a75-4397-9cd3-5edd10171211 12/10/2017 8:43 PM Firefox 59.0a1 Crash Report [@ IPCError-browser | ShutDownKill ] Crash Reason EXCEPTION_BREAKPOINT Signature replaced with an IPC Channel Error, was: "RtlSafeRemoveEntryList | RtlDeleteCriticalSection | mozilla::detail::MutexImpl::~MutexImpl | mozilla::gfx::SourceSurfaceVolatileData::~SourceSurfaceVolatileData" Crashing Thread (0) Frame Module Signature Source 0 ntdll.dll RtlSafeRemoveEntryList 1 ntdll.dll RtlDeleteCriticalSection 2 mozglue.dll mozilla::detail::MutexImpl::~MutexImpl() mozglue/misc/Mutex_windows.cpp:36 3 xul.dll mozilla::gfx::SourceSurfaceVolatileData::~SourceSurfaceVolatileData() gfx/layers/SourceSurfaceVolatileData.h:101 bp-1a1b6cf1-0022-4dd6-b8df-66f400171211 12/10/2017 8:42 PM Firefox 59.0a1 Crash Report [@ IPCError-browser | ShutDownKill ] Crash Reason EXCEPTION_BREAKPOINT Signature replaced with an IPC Channel Error, was: "servo_arc::Arc<T>::drop_slow<T>" Crashing Thread (0) Frame Module Signature Source 0 xul.dll servo_arc::Arc<style::gecko_properties::ComputedValues>::drop_slow<style::gecko_properties::ComputedValues> servo/components/servo_arc/lib.rs:254 1 xul.dll geckoservo::glue::Servo_Element_ClearData servo/ports/geckolib/glue.rs:943 2 xul.dll mozilla::dom::Element::ClearServoData(nsIDocument*) dom/base/Element.cpp:4418 bp-ab36ebe6-53f4-4fe0-ba77-b41ee0171211 12/10/2017 8:41 PM Firefox 59.0a1 Crash Report [@ IPCError-browser | ShutDownKill ] Crash Reason EXCEPTION_BREAKPOINT Signature replaced with an IPC Channel Error, was: "mozilla::widget::WinUtils::WaitForMessage | nsAppShell::ProcessNextNativeEvent" Crashing Thread (0) Frame Module Signature Source 0 ntdll.dll NtWaitForMultipleObjects 1 kernelbase.dll WaitForMultipleObjectsEx 2 kernel32.dll WaitForMultipleObjectsExImplementation 3 user32.dll RealMsgWaitForMultipleObjectsEx bp-4981d798-cfa1-45a8-8493-38e390171211 12/10/2017 8:38 PM Firefox 59.0a1 Crash Report [@ mozilla::dom::workers::RuntimeService::CrashIfHanging ] Crash Reason EXCEPTION_BREAKPOINT User Comments: Firefox Nightly crashed while updating. MOZ_CRASH Reason Workers Hanging - A:3|S:0|Q:0-BC:0|WorkerHolderToken-BC:0|WorkerHolderToken-BC:0|WorkerHolderToken Crashing Thread (64), Name: Shutdown Hang Terminator Frame Module Signature Source 0 mozglue.dll MOZ_CrashOOL mfbt/Assertions.cpp:33 1 xul.dll mozilla::dom::workers::RuntimeService::CrashIfHanging() dom/workers/RuntimeService.cpp:2191 2 xul.dll mozilla::`anonymous namespace'::RunWatchdog toolkit/components/terminator/nsTerminator.cpp:162
Hi Baku, should we still consider uplifting this to ESR52? I am just reviewing the uplift nom which you made in mid-Nov and I noticed that this helped but that there may be other crash sign that still need fixing. Also, do all 3 patches need to be landed in ESR52 or just the one you nom'd? Thanks!
Flags: needinfo?(amarchesini)
(In reply to Ritu Kothari (:ritu) from comment #38) > Hi Baku, should we still consider uplifting this to ESR52? You are right. We don't need it. We have enough reports.
Flags: needinfo?(amarchesini)
Attachment #8927819 - Flags: review+
Attachment #8927819 - Flags: approval-mozilla-esr52?
[@ mozilla::dom::workers::RuntimeService::CrashIfHanging] accounts for 25% of browser crashes on 59.0b2.
baku, can you take a look at the crashes in 59.0b ? This is showing up in early (Developer Edition) 59 releases as the top crash by far (as philipp points out). I'll also email Jan.
Flags: needinfo?(amarchesini)
Keywords: topcrash
Ok, I'll take a look too.
Flags: needinfo?(jvarga)
hm, by looking closer into crashes during shutdown progress and comparing the situation with 58.0b it appears that the [@ mozilla::dom::workers::RuntimeService::CrashIfHanging] signature may not be new crashes, but just "swallowing up" nearly all signatures that would have shown up as [@ shutdownhang...] in prior cycles. this query would be comparing signatures during shutdown in 58.0b99 and 59.0b3: https://mozilla.github.io/stab-crashes/scomp.html?limit=100&common=product%3DFirefox%26submitted_from_infobar%3D!__true__%26shutdown_progress%3D!__null__%26process_type%3Dbrowser&p1=version%3D58.0b99%26date%3D%3E2018-01-01&p2=version%3D59.0b3%26date%3D%3E2018-01-10 the share of crashes during shutdown compared to all other browser crashes doesn't seem to be worse in 59.0b compared to 58.0b. i'm not sure if summing up most shutdownhangs under one signature was the intended consequence of the patch here though...
OK, I'm not considering this a blocker for 59 betas or release, then.
I'm debugging this issue a lot lately. I would say that, often, checking the crash-reports, Firefox crashes because some component blocks the main-thread and the worker shutdown is not able to complete. Some important improvement happened with the introduction of nsIThreadManager::spinEventLoopUntilOrShutdown, but there is a lot more to do. For instance: https://crash-stats.mozilla.com/report/index/8e9a5c50-93c3-4718-971c-3a1290180131 Here https://hg.mozilla.org/releases/mozilla-beta/annotate/64737c752ac4/gfx/layers/ipc/ImageBridgeChild.cpp#l643 is blocking the main-thread. The workers are probably already gone, but RuntimeService doesn't receive the notification. https://crash-stats.mozilla.com/report/index/95de2dc0-f706-4d34-b1a3-5fab10180131#allthreads Also here we have a spinning of event loop when shutting down the transporter thread: https://hg.mozilla.org/releases/mozilla-beta/annotate/64737c752ac4/netwerk/base/nsIOService.cpp#l1122 https://crash-stats.mozilla.com/report/index/93e63bad-dfd4-4830-b523-770200180131#allthreads Same here: https://hg.mozilla.org/releases/mozilla-beta/annotate/64737c752ac4/dom/media/gmp/GMPServiceParent.cpp#l302 Wondering if the solution is to move the shutting down of workers in a separate thread.
Flags: needinfo?(amarchesini)
Note that these 3 crash-reports are just the 3 most recent ones.
Attached patch will.patch (deleted) — Splinter Review
There are several issues here, but one of them is that workers shutting down doesn't start because a previous observer spins the event loop and hangs there. Here a patch where the worker shutting down starts on NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID.
Attachment #8947091 - Flags: review?(bugs)
I'm lost here. Something landed two months ago, and now something else is about to land too? Can we please open a new bug.
Comment on attachment 8947091 [details] [diff] [review] will.patch This patch as such should be fine. But please don't land in this bug. It is really confusing to keep landing patches from the same bug over time.
Attachment #8947091 - Flags: review?(bugs) → review+
Depends on: 1434618
Just a side note, that there use to be a problem with quota manager shutdown and we solved it by introducing "profile-before-change-qm" notification (this one is fired after "profile-before-change". So any stuff that uses storage should listen for "profile-before-change" and then quota manager shutdowns at "profile-before-change-qm" when nothing should be initiating new storage requests.
(In reply to Jan Varga [:janv] from comment #52) > Just a side note, that there use to be a problem with quota manager shutdown > and we solved it by introducing "profile-before-change-qm" notification > (this one is fired after "profile-before-change". So any stuff that uses > storage should listen for "profile-before-change" and then quota manager > shutdowns at "profile-before-change-qm" when nothing should be initiating > new storage requests. This is basically what I'm going here. But this is not the final fix because if some code blocks the main-thread, the worker runnables to the main-thread are not processed.
Depends on: 1422036
Here a report about the state of this crash: There are mainly 2 reasons why we have this crash signature: a. there is a bug in how worker code is used; b. there is something blocking the worker shutdown: I see often components receiving the xpcom-shutdown notification and spinning the event loop. This event loop never ends, and workerinternals::RuntimeService doesn't receive the xpcom-shutdown notification. This means that the shutdown of workers, simply, doesn't start. Related to the first point, I did some improvements. It seems that the main crash reports are related to 'fetch'. In bug 1422036 there is a important patch that should fix them. About something blocking the worker shutdown, I'm working on bug 1434618 where, the shutting down of workers starts a bit before xpcom-shutdown. This should fix the spinning-event-loop issue that we see often. See bug 1435343. The signature is now changed because I recently removed the 'workers' namespace. We should continue the discussion on bug 1435343.
Depends on: 1436995
Blocks: 1436995
No longer depends on: 1436995
Blocks: 1437575
Whiteboard: [tbird crash]
(In reply to Andrea Marchesini [:baku] from comment #54) > Here a report about the state of this crash: > > There are mainly 2 reasons why we have this crash signature: > a. there is a bug in how worker code is used; > b... > > Related to the first point, I did some improvements. It seems that the main > crash reports are related to 'fetch'. In bug 1422036 there is a important > patch that should fix them. bug 1422036 got uplifted to 59.0b8, but there wasn't any apparent decline in the [@ mozilla::dom::workers::RuntimeService::CrashIfHanging ] crash signature.
We don't have any additional reports where Fetch WorkerHolder keeps workers alive. Unfortunately was 10-15% of the hanging. The rest is related to bug dependences of bug 1435343.
I recently landed bug 1437575 on nightly that makes this crash report basically disappear. This doesn't mean that the issue is fixed! What is behind this crash report is a component blocking the main-thread. With the main-thread blocked, workers are not able to complete the shutdown procedure and firefox hangs/crashes. Before bug 1437575 we didn't have a way to detect this issue and it was shown as a worker hanging. I filed bugs for each component blocking the main-thread, following the crash-reports. Here the list: 1435958, 1435960, 1435961, 1435962, 1435966, 1435963, 1435964
Several (such as the URLClassifier) appear to be due to using SYncRunnables/DISPATCH_SYNC from mainthread, probably because they need to wait for some other thread to do something during shutdown, and if they return from Observe the shutdown sequence will continue. Instead, anything which needs to do other-thread actions or wait should use an AsyncShutdown blocker (nsIAsyncShutdownClient, etc) - you can see this in action in C++ code in MediaManager.cpp. Then you don't need to use a sync message from the shutdown observer/etc.
Crash numbers are very high for beta - our top crash. So this may cause problems on release (based on the user comments on these crash signatures)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #59) > Crash numbers are very high for beta - our top crash. So this may cause > problems on release (based on the user comments on these crash signatures) This signature is going down in 60 for 2 reasons: . it has been renamed to mozilla::dom::workerinternals::RuntimeService::CrashIfHanging; And also this signature is going down quickly. See the graph in bug 1435343. . bug 1437575 is landed to give more information of what is happening during shutdown; Maybe we want to uplift bug 1437575 to know what is going on in beta.
Flags: needinfo?(lhenry)
Blocks: 1439893
(In reply to Andrew Overholt [:overholt] from comment #7) > Can we get some common URLs people are hitting this with? I know it's a > shutdown issue but without an STR or sites to reproduce it's hard to get > anywhere here :) 2018-03-03> Firefox crashed while shutting down. Menu, Alt+F X (Exit). It took forever to shut down. A few minutes. The firefox process was not responding, and then in the suspended state (as per windows resource monitor). It eventually closed and generated this crash report. --- During Shutdown these crash reports span a 4 minute period. Unsubmitted Crash Reports Report ID Date Crashed e18ea154-fab4-4b76-95c3-a93d5d444724 03/03/18 04:14 d634690f-dfce-4ee8-87d1-e5d56b23e723 03/03/18 04:13 90c0b448-1a94-4cb3-b713-e90bdbcc7c8c 03/03/18 04:12 da0e0ec9-8a7c-454b-bbf8-f2d7fd29258f 03/03/18 04:12 dc8c503d-ade1-48a1-97e4-c6ea38ff6231 03/03/18 04:12 ed32682c-8948-4be8-9493-27cc4093764b 03/03/18 04:11 Report ID Date Submitted bp-5f249501-2f81-4923-8bc2-baa3b0180303 03/03/18 04:51 bp-527a809c-9311-40a9-ace0-9c8880180303 03/03/18 04:50 bp-5f0bf39d-b594-432d-a07f-cad250180303 03/03/18 04:47 bp-ca5c3e00-f062-41cb-b7fb-e515d0180303 03/03/18 04:46 bp-7b606d9f-5861-46c7-8c63-249340180303 03/03/18 04:45 bp-de84a4f6-7020-490b-aaf7-2fab30180303 03/03/18 04:35 --- AvailablePageFile: 4170256384 AvailablePhysicalMemory: 2098941952 AvailableVirtualMemory: 8792240664576 Comments: Firefox crashed while shutting down. Menu, Alt+F X (Exit). It took forever to shut down. A few minutes. The firefox process was not responding, and then in the suspended state (as per windows resource monitor). It eventually closed and generated this crash report. MozCrashReason: Workers Hanging - A:1|S:0|Q:0-BC:2Dispatch Error ProductName: Firefox ReleaseChannel: beta SafeMode: 0 SecondsSinceLastCrash: 1759336 ShutdownProgress: xpcom-shutdown TotalPageFile: 8273096704 TotalPhysicalMemory: 4080713728 TotalVirtualMemory: 8796092891136 URL: https://flightaware.com/live/airport_status_bigmap.rvt?airport=CYWG Version: 59.0
(In reply to [:philipp] from comment #43) > hm, by looking closer into crashes during shutdown progress and comparing > the situation with 58.0b it appears that the [@ > mozilla::dom::workers::RuntimeService::CrashIfHanging] signature may not be > new crashes, but just "swallowing up" nearly all signatures that would have > shown up as [@ shutdownhang...] in prior cycles. I agree. And I think this has been a problem for a few years now. It would be great to see it resolved. I think the users affected find it very annoying. I certainly do. Blocks: 1333409, 1368983, 1308629, 1401481, 1186276, 1419108, 1248837 See Also: 1346860, 1233757 Blocks: Bug 1333409 Bad interaction between the shutdownhang kill timer and the ContentParent kill timer NEW Unassigned Bug 1368983 Increased parent process shutdown crash rate in Firefox 53.0.3. Crash in [@ shutdownhang | ____] NEW Assigned to baku Bug 1308629 Firefox goes Not Responding (or crashes) on Quit NEW Unassigned Bug 1401481 [Crash in shutdownhang | js::Thread::join] MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) NEW Unassigned Bug 1186276 Firefox keeps crashing: shutdown hang in the storage service NEW Unassigned Bug 1419108 Crash in shutdownhang | RtlAnsiStringToUnicodeString | MsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn Bug 1248837 shutdownhang in nsThread::Shutdown() | Firefox crashes, sometimes the restart fails REOPENED Unassigned See Also: Bug 1346860 Different signatures for same shutdownhang issues on release & beta NEW Unassigned Bug 1233757 Make shutdownhang crashes less annoying NEW Unassigned
Blocks: 1415837
I don't think we have anything actionable here for 59, which will release next week. Maybe we'll see some useful results from the diagnostic patch that will be useful for 60.
Nothing to do for 56. The fix landed in 60.
Assignee: amarchesini → nobody
Too late for Fx59 at this point, but definitely want to keep this on the radar for 60/61 still.
Interestingly, we're not seeing any hits for this on 60 on Beta so far. Any idea what might have fixed it?
Flags: needinfo?(amarchesini)
If this is fixed in 60, close this bug as resolved then.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #66) > Interestingly, we're not seeing any hits for this on 60 on Beta so far. Any > idea what might have fixed it? On beta/nightly this issue is not fixed but just correctly reported. Let me explain it a bit more: There are some components that spin the event loop on main-thread during the shutdown. This happens when shutdown notifications are emitted (xpcom-shutdown, xpcom-will-shutdown, xpcom-thread-shutdown and so on). Sometimes, some of these components do not exit from this spinning and that means that other components do not receive the shutdown notifications. In our hang monitor, there is a specific check about web workers to be sure that all the web workers are correctly terminated on shutdown. But if a component blocks the main-thread, spinning the event loop forever, the web workers shutdown doesn't start because notifications are not received. So, the hang monitor reports this as a web worker error. In bug 1437575 I improved the hang monitor reporting unterminated shutdown notifications. This showed that often, it was not a web worker issue but something else: 1434618, 1435958, 1435960, 1435961, 1435962, 1435966, 1418871, 1422036, 1435963, 1435964. I filed, as you see, several bugs and I alerted the owners/peers of those components. I also fixed a couple of web worker issues. The situation now is better, but we can do more. In particular bug 1435958. I mark this bug as dup of 1437575.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Flags: needinfo?(amarchesini)
Resolution: --- → DUPLICATE
Thanks for the explanation. Doesn't sound like this bug is going to be actionable for ESR52 either, which is unfortunate. Over 1000 crash reports in the last week there still too :(
Assignee: nobody → amarchesini
Flags: needinfo?(amarchesini)
I could try to port some of the patches to ESR52, but note that all those patches do not fix the problem. They just expose it in a better way.
Flags: needinfo?(amarchesini)
No longer depends on: 1435958
Depends on: 1356853
Blocks: 1633342
No longer blocks: 1633342
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: