Closed Bug 1709421 Opened 3 years ago Closed 3 years ago

Crash in [@ shutdownhang | nsThreadManager::CancelBackgroundDelayedRunnables]

Categories

(Core :: XPCOM, defect)

All
Windows
defect

Tracking

()

RESOLVED FIXED
93 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- wontfix
firefox88 --- unaffected
firefox89 --- fixed
firefox90 --- fixed
firefox91 --- wontfix
firefox92 --- wontfix
firefox93 --- fixed

People

(Reporter: aryx, Assigned: nika)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

89 + 90 affected. This one is predominantly on Windows 10, and on Windows in general.

Crash report: https://crash-stats.mozilla.org/report/index/7670ae56-e92a-4cad-a24f-b21750210503

MOZ_CRASH Reason: MOZ_CRASH(Shutdown hanging after all known phases and workers finished.)

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForAlertByThreadId 
1 ntdll.dll RtlSleepConditionVariableSRW 
2 kernelbase.dll SleepConditionVariableSRW 
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1093
5 xul.dll nsThreadManager::CancelBackgroundDelayedRunnables xpcom/threads/nsThreadManager.cpp:520
6 xul.dll mozilla::ShutdownXPCOM xpcom/build/XPCOMInit.cpp:624
7 xul.dll ScopedXPCOMStartup::~ScopedXPCOMStartup toolkit/xre/nsAppRunner.cpp:1668
8 xul.dll XREMain::XRE_main toolkit/xre/nsAppRunner.cpp:5556
9 xul.dll XRE_main toolkit/xre/nsAppRunner.cpp:5598

This is happening again in 91, can you take a look?

Flags: needinfo?(kwright)

Looks like we are getting stuck cancelling our background runnables. When we do this, we iterate each task queue's runnables in a promise that we dispatched to the task queue. A TaskQueue can't seem to resolve, and I don't see anything in the stack that indicates that one is waiting on a DelayedRunnable::CancelTimer call.

Looking at the threads in the crash stack, it looks like one of our background IO threads is attempting to do a read. My best guess is that we're stuck here and the related TaskQueue can't resolve its shutdown promise. I'm not familiar with this part of necko so I'm not sure if this is because of the delayed runnable shutdown, or if we're stuck because this runnable is stuck here.

Jens, are there any known hangs in this code?

Flags: needinfo?(kwright) → needinfo?(jstutte)

Could be a necko thing then?

Flags: needinfo?(jstutte) → needinfo?(dd.mozilla)

This may have been caused by bug 1714307.

A look a bit at the patch but could not see any bug, but patch is huge.

Or is there some new code that starts BackgroundFileSaver very late during the shutdown?

Valentin can you look at this?

Flags: needinfo?(dd.mozilla) → needinfo?(valentin.gosu)

The changes to BackgroundFileSaver.h/.cpp should not introduce any behaviour difference. I have double checked, and everything looks correct.

We could of course try to revert 453a12cf60be2b315ced8c208c9923fce2974dfa and see if it helps, but the crashes on 91.0a1 started on june 1st, while bug 1714307 landed on June 11th, so I'm around 90% sure it's not the cause.

Flags: needinfo?(valentin.gosu)

(In reply to Valentin Gosu [:valentin] (he/him) from comment #8)

We could of course try to revert 453a12cf60be2b315ced8c208c9923fce2974dfa and see if it helps, but the crashes on 91.0a1 started on june 1st, while bug 1714307 landed on June 11th, so I'm around 90% sure it's not the cause.

It's worth noting that before we introduced DelayedRunnable shutdown, we wouldn't have caught this crash signature. It would have appeared as a shutdown hang at the end of nsThreadManager::Shutdown when nsThreadManager gets stuck trying to shut down its threads (because one is hanging).

Nika, it's a medium crasher in 91, it is too late for 92 but could we get that bug assigned and hopefully fixed in 93/94? Thanks!

Flags: needinfo?(nika)

There were a number of crashes caused due to a hang on a background
TaskQueue, and in many of these bugs there was a read from
BackgroundFileSaver on the stack. It appears that we are stuck in this
loop late in shutdown, when we are attempting to shut down TaskQueues
and XPCOM threads.

I'm unsure if we would cancel these requests already, however I noticed
that we wouldn't abort this read even if we did. This change now
performs a check to see if the BackgroundFileSaver has errored between
each read, which should cause aborts to also abort this loop, and
hopefully avoid potential shutdown hangs.

Assignee: nobody → nika
Status: NEW → ASSIGNED

Had a conversation with :kriswright & we think this is probably caused by the necko download code hanging during shutdown, and preventing us from shutting down its TaskQueue. The patch I've attached tries to allow cancelling that potentially long-running read operation.

Flags: needinfo?(nika)
Pushed by nlayzell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6d0f2956dd13
Abort generating partial digest of existing file when download is cancelled, r=necko-reviewers,dragana
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 93 Branch
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: