1588498 - (qm-shutdown-hangs) [meta] Remaining issues that may cause QuotaManager shutdown hangs

Crash Signature: [@ mozilla::dom::indexedDB::`anonymous namespace'::QuotaClient::ShutdownTimedOut] [@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownTimedOut] [@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_3:…

Jan Varga [:janv]

Reporter

Updated

•

5 years ago

Crash Signature: namespace)::QuotaClient::ShutdownWorkThreads::$_3::__invoke] [@ mozilla::dom::indexedDB::`anonymous namespace'::QuotaClient::ShutdownWorkThreads::<T>::__invoke] → namespace)::QuotaClient::ShutdownWorkThreads::$_3::__invoke] [@ mozilla::dom::indexedDB::`anonymous namespace'::QuotaClient::ShutdownWorkThreads::<T>::__invoke] [@ mozilla::dom::(anonymous namespace)::QuotaClient::ShutdownTimedOut] [@ mozilla::dom::(a…

Robert AB

Updated

•

5 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1541370

Robert AB

Updated

•

5 years ago

URL: https://bugzilla.mozilla.org/show_bug...

Robert AB

Updated

•

5 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1541776

Jan Varga [:janv]

Reporter

Updated

•

5 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1541928, https://bugzilla.mozilla.org/show_bug.cgi?id=1542485

Jens Stutte [:jstutte]

Assignee

Updated

•

5 years ago

Type: task → defect

Jens Stutte [:jstutte]

Assignee

Updated

•

5 years ago

Depends on: 1616084

Jens Stutte [:jstutte]

Assignee

Updated

•

5 years ago

Depends on: 1593863

Simon Giesecke [:sg] [he/him]

Updated

•

4 years ago

Crash Signature: mozilla::dom::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_15::__invoke] [@ mozilla::dom::`anonymous namespace'::QuotaClient::ShutdownTimedOut] [@ mozilla::dom::`anonymous namespace'::QuotaClient::ShutdownWorkThreads::<T>::__invoke] → mozilla::dom::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_15::__invoke] [@ mozilla::dom::`anonymous namespace'::QuotaClient::ShutdownTimedOut] [@ mozilla::dom::`anonymous namespace'::QuotaClient::ShutdownWorkThreads::<T>::__invoke] [@ m…

Jan Varga [:janv]

Reporter

Updated

•

4 years ago

Depends on: 1666214

Simon Giesecke [:sg] [he/him]

Updated

•

4 years ago

Depends on: 1672369

Simon Giesecke [:sg] [he/him]

Updated

•

4 years ago

Crash Signature: mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_52::__invoke] → mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_52::__invoke] [@ static mozilla::dom::quota::Client::ShutdownWorkThreads::<T>::operator()::<T>::__invoke ]

Simon Giesecke [:sg] [he/him]

Updated

•

4 years ago

Crash Signature: mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_52::__invoke] [@ static mozilla::dom::quota::Client::ShutdownWorkThreads::<T>::operator()::<T>::__invoke ] → mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_52::__invoke] [@ static mozilla::dom::quota::Client::ShutdownWorkThreads::<T>::operator()::<T>::__invoke ] [@ mozilla::dom::quota::Client::ShutdownWorkThreads::$_29::op…

Simon Giesecke [:sg] [he/him]

Updated

•

4 years ago

Crash Signature: mozilla::dom::quota::Client::ShutdownWorkThreads::$_29::operator() const::{lambda#1}::__invoke ] → mozilla::dom::quota::Client::ShutdownWorkThreads::$_29::operator() const::{lambda#1}::__invoke ] [ @ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke ]

Simon Giesecke [:sg] [he/him]

Updated

•

4 years ago

Crash Signature: mozilla::dom::quota::Client::ShutdownWorkThreads::$_29::operator() const::{lambda#1}::__invoke ] [ @ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke ] → mozilla::dom::quota::Client::ShutdownWorkThreads::$_29::operator() const::{lambda#1}::__invoke ] [@ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke ] [@ shutdownhang | RtlAcquireSRWLockExclusive | mozilla…

Simon Giesecke [:sg] [he/him]

Updated

•

4 years ago

Crash Signature: mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::Observer::Observe ] → mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::Observer::Observe ] [@ mozilla::dom::quota::Client::ShutdownWorkThreads::$_29::operator() const::{lambda#1}::__invoke ] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell…

Simon Giesecke [:sg] [he/him]

Updated

•

4 years ago

Crash Signature: nsAppShell::ProcessNextNativeEvent ] → nsAppShell::ProcessNextNativeEvent ] [@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke] [@ mozilla::dom::quota::QuotaManager::Shutdown::$_8::operator() const::{lambda#1}::__invoke]

Pascal Chevrel:pascalc (PTO until August 21)

Updated

•

4 years ago

status-firefox86: --- → affected

status-firefox87: --- → affected

status-firefox88: --- → affected

status-firefox-esr78: --- → affected

Jens Stutte [:jstutte]

Assignee

Comment 1

•

4 years ago

Very high volume!
[@ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke ]
[@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke ]

Normal/Low volume
[@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell::ProcessNextNativeEvent ]
[@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownTimedOut ]
[@ mozilla::dom::indexedDB::`anonymous namespace'::QuotaClient::ShutdownWorkThreads::<T>::__invoke ]
[@ mozilla::dom::`anonymous namespace'::QuotaClient::ShutdownTimedOut ]
[@ static mozilla::dom::quota::Client::ShutdownWorkThreads::<T>::operator()::<T>::__invoke ]
[@ shutdownhang | RtlAcquireSRWLockExclusive | mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::Observer::Observe ]

Thunderbird 78.x only
[@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_52::__invoke ]

68 only:
[@ mozilla::dom::(anonymous namespace)::QuotaClient::ShutdownTimedOut ]

Jens Stutte [:jstutte]

Assignee

Updated

•

4 years ago

Crash Signature: [@ mozilla::dom::indexedDB::`anonymous namespace'::QuotaClient::ShutdownTimedOut] [@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownTimedOut] [@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_3:… → [@ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke ] [@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke ] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell…

Jens Stutte [:jstutte]

Assignee

Comment 2

•

4 years ago

Attached file 20210330 QuotaManagerShutdownTimeout values (deleted) — Details

This attachment contains the value of QuotaManagerShutdownTimeout where set.

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

status-firefox86: affected → wontfix

status-firefox87: affected → wontfix

status-firefox89: --- → affected

Jens Stutte [:jstutte]

Assignee

Updated

•

4 years ago

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Updated

•

4 years ago

Crash Signature: namespace'::QuotaClient::ShutdownTimedOut ] [@ static mozilla::dom::quota::Client::ShutdownWorkThreads::<T>::operator()::<T>::__invoke ] [@ shutdownhang | RtlAcquireSRWLockExclusive | mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::… → namespace'::QuotaClient::ShutdownTimedOut ] [@ static mozilla::dom::quota::Client::ShutdownWorkThreads::<T>::operator()::<T>::__invoke ] [@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_52::__invoke ]

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Comment 3

•

4 years ago

Apparently the volume spikes up here since we unified the Quota client shutdown timeouts in bug 1681670. In turn the volume in other bugs (like bug 1542485, bug 1541776 and bug 1542485) went down by the same order of magnitude.

So we can be quite assured that we did not regress the overall quality of quota manager shutdown, we just moved some signatures and have hopefully now a better perspective on them.

desbest

Comment 4

•

4 years ago

Is there an update on this?
I really need this bug fixed as it's severely affecting my firefox performance making it really bad under prolonged or heavy use. I've had firefox be slow, freeze and crash for 3 hours straight due to this bug.
I get this bug every 3 days and every week.
Are you having problems reproducing this bug? I can sometimes reproduce it on my computer through "stress testing", but it appears sporadically and it appears to do with how memory is allocated. It appears more often once the windows memory caching Superfetch is off.

Jens Stutte [:jstutte]

Assignee

Comment 5

•

4 years ago

Hi :desbest, we are working hard to make this bug actionable. This bug is supposed to have a noticeable impact only during shutdown, leaving the Firefox process open for some more second than normal and then reporting a crash after the shutdown timeout. If this is what you are seeing, then reliable STR or a captured profile could help to move this forward. If you have any other concrete problem during use, please file a new bug, always with clearer STR and/or a captured profile.

desbest

Comment 6

•

4 years ago

In my experience this bug also affects the performance of Firefox before firefox is shut down. Firefox is slower if it happens that this bug will happen upon shut down of Firefox.
It's hard to provide a reliable STR because if I follow the same instructions, sometimes the bug will happen and sometimes it won't.
Does a captured profile mean to use the Firefox Profiler? https://profiler.firefox.com/

desbest

Comment 7

•

4 years ago

I suspect the bug is connected to when a website uses IndexedDB.

Jens Stutte [:jstutte]

Assignee

Comment 8

•

4 years ago

(In reply to desbest from comment #6)

In my experience this bug also affects the performance of Firefox before firefox is shut down. Firefox is slower if it happens that this bug will happen upon shut down of Firefox.

It might be in some weird state, that also affects shutdown, of course. There is no other load (other programs with high CPU usage) on the machine when this happens?

It's hard to provide a reliable STR because if I follow the same instructions, sometimes the bug will happen and sometimes it won't.
Does a captured profile mean to use the Firefox Profiler? https://profiler.firefox.com/

Yes, exactly. It could give at least hints, what is slowing down your instance in general. Please choose the "Firefox Platform" setting in the profiler plugin while recording. Or if you suspect Indexed DB to be involved, you might want to use a custom profile and switch on the feature "All File IO" and maybe the "DOM Worker" thread.

desbest

Comment 9

•

4 years ago

I can't get the profiler to work.
After finishing capturing, when I click the upload button nothing happens. So I can't share my capture with others.

Also how do I profile the startup and shutdown of firefox?
https://profiler.firefox.com/docs/#/./guide-startup-shutdown
Where do I put the environment variable? Do I have to change the windows settings like in the image below?

desbest

Comment 10

•

4 years ago

It won't let me post images. The images are here. https://imgur.com/a/uqGwi1C

desbest

Comment 11

•

4 years ago

I've spent 4 days trying to reproduce bugs and I've noticed two things using 2 methods.

===========

For method one, when I use the profiler to create a profile of how to reproduce the bug, Firefox freezes for over 3 hours and never resumes. I waited for as long as I could, and I noticed that it took 12 hours for it to stop freezing and then have Firefox spontaneously close. So I couldn't get the profiler to work. So I can't reproduce the bug using the first method unless the profiler is fixed so it stops freezing.

===========

For method two, if I use the Pocket Tube extension then wait for it to finish loading. If I open up a youtube video to play it then click the tag icon at the top that Pocket Tube generates, then close Firefox there will be a crash. However the signature for the crash will be different each time. Here is a profiler capture. https://share.firefox.dev/3mO77vC

Firefox 87.0 Crash Report [@ AsyncShutdownTimeout | places.sqlite#1: waiting for clients | Transaction (114),places.sqlite#1: PlacesExpiration.jsm: expire (115) ]

Jens Stutte [:jstutte]

Assignee

Updated

•

4 years ago

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Comment 12

•

4 years ago

(In reply to desbest from comment #11)

I've spent 4 days trying to reproduce bugs and I've noticed two things using 2 methods.

Thanks for insisting!

Firefox 87.0 Crash Report [@ shutdownhang | RtlpWaitOnAddressWithTimeout | RtlpWaitOnAddress | RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | sqlite3_finalize ]

Here is a corresponding crash report.
We try to finalize statements in places.sqlite on the main thread while there is another Thread 66, Name: mozStorage #2 executing something on another sqlite DB.

Firefox 87.0 Crash Report [@ AsyncShutdownTimeout | places.sqlite#1: waiting for clients | Transaction (114),places.sqlite#1: PlacesExpiration.jsm: expire (115) ]

There are quite some reports with similar signatures.
Again we see places.sqlite involved here. We run against this shutdown barrier. We might just be waiting on some slow transaction.

:mak, can you see something obvious here?

Flags: needinfo?(jstutte) → needinfo?(mak)

Jens Stutte [:jstutte]

Assignee

Comment 13

•

4 years ago

Comment 14

•

4 years ago

BTW: The signatures mentioned in comment #11 should not be considered part of this specific bug, they are yet another kind of async shutdown timeouts.

Jens Stutte [:jstutte]

Assignee

Comment 15

•

4 years ago

:desbest, do you have by chance the setting "Privacy & Security" -> "History" set to "Never remember history" or "Clear history when Nightly closes" ? Then I suspect you are affected by something similar to bug 1700584.

Flags: needinfo?(afaninthehouse)

desbest

Comment 16

•

4 years ago

It’s set to remember history.

Marco Bonardo [:mak]

Comment 17

•

4 years ago

(In reply to Jens Stutte [:jstutte] from comment #12)

We try to finalize statements in places.sqlite on the main thread while there is another Thread 66, Name: mozStorage #2 executing something on another sqlite DB.

How can you tell it's another db and not the same? It could be:

thread 2 is executing some very long and expensive query on places.sqlite, and main-thread is just waiting for it to finish
there is a deadlock between thread2 and main-thread being both on places.sqlite that will never be resolved even waiting forever
thread 2 is executing Sync and it's taking too long

Firefox 87.0 Crash Report [@ AsyncShutdownTimeout | places.sqlite#1: waiting for clients | Transaction (114),places.sqlite#1: PlacesExpiration.jsm: expire (115) ]

This is an interesting signature, we should file a separate bug for it, since it looks like many of these timeouts are related to removing orphan icons, there may be a common cause. I don't expect that to be related to QuotaManager or indexedDB anyway.

Apparently this is hanging on "DELETE FROM moz_pages_w_icons WHERE page_url_hash NOT IN (SELECT url_hash FROM moz_places) OR id NOT IN (SELECT DISTINCT page_id FROM moz_icons_to_pages)"
There may be various reasons:

there is some database corruption
there is an unknown deadlock related to ATTACHed databases
removing orphan icons is just very slow under certain circumstances

Did we go through the usual troubleshooting steps?
First thing i'd suggest to run Places Database Integrity check from about:support, and post the resulting log of that here or in the newly filed bug.
Then check if things improved after a restart.
Another interesting experiment would be to pretty much drop favicons, so go to the profile folder when Firefox is closed and move favicons.sqlite (and favicons.sqlite-wal) elsewhere. Launching firefox should generate a new one, and potentially resolve any unexpected perf issue with it, but will also lose all the favicons on bookmarks until you revisit them. You can always restore the previously moved file anyway, after the experiment. If that solves the problem, I may be interested into getting more statistics about that db.

But, as I said, I'd suggest to move the PlacesExpiration.js investigation to a separate bug. I don't expect these Places crashes to have any relation to DOM Storage or QuotaManager.

Flags: needinfo?(mak)

Jens Stutte [:jstutte]

Assignee

Comment 18

•

4 years ago

(In reply to Marco Bonardo [:mak] from comment #17)

But, as I said, I'd suggest to move the PlacesExpiration.js investigation to a separate bug. I don't expect these Places crashes to have any relation to DOM Storage or QuotaManager.

Spinned off bug 1706298.

Pascal Chevrel:pascalc (PTO until August 21)

Updated

•

3 years ago

status-firefox88: affected → wontfix

status-firefox89: affected → wontfix

desbest

Updated

•

3 years ago

Flags: needinfo?(afaninthehouse)

Jens Stutte [:jstutte]

Assignee

Comment 19

•

3 years ago

Looking at one of the cache related hangs I cannot see any thread doing any work.

The annotation reads:

cache: Managers: 1 (https://aaa.aa: Closing,  )
Intermediate steps:
0.000000s: starting

IIUC, the shutdown sequence in QuotaManager::Shutdown is the following:

StopIdleMaintenance(); (does nothing except for IDB where it just sets an aborted flag
Call InitiateShutdownWorkThreads(); for each client type. This writes the "starting" message into the annotations.
2.a In case of cache this calls synchronously the sequence:
Client::InitiateShutdownWorkThreads()
-> CacheQuotaClient::InitiateShutdown()
-> Manager::Factory::ShutdownAll()
-> (for each manager) Manager::Shutdown()
-> (for each Context) Context::CancelAll()
-> Context::ThreadsafeHandle::AllowToClose()

Here it gets interesting, as we might dispatch dom::cache::Context::ThreadsafeHandle::AllowToCloseOnOwningThread to the owning thread. Interestingly, this code is never hit in our test automation?

-> if all goes well we then get into Context::ThreadsafeHandle::AllowToCloseOnOwningThread() and nuke our Contextreference which should nuke the Contextand through its destructor arrive at Factory::Remove(*this);.
-> If this was the last Context, Factory::Remove will destroy the factory singleton and thus unblock the shutdown. This also generates the "Manager removed" annotation.

2.b It then decides via CacheQuotaClient::IsShutdownCompleted() if we are already completely shut down. If not, we will enter the SpinEventLoopUntil
3. In case we need to wait for someone, enter the SpinEventLoopUntil.

It seems, that even after the cache shutdown we still have a living sFactory that prevents the event loop from breaking.

So the doubt would be: We dispatch dom::cache::Context::ThreadsafeHandle::AllowToCloseOnOwningThread with MOZ_ALWAYS_SUCCEEDS. But if this dispatch fails, we might end up with a live lock? From our automation it is not even clear if we ever try to dispatch this message, though.

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Flags: needinfo?(jvarga)

Jens Stutte [:jstutte]

Assignee

Comment 20

•

3 years ago

Another possibility could be that we do not free all Contexts, of course. In that sense I was asking myself, if aOldManager->mContext is ever nulled out if there is an aOldManager passed here?

Jens Stutte [:jstutte]

Assignee

Comment 21

•

3 years ago

Attached file Bug 1588498: Forbid the creation of new cache Context objects during quota manager shutdown. r?#dom-storage (deleted) — Details

We want to enforce the (assumed) invariant that after we started QM shutdown we should not create new contextes
and associated threads for cache IO. We do this generating a runtime error, whose handling might not be consistent
in all (unexpected) cases, yet. But this is preferable over a shutdown hang crash, for sure.

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → jstutte

Status: NEW → ASSIGNED

Jens Stutte [:jstutte]

Assignee

Comment 22

•

3 years ago

This patch is not really meant to fix the hangs.

Keywords: leave-open

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Depends on: 1726598

Pulsebot

Comment 23

•

3 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2ccec5420edf Forbid the creation of new cache Context objects during quota manager shutdown. r=dom-storage-reviewers,asuth

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Depends on: 1726711

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Depends on: 1726714

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 24

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/2ccec5420edf

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Depends on: 1727497

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Depends on: 1727526

FredMcD

Comment 25

•

3 years ago

https://support.mozilla.org/en-US/questions/1351032
Issue solved by updating Avast.
bp-b763a58f-97f0-4df9-9205-44ffc0210919
bp-7b2b3a35-6b01-4672-a8d2-9dd940210905

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Depends on: 1731968

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Depends on: 1732274

Gabriele Svelto [:gsvelto]

Comment 26

•

3 years ago

Added a signature

Crash Signature: [@ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke ] [@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke ] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell… → [@ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke] [@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell::…

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Depends on: 1734393

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Depends on: 1733826

Jens Stutte [:jstutte]

Assignee

Comment 27

•

3 years ago

Signature cleanup:

[@ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke ]
Appears only on 85.

[@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke ]
Very active

[@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell::ProcessNextNativeEvent ]
rare but active

[@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownTimedOut ]
<= 78

[@ mozilla::dom::indexedDB::`anonymous namespace'::QuotaClient::ShutdownWorkThreads::<T>::__invoke ]
inactive

[@ mozilla::dom::`anonymous namespace'::QuotaClient::ShutdownTimedOut ]
inactive

[@ static mozilla::dom::quota::Client::ShutdownWorkThreads::<T>::operator()::<T>::__invoke ]
inactive

[@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_52::__invoke ]
inactive

[@ mozilla::dom::quota::QuotaManager::Shutdown::{lambda#1}::operator() const::{lambda#1}::_FUN ]
new but rare

Crash Signature: [@ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke] [@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell::… → [@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell::ProcessNextNativeEvent] [@ mozilla::dom::quota::QuotaManager::Shutdown::{lambda#1}::operator() const…

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Depends on: 1740893

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

status-firefox-esr78: affected → wontfix

Ryan VanderMeulen [:RyanVM]

Comment 28

•

3 years ago

This is still the #1 overall non-OOM topcrash on release.

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Depends on: 1757144

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Depends on: 1757186

Jens Stutte [:jstutte]

Assignee

Comment 29

•

3 years ago

This crash report shows a suspicious sequence:

Intermediate steps:
0.047000s: ClearRequestBase: Starting repeated directory removal #0
0.047000s: ClearRequestBase: Completed repeated directory removal #0
0.047000s: ClearRequestBase: Completed deleting files
0.047000s: ClearRequestBase: Starting deleting files
0.063000s: ClearRequestBase: Starting repeated directory removal #0
0.063000s: ClearRequestBase: Completed repeated directory removal #0
0.063000s: ClearRequestBase: Completed deleting files
... (continues until hang) ...

Jens Stutte [:jstutte]

Assignee

Comment 30

•

3 years ago

This one

ls: Datastores: 34 (DirectoryLock:1|Connection:1|QuotaObject:1|PrepareDatastoreOps:0|PreparedDatastores:0|Databases:0|ActiveDatabases:0|Origin:https://(removed)=D|PrivateBrowsingId:0|Closed:1, DirectoryLock:1|Connection:1|QuotaObject:1|PrepareDatastoreOps:0|PreparedDatastores:0|Databases:0|ActiveDatabases:0|Origin:https://(removed)=DD|PrivateBrowsingId:0|Closed:1, DirectoryLock:1|Connection:1|QuotaObject:1|PrepareDatastoreOps:0|PreparedDatastores:0|Databases:0|ActiveDatabases:0|Origin:moz-extension://(removed)|PrivateBrowsingId:0|Closed:1)

seems to show un-removed closed LS connections?

Jan Varga [:janv]

Reporter

Comment 31

•

3 years ago

When a Datastore is closed we initiate closing of the connection and we release the directory lock after that. We unregister the Datastore object only when the connection is closed, otherwise a new Datastore could try to touch the same database on disk. I think the log from comment 30 falls into the group of "too much work during shutdown". If there are too many open LS connections, it can happen that we won't have a chance to close them all in time. I filed bug 1592593 for this in the past. I think that one could be a good "first project" bug.

If bug 1592593 doesn't help enouch, we might want to investigate adding more threads for I/O stuff. There's only one at the moment.

Anyway, I just noticed that we need to update Datastore::Stringify to include mHasLivePrivateDatastore.

Flags: needinfo?(jvarga)

Jan Varga [:janv]

Reporter

Comment 32

•

3 years ago

(In reply to Jens Stutte [:jstutte] from comment #29)

This crash report shows a suspicious sequence:

Intermediate steps:
0.047000s: ClearRequestBase: Starting repeated directory removal #0
0.047000s: ClearRequestBase: Completed repeated directory removal #0
0.047000s: ClearRequestBase: Completed deleting files
0.047000s: ClearRequestBase: Starting deleting files
0.063000s: ClearRequestBase: Starting repeated directory removal #0
0.063000s: ClearRequestBase: Completed repeated directory removal #0
0.063000s: ClearRequestBase: Completed deleting files
... (continues until hang) ...

Since there's no info about the origin in the message, we can't say if it happens for the same origin directory or some other origin directory.
The latter is ok.

Jens Stutte [:jstutte]

Assignee

Comment 33

•

3 years ago

The annotations continue to show that we have two main causes (accounting together for ~90% of the cases):

cache: Managers: 1 (https://aaaaaaaaaaaaaa.aaaaaaaaaa.aaa: Closing, ) - bug 1727526 did not yield any conclusion, yet
QM: 1 normal origin ops pending - this one merits more investigation

In order to investigate 2. - it seems there is no way to know, which operation is pending here? It would probably make a huge difference if it can be randomly any kind or if we see only very few types of operations here?

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Depends on: 1759732

Jens Stutte [:jstutte]

Assignee

Comment 34

•

3 years ago

(In reply to Jens Stutte [:jstutte] from comment #33)

QM: 1 normal origin ops pending - this one merits more investigation

So the first ever crash with the new annotation shows:

that the annotation works!
that we directly trapped into a case with a very high # (35947) of pending ops
where all those pending ops are dom::quota::ClearOriginOp

Now it seems difficult to believe (though not impossible) that we want to delete ~36K of distinct origins here?

Jan Varga [:janv]

Reporter

Comment 35

•

3 years ago

I'm not that surprised that a shutdown hang is caused by trying to do too much I/O, but 36K is quite a lot.
Bug 1733107 should help a bit.
(The de-duplication is probably a good idea).

Jan Varga [:janv]

Reporter

Comment 36

•

3 years ago

There's also the repeated directory removal in the log, so it seems origin directories can't be deleted during shutdown and the number of origin then grows and grows.

Jan Varga [:janv]

Reporter

Comment 37

•

3 years ago

Or some third-party party app interferes with us and slows down the clearing.

Jens Stutte [:jstutte]

Assignee

Comment 38

•

3 years ago

(In reply to Jan Varga [:janv] from comment #35)

(The de-duplication is probably a good idea).

Only if we also reduce the verbosity of ClearRequestBase::DeleteFiles which actually brings us to the retry loop (so it is good to have this verbosity!) where we seem to spend up to 10 times 200ms which seems a bad idea if all removals are always failing as seems to be the case here. So it is easy to spend some minute here and hit the timeout while the computer is actually idle.

Jens Stutte [:jstutte]

Assignee

Comment 39

•

3 years ago

This comment says it already.

What if we would just dispatch the same ClearRequestBaseOp another time if we were not able to remove all and move on? We should have a retry count parameter to be able to break, though.

Jan Varga [:janv]

Reporter

Comment 40

•

3 years ago

I think we shouldn't repeat the clearing during shutdown. We should mark origin directories as invalid by putting marker files into them and then during storage initialization, remove such origin directories. I think there's a patch somewhere for the markers made by Tom. We might want to revive it ...

Jens Stutte [:jstutte]

Assignee

Comment 41

•

3 years ago

So I think we need to change our logic here. Instead of blindly retrying (and blocking) we should track during deletion based on the errors we get the severity of the situation:

Errors that can be safely ignored (no retry):
NS_ERROR_FILE_NOT_FOUND (already gone?)

Errors that might merit a retry:
NS_ERROR_FILE_IS_LOCKED (could go away)
NS_ERROR_FILE_DIR_NOT_EMPTY (something has been added while we were operating?)
NS_ERROR_FILE_DEVICE_TEMPORARY_FAILURE (the name says it all)
NS_ERROR_MODULE_WIN32 (mostly because we dunno what's in there)

All other errors do not merit a retry:
NS_ERROR_FILE_ACCESS_DENIED (this might be borderline, but I don't think it is ever a temporary thing without user or third-party SW interaction)
NS_ERROR_FILE_READ_ONLY (again, will not change)
NS_ERROR_FILE_NAME_TOO_LONG (will never change)
NS_ERROR_FILE_NOT_DIRECTORY (will never change)
NS_ERROR_FILE_FS_CORRUPTED (we should stop torturing the FS)
NS_ERROR_FILE_DEVICE_FAILURE (we should stop torturing the FS)
NS_ERROR_FILE_INVALID_PATH (will never change)

Errors that are not supposed to always happen in this context, and still do not merit a retry:
NS_ERROR_OUT_OF_MEMORY
NS_ERROR_FILE_NO_DEVICE_SPACE
NS_ERROR_FILE_ALREADY_EXISTS
NS_ERROR_FILE_INVALID_HANDLE

If then we really want to retry, we should do so by dispatching a new event rather than in a blocking loop with sleep.

Jens Stutte [:jstutte]

Assignee

Comment 42

•

3 years ago

(In reply to Jan Varga [:janv] from comment #40)

I think we shouldn't repeat the clearing during shutdown. We should mark origin directories as invalid by putting marker files into them and then during storage initialization, remove such origin directories. I think there's a patch somewhere for the markers made by Tom. We might want to revive it ...

OK, that is also an option, of course.

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Depends on: 1760302

Jens Stutte [:jstutte]

Assignee

Comment 43

•

3 years ago

(In reply to Jens Stutte [:jstutte] from comment #38)

the retry loop (so it is good to have this verbosity!) where we seem to spend up to 10 times 200ms which seems a bad idea if all removals are always failing as seems to be the case here. So it is easy to spend some minute here and hit the timeout while the computer is actually idle.

After having touched the code in bug 1760302 I understood that actually in the crash from comment 34 we did not wait, we just printed out superfluous messages for how the loop was designed. Still I think we should do bug 1760302 in general, but this case seemed to be a different one.

Jens Stutte [:jstutte]

Assignee

Comment 44

•

3 years ago

Just a heads up that the patch from bug 1682326 (together with D139795 from bug 1757186) will most likely change the signatures here.

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Crash Signature: const::{lambda#1}::_FUN] → const::{lambda#1}::_FUN] [@ mozilla::dom::quota::QuotaManager::Shutdown::$::__invoke ] [@ mozilla::dom::quota::QuotaManager::Shutdown::<T>::__invoke ]

Jens Stutte [:jstutte]

Assignee

Comment 46

•

3 years ago

So looking at the most recent nightly crashes, I can see:

24 cases where we do not see any annotation whatsoever. Example: 25785003-d2e6-4727-83ec-0ca7f0220425
43 cases where we seem to be stuck at ClearRequestBase::DeleteFiles (causing gNormalOriginOps not being cleared).
only 3 cases with other, client-specific reasons

The first seems just an error in the way we create the annotations - if we arrive there with gNormalOriginOps unset we will crash unexpectedly before we set the annotations.

Edit: 1. is a regression from bug 1757186.

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Depends on: 1770451

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Depends on: 1770915

BugBot [:suhaib / :marco/ :calixte]

Comment 47

•

2 years ago

The bug is linked to topcrash signatures, which match the following criteria:

Top 5 desktop browser crashes on Mac on release
Top 20 desktop browser crashes on release
Top 20 desktop browser crashes on beta
Top 5 desktop browser crashes on Windows on beta
Top 5 desktop browser crashes on Windows on release

For more information, please visit auto_nag documentation.

Keywords: topcrash

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

BugBot [:suhaib / :marco/ :calixte]

Comment 48

•

2 years ago

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

Jens Stutte [:jstutte]

Assignee

Comment 49

•

2 years ago

(In reply to Release mgmt bot [:suhaib / :marco/ :calixte] from comment #48)

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

For some reasons, most of the relevant signatures moved to bug 1592336, but we want to continue the investigation here. So moving them all over here and closing bug 1592336.

Crash Signature: [@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell::ProcessNextNativeEvent] [@ mozilla::dom::quota::QuotaManager::Shutdown::{lambda#1}::operator() const… → [@ mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator() ] [@ mozilla::dom::quota::QuotaManager::Shutdown::<T>::__invoke ] [@ mozilla::dom::quota::QuotaManager::Shutdown::$::operator() ] [@ mozilla::dom::quota::QuotaManager::Shutdown::$::__invok…

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Duplicate of this bug: 1592336

BugBot [:suhaib / :marco/ :calixte]

Comment 51

•

2 years ago

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to topcrash signatures, which match the following criteria:

Top 5 desktop browser crashes on Mac on release
Top 20 desktop browser crashes on release
Top 20 desktop browser crashes on beta
Top 5 desktop browser crashes on Windows on beta
Top 5 desktop browser crashes on Windows on release

For more information, please visit auto_nag documentation.

Keywords: topcrash

BugBot [:suhaib / :marco/ :calixte]

Comment 52

•

2 years ago

The severity field for this bug is set to S3. However, the following bug duplicate has higher severity:

Bug 1592336: S2

:jstutte, could you consider increasing the severity of this bug to S2?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Severity: S3 → S2

Flags: needinfo?(jstutte)

Priority: P3 → P2

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Depends on: 1822498

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Crash Signature: mozilla::dom::quota::QuotaManager::Shutdown::$::__invoke ] → mozilla::dom::quota::QuotaManager::Shutdown::$::__invoke ] [@ mozilla::dom::quota::QuotaManager::Observer::Observe ]

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Crash Signature: mozilla::dom::quota::QuotaManager::Shutdown::$::__invoke ] [@ mozilla::dom::quota::QuotaManager::Observer::Observe ] → mozilla::dom::quota::QuotaManager::Shutdown::$::__invoke ] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::Observer::Observe ] [@ shutdownhang | kernelbase.dll | mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota…

Jens Stutte [:jstutte]

Assignee

Comment 53

•

2 years ago

Until bug 1822498 is fixed we should track all signatures here.

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Duplicate of this bug: 1542485

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Alias: qm-shutdown-hangs

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Depends on: 1820823

Comment hidden (Intermittent Failures Robot)

Jens Stutte [:jstutte]

Assignee

Comment 56

•

2 years ago

(In reply to Intermittent Failures Robot from comment #55)

For more details, see:
https://treeherder.mozilla.org/intermittent-failures/bugdetails?bug=1588498&startday=2023-03-27&endday=2023-04-02&tree=all

This specific failure shows we are stuck flushing files from SQLite:

[task 2023-03-28T06:19:16.423Z] 06:19:16     INFO - Thread 34 QuotaManager IO
[task 2023-03-28T06:19:16.423Z] 06:19:16     INFO -  0  ntdll.dll!ZwFlushBuffersFile + 0xc
[task 2023-03-28T06:19:16.423Z] 06:19:16     INFO -      eip = 0x77616a1c    esp = 0x172fed14    ebp = 0x172feda8    ebx = 0x172fedb8
[task 2023-03-28T06:19:16.423Z] 06:19:16     INFO -      esi = 0x172fed7c    edi = 0x00000af4    eax = 0x00000000    ecx = 0x00000000
[task 2023-03-28T06:19:16.423Z] 06:19:16     INFO -      edx = 0x00000000 eflags = 0x00000206
[task 2023-03-28T06:19:16.423Z] 06:19:16     INFO -     Found by: given as instruction pointer in context
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -  1  xul.dll!mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared>,long (*)(void *, _IO_STATUS_BLOCK *) __attribute__((stdcall))>::operator()(void*&, _IO_STATUS_BLOCK*&) const [nsWindowsDllInterceptor.h:aff0b64247011767675227facc902a7968668686 : 150]
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -     Found by: inlining
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -  2  xul.dll!(anonymous namespace)::InterposedNtFlushBuffersFile(void*, _IO_STATUS_BLOCK*) [PoisonIOInterposerWin.cpp:aff0b64247011767675227facc902a7968668686 : 399 + 0x7]
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -      eip = 0x6638080d    esp = 0x172fed18    ebp = 0x172feda8    ebx = 0x172fedb8
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -      esi = 0x172fed7c    edi = 0x00000af4
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -     Found by: call frame info
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -  3  KERNELBASE.dll!FlushFileBuffers + 0x4b
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -      eip = 0x766da95c    esp = 0x172fedb0    ebp = 0x172fedc0    ebx = 0x04db55d4
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -      esi = 0x00000af4    edi = 0x04db55d4
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -     Found by: call frame info
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -  4  nss3.dll!winSync(sqlite3_file*, int) [sqlite3.c:aff0b64247011767675227facc902a7968668686 : 48137 + 0x11]
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -      eip = 0x7185960a    esp = 0x172fedc8    ebp = 0x172fefd0    ebx = 0x04db55d4
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -      esi = 0x00000af4    edi = 0x04db55d4
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -     Found by: call frame info
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -  5  xul.dll!(anonymous namespace)::xSync(sqlite3_file*, int) [TelemetryVFS.cpp:aff0b64247011767675227facc902a7968668686 : 222 + 0x14]
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -      eip = 0x66ecb348    esp = 0x172fefd8    ebp = 0x172ff030    ebx = 0x04db55d4
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -      esi = 0x0000024e    edi = 0x71d56d40
[task 2023-03-28T06:19:16.424Z] 06:19:16     INFO -     Found by: call frame info

I would expect this to become a problem only when there is extremely slow disc I/O, it is a bit weird to see this in CI? Or am I overlookign something and there is more to investigate here?

Comment hidden (Intermittent Failures Robot)

Jens Stutte [:jstutte]

Assignee

Updated

•

1 year ago

Blocks: 1832861

Pascal Chevrel:pascalc (PTO until August 21)

Comment 58

•

1 year ago

Jens, these crashes seem to have spikes in the 114 beta cycle compared to the 113 cycle, anything that could explain that we could fix? Thanks

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Comment 59

•

1 year ago

Yeah, comparing the same beta versions from 113 and 114 looks quite bad:

1 	114.0b3 	339 	35.17 %
2 	114.0b2 	313 	32.47 %
3 	114.0b4 	125 	12.97 %
4 	114.0b1 	86 	8.92 %
5 	113.0b4 	61 	6.33 %
6 	113.0b3 	23 	2.39 %
7 	113.0b2 	14 	1.45 %
8 	113.0b1 	3 	0.31 %

We need to take a closer look for sure.

Flags: needinfo?(jstutte)

Jan Varga [:janv]

Reporter

Comment 60

•

1 year ago

Bug 1733107 should help with reducing shutdown hangs, it's currently our top priority to fix the bug and eventually uplift patches to beta.

Pascal Chevrel:pascalc (PTO until August 21)

Comment 61

•

1 year ago

Adding bug 1733107 to dependencies then. I am tracking bug 1733107 for a 114 beta uplift.

Depends on: 1733107

Jan Varga [:janv]

Reporter

Updated

•

1 year ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1833810

Jan Varga [:janv]

Reporter

Updated

•

1 year ago

Depends on: 1833810

Jens Stutte [:jstutte]

Assignee

Updated

•

1 year ago

Depends on: 1746773

Wayne Mery (:wsmwk)

Updated

•

1 year ago

Whiteboard: [tbird crash]

BugBot [:suhaib / :marco/ :calixte]

Comment 62

•

1 year ago

The bug is linked to topcrash signatures, which match the following criteria:

Top 5 desktop browser crashes on Mac on release
Top 20 desktop browser crashes on release (startup)
Top 20 desktop browser crashes on beta
Top 10 desktop browser crashes on nightly
Top 5 desktop browser crashes on Windows on beta
Top 5 desktop browser crashes on Windows on release (startup)

For more information, please visit BugBot documentation.

Keywords: topcrash-startup

Jens Stutte [:jstutte]

Assignee

Comment 63

•

1 year ago

There is a very weird upwards tendency for ESR crashes only, it seems. I cannot see any correlation to the corresponding normal versions, though. And even inside the 102.x series it is increasing over time, where I would not expect any functional changes landed that could influence here. And it happens also for 115.x, while the release channel seems to have a slow but steady downwards tendency all the time.

Jens Stutte [:jstutte]

Assignee

Comment 64

•

1 year ago

(In reply to Jens Stutte [:jstutte] from comment #63)

There is a very weird upwards tendency for ESR crashes only, it seems. I cannot see any correlation to the corresponding normal versions, though. And even inside the 102.x series it is increasing over time, where I would not expect any functional changes landed that could influence here. And it happens also for 115.x, while the release channel seems to have a slow but steady downwards tendency all the time.

That's bug 1849352, or better that bug would help to "mitigate" this. The upside is that there is no real increase affecting our users.

Jens Stutte [:jstutte]

Assignee

Updated

•

1 year ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1849352

20210330 QuotaManagerShutdownTimeout values 4 years ago Jens Stutte [:jstutte] (deleted), text/plain		Details
Bug 1588498: Forbid the creation of new cache Context objects during quota manager shutdown. r?#dom-storage 3 years ago Jens Stutte [:jstutte] (deleted), text/x-phabricator-request		Details