[meta] Remaining issues that may cause QuotaManager shutdown hangs
Categories
(Core :: Storage: Quota Manager, defect, P2)
Tracking
()
People
(Reporter: janv, Assigned: jstutte)
References
(Depends on 12 open bugs, Blocks 1 open bug, )
Details
(4 keywords, Whiteboard: [tbird crash])
Crash Data
Attachments
(2 files)
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Comment 1•4 years ago
|
||
Very high volume!
[@ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke ]
[@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke ]
Normal/Low volume
[@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell::ProcessNextNativeEvent ]
[@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownTimedOut ]
[@ mozilla::dom::indexedDB::`anonymous namespace'::QuotaClient::ShutdownWorkThreads::<T>::__invoke ]
[@ mozilla::dom::`anonymous namespace'::QuotaClient::ShutdownTimedOut ]
[@ static mozilla::dom::quota::Client::ShutdownWorkThreads::<T>::operator()::<T>::__invoke ]
[@ shutdownhang | RtlAcquireSRWLockExclusive | mozilla::SpinEventLoopUntil<T> | mozilla::dom::quota::QuotaManager::Observer::Observe ]
Thunderbird 78.x only
[@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_52::__invoke ]
68 only:
[@ mozilla::dom::(anonymous namespace)::QuotaClient::ShutdownTimedOut ]
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 2•4 years ago
|
||
This attachment contains the value of QuotaManagerShutdownTimeout
where set.
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 3•4 years ago
|
||
Apparently the volume spikes up here since we unified the Quota client shutdown timeouts in bug 1681670. In turn the volume in other bugs (like bug 1542485, bug 1541776 and bug 1542485) went down by the same order of magnitude.
So we can be quite assured that we did not regress the overall quality of quota manager shutdown, we just moved some signatures and have hopefully now a better perspective on them.
Is there an update on this?
I really need this bug fixed as it's severely affecting my firefox performance making it really bad under prolonged or heavy use. I've had firefox be slow, freeze and crash for 3 hours straight due to this bug.
I get this bug every 3 days and every week.
Are you having problems reproducing this bug? I can sometimes reproduce it on my computer through "stress testing", but it appears sporadically and it appears to do with how memory is allocated. It appears more often once the windows memory caching Superfetch is off.
Assignee | ||
Comment 5•4 years ago
|
||
Hi :desbest, we are working hard to make this bug actionable. This bug is supposed to have a noticeable impact only during shutdown, leaving the Firefox process open for some more second than normal and then reporting a crash after the shutdown timeout. If this is what you are seeing, then reliable STR or a captured profile could help to move this forward. If you have any other concrete problem during use, please file a new bug, always with clearer STR and/or a captured profile.
In my experience this bug also affects the performance of Firefox before firefox is shut down. Firefox is slower if it happens that this bug will happen upon shut down of Firefox.
It's hard to provide a reliable STR because if I follow the same instructions, sometimes the bug will happen and sometimes it won't.
Does a captured profile mean to use the Firefox Profiler? https://profiler.firefox.com/
Assignee | ||
Comment 8•4 years ago
|
||
(In reply to desbest from comment #6)
In my experience this bug also affects the performance of Firefox before firefox is shut down. Firefox is slower if it happens that this bug will happen upon shut down of Firefox.
It might be in some weird state, that also affects shutdown, of course. There is no other load (other programs with high CPU usage) on the machine when this happens?
It's hard to provide a reliable STR because if I follow the same instructions, sometimes the bug will happen and sometimes it won't.
Does a captured profile mean to use the Firefox Profiler? https://profiler.firefox.com/
Yes, exactly. It could give at least hints, what is slowing down your instance in general. Please choose the "Firefox Platform" setting in the profiler plugin while recording. Or if you suspect Indexed DB to be involved, you might want to use a custom profile and switch on the feature "All File IO" and maybe the "DOM Worker" thread.
I can't get the profiler to work.
After finishing capturing, when I click the upload button nothing happens. So I can't share my capture with others.
Also how do I profile the startup and shutdown of firefox?
https://profiler.firefox.com/docs/#/./guide-startup-shutdown
Where do I put the environment variable? Do I have to change the windows settings like in the image below?
Comment 10•4 years ago
|
||
It won't let me post images. The images are here. https://imgur.com/a/uqGwi1C
Comment 11•4 years ago
|
||
I've spent 4 days trying to reproduce bugs and I've noticed two things using 2 methods.
===========
For method one, when I use the profiler to create a profile of how to reproduce the bug, Firefox freezes for over 3 hours and never resumes. I waited for as long as I could, and I noticed that it took 12 hours for it to stop freezing and then have Firefox spontaneously close. So I couldn't get the profiler to work. So I can't reproduce the bug using the first method unless the profiler is fixed so it stops freezing.
===========
For method two, if I use the Pocket Tube extension then wait for it to finish loading. If I open up a youtube video to play it then click the tag icon at the top that Pocket Tube generates, then close Firefox there will be a crash. However the signature for the crash will be different each time. Here is a profiler capture. https://share.firefox.dev/3mO77vC
Firefox 87.0 Crash Report [@ shutdownhang | RtlpWaitOnAddressWithTimeout | RtlpWaitOnAddress | RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | sqlite3_finalize ]
Firefox 87.0 Crash Report [@ AsyncShutdownTimeout | places.sqlite#1: waiting for clients | Transaction (114),places.sqlite#1: PlacesExpiration.jsm: expire (115) ]
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 12•4 years ago
|
||
(In reply to desbest from comment #11)
I've spent 4 days trying to reproduce bugs and I've noticed two things using 2 methods.
Thanks for insisting!
Firefox 87.0 Crash Report [@ shutdownhang | RtlpWaitOnAddressWithTimeout | RtlpWaitOnAddress | RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | sqlite3_finalize ]
Here is a corresponding crash report.
We try to finalize statements in places.sqlite on the main thread while there is another Thread 66, Name: mozStorage #2
executing something on another sqlite DB.
Firefox 87.0 Crash Report [@ AsyncShutdownTimeout | places.sqlite#1: waiting for clients | Transaction (114),places.sqlite#1: PlacesExpiration.jsm: expire (115) ]
There are quite some reports with similar signatures.
Again we see places.sqlite involved here. We run against this shutdown barrier. We might just be waiting on some slow transaction.
:mak, can you see something obvious here?
Assignee | ||
Comment 13•4 years ago
|
||
See also bug 1421562 ?
Assignee | ||
Comment 14•4 years ago
|
||
BTW: The signatures mentioned in comment #11 should not be considered part of this specific bug, they are yet another kind of async shutdown timeouts.
Assignee | ||
Comment 15•4 years ago
|
||
:desbest, do you have by chance the setting "Privacy & Security" -> "History" set to "Never remember history" or "Clear history when Nightly closes" ? Then I suspect you are affected by something similar to bug 1700584.
Comment 16•4 years ago
|
||
It’s set to remember history.
Comment 17•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #12)
We try to finalize statements in places.sqlite on the main thread while there is another
Thread 66, Name: mozStorage #2
executing something on another sqlite DB.
How can you tell it's another db and not the same? It could be:
- thread 2 is executing some very long and expensive query on places.sqlite, and main-thread is just waiting for it to finish
- there is a deadlock between thread2 and main-thread being both on places.sqlite that will never be resolved even waiting forever
- thread 2 is executing Sync and it's taking too long
Firefox 87.0 Crash Report [@ AsyncShutdownTimeout | places.sqlite#1: waiting for clients | Transaction (114),places.sqlite#1: PlacesExpiration.jsm: expire (115) ]
This is an interesting signature, we should file a separate bug for it, since it looks like many of these timeouts are related to removing orphan icons, there may be a common cause. I don't expect that to be related to QuotaManager or indexedDB anyway.
Apparently this is hanging on "DELETE FROM moz_pages_w_icons WHERE page_url_hash NOT IN (SELECT url_hash FROM moz_places) OR id NOT IN (SELECT DISTINCT page_id FROM moz_icons_to_pages)"
There may be various reasons:
- there is some database corruption
- there is an unknown deadlock related to ATTACHed databases
- removing orphan icons is just very slow under certain circumstances
Did we go through the usual troubleshooting steps?
First thing i'd suggest to run Places Database Integrity check from about:support, and post the resulting log of that here or in the newly filed bug.
Then check if things improved after a restart.
Another interesting experiment would be to pretty much drop favicons, so go to the profile folder when Firefox is closed and move favicons.sqlite (and favicons.sqlite-wal) elsewhere. Launching firefox should generate a new one, and potentially resolve any unexpected perf issue with it, but will also lose all the favicons on bookmarks until you revisit them. You can always restore the previously moved file anyway, after the experiment. If that solves the problem, I may be interested into getting more statistics about that db.
But, as I said, I'd suggest to move the PlacesExpiration.js investigation to a separate bug. I don't expect these Places crashes to have any relation to DOM Storage or QuotaManager.
Assignee | ||
Comment 18•4 years ago
|
||
(In reply to Marco Bonardo [:mak] from comment #17)
But, as I said, I'd suggest to move the PlacesExpiration.js investigation to a separate bug. I don't expect these Places crashes to have any relation to DOM Storage or QuotaManager.
Spinned off bug 1706298.
Updated•3 years ago
|
Assignee | ||
Comment 19•3 years ago
|
||
Looking at one of the cache related hangs I cannot see any thread doing any work.
The annotation reads:
cache: Managers: 1 (https://aaa.aa: Closing, )
Intermediate steps:
0.000000s: starting
IIUC, the shutdown sequence in QuotaManager::Shutdown is the following:
StopIdleMaintenance();
(does nothing except for IDB where it just sets an aborted flag- Call
InitiateShutdownWorkThreads();
for each client type. This writes the "starting" message into the annotations.
2.a In case of cache this calls synchronously the sequence:
Client::InitiateShutdownWorkThreads()
->CacheQuotaClient::InitiateShutdown()
->Manager::Factory::ShutdownAll()
-> (for each manager)Manager::Shutdown()
-> (for eachContext
) Context::CancelAll()
->Context::ThreadsafeHandle::AllowToClose()
Here it gets interesting, as we might dispatch dom::cache::Context::ThreadsafeHandle::AllowToCloseOnOwningThread
to the owning thread. Interestingly, this code is never hit in our test automation?
-> if all goes well we then get into Context::ThreadsafeHandle::AllowToCloseOnOwningThread()
and nuke our Context
reference which should nuke the Context
and through its destructor arrive at Factory::Remove(*this);
.
-> If this was the last Context
, Factory::Remove
will destroy the factory singleton and thus unblock the shutdown. This also generates the "Manager removed" annotation.
2.b It then decides via CacheQuotaClient::IsShutdownCompleted()
if we are already completely shut down. If not, we will enter the SpinEventLoopUntil
3. In case we need to wait for someone, enter the SpinEventLoopUntil
.
It seems, that even after the cache shutdown we still have a living sFactory
that prevents the event loop from breaking.
So the doubt would be: We dispatch dom::cache::Context::ThreadsafeHandle::AllowToCloseOnOwningThread
with MOZ_ALWAYS_SUCCEEDS
. But if this dispatch fails, we might end up with a live lock? From our automation it is not even clear if we ever try to dispatch this message, though.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 20•3 years ago
|
||
Another possibility could be that we do not free all Contexts, of course. In that sense I was asking myself, if aOldManager->mContext
is ever nulled out if there is an aOldManager
passed here?
Assignee | ||
Comment 21•3 years ago
|
||
We want to enforce the (assumed) invariant that after we started QM shutdown we should not create new contextes
and associated threads for cache IO. We do this generating a runtime error, whose handling might not be consistent
in all (unexpected) cases, yet. But this is preferable over a shutdown hang crash, for sure.
Updated•3 years ago
|
Assignee | ||
Comment 22•3 years ago
|
||
This patch is not really meant to fix the hangs.
Comment 23•3 years ago
|
||
Comment 24•3 years ago
|
||
bugherder |
Comment 25•3 years ago
|
||
https://support.mozilla.org/en-US/questions/1351032
Issue solved by updating Avast.
bp-b763a58f-97f0-4df9-9205-44ffc0210919
bp-7b2b3a35-6b01-4672-a8d2-9dd940210905
Comment 26•3 years ago
|
||
Added a signature
Assignee | ||
Comment 27•3 years ago
|
||
Signature cleanup:
[@ static mozilla::dom::quota::Client::InitiateShutdownWorkThreads::<T>::operator()::<T>::__invoke ]
Appears only on 85.
[@ static mozilla::dom::quota::QuotaManager::Shutdown::<T>::operator()::<T>::__invoke ]
Very active
[@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | nsAppShell::ProcessNextNativeEvent ]
rare but active
[@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownTimedOut ]
<= 78
[@ mozilla::dom::indexedDB::`anonymous namespace'::QuotaClient::ShutdownWorkThreads::<T>::__invoke ]
inactive
[@ mozilla::dom::`anonymous namespace'::QuotaClient::ShutdownTimedOut ]
inactive
[@ static mozilla::dom::quota::Client::ShutdownWorkThreads::<T>::operator()::<T>::__invoke ]
inactive
[@ mozilla::dom::indexedDB::(anonymous namespace)::QuotaClient::ShutdownWorkThreads::$_52::__invoke ]
inactive
[@ mozilla::dom::quota::QuotaManager::Shutdown::{lambda#1}::operator() const::{lambda#1}::_FUN ]
new but rare
Updated•3 years ago
|
Comment 28•3 years ago
|
||
This is still the #1 overall non-OOM topcrash on release.
Assignee | ||
Comment 29•3 years ago
|
||
This crash report shows a suspicious sequence:
Intermediate steps:
0.047000s: ClearRequestBase: Starting repeated directory removal #0
0.047000s: ClearRequestBase: Completed repeated directory removal #0
0.047000s: ClearRequestBase: Completed deleting files
0.047000s: ClearRequestBase: Starting deleting files
0.063000s: ClearRequestBase: Starting repeated directory removal #0
0.063000s: ClearRequestBase: Completed repeated directory removal #0
0.063000s: ClearRequestBase: Completed deleting files
... (continues until hang) ...
Assignee | ||
Comment 30•3 years ago
|
||
ls: Datastores: 34 (DirectoryLock:1|Connection:1|QuotaObject:1|PrepareDatastoreOps:0|PreparedDatastores:0|Databases:0|ActiveDatabases:0|Origin:https://(removed)=D|PrivateBrowsingId:0|Closed:1, DirectoryLock:1|Connection:1|QuotaObject:1|PrepareDatastoreOps:0|PreparedDatastores:0|Databases:0|ActiveDatabases:0|Origin:https://(removed)=DD|PrivateBrowsingId:0|Closed:1, DirectoryLock:1|Connection:1|QuotaObject:1|PrepareDatastoreOps:0|PreparedDatastores:0|Databases:0|ActiveDatabases:0|Origin:moz-extension://(removed)|PrivateBrowsingId:0|Closed:1)
seems to show un-removed closed LS connections?
Reporter | ||
Comment 31•3 years ago
|
||
When a Datastore is closed we initiate closing of the connection and we release the directory lock after that. We unregister the Datastore
object only when the connection is closed, otherwise a new Datastore
could try to touch the same database on disk. I think the log from comment 30 falls into the group of "too much work during shutdown". If there are too many open LS connections, it can happen that we won't have a chance to close them all in time. I filed bug 1592593 for this in the past. I think that one could be a good "first project" bug.
If bug 1592593 doesn't help enouch, we might want to investigate adding more threads for I/O stuff. There's only one at the moment.
Anyway, I just noticed that we need to update Datastore::Stringify
to include mHasLivePrivateDatastore
.
Reporter | ||
Comment 32•3 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #29)
This crash report shows a suspicious sequence:
Intermediate steps: 0.047000s: ClearRequestBase: Starting repeated directory removal #0 0.047000s: ClearRequestBase: Completed repeated directory removal #0 0.047000s: ClearRequestBase: Completed deleting files 0.047000s: ClearRequestBase: Starting deleting files 0.063000s: ClearRequestBase: Starting repeated directory removal #0 0.063000s: ClearRequestBase: Completed repeated directory removal #0 0.063000s: ClearRequestBase: Completed deleting files ... (continues until hang) ...
Since there's no info about the origin in the message, we can't say if it happens for the same origin directory or some other origin directory.
The latter is ok.
Assignee | ||
Comment 33•3 years ago
|
||
The annotations continue to show that we have two main causes (accounting together for ~90% of the cases):
cache: Managers: 1 (https://aaaaaaaaaaaaaa.aaaaaaaaaa.aaa: Closing, )
- bug 1727526 did not yield any conclusion, yetQM: 1 normal origin ops pending
- this one merits more investigation
In order to investigate 2. - it seems there is no way to know, which operation is pending here? It would probably make a huge difference if it can be randomly any kind or if we see only very few types of operations here?
Assignee | ||
Comment 34•3 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #33)
QM: 1 normal origin ops pending
- this one merits more investigation
So the first ever crash with the new annotation shows:
- that the annotation works!
- that we directly trapped into a case with a very high # (35947) of pending ops
- where all those pending ops are dom::quota::ClearOriginOp
Now it seems difficult to believe (though not impossible) that we want to delete ~36K of distinct origins here?
Reporter | ||
Comment 35•3 years ago
|
||
I'm not that surprised that a shutdown hang is caused by trying to do too much I/O, but 36K is quite a lot.
Bug 1733107 should help a bit.
(The de-duplication is probably a good idea).
Reporter | ||
Comment 36•3 years ago
|
||
There's also the repeated directory removal in the log, so it seems origin directories can't be deleted during shutdown and the number of origin then grows and grows.
Reporter | ||
Comment 37•3 years ago
|
||
Or some third-party party app interferes with us and slows down the clearing.
Assignee | ||
Comment 38•3 years ago
|
||
(In reply to Jan Varga [:janv] from comment #35)
(The de-duplication is probably a good idea).
Only if we also reduce the verbosity of ClearRequestBase::DeleteFiles
which actually brings us to the retry loop (so it is good to have this verbosity!) where we seem to spend up to 10 times 200ms which seems a bad idea if all removals are always failing as seems to be the case here. So it is easy to spend some minute here and hit the timeout while the computer is actually idle.
Assignee | ||
Comment 39•3 years ago
|
||
This comment says it already.
What if we would just dispatch the same ClearRequestBaseOp another time if we were not able to remove all and move on? We should have a retry count parameter to be able to break, though.
Reporter | ||
Comment 40•3 years ago
|
||
I think we shouldn't repeat the clearing during shutdown. We should mark origin directories as invalid by putting marker files into them and then during storage initialization, remove such origin directories. I think there's a patch somewhere for the markers made by Tom. We might want to revive it ...
Assignee | ||
Comment 41•3 years ago
|
||
So I think we need to change our logic here. Instead of blindly retrying (and blocking) we should track during deletion based on the errors we get the severity of the situation:
Errors that can be safely ignored (no retry):
NS_ERROR_FILE_NOT_FOUND
(already gone?)
Errors that might merit a retry:
NS_ERROR_FILE_IS_LOCKED
(could go away)
NS_ERROR_FILE_DIR_NOT_EMPTY
(something has been added while we were operating?)
NS_ERROR_FILE_DEVICE_TEMPORARY_FAILURE
(the name says it all)
NS_ERROR_MODULE_WIN32
(mostly because we dunno what's in there)
All other errors do not merit a retry:
NS_ERROR_FILE_ACCESS_DENIED
(this might be borderline, but I don't think it is ever a temporary thing without user or third-party SW interaction)
NS_ERROR_FILE_READ_ONLY
(again, will not change)
NS_ERROR_FILE_NAME_TOO_LONG
(will never change)
NS_ERROR_FILE_NOT_DIRECTORY
(will never change)
NS_ERROR_FILE_FS_CORRUPTED
(we should stop torturing the FS)
NS_ERROR_FILE_DEVICE_FAILURE
(we should stop torturing the FS)
NS_ERROR_FILE_INVALID_PATH
(will never change)
Errors that are not supposed to always happen in this context, and still do not merit a retry:
NS_ERROR_OUT_OF_MEMORY
NS_ERROR_FILE_NO_DEVICE_SPACE
NS_ERROR_FILE_ALREADY_EXISTS
NS_ERROR_FILE_INVALID_HANDLE
If then we really want to retry, we should do so by dispatching a new event rather than in a blocking loop with sleep.
Assignee | ||
Comment 42•3 years ago
|
||
(In reply to Jan Varga [:janv] from comment #40)
I think we shouldn't repeat the clearing during shutdown. We should mark origin directories as invalid by putting marker files into them and then during storage initialization, remove such origin directories. I think there's a patch somewhere for the markers made by Tom. We might want to revive it ...
OK, that is also an option, of course.
Assignee | ||
Comment 43•3 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #38)
the retry loop (so it is good to have this verbosity!) where we seem to spend up to 10 times 200ms which seems a bad idea if all removals are always failing as seems to be the case here. So it is easy to spend some minute here and hit the timeout while the computer is actually idle.
After having touched the code in bug 1760302 I understood that actually in the crash from comment 34 we did not wait, we just printed out superfluous messages for how the loop was designed. Still I think we should do bug 1760302 in general, but this case seemed to be a different one.
Assignee | ||
Comment 44•3 years ago
|
||
Just a heads up that the patch from bug 1682326 (together with D139795 from bug 1757186) will most likely change the signatures here.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 46•3 years ago
|
||
So looking at the most recent nightly crashes, I can see:
- 24 cases where we do not see any annotation whatsoever. Example: 25785003-d2e6-4727-83ec-0ca7f0220425
- 43 cases where we seem to be stuck at
ClearRequestBase::DeleteFiles
(causinggNormalOriginOps
not being cleared). - only 3 cases with other, client-specific reasons
The first seems just an error in the way we create the annotations - if we arrive there with gNormalOriginOps
unset we will crash unexpectedly before we set the annotations.
Edit: 1. is a regression from bug 1757186.
Comment 47•2 years ago
|
||
The bug is linked to topcrash signatures, which match the following criteria:
- Top 5 desktop browser crashes on Mac on release
- Top 20 desktop browser crashes on release
- Top 20 desktop browser crashes on beta
- Top 5 desktop browser crashes on Windows on beta
- Top 5 desktop browser crashes on Windows on release
For more information, please visit auto_nag documentation.
Updated•2 years ago
|
Comment 48•2 years ago
|
||
Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 49•2 years ago
|
||
(In reply to Release mgmt bot [:suhaib / :marco/ :calixte] from comment #48)
Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.
For more information, please visit auto_nag documentation.
For some reasons, most of the relevant signatures moved to bug 1592336, but we want to continue the investigation here. So moving them all over here and closing bug 1592336.
Comment 51•2 years ago
|
||
Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to topcrash signatures, which match the following criteria:
- Top 5 desktop browser crashes on Mac on release
- Top 20 desktop browser crashes on release
- Top 20 desktop browser crashes on beta
- Top 5 desktop browser crashes on Windows on beta
- Top 5 desktop browser crashes on Windows on release
For more information, please visit auto_nag documentation.
Comment 52•2 years ago
|
||
The severity field for this bug is set to S3
. However, the following bug duplicate has higher severity:
- Bug 1592336: S2
:jstutte, could you consider increasing the severity of this bug to S2
?
For more information, please visit auto_nag documentation.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 53•2 years ago
|
||
Until bug 1822498 is fixed we should track all signatures here.
Assignee | ||
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 56•2 years ago
|
||
(In reply to Intermittent Failures Robot from comment #55)
For more details, see:
https://treeherder.mozilla.org/intermittent-failures/bugdetails?bug=1588498&startday=2023-03-27&endday=2023-04-02&tree=all
This specific failure shows we are stuck flushing files from SQLite:
[task 2023-03-28T06:19:16.423Z] 06:19:16 INFO - Thread 34 QuotaManager IO
[task 2023-03-28T06:19:16.423Z] 06:19:16 INFO - 0 ntdll.dll!ZwFlushBuffersFile + 0xc
[task 2023-03-28T06:19:16.423Z] 06:19:16 INFO - eip = 0x77616a1c esp = 0x172fed14 ebp = 0x172feda8 ebx = 0x172fedb8
[task 2023-03-28T06:19:16.423Z] 06:19:16 INFO - esi = 0x172fed7c edi = 0x00000af4 eax = 0x00000000 ecx = 0x00000000
[task 2023-03-28T06:19:16.423Z] 06:19:16 INFO - edx = 0x00000000 eflags = 0x00000206
[task 2023-03-28T06:19:16.423Z] 06:19:16 INFO - Found by: given as instruction pointer in context
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - 1 xul.dll!mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared>,long (*)(void *, _IO_STATUS_BLOCK *) __attribute__((stdcall))>::operator()(void*&, _IO_STATUS_BLOCK*&) const [nsWindowsDllInterceptor.h:aff0b64247011767675227facc902a7968668686 : 150]
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - Found by: inlining
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - 2 xul.dll!(anonymous namespace)::InterposedNtFlushBuffersFile(void*, _IO_STATUS_BLOCK*) [PoisonIOInterposerWin.cpp:aff0b64247011767675227facc902a7968668686 : 399 + 0x7]
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - eip = 0x6638080d esp = 0x172fed18 ebp = 0x172feda8 ebx = 0x172fedb8
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - esi = 0x172fed7c edi = 0x00000af4
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - Found by: call frame info
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - 3 KERNELBASE.dll!FlushFileBuffers + 0x4b
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - eip = 0x766da95c esp = 0x172fedb0 ebp = 0x172fedc0 ebx = 0x04db55d4
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - esi = 0x00000af4 edi = 0x04db55d4
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - Found by: call frame info
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - 4 nss3.dll!winSync(sqlite3_file*, int) [sqlite3.c:aff0b64247011767675227facc902a7968668686 : 48137 + 0x11]
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - eip = 0x7185960a esp = 0x172fedc8 ebp = 0x172fefd0 ebx = 0x04db55d4
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - esi = 0x00000af4 edi = 0x04db55d4
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - Found by: call frame info
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - 5 xul.dll!(anonymous namespace)::xSync(sqlite3_file*, int) [TelemetryVFS.cpp:aff0b64247011767675227facc902a7968668686 : 222 + 0x14]
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - eip = 0x66ecb348 esp = 0x172fefd8 ebp = 0x172ff030 ebx = 0x04db55d4
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - esi = 0x0000024e edi = 0x71d56d40
[task 2023-03-28T06:19:16.424Z] 06:19:16 INFO - Found by: call frame info
I would expect this to become a problem only when there is extremely slow disc I/O, it is a bit weird to see this in CI? Or am I overlookign something and there is more to investigate here?
Comment hidden (Intermittent Failures Robot) |
Comment 58•1 year ago
|
||
Jens, these crashes seem to have spikes in the 114 beta cycle compared to the 113 cycle, anything that could explain that we could fix? Thanks
Assignee | ||
Comment 59•1 year ago
|
||
Yeah, comparing the same beta versions from 113 and 114 looks quite bad:
1 114.0b3 339 35.17 %
2 114.0b2 313 32.47 %
3 114.0b4 125 12.97 %
4 114.0b1 86 8.92 %
5 113.0b4 61 6.33 %
6 113.0b3 23 2.39 %
7 113.0b2 14 1.45 %
8 113.0b1 3 0.31 %
We need to take a closer look for sure.
Reporter | ||
Comment 60•1 year ago
|
||
Bug 1733107 should help with reducing shutdown hangs, it's currently our top priority to fix the bug and eventually uplift patches to beta.
Comment 61•1 year ago
|
||
Adding bug 1733107 to dependencies then. I am tracking bug 1733107 for a 114 beta uplift.
Reporter | ||
Updated•1 year ago
|
Updated•1 year ago
|
Comment 62•1 year ago
|
||
The bug is linked to topcrash signatures, which match the following criteria:
- Top 5 desktop browser crashes on Mac on release
- Top 20 desktop browser crashes on release (startup)
- Top 20 desktop browser crashes on beta
- Top 10 desktop browser crashes on nightly
- Top 5 desktop browser crashes on Windows on beta
- Top 5 desktop browser crashes on Windows on release (startup)
For more information, please visit BugBot documentation.
Assignee | ||
Comment 63•1 year ago
|
||
There is a very weird upwards tendency for ESR crashes only, it seems. I cannot see any correlation to the corresponding normal versions, though. And even inside the 102.x series it is increasing over time, where I would not expect any functional changes landed that could influence here. And it happens also for 115.x, while the release channel seems to have a slow but steady downwards tendency all the time.
Assignee | ||
Comment 64•1 year ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #63)
There is a very weird upwards tendency for ESR crashes only, it seems. I cannot see any correlation to the corresponding normal versions, though. And even inside the 102.x series it is increasing over time, where I would not expect any functional changes landed that could influence here. And it happens also for 115.x, while the release channel seems to have a slow but steady downwards tendency all the time.
That's bug 1849352, or better that bug would help to "mitigate" this. The upside is that there is no real increase affecting our users.
Assignee | ||
Updated•1 year ago
|
Description
•