XPCOM/XPConnect should do a "canary" load of the async shutdown service (Was: fast startup crash in ServiceWorkerRegistrar::ProfileStarted)
Categories
(Core :: XPCOM, defect)
Tracking
()
People
(Reporter: sfink, Unassigned)
References
Details
(4 keywords, Whiteboard: [startupcrash] [qa-not-actionable])
Crash Data
Attachments
(3 files)
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Comment 2•6 years ago
|
||
Comment 3•6 years ago
|
||
Comment 4•6 years ago
|
||
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Comment 5•6 years ago
|
||
Reporter | ||
Comment 6•6 years ago
|
||
Reporter | ||
Comment 7•6 years ago
|
||
Reporter | ||
Comment 8•6 years ago
|
||
Comment 9•6 years ago
|
||
Reporter | ||
Comment 10•6 years ago
|
||
Comment 11•5 years ago
|
||
(In reply to Steve Fink [:sfink] [:s:] from comment #10)
Would it be worth filing a bug to report JS exceptions during startup
instead of swallowing them and crashing here? (I forget the exact path, but
I know there was an exception thrown that led to this.)
I would second that, given that we cannot do much here in the Service Worker component, it seems. :asuth, what do you think?
Comment 12•5 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #11)
(In reply to Steve Fink [:sfink] [:s:] from comment #10)
Would it be worth filing a bug to report JS exceptions during startup
instead of swallowing them and crashing here? (I forget the exact path, but
I know there was an exception thrown that led to this.)I would second that, given that we cannot do much here in the Service Worker component, it seems. :asuth, what do you think?
I think this amounts to a request to create an infallible do_getService
variant called do_getServiceThatDefinitelyExists
which internalizes the error reporting logic we crammed into ServiceWorkerRegistrar because it's the first native code that tries to load a JS service and crashes instead of silently breaking.
Needinfo-ing :bholley the XPConnect owner because that's where the bulk of the XPConnect scenario lives.
Another possibility lower down would be that the lower level JAR protocol stuff should be performing some type of integrity check on omni.ja that validates some mixture of:
- not truncated
- passes some level of checksum/crc32/hash check, possibly lazily. (do we do things to trick the operating system to perform readahead that could be used for hashing purposes?)
- verifies there's an expected build identifier signature baked in to various places in the file (possibly via sentinel zip/jar entries?)
Needinfo-ing :dragana as owner of Necko where Core::Networking: JAR lives.
:bholley/:dragana note that you may want to take a look at bug 1403348 which is where most of the diagnostic effort related to this problem has previously lived. (And where in https://bugzilla.mozilla.org/show_bug.cgi?id=1403348#c79 there are suggestions the problem isn't just an application update problem because it happens on Android, which potentially also means it's also not rogue anti-virus software.)
Comment 13•5 years ago
|
||
Michal, can you take a look?
Comment 14•5 years ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #12)
I think this amounts to a request to create an infallible
do_getService
variant calleddo_getServiceThatDefinitelyExists
which internalizes the error reporting logic we crammed into ServiceWorkerRegistrar because it's the first native code that tries to load a JS service and crashes instead of silently breaking.
I think it would be reasonable to add a do_GetServiceInfallible, or somesuch, rather than sprinkling the all-is-lost logic in whatever callsites happen to run early in startup. Nathan, do you agree?
Comment 15•5 years ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #12)
- passes some level of checksum/crc32/hash check, possibly lazily. (do we do things to trick the operating system to perform readahead that could be used for hashing purposes?)
Corrupted files should be detected when checking CRC at https://searchfox.org/mozilla-central/rev/e076e40ab1290f4e5e67ebd21dc8af753fc05be6/modules/libjar/nsJARInputStream.cpp#333
Comment 16•5 years ago
|
||
(In reply to Bobby Holley (:bholley) from comment #14)
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #12)
I think this amounts to a request to create an infallible
do_getService
variant calleddo_getServiceThatDefinitelyExists
which internalizes the error reporting logic we crammed into ServiceWorkerRegistrar because it's the first native code that tries to load a JS service and crashes instead of silently breaking.I think it would be reasonable to add a do_GetServiceInfallible, or somesuch, rather than sprinkling the all-is-lost logic in whatever callsites happen to run early in startup. Nathan, do you agree?
I'm not super-excited about this, just because of the potential for people putting it along paths that eventually get called during shutdown. Or people will start replacing code like:
nsCOMPtr<nsIObserverService> obsService = ...;
if (obsService) { ... }
with:
nsCOMPtr<nsIObserverService> obsService = do_GetServiceInfallible(...);
...stuff...
and then get bitten when that code runs in unexpected places.
If we wanted to make it a super-scary name, I guess it might be OK?
Comment 17•5 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #16)
If we wanted to make it a super-scary name, I guess it might be OK?
I would surely second that, and I find the current naming convention doXInfallible also used elsewhere always very confusing.
Comment 18•5 years ago
|
||
Hm, ok.
Perhaps what we really want here is for XPConnect to do a "canary" service-load of the async shutdown service early in startup, before we start notifying random observers to start initializing Gecko components. That would at least prevent the problem from moving around if the observers start firing in a different order.
Perhaps right at [1], which occurs after XPConnect has been initialized [2], but before the doStartup call [3] that takes us into service workers?
[1] https://searchfox.org/mozilla-central/rev/cfd1cc461f1efe0d66c2fdc17c024a203d5a2fd8/toolkit/xre/nsAppRunner.cpp#4412
[2] https://searchfox.org/mozilla-central/rev/cfd1cc461f1efe0d66c2fdc17c024a203d5a2fd8/toolkit/xre/nsAppRunner.cpp#4404
[3] https://searchfox.org/mozilla-central/rev/cfd1cc461f1efe0d66c2fdc17c024a203d5a2fd8/toolkit/xre/nsAppRunner.cpp#4418
Comment 20•4 years ago
|
||
(In reply to Bobby Holley (:bholley) from comment #18)
Perhaps what we really want here is for XPConnect to do a "canary" service-load of the async shutdown service early in startup, before we start notifying random observers to start initializing Gecko components. That would at least prevent the problem from moving around if the observers start firing in a different order.
Since I think this proposed fix is in XPCOM, moving the bug there.
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 21•4 years ago
|
||
Should we remove the stalled keyword here since there's a possible way forward described in comment 18?
Comment 22•4 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #21)
Should we remove the stalled keyword here since there's a possible way forward described in comment 18?
Sure.
Comment 23•4 years ago
|
||
Hi Nika, assuming that Nathan will not move this forward any more and being a quite frequent startup crash - can you get to this?
Comment 24•4 years ago
|
||
This signature is ranked #10 for Thunderbird 78. And so is a serious problem.
For example bp-4be5fed4-4b0c-4e2e-93a3-f13ab0201222
Comment 25•4 years ago
|
||
For about 98% of these crashes in the last 6 months, the nsresult error value for these crashes appears to be NS_ERROR_FILE_NOT_FOUND. The Windows error values that can return that are ERROR_FILE_NOT_FOUND, ERROR_PATH_NOT_FOUND, ERROR_INVALID_DRIVE, ERROR_NOT_READY.
Comment 26•4 years ago
|
||
This reminds me a bit of bug 1686041 - we probably would want to know, which of these OS error codes really cause this, in particular ERROR_NOT_READY would be something we cannot do much about, right?
Updated•3 years ago
|
Comment 27•3 years ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #24)
This signature is ranked #10 for Thunderbird 78. And so is a serious problem.
It is now lower, at #30
Comment 28•3 years ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #27)
(In reply to Wayne Mery (:wsmwk) from comment #24)
This signature is ranked #10 for Thunderbird 78. And so is a serious problem.
It is now lower, at #30
but higher for version 91. Currently #8 for Thunderbird 91.8.1, and looks like it has been high for all of version 91.
bp-9eadd4a9-3c42-43db-920d-8ee5f0220424 Mac
bp-b05e316c-36a7-4eb1-8598-cc5950220424 Windows
Updated•2 years ago
|
Comment 29•2 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #23)
Hi Nika, assuming that Nathan will not move this forward any more and being a quite frequent startup crash - can you get to this?
Nika, is still possible?
#15 crash for Firefox 103
#19 crash for Thunderibrd 102.0.3
Comment 30•2 years ago
|
||
There's not a ton actionable we can do here, other than potentially change the signature (like the bug title suggests) by moving the crash earlier during startup. This bug appears to be caused by part of the Firefox/Thunderbird install (the omnijar) not being present on the user's computer, perhaps due to a botched install or similar.
This is somewhat aligned with some of the comments on the bug which say things like "impossible d'ouvrir" (can't open), suggesting this is a persistent broken install issue.
It appears that the only practical way forward here is to somehow communicate to the user that their Firefox/Thunderbird install appears to be broken, and ask them to re-install, without crashing somewhere while trying to show said dialog. We risk losing information about how frequently this is happening if we do that though, as we won't be showing the crash report dialog, and telemetry is fairly dependent on the omnijar to work at all.
Redirecting a ni? to :bytesized in case this issue is something which the updater could help us automatically detect and/or fix a broken install in a reliable way.
Comment 31•2 years ago
|
||
To double-check what we think the error is (omni.ja
missing) I made a copy of my install & deleted the omni.ja
file from it, getting this crash: https://crash-stats.mozilla.org/report/index/67dd89dc-d583-4b0e-bb62-b13730220802. It appears very similar to the other crashes here, so I'm inclined to believe that it is the problem.
Comment 32•2 years ago
|
||
(In reply to Nika Layzell [:nika] (ni? for response) from comment #30)
Redirecting a ni? to :bytesized in case this issue is something which the updater could help us automatically detect and/or fix a broken install in a reliable way.
I doubt that there is much that I can help with here, especially without more information about what is going wrong. The background updater probably can't fix a build that is crashing on startup because the background updater runs using the Firefox binary. So if the omnijar is missing, I don't see how it could start either.
If we are considering this to be a potential problem caused by the updater, we do already attempt to recover from bad updates. With staging enabled (the default), we should stage the whole new install into a subdirectory and move it into place when Firefox launches. If patching or extracting a file fails, staging should fail and the update will not be installed. If staging is not enabled we should roll back the changes if the update fails.
I have seen reports where it appears that this process fails. I'm not entirely sure why. I can't find the bug that I'm thinking of, but I remember the update log looking basically like this:
updating normally...
Some file is not patched successfully (permission denied error)
We start rolling back
Some of the backup files mysteriously don't exist so we are not able to roll back all files successfully.
I can only think of two possible reasons why something like this would happen: Some kind of antivirus weirdness, or two Firefox invocations managed to race so closely that they both started an updater and the two updaters interfered with each other. I've filed Bug 1783939 to do better at preventing the latter possibility. I'm not sure what we can do about the former.
I don't know how to explain the fact that the original error was a permission denied error. My best guess is that antivirus was involved. Even if the file was in use, we should still be able to move it out of the way.
Comment 33•2 years ago
|
||
There is some good news. Around April 1, Firefox crashes went into steep decline and by mid-May dropped by more than 30%. It looks to me like the majority of that reduction came from within the Windows user population.
Not possible to say whether Thunderbird will have a similar decline, for users going from esr 91 to esr 102.
There is a not insignficant Mac population
Comment 34•2 years ago
|
||
The bug is linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on release (startup)
- Top 20 desktop browser crashes on beta (startup)
For more information, please visit auto_nag documentation.
Updated•2 years ago
|
Comment 35•2 years ago
|
||
This bug would not actually fix the startup crashes which are occuring here (due to a missing omni.ja or corrupted omni.ja), and would instead just change the crash signature to something more consistent, so dropping to S3 for now.
There is unlikely to be a way to automatically repair broken installations during startup right now.
Comment 36•2 years ago
|
||
It's also worth noting that bug 1760855 migrating the async shutdown service be implemented in C++ for C++ services would also address/moot this bug.
Description
•