Open Bug 1471720 Opened 6 years ago Updated 2 years ago

XPCOM/XPConnect should do a "canary" load of the async shutdown service (Was: fast startup crash in ServiceWorkerRegistrar::ProfileStarted)

Categories

(Core :: XPCOM, defect)

defect

Tracking

()

People

(Reporter: sfink, Unassigned)

References

Details

(4 keywords, Whiteboard: [startupcrash] [qa-not-actionable])

Crash Data

Attachments

(3 files)

I restarted my Nightly today (I was having problem getting it to recognize my microphone), and I got a profile-independent instant startup crash: #0 0x0000000000414644 in MOZ_CrashPrintf () #1 0x00007f89716fd903 in mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase() const [clone .cold.211] () at /home/sfink/Downloads/firefox/libxul.so #2 0x00007f8972f7f7af in mozilla::dom::ServiceWorkerRegistrar::ProfileStarted() () at /home/sfink/Downloads/firefox/libxul.so #3 0x00007f8972f7f735 in mozilla::dom::ServiceWorkerRegistrar::Observe(nsISupports*, char const*, char16_t const*) () at /home/sfink/Downloads/firefox/libxul.so #4 0x00007f8972573b52 in nsObserverList::NotifyObservers(nsISupports*, char const*, char16_t const*) () at /home/sfink/Downloads/firefox/libxul.so #5 0x00007f897257161c in nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) () at /home/sfink/Downloads/firefox/libxul.so #6 0x00007f8973173faa in nsXREDirProvider::DoStartup() () at /home/sfink/Downloads/firefox/libxul.so #7 0x00007f8973170b20 in XREMain::XRE_mainRun() () at /home/sfink/Downloads/firefox/libxul.so #8 0x00007f897316df92 in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) () at /home/sfink/Downloads/firefox/libxul.so #9 0x00007f897316dbf6 in XRE_main(int, char**, mozilla::BootstrapConfig const&) () at /home/sfink/Downloads/firefox/libxul.so #10 0x0000000000422a39 in do_main(int, char**, char**) () #11 0x000000000041732f in main () It appears it was crashing with the string $rdi = 0x634840 <sPrintfCrashReason> "Failed to get async shutdown service: NS_ERROR_FACTORY_NOT_REGISTERED" I have an rr recording of the crash and a backup copy of my install directory.
Priority: -- → P3
Blake: on your radar
Flags: needinfo?(mrbkap)
Flags: needinfo?(mrbkap)
Priority: P3 → P2
Just hit this with ASAN Nightlies. It instantly renders the browser unusable, so help? It looks like it's even crashing before crash reporter gets a chance to run? At least ASAN can't submit.
Attached file ASAN Crash log (deleted) —
The crasher fixed itself (for small values of "fix") after going ASAN->regular Nightly->ASAN Nightly->different profile->regular profile.
Flags: needinfo?(mrbkap)
Assignee: nobody → mrbkap
After a bunch of guessing, I haven't gotten too far on this. Steve, would you happen to still have the recording and backup copy of your install directory that you mentioned in comment 0 lying around? It's probably our best bet at figuring out what's happening at this point.
Flags: needinfo?(mrbkap) → needinfo?(sphink)
mrbkap and I looked at this in a debugger today. The specific problem in my case is due to the toplevel omni.ja being really old and using JS features that were removed in 2014. I'll attempt to attach some files, though they might be too large.
Flags: needinfo?(sphink)
Attached file omni.ja from the toplevel directory (deleted) —
Attached file updates.xml (deleted) —
After looking at this and bug 1403348, it appears that there are a ton of reasons for us to fail to load the async shutdown service and that ServiceWorkerRegistrar::ProfileStarted is becoming a bucket for us to find how many of our users have some sort of fatal JS error on startup. I'm marking this stalled since I don't think there's much actionable to do in the service worker code.
Assignee: mrbkap → nobody
Keywords: stalled
Would it be worth filing a bug to report JS exceptions during startup instead of swallowing them and crashing here? (I forget the exact path, but I know there was an exception thrown that led to this.)

(In reply to Steve Fink [:sfink] [:s:] from comment #10)

Would it be worth filing a bug to report JS exceptions during startup
instead of swallowing them and crashing here? (I forget the exact path, but
I know there was an exception thrown that led to this.)

I would second that, given that we cannot do much here in the Service Worker component, it seems. :asuth, what do you think?

Flags: needinfo?(bugmail)

(In reply to Jens Stutte [:jstutte] from comment #11)

(In reply to Steve Fink [:sfink] [:s:] from comment #10)

Would it be worth filing a bug to report JS exceptions during startup
instead of swallowing them and crashing here? (I forget the exact path, but
I know there was an exception thrown that led to this.)

I would second that, given that we cannot do much here in the Service Worker component, it seems. :asuth, what do you think?

I think this amounts to a request to create an infallible do_getService variant called do_getServiceThatDefinitelyExists which internalizes the error reporting logic we crammed into ServiceWorkerRegistrar because it's the first native code that tries to load a JS service and crashes instead of silently breaking.

Needinfo-ing :bholley the XPConnect owner because that's where the bulk of the XPConnect scenario lives.

Another possibility lower down would be that the lower level JAR protocol stuff should be performing some type of integrity check on omni.ja that validates some mixture of:

  • not truncated
  • passes some level of checksum/crc32/hash check, possibly lazily. (do we do things to trick the operating system to perform readahead that could be used for hashing purposes?)
  • verifies there's an expected build identifier signature baked in to various places in the file (possibly via sentinel zip/jar entries?)

Needinfo-ing :dragana as owner of Necko where Core::Networking: JAR lives.

:bholley/:dragana note that you may want to take a look at bug 1403348 which is where most of the diagnostic effort related to this problem has previously lived. (And where in https://bugzilla.mozilla.org/show_bug.cgi?id=1403348#c79 there are suggestions the problem isn't just an application update problem because it happens on Android, which potentially also means it's also not rogue anti-virus software.)

Flags: needinfo?(dd.mozilla)
Flags: needinfo?(bugmail)
Flags: needinfo?(bholley)

Michal, can you take a look?

Flags: needinfo?(dd.mozilla) → needinfo?(michal.novotny)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #12)

I think this amounts to a request to create an infallible do_getService variant called do_getServiceThatDefinitelyExists which internalizes the error reporting logic we crammed into ServiceWorkerRegistrar because it's the first native code that tries to load a JS service and crashes instead of silently breaking.

I think it would be reasonable to add a do_GetServiceInfallible, or somesuch, rather than sprinkling the all-is-lost logic in whatever callsites happen to run early in startup. Nathan, do you agree?

Flags: needinfo?(bholley) → needinfo?(nfroyd)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #12)

  • passes some level of checksum/crc32/hash check, possibly lazily. (do we do things to trick the operating system to perform readahead that could be used for hashing purposes?)

Corrupted files should be detected when checking CRC at https://searchfox.org/mozilla-central/rev/e076e40ab1290f4e5e67ebd21dc8af753fc05be6/modules/libjar/nsJARInputStream.cpp#333

Flags: needinfo?(michal.novotny)

(In reply to Bobby Holley (:bholley) from comment #14)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #12)

I think this amounts to a request to create an infallible do_getService variant called do_getServiceThatDefinitelyExists which internalizes the error reporting logic we crammed into ServiceWorkerRegistrar because it's the first native code that tries to load a JS service and crashes instead of silently breaking.

I think it would be reasonable to add a do_GetServiceInfallible, or somesuch, rather than sprinkling the all-is-lost logic in whatever callsites happen to run early in startup. Nathan, do you agree?

I'm not super-excited about this, just because of the potential for people putting it along paths that eventually get called during shutdown. Or people will start replacing code like:

nsCOMPtr<nsIObserverService> obsService = ...;
if (obsService) { ... }

with:

nsCOMPtr<nsIObserverService> obsService = do_GetServiceInfallible(...);
...stuff...

and then get bitten when that code runs in unexpected places.

If we wanted to make it a super-scary name, I guess it might be OK?

Flags: needinfo?(nfroyd)

(In reply to Nathan Froyd [:froydnj] from comment #16)

If we wanted to make it a super-scary name, I guess it might be OK?

I would surely second that, and I find the current naming convention doXInfallible also used elsewhere always very confusing.

Hm, ok.

Perhaps what we really want here is for XPConnect to do a "canary" service-load of the async shutdown service early in startup, before we start notifying random observers to start initializing Gecko components. That would at least prevent the problem from moving around if the observers start firing in a different order.

Perhaps right at [1], which occurs after XPConnect has been initialized [2], but before the doStartup call [3] that takes us into service workers?

[1] https://searchfox.org/mozilla-central/rev/cfd1cc461f1efe0d66c2fdc17c024a203d5a2fd8/toolkit/xre/nsAppRunner.cpp#4412
[2] https://searchfox.org/mozilla-central/rev/cfd1cc461f1efe0d66c2fdc17c024a203d5a2fd8/toolkit/xre/nsAppRunner.cpp#4404
[3] https://searchfox.org/mozilla-central/rev/cfd1cc461f1efe0d66c2fdc17c024a203d5a2fd8/toolkit/xre/nsAppRunner.cpp#4418

(In reply to Bobby Holley (:bholley) from comment #18)

Perhaps what we really want here is for XPConnect to do a "canary" service-load of the async shutdown service early in startup, before we start notifying random observers to start initializing Gecko components. That would at least prevent the problem from moving around if the observers start firing in a different order.

Since I think this proposed fix is in XPCOM, moving the bug there.

Component: DOM: Service Workers → XPCOM
Summary: fast startup crash in ServiceWorkerRegistrar::ProfileStarted → XPCOM/XPConnect should do a "canary" load of the async shutdown service (Was: fast startup crash in ServiceWorkerRegistrar::ProfileStarted)
Priority: P2 → --
Crash Signature: [@ AnnotateScriptContents] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase const]

Should we remove the stalled keyword here since there's a possible way forward described in comment 18?

Crash Signature: [@ AnnotateScriptContents] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase const] → [@ AnnotateScriptContents] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase const]
Flags: needinfo?(nfroyd)

(In reply to Julien Cristau [:jcristau] from comment #21)

Should we remove the stalled keyword here since there's a possible way forward described in comment 18?

Sure.

Flags: needinfo?(nfroyd)
Keywords: stalled

Hi Nika, assuming that Nathan will not move this forward any more and being a quite frequent startup crash - can you get to this?

Flags: needinfo?(nika)

For about 98% of these crashes in the last 6 months, the nsresult error value for these crashes appears to be NS_ERROR_FILE_NOT_FOUND. The Windows error values that can return that are ERROR_FILE_NOT_FOUND, ERROR_PATH_NOT_FOUND, ERROR_INVALID_DRIVE, ERROR_NOT_READY.

This reminds me a bit of bug 1686041 - we probably would want to know, which of these OS error codes really cause this, in particular ERROR_NOT_READY would be something we cannot do much about, right?

Flags: needinfo?(jstutte)
Whiteboard: [startupcrash] → [startupcrash] [qa-not-actionable]

(In reply to Wayne Mery (:wsmwk) from comment #24)

This signature is ranked #10 for Thunderbird 78. And so is a serious problem.

It is now lower, at #30

(In reply to Wayne Mery (:wsmwk) from comment #27)

(In reply to Wayne Mery (:wsmwk) from comment #24)

This signature is ranked #10 for Thunderbird 78. And so is a serious problem.

It is now lower, at #30

but higher for version 91. Currently #8 for Thunderbird 91.8.1, and looks like it has been high for all of version 91.
bp-9eadd4a9-3c42-43db-920d-8ee5f0220424 Mac
bp-b05e316c-36a7-4eb1-8598-cc5950220424 Windows

Flags: needinfo?(nika)

(In reply to Jens Stutte [:jstutte] from comment #23)

Hi Nika, assuming that Nathan will not move this forward any more and being a quite frequent startup crash - can you get to this?

Nika, is still possible?

#15 crash for Firefox 103
#19 crash for Thunderibrd 102.0.3

Flags: needinfo?(nika)

There's not a ton actionable we can do here, other than potentially change the signature (like the bug title suggests) by moving the crash earlier during startup. This bug appears to be caused by part of the Firefox/Thunderbird install (the omnijar) not being present on the user's computer, perhaps due to a botched install or similar.

This is somewhat aligned with some of the comments on the bug which say things like "impossible d'ouvrir" (can't open), suggesting this is a persistent broken install issue.

It appears that the only practical way forward here is to somehow communicate to the user that their Firefox/Thunderbird install appears to be broken, and ask them to re-install, without crashing somewhere while trying to show said dialog. We risk losing information about how frequently this is happening if we do that though, as we won't be showing the crash report dialog, and telemetry is fairly dependent on the omnijar to work at all.

Redirecting a ni? to :bytesized in case this issue is something which the updater could help us automatically detect and/or fix a broken install in a reliable way.

Flags: needinfo?(nika) → needinfo?(bytesized)

To double-check what we think the error is (omni.ja missing) I made a copy of my install & deleted the omni.ja file from it, getting this crash: https://crash-stats.mozilla.org/report/index/67dd89dc-d583-4b0e-bb62-b13730220802. It appears very similar to the other crashes here, so I'm inclined to believe that it is the problem.

(In reply to Nika Layzell [:nika] (ni? for response) from comment #30)

Redirecting a ni? to :bytesized in case this issue is something which the updater could help us automatically detect and/or fix a broken install in a reliable way.

I doubt that there is much that I can help with here, especially without more information about what is going wrong. The background updater probably can't fix a build that is crashing on startup because the background updater runs using the Firefox binary. So if the omnijar is missing, I don't see how it could start either.

If we are considering this to be a potential problem caused by the updater, we do already attempt to recover from bad updates. With staging enabled (the default), we should stage the whole new install into a subdirectory and move it into place when Firefox launches. If patching or extracting a file fails, staging should fail and the update will not be installed. If staging is not enabled we should roll back the changes if the update fails.

I have seen reports where it appears that this process fails. I'm not entirely sure why. I can't find the bug that I'm thinking of, but I remember the update log looking basically like this:

updating normally...
Some file is not patched successfully (permission denied error)
We start rolling back
Some of the backup files mysteriously don't exist so we are not able to roll back all files successfully.

I can only think of two possible reasons why something like this would happen: Some kind of antivirus weirdness, or two Firefox invocations managed to race so closely that they both started an updater and the two updaters interfered with each other. I've filed Bug 1783939 to do better at preventing the latter possibility. I'm not sure what we can do about the former.

I don't know how to explain the fact that the original error was a permission denied error. My best guess is that antivirus was involved. Even if the file was in use, we should still be able to move it out of the way.

Flags: needinfo?(bytesized)

There is some good news. Around April 1, Firefox crashes went into steep decline and by mid-May dropped by more than 30%. It looks to me like the majority of that reduction came from within the Windows user population.

Not possible to say whether Thunderbird will have a similar decline, for users going from esr 91 to esr 102.

There is a not insignficant Mac population

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on release (startup)
  • Top 20 desktop browser crashes on beta (startup)

For more information, please visit auto_nag documentation.

Crash Signature: [@ AnnotateScriptContents] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase const] → [@ AnnotateScriptContents] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase const] [@ nsTString<T>::get]

This bug would not actually fix the startup crashes which are occuring here (due to a missing omni.ja or corrupted omni.ja), and would instead just change the crash signature to something more consistent, so dropping to S3 for now.

There is unlikely to be a way to automatically repair broken installations during startup right now.

Severity: S2 → S3

It's also worth noting that bug 1760855 migrating the async shutdown service be implemented in C++ for C++ services would also address/moot this bug.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: