1471720 - XPCOM/XPConnect should do a "canary" load of the async shutdown service (Was: fast startup crash in ServiceWorkerRegistrar::ProfileStarted)

Reporter

Description

•

6 years ago

I restarted my Nightly today (I was having problem getting it to recognize my microphone), and I got a profile-independent instant startup crash: #0 0x0000000000414644 in MOZ_CrashPrintf () #1 0x00007f89716fd903 in mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase() const [clone .cold.211] () at /home/sfink/Downloads/firefox/libxul.so #2 0x00007f8972f7f7af in mozilla::dom::ServiceWorkerRegistrar::ProfileStarted() () at /home/sfink/Downloads/firefox/libxul.so #3 0x00007f8972f7f735 in mozilla::dom::ServiceWorkerRegistrar::Observe(nsISupports*, char const*, char16_t const*) () at /home/sfink/Downloads/firefox/libxul.so #4 0x00007f8972573b52 in nsObserverList::NotifyObservers(nsISupports*, char const*, char16_t const*) () at /home/sfink/Downloads/firefox/libxul.so #5 0x00007f897257161c in nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) () at /home/sfink/Downloads/firefox/libxul.so #6 0x00007f8973173faa in nsXREDirProvider::DoStartup() () at /home/sfink/Downloads/firefox/libxul.so #7 0x00007f8973170b20 in XREMain::XRE_mainRun() () at /home/sfink/Downloads/firefox/libxul.so #8 0x00007f897316df92 in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) () at /home/sfink/Downloads/firefox/libxul.so #9 0x00007f897316dbf6 in XRE_main(int, char**, mozilla::BootstrapConfig const&) () at /home/sfink/Downloads/firefox/libxul.so #10 0x0000000000422a39 in do_main(int, char**, char**) () #11 0x000000000041732f in main () It appears it was crashing with the string $rdi = 0x634840 <sPrintfCrashReason> "Failed to get async shutdown service: NS_ERROR_FACTORY_NOT_REGISTERED" I have an rr recording of the crash and a backup copy of my install directory.

Marion Daly [:mdaly]

Updated

•

6 years ago

Priority: -- → P3

Marion Daly [:mdaly]

Comment 1

•

6 years ago

Blake: on your radar

Flags: needinfo?(mrbkap)

Marion Daly [:mdaly]

Updated

•

6 years ago

Flags: needinfo?(mrbkap)

Marion Daly [:mdaly]

Updated

•

6 years ago

Priority: P3 → P2

Gian-Carlo Pascutto [:gcp]

Comment 2

•

6 years ago

Just hit this with ASAN Nightlies. It instantly renders the browser unusable, so help? It looks like it's even crashing before crash reporter gets a chance to run? At least ASAN can't submit.

Gian-Carlo Pascutto [:gcp]

Comment 3

•

6 years ago

Attached file ASAN Crash log (deleted) — Details

Gian-Carlo Pascutto [:gcp]

Comment 4

•

6 years ago

The crasher fixed itself (for small values of "fix") after going ASAN->regular Nightly->ASAN Nightly->different profile->regular profile.

Blake Kaplan (:mrbkap) (inactive)

Updated

•

6 years ago

Flags: needinfo?(mrbkap)

Andrew Sutherland [:asuth] (he/him)

Updated

•

6 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1403348

Marion Daly [:mdaly]

Updated

•

6 years ago

Assignee: nobody → mrbkap

Blake Kaplan (:mrbkap) (inactive)

Comment 5

•

6 years ago

After a bunch of guessing, I haven't gotten too far on this. Steve, would you happen to still have the recording and backup copy of your install directory that you mentioned in comment 0 lying around? It's probably our best bet at figuring out what's happening at this point.

Flags: needinfo?(mrbkap) → needinfo?(sphink)

Steve Fink [:sfink] [:s:]

Reporter

Comment 6

•

6 years ago

mrbkap and I looked at this in a debugger today. The specific problem in my case is due to the toplevel omni.ja being really old and using JS features that were removed in 2014. I'll attempt to attach some files, though they might be too large.

Flags: needinfo?(sphink)

Steve Fink [:sfink] [:s:]

Reporter

Comment 7

•

6 years ago

Attached file omni.ja from the toplevel directory (deleted) — Details

Steve Fink [:sfink] [:s:]

Reporter

Comment 8

•

6 years ago

Attached file updates.xml (deleted) — Details

Blake Kaplan (:mrbkap) (inactive)

Comment 9

•

6 years ago

After looking at this and bug 1403348, it appears that there are a ton of reasons for us to fail to load the async shutdown service and that ServiceWorkerRegistrar::ProfileStarted is becoming a bucket for us to find how many of our users have some sort of fatal JS error on startup. I'm marking this stalled since I don't think there's much actionable to do in the service worker code.

Assignee: mrbkap → nobody

Keywords: stalled

Steve Fink [:sfink] [:s:]

Reporter

Comment 10

•

6 years ago

Would it be worth filing a bug to report JS exceptions during startup instead of swallowing them and crashing here? (I forget the exact path, but I know there was an exception thrown that led to this.)

Jens Stutte [:jstutte]

Comment 11

•

5 years ago

(In reply to Steve Fink [:sfink] [:s:] from comment #10)

Would it be worth filing a bug to report JS exceptions during startup
instead of swallowing them and crashing here? (I forget the exact path, but
I know there was an exception thrown that led to this.)

I would second that, given that we cannot do much here in the Service Worker component, it seems. :asuth, what do you think?

Flags: needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Comment 12

•

5 years ago

(In reply to Jens Stutte [:jstutte] from comment #11)

(In reply to Steve Fink [:sfink] [:s:] from comment #10)

Would it be worth filing a bug to report JS exceptions during startup
instead of swallowing them and crashing here? (I forget the exact path, but
I know there was an exception thrown that led to this.)

I would second that, given that we cannot do much here in the Service Worker component, it seems. :asuth, what do you think?

I think this amounts to a request to create an infallible do_getService variant called do_getServiceThatDefinitelyExists which internalizes the error reporting logic we crammed into ServiceWorkerRegistrar because it's the first native code that tries to load a JS service and crashes instead of silently breaking.

Needinfo-ing :bholley the XPConnect owner because that's where the bulk of the XPConnect scenario lives.

Another possibility lower down would be that the lower level JAR protocol stuff should be performing some type of integrity check on omni.ja that validates some mixture of:

not truncated
passes some level of checksum/crc32/hash check, possibly lazily. (do we do things to trick the operating system to perform readahead that could be used for hashing purposes?)
verifies there's an expected build identifier signature baked in to various places in the file (possibly via sentinel zip/jar entries?)

Needinfo-ing :dragana as owner of Necko where Core::Networking: JAR lives.

:bholley/:dragana note that you may want to take a look at bug 1403348 which is where most of the diagnostic effort related to this problem has previously lived. (And where in https://bugzilla.mozilla.org/show_bug.cgi?id=1403348#c79 there are suggestions the problem isn't just an application update problem because it happens on Android, which potentially also means it's also not rogue anti-virus software.)

Flags: needinfo?(dd.mozilla)

Flags: needinfo?(bugmail)

Flags: needinfo?(bholley)

Dragana Damjanovic [:dragana]

Comment 13

•

5 years ago

Michal, can you take a look?

Flags: needinfo?(dd.mozilla) → needinfo?(michal.novotny)

Bobby Holley (:bholley)

Comment 14

•

5 years ago

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #12)

I think this amounts to a request to create an infallible do_getService variant called do_getServiceThatDefinitelyExists which internalizes the error reporting logic we crammed into ServiceWorkerRegistrar because it's the first native code that tries to load a JS service and crashes instead of silently breaking.

I think it would be reasonable to add a do_GetServiceInfallible, or somesuch, rather than sprinkling the all-is-lost logic in whatever callsites happen to run early in startup. Nathan, do you agree?

Flags: needinfo?(bholley) → needinfo?(nfroyd)

Michal Novotny [:michal]

Comment 15

•

5 years ago

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #12)

passes some level of checksum/crc32/hash check, possibly lazily. (do we do things to trick the operating system to perform readahead that could be used for hashing purposes?)

Corrupted files should be detected when checking CRC at https://searchfox.org/mozilla-central/rev/e076e40ab1290f4e5e67ebd21dc8af753fc05be6/modules/libjar/nsJARInputStream.cpp#333

Flags: needinfo?(michal.novotny)

Nathan Froyd [:froydnj]

Comment 16

•

5 years ago

(In reply to Bobby Holley (:bholley) from comment #14)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #12)

I think this amounts to a request to create an infallible do_getService variant called do_getServiceThatDefinitelyExists which internalizes the error reporting logic we crammed into ServiceWorkerRegistrar because it's the first native code that tries to load a JS service and crashes instead of silently breaking.

I think it would be reasonable to add a do_GetServiceInfallible, or somesuch, rather than sprinkling the all-is-lost logic in whatever callsites happen to run early in startup. Nathan, do you agree?

I'm not super-excited about this, just because of the potential for people putting it along paths that eventually get called during shutdown. Or people will start replacing code like:

nsCOMPtr<nsIObserverService> obsService = ...;
if (obsService) { ... }

with:

nsCOMPtr<nsIObserverService> obsService = do_GetServiceInfallible(...);
...stuff...

and then get bitten when that code runs in unexpected places.

If we wanted to make it a super-scary name, I guess it might be OK?

Flags: needinfo?(nfroyd)

Jens Stutte [:jstutte]

Comment 17

•

5 years ago

(In reply to Nathan Froyd [:froydnj] from comment #16)

If we wanted to make it a super-scary name, I guess it might be OK?

I would surely second that, and I find the current naming convention doXInfallible also used elsewhere always very confusing.

Bobby Holley (:bholley)

Comment 18

•

5 years ago

Hm, ok.

Perhaps what we really want here is for XPConnect to do a "canary" service-load of the async shutdown service early in startup, before we start notifying random observers to start initializing Gecko components. That would at least prevent the problem from moving around if the observers start firing in a different order.

Perhaps right at [1], which occurs after XPConnect has been initialized [2], but before the doStartup call [3] that takes us into service workers?

[1] https://searchfox.org/mozilla-central/rev/cfd1cc461f1efe0d66c2fdc17c024a203d5a2fd8/toolkit/xre/nsAppRunner.cpp#4412
[2] https://searchfox.org/mozilla-central/rev/cfd1cc461f1efe0d66c2fdc17c024a203d5a2fd8/toolkit/xre/nsAppRunner.cpp#4404
[3] https://searchfox.org/mozilla-central/rev/cfd1cc461f1efe0d66c2fdc17c024a203d5a2fd8/toolkit/xre/nsAppRunner.cpp#4418

Andrew Sutherland [:asuth] (he/him)

Comment 20

•

4 years ago

(In reply to Bobby Holley (:bholley) from comment #18)

Perhaps what we really want here is for XPConnect to do a "canary" service-load of the async shutdown service early in startup, before we start notifying random observers to start initializing Gecko components. That would at least prevent the problem from moving around if the observers start firing in a different order.

Since I think this proposed fix is in XPCOM, moving the bug there.

Component: DOM: Service Workers → XPCOM

Andrew Sutherland [:asuth] (he/him)

Updated

•

4 years ago

Summary: fast startup crash in ServiceWorkerRegistrar::ProfileStarted → XPCOM/XPConnect should do a "canary" load of the async shutdown service (Was: fast startup crash in ServiceWorkerRegistrar::ProfileStarted)

Andrew Sutherland [:asuth] (he/him)

Updated

•

4 years ago

Priority: P2 → --

BugBot [:suhaib / :marco/ :calixte]

Updated

•

4 years ago

Crash Signature: [@ AnnotateScriptContents] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase const]

Julien Cristau [:jcristau]

Comment 21

•

4 years ago

Should we remove the stalled keyword here since there's a possible way forward described in comment 18?

Crash Signature: [@ AnnotateScriptContents] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase const] → [@ AnnotateScriptContents] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase const]

Flags: needinfo?(nfroyd)

Nathan Froyd [:froydnj]

Comment 22

•

4 years ago

(In reply to Julien Cristau [:jcristau] from comment #21)

Should we remove the stalled keyword here since there's a possible way forward described in comment 18?

Sure.

Flags: needinfo?(nfroyd)

Keywords: stalled

Jens Stutte [:jstutte]

Comment 23

•

4 years ago

Hi Nika, assuming that Nathan will not move this forward any more and being a quite frequent startup crash - can you get to this?

Flags: needinfo?(nika)

Wayne Mery (:wsmwk)

Comment 24

•

4 years ago

This signature is ranked #10 for Thunderbird 78. And so is a serious problem.

For example bp-4be5fed4-4b0c-4e2e-93a3-f13ab0201222

https://crash-stats.mozilla.org/signature/?product=Firefox&product=Thunderbird&signature=mozilla%3A%3Adom%3A%3AServiceWorkerRegistrar%3A%3AGetShutdownPhase&date=%3E%3D2020-06-24T01%3A08%3A00.000Z&date=%3C2020-12-24T01%3A08%3A00.000Z#graphs

Severity: normal → S2

Flags: needinfo?(jstutte)

Keywords: crash, topcrash-thunderbird

Whiteboard: [startupcrash]

Andrew McCreight [:mccr8]

Comment 25

•

4 years ago

For about 98% of these crashes in the last 6 months, the nsresult error value for these crashes appears to be NS_ERROR_FILE_NOT_FOUND. The Windows error values that can return that are ERROR_FILE_NOT_FOUND, ERROR_PATH_NOT_FOUND, ERROR_INVALID_DRIVE, ERROR_NOT_READY.

Jens Stutte [:jstutte]

Comment 26

•

4 years ago

This reminds me a bit of bug 1686041 - we probably would want to know, which of these OS error codes really cause this, in particular ERROR_NOT_READY would be something we cannot do much about, right?

Flags: needinfo?(jstutte)

Rares Doghi, Desktop QA

Updated

•

3 years ago

Whiteboard: [startupcrash] → [startupcrash] [qa-not-actionable]

Wayne Mery (:wsmwk)

Comment 27

•

3 years ago

(In reply to Wayne Mery (:wsmwk) from comment #24)

This signature is ranked #10 for Thunderbird 78. And so is a serious problem.

It is now lower, at #30

Wayne Mery (:wsmwk)

Comment 28

•

3 years ago

(In reply to Wayne Mery (:wsmwk) from comment #27)

(In reply to Wayne Mery (:wsmwk) from comment #24)

This signature is ranked #10 for Thunderbird 78. And so is a serious problem.

It is now lower, at #30

but higher for version 91. Currently #8 for Thunderbird 91.8.1, and looks like it has been high for all of version 91.
bp-9eadd4a9-3c42-43db-920d-8ee5f0220424 Mac
bp-b05e316c-36a7-4eb1-8598-cc5950220424 Windows

Wayne Mery (:wsmwk)

Updated

•

2 years ago

Flags: needinfo?(nika)

Wayne Mery (:wsmwk)

Comment 29

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #23)

Hi Nika, assuming that Nathan will not move this forward any more and being a quite frequent startup crash - can you get to this?

Nika, is still possible?

#15 crash for Firefox 103
#19 crash for Thunderibrd 102.0.3

Flags: needinfo?(nika)

Nika Layzell [:nika] (ni? for response)

Comment 30

•

2 years ago

There's not a ton actionable we can do here, other than potentially change the signature (like the bug title suggests) by moving the crash earlier during startup. This bug appears to be caused by part of the Firefox/Thunderbird install (the omnijar) not being present on the user's computer, perhaps due to a botched install or similar.

This is somewhat aligned with some of the comments on the bug which say things like "impossible d'ouvrir" (can't open), suggesting this is a persistent broken install issue.

It appears that the only practical way forward here is to somehow communicate to the user that their Firefox/Thunderbird install appears to be broken, and ask them to re-install, without crashing somewhere while trying to show said dialog. We risk losing information about how frequently this is happening if we do that though, as we won't be showing the crash report dialog, and telemetry is fairly dependent on the omnijar to work at all.

Redirecting a ni? to :bytesized in case this issue is something which the updater could help us automatically detect and/or fix a broken install in a reliable way.

Flags: needinfo?(nika) → needinfo?(bytesized)

Nika Layzell [:nika] (ni? for response)

Comment 31

•

2 years ago

To double-check what we think the error is (omni.ja missing) I made a copy of my install & deleted the omni.ja file from it, getting this crash: https://crash-stats.mozilla.org/report/index/67dd89dc-d583-4b0e-bb62-b13730220802. It appears very similar to the other crashes here, so I'm inclined to believe that it is the problem.

Robin Steuber (they/them) [:bytesized]

Comment 32

•

2 years ago

(In reply to Nika Layzell [:nika] (ni? for response) from comment #30)

Redirecting a ni? to :bytesized in case this issue is something which the updater could help us automatically detect and/or fix a broken install in a reliable way.

I doubt that there is much that I can help with here, especially without more information about what is going wrong. The background updater probably can't fix a build that is crashing on startup because the background updater runs using the Firefox binary. So if the omnijar is missing, I don't see how it could start either.

If we are considering this to be a potential problem caused by the updater, we do already attempt to recover from bad updates. With staging enabled (the default), we should stage the whole new install into a subdirectory and move it into place when Firefox launches. If patching or extracting a file fails, staging should fail and the update will not be installed. If staging is not enabled we should roll back the changes if the update fails.

I have seen reports where it appears that this process fails. I'm not entirely sure why. I can't find the bug that I'm thinking of, but I remember the update log looking basically like this:

updating normally...
Some file is not patched successfully (permission denied error)
We start rolling back
Some of the backup files mysteriously don't exist so we are not able to roll back all files successfully.

I can only think of two possible reasons why something like this would happen: Some kind of antivirus weirdness, or two Firefox invocations managed to race so closely that they both started an updater and the two updaters interfered with each other. I've filed Bug 1783939 to do better at preventing the latter possibility. I'm not sure what we can do about the former.

I don't know how to explain the fact that the original error was a permission denied error. My best guess is that antivirus was involved. Even if the file was in use, we should still be able to move it out of the way.

Flags: needinfo?(bytesized)

Wayne Mery (:wsmwk)

Comment 33

•

2 years ago

There is some good news. Around April 1, Firefox crashes went into steep decline and by mid-May dropped by more than 30%. It looks to me like the majority of that reduction came from within the Windows user population.

Not possible to say whether Thunderbird will have a similar decline, for users going from esr 91 to esr 102.

There is a not insignficant Mac population

See Also: https://bugzilla.mozilla.org/show_bug.cgi?id=1403348 → https://bugzilla.mozilla.org/show_bug.cgi?id=1783939

BugBot [:suhaib / :marco/ :calixte]

Comment 34

•

2 years ago

The bug is linked to a topcrash signature, which matches the following criteria:

Top 20 desktop browser crashes on release (startup)
Top 20 desktop browser crashes on beta (startup)

For more information, please visit auto_nag documentation.

Keywords: topcrash, topcrash-startup

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

2 years ago

Crash Signature: [@ AnnotateScriptContents] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase const] → [@ AnnotateScriptContents] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase] [@ mozilla::dom::ServiceWorkerRegistrar::GetShutdownPhase const] [@ nsTString<T>::get]

Nika Layzell [:nika] (ni? for response)

Comment 35

•

2 years ago

This bug would not actually fix the startup crashes which are occuring here (due to a missing omni.ja or corrupted omni.ja), and would instead just change the crash signature to something more consistent, so dropping to S3 for now.

There is unlikely to be a way to automatically repair broken installations during startup right now.

Severity: S2 → S3

Andrew Sutherland [:asuth] (he/him)

Comment 36

•

2 years ago

It's also worth noting that bug 1760855 migrating the async shutdown service be implemented in C++ for C++ services would also address/moot this bug.

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1760855

ASAN Crash log 6 years ago Gian-Carlo Pascutto [:gcp] (deleted), text/plain		Details
omni.ja from the toplevel directory 6 years ago Steve Fink [:sfink] [:s:] (deleted), application/octet-stream		Details
updates.xml 6 years ago Steve Fink [:sfink] [:s:] (deleted), text/xml		Details