Closed Bug 1545381 Opened 6 years ago Closed 3 years ago

Startup Crash in [@ nsWidgetWindowsModuleDtor]

Categories

(Core :: XPCOM, defect, P3)

67 Branch
All
Windows
defect

Tracking

()

RESOLVED WORKSFORME
mozilla70
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- fixed
firefox66 --- unaffected
firefox67 --- wontfix
firefox68 --- wontfix
firefox69 --- fixed
firefox70 --- fixed

People

(Reporter: philipp, Assigned: froydnj)

References

(Regression)

Details

(Keywords: crash, regression, regressionwindow-wanted)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-5c667c4a-7904-43e1-964d-054af0190418.

Top 10 frames of crashing thread:

0 xul.dll nsWidgetWindowsModuleDtor widget/windows/nsWidgetFactory.cpp:61
1 xul.dll static void mozilla::xpcom::CallUnloadFuncs xpcom/components/StaticComponents.cpp:8020
2 xul.dll nsComponentManagerImpl::Shutdown xpcom/components/nsComponentManager.cpp:962
3 xul.dll static nsresult mozilla::ShutdownXPCOM xpcom/build/XPCOMInit.cpp:746
4 xul.dll void ScopedXPCOMStartup::~ScopedXPCOMStartup toolkit/xre/nsAppRunner.cpp:1243
5 xul.dll int XREMain::XRE_main toolkit/xre/nsAppRunner.cpp:4756
6 xul.dll XRE_main toolkit/xre/nsAppRunner.cpp:4812
7 xul.dll int mozilla::BootstrapImpl::XRE_main toolkit/xre/Bootstrap.cpp:39
8 firefox.exe static int do_main browser/app/nsBrowserApp.cpp:214
9 firefox.exe static int NS_internal_main browser/app/nsBrowserApp.cpp:293

these startup crashes are starting to appear in firefox 67 (in rather low volume so far though) - presumably related to bug 1478124?

This is technically a startup crash, but it is also a shutdown crash. I.e., we're shutting down immediately whether this crash happens or not. It would be good to know why...

Presumably we're trying to release sAppShell at https://searchfox.org/mozilla-central/rev/d302c3058330a57f238be4062fddea629311ce66/widget/nsAppShellSingleton.h#51, presumably because it hasn't actually been initialized yet.

I don't know whether this is related to bug 1478124, but I don't have any reason to think it is.

In theory initialization could have failed as well. Checking for null in the shutdown code could help avoid the crash, but as Kris noted understanding why we're shutting down so quickly would be helpful as well.

Although this is a new startup crash in beta, we are one week away from RC week and this bug remains unprioritized and unassigned so this is wontfix for 67.

Summary: Crash in [@ nsWidgetWindowsModuleDtor] → Startup Crash in [@ nsWidgetWindowsModuleDtor]

Looks like much lower crash rate for 68 beta

Hi Nathan, (looking at this during regression triage) - the crash rate in 68 beta is about the same as in 67. Can you triage this? Thanks!

Flags: needinfo?(nfroyd)

I don't understand how we're crashing here. The static components stuff ensures that we only unload modules whose ctor has been called. So the windows widget module ctor has been called, which calls the appshell initialization, which basically cannot fail. (The failure points are allocating the appshell, which is infallible, and initializing it, which on Windows is also infallible.) So we're not failing out of the widget module ctor, and the appshell initialization will correctly set up sAppShell. (Aside: maybe we shouldn't set the initialization called bit for static modules if the ctor fails? That's probably hard to get right, though...)

But then we're getting to the widget dtor, and shutting down the appshell, and we find that sAppShell is nullptr, i.e. that the appshell has (presumably) been destroyed. Where in the world did that happen?

I don't have any better ideas other than doing the defensive null-check. That change may just make the crash migrate someplace else, ideally to a place where we have greater insight into what's going on?

I suppose we could add a crash annotation to describe what we got back from XRE_mainRun() or something like that, and that might provide some kind of insight?

Flags: needinfo?(nfroyd)
Priority: -- → P3

(In reply to Nathan Froyd [:froydnj] from comment #6)

But then we're getting to the widget dtor, and shutting down the appshell, and we find that sAppShell is nullptr, i.e. that the appshell has (presumably) been destroyed. Where in the world did that happen?

Huh. I guess it's barely possible that we're destroying it in nsAppShellInit because Init() is failing: https://searchfox.org/mozilla-central/rev/d302c3058330a57f238be4062fddea629311ce66/widget/nsAppShellSingleton.h#42-44

The only way that should be possible, though, is if NS_GetCurrentThread returns null: https://searchfox.org/mozilla-central/source/widget/nsBaseAppShell.cpp#44-46

Which, I don't know, is maybe possible in this weird case where we're shutting down in the middle of startup?

(In reply to Nathan Froyd [:froydnj] from comment #6)

I don't have any better ideas other than doing the defensive null-check. That change may just make the crash migrate someplace else, ideally to a place where we have greater insight into what's going on?

Side note: I doubt that this will make the crash migrate to another place. I suspect that it will "fix" these crashes without telling us what's causing us to bail out this early during startup in the first place, though, which isn't great...

Ah, that's a good point about NS_ENSURE_STATE; I missed that while scanning for return.

OK, so that must be what's happening...but how? Did the thread manager just utterly and completely fail to do its job?

From #memshrink:

<John-Galt> froydnj: You may as well just add a release assert that sAppShell->Init() doesn't fail while you're at it. If it does, we're pretty screwed anyway, and we're better off crashing there rather than at shutdown if it does.

I kind of want to make this a MOZ_DIAGNOSTIC_ASSERT, but I'm a bit worried that we wouldn't actually get any crash reports from early beta on this...and a release assert will just crash the same way we're crashing on shutdown here anyway, so a release assert is probably the right thing.

Attachment #9079830 - Attachment description: Bug 1545381 - be defensive in nsAppShellShutdown; r=erahm → Bug 1545381 - be more defensive in nsAppShellSingleton; r=erahm
Pushed by nfroyd@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/87a0c27e1e07 be more defensive in nsAppShellSingleton; r=erahm
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70
Assignee: nobody → nfroyd

Please nominate this for Beta and ESR68 approval when you get a chance. Even if it only ends up changing the crash signature, it would be good to get that feedback sooner in the cycle.

Flags: needinfo?(nfroyd)

Comment on attachment 9079830 [details]
Bug 1545381 - be more defensive in nsAppShellSingleton; r=erahm

Beta/Release Uplift Approval Request

  • User impact if declined: None? We (Firefox) won't have insight into why things are crashing, though.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This change should just move a crash around.
  • String changes made/needed: None.

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: We'd like a wider net for crash reports in understanding what's going wrong in this bug.
  • User impact if declined:
  • Fix Landed on Version: 70
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky):
  • String or UUID changes made by this patch: None.
Flags: needinfo?(nfroyd)
Attachment #9079830 - Flags: approval-mozilla-esr68?
Attachment #9079830 - Flags: approval-mozilla-beta?

Comment on attachment 9079830 [details]
Bug 1545381 - be more defensive in nsAppShellSingleton; r=erahm

Diagnostic patch which will hopefully help us narrow down what's causing these crashes. Approved for 69.0b8 and 68.1esr.

Attachment #9079830 - Flags: approval-mozilla-esr68?
Attachment #9079830 - Flags: approval-mozilla-esr68+
Attachment #9079830 - Flags: approval-mozilla-beta?
Attachment #9079830 - Flags: approval-mozilla-beta+

Diagnostic work appears to have caused crash bug 1577507 and does not appear to have fixed the original issue.

Status: RESOLVED → REOPENED
Depends on: 1577507
Flags: needinfo?(nfroyd)
Resolution: FIXED → ---

(In reply to Kris Maglione [:kmag] from comment #8)

(In reply to Nathan Froyd [:froydnj] from comment #6)

But then we're getting to the widget dtor, and shutting down the appshell, and we find that sAppShell is nullptr, i.e. that the appshell has (presumably) been destroyed. Where in the world did that happen?

Huh. I guess it's barely possible that we're destroying it in nsAppShellInit because Init() is failing: https://searchfox.org/mozilla-central/rev/d302c3058330a57f238be4062fddea629311ce66/widget/nsAppShellSingleton.h#42-44

The only way that should be possible, though, is if NS_GetCurrentThread returns null: https://searchfox.org/mozilla-central/source/widget/nsBaseAppShell.cpp#44-46

Which, I don't know, is maybe possible in this weird case where we're shutting down in the middle of startup?

(In reply to Jim Mathies [:jimm] from comment #18)

Diagnostic work appears to have caused crash bug 1577507 and does not appear to have fixed the original issue.

The patch was never meant to fix the original issue; the patch was intended to move it to a place where we might have more information about what's going on.

Apparently the plugin process doesn't really set up xpcom, but tries to fake things:

https://searchfox.org/mozilla-central/source/dom/plugins/ipc/PluginProcessChild.cpp#142-148

I'd guess it's not faking things hard enough, and not setting up the main thread properly.

Flags: needinfo?(nfroyd)
Regressions: 1577507

Considering the "regressionwindow-wanted" tag, I could try to find the regression if some steps to reproduce are provided.
Please NI me if some working STR are obtained. Thanks.

QA Whiteboard: [qa-regression-triage]
Status: REOPENED → RESOLVED
Closed: 5 years ago3 years ago
Resolution: --- → WORKSFORME
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: