Closed Bug 1248489 Opened 9 years ago Closed 9 years ago

crash in mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort (with abort message chrome://browser/content/sanitize.js) and Firefox no longer responding on startup

Categories

(Core :: General, defect)

44 Branch
x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
firefox44 --- wontfix
firefox45 + verified
firefox46 + wontfix
firefox47 + wontfix

People

(Reporter: philipp, Assigned: mak)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 obsolete file)

[Tracking Requested - why for this release]: This bug was filed from the Socorro interface and is report bp-a464ca53-513a-40a9-9393-7c5462160215. ============================================================= Crashing Thread (0) Frame Module Signature Source 0 mozglue.dll mozalloc_abort(char const* const) memory/mozalloc/mozalloc_abort.cpp 1 xul.dll NS_DebugBreak xpcom/base/nsDebugImpl.cpp 2 xul.dll nsDebugImpl::Abort(char const*, int) xpcom/base/nsDebugImpl.cpp 3 xul.dll NS_InvokeByIndex xpcom/reflect/xptcall/md/win32/xptcinvoke.cpp 4 xul.dll XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) js/xpconnect/src/XPCWrappedNative.cpp 5 xul.dll XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) js/xpconnect/src/XPCWrappedNativeJSOps.cpp 6 xul.dll js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) js/src/vm/Interpreter.cpp 7 xul.dll Interpret js/src/vm/Interpreter.cpp 8 xul.dll js::RunScript(JSContext*, js::RunState&) js/src/vm/Interpreter.cpp 9 xul.dll js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) js/src/vm/Interpreter.cpp 10 xul.dll js::CallOrConstructBoundFunction(JSContext*, unsigned int, JS::Value*) js/src/jsfun.cpp 11 xul.dll js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) js/src/vm/Interpreter.cpp 12 xul.dll js::fun_call(JSContext*, unsigned int, JS::Value*) js/src/jsfun.cpp 13 xul.dll js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) js/src/vm/Interpreter.cpp 14 xul.dll js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value const*, JS::MutableHandle<JS::Value>) js/src/vm/Interpreter.cpp 15 xul.dll js::DirectProxyHandler::call(JSContext*, JS::Handle<JSObject*>, JS::CallArgs const&) js/src/proxy/DirectProxyHandler.cpp 16 xul.dll js::CrossCompartmentWrapper::call(JSContext*, JS::Handle<JSObject*>, JS::CallArgs const&) js/src/proxy/CrossCompartmentWrapper.cpp 17 xul.dll js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) js/src/vm/Interpreter.cpp 18 xul.dll Interpret js/src/vm/Interpreter.cpp 19 xul.dll js::RunScript(JSContext*, js::RunState&) js/src/vm/Interpreter.cpp 20 xul.dll js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) js/src/vm/Interpreter.cpp 21 xul.dll js::CallOrConstructBoundFunction(JSContext*, unsigned int, JS::Value*) js/src/jsfun.cpp 22 xul.dll js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) js/src/vm/Interpreter.cpp 23 xul.dll js::CallOrConstructBoundFunction(JSContext*, unsigned int, JS::Value*) js/src/jsfun.cpp 24 xul.dll js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) js/src/vm/Interpreter.cpp 25 xul.dll js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value const*, JS::MutableHandle<JS::Value>) js/src/vm/Interpreter.cpp 26 xul.dll nsXPCComponents_Utils::CallFunctionWithAsyncStack(JS::Handle<JS::Value>, nsIStackFrame*, nsAString_internal const&, JSContext*, JS::MutableHandle<JS::Value>) js/xpconnect/src/XPCComponents.cpp 27 xul.dll NS_InvokeByIndex xpcom/reflect/xptcall/md/win32/xptcinvoke.cpp 28 xul.dll XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) js/xpconnect/src/XPCWrappedNative.cpp 29 xul.dll XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) js/xpconnect/src/XPCWrappedNativeJSOps.cpp 30 xul.dll js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) js/src/vm/Interpreter.cpp 31 xul.dll Interpret js/src/vm/Interpreter.cpp 32 xul.dll js::RunScript(JSContext*, js::RunState&) js/src/vm/Interpreter.cpp 33 xul.dll js::Invoke(JSContext*, JS::CallArgs const&, js::MaybeConstruct) js/src/vm/Interpreter.cpp 34 xul.dll JS::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>) js/src/jsapi.cpp 35 xul.dll mozilla::dom::AnyCallback::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::MutableHandle<JS::Value>, mozilla::ErrorResult&) obj-firefox/dom/bindings/PromiseBinding.cpp 36 xul.dll mozilla::dom::WrapperPromiseCallback::Call(JSContext*, JS::Handle<JS::Value>) dom/promise/PromiseCallback.cpp AbortMessage [11276] ###!!! ABORT: file chrome://browser/content/sanitize.js, line 98 AsyncShutdownTimeout {"phase":"Places Database shutdown","conditions":[{"name":"sanitize.js: Sanitize","state":{"progress":{"cache":"cleared","cookies":"cleared","history":"ready","formdata":"ready","downloads":"ready","sessions":"ready"}},"filename":"chrome://browser/content/sanitize.js","lineNumber":98,"stack":["chrome://browser/content/sanitize.js:Sanitizer.prototype.sanitize<:98","resource://gre/modules/Task.jsm:TaskImpl_run:314","resource://gre/modules/Task.jsm:TaskImpl:275","resource://gre/modules/Task.jsm:createAsyncFunction/asyncFunction:249","chrome://browser/content/sanitize.js:Sanitizer.onStartup<:832","resource://gre/modules/Task.jsm:TaskImpl_run:314","resource://gre/modules/Task.jsm:TaskImpl:275","resource://gre/modules/Task.jsm:createAsyncFunction/asyncFunction:249","resource://app/components/nsBrowserGlue.js:BG__finalUIStartup:759","resource://app/components/nsBrowserGlue.js:BG_observe:283"]}]} the signature is #11 top crasher on release & this crash variant (abort in sanitize.js, line 98) makes up 2/3 of those crashes. it is regressing in firefox 44. multiple user comments describe that they tried to delete the firefox history and firefox was subsequently stuck in a "not responding" mode after a restart constantly: https://crash-stats.mozilla.com/search/?signature=%3Dmozalloc_abort+|+NS_DebugBreak+|+nsDebugImpl%3A%3AAbort&abort_message=98&user_comments=~history&_facets=signature&_facets=version&_facets=user_comments&_facets=uptime&_facets=platform_pretty_version&_facets=abort_message&_facets=async_shutdown_timeout&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-user_comments downgrading to 43 can get you out of the freezing condition on startup again apparently according to https://support.mozilla.org/questions/1110002.
There are various bugs in sanitize on shutdown we are currently handling and I think this is just the tip of the iceberg. Bug 1243549 is a part of those, it already landed on all the branches Bug 1244650 is another important part that shows clear on shutdown is broken from a long time Then there's bug 1245065 Here it's also showing async shutdown is killing us while we try to clear history, that means sanitize takes more than 1 minute. I'm not really surprised, 1 minute can be a short time if it's the time to clear *all* the sanitize clients. Especially in the async world. plus we have bug 871908 that is quite bad, but we lack resources to fix it.
Depends on: 871908
Depends on: 1244650
This is one of the top crash of 45, tracking. Marco, are you going to take care of that?
Flags: needinfo?(mak77)
I will have a look, but first I need to do some planning with the team.
Assignee: nobody → mak77
First of all, the hang on startup is easy to explain, if we fail to sanitize on shutdown, we do on startup. We started doing that "properly" with bug 1089695, in the sense we can now detect interrupted shutdown sanitizations . We could work on avoiding jankiness, but it's a long process we already started years ago involving rewrite of sync APIs and better I/O paths. I'd first concentrate on making sanitize on startup a rare event in this case. I looked at the various AsyncShutdownTimeout annotation, on 45 most of them are blocked on "cookies: ready" while on 44 most are on "cookies: failed, history: ready" The difference between 44 and 45 could be Bug 1243549. That bug still left cookies in the previous not-working state, followup bug 1245065 will fix them. On the other side Bug 1244650 could already solve the cookies problem, since it starts sanitization at a safer point during shutdown, when the cookies service can do its job. Now the problem is understanding whether the hang happens due to the cookies issue (we are either at cookies: failed or cookies: ready) or due to history clearing (on 44 we are always at history: ready). Let's look at telemetry: FX_SANITIZE_COOKIES indicates about 1% of users for which sanitizing cookies takes around 30s or more BUT this is not just measuring cookies, indeed by looking at FX_SANITIZE_COOKIES2 it's clear the real culprit is FX_SANITIZE_PLUGINS that is part of FX_SANITIZE_COOKIES FX_SANITIZE_HISTORY indicates about 0.2% of users for which sanitizing history takes around 30s or more These users can easily go past the 1 minute phase timeout arbitrarily imposed by async shutdown. The strange thing is that I'd expect people using clear history on shutdown having mostly an empty history for which sanitization should be quite fast. The assumption is clearly untrue for FX_SANITIZE_PLUGINS, since the time needed to instantiate a plugin and asking it to clear data is not related to it having data or not. Summing up, we have a 1% of users taking about 30s to clear plugins data, that means we only have 30s left for any other shutdown work in the same phase. Since we packed up all sanitization and components shutdown in profile-before-change, that's not a lot of time. Bug 1244650 helps also cause it moves sanitization to profile-change-teardown and since the timeout is per phase, sanitization won't have anymore to share the timeout with components shutdown. So, while the bug itself is not fixed, the currently landed patches (if uplifted) could already largely reduce its impact to a very tiny % of users. What we can do further: 1. evaluate whether the 1 minute timeout was a realistic choice, considering the time some users take just to clear plugins data. Increasing the timeout could make regressions less likely to be noticed, but if the increase is well thought it may work. Something like 90s could do. 2. evaluate if we really want to instantiate plugins on shutdown just to clear them... Maybe we could always delay clearing plugins to startup? Maybe we should just stop waiting for plugins data clearing, hoping it will work? Regardless we are already ignoring any error thrown by them. 3. investigate why it takes so much to clear plugins data (I suspect we need to spawn the plugin-container process and go through IPC, that is a slow process), and what we could do to speed it up, but IIRC we already invested time in this in the recent past. I will file a bug in the proper component. 4. investigate history clearing speed up: bug 871908 is likely involved, may not be trivial and it seems to only affect 0.2% of the users clearing history I will now go through the regressing patch to identify any possible change that could have made us cross the boundary, supposing we were already on the edge of it.
Flags: needinfo?(mak77)
differences after bug 1089695: - cookies were the first thing to be cleared, now cache is cleared before them, telemetry tells me in general cache should take less than 10 seconds. Still it's likely starting cookies clearing before can help, considered plugins clearing time. Bug 1244650 helps here by starting all the clearing tasks in a strict loop. - we were not waiting for most of the async sanitizations, we were basically waiting only for cookies. The previous behavior was just wrong. - we were not yielding to avoid UI jankiness. Probably we should not do this when there's no UI (like on shutdown). I will file a bug for this. - history was cleared synchronously. The amount of work should be the same, priority could differ since it's a different thread. The only thing that can help us here is profiling and bug 871908. Bug 1244650 helps by starting the process earlier and not concentrating all the cleanup work in a single shutdown phase. - we are now properly handling partial sanitizations by reissuing sanitize on startup. This has likely nothing to do with the shutdown crash, is probably related to the hang on startup, but should be handled apart.
Depends on: 1249333
Depends on: 1249336
Here is my crash report: https://crash-stats.mozilla.com/report/index/00e9a6cb-6cf3-4008-a305-72d3b2160219 Firefox's process doesn't get killed after closing Firefox. It started in Firefox 45.0b7, b6 didn't have this problem. Here is some debug output which may help: WARNING: At least one completion condition is taking too long to complete. Conditions: [{"name":"sanitize.js: Sanitize","state":{"progress":{"cache":"ready","cookies":"ready","offlineApps":"ready","history":"ready","formdata":"ready","downloads":"ready","sessions":"cleared"}},"filename":"chrome://browser/content/sanitize.js","lineNumber":79,"stack":["chrome://browser/content/sanitize.js:Sanitizer.prototype.sanitize<:79","resource://gre/modules/Task.jsm:TaskImpl_run:315","resource://gre/modules/Task.jsm:TaskImpl:276","resource://gre/modules/Task.jsm:createAsyncFunction/asyncFunction:250","chrome://browser/content/sanitizeDialog.js:gSanitizePromptDialog.sanitize:114","chrome://global/content/bindings/dialog.xml line 375 > Function:anonymous:1","chrome://global/content/bindings/dialog.xml:_fireButtonEvent:376","chrome://global/content/bindings/dialog.xml:_doButtonCommand:344","chrome://global/content/bindings/dialog.xml:_handleButtonCommand:332","chrome://browser/content/sanitize.js:Sanitizer.showUI:748","chrome://browser/content/sanitize.js:Sanitizer.sanitize:761","resource://app/components/nsBrowserGlue.js:BG_sanitize:2178","chrome://browser/content/browser.xul:oncommand:1"]}] Barrier: Places Clients shutdown WARNING: At least one completion condition is taking too long to complete. Conditions: [{"name":"Places Clients shutdown","state":"(none)","filename":"/builds/slave/rel-m-beta-l64_bld-00000000000/build/toolkit/components/places/Database.cpp","lineNumber":488,"stack":["resource://gre/components/nsAsyncShutdown.js:nsAsyncShutdownClient.prototype.addBlocker:163","chrome://browser/content/sanitize.js:Sanitizer.onStartup<:767","resource://gre/modules/Task.jsm:TaskImpl_run:315","resource://gre/modules/Task.jsm:TaskImpl:276","resource://gre/modules/Task.jsm:createAsyncFunction/asyncFunction:250","resource://app/components/nsBrowserGlue.js:BG__finalUIStartup:764","resource://app/components/nsBrowserGlue.js:BG_observe:287"]}] Barrier: profile-change-teardown console.error: Message: Error: SessionFile is closed Stack: SessionFileInternal.write@resource://app/modules/sessionstore/SessionFile.jsm:277:29 this.SessionFile.write@resource://app/modules/sessionstore/SessionFile.jsm:76:12 SessionSaverInternal._writeState@resource://app/modules/sessionstore/SessionSaver.jsm:242:12 SessionSaverInternal._saveState@resource://app/modules/sessionstore/SessionSaver.jsm:213:12 SessionSaverInternal._saveStateAsync@resource://app/modules/sessionstore/SessionSaver.jsm:226:5 SessionSaverInternal.runDelayed/this._timeoutID<@resource://app/modules/sessionstore/SessionSaver.jsm:145:40 setTimeout_timer@resource://gre/modules/Timer.jsm:30:5 Spinner.prototype.observe@resource://gre/modules/AsyncShutdown.jsm:529:9
David, does that "SessionFile is closed" error raise any idea?
Flags: needinfo?(dteller)
(In reply to Markus Popp from comment #7) > Here is my crash report: > > https://crash-stats.mozilla.com/report/index/00e9a6cb-6cf3-4008-a305- > 72d3b2160219 Could you please tell me the builtfrom value from about:buildconfig?
Flags: needinfo?(markus.popp)
Also, did you set Clear history on shutdown, or manually clear from the dialog?
Blocks: 1206393
(In reply to Marco Bonardo [::mak] from comment #8) > David, does that "SessionFile is closed" error raise any idea? Investigating this over irc.
Flags: needinfo?(dteller)
(In reply to Marco Bonardo [::mak] from comment #9) > Could you please tell me the builtfrom value from about:buildconfig? Built from https://hg.mozilla.org/releases/mozilla-beta/rev/3a671727fd14120dd85362997eb84d0a7ce41a11 > Also, did you set Clear history on shutdown, or manually clear from the dialog? I cleared manually from the dialog.
Flags: needinfo?(markus.popp)
I was able to reproduce the hang in beta, in the end the problem is, as I suspected, bug 871908. Removing pages from moz_places table creates a long transaction that causes the wal to grow insanely and doing that asynchronously causes a deadlock with queries on the main thread (the async thread is constantly reading from disk and holds the shared mutex). Once the main thread is blocked it doesn't matter if the other clearings can proceed, we will just timeout and abort. This is the risk when we use the same db connection across 2 threads, and there's nothing we can do until all of Places APIs are async... For example on shutdown we may deadlock with Windows7Jumplists trying to access history.hasHistoryEntries. The startup hang is just the bookmarks service trying to initialize synchronously. Why did we start noticing the cost of bug 871908 only after the fix for bug 1089695? - clear history on startup before was happening rarely, especially cause we were ignoring most shutdown errors. Now that it runs more often on startup, it clearly hangs more often. - on shutdown we execute queries asynchronously instead of synchronously and we wait for them. I'm looking for a low risk fix that will "hide" the issue, that is how it was before, since I don't think we have the space for a correct fix.
Depends on: 1249786
(In reply to Marco Bonardo [::mak] from comment #16) > - clear history on startup before was happening rarely, especially cause we > were ignoring most shutdown errors. Now that it runs more often on startup, > it clearly hangs more often. And the just found bug 1249786.
yes, it's the same cause.
Depends on: 976940
[Tracking Requested - why for this release]: Crashing on shutdown.
Philipp mentioned it's a top crash on 45.0b8. Tracked for 46 and 47.
yep, to fix this we need at least: bug 1249786 - since currently we are sanitizing at every startup, even when not needed bug 976940 - users with disabled form history crash every time bug 871908 - this is the less trivial fix for which I'm trying to find a solution that is small enough to be uplifted.
Depends on: 1250363
No longer depends on: 871908, 1249336
Summary: crash in mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort and Firefox no longer responding on startup → crash in mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort (with abort message chrome://browser/content/sanitize.js) and Firefox no longer responding on startup
b7 looks like this went off the rails. would another short term strategy be to back out the changes in b7 that lead to the dramatic increase in this signature, then try to go back to work on figuring out the complex set of things that are needed to solve the bigger underlying problem and let that set of fixes ride trains. build rank pct-of-crashes 45.b8 1 50.61% new mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort 45.b7 1 60.46% new mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort 45.b6 6 2.31% new mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort 45.b5 7 2.09% 0.0 mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort 45.b4 10 1.57% mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort 45.b3 9 5 1.54% -0.mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort 45.b2 10 1.44% mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort 45.b1 3 2.93% mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort 44.0.2 12 1.18% mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort 44.0.1 23 -14 0.58% mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort 44.0 10 1.06% mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort looks like there are also a number of shutdownhang | WaitForSingleObjectEx .... and shutdownhang | _VEC_memset signatures that were part of 44.0-45.b6 that have reduced in relative volume since this exploded. are any of those related or is it just the case that we crash hear earlier now in shut down and avoid ever getting to the conditions of those shutdownhang bugs?
(In reply to philipp from comment #24) > this signature seems to have become much worse in 45.0b7 (not released) and > 45.0b8 - in very early data for these builds it makes up >50% of all crashes! It also got much worse on nightly (similar massive spike in crashes) starting with the February 18 nightly build. The pushlog for the regression worsening on nightly is: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=709f559b5406&tochange=1150ac4755c7 (I previously thought this bug was about that spike, and didn't notice that it was filed two days before it happened.)
b7 was never released, please ignore it. b8 is what you want to take a look at, and its data is *very* early as it was only released a few hours ago and updates were already disabled again for this bug. In any case, this very probably means that we will not release 45 on time.
bug 1249512 has STR.
(In reply to magicp from comment #29) > bug 1249512 has STR. the "disable form history" hang is simple to fix, bug 976940 will land as soon as the tree reopens. But I doubt all the hangs are just due to people disabling form history. A part of them is very likely due to history clearing (bug 1250363 contains a workaround for the more complex technical debt bug we have from quite some time, needs some additional hours of testing), and a part is due to the fact we are sanitizing on startup also when not needed (bug 1249786, unfortunately hit a, so far, unexplained talos regression). The critical point here is the beta delay, for which we may evaluate a complex backout of various changesets to go back to a situation were issues exists, but they don't cause shutdown crashes. That involves backing out the following changesets (from the newest to the oldest): https://hg.mozilla.org/releases/mozilla-beta/rev/32608d640a8c https://hg.mozilla.org/releases/mozilla-beta/rev/8ade38cf0509 https://hg.mozilla.org/releases/mozilla-beta/rev/fb1653512406 https://hg.mozilla.org/releases/mozilla-beta/rev/46bf92e380db https://hg.mozilla.org/releases/mozilla-beta/rev/b02e8e8f06da https://hg.mozilla.org/releases/mozilla-beta/rev/466f3cc06a67 https://hg.mozilla.org/releases/mozilla-beta/rev/9a6d45264eb2 some backouts may not be trivial cause other stuff landed in the middle. This will bring us back to 44.0a1 that should also help with the spike of crashes in 44 due to Bug 1089695.
Blocks: 1250424
I cloned this bug to bug 1250424, that will track work to properly fix this in Nightly 47 and Aurora 46. Here I'll take care of the backouts in Beta 45.
the affected user at https://support.mozilla.org/questions/1110002 confirmed that the original behaviour from 43 (with no crashes and no hangs) was restored in the latest beta builds for 45 :-)
Thanks for reporting back!
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: