Open Bug 1405521 Opened 7 years ago Updated 10 months ago

Crash in nsGlobalWindow::ClearDocumentDependentSlots: MOZ_CRASH(Unhandlable OOM while clearing document dependent slots.)

Categories

(Core :: DOM: Core & HTML, defect, P3)

Unspecified
All
defect

Tracking

()

Tracking Status
firefox-esr102 --- wontfix
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- wontfix
firefox92 --- wontfix
firefox93 --- wontfix
firefox105 --- wontfix
firefox106 --- wontfix
firefox107 --- wontfix

People

(Reporter: n.nethercote, Unassigned)

References

Details

(Keywords: crash, topcrash)

Crash Data

This bug was filed from the Socorro interface and is 
report bp-0901faec-eb60-4ad0-b1ce-0a9e10171003.
=============================================================

We are hitting this:

> MOZ_CRASH(Unhandlable OOM while clearing document dependent slots.)

But in the linked crash report the system has huge amounts of virtual memory, page file and physical memory.

The code looks like this:

>  if (!WindowBinding::ClearCachedDocumentValue(aCx, this) ||
>      !WindowBinding::ClearCachedPerformanceValue(aCx, this)) {
>    MOZ_CRASH("Unhandlable OOM while clearing document dependent slots.");
>  }

It assumes that ClearCachedDocumentValue() and ClearCachedPerformanceValue() failure indicates OOM, but they will also fail if `this->GetWrapper()` returns null. bz, is that possible?
Flags: needinfo?(bzbarsky)
> but they will also fail if `this->GetWrapper()` returns null.

No, they won't; they'll no-op.  The generated code is:

  obj = aObject->GetWrapper();
  if (!obj) {
    return true;
  }

precisely because no wrapper is not a failure condition: it just means that there is nothing to clear.  

False return means that either get_document or get_performance returned false.

Looking at get_document, we're in the case when we just cleared the slot, so we'll make it to the self->GetDocument() call.  Once we do, we will return false only in the following cases:

1)  GetOrCreateDOMReflector(cx, result, args.rval()) returns false.
2)  Wrapping the value the getter returned into the window's compartment returns false.
3)  Wrapping the value the getter returned into the caller's compartment (which is also
    the window compartment in this case) returns false.

That's it.  Looking at get_performance, it can return false also in only those three cases.

Wrapping into compartments only fails if JS_WrapValue returns false, which only happens if JSCompartment::wrap() returns false, which can happen in the following ways:

A) CheckSystemRecursionLimit() (inside getNonWrapperObjectForCurrentCompartment) fails.
B) The prewrap callback outputs null.  This should never happen for webidl objects, afaict.
C) JSCompartment::getOrCreateWrapper fails.  I believe this can only happen on "OOM".

As for GetOrCreateDOMReflector, it returns false if:

D) CouldBeDOMBinding() returns false (should never happen for document or performance).
E) The actual WrapObject() call returns false.
F) JS_WrapValue returns false, see above.

WrapObject() _can_ fail in somewhat interesting ways for documents.  See <http://searchfox.org/mozilla-central/source/dom/base/nsINode.cpp#2953-2958>.  But that shouldn't happen in this case, since we're just setting the document up so all the script handling object state should be fine.  The other thing both WrapObjects call is the relevant binding's Wrap(), which can happen in cases when globals are missing or protos can't be instantiated, but fundamentally none of those should be happening here.  And of course it can fail on "OOM".

Now the big caveat: "OOM" for our purposes is "out of SpiderMonkey heap", not "out of memory".  It's totally possible to hit "OOM" without actually being out of memory for malloc() purposes...  I don't know offhand what the cap is on the size of SpiderMonkey's heap, esp because it looks like we have separate heaps for gcthing allocations and JS_malloc allocations or something.

So for this specific crash, my best guess is that we either hit the SpiderMonkey memory cap or CheckSystemRecursionLimit() failed.  Hard to tell about the latter, because the stack is truncated at the first jitframe.
Flags: needinfo?(bzbarsky)
> No, they won't; they'll no-op.  The generated code is:
> 
>   obj = aObject->GetWrapper();
>   if (!obj) {
>     return true;
>   }
> 
> precisely because no wrapper is not a failure condition: it just means that
> there is nothing to clear.  

Yes, my bad.

Thank you for the detailed analysis.

jonco, does SM have a heap limit?
Flags: needinfo?(jcoppeard)
(In reply to Nicholas Nethercote [:njn] from comment #2)
> jonco, does SM have a heap limit?

Yes, the GC parameter JSGC_MAX_BYTES is used to set this.  This is set to 0xffffffff in XPConnect.
Flags: needinfo?(jcoppeard)
Fwiw, bug 1197540 has a testcase that can trigger this assert.  That testcase definitely looks like an infinite recursion case to me, so it might be running into CheckSystemRecursionLimit() failure...  I'll try catching it in rr.
Depends on: 1197540
Priority: -- → P2
426 crashes in the last week in 57.
OS: Windows 10 → All
Crash Signature: [@ nsGlobalWindow::ClearDocumentDependentSlots] → [@ nsGlobalWindow::ClearDocumentDependentSlots] [@ nsGlobalWindowInner::ClearDocumentDependentSlots]
There are around 1000 crashes per week on release 57 versions. From the duplicate bug 1422313, sounds like this is a shift in signature rather than a new crash.
bp-9d2936eb-65eb-4e92-bd7d-540380180425 with build 20180410100115 @ nsGlobalWindowInner::ClearDocumentDependentSlots
(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #4)
> Fwiw, [Mac] bug 1197540 has a testcase that can trigger this assert.  That
> testcase definitely looks like an infinite recursion case to me, so it might
> be running into CheckSystemRecursionLimit() failure...  I'll try catching it
> in rr.

The Mac crashes for that bug report dropped to near zero around mid-March, from whatever shipped after 58.0.2. https://crash-stats.mozilla.com/signature/?_sort=user_comments&_sort=-date&signature=nsGlobalWindow%3A%3AInnerSetNewDocument&date=%3E%3D2017-12-11T04%3A18%3A16.000Z&date=%3C2018-06-11T05%3A18%3A16.000Z#graphs
(In reply to Wayne Mery (:wsmwk) from comment #9)
> Crashes for that bug report (bug 1197540) dropped to near zero around mid-March,
> from whatever shipped after 58.0.2.
> https://crash-stats.mozilla.com/signature/?_sort=user_comments&_sort=-
> date&signature=nsGlobalWindow%3A%3AInnerSetNewDocument&date=%3E%3D2017-12-
> 11T04%3A18%3A16.000Z&date=%3C2018-06-11T05%3A18%3A16.000Z#graphs

However, the crash rate for this (Windows) bug has held fairly steady.  But that is actually not surpring because the majority of crashes here are on 52.x esr.

I had another crash bp-3819de05-b8f6-41a0-9297-1b2980180611
Depends on: 1491313
Depends on: 1491925
Depends on: 1493849
Depends on: 1496805
Depends on: 1499150
Depends on: 1501479
Depends on: 1503659
Depends on: 1503664
Depends on: 1505468
Component: DOM → DOM: Core & HTML

While trying to get an rr trace for bug 1593704 I kept triggering this issue. I created a Pernosco session which can be found here: https://pernos.co/debug/HeNG0Imk-tsJryLVWyqD2g/index.html

Crash Signature: [@ nsGlobalWindow::ClearDocumentDependentSlots] [@ nsGlobalWindowInner::ClearDocumentDependentSlots] → [@ nsGlobalWindow::ClearDocumentDependentSlots] [@ nsGlobalWindowInner::ClearDocumentDependentSlots] [@ nsGlobalWindowOuter::SetNewDocument]

(In reply to Tyson Smith [:tsmith] from comment #11)

While trying to get an rr trace for bug 1593704 I kept triggering this issue. I created a Pernosco session which can be found here: https://pernos.co/debug/HeNG0Imk-tsJryLVWyqD2g/index.html

I happend to resurrect the pernosco trace. It seems that GetOrCreateDOMReflector returns false (I added an entry to the notebook there). Due to massive inlining it is less clear (to me), if this can really only be caused by OOM.

Crash Signature: [@ nsGlobalWindow::ClearDocumentDependentSlots] [@ nsGlobalWindowInner::ClearDocumentDependentSlots] [@ nsGlobalWindowOuter::SetNewDocument] → [@ nsGlobalWindow::ClearDocumentDependentSlots] [@ nsGlobalWindowInner::ClearDocumentDependentSlots] [@ nsGlobalWindowOuter::SetNewDocument]

Here is a Pernosco session created with a -O0 build hopefully this is more helpful. https://pernos.co/debug/wo7vFFam6FDy7kYhiopX5g/index.html

Flags: needinfo?(jstutte)

Thanks a lot! That makes it easier. It seems, we are definitely not seeing an OOM here.

The low-level analysis is, that we rely on that call stack on GetWrapperMaybeDead to give us always a living wrapper, which is not the case. I was not yet able to check, what might cause the wrapper to be "dead and in the process of being finalized." as the comment points out to be a possible cause for being nullptr.

Olli, does this help to understand this better?

Flags: needinfo?(jstutte) → needinfo?(bugs)

(In reply to Jens Stutte [:jstutte] from comment #15)

I was not yet able to check, what might cause the wrapper to be "dead and in the process of being finalized."

This means that the GC has determined that the wrapper is dead, but the wrapper has not been destroyed yet (and pointer to it has not been set to null).

The stack shows an interesting cycle of:

XMLHttpRequest_Binding::open
...
XMLHttpRequestMainThread::FireReadystatechangeEvent
...
js::RunScript
...
XMLHttpRequest_Binding::send
...
XMLHttpRequestMainThread::ResumeEventDispatching
EventTarget::DispatchEvent
...
js::RunScript
...
XMLHttpRequest_Binding::open

over and over again. So sync XHR that triggers sync XHR etc.

Eventually we are way down into that stack, in danger of hitting JS engine stack-overflow checks, and processing events under a sync XHR. We land in nsDocumentOpenInfo::OnStartRequest and go from there. We try to create a wrapper for the document, try to create its proto, try to define properties on it, hit the over-recursion check in CallJSAddPropertyOp and fail it, fail to add the property and bubble up the stack failing things.

I added some notes to the Pernosco session for these bits.

(In reply to Karl Dubostđź’ˇ :karlcow from comment #18)

We had a report on webcompat with regards to this website

This crash is just a symptom of running out of memory. What is more interesting is what is causing the browser to use a lot of memory. You'll want a new bug for that.

Webcompat Priority: --- → ?
Webcompat Priority: ? → ---

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 content process crashes on beta

For more information, please visit auto_nag documentation.

Keywords: topcrash
Severity: critical → S2
Flags: needinfo?(smaug)

As of comment 19 I'd assume this to be not worth S2 ?

Flags: needinfo?(continuation)

Comment 19 was mostly relevant to a specific site that was maybe showing this issue.

It is a reasonably common crash, so it might qualify as S2, but we have no real plan of action here. Boris landed a ton of instrumentation in 2018 to try to figure out why this is happening, but he didn't post any kind of conclusion in these bugs as far as I can see, so I guess nothing useful resulted from it.

Although looking at this now, it is possible that bug 1543537 helped here. Weird null derefs in documents are a possible symptom of that issue. I fixed it in 107, and the volume in 107 beta seems to be a lot lower than 106 beta (I think comment 20 was made when 106 was in beta). Maybe we can wait a few weeks and see if the volume on 107 continues to be low, then we could remove the top crash and mark it S3.

Depends on: 1543537
Flags: needinfo?(continuation)

This got frequent with 109.0a1 20221129084032: 25-70 crashes per Nightly build. There are no crash reports for the latest Nightly (20221030214707?) so far. Push log lists nothing obvious.

The bug is linked to topcrash signatures, which match the following criteria:

  • Top 5 desktop browser crashes on Mac on beta
  • Top 5 desktop browser crashes on Mac on release
  • Top 20 desktop browser crashes on release (startup)
  • Top 20 desktop browser crashes on beta
  • Top 10 desktop browser crashes on nightly
  • Top 10 content process crashes on beta
  • Top 10 content process crashes on release
  • Top 5 desktop browser crashes on Linux on beta
  • Top 5 desktop browser crashes on Linux on release
  • Top 5 desktop browser crashes on Windows on release (startup)

For more information, please visit auto_nag documentation.

Hello Andrew, we got new reports and this is a current topcrash-startup. Would you please take another look? Thanks.

Flags: needinfo?(continuation)

The set of patches I'm seeing for the build that started crashing a lot is this. That includes bug 1219128, which has already had multiple OOM issues in automation associated with it already. I think we should back that patch out if the memory issues can't be resolved very quickly.

Flags: needinfo?(continuation) → needinfo?(nicolas.b.pierron)

Hmm I guess it got backed out immediately, but still got marked fixed somehow, so I guess that can't be to blame?

Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(continuation)

(In reply to Andrew McCreight [:mccr8] from comment #26)

The set of patches I'm seeing for the build that started crashing a lot is this. That includes bug 1219128, which has already had multiple OOM issues in automation associated with it already. I think we should back that patch out if the memory issues can't be resolved very quickly.

The OOM issues related to this patch are strictly only caused by the test suite configured to be greedy at finding OOM issues by having a loop simulating OOM, and the associated backout is caused by failing to annotate the test cases that such tests cases were instrumented as such. None of these OOM are caused by the system running out of memory.

Otherwise Bug 1219128 only changes how Object and Function classes are registered in the GlobalObject, by allocating them eagerly, which would most likely be present anyway if the global is not unused.

I filed a new bug for this recent spike as it seems to involve a bunch of unrelated signatures.

Flags: needinfo?(continuation)

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on release (startup)
  • Top 10 content process crashes on release

For more information, please visit auto_nag documentation.

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

This is a symptom of an OOM. There has been extensive investigation that hasn't found anything.

Severity: S2 → S3
Priority: P2 → P3

I don't know if it helps. But i recently stumbled over this a few times. A website that quite produced this error on me is https://www.welt.de/
It's a Newswebsite. If you let it "idle" for a while the website will notify you that there have been news updates.. First i though that this trigger the crash, however it doesn't. so just don'T touch the website and let it stay for a while longer most of time it took around 1hour till the tab crashes (though sometimes it doesn't happen at all). What's pretty inetresting to me is that before it crashes i'll get spamed with "save file to" with a lot of empty .html files (don't even know why that happens). After you saved them all or "cancle" the download you'll then realize that the tab has crashed. It also once gave me an Memory Expection Read as well with the exact same behaviour.

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on release (startup)
  • Top 10 content process crashes on release

For more information, please visit BugBot documentation.

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on release (startup)
  • Top 10 content process crashes on release

For more information, please visit BugBot documentation.

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

Sorry for removing the keyword earlier but there is a recent change in the ranking, so the bug is again linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on release (startup)
  • Top 10 content process crashes on release

For more information, please visit BugBot documentation.

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

You need to log in before you can comment on or make changes to this bug.