Closed Bug 1691997 Opened 4 years ago Closed 4 years ago

Potential startup Crash in [@ OOM | large | mozalloc_abort | moz_xmalloc | std::allocator<T>::allocate]

Categories

(Core :: Panning and Zooming, defect)

Unspecified
Windows 10
defect

Tracking

()

VERIFIED FIXED
88 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox85 --- unaffected
firefox86 --- unaffected
firefox87 --- verified
firefox88 --- verified

People

(Reporter: aryx, Assigned: hiro)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(2 files)

11 Nightly crashes on 5+ devices, all on Windows 10, oldest build ID 20210205132939

Hiro, could this be from bug 1687067?

Crash report: https://crash-stats.mozilla.org/report/index/5f8997e6-440e-4521-a681-e25d40210209

MOZ_CRASH Reason: MOZ_CRASH()

Top 10 frames of crashing thread:

0 mozglue.dll mozalloc_abort memory/mozalloc/mozalloc_abort.cpp:33
1 mozglue.dll mozalloc_handle_oom memory/mozalloc/mozalloc_oom.cpp:51
2 mozglue.dll moz_xmalloc memory/mozalloc/mozalloc.cpp:54
3 xul.dll std::allocator< vs2017_15.8.4/VC/include/xmemory0:997
4 xul.dll std::vector<RefPtr<mozilla::layers::AsyncPanZoomController>, std::allocator<RefPtr<mozilla::layers::AsyncPanZoomController> > >::_Emplace_reallocate<RefPtr<mozilla::layers::AsyncPanZoomController> > vs2017_15.8.4/VC/include/vector:956
5 xul.dll mozilla::layers::OverscrollHandoffChain::Add gfx/layers/apz/src/OverscrollHandoffState.cpp:20
6 xul.dll mozilla::layers::APZCTreeManager::BuildOverscrollHandoffChain gfx/layers/apz/src/APZCTreeManager.cpp:2926
7 xul.dll mozilla::layers::AsyncPanZoomController::BuildOverscrollHandoffChain gfx/layers/apz/src/AsyncPanZoomController.cpp:3374
8 xul.dll mozilla::layers::InputBlockState::InputBlockState gfx/layers/apz/src/InputBlockState.cpp:41
9 xul.dll mozilla::layers::WheelBlockState::WheelBlockState gfx/layers/apz/src/InputBlockState.cpp:276
Flags: needinfo?(hikezoe.birchill)
Summary: Crash in [@ OOM | large | mozalloc_abort | moz_xmalloc | std::allocator<T>::allocate] → Potential startup Crash in [@ OOM | large | mozalloc_abort | moz_xmalloc | std::allocator<T>::allocate]

Yeah, it looks related. the std::vector in question is std::vector<RefPtr<AsyncPanZoomController>> so it shouldn't be so large and it's generated from hit testing tree, in my sense it's pretty small, I am guessing the tree is somewhat broken or (APZC tree is somewhat broken)

Anyway, if the crash is happened by the change, it happens on particular sites,

Aryx, can you see any suspicious URIs in the reports?

Flags: needinfo?(hikezoe.birchill) → needinfo?(aryx.bugmail)

Scrolling on these sites reproduces the issue here:

The program becomes unresponsive and the minidump analyzer eventually launches and the page and program visually look back to normal but when I interacted with it, it froze again. Eventually had to confirm its termination (after clicking onto the window's 'Close' button).

Flags: needinfo?(aryx.bugmail)

Thank you, Aryx! It's quite easy to reproduce it. :/

Anyways, I got this assertion;

Assertion failure: false (item should have finite clip with respect to aASR), at layout/painting/nsDisplayList.cpp:3007

So something definitely goes wrong on the main-thread.

Flags: needinfo?(hikezoe.birchill)

I am also seeing "ASSERTION: Two layers that scroll together have different ancestor transforms: 'false', file gfx/layers/apz/src/APZCTreeManager.cpp:1390".

Assignee: nobody → hikezoe.birchill
Status: NEW → ASSIGNED
Flags: needinfo?(hikezoe.birchill)

A hit testing tree on https://pdf.wondershare.com/buy/pdfelement-windows.html looks like this;

HitTestingTreeNode (0x7f53bec7cd40) APZC (0x7f53bb73a000) g=({ l=0x100000001, p=1, v=3 }) r=({ }) t=([ I ]) c=([0,0,1276,1145])
  HitTestingTreeNode (0x7f53acd2fa60) APZC ((nil)) g=(l=0x100000001) r=({ }) t=([ 1 0; 0 1; 0 74; ]) c=(none)
    HitTestingTreeNode (0x7f53acd2f2e0) APZC ((nil)) g=(l=0x10000000b) r=({ }) t=([ I ]) c=(none)
      HitTestingTreeNode (0x7f53af5435c0) APZC ((nil)) g=(l=0x10000000b) r=({ }) t=([ I ]) c=(none) scrollbar
        HitTestingTreeNode (0x7f53a9ff9200) APZC ((nil)) g=(l=0x10000000b) r=({ }) t=([ I ]) c=(none) scrollbar scrollthumb
      HitTestingTreeNode (0x7f53af2e4a60) APZC (0x7f53acc81000) g=({ l=0x10000000b, p=2, v=3 }) r=({ }) t=([ I ]) c=([0,0,1264,1071])
      HitTestingTreeNode (0x7f53af2e4e20) APZC ((nil)) g=(l=0x10000000b) fixed=3 r=({ }) t=([ I ]) c=(none)
      HitTestingTreeNode (0x7f53a9fd5100) APZC (0x7f53acc81000) g=({ l=0x10000000b, p=2, v=3 }) r=({ }) t=([ I ]) c=([0,0,1264,1071])
      HitTestingTreeNode (0x7f53a9ff9020) APZC ((nil)) g=(l=0x10000000b) fixed=3 r=({ }) t=([ I ]) c=(none)
      HitTestingTreeNode (0x7f53acd2fe20) APZC ((nil)) g=(l=0x10000000b) fixed=3 r=({ }) t=([ I ]) c=(none)
        HitTestingTreeNode (0x7f53aa069200) APZC ((nil)) g=(l=0x10000000b) r=({ }) t=([ 1 0; 0 1; 0 1071; ]) c=(none)
          HitTestingTreeNode (0x7f53af2e4880) APZC (0x7f53acc81000) g=({ l=0x10000000b, p=2, v=3 }) r=({ }) t=([ I ]) c=([0,0,1264,1071])
            HitTestingTreeNode (0x7f53aa069020) APZC ((nil)) g=(l=0x10000000b) r=({ }) t=([ I ]) c=(none)
      HitTestingTreeNode (0x7f53acd2f6a0) APZC ((nil)) g=(l=0x10000000b) fixed=3 r=({ }) t=([ I ]) c=(none)
      HitTestingTreeNode (0x7f53acd2e200) APZC ((nil)) g=(l=0x10000000b) fixed=3 r=({ }) t=([ I ]) c=(none)

There are multiple nodes having APZC(0x7f53acc81000), we are using depth first search (post order) in GetTargetNode, so we found HitTestingTreeNode (0x7f53af2e4880) first, then found HitTestingTreeNode (0x7f53a9fd5100) which has the same APZC. That results an infinite loop there.

Botond, does it make sense to use breadth first search to find the target node for building scroll handoff chains? (I am thinking it does, but I am quite unsure the hit testing tree is malformed or not)

Flags: needinfo?(botond)

(In reply to Hiroyuki Ikezoe (:hiro) from comment #5)

There are multiple nodes having APZC(0x7f53acc81000), we are using depth first search (post order) in GetTargetNode, so we found HitTestingTreeNode (0x7f53af2e4880) first, then found HitTestingTreeNode (0x7f53a9fd5100) which has the same APZC. That results an infinite loop there.

I don't think the node 0x7f53a9fd5100 comes into play. I think the way the infinite loop arises is:

  • the GetTargetNode() call finds 0x7f53af2e4880, as you said
  • inside GetTargetApzcForNode(), we get to 0x7f53acd2fe20
  • because this is a fixed node, we take this branch, where we look for an APZC with v=3, which gets us back to the APZC we started with
Flags: needinfo?(botond)
Regressed by: 1687067
Has Regression Range: --- → yes

I think using a breadth-first search in GetTargetNode() would be fine.

I do think it's more of a workaround and the underlying problem is with the tree structure. Specifically, having a node with fixed=N (in this case, 0x7f53acd2fe20 with fixed=3) which has a descendant with an APZC with v=N (same N, in this case, 0x7f53af2e4880 with v=3) is suspicious and probably shouldn't happen. I think it's worth filing a follow-up to investigate this further.

Note that using a breadth-first search would not fix a scenario where we have the above pattern, and that descendant is the only node for that APZC. (I don't know if this is possible.) In that case, we would get an infinite loop with a breadth-first search too.

Perhaps a more robust workaround would be in FindHandoffParent:

  while (node) {
    if (auto* apzc = GetTargetApzcForNode(node->GetParent())) {
      if (apzc != aApzc) {        // <======
        return apzc;
      }
    }
    node = node->GetParent();
  }

I think that would handle both cases.

Thank you Botond! Yeah, the way you suggested is the way what I initially tried. :) (I assume you meant aApzc instead of this). Will take the way and file a bug for the investigation of the malformed tree.

So we can avoid an infinite loop in the function when we traverse malformed
trees.

Pushed by hikezoe.birchill@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/83c9cd9cf143 Walk up the hit testing tree until we find an APZC different from the given |aApzc| in FindHandoffParent. r=botond

Comment on attachment 9203149 [details]
Bug 1691997 - Walk up the hit testing tree until we find an APZC different from the given |aApzc| in FindHandoffParent. r?botond

Beta/Release Uplift Approval Request

  1. Scroll up and down the page
  2. Click somewhere in the page
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): So the crash was caused by an infinite loop (which leads OOM), this change does just bail out from the loop
  • String changes made/needed:
Attachment #9203149 - Flags: approval-mozilla-release?
Attachment #9203149 - Flags: approval-mozilla-beta?
Flags: qe-verify+
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 88 Branch
QA Whiteboard: [qa-triaged]

Reproduced the issue on affected Beta 87.0b1 on Windows 10 x64, Windows 7 x64 and Windows 8.1 x86 for both sites.
Verified-fixed on the latest Nightly 88.0a1 on the above-mentioned OSes. The browser does not freeze and crash.

However, on Windows 7 only, even if the browser does not crash, memory consumption goes up to 100% (16GB) and persists even after closing Nightly. I had to kill the process from Task Manager in order to free up the memory again. Hiro, should I submit a new bug for this case or should this ticket be re-opened?

Flags: needinfo?(hikezoe.birchill)

Attaching image of the memory consumption I was talking about. This happens only if I reproduce the issue first on a Nightly build without the fix (older Nightly) and then update to the one with the fix. It takes more scrolling and navigating through the pages of this site to reach that but it did happen on my side.
Downloading the latest build and trying to reproduce it with a new profile will not exhibit this issue. The upgrade scenario is more relevant for users here.

This is flagged as regressed by bug 1687067 which landed in 87. Why does it need uplift to release (86)?

(In reply to Julien Cristau [:jcristau] (back March 1) from comment #15)

This is flagged as regressed by bug 1687067 which landed in 87. Why does it need uplift to release (86)?

I think that part is a mistake, only beta uplift should be needed.

Comment on attachment 9203149 [details]
Bug 1691997 - Walk up the hit testing tree until we find an APZC different from the given |aApzc| in FindHandoffParent. r?botond

(Remove approval-mozilla-release flag.)

Attachment #9203149 - Flags: approval-mozilla-release?

Yeah, Botond is right. I was confused by the release version number. I did actually double check which version is affected by this, and what I was thinking somehow is "oh it's in 87 which is the last released version". :/

(In reply to Timea Cernea [:tbabos] from comment #14)

Created attachment 9204782 [details]
Windows 7 Memory Leak after fix. (requires upgrade on same profile)

Attaching image of the memory consumption I was talking about. This happens only if I reproduce the issue first on a Nightly build without the fix (older Nightly) and then update to the one with the fix. It takes more scrolling and navigating through the pages of this site to reach that but it did happen on my side.
Downloading the latest build and trying to reproduce it with a new profile will not exhibit this issue. The upgrade scenario is more relevant for users here.

That sounds weird. It's worth filing a new bug (I have no idea what I can do for it though).

Flags: needinfo?(hikezoe.birchill)

Thanks, Hiro.Will submit a new bug for the win7 issue. Marking is as verified-fixed based on Comment 13 for the latest Nightly.
Waiting for uplift to Beta to verify further.

Comment on attachment 9203149 [details]
Bug 1691997 - Walk up the hit testing tree until we find an APZC different from the given |aApzc| in FindHandoffParent. r?botond

Approved for 87.0b3.

Attachment #9203149 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Verified-fixed on the latest Beta 87.0b4 on Windows 10 x64, Windows 7 x64 and Windows 8.1 x86. Could not reproduce the Win7 issue anymore using the same scenario after updating to the latest Beta.

Status: RESOLVED → VERIFIED
QA Whiteboard: [qa-triaged]
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: