Potential startup Crash in [@ OOM | large | mozalloc_abort | moz_xmalloc | std::allocator<T>::allocate]
Categories
(Core :: Panning and Zooming, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr78 | --- | unaffected |
firefox85 | --- | unaffected |
firefox86 | --- | unaffected |
firefox87 | --- | verified |
firefox88 | --- | verified |
People
(Reporter: aryx, Assigned: hiro)
References
(Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(2 files)
(deleted),
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta+
|
Details |
(deleted),
image/png
|
Details |
11 Nightly crashes on 5+ devices, all on Windows 10, oldest build ID 20210205132939
Hiro, could this be from bug 1687067?
Crash report: https://crash-stats.mozilla.org/report/index/5f8997e6-440e-4521-a681-e25d40210209
MOZ_CRASH Reason: MOZ_CRASH()
Top 10 frames of crashing thread:
0 mozglue.dll mozalloc_abort memory/mozalloc/mozalloc_abort.cpp:33
1 mozglue.dll mozalloc_handle_oom memory/mozalloc/mozalloc_oom.cpp:51
2 mozglue.dll moz_xmalloc memory/mozalloc/mozalloc.cpp:54
3 xul.dll std::allocator< vs2017_15.8.4/VC/include/xmemory0:997
4 xul.dll std::vector<RefPtr<mozilla::layers::AsyncPanZoomController>, std::allocator<RefPtr<mozilla::layers::AsyncPanZoomController> > >::_Emplace_reallocate<RefPtr<mozilla::layers::AsyncPanZoomController> > vs2017_15.8.4/VC/include/vector:956
5 xul.dll mozilla::layers::OverscrollHandoffChain::Add gfx/layers/apz/src/OverscrollHandoffState.cpp:20
6 xul.dll mozilla::layers::APZCTreeManager::BuildOverscrollHandoffChain gfx/layers/apz/src/APZCTreeManager.cpp:2926
7 xul.dll mozilla::layers::AsyncPanZoomController::BuildOverscrollHandoffChain gfx/layers/apz/src/AsyncPanZoomController.cpp:3374
8 xul.dll mozilla::layers::InputBlockState::InputBlockState gfx/layers/apz/src/InputBlockState.cpp:41
9 xul.dll mozilla::layers::WheelBlockState::WheelBlockState gfx/layers/apz/src/InputBlockState.cpp:276
Reporter | ||
Updated•4 years ago
|
Assignee | ||
Comment 1•4 years ago
|
||
Yeah, it looks related. the std::vector in question is std::vector<RefPtr<AsyncPanZoomController>>
so it shouldn't be so large and it's generated from hit testing tree, in my sense it's pretty small, I am guessing the tree is somewhat broken or (APZC tree is somewhat broken)
Anyway, if the crash is happened by the change, it happens on particular sites,
Aryx, can you see any suspicious URIs in the reports?
Reporter | ||
Comment 2•4 years ago
|
||
Scrolling on these sites reproduces the issue here:
- https://pdf.wondershare.com/buy/pdfelement-windows.html (with beta, drag and drop of the scroll bar doesn't work or is disabled).
- https://www.lecho.be/entreprises/general/les-hommes-de-pouvoir-portent-un-brocelet/10168713.html and other articles from that site
The program becomes unresponsive and the minidump analyzer eventually launches and the page and program visually look back to normal but when I interacted with it, it froze again. Eventually had to confirm its termination (after clicking onto the window's 'Close' button).
Assignee | ||
Comment 3•4 years ago
|
||
Thank you, Aryx! It's quite easy to reproduce it. :/
Anyways, I got this assertion;
Assertion failure: false (item should have finite clip with respect to aASR), at layout/painting/nsDisplayList.cpp:3007
So something definitely goes wrong on the main-thread.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 4•4 years ago
|
||
I am also seeing "ASSERTION: Two layers that scroll together have different ancestor transforms: 'false', file gfx/layers/apz/src/APZCTreeManager.cpp:1390".
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 5•4 years ago
|
||
A hit testing tree on https://pdf.wondershare.com/buy/pdfelement-windows.html looks like this;
HitTestingTreeNode (0x7f53bec7cd40) APZC (0x7f53bb73a000) g=({ l=0x100000001, p=1, v=3 }) r=({ }) t=([ I ]) c=([0,0,1276,1145])
HitTestingTreeNode (0x7f53acd2fa60) APZC ((nil)) g=(l=0x100000001) r=({ }) t=([ 1 0; 0 1; 0 74; ]) c=(none)
HitTestingTreeNode (0x7f53acd2f2e0) APZC ((nil)) g=(l=0x10000000b) r=({ }) t=([ I ]) c=(none)
HitTestingTreeNode (0x7f53af5435c0) APZC ((nil)) g=(l=0x10000000b) r=({ }) t=([ I ]) c=(none) scrollbar
HitTestingTreeNode (0x7f53a9ff9200) APZC ((nil)) g=(l=0x10000000b) r=({ }) t=([ I ]) c=(none) scrollbar scrollthumb
HitTestingTreeNode (0x7f53af2e4a60) APZC (0x7f53acc81000) g=({ l=0x10000000b, p=2, v=3 }) r=({ }) t=([ I ]) c=([0,0,1264,1071])
HitTestingTreeNode (0x7f53af2e4e20) APZC ((nil)) g=(l=0x10000000b) fixed=3 r=({ }) t=([ I ]) c=(none)
HitTestingTreeNode (0x7f53a9fd5100) APZC (0x7f53acc81000) g=({ l=0x10000000b, p=2, v=3 }) r=({ }) t=([ I ]) c=([0,0,1264,1071])
HitTestingTreeNode (0x7f53a9ff9020) APZC ((nil)) g=(l=0x10000000b) fixed=3 r=({ }) t=([ I ]) c=(none)
HitTestingTreeNode (0x7f53acd2fe20) APZC ((nil)) g=(l=0x10000000b) fixed=3 r=({ }) t=([ I ]) c=(none)
HitTestingTreeNode (0x7f53aa069200) APZC ((nil)) g=(l=0x10000000b) r=({ }) t=([ 1 0; 0 1; 0 1071; ]) c=(none)
HitTestingTreeNode (0x7f53af2e4880) APZC (0x7f53acc81000) g=({ l=0x10000000b, p=2, v=3 }) r=({ }) t=([ I ]) c=([0,0,1264,1071])
HitTestingTreeNode (0x7f53aa069020) APZC ((nil)) g=(l=0x10000000b) r=({ }) t=([ I ]) c=(none)
HitTestingTreeNode (0x7f53acd2f6a0) APZC ((nil)) g=(l=0x10000000b) fixed=3 r=({ }) t=([ I ]) c=(none)
HitTestingTreeNode (0x7f53acd2e200) APZC ((nil)) g=(l=0x10000000b) fixed=3 r=({ }) t=([ I ]) c=(none)
There are multiple nodes having APZC(0x7f53acc81000), we are using depth first search (post order) in GetTargetNode, so we found HitTestingTreeNode (0x7f53af2e4880) first, then found HitTestingTreeNode (0x7f53a9fd5100) which has the same APZC. That results an infinite loop there.
Botond, does it make sense to use breadth first search to find the target node for building scroll handoff chains? (I am thinking it does, but I am quite unsure the hit testing tree is malformed or not)
Comment 6•4 years ago
|
||
(In reply to Hiroyuki Ikezoe (:hiro) from comment #5)
There are multiple nodes having APZC(0x7f53acc81000), we are using depth first search (post order) in GetTargetNode, so we found HitTestingTreeNode (0x7f53af2e4880) first, then found HitTestingTreeNode (0x7f53a9fd5100) which has the same APZC. That results an infinite loop there.
I don't think the node 0x7f53a9fd5100 comes into play. I think the way the infinite loop arises is:
- the GetTargetNode() call finds 0x7f53af2e4880, as you said
- inside
GetTargetApzcForNode()
, we get to 0x7f53acd2fe20 - because this is a fixed node, we take this branch, where we look for an APZC with
v=3
, which gets us back to the APZC we started with
Updated•4 years ago
|
Comment 7•4 years ago
|
||
I think using a breadth-first search in GetTargetNode()
would be fine.
I do think it's more of a workaround and the underlying problem is with the tree structure. Specifically, having a node with fixed=N
(in this case, 0x7f53acd2fe20 with fixed=3
) which has a descendant with an APZC with v=N
(same N
, in this case, 0x7f53af2e4880 with v=3
) is suspicious and probably shouldn't happen. I think it's worth filing a follow-up to investigate this further.
Note that using a breadth-first search would not fix a scenario where we have the above pattern, and that descendant is the only node for that APZC. (I don't know if this is possible.) In that case, we would get an infinite loop with a breadth-first search too.
Perhaps a more robust workaround would be in FindHandoffParent
:
while (node) {
if (auto* apzc = GetTargetApzcForNode(node->GetParent())) {
if (apzc != aApzc) { // <======
return apzc;
}
}
node = node->GetParent();
}
I think that would handle both cases.
Assignee | ||
Comment 8•4 years ago
|
||
Thank you Botond! Yeah, the way you suggested is the way what I initially tried. :) (I assume you meant aApzc
instead of this
). Will take the way and file a bug for the investigation of the malformed tree.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 9•4 years ago
|
||
So we can avoid an infinite loop in the function when we traverse malformed
trees.
Updated•4 years ago
|
Comment 10•4 years ago
|
||
Assignee | ||
Comment 11•4 years ago
|
||
Comment on attachment 9203149 [details]
Bug 1691997 - Walk up the hit testing tree until we find an APZC different from the given |aApzc| in FindHandoffParent. r?botond
Beta/Release Uplift Approval Request
- User impact if declined: Crash
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: Yes
- If yes, steps to reproduce: 1. Open https://pdf.wondershare.com/buy/pdfelement-windows.html
- Scroll up and down the page
- Click somewhere in the page
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): So the crash was caused by an infinite loop (which leads OOM), this change does just bail out from the loop
- String changes made/needed:
Assignee | ||
Updated•4 years ago
|
Comment 12•4 years ago
|
||
bugherder |
Updated•4 years ago
|
Comment 13•4 years ago
|
||
Reproduced the issue on affected Beta 87.0b1 on Windows 10 x64, Windows 7 x64 and Windows 8.1 x86 for both sites.
Verified-fixed on the latest Nightly 88.0a1 on the above-mentioned OSes. The browser does not freeze and crash.
However, on Windows 7 only, even if the browser does not crash, memory consumption goes up to 100% (16GB) and persists even after closing Nightly. I had to kill the process from Task Manager in order to free up the memory again. Hiro, should I submit a new bug for this case or should this ticket be re-opened?
Comment 14•4 years ago
|
||
Attaching image of the memory consumption I was talking about. This happens only if I reproduce the issue first on a Nightly build without the fix (older Nightly) and then update to the one with the fix. It takes more scrolling and navigating through the pages of this site to reach that but it did happen on my side.
Downloading the latest build and trying to reproduce it with a new profile will not exhibit this issue. The upgrade scenario is more relevant for users here.
Comment 15•4 years ago
|
||
This is flagged as regressed by bug 1687067 which landed in 87. Why does it need uplift to release (86)?
Comment 16•4 years ago
|
||
(In reply to Julien Cristau [:jcristau] (back March 1) from comment #15)
This is flagged as regressed by bug 1687067 which landed in 87. Why does it need uplift to release (86)?
I think that part is a mistake, only beta uplift should be needed.
Comment 17•4 years ago
|
||
Comment on attachment 9203149 [details]
Bug 1691997 - Walk up the hit testing tree until we find an APZC different from the given |aApzc| in FindHandoffParent. r?botond
(Remove approval-mozilla-release flag.)
Assignee | ||
Comment 18•4 years ago
|
||
Yeah, Botond is right. I was confused by the release version number. I did actually double check which version is affected by this, and what I was thinking somehow is "oh it's in 87 which is the last released version". :/
(In reply to Timea Cernea [:tbabos] from comment #14)
Created attachment 9204782 [details]
Windows 7 Memory Leak after fix. (requires upgrade on same profile)Attaching image of the memory consumption I was talking about. This happens only if I reproduce the issue first on a Nightly build without the fix (older Nightly) and then update to the one with the fix. It takes more scrolling and navigating through the pages of this site to reach that but it did happen on my side.
Downloading the latest build and trying to reproduce it with a new profile will not exhibit this issue. The upgrade scenario is more relevant for users here.
That sounds weird. It's worth filing a new bug (I have no idea what I can do for it though).
Comment 19•4 years ago
|
||
Thanks, Hiro.Will submit a new bug for the win7 issue. Marking is as verified-fixed based on Comment 13 for the latest Nightly.
Waiting for uplift to Beta to verify further.
Comment 20•4 years ago
|
||
Comment on attachment 9203149 [details]
Bug 1691997 - Walk up the hit testing tree until we find an APZC different from the given |aApzc| in FindHandoffParent. r?botond
Approved for 87.0b3.
Comment 21•4 years ago
|
||
bugherder uplift |
Comment 22•4 years ago
|
||
Verified-fixed on the latest Beta 87.0b4 on Windows 10 x64, Windows 7 x64 and Windows 8.1 x86. Could not reproduce the Win7 issue anymore using the same scenario after updating to the latest Beta.
Description
•