Closed Bug 1466010 Opened 7 years ago Closed 3 years ago

Crash in MergeState::HasMatchingItemInOldList

Categories

(Core :: Web Painting, defect, P5)

Unspecified
All
defect

Tracking

()

RESOLVED WORKSFORME
Webcompat Priority ?
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox60 --- unaffected
firefox61 - wontfix
firefox62 - wontfix
firefox63 - wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- wontfix

People

(Reporter: mattwoodrow, Assigned: hiro)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(2 files)

+++ This bug was initially created as a clone of Bug #1462497 +++ Bug 1262497 fixed most causes of this bug, and prevented it from crashing in release bugs. This bug tracks the remaining crashes.
(In reply to [:philipp] from bug 1462497 comment #41) > the signature is still present in 61.0b10 devedition builds - this query > is/will be listing the moz_crash_reason fields: > https://bit.ly/2JnOLAp
Priority: -- → P2
Depends on: 1466954
OS: Windows 10 → All
This is the top non-shutdown hang crash on the 6-13 Windows Nightly with 3.52% of all crashes (which is 30 crashes). Three of the URLs for the crashes are YouTube, and one is Google Docs.
I believe these crashes are hitting diagnostic asserts (given that only aurora channel builds appear to be affected on Beta), so calling this fix-optional for 61.
I might have run into what look like more instances of this bug while triaging nightly builds. I found at least three reports with a similar MOZ_CRASH() reason but with a useless signature. This one for example: https://crash-stats.mozilla.com/report/index/66a891bd-45fb-4194-ba70-d45b10180618 We might be undercounting the crashes.
Depends on: 1469746
Filed bug 1469746 for the specific case of this reported in webcompat bug 17317. (In reply to Gabriele Svelto [:gsvelto] from comment #6) > I might have run into what look like more instances of this bug while > triaging nightly builds. I found at least three reports with a similar > MOZ_CRASH() reason but with a useless signature. This one for example: > > https://crash-stats.mozilla.com/report/index/66a891bd-45fb-4194-ba70- > d45b10180618 > > We might be undercounting the crashes. Missing symbols seems like a generic problem that would cause us to undercount crashes for all issues. It's not obvious why symbols missing would be specific to this crash. Do you know if we have other reported cases of this? That said, if we could link the bug report to the assert message instead of the symbol, we'd be less likely to have this issue.
(In reply to Matt Woodrow (:mattwoodrow) from comment #7) > Missing symbols seems like a generic problem that would cause us to > undercount crashes for all issues. It's not obvious why symbols missing > would be specific to this crash. Do you know if we have other reported cases > of this? Not really, but I wanted to point this out because there were 6 instances of this for a single nightly build which is a lot. > That said, if we could link the bug report to the assert message instead of > the symbol, we'd be less likely to have this issue. Yeah.
Crashed while making a youtube video fullscreen.
Adding 63 as affected.
This is the top non-shutdown hang crash in the 6-27 Windows Nightly, with 19 crashes. URLs are various YouTube videos, plus http://www.pedalsandeffects.com/blog/2018/4/10/rainger-fx-reverb-x (which has an embedded YouTube video).
Depends on: 1472053
(In reply to Andrew McCreight [:mccr8] from comment #11) > This is the top non-shutdown hang crash in the 6-27 Windows Nightly, with 19 > crashes. URLs are various YouTube videos, plus > http://www.pedalsandeffects.com/blog/2018/4/10/rainger-fx-reverb-x (which > has an embedded YouTube video). It looks to be a really complex interaction with our fullscreen transition and an animation finishing in the middle of it. I can reproduce it very very occasionally, but not in rr. I've filed bug 1472053 for fixing what we think the issue is.
Depends on: 1472076
Similar to comment 11, this is the #1 non-shutdown hang crash in the Windows nightly of 20180628130527, with 30+ different installations recording 43 crashes. Note that, despite "almost impossible to reproduce" comments above, one of the installations (2018-06-28 18:46:34) reports 5 crashes, and in total there are 4 installations reporting a crash more than once. Given that, per comments in bug 1472053, this sounds like some kind of race condition, I wonder if it would be easier to reproduce on a very slow machine. The other Windows nightly of the day, 20180628220051, reports 31 crashes including 3 installations that crashed more than once.
We think we already know what happens there on the race condition. Bug 1472053 is an approach to fix the race, and bug 1472076 is another approach for the race. Probably we can land bug 1472076 within a few days.
After bug 1472076 landed, there are still some crashes unfortunately, e.g; https://crash-stats.mozilla.com/report/index/5a7153f7-4e0f-447b-9374-203a20180703 The buildid 20180703100028 also includes bug 1467688, so there are still something flaw we are not aware of.
(In reply to Hiroyuki Ikezoe (:hiro) from comment #15) > After bug 1472076 landed, there are still some crashes unfortunately, e.g; > > https://crash-stats.mozilla.com/report/index/5a7153f7-4e0f-447b-9374- > 203a20180703 > > The buildid 20180703100028 also includes bug 1467688, so there are still > something flaw we are not aware of. Very interesting that this is also from the WillPaintWindow callstack, not nsRefreshDriver::Tick. Is it possible that there's another way for the refresh driver time to change without us notifying our observers? Maybe we switched timers?
Another possibility that changes the returned value of AnimationEffect::IsCurrent (or IsInEffect) I can think of is that GetNavigationTiming call [1] in DocumentTimlineGetCurrentTimeStamp if we don't have refresh driver yet. [1] https://searchfox.org/mozilla-central/rev/403038737ba75af3842ba6b43b6e2fb47eb06609/dom/animation/DocumentTimeline.cpp#96 We should probably fix this too.
Depends on: 1473172
Still happens from WillPaintWindow; bp-235580ba-cb0f-4ad5-8dcd-8df3c0180705 And on normal tick; bp-e2ee2b95-118f-481d-90e8-c65230180705
I've been trying to reproduce the crash on youtube for a few hours but no luck. :/
[Tracking Requested - why for this release]: Pretty high volume of crashes (701) and the signature 'MergeState::HasMatchingItemInOldList' is ranked #2 in content top-crashers.
Keywords: topcrash
Blocks: 1472525
Today, I did setup my old laptop (core2duo machine), and am trying to reproduce the crash, but no luck so far. Do I need to login there? I am going to try with logged-in state.
I thought, finally, I can reproduce this crash, but that was a difference crashe. :/ Filed bug 1474231 for that.
Brian suggested me that modifying nsRefreshDriver to make time changes happens between tick and paint artificially. Attaching patch advances the refresh driver timestamp ALWAYS before paint. Though even with this patch I couldn't reproduce the crash on youtube.com, some test cases caused the crash on a try with this patch, e.g test_bug1354633_media_error.html [1]. This looks pretty much the same as the crash on youtube.com. In this test case, the problematic element is 'scrubberStack' div [2], and the animation in question seems this controlBar animation [3] (note that this is a script animation, and 'scrubberStack' doesn't have any animations). Tomorrow I am going to keep investigating in further detail, but I am wondering it is still possible that we use an advanced refresh driver time for paint process? We've already fixed two possibilities (bug 1472076 and bug 1473172), and I can't find any other suspicious points so far unfortunately. [1] https://treeherder.mozilla.org/logviewer.html#?job_id=187305801&repo=try&lineNumber=3413 [2] https://hg.mozilla.org/mozilla-central/file/a675c5d7eb76/toolkit/content/widgets/videocontrols.xml#l46 [3] https://hg.mozilla.org/mozilla-central/file/a675c5d7eb76/toolkit/content/widgets/videocontrols.xml#l1109
I think I could replicate the crash on youtube.com with below modification. I don't yet understand what's going on there, but anyway I take this. diff --git a/layout/painting/nsDisplayList.cpp b/layout/painting/nsDisplayList.cpp --- a/layout/painting/nsDisplayList.cpp +++ b/layout/painting/nsDisplayList.cpp @@ -2770,7 +2770,9 @@ already_AddRefed<LayerManager> nsDisplay aBuilder->SetIsCompositingCheap(temp); if (document && widgetTransaction) { - TriggerPendingAnimations(document, layerManager->GetAnimationReadyTime()); + TimeStamp now = layerManager->GetAnimationReadyTime() + + TimeDuration::FromMilliseconds(100); + TriggerPendingAnimations(document, now);// layerManager->GetAnimationReadyTime()); }
Assignee: nobody → hikezoe
Status: NEW → ASSIGNED
(In reply to Hiroyuki Ikezoe (:hiro) from comment #23) > Created attachment 8990945 [details] > Advance refresh driver time before paint > > Brian suggested me that modifying nsRefreshDriver to make time changes > happens between tick and paint artificially. Attaching patch advances the > refresh driver timestamp ALWAYS before paint. Though even with this patch I > couldn't reproduce the crash on youtube.com, some test cases caused the > crash on a try with this patch, e.g test_bug1354633_media_error.html [1]. > This looks pretty much the same as the crash on youtube.com. In this test > case, the problematic element is 'scrubberStack' div [2], and the animation > in question seems this controlBar animation [3] (note that this is a script > animation, and 'scrubberStack' doesn't have any animations). This is super confusing! If the changed element doesn't have an animation, how does changing the refresh driver tick time affect it?
(In reply to Matt Woodrow (:mattwoodrow) from comment #25) > (In reply to Hiroyuki Ikezoe (:hiro) from comment #23) > > Created attachment 8990945 [details] > > Advance refresh driver time before paint > > > > Brian suggested me that modifying nsRefreshDriver to make time changes > > happens between tick and paint artificially. Attaching patch advances the > > refresh driver timestamp ALWAYS before paint. Though even with this patch I > > couldn't reproduce the crash on youtube.com, some test cases caused the > > crash on a try with this patch, e.g test_bug1354633_media_error.html [1]. > > This looks pretty much the same as the crash on youtube.com. In this test > > case, the problematic element is 'scrubberStack' div [2], and the animation > > in question seems this controlBar animation [3] (note that this is a script > > animation, and 'scrubberStack' doesn't have any animations). > > This is super confusing! If the changed element doesn't have an animation, > how does changing the refresh driver tick time affect it? In this case, the 'scrubberStack' is a descendant of the controlBar animation, so if the animating frame doesn't call MarkNeedsDisplayItemRebuild(), the crash happens. That's said, after some more digging into fullscreen stuff, I realized that nsRefreshDriver::Freeze/Thaw is called only for chrome window [1], so it shouldn't affect content window, which means the crash on youtube.com is not fixed by either bug 1472076 or bug 1473172. They fixed the cases once we enable layout.display-list.retain.chrome. :) [1] https://searchfox.org/mozilla-central/rev/a80651653faa78fa4dfbd238d099c2aad1cec304/dom/base/nsGlobalWindowOuter.cpp#4338
(In reply to Hiroyuki Ikezoe (:hiro) from comment #26) > In this case, the 'scrubberStack' is a descendant of the controlBar > animation, so if the animating frame doesn't call > MarkNeedsDisplayItemRebuild(), the crash happens. That's said, after some > more digging into fullscreen stuff, I realized that > nsRefreshDriver::Freeze/Thaw is called only for chrome window [1], so it > shouldn't affect content window, which means the crash on youtube.com is not > fixed by either bug 1472076 or bug 1473172. They fixed the cases once we > enable layout.display-list.retain.chrome. :) To be clear, fullscreen stuff is definitly involved with the crash on youtube.com, but I believe the refresh driver timer changes by Freeze/Thaw is not a cause of the crash.
Depends on: 1475769
No longer depends on: 1475155
Note that bug 1475769 landed in buildid: 20180715220110.
Bug 1475769 fixed the crash happened by WillPaint. There are still three reports that caused by normal nsRefreshDriver::Tick. The only one suspicious place for that I can think of is AddRefreshObserver call [1] in nsRefreshDriver::IsWaitingForPaint(). I can't make it work (i.e how to cause this crash) in my brain yet, though I hope the case is pretty rare, I am going to figure out a bit more. [1] https://hg.mozilla.org/mozilla-central/file/2ed1506d1dc7/layout/base/nsRefreshDriver.cpp#l2265
Looks like the big issue here is fixed by bug 1475769, which is headed for beta 62 as well. So, I'll untrack this, and if you do fix the left over issues and want to uplift that fix to beta, please don't hesitate to nominate for uplift again. Thanks!
(In reply to Hiroyuki Ikezoe (:hiro) from comment #31) > I found a crash report [1] commenting the crash happened on > https://theconversation.com/crispr-cas9-gene-editing-scissors-are-less- > accurate-than-we-thought-but-there-are-fixes-100007 , and I can reproduce > the crash locally. :) I am going to debug it. > > [1] > https://crash-stats.mozilla.com/report/index/458861f0-2288-443b-84ce- > 6183c0180717 Hiro, were you able to reproduce and debug the crash? Thanks
Flags: needinfo?(hikezoe)
Yes, I can still reproduce the crash on m-c rev: dd386b5b9fa7f5cd6dc4bbbfa0503b3eb2969af5. And I was debugging it two weeks ago but haven't finished yet.
Flags: needinfo?(hikezoe)
Blocks: 1476971
Crash Signature: [@ MergeState::HasMatchingItemInOldList] [@ nsDisplayItem::GetOldListIndex ] → [@ MergeState::HasMatchingItemInOldList] [@ nsDisplayItem::GetOldListIndex ] [@ bool nsDisplayItem::GetOldListIndex ]
Any progress on this? I think the debug on #c33 might address concerns on bug 1476971 as well.
Flags: needinfo?(hikezoe)
Nothing particular. I haven't taken time for this for a while, unfortunately.
Flags: needinfo?(hikezoe)
Marking as fix-optional for 63 as it only affects the aurora channel.
Attached file opt asan stack (deleted) —
bughunter can reproduce on nightly 64 but not release 62 nor beta 63 on https://www.apple.com/in/ipad-pro/ and various other language localizations on Linux. Opt asan gives ==2987==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000001 (pc 0x555d5790d87a bp 0x7ffdaed9a690 sp 0x7ffdaed9a520 T0) ==2987==The signal is caused by a WRITE memory access. ==2987==Hint: address points to the zero page. #0 0x555d5790d879 in MOZ_CrashPrintf /builds/worker/workspace/build/src/mfbt/Assertions.cpp:55:5 #1 0x7fea7ee49762 in GetOldListIndex /builds/worker/workspace/build/src/layout/painting/nsDisplayList.h:3171:7 #2 0x7fea7ee49762 in MergeState::HasMatchingItemInOldList(nsDisplayItem*, Index<OldListUnits>*) /builds/worker/workspace/build/src/layout/painting/RetainedDisplayListBuilder.cpp:449
Very low-volume crash, so I'm marking this fix-optional.

Updating flags as this is still visible from 66-68, albeit in fairly low volume.

Bughunter can reproduce this on https://www.google.com/flights/ and 3 other urls using opt and debug nightly linux with moz crash reasons similar to

Hit MOZ_CRASH(Item found was in the wrong list! type 27 (outer type was 297 at depth 3, now is 41)) at /builds/worker/workspace/build/src/layout/painting/nsDisplayList.h:2361
#01: MergeState::HasMatchingItemInOldList(nsDisplayItem*, Index<OldListUnits>*) [layout/painting/RetainedDisplayListBuilder.cpp:629]
#02: MergeState::ProcessItemFromNewList(nsDisplayItem*, mozilla::Maybe<Index<MergedListUnits> > const&) [layout/painting/RetainedDisplayListBuilder.cpp:450]
#03: RetainedDisplayListBuilder::MergeDisplayLists(nsDisplayList*, RetainedDisplayList*, RetainedDisplayList*, mozilla::Maybe<mozilla::ActiveScrolledRoot const*>&, nsDisplayItem*) [layout/painting/RetainedDisplayListBuilder.cpp:839]
#04: MergeState::MergeChildLists(nsDisplayItem*, nsDisplayItem*, nsDisplayItem*) [layout/painting/RetainedDisplayListBuilder.cpp:519]
#05: MergeState::ProcessItemFromNewList(nsDisplayItem*, mozilla::Maybe<Index<MergedListUnits> > const&) [layout/painting/RetainedDisplayListBuilder.cpp:485]
#06: RetainedDisplayListBuilder::MergeDisplayLists(nsDisplayList*, RetainedDisplayList*, RetainedDisplayList*, mozilla::Maybe<mozilla::ActiveScrolledRoot const*>&, nsDisplayItem*) [layout/painting/RetainedDisplayListBuilder.cpp:839]
#07: MergeState::MergeChildLists(nsDisplayItem*, nsDisplayItem*, nsDisplayItem*) [layout/painting/RetainedDisplayListBuilder.cpp:519]
#08: MergeState::ProcessItemFromNewList(nsDisplayItem*, mozilla::Maybe<Index<MergedListUnits> > const&) [layout/painting/RetainedDisplayListBuilder.cpp:485]
#09: RetainedDisplayListBuilder::MergeDisplayLists(nsDisplayList*, RetainedDisplayList*, RetainedDisplayList*, mozilla::Maybe<mozilla::ActiveScrolledRoot const*>&, nsDisplayItem*) [layout/painting/RetainedDisplayListBuilder.cpp:839]
#10: MergeState::MergeChildLists(nsDisplayItem*, nsDisplayItem*, nsDisplayItem*) [layout/painting/RetainedDisplayListBuilder.cpp:519]
#11: MergeState::ProcessItemFromNewList(nsDisplayItem*, mozilla::Maybe<Index<MergedListUnits> > const&) [layout/painting/RetainedDisplayListBuilder.cpp:485]
#12: RetainedDisplayListBuilder::MergeDisplayLists(nsDisplayList*, RetainedDisplayList*, RetainedDisplayList*, mozilla::Maybe<mozilla::ActiveScrolledRoot const*>&, nsDisplayItem*) [layout/painting/RetainedDisplayListBuilder.cpp:839]
#13: RetainedDisplayListBuilder::AttemptPartialUpdate(unsigned int, mozilla::DisplayListChecker*) [layout/painting/RetainedDisplayListBuilder.cpp:1523]
#14: nsLayoutUtils::PaintFrame(gfxContext*, nsIFrame*, nsRegion const&, unsigned int, nsDisplayListBuilderMode, nsLayoutUtils::PaintFrameFlags) [layout/base/nsLayoutUtils.cpp:0]
#15: mozilla::PresShell::Paint(nsView*, nsRegion const&, mozilla::PaintFlags) [layout/base/PresShell.cpp:6088]
#16: nsViewManager::ProcessPendingUpdatesPaint(nsIWidget*) [gfx/src/nsRegion.h:478]
#17: nsViewManager::ProcessPendingUpdatesForView(nsView*, bool) [mfbt/RefPtr.h:78]
#18: nsViewManager::ProcessPendingUpdates() [view/nsViewManager.cpp:1020]
#19: nsRefreshDriver::Tick(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp) [layout/base/nsRefreshDriver.cpp:2149]
#20: mozilla::RefreshDriverTimer::TickRefreshDrivers(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp, nsTArray<RefPtr<nsRefreshDriver> >&) [layout/base/nsRefreshDriver.cpp:344]
#21: mozilla::RefreshDriverTimer::Tick(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp) [layout/base/nsRefreshDriver.cpp:369]
#22: mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::TickRefreshDriver(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp) [layout/base/nsRefreshDriver.cpp:729]
#23: mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::NotifyVsync(mozilla::VsyncEvent const&) [layout/base/nsRefreshDriver.cpp:623]
#24: mozilla::layout::VsyncChild::RecvNotify(mozilla::VsyncEvent const&) [layout/ipc/VsyncChild.cpp:68]
#25: mozilla::layout::PVsyncChild::OnMessageReceived(IPC::Message const&) [s3:gecko-generated-sources:40b57a5f11db1f2975fc13c74f20fea3c72418dd5cc7be16b1724f135b6995163d22588c816f1fb7f6cdadad80e8ed2fcea1ccf234f0788643e6a5e4e1859c1e/ipc/ipdl/PVsyncChild.cpp::187]
#26: mozilla::ipc::PBackgroundChild::OnMessageReceived(IPC::Message const&) [s3:gecko-generated-sources:5579cb101527982d72096be9b6fcb46f6d93a5d14564b8f11e1f6a6e8ccd6278d0b51192bc07b23625f16d1978dd8222d850a46e5779288881548c3e9f02aad4/ipc/ipdl/PBackgroundChild.cpp::5876]
#27: mozilla::ipc::MessageChannel::DispatchAsyncMessage(mozilla::ipc::ActorLifecycleProxy*, IPC::Message const&) [ipc/glue/MessageChannel.cpp:2209]
#28: mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message&&) [ipc/glue/MessageChannel.cpp:2133]
#29: mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::MessageChannel::MessageTask&) [ipc/glue/MessageChannel.cpp:0]
#30: mozilla::ipc::MessageChannel::MessageTask::Run() [ipc/glue/MessageChannel.cpp:2005]

This seems to have spiked a bit today.

The new spike from comment 43 was filed as bug 1594381.

Hey Matt,
Can you still reproduce this crash or can we close it?

Flags: needinfo?(matt.woodrow)

It seems like this still happens (very infrequently), so i think we should keep this open to track it.

Status: ASSIGNED → NEW
Flags: needinfo?(matt.woodrow)
Severity: critical → S4
Priority: P2 → P5
QA Whiteboard: [qa-not-actionable]
Webcompat Priority: --- → ?

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: