Crash in mozilla::wr::Moz2DRenderCallback (ExternalSourceSurfaceCreation PLAY failure) - Still on Google Slides?
Categories
(Core :: Graphics: WebRender, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr60 | --- | unaffected |
firefox67 | --- | wontfix |
firefox68 | --- | verified |
firefox69 | --- | verified |
People
(Reporter: jan, Assigned: aosmond)
References
(Blocks 2 open bugs)
Details
(Keywords: crash, nightly-community, regression)
Crash Data
Attachments
(2 files, 1 obsolete file)
(deleted),
text/x-phabricator-request
|
jcristau
:
approval-mozilla-beta+
|
Details |
(deleted),
text/x-phabricator-request
|
jcristau
:
approval-mozilla-beta+
|
Details |
+++ This bug was initially created as a clone of Bug #1516011 +++
(Andrew Osmond [:aosmond] from bug 1516011 comment 20)
I see https://crash-stats.mozilla.com/report/index/9f210bca-5925-4eb8-8e08-8fdd50190115#tab-details which includes my fix. I am now puzzled as to the root cause.
Still happening on Nvidia/Win10. From yesterday:
This bug is for crash report bp-40087b07-eaf5-48b7-8d80-8c9890190131.
MOZ_RELEASE_ASSERT(false)
GraphicsCriticalError |[G0][GFX1-]: Updating unknown shared surface: 979252543489 (t=548804) |[G1][GFX1-]: Replay failure: ExternalSourceSurfaceCreation PLAY (t=619060) |[G2][GFX1-]: Replay failure: ExternalSourceSurfaceCreation PLAY (t=619060)
Top 10 frames of crashing thread:
0 xul.dll static bool mozilla::wr::Moz2DRenderCallback gfx/webrender_bindings/Moz2DImageRenderer.cpp:435
1 xul.dll wr_moz2d_render_cb gfx/webrender_bindings/Moz2DImageRenderer.cpp:474
2 xul.dll static struct gfx/webrender_bindings/src/moz2d_renderer.rs:526
3 xul.dll static void rayon::iter::plumbing::bridge_producer_consumer::helper<rayon::vec::VecProducer<webrender_bindings::moz2d_renderer::Job>, rayon::iter::map::MapConsumer<rayon::iter::collect::consumer::CollectConsumer< third_party/rust/rayon/src/iter/plumbing/mod.rs:418
4 xul.dll static void rayon_core::job::{{impl}}::execute<rayon_core::latch::SpinLatch, closure, third_party/rust/rayon-core/src/job.rs:113
5 xul.dll static void rayon_core::registry::WorkerThread::wait_until_cold<rayon_core::latch::CountLatch> third_party/rust/rayon-core/src/registry.rs:567
6 xul.dll static void std::sys_common::backtrace::__rust_begin_short_backtrace<closure, z:/libstd/sys_common/backtrace.rs:137
7 xul.dll static void alloc::boxed::{{impl}}::call_box< z:/liballoc/boxed.rs:672
8 xul.dll static void std::sys::windows::thread::{{impl}}::new::thread_start src/libstd/sys/windows/thread.rs:56
9 kernel32.dll BaseThreadInitThunk
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Updated•6 years ago
|
Updated•6 years ago
|
Comment 1•6 years ago
|
||
A couple of urls:
http://www.kicker.de/news/fussball/bundesliga/spieltag/1-bundesliga/2018-19/20/4243443/livematch_hannover-96-58_rasenballsport-leipzig-15778.html
http://www.kicker.de/news/fussball/intligen/intpokale/copa-del-rey/2018-19/7/4561911/livematch_fc-barcelona_real-madrid.html
https://www.cwb.gov.tw/V7/forecast/town368/towns/6500100.htm?type=Weather&time=7Day
Updated•6 years ago
|
Updated•6 years ago
|
Comment 2•6 years ago
|
||
I just had bp-77f1a7fd-b465-4b47-aba4-467cf0190319 on Google Slides (macOS), which seems related.
Comment 3•6 years ago
|
||
Updated•6 years ago
|
Comment 4•6 years ago
|
||
Per bug 1540853 comment 4 and bug 1540853 comment 6 this is the #2 WR-related crash on the 66 release experiment, and #3 in 67 beta. Would be good to track it down if we can.
Comment 5•6 years ago
|
||
Just from looking at the codepaths it seems that if the parent bails out here then that could result in this crash. Can we add some logging or such to catch this scenario? Or maybe insert a dummy entry into the sInstances
table so that the replay doesn't just fail?
Assignee | ||
Comment 6•6 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #5)
Just from looking at the codepaths it seems that if the parent bails out here then that could result in this crash. Can we add some logging or such to catch this scenario? Or maybe insert a dummy entry into the
sInstances
table so that the replay doesn't just fail?
Good point. If the memory map fails, we will crash. This should not be considered an impossible/unlikely error, especially on . The general image use case is here:
Which will fail gracefully and we do see those warnings in the critical logs from time to time (either for the same failure or different).
Assignee | ||
Comment 7•6 years ago
|
||
Assignee | ||
Updated•6 years ago
|
Comment 9•5 years ago
|
||
bugherder |
Comment 10•5 years ago
|
||
So https://crash-stats.mozilla.com/report/index/ff27f77c-0f16-41f8-a1f3-a71510190507 and https://crash-stats.mozilla.com/report/index/42a16c3a-5917-4041-ac42-fa9590190508 both have this patch but not relevant messages in the GraphicsCriticalError field. So maybe the crash is coming from something else?
https://crash-stats.mozilla.com/report/index/b028aea6-06fa-451e-9236-123bc0190506#tab-metadata has this: DataSourceSurface of SharedSurfaces does not exist for extId:21474837290
, maybe that's relevant?
Comment 11•5 years ago
|
||
This is also with that patch: bp-62577f07-1f2d-48f1-b78c-21ce40190515
Assignee | ||
Comment 12•5 years ago
|
||
Google Slides are entirely blob images, so I suppose it makes sense it is the most likely place to observe the problem. These do not appear to be animated images, so it appears the following is insufficient / missing some corner case:
Reporter | ||
Comment 13•5 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #10)
https://crash-stats.mozilla.com/report/index/b028aea6-06fa-451e-9236-123bc0190506#tab-metadata has this:
DataSourceSurface of SharedSurfaces does not exist for extId:21474837290
, maybe that's relevant?
Also seen in bug 1553522.
Assignee | ||
Comment 14•5 years ago
|
||
I have had some modest, if tedious, success in reproducing.
Assignee | ||
Comment 15•5 years ago
|
||
With some tweaks, in particular setting image.mem.surfacecache.min_expiration_ms to something very small, I have been able to reproduce this consistently on nightly and local builds.
STR:
- Go to about:config and set image.mem.surfacecache.min_expiration_ms to 1 and gfx.webrender.blob.paint-flashing to true. Restart browser.
- Browse to https://www.yr.no/kart/?spr=eng#lat=57.75946&lon=11.89664&zoom=6&laga=nedb%C3%B8rskyer&proj=3575.
- Once the page loads, move the mouse between blob tiles with a weather icon.
- Crashes very quickly.
Assignee | ||
Comment 16•5 years ago
|
||
I did a mozregression in case it was illuminating but it mostly just points to bug 1456555 which is an artifact of my STR. I was able to make it crash without bug 1456555 but not consistently.
Assignee | ||
Comment 17•5 years ago
|
||
What the logs suggest is that AsyncImagePipelineManager::HoldExternalImage is not keeping the external surfaces alive long enough. This was a key assumption in the lifetime design. I'm not sure if it is HoldExternalImage's fault exactly -- it is doing the job that was asked, but I suspect the blob lives longer than our hold on it, rather than HoldExternalImage did not hold onto the surface long enough. I think we should try to move to an implementation where we are more explicit with the blob dependencies. This would probably prevent similar problems in the future. That's what I'm looking into now.
Updated•5 years ago
|
Assignee | ||
Comment 18•5 years ago
|
||
Assignee | ||
Comment 19•5 years ago
|
||
When a blob image invalidates, it doesn't always repaint the entire
blob. When we stored the shared surface references in the DIGroup, it
would incorrectly forgot about images referenced by items that were not
invalidated when it repainted. As such, it could free them too early and
cause a crash when rasterizing the blob in the compositor process.
This did not crash most of the time because the image cache would bail
us out. It takes a full minute for the image cache to expire, but the
issue was more readily reproducible when that timeout was shortened.
We now store the references in BlobItemData, on a per display item
basis. This ensures that when any given item is invalidated, it will
continue referencing any resources that it needs.
Additionally we now also post the releasing of the shared surface image
keys and external image ID to the main thread. This allows the current
transaction to complete before freeing the surface, which guards against
cases where the surface is referenced and released somehow in the space
of the same transaction.
Assignee | ||
Comment 20•5 years ago
|
||
Assignee | ||
Comment 21•5 years ago
|
||
Assignee | ||
Comment 22•5 years ago
|
||
We now also post the releasing of the shared surface image keys and
external image ID to the main thread. This allows the current
transaction to complete before freeing the surface, which guards against
cases where the surface is referenced and released somehow in the space
of the same transaction.
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 23•5 years ago
|
||
try after rewrite of p2: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3f466c1d7a0a0e62c299296b8948afe05f1d8b8b
Comment 24•5 years ago
|
||
Assignee | ||
Updated•5 years ago
|
Comment 25•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/8b66f48247b6
https://hg.mozilla.org/mozilla-central/rev/4d3bc03ee7a5
Assignee | ||
Comment 26•5 years ago
|
||
I verified this with the STR in nightly builds before and after the patch landed. It has been fixed.
Assignee | ||
Comment 27•5 years ago
|
||
Comment on attachment 9068095 [details]
Bug 1524280 - Part 1. Ensure we always post when freeing SharedUserData.
Beta/Release Uplift Approval Request
- User impact if declined: May experience crashes when browsing sites with a lot of SVG images when using WebRender. In particular Google slides is impacted.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce: 1) Go to about:config and set image.mem.surfacecache.min_expiration_ms to 1 and gfx.webrender.blob.paint-flashing to true. Restart browser.
- Browse to https://www.yr.no/kart/?spr=eng#lat=57.75946&lon=11.89664&zoom=6&laga=nedb%C3%B8rskyer&proj=3575.
- Once the page loads, move the mouse between blob tiles with a weather icon.
- Should no longer crash.
- List of other uplifts needed: Needs both parts 1 and 2 from bug 1524280
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): We are making the lifetimes of everything longer. It is unlikely to increase the crash rate, even if done wrong, because the lifetimes are already too short. The image cache has been hiding the issue by retaining the surfaces for an extra 60 seconds (in case the cache is queried again) and will continue to do so.
- String changes made/needed: None
Assignee | ||
Updated•5 years ago
|
Comment 28•5 years ago
|
||
Comment on attachment 9068095 [details]
Bug 1524280 - Part 1. Ensure we always post when freeing SharedUserData.
webrender crash fix, approved for 68.0b7
Updated•5 years ago
|
Comment 29•5 years ago
|
||
bugherder uplift |
Updated•5 years ago
|
Updated•5 years ago
|
Comment 30•5 years ago
|
||
Hello,
Reproduced the issue with Firefox 69.0a1 (20190522152821) on Windows 10 x64, macOS 10.14 and Ubuntu 18.04.
On Ubuntu 18.04 and macOS 10.14 Firefox crashed and the “Crash Reporter” was displayed after doing the steps from comment 27.
On Windows 10x64 no “Crash Reporter” window appeared and the browser window turned white. In the about:crashes page I observed there are like 4-5 unsubmitted crash reports (depends on how many times browser turned white) having the same Signature (“mozilla::wr::Moz2DRenderCallback” - e.g. link).
The issue is verified using Firefox 69.0a1 (20190604214415) and Firefox 68.0b7(20190603181408) on Windows 10x64, Ubuntu 18.04 and macOS 10.14 (no crashes or the window turning white was encountered).
Assignee | ||
Updated•5 years ago
|
Updated•5 years ago
|
Description
•