Crash in [@ amdgpu_cs_flush]
Categories
(Core :: Graphics: WebRender, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr91 | --- | unaffected |
firefox-esr102 | --- | unaffected |
firefox102 | --- | unaffected |
firefox103 | --- | fixed |
firefox104 | --- | fixed |
People
(Reporter: gerard-majax, Assigned: stransky)
References
(Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(3 files)
(deleted),
text/plain
|
Details | |
(deleted),
text/x-log
|
Details | |
(deleted),
text/x-phabricator-request
|
dmeehan
:
approval-mozilla-beta+
|
Details |
Crash report: https://crash-stats.mozilla.org/report/index/1d1fa165-4999-4f18-8707-d39e10220705
Reason: SIGSEGV / SEGV_MAPERR
Top 10 frames of crashing thread:
0 libgallium_dri.so amdgpu_cs_flush src/gallium/winsys/amdgpu/drm/amdgpu_cs.c:1730
1 libgallium_dri.so si_flush_gfx_cs src/gallium/drivers/radeonsi/si_gfx_cs.c:143
2 libgallium_dri.so si_flush_from_st src/gallium/drivers/radeonsi/si_fence.c:534
3 libgallium_dri.so si_texture_create_object src/gallium/drivers/radeonsi/si_texture.c:1124
4 libgallium_dri.so si_texture_create_with_modifier src/gallium/drivers/radeonsi/si_texture.c:1300
5 libgallium_dri.so st_texture_create src/mesa/state_tracker/st_texture.c:103
6 libgallium_dri.so st_texture_storage src/mesa/state_tracker/st_cb_texture.c:3247
7 libgallium_dri.so st_AllocTextureStorage src/mesa/state_tracker/st_cb_texture.c:3295
8 libgallium_dri.so texture_storage_error.constprop.0 src/mesa/main/texstorage.c:554
9 libgallium_dri.so texstorage_error src/mesa/main/texstorage.c:610
Crashed ~10 times over a 30-min zoom web call
Reporter | ||
Comment 1•2 years ago
|
||
On the first occurrence of the crash, there was an amdgpu
GPU
reset:
[76114.322182] [drm:gfx_v9_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
[76119.375165] [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
[76119.375165] [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
[76124.505170] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=3447850, emitted seq=3447853
[76124.505355] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox-bin pid 669730 thread firefox-bi:cs0 pid 669844
[76124.505506] amdgpu 0000:08:00.0: amdgpu: GPU reset begin!
Reporter | ||
Comment 2•2 years ago
|
||
The first crash occurrence on our side: https://crash-stats.mozilla.org/report/index/1d1fa165-4999-4f18-8707-d39e10220705
It was running fine for ~20h54 min and it crashed ~20 min after I started my Zoom Web call.
Comment 3•2 years ago
|
||
Looking at the URLs it seems like this is correlated with WebGL/Accelerated canvas
Updated•2 years ago
|
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Comment 4•2 years ago
|
||
As per :jrmuizel
suggestion, I have set gfx.canvas.accelerated = false
and will see if it happens.
Reporter | ||
Comment 5•2 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #4)
As per
:jrmuizel
suggestion, I have setgfx.canvas.accelerated = false
and will see if it happens.
This did not help: https://crash-stats.mozilla.org/report/index/313756bb-044a-4407-b947-36cbe0220706
System was stable for more than 24h, and after ~25 min of Zoom Web, kaboom.
This time [@ si_cp_dma_prefetch]
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Comment 6•2 years ago
|
||
Got a new one, just a few minutes after: https://crash-stats.mozilla.org/report/index/72445795-0cb1-41dd-a84a-a0ae30220706
Reporter | ||
Comment 7•2 years ago
|
||
Was there a change around 20220628191450
that could relate remotely ? This is what seems to be the oldest buildid starting to trigger the issue, looking at crash-stats (and https://crash-stats.mozilla.org/report/index/5d95bbb0-3614-42cc-aacd-f60c60220629 that seems to be the first one is Fedora so it's not my system)
Reporter | ||
Comment 8•2 years ago
|
||
A third crash, likely with GPU reset, completely messed my system
Reporter | ||
Comment 9•2 years ago
|
||
Comment 10•2 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #8)
A third crash, likely with GPU reset, completely messed my system
This means it's at least also a Mesa or Kernel bug - can you open a bug at https://gitlab.freedesktop.org/mesa/mesa/-/issues ?
Reporter | ||
Comment 11•2 years ago
|
||
(In reply to Robert Mader [:rmader] from comment #10)
(In reply to Alexandre LISSY :gerard-majax from comment #8)
A third crash, likely with GPU reset, completely messed my system
This means it's at least also a Mesa or Kernel bug - can you open a bug at https://gitlab.freedesktop.org/mesa/mesa/-/issues ?
I already did ? https://bugzilla.mozilla.org/show_bug.cgi?id=1778114#a1917_41385
Comment 12•2 years ago
|
||
Set release status flags based on info from the regressing bug 1776563
Comment 13•2 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #11)
I already did ? https://bugzilla.mozilla.org/show_bug.cgi?id=1778114#a1917_41385
Oh right, didn't see, thanks!
Comment 14•2 years ago
|
||
(In reply to Release mgmt bot [:suhaib / :marco/ :calixte] from comment #12)
Set release status flags based on info from the regressing bug 1776563
Given that bug 1776563 fixed a quite strong regression from bug 1776348, could you double check that builds from before bug 1776563 don't have the issue?
Reporter | ||
Comment 15•2 years ago
|
||
(In reply to Robert Mader [:rmader] from comment #14)
(In reply to Release mgmt bot [:suhaib / :marco/ :calixte] from comment #12)
Set release status flags based on info from the regressing bug 1776563
Given that bug 1776563 fixed a quite strong regression from bug 1776348, could you double check that builds from before bug 1776563 don't have the issue?
Well I might need my full profile and I might need several days before knowing.
Reporter | ||
Comment 16•2 years ago
|
||
Moved back to 20220626190331
, let's see what tomorrow's meetings are going to yield.
Reporter | ||
Comment 17•2 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #16)
Moved back to
20220626190331
, let's see what tomorrow's meetings are going to yield.
18 mins long meeting this morning, no repro. It might have been too short, though.
Reporter | ||
Comment 18•2 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #17)
(In reply to Alexandre LISSY :gerard-majax from comment #16)
Moved back to
20220626190331
, let's see what tomorrow's meetings are going to yield.18 mins long meeting this morning, no repro. It might have been too short, though.
Second meeting, 39 min, no crash.
Reporter | ||
Comment 19•2 years ago
|
||
I'm going back to current buildid, and if I hit again will try disabling widget.dmabuf-webgl.enabled
Comment 20•2 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #18)
(In reply to Alexandre LISSY :gerard-majax from comment #17)
(In reply to Alexandre LISSY :gerard-majax from comment #16)
Moved back to
20220626190331
, let's see what tomorrow's meetings are going to yield.18 mins long meeting this morning, no repro. It might have been too short, though.
Second meeting, 39 min, no crash.
Duh, sorry, comment 14 was wrong :( It's expected to work there because dmabuf is always disabled. You'd need to test a build without bug 1776348, i.e. before 20220626
Reporter | ||
Comment 21•2 years ago
|
||
(In reply to Robert Mader [:rmader] from comment #20)
(In reply to Alexandre LISSY :gerard-majax from comment #18)
(In reply to Alexandre LISSY :gerard-majax from comment #17)
(In reply to Alexandre LISSY :gerard-majax from comment #16)
Moved back to
20220626190331
, let's see what tomorrow's meetings are going to yield.18 mins long meeting this morning, no repro. It might have been too short, though.
Second meeting, 39 min, no crash.
Duh, sorry, comment 14 was wrong :( It's expected to work there because dmabuf is always disabled. You'd need to test a build without bug 1776348, i.e. before 20220626
There's still something that seems to relate to a landing after 20220626190331, I got multiple crashes in several rows, even after reboots. And nothing with 20220626190331
.
Comment 22•2 years ago
|
||
Yes, that totally makes sense - after bug 1776348 and before bug 1776563 dmabuf webgl usage was disabled completely. And dmabuf usage is most likely the cause for the crash.
Reporter | ||
Comment 23•2 years ago
|
||
We have recenly improved debug symbols import, namely for ArchLinux, and we now see a few crashed on some AMD GPUs that seems to match the same signature: https://crash-stats.mozilla.org/signature/?platform_pretty_version=~Arch&date=%3E%3D2022-07-07T08%3A45%3A00.000Z&date=%3C2022-07-08T08%3A45%3A00.000Z&_sort=-date&signature=amdgpu_cs_flush
Comment 24•2 years ago
|
||
Marek suggests in https://gitlab.freedesktop.org/mesa/mesa/-/issues/6796#note_1461520 that it looks like Firefox is using the same GL context from multiple threads. Kelsey/Lee is that something that could be happening?
Comment 25•2 years ago
|
||
I would suggest flipping the pref webgl.threadsafe-gl.force-disabled
to true
, restarting Firefox and seeing if you can reproduce. This will guarantee that we always use GL from the Renderer thread (for both general rendering and WebGL). I realize this is about using the same context from multiple threads, but I am curious to see if putting all of the contexts on the same thread helps anyways.
Comment 26•2 years ago
|
||
We should never be using the same context on multiple threads to my knowledge. We should add asserts for this.
We definitely hope that drivers support different contexts on different threads, but we have blocklisting in place for ones who don't handle it right.
Updated•2 years ago
|
Comment 27•2 years ago
|
||
One possibility here is that WR calls out to e.g. Gecko's dmabuf code, which accidentally changes context, and then returns to WR but now the wrong context is active?
Comment 28•2 years ago
|
||
(In reply to Andrew Osmond [:aosmond] (he/him) from comment #25)
I would suggest flipping the pref
webgl.threadsafe-gl.force-disabled
totrue
, restarting Firefox and seeing if you can reproduce. This will guarantee that we always use GL from the Renderer thread (for both general rendering and WebGL). I realize this is about using the same context from multiple threads, but I am curious to see if putting all of the contexts on the same thread helps anyways.
Reporter | ||
Comment 29•2 years ago
|
||
I already did that as soon as I saw Andrew's message. Unfortunately, I dont have any meeting this week to confirm with Zoom Web.
Updated•2 years ago
|
Assignee | ||
Comment 30•2 years ago
|
||
If that's 'one context on multiple threads' scenario it does not come from dmabuf code itself.
The notable libgallium_dri calls from the backtrace are:
Thread 45, Name: Compositor
0 libc.so.6 syscall None
1 libgallium_dri.so si_eliminate_fast_color_clear src/gallium/drivers/radeonsi/si_texture.c:313
2 libgallium_dri.so si_texture_get_handle src/gallium/drivers/radeonsi/si_texture.c:728
3 libgallium_dri.so dri2_query_image src/gallium/frontends/dri/dri2.c:1375
4 libEGL_mesa.so.0 dri2_export_dma_buf_image_query_mesa src/egl/drivers/dri2/egl_dri2.c:3081
5 libEGL_mesa.so.0 _eglReleaseDisplayResources src/egl/main/egldisplay.c:321
6 libxul.so DMABufSurfaceRGBA::Create(mozilla::gl::GLContext*, void*, int, int) widget/gtk/DMABufSurface.cpp:451
7 libxul.so DMABufSurfaceRGBA::CreateDMABufSurface(mozilla::gl::GLContext*, void*, int, int) widget/gtk/DMABufSurface.cpp:896
8 libxul.so mozilla::gl::SharedSurface_DMABUF::Create(mozilla::gl::SharedSurfaceDesc const&) gfx/gl/SharedSurfaceDMABUF.cpp:60
9 libxul.so mozilla::gl::SurfaceFactory_DMABUF::CreateSharedImpl(mozilla::gl::SharedSurfaceDesc const&) gfx/gl/SharedSurfaceDMABUF.h:59
10 libxul.so mozilla::gl::SurfaceFactory::CreateShared(mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx::ColorSpace2)
Crashing Thread (32), Name: Renderer
Frame Module Signature Source Trust
0 libgallium_dri.so amdgpu_cs_flush src/gallium/winsys/amdgpu/drm/amdgpu_cs.c:1730 context
1 libgallium_dri.so si_flush_gfx_cs src/gallium/drivers/radeonsi/si_gfx_cs.c:143 cfi
2 libgallium_dri.so si_flush_from_st src/gallium/drivers/radeonsi/si_fence.c:534 cfi
3 libgallium_dri.so si_texture_create_object src/gallium/drivers/radeonsi/si_texture.c:1124 cfi
4 libgallium_dri.so si_texture_create_with_modifier src/gallium/drivers/radeonsi/si_texture.c:1300 cfi
5 libgallium_dri.so st_texture_create src/mesa/state_tracker/st_texture.c:103 cfi
6 libgallium_dri.so st_texture_storage src/mesa/state_tracker/st_cb_texture.c:3247 cfi
7 libgallium_dri.so st_AllocTextureStorage src/mesa/state_tracker/st_cb_texture.c:3295 cfi
8 libgallium_dri.so texture_storage_error.constprop.0 src/mesa/main/texstorage.c:554 cfi
9 libgallium_dri.so texstorage_error src/mesa/main/texstorage.c:610 cfi
10 libgallium_dri.so _mesa_TexStorage2D src/mesa/main/texstorage.c:715 cfi
11 libxul.so webrender::device::gl::Device::create_texture gfx/wr/webrender/src/device/gl.rs:2521 cfi
12 libxul.so webrender::renderer::Renderer::update_texture_cache gfx/wr/webrender/src/renderer/mod.rs:2379 cfi
13 libxul.so webrender::renderer::Renderer::render_impl gfx/wr/webrender/src/renderer/mod.rs:1975 cfi
14 libxul.so webrender::renderer::Renderer::render gfx/wr/webrender/src/renderer/mod.rs:1737 cfi
15 libxul.so wr_renderer_render gfx/webrender_bindings/src/bindings.rs:620
GL context for 'Thread 45, Name: Compositor' comes from mozilla::gl::SharedSurfaceDesc and GL context for 'Crashing Thread (32), Name: Renderer' comes from rust.
As we have AMD / libgallium reports only (see Bug 1775258) and nothing on Intel/NVIDIA I'd say https://gitlab.freedesktop.org/mesa/mesa/-/issues/6666 is incomplete or we use wrong Mesa version for DMABUF_SURFACE_EXPORT feature check.
Assignee | ||
Comment 31•2 years ago
|
||
This is a variant of Bug 1774075. It was supposed to be fixed in Mesa 22.1.2 but I have Mesa 22.1.3 and it's still broken.
Simple testcase is to set webgl sample as a homepage and open bunch of Firefox windows, that reliably crashes.
Assignee | ||
Comment 32•2 years ago
|
||
Updated•2 years ago
|
Assignee | ||
Comment 33•2 years ago
|
||
Let's disable it on all AMD devices until it's fixed reliably on Mesa side.
Comment 34•2 years ago
|
||
Those traces from comment 30 are almost certainly two different contexts accessed on two different threads.
Comment 35•2 years ago
|
||
Pushed by stransky@redhat.com: https://hg.mozilla.org/integration/autoland/rev/ea595572b392 Disable DMABUF_SURFACE_EXPORT on AMD r=jgilbert,aosmond
Comment 36•2 years ago
|
||
bugherder |
Comment 38•2 years ago
|
||
Copying crash signatures from duplicate bugs.
Assignee | ||
Comment 39•2 years ago
|
||
Comment on attachment 9285077 [details]
Bug 1778114 Disable DMABUF_SURFACE_EXPORT on AMD r?jgilbert
Beta/Release Uplift Approval Request
- User impact if declined: WebGL crashes on AMD devices on Linux.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Fall back to old (tested) dmabuf code on AMD.
- String changes made/needed:
- Is Android affected?: No
Comment 40•2 years ago
|
||
Comment on attachment 9285077 [details]
Bug 1778114 Disable DMABUF_SURFACE_EXPORT on AMD r?jgilbert
Approved for 103.0b9, thanks.
Comment 41•2 years ago
|
||
bugherder uplift |
Updated•2 years ago
|
Description
•