Closed Bug 1778114 Opened 2 years ago Closed 2 years ago

Crash in [@ amdgpu_cs_flush]

Categories

(Core :: Graphics: WebRender, defect)

defect

Tracking

()

RESOLVED FIXED
104 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox-esr102 --- unaffected
firefox102 --- unaffected
firefox103 --- fixed
firefox104 --- fixed

People

(Reporter: gerard-majax, Assigned: stransky)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(3 files)

Crash report: https://crash-stats.mozilla.org/report/index/1d1fa165-4999-4f18-8707-d39e10220705

Reason: SIGSEGV / SEGV_MAPERR

Top 10 frames of crashing thread:

0 libgallium_dri.so amdgpu_cs_flush src/gallium/winsys/amdgpu/drm/amdgpu_cs.c:1730
1 libgallium_dri.so si_flush_gfx_cs src/gallium/drivers/radeonsi/si_gfx_cs.c:143
2 libgallium_dri.so si_flush_from_st src/gallium/drivers/radeonsi/si_fence.c:534
3 libgallium_dri.so si_texture_create_object src/gallium/drivers/radeonsi/si_texture.c:1124
4 libgallium_dri.so si_texture_create_with_modifier src/gallium/drivers/radeonsi/si_texture.c:1300
5 libgallium_dri.so st_texture_create src/mesa/state_tracker/st_texture.c:103
6 libgallium_dri.so st_texture_storage src/mesa/state_tracker/st_cb_texture.c:3247
7 libgallium_dri.so st_AllocTextureStorage src/mesa/state_tracker/st_cb_texture.c:3295
8 libgallium_dri.so texture_storage_error.constprop.0 src/mesa/main/texstorage.c:554
9 libgallium_dri.so texstorage_error src/mesa/main/texstorage.c:610

Crashed ~10 times over a 30-min zoom web call

Attached file dmesg (deleted) —

On the first occurrence of the crash, there was an amdgpu GPU reset:

[76114.322182] [drm:gfx_v9_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
[76119.375165] [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
[76119.375165] [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
[76124.505170] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=3447850, emitted seq=3447853
[76124.505355] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox-bin pid 669730 thread firefox-bi:cs0 pid 669844
[76124.505506] amdgpu 0000:08:00.0: amdgpu: GPU reset begin!

The first crash occurrence on our side: https://crash-stats.mozilla.org/report/index/1d1fa165-4999-4f18-8707-d39e10220705

It was running fine for ~20h54 min and it crashed ~20 min after I started my Zoom Web call.

Looking at the URLs it seems like this is correlated with WebGL/Accelerated canvas

As per :jrmuizel suggestion, I have set gfx.canvas.accelerated = false and will see if it happens.

(In reply to Alexandre LISSY :gerard-majax from comment #4)

As per :jrmuizel suggestion, I have set gfx.canvas.accelerated = false and will see if it happens.

This did not help: https://crash-stats.mozilla.org/report/index/313756bb-044a-4407-b947-36cbe0220706

System was stable for more than 24h, and after ~25 min of Zoom Web, kaboom.

This time [@ si_cp_dma_prefetch]

Crash Signature: [@ amdgpu_cs_flush] → [@ amdgpu_cs_flush] [@ si_cp_dma_prefetch]
Crash Signature: [@ amdgpu_cs_flush] [@ si_cp_dma_prefetch] → [@ amdgpu_cs_flush] [@ si_cp_dma_prefetch] [@ si_emit_streamout_enable ]

Was there a change around 20220628191450 that could relate remotely ? This is what seems to be the oldest buildid starting to trigger the issue, looking at crash-stats (and https://crash-stats.mozilla.org/report/index/5d95bbb0-3614-42cc-aacd-f60c60220629 that seems to be the first one is Fedora so it's not my system)

Flags: needinfo?(jmuizelaar)

A third crash, likely with GPU reset, completely messed my system

Attached file dmesg_amdgpu.log (deleted) —

(In reply to Alexandre LISSY :gerard-majax from comment #8)

A third crash, likely with GPU reset, completely messed my system

This means it's at least also a Mesa or Kernel bug - can you open a bug at https://gitlab.freedesktop.org/mesa/mesa/-/issues ?

(In reply to Robert Mader [:rmader] from comment #10)

(In reply to Alexandre LISSY :gerard-majax from comment #8)

A third crash, likely with GPU reset, completely messed my system

This means it's at least also a Mesa or Kernel bug - can you open a bug at https://gitlab.freedesktop.org/mesa/mesa/-/issues ?

I already did ? https://bugzilla.mozilla.org/show_bug.cgi?id=1778114#a1917_41385

Flags: needinfo?(jmuizelaar)
Regressed by: 1776563

Set release status flags based on info from the regressing bug 1776563

(In reply to Alexandre LISSY :gerard-majax from comment #11)

I already did ? https://bugzilla.mozilla.org/show_bug.cgi?id=1778114#a1917_41385

Oh right, didn't see, thanks!

(In reply to Release mgmt bot [:suhaib / :marco/ :calixte] from comment #12)

Set release status flags based on info from the regressing bug 1776563

Given that bug 1776563 fixed a quite strong regression from bug 1776348, could you double check that builds from before bug 1776563 don't have the issue?

(In reply to Robert Mader [:rmader] from comment #14)

(In reply to Release mgmt bot [:suhaib / :marco/ :calixte] from comment #12)

Set release status flags based on info from the regressing bug 1776563

Given that bug 1776563 fixed a quite strong regression from bug 1776348, could you double check that builds from before bug 1776563 don't have the issue?

Well I might need my full profile and I might need several days before knowing.

Moved back to 20220626190331, let's see what tomorrow's meetings are going to yield.

(In reply to Alexandre LISSY :gerard-majax from comment #16)

Moved back to 20220626190331, let's see what tomorrow's meetings are going to yield.

18 mins long meeting this morning, no repro. It might have been too short, though.

(In reply to Alexandre LISSY :gerard-majax from comment #17)

(In reply to Alexandre LISSY :gerard-majax from comment #16)

Moved back to 20220626190331, let's see what tomorrow's meetings are going to yield.

18 mins long meeting this morning, no repro. It might have been too short, though.

Second meeting, 39 min, no crash.

I'm going back to current buildid, and if I hit again will try disabling widget.dmabuf-webgl.enabled

(In reply to Alexandre LISSY :gerard-majax from comment #18)

(In reply to Alexandre LISSY :gerard-majax from comment #17)

(In reply to Alexandre LISSY :gerard-majax from comment #16)

Moved back to 20220626190331, let's see what tomorrow's meetings are going to yield.

18 mins long meeting this morning, no repro. It might have been too short, though.

Second meeting, 39 min, no crash.

Duh, sorry, comment 14 was wrong :( It's expected to work there because dmabuf is always disabled. You'd need to test a build without bug 1776348, i.e. before 20220626

(In reply to Robert Mader [:rmader] from comment #20)

(In reply to Alexandre LISSY :gerard-majax from comment #18)

(In reply to Alexandre LISSY :gerard-majax from comment #17)

(In reply to Alexandre LISSY :gerard-majax from comment #16)

Moved back to 20220626190331, let's see what tomorrow's meetings are going to yield.

18 mins long meeting this morning, no repro. It might have been too short, though.

Second meeting, 39 min, no crash.

Duh, sorry, comment 14 was wrong :( It's expected to work there because dmabuf is always disabled. You'd need to test a build without bug 1776348, i.e. before 20220626

There's still something that seems to relate to a landing after 20220626190331, I got multiple crashes in several rows, even after reboots. And nothing with 20220626190331.

Yes, that totally makes sense - after bug 1776348 and before bug 1776563 dmabuf webgl usage was disabled completely. And dmabuf usage is most likely the cause for the crash.

We have recenly improved debug symbols import, namely for ArchLinux, and we now see a few crashed on some AMD GPUs that seems to match the same signature: https://crash-stats.mozilla.org/signature/?platform_pretty_version=~Arch&date=%3E%3D2022-07-07T08%3A45%3A00.000Z&date=%3C2022-07-08T08%3A45%3A00.000Z&_sort=-date&signature=amdgpu_cs_flush

Marek suggests in https://gitlab.freedesktop.org/mesa/mesa/-/issues/6796#note_1461520 that it looks like Firefox is using the same GL context from multiple threads. Kelsey/Lee is that something that could be happening?

Flags: needinfo?(lsalzman)
Flags: needinfo?(jgilbert)

I would suggest flipping the pref webgl.threadsafe-gl.force-disabled to true, restarting Firefox and seeing if you can reproduce. This will guarantee that we always use GL from the Renderer thread (for both general rendering and WebGL). I realize this is about using the same context from multiple threads, but I am curious to see if putting all of the contexts on the same thread helps anyways.

We should never be using the same context on multiple threads to my knowledge. We should add asserts for this.
We definitely hope that drivers support different contexts on different threads, but we have blocklisting in place for ones who don't handle it right.

Flags: needinfo?(jgilbert) → needinfo?(stransky)

One possibility here is that WR calls out to e.g. Gecko's dmabuf code, which accidentally changes context, and then returns to WR but now the wrong context is active?

(In reply to Andrew Osmond [:aosmond] (he/him) from comment #25)

I would suggest flipping the pref webgl.threadsafe-gl.force-disabled to true, restarting Firefox and seeing if you can reproduce. This will guarantee that we always use GL from the Renderer thread (for both general rendering and WebGL). I realize this is about using the same context from multiple threads, but I am curious to see if putting all of the contexts on the same thread helps anyways.

Flags: needinfo?(lissyx+mozillians)

I already did that as soon as I saw Andrew's message. Unfortunately, I dont have any meeting this week to confirm with Zoom Web.

Flags: needinfo?(lissyx+mozillians)

If that's 'one context on multiple threads' scenario it does not come from dmabuf code itself.
The notable libgallium_dri calls from the backtrace are:

Thread 45, Name: Compositor
 0 	libc.so.6 	syscall 	None
1 	libgallium_dri.so 	si_eliminate_fast_color_clear 	src/gallium/drivers/radeonsi/si_texture.c:313
2 	libgallium_dri.so 	si_texture_get_handle 	src/gallium/drivers/radeonsi/si_texture.c:728
3 	libgallium_dri.so 	dri2_query_image 	src/gallium/frontends/dri/dri2.c:1375
4 	libEGL_mesa.so.0 	dri2_export_dma_buf_image_query_mesa 	src/egl/drivers/dri2/egl_dri2.c:3081
5 	libEGL_mesa.so.0 	_eglReleaseDisplayResources 	src/egl/main/egldisplay.c:321
6 	libxul.so 	DMABufSurfaceRGBA::Create(mozilla::gl::GLContext*, void*, int, int) 	widget/gtk/DMABufSurface.cpp:451
7 	libxul.so 	DMABufSurfaceRGBA::CreateDMABufSurface(mozilla::gl::GLContext*, void*, int, int) 	widget/gtk/DMABufSurface.cpp:896
8 	libxul.so 	mozilla::gl::SharedSurface_DMABUF::Create(mozilla::gl::SharedSurfaceDesc const&) 	gfx/gl/SharedSurfaceDMABUF.cpp:60
9 	libxul.so 	mozilla::gl::SurfaceFactory_DMABUF::CreateSharedImpl(mozilla::gl::SharedSurfaceDesc const&) 	gfx/gl/SharedSurfaceDMABUF.h:59
10 	libxul.so 	mozilla::gl::SurfaceFactory::CreateShared(mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx::ColorSpace2)

Crashing Thread (32), Name: Renderer
Frame 	Module 	Signature 	Source 	Trust
0 	libgallium_dri.so 	amdgpu_cs_flush 	src/gallium/winsys/amdgpu/drm/amdgpu_cs.c:1730 	context
1 	libgallium_dri.so 	si_flush_gfx_cs 	src/gallium/drivers/radeonsi/si_gfx_cs.c:143 	cfi
2 	libgallium_dri.so 	si_flush_from_st 	src/gallium/drivers/radeonsi/si_fence.c:534 	cfi
3 	libgallium_dri.so 	si_texture_create_object 	src/gallium/drivers/radeonsi/si_texture.c:1124 	cfi
4 	libgallium_dri.so 	si_texture_create_with_modifier 	src/gallium/drivers/radeonsi/si_texture.c:1300 	cfi
5 	libgallium_dri.so 	st_texture_create 	src/mesa/state_tracker/st_texture.c:103 	cfi
6 	libgallium_dri.so 	st_texture_storage 	src/mesa/state_tracker/st_cb_texture.c:3247 	cfi
7 	libgallium_dri.so 	st_AllocTextureStorage 	src/mesa/state_tracker/st_cb_texture.c:3295 	cfi
8 	libgallium_dri.so 	texture_storage_error.constprop.0 	src/mesa/main/texstorage.c:554 	cfi
9 	libgallium_dri.so 	texstorage_error 	src/mesa/main/texstorage.c:610 	cfi
10 	libgallium_dri.so 	_mesa_TexStorage2D 	src/mesa/main/texstorage.c:715 	cfi
11 	libxul.so 	webrender::device::gl::Device::create_texture 	gfx/wr/webrender/src/device/gl.rs:2521 	cfi
12 	libxul.so 	webrender::renderer::Renderer::update_texture_cache 	gfx/wr/webrender/src/renderer/mod.rs:2379 	cfi
13 	libxul.so 	webrender::renderer::Renderer::render_impl 	gfx/wr/webrender/src/renderer/mod.rs:1975 	cfi
14 	libxul.so 	webrender::renderer::Renderer::render 	gfx/wr/webrender/src/renderer/mod.rs:1737 	cfi
15 	libxul.so 	wr_renderer_render 	gfx/webrender_bindings/src/bindings.rs:620

GL context for 'Thread 45, Name: Compositor' comes from mozilla::gl::SharedSurfaceDesc and GL context for 'Crashing Thread (32), Name: Renderer' comes from rust.

As we have AMD / libgallium reports only (see Bug 1775258) and nothing on Intel/NVIDIA I'd say https://gitlab.freedesktop.org/mesa/mesa/-/issues/6666 is incomplete or we use wrong Mesa version for DMABUF_SURFACE_EXPORT feature check.

Flags: needinfo?(stransky)

This is a variant of Bug 1774075. It was supposed to be fixed in Mesa 22.1.2 but I have Mesa 22.1.3 and it's still broken.
Simple testcase is to set webgl sample as a homepage and open bunch of Firefox windows, that reliably crashes.

Assignee: nobody → stransky
Status: NEW → ASSIGNED

Let's disable it on all AMD devices until it's fixed reliably on Mesa side.

Those traces from comment 30 are almost certainly two different contexts accessed on two different threads.

Pushed by stransky@redhat.com:
https://hg.mozilla.org/integration/autoland/rev/ea595572b392
Disable DMABUF_SURFACE_EXPORT on AMD r=jgilbert,aosmond
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 104 Branch
Depends on: 1779355

Copying crash signatures from duplicate bugs.

Crash Signature: [@ amdgpu_cs_flush] [@ si_cp_dma_prefetch] [@ si_emit_streamout_enable ] → [@ amdgpu_cs_flush] [@ si_cp_dma_prefetch] [@ si_emit_streamout_enable ] [@ amdgpu_add_fence_dependencies_bo_list]

Comment on attachment 9285077 [details]
Bug 1778114 Disable DMABUF_SURFACE_EXPORT on AMD r?jgilbert

Beta/Release Uplift Approval Request

  • User impact if declined: WebGL crashes on AMD devices on Linux.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Fall back to old (tested) dmabuf code on AMD.
  • String changes made/needed:
  • Is Android affected?: No
Attachment #9285077 - Flags: approval-mozilla-beta?

Comment on attachment 9285077 [details]
Bug 1778114 Disable DMABUF_SURFACE_EXPORT on AMD r?jgilbert

Approved for 103.0b9, thanks.

Attachment #9285077 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Regressions: 1779425
No longer regressions: 1779425
Crash Signature: [@ amdgpu_cs_flush] [@ si_cp_dma_prefetch] [@ si_emit_streamout_enable ] [@ amdgpu_add_fence_dependencies_bo_list] → [@ amdgpu_cs_flush] [@ si_cp_dma_prefetch] [@ si_emit_streamout_enable ] [@ amdgpu_add_fence_dependencies_bo_list]
Flags: needinfo?(lsalzman)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: