Open Bug 1788573 Opened 2 years ago Updated 1 year ago

Crash in [@ NvGlEglGetFunctions] after suspend&resume if DMABUF and THREADSAFE_GL are enabled. Fixed in next Nvidia driver 530 release

Categories

(Core :: Graphics: WebRender, defect)

Firefox 106
x86_64
Linux
defect

Tracking

()

People

(Reporter: max, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/d9dfc4cf-fa65-4e51-83ca-5a8e60220828

Reason: SIGSEGV / SI_KERNEL

Top 10 frames of crashing thread:

0 libnvidia-eglcore.so.515.65.01 NvGlEglGetFunctions 
1 libnvidia-eglcore.so.515.65.01 NvGlEglApiInit 
2 libnvidia-eglcore.so.515.65.01 NvGlEglApiInit 
3 libEGL_nvidia.so.0 NvEglwlaf47906in 
4 libEGL_nvidia.so.0 NvEglwlaf47906in 
5 libEGL_nvidia.so.0 <.text ELF section in libEGL_nvidia.so.515.65.01> 
6 libEGL_nvidia.so.0 NvEglwlaf47906in 
7 libEGL_nvidia.so.0 NvEglwlaf47906in 
8 libxul.so DMABufSurfaceRGBA::ReleaseTextures widget/gtk/DMABufSurface.cpp:679
9 libxul.so mozilla::wr::RenderDMABUFTextureHost::ClearCachedResources gfx/webrender_bindings/RenderDMABUFTextureHost.cpp:72

The crash appears to be triggered if an external monitor is connected/disconnected while the computer is sleeping. I can confirm this if desired. I can also supply about 10 other crash reports with the same problem, more system information, or do other debugging steps.

This might be similar to https://bugzilla.mozilla.org/show_bug.cgi?id=1737834, but that seems to be caused by a memory leak and shows up after firefox is left running for a while.

I'm using the 515 version the nvidia drivers, but I believe that this bug was present with the 510 version as well.

The bug has a crash signature, thus the bug will be considered confirmed.

Status: UNCONFIRMED → NEW
Ever confirmed: true
Component: General → Graphics: WebRender
Product: Firefox → Core
Severity: -- → S3
Flags: needinfo?(stransky)

https://gitlab.gnome.org/GNOME/mutter/-/issues/2045#note_1519500

Erik Kurzinger @ekurzinger · 4 weeks ago
The original cursor leak should definitely be fixed with that driver version, but I guess it's possible there's another leak somewhere. If you run nvidia-smi does it report suspiciously high video memory usage?

Blocks: wr-nv-linux
Keywords: crash
Hardware: ARM64 → x86_64
Summary: Crash in [@ NvGlEglGetFunctions] → Crash in [@ NvGlEglGetFunctions] apparently caused by connecting/disconnecting external monitor while the computer is sleeping

Erik, to me this looks like a driver bug that causes quite a few crashes - can you have a look / do you have an idea what could be happening?

Flags: needinfo?(ekurzinger)

I'm fairly certain the crash is not actually happening in NvGlEglGetFunctions. There's no way for that function to be reached from glDeleteTextures, maybe whatever is generating the backtrace is getting confused? In which case it might not even be the same issue as the other bug you mentioned. It seems like any segfault in libnvidia-eglcore.so gets mistakenly attributed to that function.

That said, there does seem to be a potential driver bug lurking somewhere. The other crash was during glTexImage2D, so it's possible the root cause is related.

Is the issue specific to Wayland like the other bug? Or does it also reproduce on X11? Also, does it reproduce every time, or is it intermittent? I did try connecting an external display while the system was suspended on X11, but didn't see a crash. On Wayland, suspend seems to be have been completely broken since version 510 (I was rather disturbed to discover this).

Also, if you have a reliable repro, would you be able to check the video memory usage reported by nvidia-smi? If we do suspect video memory might be getting exhausted that may help to confirm it. Although there are some video memory allocations that wouldn't be tracked by that tool.

Flags: needinfo?(ekurzinger)

I have not tried this on Wayland. All of my crashes are on X. I'll try to get a set of steps to reproduce this in a few minutes. I doubt that it is a memory leak bug, since it triggers immediately on resume. I'll check the memory usage. It's a bit hard to check before the crash since it's on resume though. I can definitely check after, but by that point it's probably too late

Well now I've tried:

sleeping, disconnecting external monitor, resuming
disconnecting monitor, sleeping, connecting, resuming
sleeping, resuming with monitor connected or disconnected

Nothing seems to trigger the crash. So that must have just been a red flag. Every time I can remember, the crash happened right after I resumed, but something else must be triggering it. I'll keep playing around and see if I can get a reliable reproduction

Flags: needinfo?(stransky)
Flags: needinfo?(gwatson)

This happens to me a few times a day. I also suspect it's a driver issue (using 510 version)
Let me know if I can help with anything.

Crashes when I wake up my workstation from sleep, monitors stay connected.
Driver: 525.60.11
Ubuntu 22.04.1
Nightly 110.0a1 (2022-12-18) (64-bit)

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 5 desktop browser crashes on Linux on release (startup)

:gw, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(gwatson)

Aleino, any ideas on this one?

Flags: needinfo?(gwatson) → needinfo?(aleino)

(In reply to Glenn Watson [:gw] from comment #12)

Aleino, any ideas on this one?

From the comments above it looks like my colleague, Erik Kurzinger, is already looking into it.

Info that would be helpful:

  1. What is the display connection? (DP, HDMI?)
  2. What's the GPU?
  3. Does it happen on the latest available drivers?
  4. X11 version.
  5. Some repro steps -- a more detailed description of how to trigger the crash.
Flags: needinfo?(aleino)
Duplicate of this bug: 1818077
Summary: Crash in [@ NvGlEglGetFunctions] apparently caused by connecting/disconnecting external monitor while the computer is sleeping → Crash in [@ NvGlEglGetFunctions] after suspend&resume

I don't necessarily have anything to add to this, except I don't have an external monitor per say. My setup is an older 3770k desktop system, with 2 monitors plugged in to an Asus TUF Gaming OC Geforce 1660 Super. Running Linux Pop!_OS 22.04, with NVIDIA driver 525.85.05, and Pop!_OS Linux Kernel version: 6.1.11-76060111-generic

Monitor 1 is an LG Ultrawide (2560x1080) hooked up via an HDMI cable
Monitor 2 is a Dell FP (1600x1200) hooked up via a DVI-D cable.

When it returns from suspend, it will crash Firefox nearly all the time. The only time that I can sort of say it maybe doesn't is if FF is minimized and not on screen, but I can't confirm that that is 100% effective, nor is it much of a solution. Currently that's just an anecdotal theory.

If I can help in any way for testing, triage, let me know.

Given the recent uptick in crash reports should this be S2 and/or have an assignee?

Flags: needinfo?(gwatson)

(In reply to Paul Zühlcke [:pbz] from comment #9)
Does it occur on Wayland and X11 or only on one of them?
Can the crash be prevented by setting widget.dmabuf-webgl.enabled to false and restarting Firefox?

Flags: needinfo?(pbz)
Severity: S3 → S2
Flags: needinfo?(gwatson)

I have not seen this bug since I switched to Wayland. I never could get a reliable reproducible procedure. When it was crashing, it would crash on (almost?) every single resume. Then after a full restart it might not do it (but I don't think this was consistent either). Then go back to crashing on every resume. When it was crashing. I could force it to happen with certainty by a sleep/resume cycle.

Update, last night before I suspended, I had 2 open FF windows, one on each monitor. I minimized them so the screen was just empty desktop. I did this consciously to try my theory. After resuming this morning, FF crashed. Apparently it doesn't matter if minimized or not. I'd say it crashes 90%+ of the time my system suspends.

Pop!_OS is X11 to answer the above question. This has been happening for me pretty regularly for several months.

Do you have the "PreserveVideoMemoryAllocations" feature enabled for the NVIDIA driver? If unsure, you can run "cat /proc/driver/nvidia/params" and look for the corresponding line.

PreserveVideoMemoryAllocations: 0

That's the defaults for Pop(System 76)/Nvidia, whomever set that. I haven't changed anything.

What does that feature do?

It will save the contents of video memory when the system suspends and restore it on resume. Otherwise applications need to explicitly re-initialize all of their textures and stuff. But I believe Firefox does have code to do that. I only asked in case it affects the reproducibility of the crash.

(I'm just a user/tester.)

Nvidia driver 525 crashes after suspend&resume
if Dmabuf is enabled
and/or
if Firefox assumes driver thread safety and runs WebGL on a different thread.

https://nightly.mozilla.org, Gnome X11/Ubuntu 22.04 LTS/GTX1060/driver 525.85.05, PreserveVideoMemoryAllocations = 0.
STR:

  1. Open WebGL.
  2. Suspend & resume.
  3. Main process crash

https://webglsamples.org/aquarium/aquarium.html

https://yari-demos.prod.mdn.mozit.cloud/en-US/docs/Web/API/Canvas_API/Tutorial/Basic_animations/_sample_.an_animated_solar_system.html

Here is the logic on which main process thread WebGL runs: https://searchfox.org/mozilla-central/rev/f7edb0b474a1a922f3285107620e802c6e19914d/gfx/ipc/CanvasManagerParent.cpp#52
a) if not threadsafe (bug 1739996 comment 2: so far only the case on Nouveau. webgl.threadsafe-gl.force-disabled=true) = run WebGL on RenderThread
b) if threadsafe and webgl.use-canvas-render-thread=true (bug 1778431) = run WebGL on CanvasRenderThread
c) if threadsafe = run WebGL on CompositorThread. "This appears to have performance benefits, possibly because the renderer thread is too busy"

background

  • Firefox internally accelerates classic Canvas via WebGL (gfx.canvas.accelerated).
  • Firefox has two Dmabuf WebGL modes
    • preferred on proprietary Nvidia, blacklisted for Mesa: bug 1735929 comment 25 (EGL_MESA_image_dma_buf_export)
    • Gbm
  • The crash also occurs with disabled Dmabuf.

Gnome Wayland on Ubuntu 22.04 LTS: Gnome glitches (wild colors) after suspend&resume, can't really see Firefox. Will upgrade Ubuntu and re-test.

Flags: needinfo?(pbz)

Ok, finally got a repro with a debug build of the NV driver and I think I see the problem. It does appear to be a driver bug - our memory book-keeping is getting messed up after we resume from suspend which can cause a segfault at some random point later on.

Thanks Erik. Is there anything we can do from the Firefox side to work around this? I'm assuming not, but figured I'd ask - as it doesn't seem worthwhile blocking that driver version from hw-accel for a suspend/resume problem, but it's also a relatively high crash volume.

Flags: needinfo?(ekurzinger)

The driver bug can be fixed with a fairly low-risk change, so I might be able to get it into the next 530 release which should be fairly soon. In terms of a work-around until then, one option would be to force the use of GLX since the bug is specific to EGL. Another option would be to enable the aforementioned feature that preserves video memory across suspend / resume, which should have the side-effect of avoiding the problem. That can be done by setting the option "NVreg_PreserveVideoMemoryAllocations=1" for the nvidia kernel module.

Darkspirit's earlier comment mentioned some other settings that appear to prevent the crash, although note that it's basically a use-after-free error and therefore somewhat non-deterministic. So it could be the case that they just work due to luck... I'm not sure. The two things mentioned above should work for certain, though.

Flags: needinfo?(ekurzinger)

I guess I should also say that the specific thing triggering the bug is suspending while there are textures bound to EGLImages. I'm not familiar enough with FF internals to know how it uses such textures, but if that can be avoided somehow it should also avoid the crash.

Thanks Erik. Martin, Andrew, thoughts on what might be the easiest workaround? Would it be reasonable to force GLX for this driver version?

Flags: needinfo?(stransky)
Flags: needinfo?(aosmond)

From what Darkspirit indicates in comment 23, I would prefer to disable DMABUF and/or THREADSAFE_GL for NVIDIA binary driver users. Putting them on GLX implies disabling DMABUF anyways.

Right now we require >= 495.44 for DMABUF. Do we have any sense of a driver range we should consider here?

Edit: Based on comment 26, maybe it is insufficient. I think it is much preferred to switching to GLX. We are trying to get away from GLX whenever possible as it has its own threading issues.

I see it crashing in [510.47.3.0, 525.89.2.0] range. I'd say block DMABUF for 510.0 to 530.0 and see if that is sufficient.

Crash volume increased because bug 1806058 increased WebGL usage.
(Lee Salzman [:lsalzman] from bug 1777849 comment 57)

We set up downloadable Blocklist rules for 110 to prevent Linux + X11 users from enabling accelerated canvas2D.

I will now test these driver versions: https://packages.ubuntu.com/search?suite=jammy-updates&searchon=names&keywords=nvidia-driver-

  • nvidia-driver-525: comment 23 = Only widget.dmabuf-webgl.enabled=false + webgl.threadsafe-gl.force-disabled=true prevented the crash after suspend&resume. If I disabled only one of them, the crash still occured.
  • nvidia-driver-390: black screen after boot.

Tested https://webglsamples.org/aquarium/aquarium.html on Gnome X11, Ubuntu 22.04 LTS:

  • no crash with widget.dmabuf-webgl.enabled=false + webgl.threadsafe-gl.force-disabled=true, neither with multiple windows, but fishes lose their texture on the right and regain it on the left (attached screenshot). Can be fixed with F5.

nvidia-driver-470

MOZ_LOG="Dmabuf:5" firefox/firefox
[Child 13169: Main Thread]: D/Dmabuf We're missing DRM render device!
[Child 13169: Main Thread]: D/Dmabuf nsDMABufDevice::IsDMABufWebGLEnabled: UseDMABuf 0 mUseWebGLDmabufBackend 1 widget_dmabuf_webgl_enabled 1

nvidia-driver-495 is an alias for nvidia-driver-510

MOZ_LOG="Dmabuf:5" firefox/firefox
[Parent 3974: Main Thread]: D/Dmabuf Using DRM device /dev/dri/renderD128
[Parent 3974: Main Thread]: D/Dmabuf nsDMABufDevice::Configure()
[Parent 3974: Main Thread]: D/Dmabuf Loading DMABuf system library libgbm.so.1 ...
[Parent 3974: Main Thread]: D/Dmabuf DMABuf is enabled
[ERROR glean_core] Error setting metrics feature config: Json(Error("EOF while parsing a value", line: 1, column: 0))
[Child 4164: Main Thread]: D/Dmabuf Using DRM device /dev/dri/renderD128
[Child 4164: Main Thread]: D/Dmabuf Failed to open drm render node /dev/dri/renderD128 error Permission denied
[Child 4164: Main Thread]: D/Dmabuf nsDMABufDevice::IsDMABufWebGLEnabled: UseDMABuf 1 mUseWebGLDmabufBackend 1 widget_dmabuf_webgl_enabled 1
[Parent 3974: CanvasRenderer]: D/Dmabuf nsDMABufDevice::IsDMABufWebGLEnabled: UseDMABuf 1 mUseWebGLDmabufBackend 1 widget_dmabuf_webgl_enabled 1
[Parent 3974: CanvasRenderer]: D/Dmabuf nsDMABufDevice::IsDMABufWebGLEnabled: UseDMABuf 1 mUseWebGLDmabufBackend 1 widget_dmabuf_webgl_enabled 1
[Parent 3974: CanvasRenderer]: D/Dmabuf DMABufSurfaceRGBA::Create() from EGLImage UID = 1
[Parent 3974: CanvasRenderer]: D/Dmabuf   imported size 1 x 1 format 34324241 planes 1 modifiers 3000000004fe010
[Parent 3974: CanvasRenderer]: D/Dmabuf DMABufSurfaceRGBA::Serialize() UID 1
[Parent 3974: CanvasRenderer]: D/Dmabuf DMABufSurfaceRGBA::ImportSurfaceDescriptor() UID 1 size 1 x 1
[Parent 3974: CanvasRenderer]: D/Dmabuf   imported size 1 x 1 format 34324241 planes 1
[Parent 3974: CanvasRenderer]: D/Dmabuf DMABufSurfaceRGBA::CreateTexture() UID 1
[Parent 3974: CanvasRenderer]: D/Dmabuf DMABufSurfaceRGBA::ReleaseTextures() UID 1
[Parent 3974: CanvasRenderer]: D/Dmabuf DMABufSurface::ReleaseDMABuf() UID 1
[Parent 3974: CanvasRenderer]: D/Dmabuf DMABufSurfaceRGBA::ReleaseTextures() UID 1
[Parent 3974: CanvasRenderer]: D/Dmabuf DMABufSurfaceRGBA::ReleaseTextures() UID 1
[Parent 3974: CanvasRenderer]: D/Dmabuf DMABufSurface::ReleaseDMABuf() UID 1
[Parent 3974: CanvasRenderer]: D/Dmabuf DMABufSurfaceRGBA::Create() from EGLImage UID = 3
[Parent 3974: CanvasRenderer]: D/Dmabuf   imported size 1024 x 1024 format 34324241 planes 1 modifiers 300000000cdb014
[Parent 3974: CanvasRenderer]: D/Dmabuf DMABufSurfaceRGBA::Serialize() UID 3
[Parent 3974: CanvasRenderer]: D/Dmabuf DMABufSurfaceRGBA::ImportSurfaceDescriptor() UID 3 size 1024 x 1024
[Parent 3974: CanvasRenderer]: D/Dmabuf   imported size 1024 x 1024 format 34324241 planes 1
[Parent 3974: CanvasRenderer]: D/Dmabuf DMABufSurfaceRGBA::Serialize() UID 3
[Child 4164: Main Thread]: D/Dmabuf nsDMABufDevice::IsDMABufWebGLEnabled: UseDMABuf 1 mUseWebGLDmabufBackend 1 widget_dmabuf_webgl_enabled

nvidia-driver-515


nvidia-driver-520 is an alias for nvidia-driver-525 = comment 23

Crash Signature: [@ NvGlEglGetFunctions] → [@ NvGlEglGetFunctions] [@ libnvidia-eglcore.so.470.161.03@0xf710b0 ] [@ libnvidia-eglcore.so.510.108.03@0xe4bd91 ] [@ libnvidia-eglcore.so.515.86.01@0xe5b281 ]
Duplicate of this bug: 1801892
Crash Signature: [@ NvGlEglGetFunctions] [@ libnvidia-eglcore.so.470.161.03@0xf710b0 ] [@ libnvidia-eglcore.so.510.108.03@0xe4bd91 ] [@ libnvidia-eglcore.so.515.86.01@0xe5b281 ] → [@ NvGlEglGetFunctions] [@ libnvidia-eglcore.so.470.161.03@0xf710b0 ] [@ libnvidia-eglcore.so.510.108.03@0xe4bd91 ] [@ libnvidia-eglcore.so.515.86.01@0xe5b281 ] [@ libnvidia-eglcore.so.510.85.02@0xe4bf31] [@ libnvidia-eglcore.so.510.85.02@0xe4bfbd ] …
Duplicate of this bug: 1761644
Summary: Crash in [@ NvGlEglGetFunctions] after suspend&resume → Crash in [@ NvGlEglGetFunctions] after suspend&resume if DMABUF and THREADSAFE_GL are enabled. Fixed in next Nvidia driver 530 release

I just want to say thank you to everyone who is helping resolve this most-heinous bug, most notably Darkspirit for his trouble shooting, and Erik Kurzinger for his work at Nvidia providing the actual fix. Can't wait for my distribution to roll out 530-series NVIDIA drivers with the actual fix baked in. Cheers!

Flags: needinfo?(stransky)

Have anyone tried to enable GPU process? Because i have layers.gpu-process.enabled set to true and this solves problem completely.
Yes, GPu process crashes periodically, but doesnt bring whole browser down.

This is my example of same crash - https://crash-stats.mozilla.org/report/index/66b6c18c-11c4-4896-8fa4-6d0220230304

It appears our blocklisting efforts have successfully brought the crash rate down.

(In reply to V. Korn from comment #36)

Have anyone tried to enable GPU process? Because i have layers.gpu-process.enabled set to true and this solves problem completely.
Yes, GPu process crashes periodically, but doesnt bring whole browser down.

This is my example of same crash - https://crash-stats.mozilla.org/report/index/66b6c18c-11c4-4896-8fa4-6d0220230304

Unfortunately the GPU process as currently implemented on Linux won't work with Wayland, so we haven't invested effort in shipping it. My understanding is to make it work with Wayland, we would need to do something similar to what we do on Android, where we proxy to the parent process at the final stages of the compositing pipeline.

Flags: needinfo?(aosmond)

(In reply to Erik Kurzinger from comment #26)

The driver bug can be fixed with a fairly low-risk change, so I might be able to get it into the next 530 release which should be fairly soon. In terms of a work-around until then, one option would be to force the use of GLX since the bug is specific to EGL. Another option would be to enable the aforementioned feature that preserves video memory across suspend / resume, which should have the side-effect of avoiding the problem. That can be done by setting the option "NVreg_PreserveVideoMemoryAllocations=1" for the nvidia kernel module.

Darkspirit's earlier comment mentioned some other settings that appear to prevent the crash, although note that it's basically a use-after-free error and therefore somewhat non-deterministic. So it could be the case that they just work due to luck... I'm not sure. The two things mentioned above should work for certain, though.

Hey Erik, Can you confirm if the fix for this bug made it into the 530 release?

https://www.nvidia.com/download/driverResults.aspx/200481/en-us/
Version: 530.41.03

Thanks

Flags: needinfo?(ekurzinger)

No, the fix turned out to be more complicated than I initially thought and I was unable to get it checked in in time for the release. Apologies.

Flags: needinfo?(ekurzinger)

Thanks Erik.

Whoever can/is responsible for bug https://bugzilla.mozilla.org/show_bug.cgi?id=1820055 may need to be adjusted as it is a workaround hard coded for NVIDIA driver releases 530 and lower.

Duplicate of this bug: 1818178

Donald,

You're going to have re-open Bug #1820055 to alter the workaround beyond release 530, as per Erik's comment about the fix not making it into 530 so without chaning the workaround the crash will return.

Flags: needinfo?(dmeehan)

Redirecting needinfo to :aosmond for a follow-up on comment 39 - comment 42

Flags: needinfo?(dmeehan) → needinfo?(aosmond)

Landing a fix in bug 1824778. When we have a confirmed working driver, we can unblock.

Flags: needinfo?(aosmond)

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

@aosmond so what's the state with this bug? Can you annotate it with fixed/etc if it is so?

Flags: needinfo?(aosmond)

Kelsey, this big is on hold awaiting a driver fix from Nvidia, which makes Erik Kurzinger from Nvidia the primary point of contact.

After the Nvidia fix comes in then dmabuf can be re-enabled.

Any progress, Erik?

Flags: needinfo?(ekurzinger)

Yep, once we hear from NVIDIA on a specific driver version with the fix, we can make a more fine tuned blocklist rule.

Flags: needinfo?(aosmond)

Sorry for the slow response. The fix will be in the 545 driver release, that'll be the next major version after 535 which went public recently.

Flags: needinfo?(ekurzinger)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: