Closed Bug 1771382 Opened 2 years ago Closed 2 years ago

VAAPI on dual Intel/Nvidia graphics (crash reports look like Intel-only): Crash in [@ __GI___mknodat] because the sandbox blocks the Nvidia driver

Categories

(Core :: Security: Process Sandboxing, defect, P2)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
103 Branch
Tracking Status
firefox102 --- disabled
firefox103 --- fixed

People

(Reporter: mccr8, Assigned: jld)

References

(Blocks 2 open bugs)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/0d55f0cf-248b-41eb-9cc6-c41b10220526

Reason: SIGSYS / SYS_SECCOMP

Top 10 frames of crashing thread:

0 libc.so.6 __GI___mknodat sysdeps/unix/sysv/linux/mknodat.c:33
1 libnvidia-glsi.so.390.151 libnvidia-glsi.so.390.151@0x00000000000514f9 
2 libnvidia-glsi.so.390.151 libnvidia-glsi.so.390.151@0x0000000000051714 
3 libnvidia-glsi.so.390.151 libnvidia-glsi.so.390.151@0x0000000000051aff 
4 libnvidia-glsi.so.390.151 libnvidia-glsi.so.390.151@0x000000000004e0bf 
5 libnvidia-glsi.so.390.151 libnvidia-glsi.so.390.151@0x000000000004b3f5 
6 libnvidia-glsi.so.390.151 libnvidia-glsi.so.390.151@0x000000000004e6fc 
7 firefox-bin calloc memory/build/malloc_decls.h:52
8 libEGL.so.1 _fini 
9 libEGL.so.1 _fini 
Blocks: 1748460
OS: Unspecified → Linux
Hardware: Unspecified → x86_64
Summary: Crash in [@ __GI___mknodat] → nvidia-vaapi-driver: RDD Crash in [@ __GI___mknodat]

(Kevin Locke from bug 1771632 comment 2)
[...]

However, I'm a bit confused. In bug 1751363 comment 67 you mention "MOZ_DISABLE_RDD_SANDBOX=1 is no longer needed for Mesa users." However, I'm experiencing crashes on build 20220529090310. Can you clarify?

(Darkspirit from bug 1771632 comment 3)

Some or all of these reports must be yours then. They are tracked by bug 1771382.

Strange: bp-0d11c47a-3616-4ac4-bc4b-6f9520220529 from crash-stats does not show an Nvidia card in the Telemetry tab, but is caused by the Nvidia driver ("libnvidia-glsi.so.390.151") being blocked by the sandbox.
Do you have an Nvidia graphics card?
Do you just have their driver installed without having an Nvidia card anymore?
Are you trying to use nvidia-vaapi-driver?

(Kevin Locke from bug 1771632 comment 4)

Interesting, thanks! In addition to the Intel 3rd Gen Core Graphics, my T430 also has an NVIDIA NVS 5400M (for Optimus "switchable graphics") which is disabled in the BIOS. The drivers are installed for the rare occasions I need to run something on it. I'm surprised they are being loaded, since glx is currently set to mesa (using update-glx on Debian). Is that expected? I'll see if I can figure out why. Any hints for investigating would be appreciated.

If you can reliably reproduce this crash, please test if starting Firefox with this environment variable fixes the problem:
$ __EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/50_mesa.json path/to/firefox

Summary: nvidia-vaapi-driver: RDD Crash in [@ __GI___mknodat] → VAAPI on dual Intel/Nvidia graphics (buz crash reports look like Intel-only): Crash in [@ __GI___mknodat] because the sandbox blocks the Nvidia driver
Summary: VAAPI on dual Intel/Nvidia graphics (buz crash reports look like Intel-only): Crash in [@ __GI___mknodat] because the sandbox blocks the Nvidia driver → VAAPI on dual Intel/Nvidia graphics (crash reports look like Intel-only): Crash in [@ __GI___mknodat] because the sandbox blocks the Nvidia driver

Thanks again Darkspirit!

(In reply to Darkspirit from comment #1)

If you can reliably reproduce this crash, please test if starting Firefox with this environment variable fixes the problem:
$ __EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/50_mesa.json path/to/firefox

When I launch firefox with multiple tabs of https://old.reddit.com/r/wallstreetbets/comments/l8rf4k/times_square_right_now/ I get a crash ~30% of the time. I get messages similar to Sandbox: seccomp sandbox violation: pid 805237, tid 805246, syscall 259, args 4294967196 140014785990768 8630 50175 1 1. Killing process. 100% of the time.

With __EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/50_mesa.json neither the crash nor the seccomp sandbox violation message has been produced in ~15 attempts. It appears to reliably avoid the issue.

If I understand correctly, the GLVND EGL dispatching tries libEGL_nvidia.so before libEGL_mesa.so because 10_nvidia.json precedes 50_mesa.json. Normally (based on strace of eglgears_wayland) libEGL_nvidia.so loads libnvidia-glsi.so which reads /proc/modules and each /sys/bus/pci/devices/*/config, then forks and execs /usr/bin/nvidia-modprobe (which also reads /proc/modules and each /sys/bus/pci/devices/*/config, then exits with code 1, without attempting mknod or init_module) and gives up after nvidia-modprobe exits, letting libEGL_mesa.so try (and succeed on my system).

Perhaps if the RDD process it is unable to read or fork+exec those files, libnvidia-glsi.so falls back to attempting mknod itself and gets killed (for seccomp sandbox violation: syscall 259, mknodat)? That could explain why the issue only occurs on systems with the nvidia drivers installed and the nvidia module not loaded (because otherwise mknod would not be needed as the device files would already exist).

However, that doesn't explain why Firefox sometimes crashes when the RDD process dies (especially when particular videos are loaded at the same time in multiple tabs). Perhaps these are two separate issues:

  1. The presence of the nvidia GLVND EGL driver causes the RDD process to crash when the nvidia kernel module isn't loaded/hardware isn't present.
  2. Killing the RDD process can cause Firefox to crash under certain circumstances.

Would it make sense to split off a separate issue?

For reference, I bisected the Sandbox: seccomp sandbox violation: ... Killing process. message to Pushlog: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=f6f71994696724abc3b194de6f152944f9a9b0cb&tochange=cd0c2d8c609262d6713d6b20804cf283a0c9c330 (Bug 1769182). It was much easier to bisect than the unreliable crash.

Main vs. RDD process crash:
The main (parent) process can crash with [@ mozilla::PRDDChild::OtherPid ] (bug 1752493, bug 1761942, bug 1761217) if many tabs try to talk to the RDD process at the same time while it's crashing because of a sandbox violation.


Wrong EGL driver selection:

You seem to be right.
But it also seems to affect the main process which doesn't have a sandbox (bug 1745172).

That's what I've found a few days ago:
https://www.reddit.com/r/pop_os/comments/rnergb/firefox_shows_blank_screen_after_upgrading_to_2110/

This only occurs with Integrated graphics mode, not Nvidia graphics mode;
This does not always occur. Around 1 in 3 attempts to run firefox will succeed.
Even if firefox starts successfully, opening Help -> About Firefox will still show a blank screen.

https://askubuntu.com/questions/1380600/firefox-occasionally-not-rendering-its-own-window-when-opened

Turns out it has something to do in the case when you disable the Nvidia card and use integrated graphics. Firefox seems to be trying to render through the disabled GPU. To fix add this to your .profile:

if ! grep -w -q nvidia <(lsmod) ; then export
__EGL_VENDOR_LIBRARY_FILENAMES="/usr/share/glvnd/egl_vendor.d/50_mesa.json"
fi

The user has 10_nvidia.json and 50_mesa.json.
glvnd just picks the json file with the lowest number.
The Nvidia driver doesn't reliably reject its responsibility when the GPU is disabled, the kernel module not loaded, etc.

(In reply to Darkspirit from comment #4)

Main vs. RDD process crash:
The main (parent) process can crash with [@ mozilla::PRDDChild::OtherPid ] (bug 1752493, bug 1761942, bug 1761217) if many tabs try to talk to the RDD process at the same time while it's crashing because of a sandbox violation.

Bingo. That'd explain it. Thanks!

Let me know if you'd like me to open another bug to track it (since those are all closed as fixed and this is a reasonably reliable reproduction - for the moment at least).

(In reply to Darkspirit from comment #4)

Wrong EGL driver selection:

You seem to be right.
But it also seems to affect the main process which doesn't have a sandbox (bug 1745172).
[...]
The user has 10_nvidia.json and 50_mesa.json.
glvnd just picks the json file with the lowest number.
The Nvidia driver doesn't reliably reject its responsibility when the GPU is disabled, the kernel module not loaded, etc.

Good finds! I haven't observed Firefox opening with a transparent window, but the problem may manifest differently on Wayland than X11. I'll keep an eye out for possibly related symptoms.

Another potential factor is that Debian, Ubuntu, and derivatives provide update-glx to switch between nvidia and mesa drivers. It uses the Debian Alternatives System to update the target of the symlinks for libGL.so, libEGL.so, etc. between nvidia and mesa versions (along with symlinks in /etc/modprobe.d/ to disable loading of the nvidia or nouveau kernel module). My understanding is that it's (mostly?) superseded by GLVND, but perhaps it still has an effect in some cases? For reference, on my system update-glx --display glx | head -n4 shows:

glx - manual mode
  link best version is /usr/lib/nvidia
  link currently points to /usr/lib/mesa-diverted
  link glx is /usr/lib/glx

Indicating that mesa is currently selected (e.g. /usr/lib/x86_64-linux-gnu/libGL.so resolves to /usr/lib/mesa-diverted/x86_64-linux-gnu/libGL.so).

Note: /usr/lib/nvidia is "best" because its alternative is created with priority 100, so that it would be chosen by default (i.e. in "auto" mode) since it's likely what users want if they bothered to install it. It doesn't indicate anything about the specific machine.

Priority: -- → P2

(lilydjwg from bug 1771898 comment 14)

VAAPI works with MOZ_DISABLE_RDD_SANDBOX=1. Without it, I get the following log:

Sandbox: seccomp sandbox violation: pid 366214, tid 366375, syscall 16, args 26 3225962194 140193679276464 140194141494800 140193679276464 0.  Killing process.
Sandbox: seccomp sandbox violation: pid 366388, tid 366417, syscall 16, args 26 3225962194 140458407098800 140458878246416 140458407098800 0.  Killing process.
[Child 365603, MediaDecoderStateMachine #1] WARNING: Decoder=7f7c1cc2c700 Decode error: NS_ERROR_DOM_MEDIA_FATAL_ERR (0x806e0005): file /builds/worker/checkouts/gecko/dom/media/MediaDecoderStateMachineBase.cpp:151
Sandbox: seccomp sandbox violation: pid 366435, tid 366445, syscall 16, args 26 3225962194 139975790389680 139976252603920 139975790389680 0.  Killing process.

Here is a crash report: https://crash-stats.mozilla.org/report/index/4ddd1bd8-05ea-4dd7-8045-7608a0220611

(lilydjwg from bug 1771898 comment 20)

Yes, __EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/50_mesa.json prevents the crash.

Crash Signature: [@ __GI___mknodat] → [@ __GI___mknodat] [@ ioctl ]

I've tried and failed to reproduce this, on a desktop with an AMD GPU where I've also installed an Nvidia GPU and the proprietary drivers, on Debian. The RDD process has loaded both libEGL_mesa and libEGL_nvidia, as well as libGLdispatch (glvnd), and it has tried to read some nvidia config files I didn't know about (nvidia-application-profiles-rc), but haven't seen any of the sandbox problems.

But, I did a Try run (link to x86 opt tar.bz2) with a patch to quietly fail the mknod and those 'F' ioctls (e.g. 3225962194 = 0xC04846D2 ⇒ read/write, size 0x0048 = 72, type 0x46 = 'F', number 0xD2 = 210) instead of crashing on Nightly, if someone who's affected wants to try testing it.

(I'm also looking at what the nvidia drivers would need in order to actually work — the recent open source kernel driver release defines their ioctls (I don't know if those had any public docs before?), which helps — but I'll need to do more hardware rearranging to test it.)

Also, if the problem is just the Killing process. part, setting MOZ_SANDBOX_CRASH_ON_ERROR=0 in the environment is another workaround.

Crash Signature: [@ __GI___mknodat] [@ ioctl ] → [@ __GI___mknodat] [@ ioctl] [@ mknodat]

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #7)

I've tried and failed to reproduce this

I don't know what I changed that did it, but it reproduces now, and the patch from the Try run that I linked in the last comment does make the sandbox errors go away.

Assignee: nobody → jld

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #8)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #7)

I've tried and failed to reproduce this

I don't know what I changed that did it, but it reproduces now

I think that what I did was that I booted a slightly older kernel (because of an unrelated issue) which meant that the nvidia kernel driver didn't load. Which explains the mknod (because the device node isn't there, and as for nvidia-modprobe we're not allowing either fork or exec). So that's definitely not testing the case from bug 1771898 comment 14, but the patch I have should work for that.

Thanks :jld! I'm unable to reproduce the crash and do not observe the Sandbox: seccomp sandbox violation... messages using your try build. It appears to fix the issue for me.

I can also confirm that the video plays fine on the Intel card, although I'm not sure how to confirm whether VAAPI is being used (debugInfo.decoder.reader.videoHardwareAccelerated in DevTools Media Panel is true, but that is also the case with media.ffmpeg.vaapi.enabled:false so I'm not sure what it means).

The video also plays fine on the Nvidia card with the nvidia driver in X (presumably without VAAPI, since about:support shows "unavailable by runtime: Requires EGL" for VAAPI). I'm unable to test the nvidia driver in Wayland. (Does nvidia-legacy-390xx support EGL on Wayland? I get eglinfo: eglInitialize failed)

The video also plays fine on the Nvidia card with nouveau if VAAPI is disabled. If VAAPI is enabled, the video is garbled, as always (mesa/mesa#3899).

Let me know if there's anything specific you'd like me to test.

The try build doesn't work for me because of the libEGL warning: egl: failed to create dri2 screen error described in bug 1771898.

The try build from comment 7 (Mon, 13 Jun 2022 19:28:21 +0000 (37 hours ago)) does not contain bug 1773377 (Tue, 14 Jun 2022 03:47:07 +0000 (29 hours ago)).

On multi-GPU systems, even though the GPU we're going to use for
accelerated video decoding is driven by Mesa, sometimes the nvidia
proprietary driver can be loaded and attempt to probe devices. This
patch attempts to make the sandbox policy quietly return errors for
those syscalls, instead of treating them as unexpected (and crashing on
Nightly).

Crash Signature: [@ __GI___mknodat] [@ ioctl] [@ mknodat] → [@ __GI___mknodat] [@ ioctl] [@ mknodat] [@ __mknodat]
Crash Signature: [@ __GI___mknodat] [@ ioctl] [@ mknodat] [@ __mknodat] → [@ __GI___mknodat] [@ ioctl] [@ mknodat] [@ __mknodat] [@ <.text ELF section in libnvidia-glsi.so.515.48.07>]
Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2e18d27a4d70 Adjust the Linux RDD sandbox to handle the nvidia driver being loaded but not used. r=gcp
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 103 Branch

Copying crash signatures from duplicate bugs.

Crash Signature: [@ __GI___mknodat] [@ ioctl] [@ mknodat] [@ __mknodat] [@ <.text ELF section in libnvidia-glsi.so.515.48.07>] → [@ __GI___mknodat] [@ ioctl] [@ mknodat] [@ __mknodat] [@ <.text ELF section in libnvidia-glsi.so.515.48.07>] [@ __GI___ioctl] [@ __ioctl]

Copying crash signatures from duplicate bugs.

Crash Signature: [@ __GI___mknodat] [@ ioctl] [@ mknodat] [@ __mknodat] [@ <.text ELF section in libnvidia-glsi.so.515.48.07>] [@ __GI___ioctl] [@ __ioctl] → [@ __GI___mknodat] [@ ioctl] [@ mknodat] [@ __mknodat] [@ <.text ELF section in libnvidia-glsi.so.515.48.07>] [@ __GI___ioctl] [@ __ioctl] [@ __assert_fail_base | <.text ELF section in libnvidia-glsi.so.510.68.02>]
Crash Signature: [@ __GI___mknodat] [@ ioctl] [@ mknodat] [@ __mknodat] [@ <.text ELF section in libnvidia-glsi.so.515.48.07>] [@ __GI___ioctl] [@ __ioctl] [@ __assert_fail_base | <.text ELF section in libnvidia-glsi.so.510.68.02>] → [@ __GI___mknodat] [@ ioctl] [@ mknodat] [@ __mknodat] [@ <.text ELF section in libnvidia-glsi.so.515.48.07>] [@ __GI___ioctl] [@ __ioctl] [@ __assert_fail_base | <.text ELF section in libnvidia-glsi.so.510.68.02>]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: