Regression: WebRender support on Nvidia Binary
Categories
(Core :: Graphics: WebRender, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr78 | --- | unaffected |
firefox89 | --- | unaffected |
firefox90 | --- | fixed |
firefox91 | --- | fixed |
People
(Reporter: Vash63, Assigned: rmader)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: regression)
Attachments
(4 files)
(deleted),
text/plain
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
text/x-phabricator-request
|
jcristau
:
approval-mozilla-beta+
|
Details |
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0
Steps to reproduce:
Updated Nightly. Noticed I was in legacy GL (my default profile has layers.acceleration.force-enabled on) and getting graphical issues.
Actual results:
Checked about:support and confirmed that WebRender was disabled due to glxtest process failure in failure log.
Expected results:
Firefox should have launched with WebRender
Additional info that didn't fit in the template:
I ran mozregression and it gave me: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=42d5dd452cd79ed9aef39ba6a429d839f0344887&tochange=cbbccab13f240baabf0fe2678b4f4b683fd21e5f
Between the two this is probably due to: https://bugzilla.mozilla.org/show_bug.cgi?id=1714897
Also this is the exact same symptoms I had about a month ago that was fixed in either https://bugzilla.mozilla.org/show_bug.cgi?id=1706762 or https://bugzilla.mozilla.org/show_bug.cgi?id=1706452
Comment 2•3 years ago
|
||
The Bugbug bot thinks this bug should belong to the 'Core::Graphics: WebRender' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.
Comment 3•3 years ago
|
||
Thanks for the bug report and regression range Vash63. Could you please attach your about:support information to this bug?
Glenn, any ideas what could have caused this.
Updated•3 years ago
|
Updated•3 years ago
|
About:support from the latest nightly attached. It notes "glxtest: process failed (received signal 11)" which was the same application updated in https://phabricator.services.mozilla.com/D116957
Comment 5•3 years ago
|
||
FWIW here's my about:support, on linux with nvidia driver version 460.73.01, NVIDIA Corporation GP107GL [Quadro P400] [10de:1cb3]
Comment 6•3 years ago
|
||
Comment 7•3 years ago
|
||
There's two things that patch does - it passes the current X display to eglInitialize
, which was what fixes the glxtest code on amdgpu, and it moves the installation of the X error handler slightly earlier.
I wouldn't expect either of those to crash the nvidia proprietary driver, but it sounds like one of them does :|
Robert, have you got any ideas what we should do here?
Comment 8•3 years ago
|
||
aosmond, stransky, do either of you have a machine with an nvidia binary driver set up that you could test this on?
Comment 9•3 years ago
|
||
I have NVIDIA card available so I can set it up for testing. Which patch do you mean?
Comment 10•3 years ago
|
||
Thanks Martin. It's already landed in m-c (https://phabricator.services.mozilla.com/D116957), so I guess just testing to see if glxtest
is crashing on your machine (I suspect it will only reproduce in an X11 session, but it might be worth testing on all X/Wayland combinations).
Comment 11•3 years ago
|
||
Looks like it's crashing in XCloseDisplay.
(gdb) next
852 XCloseDisplay(dpy);
(gdb) p glxtest_buf
$9 = 0x7fffffffc1d0 "PCI_VENDOR_ID\n0x10de\nPCI_DEVICE_ID\n0x1cb3\nVENDOR\nNVIDIA Corporation\nRENDERER\nQuadro P400/PCIe/SSE2\nVERSION\n4.6.0 NVIDIA 460.73.01\nTFP\nTRUE\nMESA_ACCELERATED\nTRUE\nSCREEN_INFO\n3840x2160:1;\n"
(gdb) next
Thread 2.1 "firefox" received signal SIGSEGV, Segmentation fault.
0x00007fffe960f580 in ?? ()
(gdb) bt
#0 0x00007fffe960f580 in ()
#1 0x00007ffff6282ba2 in XCloseDisplay (dpy=0x7ffff78ca000) at ../../src/ClDisplay.c:65
#2 0x00007ffff1f19d6d in x11_egltest(int) (pci_count=<optimized out>) at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/glxtest.cpp:852
#3 childgltest() () at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/glxtest.cpp:1222
#4 0x00007ffff1f1a407 in fire_glxtest_process() () at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/glxtest.cpp:1261
#5 0x00007ffff1f0e752 in XREMain::XRE_mainInit(bool*) (this=<optimized out>, this@entry=0x7fffffffccb0, aExitFlag=aExitFlag@entry=0x7fffffffcc37)
at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/nsAppRunner.cpp:3614
#6 0x00007ffff1f156db in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&)
(this=this@entry=0x7fffffffccb0, argc=argc@entry=4, argv=argv@entry=0x7fffffffdf58, aConfig=...)
at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/nsAppRunner.cpp:5411
#7 0x00007ffff1f15ada in XRE_main(int, char**, mozilla::BootstrapConfig const&) (argc=-142277752, argv=0x3, aConfig=...)
at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/nsAppRunner.cpp:5496
#8 0x000055555557bdc4 in do_main(int, char**, char**) (argc=-142277752, argv=0x7fffffffdf58, envp=<optimized out>)
at /home/jcristau/src/hg.mozilla.org/mozilla-unified/browser/app/nsBrowserApp.cpp:224
#9 main(int, char**, char**) (argc=4, argv=<optimized out>, envp=<optimized out>) at /home/jcristau/src/hg.mozilla.org/mozilla-unified/browser/app/nsBrowserApp.cpp:351
XCloseDisplay is calling the NV-GLX close_display callback:
(gdb) up
#1 0x00007ffff6282ba2 in XCloseDisplay (dpy=0x7ffff78ca000) at ../../src/ClDisplay.c:65
65 ../../src/ClDisplay.c: No such file or directory.
(gdb) p *ext
$11 = {next = 0x7ffff784f480, codes = {extension = 1, major_opcode = 155, first_event = 0, first_error = 0}, create_GC = 0x0, copy_GC = 0x0, flush_GC = 0x0, free_GC = 0x0,
create_Font = 0x0, free_Font = 0x0, close_display = 0x7fffe960f580, error = 0x0, error_string = 0x0, name = 0x7ffff781b178 "NV-GLX", error_values = 0x0,
before_flush = 0x0, next_flush = 0x0}
but it looks like nv-glx has been unloaded before then?
Comment 12•3 years ago
|
||
Set release status flags based on info from the regressing bug 1714897
Assignee | ||
Comment 13•3 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #11)
Looks like it's crashing in XCloseDisplay.
...
but it looks like nv-glx has been unloaded before then?
Ah, this kinda makes sense. Looking at get_glx_status()
[1], it jumps to attention that libgl
is opened before and closed after the X11 Display. Now I wonder if this a bug in the nvidia driver and how we best work around it. We could probably shuffle things around a bit, however it looks like on Wayland it's required to first open the display connection. Another option could be to just not close the display connection - the process will exit directly after anyways.
1: https://searchfox.org/mozilla-central/source/toolkit/xre/glxtest.cpp#652-835
Updated•3 years ago
|
Assignee | ||
Comment 14•3 years ago
|
||
FTR, I don't think NV-GLX
has any business being there as that's a pure EGL code path :/
Comment 15•3 years ago
|
||
Commenting out the call to XCloseDisplay
gives me WebRender back (on the second run).
Assignee | ||
Comment 16•3 years ago
|
||
Closing the X11 connection is buggy on nv prop. drivers. Leave
it open, the process will exit anyways.
Updated•3 years ago
|
Assignee | ||
Comment 17•3 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #15)
Commenting out the call to
XCloseDisplay
gives me WebRender back (on the second run).
Thanks, was about to ask you. Yeah, if glxtest
fails it usually takes two restarts to clean up the blocklist entries or so. So well, I guess lets go with this ugly workaround then. It works well here as well, both on X and Wayland.
Comment 18•3 years ago
|
||
(In reply to Robert Mader [:rmader] from comment #14)
FTR, I don't think
NV-GLX
has any business being there as that's a pure EGL code path :/
FWIW here's where the NV-GLX
close_display
hook seems to get added:
Thread 2.1 "firefox" hit Breakpoint 2, XESetCloseDisplay (dpy=0x7ffff78ca000, extension=1, proc=0x7fffe960f580) at ../../src/InitExt.c:93
93 in ../../src/InitExt.c
(gdb) bt
#0 XESetCloseDisplay (dpy=0x7ffff78ca000, extension=1, proc=0x7fffe960f580) at ../../src/InitExt.c:93
#1 0x00007fffe960f7fa in () at /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.73.01
#2 0x00007fffe96020b8 in () at /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.73.01
#3 0x00007fffe9602245 in () at /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.73.01
#4 0x00007fffe9888c73 in () at /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.0
#5 0x00007fffe9888da1 in () at /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.0
#6 0x00007fffe98a10a0 in () at /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.0
#7 0x00007fffe9c5698d in () at /usr/lib/x86_64-linux-gnu/libEGL.so.1
#8 0x00007ffff1f1dea7 in get_egl_status(void*, bool, bool) (native_dpy=native_dpy@entry=0x7ffff78ca000, gles_test=false, require_driver=false)
at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/glxtest.cpp:589
#9 0x00007ffff1f19cd5 in x11_egltest(int) (pci_count=1) at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/glxtest.cpp:846
#10 childgltest() () at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/glxtest.cpp:1225
#11 0x00007ffff1f1a417 in fire_glxtest_process() () at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/glxtest.cpp:1264
#12 0x00007ffff1f0e752 in XREMain::XRE_mainInit(bool*) (this=<optimized out>, this@entry=0x7fffffffccb0, aExitFlag=aExitFlag@entry=0x7fffffffcc37)
at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/nsAppRunner.cpp:3614
#13 0x00007ffff1f156db in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&)
(this=this@entry=0x7fffffffccb0, argc=argc@entry=4, argv=argv@entry=0x7fffffffdf58, aConfig=...)
at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/nsAppRunner.cpp:5411
#14 0x00007ffff1f15ada in XRE_main(int, char**, mozilla::BootstrapConfig const&) (argc=1, argv=0x7fffe960f580, aConfig=...)
at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/nsAppRunner.cpp:5496
#15 0x000055555557bdc4 in do_main(int, char**, char**) (argc=1, argv=0x7fffffffdf58, envp=<optimized out>)
at /home/jcristau/src/hg.mozilla.org/mozilla-unified/browser/app/nsBrowserApp.cpp:224
#16 main(int, char**, char**) (argc=4, argv=<optimized out>, envp=<optimized out>) at /home/jcristau/src/hg.mozilla.org/mozilla-unified/browser/app/nsBrowserApp.cpp:351
I haven't tracked down where libnvidia-glsi.so
is unloaded yet.
Comment 19•3 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #18)
I haven't tracked down where
libnvidia-glsi.so
is unloaded yet.
It's from dlclose(libegl)
; I guess eglTerminate
doesn't clean up the display hooks properly?
Assignee | ||
Comment 20•3 years ago
|
||
I'm inclined to write this off as another pain to suffer for being one of the first big applications to go full EGL on X11. The good news is that our work has already motivated fixes in mesa and Xwayland, so that GTK4 was able to make the transition more easily [1]. Another one is obs-studio, which recently landed similar work (but didn't enable it by default yet) [2].
So, well, even after all this years things won't be painless. But we're at least not alone in this struggle :)
1: https://gitlab.gnome.org/GNOME/gtk/-/merge_requests/3540
2: https://github.com/obsproject/obs-studio/pull/2478 / https://github.com/obsproject/obs-studio/pull/2484
Comment 21•3 years ago
|
||
Comment 22•3 years ago
|
||
bugherder |
Comment 23•3 years ago
|
||
If there is a suspicion of a NVIDIA driver bug, I'd like to take a look. But I do not fully understand from the discussion here what the bug is thought to be.
Is the "glxtest" tool available standalone so I can observe the problem locally?
Thank you
Assignee | ||
Comment 24•3 years ago
|
||
(In reply to Arthur Huillet from comment #23)
If there is a suspicion of a NVIDIA driver bug, I'd like to take a look. But I do not fully understand from the discussion here what the bug is thought to be.
Is the "glxtest" tool available standalone so I can observe the problem locally?
Thank you
Thanks, that would be great. I don't think glxtest
can be run standalone - Julien, do you know more maybe?
You should be able to observe the issue in a build from yesterday, using mozregression
- I can also provide you with one, if that helps.
Assignee | ||
Comment 25•3 years ago
|
||
P.S.: The relevant lines are those at https://searchfox.org/mozilla-central/source/toolkit/xre/glxtest.cpp#837-855.
What happens is:
- we open a X11
Display
- we initialize EGL and GL, do some stuff (https://searchfox.org/mozilla-central/source/toolkit/xre/glxtest.cpp#553-632)
- we terminate EGL
- we close the
Display
- the Nvidia driver crashes in something related to
NV-GLX
(comment 11)
Comment 26•3 years ago
|
||
Thanks. I'd need to observe the crash locally to determine what, if anything, the NVIDIA driver is doing wrong. There have been bugs found and fixed recently related to EGL resource lifecycle management.
Can you please help me observe it locally? I've never run mozregression, or really anything Mozilla/Firefox that wasn't a distro package in the past.
Comment 27•3 years ago
|
||
I wonder if the more proper fix might be to avoid calling dlclose(libegl)
at least while the display is open. On the assumption that after we've called eglGetDisplay(native_dpy)
it's not safe to unload libEGL (and its dependencies) while the display is alive?
Comment 28•3 years ago
|
||
There's uplift requests in flight for this and related bugs, updating status flags.
Assignee | ||
Comment 29•3 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #27)
I wonder if the more proper fix might be to avoid calling
dlclose(libegl)
at least while the display is open. On the assumption that after we've calledeglGetDisplay(native_dpy)
it's not safe to unload libEGL (and its dependencies) while the display is alive?
Well, we call eglTerminate(dpy)
on the EGLDisplay
we created via eglGetDisplay()
. Unless we forgot to release everything properly (looking at eglMakeCurrent
right now), it should be save to close the Display
, no?
Comment 30•3 years ago
|
||
Robert: Xlib doesn't really expect that, as far as I can tell. If any of the libs pulled in by libEGL registers an extension (with XextAddDisplay
), then there doesn't seem to be a reasonable way for it to clean up. In this case nvidia's eglGetDisplay
adds the NV-GLX
extension, and libXext itself adds Generic Event Extension
, both of which register close_display
callbacks.
Arthur: here's what glxtest is doing, as far as I can tell (reduced to cause the crash; link with -ldl -lX11
):
#include <dlfcn.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <X11/Xlib.h>
#include <EGL/egl.h>
#define LIBEGL_FILENAME "libEGL.so.1"
static bool get_egl_status(EGLNativeDisplayType native_dpy) {
void* libegl = dlopen(LIBEGL_FILENAME, RTLD_LAZY);
if (!libegl) {
return false;
}
PFNEGLGETPROCADDRESSPROC eglGetProcAddress =
(PFNEGLGETPROCADDRESSPROC)(dlsym(libegl, "eglGetProcAddress"));
if (!eglGetProcAddress) {
dlclose(libegl);
return false;
}
PFNEGLGETDISPLAYPROC eglGetDisplay =
(PFNEGLGETDISPLAYPROC)(eglGetProcAddress("eglGetDisplay"));
PFNEGLTERMINATEPROC eglTerminate =
(PFNEGLTERMINATEPROC)(eglGetProcAddress("eglTerminate"));
if (!eglGetDisplay || !eglTerminate) {
dlclose(libegl);
return false;
}
EGLDisplay dpy = eglGetDisplay(native_dpy);
if (!dpy) {
dlclose(libegl);
return false;
}
eglTerminate(dpy);
dlclose(libegl);
return true;
}
int main() {
Display* dpy = XOpenDisplay(NULL);
if (!dpy) {
return 1;
}
if (!get_egl_status(dpy)) {
return 1;
}
XCloseDisplay(dpy);
return 0;
}
Comment 31•3 years ago
|
||
(In reply to Arthur Huillet from comment #26)
Can you please help me observe it locally? I've never run mozregression, or really anything Mozilla/Firefox that wasn't a distro package in the past.
$ pip3 install --user mozregression
Bad build from comment 5: $ mozregression --launch 20210608091819 -a about:support
- Compositing: WebRender (Software)
- "WebGL creation failed"
-
Failure Log
(#0) Error: No GPUs detected via PCI
(#1) Error: glxtest: process failed (exited with status 1)
Good build from comment 6: $ mozregression --launch 2021-06-01 -a about:support
- Compositing: WebRender
- no WebGL error
- no failures
Assignee | ||
Comment 32•3 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #30)
Robert: Xlib doesn't really expect that, as far as I can tell. If any of the libs pulled in by libEGL registers an extension (with
XextAddDisplay
), then there doesn't seem to be a reasonable way for it to clean up. In this case nvidia'seglGetDisplay
adds theNV-GLX
extension, and libXext itself addsGeneric Event Extension
, both of which registerclose_display
callbacks.
Well, I don't really see an advantage of trading dlclose(libegl)
against XCloseDisplay(dpy)
- skipping either of them is somewhat dirty but also doesn't really matter, as the process will exit with milliseconds anyway. Do you have a preference?
In any case, AFAICS NV-GLX
should simply not get loaded when initializing EGL. Looks to me like some unfortunate entanglement within the driver that should probably better be avoided.
Updated•3 years ago
|
Comment 33•3 years ago
|
||
Thank you Julien for sharing the standalone reproducer. I filed NVIDIA bug 200740810 for this to be investigated. Sadly I am not personally qualified in EGL to say who's wrong here and why.
Comment 34•3 years ago
|
||
Comment on attachment 9226143 [details]
Bug 1715245 - Leave X11 connection open, r=aosmond
approved for 90.0b7 to avoid a regression
Comment 35•3 years ago
|
||
bugherder uplift |
Comment 36•3 years ago
|
||
Confirmed NVIDIA driver bug, we're looking into it. A workaround we can suggest is to XCloseDisplay before the dlclose.
Comment 37•3 years ago
|
||
Aside from the NVIDIA driver itself, the same bug is also in libXext.so for the generic event extension:
https://gitlab.freedesktop.org/xorg/lib/libxext/-/issues/3
Even after the NVIDIA driver is fixed, the test program in comment #30 would still crash, because when it unloads libEGL.so, that would also unload libXext.so.
The cleanest workaround I can think of would be to call XCloseDIsplay first, before you unload libEGL.so. The callbacks are all per-display, so closing the display first ensures that they get cleared out without having to leak anything.
Updated•3 years ago
|
Description
•