Closed Bug 1787182 Opened 2 years ago Closed 1 year ago

Don't fire glxtest process for remote instances (VAAPI test sometimes hangs Firefox on AMD. bug 1813500 hotfix lets users continue but with disabled hardware rendering.)

Categories

(Core :: Widget: Gtk, defect, P3)

Firefox 104
Desktop
Linux
defect

Tracking

()

RESOLVED FIXED
114 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox104 --- wontfix
firefox112 --- wontfix
firefox113 --- wontfix
firefox114 --- fixed

People

(Reporter: o.freyermuth, Assigned: stransky)

References

(Blocks 3 open bugs, Regression)

Details

(Keywords: hang, regression)

Attachments

(10 files)

(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:104.0) Gecko/20100101 Firefox/104.0

Steps to reproduce:

Start Firefox, message:

Crash Annotation GraphicsCriticalError: |[0][GFX1-]: glxtest: VA-API test failed: process crashed. Please check your VA-API drivers. (t=0.447447) [GFX1-]: glxtest: VA-API test failed: process crashed. Please check your VA-API drivers.

is shown. A coredump is saved, but Firefox functions normally.

Each time I open any link on my system (which in turn calls Firefox and opens the page), a new coredump is saved, leading to many dozens of coredumps each day.

This issue is a side-effect of #1758473 which reruns the test on each Firefox invocation, causing a coredump on systems with affected drivers every time Firefox is started or any URL is opened (filling up journals / hard drives).

My affected system is a Gentoo Linux x86_64, other users on Debian seem affected, too:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1017414

Actual results:

Coredump due to falure of glxtext.

Trace found in coredumps:

#0  XDisplayString (dpy=0x0) at /usr/src/debug/x11-libs/libX11-1.8.1/libX11-1.8.1/src/Macros.c:119
#1  0x00007fccbf6a5fc5 in vdpau_common_Initialize (driver_data=0x7fccbd5df200)
    at /usr/src/debug/x11-libs/libva-vdpau-driver-0.7.4-r5/libva-vdpau-driver-0.7.4/src/vdpau_driver.c:188
#2  vdpau_Initialize_Current (ctx=0x7fcccc12fc40) at /usr/src/debug/x11-libs/libva-vdpau-driver-0.7.4-r5/libva-vdpau-driver-0.7.4/src/vdpau_driver_template.h:561
#3  __vaDriverInit_1_15 (ctx=0x7fcccc12fc40) at /usr/src/debug/x11-libs/libva-vdpau-driver-0.7.4-r5/libva-vdpau-driver-0.7.4/src/vdpau_driver.c:317
#4  0x00007fccbf6bce4c in  () at /usr/lib64/libva.so.2
#5  0x00007fccbf6bdfc6 in vaInitialize () at /usr/lib64/libva.so.2
#6  0x00007fccc78b1d6f in  () at /usr/lib64/firefox/libxul.so
#7  0x00007fccc78b2988 in  () at /usr/lib64/firefox/libxul.so
#8  0x00007fccc78b2a66 in  () at /usr/lib64/firefox/libxul.so
#9  0x00007fccc78a8043 in  () at /usr/lib64/firefox/libxul.so
#10 0x00007fccc78ae846 in  () at /usr/lib64/firefox/libxul.so
#11 0x00007fccc78aecaa in  () at /usr/lib64/firefox/libxul.so
#12 0x000056302fc4d473 in  ()
#13 0x00007fcccc47534a in __libc_start_call_main (main=main@entry=0x56302fc4d0b0, argc=argc@entry=2, argv=argv@entry=0x7ffd51e764f8)
    at ../sysdeps/nptl/libc_start_call_main.h:58
#14 0x00007fcccc4753fc in __libc_start_main_impl
     (main=0x56302fc4d0b0, argc=2, argv=0x7ffd51e764f8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffd51e764e8)
    at ../csu/libc-start.c:389
#15 0x000056302fc4cf51 in _start ()

Expected results:

Do not coredump on each start.

The Bugbug bot thinks this bug should belong to the 'Core::Widget: Gtk' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Widget: Gtk
Product: Firefox → Core

Please uninstall deprecated libva-vdpau-driver: It doesn't support vaGetDisplayDRM and crashes.

(Darkspirit from bug 1758473 comment #9)

It might be this one: https://salsa.debian.org/multimedia-team/attic/vdpau-video/-/blob/63450ffea86143d418c6e83cb8d2828d3a7beb25/src/vdpau_driver.c#L188

const char * const x11_dpy_name = XDisplayString(driver_data->x11_dpy);

https://bugs.archlinux.org/task/72241#comments

vaGetDisplayDRM() doesn't fill ->x11_dpy

VAAPI should be blocked for vdpau_drv_video.so.
vdpau_drv_video.so is deprecated and has been removed from Debian.
Debian Buster (oldstable) was the last release that had a package for it: https://packages.debian.org/oldstable/vdpau-va-driver
https://tracker.debian.org/pkg/vdpau-video
https://salsa.debian.org/multimedia-team/attic/vdpau-video

(In reply to Oliver Freyermuth from comment #0)

other users on Debian seem affected, too:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1017414

That one has a different problem, it uses experimental nvidia_drv_video.so.

(Darkspirit from bug 1782408 comment 1)

https://github.com/elFarto/nvidia-vaapi-driver requires CUDA which doesn't save you power and is blocked by the media process sandbox. Allowing CUDA would require further media process sandbox exceptions/holes.

https://github.com/elFarto/nvidia-vaapi-driver/issues/74#issuecomment-1100918826

It's a driver issue. As soon as you initialise cuda, it forces the GPU into a higher power state (at least P2 IIRC), and the nvdec implementation forces you to use cuda to interact with it, so this always happens. It means that using nvdec will never save you power unless you were going to use cuda anyway.

(In reply to Darkspirit from comment #2)

Please uninstall deprecated libva-vdpau-driver: It doesn't support vaGetDisplayDRM and crashes.

Thanks, indeed that fixes it for me. Gentoo currently still has libva-vdpau-driver in stable and it's used as dependency of the latest Kodi release (unless VDPAU support is disabled there). I'll also raise a bug in the Gentoo bugtracker then so the issue with / deprecation of libva-vdpau-driver gets known and this can be reconsidered.

Taking into account that two users are affected by the segfaults: Do you consider this "ok" (since it's a driver bug causing the segfault, in the end), or should this bug be kept open to improve on existing glxtest behaviour (e.g. blacklist "harder" / "earlier" so the coredump / segmentation fault is not triggered each time)?

For reference, in case a Gentoo user ends up here, the downstream issue I reported is at:
https://bugs.gentoo.org/866557

The bug has a release status flag that shows some version of Firefox is affected, thus it will be considered confirmed.

Status: UNCONFIRMED → NEW
Ever confirmed: true

(In reply to Oliver Freyermuth from comment #4)

Taking into account that two users are affected by the segfaults: Do you consider this "ok" (since it's a driver bug causing the segfault, in the end), or should this bug be kept open to improve on existing glxtest behaviour (e.g. blacklist "harder" / "earlier" so the coredump / segmentation fault is not triggered each time)?

IMHO the warning "Please check your VA-API drivers." should be enough to point affected users to the fact that they are using broken drivers :P Implementing further detection risks overblocking in case someone fixes the bug (which shouldn't be hard by the way).

For reference, in case a Gentoo user ends up here, the downstream issue I reported is at:
https://bugs.gentoo.org/866557

Thanks. For distros shipping this driver by default I think it's reasonable to expect them to

  1. remove it from the default packages, as it's unmaintained
  2. ship a downstream patch for the issue, i.e. maintain it themselves

Olivier: on a second though I wonder if we could just do better with the warning. So instead of

Please check your VA-API drivers.

have something more verbose, for example

This was likely caused by outdated or unmaintained drivers installed on your system. You can check the driver by running vainfo - please consider updating or uninstalling the driver.

Do you think that would have helped you/would make things clear enough?

Flags: needinfo?(o.freyermuth)

Thanks for reaching out, Robert!
In fact, I think the proposed more verbose warning would have helped to identify the problematic library faster.

Additionally, though, it took me more time than expected to even see the warning, since it is not always visible:
The coredumps are produced on each invocation of the firefox binary, e.g. handling URLs. However, if one Firefox session is running (e.g. started via GUI) and a URL is opened, while the coredump is still produced, the error message is not shown in the new invocation of firefox (since it is echoed by the original process, started via GUI). So I only really became aware of the message once I closed all Firefox processes and started from scratch.

So my proposal would be (if that can be implemented) to run the glxtests only when the main process starts, and not when handling URLs. This would have several benefits:

  • This approach does not overblock in case the bug is fixed (a very valid point).
  • It prevents filling up harddrives with coredumps. Regular users will not investigate issues in-depth, but blame Firefox unstable for coredumping all the time, even though Firefox itself is not at fault.
  • If a user observes the coredump and wants to find the source, he/she will have to restart the main process — which is the process which will print the error message.

This combined with the clearer error message would certainly have made things clear enough for me (and hopefully also for others).

Further ideas to reach even more users:

  • Show the error highlighted on top of about:support.
  • Add some graphical alerting the first time an error with glxtests is encountered (and refer to about:support).

This is just a collection of ideas from a user point of view to improve the visibility of the issue and allow them to fix it more easily.

Flags: needinfo?(o.freyermuth)

Thanks for the input!

So my proposal would be (if that can be implemented) to run the glxtests only when the main process starts, and not when handling URLs. This would have several benefits:

Urgh, I didn't know we do that. It doesn't make sense and we should definitely look into stopping it.

Duplicate of this bug: 1805468
Summary: Coredump with nvidia drivers during each glxtest invocation → Don't fire glxtest process for remote instances

Downstream bug: https://bugzilla.redhat.com/show_bug.cgi?id=2147344
When glxtest process hangs (but not crashed!) it blocks every Firefox start (even remote one).

Assignee: nobody → stransky
Status: NEW → ASSIGNED

Right now we fire glxtest on every Firefox start, even if we going to update, restart or ping running remote instance.
When we're running on system with broken/unstable gfx drivers (drivers/glx freezes or crashes) every such action is delayed or coredumps are generated on systems.

In this patch we launch glx test proces later if we know we need it.

Depends on D168650

Pushed by stransky@redhat.com:
https://hg.mozilla.org/integration/autoland/rev/07f14a03f7f5
[Linux] Use MOZ_GFX_DEBUG_FILE to dump output of glxtest process r=emilio
https://hg.mozilla.org/integration/autoland/rev/dca76d15b50b
[Linux] Don't fire glxtest process unless we know we really want to run r=emilio

Backed out 2 changesets (Bug 1787182) for valgrind-test bustages and we and bc failures.
Backout link
Push with failures <--> V-swr
Failure Log
Also Wr11
Also Wr2
Also bc25
Also wpt19
Also wpt27
Also wpt31

Flags: needinfo?(stransky)
Flags: needinfo?(stransky)
Pushed by stransky@redhat.com:
https://hg.mozilla.org/integration/autoland/rev/7373f1daf3e0
[Linux] Use MOZ_GFX_DEBUG_FILE to dump output of glxtest process r=emilio
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 112 Branch
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → ASSIGNED
Priority: -- → P3
Target Milestone: 112 Branch → ---

We should not run VA-API testing as part of OpenGL test as we want to test VA-API on supported hardware only.

Depends on D168651

Depends on D171993

  • Implement fire_vaapi_process() which launch VA-API test utility on given DRM device.
  • Implement GfxInfo::GetDataVAAPI() which gets VA-API test results
  • Run VA-API tests when FEATURE_HARDWARE_VIDEO_DECODING is probed and only if it's enabled by GfxInfo.

Depends on D171994

Pushed by stransky@redhat.com:
https://hg.mozilla.org/integration/autoland/rev/e75ee46f3a6f
[Linux] Don't fire glxtest process unless we know we really want to run r=emilio
https://hg.mozilla.org/integration/autoland/rev/63710cc6fda4
[Linux] Remove VA-API test from glxtest r=emilio
https://hg.mozilla.org/integration/autoland/rev/41e45dcc215a
[Linux] Implement VA-API test as vaapitest binary r=emilio
https://hg.mozilla.org/integration/autoland/rev/8e5214ecc91d
[Linux] Run VA-API tests on supported hardware only r=emilio,gfx-reviewers,jgilbert

glxtest is run later when Firefox already spawns threads. Recently glxtest runs in forked process
which doesn't work correctly in multi-thread environment, so we need to move glxtest to different binary file
and launch it as stand alone code.

Depends on D171995

Pushed by stransky@redhat.com:
https://hg.mozilla.org/integration/autoland/rev/cb7d83d2253b
[Linux] Don't fire glxtest process unless we know we really want to run r=emilio
https://hg.mozilla.org/integration/autoland/rev/3cd483e237a1
[Linux] Remove VA-API test from glxtest r=emilio
https://hg.mozilla.org/integration/autoland/rev/6a359af2ce38
[Linux] Implement VA-API test as vaapitest binary r=emilio
https://hg.mozilla.org/integration/autoland/rev/0d097f0f6228
[Linux] Run VA-API tests on supported hardware only r=emilio,gfx-reviewers,jgilbert
https://hg.mozilla.org/integration/autoland/rev/2fc6b3fd4365
[Linux] Implement glxtest test as binary r=emilio
Flags: needinfo?(stransky)

Updated, Thanks.

Flags: needinfo?(stransky)
Pushed by stransky@redhat.com:
https://hg.mozilla.org/integration/autoland/rev/70439dbc72fd
[Linux] Don't fire glxtest process unless we know we really want to run r=emilio
https://hg.mozilla.org/integration/autoland/rev/fe70bf105370
[Linux] Remove VA-API test from glxtest r=emilio
https://hg.mozilla.org/integration/autoland/rev/7e710b6500f0
[Linux] Implement VA-API test as vaapitest binary r=emilio
https://hg.mozilla.org/integration/autoland/rev/fea9dc3de652
[Linux] Run VA-API tests on supported hardware only r=emilio,gfx-reviewers,jgilbert
https://hg.mozilla.org/integration/autoland/rev/34056df4e1d5
[Linux] Implement glxtest test as binary r=emilio
Flags: needinfo?(stransky)

Updated, Thanks.

Flags: needinfo?(stransky)
Pushed by stransky@redhat.com:
https://hg.mozilla.org/integration/autoland/rev/bde24058d87c
[Linux] Don't fire glxtest process unless we know we really want to run r=emilio
https://hg.mozilla.org/integration/autoland/rev/d11d31ec4d81
[Linux] Remove VA-API test from glxtest r=emilio
https://hg.mozilla.org/integration/autoland/rev/10299b2e3a6a
[Linux] Implement VA-API test as vaapitest binary r=emilio
https://hg.mozilla.org/integration/autoland/rev/ade3e3a9ec67
[Linux] Run VA-API tests on supported hardware only r=emilio,gfx-reviewers,jgilbert
https://hg.mozilla.org/integration/autoland/rev/c5518abf189e
[Linux] Implement glxtest test as binary r=emilio
Duplicate of this bug: 1804180
Duplicate of this bug: 1826129
Flags: needinfo?(stransky)
Pushed by stransky@redhat.com:
https://hg.mozilla.org/integration/autoland/rev/be1774f1cceb
[Linux] Don't fire glxtest process unless we know we really want to run r=emilio
https://hg.mozilla.org/integration/autoland/rev/e9a220ba6f47
[Linux] Remove VA-API test from glxtest r=emilio
https://hg.mozilla.org/integration/autoland/rev/0fe756b5d74d
[Linux] Implement VA-API test as vaapitest binary r=emilio
https://hg.mozilla.org/integration/autoland/rev/6267b0d60a1f
[Linux] Run VA-API tests on supported hardware only r=emilio,gfx-reviewers,jgilbert
https://hg.mozilla.org/integration/autoland/rev/74758d7ccf41
[Linux] Implement glxtest test as binary r=emilio
Regressions: 1826498

Backed out for causing bustage on glxtest.cpp and xpcshell failure on test_gfxBlacklist_Device.js

Backout link

Push with failures - bustage
Push with failures - xpcshell

Failure log - bustage
Failure log - xpcshell

Flags: needinfo?(stransky)
Flags: needinfo?(stransky)
Duplicate of this bug: 1826498
Pushed by stransky@redhat.com:
https://hg.mozilla.org/integration/autoland/rev/6a1b9e363c54
[Linux] Don't fire glxtest process unless we know we really want to run r=emilio
https://hg.mozilla.org/integration/autoland/rev/e084be47c307
[Linux] Remove VA-API test from glxtest r=emilio
https://hg.mozilla.org/integration/autoland/rev/21e692c2f871
[Linux] Implement VA-API test as vaapitest binary r=emilio
https://hg.mozilla.org/integration/autoland/rev/0771c006513a
[Linux] Run VA-API tests on supported hardware only r=emilio,gfx-reviewers,jgilbert
https://hg.mozilla.org/integration/autoland/rev/f40c90d3ed12
[Linux] Implement glxtest test as binary r=emilio
Regressions: 1826951
Flags: needinfo?(stransky)

Depends on D173486

Summary:
Before bug 1758473, VAAPI has only been tested (in main process) if VAAPI was enabled.
Since then it has been tested in glxtest in any case which has caused

  • a startup hang/freeze for some AMD users and
  • a glxtest crash for Nvidia users with deprecated/incompatible libva-vdpau-driver which has created a coredump file.

Hotfixes:
bug 1799747 disabled the VAAPI test on Nvidia.
bug 1813500 let AMD users continue but with disabled hardware rendering. This bug would be the actual fix.

Blocks: wr-linux
Depends on: 1813500, 1799747
Keywords: crashhang, regression
Regressed by: 1758473
Hardware: Unspecified → Desktop
Summary: Don't fire glxtest process for remote instances → Don't fire glxtest process for remote instances (VAAPI test sometimes hangs Firefox on AMD. bug 1813500 hotfix lets users continue but with disabled hardware rendering.)

Depends on D174995

Change FEATURE_BLOCKED_PLATFORM_TEST VA-API test failure to FEATURE_BLOCKED_DRIVER_VERSION fail as we support VA-API now but only on new Mesa.

Depends on D175235

Updated, Thanks.

Flags: needinfo?(stransky)
Duplicate of this bug: 955916
Attachment #9328118 - Attachment description: Bug 1787182 [Linux] Update test_gfxBlacklist_Version to fail with FEATURE_BLOCKED_DRIVER_VERSION r?emilio → Bug 1787182 [Linux] Update test_gfxBlacklist_Version to comply with FEATURE_HARDWARE_VIDEO_DECODING feature state on Linux r?emilio

Depends on D175236

Duplicate of this bug: 1826951
Pushed by stransky@redhat.com:
https://hg.mozilla.org/integration/autoland/rev/ae30d1ad9c33
[Linux] Don't fire glxtest process unless we know we really want to run r=emilio
https://hg.mozilla.org/integration/autoland/rev/ea656d846a56
[Linux] Remove VA-API test from glxtest r=emilio
https://hg.mozilla.org/integration/autoland/rev/579f0b121b02
[Linux] Implement VA-API test as vaapitest binary r=emilio
https://hg.mozilla.org/integration/autoland/rev/2a37788899e2
[Linux] Run VA-API tests on supported hardware only r=emilio,gfx-reviewers,jgilbert
https://hg.mozilla.org/integration/autoland/rev/4b8a03e2bd4f
[Linux] Implement glxtest test as binary r=emilio
https://hg.mozilla.org/integration/autoland/rev/6e0d9d9e1a3d
[Linux] Use pipe instead of stdout to get data from glxtest r=emilio
https://hg.mozilla.org/integration/autoland/rev/561fef5949d5
[Linux] Add logging to glxtest under MOZ_GFX_DEBUG env variable r=emilio
https://hg.mozilla.org/integration/autoland/rev/0cb52a90a1fe
[Linux] Update test_gfxBlacklist_Version to comply with FEATURE_HARDWARE_VIDEO_DECODING feature state on Linux r=emilio
https://hg.mozilla.org/integration/autoland/rev/d87d273845cc
[Linux] Build and run VA-API test on MOZ_WAYLAND builds only r=emilio

The patch landed in nightly and beta is affected.
:stransky, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.Also, don't forget to request an uplift for the patches in the regressions caused by this fix.
  • If no, please set status-firefox113 to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(stransky)
Regressions: 1828107
Flags: needinfo?(stransky)
Regressions: 1828192
Regressions: 1828195
Depends on: 1829461
Regressions: 1829541
Duplicate of this bug: 1739884
Regressions: 1831541
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: