GLX/Nvidia: Firefox spins on CPU after computer wakes from suspend
Categories
(Core :: Graphics, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr78 | --- | unaffected |
firefox-esr91 | --- | unaffected |
firefox93 | --- | unaffected |
firefox94 | + | fixed |
firefox95 | + | fixed |
People
(Reporter: pbone, Assigned: nical)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: hang, regression)
Crash Data
Attachments
(1 file)
(deleted),
text/x-phabricator-request
|
Details |
When my computer wakes from suspend Firefox windows don't respond to anything I'm doing. In the process list there's usually 2 Firefox processes spinning forever (I didn't check which ones). I sent one a SIGTRAP and got the following report. This happens reliably over about the last week, I can probably find a regression.
I'm using Linux Mint 20.1 on amd64 with nvidia graphics.
Maybe Fission related. (DOMFissionEnabled=1)
Crash report: https://crash-stats.mozilla.org/report/index/7cd4d846-106a-46ba-84d5-a41400211008
Reason: SIGTRAP
Top 10 frames of crashing thread:
0 libpthread.so.0 __pthread_cond_wait /build/glibc-eX1tMB/glibc-2.31/nptl/pthread_cond_wait.c:638
1 firefox-bin mozilla::detail::ConditionVariableImpl::wait_for mozglue/misc/ConditionVariable_posix.cpp:115
2 libxul.so mozilla::ipc::MessageChannel::Send ipc/glue/MessageChannel.cpp:1449
3 libxul.so mozilla::ipc::IProtocol::ChannelSend ipc/glue/ProtocolUtils.cpp:534
4 libxul.so mozilla::layers::PCompositorBridgeChild::SendFlushRendering ipc/ipdl/PCompositorBridgeChild.cpp:1084
5 libxul.so mozilla::layers::WebRenderLayerManager::FlushRendering gfx/layers/wr/WebRenderLayerManager.cpp:709
6 libxul.so nsViewManager::Refresh view/nsViewManager.cpp:314
7 libxul.so nsViewManager::PaintWindow view/nsViewManager.cpp:628
8 libxul.so nsView::PaintWindow view/nsView.cpp:1055
9 libxul.so nsWindow::OnExposeEvent widget/gtk/nsWindow.cpp:3646
Reporter | ||
Comment 1•3 years ago
|
||
Well that's a weird looking stack:
https://crash-stats.mozilla.org/report/index/bf5321e9-b2ca-40ea-b862-db3b30211009
Reporter | ||
Comment 2•3 years ago
|
||
Mozregression took me to Bug 1733154. and I had to set gfx.webrender.all=true
, it didn't have the problem with software webrender.
Updated•3 years ago
|
Comment 3•3 years ago
|
||
Does this problem also occur with gfx.x11-egl.force-enabled=true + gfx.webrender.all=true?
Updated•3 years ago
|
Comment 4•3 years ago
|
||
Set release status flags based on info from the regressing bug 1733154
Updated•3 years ago
|
Assignee | ||
Comment 5•3 years ago
|
||
I should have known better than to expect that all drivers would implement a queue correctly.
Updated•3 years ago
|
Reporter | ||
Comment 7•3 years ago
|
||
(In reply to Darkspirit from comment #3)
Does this problem also occur with gfx.x11-egl.force-enabled=true + gfx.webrender.all=true?
Yes.
Comment 8•3 years ago
|
||
bugherder |
Assignee | ||
Comment 9•3 years ago
|
||
Is this still reproducible on the latest nightly? If not I'll request a 94 uplift.
Reporter | ||
Comment 10•3 years ago
|
||
(In reply to Nicolas Silva [:nical] from comment #9)
Is this still reproducible on the latest nightly? If not I'll request a 94 uplift.
Yes, it still reproduces with BuildID: 20211014212856
Here's a new crash report made after I wake my computer, notice that firefox is locked up and spinning on the CPU and send SIGSEGV to the process: https://crash-stats.mozilla.org/report/index/094001ff-f20e-4805-a4d0-0a1ce0211015
There's two processes that are each sitting on 100% CPU. Are they both busy sending each-other IPC messages?
Updated•3 years ago
|
Comment 11•3 years ago
|
||
(In reply to Paul Bone [:pbone] from comment #10)
(In reply to Nicolas Silva [:nical] from comment #9)
Is this still reproducible on the latest nightly? If not I'll request a 94 uplift.
Yes, it still reproduces with BuildID: 20211014212856
Here's a new crash report made after I wake my computer, notice that firefox is locked up and spinning on the CPU and send SIGSEGV to the process: https://crash-stats.mozilla.org/report/index/094001ff-f20e-4805-a4d0-0a1ce0211015
There's two processes that are each sitting on 100% CPU. Are they both busy sending each-other IPC messages?
This is variant of Bug 1734368 - main thread is waiting in synced FlushRendering() which is waiting to Rendering thread to finish the flush. And Rendering thread is struck in NV drivers.
We may consider to make FlushRendering() async on Linux/X11 (it's async on Wayland already).
Updated•3 years ago
|
Comment 12•3 years ago
|
||
Given that we're down to our final Fx94 beta of the cycle before RC week next week, is backing out bug 1733154 a reasonable option as a short-term fix? It does backout cleanly from Beta.
Reporter | ||
Comment 13•3 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #12)
Given that we're down to our final Fx94 beta of the cycle before RC week next week, is backing out bug 1733154 a reasonable option as a short-term fix? It does backout cleanly from Beta.
Given that I had to set gfx.webrender.all=true
is this actually affecting people who don't mess with their about:config
? Or are there people who may have this bug who have had webrender enabled by Firefox's choice.
Comment 14•3 years ago
|
||
(In reply to Paul Bone [:pbone] from comment #13)
Given that I had to set
gfx.webrender.all=true
is this actually affecting people who don't mess with theirabout:config
?
I am running Firefox 94.0b8 on Ubuntu 20.04 with the latest stable proprietary Nvidia drivers, and do not have gfx.webrender.all
set to true: it's at the default value of false. However, I am still seeing what appears to be the same behaviour: Firefox spins on resume from suspend-to-RAM, chewing up two CPU cores and requiring a kill-restart to fix.
Reporter | ||
Comment 15•3 years ago
|
||
(In reply to gareth from comment #14)
(In reply to Paul Bone [:pbone] from comment #13)
Given that I had to set
gfx.webrender.all=true
is this actually affecting people who don't mess with theirabout:config
?I am running Firefox 94.0b8 on Ubuntu 20.04 with the latest stable proprietary Nvidia drivers, and do not have
gfx.webrender.all
set to true: it's at the default value of false. However, I am still seeing what appears to be the same behaviour: Firefox spins on resume from suspend-to-RAM, chewing up two CPU cores and requiring a kill-restart to fix.
Thanks. That's exactly like the problem I have. I wonder why mine defaults to off. I have an NVidia Quadro card here, maybe Firefox isn't confident about that card, our webrender people will know.
Comment 16•3 years ago
|
||
Just to add some more detail: I'm running an RTX 2080 and the problem goes away if I go into Settings, untick "Use recommended performance settings", and manually disable "Use hardware acceleration when available" - I can resume from suspend perfectly then.
It does, of course, have an impact on performance - but at least I'm not having to kill Firefox every time I come back from lunch now!
Comment 17•3 years ago
|
||
(In reply to Paul Bone [:pbone] from comment #13)
Given that I had to set
gfx.webrender.all=true
is this actually affecting people who don't mess with theirabout:config
? Or are there people who may have this bug who have had webrender enabled by Firefox's choice.
I would assume it'd affect anybody getting hardware webrender, which isn't controlled only via pref.
Updated•3 years ago
|
Comment 18•3 years ago
|
||
backout |
(In reply to Cristina Cozmuta (:CrissCozmuta) from comment #8)
Backed out along with bug 1733154. The next scheduled Nightly builds starting in ~4hr should have this change. It would be great if we could get confirmation from affected users over the weekend that this works so we can do the same for the Fx94 RC build on Monday.
https://hg.mozilla.org/mozilla-central/rev/a426132614c3584af011efdf8ddc62bad6ac9b4d
Comment 19•3 years ago
|
||
bug1731172 EGL |
Upgraded the beta to 94.0b9, no change; installed Nightly via the PPA (received 95.0a1 (2021-10-20), which judging by the date may not be the latest build mentioned above), and it does not appear to exhibit the bug at default settings with a new, blank profile. It does, however, massively corrupt the UI on resume from suspend: my tabs are now empty rectangles, the address bar is empty bar four oddly-spaced characters, and all images are missing from the tab that as displayed when I originally suspended.
That's a separate issue, though: it's not spinning, at least!
Comment 20•3 years ago
|
||
(In reply to gareth from comment #19)
Upgraded the beta to 94.0b9, no change; installed Nightly via the PPA (received 95.0a1 (2021-10-20), which judging by the date may not be the latest build mentioned above)
Neither of those builds would have the backout from comment 18. If you have a 2021-10-23 or newer Nightly build, that should have the fix. Nothing on 94 will yet, however.
Comment hidden (offtopic) |
Comment 22•3 years ago
|
||
backout |
Doesn't sound like we're going to be able to get much of a useful verification from Nightly given the overall change in behavior which occurred there in the intervening time. We went ahead and did the backout for 94.0rc1. It should be available for testing on the Beta channel later today or tomorrow.
https://hg.mozilla.org/releases/mozilla-beta/rev/b24d4a876b9f
Comment 23•3 years ago
|
||
Firefox 94 (20211025220926) just installed on the desktop: I can confirm that the bug appears to be gone. Re-enabling hardware acceleration, restarting the browser, suspending, then resuming causes no problems at all. Many thanks for the fix!
Comment 24•3 years ago
|
||
Thanks for testing!
Comment 25•3 years ago
|
||
(In reply to Paul Bone [:pbone] from comment #10)
(In reply to Nicolas Silva [:nical] from comment #9)
Is this still reproducible on the latest nightly? If not I'll request a 94 uplift.
Yes, it still reproduces with BuildID: 20211014212856
Can you check if the problem occurs with
mozregression --launch 20211014212856
but not with
mozregression --launch 20211014212856 --pref layout.frame_rate:60
?
Then we would know that Firefox' usage of GLX_SGI_video_sync was the cause. (GLX vsync also caused bug 1710400, bug 1279309.)
Updated•3 years ago
|
Comment 27•3 years ago
|
||
If this is indeed an NV only problem with the GLX vsync source, we could extend bug 1640779 to only use GLX_SGI_video_sync
on Mesa and use Xrandr in all other cases. Might make problems with switch interval 1 though.
Edit: Alternatively: I think all hardware supported by the 460 driver series (our NV baseline for WR) is also supported by the 470 series. So if we ship EGL for the 470 series (on EGL we always use the Xrandr vsync), we can also consider just bumping the NV baseline to 470.82, making sure that the whole prop. NV population is either on EGL+HW-WR or SW-WR, making the first big step to deprecate GLX.
Edit2: Given that the 470 driver series marks the line for officially supported hardware (the oldest being almost 10 years old), it's probably fair to only officially support WR on officially supported hardware. So going all EGL on NV sounds like a decent option to me (users of older hardware should use nouveau/mesa).
Reporter | ||
Comment 28•3 years ago
|
||
(In reply to Darkspirit from comment #25)
(In reply to Paul Bone [:pbone] from comment #10)
(In reply to Nicolas Silva [:nical] from comment #9)
Is this still reproducible on the latest nightly? If not I'll request a 94 uplift.
Yes, it still reproduces with BuildID: 20211014212856
Can you check if the problem occurs with
mozregression --launch 20211014212856
but not with
mozregression --launch 20211014212856 --pref layout.frame_rate:60
?
Then we would know that Firefox' usage of GLX_SGI_video_sync was the cause. (GLX vsync also caused bug 1710400, bug 1279309.)
I can't get it to lockup with mozregression at all today. But I was able to earlier when first investigating this problem. I can still do it for my Nightly installation & profile (hrm, I'll try a clean profile too).
Some more info:
- It only locks up when I have
gfx.webrender.all = true
- I seem to have quite old NVidia drivers, I don't know why, I guess I just didn't update them, 390.144, I would update them now I know this, except that I want to keep helping test this bug so I'll leave them at this version until we're done.
Reporter | ||
Comment 29•3 years ago
|
||
Yep, my Nightly install with a fresh profile can't reproduce it either. So it's now something in my profile & some builds. It got better there for maybe a week (something was backed out I think). Then worse again.
I don't see anything else changed in my profile. Could this be just me at this point, with a weird combination of drivers, and profile?
Assignee | ||
Updated•3 years ago
|
Comment 31•3 years ago
|
||
(In reply to Paul Bone [:pbone] from comment #28)
- I seem to have quite old NVidia drivers, I don't know why, I guess I just didn't update them, 390.144, I would update them now I know this, except that I want to keep helping test this bug so I'll leave them at this version until we're done.
What NVidia card is this? 390.144 were released in July 2021 so not exactly old per se but they're for legacy cards.
Updated•3 years ago
|
Reporter | ||
Comment 32•3 years ago
|
||
(In reply to Arthur K. [He/Him] from comment #31)
(In reply to Paul Bone [:pbone] from comment #28)
- I seem to have quite old NVidia drivers, I don't know why, I guess I just didn't update them, 390.144, I would update them now I know this, except that I want to keep helping test this bug so I'll leave them at this version until we're done.
What NVidia card is this? 390.144 were released in July 2021 so not exactly old per se but they're for legacy cards.
Quadro P400, on my Mozilla desktop. I'm pretty sure it can use a newer driver.
Comment 33•3 years ago
|
||
FWIW, we are entering our last week of beta for Firefox 95.
Reporter | ||
Comment 34•3 years ago
|
||
So I think this only effects folk with both:
- old NVidia drivers
- force-enabled WebRender.
I can confirm this because I upgraded my drivers and now things work properly.
Comment 35•3 years ago
|
||
-
comment 18 + comment 22: regressing bug was backed out from 94 beta + 95 nightly (bug 1733154 comment 5)
-
These old bugs were about GL context loss: bug 1682876 + Comment 6 seems similar fix as bug 1492580
-
crash from comment 0 occured with disqualified driver 390.144.0.0
-
Driver 455.28 fixed an important bug:
(Dzmitry Malyshau [:kvark] from bug 1656361 comment 9)Looks to be fixed in the latest NV proprietary driver:
Fixed a bug where glGetGraphicsResetStatusARB would incorrectly return GL_PURGED_CONTEXT_RESET_NV immediately after application start-up if the system had previously been suspended.
-
bug 1673752 enabled WR by default for >= 460.32.03. It's not recommended to enable WR below it.
Description
•