Closed Bug 1734958 Opened 3 years ago Closed 3 years ago

GLX/Nvidia: Firefox spins on CPU after computer wakes from suspend

Categories

(Core :: Graphics, defect)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- unaffected
firefox93 --- unaffected
firefox94 + fixed
firefox95 + fixed

People

(Reporter: pbone, Assigned: nical)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: hang, regression)

Crash Data

Attachments

(1 file)

When my computer wakes from suspend Firefox windows don't respond to anything I'm doing. In the process list there's usually 2 Firefox processes spinning forever (I didn't check which ones). I sent one a SIGTRAP and got the following report. This happens reliably over about the last week, I can probably find a regression.

I'm using Linux Mint 20.1 on amd64 with nvidia graphics.

Maybe Fission related. (DOMFissionEnabled=1)

Crash report: https://crash-stats.mozilla.org/report/index/7cd4d846-106a-46ba-84d5-a41400211008

Reason: SIGTRAP

Top 10 frames of crashing thread:

0 libpthread.so.0 __pthread_cond_wait /build/glibc-eX1tMB/glibc-2.31/nptl/pthread_cond_wait.c:638
1 firefox-bin mozilla::detail::ConditionVariableImpl::wait_for mozglue/misc/ConditionVariable_posix.cpp:115
2 libxul.so mozilla::ipc::MessageChannel::Send ipc/glue/MessageChannel.cpp:1449
3 libxul.so mozilla::ipc::IProtocol::ChannelSend ipc/glue/ProtocolUtils.cpp:534
4 libxul.so mozilla::layers::PCompositorBridgeChild::SendFlushRendering ipc/ipdl/PCompositorBridgeChild.cpp:1084
5 libxul.so mozilla::layers::WebRenderLayerManager::FlushRendering gfx/layers/wr/WebRenderLayerManager.cpp:709
6 libxul.so nsViewManager::Refresh view/nsViewManager.cpp:314
7 libxul.so nsViewManager::PaintWindow view/nsViewManager.cpp:628
8 libxul.so nsView::PaintWindow view/nsView.cpp:1055
9 libxul.so nsWindow::OnExposeEvent widget/gtk/nsWindow.cpp:3646

Mozregression took me to Bug 1733154. and I had to set gfx.webrender.all=true, it didn't have the problem with software webrender.

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=3772cc85bffe7f43157aca7a4263381b1bac3252&tochange=0ef9b28ae108513855b5e855f469cd64e278f6e4

Flags: needinfo?(nical.bugzilla)
Regressed by: 1733154
Has Regression Range: --- → yes

Does this problem also occur with gfx.x11-egl.force-enabled=true + gfx.webrender.all=true?

Blocks: wr-nv-linux

Set release status flags based on info from the regressing bug 1733154

I should have known better than to expect that all drivers would implement a queue correctly.

Assignee: nobody → nical.bugzilla
Status: NEW → ASSIGNED
Pushed by nsilva@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/45eeed639240 Don't spin forever trying to get gl errors. r=gfx-reviewers,jnicol

(In reply to Darkspirit from comment #3)

Does this problem also occur with gfx.x11-egl.force-enabled=true + gfx.webrender.all=true?

Yes.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 95 Branch

Is this still reproducible on the latest nightly? If not I'll request a 94 uplift.

Flags: needinfo?(nical.bugzilla) → needinfo?(pbone)

(In reply to Nicolas Silva [:nical] from comment #9)

Is this still reproducible on the latest nightly? If not I'll request a 94 uplift.

Yes, it still reproduces with BuildID: 20211014212856

Here's a new crash report made after I wake my computer, notice that firefox is locked up and spinning on the CPU and send SIGSEGV to the process: https://crash-stats.mozilla.org/report/index/094001ff-f20e-4805-a4d0-0a1ce0211015

There's two processes that are each sitting on 100% CPU. Are they both busy sending each-other IPC messages?

Flags: needinfo?(pbone) → needinfo?(nical.bugzilla)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

(In reply to Paul Bone [:pbone] from comment #10)

(In reply to Nicolas Silva [:nical] from comment #9)

Is this still reproducible on the latest nightly? If not I'll request a 94 uplift.

Yes, it still reproduces with BuildID: 20211014212856

Here's a new crash report made after I wake my computer, notice that firefox is locked up and spinning on the CPU and send SIGSEGV to the process: https://crash-stats.mozilla.org/report/index/094001ff-f20e-4805-a4d0-0a1ce0211015

There's two processes that are each sitting on 100% CPU. Are they both busy sending each-other IPC messages?

This is variant of Bug 1734368 - main thread is waiting in synced FlushRendering() which is waiting to Rendering thread to finish the flush. And Rendering thread is struck in NV drivers.

We may consider to make FlushRendering() async on Linux/X11 (it's async on Wayland already).

Given that we're down to our final Fx94 beta of the cycle before RC week next week, is backing out bug 1733154 a reasonable option as a short-term fix? It does backout cleanly from Beta.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #12)

Given that we're down to our final Fx94 beta of the cycle before RC week next week, is backing out bug 1733154 a reasonable option as a short-term fix? It does backout cleanly from Beta.

Given that I had to set gfx.webrender.all=true is this actually affecting people who don't mess with their about:config? Or are there people who may have this bug who have had webrender enabled by Firefox's choice.

(In reply to Paul Bone [:pbone] from comment #13)

Given that I had to set gfx.webrender.all=true is this actually affecting people who don't mess with their about:config?

I am running Firefox 94.0b8 on Ubuntu 20.04 with the latest stable proprietary Nvidia drivers, and do not have gfx.webrender.all set to true: it's at the default value of false. However, I am still seeing what appears to be the same behaviour: Firefox spins on resume from suspend-to-RAM, chewing up two CPU cores and requiring a kill-restart to fix.

(In reply to gareth from comment #14)

(In reply to Paul Bone [:pbone] from comment #13)

Given that I had to set gfx.webrender.all=true is this actually affecting people who don't mess with their about:config?

I am running Firefox 94.0b8 on Ubuntu 20.04 with the latest stable proprietary Nvidia drivers, and do not have gfx.webrender.all set to true: it's at the default value of false. However, I am still seeing what appears to be the same behaviour: Firefox spins on resume from suspend-to-RAM, chewing up two CPU cores and requiring a kill-restart to fix.

Thanks. That's exactly like the problem I have. I wonder why mine defaults to off. I have an NVidia Quadro card here, maybe Firefox isn't confident about that card, our webrender people will know.

Just to add some more detail: I'm running an RTX 2080 and the problem goes away if I go into Settings, untick "Use recommended performance settings", and manually disable "Use hardware acceleration when available" - I can resume from suspend perfectly then.

It does, of course, have an impact on performance - but at least I'm not having to kill Firefox every time I come back from lunch now!

(In reply to Paul Bone [:pbone] from comment #13)

Given that I had to set gfx.webrender.all=true is this actually affecting people who don't mess with their about:config? Or are there people who may have this bug who have had webrender enabled by Firefox's choice.

I would assume it'd affect anybody getting hardware webrender, which isn't controlled only via pref.

Target Milestone: 95 Branch → ---

(In reply to Cristina Cozmuta (:CrissCozmuta) from comment #8)

https://hg.mozilla.org/mozilla-central/rev/45eeed639240

Backed out along with bug 1733154. The next scheduled Nightly builds starting in ~4hr should have this change. It would be great if we could get confirmation from affected users over the weekend that this works so we can do the same for the Fx94 RC build on Monday.
https://hg.mozilla.org/mozilla-central/rev/a426132614c3584af011efdf8ddc62bad6ac9b4d

Upgraded the beta to 94.0b9, no change; installed Nightly via the PPA (received 95.0a1 (2021-10-20), which judging by the date may not be the latest build mentioned above), and it does not appear to exhibit the bug at default settings with a new, blank profile. It does, however, massively corrupt the UI on resume from suspend: my tabs are now empty rectangles, the address bar is empty bar four oddly-spaced characters, and all images are missing from the tab that as displayed when I originally suspended.

That's a separate issue, though: it's not spinning, at least!

(In reply to gareth from comment #19)

Upgraded the beta to 94.0b9, no change; installed Nightly via the PPA (received 95.0a1 (2021-10-20), which judging by the date may not be the latest build mentioned above)

Neither of those builds would have the backout from comment 18. If you have a 2021-10-23 or newer Nightly build, that should have the fix. Nothing on 94 will yet, however.

Flags: needinfo?(gareth)

Doesn't sound like we're going to be able to get much of a useful verification from Nightly given the overall change in behavior which occurred there in the intervening time. We went ahead and did the backout for 94.0rc1. It should be available for testing on the Beta channel later today or tomorrow.
https://hg.mozilla.org/releases/mozilla-beta/rev/b24d4a876b9f

Firefox 94 (20211025220926) just installed on the desktop: I can confirm that the bug appears to be gone. Re-enabling hardware acceleration, restarting the browser, suspending, then resuming causes no problems at all. Many thanks for the fix!

Thanks for testing!

(In reply to Paul Bone [:pbone] from comment #10)

(In reply to Nicolas Silva [:nical] from comment #9)

Is this still reproducible on the latest nightly? If not I'll request a 94 uplift.

Yes, it still reproduces with BuildID: 20211014212856

Can you check if the problem occurs with
mozregression --launch 20211014212856
but not with
mozregression --launch 20211014212856 --pref layout.frame_rate:60
?
Then we would know that Firefox' usage of GLX_SGI_video_sync was the cause. (GLX vsync also caused bug 1710400, bug 1279309.)

Flags: needinfo?(pbone)

If this is indeed an NV only problem with the GLX vsync source, we could extend bug 1640779 to only use GLX_SGI_video_sync on Mesa and use Xrandr in all other cases. Might make problems with switch interval 1 though.

Edit: Alternatively: I think all hardware supported by the 460 driver series (our NV baseline for WR) is also supported by the 470 series. So if we ship EGL for the 470 series (on EGL we always use the Xrandr vsync), we can also consider just bumping the NV baseline to 470.82, making sure that the whole prop. NV population is either on EGL+HW-WR or SW-WR, making the first big step to deprecate GLX.

Edit2: Given that the 470 driver series marks the line for officially supported hardware (the oldest being almost 10 years old), it's probably fair to only officially support WR on officially supported hardware. So going all EGL on NV sounds like a decent option to me (users of older hardware should use nouveau/mesa).

(In reply to Darkspirit from comment #25)

(In reply to Paul Bone [:pbone] from comment #10)

(In reply to Nicolas Silva [:nical] from comment #9)

Is this still reproducible on the latest nightly? If not I'll request a 94 uplift.

Yes, it still reproduces with BuildID: 20211014212856

Can you check if the problem occurs with
mozregression --launch 20211014212856
but not with
mozregression --launch 20211014212856 --pref layout.frame_rate:60
?
Then we would know that Firefox' usage of GLX_SGI_video_sync was the cause. (GLX vsync also caused bug 1710400, bug 1279309.)

I can't get it to lockup with mozregression at all today. But I was able to earlier when first investigating this problem. I can still do it for my Nightly installation & profile (hrm, I'll try a clean profile too).

Some more info:

  • It only locks up when I have gfx.webrender.all = true
  • I seem to have quite old NVidia drivers, I don't know why, I guess I just didn't update them, 390.144, I would update them now I know this, except that I want to keep helping test this bug so I'll leave them at this version until we're done.
Flags: needinfo?(pbone)

Yep, my Nightly install with a fresh profile can't reproduce it either. So it's now something in my profile & some builds. It got better there for maybe a week (something was backed out I think). Then worse again.

I don't see anything else changed in my profile. Could this be just me at this point, with a weird combination of drivers, and profile?

Flags: needinfo?(nical.bugzilla)

(In reply to Paul Bone [:pbone] from comment #28)

  • I seem to have quite old NVidia drivers, I don't know why, I guess I just didn't update them, 390.144, I would update them now I know this, except that I want to keep helping test this bug so I'll leave them at this version until we're done.

What NVidia card is this? 390.144 were released in July 2021 so not exactly old per se but they're for legacy cards.

(In reply to Arthur K. [He/Him] from comment #31)

(In reply to Paul Bone [:pbone] from comment #28)

  • I seem to have quite old NVidia drivers, I don't know why, I guess I just didn't update them, 390.144, I would update them now I know this, except that I want to keep helping test this bug so I'll leave them at this version until we're done.

What NVidia card is this? 390.144 were released in July 2021 so not exactly old per se but they're for legacy cards.

Quadro P400, on my Mozilla desktop. I'm pretty sure it can use a newer driver.

FWIW, we are entering our last week of beta for Firefox 95.

So I think this only effects folk with both:

  • old NVidia drivers
  • force-enabled WebRender.

I can confirm this because I upgraded my drivers and now things work properly.

Status: REOPENED → RESOLVED
Crash Signature: [@ __pthread_cond_wait | mozilla::ipc::MessageChannel::Send | mozilla::ipc::IProtocol::ChannelSend ]
Closed: 3 years ago3 years ago
Flags: needinfo?(gareth)
Resolution: --- → FIXED
Summary: Firefox spins on CPU after computer wakes from suspend → GLX/Nvidia: Firefox spins on CPU after computer wakes from suspend
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: