Open Bug 1729116 Opened 3 years ago Updated 1 year ago

High CPU usage in main process, seemingly spinning in poll()/recvmsg()=EAGAIN, but really RefreshDriver

Categories

(Core :: Performance, defect)

Firefox 91
x86_64
Linux
defect

Tracking

()

UNCONFIRMED
Performance Impact low

People

(Reporter: kestrel.fink, Unassigned)

References

Details

(Keywords: perf:responsiveness)

Attachments

(1 file, 1 obsolete file)

Attached file bug-constant-polling.txt (obsolete) (deleted) —

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0

Steps to reproduce:

Start up Firefox.

Actual results:

Firefox worked ok, though sluggish at times, and Zoom calls at the same time have choppy audio unless I exit Firefox.

Also, about:processes shows about 40% CPU usage for the main process, dropping to 15-20% after disabling all addons and restarting.

strace at the same time showed poll() timing out, then a useless recvmsg on fd 4 that returned EAGAIN.

Expected results:

Smooth browser, no choppy audio.

OS: Unspecified → Linux
Hardware: Unspecified → x86_64

The Bugbug bot thinks this bug should belong to the 'Core::Audio/Video: Playback' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → Audio/Video: Playback
Product: Firefox → Core

Thanks a lot for reporting the issue.

Could you please provide more information such as what site or page you were visiting? And when you say choppy Zoom audio, is it the Zoom app for Linux or Zoom for Web in Firefox?

Also, it will be very helpful if you can get the troubleshooting info[1] or capture a profile[2] and upload them here. Thank you!

[1] https://support.mozilla.org/en-US/kb/use-troubleshooting-information-page-fix-firefox#w_accessing-the-troubleshooting-information-page
[2] https://profiler.firefox.com/

Flags: needinfo?(kestrel.fink)

Oops, sorry, I thought I had posted the profile: https://share.firefox.dev/38EDcQ3

The choppy Zoom audio is in the separate Zoom app for Linux.

The site doesn't seem to matter much, and if I remember correctly no other processes showed much CPU usage.

(This comment refers to the same problem; I reported it using my son's bugzilla account.)

The thing that looks off to me is the repeated

poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=12, events=POLLIN}, {fd=31, events=POLLIN}, {fd=59, events=POLLIN}], 6, 0) = 0 (Timeout)
recvmsg(4, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(4, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)

because it's polling fd 4, getting a timeout, and then calling recvmsg (twice, no less!) when poll just said that there's no input event pending.

Flags: needinfo?(kestrel.fink)

I see various processes doing lots of polling in the profile : parent process, the remote data decoder. These seems to stem from nsIThread machinery and checking for tasks to run -- though I'm no expert on such things. Since it's not Zoom in Fx being flaky, it sounds like it's just Fx eating lots of resources and messing with other programs. And it's not clear to me those resources are being caused by media in Fx.

Could you try setting the pref media.rdd-process.enabled to false (just to test, this will break a whole lot of media if you leave if false) and seeing if the polling still takes place?

Flags: needinfo?(kestrel.fink)
Whiteboard: [qf]

This looks similar to bug 1709175. Nihanth, what do you think?

Flags: needinfo?(nhnt11)

It looks similar but I'm actually not sure it's the same issue. It isn't so clear in the other bug that we have useless recvmsg() calls like the strace shows in comment 0 here. That other bug seems macOS specific and I need to spend more time understanding why we are spending time walking the select_conflict_queue in the kernel.

I would say let's leave this open and see if stronger evidence emerges that these are related. Worth a "see also" though.

Flags: needinfo?(nhnt11)

I dug into this further when I had access to the computer, but didn't keep very good notes. I'll try to dig up what I can now.

The recvmsg is in the bowels of XCB:

#0  0x00007f62845be630 in recvmsg () from /lib64/libpthread.so.0
#1  0x00007f6281ef0658 in _xcb_in_read (c=c@entry=0x7f6283ecd000)
    at /usr/src/debug/libxcb-1.13.1-7.fc34.x86_64/src/xcb_in.c:1000
#2  0x00007f6281ef1758 in poll_for_next_event (c=0x7f6283ecd000, queued=queued@entry=0)
    at /usr/src/debug/libxcb-1.13.1-7.fc34.x86_64/src/xcb_in.c:713
#3  0x00007f6281ef2961 in poll_for_next_event (queued=<optimized out>, c=<optimized out>)
    at /usr/src/debug/libxcb-1.13.1-7.fc34.x86_64/src/xcb_in.c:720
#4  0x00007f62828daaae in poll_for_event (dpy=dpy@entry=0x7f6283ecb000, queued_only=0)
    at /usr/src/debug/libX11-1.7.2-3.fc34.x86_64/src/xcb_io.c:254
#5  0x00007f62828dac28 in poll_for_response (dpy=dpy@entry=0x7f6283ecb000)
    at /usr/src/debug/libX11-1.7.2-3.fc34.x86_64/src/xcb_io.c:301
#6  0x00007f62828ddd02 in _XEventsQueued (mode=<optimized out>, dpy=0x7f6283ecb000)
    at /usr/src/debug/libX11-1.7.2-3.fc34.x86_64/src/xcb_io.c:432
#7  _XEventsQueued (dpy=0x7f6283ecb000, mode=<optimized out>)
    at /usr/src/debug/libX11-1.7.2-3.fc34.x86_64/src/xcb_io.c:414
#8  0x00007f62828ca431 in XPending (dpy=0x7f6283ecb000)
    at /usr/src/debug/libX11-1.7.2-3.fc34.x86_64/src/Pending.c:55
#9  0x00007f6282a700e9 in gdk_event_source_prepare () from /lib64/libgdk-3.so.0
#10 0x00007f62821d708a in g_main_context_prepare (context=0x7f6283eb4be0,
    priority=0x7ffc2b132338) at ../glib/gmain.c:3657
#11 0x00007f628222b303 in g_main_context_iterate.constprop.0 (
    context=context@entry=0x7f6283eb4be0, block=722674492, block@entry=0,
    dispatch=dispatch@entry=1, self=<optimized out>) at ../glib/gmain.c:4111
#12 0x00007f62821d4c03 in g_main_context_iteration (context=0x7f6283eb4be0, may_block=0)
    at ../glib/gmain.c:4196
#13 0x00007f627cc862b3 in non-virtual thunk to nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal\
*, bool) () from /home/kestrel/firefox/libxul.so
#14 0x00007f627ca55175 in nsThread::ProcessNextEvent(bool, bool*) ()
   from /home/kestrel/firefox/libxul.so
#15 0x00007f627ca9f624 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) ()
   from /home/kestrel/firefox/libxul.so
#16 0x00007f627d54190d in MessageLoop::Run() () from /home/kestrel/firefox/libxul.so
#17 0x00007f627dbb1e89 in nsBaseAppShell::Run() () from /home/kestrel/firefox/libxul.so
#18 0x00007f627b8beb91 in nsAppStartup::Run() () from /home/kestrel/firefox/libxul.so
#19 0x00007f627b932d90 in XREMain::XRE_mainRun() () from /home/kestrel/firefox/libxul.so
#20 0x00007f627b933769 in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) ()

It happens in a somewhat suspicious line of code, after it has done the poll() that does not get an event on that fd:

       if(!ret && !queued && c->in.reading == 0 && _xcb_in_read(c)) /* _xcb_in_read shuts down the\
 connection on error */

It's almost like it's doing the read in order to shut down the connection? But in my case it doesn't, because as poll() already told it, there are no events pending on the socket. Actually, let me paste in the full body of that function:

static xcb_generic_event_t *poll_for_next_event(xcb_connection_t *c, int queued)
{
    xcb_generic_event_t *ret = 0;
    if(!c->has_error)
    {
        pthread_mutex_lock(&c->iolock);
        /* FIXME: follow X meets Z architecture changes. */
        ret = get_event(c);
        if(!ret && !queued && c->in.reading == 0 && _xcb_in_read(c)) /* _xcb_in_read shuts down the\
 connection on error */
            ret = get_event(c);
        pthread_mutex_unlock(&c->iolock);
    }
    return ret;
}

I previously chased down the poll() call stack, but I don't have time to get it right now.

Hey stransky, wondering if this kind of performance problem from event polling is something you've had to wrestle with before in the X11 stack?

Component: Audio/Video: Playback → Performance
Flags: needinfo?(stransky)

Karl may know.

Flags: needinfo?(stransky) → needinfo?(karlt)

Comment on attachment 9239437 [details]
bug-constant-polling.txt

poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=12, events=POLLIN}, {fd=31, events=POLLIN}, {fd=59, events=POLLIN}], 6, 0) = 0 (Timeout)
recvmsg(4, {msg_namelen=0}, 0) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(4, {msg_namelen=0}, 0) = -1 EAGAIN (Resource temporarily unavailable)

This is checking whether there are any pending events. Perhaps the recvmsg is not as efficient as it could be, but it is not what is waking the process.

poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=12, events=POLLIN}, {fd=31, events=POLLIN}, {fd=59, events=POLLIN}], 6, -1) = 1 ([{fd=31, revents=POLLIN}])
read(31, "\372", 1) = 1
write(32, "\372", 1) = 1

This poll puts the app to sleep.
It is woken by fd 31 which is a pipe connected to fd 32, which is written from ScheduleNativeEventCallback(), probably from queuing a new runnable on the main thread.

The write here is after the poll, so presumably there was another write from a different thread that woke this poll. strace -tt -f -p <pid> will trace and identify other threads.

Flags: needinfo?(karlt)

Ah, thanks, that will help guide my investigation. I still think xcb is being obnoxious here; there's no need to do the recvmsg immediately after polling and being told that there's nothing to receive. But you're right, that appears to be a red herring in this case, since if it weren't for the constant fd 31 wakeups then it's really not different from the initial poll with a timeout of zero.

I'll try to track down what runnable this is and why it's getting queued constantly, though from the bug 508427 you linked to I fear it'll be something with the refresh driver.

Yes, it's getting a nonstop stream of RefreshDriverTick, with reason=HasImageAnimations: https://share.firefox.dev/3ksG1Kw

The system should be idle. Is there a way to track down what is animating?

Summary: High CPU usage in main process, spinning in poll() → High CPU usage in main process, seemingly spinning in poll()/recvmsg()=EAGAIN, but really RefreshDriver

I'll note that the profile is comment 12 (from Firefox 94) is very different from the one in comment 2 (from Firefox 91).

Attached file strace -f -tt output (deleted) —
Attachment #9239437 - Attachment is obsolete: true

(In reply to Steve Fink [:sfink] [:s:] from comment #12)

The system should be idle. Is there a way to track down what is animating?

You could check with the Animations of the DevTools inspector. You can find it on the right side - if needed click the dropdown array to select it. If it's not a particular website use the Browser Toolbox instead.

No, HasImageAnimations won't be visible there - I think it's about animated gifs, mostly. I think we need a new bug to make it easier to debug the source of HasImageAnimations.

Whiteboard: [qf] → [qf:p3:responsiveness]
Performance Impact: --- → P3
Whiteboard: [qf:p3:responsiveness]

Redirect a needinfo that is pending on an inactive user to the triage owner.
:fdoty, since the bug has recent activity, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(kestrel.fink) → needinfo?(fdoty)
Flags: needinfo?(fdoty)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: