1729116 - High CPU usage in main process, seemingly spinning in poll()/recvmsg()=EAGAIN, but really RefreshDriver

kestrel.fink

Reporter

Description

•

3 years ago

Attached file bug-constant-polling.txt (obsolete) (deleted) — Details

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0

Steps to reproduce:

Start up Firefox.

Actual results:

Firefox worked ok, though sluggish at times, and Zoom calls at the same time have choppy audio unless I exit Firefox.

Also, about:processes shows about 40% CPU usage for the main process, dropping to 15-20% after disabling all addons and restarting.

strace at the same time showed poll() timing out, then a useless recvmsg on fd 4 that returned EAGAIN.

Expected results:

Smooth browser, no choppy audio.

kestrel.fink

Reporter

Updated

•

3 years ago

OS: Unspecified → Linux

Hardware: Unspecified → x86_64

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

3 years ago

The Bugbug bot thinks this bug should belong to the 'Core::Audio/Video: Playback' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → Audio/Video: Playback

Product: Firefox → Core

John Lin [:jhlin][:jolin]

Comment 2

•

3 years ago

Thanks a lot for reporting the issue.

Could you please provide more information such as what site or page you were visiting? And when you say choppy Zoom audio, is it the Zoom app for Linux or Zoom for Web in Firefox?

Also, it will be very helpful if you can get the troubleshooting info[1] or capture a profile[2] and upload them here. Thank you!

[1] https://support.mozilla.org/en-US/kb/use-troubleshooting-information-page-fix-firefox#w_accessing-the-troubleshooting-information-page
[2] https://profiler.firefox.com/

Flags: needinfo?(kestrel.fink)

Steve Fink [:sfink] [:s:]

Comment 3

•

3 years ago

Oops, sorry, I thought I had posted the profile: https://share.firefox.dev/38EDcQ3

The choppy Zoom audio is in the separate Zoom app for Linux.

The site doesn't seem to matter much, and if I remember correctly no other processes showed much CPU usage.

(This comment refers to the same problem; I reported it using my son's bugzilla account.)

The thing that looks off to me is the repeated

poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=12, events=POLLIN}, {fd=31, events=POLLIN}, {fd=59, events=POLLIN}], 6, 0) = 0 (Timeout)
recvmsg(4, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(4, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)

because it's polling fd 4, getting a timeout, and then calling recvmsg (twice, no less!) when poll just said that there's no input event pending.

Flags: needinfo?(kestrel.fink)

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Comment 4

•

3 years ago

I see various processes doing lots of polling in the profile : parent process, the remote data decoder. These seems to stem from nsIThread machinery and checking for tasks to run -- though I'm no expert on such things. Since it's not Zoom in Fx being flaky, it sounds like it's just Fx eating lots of resources and messing with other programs. And it's not clear to me those resources are being caused by media in Fx.

Could you try setting the pref media.rdd-process.enabled to false (just to test, this will break a whole lot of media if you leave if false) and seeing if the polling still takes place?

Flags: needinfo?(kestrel.fink)

Patricia Lawless

Updated

•

3 years ago

Whiteboard: [qf]

Markus Stange [:mstange]

Comment 5

•

3 years ago

This looks similar to bug 1709175. Nihanth, what do you think?

Flags: needinfo?(nhnt11)

(inactive) Nihanth Subramanya [:nhnt11]

Comment 6

•

3 years ago

It looks similar but I'm actually not sure it's the same issue. It isn't so clear in the other bug that we have useless recvmsg() calls like the strace shows in comment 0 here. That other bug seems macOS specific and I need to spend more time understanding why we are spending time walking the select_conflict_queue in the kernel.

I would say let's leave this open and see if stronger evidence emerges that these are related. Worth a "see also" though.

Flags: needinfo?(nhnt11)

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1709175

Steve Fink [:sfink] [:s:]

Comment 7

•

3 years ago

I dug into this further when I had access to the computer, but didn't keep very good notes. I'll try to dig up what I can now.

The recvmsg is in the bowels of XCB:

#0  0x00007f62845be630 in recvmsg () from /lib64/libpthread.so.0
#1  0x00007f6281ef0658 in _xcb_in_read (c=c@entry=0x7f6283ecd000)
    at /usr/src/debug/libxcb-1.13.1-7.fc34.x86_64/src/xcb_in.c:1000
#2  0x00007f6281ef1758 in poll_for_next_event (c=0x7f6283ecd000, queued=queued@entry=0)
    at /usr/src/debug/libxcb-1.13.1-7.fc34.x86_64/src/xcb_in.c:713
#3  0x00007f6281ef2961 in poll_for_next_event (queued=<optimized out>, c=<optimized out>)
    at /usr/src/debug/libxcb-1.13.1-7.fc34.x86_64/src/xcb_in.c:720
#4  0x00007f62828daaae in poll_for_event (dpy=dpy@entry=0x7f6283ecb000, queued_only=0)
    at /usr/src/debug/libX11-1.7.2-3.fc34.x86_64/src/xcb_io.c:254
#5  0x00007f62828dac28 in poll_for_response (dpy=dpy@entry=0x7f6283ecb000)
    at /usr/src/debug/libX11-1.7.2-3.fc34.x86_64/src/xcb_io.c:301
#6  0x00007f62828ddd02 in _XEventsQueued (mode=<optimized out>, dpy=0x7f6283ecb000)
    at /usr/src/debug/libX11-1.7.2-3.fc34.x86_64/src/xcb_io.c:432
#7  _XEventsQueued (dpy=0x7f6283ecb000, mode=<optimized out>)
    at /usr/src/debug/libX11-1.7.2-3.fc34.x86_64/src/xcb_io.c:414
#8  0x00007f62828ca431 in XPending (dpy=0x7f6283ecb000)
    at /usr/src/debug/libX11-1.7.2-3.fc34.x86_64/src/Pending.c:55
#9  0x00007f6282a700e9 in gdk_event_source_prepare () from /lib64/libgdk-3.so.0
#10 0x00007f62821d708a in g_main_context_prepare (context=0x7f6283eb4be0,
    priority=0x7ffc2b132338) at ../glib/gmain.c:3657
#11 0x00007f628222b303 in g_main_context_iterate.constprop.0 (
    context=context@entry=0x7f6283eb4be0, block=722674492, block@entry=0,
    dispatch=dispatch@entry=1, self=<optimized out>) at ../glib/gmain.c:4111
#12 0x00007f62821d4c03 in g_main_context_iteration (context=0x7f6283eb4be0, may_block=0)
    at ../glib/gmain.c:4196
#13 0x00007f627cc862b3 in non-virtual thunk to nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal\
*, bool) () from /home/kestrel/firefox/libxul.so
#14 0x00007f627ca55175 in nsThread::ProcessNextEvent(bool, bool*) ()
   from /home/kestrel/firefox/libxul.so
#15 0x00007f627ca9f624 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) ()
   from /home/kestrel/firefox/libxul.so
#16 0x00007f627d54190d in MessageLoop::Run() () from /home/kestrel/firefox/libxul.so
#17 0x00007f627dbb1e89 in nsBaseAppShell::Run() () from /home/kestrel/firefox/libxul.so
#18 0x00007f627b8beb91 in nsAppStartup::Run() () from /home/kestrel/firefox/libxul.so
#19 0x00007f627b932d90 in XREMain::XRE_mainRun() () from /home/kestrel/firefox/libxul.so
#20 0x00007f627b933769 in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) ()

It happens in a somewhat suspicious line of code, after it has done the poll() that does not get an event on that fd:

       if(!ret && !queued && c->in.reading == 0 && _xcb_in_read(c)) /* _xcb_in_read shuts down the\
 connection on error */

It's almost like it's doing the read in order to shut down the connection? But in my case it doesn't, because as poll() already told it, there are no events pending on the socket. Actually, let me paste in the full body of that function:

static xcb_generic_event_t *poll_for_next_event(xcb_connection_t *c, int queued)
{
    xcb_generic_event_t *ret = 0;
    if(!c->has_error)
    {
        pthread_mutex_lock(&c->iolock);
        /* FIXME: follow X meets Z architecture changes. */
        ret = get_event(c);
        if(!ret && !queued && c->in.reading == 0 && _xcb_in_read(c)) /* _xcb_in_read shuts down the\
 connection on error */
            ret = get_event(c);
        pthread_mutex_unlock(&c->iolock);
    }
    return ret;
}

I previously chased down the poll() call stack, but I don't have time to get it right now.

Mike Conley (:mconley) (:⚙️)

Comment 8

•

3 years ago

Hey stransky, wondering if this kind of performance problem from event polling is something you've had to wrestle with before in the X11 stack?

Component: Audio/Video: Playback → Performance

Flags: needinfo?(stransky)

Martin Stránský [:stransky] (ni? me)

Comment 9

•

3 years ago

Karl may know.

Flags: needinfo?(stransky) → needinfo?(karlt)

Karl Tomlinson (:karlt)

Comment 10

•

3 years ago

Comment on attachment 9239437 [details]
bug-constant-polling.txt

poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=12, events=POLLIN}, {fd=31, events=POLLIN}, {fd=59, events=POLLIN}], 6, 0) = 0 (Timeout)
recvmsg(4, {msg_namelen=0}, 0) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(4, {msg_namelen=0}, 0) = -1 EAGAIN (Resource temporarily unavailable)

This is checking whether there are any pending events. Perhaps the recvmsg is not as efficient as it could be, but it is not what is waking the process.

poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=12, events=POLLIN}, {fd=31, events=POLLIN}, {fd=59, events=POLLIN}], 6, -1) = 1 ([{fd=31, revents=POLLIN}])
read(31, "\372", 1) = 1
write(32, "\372", 1) = 1

This poll puts the app to sleep.
It is woken by fd 31 which is a pipe connected to fd 32, which is written from ScheduleNativeEventCallback(), probably from queuing a new runnable on the main thread.

The write here is after the poll, so presumably there was another write from a different thread that woke this poll. strace -tt -f -p <pid> will trace and identify other threads.

Flags: needinfo?(karlt)

Karl Tomlinson (:karlt)

Updated

•

3 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=508427

Steve Fink [:sfink] [:s:]

Comment 11

•

3 years ago

Ah, thanks, that will help guide my investigation. I still think xcb is being obnoxious here; there's no need to do the recvmsg immediately after polling and being told that there's nothing to receive. But you're right, that appears to be a red herring in this case, since if it weren't for the constant fd 31 wakeups then it's really not different from the initial poll with a timeout of zero.

I'll try to track down what runnable this is and why it's getting queued constantly, though from the bug 508427 you linked to I fear it'll be something with the refresh driver.

Steve Fink [:sfink] [:s:]

Comment 12

•

3 years ago

Yes, it's getting a nonstop stream of RefreshDriverTick, with reason=HasImageAnimations: https://share.firefox.dev/3ksG1Kw

The system should be idle. Is there a way to track down what is animating?

Summary: High CPU usage in main process, spinning in poll() → High CPU usage in main process, seemingly spinning in poll()/recvmsg()=EAGAIN, but really RefreshDriver

Steve Fink [:sfink] [:s:]

Comment 13

•

3 years ago

I'll note that the profile is comment 12 (from Firefox 94) is very different from the one in comment 2 (from Firefox 91).

Steve Fink [:sfink] [:s:]

Comment 14

•

3 years ago

Attached file strace -f -tt output (deleted) — Details

Attachment #9239437 - Attachment is obsolete: true

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 15

•

3 years ago

(In reply to Steve Fink [:sfink] [:s:] from comment #12)

The system should be idle. Is there a way to track down what is animating?

You could check with the Animations of the DevTools inspector. You can find it on the right side - if needed click the dropdown array to select it. If it's not a particular website use the Browser Toolbox instead.

Markus Stange [:mstange]

Comment 16

•

3 years ago

No, HasImageAnimations won't be visible there - I think it's about animated gifs, mostly. I think we need a new bug to make it easier to debug the source of HasImageAnimations.

Daniel Holbert [:dholbert]

Updated

•

3 years ago

Whiteboard: [qf] → [qf:p3:responsiveness]

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

2 years ago

Performance Impact: --- → P3

Keywords: perf:responsiveness

Whiteboard: [qf:p3:responsiveness]

BugBot [:suhaib / :marco/ :calixte]

Comment 17

•

2 years ago

Redirect a needinfo that is pending on an inactive user to the triage owner.
:fdoty, since the bug has recent activity, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(kestrel.fink) → needinfo?(fdoty)

Frank Doty [:fdoty|on_leave]

Updated

•

2 years ago

Flags: needinfo?(fdoty)

bug-constant-polling.txt 3 years ago kestrel.fink (deleted), text/plain		Details
strace -f -tt output 3 years ago Steve Fink [:sfink] [:s:] (deleted), application/x-xz		Details