Open
Bug 86632
Opened 23 years ago
Updated 16 years ago
seemingly redundant poll and ioctl system call looping
Categories
(SeaMonkey :: UI Design, defect)
Tracking
(Not tracked)
NEW
People
(Reporter: sarnila, Assigned: jag+mozilla)
Details
(Keywords: perf)
Attachments
(1 file, 1 obsolete file)
(deleted),
patch
|
Details | Diff | Splinter Review |
/usr/src/mozilla/mozilla/dist/bin/mozilla-bin (traced by strace) keeps calling
in following loop system call obviously without any purpouse.
16:11:40.320120 poll([{fd=7, events=POLLIN}, {fd=17, events=POLLIN}, {fd=5,
events=POLLIN}, {fd=37, events=POLLIN}], 4, 0) = 0
16:11:40.320703 gettimeofday({992956300, 320879}, NULL) = 0
16:11:40.321217 gettimeofday({992956300, 321394}, NULL) = 0
16:11:40.321731 ioctl(7, FIONREAD, [0]) = 0
16:11:40.322250 poll([{fd=7, events=POLLIN}, {fd=17, events=POLLIN}, {fd=5,
events=POLLIN}, {fd=37, events=POLLIN}], 4, 6) = 0
16:11:40.338073 gettimeofday({992956300, 338252}, NULL) = 0
16:11:40.338597 gettimeofday({992956300, 338776}, NULL) = 0
16:11:40.339108 gettimeofday({992956300, 339284}, NULL) = 0
16:11:40.339621 ioctl(7, FIONREAD, [0]) = 0
16:11:40.340137 poll([{fd=7, events=POLLIN}, {fd=17, events=POLLIN}, {fd=5,
events=POLLIN}, {fd=37, events=POLLIN}], 4, 0) = 0
16:11:40.340723 gettimeofday({992956300, 340900}, NULL) = 0
16:11:40.341235 gettimeofday({992956300, 341413}, NULL) = 0
16:11:40.341751 ioctl(7, FIONREAD, [0]) = 0
-------------------------------
Here analysed:
>16:11:40.320120 poll([{fd=7, events=POLLIN}, {fd=17, events=POLLIN}, {fd=5,
>events=POLLIN}, {fd=37, events=POLLIN}], 4, 0) = 0
Here timeout parameter (last one) is 0. Such poll returns immediatly
in any case and is as such redundant. Here it also returns 0 miening
timeout after 0 milliseconds. The call doesn't make sence.
>16:11:40.320703 gettimeofday({992956300, 320879}, NULL) = 0
>16:11:40.321217 gettimeofday({992956300, 321394}, NULL) = 0
Here we get the curret time twice???
>16:11:40.321731 ioctl(7, FIONREAD, [0]) = 0
Here check we check by ioctl FIONREAD call if data is available from handle 7
allthough we just got from poll that there is nothing to be read, and we are
about to check it right with the next poll
>16:11:40.322250 poll([{fd=7, events=POLLIN}, {fd=17, events=POLLIN}, {fd=5,
>events=POLLIN}, {fd=37, events=POLLIN}], 4, 6) = 0
Here to timeout is 6 milliseconds. Other wise same comments as
above.
>16:11:40.338073 gettimeofday({992956300, 338252}, NULL) = 0
>16:11:40.338597 gettimeofday({992956300, 338776}, NULL) = 0
>16:11:40.339108 gettimeofday({992956300, 339284}, NULL) = 0
Here the current time again three times.
>16:11:40.339621 ioctl(7, FIONREAD, [0]) = 0
And here again the same redundant ioctl followed by poll
>16:11:40.340137 poll([{fd=7, events=POLLIN}, {fd=17, events=POLLIN}, {fd=5,
>events=POLLIN}, {fd=37, events=POLLIN}], 4, 0) = 0
And here we start the cycle again. The whole cycle takes about 20 milliseconds.
>16:11:40.340723 gettimeofday({992956300, 340900}, NULL) = 0
>16:11:40.341235 gettimeofday({992956300, 341413}, NULL) = 0
>16:11:40.341751 ioctl(7, FIONREAD, [0]) = 0
---------------------------
And All that is hapening all the time eeven when the program is otherwise doing
nothing (not even mouse on it).
In any case that is a very odd way of using poll command. The whole idea of poll
is that it sleeps till there is event in any of the handles given to it to monitor.
When mouse is moved over mozilla window the poll actually return POLLIN event on
handle 7, after which ioctl FIONREAD is used to inquire the number of bytes
available adn then read is used the read them:
------------
16:34:16.723045 poll([{fd=7, events=POLLIN, revents=POLLIN}, {fd=17,
events=POLLIN}, {fd=5, events=POLLIN}, {fd=37, events=POLLIN}], 4, 7) = 1
16:34:16.733129 gettimeofday({992957656, 733162}, NULL) = 0
16:34:16.733208 ioctl(7, FIONREAD, [32]) = 0
16:34:16.733274 read(7, "\6\0\233\371\234n\3430-\0\0\0m\24
\1\0\0\0\0\3\2\373\2"..., 32) = 32
---------------
Thus I seriously suspect that there is really a bug somewhere in the code (most
probably in some library) that is used to wait for action on the part of the
user or network. It could well be that the timeout parameter to poll call is
quite something else than what it is supposed to be. In any case well coded
programs should sleep and not run in tight do_nothing_loop using machine
resources while the program actually is doing nothing but waiting for an event.
Comment 1•23 years ago
|
||
maybe waterson? I have no idea what this bug is about.
Assignee: asa → waterson
Comment 2•23 years ago
|
||
setting status to NEW and adding qawanted keyword... Looks like the threading
system may be doing something funky, but I don't understand it well enough to
tell.
Comment 3•23 years ago
|
||
Ran into the same system call trace on HP-UX 11.00 (build of milestone 0.9.5).
To give some idea of the impact here: on a relatively slow machine (by today's
standards: a C110, which was current in about 1996) which can do about 1 million
calls/sec to gettimeofday() alone in a loop on a good day with the wind,
mozilla-bin is calling the trio gettimeofday/ioctl(FIONREAD)/poll about 1000
times per second to display "about:blank", iconified in a seperate workspace.
Given that there are no events going in/out of the Moz window here, and that the
ioctl/poll calls are returning immediately with 0 all the time, it's safe to say
we're wasting our time in a loop.
Also, this seems related to bug # 66366 which discusses high usage when mousing
over links, but with similar high CPU usage and syscall trace as I've seen while
displaying only about:blank. That bug is currently listed as a future fix,
because it appears to be considered more an enhancement request than a bug
(mozilla is expected to take-up more CPU, so this may be normal behavior).
Attaching a debugger to this shows that the loop in question is the (Gtk+) glib
main loop, the gettimeofday and poll calls coming from glib polling for events
and the ioctl(FIONREAD) calls are the underpinnings of the XPending() call,
which is looking for pending X events from a gtk-regsitered "prepare" function
-- see gdk_events_pending() in gdk/gdkevents.c. This is all called within the
main loop (glib/gmain.c v1.2.10):
hook = g_hook_first_valid (&source_list, TRUE);
while (hook)
{ [ ... ]
(*prepare)(....)
if (hook->flags & G_SOURCE_READY)
{ [ ... ]
n_ready++;
current_priority = source->priority;
timeout = 0;
}
hook = g_hook_next_valid (&source_list, hook, TRUE);
}
/* poll(), if necessary */
g_main_poll (timeout, n_ready > 0, current_priority);
A well-behaved gtk+ application will sit on this poll() with a -1 (indefinite)
timeout, waiting for a new event to come down the socket. For some reason,
Mozilla will call poll with a 0 timeout, return immediately without any events
being found on the fd's, causing the main loop to repeat. There appears to be
some disconnect between the G_SOURCE_READY flag being set, indicating the source
needs attention, and the poll which never finds anything to do.
I haven't yet compiled gtk/glib debuggable to find out what the events in
question are, but that would seem the next step. Hopefully this should at least
get the ball rolling in the gtk direction and maybe someone with a better
understanding of that UI library can comment as to what might be going on.
Is this because we need to service non-file-descriptor event queues (such as for
our cross-thread proxies) from the GTK main loop, and therefore can't just sit
there until there are new X events to process? I wonder if you also see this
with any "well-behaved" GTK program that uses the main loop hook.
Comment 5•23 years ago
|
||
There are a couple of places that this can happen:
o Someone can make a call that will check the X Window queue for incoming
window events, even checking to see if there's data in the X FD. This might be
causing some of these events.
o Someone posts to the thread event queue. This causes the mainloop to wake up
and check all the file descriptors.
o Someone fires a timer. This will cause the file descriptors to be checked
too, I think. I'm not positive about this, though.
Anyway, I think this bug is complaining about normal mainloop behaviour. As
they say "Nothing to see here, move along, move along."
He's not talking about events being generated, or the main loop leaving its
sleep because it found something or was kicked by a timer (how, if not by an
event?). He's saying that we will _never_ do anything but busy-loop, because we
have a timeout of 0.
The question is why we have the timeout of 0, and if there's another way (ah,
for IRIX's usema) to meet those requirements without that busy-loop cost.
Comment 7•23 years ago
|
||
To clarify the earlier comments: the poll() with timeout 0 is called by the
gmain loop when it finds any source which is in a G_SOURCE_READY state. I
presume this is a safeguard against a race to prevent a hang: where the event is
available when checked, but does not appear on the file descriptor later when
polled -- but that should be the exception case. The glib main loop appears to
do some work to ensure that it's only calling poll() on file descriptors which
appear to have data ready, however, so it appears odd that it's repeatedly
checking on fd's which it "thinks" have data, then timing-out (immediately)
because there's nothing there.
This gets us back to timers and the other non-fd-based uses of the main loop
hook. I have seen some traces in the debugger showing the prepare function as
called from a timer (resulting in the XPending's call to ioctl()), so there's
the possibility that this is just getting called too often (will get those debug
gtk libraries built real soon now...) As for non-fd events: I would have
expected those to be relegated to a seperate non-GUI thread or use a different
blocking event mechanism (e.g. pthread_mutex) to avoid having to check them
1000x/second in a tight loop. But I'll admit my naivete in this area of code,
and defer judgement until I've had time to have a closer look.
Comment 8•23 years ago
|
||
(one set of debug libraries later)
The root of the near-constant polling is the timer implementation for
libtimer_gtk in widget/timer/src/unix/gtk/nsTimerGtk.cpp which schedules a timer
through gtk's timer hook to run every 10ms in nsTimerGtk::Init() ...
#define INTERVAL 10
gtk_timeout_add ( INTERVAL, TimerCallbackFunc, (gpointer)0);
To process timers at 10ms intervals. TimerCallbackFunc() then calls
g_main_pending() to see if "events are pending in the main context.... see
g_main_context_pending()." (from the API ref). g_main_pending() calls
g_main_iterate(), which calls g_main_poll(), and the rest you know.
Obvious (flawed) fix: change the timer INTERVAL in this file to something
larger. Doing this, you definitely get far fewer system calls / CPU usage in
the trace, and the poll call gets used with a positive-ms timeout where it waits
the desired interval. This also means that animations/flashing cursors, etc
don't get updated any faster than the INTERVAL specified, which slows things
down a bit. Certainly an interval such as 250ms could be used for things like
flashing cursors, but animations need faster timeouts to be smooth.
It seems as though there should be a way to bridge these two needs: we don't
need to wake-up every 10ms and poll everything just to know if a timer has
expired. There should be a way to maintain timeouts large and small without
having to gettimeofday()/ioctl()/poll() (+ all the stuff in user-land) every
10ms. It would seem that we could predict our next needed wakeup-time with a
bit more granularity, and schedule the next wakeup for that time, ignoring the
flurry of 10ms intervals in between. Alternately, use of an interval like 100ms
could work in the current scheme, assuming that animations would be sped-up
accordingly or that the system calls for polling fd though this loop could be
avoided except when known to be needed.
Comment 9•23 years ago
|
||
Attached code implements the timer code a bit differently: doesn't schedule a
continual "every 10ms" interrupt, but calculates the time when the next
interrupt is needed, and schedules the one, then after a timeout processing
loop
recomputes the next one needed, schedules that that one and so on. Advantage
is
that you're not continually scanning queues looking for work to do, so
"about:blank" displays without any continual system activity, and a single
flashing cursor only wakes-up 2x a second, rather than the current 100x. Also
simplified some of the storage aspects to reduce redundant math and extra
shifting of elements in the arrays, but stopped short of changing the array
storage to a sorted list, which didn't seem worth the effort given the few
elements being stored.
Apologies in advance for my lack of C++ style, and potential for other problems
introduced due to my ignorance of much of the overall architecture.
Comment 10•23 years ago
|
||
Can you use cvs diff -u to attach a diff?
Comment 11•23 years ago
|
||
Updated version of the same code, fixes a couple of bugs introduced in the
first pass, provided as "cvs diff -u" output, as requested. Note that the
previous attachment should be deleted by someone who has appropriate
permissions -- as it does introduce some odd behavior. I'd be happy to share
what I've learned in that area, should anyone care.
Comment 12•23 years ago
|
||
Comment on attachment 57317 [details] [diff] [review]
modified nsTimerGtk.h and nsTimerGtk.cpp
marking obsolete per author's request
Attachment #57317 -
Attachment is obsolete: true
Comment 13•23 years ago
|
||
Does that patch need updating to deal with pavlov's timer rewrite? If not,
could someone review it?
Comment 14•23 years ago
|
||
The code that the patch touches doesn't exist anymore. I doubt that we're going
to see this behaviour anymore unless pav's timer implementation is really bad.
Updated•23 years ago
|
Target Milestone: --- → Future
Updated•23 years ago
|
Status: NEW → ASSIGNED
Comment 15•23 years ago
|
||
Dan, are you still seeing this? Is this now worksforme?
Comment 16•23 years ago
|
||
Ran tusc on a build of the 0.9.8 milestone, and I no longer see the looping
behavior. Displaying a static page (about:) puts all threads asleep quietly in
a pthread_cond_wait(), and with a flashing cursor it cycles through poll/select
only the 2x per second as needed to toggle the cursor. Definitely appears fixed
with these changes -- certainly much less "busy" than before.
Comment 17•23 years ago
|
||
Is
Comment 18•23 years ago
|
||
Sorry about the last spam. Is this problem fixed by the patch of bug 78611?
Comment 19•23 years ago
|
||
What's the current status of this one ?
Comment 20•22 years ago
|
||
Tried strace on Netscape7.0 pr1 on Linux, still saw the looping of poll() with
timer 0. On the other hand, that does not happen on mozilla build (moz099,
moz100). The looping also exists in Netscape 7.0 on Solaris (built by Sun).
Comment 21•22 years ago
|
||
Adding myself to the cc list.
Comment 22•22 years ago
|
||
Couldn't find anyone on IRC who cared, and this is the nearest bug I can find
to what I just saw, so I'll dump it here. Mozilla (today's trunk CVS, Linux)
is "visibly" idle, but one of its threads is eating up 100% CPU. The top of the
profile looks like :
0000d93c 10174 1.08509 PR_GetIdentitiesLayer
/home/moz/bin/mozill/bin/libnspr4.so
0000d250 10672 1.13821 pl_DefPoll
/home/moz/bin/mozill/bin/libnspr4.so
00069208 11749 1.25307 nsHttpConnection::OnDataAvailable(nsIRequest *,
nsISupports *, nsIInputStream *, unsigned int, unsigned int)
/home/moz/bin/mozill/bin/components/libnecko.so
000d0890 12515 1.33477 __poll /lib/libc-2.2.so
0003397c 13511 1.441 nsSocketTransport::CheckForTimeout(unsigned int)
/home/moz/bin/mozill/bin/components/libnecko.so
00009734 15653 1.66945 pthread_equal /lib/libpthread-0.9.so
00022f94 18715 1.99602 PR_Now
/home/moz/bin/mozill/bin/libnspr4.so
001057a0 20071 2.14064 __divdi3 /lib/libc-2.2.so
00022ffc 21801 2.32515 _PR_UNIX_GetInterval
/home/moz/bin/mozill/bin/libnspr4.so
00016e08 22224 2.37027 PR_IntervalNow
/home/moz/bin/mozill/bin/libnspr4.so
0001c654 22523 2.40216 PR_Lock
/home/moz/bin/mozill/bin/libnspr4.so
00037d38 23254 2.48012 nsSocketReadRequest::OnRead(void)
/home/moz/bin/mozill/bin/components/libnecko.so
00008130 23918 2.55094 __pthread_mutex_unlock /lib/libpthread-0.9.so
00007e14 24313 2.59307 __pthread_mutex_lock /lib/libpthread-0.9.so
0001cc00 24593 2.62293 PR_ExitMonitor
/home/moz/bin/mozill/bin/libnspr4.so
0001cbb0 29559 3.15257 PR_EnterMonitor
/home/moz/bin/mozill/bin/libnspr4.so
0003917c 30116 3.21198 nsSocketTransportService::Run(void)
/home/moz/bin/mozill/bin/components/libnecko.so
00033a24 32087 3.42219 nsSocketTransport::Process(short)
/home/moz/bin/mozill/bin/components/libnecko.so
00034914 33524 3.57545 nsSocketTransport::doReadWrite(short)
/home/moz/bin/mozill/bin/components/libnecko.so
0001c688 34873 3.71933 PR_Unlock
/home/moz/bin/mozill/bin/libnspr4.so
000797f0 37389 3.98767 nsHttpPipeline::OnDataReadable(nsIInputStream *)
/home/moz/bin/mozill/bin/components/libnecko.so
0000b758 62826 6.70062 __pthread_unlock /lib/libpthread-0.9.so
000096c8 78942 8.41945 pthread_self /lib/libpthread-0.9.so
0000b564 83847 8.94258 __pthread_lock /lib/libpthread-0.9.so
00020138 89553 9.55115 _pr_poll_with_poll
/home/moz/bin/mozill/bin/libnspr4.so
000a4ed0 92566 9.8725 __gettimeofday /lib/libc-2.2.so
(fields are : vma, nr. samples, percent of total, symbol, binary/library)
This was causing billions of sys_gettimeofday()/sys_poll() calls
I would have done a strace, but trying to attach to the run away process
stopped the problem (damn).
I wasn't doing anything particularly unusual: news, bugzilla, some website.
Updated•20 years ago
|
Product: Core → Mozilla Application Suite
Comment 23•19 years ago
|
||
This bug seems to be quite ... dead.
Can anyone report this is still (or is not) a perf problem on a current build?
Comment 24•18 years ago
|
||
Bug 371398 seems to be the similiar to this one for Firefox.
Updated•16 years ago
|
Assignee: waterson → jag
Status: ASSIGNED → NEW
QA Contact: jrgmorrison
Target Milestone: Future → ---
You need to log in
before you can comment on or make changes to this bug.
Description
•