Open Bug 86632 Opened 23 years ago Updated 16 years ago

seemingly redundant poll and ioctl system call looping

Categories

(SeaMonkey :: UI Design, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

People

(Reporter: sarnila, Assigned: jag+mozilla)

Details

(Keywords: perf)

Attachments

(1 file, 1 obsolete file)

/usr/src/mozilla/mozilla/dist/bin/mozilla-bin (traced by strace) keeps calling in following loop system call obviously without any purpouse. 16:11:40.320120 poll([{fd=7, events=POLLIN}, {fd=17, events=POLLIN}, {fd=5, events=POLLIN}, {fd=37, events=POLLIN}], 4, 0) = 0 16:11:40.320703 gettimeofday({992956300, 320879}, NULL) = 0 16:11:40.321217 gettimeofday({992956300, 321394}, NULL) = 0 16:11:40.321731 ioctl(7, FIONREAD, [0]) = 0 16:11:40.322250 poll([{fd=7, events=POLLIN}, {fd=17, events=POLLIN}, {fd=5, events=POLLIN}, {fd=37, events=POLLIN}], 4, 6) = 0 16:11:40.338073 gettimeofday({992956300, 338252}, NULL) = 0 16:11:40.338597 gettimeofday({992956300, 338776}, NULL) = 0 16:11:40.339108 gettimeofday({992956300, 339284}, NULL) = 0 16:11:40.339621 ioctl(7, FIONREAD, [0]) = 0 16:11:40.340137 poll([{fd=7, events=POLLIN}, {fd=17, events=POLLIN}, {fd=5, events=POLLIN}, {fd=37, events=POLLIN}], 4, 0) = 0 16:11:40.340723 gettimeofday({992956300, 340900}, NULL) = 0 16:11:40.341235 gettimeofday({992956300, 341413}, NULL) = 0 16:11:40.341751 ioctl(7, FIONREAD, [0]) = 0 ------------------------------- Here analysed: >16:11:40.320120 poll([{fd=7, events=POLLIN}, {fd=17, events=POLLIN}, {fd=5, >events=POLLIN}, {fd=37, events=POLLIN}], 4, 0) = 0 Here timeout parameter (last one) is 0. Such poll returns immediatly in any case and is as such redundant. Here it also returns 0 miening timeout after 0 milliseconds. The call doesn't make sence. >16:11:40.320703 gettimeofday({992956300, 320879}, NULL) = 0 >16:11:40.321217 gettimeofday({992956300, 321394}, NULL) = 0 Here we get the curret time twice??? >16:11:40.321731 ioctl(7, FIONREAD, [0]) = 0 Here check we check by ioctl FIONREAD call if data is available from handle 7 allthough we just got from poll that there is nothing to be read, and we are about to check it right with the next poll >16:11:40.322250 poll([{fd=7, events=POLLIN}, {fd=17, events=POLLIN}, {fd=5, >events=POLLIN}, {fd=37, events=POLLIN}], 4, 6) = 0 Here to timeout is 6 milliseconds. Other wise same comments as above. >16:11:40.338073 gettimeofday({992956300, 338252}, NULL) = 0 >16:11:40.338597 gettimeofday({992956300, 338776}, NULL) = 0 >16:11:40.339108 gettimeofday({992956300, 339284}, NULL) = 0 Here the current time again three times. >16:11:40.339621 ioctl(7, FIONREAD, [0]) = 0 And here again the same redundant ioctl followed by poll >16:11:40.340137 poll([{fd=7, events=POLLIN}, {fd=17, events=POLLIN}, {fd=5, >events=POLLIN}, {fd=37, events=POLLIN}], 4, 0) = 0 And here we start the cycle again. The whole cycle takes about 20 milliseconds. >16:11:40.340723 gettimeofday({992956300, 340900}, NULL) = 0 >16:11:40.341235 gettimeofday({992956300, 341413}, NULL) = 0 >16:11:40.341751 ioctl(7, FIONREAD, [0]) = 0 --------------------------- And All that is hapening all the time eeven when the program is otherwise doing nothing (not even mouse on it). In any case that is a very odd way of using poll command. The whole idea of poll is that it sleeps till there is event in any of the handles given to it to monitor. When mouse is moved over mozilla window the poll actually return POLLIN event on handle 7, after which ioctl FIONREAD is used to inquire the number of bytes available adn then read is used the read them: ------------ 16:34:16.723045 poll([{fd=7, events=POLLIN, revents=POLLIN}, {fd=17, events=POLLIN}, {fd=5, events=POLLIN}, {fd=37, events=POLLIN}], 4, 7) = 1 16:34:16.733129 gettimeofday({992957656, 733162}, NULL) = 0 16:34:16.733208 ioctl(7, FIONREAD, [32]) = 0 16:34:16.733274 read(7, "\6\0\233\371\234n\3430-\0\0\0m\24 \1\0\0\0\0\3\2\373\2"..., 32) = 32 --------------- Thus I seriously suspect that there is really a bug somewhere in the code (most probably in some library) that is used to wait for action on the part of the user or network. It could well be that the timeout parameter to poll call is quite something else than what it is supposed to be. In any case well coded programs should sleep and not run in tight do_nothing_loop using machine resources while the program actually is doing nothing but waiting for an event.
maybe waterson? I have no idea what this bug is about.
Assignee: asa → waterson
setting status to NEW and adding qawanted keyword... Looks like the threading system may be doing something funky, but I don't understand it well enough to tell.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: qawanted
Ran into the same system call trace on HP-UX 11.00 (build of milestone 0.9.5). To give some idea of the impact here: on a relatively slow machine (by today's standards: a C110, which was current in about 1996) which can do about 1 million calls/sec to gettimeofday() alone in a loop on a good day with the wind, mozilla-bin is calling the trio gettimeofday/ioctl(FIONREAD)/poll about 1000 times per second to display "about:blank", iconified in a seperate workspace. Given that there are no events going in/out of the Moz window here, and that the ioctl/poll calls are returning immediately with 0 all the time, it's safe to say we're wasting our time in a loop. Also, this seems related to bug # 66366 which discusses high usage when mousing over links, but with similar high CPU usage and syscall trace as I've seen while displaying only about:blank. That bug is currently listed as a future fix, because it appears to be considered more an enhancement request than a bug (mozilla is expected to take-up more CPU, so this may be normal behavior). Attaching a debugger to this shows that the loop in question is the (Gtk+) glib main loop, the gettimeofday and poll calls coming from glib polling for events and the ioctl(FIONREAD) calls are the underpinnings of the XPending() call, which is looking for pending X events from a gtk-regsitered "prepare" function -- see gdk_events_pending() in gdk/gdkevents.c. This is all called within the main loop (glib/gmain.c v1.2.10): hook = g_hook_first_valid (&source_list, TRUE); while (hook) { [ ... ] (*prepare)(....) if (hook->flags & G_SOURCE_READY) { [ ... ] n_ready++; current_priority = source->priority; timeout = 0; } hook = g_hook_next_valid (&source_list, hook, TRUE); } /* poll(), if necessary */ g_main_poll (timeout, n_ready > 0, current_priority); A well-behaved gtk+ application will sit on this poll() with a -1 (indefinite) timeout, waiting for a new event to come down the socket. For some reason, Mozilla will call poll with a 0 timeout, return immediately without any events being found on the fd's, causing the main loop to repeat. There appears to be some disconnect between the G_SOURCE_READY flag being set, indicating the source needs attention, and the poll which never finds anything to do. I haven't yet compiled gtk/glib debuggable to find out what the events in question are, but that would seem the next step. Hopefully this should at least get the ball rolling in the gtk direction and maybe someone with a better understanding of that UI library can comment as to what might be going on.
Is this because we need to service non-file-descriptor event queues (such as for our cross-thread proxies) from the GTK main loop, and therefore can't just sit there until there are new X events to process? I wonder if you also see this with any "well-behaved" GTK program that uses the main loop hook.
There are a couple of places that this can happen: o Someone can make a call that will check the X Window queue for incoming window events, even checking to see if there's data in the X FD. This might be causing some of these events. o Someone posts to the thread event queue. This causes the mainloop to wake up and check all the file descriptors. o Someone fires a timer. This will cause the file descriptors to be checked too, I think. I'm not positive about this, though. Anyway, I think this bug is complaining about normal mainloop behaviour. As they say "Nothing to see here, move along, move along."
He's not talking about events being generated, or the main loop leaving its sleep because it found something or was kicked by a timer (how, if not by an event?). He's saying that we will _never_ do anything but busy-loop, because we have a timeout of 0. The question is why we have the timeout of 0, and if there's another way (ah, for IRIX's usema) to meet those requirements without that busy-loop cost.
To clarify the earlier comments: the poll() with timeout 0 is called by the gmain loop when it finds any source which is in a G_SOURCE_READY state. I presume this is a safeguard against a race to prevent a hang: where the event is available when checked, but does not appear on the file descriptor later when polled -- but that should be the exception case. The glib main loop appears to do some work to ensure that it's only calling poll() on file descriptors which appear to have data ready, however, so it appears odd that it's repeatedly checking on fd's which it "thinks" have data, then timing-out (immediately) because there's nothing there. This gets us back to timers and the other non-fd-based uses of the main loop hook. I have seen some traces in the debugger showing the prepare function as called from a timer (resulting in the XPending's call to ioctl()), so there's the possibility that this is just getting called too often (will get those debug gtk libraries built real soon now...) As for non-fd events: I would have expected those to be relegated to a seperate non-GUI thread or use a different blocking event mechanism (e.g. pthread_mutex) to avoid having to check them 1000x/second in a tight loop. But I'll admit my naivete in this area of code, and defer judgement until I've had time to have a closer look.
(one set of debug libraries later) The root of the near-constant polling is the timer implementation for libtimer_gtk in widget/timer/src/unix/gtk/nsTimerGtk.cpp which schedules a timer through gtk's timer hook to run every 10ms in nsTimerGtk::Init() ... #define INTERVAL 10 gtk_timeout_add ( INTERVAL, TimerCallbackFunc, (gpointer)0); To process timers at 10ms intervals. TimerCallbackFunc() then calls g_main_pending() to see if "events are pending in the main context.... see g_main_context_pending()." (from the API ref). g_main_pending() calls g_main_iterate(), which calls g_main_poll(), and the rest you know. Obvious (flawed) fix: change the timer INTERVAL in this file to something larger. Doing this, you definitely get far fewer system calls / CPU usage in the trace, and the poll call gets used with a positive-ms timeout where it waits the desired interval. This also means that animations/flashing cursors, etc don't get updated any faster than the INTERVAL specified, which slows things down a bit. Certainly an interval such as 250ms could be used for things like flashing cursors, but animations need faster timeouts to be smooth. It seems as though there should be a way to bridge these two needs: we don't need to wake-up every 10ms and poll everything just to know if a timer has expired. There should be a way to maintain timeouts large and small without having to gettimeofday()/ioctl()/poll() (+ all the stuff in user-land) every 10ms. It would seem that we could predict our next needed wakeup-time with a bit more granularity, and schedule the next wakeup for that time, ignoring the flurry of 10ms intervals in between. Alternately, use of an interval like 100ms could work in the current scheme, assuming that animations would be sped-up accordingly or that the system calls for polling fd though this loop could be avoided except when known to be needed.
Component: Browser-General → XP Apps
QA Contact: doronr → jrgm
Attached patch modified nsTimerGtk.h and nsTimerGtk.cpp (obsolete) (deleted) — Splinter Review
Attached code implements the timer code a bit differently: doesn't schedule a continual "every 10ms" interrupt, but calculates the time when the next interrupt is needed, and schedules the one, then after a timeout processing loop recomputes the next one needed, schedules that that one and so on. Advantage is that you're not continually scanning queues looking for work to do, so "about:blank" displays without any continual system activity, and a single flashing cursor only wakes-up 2x a second, rather than the current 100x. Also simplified some of the storage aspects to reduce redundant math and extra shifting of elements in the arrays, but stopped short of changing the array storage to a sorted list, which didn't seem worth the effort given the few elements being stored. Apologies in advance for my lack of C++ style, and potential for other problems introduced due to my ignorance of much of the overall architecture.
Can you use cvs diff -u to attach a diff?
Updated version of the same code, fixes a couple of bugs introduced in the first pass, provided as "cvs diff -u" output, as requested. Note that the previous attachment should be deleted by someone who has appropriate permissions -- as it does introduce some odd behavior. I'd be happy to share what I've learned in that area, should anyone care.
Comment on attachment 57317 [details] [diff] [review] modified nsTimerGtk.h and nsTimerGtk.cpp marking obsolete per author's request
Attachment #57317 - Attachment is obsolete: true
Does that patch need updating to deal with pavlov's timer rewrite? If not, could someone review it?
Keywords: qawantedpatch, review
The code that the patch touches doesn't exist anymore. I doubt that we're going to see this behaviour anymore unless pav's timer implementation is really bad.
Target Milestone: --- → Future
Status: NEW → ASSIGNED
Dan, are you still seeing this? Is this now worksforme?
Ran tusc on a build of the 0.9.8 milestone, and I no longer see the looping behavior. Displaying a static page (about:) puts all threads asleep quietly in a pthread_cond_wait(), and with a flashing cursor it cycles through poll/select only the 2x per second as needed to toggle the cursor. Definitely appears fixed with these changes -- certainly much less "busy" than before.
Is
Sorry about the last spam. Is this problem fixed by the patch of bug 78611?
Keywords: perf
What's the current status of this one ?
Tried strace on Netscape7.0 pr1 on Linux, still saw the looping of poll() with timer 0. On the other hand, that does not happen on mozilla build (moz099, moz100). The looping also exists in Netscape 7.0 on Solaris (built by Sun).
Adding myself to the cc list.
Couldn't find anyone on IRC who cared, and this is the nearest bug I can find to what I just saw, so I'll dump it here. Mozilla (today's trunk CVS, Linux) is "visibly" idle, but one of its threads is eating up 100% CPU. The top of the profile looks like : 0000d93c 10174 1.08509 PR_GetIdentitiesLayer /home/moz/bin/mozill/bin/libnspr4.so 0000d250 10672 1.13821 pl_DefPoll /home/moz/bin/mozill/bin/libnspr4.so 00069208 11749 1.25307 nsHttpConnection::OnDataAvailable(nsIRequest *, nsISupports *, nsIInputStream *, unsigned int, unsigned int) /home/moz/bin/mozill/bin/components/libnecko.so 000d0890 12515 1.33477 __poll /lib/libc-2.2.so 0003397c 13511 1.441 nsSocketTransport::CheckForTimeout(unsigned int) /home/moz/bin/mozill/bin/components/libnecko.so 00009734 15653 1.66945 pthread_equal /lib/libpthread-0.9.so 00022f94 18715 1.99602 PR_Now /home/moz/bin/mozill/bin/libnspr4.so 001057a0 20071 2.14064 __divdi3 /lib/libc-2.2.so 00022ffc 21801 2.32515 _PR_UNIX_GetInterval /home/moz/bin/mozill/bin/libnspr4.so 00016e08 22224 2.37027 PR_IntervalNow /home/moz/bin/mozill/bin/libnspr4.so 0001c654 22523 2.40216 PR_Lock /home/moz/bin/mozill/bin/libnspr4.so 00037d38 23254 2.48012 nsSocketReadRequest::OnRead(void) /home/moz/bin/mozill/bin/components/libnecko.so 00008130 23918 2.55094 __pthread_mutex_unlock /lib/libpthread-0.9.so 00007e14 24313 2.59307 __pthread_mutex_lock /lib/libpthread-0.9.so 0001cc00 24593 2.62293 PR_ExitMonitor /home/moz/bin/mozill/bin/libnspr4.so 0001cbb0 29559 3.15257 PR_EnterMonitor /home/moz/bin/mozill/bin/libnspr4.so 0003917c 30116 3.21198 nsSocketTransportService::Run(void) /home/moz/bin/mozill/bin/components/libnecko.so 00033a24 32087 3.42219 nsSocketTransport::Process(short) /home/moz/bin/mozill/bin/components/libnecko.so 00034914 33524 3.57545 nsSocketTransport::doReadWrite(short) /home/moz/bin/mozill/bin/components/libnecko.so 0001c688 34873 3.71933 PR_Unlock /home/moz/bin/mozill/bin/libnspr4.so 000797f0 37389 3.98767 nsHttpPipeline::OnDataReadable(nsIInputStream *) /home/moz/bin/mozill/bin/components/libnecko.so 0000b758 62826 6.70062 __pthread_unlock /lib/libpthread-0.9.so 000096c8 78942 8.41945 pthread_self /lib/libpthread-0.9.so 0000b564 83847 8.94258 __pthread_lock /lib/libpthread-0.9.so 00020138 89553 9.55115 _pr_poll_with_poll /home/moz/bin/mozill/bin/libnspr4.so 000a4ed0 92566 9.8725 __gettimeofday /lib/libc-2.2.so (fields are : vma, nr. samples, percent of total, symbol, binary/library) This was causing billions of sys_gettimeofday()/sys_poll() calls I would have done a strace, but trying to attach to the run away process stopped the problem (damn). I wasn't doing anything particularly unusual: news, bugzilla, some website.
Product: Core → Mozilla Application Suite
This bug seems to be quite ... dead. Can anyone report this is still (or is not) a perf problem on a current build?
Bug 371398 seems to be the similiar to this one for Firefox.
Assignee: waterson → jag
Status: ASSIGNED → NEW
QA Contact: jrgmorrison
Target Milestone: Future → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: