Closed Bug 575515 Opened 14 years ago Closed 14 years ago

Resizing or Dragging the window sometimes locks up the browser (apparently under heavy cpu usage)

Categories

(Core :: General, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- betaN+

People

(Reporter: mozbugz, Assigned: jimm)

References

Details

(Keywords: regression)

Attachments

(4 files, 4 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:2.0b2pre) Gecko/20100628 Minefield/4.0b2pre Firefox/3.6.6
Build Identifier: 

I am getting frequent lock-ups that usually happen when I try to resize the browser window or move it or sometimes when I click on a caption button.

I think its custom mouse events for the browser

Reproducible: Sometimes

Steps to Reproduce:

I don't have STR unless its try to resize the browser edges while starting the browser or its in CPU churning mode.
Actual Results:  
The mouse can be moved, the cursor stays in drag mode, the browser is locked.
Have to bring up task manager to try to focus another part of the window in order to get it to respond, but I cannot shut down the browser window.

Expected Results:  
Custom mouse cursors behave.

It tends to happen on the top border with the App menu, but sometimes I see it on other edges.. so not sure if its just app menu button specific at this time.
Confirming.  Hard to tell what exactly is the causing this.

Mozilla/5.0 (Windows; U; Windows NT 6.1; Win64; x64; en-US; rv:1.9.3a6pre) Gecko/20100628 Minefield/3.7a6pre
Additional Actual results, if trying resize, the cursor stays in resize cursor mode, but the browser window jitters and locks up and doesn't move.
It just happened to me.  Opening up Performance Monitor, I saw Firefox was eating up 100% of one of my 4 cores, thus using ~25% CPU.

Also, pressing ctrl+shift+esc to open Task Manager un-freezes the browser.  Using the open web page works fine and everything seems normal (although CPU is still at about 25%), but then when I try to drag or resize the window the next time it locks up immediately.
Blocks: 555081
blocking2.0: --- → ?
I've verified that it's not a regression from bug 555081.

STR:
 - open a heavy JS webpage (e.g. http://mrdoob.com/projects/chromeexperiments/depth_of_field/ )
 - resize the browser on any edge

What happens: Browser stops processing events, and other programs can't receive mouse input. Pressing alt+tab unfreezes.
That testcase locks up the browser for me.  But I also lock up when left click drag the browser window about an inch or two and it also locks on a website like bugzilla.
Maybe the latter happens on my original profile that I've had for a while.  I tried a new profile now with this most recent build which doesn't seem to be showing issues with click & drag on bugzilla.
Status: UNCONFIRMED → NEW
Ever confirmed: true
My guess is the hangs are related to our hit testing events getting sent up to content and getting hung up somehow. We've never really tested that.

Also, I believe we currently mouse hit test for the entire browser window, which in thinking about it is probably really hard on performance and is something we can totally avoid.
filed bug 576182.
The hangs still happens without the hit testing. I think the custom handling doesn't have anything to do with that. Probably more the re-layout on resizing (and something else that can be trigged occasionally by moving the window)

Also, the events don't hit content. There's a flag to dispatch only to chrome, and they're only sent to windows with custom drawing (although this will need to be changed to support the glass toolbars on regular drawing windows)


I tried to do debug this but I couldn't find a way to do it.. I can get the browser hung up but when I try to switch to MSVC to pause execution, the browser unfreezes.

Watching Spy++ shows that events stop being delivered. Probably some reentrant event happening? Though I don't know why this would happen only on heavy processing situations. Trying to think up of something
So not only was the issue causing the browser to stop responding in my main profile, after using alt-tab, then [x] to close the window, I was hitting profile lock and in use bug there too and I couldn't move my mouse down over my windows taskbar.
Confirmed hang? Yeah, this blocks Fx4 final.
blocking2.0: ? → final+
I did some debugging on this today. The lock always happen right before the WM_WINDOWPOSCHANGING is received (note: not while processing it).  Unlocking the browser with the STR in comment #4 (alt tabbing) is followed by WM_WINDOWPOSCHANGING and WM_WINDOWPOSCHANGED.

I tried adding SWP_ASYNCWINDOWPOS to every SetWindowPos call, but that didn't make a difference.

It seems that this is a message deadlock, although I'm not sure what other thread could be calling or posting something to the UI thread's queue  (or my guess might be totally off, too...)
Summary: Custom Mouse event Resize or Drag and Click sometimes locks up the browser → Resizing or Dragging the window sometimes locks up the browser (apparently under heavy cpu usage)
Resizing on Windows puts us in the modal resize loop, which has caused us problems before.
I think that this bug could be a duplicate: Bug 598166.

I have found a very simple testcase that only involves 5 lines of JS (a setInterval timer and a big for loop) and no HTML at all. Attaching it to this bug report as well.
Attached file minimal test case (deleted) —
Yeah, they probably have the same underlying issue.
Product: Firefox → Core
QA Contact: general → general
Assigning this to myself as I've been looking into this and other similar bugs, though I'll probably still need help to find a general solution for this.

Here's my theory on how this started:

The reports for these appeared when the work for the new titlebar landed. However, the problem still happens regardless of using the custom titlebar or not.
So I think this is a fallout of attaching the view to the top level window, instead of a child window.

Before, when resizing the window, the modal resize loop would be kicked on the top level window, while Paint and all other events would still go to the child window inside. Now everything goes to the the same message queue and things go wrong (it could be messages incorrectly taking precedence or the thread never answering a message, I think)
Assignee: nobody → felipc
Status: NEW → ASSIGNED
(In reply to comment #15)
> Created attachment 478027 [details]
> minimal test case

What are the STR here? The browser is using a lot of cpu, so everything in the app is lagged. If resize the window, things get screwy with the mouse point. If I let things sit for a second everything seems to get straightened out.

Is this "the bug"?

FWIW, I see the same behavior in 3.6.
(In reply to comment #19)
> (In reply to comment #15)
> > Created attachment 478027 [details] [details]
> > minimal test case
> 
> What are the STR here? The browser is using a lot of cpu, so everything in the
> app is lagged. If resize the window, things get screwy with the mouse point. If
> I let things sit for a second everything seems to get straightened out.
> 
> Is this "the bug"?

Yes, that's it. But it doesn't resolve itself for me, unless the app loses focus (by alt-tabbing or ctrl+alt+del)

> 
> FWIW, I see the same behavior in 3.6.

Oh ok, that's true. So my theory above can be ignored.


Also relevant: bug 491700
(I tried to make nsIWidget::HasPendingInputEvent to always return true to force scheduling reflow off of a timer, but it didn't help)
(In reply to comment #20)
> (In reply to comment #19)
> > (In reply to comment #15)
> > > Created attachment 478027 [details] [details] [details]
> > > minimal test case
> > 
> > What are the STR here? The browser is using a lot of cpu, so everything in the
> > app is lagged. If resize the window, things get screwy with the mouse point. If
> > I let things sit for a second everything seems to get straightened out.
> > 
> > Is this "the bug"?
> 
> Yes, that's it. But it doesn't resolve itself for me, unless the app loses
> focus (by alt-tabbing or ctrl+alt+del)

Indeed. Sometimes it comes back to normal after a few seconds, but sometimes it doesn't. Note that alt+tab is normally enough to restore normal behavior. While it is crazy, not only mouse clicks are ignored system-wide, but the mouse pointer is typically constrained above the taskbar.
Attached file msg log (deleted) —
A very odd bug. We occasionally manage to confuse windows so that it thinks we're still in the sizing event, even though we've release the mouse. All the events eventually get delivered, but not without doing something to reset the desktop, like ctrl-alt-del. 

I bumped the delay up to 100, and added some code to trigger paints on each iteration. The block of dumps in the middle of this log represents about 30 seconds. During which javascript continues to execute, but windows isn't delivering events to our window at all. WM_MOUSELEAVE is the last event we receive before things go south.
Attached file testcase variation (deleted) —
The modal resize loop might change the ordering of delivery of events. So if we keep posting an event it will always get processed before keyboard/mouse input from the OS, say. I think something like that was the cause of bug 491700.
Blocks: 594387
(In reply to comment #23)
> Created attachment 478137 [details]
> testcase variation

I can reproduce this easily in a nightly and 3.6 using this test case and moving the window using a drag on the titlebar. 3.6 though has an additional problem, the text doesn't repaint!

So I think it's safe to rule out all of our recent work.

We might be able to track down a regression range, although who knows how far back this problem goes.
Best guess for a possible culprit would be the modal loop related code we added in bug 420148 - 

http://hg.mozilla.org/mozilla-central/rev/990794df70da

I can't confirm though since the builds back then locked up on my test case. 

I can however confirm that the problem is that widget/src/windows/nsAppShell's ProcessNextNativeEvent is not getting called. We're not dispatch windowing events.
No longer blocks: 594387
Depends on: slowui
Blocks: slowui
No longer depends on: slowui
That was very close, and it pointed in the right direction!

I backed out bug 420128 but it didn't solve the problem. The real culprit is bug 389931.  With it backed out it (and 420128 too, as it depends on it), the bug goes away.

There's some relevant discussion over that bug, and there's this in the patch:

+  if (mEventloopNestingState == eEventloopXPCOM) {
+    mEventloopNestingState = eEventloopOther;
+    // XXX there is a tiny risk we will never get a new NativeEventCallback,
+    // XXX see discussion in bug 389931.


I'm gonna check if this is the case that we're hitting, or if it's something else
Blocks: 389931
Attached patch timer patch (obsolete) (deleted) — Splinter Review
So after reading the discussion on bug 389931 and some of the follow-ups, there was talk about a situation in which the event queue could run out of native events to keep the event loop going (see bug 389931 comment 66).

The idea to fix it would be to set up a timer to recover from that situation, and a follow up bug was created for that (bug 419630), so I tried to write a simple patch to implement that and test the hypothesis. I'm attaching it here, but it didn't solve the problem. It can be either that: a) I got something wrong on the patch, or b) this isn't the situation happening here.

I don't know yet, but I believe it's not quite the same situation. As all the various STR involve something processor intensive, looks like a starvation is happening due to some events being constantly posted and taking higher precendence than the modal native loop that needs to be unwound.
So here's a sketch of the starved situation

User clicks on the resizer border:

  WM_NCLBUTTONDOWN
   -> DefWindowProc
       -> generates WM_SYSCOMMAND (wParam=WC_SIZE)
          -> DefWindowProc
             -> modal resize loop directly calls nsAppShell::EventWindowProc
             ^  -> msg is always "nsAppShell::EventID" which
             |  |  calls NativeEventCallback
             |__|

The modal loop gives priority to posted messages, so it seems that our tick message is always processed before even though there are pending input messages.

Comments 17-19 from bug 491700 basically explains the situation here.
Maybe in a resize loop, we should use a zero-duration timer to drive the Gecko events instead of a posted message.
Blocks: 582950
Unassigning as I don't understand the appshell code well enough to invest more time on this bug
Assignee: felipc → nobody
Status: ASSIGNED → NEW
(In reply to comment #36)
> Unassigning as I don't understand the appshell code well enough to invest more
> time on this bug

How/where can we find somebody who does? This is a really important bug, it feels uncomfortable that it's not assigned!
i'll take a shot at it.
Assignee: nobody → jmathies
Can someone with driver privs mark this as blocking betaN?
blocking2.0: final+ → betaN+
Severity: critical → normal
Why was this bug demoted from 'critical' status? It's been reported many times and has a testcase that shows that 5 lines of general JS can trigger it; and for many users it looks enough like a windows crash that they'll probably just force-shutdown their machines and we'll never know about it.
This has been a real pita, I still don't have a working fix.

I've experimented with timers, but since a timer's Fire is triggered via a gecko event, which get scheduled through ScheduleNativeEventCallback, I can't remove the use of post message during these loops. I've also experimented with processing windows events via a short timer as well, but it seems the windows resize loop filters mouse events. The only event we can process is paint. 

I've also tried using lower priority events in ScheduleNativeEventCallback during resize loops (like wm_timer), but found the latency was too high.

I'll post the experimental work I've done later today. If anyone has any ideas, I'd welcome the input.
Attached patch appshell patch v.1 (obsolete) (deleted) — Splinter Review
I see a glimmer of hope! Need to do some more testing.
Attachment #479325 - Attachment is obsolete: true
Attached patch appshell patch v.2 (obsolete) (deleted) — Splinter Review
One more time, without the unused timer crud.
Attachment #491264 - Attachment is obsolete: true
Attached patch appshell patch v.3 (obsolete) (deleted) — Splinter Review
Will kick off reviews if everything looks ok on a try run.
Attachment #491267 - Attachment is obsolete: true
Comment on attachment 491317 [details] [diff] [review]
appshell patch v.3

Ok, so this worked great, but in testing I found the problem exists in all modal loops, not just window interactions. So I've generalized this down to apply to all cases.  I'll post a final after I do some more testing and get some try runs.
Attachment #491317 - Attachment is obsolete: true
Attached patch patch v.1 (deleted) — Splinter Review
Comment on attachment 491430 [details] [diff] [review]
patch v.1

Mats, you've worked on this code quite a bit in the past, care to do the review?
Attachment #491430 - Flags: review?(matspal)
I'm reviewing this.  This code is hairy already as it is so it may take
a day or two...
(In reply to comment #50)
> I'm reviewing this.  This code is hairy already as it is so it may take
> a day or two...

Now that plugins are out of process, I wonder if some of the past work here could be removed. For example bug 420148 may not be needed anymore.

Maybe after we branch we should do some investigative work on simplifying this code down.
(In reply to comment #51)
> Maybe after we branch we should do some investigative work on simplifying this
> code down.

Yeah, I agree, there's too many layers of band-aids here and it has become
quite unmanageable.  And there's still bugs lurking... found a new one
while testing your patch, filed as bug 614165.
Comment on attachment 491430 [details] [diff] [review]
patch v.1

>+  // See DoProcessNextNativeEvent, mEventloopNestingLevel will be
>+  // one when a modal loop unwinds.
>+  if (mNativeCallbackPending && mEventloopNestingLevel == 1)
>+    DoProcessMoreGeckoEvents();

I think this bug can occur at an arbitrary event loop depth,
ie (mEventloopNestingLevel & 1).
Probably not worth making the patch more complex to handle
that though.

I noted that there are still calls to ScheduleNativeEventCallback()
while mNativeCallbackPending is TRUE, eg from nsThread::PutEvent.
Are you letting those through intentionally?


Nits:
In all comments you add: s/windows/Windows to make it clear
you're not referring to XUL windows or nsWindows or whatever.

>   PRInt32 mSuspendNativeCount;
> 
>+protected:
>+  PRUint32 mEventloopNestingLevel;

You're already in a protected section there.
No need for the line separating it from mSuspendNativeCount.

>+// XXX This is overidden on windows, see comments in nsAppShell for

Drop the XXX (nothing needs fixing there AFAICT)
Attachment #491430 - Flags: review?(matspal) → review+
(In reply to comment #54)
> Comment on attachment 491430 [details] [diff] [review]
> I noted that there are still calls to ScheduleNativeEventCallback()
> while mNativeCallbackPending is TRUE, eg from nsThread::PutEvent.
> Are you letting those through intentionally?

No I didn't, is that the cause of bug 614165? I would have expected this patch to address that, since multiple modal loops would get trapped by mEventloopNestingLevel >= 2.

> 
> 
> Nits:
> In all comments you add: s/windows/Windows to make it clear
> you're not referring to XUL windows or nsWindows or whatever.
> 
> >   PRInt32 mSuspendNativeCount;
> > 
> >+protected:
> >+  PRUint32 mEventloopNestingLevel;
> 
> You're already in a protected section there.
> No need for the line separating it from mSuspendNativeCount.
> 
> >+// XXX This is overidden on windows, see comments in nsAppShell for
> 
> Drop the XXX (nothing needs fixing there AFAICT)

Will do.
> > I noted that there are still calls to ScheduleNativeEventCallback()
> > while mNativeCallbackPending is TRUE, eg from nsThread::PutEvent.
> > Are you letting those through intentionally?
> 
> No I didn't, is that the cause of bug 614165? I would have expected this patch
> to address that, since multiple modal loops would get trapped by
> mEventloopNestingLevel >= 2.

The nesting level isn't the problem here.  The ScheduleNativeEventCallback()
calls I'm referring to comes from non-UI threads posting a gecko event
using nsThread::PutEvent, which then notifies the appshell's
OnDispatchedEvent.  So, if you truly want to block all calls to
ScheduleNativeEventCallback() you need to override OnDispatchedEvent
(or hack ScheduleNativeEventCallback) and do your DoProcessMoreGeckoEvents
magic there.

I'm not saying this is desirable though - I'm worried that blocking out
gecko event processing 100% will cause new problems.

I did a quick hack like so:
nsAppShell::ScheduleNativeEventCallback()
{
  if (mNativeCallbackPending) {
    PR_AtomicSet(&mNativeEventPending, 0);
    return;
  }
...

It feels slightly more responsive while resizing the window, but it also
seems to lead to more occurencies of the problem where the window still
resizes after releasing the button.
(This problem occurs also with your patch BTW, although less frequent.)

So, I'm happy with the patch as is.

(I'm pretty sure bug 614165 is an unrelated bug)
http://hg.mozilla.org/mozilla-central/rev/85800bbf667b

I'll file a follow up on the PutEvent.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Depends on: 639864
Issue is resolved - clearing old keywords - qa-wanted clean-up
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: