575515 - Resizing or Dragging the window sometimes locks up the browser (apparently under heavy cpu usage)

Reporter

Description

•

14 years ago

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:2.0b2pre) Gecko/20100628 Minefield/4.0b2pre Firefox/3.6.6
Build Identifier: 

I am getting frequent lock-ups that usually happen when I try to resize the browser window or move it or sometimes when I click on a caption button.

I think its custom mouse events for the browser

Reproducible: Sometimes

Steps to Reproduce:

I don't have STR unless its try to resize the browser edges while starting the browser or its in CPU churning mode.
Actual Results:  
The mouse can be moved, the cursor stays in drag mode, the browser is locked.
Have to bring up task manager to try to focus another part of the window in order to get it to respond, but I cannot shut down the browser window.

Expected Results:  
Custom mouse cursors behave.

It tends to happen on the top border with the App menu, but sometimes I see it on other edges.. so not sure if its just app menu button specific at this time.

hjalte

Comment 1

•

14 years ago

Confirming.  Hard to tell what exactly is the causing this.

Mozilla/5.0 (Windows; U; Windows NT 6.1; Win64; x64; en-US; rv:1.9.3a6pre) Gecko/20100628 Minefield/3.7a6pre

[not reading bugmail]

Reporter

Comment 2

•

14 years ago

Additional Actual results, if trying resize, the cursor stays in resize cursor mode, but the browser window jitters and locks up and doesn't move.

hjalte

Comment 3

•

14 years ago

It just happened to me.  Opening up Performance Monitor, I saw Firefox was eating up 100% of one of my 4 cores, thus using ~25% CPU.

Also, pressing ctrl+shift+esc to open Task Manager un-freezes the browser.  Using the open web page works fine and everything seems normal (although CPU is still at about 25%), but then when I try to drag or resize the window the next time it locks up immediately.

[not reading bugmail]

Reporter

Updated

•

14 years ago

Blocks: 555081

blocking2.0: --- → ?

:Felipe Gomes (needinfo for replies!)

Comment 4

•

14 years ago

I've verified that it's not a regression from bug 555081.

STR:
 - open a heavy JS webpage (e.g. http://mrdoob.com/projects/chromeexperiments/depth_of_field/ )
 - resize the browser on any edge

What happens: Browser stops processing events, and other programs can't receive mouse input. Pressing alt+tab unfreezes.

[not reading bugmail]

Reporter

Comment 5

•

14 years ago

That testcase locks up the browser for me.  But I also lock up when left click drag the browser window about an inch or two and it also locks on a website like bugzilla.

[not reading bugmail]

Reporter

Comment 6

•

14 years ago

Maybe the latter happens on my original profile that I've had for a while.  I tried a new profile now with this most recent build which doesn't seem to be showing issues with click & drag on bugzilla.

Jim Mathies [:jimm]

Assignee

Updated

•

14 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Jim Mathies [:jimm]

Assignee

Comment 7

•

14 years ago

My guess is the hangs are related to our hit testing events getting sent up to content and getting hung up somehow. We've never really tested that.

Also, I believe we currently mouse hit test for the entire browser window, which in thinking about it is probably really hard on performance and is something we can totally avoid.

Jim Mathies [:jimm]

Assignee

Comment 8

•

14 years ago

filed bug 576182.

:Felipe Gomes (needinfo for replies!)

Comment 9

•

14 years ago

The hangs still happens without the hit testing. I think the custom handling doesn't have anything to do with that. Probably more the re-layout on resizing (and something else that can be trigged occasionally by moving the window)

Also, the events don't hit content. There's a flag to dispatch only to chrome, and they're only sent to windows with custom drawing (although this will need to be changed to support the glass toolbars on regular drawing windows)


I tried to do debug this but I couldn't find a way to do it.. I can get the browser hung up but when I try to switch to MSVC to pause execution, the browser unfreezes.

Watching Spy++ shows that events stop being delivered. Probably some reentrant event happening? Though I don't know why this would happen only on heavy processing situations. Trying to think up of something

[not reading bugmail]

Reporter

Comment 10

•

14 years ago

So not only was the issue causing the browser to stop responding in my main profile, after using alt-tab, then [x] to close the window, I was hitting profile lock and in use bug there too and I couldn't move my mouse down over my windows taskbar.

Dietrich Ayala (:dietrich)

Comment 11

•

14 years ago

Confirmed hang? Yeah, this blocks Fx4 final.

blocking2.0: ? → final+

:Felipe Gomes (needinfo for replies!)

Comment 12

•

14 years ago

I did some debugging on this today. The lock always happen right before the WM_WINDOWPOSCHANGING is received (note: not while processing it).  Unlocking the browser with the STR in comment #4 (alt tabbing) is followed by WM_WINDOWPOSCHANGING and WM_WINDOWPOSCHANGED.

I tried adding SWP_ASYNCWINDOWPOS to every SetWindowPos call, but that didn't make a difference.

It seems that this is a message deadlock, although I'm not sure what other thread could be calling or posting something to the UI thread's queue  (or my guess might be totally off, too...)

Summary: Custom Mouse event Resize or Drag and Click sometimes locks up the browser → Resizing or Dragging the window sometimes locks up the browser (apparently under heavy cpu usage)

Timothy Nikkel (:tnikkel)

Comment 13

•

14 years ago

Resizing on Windows puts us in the modal resize loop, which has caused us problems before.

Benoit Jacob [:bjacob] (mostly away)

Comment 14

•

14 years ago

I think that this bug could be a duplicate: Bug 598166.

I have found a very simple testcase that only involves 5 lines of JS (a setInterval timer and a big for loop) and no HTML at all. Attaching it to this bug report as well.

Benoit Jacob [:bjacob] (mostly away)

Comment 15

•

14 years ago

Attached file minimal test case (deleted) — Details

Timothy Nikkel (:tnikkel)

Comment 16

•

14 years ago

Yeah, they probably have the same underlying issue.

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

14 years ago

Product: Firefox → Core

QA Contact: general → general

:Felipe Gomes (needinfo for replies!)

Comment 18

•

14 years ago

Assigning this to myself as I've been looking into this and other similar bugs, though I'll probably still need help to find a general solution for this.

Here's my theory on how this started:

The reports for these appeared when the work for the new titlebar landed. However, the problem still happens regardless of using the custom titlebar or not.
So I think this is a fallout of attaching the view to the top level window, instead of a child window.

Before, when resizing the window, the modal resize loop would be kicked on the top level window, while Paint and all other events would still go to the child window inside. Now everything goes to the the same message queue and things go wrong (it could be messages incorrectly taking precedence or the thread never answering a message, I think)

Assignee: nobody → felipc

Status: NEW → ASSIGNED

Jim Mathies [:jimm]

Assignee

Comment 19

•

14 years ago

(In reply to comment #15)
> Created attachment 478027 [details]
> minimal test case

What are the STR here? The browser is using a lot of cpu, so everything in the app is lagged. If resize the window, things get screwy with the mouse point. If I let things sit for a second everything seems to get straightened out.

Is this "the bug"?

FWIW, I see the same behavior in 3.6.

:Felipe Gomes (needinfo for replies!)

Comment 20

•

14 years ago

(In reply to comment #19)
> (In reply to comment #15)
> > Created attachment 478027 [details] [details]
> > minimal test case
> 
> What are the STR here? The browser is using a lot of cpu, so everything in the
> app is lagged. If resize the window, things get screwy with the mouse point. If
> I let things sit for a second everything seems to get straightened out.
> 
> Is this "the bug"?

Yes, that's it. But it doesn't resolve itself for me, unless the app loses focus (by alt-tabbing or ctrl+alt+del)

> 
> FWIW, I see the same behavior in 3.6.

Oh ok, that's true. So my theory above can be ignored.


Also relevant: bug 491700
(I tried to make nsIWidget::HasPendingInputEvent to always return true to force scheduling reflow off of a timer, but it didn't help)

Benoit Jacob [:bjacob] (mostly away)

Comment 21

•

14 years ago

(In reply to comment #20)
> (In reply to comment #19)
> > (In reply to comment #15)
> > > Created attachment 478027 [details] [details] [details]
> > > minimal test case
> > 
> > What are the STR here? The browser is using a lot of cpu, so everything in the
> > app is lagged. If resize the window, things get screwy with the mouse point. If
> > I let things sit for a second everything seems to get straightened out.
> > 
> > Is this "the bug"?
> 
> Yes, that's it. But it doesn't resolve itself for me, unless the app loses
> focus (by alt-tabbing or ctrl+alt+del)

Indeed. Sometimes it comes back to normal after a few seconds, but sometimes it doesn't. Note that alt+tab is normally enough to restore normal behavior. While it is crazy, not only mouse clicks are ignored system-wide, but the mouse pointer is typically constrained above the taskbar.

Jim Mathies [:jimm]

Assignee

Comment 22

•

14 years ago

Attached file msg log (deleted) — Details

A very odd bug. We occasionally manage to confuse windows so that it thinks we're still in the sizing event, even though we've release the mouse. All the events eventually get delivered, but not without doing something to reset the desktop, like ctrl-alt-del. 

I bumped the delay up to 100, and added some code to trigger paints on each iteration. The block of dumps in the middle of this log represents about 30 seconds. During which javascript continues to execute, but windows isn't delivering events to our window at all. WM_MOUSELEAVE is the last event we receive before things go south.

Jim Mathies [:jimm]

Assignee

Comment 23

•

14 years ago

Attached file testcase variation (deleted) — Details

Timothy Nikkel (:tnikkel)

Comment 24

•

14 years ago

The modal resize loop might change the ordering of delivery of events. So if we keep posting an event it will always get processed before keyboard/mouse input from the OS, say. I think something like that was the cause of bug 491700.

Scoobidiver (away)

Updated

•

14 years ago

Blocks: 594387

Jim Mathies [:jimm]

Assignee

Comment 25

•

14 years ago

(In reply to comment #23)
> Created attachment 478137 [details]
> testcase variation

I can reproduce this easily in a nightly and 3.6 using this test case and moving the window using a drag on the titlebar. 3.6 though has an additional problem, the text doesn't repaint!

So I think it's safe to rule out all of our recent work.

We might be able to track down a regression range, although who knows how far back this problem goes.

Keywords: regression, regressionwindow-wanted

Jim Mathies [:jimm]

Assignee

Comment 27

•

14 years ago

Best guess for a possible culprit would be the modal loop related code we added in bug 420148 - 

http://hg.mozilla.org/mozilla-central/rev/990794df70da

I can't confirm though since the builds back then locked up on my test case. 

I can however confirm that the problem is that widget/src/windows/nsAppShell's ProcessNextNativeEvent is not getting called. We're not dispatch windowing events.

Scoobidiver (away)

Updated

•

14 years ago

No longer blocks: 594387

[not reading bugmail]

Reporter

Updated

•

14 years ago

Depends on: slowui

[not reading bugmail]

Reporter

Updated

•

14 years ago

Blocks: slowui

No longer depends on: slowui

:Felipe Gomes (needinfo for replies!)

Comment 28

•

14 years ago

That was very close, and it pointed in the right direction!

I backed out bug 420128 but it didn't solve the problem. The real culprit is bug 389931.  With it backed out it (and 420128 too, as it depends on it), the bug goes away.

There's some relevant discussion over that bug, and there's this in the patch:

+  if (mEventloopNestingState == eEventloopXPCOM) {
+    mEventloopNestingState = eEventloopOther;
+    // XXX there is a tiny risk we will never get a new NativeEventCallback,
+    // XXX see discussion in bug 389931.


I'm gonna check if this is the case that we're hitting, or if it's something else

Jim Mathies [:jimm]

Assignee

Updated

•

14 years ago

Blocks: 389931

:Felipe Gomes (needinfo for replies!)

Comment 29

•

14 years ago

Attached patch timer patch (obsolete) (deleted) — Details — Splinter Review

So after reading the discussion on bug 389931 and some of the follow-ups, there was talk about a situation in which the event queue could run out of native events to keep the event loop going (see bug 389931 comment 66).

The idea to fix it would be to set up a timer to recover from that situation, and a follow up bug was created for that (bug 419630), so I tried to write a simple patch to implement that and test the hypothesis. I'm attaching it here, but it didn't solve the problem. It can be either that: a) I got something wrong on the patch, or b) this isn't the situation happening here.

I don't know yet, but I believe it's not quite the same situation. As all the various STR involve something processor intensive, looks like a starvation is happening due to some events being constantly posted and taking higher precendence than the modal native loop that needs to be unwound.

:Felipe Gomes (needinfo for replies!)

Comment 31

•

14 years ago

So here's a sketch of the starved situation

User clicks on the resizer border:

  WM_NCLBUTTONDOWN
   -> DefWindowProc
       -> generates WM_SYSCOMMAND (wParam=WC_SIZE)
          -> DefWindowProc
             -> modal resize loop directly calls nsAppShell::EventWindowProc
             ^  -> msg is always "nsAppShell::EventID" which
             |  |  calls NativeEventCallback
             |__|

The modal loop gives priority to posted messages, so it seems that our tick message is always processed before even though there are pending input messages.

Comments 17-19 from bug 491700 basically explains the situation here.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 32

•

14 years ago

Maybe in a resize loop, we should use a zero-duration timer to drive the Gecko events instead of a posted message.

Dão Gottwald [:dao]

Updated

•

14 years ago

Blocks: 582950

Jim Mathies [:jimm]

Assignee

Comment 35

•

14 years ago

another good test case:


http://cubicvr.org/CubicVR.js/bd3/BeatDetektor3HD.html

:Felipe Gomes (needinfo for replies!)

Comment 36

•

14 years ago

Unassigning as I don't understand the appshell code well enough to invest more time on this bug

Assignee: felipc → nobody

Status: ASSIGNED → NEW

Benoit Jacob [:bjacob] (mostly away)

Comment 37

•

14 years ago

(In reply to comment #36)
> Unassigning as I don't understand the appshell code well enough to invest more
> time on this bug

How/where can we find somebody who does? This is a really important bug, it feels uncomfortable that it's not assigned!

Jim Mathies [:jimm]

Assignee

Comment 38

•

14 years ago

i'll take a shot at it.

Assignee: nobody → jmathies

Jim Mathies [:jimm]

Assignee

Comment 39

•

14 years ago

Can someone with driver privs mark this as blocking betaN?

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

14 years ago

blocking2.0: final+ → betaN+

Jim Mathies [:jimm]

Assignee

Updated

•

14 years ago

Severity: critical → normal

Benoit Jacob [:bjacob] (mostly away)

Comment 40

•

14 years ago

Why was this bug demoted from 'critical' status? It's been reported many times and has a testcase that shows that 5 lines of general JS can trigger it; and for many users it looks enough like a windows crash that they'll probably just force-shutdown their machines and we'll never know about it.

Jim Mathies [:jimm]

Assignee

Comment 41

•

14 years ago

This has been a real pita, I still don't have a working fix.

I've experimented with timers, but since a timer's Fire is triggered via a gecko event, which get scheduled through ScheduleNativeEventCallback, I can't remove the use of post message during these loops. I've also experimented with processing windows events via a short timer as well, but it seems the windows resize loop filters mouse events. The only event we can process is paint. 

I've also tried using lower priority events in ScheduleNativeEventCallback during resize loops (like wm_timer), but found the latency was too high.

I'll post the experimental work I've done later today. If anyone has any ideas, I'd welcome the input.

Benoit Jacob [:bjacob] (mostly away)

Comment 42

•

14 years ago

Users are also reporting it there: https://support.mozilla.com/en-US/questions/764752

Jim Mathies [:jimm]

Assignee

Comment 43

•

14 years ago

Attached patch appshell patch v.1 (obsolete) (deleted) — Details — Splinter Review

I see a glimmer of hope! Need to do some more testing.

Attachment #479325 - Attachment is obsolete: true

Jim Mathies [:jimm]

Assignee

Comment 44

•

14 years ago

Attached patch appshell patch v.2 (obsolete) (deleted) — Details — Splinter Review

One more time, without the unused timer crud.

Attachment #491264 - Attachment is obsolete: true

Jim Mathies [:jimm]

Assignee

Comment 45

•

14 years ago

Attached patch appshell patch v.3 (obsolete) (deleted) — Details — Splinter Review

Will kick off reviews if everything looks ok on a try run.

Attachment #491267 - Attachment is obsolete: true

Jim Mathies [:jimm]

Assignee

Comment 46

•

14 years ago

Comment on attachment 491317 [details] [diff] [review]
appshell patch v.3

Ok, so this worked great, but in testing I found the problem exists in all modal loops, not just window interactions. So I've generalized this down to apply to all cases.  I'll post a final after I do some more testing and get some try runs.

Attachment #491317 - Attachment is obsolete: true

Jim Mathies [:jimm]

Assignee

Comment 47

•

14 years ago

Attached patch patch v.1 (deleted) — Details — Splinter Review

Jim Mathies [:jimm]

Assignee

Comment 48

•

14 years ago

try builds: http://ftp.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/jmathies@mozilla.com-6e8d2f1d68d0/

Jim Mathies [:jimm]

Assignee

Comment 49

•

14 years ago

Comment on attachment 491430 [details] [diff] [review]
patch v.1

Mats, you've worked on this code quite a bit in the past, care to do the review?

Attachment #491430 - Flags: review?(matspal)

Mats Palmgren (inactive)

Comment 50

•

14 years ago

I'm reviewing this.  This code is hairy already as it is so it may take
a day or two...

Jim Mathies [:jimm]

Assignee

Comment 51

•

14 years ago

(In reply to comment #50)
> I'm reviewing this.  This code is hairy already as it is so it may take
> a day or two...

Now that plugins are out of process, I wonder if some of the past work here could be removed. For example bug 420148 may not be needed anymore.

Maybe after we branch we should do some investigative work on simplifying this code down.

Mats Palmgren (inactive)

Comment 53

•

14 years ago

(In reply to comment #51)
> Maybe after we branch we should do some investigative work on simplifying this
> code down.

Yeah, I agree, there's too many layers of band-aids here and it has become
quite unmanageable.  And there's still bugs lurking... found a new one
while testing your patch, filed as bug 614165.

Mats Palmgren (inactive)

Comment 54

•

14 years ago

Comment on attachment 491430 [details] [diff] [review]
patch v.1

>+  // See DoProcessNextNativeEvent, mEventloopNestingLevel will be
>+  // one when a modal loop unwinds.
>+  if (mNativeCallbackPending && mEventloopNestingLevel == 1)
>+    DoProcessMoreGeckoEvents();

I think this bug can occur at an arbitrary event loop depth,
ie (mEventloopNestingLevel & 1).
Probably not worth making the patch more complex to handle
that though.

I noted that there are still calls to ScheduleNativeEventCallback()
while mNativeCallbackPending is TRUE, eg from nsThread::PutEvent.
Are you letting those through intentionally?


Nits:
In all comments you add: s/windows/Windows to make it clear
you're not referring to XUL windows or nsWindows or whatever.

>   PRInt32 mSuspendNativeCount;
> 
>+protected:
>+  PRUint32 mEventloopNestingLevel;

You're already in a protected section there.
No need for the line separating it from mSuspendNativeCount.

>+// XXX This is overidden on windows, see comments in nsAppShell for

Drop the XXX (nothing needs fixing there AFAICT)

Attachment #491430 - Flags: review?(matspal) → review+

Jim Mathies [:jimm]

Assignee

Comment 55

•

14 years ago

(In reply to comment #54)
> Comment on attachment 491430 [details] [diff] [review]
> I noted that there are still calls to ScheduleNativeEventCallback()
> while mNativeCallbackPending is TRUE, eg from nsThread::PutEvent.
> Are you letting those through intentionally?

No I didn't, is that the cause of bug 614165? I would have expected this patch to address that, since multiple modal loops would get trapped by mEventloopNestingLevel >= 2.

> 
> 
> Nits:
> In all comments you add: s/windows/Windows to make it clear
> you're not referring to XUL windows or nsWindows or whatever.
> 
> >   PRInt32 mSuspendNativeCount;
> > 
> >+protected:
> >+  PRUint32 mEventloopNestingLevel;
> 
> You're already in a protected section there.
> No need for the line separating it from mSuspendNativeCount.
> 
> >+// XXX This is overidden on windows, see comments in nsAppShell for
> 
> Drop the XXX (nothing needs fixing there AFAICT)

Will do.

Mats Palmgren (inactive)

Comment 56

•

14 years ago

> > I noted that there are still calls to ScheduleNativeEventCallback()
> > while mNativeCallbackPending is TRUE, eg from nsThread::PutEvent.
> > Are you letting those through intentionally?
> 
> No I didn't, is that the cause of bug 614165? I would have expected this patch
> to address that, since multiple modal loops would get trapped by
> mEventloopNestingLevel >= 2.

The nesting level isn't the problem here.  The ScheduleNativeEventCallback()
calls I'm referring to comes from non-UI threads posting a gecko event
using nsThread::PutEvent, which then notifies the appshell's
OnDispatchedEvent.  So, if you truly want to block all calls to
ScheduleNativeEventCallback() you need to override OnDispatchedEvent
(or hack ScheduleNativeEventCallback) and do your DoProcessMoreGeckoEvents
magic there.

I'm not saying this is desirable though - I'm worried that blocking out
gecko event processing 100% will cause new problems.

I did a quick hack like so:
nsAppShell::ScheduleNativeEventCallback()
{
  if (mNativeCallbackPending) {
    PR_AtomicSet(&mNativeEventPending, 0);
    return;
  }
...

It feels slightly more responsive while resizing the window, but it also
seems to lead to more occurencies of the problem where the window still
resizes after releasing the button.
(This problem occurs also with your patch BTW, although less frequent.)

So, I'm happy with the patch as is.

(I'm pretty sure bug 614165 is an unrelated bug)

Jim Mathies [:jimm]

Assignee

Comment 57

•

14 years ago

http://hg.mozilla.org/mozilla-central/rev/85800bbf667b

I'll file a follow up on the PutEvent.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

:Ms2ger (he/him; ⌚ UTC+1/+2)

Updated

•

13 years ago

Depends on: 639864

Joshua Mitchell (Inactive)

Comment 59

•

10 years ago

Issue is resolved - clearing old keywords - qa-wanted clean-up

Keywords: regressionwindow-wanted

minimal test case 14 years ago Benoit Jacob [:bjacob] (mostly away) (deleted), text/html		Details
msg log 14 years ago Jim Mathies [:jimm] (deleted), text/plain		Details
testcase variation 14 years ago Jim Mathies [:jimm] (deleted), text/html		Details
timer patch 14 years ago :Felipe Gomes (needinfo for replies!) (deleted), patch		Details \| Diff \| Splinter Review
appshell patch v.1 14 years ago Jim Mathies [:jimm] (deleted), patch		Details \| Diff \| Splinter Review
appshell patch v.2 14 years ago Jim Mathies [:jimm] (deleted), patch		Details \| Diff \| Splinter Review
appshell patch v.3 14 years ago Jim Mathies [:jimm] (deleted), patch		Details \| Diff \| Splinter Review
patch v.1 14 years ago Jim Mathies [:jimm] (deleted), patch	MatsPalmgren_bugz : review+	Details \| Diff \| Splinter Review