Closed Bug 953435 Opened 11 years ago Closed 10 years ago

Browser hang on Mac if an AfterProcessNextEvent callback tries to spin the event loop

Categories

(Core :: Widget: Cocoa, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME
mozilla29

People

(Reporter: bzbarsky, Assigned: smaug)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(3 files, 2 obsolete files)

Attached file Testcase that shows the problem (deleted) —
The attached testcase seems to completely freeze my browser on Mac: we never respond to clicks on the "OK" button of the alert.  The only weird thing I see in here is that we're spinning the event loop from inside this stack:

#61 0x000000010565565d in nsDOMMutationObserver::HandleMutations () at nsDOMMutationObserver.h:390
#62 0x00000001056457b4 in nsXPConnect::AfterProcessNextEvent (this=0x10060bbe0, aThread=0x100615520, aRecursionDepth=0, aEventWasProcessed=true) at nsXPConnect.cpp:1130
#63 0x0000000105645b84 in non-virtual thunk to nsXPConnect::AfterProcessNextEvent(nsIThreadInternal*, unsigned int, bool) () at nsXPConnect.cpp:1246
#64 0x0000000103b79379 in nsThread::ProcessNextEvent (this=0x100615520, mayWait=false, result=0x7fff5fbfcb13) at nsThread.cpp:651
#65 0x0000000103a8255b in NS_ProcessPendingEvents (thread=0x100615520, timeout=20) at nsThreadUtils.cpp:210

via the DOM mutation observer calling alert().

I'm not sure whether this is cross-platform or Mac-only...  Olli, do you see the same issue on Linux?
Flags: needinfo?(bugs)
Works fine on linux.
Flags: needinfo?(bugs)
And works fine on Windows too.
Somewhat similar to Bug 397439?
Component: XPCOM → Widget: Cocoa
Flags: needinfo?(smichaud)
A possible patch coming, but can't test it since I don't have a (working) mac.
Attached patch something like this (obsolete) (deleted) — Splinter Review
May or may not compile on OSX, and may or may not fix the issue :)

GetIsProcessingEvents could be made thread-safe, if needed (by using atomic counter), but other similar things aren't thread-safe either there.

https://tbpl.mozilla.org/?tree=Try&rev=5bb11e763dc0
Summary: Browser hang (at least on Mac) if an AfterProcessNextEvent callback tries to spin the event loop → Browser hang on Mac if an AfterProcessNextEvent callback tries to spin the event loop
And it doesn't compile.
Attached patch v2 (obsolete) (deleted) — Splinter Review
Maybe this
https://tbpl.mozilla.org/?tree=Try&rev=e07a3ef2a8be
Attachment #8351740 - Attachment is obsolete: true
v2 fixes the bug for me once I fix the presumably-backwards test in the assert in GetIsProcessingEvents.
Oops, the assert in the patch is indeed wrong
Attached patch +asserts fixed (deleted) — Splinter Review
The patch makes Cocoa appshell less fragile, by making it use generic thread API.
isProcessingEvents is similar, but not quite the same to recursionDepth.

(I think we should merge nsIThreadInternal to nsIThread)
Assignee: nobody → bugs
Attachment #8351743 - Attachment is obsolete: true
Attachment #8354929 - Flags: review?(smichaud)
Attachment #8354929 - Flags: review?(nfroyd)
(And yes, isProcessingEvents is a bit odd, given that it isn't threadsafe, but for now it should be enough.)
Comment on attachment 8354929 [details] [diff] [review]
+asserts fixed

Review of attachment 8354929 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with the changes below.

::: xpcom/threads/nsIThread.idl
@@ +83,5 @@
>     */
>    boolean processNextEvent(in boolean mayWait);
> +
> +  /**
> +   * true if we're processing runnables or thread observers.

Please add a blurb about only accessing this on the current thread.

I see that nsThread::GetRecursionDepth, which is similar to this, MOZ_ASSERTs for mismatched threads, which seems a little more likely to catch errors.  Please use MOZ_ASSERT instead of warning.  (Bonus points for making this attribute [infallible], then.)

::: xpcom/threads/nsThread.h
@@ +149,5 @@
>    PRThread *mThread;
>    uint32_t  mRunningEvent;  // counter
>    uint32_t  mStackSize;
>  
> +  uint32_t  mIsProcessingEvents;

Naming this mProcessingEvent for parallelism with mRunningEvent seems like a better choice.
Attachment #8354929 - Flags: review?(nfroyd) → review+
nsThread::GetRecursionDepth does *not* assert. That is the reason why I removed
assert from my patch and made it consistent with recursionDepth
(In reply to Olli Pettay [:smaug] from comment #13)
> nsThread::GetRecursionDepth does *not* assert. That is the reason why I
> removed
> assert from my patch and made it consistent with recursionDepth

Oh, whoops, you're right.  I was either looking at an old version or was just confused.  Carry on!
Comment on attachment 8354929 [details] [diff] [review]
+asserts fixed

This looks fine to me, but I'd like to test it.  I should be able to do that later today or tomorrow.
Flags: needinfo?(smichaud)
Comment on attachment 8354929 [details] [diff] [review]
+asserts fixed

I just tested this in current trunk code, and everything went as expected.  This bug's testcase "froze" Firefox without the patch, but didn't with the patch.

Conceptually, using nsIThread::isProcessingEvents is exactly what I wanted to do from the beginning.  But I convinced myself that the "recursion depth" is a good proxy.  And indeed it seems to be, since this code is old, and we don't seem to have had much trouble with it since it landed.

What I don't understand, though, is exactly how/why using nsIThread::isProcessingEvents can sometimes give a different result from using the "recursion depth".  Can anyone here explain?
Attachment #8354929 - Flags: review?(smichaud) → review+
recursionDepth value is decreased before calling the thread observers.
https://hg.mozilla.org/mozilla-central/rev/e176126d2b42
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
Depends on: 959281
Depends on: 962942
https://hg.mozilla.org/integration/mozilla-inbound/rev/7e51a10dee23
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: 977856
Do we want to track this as a talos regression?  This was backed out and the original patch is a performance imporovement.

I am just following up on all performance bugs that have now migrated to Aurora.
Flags: needinfo?(bugs)
We don't want to track the talos regression.
Flags: needinfo?(bugs)
thanks, I have removed the whiteboard flag.
Whiteboard: [talos_regression]
Smaug, could you reland this bug's patch on trunk (or mozilla-inbound), now that I've landed a bandaid patch for bug 959281?

It will ultimately need to be backed out again when a patch for bug 996848 lands (since that patch will make it redundant).

We can look out for another talos regression -- but I suspect the bandaid patch from bug 959281 will prevent it from happening.
Flags: needinfo?(bugs)
Feel free to land, or I'll land tomorrow.
Go ahead and land it tomorrow.  I'm at the end of my day and tired :-)
Flags: needinfo?(bugs)
Do we still need some variant of this? I don't have access to any OSX devices?

nsAppShell.mm seems to have changed quite a bit, and the patch certainly doesn't apply anymore.
Flags: needinfo?(smichaud)
> Do we still need some variant of this?

As long as my patch for bug 996848 remains in the tree, we don't.

So far that patch has done pretty well.  But it's conceivable that some issue will arise that we can't fix (or can't fix quickly enough).  In that case we'd probably want to back it out and fall back to a combination of my bandaid patch for bug 959281 and your patch for this bug.
Flags: needinfo?(smichaud)
Let's close this WORKSFORME for the time being.
Status: REOPENED → RESOLVED
Closed: 11 years ago10 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: