Closed
Bug 1368286
Opened 7 years ago
Closed 7 years ago
nsThread::HasPendingEvents() will falsely claim there are no pending events, causing a blocking wait for native events while idle queue is pending processing
Categories
(Core :: XPCOM, defect)
Core
XPCOM
Tracking
()
RESOLVED
FIXED
mozilla55
People
(Reporter: ehsan.akhgari, Assigned: ehsan.akhgari)
References
Details
Attachments
(1 file, 4 obsolete files)
(deleted),
patch
|
smaug
:
review+
|
Details | Diff | Splinter Review |
STR:
1. Apply my patch in bug 1361461.
2. Apply my patch in bug 1368285.
3. Run `./mach xpcshell-test js/xpconnect/tests/unit/test_nuke_webextension_wrappers.js`
The test times out.
The reason this is happening is that during event processing we continually hit nsBaseAppShell::DispatchDummyEvent() <https://searchfox.org/mozilla-central/rev/596d188f6dbc8cb023e625f0a4310d184875f8fc/widget/nsBaseAppShell.cpp#299> and therefore every time nsThread::GetEvent() gets called for our main thread, we'd have something in mEvents, and we end up never looking at mIdleEvents as a result and starving the idle event that was dispatched.
This dummy event business is ancient and I'm actually not sure if we really need it any more. I posted a patch to the try server to see if we can fix this issue by simply removing this old mechanism: <https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab2ea816712de9dea2a214ad3cc684b8d585e9d0>
Assignee | ||
Comment 1•7 years ago
|
||
Try results look promising!
Assignee | ||
Comment 2•7 years ago
|
||
Attachment #8872109 -
Flags: review?(bugs)
Comment 3•7 years ago
|
||
Why do we end up calling DispatchDummeEvent all the time?
What code passes true as mayWait?
Couldn't DispatchDummyEvent dispatch the event to idle queue?
Comment 4•7 years ago
|
||
Ah, hrm, nsThread::HasPendingEvents doesn't take idle queue into account. I think it should.
Comment 5•7 years ago
|
||
Comment on attachment 8872109 [details] [diff] [review]
Don't dispatch a dummy event at the end of nsBaseAppShell::OnProcessNextEvent()
I don't think the reasons why the code was added have disappeared.
Attachment #8872109 -
Flags: review?(bugs) → review-
Assignee | ||
Comment 6•7 years ago
|
||
(In reply to Olli Pettay [:smaug] from comment #4)
> Ah, hrm, nsThread::HasPendingEvents doesn't take idle queue into account. I
> think it should.
Yes, that is indeed the root cause of the problem.
Assignee | ||
Updated•7 years ago
|
Summary: DummyEvent gets in the way of idle events running → nsThread::HasPendingEvents() will falsely claim there are no pending events, causing a blocking wait for native events while idle queue is pending process
Assignee | ||
Updated•7 years ago
|
Summary: nsThread::HasPendingEvents() will falsely claim there are no pending events, causing a blocking wait for native events while idle queue is pending process → nsThread::HasPendingEvents() will falsely claim there are no pending events, causing a blocking wait for native events while idle queue is pending processing
Assignee | ||
Comment 7•7 years ago
|
||
Attachment #8872420 -
Flags: review?(bugs)
Assignee | ||
Updated•7 years ago
|
Attachment #8872109 -
Attachment is obsolete: true
Updated•7 years ago
|
Attachment #8872420 -
Flags: review?(bugs) → review+
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0b814165b471
Take the idle queue into account in nsThread::HasPendingEvents(); r=smaug
Comment 9•7 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 7 years ago
status-firefox55:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Updated•7 years ago
|
Assignee: nobody → ehsan
Assignee | ||
Comment 10•7 years ago
|
||
Backed out because of bug 1369123.
https://hg.mozilla.org/mozilla-central/rev/edffa38ec0c5
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 11•7 years ago
|
||
An alternative approach here would be to make the dummy event special in some way. We could use QI to some interface, but a more performance way would be to do something like:
StaticRefPtr<nsIRunnable> sDummy;
bool
IsDummy(nsIRunnable* aRunnable)
{
return aRunnable == sDummy;
}
nsresult
DispatchDummy(nsIThread* aThread)
{
return aThread->Dispatch(sDummy, nsIThread::DISPATCH_NORMAL);
}
Then make idle processing do something like:
if (!HasPendingEvents() || (LengthIsOne() && IsDummy(FirstEvent())) {
// we're idle
}
That would avoid changing behavior for other code using HasPendingEvents().
Assignee | ||
Comment 12•7 years ago
|
||
[Tracking Requested - why for this release]: Idle queue events may end up being discarded...
tracking-firefox55:
--- → ?
Assignee | ||
Comment 13•7 years ago
|
||
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #12)
> [Tracking Requested - why for this release]: Idle queue events may end up
> being discarded...
Err, I meant to say starved.
Assignee | ||
Comment 14•7 years ago
|
||
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #11)
> An alternative approach here would be to make the dummy event special in
> some way.
Yeah, I guess, but it depends on the root cause of the bug, and I for one still at least don't really understand it. I actually realized I still have my old MacBook at home and spun up a recent build on it but haven't yet gotten a chance to test this myself. I will give it a shot to see if I can figure out where the root cause is. If the root cause is NS_HasPendingEvents() returning true in some cases confusing some other code this alternative idea may or may not help.
Assignee | ||
Updated•7 years ago
|
Tracking this for 55 to make sure we follow up to resolve the issue before release.
Assignee | ||
Comment 16•7 years ago
|
||
(Sorry for the length of the comment but the bug was complex.)
I debugged bug 1369123 on OSX. Here is my findings.
I set a breakpoint in nsThread::HasPendingEvents() based on the condition of mEvents->HasPendingEvent() returning false and mIdleEvents.HasPendingEvent() returning true, since this is the only condition where nsThread::HasPendingEvents() would do something different after my patch. The breakpoint got hit for the first time when I clicked the BuildAPI link after a few tries from the STR in bug 1369123 under this call stack:
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
* frame #0: 0x0000000110e23492 XUL`nsThread::HasPendingEvents(this=0x000000010fc20800, aResult=0x00007fff519f3917) at nsThread.cpp:1051
frame #1: 0x0000000110e2c1fa XUL`hasPendingEvents(aThread=0x000000010fc20800) at nsThreadUtils.cpp:430
frame #2: 0x0000000110e2c1cd XUL`NS_HasPendingEvents(aThread=0x000000010fc20800) at nsThreadUtils.cpp:448
frame #3: 0x00000001157c5644 XUL`nsBaseAppShell::NativeEventCallback(this=0x000000010fc85b80) at nsBaseAppShell.cpp:104
frame #4: 0x0000000115862b92 XUL`nsAppShell::ProcessGeckoEvents(aInfo=0x000000010fc85b80) at nsAppShell.mm:399
frame #5: 0x00007fffbd318981 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 17
frame #6: 0x00007fffbd2f9a7d CoreFoundation`__CFRunLoopDoSources0 + 557
frame #7: 0x00007fffbd2f8f76 CoreFoundation`__CFRunLoopRun + 934
frame #8: 0x00007fffbd2f8974 CoreFoundation`CFRunLoopRunSpecific + 420
frame #9: 0x00007fffbc884a5c HIToolbox`RunCurrentEventLoopInMode + 240
frame #10: 0x00007fffbc884891 HIToolbox`ReceiveNextEventCommon + 432
frame #11: 0x00007fffbc8846c6 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 71
frame #12: 0x00007fffbae2a5b4 AppKit`_DPSNextEvent + 1120
frame #13: 0x00007fffbb5a4d6b AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 2789
frame #14: 0x0000000115861684 XUL`::-[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:](self=0x000000010e4ea740, _cmd="nextEventMatchingMask:untilDate:inMode:dequeue:", mask=18446744073709551615, expiration=4001-01-01 00:00:00 UTC, mode="kCFRunLoopDefaultMode", flag=YES) at nsAppShell.mm:130
frame #15: 0x00007fffbae1ef35 AppKit`-[NSApplication run] + 926
frame #16: 0x000000011586360c XUL`nsAppShell::Run(this=0x000000010fc85b80) at nsAppShell.mm:673
frame #17: 0x00000001183e15fb XUL`nsAppStartup::Run(this=0x000000012a9e1a10) at nsAppStartup.cpp:283
frame #18: 0x00000001185361cd XUL`XREMain::XRE_mainRun(this=0x00007fff519f5b88) at nsAppRunner.cpp:4571
frame #19: 0x000000011853775a XUL`XREMain::XRE_main(this=0x00007fff519f5b88, argc=5, argv=0x00007fff519f6218, aConfig=0x00007fff519f5d58) at nsAppRunner.cpp:4751
frame #20: 0x0000000118537edc XUL`XRE_main(argc=5, argv=0x00007fff519f6218, aConfig=0x00007fff519f5d58) at nsAppRunner.cpp:4846
frame #21: 0x000000011854b2c7 XUL`mozilla::BootstrapImpl::XRE_main(this=0x000000010e413138, argc=5, argv=0x00007fff519f6218, aConfig=0x00007fff519f5d58) at Bootstrap.cpp:45
frame #22: 0x000000010e20a3d6 firefox`do_main(argc=5, argv=0x00007fff519f6218, envp=0x00007fff519f6248) at nsBrowserApp.cpp:236
frame #23: 0x000000010e209eb1 firefox`main(argc=5, argv=0x00007fff519f6218, envp=0x00007fff519f6248) at nsBrowserApp.cpp:309
frame #24: 0x00007fffd2846255 libdyld.dylib`start + 1
frame #25: 0x00007fffd2846255 libdyld.dylib`start + 1
This call comes from this code: <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/nsBaseAppShell.cpp#104>
Previously we would return false here but now we would go ahead and call DoProcessMoreGeckoEvents(). This is what our state looks like at this point:
(lldb) p *this
(nsBaseAppShell) $4 = {
mRefCnt = {
mValue = 8
}
_mOwningThread = (mThread = 0x000000010e4511a0)
mSuspendNativeCount = 0
mEventloopNestingLevel = 1
mDummyEvent = (mRawPtr = 0x0000000000000000)
mBlockedWait = 0x0000000000000000
mFavorPerf = 0
mNativeEventPending = {
mozilla::detail::AtomicBase<unsigned int, mozilla::MemoryOrdering::SequentiallyConsistent> = {
mValue = 1
}
}
mStarvationDelay = 0
mSwitchTime = 81294028
mLastNativeEventTime = 82328514
mEventloopNestingState = eEventloopNone
mRunning = false
mExiting = false
mBlockNativeEvent = false
mProcessedGeckoEvents = true
}
Then we end up returning early from here: <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/nsBaseAppShell.cpp#224> without scheduling another native event callback.
I then looked at what happens when we get to nsBaseAppShell::OnProcessNextEvent() the next time. We start off with this state:
(lldb) p *this
(nsBaseAppShell) $10 = {
mRefCnt = {
mValue = 8
}
_mOwningThread = (mThread = 0x000000010e4511a0)
mSuspendNativeCount = 0
mEventloopNestingLevel = 1
mDummyEvent = (mRawPtr = 0x0000000000000000)
mBlockedWait = 0x0000000000000000
mFavorPerf = 0
mNativeEventPending = {
mozilla::detail::AtomicBase<unsigned int, mozilla::MemoryOrdering::SequentiallyConsistent> = {
mValue = 1
}
}
mStarvationDelay = 0
mSwitchTime = 81294028
mLastNativeEventTime = 82328514
mEventloopNestingState = eEventloopNone
mRunning = false
mExiting = false
mBlockNativeEvent = false
mProcessedGeckoEvents = true
}
We first start to process native events but we give up here <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/cocoa/nsAppShell.mm#613>, so moreEvents will remain false and ProcessNextNativeEvent() will return false. So we bail out from processing native events and get to here <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/nsBaseAppShell.cpp#281> and run NS_HasPendingEvents() again. At this point NS_HasPendingEvents() will still return true so the condition here will evaluate to false and we won’t enter the loop. Similarly for the condition here <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/nsBaseAppShell.cpp#298>.
From this point on, the program never reaches this point <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/cocoa/nsAppShell.mm#621> so no further native event processing happens any more, we will only process XPCOM events from this point on.
I think the problem is that this call site to NS_HasPendingEvents() <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/nsBaseAppShell.cpp#104> is really special. See this comment about its semantics: <http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/windows/nsAppShell.cpp#287> Basically this code is trying to see if we hit the starvation limit and if so it’s trying to schedule another native callback to see try to run the rest of the Gecko events in the next run of the callback. However if we end up scheduling this next native callback in the precise window of time when we only have idle events in our queue and nothing in nsThread::mEvents, we get into this weird scenario on OSX at least (and possibly into other unexpected situations on other OSes, I didn’t try to debug this elsewhere.) I think the right way to fix this is to add a way to query the state of the non-idle queue of the nsThread and use it for this call site.
Assignee | ||
Comment 17•7 years ago
|
||
nsBaseAppShell::NativeEventCallback()'s particular semantics makes it care about
the non-idle event queue specifically so we need to make that particular consumer
not take the idle queue into account.
Attachment #8873923 -
Flags: review?(bugs)
Assignee | ||
Updated•7 years ago
|
Attachment #8872420 -
Attachment is obsolete: true
Comment 18•7 years ago
|
||
I have no idea why the patch makes any sense. Before idle stuff all those runnables were in the normal event queue, why does it matter that they are now in idle queue?
Does the patch fix the issue by accident only?
Comment 19•7 years ago
|
||
Comment on attachment 8873923 [details] [diff] [review]
Take the idle queue into account in nsThread::HasPendingEvents() and special case nsBaseAppShell::NativeEventCallback() appropriately
By default this looks wrong, but please explain why this is the right approach.
The comment in nsBaseAppShell::NativeEventCallback() doesn't explain _why_ this is needed.
Attachment #8873923 -
Flags: review?(bugs)
Assignee | ||
Comment 20•7 years ago
|
||
The last paragraph in comment 16 explains the reason. Please read that entire comment before reviewing, otherwise the patych looks arbitrary.
Assignee | ||
Updated•7 years ago
|
Attachment #8873923 -
Flags: review?(bugs)
Comment 21•7 years ago
|
||
Comment on attachment 8873923 [details] [diff] [review]
Take the idle queue into account in nsThread::HasPendingEvents() and special case nsBaseAppShell::NativeEventCallback() appropriately
Nothing explained the reason, and I'm pretty the issue is that mIdleEvents in nsThread doesn't use nsEventQueue::eSharedCondVarQueue mode, as the normal queue and queue for vsync.
Attachment #8873923 -
Flags: review?(bugs)
Assignee | ||
Updated•7 years ago
|
Assignee | ||
Comment 22•7 years ago
|
||
Attachment #8874470 -
Flags: review?(bugs)
Assignee | ||
Updated•7 years ago
|
Attachment #8873923 -
Attachment is obsolete: true
Comment 23•7 years ago
|
||
Comment on attachment 8874470 [details] [diff] [review]
Take the idle queue into account in nsThread::HasPendingEvents()
>+TimeStamp
>+nsThread::GetIdleDeadline()
>+{
>+ TimeStamp idleDeadline;
>+ {
>+ // Releasing the lock temporarily since getting the idle period
>+ // might need to lock the timer thread. Unlocking here might make
>+ // us receive an event on the main queue, but we've committed to
>+ // run an idle event anyhow.
>+ MutexAutoUnlock unlock(mLock);
>+ mIdlePeriod->GetIdlePeriodHint(&idleDeadline);
>+ }
>+
>+ // If HasPendingEvents() has been called and it has returned true because of
>+ // pending idle events, there is a risk that we may decide here that we aren't
>+ // idle and return null, in which case HasPendingEvents() has effectively
>+ // lied. Since we can't go back and fix the past, we have to adjust what we
>+ // do here and forcefully pick the idle queue task here. Note that this means
>+ // that we are choosing to run a task from the idle queue when we would
>+ // normally decide that we aren't in an idle period, but this can only happen
>+ // if we fall out of the idle period in between the call to HasPendingEvents()
>+ // and here, which should hopefully be quite rare. We are effectively
>+ // choosing to prioritize the sanity of our API semantics over the optimal
>+ // scheduling.
>+ if (!mHasPendingEventsPromisedIdleEvent &&
>+ (!idleDeadline || idleDeadline < TimeStamp::Now())) {
>+ return TimeStamp();
>+ }
>+ return idleDeadline;
>+}
This can't work. GetIdlePeriodHint may return IsNull timestamp,
so GetIdleEvent might return null after all. You really need to return some non-null timestamp
when mHasPendingEventsPromisedIdleEvent is true. returning TimeStamp::Now() would be probably fine, or some dummy
now + 1ms
> NS_IMETHODIMP
> nsThread::HasPendingEvents(bool* aResult)
> {
>+ mHasPendingEventsPromisedIdleEvent = false;
>+
> if (NS_WARN_IF(PR_GetCurrentThread() != mThread)) {
> return NS_ERROR_NOT_SAME_THREAD;
> }
>
> {
> MutexAutoLock lock(mLock);
Could you move mHasPendingEventsPromisedIdleEvent = false;
here, after autolock
Attachment #8874470 -
Flags: review?(bugs) → review-
Assignee | ||
Comment 24•7 years ago
|
||
Attachment #8874493 -
Flags: review?(bugs)
Assignee | ||
Updated•7 years ago
|
Attachment #8874470 -
Attachment is obsolete: true
Comment 25•7 years ago
|
||
I re-applied the original patch that landed in comment 9 on central, and was easily able to reproduce the hang locally.
I am _not_ able to reproduce the hang with the latest patch. *thumbs up*.
Comment 26•7 years ago
|
||
(To be clear, the hang I'm referring to is the one originally reported in bug 1369123).
Updated•7 years ago
|
Attachment #8874493 -
Flags: review?(bugs) → review+
Assignee | ||
Comment 27•7 years ago
|
||
Once the patch in bug 1370448 gets r+ed I will hopefully be able to land this.
Comment 28•7 years ago
|
||
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/16798a4167a1
Take the idle queue into account in nsThread::HasPendingEvents(); r=smaug
Comment 29•7 years ago
|
||
bugherder |
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•