Closed Bug 1204784 Opened 9 years ago Closed 9 years ago

Intermittent leakcheck | default process: 64 bytes leaked (nsRunnable, nsThreadShutdownAckEvent)

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla47
Tracking Status
firefox43 --- wontfix
firefox44 --- wontfix
firefox45 --- fixed
firefox46 --- fixed
firefox47 --- fixed

People

(Reporter: nigelb, Assigned: khuey)

References

Details

(Keywords: intermittent-failure, memory-leak)

Attachments

(2 files)

      No description provided.
Keywords: mlk
Regression from bug 1200922, maybe?
Flags: needinfo?(khuey)
Well that bug added this class, so yes, it must be.
Flags: needinfo?(khuey)
ASSERTION: Non-empty event queue being destroyed; events being leaked.: 'IsEmpty()'
*shrug*
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Kyle, this seems to be happening more often starting Jan 11.

https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1204784&startday=2016-01-01&endday=2016-01-14&tree=all

Any idea what caused this to trigger more often?  Can we clean up the events?
Flags: needinfo?(khuey)
It could have been landing bug 1237371 and bug 1237369, which uses this a lot more. But idk why it would lag a few days.

Not sure what's going on ... I'll think about it some more.
Assignee: nobody → khuey
Flags: needinfo?(khuey)
The only way I could see that we would put an event in a queue and not execute it is if there are nested event queues.  Do you have any other ideas, Nathan?  The Push/PopEventQueue stuff is only used by workers so it seems unlikely to be involved here ...
Flags: needinfo?(nfroyd)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #9)
> The only way I could see that we would put an event in a queue and not
> execute it is if there are nested event queues.  Do you have any other
> ideas, Nathan?  The Push/PopEventQueue stuff is only used by workers so it
> seems unlikely to be involved here ...

But we "promote" events from a popped event queue:

https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThread.cpp#1176

so Push/PopEventQueue doesn't seem relevant here...?  We also have:

16:23:53     INFO -  [Child 1950] ###!!! ASSERTION: Non-empty event queue being destroyed; events being leaked.: 'IsEmpty()', file /builds/slave/m-in-l64-d-0000000000000000000/build/src/xpcom/threads/nsEventQueue.cpp, line 36
16:23:53     INFO -  #01: nsThread::nsChainedEventQueue::~nsChainedEventQueue() [mfbt/RefPtr.h:55]
16:23:53     INFO -  #02: nsThread::~nsThread() [xpcom/glue/nsTArray.h:2093]
16:23:53     INFO -  #03: nsThread::~nsThread() [xpcom/threads/nsThread.cpp:508]
16:23:53     INFO -  #04: nsThread::Release() [xpcom/threads/nsThread.cpp:179]
16:23:53     INFO -  #05: _PR_DestroyThreadPrivate [nsprpub/pr/src/threads/prtpd.c:237]
16:23:53     INFO -  #06: _pt_thread_death_internal [nsprpub/pr/src/pthreads/ptthread.c:888]
16:23:53     INFO -  #07: _pt_thread_death [nsprpub/pr/src/pthreads/ptthread.c:865]
16:23:53     INFO -  #08: libpthread.so.0 + 0x7c83
16:23:53     INFO -  #09: libpthread.so.0 + 0x7ea8
16:23:53     INFO -  #10: libc.so.6 + 0xf42ed

so the root event queue must be non-empty when the thread dies, because that's the only nsChainedEventQueue we're destroying.

I also think that the ack event must be in the main thread's event queue, as that's where the shutdown ack event from bug 1237371 is going.  The main thread is also where LSPAnnotater::Annotate is running, so the main thread would also be receiving a shutdown ack event there.

(I don't really understand how the stack can be libc -> pthread -> nsThread::Release...does that mean we're not actually releasing the main thread until TLS is emptied?)

In any event shutdown for the main thread looks a little strange:

https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThreadManager.cpp#126

There's probably some other IPC event loop stuff involved, but AFAICT the main thread isn't going through the loop here:

https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThread.cpp#390

which means that we're never setting mEventsAreDoomed on the main thread, which means that we keep accepting events until...always?

I think my bet is on the shutdown ack event coming here or after:

https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThreadManager.cpp#161

So the main thread is going through nsThreadManager::Shutdown, but finishing the processing of events before the predictor thread or the annotator thread can send back the shutdown ack...and then once the shutdown ack gets sent there, we leak.

Does that make any sense?
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #11)
> I think my bet is on the shutdown ack event coming here or after:
> 
> https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThreadManager.
> cpp#161

Hmm, it does seem like we need to wait for all the pending shutdown contexts to clear first.
Attached patch Patch (deleted) — Splinter Review
If the above theory is correct, this is the fix.  It does seems like something we need regardless.
Attachment #8708499 - Flags: review?(nfroyd)
Comment on attachment 8708499 [details] [diff] [review]
Patch

Review of attachment 8708499 [details] [diff] [review]:
-----------------------------------------------------------------

This looks fine in principle, but...this is assuming that we can never get any events into the main thread's event queue between the first call to NS_ProcessPendingEvents and the WaitForAllAsyncShutdowns in nsThreadManager::Shutdown, right?  Because if events can, then it's possible that some of those events are going to call AsyncShutdown, and there might not be outstanding AsyncShutdown threads.  Then WaitForAllAsyncShutdowns does nothing, and we're back to where we started.

How likely is that?  I'm not really sure.  I don't know when these AsyncShutdown-calling events are getting inserted into the queue, though...

r=me, but I'd really like to know whether we can come up with something more bulletproof, or we can assert that the above doesn't happen.
Attachment #8708499 - Flags: review?(nfroyd) → review+
From irc

[11:16:56] <khuey> froydnj: in your comment, do you mean the *last* call to NS_ProcessPendingEvents?
[11:17:30] <froydnj> khuey: no, I mean the first call:
https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThreadManager.cpp#126
[11:17:47] <khuey> froydnj: if an event gets inserted that wants to call AsyncShutdown at that
                   point, it doesn't matter
[11:18:03] <froydnj> khuey: why not?
[11:18:07] <khuey> froydnj: because we're going to go through and call Shutdown on all outstanding
                   threads just below that
[11:18:34] <khuey> froydnj: and there will be at least one thread outstanding
[11:18:43] <khuey> froydnj: (because there's something to call AsyncShutdown on)
[11:18:51] <khuey> froydnj: so we'll spin the event loop to do the (sync) Shutdown
[11:19:05] <khuey> froydnj: and that'll execute any pending runnables that want to call
                   AsyncShutdown on other threads
[11:19:42] <khuey> froydnj: similarly once we've called Shutdown on all outstanding threads any
                   pending runnables that want to call AsyncShutdown don't matter, because all
                   threads have been synchronously Shutdown
[11:20:26] <froydnj> khuey: ok, I think that makes sense...worth a comment in
                     nsThreadManager::Shutdown?
[11:20:32] <khuey> froydnj: sure
seems after this landing we got test failures like https://treeherder.mozilla.org/logviewer.html#?job_id=3103586&repo=mozilla-central
Flags: needinfo?(khuey)
I was hoping the assertion would give us a minidump but apparently not :/
Flags: needinfo?(khuey)
Keywords: leave-open
https://hg.mozilla.org/mozilla-central/rev/c395b4777d7d
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
Per philor, This is very likely the cause for a very hard-to-star, frequent assertion (which is also the source of the bugs marked as dups of this in a chain) -- bug 1240672
Depends on: 1240672
Flags: needinfo?(khuey)
I can make the assertion non-fatal again (which presumably will turn the orange back into a leak).  Is that easier to star?
Flags: needinfo?(rjesup)
Flags: needinfo?(philringnalda)
Flags: needinfo?(khuey)
Yep, leaks don't have tree-specific paths. I filed bug 1242178 on being able to star assertions, but I don't know how likely to be workable it is. The other thing you can do is to push the path out of the way of the bug search with a longer message - treeherder is searching for

Assertion failure: IsEmpty(), at /builds/slave/try-lx-d-000000000000000000000/build/src/xpcom/thread

so if you turn that into

Assertion failure: IsEmpty() event queue should be empty while we are shutting down because we can't even

then the bug summary can match what it searches for.
Flags: needinfo?(philringnalda)
Status: RESOLVED → REOPENED
Flags: needinfo?(rjesup)
Resolution: FIXED → ---
https://hg.mozilla.org/mozilla-central/rev/dadf5ab56adc (landed in time to the merge to Aurora)
Target Milestone: mozilla46 → ---
I have an idea on how to get us more information on this from automation.  I'll try to whip it up this weekend.
Just an idle question - are we certain the "joiningThread" is MainThread?  Too bad we can't hit this under rr..... if it's frequent enough, maybe we can?  or maybe rr will make it *more* likely (often it makes things easier to hit, and often harder - depends on which way the timing gods got up that morning).
(In reply to Randell Jesup [:jesup] from comment #37)
> Just an idle question - are we certain the "joiningThread" is MainThread?

I don't think it is, judging by:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8569ba92a355

and the failing log:

11:38:41     INFO -  [2032] ###!!! ASSERTION: shouldn't be waiting on other threads to shutdown: 'mRequestedShutdownContexts.IsEmpty()', file /builds/slave/try-l64-d-00000000000000000000/build/src/xpcom/threads/nsThread.cpp, line 522
11:38:41     INFO -  #01: nsThread::~nsThread() [xpcom/threads/nsThread.cpp:523]
11:38:41     INFO -  #02: nsThread::Release() [xpcom/threads/nsThread.cpp:179]
11:38:42     INFO -  #03: _PR_DestroyThreadPrivate [nsprpub/pr/src/threads/prtpd.c:238]
11:38:42     INFO -  #04: _pt_thread_death_internal [nsprpub/pr/src/pthreads/ptthread.c:889]
11:38:42     INFO -  #05: _pt_thread_death [nsprpub/pr/src/pthreads/ptthread.c:863]
11:38:42     INFO -  #06: libpthread.so.0 + 0x7c83
11:38:42     INFO -  #07: libpthread.so.0 + 0x7ea8
11:38:42     INFO -  #08: libc.so.6 + 0xf42ed
11:38:42     INFO -  [2032] ###!!! ASSERTION: Non-empty event queue being destroyed; events being leaked.: 'IsEmpty()', file /builds/slave/try-l64-d-00000000000000000000/build/src/xpcom/threads/nsEventQueue.cpp, line 36
11:38:42     INFO -  #01: nsThread::nsChainedEventQueue::~nsChainedEventQueue() [mfbt/RefPtr.h:55]
11:38:42     INFO -  #02: nsThread::~nsThread() [xpcom/glue/nsTArray.h:2095]
11:38:42     INFO -  #03: nsThread::~nsThread() [xpcom/threads/nsThread.cpp:523]
11:38:42     INFO -  #04: nsThread::Release() [xpcom/threads/nsThread.cpp:179]
11:38:42     INFO -  #05: _PR_DestroyThreadPrivate [nsprpub/pr/src/threads/prtpd.c:238]
11:38:42     INFO -  #06: _pt_thread_death_internal [nsprpub/pr/src/pthreads/ptthread.c:889]
11:38:42     INFO -  #07: _pt_thread_death [nsprpub/pr/src/pthreads/ptthread.c:863]
11:38:42     INFO -  #08: libpthread.so.0 + 0x7c83
11:38:42     INFO -  #09: libpthread.so.0 + 0x7ea8
11:38:42     INFO -  #10: libc.so.6 + 0xf42ed

I *think* that means this is a non-main thread.  There's another variant of the non-empty event queue one that *is* on the main thread, but I think that's a separate bug...the extra assertions I added in that try push don't seem to be firing, though.
And jesup wins the prize for today:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a3fa8023b527

14:01:43     INFO -  133 @0x7f06a8f979a0 (0 references; 0 from COMPtrs)
14:01:43     INFO -  allocation stack:
14:01:43     INFO -  #00: nsThread::ShutdownInternal(bool) [xpcom/base/nsAutoPtr.h:34]
14:01:43     INFO -  #01: nsThread::AsyncShutdown() [xpcom/threads/nsThread.cpp:720]
14:01:43     INFO -  #02: nsThreadPool::PutEvent(already_AddRefed<nsIRunnable>&&) [xpcom/threads/nsThreadPool.cpp:125]
14:01:43     INFO -  #03: nsThreadPool::Dispatch(already_AddRefed<nsIRunnable>&&, unsigned int) [xpcom/threads/nsThreadPool.cpp:271]
14:01:43     INFO -  #04: mozilla::TaskQueue::DispatchLocked(already_AddRefed<nsIRunnable>, mozilla::TaskQueue::DispatchMode, mozilla::AbstractThread::DispatchFailureHandling, mozilla::AbstractThread::DispatchReason) [xpcom/threads/TaskQueue.cpp:66]
14:01:43     INFO -  #05: mozilla::TaskQueue::Dispatch(already_AddRefed<nsIRunnable>, mozilla::AbstractThread::DispatchFailureHandling, mozilla::AbstractThread::DispatchReason) [xpcom/threads/TaskQueue.h:47]
14:01:43     INFO -  #06: mozilla::detail::EventTarget<mozilla::AbstractThread>::Dispatch(mozilla::AbstractThread*, already_AddRefed<nsIRunnable>&&) [dom/media/MediaEventSource.h:109]
14:01:43     INFO -  #07: mozilla::detail::ListenerHelper<mozilla::AbstractThread, mozilla::MediaEventSourceImpl<<anonymous>, Ts>::ConnectInternal(Target*, This*, Method) [with Target = mozilla::AbstractThread; This = mozilla::MediaDecoderStateMachine; Method = void (mozilla::MediaDecoderStateMachine::*)(const RefPtr<mozilla::MediaData>&); mozilla::ListenerMode Mode = (mozilla::ListenerMode)1; Es = {RefPtr<mozilla::MediaData>}; typename mozilla::EnableIf<mozilla::detail::TakeArgs<Method>::value, mozilla::MediaEventListener>::Type = mozilla::MediaEventListener]::<lambda(mozilla::MediaEventSourceImpl<(mozilla::ListenerMode)1, RefPtr<mozilla::MediaData> >::ArgType<RefPtr<mozilla::MediaData> >&&)> >::DispatchHelper<mozilla::MediaEventSourceImpl<<anonymous>, Ts>::ConnectInternal(Target*, This*, Method) [with Target = mozilla::AbstractThread; This = mozilla::MediaDecoderStateMachine; Method = void (mozilla::MediaDecoderStateMachine::*)(const RefPtr<mozilla::MediaData>&); mozilla::ListenerMode Mode = (mozilla::ListenerMode)1; Es = {RefPtr<mozilla::MediaData>}; typename mozilla::EnableIf<mozilla::detail::TakeArgs<Method>::value, mozilla::MediaEventListener>::Type = mozilla::MediaEventListener]::<lambda(mozilla::MediaEventSourceImpl<(mozilla::ListenerMode)1, RefPtr<mozilla::MediaData> >::ArgType<RefPtr<mozilla::MediaData> >&&)>, const RefPtr<mozilla::MediaData>&> [dom/media/MediaEventSource.h:177]
14:01:43     INFO -  #08: void mozilla::MediaEventSourceImpl<(mozilla::ListenerMode)1, RefPtr<mozilla::MediaData> >::NotifyInternal<RefPtr<mozilla::MediaData>&>(RefPtr<mozilla::MediaData>&&&) [dom/media/MediaEventSource.h:468]
14:01:43     INFO -  #09: mozilla::MediaQueue<mozilla::MediaData>::PopFront() [mfbt/RefPtr.h:206]
14:01:43     INFO -  #10: mozilla::media::DecodedAudioDataSink::PopFrames(unsigned int) [mfbt/AlreadyAddRefed.h:116]
14:01:43     INFO -  #11: mozilla::AudioStream::GetUnprocessed(mozilla::AudioBufferWriter&) [dom/media/AudioStream.cpp:583]
14:01:43     INFO -  #12: mozilla::AudioStream::DataCallback(void*, long) [dom/media/AudioStream.cpp:658]
14:01:43     INFO -  #13: trigger_user_callback [media/libcubeb/src/cubeb_pulse.c:215]
14:01:43     INFO -  #14: libpulse.so.0 + 0x29ba7
14:01:43     INFO -  #15: libpulsecommon-1.1.so + 0x2d9eb
14:01:43     INFO -  #16: libpulse.so.0 + 0xebdd
14:01:43     INFO -  #17: libpulsecommon-1.1.so + 0x322a9
14:01:43     INFO -  #18: libpulse.so.0 + 0x21a1e
14:01:43     INFO -  #19: libpulse.so.0 + 0x21de5
14:01:43     INFO -  #20: libpulse.so.0 + 0x21e90
14:01:43     INFO -  #21: libpulse.so.0 + 0x3030f
14:01:43     INFO -  #22: libpulsecommon-1.1.so + 0x3fd18
14:01:43     INFO -  #23: libpthread.so.0 + 0x7e9a
14:01:43     INFO -  #24: libc.so.6 + 0xf42ed

That is, we're asynchronously shutting down a thread in a thread pool that we don't need anymore, from a non-main-thread thread (one that I don't think we even control via the thread manager, since it looks like it was started from pulseaudio?).  The source code:

http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThreadPool.cpp#122

    // We never dispatched any events to the thread, so we can shut it down
    // asynchronously without worrying about anything.
    MOZ_ALWAYS_TRUE(NS_SUCCEEDED(thread->AsyncShutdown()));

I guess this would be fallout from bug 1200922, then, khuey?
Flags: needinfo?(khuey)
(In reply to Nathan Froyd [:froydnj] from comment #39)
> That is, we're asynchronously shutting down a thread in a thread pool that
> we don't need anymore, from a non-main-thread thread (one that I don't think
> we even control via the thread manager, since it looks like it was started
> from pulseaudio?).

Ugh, jfc.

On the bright side, it does explain why this only happens on Linux.  I don't think things would have worked before my changes if we ended up in this code path.  We would have tried to dispatch a runnable to the current thread that would never run ...  Do we have a different intermittent runnable leak that just happened to go away when I landed my patch?

I think the fix is to make this call nsThreadPool::ShutdownThread with this thread (and that can call AsyncShutdown, because why not), that way we shut down everything from a thread we control (the main thread).  There's a more general problem here of how much of Gecko works when you start calling it from threads we don't control ... I hope this audio code doesn't do anything too crazy!
Flags: needinfo?(khuey)
Attached patch Take two (deleted) — Splinter Review
Untested, but something like this.
Attachment #8714152 - Flags: review?(nfroyd)
Comment on attachment 8714152 [details] [diff] [review]
Take two

Review of attachment 8714152 [details] [diff] [review]:
-----------------------------------------------------------------

This fix looks good: https://treeherder.mozilla.org/#/jobs?repo=try&revision=81c83641606f

I have some patches that helped in debugging this issue; I'll clean them up and post them in a different bug.

::: xpcom/threads/nsThreadPool.cpp
@@ +136,5 @@
>    LOG(("THRD-P(%p) shutdown async [%p]\n", this, aThread));
>  
> +  // This is either called by a threadpool thread that is out of work, or
> +  // a thread that attempted to create a threadpool thread and raced in
> +  // such a way that the newly created thread is no longer necessary.

Thank you for this comment.
Attachment #8714152 - Flags: review?(nfroyd) → review+
https://hg.mozilla.org/mozilla-central/rev/323f89d56ef3
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Comment on attachment 8714152 [details] [diff] [review]
Take two

Approval Request Comment
[Feature/regressing bug #]: Bug 1200922
[User impact if declined]: High frequency intermittent leaks
[Describe test coverage new/current, TreeHerder]: This is well covered on TreeHerder, because the orange shows up all the time.
[Risks and why]: Low risk
[String/UUID change made/needed]: None
Attachment #8714152 - Flags: approval-mozilla-beta?
Attachment #8714152 - Flags: approval-mozilla-aurora?
Comment on attachment 8708499 [details] [diff] [review]
Patch

See above.

Note that this landed in time to make it to Aurora, so we only need it on Beta.
Attachment #8708499 - Flags: approval-mozilla-beta?
Comment on attachment 8714152 [details] [diff] [review]
Take two

Improve the testsuite, taking it.
Should be in 45 beta 3.
Attachment #8714152 - Flags: approval-mozilla-beta?
Attachment #8714152 - Flags: approval-mozilla-beta+
Attachment #8714152 - Flags: approval-mozilla-aurora?
Attachment #8714152 - Flags: approval-mozilla-aurora+
Attachment #8708499 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: