Closed
Bug 799142
Opened 12 years ago
Closed 12 years ago
Crash because ~TimerEventAllocator() may be called before TimerEventAllocator::Free() is called [@ nsRunnable::Release() ]
Categories
(Core :: XPCOM, defect, P2)
Tracking
()
VERIFIED
FIXED
mozilla22
People
(Reporter: tete009+bugzilla, Assigned: ehsan.akhgari)
References
(Blocks 1 open bug, )
Details
(Keywords: crash, reproducible, Whiteboard: [webrtc][blocking-webrtc+][ietestdrive])
Crash Data
Attachments
(1 file, 1 obsolete file)
(deleted),
patch
|
benjamin
:
review+
bajaj
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko/20100101 Firefox/16.0
Build ID: 20121005155445
Steps to reproduce:
I tried to close Firefox when I'm playing Browser Hunt.
http://ie.microsoft.com/testdrive/performance/browserhunt/default.xhtml
Actual results:
Firefox sometimes crashed because of access violation.
I debugged my builds of Firefox 15.0.1, 16.0 and trunk, and I found out that ~TimerEventAllocator() was called before TimerEventAllocator::Free() was called, when Firefox crashed as mentioned above.
Expected results:
All TimerEventAllocator::Free() calls should be completed before call of ~TimerEventAllocator().
Reporter | ||
Updated•12 years ago
|
Reporter | ||
Updated•12 years ago
|
Crash Signature: [@ nsTimerEvent::`vector deleting destructor''(unsigned int) ]
Reporter | ||
Updated•12 years ago
|
Crash Signature: [@ nsTimerEvent::`vector deleting destructor''(unsigned int) ] → [@ nsTimerEvent::`vector deleting destructor''(unsigned int) ]
[@ nsRunnable::Release() ]
Reporter | ||
Comment 1•12 years ago
|
||
Sorry, ~TimerEventAllocator() isn't explicitly declared in the original source. It was added to my builds explicitly for debugging purpose.
Firefox 16.0 rc build1 seems to show another crash signature [@ nsRunnable::Release() ], so I added it to "Crash Signature".
Reporter | ||
Comment 3•12 years ago
|
||
Crash IDs:
Firefox 18.0a1: 75443141-6066-4ab4-92a9-975ca2121008
Firefox 16.0 rc build1: 142a8001-7c35-4367-b67c-47a4e2121008
Updated•12 years ago
|
Component: Untriaged → XPCOM
Product: Firefox → Core
Comment 4•12 years ago
|
||
This is race condition. thread 0 (gecko main thread) is shutting down after another thread is starting nsRunnable::run(). Then, if run() spends a lot of times (has busy job), thread 0 will release allocator before finishing run().
Although this is possible issue, I don't think that this is frequency. mark as new since there is crash signature...
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 6•12 years ago
|
||
The scenario in Bug 825480 seems to reliably trigger this crash on certain hardware configurations, so we're going to mark this as [blocking-rtcweb+] for the time being.
I'm not familiar with the code involved, so this is nothing more than a casual armchair analysis: The TimerEventAllocator class is used in only one file -- but it's of class nsFixedSizeAllocator, which hits 29 other files. Based on that first-order examination, I'd estimate that a conversion to reference counting (which would solve the race described) is probably realistic, but not trivial.
Whiteboard: [blocking-webrtc+]
Updated•12 years ago
|
Whiteboard: [blocking-webrtc+] → [blocking-webrtc+][ietestdrive]
Updated•12 years ago
|
Whiteboard: [blocking-webrtc+][ietestdrive] → [webrtc][blocking-webrtc+][ietestdrive]
Comment 7•12 years ago
|
||
I'm getting this to reproduce using the data channels test page by sending files around eventually as well.
tracking-firefox21:
--- → ?
Comment 8•12 years ago
|
||
Seems to be easily reproducible with http://mozilla.github.com/webrtc-landing/data_test.html on Win 7.
Keywords: reproducible
Updated•12 years ago
|
status-firefox21:
--- → affected
Comment 9•12 years ago
|
||
:mreavy,:jesup : Can you please help find an assignee for this bug ? Thanks !
Comment 10•12 years ago
|
||
CCing people who've had their fingers in TimerImpl recently...
There's a race condition in the Timer code that has existed and apparently we're now triggering. Could some of you take a look and see what might work as a solution?
Comment 11•12 years ago
|
||
As far as I can tell, the TimerEventAllocator thing added in bug 733277 assume nsITimer is used only from the main thread, right?
Oh boy. I use timers off the main thread all over the place.
Updated•12 years ago
|
Flags: needinfo?(mreavy)
Updated•12 years ago
|
Flags: needinfo?(rjesup)
Updated•12 years ago
|
Assignee: nobody → ehsan
Priority: -- → P2
Comment 13•12 years ago
|
||
Thanks for the assignee, Benjamin - clearing the needsinfo
Flags: needinfo?(rjesup)
Flags: needinfo?(mreavy)
Comment 14•12 years ago
|
||
If bug 733277 assumes nsITimer is only used from the main thread and there are instances of use outside the main thread then should bug 733277 be backed out while FF 21 is still on m-c and re-worked?
(In reply to Lukas Blakk [:lsblakk] from comment #14)
> If bug 733277 assumes nsITimer is only used from the main thread and there
> are instances of use outside the main thread then should bug 733277 be
> backed out while FF 21 is still on m-c and re-worked?
We need ehsan's input here, but if bug 733277 assumes only main thread timers then I would say yes, it should be backed out. It probably caused subtle race bugs in anything involving LazyIdleThread (IndexedDB, JumplistBuilder from a quick MXR search) or Workers.
Comment 16•12 years ago
|
||
Ehsan is out until the 19th.
Comment 17•12 years ago
|
||
How likely are those subtle races causing impossible-to-reproduce/debug crashes/hangs/malfunctions?
Comment 18•12 years ago
|
||
Since the original landed in March, I'm going to wait for ehsan to return and comment before taking other actions on this bug.
Assignee | ||
Comment 19•12 years ago
|
||
(In reply to Boris Zbarsky (:bz) from comment #11)
> As far as I can tell, the TimerEventAllocator thing added in bug 733277
> assume nsITimer is used only from the main thread, right?
Why do you think that?
One mistake that I can see in this code is that we assume all nsTimerEvent objects are deallocated before nsTimerEvent::Shutdown is called. This assumption will probably not be true if somebody holds on to a reference to the runnable object for too long, which can happen if nsTimerImpl::PostTimerEvent returns after nsTimerImpl::Shutdown has been called on the main thread, which can happen since the thread manager is shut down after xpcom timers (for example, see https://crash-stats.mozilla.com/report/index/a9166241-055f-4aeb-a648-34d0b2130213).
I can see two possible fixes here. One would be to call nsTimerEvent::Shutdown() after the thread manager shutdown. The other would be to keep track of how many objects reference the allocator, and only delete the allocator when there are no new objects, and the timer thread has fully shut down. The first solution is simpler, but it wouldn't cover all possible edge cases (in case somebody adds code to hold on to an nsIRunnable even after it's been processed, etc.)
I'm not sure which solution I'd prefer. Benjamin, what do you think?
Assignee | ||
Comment 20•12 years ago
|
||
Also, a detailed set of STRs here would really help. I tried the link in comment 8 but it's not obvious how to reproduce this crash.
Comment 21•12 years ago
|
||
(In reply to :Ehsan Akhgari from comment #21)
> Also, a detailed set of STRs here would really help. I tried the link in
> comment 8 but it's not obvious how to reproduce this crash.
Ehsan: try the STRs from Bug 825480, keeping in mind that they are only known to trigger this bug under Windows 7:
Prerequisite: ensure that pref media.peerconnection.enabled is "true".
1. Go to http://mozilla.github.com/webrtc-landing/pc_test.html
2. Select Start
3. Wait for the message "hip hip hooray"
4. Quit Firefox
Assignee | ||
Comment 22•12 years ago
|
||
(In reply to Adam Roach [:abr] from comment #23)
> (In reply to :Ehsan Akhgari from comment #21)
> > Also, a detailed set of STRs here would really help. I tried the link in
> > comment 8 but it's not obvious how to reproduce this crash.
>
> Ehsan: try the STRs from Bug 825480, keeping in mind that they are only
> known to trigger this bug under Windows 7:
>
> Prerequisite: ensure that pref media.peerconnection.enabled is "true".
>
> 1. Go to http://mozilla.github.com/webrtc-landing/pc_test.html
> 2. Select Start
> 3. Wait for the message "hip hip hooray"
> 4. Quit Firefox
I don't have access to a Windows 7 box right now. If you do, could you please post the crash-stats link for one such crash as found in about:crashes? Seeing such a crash stack should be enough for me to see if my analysis in comment 20 is correct or not.
Thanks!
Comment 23•12 years ago
|
||
> Why do you think that?
Well, for one thing it talks about "freed on the main thread", not "freed on an arbitrary thread"...
Assignee | ||
Comment 24•12 years ago
|
||
(In reply to Boris Zbarsky (:bz) from comment #25)
> > Why do you think that?
>
> Well, for one thing it talks about "freed on the main thread", not "freed on
> an arbitrary thread"...
That is my documentation mistake, fixed here: https://hg.mozilla.org/integration/mozilla-inbound/rev/fe5e2415c67c
Comment 25•12 years ago
|
||
(In reply to :Ehsan Akhgari from comment #24)
> (In reply to Adam Roach [:abr] from comment #23)
> > (In reply to :Ehsan Akhgari from comment #21)
> > > Also, a detailed set of STRs here would really help. I tried the link in
> > > comment 8 but it's not obvious how to reproduce this crash.
> >
> > Ehsan: try the STRs from Bug 825480, keeping in mind that they are only
> > known to trigger this bug under Windows 7:
> >
> > Prerequisite: ensure that pref media.peerconnection.enabled is "true".
> >
> > 1. Go to http://mozilla.github.com/webrtc-landing/pc_test.html
> > 2. Select Start
> > 3. Wait for the message "hip hip hooray"
> > 4. Quit Firefox
>
> I don't have access to a Windows 7 box right now. If you do, could you
> please post the crash-stats link for one such crash as found in
> about:crashes? Seeing such a crash stack should be enough for me to see if
> my analysis in comment 20 is correct or not.
>
> Thanks!
This is the crash stack I had from the dupe - https://crash-stats.mozilla.com/report/index/a1dea194-7992-4c21-acad-27e282121230.
Assignee | ||
Comment 26•12 years ago
|
||
(In reply to Jason Smith [:jsmith] from comment #27)
> (In reply to :Ehsan Akhgari from comment #24)
> > (In reply to Adam Roach [:abr] from comment #23)
> > > (In reply to :Ehsan Akhgari from comment #21)
> > > > Also, a detailed set of STRs here would really help. I tried the link in
> > > > comment 8 but it's not obvious how to reproduce this crash.
> > >
> > > Ehsan: try the STRs from Bug 825480, keeping in mind that they are only
> > > known to trigger this bug under Windows 7:
> > >
> > > Prerequisite: ensure that pref media.peerconnection.enabled is "true".
> > >
> > > 1. Go to http://mozilla.github.com/webrtc-landing/pc_test.html
> > > 2. Select Start
> > > 3. Wait for the message "hip hip hooray"
> > > 4. Quit Firefox
> >
> > I don't have access to a Windows 7 box right now. If you do, could you
> > please post the crash-stats link for one such crash as found in
> > about:crashes? Seeing such a crash stack should be enough for me to see if
> > my analysis in comment 20 is correct or not.
> >
> > Thanks!
>
> This is the crash stack I had from the dupe -
> https://crash-stats.mozilla.com/report/index/a1dea194-7992-4c21-acad-
> 27e282121230.
Thanks, this confirms my theory.
Comment 27•12 years ago
|
||
What about the comment about non-mainthread use in nsITimer.idl?
* It is not currently safe to initialize timers on any thread other than the
* main thread (it will cause races on the timers' delay adjustment mechanism,
* which may mess up timings). You can, however, cancel() and/or release a
* timer on a non-main thread (provided that its callback object has a
* thread-safe release() function).
So, do we need to create them on the thread they're going to run on (see nsTimerImpl()), then dispatch to MainThread to Init them so they can fire back on the original thread?? Clearly we're not doing that, so either a) the comment is wrong or incomplete, or b) our usage is drastically wrong, or c) both!
Assignee | ||
Comment 28•12 years ago
|
||
(In reply to comment #29)
> What about the comment about non-mainthread use in nsITimer.idl?
>
> * It is not currently safe to initialize timers on any thread other than the
> * main thread (it will cause races on the timers' delay adjustment mechanism,
> * which may mess up timings). You can, however, cancel() and/or release a
> * timer on a non-main thread (provided that its callback object has a
> * thread-safe release() function).
>
> So, do we need to create them on the thread they're going to run on (see
> nsTimerImpl()), then dispatch to MainThread to Init them so they can fire back
> on the original thread?? Clearly we're not doing that, so either a) the
> comment is wrong or incomplete, or b) our usage is drastically wrong, or c)
> both!
I think we're talking about two different things. The bug in question is about nsTimerEvent which is responsible for "firing" a timer. Basically, when we're ready to call back a timer object, we create an nsTimerEvent and dispatch it to the timer's target thread, which runs this code: <http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsTimerImpl.cpp#548> to "fire" the timer. nsTimerEvent objects are _always_ created on the timer thread in PostTimerEvent <http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsTimerImpl.cpp#579>, and once they are dispatched to the target thread, their ownership is transfered to that thread through the nsIEventTarget::Dispatch function. After that, the nsTimerEvent is treated like any other refcounted object and will be deleted when the reference count drops to zero, which almost certainly happens on another thread. Note that the thread creating the nsTimerImpl object (implementing nsITimer) has nothing to do with any of this so far.
The comment you're quoting comes from bug 778296, and is entirely news to me. I was definitely under the impression that you can create nsITimer's from all threads, and that bug doesn't link to the thread in question, but if this is the discussion in question <https://groups.google.com/d/topic/mozilla.dev.platform/OfF8f-UO8Z8/discussion>, then the comment added in attachment 646714 [details] [diff] [review] seems wrong. I think Jason meant "fire" when he said "initialize." Jason, Boris, please correct me if I'm wrong.
And sorry if this is all a bit confusing. :(
Assignee | ||
Comment 29•12 years ago
|
||
(To make things more clear, I was talking about nsTimerEvents, not nsITimers all along here. nsITimer is not relevant to this bug...)
Comment 30•12 years ago
|
||
Ehsan is right: what matters for purposes of that comment is where the timer fires (and hence where it's created), not where init calls happen. Having timers fire anywhere outside the main thread means you get data races on the delay line gunk.
And as things stand the comment is totally correct, as far as it goes. See bug 792920. The failure mode in case of a race is timers firing too early or too late, so not completely fatal...
But yes, all that is completely unrelated to this bug.
Comment 32•12 years ago
|
||
The second option doesn't sound that hard: can't you have a static refcount that is atomically modified in the timer constructor/destructor?
It would have the side benefit of being able to assert if timers are being held alive past XPCOM shutdown.
Flags: needinfo?(benjamin)
Comment 33•12 years ago
|
||
I see this all the time when I close Firefox with an active peer connection. Adjusting summary so it can be found easier.
Severity: normal → critical
Summary: ~TimerEventAllocator() may be called before TimerEventAllocator::Free() is called → Crash because ~TimerEventAllocator() may be called before TimerEventAllocator::Free() is called [@ nsRunnable::Release() ]
Updated•12 years ago
|
status-firefox22:
--- → affected
tracking-firefox22:
--- → ?
Assignee | ||
Comment 34•12 years ago
|
||
Can somebody who can reproduce the crash please test this patch? Thanks!
Attachment #719029 -
Flags: review?(benjamin)
Comment 35•12 years ago
|
||
(In reply to :Ehsan Akhgari from comment #36)
> Created attachment 719029 [details] [diff] [review]
> Patch (v1)
>
> Can somebody who can reproduce the crash please test this patch? Thanks!
Can you kick off a try build with your patch?
Assignee | ||
Comment 36•12 years ago
|
||
This version should be better, it uses atomic inc/dec, and also adds an assertion to make sure nsTimerEvent is always allocated on the right thread.
Attachment #719029 -
Attachment is obsolete: true
Attachment #719029 -
Flags: review?(benjamin)
Attachment #719041 -
Flags: review?(benjamin)
Assignee | ||
Comment 37•12 years ago
|
||
(In reply to Jason Smith [:jsmith] from comment #37)
> (In reply to :Ehsan Akhgari from comment #36)
> > Created attachment 719029 [details] [diff] [review]
> > Patch (v1)
> >
> > Can somebody who can reproduce the crash please test this patch? Thanks!
>
> Can you kick off a try build with your patch?
https://tbpl.mozilla.org/?tree=Try&rev=e29be3cd65f7
Comment 38•12 years ago
|
||
So this tryserver build doesn't crash on shutdown for me on Windows 7. But the builds are busted for B2G and Android.
Status: NEW → ASSIGNED
Assignee | ||
Comment 39•12 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #40)
> So this tryserver build doesn't crash on shutdown for me on Windows 7.
Yay!
> But the builds are busted for B2G and Android.
Oh that's my bad, but the fix is trivial so I'll update the patch when landing.
Updated•12 years ago
|
Comment 40•12 years ago
|
||
I think that bug would be great to get a Mozmill restart test, given that we can't do it as mochitest.
Flags: in-qa-testsuite?(hskupin)
Updated•12 years ago
|
Attachment #719041 -
Flags: review?(benjamin) → review+
Assignee | ||
Comment 41•12 years ago
|
||
Assignee | ||
Updated•12 years ago
|
Assignee | ||
Comment 42•12 years ago
|
||
Comment on attachment 719041 [details] [diff] [review]
Patch (v2)
[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 733277
User impact if declined: Crashes
Testing completed (on m-c, etc.): Locally
Risk to taking this patch (and alternatives if risky): This patch should be fairly safe for Aurora. The alternative to taking this is to backout bug 733277 which can have performance implications.
String or UUID changes made by this patch: none
Attachment #719041 -
Flags: approval-mozilla-aurora?
Assignee | ||
Comment 43•12 years ago
|
||
Landed a follow-up to fix builds: http://hg.mozilla.org/integration/mozilla-inbound/rev/a0cf0252cee4
Assignee | ||
Comment 44•12 years ago
|
||
Backed out because of a leak:
12:40:59 WARNING - TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 56 bytes during test execution
12:40:59 WARNING - TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 1 instance of CondVar with size 32 bytes
12:40:59 WARNING - TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 1 instance of Mutex with size 24 bytes
https://hg.mozilla.org/integration/mozilla-inbound/rev/c6896fe12fb6
Assignee | ||
Comment 45•12 years ago
|
||
I'm retarded. nsTimerEvent's default ctor will never get called, so the refcount will only go negative, hence the leak. The fix is simple, just move all of that code to the ctor that actually gets used.
Assignee | ||
Comment 46•12 years ago
|
||
Comment 47•12 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
Comment 48•12 years ago
|
||
Comment on attachment 719041 [details] [diff] [review]
Patch (v2)
Considering the backout may have a perf impact here, lets try the fwd fix here to see if it helps fix crashes.
Attachment #719041 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Assignee | ||
Comment 49•12 years ago
|
||
Comment 50•12 years ago
|
||
Good news is I can't reproduce this crash. Bad news is that I can reproduce a new XPCOM Shutdown crash during shutdown of the data channel test page. Not sure if this is the regressing bug or not, but linking just in case.
Marking verified with a potential followup in bug 848496.
Updated•11 years ago
|
Blocks: ietestdrive
Comment 51•10 years ago
|
||
It may be hard to create an automated test, without having a proper testcase. The page as referenced might also have been changed meanwhile, so a bit of investigation would be necessary. Nils, do you think it's worth the time? If yes, which framework could we use?
Flags: needinfo?(drno)
Flags: in-testsuite?
Flags: in-qa-testsuite?(hskupin)
Flags: in-qa-testsuite?
Comment 52•10 years ago
|
||
I think it probably not worth trying to write a test case for each problem we had on shutdown. But as we currently don't have anything testing that area I think it would be useful to have one Mozmill test case which establishes a call with a PeerConnection and Audio and Video and then closes FF while that call is still active. That should hopefully give us some basic coverage for cleanup on shutdown.
I'll point to this bug from the other bug with similar problems.
Flags: needinfo?(drno)
Comment 53•9 years ago
|
||
(In reply to Nils Ohlmeier [:drno] from comment #54)
> I think it probably not worth trying to write a test case for each problem
> we had on shutdown. But as we currently don't have anything testing that
> area I think it would be useful to have one Mozmill test case which
> establishes a call with a PeerConnection and Audio and Video and then closes
> FF while that call is still active. That should hopefully give us some basic
> coverage for cleanup on shutdown.
Nils, would you mind to file a Firefox UI Test given that Mozmill is dead!? Not sure if you did that already. At least I cannot find a reference here.
Flags: needinfo?(drno)
Updated•9 years ago
|
Comment 54•9 years ago
|
||
Thanks for the reminder. Filed bug 1201789.
You need to log in
before you can comment on or make changes to this bug.
Description
•