Closed Bug 799142 Opened 12 years ago Closed 12 years ago

Crash because ~TimerEventAllocator() may be called before TimerEventAllocator::Free() is called [@ nsRunnable::Release() ]

Categories

(Core :: XPCOM, defect, P2)

x86_64
Windows 7
defect

Tracking

()

VERIFIED FIXED
mozilla22
Tracking Status
firefox21 + fixed
firefox22 + fixed

People

(Reporter: tete009+bugzilla, Assigned: ehsan.akhgari)

References

(Blocks 1 open bug, )

Details

(Keywords: crash, reproducible, Whiteboard: [webrtc][blocking-webrtc+][ietestdrive])

Crash Data

Attachments

(1 file, 1 obsolete file)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko/20100101 Firefox/16.0 Build ID: 20121005155445 Steps to reproduce: I tried to close Firefox when I'm playing Browser Hunt. http://ie.microsoft.com/testdrive/performance/browserhunt/default.xhtml Actual results: Firefox sometimes crashed because of access violation. I debugged my builds of Firefox 15.0.1, 16.0 and trunk, and I found out that ~TimerEventAllocator() was called before TimerEventAllocator::Free() was called, when Firefox crashed as mentioned above. Expected results: All TimerEventAllocator::Free() calls should be completed before call of ~TimerEventAllocator().
Crash Signature: [@ nsTimerEvent::`vector deleting destructor''(unsigned int) ]
Crash Signature: [@ nsTimerEvent::`vector deleting destructor''(unsigned int) ] → [@ nsTimerEvent::`vector deleting destructor''(unsigned int) ] [@ nsRunnable::Release() ]
Sorry, ~TimerEventAllocator() isn't explicitly declared in the original source. It was added to my builds explicitly for debugging purpose. Firefox 16.0 rc build1 seems to show another crash signature [@ nsRunnable::Release() ], so I added it to "Crash Signature".
Can you please post a crash ID from about:crashes ?
Keywords: crash
Crash IDs: Firefox 18.0a1: 75443141-6066-4ab4-92a9-975ca2121008 Firefox 16.0 rc build1: 142a8001-7c35-4367-b67c-47a4e2121008
Component: Untriaged → XPCOM
Product: Firefox → Core
This is race condition. thread 0 (gecko main thread) is shutting down after another thread is starting nsRunnable::run(). Then, if run() spends a lot of times (has busy job), thread 0 will release allocator before finishing run(). Although this is possible issue, I don't think that this is frequency. mark as new since there is crash signature...
Status: UNCONFIRMED → NEW
Ever confirmed: true
The scenario in Bug 825480 seems to reliably trigger this crash on certain hardware configurations, so we're going to mark this as [blocking-rtcweb+] for the time being. I'm not familiar with the code involved, so this is nothing more than a casual armchair analysis: The TimerEventAllocator class is used in only one file -- but it's of class nsFixedSizeAllocator, which hits 29 other files. Based on that first-order examination, I'd estimate that a conversion to reference counting (which would solve the race described) is probably realistic, but not trivial.
Whiteboard: [blocking-webrtc+]
Whiteboard: [blocking-webrtc+] → [blocking-webrtc+][ietestdrive]
Whiteboard: [blocking-webrtc+][ietestdrive] → [webrtc][blocking-webrtc+][ietestdrive]
I'm getting this to reproduce using the data channels test page by sending files around eventually as well.
Seems to be easily reproducible with http://mozilla.github.com/webrtc-landing/data_test.html on Win 7.
Keywords: reproducible
:mreavy,:jesup : Can you please help find an assignee for this bug ? Thanks !
CCing people who've had their fingers in TimerImpl recently... There's a race condition in the Timer code that has existed and apparently we're now triggering. Could some of you take a look and see what might work as a solution?
As far as I can tell, the TimerEventAllocator thing added in bug 733277 assume nsITimer is used only from the main thread, right?
Oh boy. I use timers off the main thread all over the place.
Flags: needinfo?(mreavy)
Flags: needinfo?(rjesup)
Assignee: nobody → ehsan
Priority: -- → P2
Thanks for the assignee, Benjamin - clearing the needsinfo
Flags: needinfo?(rjesup)
Flags: needinfo?(mreavy)
Blocks: 733277
If bug 733277 assumes nsITimer is only used from the main thread and there are instances of use outside the main thread then should bug 733277 be backed out while FF 21 is still on m-c and re-worked?
(In reply to Lukas Blakk [:lsblakk] from comment #14) > If bug 733277 assumes nsITimer is only used from the main thread and there > are instances of use outside the main thread then should bug 733277 be > backed out while FF 21 is still on m-c and re-worked? We need ehsan's input here, but if bug 733277 assumes only main thread timers then I would say yes, it should be backed out. It probably caused subtle race bugs in anything involving LazyIdleThread (IndexedDB, JumplistBuilder from a quick MXR search) or Workers.
Ehsan is out until the 19th.
How likely are those subtle races causing impossible-to-reproduce/debug crashes/hangs/malfunctions?
Since the original landed in March, I'm going to wait for ehsan to return and comment before taking other actions on this bug.
(In reply to Boris Zbarsky (:bz) from comment #11) > As far as I can tell, the TimerEventAllocator thing added in bug 733277 > assume nsITimer is used only from the main thread, right? Why do you think that? One mistake that I can see in this code is that we assume all nsTimerEvent objects are deallocated before nsTimerEvent::Shutdown is called. This assumption will probably not be true if somebody holds on to a reference to the runnable object for too long, which can happen if nsTimerImpl::PostTimerEvent returns after nsTimerImpl::Shutdown has been called on the main thread, which can happen since the thread manager is shut down after xpcom timers (for example, see https://crash-stats.mozilla.com/report/index/a9166241-055f-4aeb-a648-34d0b2130213). I can see two possible fixes here. One would be to call nsTimerEvent::Shutdown() after the thread manager shutdown. The other would be to keep track of how many objects reference the allocator, and only delete the allocator when there are no new objects, and the timer thread has fully shut down. The first solution is simpler, but it wouldn't cover all possible edge cases (in case somebody adds code to hold on to an nsIRunnable even after it's been processed, etc.) I'm not sure which solution I'd prefer. Benjamin, what do you think?
Also, a detailed set of STRs here would really help. I tried the link in comment 8 but it's not obvious how to reproduce this crash.
(In reply to :Ehsan Akhgari from comment #21) > Also, a detailed set of STRs here would really help. I tried the link in > comment 8 but it's not obvious how to reproduce this crash. Ehsan: try the STRs from Bug 825480, keeping in mind that they are only known to trigger this bug under Windows 7: Prerequisite: ensure that pref media.peerconnection.enabled is "true". 1. Go to http://mozilla.github.com/webrtc-landing/pc_test.html 2. Select Start 3. Wait for the message "hip hip hooray" 4. Quit Firefox
(In reply to Adam Roach [:abr] from comment #23) > (In reply to :Ehsan Akhgari from comment #21) > > Also, a detailed set of STRs here would really help. I tried the link in > > comment 8 but it's not obvious how to reproduce this crash. > > Ehsan: try the STRs from Bug 825480, keeping in mind that they are only > known to trigger this bug under Windows 7: > > Prerequisite: ensure that pref media.peerconnection.enabled is "true". > > 1. Go to http://mozilla.github.com/webrtc-landing/pc_test.html > 2. Select Start > 3. Wait for the message "hip hip hooray" > 4. Quit Firefox I don't have access to a Windows 7 box right now. If you do, could you please post the crash-stats link for one such crash as found in about:crashes? Seeing such a crash stack should be enough for me to see if my analysis in comment 20 is correct or not. Thanks!
> Why do you think that? Well, for one thing it talks about "freed on the main thread", not "freed on an arbitrary thread"...
(In reply to Boris Zbarsky (:bz) from comment #25) > > Why do you think that? > > Well, for one thing it talks about "freed on the main thread", not "freed on > an arbitrary thread"... That is my documentation mistake, fixed here: https://hg.mozilla.org/integration/mozilla-inbound/rev/fe5e2415c67c
(In reply to :Ehsan Akhgari from comment #24) > (In reply to Adam Roach [:abr] from comment #23) > > (In reply to :Ehsan Akhgari from comment #21) > > > Also, a detailed set of STRs here would really help. I tried the link in > > > comment 8 but it's not obvious how to reproduce this crash. > > > > Ehsan: try the STRs from Bug 825480, keeping in mind that they are only > > known to trigger this bug under Windows 7: > > > > Prerequisite: ensure that pref media.peerconnection.enabled is "true". > > > > 1. Go to http://mozilla.github.com/webrtc-landing/pc_test.html > > 2. Select Start > > 3. Wait for the message "hip hip hooray" > > 4. Quit Firefox > > I don't have access to a Windows 7 box right now. If you do, could you > please post the crash-stats link for one such crash as found in > about:crashes? Seeing such a crash stack should be enough for me to see if > my analysis in comment 20 is correct or not. > > Thanks! This is the crash stack I had from the dupe - https://crash-stats.mozilla.com/report/index/a1dea194-7992-4c21-acad-27e282121230.
(In reply to Jason Smith [:jsmith] from comment #27) > (In reply to :Ehsan Akhgari from comment #24) > > (In reply to Adam Roach [:abr] from comment #23) > > > (In reply to :Ehsan Akhgari from comment #21) > > > > Also, a detailed set of STRs here would really help. I tried the link in > > > > comment 8 but it's not obvious how to reproduce this crash. > > > > > > Ehsan: try the STRs from Bug 825480, keeping in mind that they are only > > > known to trigger this bug under Windows 7: > > > > > > Prerequisite: ensure that pref media.peerconnection.enabled is "true". > > > > > > 1. Go to http://mozilla.github.com/webrtc-landing/pc_test.html > > > 2. Select Start > > > 3. Wait for the message "hip hip hooray" > > > 4. Quit Firefox > > > > I don't have access to a Windows 7 box right now. If you do, could you > > please post the crash-stats link for one such crash as found in > > about:crashes? Seeing such a crash stack should be enough for me to see if > > my analysis in comment 20 is correct or not. > > > > Thanks! > > This is the crash stack I had from the dupe - > https://crash-stats.mozilla.com/report/index/a1dea194-7992-4c21-acad- > 27e282121230. Thanks, this confirms my theory.
What about the comment about non-mainthread use in nsITimer.idl? * It is not currently safe to initialize timers on any thread other than the * main thread (it will cause races on the timers' delay adjustment mechanism, * which may mess up timings). You can, however, cancel() and/or release a * timer on a non-main thread (provided that its callback object has a * thread-safe release() function). So, do we need to create them on the thread they're going to run on (see nsTimerImpl()), then dispatch to MainThread to Init them so they can fire back on the original thread?? Clearly we're not doing that, so either a) the comment is wrong or incomplete, or b) our usage is drastically wrong, or c) both!
(In reply to comment #29) > What about the comment about non-mainthread use in nsITimer.idl? > > * It is not currently safe to initialize timers on any thread other than the > * main thread (it will cause races on the timers' delay adjustment mechanism, > * which may mess up timings). You can, however, cancel() and/or release a > * timer on a non-main thread (provided that its callback object has a > * thread-safe release() function). > > So, do we need to create them on the thread they're going to run on (see > nsTimerImpl()), then dispatch to MainThread to Init them so they can fire back > on the original thread?? Clearly we're not doing that, so either a) the > comment is wrong or incomplete, or b) our usage is drastically wrong, or c) > both! I think we're talking about two different things. The bug in question is about nsTimerEvent which is responsible for "firing" a timer. Basically, when we're ready to call back a timer object, we create an nsTimerEvent and dispatch it to the timer's target thread, which runs this code: <http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsTimerImpl.cpp#548> to "fire" the timer. nsTimerEvent objects are _always_ created on the timer thread in PostTimerEvent <http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsTimerImpl.cpp#579>, and once they are dispatched to the target thread, their ownership is transfered to that thread through the nsIEventTarget::Dispatch function. After that, the nsTimerEvent is treated like any other refcounted object and will be deleted when the reference count drops to zero, which almost certainly happens on another thread. Note that the thread creating the nsTimerImpl object (implementing nsITimer) has nothing to do with any of this so far. The comment you're quoting comes from bug 778296, and is entirely news to me. I was definitely under the impression that you can create nsITimer's from all threads, and that bug doesn't link to the thread in question, but if this is the discussion in question <https://groups.google.com/d/topic/mozilla.dev.platform/OfF8f-UO8Z8/discussion>, then the comment added in attachment 646714 [details] [diff] [review] seems wrong. I think Jason meant "fire" when he said "initialize." Jason, Boris, please correct me if I'm wrong. And sorry if this is all a bit confusing. :(
(To make things more clear, I was talking about nsTimerEvents, not nsITimers all along here. nsITimer is not relevant to this bug...)
Ehsan is right: what matters for purposes of that comment is where the timer fires (and hence where it's created), not where init calls happen. Having timers fire anywhere outside the main thread means you get data races on the delay line gunk. And as things stand the comment is totally correct, as far as it goes. See bug 792920. The failure mode in case of a race is timers firing too early or too late, so not completely fatal... But yes, all that is completely unrelated to this bug.
Benjamin, please see comment 20.
Flags: needinfo?(benjamin)
The second option doesn't sound that hard: can't you have a static refcount that is atomically modified in the timer constructor/destructor? It would have the side benefit of being able to assert if timers are being held alive past XPCOM shutdown.
Flags: needinfo?(benjamin)
I see this all the time when I close Firefox with an active peer connection. Adjusting summary so it can be found easier.
Severity: normal → critical
Summary: ~TimerEventAllocator() may be called before TimerEventAllocator::Free() is called → Crash because ~TimerEventAllocator() may be called before TimerEventAllocator::Free() is called [@ nsRunnable::Release() ]
Attached patch Patch (v1) (obsolete) (deleted) — Splinter Review
Can somebody who can reproduce the crash please test this patch? Thanks!
Attachment #719029 - Flags: review?(benjamin)
(In reply to :Ehsan Akhgari from comment #36) > Created attachment 719029 [details] [diff] [review] > Patch (v1) > > Can somebody who can reproduce the crash please test this patch? Thanks! Can you kick off a try build with your patch?
Attached patch Patch (v2) (deleted) — Splinter Review
This version should be better, it uses atomic inc/dec, and also adds an assertion to make sure nsTimerEvent is always allocated on the right thread.
Attachment #719029 - Attachment is obsolete: true
Attachment #719029 - Flags: review?(benjamin)
Attachment #719041 - Flags: review?(benjamin)
(In reply to Jason Smith [:jsmith] from comment #37) > (In reply to :Ehsan Akhgari from comment #36) > > Created attachment 719029 [details] [diff] [review] > > Patch (v1) > > > > Can somebody who can reproduce the crash please test this patch? Thanks! > > Can you kick off a try build with your patch? https://tbpl.mozilla.org/?tree=Try&rev=e29be3cd65f7
So this tryserver build doesn't crash on shutdown for me on Windows 7. But the builds are busted for B2G and Android.
Status: NEW → ASSIGNED
(In reply to Henrik Skupin (:whimboo) from comment #40) > So this tryserver build doesn't crash on shutdown for me on Windows 7. Yay! > But the builds are busted for B2G and Android. Oh that's my bad, but the fix is trivial so I'll update the patch when landing.
I think that bug would be great to get a Mozmill restart test, given that we can't do it as mochitest.
Flags: in-qa-testsuite?(hskupin)
Attachment #719041 - Flags: review?(benjamin) → review+
Comment on attachment 719041 [details] [diff] [review] Patch (v2) [Approval Request Comment] Bug caused by (feature/regressing bug #): bug 733277 User impact if declined: Crashes Testing completed (on m-c, etc.): Locally Risk to taking this patch (and alternatives if risky): This patch should be fairly safe for Aurora. The alternative to taking this is to backout bug 733277 which can have performance implications. String or UUID changes made by this patch: none
Attachment #719041 - Flags: approval-mozilla-aurora?
Backed out because of a leak: 12:40:59 WARNING - TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 56 bytes during test execution 12:40:59 WARNING - TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 1 instance of CondVar with size 32 bytes 12:40:59 WARNING - TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 1 instance of Mutex with size 24 bytes https://hg.mozilla.org/integration/mozilla-inbound/rev/c6896fe12fb6
I'm retarded. nsTimerEvent's default ctor will never get called, so the refcount will only go negative, hence the leak. The fix is simple, just move all of that code to the ctor that actually gets used.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
Keywords: verifyme
QA Contact: jsmith
Comment on attachment 719041 [details] [diff] [review] Patch (v2) Considering the backout may have a perf impact here, lets try the fwd fix here to see if it helps fix crashes.
Attachment #719041 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Good news is I can't reproduce this crash. Bad news is that I can reproduce a new XPCOM Shutdown crash during shutdown of the data channel test page. Not sure if this is the regressing bug or not, but linking just in case. Marking verified with a potential followup in bug 848496.
Status: RESOLVED → VERIFIED
Depends on: 848496
Keywords: verifyme
No longer depends on: 848496
It may be hard to create an automated test, without having a proper testcase. The page as referenced might also have been changed meanwhile, so a bit of investigation would be necessary. Nils, do you think it's worth the time? If yes, which framework could we use?
Flags: needinfo?(drno)
Flags: in-testsuite?
Flags: in-qa-testsuite?(hskupin)
Flags: in-qa-testsuite?
I think it probably not worth trying to write a test case for each problem we had on shutdown. But as we currently don't have anything testing that area I think it would be useful to have one Mozmill test case which establishes a call with a PeerConnection and Audio and Video and then closes FF while that call is still active. That should hopefully give us some basic coverage for cleanup on shutdown. I'll point to this bug from the other bug with similar problems.
Flags: needinfo?(drno)
(In reply to Nils Ohlmeier [:drno] from comment #54) > I think it probably not worth trying to write a test case for each problem > we had on shutdown. But as we currently don't have anything testing that > area I think it would be useful to have one Mozmill test case which > establishes a call with a PeerConnection and Audio and Video and then closes > FF while that call is still active. That should hopefully give us some basic > coverage for cleanup on shutdown. Nils, would you mind to file a Firefox UI Test given that Mozmill is dead!? Not sure if you did that already. At least I cannot find a reference here.
Flags: needinfo?(drno)
Thanks for the reminder. Filed bug 1201789.
Clearing NI.
Flags: needinfo?(drno)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: