Closed
Bug 458998
Opened 16 years ago
Closed 16 years ago
AsyncExecute helper classes can complete() before they are added to mPendingEvents
Categories
(Toolkit :: Storage, defect)
Toolkit
Storage
Tracking
()
RESOLVED
FIXED
mozilla1.9.1b2
People
(Reporter: asuth, Assigned: sdwilsh)
References
Details
(Keywords: intermittent-failure)
Attachments
(1 file, 4 obsolete files)
(deleted),
patch
|
bent.mozilla
:
review+
|
Details | Diff | Splinter Review |
davida has seen the AsyncExecute destructor's NS_ASSERTION(mPendingEvents.Count() == 0, "Still pending events!") fire a bunch of times in succession. Based on the fact that there was high latency at the time I am inclined to believe that a full GC with cycle collector was happening. I assume some AsyncExecutes had completed execution, so were no longer directly alive, but were being sustained by a CompletionNotifier's reference count. Based on my understanding of the rather undocumented nsRefPtr's source, I doubt that the cycle collector is able to follow the nsRefPtr link from the 'live' CompletionNotifier to the AsyncExecute. I don't have ironclad proof of this, but my previous auditing of the code had suggested it was otherwise impossible for this to happen without memory corruption.
Assignee | ||
Comment 1•16 years ago
|
||
We shouldn't be touched by the cycle collector because we don't tell it about us. I'm quite confused as to how we managed to have mPendingEvents.Count() be greater non-zero when we hit the constructor. Every single pending event we post ends up holding a reference to the AsyncExecute object...
Reporter | ||
Comment 2•16 years ago
|
||
So, I never realized that mCallingThread->Dispatch is called before the mPendingEventsMutex is taken and mPendingEvents.AppendObject is called. I had the debugger trigger on the assertion. mPendingEvents has 3 completed CallbackResultNotifiers in it (they are not canceled and mCompletionNotifier was nulled out). Each of their reference counts was 1. This suggests that they must have run to completion before AsyncExecute got a chance to add them to mPendingEvents. Is there any reason not to move the call to dispatch inside the mutex? ps: Apologies for crying wolf, I confess I was hoping to get the problem solved for me without firing up a debugger or taking a break to properly read up on the cycle collector ;). (It turns out I only had to read a few lines, of course...)
Summary: Notifier use of nsRefPtr rather than nsCOMPtr can result in live AsyncExecute being cycled collected → AsyncExecute helper classes can complete() before they are added to mPendingEvents
(In reply to comment #2) > Is there any reason not to move the call to dispatch inside the mutex? Yes, you shouldn't call out to other modules while holding a lock as it is a deadlock hazard. But there's definitely a race here that needs to be addressed.
Maybe as simple as this? + nsRefPtr<CallbackResultNotifier> notifier = ... + { + nsAutoLock mutex(mPendingEventsMutex); + (void)mPendingEvents.AppendObject(notifier); + } + + nsresult status = mCallingThread->Dispatch(notifier, NS_DISPATCH_NORMAL); + + if (NS_FAILED(status)) { + nsAutoLock mutex(mPendingEventsMutex); + (void)mPendingEvents.RemoveObject(notifier); + }
Assignee | ||
Comment 6•16 years ago
|
||
in fact, r=sdwilsh for a patch that does what comment 4 says, but with no newline after mCallingThread->Dispatch(...)
Assignee | ||
Comment 7•16 years ago
|
||
This is what bent said, but in patch form. Needs bug 458756 applied to apply cleanly I suspect.
Attachment #342474 -
Flags: review+
Assignee | ||
Updated•16 years ago
|
Assignee: nobody → bent.mozilla
Status: NEW → ASSIGNED
OS: Mac OS X → All
Hardware: PC → All
Target Milestone: --- → mozilla1.9.1b2
Reporter | ||
Comment 8•16 years ago
|
||
Shouldn't the patch also address the identical case for CompletionNotifier and the similar, but more complicated, case for ErrorNotifier? (ErrorNotifier::Dispatch creates the ErrorNotifier instance and dispatches it before actually returning control to the AsyncExecute instance who maintains mPendingEvents.)
Assignee | ||
Comment 10•16 years ago
|
||
I think tinderbox is seeing this occasionally.
Assignee: bent.mozilla → sdwilsh
Assignee | ||
Updated•16 years ago
|
Whiteboard: [needs patch]
Assignee | ||
Comment 11•16 years ago
|
||
OK. This is a much better solution. All event dispatching now goes through Notify, and I've added NotifyError and NotifyResults. Also renamed Complete to NotifyComplete. There is now only one place in the code where pending events are added, and two where they are removed, making this logic much simpler to follow.
Attachment #342474 -
Attachment is obsolete: true
Attachment #343473 -
Flags: review?(bent.mozilla)
Assignee | ||
Comment 12•16 years ago
|
||
Comment on attachment 343473 [details] [diff] [review] v2.0 Just to be clear... >- { >- nsAutoLock mutex(mStateMutex); >- mState = ERROR; >- } >+ nsAutoLock mutex(mStateMutex); >+ mState = ERROR; This change is unrelated, but I noticed we were not holding the state mutex, which is an invariant of calling NotifyComplete.
Assignee | ||
Updated•16 years ago
|
Whiteboard: [needs patch] → [has patch][needs review bent]
Comment on attachment 343473 [details] [diff] [review] v2.0 I would recommend using the atomic set operations rather than locking all over just to set an integer value. That would let you always lock inside the methods you'd like to rather than assuming the caller does the right thing.
Assignee | ||
Comment 14•16 years ago
|
||
(In reply to comment #13) > (From update of attachment 343473 [details] [diff] [review]) > I would recommend using the atomic set operations rather than locking all over > just to set an integer value. That would let you always lock inside the methods > you'd like to rather than assuming the caller does the right thing. I hadn't done that originally because I wasn't sure if we really have atomic reads on all platforms (I'm pretty sure we don't actually unless we use the volatile keyword, which we can't actually use with PR_AtomicSet), but I've got a patch coming with that.
Assignee | ||
Comment 15•16 years ago
|
||
Attachment #343473 -
Attachment is obsolete: true
Attachment #344011 -
Flags: review?(bent.mozilla)
Attachment #343473 -
Flags: review?(bent.mozilla)
Assignee | ||
Comment 17•16 years ago
|
||
OK - this bug really sucks, for the record. It is quite possible to call cancel when we've already finished. In the past, it would throw (I'd say that is not desirable), but with the approach that bent suggested to me, it means we end up having it be quite possible to say that our reason for completion in handleCompletion is REASON_FINISHED instead of REASON_CANCELED. I almost think that we have to keep the mState lock and throw if we try to cancel when we are not currently in a pending state (or at least atomically set on the background, and read only on the calling thread). Regardless, the test needs to be fixed to reflect that...
Assignee | ||
Comment 18•16 years ago
|
||
But that also seems like a really bad idea. I'm not sure what the right behavior is here...
Assignee | ||
Comment 19•16 years ago
|
||
This stuff is hard... This has an interface change for mozIStoragePendingStatement::cancel. It now returns a boolean indicating if it successfully canceled or not. Discussed this at length with bent, and it makes our tests more reliable.
Attachment #344011 -
Attachment is obsolete: true
Attachment #344560 -
Flags: review?(bent.mozilla)
Attachment #344011 -
Flags: review?(bent.mozilla)
Updated•16 years ago
|
Attachment #344560 -
Flags: review?(bent.mozilla) → review-
Comment on attachment 344560 [details] [diff] [review] v3.0 >+ * Cancels a pending statement, if possible. This will only fail if you try >+ * cancel more than once. Typo, should be "try to cancel". > for (PRUint32 i = 0; i < mStatements.Length(); i++) { So, for the record, I think this while-within-a-for-loop construct is tough to follow and that it could be simplified by breaking the while loop into a separate function. I further think that would let you use the scoped locks more effectively. But since we're tight on time and it looks like you've done everything correctly I guess that can wait for a followup. >+#ifdef DEBUG >+ PRBool onCallingThread = PR_FALSE; >+ (void)mCallingThread->IsOnCurrentThread(&onCallingThread); >+ NS_ASSERTION(onCallingThread, "Not canceling from the calling thread!"); >+#endif Handy! Maybe file a followup about adding these in more places :) >+ // If we have already canceled, we have an error, but always indicate that >+ // we are trying to cancel. >+ NS_ENSURE_FALSE(PR_AtomicSet(&mTryToCancel, PR_TRUE), NS_ERROR_UNEXPECTED); I'd avoid calling functions within any of the nsDebug macros. Just make a stack variable. And don't use the PR_AtomicSet on a PRBool. That will break when PRBool and PRInt32 are no longer typedef'd to the same thing. >+ // Establish if we can cancel > { >+ nsAutoLock mutex(mLock); >+ *_successful = (mState == PENDING); > } Your cancel could have already succeeded, need to also check mState == CANCELED. Or move the set for mTryToCancel inside the mutex after you've tested. >+ nsAutoLock::DestroyLock(mLock); >+ mLock = nsnull; Nulling this out in the destructor isn't necessary > if (mCallback) { > nsRefPtr<CompletionNotifier> completionEvent = >+ new CompletionNotifier(mCallback, mState); >+ NS_ENSURE_TRUE(completionEvent, NS_ERROR_OUT_OF_MEMORY); >+ >+ (void)mCallingThread->Dispatch(completionEvent, NS_DISPATCH_NORMAL); > > // We no longer own mCallback (the CompletionNotifier takes ownership). > mCallback = nsnull; I'd move this above the dispatch, then, to avoid any funny races. >+ nsresult NotifyError(PRInt32 aErrorCode, const char *aMessage) I can't find a single place where you care about the return value in any of the Notify functions except NotifyComplete... You're always discarding the return. Maybe make them return void? I'd also document that all these Notify functions expect to *not* have the lock in place when called. And I was thinking about it, it would probably be trivial to subclass nsAutoLock in debug builds to expose the locked/unlocked status so you could assert that. Followup fodder. >+ PRBool mTryToCancel; I think I like 'mCancelRequested' or 'mCancelAttempted' better... Your call though. >+ * This is the lock that protects our state from changing. Maybe list the variables it protects instead since that's pretty vague.
Assignee | ||
Comment 21•16 years ago
|
||
Addresses review comments
Attachment #344560 -
Attachment is obsolete: true
Attachment #344679 -
Flags: review?(bent.mozilla)
Assignee | ||
Comment 22•16 years ago
|
||
er...with the exception of the nulling out in the destructor which I've fixed locally.
Comment on attachment 344679 [details] [diff] [review] v3.1 >+ // We need to indicate that we want to try and cancel now. >+ mCancelRequested = PR_TRUE; >+ >+ // Establish if we can cancel >+ *_successful = (mState == PENDING); Add some comments here about why you're canceling the previously queued results even if you're going to return false.
Attachment #344679 -
Flags: review?(bent.mozilla) → review+
Assignee | ||
Comment 24•16 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/d3cdb61f7d29
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Whiteboard: [has patch][needs review bent]
Reporter | ||
Comment 25•16 years ago
|
||
It looks like CompletionNotifier::cancel was accidentally left around?
Assignee | ||
Comment 26•16 years ago
|
||
huh, missed one then. Could you please file a bug to remove that?
Assignee | ||
Comment 28•16 years ago
|
||
So, I do not think we managed to get this race condition nailed out given a recent tinderbox failure: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1225387432.1225397005.17478.gz#err0 The test in question is setup to be fault tolerant of our reason being either finished or canceled. It's that, or we've managed to say we can cancel, but do not properly update state somewhere.
Assignee | ||
Comment 29•16 years ago
|
||
I know what's up with comment 28 - I'll file a bug about it later. For now there is a sticky note on my desk on what to do.
Updated•16 years ago
|
Whiteboard: [orange]
Updated•12 years ago
|
Keywords: intermittent-failure
Updated•12 years ago
|
Whiteboard: [orange]
You need to log in
before you can comment on or make changes to this bug.
Description
•