Closed Bug 468736 Opened 16 years ago Closed 13 years ago

PSM nsPSMBackgroundThread::requestExit is racy and not threadsafe

Categories

(Core Graveyard :: Security: UI, defect)

defect
Not set
critical

Tracking

(firefox5-, firefox6-, blocking2.0 .x+)

RESOLVED FIXED
mozilla7
Tracking Status
firefox5 - ---
firefox6 - ---
blocking2.0 --- .x+

People

(Reporter: timeless, Assigned: briansmith)

References

Details

(Keywords: hang, Whiteboard: [inbound])

Attachments

(4 files, 7 obsolete files)

(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), patch
briansmith
: review+
Details | Diff | Splinter Review
Attached file profiled stack trace from camino (deleted) —
ok, so the main thread is here: http://mxr-test.konigsberg.mozilla.org/mozilla-central/source/security/manager/ssl/src/nsPSMBackgroundThread.cpp?rev=23dd66069c44&mark=88-97#82 - i think this code needs a stack object. it looks like a race http://mxr-test.konigsberg.mozilla.org/mozilla-central/source/security/manager/ssl/src/nsNSSIOLayer.cpp?rev=23dd66069c44&mark=341-375#334 - is the other thread - the second thread enters passes 340 and then gets dangerous. the main thread enters and does stuff, but it's too late.
This causes unpredictable hangs (see bug 468234) and is thusly nominated for blocking.
Flags: blocking1.9.1?
I don't think this is prevalent enough to be a blocker. Definitely wanted.
Flags: wanted1.9.1+
Flags: blocking1.9.1?
Flags: blocking1.9.1-
I can reproduce this almost everytime I use Camino (Version 2.0b1 (1.9.0.6pre 2008121820)) (See Duplicate bug 471080). It seems to be more prevalent when I use GMail before I try to quit.
(In reply to comment #6) > *** Bug 484367 has been marked as a duplicate of this bug. *** nsPSMBackgroundThread::requestExit is innocent here. See bug 484367, it happened to me right after start, that bug also contains some local variables dump on the dead thread, hope it might move us forward a bit. It seems to me that this is some event queue problem, but it's a guess. I could reproduce just ones.
could someone who hits but try running w/ this? (you can tryserver it)
Assignee: kaie → timeless
Status: NEW → ASSIGNED
Attachment #371365 - Flags: review?
Attachment #371365 - Flags: superreview?(kaie)
Attachment #371365 - Flags: review?(stewart.campbell)
Attachment #371365 - Flags: review?(honzab.moz)
Attachment #371365 - Flags: review?
Comment on attachment 371365 [details] [diff] [review] Patch v1 [i think this might address the race i described] In your new while cycle in nsPSMBackgroundThread::requestExit you should access mExitState under the lock as you modify it on the other thread. Why did you remove NS_ASSERTION(mThreadHandle, "Could not create nsPSMBackgroundThread\n"); ? Nit: if (ssl_thread_singleton->mExitState): I'd rather see if (ssl_thread_singleton->mExitState != ePSMThreadRunning), just to make more clear what the condition is about Please don't include the only-formatting changes in nsNSSComponent, it's unrelated to this bug. Otherwise it looks good, we just have to take care to test shutdown well, there might be races introduced.
Attachment #371365 - Flags: review?(honzab.moz) → review-
Attached file another possible sample of this hang (deleted) —
I have had another hang after it was suggested in bug 466148 that my crashes are due to this bug. I attached the sample. If it is the same bug, I would be happy to test out builds with fixes, as Smokey said it was difficult to reproduce.
Oscar van Vliet wrote: > Smokey said it was difficult to reproduce. I really don't understand this. My bug, comment #1 (which existed before this one was reported, yet somehow got closed as a dupe of this one?!?) has a testcase that triggers the problem every single time, and a website that I visit frequently brings Firefox 3.5 to its knees EVERY SINGLE TIME, without fail. I can't (and won't) upgrade from 3.0.11 until this bug is resolved. I'd just like to point out that there may be information in the comments under my bug that could prove helpful for someone working on this one (if anyone), and also that I can reproduce the problem consistently, so I'd be happy to help with testing patches or capture samples from the crashing, if it would be helpful.
(In reply to comment #12) > Created an attachment (id=386436) [details] > A sample of a hang from Firefox 3.5 final AFAICT, this sample does not have nsPSMBackgroundThread::requestExit in it, and so is a different bug.
Severity: major → critical
Attachment #371365 - Attachment description: i think this might address the race i described → Patch v1 [i think this might address the race i described]
Attached patch whitespace-ignore version of patch v1 (obsolete) (deleted) — Splinter Review
timeless, maybe it's just me, but I find it very difficult to separate "bug fix meat" from "white space changes" when looking through a patch, especially when trying to understand fixes for tricky code. This is timeless' patch, ignoring all whitespace changes (which doesn't work for the places where you split code into multiple lines).
Attached patch smaller whitespace-ignore vresion of patch v1 (obsolete) (deleted) — Splinter Review
I've removed additional unnecessary chunks from the patch.
Attachment #439892 - Attachment is obsolete: true
Comment on attachment 371365 [details] [diff] [review] Patch v1 [i think this might address the race i described] clearing review requests, because Honza already gave r- waiting for answers to Honza's comments and a new patch
Attachment #371365 - Flags: superreview?(kaie)
Attachment #371365 - Flags: review?(stewart.campbell)
So, what's needed to make progress on this bug? Just addressing the comments in comment 9?
(In reply to comment #17) > So, what's needed to make progress on this bug? Just addressing the comments > in comment 9? Probably yes. Also some kind of automated testing for this would be excelent, but probably unachievable.
OK, then. If this is deemed to be blocking, I can pick it up and work on it. If not, I'm afraid that I can take a look at it post-2.0.
blocking2.0: --- → ?
looks like this one might have missed the grand re-triage. have we decided to block on firefox 4, or not?
Not a regression from 3.6, not a blocker.
blocking2.0: ? → .x
It looks like this might be causing bug 651665. Ehsan: any chance you can pick this up now that we're post-2.0?
Attached patch Updated patch (obsolete) (deleted) — Splinter Review
This patch updates the no-whitespace-change version to apply cleanly on trunk, and then addresses review comments in comment 9. I can't reproduce the problem, so I don't really know how to test this. Running the browser and going to SSL sites doesn't seem to break anything obvious. I've posted this to the try server as well.
Attachment #527557 - Flags: review?(honzab.moz)
One indication that your patch resolves bug 651665 is that the Add-on SDK tests pass on Linux Debug with your push to try server: http://tbpl.mozilla.org/?tree=MozillaTry&noignore=1&pusher=eakhgari@mozilla.com Compare with this random push of yours to mozilla-central: http://tbpl.mozilla.org/?noignore=1&pusher=eakhgari@mozilla.com That isn't determinative, since it sometimes happens anyway (f.e. in that random push of yours, one of the four debug builds that have run the tests have passed them). But most of the time the tests fail on debug builds when tested by the test automation, so it's a good sign. It'd be useful to have Linux64 Debug, OS X Debug, and OS X64 Debug results for that try server push as well.
(In reply to comment #24) > It'd be useful to have Linux64 Debug, OS X Debug, and OS X64 Debug results for > that try server push as well. The results are now in, and all those builds passed the SDK tests, so it looks like your patch does indeed fix the problem.
And in general try server was extremely happy with my push: <http://tbpl.mozilla.org/?tree=MozillaTry&rev=b53e241153e5>. (The two windows reftest failures are caused by another patch in my queue, and the OS X opt Moth orange is intermittent.)
(In reply to comment #25) > The results are now in, and all those builds passed the SDK tests, so it looks > like your patch does indeed fix the problem. To make extra sure, I re-ran all the SDK tests on the four Mac/Linux debug builds, and they all passed the second time as well: http://tbpl.mozilla.org/?tree=MozillaTry&rev=b53e241153e5&noignore=1 We don't have test results on Windows debug builds because the SDK tests aren't being run on those yet (bug 651806).
Comment on attachment 527557 [details] [diff] [review] Updated patch >+++ b/security/manager/ssl/src/nsCertVerificationThread.cpp >+ while (!mExitState && (0 == verification_thread_singleton->mJobQ.GetSize())) { mExitState == ePSMThreadRunning please. >+ if (mExitState) > break; mExitState != ePSMThreadRunning please. -1917,21 +1915,29 @@ nsNSSComponent::Init() > mSSLThread = new nsSSLThread(); >- if (mSSLThread) >- mSSLThread->startThread(); >+ if (mSSLThread) { >+ if (NS_FAILED(mSSLThread->startThread())) { >+ delete mSSLThread; >+ mSSLThread = nsnull; >+ } >+ } > mCertVerificationThread = new nsCertVerificationThread(); >- if (mCertVerificationThread) >- mCertVerificationThread->startThread(); >+ if (mCertVerificationThread) { >+ if (NS_FAILED(mCertVerificationThread->startThread())) { >+ delete mCertVerificationThread; >+ mCertVerificationThread = nsnull; >+ } >+ } > Do the same in nsNSSComponent::DoProfileChangeNetRestore or rather remove this change (I prefer the letter). I don't think we have non-null checks on all places in the code. > void nsPSMBackgroundThread::requestExit() >- if (mExitRequested) >+ if (mExitState) > return; mExitState != ePSMThreadRunning please. >+ while (GetExitStateThreadSafe() < ePSMThreadStopped) { >+ NS_ProcessNextEvent(nsnull, PR_FALSE); >+ if (GetExitStateThreadSafe() < ePSMThreadStopped) >+ PR_Sleep(PR_SecondsToInterval(5)); >+ } We should add a comment here that we cannot rely on posting/processing events since there is no event loop that late when this code gets called. > PRBool nsSSLThread::exitRequested() >+ return ssl_thread_singleton->mExitState; mExitState != ePSMThreadRunning please. r=honzab with these fixes
Attachment #527557 - Flags: review?(honzab.moz) → review+
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla6
Depends on: 653161
I've backed this out on the suspicion of being the culprit for the 400% Tshutdown regression: http://hg.mozilla.org/mozilla-central/rev/da816b446ffa If this is the cause, we'll need a new owner for this bug, as I won't have enough time to diagnose the regression. If it's not the cause, I'll reland it.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The backout fixed the regression. Somebody more familiar with the code here should pick it up and investigate the regression.
Assignee: timeless → nobody
Brian & Honza: would one of you be able to pick this up?
To be clear, this regressed Tp4_shutdown.
I will do that.
Assignee: nobody → honzab.moz
Status: REOPENED → ASSIGNED
Blocks: cmd-q
What's blocking this from moving forward? I'm hitting this about every other time I shut down a Nightly ...
I do not think the Tshutdown regression is worse than this hang; the hang is actually a worse-than-400% shutdown time regression and lots of people are running into this. The patch can and should be improved. I will help Honza with it.
(In reply to comment #38) > I do not think the Tshutdown regression is worse than this hang I beg to differ. The Tshutdown regression might affect all of our users, but the hang definitely only affects some of them. That being said, I also definitely think that this could be improved, but I didn't know enough about the code in question to do that myself.
We know this old bug is happening much more frequently in nightly. Do we know why? Is this occurring more frequently on FF5 beta too? If so, then we should try to address it for FF5. Did patches related to SYN retry and/or pipelining cause this to happen more frequently? I know SYN retry itself doesn't cause this, because I have seen it even when the SYN retry timeout is zero (which disables the actual retry). But, I wouldn't be surprised if the associated changes to support SYN retry had some effect in increasing the frequency of occurrence.
Comment on attachment 527557 [details] [diff] [review] Updated patch Review of attachment 527557 [details] [diff] [review]: ----------------------------------------------------------------- ::: security/manager/ssl/src/nsPSMBackgroundThread.cpp @@ +94,5 @@ > } > + while (GetExitStateThreadSafe() < ePSMThreadStopped) { > + NS_ProcessNextEvent(nsnull, PR_FALSE); > + if (GetExitStateThreadSafe() < ePSMThreadStopped) > + PR_Sleep(PR_SecondsToInterval(5)); Why not PR_Sleep(PR_INTERVAL_NO_WAIT) or a lower-than-5-seconds sleep interval here? Five seconds seems ridiculously long.
(In reply to comment #41) > Why not PR_Sleep(PR_INTERVAL_NO_WAIT) or a lower-than-5-seconds sleep > interval here? Five seconds seems ridiculously long. Because otherwise we would have 100% load on that thread w/o any context switches. It would lead to hang of UI thread as well. (In reply to comment #40) > Did patches related to SYN retry and/or pipelining cause this to happen more > frequently? I strongly believe there is no relation. I was experiencing this bug also pretty often year or longer ago. Probably some pattern makes this appear more often, therefor people experience it often when working on a bug and repeating the same behavior. Brian, if you have time to investigate the regression, then please do so. If not, I can get to this maybe at the end of this week.
We potentially could: - if there is an event queue (e.g. during profile change) then just process messages with canWait = true and do not PR_Sleep - if there is no event queue (usual case during shutdown) then we could use a monitor or cond var bound to the existing thread mutex notified when the state changes I would just need to figure out how to check whether there's still the event queue.
(In reply to comment #42) > Because otherwise we would have 100% load on that thread w/o > any context switches. It would lead to hang of UI thread as > well. It would still yield. I can see why it might be better to sleep a little, but less than five seconds. I am going to investigate trying a smaller interval first. > I strongly believe there is no relation. I was experiencing this bug also > pretty often year or longer ago. Probably some pattern makes this appear > more often, therefor people experience it often when working on a bug and > repeating the same behavior. There are 4 people (including me) sitting within 10 meters of me that are regularly running into this for the last four weeks, so I think something has changed recently that makes this occur much more frequently. Maybe it's even IPV6 related, because I think IPV6 was turned on our office recently. > Brian, if you have time to investigate the regression, then please do so. > If not, I can get to this maybe at the end of this week. I am working on it now.
Assignee: honzab.moz → bsmith
> PR_SecondsToInterval Oh! Right, not 5 SECONDS, but rather MILLISECONDS... I completely overlooked that...
(In reply to comment #45) > > PR_SecondsToInterval > > Oh! Right, not 5 SECONDS, but rather MILLISECONDS... I completely > overlooked that... :) as a mostly unimportant aside, I would make it 15ms (still a massive drop from 5000) instead of 5ms as a lot of the older NT derivations had a 15ms tick by default.. if you specify less it sometimes just ends up busy waiting anyhow for you.
too late for us to do anything in 5.
I made every change of mExitState cause a notification on mCond; requestExit will then wait until mExitState changes, while processing an event every 25ms.
Attachment #527557 - Attachment is obsolete: true
Attachment #538139 - Flags: review?(honzab.moz)
I have not figured out yet why this is happening more often recently for many of us.
(In reply to comment #49) > I have not figured out yet why this is happening more often recently for > many of us. Oh, you missed the news: http://www.nasa.gov/mission_pages/sunearth/news/News060711-blast.html
The previous patch wasn't folded into the patch I based it on, so it was really an interdiff against the patch that got backed out. Here is the folded patch.
Attachment #538139 - Attachment is obsolete: true
Attachment #538307 - Flags: review?(honzab.moz)
Attachment #538139 - Flags: review?(honzab.moz)
Comment on attachment 538307 [details] [diff] [review] Wait for 25ms or until notified by background thread, fixed Review of attachment 538307 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for the updated patch, Brian. But, I was wrong about one major thing. I was convinced that during shutdown there is no event queue anymore at the point of requestExit being called. But there is. We shut the main thread loop later in ShutdownXPCOM function. Based on that and fact that nsNSSComponent is thanks nsPSMInitPanic only instantiated on the main thread I propose: - remove the code that is waiting for the cond var with 25ms delays until the thread sets the state to 'stopped' - let the thread we want to stop set the state to 'stopped' and then post just an empty event to the main thread after it has finished - main thread waiting for shutdown will do: while (GetExitState(threadLock) != ePSMThreadStopped) { threadLock.unlock(); NS_ProcessPendingEvents(do_GetCurrentThread(), PR_IntervalToMilliseconds(100)); threadLock.lock(); } .. or what ever form of this code - the timeout should be there just for safety, but I will r+ patch even w/o it This gives us no delays and fluent dispatch of all events pending on the main thread. Time will prove if processing events from this place won't cause any trouble with reentering. Dropping the r flag since the patch will be adjusted anyway. General comments: ::: security/manager/ssl/src/nsCertVerificationThread.cpp @@ +132,5 @@ > > { > MutexAutoLock threadLock(verification_thread_singleton->mMutex); > + > + while (GetExitState(threadLock) == ePSMThreadRunning && Repeating this on and on is unreadable and tends to mistakes. Maybe change this to bool IsRunning() method? Also could have IsStopped() method for the check in the request exiting thread. ::: security/manager/ssl/src/nsNSSComponent.cpp @@ +432,5 @@ > + delete mCertVerificationThread; > + mCertVerificationThread = nsnull; > + } > +} > + Let's make these two functions members of nsNSSComponent, no need to separate them. Also have the two threads as arguments suggests these are one-purpose functions. Nit: prefixing arguments with 'm' is generally quit unusual. ::: security/manager/ssl/src/nsPSMBackgroundThread.cpp @@ +77,5 @@ > + return mExitState; > +} > + > +nsPSMBackgroundThread::ExitState > +nsPSMBackgroundThread::GetExitStateWithoutLock() const Maybe rather GetExitStateUnlocked? Kai: why exactly we need access to this w/o a lock? exitRequest seems not to be called under the thread's lock, or it is? http://mxr.mozilla.org/mozilla-central/search?string=exitRequested @@ +109,5 @@ > + if (GetExitState(threadLock) == ePSMThreadStopped) { > + break; > + } > + } > + NS_ProcessNextEvent(nsnull, PR_FALSE); NS_ProcessPendingEvents(do_GetCurrentThread()) is better on this place. It handles all events that might get posted, not just the first one. ::: security/manager/ssl/src/nsPSMBackgroundThread.h @@ +67,5 @@ > + ePSMThreadStopRequested = 1, > + ePSMThreadStopped = 2 > + }; > + > + ExitState GetExitState(::mozilla::MutexAutoLock const & proofOfLock) const; Little bit knotty this 'proofOfLock' but let's take it. @@ +84,5 @@ > nsresult startThread(); > void requestExit(); > + > +private: > + PRUint32 GetExitStateThreadSafe(); You don't need this anymore.
Attachment #538307 - Flags: review?(honzab.moz)
Minusing this for tracking - would like to see an approval request for this to land on aurora once we have a safe, reviewed patch baking on central.
I used to be able to easily reproduce this on Windows with Thunderbird on my desktop, but not so much on my windows laptop for some reason. But FTR, if I got IMAP autosync going on SSL connection and shutdown in the middle, I would generally hit this hang.
Here is the fix that we talked before. The background thread will post a message to the main thread, as the main thread spins the event loop in requestExit waiting for it.
Attachment #371365 - Attachment is obsolete: true
Attachment #439893 - Attachment is obsolete: true
Attachment #538307 - Attachment is obsolete: true
Attachment #541211 - Flags: review?(honzab.moz)
Comment on attachment 541211 [details] [diff] [review] Have background threads send exit notification to requestExit in the main thread Review of attachment 541211 [details] [diff] [review]: ----------------------------------------------------------------- Looks nice. r+ after following comments gets addressed, manly the last one: ::: security/manager/ssl/src/nsNSSComponent.cpp @@ +397,2 @@ > { > + if (mSSLThread) { Why not left in the following form? if () { } You are against the file style. @@ +409,5 @@ > + > +void > +nsNSSComponent::createBackgroundThreads() > +{ > + NS_ASSERTION(mSSLThread == nsnull, "SSL thread already created."); New line after this line please. Only one space between SSL and thread. @@ +419,5 @@ > + return; > + } > + > + NS_ASSERTION(mCertVerificationThread == nsnull, > + "Cert verification thread already created."); New line as well here. Maybe put this at the top of the method to get assertion/warning even we fail to create the ssl thread? ::: security/manager/ssl/src/nsPSMBackgroundThread.cpp @@ +73,5 @@ > > +PRBool > +nsPSMBackgroundThread::exitRequested(const MutexAutoLock & /*proofOfLock*/) const > +{ > + return mExitState != ePSMThreadRunning; Maybe let this method return exitRequestedNoLock() ? exitRequestedNoLock then could be explicitly inline. @@ +81,5 @@ > +nsPSMBackgroundThread::postStoppedEventToMainThread( > + MutexAutoLock const & /*proofOfLock*/) > +{ > + NS_ASSERTION(PR_GetCurrentThread() == mThreadHandle, > + "Background thread stopped from another thread"); New line after this line please. ::: security/manager/ssl/src/nsPSMBackgroundThread.h @@ +64,4 @@ > mozilla::CondVar mCond; > > + PRBool exitRequested(::mozilla::MutexAutoLock const & proofOfLock) const; > + PRBool exitRequestedNoLock() const { return mExitState; } return mExitState != ePSMThreadRunning; @@ +81,5 @@ > nsresult startThread(); > void requestExit(); > + > +private: > + PRUint32 GetExitStateThreadSafe(); What is this used for? ::: security/manager/ssl/src/nsSSLThread.cpp @@ +1143,3 @@ > { > + return !ssl_thread_singleton || > + ssl_thread_singleton->exitRequestedNoLock(); Why exactly are you changing this code, I mean the result value here? This code has been introduced in bug 420187. I understand the code looks wrong (we should return PR_TRUE for !ssl_thread_singleton). Not sure if it really is wrong or not. However, I would change the method's result in a different bug, not here, where we solve a quit different issue.
Attachment #541211 - Flags: review?(honzab.moz) → review+
Comment on attachment 541736 [details] [diff] [review] v2 Have background threads send exit notification to requestExit in the main thread Carrying forward r+. Review comments addressed. Bug 666999 was filed for the bug in nsSSLThread::stoppedOrStopping.
Attachment #541736 - Flags: review+
Whiteboard: [inbound]
Status: ASSIGNED → RESOLVED
Closed: 14 years ago13 years ago
Resolution: --- → FIXED
Target Milestone: mozilla6 → mozilla7
Depends on: 669050
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: