729536 - Deadlock by xul!nsHttpConnectionMgr::Shutdown

Reporter

Description

•

13 years ago

When shuting down Firefox Nightly, process isn't terminated. thread 0 and thread 8 has dead lock. Although thread 0 tries shutdown, lock has thread 8. and thread 8 tries dispatch runnable object, but thread 0 is waiting lock. Also, I don't know how to reproduce this. 0:007> ~0k 100 Child-SP RetAddr Call Site 00000000`0040e868 000007fe`fdd910dc ntdll!ZwWaitForSingleObject+0xa 00000000`0040e870 000007fe`fafea312 KERNELBASE!WaitForSingleObjectEx+0x79 00000000`0040e910 000007fe`ed396159 nspr4!PR_Wait+0x102 00000000`0040e960 000007fe`ed35a900 xul!mozilla::ReentrantMonitor::Wait+0xd 00000000`0040e990 000007fe`ed35a990 xul!nsHttpConnectionMgr::Shutdown+0xa4 00000000`0040e9c0 000007fe`ed16f919 xul!nsHttpHandler::Observe+0x8c 00000000`0040ea60 000007fe`ed172c81 xul!nsObserverList::NotifyObservers+0x79 00000000`0040eb30 000007fe`ed32977b xul!nsObserverService::NotifyObservers+0xa1 00000000`0040eb60 000007fe`ed340272 xul!nsXREDirProvider::DoShutdown+0x7f 00000000`0040eb90 000007fe`ed27324d xul!ScopedXPCOMStartup::~ScopedXPCOMStartup+ 0x56 00000000`0040ebc0 00000001`3f7a17ba xul!XRE_main+0x1261 00000000`0040f0e0 00000001`3f7a1b3a firefox!wmain+0x75a 00000000`0040fc60 00000000`775c652d firefox!__tmainCRTStartup+0x11a 00000000`0040fc90 00000000`776fc521 kernel32!BaseThreadInitThunk+0xd 00000000`0040fcc0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d 0:007> ~8k 100 Child-SP RetAddr Call Site 00000000`05f7f158 000007fe`fdd910dc ntdll!ZwWaitForSingleObject+0xa 00000000`05f7f160 000007fe`fafe1472 KERNELBASE!WaitForSingleObjectEx+0x79 00000000`05f7f200 000007fe`ed3200ce nspr4!PR_WaitCondVar+0x122 00000000`05f7f240 000007fe`edb1463c xul!mozilla::CondVar::Wait+0xe 00000000`05f7f270 000007fe`edb8f299 xul!mozilla::psm::SyncRunnableBase::DispatchToMainThreadAndWait+0x4c 00000000`05f7f2a0 000007fe`edf55f78 xul!nsNSSSocketInfo::GetPreviousCert+0x59 00000000`05f7f2d0 000007fe`fad9538b xul!HandshakeCallback+0x3f8 00000000`05f7f600 000007fe`fad98938 ssl3!ssl3_FinishHandshake+0x73 00000000`05f7f630 000007fe`fad9b83d ssl3!ssl3_HandleFinished+0x400 00000000`05f7f6b0 000007fe`fad9ba25 ssl3!ssl3_HandleHandshakeMessage+0x2c9 00000000`05f7f740 000007fe`fad9bf63 ssl3!ssl3_HandleHandshake+0xe1 00000000`05f7f770 000007fe`fad9c201 ssl3!ssl3_HandleRecord+0x41f 00000000`05f7f880 000007fe`fad9d7d1 ssl3!ssl3_GatherCompleteHandshake+0x205 00000000`05f7f8e0 000007fe`fada3432 ssl3!ssl_GatherRecord1stHandshake+0x35 00000000`05f7f910 000007fe`fada4518 ssl3!ssl_Do1stHandshake+0x56 00000000`05f7f940 000007fe`fada7fbf ssl3!ssl_SecureSend+0x158 00000000`05f7f980 000007fe`edb8f65c ssl3!ssl_Send+0x6b 00000000`05f7f9b0 000007fe`edb90e11 xul!PSMSend+0x8c 00000000`05f7fa00 000007fe`ed2ae8ae xul!nsSSLIOLayerWrite+0x11 00000000`05f7fa40 000007fe`ed66f8a4 xul!nsSocketOutputStream::Write+0xbe 00000000`05f7fa80 000007fe`ed25701a xul!nsHttpConnection::EnsureNPNComplete+0x406878 00000000`05f7fb60 000007fe`ed269015 xul!nsHttpConnection::OnSocketWritable+0x42 00000000`05f7fba0 000007fe`ed1ff41a xul!nsHttpConnection::OnOutputStreamReady+0x21 00000000`05f7fbd0 000007fe`ed1ff52f xul!nsSocketOutputStream::OnSocketReady+0xa6 00000000`05f7fc00 000007fe`ed1da0f1 xul!nsSocketTransport::OnSocketReady+0xf3 00000000`05f7fc30 000007fe`ed04d417 xul!nsSocketTransportService::DoPollIteration+0x141 00000000`05f7fc90 000007fe`ed0f46c6 xul!nsSocketTransportService::Run+0xc7 00000000`05f7fce0 000007fe`ed194a2c xul!nsThread::ProcessNextEvent+0x1c6 00000000`05f7fdb0 000007fe`fafe2508 xul!nsThread::ThreadFunc+0x8c 00000000`05f7fde0 000007fe`fafe128d nspr4!_PR_NativeRunThread+0x158 00000000`05f7fe50 00000000`73f32fdf nspr4!pr_root+0xd 00000000`05f7fe80 00000000`73f33080 MSVCR90!endthreadex+0x47 00000000`05f7feb0 00000000`775c652d MSVCR90!endthreadex+0xe8 00000000`05f7fee0 00000000`776fc521 kernel32!BaseThreadInitThunk+0xd 00000000`05f7ff10 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

Honza Bambas (:mayhemer)

Assignee

Comment 1

•

13 years ago

nsHttpConnectionMgr::Shutdown may let the event queue loop while waiting for the monitor (using mon.Wait(e.g. 100ms);).

Severity: normal → critical

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Comment 2

•

13 years ago

What do you guys think about Honza's proposed solution in comment 1? Previously, we have already established the pattern that, if we want the Socket Transport Thread and the main thread to be synced, then we must first block the socket transport thread and then execute the main thread option. This is what CertErrorRunnableRunnable (not a typo) and other code does. If nsHttpConnectionMgr were to follow this pattern, it would send an event to the socket transport service thread, and then use SyncRunnableBase (or similar) to execute the stuff on the main thread that must be executed on the main thread. If Honza's proposed solution will not work, then I have another potential solution that would work, but which would be more effort. This would require all of the following: (a) Dropping support for main-thread-only (i.e. JS) nsICertBadCertListener2 and nsISSLErrorListener. Every single nsICertBadCertListener2 and nsISSLErrorListener in mozilla-central and comm-central does "return true;" to suppress the prompts that got turned off by default in bug 682329. We can just stop calling bad cert handlers and SSL error listeners completely without having any notable negative effect on Firefox or Thunderbird themselves. The following addons on AMO would be affected: ExQuilla for Microsoft Exchange (for Thunderbird) MiTM Me - alternate (click-through) cert error UI Lightning for Thunderburd -- emulating Thunderbird's unsafe cert error exception dialog box. KeeFox (A password manager)--AFACT, this one is *completely* and silently vulnerable to MITM attacks because of how it implements nsIBadCertListener2. Maybe Chatzilla (didn't read the code) Obviously, the authors of these extensions want to keep their custom override behavior. At the same time, they are all implementing behavior that we (Mozilla) already rejected as unsafe. (In fact, the interface change from nsIBadCertListener to nsIBadCertListener2 was done explicitly to avoid the behavior that these addons are implementing.) (b) Dropping support for main-thread-only nsICertOverrideService. Our built-in implementation is already thread-safe. I verified on mxr.mozilla.org/addons that no extensions on AMO would be affected by this. The main effect of making this change is that addons like Convergence and Selenium would have to replace their JS-based implementations with native code implementations. IMO, this is not an unreasonable requirement, but bz objected to it before. (I am willing to literally write the C++ code for them.) (c) Fixing 679144 (d) Reworking the logic that handles PKCS#11 login, so that we don't need SyncRunnableBase in PK11PasswordPrompt; e.g. by having the user log in with the master password before we start any SSL connections in FIPS mode. (This would solve other problems too, like bug 584014 and related bugs.)

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Comment 3

•

13 years ago

(In reply to Brian Smith (:bsmith) from comment #2) > If Honza's proposed solution will not work, then I have another potential > solution that would work, but which would be more effort. This would require > all of the following: > > (a) Dropping support for main-thread-only (i.e. JS) nsICertBadCertListener2 > and nsISSLErrorListener. I have figured out a way to avoid this, I think, that doesn't require changing our support for nsICertBadCertListener2 and nsISSLErrorListener. > (b) Dropping support for main-thread-only nsICertOverrideService. Similarly, I was just wrong about this. It is a non-issue, AFAICT. > (c) Fixing 679144 This will happen anyway. > (d) Reworking the logic that handles PKCS#11 login, so that we don't need > SyncRunnableBase in PK11PasswordPrompt; e.g. by having the user log in with > the master password before we start any SSL connections in FIPS mode. (This > would solve other problems too, like bug 584014 and related bugs.) This would be the main remaining issue to solving this by reworking code in PSM.

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Comment 4

•

13 years ago

>> (d) Reworking the logic that handles PKCS#11 login, so that we don't need >> SyncRunnableBase in PK11PasswordPrompt; e.g. by having the user log in with >> the master password before we start any SSL connections in FIPS mode. (This >> would solve other problems too, like bug 584014 and related bugs.) > This would be the main remaining issue to solving this by reworking code > in PSM. In English: This would be the main issue blocking the solving of this by reworking code in PSM (vs. spinning the event loop).

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Updated

•

13 years ago

Depends on: 696976, 679144

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Updated

•

13 years ago

Depends on: 754374

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Comment 6

•

13 years ago

(In reply to Brian Smith (:bsmith) from comment #3) > I have figured out a way to avoid this, I think, that doesn't require > changing our support for nsICertBadCertListener2 and nsISSLErrorListener. Bug 754374. > > (d) Reworking the logic that handles PKCS#11 login, so that we don't need > > SyncRunnableBase in PK11PasswordPrompt; e.g. by having the user log in with > > the master password before we start any SSL connections in FIPS mode. (This > > would solve other problems too, like bug 584014 and related bugs.) > > This would be the main remaining issue to solving this by reworking code in > PSM. The PKCS#11 password prompt is a modal dialog that should (AFAICT) block us from shutting down. That is bad UI, but AFAICT it means it is a non-issue for this bug.

Matthew N. [:MattN]

Updated

•

12 years ago

OS: Windows 7 → All

Hardware: x86_64 → All

Matthew N. [:MattN]

Comment 7

•

12 years ago

Attached file OS X stack (deleted) — Details

I'm experiencing what appears to be the same issue on OS X 10.7.4.

Matthew N. [:MattN]

Updated

•

12 years ago

status-firefox13: --- → affected

status-firefox14: --- → affected

status-firefox15: --- → affected

status-firefox16: --- → affected

Keywords: hang

Whiteboard: [snappy]

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 9

•

12 years ago

I hit this shutting down today. Can we implement the solution from comment 1 now and rework this code later?

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 10

•

12 years ago

Also doesn't this code need to handle spurious notifications of the condvar?

(dormant account)

Updated

•

12 years ago

Whiteboard: [snappy] → [snappy:p1]

Matthew N. [:MattN]

Comment 11

•

12 years ago

I've been hitting this hang again quite a few times lately. Is someone able to take this snappy:p1?

:Irving Reid (No longer working on Firefox)

Comment 13

•

12 years ago

I saw this several times, with exactly the same OS X stack trace as in MattN's attachment. Most of the times were while shutting down to install Aurora updates, but the most recent time was immediately after waking up from sleep.

Honza Bambas (:mayhemer)

Assignee

Comment 14

•

12 years ago

I'll try to fix this.

Assignee: nobody → honzab.moz

Status: NEW → ASSIGNED

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Comment 15

•

12 years ago

Attached patch remove GetPreviousCert to avoid most common case of deadlock (obsolete) (deleted) — Details — Splinter Review

Honza, here's the patch I wrote for this. It removes the "previous cert" caching. I don't know yet what the performance impact of this would be. The case where it would be problematic is where the top-level document is EV and the session has been resumed. In that case, the SecureBrowserUI and/or browser.js will eventually call "IsExtendedValidation()" on the main thread, which will do a certificate validation, which can block on network I/O and/or disk I/O. Note that this problematic case can happen anyway, because this optimization only works (if ever) in cases where you navigate from one page to another on the same site. And, it's only for EV, AFAICT.

Attachment #667096 - Flags: review?(honzab.moz)

Lukas Blakk [:lsblakk] use ?needinfo

Comment 16

•

12 years ago

tracking since this is showing up a lot lately and we'd consider an uplift if the risk/reward ratio was positive.

status-firefox17: --- → affected

status-firefox18: --- → affected

tracking-firefox17: --- → +

tracking-firefox18: --- → +

Honza Bambas (:mayhemer)

Assignee

Comment 17

•

12 years ago

(In reply to Brian Smith (:bsmith) from comment #15) > Created attachment 667096 [details] [diff] [review] > remove GetPreviousCert to avoid most common case of deadlock > > Honza, here's the patch I wrote for this. It removes the "previous cert" > caching. > > I don't know yet what the performance impact of this would be. The case > where it would be problematic is where the top-level document is EV and the > session has been resumed. In that case, the SecureBrowserUI and/or > browser.js will eventually call "IsExtendedValidation()" on the main thread, > which will do a certificate validation, which can block on network I/O > and/or disk I/O. > > Note that this problematic case can happen anyway, because this optimization > only works (if ever) in cases where you navigate from one page to another on > the same site. And, it's only for EV, AFAICT. Sorry, I missed this review requirement first. Brian, when you start working on a bug or steal it from somebody (w/o asking him, btw) then at least please change the assignee field to let people around you know. Then, why aren't you fixing this bug according my comment 1 that would be a general fix?

Honza Bambas (:mayhemer)

Assignee

Comment 18

•

12 years ago

Attached patch v1 (deleted) — Details — Splinter Review

- ::Shutdown() no longer indefinitely waits for monitor notification - introduced a new event handler called on the main thread that sets a flag ; posted from OnMsgShutdown() handler on the socket thread - ::Shutdown() loops with ProcessNextEvent until the flag is set (-w patch for simpler review coming)

Attachment #668452 - Flags: review?(cbiesinger)

Honza Bambas (:mayhemer)

Assignee

Comment 19

•

12 years ago

Attached patch v1 ignored whitespaces (deleted) — Details — Splinter Review

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Comment 20

•

12 years ago

(In reply to Honza Bambas (:mayhemer) from comment #17) > Brian, when you start working on a bug or steal it from somebody (w/o asking > him, btw) then at least please change the assignee field to let people > around you know. > > Then, why aren't you fixing this bug according my comment 1 that would be a > general fix? This patch is in my queue for an unrelated reason. It just happens to resolve the most common cause of this bug. I just posted it to be helpful. I'm not trying to steal the bug.

Honza Bambas (:mayhemer)

Assignee

Comment 21

•

12 years ago

(In reply to Brian Smith (:bsmith) from comment #20) > This patch is in my queue for an unrelated reason. It just happens to > resolve the most common cause of this bug. I just posted it to be helpful. > I'm not trying to steal the bug. Aha, then sorry for my overreaction. Do you intend to remove more of this sync stuff? If so, then I think it would be good to move it to a different bug. For this bug, so close to release bounce, I think it is safer to take a simpler and more general patch in Necko then touch relatively non-transparent and sensitive PSM code.

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Comment 22

•

12 years ago

(In reply to Honza Bambas (:mayhemer) from comment #21) > Do you intend to remove more of this sync stuff? If so, then I think it > would be good to move it to a different bug. For this bug, so close to > release bounce, I think it is safer to take a simpler and more general patch > in Necko then touch relatively non-transparent and sensitive PSM code. If you think that spinning the event loop is not going to be problematic, then I agree that is a reasonable short-term solution. Regardless of this bug, I want to remove GetPreviousCert, because it can actually be bad for performance (due to event dispatching/queuing overhead), because of the nsIDocShell dependency it adds, and because there is a better, more effective way to cache results. However, that better way to cache results requires several more patches to implement. My intention is to eventually remove all the SyncRunnableBase stuff from PSM. I have a plan for that, which we should discuss. However, it requires make significant changes, including even UI changes for master password. So, definitely we shouldn't block progress on this bug on that.

Honza Bambas (:mayhemer)

Assignee

Comment 23

•

12 years ago

(In reply to Brian Smith (:bsmith) from comment #22) > (In reply to Honza Bambas (:mayhemer) from comment #21) > > Do you intend to remove more of this sync stuff? If so, then I think it > > would be good to move it to a different bug. For this bug, so close to > > release bounce, I think it is safer to take a simpler and more general patch > > in Necko then touch relatively non-transparent and sensitive PSM code. > > If you think that spinning the event loop is not going to be problematic, > then I agree that is a reasonable short-term solution. I've been checking the code and it is clear we do the event loop spin short after manager shutdown anyway. During the looping I've introduced the manager is already in an unconditional shutdown state (there is no ref to sts that prevents using the manager mostly). So I don't feel there would be a regression unless well hidden. And I also think this way of shutdown is more proper then indefinite waiting for a monitor notification. PSM may not be the only code that blocks the socket thread by a sync event for the main thread. > > Regardless of this bug, I want to remove GetPreviousCert, because it can > actually be bad for performance (due to event dispatching/queuing overhead), > because of the nsIDocShell dependency it adds, and because there is a > better, more effective way to cache results. However, that better way to > cache results requires several more patches to implement. > > My intention is to eventually remove all the SyncRunnableBase stuff from > PSM. I have a plan for that, which we should discuss. However, it requires > make significant changes, including even UI changes for master password. So, > definitely we shouldn't block progress on this bug on that. OK, I agree.

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Updated

•

12 years ago

Attachment #667096 - Attachment is obsolete: true

Attachment #667096 - Flags: review?(honzab.moz)

Christian :Biesinger (don't email me, ping me on IRC)

Updated

•

12 years ago

Attachment #668452 - Flags: review?(cbiesinger) → review+

Lukas Blakk [:lsblakk] use ?needinfo

Comment 24

•

12 years ago

Is the reviewed patch covering the complete fix here? If so, please get this tested and landed to trunk. To take this fix on Beta we'll want it in earlier so there's time to watch for any regressions. That means having this assessed for the risks of uplift in the coming week.

Honza Bambas (:mayhemer)

Assignee

Comment 25

•

12 years ago

Comment on attachment 668452 [details] [diff] [review] v1 https://hg.mozilla.org/integration/mozilla-inbound/rev/8141e5eb9f0b

Attachment #668452 - Flags: checkin+

Ryan VanderMeulen [:RyanVM]

Comment 26

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/8141e5eb9f0b

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Flags: in-testsuite-

Resolution: --- → FIXED

Target Milestone: --- → mozilla19

Lukas Blakk [:lsblakk] use ?needinfo

Comment 27

•

12 years ago

We're looking for beta uplift here - please nominate asap

Honza Bambas (:mayhemer)

Assignee

Comment 28

•

12 years ago

Comment on attachment 668452 [details] [diff] [review] v1 [Approval Request Comment] Bug caused by (feature/regressing bug #): None User impact if declined: Shutdown deadlock Testing completed (on m-c, etc.): Spent 6 days of baking on m-c, seems not to cause any trouble Risk to taking this patch (and alternatives if risky): Low, patch is simple and very straightforward String or UUID changes made by this patch: None

Attachment #668452 - Flags: approval-mozilla-beta?

Attachment #668452 - Flags: approval-mozilla-aurora?

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

12 years ago

Attachment #668452 - Flags: approval-mozilla-beta?

Attachment #668452 - Flags: approval-mozilla-beta+

Attachment #668452 - Flags: approval-mozilla-aurora?

Attachment #668452 - Flags: approval-mozilla-aurora+

Ryan VanderMeulen [:RyanVM]

Comment 29

•

12 years ago

https://hg.mozilla.org/releases/mozilla-beta/rev/4337f296f25e

status-firefox17: affected → fixed

status-firefox19: --- → fixed

Ryan VanderMeulen [:RyanVM]

Comment 30

•

12 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/ebb3bb2f1d6c

status-firefox18: affected → fixed

Honza Bambas (:mayhemer)

Assignee

Comment 31

•

12 years ago

Thanks for branch-landing this, Ryan.

u279076

Comment 32

•

12 years ago

Is there anything in particular QA can do to verify this is fixed? Any advice on exploratory regression testing we can perform?

Whiteboard: [snappy:p1] → [snappy:p1][qa?]

Honza Bambas (:mayhemer)

Assignee

Comment 33

•

12 years ago

(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #32) > Is there anything in particular QA can do to verify this is fixed? Any > advice on exploratory regression testing we can perform? I don't think this is simple to reproduce. You could try visit an https page during shutdown, but the race is so close you probably won't be lucky. I'd just leave this be verified by feedback from users.

u279076

Comment 34

•

12 years ago

Lukas, is there any Beta feedback we can look at to inform the verification? Otherwise there's nothing QA can really do here.

Whiteboard: [snappy:p1][qa?] → [snappy:p1][qa-]

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 35

•

12 years ago

So I still see this issue with Thunderbird 17.0 beta. Today I got an upgrade notice to the latest beta. After applying the downloaded update I triggered a restart and now a thunderbird process is hanging around for about 15 minutes. This is on OS X 10.7.5.

:Irving Reid (No longer working on Firefox)

Comment 36

•

12 years ago

Henrik, if you can attach to a hung Thunderbird with the OS X Activity Monitor and do a "Sample Process", please file a new bug, copied to me, with the process sample attached.

Flags: needinfo?(hskupin)

Honza Bambas (:mayhemer)

Assignee

Comment 37

•

12 years ago

Yes please, provide stack if you can. Thanks.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 38

•

12 years ago

I killed the process already. But I will remember and do it when I see it the next time. It's something which happens occasionally after a couple of days using hibernation mode. So it will take a bit until I can come back with further information.

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 39

•

12 years ago

Btw. could that be bug 803983?

:Irving Reid (No longer working on Firefox)

Comment 40

•

12 years ago

(In reply to Henrik Skupin (:whimboo) from comment #39) > Btw. could that be bug 803983? Yes, let's follow any further Thunderbird shutdown hangs on bug 803983.

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

12 years ago

Blocks: 803983

OS X stack 12 years ago Matthew N. [:MattN] (deleted), text/plain		Details
remove GetPreviousCert to avoid most common case of deadlock 12 years ago Brian Smith (:briansmith, :bsmith, use NEEDINFO?) (deleted), patch		Details \| Diff \| Splinter Review
v1 12 years ago Honza Bambas (:mayhemer) (deleted), patch	Biesinger : review+ lsblakk : approval-mozilla-aurora+ lsblakk : approval-mozilla-beta+ mayhemer : checkin+	Details \| Diff \| Splinter Review
v1 ignored whitespaces 12 years ago Honza Bambas (:mayhemer) (deleted), patch		Details \| Diff \| Splinter Review