[meta] Crash in [mozilla::net::nsHttpConnectionMgr::Shutdown] and other net related places. Shutdown hang.
Categories
(Core :: Networking, defect, P3)
Tracking
()
People
(Reporter: jstutte, Unassigned)
References
(Depends on 3 open bugs, Blocks 4 open bugs)
Details
(5 keywords, Whiteboard: [DWS_NEXT][stockwell unknown][tbird crash][necko-triaged][necko-monitor])
Crash Data
+++ This bug was initially created as a clone of Bug #1435343 +++
Extracts the cases 1.1 to 1.4 from comment 70 of bug 1435343.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 1•5 years ago
|
||
Cleaned dependencies and blocker as I expect all these to be quite outdated.
Reporter | ||
Comment 2•5 years ago
|
||
Restoring dependencies for active bugs for further investigation.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 3•5 years ago
|
||
I am not sure about the (causing) component here at all.
Reporter | ||
Updated•5 years ago
|
Comment 4•5 years ago
|
||
The cause is similar for each of those components: during the shutdown, many components spin the event loop when they receive notification X ("xpcom-shutdown" for instance, but we have a few others). They do because they want to receive IPC calls, or because they want to wait until a subcomponent/object is released. Often this spinning is not terminated, and because of this, other components do not receive the same notification X.
For instance, https://bugzilla.mozilla.org/show_bug.cgi?id=1435962#c0 describes what happens for mozilla::net::nsHttpConnectionMgr::Shutdown().
The fix is to remove the spinning of the event loop in every component that does it during the shutdown because that can trigger race conditions.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 5•4 years ago
|
||
S1 or S2 bugs needs an assignee - could you find someone for this bug?
Updated•4 years ago
|
Comment 6•4 years ago
|
||
:junior - do you think you will have a patch on this S1 bug in 79?
Comment 7•4 years ago
|
||
No, this is lingering for years, we fight with it very hard before, but fail to find a solution (see also bug 1158189)
That's the reason we put it P3.
An STR would be very helpful, but 79 is not something expected.
Updated•4 years ago
|
Reporter | ||
Comment 8•4 years ago
|
||
(In reply to Junior [:junior] from comment #7)
An STR would be very helpful
Hi Junior, I do not expect to have clear STR for this anytime soon, but :baku had in comment 4 a plausible call to action:
The fix is to remove the spinning of the event loop in every component that does it during the shutdown because that can trigger race conditions.
Though I have very limited understanding of this, it sounds very plausible to me that we should avoid this, and not only for the hangs (the event loop spinning on the main thread can lead sometimes to racy off-thread deconstruction of objects, too, it seems). So if you were able to identify those places of event loop spinning in your components, you could file a (set of) bug(s) to improve that aspect and see, if this helps also here.
Comment 9•4 years ago
|
||
We are spin the event loop since the socket thread can't finish its job due to hanging PR_POLL in socket thread.
bug 1435962 comment 3 indicates a possible network driver bug, and all crashes happen in Windows.
We can't simply remove the event loop spinning without closing connections when we receive offline notification during shutdown process.
I don't have a way to remove the event loop on top of my head (and I believe we can't.)
Comment 10•4 years ago
|
||
nsHttpConnectionMgr::Shutdown()
must not spin the event loop because that is called from nsIObserver::Observe()
. Spinning the event loop permits calls to nsIObserver::Remove()
, for example, which is forbidden during notifications, so Observe()
implementations must not spin the event loop.
It is also called from a destructor. I don't know what ensures that nsHttpHandler
s are destroyed only at times when it is safe to run arbitrary code.
Anything that must spin an event loop should dispatch a runnable to do that so that it happens at a safe time.
(I'm not clear whether or not that would be sufficient to resolve the issues here.)
Comment 11•4 years ago
|
||
Added a macOS-specific signature.
Comment 12•4 years ago
|
||
We can't simply remove the event loop spinning without closing connections when we receive offline notification during shutdown process.
I don't have a way to remove the event loop on top of my head (and I believe we can't.)
I'm not super familiar with this code but can we do the following?
- ConnectionManager has a boolean flag: mShuttingDown(false). When ::Shutdown() is called, we set it to true. No extra connections are accepted.
- A runnable + nsIAsyncShutdownBlocker is dispatch to complete the operation.
- When OnMshShutdownConfirm() is called, the blocker is removed and the shutdown can continue.
Comment 13•4 years ago
|
||
(In reply to Andrea Marchesini [:baku] from comment #12)
We can't simply remove the event loop spinning without closing connections when we receive offline notification during shutdown process.
I don't have a way to remove the event loop on top of my head (and I believe we can't.)I'm not super familiar with this code but can we do the following?
- ConnectionManager has a boolean flag: mShuttingDown(false). When ::Shutdown() is called, we set it to true. No extra connections are accepted.
- A runnable + nsIAsyncShutdownBlocker is dispatch to complete the operation.
- When OnMshShutdownConfirm() is called, the blocker is removed and the shutdown can continue.
It looks like a non-blocking way to do the same thing and could work. I'll investigate if we can do this.
Keep the ni? to follow. Thanks, baku!
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 15•4 years ago
|
||
:nhi Triaging as REO for 79 - is this intended to be looked at for 79? Looks this issue has gotten worse in beta.
Comment 16•4 years ago
|
||
Hello Kim,
Thanks for weighing in and pointing out the issues in 79. It helps to identify some regression.
I take a look for the first 50 crashes at the stack of @shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown in beta.
I believe HTTP3 and nss are new crashes which is not covered in bug 1158189.
This crashes could be eased to <20% once we have those fixed.
Another 10% of crashes are from haning in poll.
bug 1435962 comment 3 showes it could be a network driver bug.
We could address the last mile once later. However, we don't have much things to do with haning Poll.
I don't have strong feeling that it would be solved by not spinning main thread.
HTTP3 are mostly owing to slow string operation and nss lives in non-main thread.
If we don't spin the main thread loop, it still crashes or hangs elsewhere.
I'll ni? dragana, who chases this crashes for a long time, in the next comment for more input.
And file the responding bug to each components.
I list the crashes reported which I triaged at the bottom of this comment.
And I take another look on the crashes in nightly. All 7 crashes are H3 with different symbol.
I also take a look at the release crashes
- fflush hangs, much more than beta
https://crash-stats.mozilla.org/report/index/eab6ec5b-76c9-487f-b3ba-43dfa0200708#allthreads - deadlock
nsNSSSocketInfo::CloseSocketAndDestroy()
, which disappears or is turned to other symbol in beta
https://crash-stats.mozilla.org/report/index/a2ec85a3-383a-4925-993f-5f6910200708#allthreads - PR_POLL
- other nss symboal
- random
To me, focusing on beta could be a good idea at this stage.
As for other crash signature,
@ shutdownhang | __psynch_cvwait | <name omitted> | mozilla::TaskController::GetRunnableForMTTask and other mozilla::TaskController::GetRunnableForMTTask are the mostly crashes in QuotaManager (some of them are even without socket thread). Only few of them are about H3 and nss, which are the known cases.
@ shutdownhang | mozilla::net::ShutdownEvent::PostAndWait are about the CacheFileIOManager::Shutdown
, which is a different issue.
shutdownhang | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread is interesting. Half of them are known nss issues. Another half happens for deadlock in shutdown thread for ssl here
Here's some reports:
https://crash-stats.mozilla.org/report/index/a071e81d-0955-497f-bc3d-036210200708#allthreads
https://crash-stats.mozilla.org/report/index/011da31f-1aeb-46cd-a81f-eac870200708#allthreads
https://crash-stats.mozilla.org/report/index/3d063176-b728-437b-a4c9-c00410200708#allthreads
This is the analysis for @shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown in beta.
- HTTP3 (23)
- neqo_glue::neqo_http3conn_fetch
https://crash-stats.mozilla.org/report/index/232d2d2e-e059-416c-84fc-4896e0200708#allthreads
https://crash-stats.mozilla.org/report/index/27c83b2b-1977-45ca-8947-203520200708#allthreads
https://crash-stats.mozilla.org/report/index/ce130f9f-2137-4d7f-b791-7b51d0200708#allthreads
https://crash-stats.mozilla.org/report/index/4617e7fb-b701-40b0-871f-143f70200708#allthreads
https://crash-stats.mozilla.org/report/index/ee901bb2-abc4-489e-9f7d-79f3f0200708#allthreads
https://crash-stats.mozilla.org/report/index/76dc9ec3-abac-4490-8bea-5e55c0200708#allthreads
https://crash-stats.mozilla.org/report/index/e369ceab-448b-4d14-8aba-899980200708#allthreads
https://crash-stats.mozilla.org/report/index/cd15b360-f85c-44f4-a272-01a6d0200708#allthreads
https://crash-stats.mozilla.org/report/index/a63fdd66-6a6d-4f9b-a658-6745b0200708#allthreads
https://crash-stats.mozilla.org/report/index/69fdc1e0-912b-46ec-b278-7af360200708#allthreads
https://crash-stats.mozilla.org/report/index/273cc629-48db-4f7f-b817-303970200708#allthreads
https://crash-stats.mozilla.org/report/index/2be978fa-1e54-442d-ba98-4051f0200708#allthreads
https://crash-stats.mozilla.org/report/index/eee87582-2bdb-4398-bcce-043c80200708#allthreads - slow nsTString<char>::Find in mozilla::net::Http3Stream::FindRequestContentLength()
https://crash-stats.mozilla.org/report/index/e74302ff-a5bc-456b-9f62-7456b0200708#allthreads
https://crash-stats.mozilla.org/report/index/f4d7beae-1758-4f7b-8ddf-cff990200708#allthreads
https://crash-stats.mozilla.org/report/index/94ad5e0f-cdaa-4039-ae04-c680b0200708#allthreads
https://crash-stats.mozilla.org/report/index/1546f888-d3c2-48b5-b1c5-c08b30200708#allthreads
https://crash-stats.mozilla.org/report/index/5a584924-3e59-45f2-a637-0c6800200708#allthreads
https://crash-stats.mozilla.org/report/index/b7159d2d-1570-4ffc-87f5-2b2a60200708#allthreads
https://crash-stats.mozilla.org/report/index/05024233-a580-4315-af02-8580d0200708#allthreads - slow nsTString<char>::Find in mozilla::net::Http3Stream::GetHeadersString
https://crash-stats.mozilla.org/report/index/2595aabb-e2d5-44c0-9479-bce420200708#allthreads
https://crash-stats.mozilla.org/report/index/50b9aaf2-419d-4832-9dfd-368850200708#allthreads
https://crash-stats.mozilla.org/report/index/f447bc10-2cf9-4ad2-9ff9-9fe760200708#allthreads
- neqo_glue::neqo_http3conn_fetch
- Deadlock with nss (17): Note to myself: search
RtlpWaitOnCriticalSection
for deadlock
https://crash-stats.mozilla.org/report/index/0c967088-a97e-4dde-8bbe-e23570200708#allthreads
https://crash-stats.mozilla.org/report/index/7bb239d5-c863-4ca8-9ad3-128ac0200708#allthreads
https://crash-stats.mozilla.org/report/index/053f0a6b-b7d1-4ce4-bbc6-beb4c0200708#allthreads
https://crash-stats.mozilla.org/report/index/763ae655-0a70-4422-97f7-0b86b0200708#allthreads
https://crash-stats.mozilla.org/report/index/749aea7f-cfdc-4b49-8373-ed8db0200708#allthreads
https://crash-stats.mozilla.org/report/index/6cb693ad-b88c-42e8-afef-5eceb0200708#allthreads
https://crash-stats.mozilla.org/report/index/915564b0-e7b7-4ea1-bc55-1d7a20200708#allthreads
https://crash-stats.mozilla.org/report/index/f2523d26-c4d4-4922-8231-5c7510200708#allthreads
https://crash-stats.mozilla.org/report/index/88299646-ecf5-47e3-adb5-39b6b0200708#allthreads
https://crash-stats.mozilla.org/report/index/b1286c47-7a56-4d38-ab7f-4c3530200708#allthreads
https://crash-stats.mozilla.org/report/index/ec485865-dedb-4d43-b906-ac6320200708#allthreads
https://crash-stats.mozilla.org/report/index/ba379840-e967-4e7c-bb37-962e90200708#allthreads
https://crash-stats.mozilla.org/report/index/a1bf9f5c-ebe3-41e7-809b-a38ff0200708#allthreads
https://crash-stats.mozilla.org/report/index/c6a3fff4-264e-4ffb-9b7a-2f55c0200708#allthreads
https://crash-stats.mozilla.org/report/index/15cc0996-9cdd-47f3-a8cf-5c8500200708#allthreads
https://crash-stats.mozilla.org/report/index/208140d4-60a6-403c-ba59-334a30200708#allthreads
https://crash-stats.mozilla.org/report/index/62db5f81-1e2a-4eb9-bc8a-2b0780200708#allthreads - fflush hang in ssl
https://crash-stats.mozilla.org/report/index/a524b967-5c38-46f9-885b-80b0b0200708#allthreads - Good old poll
https://crash-stats.mozilla.org/report/index/dd4b29f7-99eb-4775-ad10-770fc0200708#allthreads
https://crash-stats.mozilla.org/report/index/d9b215a1-bdab-4ed4-87df-04ceb0200708#allthreads
https://crash-stats.mozilla.org/report/index/bbe33082-cc99-494c-b38c-930b40200708#allthreads
https://crash-stats.mozilla.org/report/index/9e64769b-e785-4681-a319-e13320200708#allthreads
https://crash-stats.mozilla.org/report/index/0c869f8c-f2e7-4754-8997-b6b9f0200708#allthreads
https://crash-stats.mozilla.org/report/index/9198e493-305a-4eeb-86a0-c35020200708#allthreads - Random
- Hang in PR_Write https://crash-stats.mozilla.org/report/index/ea70ecc5-30e1-442b-8743-deea60200708#allthreads
- doing it's shutdown job. not hanging: https://crash-stats.mozilla.org/report/index/c8f012d0-5121-488a-a6f9-7d5070200708#allthreads
- je_realloc https://crash-stats.mozilla.org/report/index/7e0cf42f-852c-4150-a2e0-0ec2d0200708#allthreads
Comment 17•4 years ago
|
||
We're not fixing this for 79. We will try to fix what we can for 80.
Updated•4 years ago
|
Comment 18•4 years ago
|
||
I've looked into some crashes report in comment 16.
We do have some new culprits to cause the shutdown hangs (bug 1651564, bug 1651565).
However, removing the main thread spin loop seems unable to solve.
What do you think, Dragana?
Comment 19•4 years ago
|
||
(In reply to Junior [:junior] from comment #18)
However, removing the main thread spin loop seems unable to solve.
Is there a reason the spin loop can't be an async shutdown blocker?
Comment 20•4 years ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #19)
(In reply to Junior [:junior] from comment #18)
However, removing the main thread spin loop seems unable to solve.
Is there a reason the spin loop can't be an async shutdown blocker?
We need to be very careful about this. We need to make sure some things happen in certain order, transactions needs to be canceled before some other things happened, etc. We can explore this, but that change will not fix the crashes.
Comment 21•4 years ago
|
||
(In reply to Junior [:junior] from comment #18)
I've looked into some crashes report in comment 16.
We do have some new culprits to cause the shutdown hangs (bug 1651564, bug 1651565).
However, removing the main thread spin loop seems unable to solve.What do you think, Dragana?
I will ask for an uplift, the patch is very isolated and it does not tuch other code exept http3 that is disabled by default.
Comment 22•4 years ago
|
||
The current beta should be in good shape.
crashes of bug 1651564 are disabled in late beta and release
bug 1651565 is uplifted
That will cover 80% of crashes, and most of other crashes are PR_POLL or even not hanging.
Now let's take a look at central.
- more hanging or unfinished job in writing file/closing socket/socket send in ntdll.dll with nss/ssl stack 21
https://crash-stats.mozilla.org/report/index/d4e6d854-cf77-4937-96e6-be2690200717#allthreads
https://crash-stats.mozilla.org/report/index/ab24abcc-5d1f-4c14-8730-9a3680200716#allthreads
https://crash-stats.mozilla.org/report/index/7e6273f6-1974-46ad-b9b9-b68710200716#allthreads
https://crash-stats.mozilla.org/report/index/0b66c52b-ed1f-4daa-a8a4-370f80200716#allthreads - PR_POLL 32
- nss deadlock
https://crash-stats.mozilla.org/report/index/445fdd84-4291-450b-9196-759ee0200716
https://crash-stats.mozilla.org/report/index/69355a0f-51ca-4d13-9240-2e05c0200716#allthreads
https://crash-stats.mozilla.org/report/index/f9c98299-b1a1-4452-8e20-c3a250200716#allthreads - Other
ffm64.dll. https://crash-stats.mozilla.org/report/index/eeb6dea7-649f-4ae2-88a2-23fbf0200716#allthreads
ntdll.dll RtlVirtualUnwind https://crash-stats.mozilla.org/report/index/93d649ac-2610-4e9c-8505-209040200716#allthreads
https://crash-stats.mozilla.org/report/index/9be62ace-b161-42d6-b404-a5f410200716#allthreads
https://crash-stats.mozilla.org/report/index/da6d77d9-f5f2-4ab2-ac54-b23780200716#allthreads
Updated•4 years ago
|
Comment 23•4 years ago
|
||
Just had a Thunderbird bug with NSS/SSL issues:
Bug 1655068
Crash in [@ shutdownhang | nssList_Remove | nssCertificateStore_RemoveCertLOCKED | nssCertificate_Destroy | NSSCertificate_Destroy | CERT_DestroyCertificate | ssl_DestroySID | SSL_ClearSessionCache | mozilla::ShutdownXPCOM]
Comment 24•4 years ago
|
||
[Tracking Requested - why for this release]:
these shutdownhangs seem to get more common during the 80.0b cycle again - the various crash signatures containing mozilla::TaskController::GetRunnableForMTTask
now account for 8% of all browser crashes there.
Comment 25•4 years ago
|
||
(In reply to [:philipp] from comment #24)
[Tracking Requested - why for this release]:
these shutdownhangs seem to get more common during the 80.0b cycle again - the various crash signatures containingmozilla::TaskController::GetRunnableForMTTask
now account for 8% of all browser crashes there.
I did a quick investigation on the reports for GetRunnableForMTTask, which contains
-
@ shutdownhang | __psynch_cvwait | <name omitted> | mozilla::TaskController::GetRunnableForMTTask
new issue bug 1656992: Looks likemozilla::psm::StopSSLServerCertVerificationThreads()
introduces this.
examples:
https://crash-stats.mozilla.org/report/index/06383cb3-0c3d-4b03-b5c9-235330200802#allthreads
https://crash-stats.mozilla.org/report/index/90d037e6-b97d-49af-b2ee-072010200802#allthreads -
[shutdownhang | __psynch_cvwait | mozilla::TaskController::GetRunnableForMTTask]
bug 1620157: stack showsCompositorThreadHolder::Shutdown
examples:
https://crash-stats.mozilla.org/report/index/877aaa93-46ca-49b7-b021-0db220200803
https://crash-stats.mozilla.org/report/index/fafa6625-31b5-478b-b963-8c7ba0200803 -
[@ IPCError-browser | ShutDownKill | __psynch_cvwait | <name omitted> | mozilla::TaskController::GetRunnableForMTTask ]
This is PR_POLL hangs, a known issue. -
[@ shutdownhang | trunc | mozilla::TaskController::GetRunnableForMTTask ]
This is a combination of the known issues like bug 1542485, bug 1651564
Comment 26•4 years ago
|
||
I take a closer look at the #4 crashes in beta 80.4b, which is @ shutdownhang | mozilla::TaskController::GetRunnableForMTTask
It shows 46 crashes for 4.59%.
I would say the signature mozilla::TaskController::GetRunnableForMTTask
is way out of networking signature.
It includes
- deadlock in nsThread::Shutdown
- unfinished spineventloop
Actually, I investigate 30 reports of them, and only 1 crash belongs to necko.
Around half of them are QuotaManager, quarter of them are nsThread::Shutdown.
As for bug 1656992, bug 1656992 comment 5 shows that it doesn't matter with beta.
Here's the triaged list.
Bug 1542485 QuotaManager: 14
https://crash-stats.mozilla.org/report/index/9f6b7760-93e0-4faa-a51d-c05d70200805
https://crash-stats.mozilla.org/report/index/da9b07bf-7e42-477a-b142-8727f0200805
Bug 1629669 Bug 1505660 nsThread::Shutdown 7
https://crash-stats.mozilla.org/report/index/dc459fc0-e2ee-4aa7-977b-b6a830200805
https://crash-stats.mozilla.org/report/index/a553423b-171a-4c04-978e-bb8620200805
https://crash-stats.mozilla.org/report/index/ca81d0ae-b048-484b-8a32-287c70200805
https://crash-stats.mozilla.org/report/index/6968de7a-4aa2-4f86-9567-2f4b30200805
js::jit 3
https://crash-stats.mozilla.org/report/index/484c4a97-230a-4db7-8b55-e50020200805
https://crash-stats.mozilla.org/report/index/48982758-5bc2-4ef2-9810-336310200805
https://crash-stats.mozilla.org/report/index/1c06bbc4-68e7-4a0f-9cbd-5fcd90200805
NSS nssTrustDomain 2
https://crash-stats.mozilla.org/report/index/cd70867d-b0de-40a3-8274-b6d1f0200805#allthreads
https://crash-stats.mozilla.org/report/index/39a412e6-cd15-4caa-81c4-a7fc60200805#allthreads
spineventloop in PreferencesWriter::Flush 2
https://crash-stats.mozilla.org/report/index/3dd5e055-51da-41c5-b006-b0cb20200805
https://crash-stats.mozilla.org/report/index/28c20ba4-0006-44cb-9dcf-f7eee0200805
workerinternals::RuntimeService::Cleanup
https://crash-stats.mozilla.org/report/index/ae84df84-0da1-4261-bbd6-001c30200805#allthreads
Here's more example with different signature
PR_POLL
https://crash-stats.mozilla.org/report/index/667e805a-5033-4af2-b2b9-d8bfb0200805#allthreads
Comment 27•4 years ago
|
||
Just found bug 1500861.
Do you think ShutdownWithTimeout
could resolve the nsThread::Shutdown hangs like the following examples, valentin?
https://crash-stats.mozilla.org/report/index/dc459fc0-e2ee-4aa7-977b-b6a830200805
https://crash-stats.mozilla.org/report/index/a553423b-171a-4c04-978e-bb8620200805
https://crash-stats.mozilla.org/report/index/ca81d0ae-b048-484b-8a32-287c70200805
https://crash-stats.mozilla.org/report/index/6968de7a-4aa2-4f86-9567-2f4b30200805
Comment 28•4 years ago
|
||
(In reply to Junior [:junior] from comment #27)
Just found bug 1500861.
Do you thinkShutdownWithTimeout
could resolve the nsThread::Shutdown hangs like the following examples, valentin?https://crash-stats.mozilla.org/report/index/dc459fc0-e2ee-4aa7-977b-b6a830200805
https://crash-stats.mozilla.org/report/index/a553423b-171a-4c04-978e-bb8620200805
https://crash-stats.mozilla.org/report/index/ca81d0ae-b048-484b-8a32-287c70200805
https://crash-stats.mozilla.org/report/index/6968de7a-4aa2-4f86-9567-2f4b30200805
I don't think it's easy to make it work in all cases. It's mostly meant to be used with nsIThreadPool specifically when doing blocking tasks, and I don't know if that's the case here.
If we're looking at nsHttpConnectionMgr::Shutdown, we could make SpinEventLoopUntil work until a timeout expires.
This might work unless there's an event in the loop that is blocked.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 29•4 years ago
|
||
Removing the generic signatures that stop at mozilla::TaskController::GetRunnableForMTTask
after Bug 1658729, and add more first specific signatures that include mozilla::net::nsHttpConnectionMgr::Shutdown
.
Comment hidden (obsolete) |
Comment hidden (obsolete) |
Updated•4 years ago
|
Reporter | ||
Comment 32•4 years ago
|
||
Regarding comment 30, I just filed bug 1660950.
Reporter | ||
Updated•4 years ago
|
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 33•4 years ago
|
||
--- Sorting out the signatures: ---
shutdownhang | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread
Happens rarely and only between 78.0.1 and 79.0
shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown
Many occurrences, Windows only, up to 68.12.0esr
shutdownhang | static bool mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown
No (more) occurrences.
shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown
Many occurrences, Windows only, up to 79
shutdownhang | __pthread_cond_wait | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown
Rare, Linux only, 75 and 68.10.0
shutdownhang | __psynch_cvwait | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown
No (more) occurrences
shutdownhang | PR_CallOnceWithArg | mozilla::net::nsHttpConnectionMgr::Shutdown
Rare, Windows only, 78.0.x and 79
--- The following three signatures are the most relevant for current versions: ---
shutdownhang | mozilla::net::ShutdownEvent::PostAndWait
Happens since 80 and often, only Windows (all flavors)
shutdownhang | mozilla::TaskController::GetRunnableForMTTask | mozilla::net::nsHttpConnectionMgr::Shutdown
Happens since 80 to date, Windows only, often
shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread
Newest flavor, Windows only, 80 to date.
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 34•4 years ago
|
||
Looking at mozilla::net::ShutdownEvent::PostAndWait I see:
rv = CacheFileIOManager::gInstance->mIOThread->Dispatch(
this,
CacheIOThread::WRITE); // When writes and closing of handles is done
MOZ_ASSERT(NS_SUCCEEDED(rv));
TimeDuration waitTime = TimeDuration::FromSeconds(1);
while (!mNotified) {
...
Shouldn't we return here if the mIOThread->Dispatch
did not succeed without even entering the while loop (instead of just doing MOZ_ASSERT
)?
Comment 35•4 years ago
|
||
" Version: 44 Branch"
Should this be changed?
Reporter | ||
Updated•4 years ago
|
Comment 36•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] (REO for FF 81) from comment #34)
Shouldn't we return here if the
mIOThread->Dispatch
did not succeed without even entering the while loop (instead of just doingMOZ_ASSERT
)?
That's a great point. I'll submit a patch in a separate bug. I'm not the connection manager/socket thread waits on the cache thread, so it's not likely to make an impact on this bug (unless I'm missing the code path that does so).
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 38•4 years ago
|
||
Looking at one crash from mozilla::net::ShutdownEvent::PostAndWait, I see the Socket Thread stuck here, waiting probably for some data to arrive, such that the shutdown event posted here is never even started to be elaborated and thus this SpinEventLoopUntil does not return before timeout.
Reporter | ||
Comment 39•4 years ago
|
||
Cleaning up signatures.
FF Active
shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread
shutdownhang | kernelbase.dll | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | mozglue.dll | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | mozilla::net::ShutdownEvent::PostAndWait
shutdownhang | mozilla::TaskController::GetRunnableForMTTask | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread
shutdownhang | ntdll.dll | kernelbase.dll | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | PR_CallOnceWithArg | mozilla::net::nsHttpConnectionMgr::Shutdown
Thunderbird only active
shutdownhang | __psynch_cvwait | _pthread_cond_wait | pthread_cond_signal_thread_np | <name omitted> | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | __pthread_cond_wait | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown
Inactive / unsupported versions only
shutdownhang | __psynch_cvwait | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | _PR_MD_WAIT_CV | _PR_WaitCondVar | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | kernelbase.dll | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | __psynch_cvwait | _pthread_cond_wait | pthread_cond_signal_thread_np | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | ntdll.dll | kernel32.dll | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | ntdll.dll | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | static bool mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown
Reporter | ||
Comment 40•4 years ago
|
||
Sorting signatures for frequency
Signature | Count |
---|---|
shutdownhang mozilla::TaskController::GetRunnableForMTTask mozilla::net::nsHttpConnectionMgr::Shutdown | 3269 |
shutdownhang mozilla::net::nsHttpConnectionMgr::Shutdown | 2380 |
shutdownhang mozilla::net::ShutdownEvent::PostAndWait | 1625 |
shutdownhang nsThread::Shutdown mozilla::net::nsSocketTransportService::ShutdownThread | 1315 |
shutdownhang mozilla::TaskController::GetRunnableForMTTask nsThread::Shutdown mozilla::net::nsSocketTransportService::ShutdownThread | 240 |
shutdownhang kernelbase.dll mozilla::net::nsHttpConnectionMgr::Shutdown | 164 |
shutdownhang mozglue.dll mozilla::net::nsHttpConnectionMgr::Shutdown | 109 |
shutdownhang PR_CallOnceWithArg mozilla::net::nsHttpConnectionMgr::Shutdown | 73 |
shutdownhang ntdll.dll kernelbase.dll mozilla::net::nsHttpConnectionMgr::Shutdown | 39 |
shutdownhang __pthread_cond_wait <name omitted> mozilla::net::nsHttpConnectionMgr::Shutdown | 2 |
Reporter | ||
Comment 41•4 years ago
|
||
In all the reports I clicked on, the SocketThread is stuck while shutting down the SSL Cert threadpool.
Looking at the shutdown function, it seems we shutdown the threads in the order we created them (and wait for each single thread before we loop). I see in the first three reports I clicked on, that SSL Cert #1
is still alive, and am assuming that it processes some long lasting event when the shutdown event comes in, such that we never get to process the shutdown event.
In two cases it is stuck in nsNSSComponent::BlockUntilLoadableCertsLoaded()
, in the other case mozilla::psm::NSSCertDBTrustDomain::GetCertTrust
seems stuck.
Comment 42•4 years ago
|
||
For this stack trace I see the SSL Cert threads stuck in nsNSSComponent::CheckForSmartCardChanges() security/manager/ssl/nsNSSComponent.cpp:900
Unfortunately I don't know if we can make them break out any faster - AFAIK this calls into the driver of these smart cards, which may be slow or badly written.
In this specific case I think we could use ShutdownWithTimeout in StopSSLServerCertVerificationThreads.
Not the most elegant fix, but it should improve the shutdown case.
Reporter | ||
Comment 43•4 years ago
|
||
So I clicked on the first 15 reports (on latest versions) of @ shutdownhang mozilla::TaskController::GetRunnableForMTTask mozilla::net::nsHttpConnectionMgr::Shutdown .
They all are doing something on the socket thread which probably prevents the shutdown event from being processed.
They fall into 3 buckets, it seems:
-
_PR_MD_PR_POLL(PRPollDesc*, int, unsigned int)
It might be worth to check, if we use unsuitable long timeouts on the poll.
It is not clear, if we are just unlucky that there is no more time left on the shutdown timer when we send the shutdown event or if those events on the socket thread are really much slower than they should be or even blocking.
Reporter | ||
Comment 44•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #43)
It is not clear, if we are just unlucky that there is no more time left on the shutdown timer when we send the shutdown event or if those events on the socket thread are really much slower than they should be or even blocking.
I proposed this patch on bug 1505660 that adds some more shutdown phases explicitly to the shutdown watchdog logic. That might mitigate the case of a late start of the network shutdown due to previous delays as the timer will be reset more often.
Reporter | ||
Comment 45•3 years ago
|
||
Looking at some of the nsHttpConnectionMgr::Shutdown
hangs.
IIUC, the intended sequence is as follows:
- We call
nsHttpConnectionMgr::Shutdown()
on the main thread - This dispatches
OnMsgShutdown
to the socket thread passing a boolean. After clearingmSocketThreadTarget
and settingmIsShuttingDown
we enter aSpinEventLoopUntil
that waits for the passed boolean to flip. OnMsgShutdown
closes everything and dispatchesOnMsgShutdownConfirm
to the same socket thread with the same boolean.OnMsgShutdownConfirm
finally sets that boolean.
What seems to happen is that the socket process never reaches the OnMsgShutdown
event, doing different things that are already in the queue. As we are in the same process, we might just want to share the mIsShuttingDown
information directly with the socket thread and abort any event processing there immediately.
This is obviously not a good solution once we have the socket process, but still it might paper over some of the hangs for now.
Comment 46•3 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #45)
Looking at some of the
nsHttpConnectionMgr::Shutdown
hangs.IIUC, the intended sequence is as follows:
- We call
nsHttpConnectionMgr::Shutdown()
on the main thread- This dispatches
OnMsgShutdown
to the socket thread passing a boolean. After clearingmSocketThreadTarget
and settingmIsShuttingDown
we enter aSpinEventLoopUntil
that waits for the passed boolean to flip.OnMsgShutdown
closes everything and dispatchesOnMsgShutdownConfirm
to the same socket thread with the same boolean.OnMsgShutdownConfirm
finally sets that boolean.What seems to happen is that the socket process never reaches the
OnMsgShutdown
event, doing different things that are already in the queue. As we are in the same process, we might just want to share themIsShuttingDown
information directly with the socket thread and abort any event processing there immediately.This is obviously not a good solution once we have the socket process, but still it might paper over some of the hangs for now.
nsSocketTransportService gets information about a shutdown in a different way from nsIOService, by calling gIOService->IsNetTearingDown().
As soon as gIOService->IsNetTearingDown() is true the nsSocketTransportService: 1) does not call PR_Poll (check is here), 2) does not create new sockets (here), 3) start leaking socket (not closing them to avoid callling PR_Close here), does not call PR_ConnectContinue here, etc. Also it has some logic to try to wake up PR_Poll.
Most of this crashes are in PR_Poll, PR_Close, PR_Connect and PR_ConnectContinue and before we call this function we check gIOService->IsNetTearingDown(). I assume that the socketThread is already hanging in one of these functions when a shutdown is called.
Comment 47•3 years ago
|
||
There is some increase in the volume of this hangs.
I had a look at some hangs (about 50 of them) and the most hangs are in PR_Close for the UDP sockets. Recently, we rolled out QUIC and this explains that there are more hangs with UDP sockets (previously there were non or almost non). I have not found any other new hang signature except this one, but that was only 3 out of 50.
Maybe UDP sockets hand more often in PR_Close than a TCP socket. I found this related bugs:
Bug 1124880 and
This Chrome bug
Comment 48•3 years ago
|
||
Got this 2x in an hour. Both times I was watching a youtube video, that being most of my network traffic, then networking in Firefox stopped working (networking seemed fine in other applications). Quit firefox and get this shutdown hang.
Updated•3 years ago
|
Comment 49•3 years ago
|
||
(In reply to Timothy Nikkel (:tnikkel) from comment #48)
Got this 2x in an hour. Both times I was watching a youtube video, that being most of my network traffic, then networking in Firefox stopped working (networking seemed fine in other applications). Quit firefox and get this shutdown hang.
Also just got this multiple times over the past few hours, and was watching YouTube at the time of the first occurrence. Ran across this bugzilla # via the about:crashes related links after I filed #1749920, and I'm not sure whether that should actually be duped to this since the shutdown hang is a symptom of the real problem (networking died).
Reporter | ||
Comment 50•3 years ago
|
||
Updating the signatures with the 10 most frequent ones.
Reporter | ||
Comment 51•3 years ago
|
||
Note that the graph above shows an unreasonable +400k cases on January 13 which is not confirmed if I repeat the query in crash-stats.
Comment 52•3 years ago
|
||
Those are from the foxstuck incident.
Comment 54•2 years ago
|
||
Thunderbird is no longer a signficant presence in any of these signatures
Comment 55•2 years ago
|
||
Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.
For more information, please visit auto_nag documentation.
Updated•2 years ago
|
Comment 56•2 years ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 20 desktop browser crashes on release
For more information, please visit auto_nag documentation.
Updated•2 years ago
|
Comment 57•2 years ago
|
||
The socket process should really help here.
We could just kill the socket process instead of waiting for the sockets to close.
Updated•2 years ago
|
Comment 58•2 years ago
|
||
keyword: Perf ?
Comment 59•2 years ago
|
||
¡Hola y'all!
Happy 🌮 Tuesday!
Crashed like
bp-d6f1b34d-91a0-486c-b5a7-3e42b0230328
on 113.0a1 (2023-03-28) (64-bit)
Updating flags FWIW.
¡Gracias!
Alex
Comment 60•2 years ago
|
||
(In reply to Worcester12345 from comment #58)
keyword: Perf ?
Updated•2 years ago
|
Reporter | ||
Updated•1 year ago
|
Description
•