Closed
Bug 1124880
Opened 10 years ago
Closed 10 years ago
shutdownhang in mozilla::net::nsHttpConnectionMgr::Shutdown() on pending UDP socket close
Categories
(Core :: Networking: HTTP, defect)
Tracking
()
People
(Reporter: kairo, Assigned: mayhemer)
References
Details
(Keywords: crash, topcrash-win)
Crash Data
Attachments
(4 files, 6 obsolete files)
(deleted),
text/plain
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
image/png
|
Details | |
(deleted),
patch
|
mayhemer
:
review+
Sylvestre
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is
report bp-35103a28-d12f-4999-9066-4f6ed2150122.
=============================================================
This is probably what a good part of bug 1103833 is mutating to after the fix of bug 1104317 - at least from the looks of early 36.0b2 data where this constitutes >1% of all crashes.
Find stats and more of those reports at https://crash-stats.mozilla.com/report/list?signature=shutdownhang+%7C+WaitForSingleObjectEx+%7C+WaitForSingleObject+%7C+PR_Wait+%7C+nsThread%3A%3AProcessNextEvent%28bool%2C+bool%2A%29+%7C+NS_ProcessNextEvent%28nsIThread%2A%2C+bool%29+%7C+mozilla%3A%3Anet%3A%3AnsHttpConnectionMgr%3A%3AShutdown%28%29
Comment 1•10 years ago
|
||
Important crash, tracking.
status-firefox36:
--- → affected
tracking-firefox36:
--- → +
This is (for me) exactly the same crash, this time on 38.0 a1
https://crash-stats.mozilla.com/report/index/6ac6325d-a96b-424d-b994-3e14b2150126
but the signature is a little bit different to this
https://crash-stats.mozilla.com/report/index/6b24ad47-c392-406b-ad63-6bf7c2150126
and so this crash on 38.0 a1 is not linked to an 'Related Bug' like this one here
Patrick, any idea why nsHttpConnectionMgr::Shutdown is taking so long? (For context: bug 1038342 added a watchdog that turns hung shutdowns into crashes)
Here's the stack of the main thread:
WaitForSingleObject
PR_Wait
nsThread::ProcessNextEvent(bool, bool*)
NS_ProcessNextEvent(nsIThread*, bool)
mozilla::net::nsHttpConnectionMgr::Shutdown()
mozilla::net::nsHttpHandler::Observe(nsISupports*, char const*, wchar_t const*)
nsObserverList::NotifyObservers(nsISupports*, char const*, wchar_t const*)
nsObserverService::NotifyObservers(nsISupports*, char const*, wchar_t const*)
nsObserverService::Release()
ScopedXPCOMStartup::~ScopedXPCOMStartup()
XREMain::XRE_main(int, char** const, nsXREAppData const*)
XRE_main
Flags: needinfo?(dmajor) → needinfo?(mcmanus)
Comment 5•10 years ago
|
||
its pretty generic.. it just means the socket thread isn't completing.
maybe honza has more insight
Flags: needinfo?(mcmanus) → needinfo?(honzab.moz)
Assignee | ||
Comment 6•10 years ago
|
||
The report this bug was for filed for:
Thread 4 @kernel32.dll!CloseHandleImplementation|ntdll.dll!NtClose
Thread 5 @ntdll.dll!ZwRemoveIoCompletion
Thread 9 (sts) @ws2_32.dll!closesocket|ntdll.dll!NtClose
Thread 35 (dns) @ws2_32.dll!closesocket|ntdll.dll!NtClose
Arbitrary reports from the list:
https://crash-stats.mozilla.com/report/index/ade2c3b2-0397-4061-878b-7414b2150122
Thread 4 (sts) @ws2_32.dll!connect|ntdll.dll!ZwDeviceIoControlFile
Thread 1 (google breakpad) @ntdll.dll!NtClose
https://crash-stats.mozilla.com/report/index/8cdee535-2dad-4261-bb34-869072150122
Thread 2 @ntdll.dll!NtRemoveIoCompletion
Thread 4 (sts) @ntdll.dll!NtClose
https://crash-stats.mozilla.com/report/index/4234021a-bc59-4757-aab6-4a8952150122
Thread 4 (sts) @ntdll.dll!NtClose
Thread 36 @ntdll.dll!NtRemoveIoCompletion
There is usually NtClose on the stack.
Seems like a windows bug.
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 7•10 years ago
|
||
Also, in cases we are stuck on closing a socket in sts it is always a UDP socket.
There are reports with no stuck (being inside poll) on sts too, tho:
https://crash-stats.mozilla.com/report/index/60d37d17-db0c-4db0-bbe3-ab84b2150122
https://crash-stats.mozilla.com/report/index/d1c60009-08a3-4bac-91cb-9fae72150123
Comment 8•10 years ago
|
||
Honza, any idea if we could workaround this windows bug? thanks
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 9•10 years ago
|
||
Tentatively assigning to me, this needs some deep investigation. No idea for a workaround, no idea of the cause at all.
Assignee: nobody → honzab.moz
Flags: needinfo?(honzab.moz)
Reporter | ||
Comment 10•10 years ago
|
||
Also note that the comments in the URL given in comment #0 sound like those are really troubled installations, sounds like we would not only run into hangs on shutdown but all over the place.
Reporter | ||
Updated•10 years ago
|
Summary: crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::net::nsHttpConnectionMgr::Shutdown() → shutdownhang in mozilla::net::nsHttpConnectionMgr::Shutdown()
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::net::nsHttpConnectionMgr::Shutdown()] → [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::net::nsHttpConnectionMgr::Shutdown()]
[@ shutdownhang | WaitForSingleObjectEx | WaitForSi…
Crash Signature: , bool) | mozilla::net::nsHttpConnectionMgr::Shutdown()]
[@ shutdownhang | WaitForSingleObjectEx | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread* → , bool) | mozilla::net::nsHttpConnectionMgr::Shutdown()]
[@ shutdownhang | WaitForSingleObjectEx | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*
Comment 11•10 years ago
|
||
I don't see that bug fixed for 36. Wontfix but tracking it for the next releases.
status-firefox37:
--- → affected
status-firefox38:
--- → affected
tracking-firefox37:
--- → +
tracking-firefox38:
--- → +
Release Note Request (optional, but appreciated)
[Why is this notable]: Users will see crash notifications more often.
[Suggested wording]: Firefox hangs on shutdown are now reported as crashes, in order to help Mozilla diagnose and fix the specific issues during shutdown.
[Links (documentation, blog post, etc)]: (not sure. maybe bug 1038342, bug 1104317. a blog post might help explain this.)
relnote-firefox:
--- → ?
Reporter | ||
Updated•10 years ago
|
Crash Signature: , bool) | mozilla::net::nsHttpConnectionMgr::Shutdown()] → , bool) | mozilla::net::nsHttpConnectionMgr::Shutdown()]
[@ shutdownhang | ntdll.dll@0x3c6bc]
Assignee | ||
Updated•10 years ago
|
Version: Trunk → 36 Branch
Comment 14•10 years ago
|
||
Any idea to hang out (kill) in the Windows TM permanent task Firefox.exe or Nightly.exe which remains after this crash?
Assignee | ||
Comment 15•10 years ago
|
||
The slow UDP close (most common cause of this hang) is also known to chromium guys [1], but there is no solution right now on their side (AFAIK, no deeper check on that).
According [2] closesocket tries to send unsent data and when network has changed (cable unplugged, wifi changed) it might take up to 5 secs. The blog post claims this is by default since Vista, but I can see reports from version 5.1.2600 (windows XP) tho. Anyway, one of the reports says Fx crashes when network is switched.
According our shutdown telemetry [3] seems like there is some distribution of slow shutdowns already. It may be that when some amount of UDP sockets piles up, we are slower and go over the 60s shutdown watchdog limit.
My vote is to force abortive close on UDP sockets, first just on Win, according [4]:
"If the l_onoff member of the linger structure is nonzero and l_linger member is zero, closesocket is not blocked even if queued data has not yet been sent or acknowledged. This is called a hard or abortive close, because the socket's virtual circuit is reset immediately, and any unsent data is lost."
[1] https://code.google.com/p/chromium/issues/detail?id=165382#c81
[2] http://blogs.msdn.com/b/winsdk/archive/2013/10/09/udp-closesocket-takes-upto-5-seconds-to-return-in-disconnect-remote-host-down-scenario-due-to-pending-data-to-send.aspx
[3] http://telemetry.mozilla.org/#filter=beta%2F36%2FSHUTDOWN_PHASE_DURATION_TICKS_PROFILE_CHANGE_TEARDOWN%2Fsaved_session%2FFirefox&aggregates=Submissions&evoOver=Builds&locked=true&sanitize=true&renderhistogram=Table
[4] https://msdn.microsoft.com/en-us/library/windows/desktop/ms737582%28v=vs.85%29.aspx
Status: NEW → ASSIGNED
Assignee | ||
Comment 16•10 years ago
|
||
SO_LINGER is not purposed for UDP sockets, at least not on Windows.
Assignee | ||
Comment 17•10 years ago
|
||
One way to at least mitigate this would be to create a separate thread for *each* PR_Close() call on each UDP socket. Then we just need to find a spot to join all the treads during shutdown. Sounds a bit crazy but I can imagine this be upliftable.
Assignee | ||
Comment 18•10 years ago
|
||
I so far haven't found any general way to avoid blocks on the socket thread. I believe select() call may also be slow from time to time (few reports are inside select).
This should mitigate the udp close delay, if it can be avoided by parallelization of close() calls on respective sockets.
Roughly tested.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d9d59454a466
Attachment #8566308 -
Flags: review?(mcmanus)
Comment 19•10 years ago
|
||
Comment on attachment 8566308 [details] [diff] [review]
v1
Review of attachment 8566308 [details] [diff] [review]:
-----------------------------------------------------------------
the try build is red.
You've done a nice job documenting that this is a known issue and parallelism seems the right mitigation. I wonder if we should be using the existing threadpool mechanism here - maybe unlimited isn't the right thing either.
did you see the one blocked in connect? I wonder if we should be testing sockets at connect time to see if they are blocking or not.
I think we should do something like this, but only expect it to be a partial mitigation. I think it should get a lot of pre-release air time - implications of spawning a lot of threads might take a while to see.
::: netwerk/base/nsUDPSocket.cpp
@@ +114,5 @@
> + void AddObserver();
> + void JoinAndRemove();
> +
> + PRFileDesc *mFd;
> + PRThread* mThread;
PRThread *mThread (throughout)
@@ +156,5 @@
> + // Released after the thread is done.
> + mSelf = this;
> + mThread = PR_CreateThread(PR_USER_THREAD, ThreadFunc, this,
> + PR_PRIORITY_NORMAL, PR_GLOBAL_THREAD,
> + PR_JOINABLE_THREAD, 128 * 1024);
that stack seems huge.. and given that we might have a lot in parallel and VM space is an issue on 32 bit platforms...
@@ +165,5 @@
> +
> + // Observer service must be worked with on the main thread only.
> + // This method is called usually on the socket thread.
> + nsCOMPtr<nsIRunnable> event = NS_NewRunnableMethod(
> + this, &nsUDPSocketCloseThread::AddObserver);
can this deadlock if the main thread is trying to shut down the connection manager?
Attachment #8566308 -
Flags: review?(mcmanus) → feedback+
Assignee | ||
Comment 20•10 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #19)
> Comment on attachment 8566308 [details] [diff] [review]
> v1
>
> Review of attachment 8566308 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> the try build is red.
>
> You've done a nice job documenting that this is a known issue and
> parallelism seems the right mitigation. I wonder if we should be using the
> existing threadpool mechanism here - maybe unlimited isn't the right thing
> either.
I don't think there will be that much sockets to close, they are also closing during browser lifetime. We can have our favorite telemetry (I love it :)) to check how far with the numbers we go (can go in with this patch in)
>
> did you see the one blocked in connect? I wonder if we should be testing
> sockets at connect time to see if they are blocking or not.
I've seen blocks in connect(), select() and some others I think. I believe there is a general problem with winsock, but google is silent about it.
>
> I think we should do something like this, but only expect it to be a partial
> mitigation. I think it should get a lot of pre-release air time -
> implications of spawning a lot of threads might take a while to see.
Yes, I am afraid we have to.
>
> ::: netwerk/base/nsUDPSocket.cpp
> @@ +114,5 @@
> > + void AddObserver();
> > + void JoinAndRemove();
> > +
> > + PRFileDesc *mFd;
> > + PRThread* mThread;
>
> PRThread *mThread (throughout)
I never know..
>
> @@ +156,5 @@
> > + // Released after the thread is done.
> > + mSelf = this;
> > + mThread = PR_CreateThread(PR_USER_THREAD, ThreadFunc, this,
> > + PR_PRIORITY_NORMAL, PR_GLOBAL_THREAD,
> > + PR_JOINABLE_THREAD, 128 * 1024);
>
> that stack seems huge.. and given that we might have a lot in parallel and
> VM space is an issue on 32 bit platforms...
Yep, something to research tho. Nightly is our friend here probably.
>
> @@ +165,5 @@
> > +
> > + // Observer service must be worked with on the main thread only.
> > + // This method is called usually on the socket thread.
> > + nsCOMPtr<nsIRunnable> event = NS_NewRunnableMethod(
> > + this, &nsUDPSocketCloseThread::AddObserver);
>
> can this deadlock if the main thread is trying to shut down the connection
> manager?
No, we spin the MT loop when shutting down. Also, this just adds an observer, nothing else.
Comment 21•10 years ago
|
||
> > VM space is an issue on 32 bit platforms...
> Yep, something to research tho. Nightly is our friend here probably.
FWIW we don't see a huge amount of address space exhaustion on Nightly these days. (I would guess that 64-bit builds are a big reason, and the daily restarts probably help too)
Assignee | ||
Comment 22•10 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1c41b1761691
Now includes telemetry.
Attachment #8566308 -
Attachment is obsolete: true
Attachment #8567077 -
Flags: review?(mcmanus)
Assignee | ||
Comment 23•10 years ago
|
||
Improved stack size:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f61e11563d3
Assignee | ||
Comment 24•10 years ago
|
||
And attempt to fix linux-static-analyzes build fail:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a9546d95b460
Assignee | ||
Comment 25•10 years ago
|
||
Few failures (comment 23):
- netwerk/test/unit/test_udp_multicast.js, socket is expected to be closed on return of nsUDPSocket::Close() ; fixed as: no use of the thread in this method
- Assertion failure: sActiveThreadsCount > 0, on x64 - sActiveThreadsCount is 32bit, hence unaligned ; fixed as: using uintptr_t
- [1], Main app process: killed by out-of-range signal, number 117 + some UDP socket warning in the console ; probably coming from dom\network\tests\test_udpsocket.html and is probably related to the first error in this list
[1] https://treeherder.mozilla.org/logviewer.html#?job_id=5074251&repo=try
Comment 26•10 years ago
|
||
Just for your information.
I will open a new bug for the test_udp_multicast.js failure. The patch from bug 1131557 makes it fail reliably on 32bit linux, but the patch is not the cause. Something strange is going on, i am still not sure what is the problem. I am working on it...
Updated•10 years ago
|
Attachment #8567077 -
Flags: review?(mcmanus) → review+
Assignee | ||
Comment 27•10 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #25)
> Few failures (comment 23):
> - netwerk/test/unit/test_udp_multicast.js, socket is expected to be closed
> on return of nsUDPSocket::Close() ; fixed as: no use of the thread in this
> method
> - [1], Main app process: killed by out-of-range signal, number 117 + some
> UDP socket warning in the console ; probably coming from
> dom\network\tests\test_udpsocket.html and is probably related to the first
> error in this list
Looking more at the code, it seems like calling PR_Close directly from ::Close() is nice but won't fix these problems. I believe there is a listener on the socket class at that time and we go to the socket thread with OnMsgClose that ends up in OnSocketDetached.
So, Dragana, I'll wait for your patch of the failing test first.
Assignee | ||
Updated•10 years ago
|
Attachment #8567077 -
Flags: review+
Comment 28•10 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #25)
> - [1], Main app process: killed by out-of-range signal, number 117 + some
> UDP socket warning in the console ; probably coming from
> dom\network\tests\test_udpsocket.html and is probably related to the first
> error in this list
>
>
> [1] https://treeherder.mozilla.org/logviewer.html#?job_id=5074251&repo=try
the udp socket warning can be cause by
https://hg.mozilla.org/integration/fx-team/annotate/37e5f630049b/toolkit/modules/secondscreen/SimpleServiceDiscovery.jsm#l163
SimpleServiceDiscovery.jsm does not shut down at all. This service was causing bug 1085266 (comment 13 explains it a bit)
Comment 29•10 years ago
|
||
I believe that the following three crash reports are all instances of this bug:
bp-163de6ab-bc47-4e3d-90d5-f91992150304 3/4/2015 16:53
bp-5a081db2-ffde-4d6c-857f-3924c2150304 3/4/2015 10:02
bp-311f753d-21d3-4164-83e8-fe9c92150303 3/3/2015 11:17
The attached debug log is according to the instructions at http://blogs.technet.com/b/markrussinovich/archive/2005/08/17/unkillable-processes.aspx. (Apologies for the initial flailing.)
I've been hitting this bug a couple of times a day for the last few days; let me know if I can gather any more information to help! :)
Comment 31•10 years ago
|
||
so has this been introduced by bug 1054959?
Comment 32•10 years ago
|
||
I looked for a preference to disable device discovery by UDP (I'm assuming that's the part of that bug you're suspecting), but couldn't find anything. (I searched for "device", "video", and "discover" in about:config.) If you can point me to it, I'd be happy to turn it off and report back after a week or so. It's also not obvious to me whether that feature exists in the build I'm using or was backed out; I'm on:
> Build identifier: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0
Comment 33•10 years ago
|
||
This corresponds to https://crash-stats.mozilla.com/report/index/2b690fb1-b9a5-4ce5-9c7f-c64192150309. It happened after being in a 3.5-hour meeting, so on wireless and with a few suspend-resumes thrown in. But Firefox was shut down during most of the time (I'm almost positive it was shut down during all of the suspend/resume cycles).
In general I've been able to reduce the frequency of this issue by exiting FF whenever I would normally alt-tab away from it, only leaving it active when I'm directly using it.
Comment 35•10 years ago
|
||
I dug around a bit in the code in bug 1054959, and found that SimpleServiceDiscovery.jsm uses UDP port 1900. I created a Windows Firewall rule to block outbound connections on that remote port, and haven't seen this bug since then. I never would've thought of that -- thank you philipp!
Comment 36•10 years ago
|
||
My problem with Firefox not connecting and not restarting is now gone. I had some trouble with the motherboard on my computer. I reflashed it, as well it will now only work with one stick of ram, but at least I have no problem with Firefox. The motherboard uses the Intel G33/G31 chipset. I don't know if it had gone bad somehow. I did have some trouble with Window Media Player. Hope this may help.
Comment 37•10 years ago
|
||
(In reply to Matt McHenry from comment #35)
> Created attachment 8577239 [details]
I am unable to see the properties window (digging into firewall outbound rule,but not able to get the properties window of firefox)
if you able to gives the steps,then it would be great for me.
> Possible work-around: Windows Firewall rule to block SSDP UDP connections
>
> I dug around a bit in the code in bug 1054959, and found that
> SimpleServiceDiscovery.jsm uses UDP port 1900. I created a Windows Firewall
> rule to block outbound connections on that remote port, and haven't seen
> this bug since then. I never would've thought of that -- thank you philipp!
Comment 38•10 years ago
|
||
(In reply to Matt McHenry from comment #35)
> Created attachment 8577239 [details]
> Possible work-around: Windows Firewall rule to block SSDP UDP connections
>
> I dug around a bit in the code in bug 1054959, and found that
> SimpleServiceDiscovery.jsm uses UDP port 1900. I created a Windows Firewall
> rule to block outbound connections on that remote port, and haven't seen
> this bug since then. I never would've thought of that -- thank you philipp!
If this is correct, this bug should be fixed in 37+ as tab casting will be disabled by default via bug 1142521. We should be able to tell by the crash volume in 37 Beta 6.
Comment 39•10 years ago
|
||
(In reply to Chandan Kumar Baba from comment #37)
> (In reply to Matt McHenry from comment #35)
> > Possible work-around: Windows Firewall rule to block SSDP UDP connections
>
> I am unable to see the properties window (digging into firewall outbound
> rule,but not able to get the properties window of firefox)
> if you able to gives the steps,then it would be great for me.
At this point it sounds like you might be better off grabbing a FF 37 beta 6+ build. But if you still want to create a firewall rule, here's how:
Start -> Windows Firewall
right-click on Outbound Rules in the left-hand pane and select "New Rule ..."
select the "Port" radio button and click next
select the "UDP" radio button, enter 1900 in the "Specific remote ports" text box, and click next
select the "block the connection" radio button and click next
select all three of the "Domain", "Private" and "Public" check boxes and click next
name your rule, paste the URL of this bug into the description, and click finish
find your new rule in the list and double-click to open it
on the "Programs and Services" tab, select the "This program" radio button
click the browse button and locate firefox.exe
Comment 41•10 years ago
|
||
I still see this crash in 37.0b6 so looks like bug 1142521 did not resolve this issue. We're now too late in 37 to take a fix so I've marked 37 as wontfix.
Assignee | ||
Comment 42•10 years ago
|
||
Attachment #8567077 -
Attachment is obsolete: true
Assignee | ||
Updated•10 years ago
|
Attachment #8581758 -
Attachment filename: 1124880-new-thread.patch → v1.2
Assignee | ||
Updated•10 years ago
|
Assignee | ||
Comment 43•10 years ago
|
||
Comment on attachment 8581758 [details] [diff] [review]
v1.2
Patrick, I've added telemetry and few other cosmetic changes. It's not for carrying r+, so asking again for a full r.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a800b82c5725
Attachment #8581758 -
Flags: review?(mcmanus)
Assignee | ||
Updated•10 years ago
|
Attachment #8581758 -
Attachment description: 1124880-new-thread.patch → v1.2
Attachment #8581758 -
Attachment filename: v1.2 → 1124880-new-thread.patch
Assignee | ||
Comment 44•10 years ago
|
||
Fixed (finally!) the thread count assertion failure.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f5743ef11575
Assignee | ||
Updated•10 years ago
|
Attachment #8582498 -
Flags: review?(mcmanus)
Assignee | ||
Updated•10 years ago
|
Attachment #8581758 -
Attachment is obsolete: true
Attachment #8581758 -
Flags: review?(mcmanus)
Comment 45•10 years ago
|
||
Comment on attachment 8582498 [details] [diff] [review]
v1.3
Review of attachment 8582498 [details] [diff] [review]:
-----------------------------------------------------------------
if the problem persists because a single thread an block too long I guess we can move to a timeout and kill the thread (and leak the socket) approach, right?
::: netwerk/base/nsUDPSocket.cpp
@@ +137,5 @@
> + static bool sPastShutdown;
> +
> + // Active threads (roughly) counter, modified only on the main thread
> + // and used only for telemetry reports.
> + static uintptr_t sActiveThreadsCount;
not really a review comment, but why uintptr_t if the var never holds a ptr? it seems accumulate just wants uint32_t
Attachment #8582498 -
Flags: review?(mcmanus) → review+
Assignee | ||
Comment 46•10 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #45)
> Comment on attachment 8582498 [details] [diff] [review]
> v1.3
>
> Review of attachment 8582498 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> if the problem persists because a single thread an block too long I guess we
> can move to a timeout and kill the thread (and leak the socket) approach,
> right?
Yes, probably have it win specific. TerminateThread would be the way.
>
> ::: netwerk/base/nsUDPSocket.cpp
> @@ +137,5 @@
> > + static bool sPastShutdown;
> > +
> > + // Active threads (roughly) counter, modified only on the main thread
> > + // and used only for telemetry reports.
> > + static uintptr_t sActiveThreadsCount;
>
> not really a review comment, but why uintptr_t if the var never holds a ptr?
> it seems accumulate just wants uint32_t
My poor attempt for alignment enforcement. uint32_t it is.
Assignee | ||
Comment 47•10 years ago
|
||
Attachment #8582498 -
Attachment is obsolete: true
Attachment #8583239 -
Flags: review+
Comment 48•10 years ago
|
||
Don't ever use TerminateThread. If the thread is in any critical section (including the allocator) the whole process will deadlock.
Assignee | ||
Comment 49•10 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #48)
> Don't ever use TerminateThread. If the thread is in any critical section
> (including the allocator) the whole process will deadlock.
And I think that is our case, I believe NtClose keeps one. So, we should slowly turn to MS for help here...
Assignee | ||
Updated•10 years ago
|
Keywords: checkin-needed
Comment 50•10 years ago
|
||
Keywords: checkin-needed
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/5044ad98d160 for build bustage:
https://treeherder.mozilla.org/logviewer.html#?job_id=8086380&repo=mozilla-inbound
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 52•10 years ago
|
||
Attachment #8583239 -
Attachment is obsolete: true
Attachment #8584524 -
Flags: review+
Assignee | ||
Updated•10 years ago
|
Flags: needinfo?(honzab.moz)
Keywords: checkin-needed
Assignee | ||
Comment 54•10 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #53)
> Something's seriously wrong with that attachment.
Hmm.. how that happened... Re-attaching, sorry.
Assignee | ||
Comment 55•10 years ago
|
||
Attachment #8584524 -
Attachment is obsolete: true
Attachment #8584632 -
Flags: review+
Comment 57•10 years ago
|
||
Keywords: checkin-needed
Comment 58•10 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
Comment 59•10 years ago
|
||
FWIW, there is a number of bugs saying "Firefox can't connect, hangs on shutdown, cannot be killed" recently, and at least one crash id linking the hang to the stack here.
bug 1125686 (I CCed everyone there), bug 1137994, bug 1105577, bug 1142981, bug 1141627, bug 1143969, bug 646189 comment 13.
...and a forum thread saying this is new in 36 http://forums.mozillazine.org/viewtopic.php?f=38&t=2916383
Reporter | ||
Comment 60•10 years ago
|
||
Looks like we still have crashes in 39 dev edition and Nightly 40 which all came out after this patch landed in 39: https://crash-stats.mozilla.com/search/?product=Firefox&version=39.0a2&version=40.0a1&process_type=browser&process_type=content&signature=~nsHttpConnectionMgr%3A%3AShutdown&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
Assignee | ||
Comment 62•10 years ago
|
||
1.
Some of the reports in comment 60 show we are stuck in PR_Poll. This could be an effect of a known bug with signaling through the pollable event. It may be that there is infinite timeout put to PR_Poll.
We could modify nsSocketTransportService::PollTimeout() to return no more then few seconds timeout (5/7/10 or so) to let the STS thread loop occasionally. With Dragana's patch for event processing we might fix this bug.
2.
https://crash-stats.mozilla.com/report/index/5ba8f826-a4b2-4131-a63c-e4b112150404#allthreads, thread 11 (socket thread) shows a dialog via nsAutodial::DialDefault:
RasDialDlgW
nsAutodial::DialDefault(wchar_t const*)
nsNativeConnectionHelper::OnConnectionFailed(wchar_t const*)
nsSocketTransport::RecoverFromError()
mozilla::net::nsHttpConnectionMgr::nsHalfOpenSocket::OnTransportStatus(nsITransport*, nsresult, __int64, __int64)
3.
Some reports show we are stuck in connect(), few of them show we are stuck in close() (nsNSSSocketInfo::CloseSocketAndDestroy).
4.
https://crash-stats.mozilla.com/report/index/5acfd3bf-631b-46fd-b67b-731e42150407#allthreads, thread 9
https://crash-stats.mozilla.com/report/index/73b21efe-1098-4285-b57d-771342150402#allthreads, thread 8
NtWaitForSingleObject
ntdll.dll@0x1e1d6
RtlpEnterCriticalSectionContended
RtlpEnterCriticalSectionContended
PR_Lock
mozilla::net::nsWSAdmissionManager::GetSessionCount(int&)
mozilla::net::WebSocketChannel::StopSession(nsresult)
mozilla::net::WebSocketChannel::ReleaseSession()
mozilla::net::WebSocketChannel::ProcessInput(unsigned char*, unsigned int)
mozilla::net::WebSocketChannel::OnInputStreamReady(nsIAsyncInputStream*)
nsInputStreamReadyEvent::Run()
nsThread::ProcessNextEvent(bool, bool*)
NS_ProcessNextEvent(nsIThread*, bool)
nsSocketTransportService::Run()
5.
https://crash-stats.mozilla.com/report/index/27ae8618-8676-436d-8056-c6db72150406#allthreads, thread 5
https://crash-stats.mozilla.com/report/index/bb6d4bd5-75b5-475d-a926-4207d2150404#allthreads, thread 4
PR_Lock
nsSocketTransportService::Dispatch(nsIRunnable*, unsigned int)
http://hg.mozilla.org/releases/mozilla-aurora/annotate/239508ee646f/netwerk/base/nsSocketTransportService2.cpp#l128
nsSocketTransportService::Run()
http://hg.mozilla.org/releases/mozilla-aurora/annotate/239508ee646f/netwerk/base/nsSocketTransportService2.cpp#l815
nsThread::ProcessNextEvent(bool, bool*)
NS_ProcessNextEvent(nsIThread*, bool)
I'll file bugs.
mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*)
MessageLoop::RunHandler()
MessageLoop::Run()
nsThread::ThreadFunc(void*)
_PR_NativeRunThread
pr_root
_callthreadstartex
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 63•10 years ago
|
||
(In reply to Sylvestre Ledru [:sylvestre] from comment #61)
> Honza, can we have an uplift to beta? Thanks
I can, what is the deadline for it? (kinda busy now)
Comment 64•10 years ago
|
||
Honzam, today please (or tomorrow morning if you can't). Stability is not good in 38, any fix is good to take! Thanks
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 65•10 years ago
|
||
There needs to be uplifted also patch from bug 1135405 then, AFAIK.
Assignee | ||
Comment 66•10 years ago
|
||
Comment on attachment 8584632 [details] [diff] [review]
v1.3.2
applies clearly on Beta but not tested on try.
LAND AFTER BUG 1135405!
Approval Request Comment
[Feature/regressing bug #]: none
[User impact if declined]: shutdown hang, inability to browse
[Describe test coverage new/current, TreeHerder]: landed on m-c on 2015-03-29, currently on m-a
[Risks and why]: this not a simple patch and carries some risk, tho should be small. This creates new threads, there is no limit for it.
[String/UUID change made/needed]: none
Flags: needinfo?(honzab.moz)
Attachment #8584632 -
Flags: approval-mozilla-beta?
Comment 67•10 years ago
|
||
Comment on attachment 8584632 [details] [diff] [review]
v1.3.2
Shutdown hangs have been an important issue lately. I am taking the risk as we are early in the beta cycle and we can always backout this change if it introduces more regressions that it fixes.
Should be in 38 beta 3
Thanks Honza!
Attachment #8584632 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 68•10 years ago
|
||
Comment 69•10 years ago
|
||
Robert, you mentioned during that the last channel meeting that we still have crashes caused by this bug. Do you confirm?
Flags: needinfo?(kairo)
Reporter | ||
Comment 70•10 years ago
|
||
Yes, it should have been fixed in 38.0b3 but it's still at #9 with 1.4% of all crashes in 38.0b4. I'm reopening.
Status: RESOLVED → REOPENED
Flags: needinfo?(kairo)
Resolution: FIXED → ---
Assignee | ||
Comment 71•10 years ago
|
||
kairo, I filed new bugs I can see stacks for, see blocks. This one is fixed.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Summary: shutdownhang in mozilla::net::nsHttpConnectionMgr::Shutdown() → shutdownhang in mozilla::net::nsHttpConnectionMgr::Shutdown() on pending UDP socket close
Assignee | ||
Comment 72•10 years ago
|
||
If you want an umbrella bug for this issue, please open a new one and move the blocking bugs there (and make this one blocking it too, left closed). Thanks.
Reporter | ||
Comment 73•10 years ago
|
||
New bugs are fine as long as they have the right entries in crash signatures and the crash keywords set. Thanks for looking into this!
Comment 74•10 years ago
|
||
Honza, this bug is still presented during our channel meeting and I would like to see it fixed for 38.
I looked at the "block" and it seems unrelated (send video to device). Are you sure about the bug number? Thanks
Flags: needinfo?(honzab.moz)
Reporter | ||
Updated•10 years ago
|
Keywords: topcrash-win
Assignee | ||
Comment 75•10 years ago
|
||
(In reply to Sylvestre Ledru [:sylvestre] from comment #74)
> Honza, this bug is still presented during our channel meeting and I would
> like to see it fixed for 38.
> I looked at the "block" and it seems unrelated (send video to device). Are
> you sure about the bug number? Thanks
It probably means someone believes this bug is caused by that one (is its regression). What do you want me to do?
The patch here is on beta (38). Other bugs with different cause but identical symptoms are tracked. I don't have time to work on them right now tho.
Flags: needinfo?(honzab.moz)
Reporter | ||
Comment 76•10 years ago
|
||
It means that this is the only bug on file that is connected with the signature (see comment #73), and the signature is still happening in large enough volume to be a concern.
Is there any bug filed that should be about the majority of the issues left with this signature?
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 77•10 years ago
|
||
What about all of the open "Depends on" bugs?
Flags: needinfo?(honzab.moz)
Reporter | ||
Comment 78•10 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #77)
> What about all of the open "Depends on" bugs?
None of them have the crash signature and crash keyword set, so they do not appear on crash-stats for the signature.
Reporter | ||
Comment 79•10 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #75)
> The patch here is on beta (38). Other bugs with different cause but
> identical symptoms are tracked. I don't have time to work on them right now
> tho.
Also none of them are tracked for any release via tracking flags. So the meaning of "tracked" is an issue as well. ;-)
Assignee | ||
Comment 80•10 years ago
|
||
I cloned this to bug 1158189. Please set the flags, I don't have perms.
Assignee | ||
Updated•10 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•