Closed
Bug 1152046
Opened 10 years ago
Closed 9 years ago
[Win] close() and connect() on a socket may block for a very long time.
Categories
(Core :: Networking, defect)
Tracking
()
People
(Reporter: mayhemer, Assigned: dragana)
References
Details
(Whiteboard: [necko-active])
Attachments
(5 files, 11 obsolete files)
(deleted),
patch
|
dragana
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
dragana
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
dragana
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
dragana
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
dragana
:
review+
|
Details | Diff | Splinter Review |
This is a hard to reproduce and nasty bug.
There is one known and described case with UDP sockets (worked around in bug 1124880): http://blogs.msdn.com/b/winsdk/archive/2013/10/09/udp-closesocket-takes-upto-5-seconds-to-return-in-disconnect-remote-host-down-scenario-due-to-pending-data-to-send.aspx
However, apparently this can happen for any socket, see some of the reports under https://crash-stats.mozilla.com/search/?product=Firefox&version=39.0a2&version=40.0a1&process_type=browser&process_type=content&signature=~nsHttpConnectionMgr%3A%3AShutdown&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
Google is quiet on this. Hence this could be specific to how we handle/setup our sockets.
Comment 1•10 years ago
|
||
Honza - what do you think of the theory that this is being caused by an LSP locally installed? Perhaps the LSP is blocking on a non-blocking socket.
here are three reports that look like that is what is happening:
https://crash-stats.mozilla.com/report/index/cd6dc70f-1afc-4596-94b6-8d26c2150404#allthreads
https://crash-stats.mozilla.com/report/index/77bf1520-1759-437b-a209-885922150401#allthreads
https://crash-stats.mozilla.com/report/index/9adc4168-b377-4a7c-801d-546d32150402#allthreads
In each case the Winsock LSP list is considerably more complicated than the winsock lsp list that you find in a crash due to something in gecko (e.g the dispatch deadlock). I took all the samples from XP SP3 just because I thought that might make the LSP list more comparable - but it happens on multiple versions of windows.
Reporter | ||
Comment 2•10 years ago
|
||
If you mean the stack:
KiFastSystemCallRet
NtDeviceIoControlFile
WSPBind
WSPAddressToString
WSPConnect
connect
_PR_MD_CONNECT
SocketConnect
then it looks to me like a pure native call (no LSP involved). Being in bind() sounds like it tries to find a local interface to bind to. It may be the same cause as in case of the UDP close(), waiting for an ARP or (more likely) a DHCP response - those are more guesses, not supported by any research now.
It could also be an OEM specific code that misbehaves here... something very hard to find out :(
there is tho 'rsvpsp.dll' in the LSP list. I don't have that file on my win7 machine.
Updated•10 years ago
|
Assignee: nobody → dd.mozilla
Assignee | ||
Comment 3•10 years ago
|
||
rsvpsp.dll is also from Windows (used for some priorities, QoS stuff). (related to comment #2)
Assignee | ||
Comment 4•10 years ago
|
||
For the most of this crashes socket Thread is doing PR_Connect.
This patch prevents connect during shutdown.
Let's try if it helps.
Attachment #8610603 -
Flags: review?(honzab.moz)
Reporter | ||
Comment 5•10 years ago
|
||
What's this 'swp' thing?
Assignee | ||
Comment 6•10 years ago
|
||
sorry
Attachment #8610603 -
Attachment is obsolete: true
Attachment #8610603 -
Flags: review?(honzab.moz)
Attachment #8611372 -
Flags: review?(honzab.moz)
Reporter | ||
Comment 7•10 years ago
|
||
Comment on attachment 8611372 [details] [diff] [review]
bug_1152046_v1.patch
Review of attachment 8611372 [details] [diff] [review]:
-----------------------------------------------------------------
This is good to have but I don't think it's gonna much help with the root cause. The delays/blocks in connect()/close() are IMO happening not just after shutdown. The thing is that we try to shutdown while the socket thread has already been hanging inside connect()/close() for a while. Maybe that is even a reason people restart the browser - since pages are not loading, socket thread is deadlocked.
There are two reasons this could happen IMO: A/V software interference or an LSP malware.
Some code that detects the time spent in connect()/close() would be good to have. At least collect telemetry, like time spent in, how often it happens to spent more than some 2 seconds, does it happen after network change, does it happen after shutdown only, whatever would be useful.
If long delays happen all the time (based on telemetry), we could do a test at the start or after any network change we detect and try to do something like manual local binding or even let connect()/close() happen on new threads (similar to what I've made for UDP close.) and maybe warn the user that "your OS performs slow when doing network access" or something...
::: netwerk/base/nsIOService.h
@@ +77,5 @@
> uint32_t flags,
> nsAsyncRedirectVerifyHelper *helper);
>
> bool IsOffline() { return mOffline; }
> + bool IsShuttingDown() { return mShutdown; }
Maybe call this IsShutdown() ? To make it conforming the name of the member.
::: netwerk/base/nsSocketTransport2.cpp
@@ +1217,5 @@
> bool isLocal;
> IsLocal(&isLocal);
>
> + if (gIOService->IsShuttingDown()) {
> + return NS_ERROR_ABORT;
nit: the file has 4 spaces indend
Attachment #8611372 -
Flags: review?(honzab.moz) → review+
Assignee | ||
Comment 8•10 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #7)
> Comment on attachment 8611372 [details] [diff] [review]
> bug_1152046_v1.patch
>
> Review of attachment 8611372 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> This is good to have but I don't think it's gonna much help with the root
> cause. The delays/blocks in connect()/close() are IMO happening not just
> after shutdown. The thing is that we try to shutdown while the socket
> thread has already been hanging inside connect()/close() for a while. Maybe
> that is even a reason people restart the browser - since pages are not
> loading, socket thread is deadlocked.
>
> There are two reasons this could happen IMO: A/V software interference or an
> LSP malware.
>
> Some code that detects the time spent in connect()/close() would be good to
> have. At least collect telemetry, like time spent in, how often it happens
> to spent more than some 2 seconds, does it happen after network change, does
> it happen after shutdown only, whatever would be useful.
>
I had intention to add telemetry as well, it is coming soon...
Assignee | ||
Comment 9•10 years ago
|
||
Assignee | ||
Comment 10•10 years ago
|
||
Attachment #8611372 -
Attachment is obsolete: true
Attachment #8612760 -
Flags: review+
Assignee | ||
Updated•10 years ago
|
Keywords: checkin-needed,
leave-open
Comment 11•10 years ago
|
||
Keywords: checkin-needed
Comment 12•10 years ago
|
||
Assignee | ||
Comment 13•9 years ago
|
||
Attachment #8613916 -
Flags: review?(honzab.moz)
Reporter | ||
Comment 14•9 years ago
|
||
Comment on attachment 8613916 [details] [diff] [review]
bug_1152046_telemetry_v1.patch
Review of attachment 8613916 [details] [diff] [review]:
-----------------------------------------------------------------
::: netwerk/base/nsSocketTransport2.cpp
@@ +53,5 @@
> #endif
> /* End keepalive config inclusions. */
>
> +// If PR_Connect, PR_Close and PR_ContinueConnect are longer than
> +// STS_HANG_DURATION ms, consired it a hang and collect data about the state we
consired (typo)
@@ +1410,5 @@
>
> NetAddrToPRNetAddr(&mNetAddr, &prAddr);
>
> MOZ_EVENT_TRACER_EXEC(this, "net::tcp::connect");
> + PRIntervalTime connectStarted;
please explain why you prefer PR intervals here (atomicity)
@@ +3100,5 @@
> + STS_STATE_SHUTDOWN);
> +
> + } else if ((gIOService->LastConnectivityChange() >= closeStarted)
> + &&
> + (gIOService->LastOfflineStateChange() >= closeStarted)){
I think we should include some window after the change. Like "has been there a change in the net config in, say, last 60 seconds". If so, add a 'state-changed' telemetry. Otherwise, add 'normal-state' telemetry.
Then, I'm not sure change in the 'connectivity yes/no' state is what you want... Maybe check with :badger on this first. There can be a wire connection, then for awhile both wired and wifi, then wire goes away (and that will not be seen as a change in connectivity..)
@@ +3116,5 @@
> + } else {
> + Telemetry::Accumulate(Telemetry::STS_LONG_PRCLOSE_STATE,
> + STS_STATE_NORMAL);
> + }
> + }
please share the code, it's easy
::: toolkit/components/telemetry/Histograms.json
@@ +2462,5 @@
> "high": "60000",
> "n_buckets": 1000,
> "description": "Time spend blocked on poll (ms)."
> },
> + "STS_PRCONNECT_TIME": {
The name should "PRCONNECT_BLOCKING_TIME"
@@ +2500,5 @@
> + "expires_in_version": "never",
> + "kind": "enumerated",
> + "n_values": 5,
> + "description": "If it is in PR_Close more than 2000 ms, collect telemetry about the state: 4: shutdown, 3: offline and connectivity state changed 2: offline state changed, 1: connectivity change and 0: none."
> + },
I'd rather have telemetry like:
PRCONNECT_NORMAL
PRCONNECT_AFTER_NETWORK_CHANGE
PRCONNECT_AFTER_SHUTDOWN
etc.. collecting time (exponential)
Attachment #8613916 -
Flags: review?(honzab.moz) → feedback-
Assignee | ||
Comment 15•9 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #14)
> Comment on attachment 8613916 [details] [diff] [review]
> bug_1152046_telemetry_v1.patch
>
> Review of attachment 8613916 [details] [diff] [review]:
> -----------------------------------------------------------------
> @@ +3100,5 @@
> > + STS_STATE_SHUTDOWN);
> > +
> > + } else if ((gIOService->LastConnectivityChange() >= closeStarted)
> > + &&
> > + (gIOService->LastOfflineStateChange() >= closeStarted)){
>
> I think we should include some window after the change. Like "has been
> there a change in the net config in, say, last 60 seconds". If so, add a
> 'state-changed' telemetry. Otherwise, add 'normal-state' telemetry.
>
> Then, I'm not sure change in the 'connectivity yes/no' state is what you
> want... Maybe check with :badger on this first. There can be a wire
> connection, then for awhile both wired and wifi, then wire goes away (and
> that will not be seen as a change in connectivity..)
>
Looking at the code I am sure that the connectivity means we have some connection or we do not have any. I will check this with :badger
I want to separate connectivity and offline state which are 2 separate things. Looking at the code connectivity state will change offline state if offline state has not been force for some other reason.
So I want to have:
PRCONNECT_BLOCKING_NORMAL - no shutdown ,now network change no ofline state change
PRCONNECT_BLOCKING_SHUTDOWN - we are in shutdown
PRCONNECT_BLOCKING_NETWORK_CHANGE - connectivity changed
PRCONNECT_BLOCKING_OFFLINE - set offline (this will probably be the same as normal, exept we have some error in necko code)
Assignee | ||
Comment 16•9 years ago
|
||
Can you take a look into connectivity bit.
I wanted to save the last time connectivity change - connected/disconnected.
Attachment #8613916 -
Attachment is obsolete: true
Attachment #8614624 -
Flags: feedback?(daniel)
Tracking for 40. Are other versions affected?
tracking-firefox40:
--- → +
See https://bugzilla.mozilla.org/show_bug.cgi?id=1158189#c7 for why I tracked this.
Updated•9 years ago
|
Attachment #8614624 -
Flags: feedback?(daniel) → feedback+
Comment 21•9 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #12)
> https://hg.mozilla.org/mozilla-central/rev/1badbdeca325
Landed on Aurora a=sledru to fix uplift bustage from bug 1169554.
https://hg.mozilla.org/releases/mozilla-aurora/rev/6b9bf28f9b8a
Assignee | ||
Comment 22•9 years ago
|
||
Attachment #8614624 -
Attachment is obsolete: true
Attachment #8628777 -
Flags: review?(honzab.moz)
Reporter | ||
Comment 23•9 years ago
|
||
Comment on attachment 8628777 [details] [diff] [review]
bug_1152046_telemetry_v2.patch
Review of attachment 8628777 [details] [diff] [review]:
-----------------------------------------------------------------
Please few of the nits and if there are no major changes you want to do, just land it! Good work.
::: netwerk/base/nsIOService.cpp
@@ +1521,5 @@
> }
>
> bool isUp = true;
> if (!strcmp(data, NS_NETWORK_LINK_DATA_CHANGED)) {
> + mLastNetworkLinkChange = PR_IntervalNow();
please double check this. I'm not sure this fires every time..
::: netwerk/base/nsIOService.h
@@ +173,5 @@
> static bool sTelemetryEnabled;
> +
> + // These are needed for the telemetry about PR_Connect anf PR_Close
> + // duration. If PRConnect or PR_Close last to long check whether
> + // connectivity or offline state has change in between.
Rather:
"These timestamps are needed for collecting telemetry on PR_Connect, PR_ConnectContinue and PR_Close blocking time. On too long time spent in any of these functions we want to know what network change happened shortly before."
BTW, it might be smarter to collect only times that take more then say 1 second? I am really not sure on this, but I don't want to see a graph of 0 - 0.1 ms -> 99% and then the rest (that we are interested in!) hidden in the 1% percent of all the captures.
::: netwerk/base/nsSocketTransport2.cpp
@@ +1406,3 @@
> status = PR_Connect(fd, &prAddr, NS_SOCKET_CONNECT_TIMEOUT);
> +
> + if (gSocketTransportService->IsTelemetryEnabled() && connectStarted) {
if you want the "&& connectStarted" part that you must initialize connectStarted to 0. maybe have the condition just as "if (connectStarted) {" ?
@@ +1881,3 @@
> PRStatus status = PR_ConnectContinue(fd, outFlags);
> +
> + if (gSocketTransportService->IsTelemetryEnabled() && connectStarted) {
same here
::: netwerk/base/nsSocketTransportService2.h
@@ +109,5 @@
>
> // Returns true if keepalives are enabled in prefs.
> bool IsKeepaliveEnabled() { return mKeepaliveEnabledPref; }
> +
> + bool IsTelemetryEnabled() { return mTelemetryEnabledPref; }
some small comment would be nice
Attachment #8628777 -
Flags: review?(honzab.moz) → review+
Assignee | ||
Comment 24•9 years ago
|
||
Attachment #8628777 -
Attachment is obsolete: true
Attachment #8630588 -
Flags: review+
Assignee | ||
Comment 25•9 years ago
|
||
Check in needed only for patch bug_1152046_telemetry_v2.patch
Keywords: checkin-needed
Comment 26•9 years ago
|
||
Keywords: checkin-needed
Comment 27•9 years ago
|
||
Assignee | ||
Updated•9 years ago
|
Status: NEW → ASSIGNED
Honza, just wanted to ping you on whether we have a fix in the works on this one or a patch ready for uplift to beta. bug 1158189 is a top crash on beta channel and blocked by this bug. Thanks!
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 29•9 years ago
|
||
I will answer and Honza can correct me.
We currently collecting telemetry to see when it is happening. We assume it is after network connection change or maybe just when there is no connectivity.
For Honza:
Form telemetry that we have collected so far I see this happening for all cases some times. some are pretty high on windows and on osx but not as high as in windows. I could do a fix to split connect to a separate thread. (I need to look closely into the data to see if we can skip in some cases)
Reporter | ||
Comment 30•9 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #29)
> I will answer and Honza can correct me.
>
> We currently collecting telemetry to see when it is happening. We assume it
> is after network connection change or maybe just when there is no
> connectivity.
>
> For Honza:
> Form telemetry that we have collected so far I see this happening for all
> cases some times. some are pretty high on windows and on osx but not as high
> as in windows. I could do a fix to split connect to a separate thread. (I
> need to look closely into the data to see if we can skip in some cases)
As Dragana says, the only fix here is to either do more investigation like contacting the people this happens to (not a usual way we work in Mozilla tho) or have separate thread(s) handling these function calls.
I think having a thread pool (limited up to like 5 or so threads) is the way here as a general workaround.
I personally believe there is some antivirus or virus (how ironic :)) involved here, but that is more a guess.
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 31•9 years ago
|
||
Peak into telemetry between 8. and 20. July 2015.
The numbers next to the platform names show the maximal value seen.
Linux values are very low in comparison to osx and windows. And Windows has much higher numbers than osx therefore most of the crashes are on windows.
PRCONNECT_BLOCKING_TIME_CONNECTIVITY_CHANGE:
linux 26ms
osx 23s
windows even 2 semples above 60s
PRCONNECT_BLOCKING_TIME_LINK_CHANGE:
linux -
osx 700ms
windows 9s
PRCONNECT_BLOCKING_TIME_NORMAL
linux 116ms
osx 9s
windows 19 samples above 60s
PRCONNECT_BLOCKING_TIME_OFFLINE
(small sample size)
linux (very small sample size 40 - 1-2ms)
osx (very small sample size 200 - 1-2ms)
windows 43s
PRCONNECTCONTINUE_BLOCKING_TIME_CONNECTIVITY_CHANGE
linux 23ms
osx 97ms
windows 11s
PRCONNECTCONTINUE_BLOCKING_TIME_LINK_CHANGE
linux -
osx 46ms
windows 10s
PRCONNECTCONTINUE_BLOCKING_TIME_NORMAL
linux 73ms
osx 133ms
windows 2samples 60s
PRCONNECTCONTINUE_BLOCKING_TIME_OFFLINE:
(small sample size)
windows 16ms
PRCLOSE_BLOCKING_TIME_CONNECTIVITY_CHANGE
linux 53ms
osx 16s
windows 1sample 60s
PRCLOSE_BLOCKING_TIME_LINK_CHANGE
linux -
osx 1s
windows 48s
PRCLOSE_BLOCKING_TIME_NORMAL
linux 11s
osx 27s
windows 9 samples 60s
PRCLOSE_BLOCKING_TIME_OFFLINE:
(small sample size)
linux 6s
osx 86ms
windows 16s
Comment 32•9 years ago
|
||
damn. those are crazy numbers. Dragana do these numbers look different at the 98th or 99th percentile rather than max? If not, then this is pretty nuts. (If so, even simple things like some kind of massive memory thrashing can explain .0001 outliers which would be irrelevant to us)
I think it is interesting that the close_blocking_time_normal is so bad cross platform.
Honza, do you think there is any reason to have this done synchronously (to the rest of gecko)?
can we just implement pr_close at the bottom of the stack to return true without closing anything, but pass the OS fd to a thread that queues them up and closes them as fast as possible. I suggest this rather than a thread pool as a possibility because the thread pool still creates a slow sync point for the outlier cases (though improves HoL blocking).
at least we have a good metric we could track now. Thanks Dragana.
Flags: needinfo?(honzab.moz)
Flags: needinfo?(dd.mozilla)
Reporter | ||
Comment 33•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #32)
> damn. those are crazy numbers. Dragana do these numbers look different at
> the 98th or 99th percentile rather than max? If not, then this is pretty
> nuts. (If so, even simple things like some kind of massive memory thrashing
> can explain .0001 outliers which would be irrelevant to us)
Yep, it would be good to check what is the percentile of blocks taking more than say 4 seconds? Or any other reasonable number of seconds.
>
> I think it is interesting that the close_blocking_time_normal is so bad
> cross platform.
>
> Honza, do you think there is any reason to have this done synchronously (to
> the rest of gecko)?
>
> can we just implement pr_close at the bottom of the stack to return true
> without closing anything, but pass the OS fd to a thread that queues them up
> and closes them as fast as possible. I suggest this rather than a thread
> pool as a possibility
"because the thread pool still creates a slow sync point for the outlier cases"
I don't follow this argument. Are you suggesting we have one thread per one close (connect)? I did the same for udp socket close - no pool, spawn a new thread for any invocation of nsUDPSocket::Close() ending up at PR_Close(). But we use udp more rarely than tcp so to have a tun of threads for each connect we call seems to me like waste of resources. But I can be persuaded otherwise. Both solutions are good. Only problem is that impact of raising number of threads is hard to measure, it's kind of a slow death...
> (though improves HoL blocking).
>
> at least we have a good metric we could track now. Thanks Dragana.
Yep, thanks!
Flags: needinfo?(honzab.moz)
Comment 34•9 years ago
|
||
roughly:
1] put a nspr layer where layer->lower == tcp/udp
2] implement pr_close(aFD) {
int fd = GetOsFD(aFD); // I forget what this is really called
aFD->lower->dtor(aFD->lower); // or maybe set fd to -1 and call lower->close()
postmessage(closingThread, fd);
return 0;
}
on closingThread
while(!shutdown) {
recvmessage(&fd);
close(fd);
}
Assignee | ||
Comment 35•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #32)
> damn. those are crazy numbers. Dragana do these numbers look different at
> the 98th or 99th percentile rather than max? If not, then this is pretty
> nuts. (If so, even simple things like some kind of massive memory thrashing
> can explain .0001 outliers which would be irrelevant to us)
>
I should have added this stats too, I was just surprised about max values they are crazy.
so 95% even 99% are only 1ms.
there are 3365 samples that are larger than 1second and 1453 larger than 4s.
Dispatching this to a pull of threads or just separate threads is going to make this worse for everybody else. I am not sure we should do this.
> I think it is interesting that the close_blocking_time_normal is so bad
> cross platform.
>
> Honza, do you think there is any reason to have this done synchronously (to
> the rest of gecko)?
>
> can we just implement pr_close at the bottom of the stack to return true
> without closing anything, but pass the OS fd to a thread that queues them up
> and closes them as fast as possible. I suggest this rather than a thread
> pool as a possibility because the thread pool still creates a slow sync
> point for the outlier cases (though improves HoL blocking).
>
> at least we have a good metric we could track now. Thanks Dragana.
Flags: needinfo?(dd.mozilla)
Assignee | ||
Comment 36•9 years ago
|
||
my message got in conflict with your.
The suggestion from comment 34 could work.
Reporter | ||
Comment 37•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #34)
> roughly:
>
>
> 1] put a nspr layer where layer->lower == tcp/udp
>
> 2] implement pr_close(aFD) {
> int fd = GetOsFD(aFD); // I forget what this is really called
> aFD->lower->dtor(aFD->lower); // or maybe set fd to -1 and call
> lower->close()
> postmessage(closingThread, fd);
> return 0;
> }
>
> on closingThread
> while(!shutdown) {
> recvmessage(&fd);
> close(fd);
> }
If we just leak any queued (yet not closed) sockets after shutdown then this is a good solution. Should also replace my patch for UDP close.
How about connect/cont-connect tho?
Assignee | ||
Comment 38•9 years ago
|
||
Here is one version. Please take a look...
Attachment #8641199 -
Flags: feedback?(mcmanus)
Reporter | ||
Comment 39•9 years ago
|
||
Comment on attachment 8641199 [details] [diff] [review]
bug_1152046_prclose_v1.patch
Review of attachment 8641199 [details] [diff] [review]:
-----------------------------------------------------------------
Nice! just few internals to fix.
::: netwerk/base/nsClosingService.cpp
@@ +16,5 @@
> +TcpUdpPRCloseLayerClose(PRFileDesc *aFd)
> +{
> + PRFileDesc *fd = aFd->lower;
> + nsClosingService::PostCloseRequest(fd);
> + aFd = 0;
you probably don't need to do this
@@ +36,5 @@
> +{
> + PR_DestroyLock(mLock);
> + PR_DestroyCondVar(mQueueCondVar);
> + gInstance = nullptr;
> + mQueue.Clear();
hmm.. the queue must be accessed only on a single thread (which I think cannot be) or always be accessed unconditionally only under the lock
@@ +40,5 @@
> + mQueue.Clear();
> +}
> +
> +void
> +nsClosingService::Start()
put // static before the method definition (applies to all static method here)
@@ +105,5 @@
> +void
> +nsClosingService::PostCloseRequest(PRFileDesc *aFd)
> +{
> + // Check if shutdown is called.
> + if (!sShutdown) {
nicer would be:
if (sShutdown) {
// Let the socket leak, we want the system take care... bla bla bla
return;
}
if (..)..
@@ +110,5 @@
> + // Check if gInstance is present, if not call in the standard way.
> + // Theoretically this can happen before nsIOService initializes
> + // nsClosingService.
> + if (gInstance) {
> + gInstance->PostRequest(aFd);
so, I'm missing one major stuff here: you are not correctly handling the service lifetime
each layer instance you crate must have its 'secret' set to a struct that has a nsRefPtr<nsClosingService> that refers the service.
here you will pick the refptr from your secret struct and use it instead of gInstance and then just nullify it. this will make sure the queue and the locks exists. the code you have may nicely race.
other option is to StaticMutex or StaticMonitor over gInstance
@@ +115,5 @@
> + } else {
> + PR_Close(aFd);
> + }
> + }
> + aFd = nullptr;
why?
@@ +179,5 @@
> + PR_Unlock(mLock);
> +}
> +
> +void
> +nsClosingService::ResolveRequest()
maybe ProcessRequest would be better (if you really want to have a method for this, inlining into ThreadFunc would work well too)
@@ +186,5 @@
> + if (mQueue.Length() == 0) {
> + PR_WaitCondVar(mQueueCondVar, PR_INTERVAL_NO_TIMEOUT);
> + }
> + if (sShutdown) {
> + PR_Unlock(mLock);
grrr....
::: netwerk/base/nsClosingService.h
@@ +62,5 @@
> + mozilla::Telemetry::ID aIDLinkChange,
> + mozilla::Telemetry::ID aIDOffline);
> +
> + static bool sShutdown;
> + static nsClosingService *gInstance;
sInstance (or preferably sSelf) since it's an object's static not a global static
@@ +65,5 @@
> + static bool sShutdown;
> + static nsClosingService *gInstance;
> + nsTArray<PRFileDesc *> mQueue;
> + PRLock *mLock;
> + PRCondVar *mQueueCondVar;
use mozilla::Monitor please. if there is some reason not to please add a comment why
::: netwerk/base/nsIncrementalDownload.cpp
@@ +17,5 @@
> #include "nsIFile.h"
> #include "nsITimer.h"
> +#include "nsIStreamListener.h"
> +#include "nsIURI.h"
> +#include "nsIInputStream.h"
hmm?
::: netwerk/base/nsSimpleNestedURI.cpp
@@ +7,5 @@
>
> #include "nsSimpleNestedURI.h"
> #include "nsIObjectInputStream.h"
> #include "nsIObjectOutputStream.h"
> +#include "nsNetUtil.h"
hmm2?
::: netwerk/base/nsURLHelper.cpp
@@ +12,5 @@
> #include "nsCOMPtr.h"
> #include "nsCRT.h"
> #include "nsNetCID.h"
> #include "prnetdb.h"
> +#include "mozilla/Preferences.h"
hmm3?
Attachment #8641199 -
Flags: feedback+
Assignee | ||
Updated•9 years ago
|
Attachment #8641199 -
Flags: feedback?(mcmanus)
Comment 40•9 years ago
|
||
We're at the end of the 40 cycle and this bug is now wontfix for 40.
As this bug is visible in 40 Betas and 41 Aurora, I'm tracking for 41+.
status-firefox41:
--- → affected
status-firefox42:
--- → affected
tracking-firefox41:
--- → +
tracking-firefox42:
--- → +
Comment hidden (obsolete) |
Assignee | ||
Comment 42•9 years ago
|
||
Attachment #8641199 -
Attachment is obsolete: true
Assignee | ||
Comment 43•9 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=873fff73ca52
Honza, can you take a look please?
There is a leak on socket that is expected to happend. A possibilty is to allow PR_Close for let's say 200-500 ms and then let the rest sockets leak?
Flags: needinfo?(honzab.moz)
Reporter | ||
Comment 44•9 years ago
|
||
I'm not sure what you exactly want from me, can you restate please?
Flags: needinfo?(honzab.moz) → needinfo?(dd.mozilla)
Assignee | ||
Comment 45•9 years ago
|
||
May I ask you for a review? Thanks.
Attachment #8650560 -
Attachment is obsolete: true
Flags: needinfo?(dd.mozilla) → needinfo?(honzab.moz)
Jason, would you be able to review this? I noticed Honza's bugzilla email says "not reviewing". This bug is tracking for FF41, so if there a fix that is safe, we would want to uplift to Beta41 soon.
Flags: needinfo?(jduell.mcbugs)
Comment 47•9 years ago
|
||
Honza has been reviewing previous patches here so I'm going to let him keep doing that. Bounce it back to me if he hasn't gotten to it in a few days.
Flags: needinfo?(jduell.mcbugs)
Reporter | ||
Comment 48•9 years ago
|
||
(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #44)
> I'm not sure what you exactly want from me, can you restate please?
I was actually asking about:
"There is a leak on socket that is expected to happend. A possibilty is to allow PR_Close for let's say 200-500 ms and then let the rest sockets leak?"
what does that mean?
Flags: needinfo?(dd.mozilla)
Assignee | ||
Comment 49•9 years ago
|
||
(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #48)
> (In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #44)
> > I'm not sure what you exactly want from me, can you restate please?
>
> I was actually asking about:
>
> "There is a leak on socket that is expected to happend. A possibilty is to
> allow PR_Close for let's say 200-500 ms and then let the rest sockets leak?"
>
> what does that mean?
Sorry I have fix that - I have added PR_Free(fd) which I have forgot before and I have done some other fixes. so there is no leak anymore
Flags: needinfo?(dd.mozilla)
Assignee | ||
Comment 50•9 years ago
|
||
Reporter | ||
Comment 51•9 years ago
|
||
Comment on attachment 8654487 [details] [diff] [review]
bug_1152046_prclose_v1.patch
Review of attachment 8654487 [details] [diff] [review]:
-----------------------------------------------------------------
nsClosingService -> mozilla::net::ClosingService
::: netwerk/base/nsClosingService.cpp
@@ +34,5 @@
> + if (!aFd) {
> + return PR_FAILURE;
> + }
> +
> + ClosingLayerSecret *closingService =
s/closingService/closingLayerSecret/
@@ +35,5 @@
> + return PR_FAILURE;
> + }
> +
> + ClosingLayerSecret *closingService =
> + reinterpret_cast<ClosingLayerSecret *>(aFd->secret);
you must do this on your layer! how can you be sure aFd is what you expect?
@@ +39,5 @@
> + reinterpret_cast<ClosingLayerSecret *>(aFd->secret);
> + aFd->secret = nullptr;
> +
> + PRFileDesc* layer = PR_PopIOLayer(aFd, PR_TOP_IO_LAYER);
> + NS_ASSERTION(layer &&
MOZ_ASSERT, maybe actually MOZ_RELEASE_ASSERT
@@ +42,5 @@
> + PRFileDesc* layer = PR_PopIOLayer(aFd, PR_TOP_IO_LAYER);
> + NS_ASSERTION(layer &&
> + layer->identity == sTcpUdpPRCloseLayerId,
> + "Closing Layer not on top of stack");
> + layer->dtor(layer);
are you sure about this? doesn't PR_Close on the layer do that? I think this just delegates to the layer below.
@@ +46,5 @@
> + layer->dtor(layer);
> +
> + PRStatus status = PR_SUCCESS;
> +
> + if (aFd) {
will always be true
@@ +60,5 @@
> + }
> + } else {
> + // If the ClosingService layer is the first layer above PR_NSPR_IO_LAYER
> + // we are not going to leak anything, but the PR_Close will not be called.
> + PR_Free(aFd);
shouldn't you use dtor() here? I'm no expert (it's been too long since the last time I played with this stuff) so rather double check. but that would make more sense to me. same in the shutdown case in ThreadFunc
@@ +69,5 @@
> + return status;
> +}
> +
> +mozilla::Atomic<bool> nsClosingService::sShutdown(false);
> +nsRefPtr<nsClosingService> nsClosingService::sInstance = nullptr;
these both are static initializer, something we don't want
you can have a raw sInstance ptr and NS_ADDREF in ctor and NS_RELEASE in dtor. be careful with multiple threads tho..
@@ +158,5 @@
> + PR_Free(aFd);
> + return;
> + }
> +
> + bool notify = (mQueue.Length() == 0)? true : false;
)?
..missing space
@@ +161,5 @@
> +
> + bool notify = (mQueue.Length() == 0)? true : false;
> +
> + mQueue.AppendElement(aFd);
> + if (notify) {
if (mQueue.Length() == 1) ?
@@ +174,5 @@
> + MOZ_ASSERT(NS_IsMainThread());
> +
> + if (sInstance) {
> + sInstance->ShutdownInternal();
> + sInstance = nullptr;
so, other thread may race and take ref to sInstance after is has been shut down? Is that OK? Maybe it is, just checking..
@@ +200,5 @@
> +
> +void
> +nsClosingService::ThreadFunc()
> +{
> + while (1) {
for (;;) or while (true) are prefered
@@ +217,5 @@
> + // PR_NSPR_IO_LAYER we are not going to leak anything, but PR_Close
> + // will not be called.
> + PR_Free(fd);
> + }
> + mQueue.Clear();
ws
::: netwerk/base/nsClosingService.h
@@ +28,5 @@
> +
> +namespace mozilla {
> +namespace net {
> +
> +class nsClosingService final : public nsISupports
to make nsRefPtr work with this you just need to define AddRef and Release. There is a macro in nsISupportsImpl.h NS_INLINE_DECL_THREADSAFE_REFCOUNTING for inline declaration of those two for you. Don't derive from isupports when not needed (which is not in this case)
@@ +58,5 @@
> + mozilla::Telemetry::ID aIDConnectivityChange,
> + mozilla::Telemetry::ID aIDLinkChange,
> + mozilla::Telemetry::ID aIDOffline);
> +
> + static nsRefPtr<nsClosingService> sInstance;
hm... this is a static initializer. not good..
::: netwerk/base/nsIOService.cpp
@@ +43,5 @@
> #include "nsIWidget.h"
> #include "nsThreadUtils.h"
> #include "mozilla/LoadInfo.h"
> #include "mozilla/net/NeckoCommon.h"
> +#include "mozilla/Services.h"
why are all the includes here?
::: netwerk/base/nsSocketTransportService2.cpp
@@ +557,5 @@
>
> +#if !defined(MOZILLA_XPCOMRT_API)
> + // Start closing service. The PRClose functions will be carried out on
> + // the closeService thread. Started here so that it is started just
> + // once.
Start the closing service. Actual PR_Close() will be carried out on a separate "closing" thread. Start the closing service here since this point is executed only once per session.
::: netwerk/base/nsURLHelper.cpp
@@ +11,5 @@
> #include "nsIURLParser.h"
> #include "nsCOMPtr.h"
> #include "nsCRT.h"
> #include "nsNetCID.h"
> +#include "mozilla/Preferences.h"
why this include?
Attachment #8654487 -
Flags: review-
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(honzab.moz)
I am not sure if the fix will come in time for 41. We have lived one release (40) with this issue and we may have to live with it for another release 41. Updating status-firefox41 to wontfix.
If the fix is ready in the next 3-4 days, and safe to uplift to Beta, please nominate.
Comment hidden (obsolete) |
Assignee | ||
Comment 54•9 years ago
|
||
(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #51)
> Comment on attachment 8654487 [details] [diff] [review]
> bug_1152046_prclose_v1.patch
>
> Review of attachment 8654487 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> @@ +35,5 @@
> > + return PR_FAILURE;
> > + }
> > +
> > + ClosingLayerSecret *closingService =
> > + reinterpret_cast<ClosingLayerSecret *>(aFd->secret);
>
> you must do this on your layer! how can you be sure aFd is what you expect?
If we are not on the right layer something is really wrong, but you I right it should be one in the correct order.
>
> @@ +42,5 @@
> > + PRFileDesc* layer = PR_PopIOLayer(aFd, PR_TOP_IO_LAYER);
> > + NS_ASSERTION(layer &&
> > + layer->identity == sTcpUdpPRCloseLayerId,
> > + "Closing Layer not on top of stack");
> > + layer->dtor(layer);
>
> are you sure about this? doesn't PR_Close on the layer do that? I think
> this just delegates to the layer below.
>
We do not want to call PR_Close. PR_Close calls te close function of this layer with is the function that we are in :)
layer->dtor and PR_Free(layer) will end up doing the same thing because the layer is popped up and the stack connections are cut (lower and higher are set to null)
> @@ +46,5 @@
> > + layer->dtor(layer);
> > +
> > + PRStatus status = PR_SUCCESS;
> > +
> > + if (aFd) {
>
> will always be true
>
aFd does not have the same value as at the beginning of the function. Call to PR_PopIOLayer has set aFd to the next layer in the chain.
aFd should be true but I have added this check just to be sure.
> @@ +60,5 @@
> > + }
> > + } else {
> > + // If the ClosingService layer is the first layer above PR_NSPR_IO_LAYER
> > + // we are not going to leak anything, but the PR_Close will not be called.
> > + PR_Free(aFd);
>
> shouldn't you use dtor() here? I'm no expert (it's been too long since the
> last time I played with this stuff) so rather double check. but that would
> make more sense to me. same in the shutdown case in ThreadFunc
>
the only difference between PR_Free and dtor is that dtor cuts connection in the layer chain (sets lower and higher to null), but in this case this does not matter which one we are using.
>
> @@ +174,5 @@
> > + MOZ_ASSERT(NS_IsMainThread());
> > +
> > + if (sInstance) {
> > + sInstance->ShutdownInternal();
> > + sInstance = nullptr;
>
> so, other thread may race and take ref to sInstance after is has been shut
> down? Is that OK? Maybe it is, just checking..
>
it is ok because mShutdown is checked before putting a request in the queue.
> ::: netwerk/base/nsIOService.cpp
> @@ +43,5 @@
> > #include "nsIWidget.h"
> > #include "nsThreadUtils.h"
> > #include "mozilla/LoadInfo.h"
> > #include "mozilla/net/NeckoCommon.h"
> > +#include "mozilla/Services.h"
>
> why are all the includes here?
>
> ::: netwerk/base/nsURLHelper.cpp
> @@ +11,5 @@
> > #include "nsIURLParser.h"
> > #include "nsCOMPtr.h"
> > #include "nsCRT.h"
> > #include "nsNetCID.h"
> > +#include "mozilla/Preferences.h"
>
> why this include?
this fixes compile errors (cause by adding a new file to moz.build that probably caused file to be moved from on unified file to another).
Assignee | ||
Comment 55•9 years ago
|
||
Attachment #8654487 -
Attachment is obsolete: true
Assignee | ||
Updated•9 years ago
|
Flags: needinfo?(honzab.moz)
Reporter | ||
Comment 56•9 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #54)
> (In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #51)
> > Comment on attachment 8654487 [details] [diff] [review]
> > bug_1152046_prclose_v1.patch
> >
> > Review of attachment 8654487 [details] [diff] [review]:
> > -----------------------------------------------------------------
> >
> > @@ +35,5 @@
> > > + return PR_FAILURE;
> > > + }
> > > +
> > > + ClosingLayerSecret *closingService =
> > > + reinterpret_cast<ClosingLayerSecret *>(aFd->secret);
> >
> > you must do this on your layer! how can you be sure aFd is what you expect?
>
>
> If we are not on the right layer something is really wrong, but you I right
> it should be one in the correct order.
If the code ensures you are on the right level then OK. But some kind of assertion at least would be at place.
>
>
> >
> > @@ +42,5 @@
> > > + PRFileDesc* layer = PR_PopIOLayer(aFd, PR_TOP_IO_LAYER);
> > > + NS_ASSERTION(layer &&
> > > + layer->identity == sTcpUdpPRCloseLayerId,
> > > + "Closing Layer not on top of stack");
> > > + layer->dtor(layer);
> >
> > are you sure about this? doesn't PR_Close on the layer do that? I think
> > this just delegates to the layer below.
> >
>
>
> We do not want to call PR_Close. PR_Close calls te close function of this
> layer with is the function that we are in :)
> layer->dtor and PR_Free(layer) will end up doing the same thing because the
> layer is popped up and the stack connections are cut (lower and higher are
> set to null)
Aha :)
>
>
> > @@ +46,5 @@
> > > + layer->dtor(layer);
> > > +
> > > + PRStatus status = PR_SUCCESS;
> > > +
> > > + if (aFd) {
> >
> > will always be true
> >
>
> aFd does not have the same value as at the beginning of the function. Call
> to PR_PopIOLayer has set aFd to the next layer in the chain.
> aFd should be true but I have added this check just to be sure.
OK, leave it as is.
>
>
> > @@ +60,5 @@
> > > + }
> > > + } else {
> > > + // If the ClosingService layer is the first layer above PR_NSPR_IO_LAYER
> > > + // we are not going to leak anything, but the PR_Close will not be called.
> > > + PR_Free(aFd);
> >
> > shouldn't you use dtor() here? I'm no expert (it's been too long since the
> > last time I played with this stuff) so rather double check. but that would
> > make more sense to me. same in the shutdown case in ThreadFunc
> >
>
> the only difference between PR_Free and dtor is that dtor cuts connection in
> the layer chain (sets lower and higher to null), but in this case this does
> not matter which one we are using.
If that is the only difference then I would pref calling dtor here. It may also release any secret structs PR_Free has no chance to ever release.
I'll do the review of the new version now.
Reporter | ||
Comment 57•9 years ago
|
||
Comment on attachment 8657801 [details] [diff] [review]
bug_1152046_prclose_v2.patch
Review of attachment 8657801 [details] [diff] [review]:
-----------------------------------------------------------------
::: netwerk/base/ClosingService.cpp
@@ +6,5 @@
> +
> +#include "ClosingService.h"
> +#include "nsIOService.h"
> +
> +using namespace mozilla::net;
instead using put this all to mozilla::net namespace.
@@ +12,5 @@
> +static PRIOMethods sTcpUdpPRCloseLayerMethods;
> +static PRIOMethods *sTcpUdpPRCloseLayerMethodsPtr = nullptr;
> +static PRDescIdentity sTcpUdpPRCloseLayerId;
> +
> +class ClosingLayerSecret
this must be in an anonymous namespace
@@ +38,5 @@
> +
> + PRFileDesc* layer = PR_PopIOLayer(aFd, PR_TOP_IO_LAYER);
> + MOZ_RELEASE_ASSERT(layer &&
> + layer->identity == sTcpUdpPRCloseLayerId,
> + "Closing Layer not on top of stack");
aha, it's here! good.
@@ +51,5 @@
> + // let it leak.
> + if (gIOService->IsShutdown()) {
> + // If the ClosingService layer is the first layer above PR_NSPR_IO_LAYER
> + // we are not going to leak anything, but the PR_Close will not be called.
> + PR_Free(aFd);
I'd prefer using dtor here, but if there is a problem with that, leave PR_Free
@@ +57,5 @@
> + closingLayerSecret->mClosingService->PostRequest(aFd);
> + } else {
> + // Socket is created before closing service has been started or there was
> + // a problem with starting it.
> + status = aFd->methods->close(aFd);
PR_Close(aFd)?
@@ +73,5 @@
> +ClosingService::ClosingService()
> + : mShutdown(false)
> + , mMonitor("ClosingService.mMonitor")
> +{
> + NS_ASSERTION(sInstance==nullptr,
MOZ_ASSERT(!sIntance) - preferred - or sInstance == nullptr
@@ +87,5 @@
> + PR_ASSERT(PR_INVALID_IO_LAYER != sTcpUdpPRCloseLayerId);
> +
> + sTcpUdpPRCloseLayerMethods = *PR_GetDefaultIOMethods();
> + sTcpUdpPRCloseLayerMethods.close = TcpUdpPRCloseLayerClose;
> + sTcpUdpPRCloseLayerMethodsPtr = &sTcpUdpPRCloseLayerMethods;
nit: why is sTcpUdpPRCloseLayerMethodsPtr needed?
@@ +91,5 @@
> + sTcpUdpPRCloseLayerMethodsPtr = &sTcpUdpPRCloseLayerMethods;
> + }
> +
> + if (!sInstance) {
> + ClosingService* service = new ClosingService();
it was better in the previous version (to use nsRefPtr here)
@@ +128,5 @@
> + layer = PR_CreateIOLayerStub(sTcpUdpPRCloseLayerId,
> + sTcpUdpPRCloseLayerMethodsPtr);
> +
> + if (!layer) {
> + return NS_ERROR_FAILURE;
this may IMO not be fatal.
@@ +139,5 @@
> +
> + if (status == PR_FAILURE) {
> + delete secret;
> + PR_DELETE(layer);
> + return NS_ERROR_FAILURE;
I would not make this fatal too... but depends. your call.
@@ +156,5 @@
> + // time. To prevent shutdown crash (bug 1152046) do not accept sockets any
> + // more.
> + // If the ClosingService layer is the first layer above PR_NSPR_IO_LAYER
> + // we are not going to leak anything, but PR_Close will not be called.
> + PR_Free(aFd);
dtor here as well?
@@ +176,5 @@
> + if (sInstance) {
> + sInstance->ShutdownInternal();
> + ClosingService* service = sInstance;
> + sInstance = nullptr;
> + NS_RELEASE(service);
I think you can just NS_RELEASE(sInstance). It will nullify sInstance for you.
::: netwerk/base/nsSocketTransportService2.cpp
@@ +555,5 @@
> obsSvc->AddObserver(this, "last-pb-context-exited", false);
> }
>
> +#if !defined(MOZILLA_XPCOMRT_API)
> + // Start the closing service. Actual PRClose() will be carried out on
PR_Close()
Attachment #8657801 -
Flags: review+
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 58•9 years ago
|
||
Assignee | ||
Comment 59•9 years ago
|
||
Assignee | ||
Comment 60•9 years ago
|
||
->dtor does not work it must be PR_Free
Attachment #8657801 -
Attachment is obsolete: true
Attachment #8659673 -
Flags: review+
Assignee | ||
Updated•9 years ago
|
Keywords: checkin-needed
Comment 61•9 years ago
|
||
hi, the first patch didn't ally cleanly:
renamed 1152046 -> bug_1152046_v1.patch
applying bug_1152046_v1.patch
patching file netwerk/base/nsIOService.h
Hunk #1 FAILED at 72
1 out of 1 hunks FAILED -- saving rejects to file netwerk/base/nsIOService.h.rej
patching file netwerk/base/nsSocketTransport2.cpp
Hunk #1 FAILED at 1211
1 out of 1 hunks FAILED -- saving rejects to file netwerk/base/nsSocketTransport2.cpp.rej
patch failed, unable to continue (try -v)
patch failed, rejects left in working directory
is this checkin needed requests for all patches ?
Flags: needinfo?(dd.mozilla)
Keywords: checkin-needed
Assignee | ||
Comment 62•9 years ago
|
||
Comment on attachment 8659673 [details] [diff] [review]
bug_1152046_prclose_v2.patch
Only for patch bug_1152046_prclose_v2.patch.
Thanks.
Flags: needinfo?(dd.mozilla)
Attachment #8659673 -
Flags: checkin+
Assignee | ||
Updated•9 years ago
|
Keywords: checkin-needed
Comment 63•9 years ago
|
||
Keywords: checkin-needed
Comment 65•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/d5d0b5994149bc22ee38b1ae14acb69853a9646d
bug 1152046 - Revert Make separate thread only for PRClose. r=backout
Comment 66•9 years ago
|
||
backed out comment 63 due to crashes in 1205016
Updated•9 years ago
|
Attachment #8659673 -
Flags: checkin+
Comment 67•9 years ago
|
||
Comment 68•9 years ago
|
||
Something that's landed on top of the original landing isn't playing well with the backout. So I've had to backout the backout. We were seeing b2g debug shutdown hangs.
Comment 69•9 years ago
|
||
This bug is causing performance regression on b2g. We need a forward fix in bug 1204837 with this patch on m-c.
Since this bug is specific to Windows, will the fix be limited to Windows in the next revision? Or we'll still have this on all platforms even neither Linux nor MacOS has this problem?
Flags: needinfo?(dd.mozilla)
Assignee | ||
Comment 70•9 years ago
|
||
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #69)
> This bug is causing performance regression on b2g. We need a forward fix in
> bug 1204837 with this patch on m-c.
>
> Since this bug is specific to Windows, will the fix be limited to Windows in
> the next revision? Or we'll still have this on all platforms even neither
> Linux nor MacOS has this problem?
Sometimes PR_Close blocks for a long time, it is rare but still it happens. The blocking time on Windows are longer than on other (linux, osx) but still long blocking times happens on all os. So we will keep this change for all os.
Sorry I did not know about bug 1204837.
Flags: needinfo?(dd.mozilla)
Assignee | ||
Comment 71•9 years ago
|
||
I moved ClosingService starting/stoping into nsIOService to be sure that it is happening only once.
I think this is not going to solve the crash, but I want to change this.
Honza, may I ask you for a review, please.
Flags: needinfo?(honzab.moz)
So this landed, and broke Linux stuff.
This was reverted, and broke b2g debug stuff.
That revert was then reverted (so back to this landed), so b2g debug stuff is working, but Linux stuff is broken again.
So what's the plan here? We can't leave both Linux stuff or b2g debug stuff broken like this.
Flags: needinfo?(mcmanus)
Flags: needinfo?(dd.mozilla)
Comment 73•9 years ago
|
||
dragana is actively working it. no idea why the b2g stuff came to depend on it in 24 hrs.
Flags: needinfo?(mcmanus)
Comment 74•9 years ago
|
||
Note: I backed out bug 1152046 (again) locally, since I was insta-crashing with it in. That lets me run, but testing https://mozilla.github.com/webrtc-landing/pc_test.html crashes in the same place (jemalloc/internal/bitmap.h) when in e10s - but not in a non-e10s profile. So it's very touchy and machine/perf/allocation/? dependent.
This also implies to me the actual bug is elsewhere (jemalloc?)
Assignee | ||
Comment 75•9 years ago
|
||
I am working on linux problem.
I do not know why b2g is broken. I was check, the files that this patch change has not been changed after this patch had landed.
This patch does not call PR_Close after we enter shutdown, it just cleans up , but does not close socket. b2g is a shutdown hang, maybe we should figure out in any case why it hangs if this patch gets backed out.
Nigel, do you know what has change in b2g, probably something about sockets?
Flags: needinfo?(dd.mozilla) → needinfo?(nigelbabu)
Comment 76•9 years ago
|
||
Comment 77•9 years ago
|
||
I have no idea. I'm probably not the best person to answer this question.
Flags: needinfo?(nigelbabu)
Assignee | ||
Comment 78•9 years ago
|
||
Attachment #8664191 -
Flags: review?(mcmanus)
Assignee | ||
Comment 79•9 years ago
|
||
Updated•9 years ago
|
Attachment #8664191 -
Flags: review?(mcmanus) → review+
Assignee | ||
Updated•9 years ago
|
Attachment #8664191 -
Flags: checkin+
Assignee | ||
Updated•9 years ago
|
Keywords: checkin-needed
Assignee | ||
Comment 80•9 years ago
|
||
checkin only for patch bug_1152046_disable_on_ffos.patch
Comment 81•9 years ago
|
||
Keywords: checkin-needed
Comment 82•9 years ago
|
||
backed out for bustage like https://treeherder.mozilla.org/logviewer.html#?job_id=14559318&repo=mozilla-inbound
Flags: needinfo?(dd.mozilla)
Comment 83•9 years ago
|
||
Assignee | ||
Comment 84•9 years ago
|
||
fix bustage. Sorry.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1a34a9dbaf39
Attachment #8664191 -
Attachment is obsolete: true
Flags: needinfo?(dd.mozilla)
Attachment #8664957 -
Flags: review+
Attachment #8664957 -
Flags: checkin+
Assignee | ||
Comment 85•9 years ago
|
||
checkin needed only for patch bug_1152046_disable_on_ffos.patch
Keywords: checkin-needed
Comment 86•9 years ago
|
||
Keywords: checkin-needed
Comment 87•9 years ago
|
||
sorry had to back this out again for another bustage, this time b2g https://treeherder.mozilla.org/logviewer.html#?job_id=14625952&repo=mozilla-inbound
Flags: needinfo?(dd.mozilla)
Comment 88•9 years ago
|
||
Comment 89•9 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #87)
> sorry had to back this out again for another bustage, this time b2g
> https://treeherder.mozilla.org/logviewer.html#?job_id=14625952&repo=mozilla-
> inbound
This looks like a shutdown deadlock on the CompositorThread as in https://bugzilla.mozilla.org/show_bug.cgi?id=1204837#c14.
Assignee | ||
Comment 90•9 years ago
|
||
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #89)
> (In reply to Carsten Book [:Tomcat] from comment #87)
> > sorry had to back this out again for another bustage, this time b2g
> > https://treeherder.mozilla.org/logviewer.html#?job_id=14625952&repo=mozilla-
> > inbound
>
> This looks like a shutdown deadlock on the CompositorThread as in
> https://bugzilla.mozilla.org/show_bug.cgi?id=1204837#c14.
Patch bug_1152046_disable_on_ffos.patch only disables use of ClosingService thread on b2g (until bug 1204837).
I thing there is another bug which is a deadlock and this enabling/disabling ClosingService only exposes it.
Currently all calls to PR_Close are dispatch to a ClosingService thread. and patch bug_1152046_disable_on_ffos.patch disables ClosingService so PR_Close are called on SocketThread (this use to be before bug_1152046_prclose_v2.patch landed, around 15.September). It is probably a timing issue that this patch exposes another bug. I do not see any other connection between CompositorThread and this bug.
What do you think?
Flags: needinfo?(dd.mozilla) → needinfo?(cyu)
Comment 91•9 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #90)
>
> Patch bug_1152046_disable_on_ffos.patch only disables use of ClosingService
> thread on b2g (until bug 1204837).
>
> I thing there is another bug which is a deadlock and this enabling/disabling
> ClosingService only exposes it.
> Currently all calls to PR_Close are dispatch to a ClosingService thread. and
> patch bug_1152046_disable_on_ffos.patch disables ClosingService so PR_Close
> are called on SocketThread (this use to be before
> bug_1152046_prclose_v2.patch landed, around 15.September). It is probably a
> timing issue that this patch exposes another bug. I do not see any other
> connection between CompositorThread and this bug.
> What do you think?
Exactly. Some patch introduced this regression after ClosingService and was hidden. Now it's revealed. I don't think ClosingService has anything to do with the deadlock on mochitest shutdown.
Flags: needinfo?(cyu)
Reporter | ||
Comment 92•9 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #90)
> (In reply to Cervantes Yu [:cyu] [:cervantes] from comment #89)
> > (In reply to Carsten Book [:Tomcat] from comment #87)
> > > sorry had to back this out again for another bustage, this time b2g
> > > https://treeherder.mozilla.org/logviewer.html#?job_id=14625952&repo=mozilla-
> > > inbound
> >
> > This looks like a shutdown deadlock on the CompositorThread as in
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1204837#c14.
>
>
> Patch bug_1152046_disable_on_ffos.patch only disables use of ClosingService
> thread on b2g (until bug 1204837).
>
> I thing there is another bug which is a deadlock and this enabling/disabling
> ClosingService only exposes it.
> Currently all calls to PR_Close are dispatch to a ClosingService thread. and
> patch bug_1152046_disable_on_ffos.patch disables ClosingService so PR_Close
> are called on SocketThread (this use to be before
> bug_1152046_prclose_v2.patch landed, around 15.September). It is probably a
> timing issue that this patch exposes another bug. I do not see any other
> connection between CompositorThread and this bug.
> What do you think?
If you are looking for a quick fix, just go with disabling the code specifically per platforms it causes issues on. We can investigate later what's actually wrong.
Personally I think your service shutdown code and some of my 'concurrent access' concerns might be worth checking on.
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 93•9 years ago
|
||
(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #92)
> (In reply to Dragana Damjanovic [:dragana] from comment #90)
> > (In reply to Cervantes Yu [:cyu] [:cervantes] from comment #89)
> > > (In reply to Carsten Book [:Tomcat] from comment #87)
> > > > sorry had to back this out again for another bustage, this time b2g
> > > > https://treeherder.mozilla.org/logviewer.html#?job_id=14625952&repo=mozilla-
> > > > inbound
> > >
> > > This looks like a shutdown deadlock on the CompositorThread as in
> > > https://bugzilla.mozilla.org/show_bug.cgi?id=1204837#c14.
> >
> >
> > Patch bug_1152046_disable_on_ffos.patch only disables use of ClosingService
> > thread on b2g (until bug 1204837).
> >
> > I thing there is another bug which is a deadlock and this enabling/disabling
> > ClosingService only exposes it.
> > Currently all calls to PR_Close are dispatch to a ClosingService thread. and
> > patch bug_1152046_disable_on_ffos.patch disables ClosingService so PR_Close
> > are called on SocketThread (this use to be before
> > bug_1152046_prclose_v2.patch landed, around 15.September). It is probably a
> > timing issue that this patch exposes another bug. I do not see any other
> > connection between CompositorThread and this bug.
> > What do you think?
>
> If you are looking for a quick fix, just go with disabling the code
> specifically per platforms it causes issues on. We can investigate later
> what's actually wrong.
>
> Personally I think your service shutdown code and some of my 'concurrent
> access' concerns might be worth checking on.
I need info you for reviewing patch bug_1152046_move_CLosingService_start_toioservice.patch
and the issue you have responded got complications: see bug 1207979, 1208019
Flags: needinfo?(honzab.moz)
Comment 94•9 years ago
|
||
Please note bug 1204837 lands a fix for ClosingService on b2g so we don't need to disable it on b2g anymore.
Reporter | ||
Comment 95•9 years ago
|
||
Dragana, it wasn't clear what you needed.
I'll review the patch.
I don't follow what you need from me on "the issues".
- according comment 94 it seems we no longer need bug 1208019, right?
- the same may apply to bug 1207979, or not?
Please state more clearly how I can help here.
Flags: needinfo?(honzab.moz) → needinfo?(dd.mozilla)
Assignee | ||
Comment 96•9 years ago
|
||
(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #95)
> Dragana, it wasn't clear what you needed.
>
> I'll review the patch.
>
> I don't follow what you need from me on "the issues".
>
> - according comment 94 it seems we no longer need bug 1208019, right?
> - the same may apply to bug 1207979, or not?
>
> Please state more clearly how I can help here.
Yes we do not need them, but there is another bug that I do not want to be missed, because it is going to show up at some point.
Explanation:
Patch bug_1152046_prclose_v2.patch (the implementation of ClosingService) was pushed on 14.Sep. On the same day couple of hours later they tried to back it out but did not work because of:
https://bugzilla.mozilla.org/show_bug.cgi?id=1204837#c14
The thing is that a patch from bug 1127270 was pushed on the same day. And if you disable ClosingService or backed it out that patch cause problems and if the ClosingService is used everything is fine. This bug and bug 1127270 do not have anything in common, so I believe there is a bug in patch 1127270 that shows up only if ClosingService is not used.
So we do not need bug 1208019, but I want that someone looks at bug 1127270 (ImageBridgeParent) and finds explanation why is it failing if ClosingService is not used.
Flags: needinfo?(dd.mozilla)
Reporter | ||
Comment 97•9 years ago
|
||
Comment on attachment 8661965 [details] [diff] [review]
bug_1152046_move_CLosingService_start_toioservice.patch
Review of attachment 8661965 [details] [diff] [review]:
-----------------------------------------------------------------
Hmm.. are you sure that nsSocketTransportService::Init is not called just once?
If there are continuing issues around this code I can take a look deeper next week I'm back at the office.
::: netwerk/base/ClosingService.cpp
@@ +184,5 @@
> {
> {
> mozilla::MonitorAutoLock mon(mMonitor);
> + if (mShutdown) {
> + // This should not happened.
"This should never happen."
OTOH, I think it may and having a check like this is good.
Worth adding a similar "only-once" check to the start method?
::: netwerk/base/nsIOService.cpp
@@ +1064,5 @@
> NS_ASSERTION(NS_SUCCEEDED(rv), "socket transport service shutdown failed");
> }
> + if (mShutdown) {
> + ClosingService::Shutdown();
> + }
maybe rather put this to the NS_XPCOM_SHUTDOWN_OBSERVER_ID topic handling code directly?
if there is some reason not to (and have it here rather) please add a comment.
Attachment #8661965 -
Flags: review+
Assignee | ||
Comment 98•9 years ago
|
||
(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #97)
> Comment on attachment 8661965 [details] [diff] [review]
> bug_1152046_move_CLosingService_start_toioservice.patch
>
> Review of attachment 8661965 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> Hmm.. are you sure that nsSocketTransportService::Init is not called just
> once?
I notice that nsSocketTransportService::Shutdown can be called more than one, it is called for profile change as well, and that is a problem :). (Init is called just once)
>
> ::: netwerk/base/nsIOService.cpp
> @@ +1064,5 @@
> > NS_ASSERTION(NS_SUCCEEDED(rv), "socket transport service shutdown failed");
> > }
> > + if (mShutdown) {
> > + ClosingService::Shutdown();
> > + }
>
> maybe rather put this to the NS_XPCOM_SHUTDOWN_OBSERVER_ID topic handling
> code directly?
>
I have put it there after SocketTransportService shutdown to let more PR_Close calls get called. Because most of the time they are not causing problem.
Assignee | ||
Comment 99•9 years ago
|
||
Attachment #8661965 -
Attachment is obsolete: true
Attachment #8669470 -
Flags: review+
Assignee | ||
Comment 100•9 years ago
|
||
check in only for
bug_1152046_move_CLosingService_start_toioservice.patch
Keywords: checkin-needed
Assignee | ||
Updated•9 years ago
|
Attachment #8664957 -
Flags: checkin+
Comment 101•9 years ago
|
||
Keywords: checkin-needed
Comment 103•9 years ago
|
||
Dragana, do you have plans wrt to this bug for 42? Thanks
Assignee | ||
Comment 104•9 years ago
|
||
(In reply to Sylvestre Ledru [:sylvestre] from comment #103)
> Dragana, do you have plans wrt to this bug for 42? Thanks
I was not planning to uplift the last two patches. Although there was no problem with a patch (since 17th Sep. or so).
Flags: needinfo?(dd.mozilla)
Comment 105•9 years ago
|
||
OK, I will mark 42 & 43 as wontfix then. Thanks
Assignee | ||
Comment 106•9 years ago
|
||
I was looking at crash some logs again:
I have not seen any PR_Close hangs.
I have seen PR_Connect hangs even though we check for gIOService->IsShutdown() before trying to connect, so hang there for a long time.
There is also nsAutodial hang (we also do not try to reconnect in case of Shutdown)
And there are some poll hangs.
A thinks that can have influence on this hangs:
avcuf32.dll
rlls.dll
winhadnt.dll
(but not all have one of these)
win sockets are made nonoblocking using rv = _md_MakeNonblock((HANDLE)osfd); just before calling connect();
http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/ntio.c#1160
I could check whether we set nonblocking flag correctly but it is hard to check if it is really set to the os socket.
I have also read some comments people left: there was 2-3 saying that firefox is really slow- I could assume that firefox was slow therefore they tried to shut it down and this crash happened, so probably there already a problem before they clicked shutdown.
Assignee | ||
Updated•9 years ago
|
Comment 107•9 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #106)
> win sockets are made nonoblocking using rv = _md_MakeNonblock((HANDLE)osfd);
> just before calling connect();
> http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/ntio.
> c#1160
> I could check whether we set nonblocking flag correctly but it is hard to
> check if it is really set to the os socket.
Heads up: ntio.c isn't used by our codebase, as NSPR is configured with the WIN95 configuration. See https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSPR/NSPR_build_instructions#--enable-win32-target.3DWIN95
It looks like the WIN95 configuration does not support this at all.
Reporter | ||
Comment 108•9 years ago
|
||
(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #107)
> (In reply to Dragana Damjanovic [:dragana] from comment #106)
> > win sockets are made nonoblocking using rv = _md_MakeNonblock((HANDLE)osfd);
> > just before calling connect();
> > http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/ntio.
> > c#1160
> > I could check whether we set nonblocking flag correctly but it is hard to
> > check if it is really set to the os socket.
>
> Heads up: ntio.c isn't used by our codebase, as NSPR is configured with the
> WIN95 configuration. See
> https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSPR/
> NSPR_build_instructions#--enable-win32-target.3DWIN95
>
> It looks like the WIN95 configuration does not support this at all.
We do set PR_SockOpt_Nonblocking = true on the socket in nsSocketTransport::InitiateSocket(), but looking into the code how that is handled, I cannot find we actually would set any option directly on the system socket. Only thing we do is a setup of secret->nonblocking = true flag on our FD at prmapopt.c!_PR_SocketSetSocketOption. This flag however doesn't lead to any other setting made on the socket (w95sock.c).
Dragana, we could have a private patch for NSPR to copy that particular piece of code from ntio.c to w95sock.c to try. Submitting that directly to NSPR should not be that hard if found working.
Assignee | ||
Comment 109•9 years ago
|
||
This is short update.
w95sock.c always makes sockets nonblocking:
http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/w95sock.c#63
and blocking mode is made inside nspr, so for each read/write it loops if a socket is blocking.
I will dig into this further.
Reporter | ||
Comment 110•9 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #109)
> This is short update.
>
> w95sock.c always makes sockets nonblocking:
> http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/
> w95sock.c#63
I knew it was somewhere! Thanks Dragana.
Assignee | ||
Comment 111•9 years ago
|
||
I notice couple of crashes on PR_ConnectContinue:
https://crash-stats.mozilla.com/report/index/7ed839dd-92fe-4b40-8586-d1abb2160114#allthreads
https://crash-stats.mozilla.com/report/index/32b4adec-87cc-4560-8373-e063c2160114#allthreads
https://crash-stats.mozilla.com/report/index/59eb432d-e0bb-4a8c-98fb-60eab2160114#allthreads
We can skip PR_ConnectContinue if we are in shutdown.
Comment 112•9 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #111)
> I notice couple of crashes on PR_ConnectContinue:
In these three cases it seems to have stopped here: http://hg.mozilla.org/releases/mozilla-beta/annotate/7ccdff6c2b81/nsprpub/pr/src/io/prsocket.c#l294 .. which is a getsockopt() call on the actual underlying socket descriptor.
Assignee | ||
Comment 113•9 years ago
|
||
(In reply to Daniel Stenberg [:bagder] from comment #112)
> (In reply to Dragana Damjanovic [:dragana] from comment #111)
> > I notice couple of crashes on PR_ConnectContinue:
>
> In these three cases it seems to have stopped here:
> http://hg.mozilla.org/releases/mozilla-beta/annotate/7ccdff6c2b81/nsprpub/pr/
> src/io/prsocket.c#l294 .. which is a getsockopt() call on the actual
> underlying socket descriptor.
I know, but when we are in shutdown we can just make it fail here.
Assignee | ||
Comment 114•9 years ago
|
||
I notice that we are doing some reads on d socket when we are closing nsHttpConnection:
http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpConnection.cpp#599
If we need to reduce amount of operation that we do when we are in shutdown, we can skip this. I have not seen any crash on this, but I am writing it here just to keep it in mind if we need it.
Updated•9 years ago
|
Whiteboard: [necko-active]
Assignee | ||
Comment 115•9 years ago
|
||
I am going to close this bug. Bug 1158189 is used as meta bug. We are going to open separate bugs for similar issues for simplicity, since this one is overflooded.
Updated•7 years ago
|
Target Milestone: --- → mozilla44
You need to log in
before you can comment on or make changes to this bug.
Description
•