1152046 - [Win] close() and connect() on a socket may block for a very long time.

Reporter

Description

•

10 years ago

This is a hard to reproduce and nasty bug.

There is one known and described case with UDP sockets (worked around in bug 1124880): http://blogs.msdn.com/b/winsdk/archive/2013/10/09/udp-closesocket-takes-upto-5-seconds-to-return-in-disconnect-remote-host-down-scenario-due-to-pending-data-to-send.aspx

However, apparently this can happen for any socket, see some of the reports under https://crash-stats.mozilla.com/search/?product=Firefox&version=39.0a2&version=40.0a1&process_type=browser&process_type=content&signature=~nsHttpConnectionMgr%3A%3AShutdown&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Google is quiet on this.  Hence this could be specific to how we handle/setup our sockets.

Patrick McManus [:mcmanus]

Comment 1

•

10 years ago

Honza - what do you think of the theory that this is being caused by an LSP locally installed? Perhaps the LSP is blocking on a non-blocking socket.

here are three reports that look like that is what is happening:

https://crash-stats.mozilla.com/report/index/cd6dc70f-1afc-4596-94b6-8d26c2150404#allthreads
https://crash-stats.mozilla.com/report/index/77bf1520-1759-437b-a209-885922150401#allthreads
https://crash-stats.mozilla.com/report/index/9adc4168-b377-4a7c-801d-546d32150402#allthreads

In each case the Winsock LSP list is considerably more complicated than the winsock lsp list that you find in a crash due to something in gecko (e.g the dispatch deadlock). I took all the samples from XP SP3 just because I thought that might make the LSP list more comparable - but it happens on multiple versions of windows.

Honza Bambas (:mayhemer)

Reporter

Comment 2

•

10 years ago

If you mean the stack:

KiFastSystemCallRet
NtDeviceIoControlFile
WSPBind
WSPAddressToString
WSPConnect
connect
_PR_MD_CONNECT
SocketConnect

then it looks to me like a pure native call (no LSP involved).  Being in bind() sounds like it tries to find a local interface to bind to.  It may be the same cause as in case of the UDP close(), waiting for an ARP or (more likely) a DHCP response - those are more guesses, not supported by any research now.

It could also be an OEM specific code that misbehaves here...  something very hard to find out :(

there is tho 'rsvpsp.dll' in the LSP list.  I don't have that file on my win7 machine.

Honza Bambas (:mayhemer)

Reporter

Updated

•

9 years ago

Blocks: 1158189

Honza Bambas (:mayhemer)

Reporter

Updated

•

9 years ago

No longer blocks: 1124880

Jason Duell

Updated

•

9 years ago

Assignee: nobody → dd.mozilla

Dragana Damjanovic [:dragana]

Assignee

Comment 3

•

9 years ago

rsvpsp.dll is also from Windows (used for some priorities, QoS stuff). (related to comment #2)

Dragana Damjanovic [:dragana]

Assignee

Comment 4

•

9 years ago

Attached file .bug_1152046_v1.patch.swp (obsolete) (deleted) — Details

For the most of this crashes socket Thread is doing PR_Connect.

This patch prevents connect during shutdown.

Let's try if it helps.

Attachment #8610603 - Flags: review?(honzab.moz)

Honza Bambas (:mayhemer)

Reporter

Comment 5

•

9 years ago

What's this 'swp' thing?

Dragana Damjanovic [:dragana]

Assignee

Comment 6

•

9 years ago

Attached patch bug_1152046_v1.patch (obsolete) (deleted) — Details — Splinter Review

sorry

Attachment #8610603 - Attachment is obsolete: true

Attachment #8610603 - Flags: review?(honzab.moz)

Attachment #8611372 - Flags: review?(honzab.moz)

Honza Bambas (:mayhemer)

Reporter

Comment 7

•

9 years ago

Comment on attachment 8611372 [details] [diff] [review]
bug_1152046_v1.patch

Review of attachment 8611372 [details] [diff] [review]:
-----------------------------------------------------------------

This is good to have but I don't think it's gonna much help with the root cause.  The delays/blocks in connect()/close() are IMO happening not just after shutdown.  The thing is that we try to shutdown while the socket thread has already been hanging inside connect()/close() for a while.  Maybe that is even a reason people restart the browser - since pages are not loading, socket thread is deadlocked.

There are two reasons this could happen IMO: A/V software interference or an LSP malware.

Some code that detects the time spent in connect()/close() would be good to have.  At least collect telemetry, like time spent in, how often it happens to spent more than some 2 seconds, does it happen after network change, does it happen after shutdown only, whatever would be useful.

If long delays happen all the time (based on telemetry), we could do a test at the start or after any network change we detect and try to do something like manual local binding or even let connect()/close() happen on new threads (similar to what I've made for UDP close.) and maybe warn the user that "your OS performs slow when doing network access" or something...

::: netwerk/base/nsIOService.h
@@ +77,5 @@
>                                      uint32_t flags,
>                                      nsAsyncRedirectVerifyHelper *helper);
>  
>      bool IsOffline() { return mOffline; }
> +    bool IsShuttingDown() { return mShutdown; }

Maybe call this IsShutdown() ?  To make it conforming the name of the member.

::: netwerk/base/nsSocketTransport2.cpp
@@ +1217,5 @@
>      bool isLocal;
>      IsLocal(&isLocal);
>  
> +    if (gIOService->IsShuttingDown()) {
> +      return NS_ERROR_ABORT;

nit: the file has 4 spaces indend

Attachment #8611372 - Flags: review?(honzab.moz) → review+

Dragana Damjanovic [:dragana]

Assignee

Comment 8

•

9 years ago

(In reply to Honza Bambas (:mayhemer) from comment #7)
> Comment on attachment 8611372 [details] [diff] [review]
> bug_1152046_v1.patch
> 
> Review of attachment 8611372 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This is good to have but I don't think it's gonna much help with the root
> cause.  The delays/blocks in connect()/close() are IMO happening not just
> after shutdown.  The thing is that we try to shutdown while the socket
> thread has already been hanging inside connect()/close() for a while.  Maybe
> that is even a reason people restart the browser - since pages are not
> loading, socket thread is deadlocked.
> 
> There are two reasons this could happen IMO: A/V software interference or an
> LSP malware.
> 
> Some code that detects the time spent in connect()/close() would be good to
> have.  At least collect telemetry, like time spent in, how often it happens
> to spent more than some 2 seconds, does it happen after network change, does
> it happen after shutdown only, whatever would be useful.
> 


I had intention to add telemetry as well, it is coming soon...

Dragana Damjanovic [:dragana]

Assignee

Comment 9

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=02a118febbc2

Dragana Damjanovic [:dragana]

Assignee

Comment 10

•

9 years ago

Attached patch bug_1152046_v1.patch (deleted) — Details — Splinter Review

Attachment #8611372 - Attachment is obsolete: true

Attachment #8612760 - Flags: review+

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

Keywords: checkin-needed, leave-open

Pulsebot

Comment 11

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/1badbdeca325

Keywords: checkin-needed

Ryan VanderMeulen [:RyanVM]

Comment 12

•

9 years ago

https://hg.mozilla.org/mozilla-central/rev/1badbdeca325

Dragana Damjanovic [:dragana]

Assignee

Comment 13

•

9 years ago

Attached patch bug_1152046_telemetry_v1.patch (obsolete) (deleted) — Details — Splinter Review

Attachment #8613916 - Flags: review?(honzab.moz)

Honza Bambas (:mayhemer)

Reporter

Comment 14

•

9 years ago

Comment on attachment 8613916 [details] [diff] [review]
bug_1152046_telemetry_v1.patch

Review of attachment 8613916 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/nsSocketTransport2.cpp
@@ +53,5 @@
>  #endif
>  /* End keepalive config inclusions. */
>  
> +// If PR_Connect, PR_Close and PR_ContinueConnect are longer than
> +// STS_HANG_DURATION ms, consired it a hang and collect data about the state we

consired (typo)

@@ +1410,5 @@
>  
>      NetAddrToPRNetAddr(&mNetAddr, &prAddr);
>  
>      MOZ_EVENT_TRACER_EXEC(this, "net::tcp::connect");
> +    PRIntervalTime connectStarted;

please explain why you prefer PR intervals here (atomicity)

@@ +3100,5 @@
> +                                      STS_STATE_SHUTDOWN);
> +
> +            } else if ((gIOService->LastConnectivityChange() >= closeStarted)
> +                       &&
> +                       (gIOService->LastOfflineStateChange() >= closeStarted)){

I think we should include some window after the change.  Like "has been there a change in the net config in, say, last 60 seconds".  If so, add a 'state-changed' telemetry.  Otherwise, add 'normal-state' telemetry.

Then, I'm not sure change in the 'connectivity yes/no' state is what you want... Maybe check with :badger on this first.  There can be a wire connection, then for awhile both wired and wifi, then wire goes away (and that will not be seen as a change in connectivity..)

@@ +3116,5 @@
> +            } else {
> +                Telemetry::Accumulate(Telemetry::STS_LONG_PRCLOSE_STATE,
> +                                      STS_STATE_NORMAL);
> +            }
> +        }

please share the code, it's easy

::: toolkit/components/telemetry/Histograms.json
@@ +2462,5 @@
>      "high": "60000",
>      "n_buckets": 1000,
>      "description": "Time spend blocked on poll (ms)."
>    },
> +  "STS_PRCONNECT_TIME": {

The name should "PRCONNECT_BLOCKING_TIME"

@@ +2500,5 @@
> +    "expires_in_version": "never",
> +    "kind": "enumerated",
> +    "n_values": 5,
> +    "description": "If it is in PR_Close more than 2000 ms, collect telemetry about the state: 4: shutdown, 3: offline and connectivity state changed 2: offline state changed, 1: connectivity change and 0: none."
> +  },

I'd rather have telemetry like:
PRCONNECT_NORMAL
PRCONNECT_AFTER_NETWORK_CHANGE
PRCONNECT_AFTER_SHUTDOWN

etc.. collecting time (exponential)

Attachment #8613916 - Flags: review?(honzab.moz) → feedback-

Dragana Damjanovic [:dragana]

Assignee

Comment 15

•

9 years ago

(In reply to Honza Bambas (:mayhemer) from comment #14)
> Comment on attachment 8613916 [details] [diff] [review]
> bug_1152046_telemetry_v1.patch
> 
> Review of attachment 8613916 [details] [diff] [review]:
> -----------------------------------------------------------------
> @@ +3100,5 @@
> > +                                      STS_STATE_SHUTDOWN);
> > +
> > +            } else if ((gIOService->LastConnectivityChange() >= closeStarted)
> > +                       &&
> > +                       (gIOService->LastOfflineStateChange() >= closeStarted)){
> 
> I think we should include some window after the change.  Like "has been
> there a change in the net config in, say, last 60 seconds".  If so, add a
> 'state-changed' telemetry.  Otherwise, add 'normal-state' telemetry.
> 
> Then, I'm not sure change in the 'connectivity yes/no' state is what you
> want... Maybe check with :badger on this first.  There can be a wire
> connection, then for awhile both wired and wifi, then wire goes away (and
> that will not be seen as a change in connectivity..)
> 

Looking at the code I am sure that the connectivity means we have some connection or we do not have any. I will check this with :badger
I want to separate connectivity and offline state which are 2 separate things. Looking at the code connectivity state will change offline state if offline state has not been force for some other reason.

So I want to have:
PRCONNECT_BLOCKING_NORMAL - no shutdown ,now network change no ofline state change
PRCONNECT_BLOCKING_SHUTDOWN - we are in shutdown
PRCONNECT_BLOCKING_NETWORK_CHANGE - connectivity changed
PRCONNECT_BLOCKING_OFFLINE - set offline (this will probably be the same as normal, exept we have some error in necko code)

Dragana Damjanovic [:dragana]

Assignee

Comment 16

•

9 years ago

Attached patch bug_1152046_telemetry_v2.patch (obsolete) (deleted) — Details — Splinter Review

Can you take a look into connectivity bit.

I wanted to save the last time connectivity change - connected/disconnected.

Attachment #8613916 - Attachment is obsolete: true

Attachment #8614624 - Flags: feedback?(daniel)

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Comment 17

•

9 years ago

Tracking for 40. Are other versions affected?

tracking-firefox40: --- → +

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Comment 18

•

9 years ago

See https://bugzilla.mozilla.org/show_bug.cgi?id=1158189#c7 for why I tracked this.

Sylvestre Ledru [:Sylvestre]

Comment 19

•

9 years ago

Daniel, can you help here?

Flags: needinfo?(daniel)

Daniel Stenberg [:bagder]

Comment 20

•

9 years ago

I see no problems in this!

Flags: needinfo?(daniel)

Daniel Stenberg [:bagder]

Updated

•

9 years ago

Attachment #8614624 - Flags: feedback?(daniel) → feedback+

Ryan VanderMeulen [:RyanVM]

Comment 21

•

9 years ago

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #12)
> https://hg.mozilla.org/mozilla-central/rev/1badbdeca325

Landed on Aurora a=sledru to fix uplift bustage from bug 1169554.
https://hg.mozilla.org/releases/mozilla-aurora/rev/6b9bf28f9b8a

Dragana Damjanovic [:dragana]

Assignee

Comment 22

•

9 years ago

Attached patch bug_1152046_telemetry_v2.patch (obsolete) (deleted) — Details — Splinter Review

Attachment #8614624 - Attachment is obsolete: true

Attachment #8628777 - Flags: review?(honzab.moz)

Honza Bambas (:mayhemer)

Reporter

Comment 23

•

9 years ago

Comment on attachment 8628777 [details] [diff] [review]
bug_1152046_telemetry_v2.patch

Review of attachment 8628777 [details] [diff] [review]:
-----------------------------------------------------------------

Please few of the nits and if there are no major changes you want to do, just land it!  Good work.

::: netwerk/base/nsIOService.cpp
@@ +1521,5 @@
>      }
>  
>      bool isUp = true;
>      if (!strcmp(data, NS_NETWORK_LINK_DATA_CHANGED)) {
> +        mLastNetworkLinkChange = PR_IntervalNow();

please double check this.  I'm not sure this fires every time..

::: netwerk/base/nsIOService.h
@@ +173,5 @@
>      static bool                          sTelemetryEnabled;
> +
> +    // These are needed for the telemetry about PR_Connect anf PR_Close
> +    // duration. If PRConnect or PR_Close last to long check whether
> +    // connectivity or offline state has change in between.

Rather:

"These timestamps are needed for collecting telemetry on PR_Connect, PR_ConnectContinue and PR_Close blocking time.  On too long time spent in any of these functions we want to know what network change happened shortly before."

BTW, it might be smarter to collect only times that take more then say 1 second?  I am really not sure on this, but I don't want to see a graph of 0 - 0.1 ms -> 99% and then the rest (that we are interested in!) hidden in the 1% percent of all the captures.

::: netwerk/base/nsSocketTransport2.cpp
@@ +1406,3 @@
>      status = PR_Connect(fd, &prAddr, NS_SOCKET_CONNECT_TIMEOUT);
> +
> +    if (gSocketTransportService->IsTelemetryEnabled() && connectStarted) {

if you want the "&& connectStarted" part that you must initialize connectStarted to 0.  maybe have the condition just as "if (connectStarted) {" ?

@@ +1881,3 @@
>          PRStatus status = PR_ConnectContinue(fd, outFlags);
> +
> +        if (gSocketTransportService->IsTelemetryEnabled() && connectStarted) {

same here

::: netwerk/base/nsSocketTransportService2.h
@@ +109,5 @@
>  
>      // Returns true if keepalives are enabled in prefs.
>      bool IsKeepaliveEnabled() { return mKeepaliveEnabledPref; }
> +
> +    bool IsTelemetryEnabled() { return mTelemetryEnabledPref; }

some small comment would be nice

Attachment #8628777 - Flags: review?(honzab.moz) → review+

Dragana Damjanovic [:dragana]

Assignee

Comment 24

•

9 years ago

Attached patch bug_1152046_telemetry_v2.patch (deleted) — Details — Splinter Review

Attachment #8628777 - Attachment is obsolete: true

Attachment #8630588 - Flags: review+

Dragana Damjanovic [:dragana]

Assignee

Comment 25

•

9 years ago

 Check in needed only for patch bug_1152046_telemetry_v2.patch

Keywords: checkin-needed

Pulsebot

Comment 26

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/1669b09e86bd

Keywords: checkin-needed

Ryan VanderMeulen [:RyanVM]

Comment 27

•

9 years ago

https://hg.mozilla.org/mozilla-central/rev/1669b09e86bd

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

Status: NEW → ASSIGNED

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 28

•

9 years ago

Honza, just wanted to ping you on whether we have a fix in the works on this one or a patch ready for uplift to beta. bug 1158189 is a top crash on beta channel and blocked by this bug. Thanks!

Flags: needinfo?(honzab.moz)

Dragana Damjanovic [:dragana]

Assignee

Comment 29

•

9 years ago

I will answer and Honza can correct me.

We currently collecting telemetry to see when it is happening. We assume it is after network connection change or maybe just when there is no connectivity. 

For Honza:
Form telemetry that we have collected so far I see this happening for all cases some times. some are pretty high on windows and on osx but not as high as in windows. I could do a fix to split connect to a separate thread. (I need to look closely into the data to see if we can skip in some cases)

Honza Bambas (:mayhemer)

Reporter

Comment 30

•

9 years ago

(In reply to Dragana Damjanovic [:dragana] from comment #29)
> I will answer and Honza can correct me.
> 
> We currently collecting telemetry to see when it is happening. We assume it
> is after network connection change or maybe just when there is no
> connectivity. 
> 
> For Honza:
> Form telemetry that we have collected so far I see this happening for all
> cases some times. some are pretty high on windows and on osx but not as high
> as in windows. I could do a fix to split connect to a separate thread. (I
> need to look closely into the data to see if we can skip in some cases)

As Dragana says, the only fix here is to either do more investigation like contacting the people this happens to (not a usual way we work in Mozilla tho) or have separate thread(s) handling these function calls.

I think having a thread pool (limited up to like 5 or so threads) is the way here as a general workaround.

I personally believe there is some antivirus or virus (how ironic :)) involved here, but that is more a guess.

Flags: needinfo?(honzab.moz)

Dragana Damjanovic [:dragana]

Assignee

Comment 31

•

9 years ago

Peak into telemetry between 8. and 20. July 2015.

The numbers next to the platform names show the maximal value seen.

Linux values are very low in comparison to osx and windows. And Windows has much higher numbers than osx therefore most of the crashes are on windows.

PRCONNECT_BLOCKING_TIME_CONNECTIVITY_CHANGE:
  linux 26ms
  osx   23s
  windows even 2 semples above 60s

PRCONNECT_BLOCKING_TIME_LINK_CHANGE:
  linux -
  osx 700ms
  windows 9s

PRCONNECT_BLOCKING_TIME_NORMAL
  linux 116ms
  osx   9s 
  windows 19 samples above 60s

PRCONNECT_BLOCKING_TIME_OFFLINE
  (small sample size)
  linux (very small sample size 40 - 1-2ms)
  osx (very small sample size 200 - 1-2ms)
  windows 43s


PRCONNECTCONTINUE_BLOCKING_TIME_CONNECTIVITY_CHANGE
  linux 23ms
  osx 97ms
  windows 11s

PRCONNECTCONTINUE_BLOCKING_TIME_LINK_CHANGE
  linux -
  osx 46ms
  windows 10s

PRCONNECTCONTINUE_BLOCKING_TIME_NORMAL
  linux 73ms
  osx 133ms
  windows 2samples 60s

PRCONNECTCONTINUE_BLOCKING_TIME_OFFLINE:
 (small sample size)
 windows 16ms

PRCLOSE_BLOCKING_TIME_CONNECTIVITY_CHANGE
  linux 53ms
  osx 16s
  windows 1sample 60s

PRCLOSE_BLOCKING_TIME_LINK_CHANGE
  linux -
  osx 1s
  windows 48s

PRCLOSE_BLOCKING_TIME_NORMAL
  linux 11s
  osx 27s
  windows 9 samples 60s

PRCLOSE_BLOCKING_TIME_OFFLINE:
  (small sample size)
  linux 6s
  osx 86ms
  windows 16s

Patrick McManus [:mcmanus]

Comment 32

•

9 years ago

damn. those are crazy numbers. Dragana do these numbers look different at the 98th or 99th percentile rather than max? If not, then this is pretty nuts. (If so, even simple things like some kind of massive memory thrashing can explain .0001 outliers which would be irrelevant to us) 

I think it is interesting that the close_blocking_time_normal is so bad cross platform.

Honza, do you think there is any reason to have this done synchronously (to the rest of gecko)? 

can we just implement pr_close at the bottom of the stack to return true without closing anything, but pass the OS fd to a thread that queues them up and closes them as fast as possible. I suggest this rather than a thread pool as a possibility because the thread pool still creates a slow sync point for the outlier cases (though improves HoL blocking).

at least we have a good metric we could track now. Thanks Dragana.

Flags: needinfo?(honzab.moz)

Flags: needinfo?(dd.mozilla)

Honza Bambas (:mayhemer)

Reporter

Comment 33

•

9 years ago

(In reply to Patrick McManus [:mcmanus] from comment #32)
> damn. those are crazy numbers. Dragana do these numbers look different at
> the 98th or 99th percentile rather than max? If not, then this is pretty
> nuts. (If so, even simple things like some kind of massive memory thrashing
> can explain .0001 outliers which would be irrelevant to us) 

Yep, it would be good to check what is the percentile of blocks taking more than say 4 seconds?  Or any other reasonable number of seconds.

> 
> I think it is interesting that the close_blocking_time_normal is so bad
> cross platform.
> 
> Honza, do you think there is any reason to have this done synchronously (to
> the rest of gecko)? 
> 
> can we just implement pr_close at the bottom of the stack to return true
> without closing anything, but pass the OS fd to a thread that queues them up
> and closes them as fast as possible. I suggest this rather than a thread
> pool as a possibility 


"because the thread pool still creates a slow sync point for the outlier cases" 

I don't follow this argument.  Are you suggesting we have one thread per one close (connect)?  I did the same for udp socket close - no pool, spawn a new thread for any invocation of nsUDPSocket::Close() ending up at PR_Close().  But we use udp more rarely than tcp so to have a tun of threads for each connect we call seems to me like waste of resources.  But I can be persuaded otherwise.  Both solutions are good.  Only problem is that impact of raising number of threads is hard to measure, it's kind of a slow death...

> (though improves HoL blocking).
> 
> at least we have a good metric we could track now. Thanks Dragana.

Yep, thanks!

Flags: needinfo?(honzab.moz)

Patrick McManus [:mcmanus]

Comment 34

•

9 years ago

roughly:


1] put a nspr layer where layer->lower == tcp/udp

2] implement pr_close(aFD) {
int fd = GetOsFD(aFD); // I forget what this is really called
aFD->lower->dtor(aFD->lower); // or maybe set fd to -1 and call lower->close()
postmessage(closingThread, fd);
return 0;
}

on closingThread
while(!shutdown) {
recvmessage(&fd);
close(fd);
}

Dragana Damjanovic [:dragana]

Assignee

Comment 35

•

9 years ago

(In reply to Patrick McManus [:mcmanus] from comment #32)
> damn. those are crazy numbers. Dragana do these numbers look different at
> the 98th or 99th percentile rather than max? If not, then this is pretty
> nuts. (If so, even simple things like some kind of massive memory thrashing
> can explain .0001 outliers which would be irrelevant to us) 
> 



I should have added this stats too, I was just surprised about max values they are crazy.
so 95% even 99% are only 1ms.
there are 3365 samples that are larger than 1second and 1453 larger than 4s.

Dispatching this to a pull of threads or just separate threads is going to make this worse for everybody else.  I am not sure we should do this.

> I think it is interesting that the close_blocking_time_normal is so bad
> cross platform.
> 
> Honza, do you think there is any reason to have this done synchronously (to
> the rest of gecko)? 
> 
> can we just implement pr_close at the bottom of the stack to return true
> without closing anything, but pass the OS fd to a thread that queues them up
> and closes them as fast as possible. I suggest this rather than a thread
> pool as a possibility because the thread pool still creates a slow sync
> point for the outlier cases (though improves HoL blocking).
> 
> at least we have a good metric we could track now. Thanks Dragana.

Flags: needinfo?(dd.mozilla)

Dragana Damjanovic [:dragana]

Assignee

Comment 36

•

9 years ago

my message got in conflict with your.
The suggestion from comment 34 could work.

Honza Bambas (:mayhemer)

Reporter

Comment 37

•

9 years ago

(In reply to Patrick McManus [:mcmanus] from comment #34)
> roughly:
> 
> 
> 1] put a nspr layer where layer->lower == tcp/udp
> 
> 2] implement pr_close(aFD) {
> int fd = GetOsFD(aFD); // I forget what this is really called
> aFD->lower->dtor(aFD->lower); // or maybe set fd to -1 and call
> lower->close()
> postmessage(closingThread, fd);
> return 0;
> }
> 
> on closingThread
> while(!shutdown) {
> recvmessage(&fd);
> close(fd);
> }

If we just leak any queued (yet not closed) sockets after shutdown then this is a good solution.  Should also replace my patch for UDP close.

How about connect/cont-connect tho?

Dragana Damjanovic [:dragana]

Assignee

Comment 38

•

9 years ago

Attached patch bug_1152046_prclose_v1.patch (obsolete) (deleted) — Details — Splinter Review

Here is one version. Please take a look...

Attachment #8641199 - Flags: feedback?(mcmanus)

Honza Bambas (:mayhemer)

Reporter

Comment 39

•

9 years ago

Comment on attachment 8641199 [details] [diff] [review]
bug_1152046_prclose_v1.patch

Review of attachment 8641199 [details] [diff] [review]:
-----------------------------------------------------------------

Nice!  just few internals to fix.

::: netwerk/base/nsClosingService.cpp
@@ +16,5 @@
> +TcpUdpPRCloseLayerClose(PRFileDesc *aFd)
> +{
> +  PRFileDesc *fd = aFd->lower;
> +  nsClosingService::PostCloseRequest(fd);
> +  aFd = 0;

you probably don't need to do this

@@ +36,5 @@
> +{
> +  PR_DestroyLock(mLock);
> +  PR_DestroyCondVar(mQueueCondVar);
> +  gInstance = nullptr;
> +  mQueue.Clear();

hmm.. the queue must be accessed only on a single thread (which I think cannot be) or always be accessed unconditionally only under the lock

@@ +40,5 @@
> +  mQueue.Clear();
> +}
> +
> +void
> +nsClosingService::Start()

put // static before the method definition (applies to all static method here)

@@ +105,5 @@
> +void
> +nsClosingService::PostCloseRequest(PRFileDesc *aFd)
> +{
> +  // Check if shutdown is called.
> +  if (!sShutdown) {

nicer would be:

if (sShutdown) {
  // Let the socket leak, we want the system take care... bla bla bla
  return;
}

if (..)..

@@ +110,5 @@
> +    // Check if gInstance is present, if not call in the standard way.
> +    // Theoretically this can happen before nsIOService initializes
> +    // nsClosingService.
> +    if (gInstance) {
> +      gInstance->PostRequest(aFd);

so, I'm missing one major stuff here: you are not correctly handling the service lifetime

each layer instance you crate must have its 'secret' set to a struct that has a nsRefPtr<nsClosingService> that refers the service.

here you will pick the refptr from your secret struct and use it instead of gInstance and then just nullify it.  this will make sure the queue and the locks exists.  the code you have may nicely race.

other option is to StaticMutex or StaticMonitor over gInstance

@@ +115,5 @@
> +    } else {
> +      PR_Close(aFd);
> +    }
> +  }
> +  aFd = nullptr;

why?

@@ +179,5 @@
> +  PR_Unlock(mLock);
> +}
> +
> +void
> +nsClosingService::ResolveRequest()

maybe ProcessRequest would be better (if you really want to have a method for this, inlining into ThreadFunc would work well too)

@@ +186,5 @@
> +  if (mQueue.Length() == 0) {
> +    PR_WaitCondVar(mQueueCondVar, PR_INTERVAL_NO_TIMEOUT);
> +  }
> +  if (sShutdown) {
> +    PR_Unlock(mLock);

grrr....

::: netwerk/base/nsClosingService.h
@@ +62,5 @@
> +                            mozilla::Telemetry::ID aIDLinkChange,
> +                            mozilla::Telemetry::ID aIDOffline);
> +
> +  static bool sShutdown;
> +  static nsClosingService *gInstance;

sInstance (or preferably sSelf) since it's an object's static not a global static

@@ +65,5 @@
> +  static bool sShutdown;
> +  static nsClosingService *gInstance;
> +  nsTArray<PRFileDesc *> mQueue;
> +  PRLock *mLock;
> +  PRCondVar *mQueueCondVar;

use mozilla::Monitor please.  if there is some reason not to please add a comment why

::: netwerk/base/nsIncrementalDownload.cpp
@@ +17,5 @@
>  #include "nsIFile.h"
>  #include "nsITimer.h"
> +#include "nsIStreamListener.h"
> +#include "nsIURI.h"
> +#include "nsIInputStream.h"

hmm?

::: netwerk/base/nsSimpleNestedURI.cpp
@@ +7,5 @@
>  
>  #include "nsSimpleNestedURI.h"
>  #include "nsIObjectInputStream.h"
>  #include "nsIObjectOutputStream.h"
> +#include "nsNetUtil.h"

hmm2?

::: netwerk/base/nsURLHelper.cpp
@@ +12,5 @@
>  #include "nsCOMPtr.h"
>  #include "nsCRT.h"
>  #include "nsNetCID.h"
>  #include "prnetdb.h"
> +#include "mozilla/Preferences.h"

hmm3?

Attachment #8641199 - Flags: feedback+

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

Attachment #8641199 - Flags: feedback?(mcmanus)

Lawrence Mandel [:lmandel] (use needinfo)

Comment 40

•

9 years ago

We're at the end of the 40 cycle and this bug is now wontfix for 40.

As this bug is visible in 40 Betas and 41 Aurora, I'm tracking for 41+.

status-firefox40: affected → wontfix

status-firefox41: --- → affected

status-firefox42: --- → affected

tracking-firefox41: --- → +

tracking-firefox42: --- → +

Comment hidden (obsolete)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0080a8d04fec

Dragana Damjanovic [:dragana]

Assignee

Comment 42

•

9 years ago

Attached patch bug_1152046_prclose_v1.patch (obsolete) (deleted) — Details — Splinter Review

Attachment #8641199 - Attachment is obsolete: true

Dragana Damjanovic [:dragana]

Assignee

Comment 43

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=873fff73ca52


Honza, can you take a look please?

There is a leak on socket that is expected to happend. A possibilty is to allow PR_Close for let's say 200-500 ms and then let the rest sockets leak?

Flags: needinfo?(honzab.moz)

Honza Bambas (:mayhemer)

Reporter

Comment 44

•

9 years ago

I'm not sure what you exactly want from me, can you restate please?

Flags: needinfo?(honzab.moz) → needinfo?(dd.mozilla)

Dragana Damjanovic [:dragana]

Assignee

Comment 45

•

9 years ago

Attached patch bug_1152046_prclose_v1.patch (obsolete) (deleted) — Details — Splinter Review

May I ask you for a review? Thanks.

Attachment #8650560 - Attachment is obsolete: true

Flags: needinfo?(dd.mozilla) → needinfo?(honzab.moz)

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 46

•

9 years ago

Jason, would you be able to review this? I noticed Honza's bugzilla email says "not reviewing". This bug is tracking for FF41, so if there a fix that is safe, we would want to uplift to Beta41 soon.

Flags: needinfo?(jduell.mcbugs)

Jason Duell

Comment 47

•

9 years ago

Honza has been reviewing previous patches here so I'm going to let him keep doing that.  Bounce it back to me if he hasn't gotten to it in a few days.

Flags: needinfo?(jduell.mcbugs)

Honza Bambas (:mayhemer)

Reporter

Comment 48

•

9 years ago

(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #44)
> I'm not sure what you exactly want from me, can you restate please?

I was actually asking about:

"There is a leak on socket that is expected to happend. A possibilty is to allow PR_Close for let's say 200-500 ms and then let the rest sockets leak?"

what does that mean?

Flags: needinfo?(dd.mozilla)

Dragana Damjanovic [:dragana]

Assignee

Comment 49

•

9 years ago

(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #48)
> (In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #44)
> > I'm not sure what you exactly want from me, can you restate please?
> 
> I was actually asking about:
> 
> "There is a leak on socket that is expected to happend. A possibilty is to
> allow PR_Close for let's say 200-500 ms and then let the rest sockets leak?"
> 
> what does that mean?

Sorry I have fix that - I have added PR_Free(fd) which I have forgot before and I have done some other fixes. so there is no leak anymore

Flags: needinfo?(dd.mozilla)

Dragana Damjanovic [:dragana]

Assignee

Comment 50

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=aae709c9eabd

Honza Bambas (:mayhemer)

Reporter

Comment 51

•

9 years ago

Comment on attachment 8654487 [details] [diff] [review]
bug_1152046_prclose_v1.patch

Review of attachment 8654487 [details] [diff] [review]:
-----------------------------------------------------------------

nsClosingService -> mozilla::net::ClosingService

::: netwerk/base/nsClosingService.cpp
@@ +34,5 @@
> +  if (!aFd) {
> +    return PR_FAILURE;
> +  }
> +
> +  ClosingLayerSecret *closingService =

s/closingService/closingLayerSecret/

@@ +35,5 @@
> +    return PR_FAILURE;
> +  }
> +
> +  ClosingLayerSecret *closingService =
> +    reinterpret_cast<ClosingLayerSecret *>(aFd->secret);

you must do this on your layer!  how can you be sure aFd is what you expect?

@@ +39,5 @@
> +    reinterpret_cast<ClosingLayerSecret *>(aFd->secret);
> +  aFd->secret = nullptr;
> +
> +  PRFileDesc* layer = PR_PopIOLayer(aFd, PR_TOP_IO_LAYER);
> +  NS_ASSERTION(layer &&

MOZ_ASSERT, maybe actually MOZ_RELEASE_ASSERT

@@ +42,5 @@
> +  PRFileDesc* layer = PR_PopIOLayer(aFd, PR_TOP_IO_LAYER);
> +  NS_ASSERTION(layer &&
> +               layer->identity == sTcpUdpPRCloseLayerId,
> +               "Closing Layer not on top of stack");
> +  layer->dtor(layer);

are you sure about this?  doesn't PR_Close on the layer do that?  I think this just delegates to the layer below.

@@ +46,5 @@
> +  layer->dtor(layer);
> +
> +  PRStatus status = PR_SUCCESS;
> +
> +  if (aFd) {

will always be true

@@ +60,5 @@
> +      }
> +    } else {
> +      // If the ClosingService layer is the first layer above PR_NSPR_IO_LAYER
> +      // we are not going to leak anything, but the PR_Close will not be called.
> +      PR_Free(aFd);

shouldn't you use dtor() here?  I'm no expert (it's been too long since the last time I played with this stuff) so rather double check.  but that would make more sense to me.  same in the shutdown case in ThreadFunc

@@ +69,5 @@
> +  return status;
> +}
> +
> +mozilla::Atomic<bool> nsClosingService::sShutdown(false);
> +nsRefPtr<nsClosingService> nsClosingService::sInstance = nullptr;

these both are static initializer, something we don't want

you can have a raw sInstance ptr and NS_ADDREF in ctor and NS_RELEASE in dtor.  be careful with multiple threads tho..

@@ +158,5 @@
> +    PR_Free(aFd);
> +    return;
> +  }
> +
> +  bool notify = (mQueue.Length() == 0)? true : false;

)? 

..missing space

@@ +161,5 @@
> +
> +  bool notify = (mQueue.Length() == 0)? true : false;
> +
> +  mQueue.AppendElement(aFd);
> +  if (notify) {

if (mQueue.Length() == 1) ?

@@ +174,5 @@
> +  MOZ_ASSERT(NS_IsMainThread());
> +
> +  if (sInstance) {
> +    sInstance->ShutdownInternal();
> +    sInstance = nullptr;

so, other thread may race and take ref to sInstance after is has been shut down?  Is that OK?  Maybe it is, just checking..

@@ +200,5 @@
> +
> +void
> +nsClosingService::ThreadFunc()
> +{
> +  while (1) {

for (;;) or while (true) are prefered

@@ +217,5 @@
> +          // PR_NSPR_IO_LAYER we are not going to leak anything, but PR_Close
> +          // will not be called.
> +          PR_Free(fd);
> +        }
> +        mQueue.Clear(); 

ws

::: netwerk/base/nsClosingService.h
@@ +28,5 @@
> +
> +namespace mozilla {
> +namespace net {
> +
> +class nsClosingService final : public nsISupports

to make nsRefPtr work with this you just need to define AddRef and Release.  There is a macro in nsISupportsImpl.h NS_INLINE_DECL_THREADSAFE_REFCOUNTING for inline declaration of those two for you.  Don't derive from isupports when not needed (which is not in this case)

@@ +58,5 @@
> +                            mozilla::Telemetry::ID aIDConnectivityChange,
> +                            mozilla::Telemetry::ID aIDLinkChange,
> +                            mozilla::Telemetry::ID aIDOffline);
> +
> +  static nsRefPtr<nsClosingService> sInstance;

hm... this is a static initializer.  not good..

::: netwerk/base/nsIOService.cpp
@@ +43,5 @@
>  #include "nsIWidget.h"
>  #include "nsThreadUtils.h"
>  #include "mozilla/LoadInfo.h"
>  #include "mozilla/net/NeckoCommon.h"
> +#include "mozilla/Services.h"

why are all the includes here?

::: netwerk/base/nsSocketTransportService2.cpp
@@ +557,5 @@
>  
> +#if !defined(MOZILLA_XPCOMRT_API)
> +    // Start closing service. The PRClose functions will be carried out on
> +    // the closeService thread. Started here so that it is started just
> +    // once.

Start the closing service.  Actual PR_Close() will be carried out on a separate "closing" thread.  Start the closing service here since this point is executed only once per session.

::: netwerk/base/nsURLHelper.cpp
@@ +11,5 @@
>  #include "nsIURLParser.h"
>  #include "nsCOMPtr.h"
>  #include "nsCRT.h"
>  #include "nsNetCID.h"
> +#include "mozilla/Preferences.h"

why this include?

Attachment #8654487 - Flags: review-

Honza Bambas (:mayhemer)

Reporter

Updated

•

9 years ago

Flags: needinfo?(honzab.moz)

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 52

•

9 years ago

I am not sure if the fix will come in time for 41. We have lived one release (40) with this issue and we may have to live with it for another release 41. Updating status-firefox41 to wontfix.

If the fix is ready in the next 3-4 days, and safe to uplift to Beta, please nominate.

status-firefox41: affected → wontfix

Comment hidden (obsolete)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=476286fdfac1

Dragana Damjanovic [:dragana]

Assignee

Comment 54

•

9 years ago

(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #51)
> Comment on attachment 8654487 [details] [diff] [review]
> bug_1152046_prclose_v1.patch
> 
> Review of attachment 8654487 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> @@ +35,5 @@
> > +    return PR_FAILURE;
> > +  }
> > +
> > +  ClosingLayerSecret *closingService =
> > +    reinterpret_cast<ClosingLayerSecret *>(aFd->secret);
> 
> you must do this on your layer!  how can you be sure aFd is what you expect?


If we are not on the right layer something is really wrong, but you I right it should be one in the correct order.


> 
> @@ +42,5 @@
> > +  PRFileDesc* layer = PR_PopIOLayer(aFd, PR_TOP_IO_LAYER);
> > +  NS_ASSERTION(layer &&
> > +               layer->identity == sTcpUdpPRCloseLayerId,
> > +               "Closing Layer not on top of stack");
> > +  layer->dtor(layer);
> 
> are you sure about this?  doesn't PR_Close on the layer do that?  I think
> this just delegates to the layer below.
> 


We do not want to call PR_Close. PR_Close calls te close function of this layer with is the function that we are in :)
layer->dtor and PR_Free(layer) will end up doing the same thing because the layer is popped up and the stack connections are cut (lower and higher are set to null)


> @@ +46,5 @@
> > +  layer->dtor(layer);
> > +
> > +  PRStatus status = PR_SUCCESS;
> > +
> > +  if (aFd) {
> 
> will always be true
> 

aFd does not have the same value as at the beginning of the function. Call to PR_PopIOLayer has set aFd to the next layer in the chain.
aFd should be true but I have added this check just to be sure. 


> @@ +60,5 @@
> > +      }
> > +    } else {
> > +      // If the ClosingService layer is the first layer above PR_NSPR_IO_LAYER
> > +      // we are not going to leak anything, but the PR_Close will not be called.
> > +      PR_Free(aFd);
> 
> shouldn't you use dtor() here?  I'm no expert (it's been too long since the
> last time I played with this stuff) so rather double check.  but that would
> make more sense to me.  same in the shutdown case in ThreadFunc
> 

the only difference between PR_Free and dtor is that dtor cuts connection in the layer chain (sets lower and higher to null), but in this case this does not matter which one we are using.

> 
> @@ +174,5 @@
> > +  MOZ_ASSERT(NS_IsMainThread());
> > +
> > +  if (sInstance) {
> > +    sInstance->ShutdownInternal();
> > +    sInstance = nullptr;
> 
> so, other thread may race and take ref to sInstance after is has been shut
> down?  Is that OK?  Maybe it is, just checking..
> 

it is ok because mShutdown is checked before putting a request in the queue.



> ::: netwerk/base/nsIOService.cpp
> @@ +43,5 @@
> >  #include "nsIWidget.h"
> >  #include "nsThreadUtils.h"
> >  #include "mozilla/LoadInfo.h"
> >  #include "mozilla/net/NeckoCommon.h"
> > +#include "mozilla/Services.h"
> 
> why are all the includes here?
> 
> ::: netwerk/base/nsURLHelper.cpp
> @@ +11,5 @@
> >  #include "nsIURLParser.h"
> >  #include "nsCOMPtr.h"
> >  #include "nsCRT.h"
> >  #include "nsNetCID.h"
> > +#include "mozilla/Preferences.h"
> 
> why this include?

this fixes compile errors (cause by adding a new file to moz.build that probably caused file to be moved from on unified file to another).

Dragana Damjanovic [:dragana]

Assignee

Comment 55

•

9 years ago

Attached patch bug_1152046_prclose_v2.patch (obsolete) (deleted) — Details — Splinter Review

Attachment #8654487 - Attachment is obsolete: true

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

Flags: needinfo?(honzab.moz)

Honza Bambas (:mayhemer)

Reporter

Comment 56

•

9 years ago

(In reply to Dragana Damjanovic [:dragana] from comment #54)
> (In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #51)
> > Comment on attachment 8654487 [details] [diff] [review]
> > bug_1152046_prclose_v1.patch
> > 
> > Review of attachment 8654487 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > @@ +35,5 @@
> > > +    return PR_FAILURE;
> > > +  }
> > > +
> > > +  ClosingLayerSecret *closingService =
> > > +    reinterpret_cast<ClosingLayerSecret *>(aFd->secret);
> > 
> > you must do this on your layer!  how can you be sure aFd is what you expect?
> 
> 
> If we are not on the right layer something is really wrong, but you I right
> it should be one in the correct order.

If the code ensures you are on the right level then OK.  But some kind of assertion at least would be at place.

> 
> 
> > 
> > @@ +42,5 @@
> > > +  PRFileDesc* layer = PR_PopIOLayer(aFd, PR_TOP_IO_LAYER);
> > > +  NS_ASSERTION(layer &&
> > > +               layer->identity == sTcpUdpPRCloseLayerId,
> > > +               "Closing Layer not on top of stack");
> > > +  layer->dtor(layer);
> > 
> > are you sure about this?  doesn't PR_Close on the layer do that?  I think
> > this just delegates to the layer below.
> > 
> 
> 
> We do not want to call PR_Close. PR_Close calls te close function of this
> layer with is the function that we are in :)
> layer->dtor and PR_Free(layer) will end up doing the same thing because the
> layer is popped up and the stack connections are cut (lower and higher are
> set to null)

Aha :)

> 
> 
> > @@ +46,5 @@
> > > +  layer->dtor(layer);
> > > +
> > > +  PRStatus status = PR_SUCCESS;
> > > +
> > > +  if (aFd) {
> > 
> > will always be true
> > 
> 
> aFd does not have the same value as at the beginning of the function. Call
> to PR_PopIOLayer has set aFd to the next layer in the chain.
> aFd should be true but I have added this check just to be sure. 

OK, leave it as is.

> 
> 
> > @@ +60,5 @@
> > > +      }
> > > +    } else {
> > > +      // If the ClosingService layer is the first layer above PR_NSPR_IO_LAYER
> > > +      // we are not going to leak anything, but the PR_Close will not be called.
> > > +      PR_Free(aFd);
> > 
> > shouldn't you use dtor() here?  I'm no expert (it's been too long since the
> > last time I played with this stuff) so rather double check.  but that would
> > make more sense to me.  same in the shutdown case in ThreadFunc
> > 
> 
> the only difference between PR_Free and dtor is that dtor cuts connection in
> the layer chain (sets lower and higher to null), but in this case this does
> not matter which one we are using.

If that is the only difference then I would pref calling dtor here.  It may also release any secret structs PR_Free has no chance to ever release.



I'll do the review of the new version now.

Honza Bambas (:mayhemer)

Reporter

Comment 57

•

9 years ago

Comment on attachment 8657801 [details] [diff] [review]
bug_1152046_prclose_v2.patch

Review of attachment 8657801 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/ClosingService.cpp
@@ +6,5 @@
> +
> +#include "ClosingService.h"
> +#include "nsIOService.h"
> +
> +using namespace mozilla::net;

instead using put this all to mozilla::net namespace.

@@ +12,5 @@
> +static PRIOMethods sTcpUdpPRCloseLayerMethods;
> +static PRIOMethods *sTcpUdpPRCloseLayerMethodsPtr = nullptr;
> +static PRDescIdentity sTcpUdpPRCloseLayerId;
> +
> +class ClosingLayerSecret

this must be in an anonymous namespace

@@ +38,5 @@
> +
> +  PRFileDesc* layer = PR_PopIOLayer(aFd, PR_TOP_IO_LAYER);
> +  MOZ_RELEASE_ASSERT(layer &&
> +                     layer->identity == sTcpUdpPRCloseLayerId,
> +                     "Closing Layer not on top of stack");

aha, it's here!  good.

@@ +51,5 @@
> +    // let it leak.
> +    if (gIOService->IsShutdown()) {
> +      // If the ClosingService layer is the first layer above PR_NSPR_IO_LAYER
> +      // we are not going to leak anything, but the PR_Close will not be called.
> +      PR_Free(aFd);

I'd prefer using dtor here, but if there is a problem with that, leave PR_Free

@@ +57,5 @@
> +      closingLayerSecret->mClosingService->PostRequest(aFd);
> +    } else {
> +      // Socket is created before closing service has been started or there was
> +      // a problem with starting it.
> +      status = aFd->methods->close(aFd);

PR_Close(aFd)?

@@ +73,5 @@
> +ClosingService::ClosingService()
> +  : mShutdown(false)
> +  , mMonitor("ClosingService.mMonitor")
> +{
> +  NS_ASSERTION(sInstance==nullptr,

MOZ_ASSERT(!sIntance) - preferred - or sInstance == nullptr

@@ +87,5 @@
> +    PR_ASSERT(PR_INVALID_IO_LAYER != sTcpUdpPRCloseLayerId);
> +
> +    sTcpUdpPRCloseLayerMethods = *PR_GetDefaultIOMethods();
> +    sTcpUdpPRCloseLayerMethods.close = TcpUdpPRCloseLayerClose;
> +    sTcpUdpPRCloseLayerMethodsPtr = &sTcpUdpPRCloseLayerMethods;

nit: why is sTcpUdpPRCloseLayerMethodsPtr needed?

@@ +91,5 @@
> +    sTcpUdpPRCloseLayerMethodsPtr = &sTcpUdpPRCloseLayerMethods;
> +  }
> +
> +  if (!sInstance) {
> +    ClosingService* service = new ClosingService();

it was better in the previous version (to use nsRefPtr here)

@@ +128,5 @@
> +  layer = PR_CreateIOLayerStub(sTcpUdpPRCloseLayerId,
> +                               sTcpUdpPRCloseLayerMethodsPtr);
> +
> +  if (!layer) {
> +    return NS_ERROR_FAILURE;

this may IMO not be fatal.

@@ +139,5 @@
> +
> +  if (status == PR_FAILURE) {
> +    delete secret;
> +    PR_DELETE(layer);
> +    return NS_ERROR_FAILURE;

I would not make this fatal too...  but depends.  your call.

@@ +156,5 @@
> +    // time. To prevent shutdown crash (bug 1152046) do not accept sockets any
> +    // more.
> +    // If the ClosingService layer is the first layer above PR_NSPR_IO_LAYER
> +    // we are not going to leak anything, but PR_Close will not be called.
> +    PR_Free(aFd);

dtor here as well?

@@ +176,5 @@
> +  if (sInstance) {
> +    sInstance->ShutdownInternal();
> +    ClosingService* service = sInstance;
> +    sInstance = nullptr;
> +    NS_RELEASE(service);

I think you can just NS_RELEASE(sInstance).  It will nullify sInstance for you.

::: netwerk/base/nsSocketTransportService2.cpp
@@ +555,5 @@
>          obsSvc->AddObserver(this, "last-pb-context-exited", false);
>      }
>  
> +#if !defined(MOZILLA_XPCOMRT_API)
> +    // Start the closing service. Actual PRClose() will be carried out on

PR_Close()

Attachment #8657801 - Flags: review+

Honza Bambas (:mayhemer)

Reporter

Updated

•

9 years ago

Flags: needinfo?(honzab.moz)

Dragana Damjanovic [:dragana]

Assignee

Comment 58

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=70451de543a9

Dragana Damjanovic [:dragana]

Assignee

Comment 59

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f8a5986688cf

Dragana Damjanovic [:dragana]

Assignee

Comment 60

•

9 years ago

Attached patch bug_1152046_prclose_v2.patch (deleted) — Details — Splinter Review

->dtor does not work it must be PR_Free

Attachment #8657801 - Attachment is obsolete: true

Attachment #8659673 - Flags: review+

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

Keywords: checkin-needed

Carsten Book [:Tomcat]

Comment 61

•

9 years ago

hi, the first patch didn't ally cleanly:

renamed 1152046 -> bug_1152046_v1.patch
applying bug_1152046_v1.patch
patching file netwerk/base/nsIOService.h
Hunk #1 FAILED at 72
1 out of 1 hunks FAILED -- saving rejects to file netwerk/base/nsIOService.h.rej
patching file netwerk/base/nsSocketTransport2.cpp
Hunk #1 FAILED at 1211
1 out of 1 hunks FAILED -- saving rejects to file netwerk/base/nsSocketTransport2.cpp.rej
patch failed, unable to continue (try -v)
patch failed, rejects left in working directory

is this checkin needed requests for all patches ?

Flags: needinfo?(dd.mozilla)

Keywords: checkin-needed

Dragana Damjanovic [:dragana]

Assignee

Comment 62

•

9 years ago

Comment on attachment 8659673 [details] [diff] [review]
bug_1152046_prclose_v2.patch

Only for patch bug_1152046_prclose_v2.patch.

Thanks.

Flags: needinfo?(dd.mozilla)

Attachment #8659673 - Flags: checkin+

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

Keywords: checkin-needed

Pulsebot

Comment 63

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/ceb030473647

Keywords: checkin-needed

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 64

•

9 years ago

https://hg.mozilla.org/mozilla-central/rev/ceb030473647

Nick Fitzgerald [:fitzgen] [⏰PST; UTC-8]

Updated

•

9 years ago

Depends on: 1205016

Patrick McManus [:mcmanus]

Comment 65

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/d5d0b5994149bc22ee38b1ae14acb69853a9646d
bug 1152046 - Revert Make separate thread only for PRClose. r=backout

Patrick McManus [:mcmanus]

Comment 66

•

9 years ago

backed out comment 63 due to crashes in 1205016

Patrick McManus [:mcmanus]

Updated

•

9 years ago

Attachment #8659673 - Flags: checkin+

Pulsebot

Comment 67

•

9 years ago

Backout:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5320f1017b81

Nigel Babu [:nigelb]

Comment 68

•

9 years ago

Something that's landed on top of the original landing isn't playing well with the backout. So I've had to backout the backout. We were seeing b2g debug shutdown hangs.

Cervantes Yu [:cyu] [:cervantes]

Comment 69

•

9 years ago

This bug is causing performance regression on b2g. We need a forward fix in bug 1204837 with this patch on m-c.

Since this bug is specific to Windows, will the fix be limited to Windows in the next revision? Or we'll still have this on all platforms even neither Linux nor MacOS has this problem?

Flags: needinfo?(dd.mozilla)

Cervantes Yu [:cyu] [:cervantes]

Updated

•

9 years ago

Blocks: 1204837

Dragana Damjanovic [:dragana]

Assignee

Comment 70

•

9 years ago

(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #69)
> This bug is causing performance regression on b2g. We need a forward fix in
> bug 1204837 with this patch on m-c.
> 
> Since this bug is specific to Windows, will the fix be limited to Windows in
> the next revision? Or we'll still have this on all platforms even neither
> Linux nor MacOS has this problem?

Sometimes PR_Close blocks for a long time, it is rare but still it happens. The blocking time on Windows are longer than on other (linux, osx) but still long blocking times happens on all os. So we will keep this change for all os.

Sorry I did not know about bug 1204837.

Flags: needinfo?(dd.mozilla)

Dragana Damjanovic [:dragana]

Assignee

Comment 71

•

9 years ago

Attached patch bug_1152046_move_CLosingService_start_toioservice.patch (obsolete) (deleted) — Details — Splinter Review

I moved ClosingService starting/stoping into nsIOService to be sure that it is happening only once.
I think this is not going to solve the crash, but I want to change this.

Honza, may I ask you for a review, please.

Flags: needinfo?(honzab.moz)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 72

•

9 years ago

So this landed, and broke Linux stuff.
This was reverted, and broke b2g debug stuff.
That revert was then reverted (so back to this landed), so b2g debug stuff is working, but Linux stuff is broken again.


So what's the plan here? We can't leave both Linux stuff or b2g debug stuff broken like this.

Flags: needinfo?(mcmanus)

Flags: needinfo?(dd.mozilla)

Patrick McManus [:mcmanus]

Comment 73

•

9 years ago

dragana is actively working it. no idea why the b2g stuff came to depend on it in 24 hrs.

Flags: needinfo?(mcmanus)

Randell Jesup [:jesup] (needinfo me)

Comment 74

•

9 years ago

Note: I backed out bug 1152046 (again) locally, since I was insta-crashing with it in.  That lets me run, but testing https://mozilla.github.com/webrtc-landing/pc_test.html crashes in the same place (jemalloc/internal/bitmap.h) when in e10s - but not in a non-e10s profile.  So it's very touchy and machine/perf/allocation/? dependent.

This also implies to me the actual bug is elsewhere (jemalloc?)

Dragana Damjanovic [:dragana]

Assignee

Comment 75

•

9 years ago

I am working on linux problem.

I do not know why b2g is broken. I was check, the files that this patch change has not been changed after this patch had landed.

This patch does not call PR_Close after we enter shutdown, it just cleans up , but does not close socket. b2g is a shutdown hang, maybe we should figure out in any case why it hangs if this patch gets backed out.

Nigel, do you know what has change in b2g, probably something about sockets?

Flags: needinfo?(dd.mozilla) → needinfo?(nigelbabu)

Patrick McManus [:mcmanus]

Comment 76

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d37e1abc116d

Nigel Babu [:nigelb]

Comment 77

•

9 years ago

I have no idea. I'm probably not the best person to answer this question.

Flags: needinfo?(nigelbabu)

Tim Taubert [:ttaubert] (inactive)

Updated

•

9 years ago

Blocks: 1001691

Tim Taubert [:ttaubert] (inactive)

Updated

•

9 years ago

No longer blocks: 1001691

Dragana Damjanovic [:dragana]

Assignee

Comment 78

•

9 years ago

Attached patch bug_1152046_disable_on_ffos.patch (obsolete) (deleted) — Details — Splinter Review

Attachment #8664191 - Flags: review?(mcmanus)

Dragana Damjanovic [:dragana]

Assignee

Comment 79

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a0aa83c132c8

Patrick McManus [:mcmanus]

Updated

•

9 years ago

Attachment #8664191 - Flags: review?(mcmanus) → review+

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

Attachment #8664191 - Flags: checkin+

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

Keywords: checkin-needed

Dragana Damjanovic [:dragana]

Assignee

Comment 80

•

9 years ago

checkin only for patch bug_1152046_disable_on_ffos.patch

Pulsebot

Comment 81

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/acce23926d07

Keywords: checkin-needed

Carsten Book [:Tomcat]

Comment 82

•

9 years ago

backed out for bustage like https://treeherder.mozilla.org/logviewer.html#?job_id=14559318&repo=mozilla-inbound

Flags: needinfo?(dd.mozilla)

Pulsebot

Comment 83

•

9 years ago

Backout:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a4d31272fc63

Dragana Damjanovic [:dragana]

Assignee

Comment 84

•

9 years ago

Attached patch bug_1152046_disable_on_ffos.patch (deleted) — Details — Splinter Review

fix bustage. Sorry.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=1a34a9dbaf39

Attachment #8664191 - Attachment is obsolete: true

Flags: needinfo?(dd.mozilla)

Attachment #8664957 - Flags: review+

Attachment #8664957 - Flags: checkin+

Dragana Damjanovic [:dragana]

Assignee

Comment 85

•

9 years ago

checkin needed only for patch  bug_1152046_disable_on_ffos.patch

Keywords: checkin-needed

Pulsebot

Comment 86

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/f579176f1c3e

Keywords: checkin-needed

Carsten Book [:Tomcat]

Comment 87

•

9 years ago

sorry had to back this out again for another bustage, this time b2g https://treeherder.mozilla.org/logviewer.html#?job_id=14625952&repo=mozilla-inbound

Flags: needinfo?(dd.mozilla)

Pulsebot

Comment 88

•

9 years ago

Backout:
https://hg.mozilla.org/integration/mozilla-inbound/rev/650795f68df8

Cervantes Yu [:cyu] [:cervantes]

Comment 89

•

9 years ago

(In reply to Carsten Book [:Tomcat] from comment #87)
> sorry had to back this out again for another bustage, this time b2g
> https://treeherder.mozilla.org/logviewer.html#?job_id=14625952&repo=mozilla-
> inbound

This looks like a shutdown deadlock on the CompositorThread as in https://bugzilla.mozilla.org/show_bug.cgi?id=1204837#c14.

Dragana Damjanovic [:dragana]

Assignee

Comment 90

•

9 years ago

(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #89)
> (In reply to Carsten Book [:Tomcat] from comment #87)
> > sorry had to back this out again for another bustage, this time b2g
> > https://treeherder.mozilla.org/logviewer.html#?job_id=14625952&repo=mozilla-
> > inbound
> 
> This looks like a shutdown deadlock on the CompositorThread as in
> https://bugzilla.mozilla.org/show_bug.cgi?id=1204837#c14.


Patch bug_1152046_disable_on_ffos.patch only disables use of ClosingService thread on b2g (until bug 1204837). 

I thing there is another bug which is a deadlock and this enabling/disabling ClosingService only exposes it.
Currently all calls to PR_Close are dispatch to a ClosingService thread. and patch bug_1152046_disable_on_ffos.patch disables ClosingService so PR_Close are called on SocketThread (this use to be before  bug_1152046_prclose_v2.patch landed, around 15.September). It is probably a timing issue that this patch exposes another bug. I do not see any other connection between CompositorThread and this bug.
What do you think?

Flags: needinfo?(dd.mozilla) → needinfo?(cyu)

Cervantes Yu [:cyu] [:cervantes]

Comment 91

•

9 years ago

(In reply to Dragana Damjanovic [:dragana] from comment #90)
> 
> Patch bug_1152046_disable_on_ffos.patch only disables use of ClosingService
> thread on b2g (until bug 1204837). 
> 
> I thing there is another bug which is a deadlock and this enabling/disabling
> ClosingService only exposes it.
> Currently all calls to PR_Close are dispatch to a ClosingService thread. and
> patch bug_1152046_disable_on_ffos.patch disables ClosingService so PR_Close
> are called on SocketThread (this use to be before 
> bug_1152046_prclose_v2.patch landed, around 15.September). It is probably a
> timing issue that this patch exposes another bug. I do not see any other
> connection between CompositorThread and this bug.
> What do you think?

Exactly. Some patch introduced this regression after ClosingService and was hidden. Now it's revealed. I don't think ClosingService has anything to do with the deadlock on mochitest shutdown.

Flags: needinfo?(cyu)

Honza Bambas (:mayhemer)

Reporter

Comment 92

•

9 years ago

(In reply to Dragana Damjanovic [:dragana] from comment #90)
> (In reply to Cervantes Yu [:cyu] [:cervantes] from comment #89)
> > (In reply to Carsten Book [:Tomcat] from comment #87)
> > > sorry had to back this out again for another bustage, this time b2g
> > > https://treeherder.mozilla.org/logviewer.html#?job_id=14625952&repo=mozilla-
> > > inbound
> > 
> > This looks like a shutdown deadlock on the CompositorThread as in
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1204837#c14.
> 
> 
> Patch bug_1152046_disable_on_ffos.patch only disables use of ClosingService
> thread on b2g (until bug 1204837). 
> 
> I thing there is another bug which is a deadlock and this enabling/disabling
> ClosingService only exposes it.
> Currently all calls to PR_Close are dispatch to a ClosingService thread. and
> patch bug_1152046_disable_on_ffos.patch disables ClosingService so PR_Close
> are called on SocketThread (this use to be before 
> bug_1152046_prclose_v2.patch landed, around 15.September). It is probably a
> timing issue that this patch exposes another bug. I do not see any other
> connection between CompositorThread and this bug.
> What do you think?

If you are looking for a quick fix, just go with disabling the code specifically per platforms it causes issues on.  We can investigate later what's actually wrong.

Personally I think your service shutdown code and some of my 'concurrent access' concerns might be worth checking on.

Flags: needinfo?(honzab.moz)

Dragana Damjanovic [:dragana]

Assignee

Comment 93

•

9 years ago

(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #92)
> (In reply to Dragana Damjanovic [:dragana] from comment #90)
> > (In reply to Cervantes Yu [:cyu] [:cervantes] from comment #89)
> > > (In reply to Carsten Book [:Tomcat] from comment #87)
> > > > sorry had to back this out again for another bustage, this time b2g
> > > > https://treeherder.mozilla.org/logviewer.html#?job_id=14625952&repo=mozilla-
> > > > inbound
> > > 
> > > This looks like a shutdown deadlock on the CompositorThread as in
> > > https://bugzilla.mozilla.org/show_bug.cgi?id=1204837#c14.
> > 
> > 
> > Patch bug_1152046_disable_on_ffos.patch only disables use of ClosingService
> > thread on b2g (until bug 1204837). 
> > 
> > I thing there is another bug which is a deadlock and this enabling/disabling
> > ClosingService only exposes it.
> > Currently all calls to PR_Close are dispatch to a ClosingService thread. and
> > patch bug_1152046_disable_on_ffos.patch disables ClosingService so PR_Close
> > are called on SocketThread (this use to be before 
> > bug_1152046_prclose_v2.patch landed, around 15.September). It is probably a
> > timing issue that this patch exposes another bug. I do not see any other
> > connection between CompositorThread and this bug.
> > What do you think?
> 
> If you are looking for a quick fix, just go with disabling the code
> specifically per platforms it causes issues on.  We can investigate later
> what's actually wrong.
> 
> Personally I think your service shutdown code and some of my 'concurrent
> access' concerns might be worth checking on.

I need info you for reviewing patch  bug_1152046_move_CLosingService_start_toioservice.patch


and the issue you have responded got complications: see bug 1207979, 1208019

Flags: needinfo?(honzab.moz)

Cervantes Yu [:cyu] [:cervantes]

Comment 94

•

9 years ago

Please note bug 1204837 lands a fix for ClosingService on b2g so we don't need to disable it on b2g anymore.

Honza Bambas (:mayhemer)

Reporter

Comment 95

•

9 years ago

Dragana, it wasn't clear what you needed.

I'll review the patch.

I don't follow what you need from me on "the issues".

- according comment 94 it seems we no longer need bug 1208019, right?
- the same may apply to bug 1207979, or not?

Please state more clearly how I can help here.

Flags: needinfo?(honzab.moz) → needinfo?(dd.mozilla)

Dragana Damjanovic [:dragana]

Assignee

Comment 96

•

9 years ago

(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #95)
> Dragana, it wasn't clear what you needed.
> 
> I'll review the patch.
> 
> I don't follow what you need from me on "the issues".
> 
> - according comment 94 it seems we no longer need bug 1208019, right?
> - the same may apply to bug 1207979, or not?
> 
> Please state more clearly how I can help here.

Yes we do not need them, but there is another bug that I do not want to be missed, because it is going to show up at some point.
Explanation:

Patch  bug_1152046_prclose_v2.patch (the implementation of ClosingService) was pushed on 14.Sep. On the same day couple of hours later they tried to back it out but did not work because of:
https://bugzilla.mozilla.org/show_bug.cgi?id=1204837#c14

The thing is that a patch from bug 1127270 was pushed on the same day. And if you disable ClosingService or backed it out that patch cause problems and if the ClosingService is used everything is fine. This bug and bug 1127270 do not have anything in common, so I believe there is a bug in patch 1127270 that shows up only if ClosingService is not used.

So we do not need bug 1208019, but I want that someone looks at bug 1127270 (ImageBridgeParent) and finds explanation why is it failing if ClosingService is not used.

Flags: needinfo?(dd.mozilla)

Honza Bambas (:mayhemer)

Reporter

Comment 97

•

9 years ago

Comment on attachment 8661965 [details] [diff] [review]
bug_1152046_move_CLosingService_start_toioservice.patch

Review of attachment 8661965 [details] [diff] [review]:
-----------------------------------------------------------------

Hmm.. are you sure that nsSocketTransportService::Init is not called just once?

If there are continuing issues around this code I can take a look deeper next week I'm back at the office.

::: netwerk/base/ClosingService.cpp
@@ +184,5 @@
>  {
>    {
>      mozilla::MonitorAutoLock mon(mMonitor);
> +    if (mShutdown) {
> +      // This should not happened.

"This should never happen."

OTOH, I think it may and having a check like this is good.

Worth adding a similar "only-once" check to the start method?

::: netwerk/base/nsIOService.cpp
@@ +1064,5 @@
>              NS_ASSERTION(NS_SUCCEEDED(rv), "socket transport service shutdown failed");
>          }
> +        if (mShutdown) {
> +            ClosingService::Shutdown();
> +        }

maybe rather put this to the NS_XPCOM_SHUTDOWN_OBSERVER_ID topic handling code directly?

if there is some reason not to (and have it here rather) please add a comment.

Attachment #8661965 - Flags: review+

Dragana Damjanovic [:dragana]

Assignee

Comment 98

•

9 years ago

(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #97)
> Comment on attachment 8661965 [details] [diff] [review]
> bug_1152046_move_CLosingService_start_toioservice.patch
> 
> Review of attachment 8661965 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Hmm.. are you sure that nsSocketTransportService::Init is not called just
> once?

I notice that nsSocketTransportService::Shutdown can be called more than one, it is called for profile change as well, and that is a problem :). (Init is called just once)


> 
> ::: netwerk/base/nsIOService.cpp
> @@ +1064,5 @@
> >              NS_ASSERTION(NS_SUCCEEDED(rv), "socket transport service shutdown failed");
> >          }
> > +        if (mShutdown) {
> > +            ClosingService::Shutdown();
> > +        }
> 
> maybe rather put this to the NS_XPCOM_SHUTDOWN_OBSERVER_ID topic handling
> code directly?
> 
I have put it there after SocketTransportService shutdown to let more PR_Close calls get called. Because most of the time they are not causing problem.

Dragana Damjanovic [:dragana]

Assignee

Comment 99

•

9 years ago

Attached patch bug_1152046_move_CLosingService_start_toioservice.patch (deleted) — Details — Splinter Review

https://treeherder.mozilla.org/#/jobs?repo=try&revision=40aad3b64d7a

Attachment #8661965 - Attachment is obsolete: true

Attachment #8669470 - Flags: review+

Dragana Damjanovic [:dragana]

Assignee

Comment 100

•

9 years ago

check in only for 

bug_1152046_move_CLosingService_start_toioservice.patch

Keywords: checkin-needed

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

Attachment #8664957 - Flags: checkin+

Pulsebot

Comment 101

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/c7323a72819c

Keywords: checkin-needed

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 102

•

9 years ago

https://hg.mozilla.org/mozilla-central/rev/c7323a72819c

Sylvestre Ledru [:Sylvestre]

Comment 103

•

9 years ago

Dragana, do you have plans wrt to this bug for 42? Thanks

status-firefox43: --- → affected

status-firefox44: --- → affected

Flags: needinfo?(dd.mozilla)

Dragana Damjanovic [:dragana]

Assignee

Comment 104

•

9 years ago

(In reply to Sylvestre Ledru [:sylvestre] from comment #103)
> Dragana, do you have plans wrt to this bug for 42? Thanks

I was not planning to uplift the last two patches. Although there was no problem with a patch (since 17th Sep. or so).

Flags: needinfo?(dd.mozilla)

Sylvestre Ledru [:Sylvestre]

Comment 105

•

9 years ago

OK, I will mark 42 & 43 as wontfix then. Thanks

status-firefox42: affected → wontfix

status-firefox43: affected → wontfix

Dragana Damjanovic [:dragana]

Assignee

Comment 106

•

9 years ago

I was looking at crash some logs again:
I have not seen any PR_Close hangs.
I have seen PR_Connect hangs even though we check for gIOService->IsShutdown() before trying to connect, so hang there for a long time. 
There is also nsAutodial hang (we also do not try to reconnect in case of Shutdown)
And there are some poll hangs.

A thinks that can have influence on this hangs:
avcuf32.dll
rlls.dll
winhadnt.dll
(but not all have one of these)

win sockets are made nonoblocking using rv = _md_MakeNonblock((HANDLE)osfd); just before calling connect(); 
http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/ntio.c#1160
I could check whether we set nonblocking flag correctly but it is hard to check if it is really set to the os socket.

I have also read some comments people left: there was 2-3 saying that firefox is really slow- I could assume that firefox was slow therefore they tried to shut it down and this crash happened, so probably there already a problem before they clicked shutdown.

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1215970

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

Depends on: 1233237

(No longer employed by Mozilla) Aaron Klotz

Comment 107

•

9 years ago

(In reply to Dragana Damjanovic [:dragana] from comment #106)
> win sockets are made nonoblocking using rv = _md_MakeNonblock((HANDLE)osfd);
> just before calling connect(); 
> http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/ntio.
> c#1160
> I could check whether we set nonblocking flag correctly but it is hard to
> check if it is really set to the os socket.

Heads up: ntio.c isn't used by our codebase, as NSPR is configured with the WIN95 configuration. See https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSPR/NSPR_build_instructions#--enable-win32-target.3DWIN95

It looks like the WIN95 configuration does not support this at all.

Honza Bambas (:mayhemer)

Reporter

Comment 108

•

9 years ago

(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #107)
> (In reply to Dragana Damjanovic [:dragana] from comment #106)
> > win sockets are made nonoblocking using rv = _md_MakeNonblock((HANDLE)osfd);
> > just before calling connect(); 
> > http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/ntio.
> > c#1160
> > I could check whether we set nonblocking flag correctly but it is hard to
> > check if it is really set to the os socket.
> 
> Heads up: ntio.c isn't used by our codebase, as NSPR is configured with the
> WIN95 configuration. See
> https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSPR/
> NSPR_build_instructions#--enable-win32-target.3DWIN95
> 
> It looks like the WIN95 configuration does not support this at all.

We do set PR_SockOpt_Nonblocking = true on the socket in nsSocketTransport::InitiateSocket(), but looking into the code how that is handled, I cannot find we actually would set any option directly on the system socket.  Only thing we do is a setup of secret->nonblocking = true flag on our FD at prmapopt.c!_PR_SocketSetSocketOption.  This flag however doesn't lead to any other setting made on the socket (w95sock.c).

Dragana, we could have a private patch for NSPR to copy that particular piece of code from ntio.c to w95sock.c to try.  Submitting that directly to NSPR should not be that hard if found working.

Dragana Damjanovic [:dragana]

Assignee

Comment 109

•

9 years ago

This is short update.

w95sock.c always makes sockets nonblocking:
http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/w95sock.c#63

and blocking mode is made inside nspr, so for each read/write it loops if a socket is blocking.

I will dig into this further.

Honza Bambas (:mayhemer)

Reporter

Comment 110

•

9 years ago

(In reply to Dragana Damjanovic [:dragana] from comment #109)
> This is short update.
> 
> w95sock.c always makes sockets nonblocking:
> http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/
> w95sock.c#63

I knew it was somewhere!  Thanks Dragana.

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

Depends on: 1238910

Dragana Damjanovic [:dragana]

Assignee

Comment 111

•

9 years ago

I notice couple of crashes on PR_ConnectContinue:

https://crash-stats.mozilla.com/report/index/7ed839dd-92fe-4b40-8586-d1abb2160114#allthreads
https://crash-stats.mozilla.com/report/index/32b4adec-87cc-4560-8373-e063c2160114#allthreads
https://crash-stats.mozilla.com/report/index/59eb432d-e0bb-4a8c-98fb-60eab2160114#allthreads

We can skip PR_ConnectContinue if we are in shutdown.

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

Depends on: 1239655

Daniel Stenberg [:bagder]

Comment 112

•

9 years ago

(In reply to Dragana Damjanovic [:dragana] from comment #111)
> I notice couple of crashes on PR_ConnectContinue:

In these three cases it seems to have stopped here: http://hg.mozilla.org/releases/mozilla-beta/annotate/7ccdff6c2b81/nsprpub/pr/src/io/prsocket.c#l294 .. which is a getsockopt() call on the actual underlying socket descriptor.

Dragana Damjanovic [:dragana]

Assignee

Comment 113

•

9 years ago

(In reply to Daniel Stenberg [:bagder] from comment #112)
> (In reply to Dragana Damjanovic [:dragana] from comment #111)
> > I notice couple of crashes on PR_ConnectContinue:
> 
> In these three cases it seems to have stopped here:
> http://hg.mozilla.org/releases/mozilla-beta/annotate/7ccdff6c2b81/nsprpub/pr/
> src/io/prsocket.c#l294 .. which is a getsockopt() call on the actual
> underlying socket descriptor.

I know, but when we are in shutdown we can just make it fail here.

Dragana Damjanovic [:dragana]

Assignee

Comment 114

•

9 years ago

I notice that we are doing some reads on d socket when we are closing nsHttpConnection:
http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpConnection.cpp#599

If we need to reduce amount of operation that we do when we are in shutdown, we can skip this. I have not seen any crash on this, but I am writing it here just to keep it in mind if we need it.

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

Depends on: 1239961

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

Depends on: 1240269

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

Depends on: 1240481

Dragana Damjanovic [:dragana]

Assignee

Updated

•

9 years ago

Depends on: 1241295

Patrick McManus [:mcmanus]

Updated

•

9 years ago

Whiteboard: [necko-active]

Dragana Damjanovic [:dragana]

Assignee

Comment 115

•

9 years ago

I am going to close this bug. Bug 1158189 is used as meta bug. We are going to open separate bugs for similar issues for simplicity, since this one is overflooded.

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Keywords: leave-open

Resolution: --- → FIXED

Wayne Mery (:wsmwk)

Updated

•

7 years ago

Target Milestone: --- → mozilla44

.bug_1152046_v1.patch.swp 9 years ago Dragana Damjanovic [:dragana] (deleted), application/octet-stream		Details
bug_1152046_v1.patch 9 years ago Dragana Damjanovic [:dragana] (deleted), patch	mayhemer : review+	Details \| Diff \| Splinter Review
bug_1152046_v1.patch 9 years ago Dragana Damjanovic [:dragana] (deleted), patch	dragana : review+	Details \| Diff \| Splinter Review
bug_1152046_telemetry_v1.patch 9 years ago Dragana Damjanovic [:dragana] (deleted), patch	mayhemer : feedback-	Details \| Diff \| Splinter Review
bug_1152046_telemetry_v2.patch 9 years ago Dragana Damjanovic [:dragana] (deleted), patch	bagder : feedback+	Details \| Diff \| Splinter Review
bug_1152046_telemetry_v2.patch 9 years ago Dragana Damjanovic [:dragana] (deleted), patch	mayhemer : review+	Details \| Diff \| Splinter Review
bug_1152046_telemetry_v2.patch 9 years ago Dragana Damjanovic [:dragana] (deleted), patch	dragana : review+	Details \| Diff \| Splinter Review
bug_1152046_prclose_v1.patch 9 years ago Dragana Damjanovic [:dragana] (deleted), patch	mayhemer : feedback+	Details \| Diff \| Splinter Review
bug_1152046_prclose_v1.patch 9 years ago Dragana Damjanovic [:dragana] (deleted), patch		Details \| Diff \| Splinter Review
bug_1152046_prclose_v1.patch 9 years ago Dragana Damjanovic [:dragana] (deleted), patch	mayhemer : review-	Details \| Diff \| Splinter Review
bug_1152046_prclose_v2.patch 9 years ago Dragana Damjanovic [:dragana] (deleted), patch	mayhemer : review+	Details \| Diff \| Splinter Review
bug_1152046_prclose_v2.patch 9 years ago Dragana Damjanovic [:dragana] (deleted), patch	dragana : review+	Details \| Diff \| Splinter Review
bug_1152046_move_CLosingService_start_toioservice.patch 9 years ago Dragana Damjanovic [:dragana] (deleted), patch	mayhemer : review+	Details \| Diff \| Splinter Review
bug_1152046_disable_on_ffos.patch 9 years ago Dragana Damjanovic [:dragana] (deleted), patch	mcmanus : review+ dragana : checkin+	Details \| Diff \| Splinter Review
bug_1152046_disable_on_ffos.patch 9 years ago Dragana Damjanovic [:dragana] (deleted), patch	dragana : review+	Details \| Diff \| Splinter Review
bug_1152046_move_CLosingService_start_toioservice.patch 9 years ago Dragana Damjanovic [:dragana] (deleted), patch	dragana : review+	Details \| Diff \| Splinter Review