421128 - Independent windows/tabs should not starve each other for network connections

Reporter

Description

•

17 years ago

Reproduction: Have a server or server group which is not responding at all (but does accept connections) or *very* slowly. Open a page on these servers with many (a few dozen) images (can be small icons etc. of a typical heavy webpage). Observe that the pages don't load at all, get bored. Open a new browser window, go to your favourite news site and/or Google. Actual result: The Google page does not load. It looks like your network connection (or browser) is entirely broken. Once the connections to the slow/unresponsive server time out (which is by default after 60-120s, i.e. longer than the most patient user will wait), you can go to Google again. Expected result: A request to domain B in window B is entirely independent from a request to domain A in window A. Implementation: There is a limit on the number of connections per server and overall. Group network connections by page and by domain, and restrict network connections only based on these (e.g. 10 per page, 50 per domain), and remove or seriously increase (200) the absolute overall connection limit for the whole browser.

Ben Bucksch (:BenB)

Reporter

Comment 1

•

17 years ago

See also bug 421125. When I ran into this, I assumed that my Internet connection is broken, or very flaky (coming and going), and didn't realize it's only one server, which made the problem diagnosis and solution a lot more time-consuming. I can also see ISP customers calling their ISP hotline.

Ben Bucksch (:BenB)

Reporter

Updated

•

17 years ago

Blocks: 384323

(dormant account)

Comment 2

•

13 years ago

Patrick is this still an issue or can this bug be resolved?

Assignee: nobody → mcmanus

(dormant account)

Updated

•

13 years ago

Whiteboard: [Snappy]

Patrick McManus [:mcmanus]

Comment 3

•

13 years ago

please don't assign bugs to me that I'm not actively working on as I understand that is what the assignment field means. I suspect the original 'slow' site in this case was significantly sharded and consumed the entire 30 connection pool, leaving nothing for opening a new tab. Its also possible that it just exhausted the 6-per-host limitation of HTTP and the new tab used the same site, though that seems less likely from the description. We have landed code that increased the total connection limit to 256 for just this reason, though it has been backed off to something smaller (64? I forget) for windows due to some problems with LSPs. If the higher overall limit was not sufficient we could build in a limit per load group.

Assignee: mcmanus → nobody

Ben Bucksch (:BenB)

Reporter

Comment 4

•

13 years ago

> site ... was significantly sharded and consumed the entire 30 connection pool, > leaving nothing for opening a new tab. That's what this bug is about: that should never happen. One site must not interfere with other sites.

Ben Bucksch (:BenB)

Reporter

Comment 5

•

13 years ago

> we could build in a limit per load group. That sounds like the solution

Patrick McManus [:mcmanus]

Comment 6

•

13 years ago

(In reply to Ben Bucksch (:BenB) from comment #5) > > we could build in a limit per load group. > > That sounds like the solution It would be a net positive. It still leaves in place the global 6-per-hostname limitation.. changing that to 6 per-tab would drift farther from the standard and is probably not usually relevant to the observed behavior. Recent changes have increased the size of the global pool, so triaging this it might not be a big problem currently.

(dormant account)

Comment 7

•

13 years ago

Patrick, sorry about ruining your assignment flow. I'll avoid that in future. Would it be hard to add telemetry so we know whether this is a problem that needs addressing?

Ben Bucksch (:BenB)

Reporter

Comment 8

•

13 years ago

I don't think any site in any circumstance should be allowed to exhaust all resources of a certain kind and thus block my browser. Now, I didn't even run into this with a deliberate attack scenario, but just with normal sites, when I internet connection broke. When I ran into this, I thought my Internet connection was broken. I tried to fix the connection, but I still couldn't load websites, not any website, so I assumed the connection was still broken. I would never have thought that when all sites in my browser are stalled, it could be cause of one single site. It took me considerable time to figure out what was going on.

Patrick McManus [:mcmanus]

Comment 9

•

13 years ago

Hey Ben, did you run into this in the field recently (the last release, maybe two) or is the anecdote older that that? I ask because the overall pool size was recently increased significantly, but the algorithm hasn't changed. Its not clear how much of a problem it is in practice now as its harder for one page to saturate the total pool. I agree a partition makes sense, at least for the global limit. But I ask for reasons of project prioritization.

Ben Bucksch (:BenB)

Reporter

Comment 10

•

13 years ago

> Hey Ben, did you run into this in the field recently (the last release, > maybe two) or is the anecdote older that that? No, because I never wanted to run into this again, so I set the limits in my profile very high: network.http.max-connections = 500 network.http.max-connections-per-serve = 50 network.http.max-persistent-connections-per-proxy = 200 network.http.max-persistent-connections-per-server = 50 When I had the values a a little lower (but still higher than default), I was still running into it, IIRC. But I considered that only a band-aid (not guaranteed to fix it), not a real fix, that's why I filed this bug.

Ben Bucksch (:BenB)

Reporter

Comment 11

•

13 years ago

I'm using a proxy server. For the record, the default currently seem to be: network.http.max-connections = 256 network.http.max-connections-per-server = 15 network.http.max-persistent-connections-per-proxy = 8 network.http.max-persistent-connections-per-server = 6 network.http.pipelining.maxrequests = 4

(dormant account)

Comment 12

•

13 years ago

Will mark this snappy:p3 until I see evidence that this is reproducible.

Whiteboard: [Snappy] → [Snappy:p3]

Ben Bucksch (:BenB)

Reporter

Comment 13

•

13 years ago

You can probably easily reproduce it like this: 1. Make a webpage with 1000 pictures, all different URLs, and from 10 different servers. 2.a. Make it so that these webservers act like a black hole. They don't refuse the connection (!), they simply dont respond at all. 2.b. ALTERNATIVELY, make your network connection so that DNS works (cached), but all requests go into a black hole. They are not rejected on IP or TCP level, they just go nowhere. This is a realistic scenario, that's how I ran into the bug. 3. Go to Google 4. Fix the broken servers or your broken network connection. 5. Go to Google Actual result: Step 3 and 5: Google doesn't work Expected result: Step 5: Google works. Step 3 with case 2.a. (broken server): Google works. Importance: The real problem with this bug is: - Even if you have case 2.a. (broken server), the user will assume case 2.b. (his network connection is broken), because Google doesn't work. - Even if you really had case 2.b. (broken network connection acting as a blackhole), and you have successfully fixed it, it looks to you as if it's still broken, because Google doesn't work. So, you tear it down again, or try to change something else, and in the process of trying and trying you make a mistake actually *really* break your network connection permanently. True story! ;-) Taras wrote: > Will mark this snappy:p3 until I see evidence that this is reproducible. Removing P3 marker for re-triage.

Whiteboard: [Snappy:p3] → [Snappy]

Patrick McManus [:mcmanus]

Comment 14

•

13 years ago

more of a [lame-network] bug (i.e. networking in very lame conditions) than a [snappy] one imo. lame-network is a down the road project for me to even start (6 or 8 weeks maybe) - feel free to add the whiteboard tag to anything that might apply.

Whiteboard: [Snappy] → [Snappy][lame-network]

Ben Bucksch (:BenB)

Reporter

Comment 15

•

13 years ago

Perfect. Restoring [Snappy:p3].

Whiteboard: [Snappy][lame-network] → [Snappy:p3][lame-network]

(dormant account)

Comment 16

•

13 years ago

Just to be clear: weirdly slow pageloading on the snappy radar. Patrick, thanks for taking this on.

Steve Fink [:sfink] [:s:]

Comment 17

•

13 years ago

I think I'm seeing this right now as a side effect of bug 731130. It again seems associated with google sites. Specifically, I am getting a busywait on an SSL connection to pz-in-f84.1e100.net, which appears to be part of google.com's CDN. This did not happen yesterday when I was on a fast network. Today I am on a home DSL line.

Steve Fink [:sfink] [:s:]

Comment 18

•

13 years ago

Attached file output of lsof -p (firefox pid) (deleted) — Details

Steve Fink [:sfink] [:s:]

Comment 19

•

13 years ago

Attached file per-server connection count (deleted) — Details

This is the output of perl -lne 'print $1 if /TCP \S*->([^:]*)/' /tmp/lsof.txt | sort | uniq -c | sort -n on the previous attachment. It actually doesn't show a high server count for any one server, which is now making me doubt this is the same thing. The total number of connections is 143, though.

Steve Fink [:sfink] [:s:]

Comment 20

•

13 years ago

Mine turned out to be bug 710176, which has nothing to do with connection exhaustion. Sorry for the noise.

(dormant account)

Comment 21

•

13 years ago

This made firefox nearly useless on public wifi during our FOSDEM workweek.

Whiteboard: [Snappy:p3][lame-network] → [Snappy:p1][lame-network]

(dormant account)

Updated

•

13 years ago

Blocks: 725023

Ben Bucksch (:BenB)

Reporter

Updated

•

13 years ago

Whiteboard: [Snappy:p1][lame-network] → [Snappy:p1][lame-network] repro see comment 13

Patrick McManus [:mcmanus]

Updated

•

12 years ago

Assignee: nobody → mcmanus

Target Milestone: --- → Future

Patrick McManus [:mcmanus]

Comment 22

•

12 years ago

worth noting that the per proxy limit is also a likely bottleneck.

Patrick McManus [:mcmanus]

Updated

•

12 years ago

Blocks: 778884

Patrick McManus [:mcmanus]

Updated

•

9 years ago

Whiteboard: [Snappy:p1][lame-network] repro see comment 13 → [Snappy:p1][lame-network] repro see comment 13[necko-backlog]

Firefox Bug Husbandry Bot

Comment 24

•

7 years ago

Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258

Priority: -- → P1

Firefox Bug Husbandry Bot

Comment 25

•

7 years ago

Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258

Priority: P1 → P3

Valentin Gosu [:valentin] (he/him)

Comment 26

•

3 years ago

Unassigning bugs owned by Patrick.

Assignee: mcmanus → nobody

Timea Cernea [:tbabos][inactive]

Comment 27

•

3 years ago

Hi Ben,
This issue is an old poke but do you know if it is still reproducible or relevant on the latest Firefox version?
If it is not, it should be closed. Please take a look at this when you have the time.

Flags: needinfo?(ben.bucksch)

Ben Bucksch (:BenB)

Reporter

Comment 28

•

3 years ago

Hi Timea, this happens in a specific situation with an overloaded web server, which then in turn lets Firefox block all web requests, even to other servers. Testing it requires a complex set up to immitate an overloaded web server. I have a company to run and don't have time to re-test this right now. Can you please assign some of your QA team to please try to follow the reproduction steps?

Flags: needinfo?(ben.bucksch)

BugBot [:suhaib / :marco/ :calixte]

Comment 29

•

2 years ago

In the process of migrating remaining bugs to the new severity system, the severity for this bug cannot be automatically determined. Please retriage this bug using the new severity system.

Severity: major → --

Randell Jesup [:jesup] (needinfo me)

Updated

•

2 years ago

Severity: -- → S3

output of lsof -p (firefox pid) 13 years ago Steve Fink [:sfink] [:s:] (deleted), text/plain		Details
per-server connection count 13 years ago Steve Fink [:sfink] [:s:] (deleted), text/plain		Details