Closed Bug 352853 Opened 18 years ago Closed 18 years ago

64k update chunks are maxing out mirrors connection limits before they use up their bandwidth

Categories

(Toolkit :: Application Update, defect)

1.8.0 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: justdave, Assigned: darin.moz)

References

()

Details

(Keywords: fixed1.8.0.8, fixed1.8.1)

Attachments

(1 file, 2 obsolete files)

We had a complaint from one of the mirrors today about the 64k chunksize used by the update service. Seems the quantity of people updating nowadays means the number of people hitting simultaneously to grab these little tiny chunks is growing fast, and because of the cost of setup/teardown on each connection, all of the available connection slots on the server are getting used up a long time before they max out their bandwidth. From IRC: 10:12:09 < maswan> justdave: anyway, can you please pass on that 1-meg chunks might be fine, but 64k really is killing us and not sustainable for larger releases. If this goes on we'll probably have to start aiming lower for the ACC mirror, which would be sad given that we have plenty of bandwidth to spare.
I got the same request from multiple other mirrors so I second this...
The fix itself would be trivial, but what's the risk? Seems like something worth considering for Fx2.
Flags: blocking-firefox2?
I don't see what the risk could be, but then I don't understand why small chunks are used in the first place. Anyone care to fill me in? I'm curious.
The risk is the impact this may have on dial-up users. If their modem connection is saturated with a 1-meg download, then they will be very unhappy users. 64k was as large as I thought we could go without impairing dial-up users significantly. The downloader (nsIncrementalDownload.cpp) could really benefit from some adaptive logic that figures out the available bandwidth and adjusts accordingly.
Actually, perhaps an easy way to solve this problem would be for the downloader to increase its fetch size but to read that at a slowed rate. For example, after each 64k chunk, the downloader could suspend the network channel for a couple seconds to relieve pressure on the user's network connection. This would cause some extra ACKs across the network as the two TCP ends recognize the slowdown, which might also suck. Other ideas?
(I'm maswan in the IRC paste) Well, instead of hanging up after 64k, how about sleeping half a second every second or some similar approach? Or for that matter, if you want to do it by bytes, read N k from the net, then sleep for as long as that took to read? If you only do 64k before hanging up, you'll never get good rates for those of us on good bandwidth either, considering that it takes much more to get a decent-sized tcp window. We were quite prepared to handle 1+ Gbit/s burst in traffic, but 1k+ requests per second was quite the surprise, and not what our mirror was tuned for.
(In reply to comment #6) > We were quite prepared to handle 1+ Gbit/s burst in traffic, but 1k+ requests > per second was quite the surprise, and not what our mirror was tuned for. And we are getting log management problems due to a gazillion 64k requests (like 30 million or so in a day)..
So, currently FF fetches a 64k chunk every minute. Over a 50 kbps modem, that takes about 10 seconds to download. A typical partial update is around 500k, which takes only 8 chunks to download (or ~8 minutes). Ignoring the logs issue for a moment, have you guys tried increasing the keep-alive interval of your HTTP connections to over a minute? That would eliminate the setup and teardown cost associated with the TCP connections. We can also very easily tune FF to fetch the 64k chunk less frequently. If we aim for having partial updates delivered within the span of an hour, then we could wait about 7-8 minutes between chunks. We can and should also make the downloader use a more sophisticated method of minimizing bandwidth utilization, so that we can avoid 64k chunks in the first place. However, that will entail more risk.
Attached patch v1 patch (obsolete) (deleted) — Splinter Review
OK, this patch bumps the chunk size up to 100000 bytes and the interval up to 10 minutes. This is hopefully a decent short-term solution.
Assignee: nobody → darin
Status: NEW → ASSIGNED
Attachment #238950 - Flags: review?
Attachment #238950 - Flags: review? → review?(robert.bugzilla)
Attachment #238950 - Flags: review?(robert.bugzilla) → review+
Darin - why 100k? Why not 256K or bigger? On windows update sizes have been as such (requests/size): 65K 100K 200K 300K 0.1: 751K 12 8 4 3 0.2: 557K 9 6 3 2 0.3: 232K 4 3 2 1 0.4: 511K 8 6 3 2 0.5: 565K 9 6 3 2 0.6: 49K 1 1 1 1 0.7: 531K 9 6 3 2 Do we deal gracefully if the partial chunk dl is aborted? What's the advantage to keeping it small?
(In reply to comment #8) > Ignoring the logs issue for a moment, have you guys tried increasing the > keep-alive interval of your HTTP connections to over a minute? That would > eliminate the setup and teardown cost associated with the TCP connections. We actually dropped max keep-alive from 5 seconds to 1 second, since we only have a finite number of threads. Over a minute would require something like 100k MaxClients (~1000-1200 requests/s), and I'm not sure either our machines nor apache would be happy with 100k threads, even if most of them are just idling in keep-alive. > We can also very easily tune FF to fetch the 64k chunk less frequently. If we > aim for having partial updates delivered within the span of an hour, then we > could wait about 7-8 minutes between chunks. Does this really matter? In total the same ammount of chunks are going to be fetched, it will just be more spread out over time per client (but likely, more different clients will be fetching at every moment). /Mattias Wadenstein
> Darin - why 100k? Why not 256K or bigger? 100k takes about 15 seconds to download over a 50 kbps modem. During that time the users network is saturated with this download, and then they will be unable to make use of their network connection for much of anything else. Their OS will ensure that other TCP connections are not entirely blocked out, but it will certainly seem like their network connection is suddenly much slower for that interval of time. So, I think we should strive to keep that interval small-ish. > Do we deal gracefully if the partial chunk dl is aborted? Yes, the downloader is designed for that. It'll pick up where it left off the next time the browser is started. > Does this really matter? In total the same ammount of chunks are going to be > fetched, it will just be more spread out over time per client (but likely, > more different clients will be fetching at every moment). I was assuming that a big part of the problem was the swell of downloads after a release, so if we spread that swell out a little more, it would help. Instead of a big spike in connections, with this change you'd see a softer spike in connections, no?
> I was assuming that a big part of the problem was the swell of downloads after > a release, so if we spread that swell out a little more, it would help. > Instead of a big spike in connections, with this change you'd see a softer > spike in connections, no? I think the primary problem is not the swell in downloads (as Mattias says in comment 6, they were "quite prepared to handle 1+ Gbit/s burst in traffic") but rather inefficient utilization of capacity. Currently, a significant portion of mirror bandwidth isn't being used because the servers hit their connection limits before they hit their bandwidth capacity. If you spread out the swell by pausing longer between requests, so the update gets downloaded more slowly, then more users will be able to start downloading the software sooner, but you still won't take any better advantage of the mirrors' additional bandwidth capacity. The secondary problem is that many small requests create overlarge logs, increasing the log management burden for mirrors. This problem also won't be mitigated by spreading out downloads, since the additional users who become able to connect will keep the servers at their connection limits and thus keep generating the same number of log entries.
> I was assuming that a big part of the problem was the swell of downloads after > a release, so if we spread that swell out a little more, it would help. > Instead of a big spike in connections, with this change you'd see a softer > spike in connections, no? Spreading out the spike by a few hours won't make much of a difference. We're still seeing >500 requests per second today which is several days after the release. And while this is less than half the peak, it is still way more than we find resonable. http://www.acc.umu.se/technical/statistics/ftp/monitordata/index.html.en For reference, look at saimei on that graph does a third of the bandwidth of the other two hosts. It is not listed in mozilla bouncer and just handles the other things we mirror. It does 4 requests per second right now. Anyway, I'd really appriciate if you could so some different kind of throttling than splitting it up in tiny chunks. In the long run, the kind of manual intervention that we're doing to keep up with this load is not really sustainable, and it might make mirrors reconsider their commitment to mozilla mirroring. It would also be rather sad to remove all the request logging for mozilla downloads, both we and some mozilla people seem to want to have that around.
Darin, My concern is the 100K doesn't do enough to substantially reduce the number of req/s for our mirrors. At 300K we'd hold the line for 45s on dial-up. It's not like it is totally unusable - just slow. Combine that with at 74% US broadband penetration (90% at work, and US is 20th in the world in BB penetration) - and the fact that 300k chunks will, on average, reduce the number of requests by 4x - we think we should use that number. Given we are holding the line for 45s the 10 minute timeout also seems reasonable. Make sense? We should open a second bug to implement a better form of bandwidth throttling for future releases.
Were there signs of this problem when the last update was released (which I wasn't around for), or on other mirrors? That was just two months ago, so one would think the traffic profile for this update would be very similar. Maybe we just reached a tipping point, but I'd want to rule out any kind of regression. That's not to say we shouldn't try to improve the download behavior, though. I wonder if we could try to guesstimate the capability of the user's network connection by gathering some metrics at update time... For example: the time to establish the TCP connection, any delay before server responds to request, the download speed of the first chunk, etc. [The server's capabilities would affect these numbers, but that's not a bad thing since we want to be nice to both ends.] The updater could then decide if it should just go ahead and snarf the rest of the update immediately, or if it should backoff by trying again later.
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
(oops, dunno how the "mark FIXED" field got checked. sorry!)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → NEW
Attached patch v1.1 patch - 300k chunks (deleted) — Splinter Review
OK, per discussion at the bonecho triage meeting this morning, we're going to go with a 300k chunk size and an interval of 10 minutes between chunks. (Sucks to be a modem user.) I'm also going to pursue teaching the downloader to throttle downloads itself.
Attachment #238950 - Attachment is obsolete: true
Attachment #239048 - Flags: approval1.8.1?
Attachment #239048 - Flags: approval1.8.0.8?
Comment on attachment 239048 [details] [diff] [review] v1.1 patch - 300k chunks a=schrep - thanks Darin!
Attachment #239048 - Flags: approval1.8.1? → approval1.8.1+
I filed bug 353182 for making nsIncrementalDownload smarter.
fixed-on-trunk, fixed1.8.1
Status: NEW → RESOLVED
Closed: 18 years ago18 years ago
Keywords: fixed1.8.1
Resolution: --- → FIXED
Flags: blocking-firefox2? → blocking-firefox2+
(In reply to comment #15) > Combine that with at 74% US > broadband penetration (90% at work, and US is 20th in the world in BB > penetration) This product isn't a US-only product...
(I'm not saying that I disagree with that change, I'm just saying that just considering US statistics is perhaps not such a good idea)
> > Combine that with at 74% US > > broadband penetration (90% at work, and US is 20th in the world in BB > > penetration) > > This product isn't a US-only product... > ... > (I'm not saying that I disagree with that change, I'm just saying that just > considering US statistics is perhaps not such a good idea) We weren't considering only US statistics. As schrep pointed out in his comment, the US is 20th in the world in broadband penetration, which means many other countries have an even higher percentage of Internet users on broadband, so the implemented fix is even better for users in those countries.
Ah, ok. sorry, I didn't notice that part of the comment.
We need this for 1.5.0.8 or the major update to 2.0 will kill mirrors.
Flags: blocking1.8.0.8+
Attachment #239048 - Flags: approval1.8.0.8?
Attached patch patch for the MOZILLA_1_8_0_BRANCH (obsolete) (deleted) — Splinter Review
Attachment #243439 - Flags: review?(mconnor)
Attachment #243439 - Flags: approval1.8.0.8?
Attachment #243439 - Attachment is obsolete: true
Attachment #243439 - Flags: review?(mconnor)
Attachment #243439 - Flags: approval1.8.0.8?
Attachment #239048 - Flags: approval1.8.0.9?
Attachment #239048 - Flags: approval1.8.0.8?
I have tested this patch on my 1508 tree, and it does what we expect: From my javascript console: before: onProgress: http://www.sspitzer.org/darin/update-test-3/1.mar, 65536/7876331 after: onProgress: http://www.sspitzer.org/darin/update-test-3/1.mar, 300000/7876331 waiting for approval before I land
to answer a question dan raised in the 150x meeting: See http://lxr.mozilla.org/mozilla1.8.0/source/toolkit/mozapps/update/src/nsUpdateService.js.in#2212 2212 var interval = this.background ? DOWNLOAD_BACKGROUND_INTERVAL 2213 : DOWNLOAD_FOREGROUND_INTERVAL; 2214 this._request.init(uri, patchFile, DOWNLOAD_CHUNK_SIZE, interval); all that differs is the interval.
Comment on attachment 239048 [details] [diff] [review] v1.1 patch - 300k chunks approved for 1.8.0 branch, a=dveditz for drivers
Attachment #239048 - Flags: approval1.8.0.8? → approval1.8.0.8+
fix landed on the MOZILLA_1_8_0_BRANCH
Keywords: fixed1.8.0.8
behaviour has changed from 1507 to 1508rc2 I set prefs to check for update soon after app start, and download in the background. 1507 downloads 1 64K chunk each minute. 1508rc2 downloads 1 300000byte chunk each 10 minutes. this is as expected I selected Help->Downloading ... Update from the menu. hen the UI appeared, the download went to line speed for both 1507 and 1508rc2 I clicked on the "Hide" button. after the dialog was dismissed, 1507 went back to the 1 64K chunk/min rate. 1508rc3 remained at line speed. I measured the rate of the download by inspecting updates/0/update.mar at regular intervals (read: |ls -l| over and over) Pref changes as follows: app.update.interval: 60 app.update.timer: 600 app.update.url.override: file:///Users/davel/update.xml update.xml: <?xml version="1.0"?> <updates> <update type="minor" version="2.0mt2" extensionVersion="2.0" buildID="2006092114" > <patch type="complete" URL="http://stage.mozilla.org/pub/mozilla.org/firefox/nightly/1.5.0.8-candidates/rc2/firefox-1.5.0.8.en-US.mac.dmg" hashFunction="SHA1" hashValue="10c287df3d0479b0579a61531498addc4c325746" size="16507599"/> </update> </updates>
Dave, here's the answer: In my backport from the MOZILLA_1_8_BRANCH to the MOZILLA_1_8_0_BRANCH I included jwalden's fix for https://bugzilla.mozilla.org/show_bug.cgi?id=304381 See http://lxr.mozilla.org/mozilla1.8.0/source/toolkit/mozapps/update/content/updates.js#1359
Product: Firefox → Toolkit
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: