Closed Bug 1223615 Opened 9 years ago Closed 9 years ago

gaia-central is responsible for ~50% of hg.mozilla.org load

Categories

(Taskcluster :: General, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Unassigned)

References

Details

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #1096653 +++ gaia-central is currently responsible for ~50% of bytes transferred and ~30% of CPU on hg.mozilla.org. On UTC day November 9, integration/gaia-central was responsible for 448,226 HTTP requests consuming 1,291,353,296,879 bytes and 200,139 CPU seconds. mozilla-central (which should be our most "expensive" repository) was responsible for 183,605 HTTP requests consuming 91,667,323,258 bytes and 133,723 CPU seconds. The data at https://docs.google.com/spreadsheets/d/1inZ1cj_e9vQFMOyQuvaZqiA2LU9_vPAQAsXzLqclW14/edit?usp=sharing says gaia-central traffic was pretty reasonable through September 8. It increased significantly each of the following 2 days and has remained as a high traffic consumer since. It's been especially outrageous since November 2, after which it has consumed 1+ TB/day for each weekday. The last time I complained about gaia-central, we didn't have support for cloning from pre-generated bundles. We have that now and I consider it a major bug for *any* part of automation to `hg clone` without the clone bundles feature (hg 3.6+ only) or the bundleclone extension installed. See https://mozilla-version-control-tools.readthedocs.org/en/latest/hgmo/bundleclone.html for configuration instructions. While automation should ideally not be cloning so much, the thing I care most about is that clones don't perform large fetches from hg.mozilla.org directly: if the bulk of clones is serviced by S3/CloudFront, I'm happy, since that moves 99% of clone load from hg.mozilla.org to Amazon. And it's much faster for clients. Wins all around. While I'm here, please read http://gregoryszorc.com/blog/2015/10/22/cloning-improvements-in-mercurial-3.6/ so you can put upgrading to Mercurial 3.6 on your radar.
Here are some IPs responsible for 100+ MB "getbundle" requests today (which is how hg powers `hg clone` and `hg pull` (which are the same operation under the covers): 52.33.166.194 52.27.57.199 52.33.148.53 52.33.160.253 52.33.153.117 52.33.159.112 52.10.40.135 52.26.144.50 52.33.113.21 52.33.181.173 52.33.143.186 Most of these requests are in the 100-115MB range. If I had to venture a guess, I'd say there is a relatively old snapshot of gaia-central somewhere that each newly-imaged machine `hg pull`s a similar large payload into. Possible solution is to ensure gaia-central snapshots are up to date. Alternatively, don't bake the snapshot into the image and `hg clone` with clonebundles/bundleclone enabled and configured to prefer "packed"/"stream" clone bundles, which are almost as fast as `curl | tar` if you are inside EC2.
Needinfo dustin to see if IPs belong to Buildbot land. Needinfo garndt to see if IPs belong to TaskCluster. Please cancel other's needinfo if you have insight into both systems.
Flags: needinfo?(garndt)
Flags: needinfo?(dustin)
just to be sure they are not buildbot based instances, checked with https://gist.github.com/rail/58e6e459c92131dff9d0 Nope, nope, nope...
Flags: needinfo?(dustin)
Sorry about this, :gps. gaia-central is updated. I will get something nailed up tomorrow as an interim fix in automation. This is my fault for not staying on top of the tc-vcs cache situation. We made some significant changes to make use of scopes more secure in the last week, requiring updates to quite a few systems.
Flags: needinfo?(garndt)
I ran some queries in papertrail to see if we could find these systems. Two IPs [1] did match systems in the list. I did not see the other IPs as having claimed any work in the past couple of days, nor a system coming up using those ips. The two queries [2,3], one for instances coming up with that IP and one for ips that claimed a task, showed the same two IPs. Looking into the tasks for these instances, 52.33.159.112 only claimed one task, and that was on the 8th. I looked at 52.10.40.135 and the one task that it claimed today didn't mention anything about gaia-central in the task logs [4]. Perhaps it's pulling from gaia-central somewhere that's not logging it. It does pull down gaia from git.m.o though (it did seed it with our cached copy first). I looked at our tasks that do explicitly state gaia-central as the gaia repo (they are test tasks for mulet and emulator unittest) and tried to reproduce. Using the copy of the gaia-central cache from Monday night (before any new caches were made) there around 84 changesets applied that resulted in <1mb network traffic (as reported by "docker stats"). I'll keep digging as to the source of this in case there are some taskcluster tasks out there causing it. [1] 52.33.159.112 and 52.10.40.135 [2] https://papertrailapp.com/groups/853883/events?q=%28%2252.33.166.194%22+OR+%2252.27.57.199%22+OR+%2252.33.148.53%22+OR+%2252.33.160.253%22+OR+%2252.33.153.117%22+OR+%2252.33.159.112%22+OR+%2252.10.40.135%22+OR+%2252.26.144.50%22+OR+%2252.33.113.21%22+OR%2252.33.181.173%22+OR%2252.33.143.186%22%29+%22publicip%22 [3] https://papertrailapp.com/groups/853883/events?q=%28%2252.33.166.194%22+OR+%2252.27.57.199%22+OR+%2252.33.148.53%22+OR+%2252.33.160.253%22+OR+%2252.33.153.117%22+OR+%2252.33.159.112%22+OR+%2252.10.40.135%22+OR+%2252.26.144.50%22+OR+%2252.33.113.21%22+OR%2252.33.181.173%22+OR%2252.33.143.186%22%29+%22%2Fclaim%22 [4] https://public-artifacts.taskcluster.net/M5DHEiLBTGqBHGd6UoWRfg/0/public/logs/live_backing.log
Also forgot to mention that the only task that 52.33.159.112 claimed [1] was a firefox os x cross compile build. [1] https://tools.taskcluster.net/task-inspector/#LWeOL5L4RAG70MPx_7BsSg/
Still seeing lots of gaia-central traffic: UTC Day Requests Bytes 11/3/2015 592,611 1,117,875,401,384 11/4/2015 540,077 1,147,975,352,297 11/5/2015 636,249 1,276,805,807,191 11/6/2015 706,717 1,028,803,925,726 11/7/2015 179,250 484,798,738,336 11/8/2015 52,845 357,693,456,880 11/9/2015 448,226 1,291,353,296,879 11/10/2015 641,082 871,462,859,892 11/11/2015 605,366 873,590,871,172 11/12/2015 506,145 771,808,117,110 Still hovering around 50% of bytes served from hg.mozilla.org. IPs from within 24 hours responsible for large pulls: Bytes IP 116782356 54.209.150.246 116782356 54.209.227.2 116782356 54.209.55.82 116782356 54.209.66.129 116782356 54.209.71.209 116782356 54.209.7.52 116782356 54.210.169.208 116782356 54.210.169.34 116782356 54.210.169.52 116782356 54.210.169.91 116782356 54.210.184.146 116782356 54.210.189.170 116782356 54.210.190.253 116782356 54.210.191.172 116782356 54.210.217.187 116782356 54.210.218.25 116782356 54.210.228.121 116782356 54.210.231.131 116782356 54.210.96.245 116782356 54.236.204.26 116782356 54.236.205.61 116782356 54.85.153.247 116825820 52.32.163.120 116825820 52.34.73.107 116825820 54.175.22.76 116825820 54.210.234.17 116826122 52.27.222.219 116826122 52.27.78.104 116826122 52.33.153.23 116826122 52.33.222.72 116826122 52.91.33.92 116826122 54.152.107.249 116826122 54.165.9.219 116826122 54.174.92.253 116826122 54.210.122.9
So far my searching for those IPs in papertrail yields these results: 54.209.66.129 - taskcluster worker, workerID: i-26f95498. Same as the other observed instances, pulling down 40 changesets in 1.5 seconds and no huge spikes in network i/o that would added up to >100mb pulls. Need to track more. 54.210.169.34 - I assume a buildbot worker since it requested an artifact built by buildbot-try 52.33.222.72 - workerType: buildbot - task: https://tools.taskcluster.net/task-inspector/#T_GCT_blT1OlukIJwug4vQ/0
> 52.33.222.72 - workerType: buildbot - task: > https://tools.taskcluster.net/task-inspector/#T_GCT_blT1OlukIJwug4vQ/0 hm, weird, that task doesn't do anything with gaia-central
(In reply to Chris AtLee [:catlee] from comment #9) > > 52.33.222.72 - workerType: buildbot - task: > > https://tools.taskcluster.net/task-inspector/#T_GCT_blT1OlukIJwug4vQ/0 > > hm, weird, that task doesn't do anything with gaia-central Here is just one line from the heroku logs that contains that IP which let me to that task ID: Nov 12 05:27:02 taskcluster-queue heroku/router: at=info method=POST path="/v1/task/T_GCT_blT1OlukIJwug4vQ/runs/0/artifacts/public/build/log_error.log" host=queue.taskcluster.net request_id=74bdc08f-2cfb-47f7-ac21-d1e7a376b936 fwd="52.33.222.72" dyno=web.1 connect=1ms service=393ms status=200 bytes=749
Could the gaia-central interaction be part of some kind of "bootstrap" process that occurs when the host is created?
Yes, that's very possible. I think we include a gaia-central bundle on the releng test instances, but am not sure if a) it's still required or b) how out of date it is.
Flags: needinfo?(rail)
Attached file Stop populating gaia-central (deleted) —
I skimmed treeherder logs on m-c and it looks like we don't use this repo anymore.
Flags: needinfo?(rail)
Attachment #8688542 - Flags: review?(catlee)
Comment on attachment 8688542 [details] Stop populating gaia-central merged
Attachment #8688542 - Flags: review?(catlee) → review+
I need to wait a few hours for UTC day November 19, but it looks like gaia-central usage plunged yesterday: 11/16/2015 494,277 1,033,520,289,014 11/17/2015 500,482 824,183,083,834 11/18/2015 360,803 361,981,959,816 That's the lowest it's been since November 8, which was a Sunday. Setting needinfo on me to remind me to collect the data tonight or tomorrow.
Flags: needinfo?(gps)
11/19/2015 323,673 99,491,368,952 Well, I'd say there was a significant reduction! HTTP request count is still high. But bandwidth and CPU usage (see spreadsheet URL in comment #0) is drastically reduced. The elevated request count appears to be from `hg pull` operations. Here's a breakdown from 1 server of the counts of hg commands issued: 3969 batch 6360 capabilities 2771 getbundle 4724 git 26 hgweb-json-pushes 5 hgweb-rev 10 hgweb-static 352 known 13994 listkeys 8999 lookup What's surprising there is the 4724 "git" commands. These are typically git(-cinnabar?) clients probing for a Git server. Those requests are coming from all over EC2. 52.33.61.76, 54.193.80.1, 54.165.38.57, 54.219.130.154, 52.34.143.183. Anyway, things are looking really good. I would like to wait another day to see if request volume goes down. It still feels a bit high to me (2.5x more than mozilla-central). I'd prefer to see things under mozilla-central's request count. Let's see what happens...
Good news: gaia-central bandwidth is way, way down: UTC Day Requests Bytes Wall T CPU Time 11/16/2015 494,277 1,033,520,289,014 429,056 162,795 11/17/2015 500,482 824,183,083,834 366,779 134,434 11/18/2015 360,803 361,981,959,816 155,595 62,247 11/19/2015 323,673 99,491,368,952 41,859 20,829 11/20/2015 190,824 32,204,888,721 12,207 7,043 11/21/2015 42,780 8,395,335,843 3,226 1,831 11/22/2015 24,897 4,904,122,770 2,232 1,106 11/23/2015 153,141 32,226,620,472 10,922 7,014 The bad news is we're still sending 150,000 requests/day. mozilla-central got 97,893 and inbound got 90,522 on UTC day 2015-11-23. I think there's some rogue process querying gaia-central more than necessary. But since these requests are only responsible for 1.86% of total CPU usage on the cluster and since we just reduced gaia-central CPU usage to ~5% of original, I'm not too worried: we have bigger fish to fry. Thank you for stopping the excessive cloning. By the way, with gaia-central off our backs, the percentage of overall hg.mozilla.org traffic served from S3 shot up from 35-50% to 58-70%! Getting the rest of automation to use bundleclone/clonebundles will likely increase the S3 offload to >90%. But I'm very happy with where we are!
Flags: needinfo?(gps)
Meant to close this.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Depends on: 1234708
No longer depends on: 1234708
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: