Closed
Bug 1223615
Opened 9 years ago
Closed 9 years ago
gaia-central is responsible for ~50% of hg.mozilla.org load
Categories
(Taskcluster :: General, defect, P1)
Taskcluster
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: gps, Unassigned)
References
Details
Attachments
(1 file)
+++ This bug was initially created as a clone of Bug #1096653 +++
gaia-central is currently responsible for ~50% of bytes transferred and ~30% of CPU on hg.mozilla.org.
On UTC day November 9, integration/gaia-central was responsible for 448,226 HTTP requests consuming 1,291,353,296,879 bytes and 200,139 CPU seconds. mozilla-central (which should be our most "expensive" repository) was responsible for 183,605 HTTP requests consuming 91,667,323,258 bytes and 133,723 CPU seconds.
The data at https://docs.google.com/spreadsheets/d/1inZ1cj_e9vQFMOyQuvaZqiA2LU9_vPAQAsXzLqclW14/edit?usp=sharing says gaia-central traffic was pretty reasonable through September 8. It increased significantly each of the following 2 days and has remained as a high traffic consumer since. It's been especially outrageous since November 2, after which it has consumed 1+ TB/day for each weekday.
The last time I complained about gaia-central, we didn't have support for cloning from pre-generated bundles. We have that now and I consider it a major bug for *any* part of automation to `hg clone` without the clone bundles feature (hg 3.6+ only) or the bundleclone extension installed. See https://mozilla-version-control-tools.readthedocs.org/en/latest/hgmo/bundleclone.html for configuration instructions. While automation should ideally not be cloning so much, the thing I care most about is that clones don't perform large fetches from hg.mozilla.org directly: if the bulk of clones is serviced by S3/CloudFront, I'm happy, since that moves 99% of clone load from hg.mozilla.org to Amazon. And it's much faster for clients. Wins all around.
While I'm here, please read http://gregoryszorc.com/blog/2015/10/22/cloning-improvements-in-mercurial-3.6/ so you can put upgrading to Mercurial 3.6 on your radar.
Reporter | ||
Comment 1•9 years ago
|
||
Here are some IPs responsible for 100+ MB "getbundle" requests today (which is how hg powers `hg clone` and `hg pull` (which are the same operation under the covers):
52.33.166.194
52.27.57.199
52.33.148.53
52.33.160.253
52.33.153.117
52.33.159.112
52.10.40.135
52.26.144.50
52.33.113.21
52.33.181.173
52.33.143.186
Most of these requests are in the 100-115MB range. If I had to venture a guess, I'd say there is a relatively old snapshot of gaia-central somewhere that each newly-imaged machine `hg pull`s a similar large payload into. Possible solution is to ensure gaia-central snapshots are up to date. Alternatively, don't bake the snapshot into the image and `hg clone` with clonebundles/bundleclone enabled and configured to prefer "packed"/"stream" clone bundles, which are almost as fast as `curl | tar` if you are inside EC2.
Reporter | ||
Comment 2•9 years ago
|
||
Needinfo dustin to see if IPs belong to Buildbot land.
Needinfo garndt to see if IPs belong to TaskCluster.
Please cancel other's needinfo if you have insight into both systems.
Flags: needinfo?(garndt)
Flags: needinfo?(dustin)
Comment 3•9 years ago
|
||
just to be sure they are not buildbot based instances, checked with https://gist.github.com/rail/58e6e459c92131dff9d0
Nope, nope, nope...
Updated•9 years ago
|
Flags: needinfo?(dustin)
Comment 4•9 years ago
|
||
Sorry about this, :gps.
gaia-central is updated. I will get something nailed up tomorrow as an interim fix in automation.
This is my fault for not staying on top of the tc-vcs cache situation. We made some significant changes to make use of scopes more secure in the last week, requiring updates to quite a few systems.
Flags: needinfo?(garndt)
Comment 5•9 years ago
|
||
I ran some queries in papertrail to see if we could find these systems. Two IPs [1] did match systems in the list. I did not see the other IPs as having claimed any work in the past couple of days, nor a system coming up using those ips.
The two queries [2,3], one for instances coming up with that IP and one for ips that claimed a task, showed the same two IPs.
Looking into the tasks for these instances, 52.33.159.112 only claimed one task, and that was on the 8th.
I looked at 52.10.40.135 and the one task that it claimed today didn't mention anything about gaia-central in the task logs [4]. Perhaps it's pulling from gaia-central somewhere that's not logging it. It does pull down gaia from git.m.o though (it did seed it with our cached copy first).
I looked at our tasks that do explicitly state gaia-central as the gaia repo (they are test tasks for mulet and emulator unittest) and tried to reproduce. Using the copy of the gaia-central cache from Monday night (before any new caches were made) there around 84 changesets applied that resulted in <1mb network traffic (as reported by "docker stats").
I'll keep digging as to the source of this in case there are some taskcluster tasks out there causing it.
[1] 52.33.159.112 and 52.10.40.135
[2] https://papertrailapp.com/groups/853883/events?q=%28%2252.33.166.194%22+OR+%2252.27.57.199%22+OR+%2252.33.148.53%22+OR+%2252.33.160.253%22+OR+%2252.33.153.117%22+OR+%2252.33.159.112%22+OR+%2252.10.40.135%22+OR+%2252.26.144.50%22+OR+%2252.33.113.21%22+OR%2252.33.181.173%22+OR%2252.33.143.186%22%29+%22publicip%22
[3] https://papertrailapp.com/groups/853883/events?q=%28%2252.33.166.194%22+OR+%2252.27.57.199%22+OR+%2252.33.148.53%22+OR+%2252.33.160.253%22+OR+%2252.33.153.117%22+OR+%2252.33.159.112%22+OR+%2252.10.40.135%22+OR+%2252.26.144.50%22+OR+%2252.33.113.21%22+OR%2252.33.181.173%22+OR%2252.33.143.186%22%29+%22%2Fclaim%22
[4] https://public-artifacts.taskcluster.net/M5DHEiLBTGqBHGd6UoWRfg/0/public/logs/live_backing.log
Comment 6•9 years ago
|
||
Also forgot to mention that the only task that 52.33.159.112 claimed [1] was a firefox os x cross compile build.
[1] https://tools.taskcluster.net/task-inspector/#LWeOL5L4RAG70MPx_7BsSg/
Reporter | ||
Comment 7•9 years ago
|
||
Still seeing lots of gaia-central traffic:
UTC Day Requests Bytes
11/3/2015 592,611 1,117,875,401,384
11/4/2015 540,077 1,147,975,352,297
11/5/2015 636,249 1,276,805,807,191
11/6/2015 706,717 1,028,803,925,726
11/7/2015 179,250 484,798,738,336
11/8/2015 52,845 357,693,456,880
11/9/2015 448,226 1,291,353,296,879
11/10/2015 641,082 871,462,859,892
11/11/2015 605,366 873,590,871,172
11/12/2015 506,145 771,808,117,110
Still hovering around 50% of bytes served from hg.mozilla.org.
IPs from within 24 hours responsible for large pulls:
Bytes IP
116782356 54.209.150.246
116782356 54.209.227.2
116782356 54.209.55.82
116782356 54.209.66.129
116782356 54.209.71.209
116782356 54.209.7.52
116782356 54.210.169.208
116782356 54.210.169.34
116782356 54.210.169.52
116782356 54.210.169.91
116782356 54.210.184.146
116782356 54.210.189.170
116782356 54.210.190.253
116782356 54.210.191.172
116782356 54.210.217.187
116782356 54.210.218.25
116782356 54.210.228.121
116782356 54.210.231.131
116782356 54.210.96.245
116782356 54.236.204.26
116782356 54.236.205.61
116782356 54.85.153.247
116825820 52.32.163.120
116825820 52.34.73.107
116825820 54.175.22.76
116825820 54.210.234.17
116826122 52.27.222.219
116826122 52.27.78.104
116826122 52.33.153.23
116826122 52.33.222.72
116826122 52.91.33.92
116826122 54.152.107.249
116826122 54.165.9.219
116826122 54.174.92.253
116826122 54.210.122.9
Comment 8•9 years ago
|
||
So far my searching for those IPs in papertrail yields these results:
54.209.66.129 - taskcluster worker, workerID: i-26f95498. Same as the other observed instances, pulling down 40 changesets in 1.5 seconds and no huge spikes in network i/o that would added up to >100mb pulls. Need to track more.
54.210.169.34 - I assume a buildbot worker since it requested an artifact built by buildbot-try
52.33.222.72 - workerType: buildbot - task: https://tools.taskcluster.net/task-inspector/#T_GCT_blT1OlukIJwug4vQ/0
Comment 9•9 years ago
|
||
> 52.33.222.72 - workerType: buildbot - task:
> https://tools.taskcluster.net/task-inspector/#T_GCT_blT1OlukIJwug4vQ/0
hm, weird, that task doesn't do anything with gaia-central
Comment 10•9 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #9)
> > 52.33.222.72 - workerType: buildbot - task:
> > https://tools.taskcluster.net/task-inspector/#T_GCT_blT1OlukIJwug4vQ/0
>
> hm, weird, that task doesn't do anything with gaia-central
Here is just one line from the heroku logs that contains that IP which let me to that task ID:
Nov 12 05:27:02 taskcluster-queue heroku/router: at=info method=POST path="/v1/task/T_GCT_blT1OlukIJwug4vQ/runs/0/artifacts/public/build/log_error.log" host=queue.taskcluster.net request_id=74bdc08f-2cfb-47f7-ac21-d1e7a376b936 fwd="52.33.222.72" dyno=web.1 connect=1ms service=393ms status=200 bytes=749
Reporter | ||
Comment 11•9 years ago
|
||
Could the gaia-central interaction be part of some kind of "bootstrap" process that occurs when the host is created?
Comment 12•9 years ago
|
||
Yes, that's very possible. I think we include a gaia-central bundle on the releng test instances, but am not sure if a) it's still required or b) how out of date it is.
Flags: needinfo?(rail)
Comment 13•9 years ago
|
||
Yup, gaia-central is there https://github.com/mozilla/build-cloud-tools/blob/master/instance_data/us-east-1.instance_data_tests.json#L8
Still need to figure out a) and b).
Comment 14•9 years ago
|
||
I skimmed treeherder logs on m-c and it looks like we don't use this repo anymore.
Flags: needinfo?(rail)
Attachment #8688542 -
Flags: review?(catlee)
Comment 15•9 years ago
|
||
Comment on attachment 8688542 [details]
Stop populating gaia-central
merged
Attachment #8688542 -
Flags: review?(catlee) → review+
Reporter | ||
Comment 16•9 years ago
|
||
I need to wait a few hours for UTC day November 19, but it looks like gaia-central usage plunged yesterday:
11/16/2015 494,277 1,033,520,289,014
11/17/2015 500,482 824,183,083,834
11/18/2015 360,803 361,981,959,816
That's the lowest it's been since November 8, which was a Sunday.
Setting needinfo on me to remind me to collect the data tonight or tomorrow.
Flags: needinfo?(gps)
Reporter | ||
Comment 17•9 years ago
|
||
11/19/2015 323,673 99,491,368,952
Well, I'd say there was a significant reduction!
HTTP request count is still high. But bandwidth and CPU usage (see spreadsheet URL in comment #0) is drastically reduced. The elevated request count appears to be from `hg pull` operations. Here's a breakdown from 1 server of the counts of hg commands issued:
3969 batch
6360 capabilities
2771 getbundle
4724 git
26 hgweb-json-pushes
5 hgweb-rev
10 hgweb-static
352 known
13994 listkeys
8999 lookup
What's surprising there is the 4724 "git" commands. These are typically git(-cinnabar?) clients probing for a Git server. Those requests are coming from all over EC2. 52.33.61.76, 54.193.80.1, 54.165.38.57, 54.219.130.154, 52.34.143.183.
Anyway, things are looking really good. I would like to wait another day to see if request volume goes down. It still feels a bit high to me (2.5x more than mozilla-central). I'd prefer to see things under mozilla-central's request count. Let's see what happens...
Reporter | ||
Comment 18•9 years ago
|
||
Good news: gaia-central bandwidth is way, way down:
UTC Day Requests Bytes Wall T CPU Time
11/16/2015 494,277 1,033,520,289,014 429,056 162,795
11/17/2015 500,482 824,183,083,834 366,779 134,434
11/18/2015 360,803 361,981,959,816 155,595 62,247
11/19/2015 323,673 99,491,368,952 41,859 20,829
11/20/2015 190,824 32,204,888,721 12,207 7,043
11/21/2015 42,780 8,395,335,843 3,226 1,831
11/22/2015 24,897 4,904,122,770 2,232 1,106
11/23/2015 153,141 32,226,620,472 10,922 7,014
The bad news is we're still sending 150,000 requests/day.
mozilla-central got 97,893 and inbound got 90,522 on UTC day 2015-11-23.
I think there's some rogue process querying gaia-central more than necessary. But since these requests are only responsible for 1.86% of total CPU usage on the cluster and since we just reduced gaia-central CPU usage to ~5% of original, I'm not too worried: we have bigger fish to fry.
Thank you for stopping the excessive cloning.
By the way, with gaia-central off our backs, the percentage of overall hg.mozilla.org traffic served from S3 shot up from 35-50% to 58-70%! Getting the rest of automation to use bundleclone/clonebundles will likely increase the S3 offload to >90%. But I'm very happy with where we are!
Flags: needinfo?(gps)
Reporter | ||
Comment 19•9 years ago
|
||
Meant to close this.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•