Closed
Bug 1381768
Opened 7 years ago
Closed 7 years ago
Some gecko-1-b-win2012 workers take forever to checkout mercurial files
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: glandium, Assigned: grenade)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
(deleted),
patch
|
pmoore
:
review+
|
Details | Diff | Splinter Review |
The following tasks, the first of which is still currently running, have taken more than 30 minutes to checkout the mercurial data (after the clone itself, over the network or possibly using the local shared data):
- H_uVPziJR0KEA3BF6T3Wsw
- C-wllbZJSLCYXwfHCe66-Q
- D1I8p43iROeArOJQoF728g
- VS0FG7L4SD2Mt-0aIlNJww
- GSmFf3LNTvK9ZE_yBTm5AA
- c9nbnarpShGtFSOwE9-yHw
- Im9Y7NclTVK0TQTkzcZUvg
Reporter | ||
Comment 1•7 years ago
|
||
Please note that even when it's "fast", it still takes several minutes. For example O3OsQKhIQ6aKF-oLuCPKzQ still took almost 5 minutes.
Comment 2•7 years ago
|
||
Persisting key logging from one of the logs, as the task logs will expire...
Here we see a wall time of 44m46.6840004s for the hg robustcheckout command.
[taskcluster 2017-07-15T00:34:18.644Z] Worker Type (gecko-1-b-win2012) settings:
[taskcluster 2017-07-15T00:34:18.644Z] {
[taskcluster 2017-07-15T00:34:18.644Z] "aws": {
[taskcluster 2017-07-15T00:34:18.644Z] "ami-id": "ami-ccc7f5da",
[taskcluster 2017-07-15T00:34:18.644Z] "availability-zone": "us-east-1e",
[taskcluster 2017-07-15T00:34:18.644Z] "instance-id": "i-0bc1c310290ab3cc4",
[taskcluster 2017-07-15T00:34:18.644Z] "instance-type": "c4.4xlarge",
[taskcluster 2017-07-15T00:34:18.644Z] "local-ipv4": "172.31.57.209",
[taskcluster 2017-07-15T00:34:18.644Z] "public-hostname": "ec2-54-145-239-43.compute-1.amazonaws.com",
[taskcluster 2017-07-15T00:34:18.647Z] "public-ipv4": "54.145.239.43"
[taskcluster 2017-07-15T00:34:18.647Z] },
[taskcluster 2017-07-15T00:34:18.647Z] "config": {
[taskcluster 2017-07-15T00:34:18.647Z] "deploymentId": "1f3ea73960ea",
[taskcluster 2017-07-15T00:34:18.647Z] "runTasksAsCurrentUser": false
[taskcluster 2017-07-15T00:34:18.647Z] },
[taskcluster 2017-07-15T00:34:18.647Z] "generic-worker": {
[taskcluster 2017-07-15T00:34:18.647Z] "go-arch": "amd64",
[taskcluster 2017-07-15T00:34:18.647Z] "go-os": "windows",
[taskcluster 2017-07-15T00:34:18.647Z] "go-version": "go1.7.5",
[taskcluster 2017-07-15T00:34:18.647Z] "release": "https://github.com/taskcluster/generic-worker/releases/tag/v10.0.5",
[taskcluster 2017-07-15T00:34:18.647Z] "version": "10.0.5"
[taskcluster 2017-07-15T00:34:18.647Z] },
[taskcluster 2017-07-15T00:34:18.647Z] "machine-setup": {
[taskcluster 2017-07-15T00:34:18.647Z] "ami-created": "2017-06-29 18:37:12.559Z",
[taskcluster 2017-07-15T00:34:18.647Z] "manifest": "https://github.com/mozilla-releng/OpenCloudConfig/blob/1f3ea73960ea811c1f0e79729be21e67732bd5b1/userdata/Manifest/gecko-1-b-win2012.json"
[taskcluster 2017-07-15T00:34:18.647Z] }
[taskcluster 2017-07-15T00:34:18.647Z] }
[taskcluster 2017-07-15T00:34:18.647Z] Task ID: D1I8p43iROeArOJQoF728g
[taskcluster 2017-07-15T00:34:18.647Z] === Task Starting ===
[taskcluster 2017-07-15T00:34:19.156Z] Uploading file public/logs/live.log as artifact public/logs/live.log
[taskcluster 2017-07-15T00:34:19.651Z] Executing command 0: "c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision %GECKO_HEAD_REV% %GECKO_HEAD_REPOSITORY% .\build\src
Z:\task_1500078434>"c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 05d3cf8977013e21a5926e15e2bd5386e16c167f https://hg.mozilla.org/try/ .\build\src
manifests [> ] 1/48
files [=> ] 57/1194 42s
files [==> ] 64/1194 57s
...
...
...
<snip/>
...
...
...
files [==========================================> ] 924/1194 06s
files [==========================================> ] 932/1194 06s
ensuring https://hg.mozilla.org/try/@05d3cf8977013e21a5926e15e2bd5386e16c167f is available at .\build\src
(cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
(sharing from existing pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
searching for changes
adding changesets
adding manifests
adding file changes
added 48 changesets with 928 changes to 936 files
searching [ <=> ] 2
changesets [============================================================>] 1/1
adding remote bookmark aurora
adding remote bookmark beta
adding remote bookmark central
adding remote bookmark esr10
adding remote bookmark esr17
adding remote bookmark esr24
adding remote bookmark esr31
adding remote bookmark esr38
adding remote bookmark esr45
adding remote bookmark esr52
adding remote bookmark fx-team
adding remote bookmark inbound
adding remote bookmark release
(pulling to obtain 05d3cf8977013e21a5926e15e2bd5386e16c167f)
searching for changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 2 changes to 4 files (+1 heads)
updating [ ] 100/196175
updating [ ] 200/196175 1h25m
...
...
...
<snip/>
...
...
...
updating [===============================================> ] 196000/196175 03s
updating [================================================>] 196175/196175 01s
196175 files updated, 0 files merged, 0 files removed, 0 files unresolved
updated to 05d3cf8977013e21a5926e15e2bd5386e16c167f
[taskcluster 2017-07-15T01:19:06.338Z] Exit Code: 0
[taskcluster 2017-07-15T01:19:06.338Z] Success Code: 0x0
[taskcluster 2017-07-15T01:19:06.339Z] User Time: 0s
[taskcluster 2017-07-15T01:19:06.339Z] Kernel Time: 0s
[taskcluster 2017-07-15T01:19:06.339Z] Wall Time: 44m46.6840004s
[taskcluster 2017-07-15T01:19:06.339Z] Peak Memory: 2797568
[taskcluster 2017-07-15T01:19:06.339Z] Result: SUCCEEDED
Updated•7 years ago
|
Assignee: relops → rthijssen
Assignee | ||
Comment 3•7 years ago
|
||
this change ensures that y:\hg-shared contains a prepopulated robustcheckout sharebase during ami creation and hopefully reduces checkout time on the first task checkout for each spot instance derived from that ami.
Attachment #8887956 -
Flags: review?(pmoore)
Comment 4•7 years ago
|
||
Comment on attachment 8887956 [details] [diff] [review]
https://github.com/mozilla-releng/OpenCloudConfig/pull/81
Review of attachment 8887956 [details] [diff] [review]:
-----------------------------------------------------------------
Looks great, thanks Rob!
Attachment #8887956 -
Attachment is patch: true
Attachment #8887956 -
Attachment mime type: text/x-github-pull-request → text/plain
Attachment #8887956 -
Flags: review?(pmoore) → review+
Reporter | ||
Comment 5•7 years ago
|
||
Note the time was not spent on *cloning*, which hg-shared would affect, but on *checkout*, which is taking the data out of the .hg directory and laying it out on disk outside of it.
Comment 6•7 years ago
|
||
Probably something going wrong in the format of the ebs volume, such that all sectors aren't touched - but we are hoping to fix that issue in bug 1378381, which should solve this issue.
I think grenade's patch is still good, as it addresses a related issue (the clone should be faster now), but it sounds like we'll need bug 1378381 solved too.
Depends on: 1378381
Comment 7•7 years ago
|
||
I'm reasonably confident that baking the Mercurial share "cache" into the AMI is counter-productive. We've done this before with other instances and it was a net loss.
By baking it into the AMI you are creating the need to touch sectors on instance initialization to mitigate copy-on-first-access with EBS volumes initialized from an AMI. It is faster to initialize an empty EBS volume, not touch sectors on instance initialization, and populate the Mercurial clone on a fresh EBS volume.
Of course, as long as you are accessing sectors on volumes initialized from AMIs on instance initialization, it doesn't matter if the sector has data or not: it will still be slow. So until we stop using AMIs, adding the Mercurial data to the AMI shouldn't have a significant negative impact (other than increasing AMI size and corresponding overhead, which I assume is small). But if the Mercurial data is the reason for using an AMI, then this will likely be slower than no AMI.
FWIW, my measurements have shown that reading sectors on an AMI-initialized volume results in throughput of 2-4 MB/s. An uncompressed Mercurial clone (what robustcheckout will do in EC2) should be >10 MB/s for Windows EC2 instances.
If we want perf wins, I'd focus on not initializing EBS volumes from AMIs.
Reporter | ||
Comment 8•7 years ago
|
||
While not directly relevant to windows builds, I was looking at the memory usage during a linux build the other day, and saw that there was something like a peak of 16GB of caches, and never more than 9GB memory used by processes. On linux, this means we could actually get around any I/O issues if we have any by having /home/worker/workspace on a tmpfs. I wonder if we'd get benefits from using a large RAM disk for the build directory on Windows.
Reporter | ||
Comment 9•7 years ago
|
||
BTW, this bug is still very much a problem. I regularly get Windows builds that spend way more than an hour on `hg update`.
Assignee | ||
Comment 10•7 years ago
|
||
jhford: i'm working on updates to occ that would attach fresh ebs volumes to spot instances at boot. to support this, i have tried to change the launchSpec in AWS provisioner config from:
"launchSpec": {
"IamInstanceProfile": {
"Arn": "arn:aws:iam::692406183521:instance-profile/taskcluster-level-1-sccache"
}
}
to:
"launchSpec": {
"IamInstanceProfile": {
"Arn": "arn:aws:iam::692406183521:instance-profile/taskcluster-level-1-sccache"
},
"BlockDeviceMapping": [
{
"DeviceName":"/dev/sda1",
"Ebs":{
"VolumeSize": 120,
"DeleteOnTermination": true
}
},
{
"DeviceName":"/dev/sdb",
"Ebs":{
"VolumeSize": 120,
"DeleteOnTermination": true
}
}
]
}
the provisioner ui shows this error when i attempt to update:
Unknown Error!
Invalid workerType: ["Error: Launch specifications are invalid\n at Function.WorkerType.testLaunchSpecs (/app/src/worker-type.js:637:15)\n at AwsManager._callee6$ (/app/src/aws-manager.js:323:34)\n at tryCatch (/app/node_modules/regenerator-runtime/runtime.js:64:40)\n at Generator.invoke [as _invoke] (/app/node_modules/regenerator-runtime/runtime.js:299:22)\n at Generator.prototype.(anonymous function) [as next] (/app/node_modules/regenerator-runtime/runtime.js:116:21)\n at step (/app/node_modules/babel-runtime/helpers/asyncToGenerator.js:17:30)\n at /app/node_modules/babel-runtime/helpers/asyncToGenerator.js:35:14\n at F (/app/node_modules/core-js/library/modules/_export.js:35:28)\n at AwsManager.<anonymous> (/app/node_modules/babel-runtime/helpers/asyncToGenerator.js:14:12)\n at AwsManager.workerTypeCanLaunch (/app/lib/aws-manager.js:671:22)\n at Object._callee4$ (/app/src/api-v1.js:293:42)\n at tryCatch (/app/node_modules/regenerator-runtime/runtime.js:64:40)\n at Generator.invoke [as _invoke] (/app/node_modules/regenerator-runtime/runtime.js:299:22)\n at Generator.prototype.(anonymous function) [as next] (/app/node_modules/regenerator-runtime/runtime.js:116:21)\n at step (/app/node_modules/babel-runtime/helpers/asyncToGenerator.js:17:30)\n at /app/node_modules/babel-runtime/helpers/asyncToGenerator.js:35:14\n at F (/app/node_modules/core-js/library/modules/_export.js:35:28)\n at Object.<anonymous> (/app/node_modules/babel-runtime/helpers/asyncToGenerator.js:14:12)\n at Object.<anonymous> (/app/src/api-v1.js:257:1)\n at /app/node_modules/taskcluster-lib-api/src/api.js:538:22\n at tryCallOne (/app/node_modules/promise/lib/core.js:37:12)\n at /app/node_modules/promise/lib/core.js:123:15\n at flush (/app/node_modules/asap/raw.js:50:29)\n at _combinedTickCallback (internal/process/next_tick.js:73:7)\n at process._tickDomainCallback (internal/process/next_tick.js:128:9)"]
i'm basing my launchSpec config on what i understood from http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_LaunchSpecification.html
do you think i have the syntax wrong or is this just unsupported in the current aws provisioner code?
Flags: needinfo?(jhford)
Assignee | ||
Comment 11•7 years ago
|
||
ignore comment 10. it should have been BlockDeviceMappings rather than BlockDeviceMapping
Flags: needinfo?(jhford)
Comment hidden (Intermittent Failures Robot) |
Comment 13•7 years ago
|
||
FWIW, this bug is particularly bad because it breaks Windows builds that then can't be retriggered from Treeherder due to chain of trust issues. So literally every failure reported under this bug requires manual intervention with appropriate scopes to retrigger a lost job. Is there anything we can do to bump the priority here?
Comment 14•7 years ago
|
||
This is the top thing rob has been working on. He's on pto this week and will return to it when he's back on monday.
Comment hidden (Intermittent Failures Robot) |
Comment 16•7 years ago
|
||
Hey Rob, welcome back from PTO! I was wondering if there was any progress to report in this bug?
Flags: needinfo?(rthijssen)
Comment 17•7 years ago
|
||
ryanvm: see the blocking bug, that's where the OCC work is being tracked.
Flags: needinfo?(rthijssen)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 19•7 years ago
|
||
when using fresh ebs volumes, hg robustcheckout fails with the following error:
abort: The filename or extension is too long: 'y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg\data/mobile/android/tests/browser/chrome/tp5/amazon.com/www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B002Y27P3M/%5C%22http%3A/g-ecx.images-amazon.com/images/G/01/kindle/shasta/photos'
not sure its related, but my patch to use fresh ebs volumes on try was reverted due to the error above (https://treeherder.mozilla.org/#/jobs?repo=try&revision=074acb44df33719de2e47ee941887fc8da6e61e4&selectedJob=123961923).
Comment 20•7 years ago
|
||
The robustcheckout failure will be fixed by bug 1391424.
But, the conditions leading to that error should not occur! If I had to venture a guess, it would be that the population of y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29 during instance "bootstrap" is failing somehow.
I would remove anything to do with y:\hg-shared from OCC except for creating an empty directory and setting its permissions. That will force y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29 to be created at first task time. And it should avoid the path too long issues.
Comment 21•7 years ago
|
||
Bug 1391424 just landed. So if you take the latest version of robustcheckout from the version-control-tools repo (https://hg.mozilla.org/hgcustom/version-control-tools/raw-file/134574b64ddf/hgext/robustcheckout/__init__.py), this error will be worked around. I'd still like to not populate y:\hg-shared during instance "bootstrap." But that could be done as a follow-up.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 27•7 years ago
|
||
ebs patch has landed. as have the robustcheckout and mercurial upgrades
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•