Closed Bug 1381768 Opened 7 years ago Closed 7 years ago

Some gecko-1-b-win2012 workers take forever to checkout mercurial files

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: glandium, Assigned: grenade)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

The following tasks, the first of which is still currently running, have taken more than 30 minutes to checkout the mercurial data (after the clone itself, over the network or possibly using the local shared data): - H_uVPziJR0KEA3BF6T3Wsw - C-wllbZJSLCYXwfHCe66-Q - D1I8p43iROeArOJQoF728g - VS0FG7L4SD2Mt-0aIlNJww - GSmFf3LNTvK9ZE_yBTm5AA - c9nbnarpShGtFSOwE9-yHw - Im9Y7NclTVK0TQTkzcZUvg
Please note that even when it's "fast", it still takes several minutes. For example O3OsQKhIQ6aKF-oLuCPKzQ still took almost 5 minutes.
Persisting key logging from one of the logs, as the task logs will expire... Here we see a wall time of 44m46.6840004s for the hg robustcheckout command. [taskcluster 2017-07-15T00:34:18.644Z] Worker Type (gecko-1-b-win2012) settings: [taskcluster 2017-07-15T00:34:18.644Z] { [taskcluster 2017-07-15T00:34:18.644Z] "aws": { [taskcluster 2017-07-15T00:34:18.644Z] "ami-id": "ami-ccc7f5da", [taskcluster 2017-07-15T00:34:18.644Z] "availability-zone": "us-east-1e", [taskcluster 2017-07-15T00:34:18.644Z] "instance-id": "i-0bc1c310290ab3cc4", [taskcluster 2017-07-15T00:34:18.644Z] "instance-type": "c4.4xlarge", [taskcluster 2017-07-15T00:34:18.644Z] "local-ipv4": "172.31.57.209", [taskcluster 2017-07-15T00:34:18.644Z] "public-hostname": "ec2-54-145-239-43.compute-1.amazonaws.com", [taskcluster 2017-07-15T00:34:18.647Z] "public-ipv4": "54.145.239.43" [taskcluster 2017-07-15T00:34:18.647Z] }, [taskcluster 2017-07-15T00:34:18.647Z] "config": { [taskcluster 2017-07-15T00:34:18.647Z] "deploymentId": "1f3ea73960ea", [taskcluster 2017-07-15T00:34:18.647Z] "runTasksAsCurrentUser": false [taskcluster 2017-07-15T00:34:18.647Z] }, [taskcluster 2017-07-15T00:34:18.647Z] "generic-worker": { [taskcluster 2017-07-15T00:34:18.647Z] "go-arch": "amd64", [taskcluster 2017-07-15T00:34:18.647Z] "go-os": "windows", [taskcluster 2017-07-15T00:34:18.647Z] "go-version": "go1.7.5", [taskcluster 2017-07-15T00:34:18.647Z] "release": "https://github.com/taskcluster/generic-worker/releases/tag/v10.0.5", [taskcluster 2017-07-15T00:34:18.647Z] "version": "10.0.5" [taskcluster 2017-07-15T00:34:18.647Z] }, [taskcluster 2017-07-15T00:34:18.647Z] "machine-setup": { [taskcluster 2017-07-15T00:34:18.647Z] "ami-created": "2017-06-29 18:37:12.559Z", [taskcluster 2017-07-15T00:34:18.647Z] "manifest": "https://github.com/mozilla-releng/OpenCloudConfig/blob/1f3ea73960ea811c1f0e79729be21e67732bd5b1/userdata/Manifest/gecko-1-b-win2012.json" [taskcluster 2017-07-15T00:34:18.647Z] } [taskcluster 2017-07-15T00:34:18.647Z] } [taskcluster 2017-07-15T00:34:18.647Z] Task ID: D1I8p43iROeArOJQoF728g [taskcluster 2017-07-15T00:34:18.647Z] === Task Starting === [taskcluster 2017-07-15T00:34:19.156Z] Uploading file public/logs/live.log as artifact public/logs/live.log [taskcluster 2017-07-15T00:34:19.651Z] Executing command 0: "c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision %GECKO_HEAD_REV% %GECKO_HEAD_REPOSITORY% .\build\src Z:\task_1500078434>"c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 05d3cf8977013e21a5926e15e2bd5386e16c167f https://hg.mozilla.org/try/ .\build\src manifests [> ] 1/48 files [=> ] 57/1194 42s files [==> ] 64/1194 57s ... ... ... <snip/> ... ... ... files [==========================================> ] 924/1194 06s files [==========================================> ] 932/1194 06s ensuring https://hg.mozilla.org/try/@05d3cf8977013e21a5926e15e2bd5386e16c167f is available at .\build\src (cloning from upstream repo https://hg.mozilla.org/mozilla-unified) (sharing from existing pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29) searching for changes adding changesets adding manifests adding file changes added 48 changesets with 928 changes to 936 files searching [ <=> ] 2 changesets [============================================================>] 1/1 adding remote bookmark aurora adding remote bookmark beta adding remote bookmark central adding remote bookmark esr10 adding remote bookmark esr17 adding remote bookmark esr24 adding remote bookmark esr31 adding remote bookmark esr38 adding remote bookmark esr45 adding remote bookmark esr52 adding remote bookmark fx-team adding remote bookmark inbound adding remote bookmark release (pulling to obtain 05d3cf8977013e21a5926e15e2bd5386e16c167f) searching for changes adding changesets adding manifests adding file changes added 1 changesets with 2 changes to 4 files (+1 heads) updating [ ] 100/196175 updating [ ] 200/196175 1h25m ... ... ... <snip/> ... ... ... updating [===============================================> ] 196000/196175 03s updating [================================================>] 196175/196175 01s 196175 files updated, 0 files merged, 0 files removed, 0 files unresolved updated to 05d3cf8977013e21a5926e15e2bd5386e16c167f [taskcluster 2017-07-15T01:19:06.338Z] Exit Code: 0 [taskcluster 2017-07-15T01:19:06.338Z] Success Code: 0x0 [taskcluster 2017-07-15T01:19:06.339Z] User Time: 0s [taskcluster 2017-07-15T01:19:06.339Z] Kernel Time: 0s [taskcluster 2017-07-15T01:19:06.339Z] Wall Time: 44m46.6840004s [taskcluster 2017-07-15T01:19:06.339Z] Peak Memory: 2797568 [taskcluster 2017-07-15T01:19:06.339Z] Result: SUCCEEDED
Assignee: relops → rthijssen
this change ensures that y:\hg-shared contains a prepopulated robustcheckout sharebase during ami creation and hopefully reduces checkout time on the first task checkout for each spot instance derived from that ami.
Attachment #8887956 - Flags: review?(pmoore)
Comment on attachment 8887956 [details] [diff] [review] https://github.com/mozilla-releng/OpenCloudConfig/pull/81 Review of attachment 8887956 [details] [diff] [review]: ----------------------------------------------------------------- Looks great, thanks Rob!
Attachment #8887956 - Attachment is patch: true
Attachment #8887956 - Attachment mime type: text/x-github-pull-request → text/plain
Attachment #8887956 - Flags: review?(pmoore) → review+
Note the time was not spent on *cloning*, which hg-shared would affect, but on *checkout*, which is taking the data out of the .hg directory and laying it out on disk outside of it.
Probably something going wrong in the format of the ebs volume, such that all sectors aren't touched - but we are hoping to fix that issue in bug 1378381, which should solve this issue. I think grenade's patch is still good, as it addresses a related issue (the clone should be faster now), but it sounds like we'll need bug 1378381 solved too.
Depends on: 1378381
I'm reasonably confident that baking the Mercurial share "cache" into the AMI is counter-productive. We've done this before with other instances and it was a net loss. By baking it into the AMI you are creating the need to touch sectors on instance initialization to mitigate copy-on-first-access with EBS volumes initialized from an AMI. It is faster to initialize an empty EBS volume, not touch sectors on instance initialization, and populate the Mercurial clone on a fresh EBS volume. Of course, as long as you are accessing sectors on volumes initialized from AMIs on instance initialization, it doesn't matter if the sector has data or not: it will still be slow. So until we stop using AMIs, adding the Mercurial data to the AMI shouldn't have a significant negative impact (other than increasing AMI size and corresponding overhead, which I assume is small). But if the Mercurial data is the reason for using an AMI, then this will likely be slower than no AMI. FWIW, my measurements have shown that reading sectors on an AMI-initialized volume results in throughput of 2-4 MB/s. An uncompressed Mercurial clone (what robustcheckout will do in EC2) should be >10 MB/s for Windows EC2 instances. If we want perf wins, I'd focus on not initializing EBS volumes from AMIs.
While not directly relevant to windows builds, I was looking at the memory usage during a linux build the other day, and saw that there was something like a peak of 16GB of caches, and never more than 9GB memory used by processes. On linux, this means we could actually get around any I/O issues if we have any by having /home/worker/workspace on a tmpfs. I wonder if we'd get benefits from using a large RAM disk for the build directory on Windows.
BTW, this bug is still very much a problem. I regularly get Windows builds that spend way more than an hour on `hg update`.
jhford: i'm working on updates to occ that would attach fresh ebs volumes to spot instances at boot. to support this, i have tried to change the launchSpec in AWS provisioner config from: "launchSpec": { "IamInstanceProfile": { "Arn": "arn:aws:iam::692406183521:instance-profile/taskcluster-level-1-sccache" } } to: "launchSpec": { "IamInstanceProfile": { "Arn": "arn:aws:iam::692406183521:instance-profile/taskcluster-level-1-sccache" }, "BlockDeviceMapping": [ { "DeviceName":"/dev/sda1", "Ebs":{ "VolumeSize": 120, "DeleteOnTermination": true } }, { "DeviceName":"/dev/sdb", "Ebs":{ "VolumeSize": 120, "DeleteOnTermination": true } } ] } the provisioner ui shows this error when i attempt to update: Unknown Error! Invalid workerType: ["Error: Launch specifications are invalid\n at Function.WorkerType.testLaunchSpecs (/app/src/worker-type.js:637:15)\n at AwsManager._callee6$ (/app/src/aws-manager.js:323:34)\n at tryCatch (/app/node_modules/regenerator-runtime/runtime.js:64:40)\n at Generator.invoke [as _invoke] (/app/node_modules/regenerator-runtime/runtime.js:299:22)\n at Generator.prototype.(anonymous function) [as next] (/app/node_modules/regenerator-runtime/runtime.js:116:21)\n at step (/app/node_modules/babel-runtime/helpers/asyncToGenerator.js:17:30)\n at /app/node_modules/babel-runtime/helpers/asyncToGenerator.js:35:14\n at F (/app/node_modules/core-js/library/modules/_export.js:35:28)\n at AwsManager.<anonymous> (/app/node_modules/babel-runtime/helpers/asyncToGenerator.js:14:12)\n at AwsManager.workerTypeCanLaunch (/app/lib/aws-manager.js:671:22)\n at Object._callee4$ (/app/src/api-v1.js:293:42)\n at tryCatch (/app/node_modules/regenerator-runtime/runtime.js:64:40)\n at Generator.invoke [as _invoke] (/app/node_modules/regenerator-runtime/runtime.js:299:22)\n at Generator.prototype.(anonymous function) [as next] (/app/node_modules/regenerator-runtime/runtime.js:116:21)\n at step (/app/node_modules/babel-runtime/helpers/asyncToGenerator.js:17:30)\n at /app/node_modules/babel-runtime/helpers/asyncToGenerator.js:35:14\n at F (/app/node_modules/core-js/library/modules/_export.js:35:28)\n at Object.<anonymous> (/app/node_modules/babel-runtime/helpers/asyncToGenerator.js:14:12)\n at Object.<anonymous> (/app/src/api-v1.js:257:1)\n at /app/node_modules/taskcluster-lib-api/src/api.js:538:22\n at tryCallOne (/app/node_modules/promise/lib/core.js:37:12)\n at /app/node_modules/promise/lib/core.js:123:15\n at flush (/app/node_modules/asap/raw.js:50:29)\n at _combinedTickCallback (internal/process/next_tick.js:73:7)\n at process._tickDomainCallback (internal/process/next_tick.js:128:9)"] i'm basing my launchSpec config on what i understood from http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_LaunchSpecification.html do you think i have the syntax wrong or is this just unsupported in the current aws provisioner code?
Flags: needinfo?(jhford)
ignore comment 10. it should have been BlockDeviceMappings rather than BlockDeviceMapping
Flags: needinfo?(jhford)
FWIW, this bug is particularly bad because it breaks Windows builds that then can't be retriggered from Treeherder due to chain of trust issues. So literally every failure reported under this bug requires manual intervention with appropriate scopes to retrigger a lost job. Is there anything we can do to bump the priority here?
This is the top thing rob has been working on. He's on pto this week and will return to it when he's back on monday.
Blocks: fastci
Hey Rob, welcome back from PTO! I was wondering if there was any progress to report in this bug?
Flags: needinfo?(rthijssen)
ryanvm: see the blocking bug, that's where the OCC work is being tracked.
Flags: needinfo?(rthijssen)
when using fresh ebs volumes, hg robustcheckout fails with the following error: abort: The filename or extension is too long: 'y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg\data/mobile/android/tests/browser/chrome/tp5/amazon.com/www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B002Y27P3M/%5C%22http%3A/g-ecx.images-amazon.com/images/G/01/kindle/shasta/photos' not sure its related, but my patch to use fresh ebs volumes on try was reverted due to the error above (https://treeherder.mozilla.org/#/jobs?repo=try&revision=074acb44df33719de2e47ee941887fc8da6e61e4&selectedJob=123961923).
The robustcheckout failure will be fixed by bug 1391424. But, the conditions leading to that error should not occur! If I had to venture a guess, it would be that the population of y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29 during instance "bootstrap" is failing somehow. I would remove anything to do with y:\hg-shared from OCC except for creating an empty directory and setting its permissions. That will force y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29 to be created at first task time. And it should avoid the path too long issues.
Bug 1391424 just landed. So if you take the latest version of robustcheckout from the version-control-tools repo (https://hg.mozilla.org/hgcustom/version-control-tools/raw-file/134574b64ddf/hgext/robustcheckout/__init__.py), this error will be worked around. I'd still like to not populate y:\hg-shared during instance "bootstrap." But that could be done as a follow-up.
ebs patch has landed. as have the robustcheckout and mercurial upgrades
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: