1381768 - Some gecko-1-b-win2012 workers take forever to checkout mercurial files

Reporter

Description

•

7 years ago

The following tasks, the first of which is still currently running, have taken more than 30 minutes to checkout the mercurial data (after the clone itself, over the network or possibly using the local shared data): - H_uVPziJR0KEA3BF6T3Wsw - C-wllbZJSLCYXwfHCe66-Q - D1I8p43iROeArOJQoF728g - VS0FG7L4SD2Mt-0aIlNJww - GSmFf3LNTvK9ZE_yBTm5AA - c9nbnarpShGtFSOwE9-yHw - Im9Y7NclTVK0TQTkzcZUvg

Mike Hommey [:glandium]

Reporter

Comment 1

•

7 years ago

Please note that even when it's "fast", it still takes several minutes. For example O3OsQKhIQ6aKF-oLuCPKzQ still took almost 5 minutes.

Pete Moore [:pmoore][:pete]

Comment 2

•

7 years ago

Persisting key logging from one of the logs, as the task logs will expire... Here we see a wall time of 44m46.6840004s for the hg robustcheckout command. [taskcluster 2017-07-15T00:34:18.644Z] Worker Type (gecko-1-b-win2012) settings: [taskcluster 2017-07-15T00:34:18.644Z] { [taskcluster 2017-07-15T00:34:18.644Z] "aws": { [taskcluster 2017-07-15T00:34:18.644Z] "ami-id": "ami-ccc7f5da", [taskcluster 2017-07-15T00:34:18.644Z] "availability-zone": "us-east-1e", [taskcluster 2017-07-15T00:34:18.644Z] "instance-id": "i-0bc1c310290ab3cc4", [taskcluster 2017-07-15T00:34:18.644Z] "instance-type": "c4.4xlarge", [taskcluster 2017-07-15T00:34:18.644Z] "local-ipv4": "172.31.57.209", [taskcluster 2017-07-15T00:34:18.644Z] "public-hostname": "ec2-54-145-239-43.compute-1.amazonaws.com", [taskcluster 2017-07-15T00:34:18.647Z] "public-ipv4": "54.145.239.43" [taskcluster 2017-07-15T00:34:18.647Z] }, [taskcluster 2017-07-15T00:34:18.647Z] "config": { [taskcluster 2017-07-15T00:34:18.647Z] "deploymentId": "1f3ea73960ea", [taskcluster 2017-07-15T00:34:18.647Z] "runTasksAsCurrentUser": false [taskcluster 2017-07-15T00:34:18.647Z] }, [taskcluster 2017-07-15T00:34:18.647Z] "generic-worker": { [taskcluster 2017-07-15T00:34:18.647Z] "go-arch": "amd64", [taskcluster 2017-07-15T00:34:18.647Z] "go-os": "windows", [taskcluster 2017-07-15T00:34:18.647Z] "go-version": "go1.7.5", [taskcluster 2017-07-15T00:34:18.647Z] "release": "https://github.com/taskcluster/generic-worker/releases/tag/v10.0.5", [taskcluster 2017-07-15T00:34:18.647Z] "version": "10.0.5" [taskcluster 2017-07-15T00:34:18.647Z] }, [taskcluster 2017-07-15T00:34:18.647Z] "machine-setup": { [taskcluster 2017-07-15T00:34:18.647Z] "ami-created": "2017-06-29 18:37:12.559Z", [taskcluster 2017-07-15T00:34:18.647Z] "manifest": "https://github.com/mozilla-releng/OpenCloudConfig/blob/1f3ea73960ea811c1f0e79729be21e67732bd5b1/userdata/Manifest/gecko-1-b-win2012.json" [taskcluster 2017-07-15T00:34:18.647Z] } [taskcluster 2017-07-15T00:34:18.647Z] } [taskcluster 2017-07-15T00:34:18.647Z] Task ID: D1I8p43iROeArOJQoF728g [taskcluster 2017-07-15T00:34:18.647Z] === Task Starting === [taskcluster 2017-07-15T00:34:19.156Z] Uploading file public/logs/live.log as artifact public/logs/live.log [taskcluster 2017-07-15T00:34:19.651Z] Executing command 0: "c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision %GECKO_HEAD_REV% %GECKO_HEAD_REPOSITORY% .\build\src Z:\task_1500078434>"c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 05d3cf8977013e21a5926e15e2bd5386e16c167f https://hg.mozilla.org/try/ .\build\src manifests [> ] 1/48 files [=> ] 57/1194 42s files [==> ] 64/1194 57s ... ... ... <snip/> ... ... ... files [==========================================> ] 924/1194 06s files [==========================================> ] 932/1194 06s ensuring https://hg.mozilla.org/try/@05d3cf8977013e21a5926e15e2bd5386e16c167f is available at .\build\src (cloning from upstream repo https://hg.mozilla.org/mozilla-unified) (sharing from existing pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29) searching for changes adding changesets adding manifests adding file changes added 48 changesets with 928 changes to 936 files searching [ <=> ] 2 changesets [============================================================>] 1/1 adding remote bookmark aurora adding remote bookmark beta adding remote bookmark central adding remote bookmark esr10 adding remote bookmark esr17 adding remote bookmark esr24 adding remote bookmark esr31 adding remote bookmark esr38 adding remote bookmark esr45 adding remote bookmark esr52 adding remote bookmark fx-team adding remote bookmark inbound adding remote bookmark release (pulling to obtain 05d3cf8977013e21a5926e15e2bd5386e16c167f) searching for changes adding changesets adding manifests adding file changes added 1 changesets with 2 changes to 4 files (+1 heads) updating [ ] 100/196175 updating [ ] 200/196175 1h25m ... ... ... <snip/> ... ... ... updating [===============================================> ] 196000/196175 03s updating [================================================>] 196175/196175 01s 196175 files updated, 0 files merged, 0 files removed, 0 files unresolved updated to 05d3cf8977013e21a5926e15e2bd5386e16c167f [taskcluster 2017-07-15T01:19:06.338Z] Exit Code: 0 [taskcluster 2017-07-15T01:19:06.338Z] Success Code: 0x0 [taskcluster 2017-07-15T01:19:06.339Z] User Time: 0s [taskcluster 2017-07-15T01:19:06.339Z] Kernel Time: 0s [taskcluster 2017-07-15T01:19:06.339Z] Wall Time: 44m46.6840004s [taskcluster 2017-07-15T01:19:06.339Z] Peak Memory: 2797568 [taskcluster 2017-07-15T01:19:06.339Z] Result: SUCCEEDED

Amy Rich [:arr] [:arich]

Updated

•

7 years ago

Assignee: relops → rthijssen

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 3

•

7 years ago

Attached patch https://github.com/mozilla-releng/OpenCloudConfig/pull/81 (deleted) — Details — Splinter Review

this change ensures that y:\hg-shared contains a prepopulated robustcheckout sharebase during ami creation and hopefully reduces checkout time on the first task checkout for each spot instance derived from that ami.

Attachment #8887956 - Flags: review?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 4

•

7 years ago

Comment on attachment 8887956 [details] [diff] [review] https://github.com/mozilla-releng/OpenCloudConfig/pull/81 Review of attachment 8887956 [details] [diff] [review]: ----------------------------------------------------------------- Looks great, thanks Rob!

Attachment #8887956 - Attachment is patch: true

Attachment #8887956 - Attachment mime type: text/x-github-pull-request → text/plain

Attachment #8887956 - Flags: review?(pmoore) → review+

Mike Hommey [:glandium]

Reporter

•

7 years ago

jhford: i'm working on updates to occ that would attach fresh ebs volumes to spot instances at boot. to support this, i have tried to change the launchSpec in AWS provisioner config from: "launchSpec": { "IamInstanceProfile": { "Arn": "arn:aws:iam::692406183521:instance-profile/taskcluster-level-1-sccache" } } to: "launchSpec": { "IamInstanceProfile": { "Arn": "arn:aws:iam::692406183521:instance-profile/taskcluster-level-1-sccache" }, "BlockDeviceMapping": [ { "DeviceName":"/dev/sda1", "Ebs":{ "VolumeSize": 120, "DeleteOnTermination": true } }, { "DeviceName":"/dev/sdb", "Ebs":{ "VolumeSize": 120, "DeleteOnTermination": true } } ] } the provisioner ui shows this error when i attempt to update: Unknown Error! Invalid workerType: ["Error: Launch specifications are invalid\n at Function.WorkerType.testLaunchSpecs (/app/src/worker-type.js:637:15)\n at AwsManager._callee6$ (/app/src/aws-manager.js:323:34)\n at tryCatch (/app/node_modules/regenerator-runtime/runtime.js:64:40)\n at Generator.invoke [as _invoke] (/app/node_modules/regenerator-runtime/runtime.js:299:22)\n at Generator.prototype.(anonymous function) [as next] (/app/node_modules/regenerator-runtime/runtime.js:116:21)\n at step (/app/node_modules/babel-runtime/helpers/asyncToGenerator.js:17:30)\n at /app/node_modules/babel-runtime/helpers/asyncToGenerator.js:35:14\n at F (/app/node_modules/core-js/library/modules/_export.js:35:28)\n at AwsManager.<anonymous> (/app/node_modules/babel-runtime/helpers/asyncToGenerator.js:14:12)\n at AwsManager.workerTypeCanLaunch (/app/lib/aws-manager.js:671:22)\n at Object._callee4$ (/app/src/api-v1.js:293:42)\n at tryCatch (/app/node_modules/regenerator-runtime/runtime.js:64:40)\n at Generator.invoke [as _invoke] (/app/node_modules/regenerator-runtime/runtime.js:299:22)\n at Generator.prototype.(anonymous function) [as next] (/app/node_modules/regenerator-runtime/runtime.js:116:21)\n at step (/app/node_modules/babel-runtime/helpers/asyncToGenerator.js:17:30)\n at /app/node_modules/babel-runtime/helpers/asyncToGenerator.js:35:14\n at F (/app/node_modules/core-js/library/modules/_export.js:35:28)\n at Object.<anonymous> (/app/node_modules/babel-runtime/helpers/asyncToGenerator.js:14:12)\n at Object.<anonymous> (/app/src/api-v1.js:257:1)\n at /app/node_modules/taskcluster-lib-api/src/api.js:538:22\n at tryCallOne (/app/node_modules/promise/lib/core.js:37:12)\n at /app/node_modules/promise/lib/core.js:123:15\n at flush (/app/node_modules/asap/raw.js:50:29)\n at _combinedTickCallback (internal/process/next_tick.js:73:7)\n at process._tickDomainCallback (internal/process/next_tick.js:128:9)"] i'm basing my launchSpec config on what i understood from http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_LaunchSpecification.html do you think i have the syntax wrong or is this just unsupported in the current aws provisioner code?

Flags: needinfo?(jhford)

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 11

•

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 19

•

7 years ago

when using fresh ebs volumes, hg robustcheckout fails with the following error: abort: The filename or extension is too long: 'y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg\data/mobile/android/tests/browser/chrome/tp5/amazon.com/www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B002Y27P3M/%5C%22http%3A/g-ecx.images-amazon.com/images/G/01/kindle/shasta/photos' not sure its related, but my patch to use fresh ebs volumes on try was reverted due to the error above (https://treeherder.mozilla.org/#/jobs?repo=try&revision=074acb44df33719de2e47ee941887fc8da6e61e4&selectedJob=123961923).

Gregory Szorc [:gps]

Comment 20

•

7 years ago

The robustcheckout failure will be fixed by bug 1391424. But, the conditions leading to that error should not occur! If I had to venture a guess, it would be that the population of y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29 during instance "bootstrap" is failing somehow. I would remove anything to do with y:\hg-shared from OCC except for creating an empty directory and setting its permissions. That will force y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29 to be created at first task time. And it should avoid the path too long issues.

Gregory Szorc [:gps]

Comment 21

•

7 years ago

Bug 1391424 just landed. So if you take the latest version of robustcheckout from the version-control-tools repo (https://hg.mozilla.org/hgcustom/version-control-tools/raw-file/134574b64ddf/hgext/robustcheckout/__init__.py), this error will be worked around. I'd still like to not populate y:\hg-shared during instance "bootstrap." But that could be done as a follow-up.

Comment hidden (Intermittent Failures Robot)

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 27

•

7 years ago

ebs patch has landed. as have the robustcheckout and mercurial upgrades

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED