Add some sort of caching for gecko checkouts on windows generic-worker jobs
Categories
(Firefox Build System :: Task Configuration, task)
Tracking
(firefox67 fixed)
Tracking | Status | |
---|---|---|
firefox67 | --- | fixed |
People
(Reporter: kats, Assigned: ahal)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(Keywords: leave-open)
Attachments
(5 files)
We're running some webrender CI stuff on windows via generic-worker (this task) and a good chunk of time is spent checking out and updating the mercurial repo.
I recall seeing a comment by :ahal somewhere that this was an item on a to-do list, but I'm not sure if there's a bug already on file for optimizing that so I'm filing one.
Assignee | ||
Comment 1•6 years ago
|
||
See:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/common.py#89
It's possible that comment is out of date, I see a "WriteableDirectoryCache" in:
https://docs.taskcluster.net/docs/reference/workers/generic-worker/docs/payload
Gps was the one who set this all up for the docker-worker, unfortunately he's not around anymore I'm a bit out of my element.
Brian, is the comment in the first link still accurate? If so is there a bug we can depend on that tracks implementing it? If not, do you have any pointers that can help us set this up?
Comment 2•6 years ago
|
||
Good question. I think pmoore is best able to answer this sort of thing.
Comment 3•6 years ago
|
||
Yes generic-worker also has caches - the WriteableDirectoryCache link from ahal is the correct one, and is essentially equivalent to the docker-worker cache directive. You simply give the cache a name, for which you require the scope to use it, and then any content in that cache directory will be preserved at the end of the task run, and if a new task comes in on that worker that declares that same cache name, the content will be mounted from the previous task run.
The link should have all the info you need, but let me know if you are missing anything.
Assignee | ||
Comment 4•6 years ago
|
||
Assignee | ||
Comment 5•6 years ago
|
||
This also adds a cache for the gecko checkout if present.
Depends on D17689
Assignee | ||
Comment 6•6 years ago
|
||
Thanks Pete. Here's an initial stab:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1e866038733799f6f6be0cad05c215f55e3b50f3
Here is an example task definition with the cache:
https://tools.taskcluster.net/groups/WeV1wa7ZTBWL1BL1XoceEQ/tasks/LrB2cpcFQyKajBQahdnGNg/details
All those term
jobs re-clone mozilla-central, but I guess that's expected because we'd need to wait for a host that has the cache to run a second time and the pool is probably much larger than my handful of retriggers, right? Is there a better way for me to test to see this working? Or should I just land and monitor it after the fact?
Comment 7•6 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #6)
Thanks Pete. Here's an initial stab:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1e866038733799f6f6be0cad05c215f55e3b50f3
Looks good!
Here is an example task definition with the cache:
https://tools.taskcluster.net/groups/WeV1wa7ZTBWL1BL1XoceEQ/tasks/LrB2cpcFQyKajBQahdnGNg/details
This looks correct. See e.g. these log lines.
All those
term
jobs re-clone mozilla-central, but I guess that's expected because we'd need to wait for a host that has the cache to run a second time and the pool is probably much larger than my handful of retriggers, right? Is there a better way for me to test to see this working? Or should I just land and monitor it after the fact?
You could make a try push changing worker type gecko-t-win10-64 to gecko-t-win10-64-beta (which is the staging pool for that worker type). The only reason this might help is it is a much more constrained pool, and you could explicitly set it to 2 or 3 workers, for example. I'd be tempted to just land it though, as the jobs are green and the logs look like the cache is being used correctly.
FWIW you can use a patch similar to this one to "swap out" the worker type:
https://hg.mozilla.org/try/rev/28d524a23fd07c16dac5537d84313a70ec4fa830
Note, you'll probably want to shrink the find_replace_dict array to just have the worker type(s) you are enabling caches on.
Also please check with :grenade if he is using the staging worker types for anything at the same time, just so you don't collide. He or I can also assist with setting the max capacity (worker pool size) for the staging worker type(s) you want to test with to something small (e.g. 1-3 workers).
Good luck!
Assignee | ||
Comment 8•6 years ago
|
||
Ok, I'm more than happy to just land, especially knowing it looks to be working. I'll keep an eye on it over the coming days and if it looks like something is wrong I'll use the worker-type trick then to test it out.
Thanks!
Assignee | ||
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Assignee | ||
Comment 9•6 years ago
|
||
Nuts, looks like this causes build failures due to insufficient disk space:
https://tools.taskcluster.net/groups/VHZWELzjSqGJFmm3l51bXg/tasks/WbD7Hwv_TOmBhzmy7txGCQ/runs/0/logs/public%2Flogs%2Flive.log
Not sure if it's intermittent or not. I didn't see this with artifact builds. I guess one easy way to fix this will be to implement a "use_caches" key in the job schema and set it to false for build tasks.
Updated•6 years ago
|
Assignee | ||
Comment 10•6 years ago
|
||
The hosts don't have enough disk space to cache mozilla-central.
Depends on D17689
Updated•6 years ago
|
Comment 11•6 years ago
|
||
Comment 12•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/0b8097689bb5
https://hg.mozilla.org/mozilla-central/rev/b6e19a5b0ab9
https://hg.mozilla.org/mozilla-central/rev/2ceeee1915ae
Comment 13•6 years ago
|
||
I've just seen comment 9, and realise this is probably a bug in the generic-worker implementation, which I think we'll need to fix before we can roll this out (I think we'll have to back this out).
We have a mechanism that checks we have enough disk space on the partition where task directories are stored, before each task runs. If not, we repeatedly delete caches until we have freed up enough space.
However, in the case that the caches are on a different partition to the task directories, this doesn't prevent the cache partition from filling up.
So we should check the partition with caches, and the partition with task directories, both have enough space at the start of every task.
I'll create a bug for this in the generic worker bugzilla component and make it a blocker for this bug.
Updated•6 years ago
|
Comment 14•6 years ago
|
||
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=227129866&repo=mozilla-central&lineNumber=2190
Backout link: https://hg.mozilla.org/mozilla-central/rev/4bc31addf415cc076ac42ab9ed64002160f57f86
[taskcluster 2019-02-08T10:32:16.918Z] Exit Code: 0
[taskcluster 2019-02-08T10:32:16.918Z] User Time: 0s
[taskcluster 2019-02-08T10:32:16.918Z] Kernel Time: 0s
[taskcluster 2019-02-08T10:32:16.918Z] Wall Time: 39m17.6778452s
[taskcluster 2019-02-08T10:32:16.918Z] Result: SUCCEEDED
[taskcluster 2019-02-08T10:32:16.918Z] === Task Finished ===
[taskcluster 2019-02-08T10:32:16.918Z] Task Duration: 39m17.6788022s
[taskcluster 2019-02-08T10:32:16.920Z] [mounts] Preserving cache: Moving "Z:\task_1549618301\build" to "Y:\caches\QPnek7u7Qe2ICSvyeudjPw"
[taskcluster 2019-02-08T10:39:34.180Z] [mounts] Removing cache level-3-checkouts from cache table
[taskcluster 2019-02-08T10:39:34.180Z] [mounts] Deleting cache level-3-checkouts file(s) at Y:\caches\QPnek7u7Qe2ICSvyeudjPw
[taskcluster:error] [mounts] Could not unmount <nil> due to: 'Could not persist cache "level-3-checkouts" due to mkdir Y:\caches\QPnek7u7Qe2ICSvyeudjPw\src\vs2017_15.8.4\SDK\bin\10.0.17134.0\x64\en-US: There is not enough space on the disk.'
[taskcluster 2019-02-08T10:39:59.316Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/I9UVQSBISBW3F2h2znaPcA/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2020-02-08T09:51:01.596Z
[taskcluster:error] Could not persist cache "level-3-checkouts" due to mkdir Y:\caches\QPnek7u7Qe2ICSvyeudjPw\src\vs2017_15.8.4\SDK\bin\10.0.17134.0\x64\en-US: There is not enough space on the disk.
Assignee | ||
Comment 15•6 years ago
|
||
No need to block on the fix, I can disable caches for those wrench tasks. That way we'll still get the benefit of caches in tasks that don't have this configuration in the meantime. We could also see if moving those srcdir checkout caches to the same partition on Windows is feasible. I'm not really sure why they're placed where they are.
But either way the bug in generic-worker should be fixed, just saying that it doesn't need to block this one.
Updated•6 years ago
|
Reporter | ||
Comment 16•6 years ago
|
||
If you're disabling the caches for wrench and relanding please also disable it for the searchfox build job which failed similarly: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=227131320&repo=mozilla-central&lineNumber=251291
Assignee | ||
Comment 17•6 years ago
|
||
Fixed by disabling caches for Windows wrench and idx:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=11c9cbca902009c053dcfc57921dd837c6992366
Comment 18•6 years ago
|
||
Comment 19•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/036604abf1e5
https://hg.mozilla.org/mozilla-central/rev/2053a035eee6
https://hg.mozilla.org/mozilla-central/rev/887cc76ba189
Comment 20•6 years ago
|
||
Disables caching on generic-worker based Windows builds for Thunderbird
due to insufficient disk space on the build hosts.
Updated•6 years ago
|
Comment 21•6 years ago
|
||
Hmm, the patch in Phabricator is NOT what you've been running on try :-( - I'll land the latter.
Comment 22•6 years ago
|
||
Comment 23•6 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #16)
If you're disabling the caches for wrench and relanding please also disable it for the searchfox build job which failed similarly: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=227131320&repo=mozilla-central&lineNumber=251291
There are similar problems on cbindgen jobs: https://queue.taskcluster.net/v1/task/G0j-vl9PS861uEMQ7u-6Dg/runs/0/artifacts/public/logs/live_backing.log
Comment 24•6 years ago
|
||
Actually different, as it's not a disk space problem, but a "there's a process still running that uses a file in the checkout"
Comment 25•6 years ago
|
||
Ironically, this makes windows tasks using a cache slow to finish. See https://taskcluster-artifacts.net/FulbQos7T0ewTbXv746i-Q/0/public/logs/live_backing.log
[taskcluster 2019-02-15T05:24:03.500Z] [mounts] Preserving cache: Moving "Z:\task_1550205030\build" to "Y:\caches\H6KTLHo4Sh-Vy0PMSQgAjA"
[taskcluster 2019-02-15T05:28:32.966Z] [mounts] Denying task_1550205030 access to 'Y:\caches\H6KTLHo4Sh-Vy0PMSQgAjA'
[taskcluster 2019-02-15T05:29:27.546Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/FulbQos7T0ewTbXv746i-Q/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-03-15T05:06:20.013Z
5 minutes for tasks possibly depending on the task to start (well, in that case the task failed so nothing was going to start anyways)
Comment 26•6 years ago
|
||
I'm going to mark this bug as blocked by bug 1528198 and bug 1526311.
Once both of these bugs are resolved, we should have a much better time using generic-worker caches on Windows.
My feeling is that it isn't wise for us to use this worker feature while these issues are still open.
Comment 27•6 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #25)
Ironically, this makes windows tasks using a cache slow to finish. See https://taskcluster-artifacts.net/FulbQos7T0ewTbXv746i-Q/0/public/logs/live_backing.log
[taskcluster 2019-02-15T05:24:03.500Z] [mounts] Preserving cache: Moving "Z:\task_1550205030\build" to "Y:\caches\H6KTLHo4Sh-Vy0PMSQgAjA"
[taskcluster 2019-02-15T05:28:32.966Z] [mounts] Denying task_1550205030 access to 'Y:\caches\H6KTLHo4Sh-Vy0PMSQgAjA'
[taskcluster 2019-02-15T05:29:27.546Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/FulbQos7T0ewTbXv746i-Q/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-03-15T05:06:20.013Z5 minutes for tasks possibly depending on the task to start (well, in that case the task failed so nothing was going to start anyways)
The slowness here will be addressed by bug 1528198.
Comment 28•6 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #26)
I'm going to mark this bug as blocked by bug 1528198 and bug 1526311.
Once both of these bugs are resolved, we should have a much better time using generic-worker caches on Windows.
My feeling is that it isn't wise for us to use this worker feature while these issues are still open.
Ah, I see I should have done this in bug 1527313 - closing this one again. Sorry for the noise.
Updated•6 years ago
|
Comment 29•6 years ago
|
||
They appear to be causing tasks to take several hours to complete.
Comment 30•6 years ago
|
||
Comment 31•6 years ago
|
||
https://hg.mozilla.org/releases/comm-esr60/rev/307afdf37c55d0efc5ae2d48506e51dcaa8625e9
Uplifted to c-esr60 to fix busted Windows builds
Description
•