Closed Bug 1519472 Opened 6 years ago Closed 6 years ago

Add some sort of caching for gecko checkouts on windows generic-worker jobs

Tracking

(firefox67 fixed)

Status:

RESOLVED FIXED

Milestone:

mozilla67

Tracking Flags:

Tracking

Status

firefox67

---

fixed

People

(Reporter: kats, Assigned: ahal)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: leave-open)

Attachments

(5 files)

Bug 1519472 - [taskgraph] Factor logic for adding a cache in job.common to a new function, r?tomprince 6 years ago Andrew Halberstadt [:ahal] (deleted), text/x-phabricator-request		Details
Bug 1519472 - [taskgraph] Support generic-worker caches in run_task, r?tomprince 6 years ago Andrew Halberstadt [:ahal] (deleted), text/x-phabricator-request		Details
Bug 1519472 - [ci] Opt out of caching for generic-worker based Windows builds, r?tomprince 6 years ago Andrew Halberstadt [:ahal] (deleted), text/x-phabricator-request		Details
Port Bug 1519472 - Disable caching on TB Windows builds. r?jorgk 6 years ago Rob Lemley [:rjl] (deleted), text/x-phabricator-request		Details
Bug 1519472: Disable caches on windows repackage builds; r?aki 6 years ago Tom Prince [:tomprince] (deleted), text/x-phabricator-request		Details

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Description

•

6 years ago

We're running some webrender CI stuff on windows via generic-worker (this task) and a good chunk of time is spent checking out and updating the mercurial repo.

I recall seeing a comment by :ahal somewhere that this was an item on a to-do list, but I'm not sure if there's a bug already on file for optimizing that so I'm filing one.

Andrew Halberstadt [:ahal]

Assignee

Comment 1

•

6 years ago

See:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/common.py#89

It's possible that comment is out of date, I see a "WriteableDirectoryCache" in:
https://docs.taskcluster.net/docs/reference/workers/generic-worker/docs/payload

Gps was the one who set this all up for the docker-worker, unfortunately he's not around anymore I'm a bit out of my element.

Brian, is the comment in the first link still accurate? If so is there a bug we can depend on that tracks implementing it? If not, do you have any pointers that can help us set this up?

Flags: needinfo?(bstack)

Brian Stack [:bstack]

Comment 2

•

6 years ago

Good question. I think pmoore is best able to answer this sort of thing.

Flags: needinfo?(bstack) → needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 3

•

6 years ago

Yes generic-worker also has caches - the WriteableDirectoryCache link from ahal is the correct one, and is essentially equivalent to the docker-worker cache directive. You simply give the cache a name, for which you require the scope to use it, and then any content in that cache directory will be preserved at the end of the task run, and if a new task comes in on that worker that declares that same cache name, the content will be mounted from the previous task run.

The link should have all the info you need, but let me know if you are missing anything.

Flags: needinfo?(pmoore)

Andrew Halberstadt [:ahal]

Assignee

Comment 4

•

6 years ago

Attached file Bug 1519472 - [taskgraph] Factor logic for adding a cache in job.common to a new function, r?tomprince (deleted) — Details

Andrew Halberstadt [:ahal]

Assignee

Comment 5

•

6 years ago

Attached file Bug 1519472 - [taskgraph] Support generic-worker caches in run_task, r?tomprince (deleted) — Details

This also adds a cache for the gecko checkout if present.

Depends on D17689

Andrew Halberstadt [:ahal]

Assignee

Comment 6

•

6 years ago

Thanks Pete. Here's an initial stab:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1e866038733799f6f6be0cad05c215f55e3b50f3

Here is an example task definition with the cache:
https://tools.taskcluster.net/groups/WeV1wa7ZTBWL1BL1XoceEQ/tasks/LrB2cpcFQyKajBQahdnGNg/details

All those term jobs re-clone mozilla-central, but I guess that's expected because we'd need to wait for a host that has the cache to run a second time and the pool is probably much larger than my handful of retriggers, right? Is there a better way for me to test to see this working? Or should I just land and monitor it after the fact?

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 7

•

6 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #6)

Thanks Pete. Here's an initial stab:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1e866038733799f6f6be0cad05c215f55e3b50f3

Looks good!

Here is an example task definition with the cache:
https://tools.taskcluster.net/groups/WeV1wa7ZTBWL1BL1XoceEQ/tasks/LrB2cpcFQyKajBQahdnGNg/details

This looks correct. See e.g. these log lines.

All those term jobs re-clone mozilla-central, but I guess that's expected because we'd need to wait for a host that has the cache to run a second time and the pool is probably much larger than my handful of retriggers, right? Is there a better way for me to test to see this working? Or should I just land and monitor it after the fact?

You could make a try push changing worker type gecko-t-win10-64 to gecko-t-win10-64-beta (which is the staging pool for that worker type). The only reason this might help is it is a much more constrained pool, and you could explicitly set it to 2 or 3 workers, for example. I'd be tempted to just land it though, as the jobs are green and the logs look like the cache is being used correctly.

FWIW you can use a patch similar to this one to "swap out" the worker type:
https://hg.mozilla.org/try/rev/28d524a23fd07c16dac5537d84313a70ec4fa830

Note, you'll probably want to shrink the find_replace_dict array to just have the worker type(s) you are enabling caches on.

Also please check with :grenade if he is using the staging worker types for anything at the same time, just so you don't collide. He or I can also assist with setting the max capacity (worker pool size) for the staging worker type(s) you want to test with to something small (e.g. 1-3 workers).

Good luck!

Flags: needinfo?(pmoore)

Andrew Halberstadt [:ahal]

Assignee

Comment 8

•

6 years ago

Ok, I'm more than happy to just land, especially knowing it looks to be working. I'll keep an eye on it over the coming days and if it looks like something is wrong I'll use the worker-type trick then to test it out.

Thanks!

Andrew Halberstadt [:ahal]

Assignee

Updated

•

6 years ago

Assignee: nobody → ahal

Status: NEW → ASSIGNED

Phabricator Automation

Updated

•

6 years ago

Attachment #9039217 - Attachment description: Bug 1519472 - [taskgraph] Factor logic for adding a cache in job.common to a new function → Bug 1519472 - [taskgraph] Factor logic for adding a cache in job.common to a new function, r?dustin

Phabricator Automation

Updated

•

6 years ago

Attachment #9039218 - Attachment description: Bug 1519472 - [taskgraph] Support generic-worker caches in run_task → Bug 1519472 - [taskgraph] Support generic-worker caches in run_task, r?dustin

Andrew Halberstadt [:ahal]

Assignee

Comment 9

•

6 years ago

Nuts, looks like this causes build failures due to insufficient disk space:
https://tools.taskcluster.net/groups/VHZWELzjSqGJFmm3l51bXg/tasks/WbD7Hwv_TOmBhzmy7txGCQ/runs/0/logs/public%2Flogs%2Flive.log

Not sure if it's intermittent or not. I didn't see this with artifact builds. I guess one easy way to fix this will be to implement a "use_caches" key in the job schema and set it to false for build tasks.

Phabricator Automation

Updated

•

6 years ago

Attachment #9039217 - Attachment description: Bug 1519472 - [taskgraph] Factor logic for adding a cache in job.common to a new function, r?dustin → Bug 1519472 - [taskgraph] Factor logic for adding a cache in job.common to a new function, r?tomprince

Andrew Halberstadt [:ahal]

Assignee

Comment 10

•

6 years ago

Attached file Bug 1519472 - [ci] Opt out of caching for generic-worker based Windows builds, r?tomprince (deleted) — Details

The hosts don't have enough disk space to cache mozilla-central.

Depends on D17689

Phabricator Automation

Updated

•

6 years ago

Attachment #9039218 - Attachment description: Bug 1519472 - [taskgraph] Support generic-worker caches in run_task, r?dustin → Bug 1519472 - [taskgraph] Support generic-worker caches in run_task, r?tomprince

Andrew Halberstadt [:ahal]

Assignee

Updated

•

6 years ago

Blocks: 1526028

Pulsebot

Comment 11

•

6 years ago

Pushed by ahalberstadt@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0b8097689bb5 [taskgraph] Factor logic for adding a cache in job.common to a new function, r=tomprince https://hg.mozilla.org/integration/autoland/rev/b6e19a5b0ab9 [ci] Opt out of caching for generic-worker based Windows builds, r=tomprince https://hg.mozilla.org/integration/autoland/rev/2ceeee1915ae [taskgraph] Support generic-worker caches in run_task, r=tomprince

Oana Pop-Rus

Comment 12

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/0b8097689bb5
https://hg.mozilla.org/mozilla-central/rev/b6e19a5b0ab9
https://hg.mozilla.org/mozilla-central/rev/2ceeee1915ae

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

status-firefox67: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla67

Pete Moore [:pmoore][:pete]

Comment 13

•

6 years ago

I've just seen comment 9, and realise this is probably a bug in the generic-worker implementation, which I think we'll need to fix before we can roll this out (I think we'll have to back this out).

We have a mechanism that checks we have enough disk space on the partition where task directories are stored, before each task runs. If not, we repeatedly delete caches until we have freed up enough space.

However, in the case that the caches are on a different partition to the task directories, this doesn't prevent the cache partition from filling up.

So we should check the partition with caches, and the partition with task directories, both have enough space at the start of every task.

I'll create a bug for this in the generic worker bugzilla component and make it a blocker for this bug.

Pete Moore [:pmoore][:pete]

Updated

•

6 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Cosmin Sabou [:CosminS]

Comment 14

•

6 years ago

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&searchStr=windows%2C10%2Cx64%2Cquantumrender%2Crelease%2Cwebrender%2Cstandalone%2Cwebrender-windows%2Cwr%28wrench%29&fromchange=da71b4d4ad402c64c19f686ed6014ec559c1844c&tochange=4bc31addf415cc076ac42ab9ed64002160f57f86&selectedJob=227129866

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=227129866&repo=mozilla-central&lineNumber=2190

Backout link: https://hg.mozilla.org/mozilla-central/rev/4bc31addf415cc076ac42ab9ed64002160f57f86

[taskcluster 2019-02-08T10:32:16.918Z] Exit Code: 0
[taskcluster 2019-02-08T10:32:16.918Z] User Time: 0s
[taskcluster 2019-02-08T10:32:16.918Z] Kernel Time: 0s
[taskcluster 2019-02-08T10:32:16.918Z] Wall Time: 39m17.6778452s
[taskcluster 2019-02-08T10:32:16.918Z] Result: SUCCEEDED
[taskcluster 2019-02-08T10:32:16.918Z] === Task Finished ===
[taskcluster 2019-02-08T10:32:16.918Z] Task Duration: 39m17.6788022s
[taskcluster 2019-02-08T10:32:16.920Z] [mounts] Preserving cache: Moving "Z:\task_1549618301\build" to "Y:\caches\QPnek7u7Qe2ICSvyeudjPw"
[taskcluster 2019-02-08T10:39:34.180Z] [mounts] Removing cache level-3-checkouts from cache table
[taskcluster 2019-02-08T10:39:34.180Z] [mounts] Deleting cache level-3-checkouts file(s) at Y:\caches\QPnek7u7Qe2ICSvyeudjPw
[taskcluster:error] [mounts] Could not unmount <nil> due to: 'Could not persist cache "level-3-checkouts" due to mkdir Y:\caches\QPnek7u7Qe2ICSvyeudjPw\src\vs2017_15.8.4\SDK\bin\10.0.17134.0\x64\en-US: There is not enough space on the disk.'
[taskcluster 2019-02-08T10:39:59.316Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/I9UVQSBISBW3F2h2znaPcA/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2020-02-08T09:51:01.596Z
[taskcluster:error] Could not persist cache "level-3-checkouts" due to mkdir Y:\caches\QPnek7u7Qe2ICSvyeudjPw\src\vs2017_15.8.4\SDK\bin\10.0.17134.0\x64\en-US: There is not enough space on the disk.

status-firefox67: fixed → ---

Flags: needinfo?(ahal)

Target Milestone: mozilla67 → ---

Andrew Halberstadt [:ahal]

Assignee

Comment 15

•

6 years ago

No need to block on the fix, I can disable caches for those wrench tasks. That way we'll still get the benefit of caches in tasks that don't have this configuration in the meantime. We could also see if moving those srcdir checkout caches to the same partition on Windows is feasible. I'm not really sure why they're placed where they are.

But either way the bug in generic-worker should be fixed, just saying that it doesn't need to block this one.

Flags: needinfo?(ahal)

Pete Moore [:pmoore][:pete]

Updated

•

6 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1526311

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 16

•

6 years ago

If you're disabling the caches for wrench and relanding please also disable it for the searchfox build job which failed similarly: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=227131320&repo=mozilla-central&lineNumber=251291

Andrew Halberstadt [:ahal]

Assignee

Comment 17

•

6 years ago

Fixed by disabling caches for Windows wrench and idx:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=11c9cbca902009c053dcfc57921dd837c6992366

Pulsebot

Comment 18

•

6 years ago

Pushed by ahalberstadt@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/036604abf1e5 [taskgraph] Factor logic for adding a cache in job.common to a new function, r=tomprince https://hg.mozilla.org/integration/autoland/rev/2053a035eee6 [ci] Opt out of caching for generic-worker based Windows builds, r=tomprince https://hg.mozilla.org/integration/autoland/rev/887cc76ba189 [taskgraph] Support generic-worker caches in run_task, r=tomprince

Bogdan Tara[:bogdan_tara | bogdant]

Comment 19

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/036604abf1e5
https://hg.mozilla.org/mozilla-central/rev/2053a035eee6
https://hg.mozilla.org/mozilla-central/rev/887cc76ba189

Status: REOPENED → RESOLVED

Closed: 6 years ago → 6 years ago

status-firefox67: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla67

Andrew Halberstadt [:ahal]

Assignee

Updated

•

6 years ago

Blocks: 1527313

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1527313

Andrew Halberstadt [:ahal]

Assignee

Updated

•

6 years ago

No longer blocks: 1527313

Rob Lemley [:rjl]

Comment 20

•

6 years ago

Attached file Port Bug 1519472 - Disable caching on TB Windows builds. r?jorgk (deleted) — Details

Disables caching on generic-worker based Windows builds for Thunderbird
due to insufficient disk space on the build hosts.

Phabricator Automation

Updated

•

6 years ago

Attachment #9043330 - Attachment description: Port Bug 1519472 - Disable caching on TB Windows builds. r=jorgk → Port Bug 1519472 - Disable caching on TB Windows builds. r?jorgk

Jorg K (CEST = GMT+2)

Comment 21

•

6 years ago

Hmm, the patch in Phabricator is NOT what you've been running on try :-( - I'll land the latter.

Pulsebot

Comment 22

•

6 years ago

Pushed by mozilla@jorgk.com: https://hg.mozilla.org/comm-central/rev/d2574be5d927 Port Bug 1519472 - Disable caching on Windows builds. r=jorgk

Mike Hommey [:glandium]

Comment 23

•

6 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #16)

If you're disabling the caches for wrench and relanding please also disable it for the searchfox build job which failed similarly: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=227131320&repo=mozilla-central&lineNumber=251291

There are similar problems on cbindgen jobs: https://queue.taskcluster.net/v1/task/G0j-vl9PS861uEMQ7u-6Dg/runs/0/artifacts/public/logs/live_backing.log

Mike Hommey [:glandium]

Comment 24

•

6 years ago

Actually different, as it's not a disk space problem, but a "there's a process still running that uses a file in the checkout"

Mike Hommey [:glandium]

Updated

•

6 years ago

Depends on: 1527798

Mike Hommey [:glandium]

Updated

•

6 years ago

Depends on: 1527799

Mike Hommey [:glandium]

Comment 25

•

6 years ago

Ironically, this makes windows tasks using a cache slow to finish. See https://taskcluster-artifacts.net/FulbQos7T0ewTbXv746i-Q/0/public/logs/live_backing.log

[taskcluster 2019-02-15T05:24:03.500Z] [mounts] Preserving cache: Moving "Z:\task_1550205030\build" to "Y:\caches\H6KTLHo4Sh-Vy0PMSQgAjA"
[taskcluster 2019-02-15T05:28:32.966Z] [mounts] Denying task_1550205030 access to 'Y:\caches\H6KTLHo4Sh-Vy0PMSQgAjA'
[taskcluster 2019-02-15T05:29:27.546Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/FulbQos7T0ewTbXv746i-Q/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-03-15T05:06:20.013Z

5 minutes for tasks possibly depending on the task to start (well, in that case the task failed so nothing was going to start anyways)

Pete Moore [:pmoore][:pete]

Comment 26

•

6 years ago

I'm going to mark this bug as blocked by bug 1528198 and bug 1526311.

Once both of these bugs are resolved, we should have a much better time using generic-worker caches on Windows.

My feeling is that it isn't wise for us to use this worker feature while these issues are still open.

Status: RESOLVED → REOPENED

Depends on: 1528198, 1526311

Resolution: FIXED → ---

Pete Moore [:pmoore][:pete]

Comment 27

•

6 years ago

(In reply to Mike Hommey [:glandium] from comment #25)

Ironically, this makes windows tasks using a cache slow to finish. See https://taskcluster-artifacts.net/FulbQos7T0ewTbXv746i-Q/0/public/logs/live_backing.log

[taskcluster 2019-02-15T05:24:03.500Z] [mounts] Preserving cache: Moving "Z:\task_1550205030\build" to "Y:\caches\H6KTLHo4Sh-Vy0PMSQgAjA"
[taskcluster 2019-02-15T05:28:32.966Z] [mounts] Denying task_1550205030 access to 'Y:\caches\H6KTLHo4Sh-Vy0PMSQgAjA'
[taskcluster 2019-02-15T05:29:27.546Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/FulbQos7T0ewTbXv746i-Q/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-03-15T05:06:20.013Z

5 minutes for tasks possibly depending on the task to start (well, in that case the task failed so nothing was going to start anyways)

The slowness here will be addressed by bug 1528198.

Pete Moore [:pmoore][:pete]

Comment 28

•

6 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #26)

I'm going to mark this bug as blocked by bug 1528198 and bug 1526311.

Once both of these bugs are resolved, we should have a much better time using generic-worker caches on Windows.

My feeling is that it isn't wise for us to use this worker feature while these issues are still open.

Ah, I see I should have done this in bug 1527313 - closing this one again. Sorry for the noise.

Status: REOPENED → RESOLVED

Closed: 6 years ago → 6 years ago

No longer depends on: 1526311, 1528198

Resolution: --- → FIXED

Mike Hommey [:glandium]

Updated

•

6 years ago

Depends on: 1528422

Mike Hommey [:glandium]

Updated

•

6 years ago

Depends on: 1528891

Tom Prince [:tomprince]

Updated

•

6 years ago

Keywords: leave-open

Tom Prince [:tomprince]

Comment 29

•

6 years ago

Attached file Bug 1519472: Disable caches on windows repackage builds; r?aki (deleted) — Details

They appear to be causing tasks to take several hours to complete.

Pulsebot

Comment 30

•

6 years ago

Pushed by mozilla@hocat.ca: https://hg.mozilla.org/mozilla-central/rev/3b08a133c893 Disable caches on windows repackage builds; r=aki a=tomprince

Jorg K (CEST = GMT+2)

Comment 31

•

6 years ago

https://hg.mozilla.org/releases/comm-esr60/rev/307afdf37c55d0efc5ae2d48506e51dcaa8625e9
Uplifted to c-esr60 to fix busted Windows builds

You need to log in before you can comment on or make changes to this bug.