Caching system for lint and static analysis jobs isn't working correctly
Categories
(Firefox Build System :: Task Configuration, task)
Tracking
(firefox77 fixed)
Tracking | Status | |
---|---|---|
firefox77 | --- | fixed |
People
(Reporter: Sylvestre, Assigned: Callek)
References
(Depends on 2 open bugs, Blocks 1 open bug)
Details
(Whiteboard: [ci-costs-2020:done])
Attachments
(5 files)
Now that we are doing a try job for every review, we should probably spend more time on cloning optimization.
For example, in these jobs, we are spending 5 minutes just to clone the repo.
https://treeherder.mozilla.org/logviewer.html#?job_id=271918576&repo=try this is about 20% of the time
https://treeherder.mozilla.org/logviewer.html#?job_id=271917591&repo=try (+80%)
Maybe we should generate a cached image every day with the current checkout.
Maybe shallow clone will help.
Updated•5 years ago
|
Comment 1•5 years ago
|
||
The repo caching system is supposed to work by checking out a copy of the repo on the worker (in the case of linting jobs this means checking out mozilla-unified) on the first job the worker takes, then performing a working directory checkout (update) to complete the given task. Subsequent jobs should be able to re-use the mozilla-unified repo, so instead of doing a full clone, they just pull the new heads from try into the previously cloned repo and then check them out into the working directory.
Are the workers recognizing existing checkouts as valid caches? From the two examples given in the bug description it doesn't seem they are. How often are linting workers recycled in this setup? If it's rather often, it could be causing the tasks to reclone frequently. Assuming I'm not misinterpreting the 20% + 80% comment above, we're doing a full re-clone on every task?
A cached image or shallow clone might help here as well. I've also thought about triggering a mozilla-unified clone as part of the worker setup before exposing it as available to run tasks in Taskcluster, which would at least move the full clone timings out of task execution times. But let's start by making sure we're taking advantage of the current caching system effectively.
Comment 2•5 years ago
|
||
Yes as far as I can tell every lint task clones mozilla-unified every time. Actually it's likely even worse. The lint tasks don't have any specific caching configuration, this is supposed to all be automatically handled by the run-task
script (and related taskgraph configs). So I believe other types of tasks are affected as well, possibly anything that uses the lint
image..
The strange thing is that it appears to be all configured properly. The tasks have a cache defined at /build/worker/checkouts
, yet every log that I've seen has this line:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=272396788&repo=autoland&lineNumber=20
That's being logged here:
https://searchfox.org/mozilla-central/rev/55aa17110091deef24b913d033ccaf58f9c6d337/taskcluster/scripts/run-task#275
So in other words, /builds/worker/checkouts
exists but is empty. The run-task script then proceeds to clone into /builds/worker/checkouts
yet the next task again detects that it's empty.
By contrast, here's a build task (where checkout cache is working):
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=272396353&repo=autoland&lineNumber=15
They both use docker-worker and they both use the same taskgraph codepath to setup the cache (which is why I suspect something in the lint image).
Reporter | ||
Comment 3•5 years ago
|
||
Thanks for investigating!
Any idea who could help us fixing that?
Comment 4•5 years ago
|
||
I did discover that how we setup the workspace differs between the build and lint images. Notably the chown
. No idea why that would matter, but I have a try run to test it out:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=86138a6dea1a400a6bc3aa79cbc279012eb490c1
(Relatedly, I also don't know how to test caching changes beyond retriggering a lot and hoping a worker near the end previously ran a task with my changes).
Maybe Tom has some additional insight? Tl;dr lint tasks are cloning Gecko every time despite a cache set up at /builds/worker/checkouts
.
Comment 5•5 years ago
|
||
Ah, here's a lint task on mozilla-centeral where the caching is working:
https://treeherder.mozilla.org/logviewer.html#?job_id=272407760&repo=mozilla-central
So it's working sometimes, but very rarely.
Comment 6•5 years ago
|
||
It looks like these tasks run on test workers; given that most tests don't have a checkout, there is a high likely hood that these tasks will run on an instance that doesn't have the tree cloned.
It probably makes sense to have a new instance type for these jobs. (I also wonder if we can get away with a smaller instance for these jobs, than the instance type we use to run firefox tests themseleves.
Comment 7•5 years ago
|
||
I think Tom's hypothesis about running the job on a suboptimal worker is probably right. As an alternative to running the jobs on a new worker type, maybe we could run lint jobs on the builders, where there is a higher chance of a cached repo. Just an idea in case standing up a new worker pool is too invasive of a change.
Comment 8•5 years ago
|
||
Standing up a new pool seems better. We can share this pool with other lightweight source-test
tasks (e.g Python unittests, docs, node tests, etc). So we should be able to get decent usage out of it. I also agree we should try and move to a cheaper worker type.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 9•5 years ago
|
||
Coop, I have been told that you could be able to help with that.
We would need a pool for linting/static analysis + make sure that the caching work.
We are spending $$$ cloning the repo for every job here!
Merci
Comment 10•5 years ago
|
||
(In reply to Sylvestre Ledru [:Sylvestre] from comment #9)
Coop, I have been told that you could be able to help with that.
We would need a pool for linting/static analysis + make sure that the caching work.We are spending $$$ cloning the repo for every job here!
For context, the linting jobs only cost us $1,377.73 in September.
Yes, we can help, and can get you onto a properly-sized instance type, but it will need to wait until later in November when we're done with existing migrations.
Comment 11•5 years ago
|
||
I agree this is likely a lower priority, but just pointing out that the bigger cost here is developer time wasted due to issues showing up on phabricator reviews later.
Comment 12•5 years ago
|
||
(In reply to Connor Sheehan [:sheehan] from comment #1)
The repo caching system is supposed to work by checking out a copy of the repo on the worker (in the case of linting jobs this means checking out mozilla-unified) on the first job the worker takes, then performing a working directory checkout (update) to complete the given task. Subsequent jobs should be able to re-use the mozilla-unified repo, so instead of doing a full clone, they just pull the new heads from try into the previously cloned repo and then check them out into the working directory.
Is this supposed to happen on all task types? The only ones I have experience looking at are Windows build jobs, and I've never seen this incremental-clone behavior happen there, it's always been a painfully slow full clone. (Especially painful when I'm watching the task live and waiting to see whether configure succeeds at all)
Comment 13•5 years ago
|
||
(In reply to :dmajor from comment #12)
Is this supposed to happen on all task types? The only ones I have experience looking at are Windows build jobs, and I've never seen this incremental-clone behavior happen there, it's always been a painfully slow full clone. (Especially painful when I'm watching the task live and waiting to see whether configure succeeds at all)
See bug 1527313 and bug 1528891.
Comment 14•5 years ago
|
||
Could we re-prioritize this? As :ahal said, this might not be a huge savings in terms of $$$ (even though it adds up as we introduce new linting jobs over time), but it will allow results to be shown to developers earlier on Phabricator.
Updated•5 years ago
|
Comment 15•5 years ago
|
||
(In reply to Marco Castelluccio [:marco] from comment #14)
Could we re-prioritize this? As :ahal said, this might not be a huge savings in terms of $$$ (even though it adds up as we introduce new linting jobs over time), but it will allow results to be shown to developers earlier on Phabricator.
Still happy to help as required, but I don't think this work is blocked on the Taskcluster team any more. After the Firefox CI cluster split in November, I believe releng is the correct group to implement a new, dedicated worker type optimized for running linting jobs in CI. Redirecting NI to :jlund
We also have access to better data now about concurrent tasks so it should be pretty easy to get a dedicated pool setup with a minimum capacity to make sure our cache hit rate is as high as possible.
Comment 16•5 years ago
|
||
(In reply to Marco Castelluccio [:marco] from comment #14)
Could we re-prioritize this? As :ahal said, this might not be a huge savings in terms of $$$ (even though it adds up as we introduce new linting jobs over time), but it will allow results to be shown to developers earlier on Phabricator.
added to triage, will reply back later this week with update. Leaving needinfo open.
Assignee | ||
Comment 17•5 years ago
|
||
I didn't change anything from original workers, gecko-3/t-linux-xlarge and gecko-t/t-win10-64 except naming and reducing the capacity on win10 to half what it was. I'm open to review comments asking me to make other changes
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 19•5 years ago
|
||
Assignee | ||
Comment 20•5 years ago
|
||
Assignee | ||
Comment 21•5 years ago
|
||
(In reply to Marco Castelluccio [:marco] from comment #18)
What's left to do here?
I thought I had pushed these patches -- apparently I didn't, I'm sorry.
c#19 is one followup based on a review comment, and c#20 is what I think is necessary for the lint tasks to use a checkout appropriately. I'm not sure offhand if there are other jobs phab review stuff triggers which would benefit from the new worker type.
I also noticed there are some items that run shell scripts first, which might benefit from assuming checkout entirely as good followup, but I don't know enough about them to say for sure. Either way this patch should round out this bugs needs as specified.
Comment 22•5 years ago
|
||
Comment 23•5 years ago
|
||
Backed out changeset b02f0d3d7477 (Bug 1589712) for causing a gecko decision task bustage CLOSED TREE
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=298415274&repo=autoland&lineNumber=902
Assignee | ||
Updated•5 years ago
|
Comment 24•5 years ago
|
||
Comment 25•5 years ago
|
||
bugherder |
Updated•5 years ago
|
Assignee | ||
Comment 26•5 years ago
|
||
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 27•5 years ago
|
||
So, at a glance with some back of napkin this looks like it was a win, although I don't know if further tweaking on number of instances, ttl of each instance, etc. could improve matters.
The source for this screenshot is https://datastudio.google.com/u/2/reporting/15a6j_NSmFOYk6ZZ5efTGmx3LUxZFu9I4/page/rFKOB if anyone can access.
Thanks go to Simon Fraser for creating these charts, we didn't spend too much time with trying to measure it, and the overall end-to-end times seem still within the noise limits of pre-change, as a whole.
Reporter | ||
Comment 28•5 years ago
|
||
Many thanks Justin! :)
Updated•4 years ago
|
Description
•