Closed
Bug 1426445
Opened 7 years ago
Closed 7 years ago
Cache poisoning introduced by tasks on maple repository
Categories
(Firefox Build System :: Task Configuration, task)
Firefox Build System
Task Configuration
Tracking
(firefox60 fixed)
RESOLVED
FIXED
mozilla60
Tracking | Status | |
---|---|---|
firefox60 | --- | fixed |
People
(Reporter: gps, Assigned: tomprince)
References
Details
Attachments
(2 files)
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=62b281c39548aa349fd1141caed5d4340700bbb6 has a number of toolchain failures due to cache poisoning. e.g. https://public-artifacts.taskcluster.net/e2ZuCgLVS7K8vIXp4YhFIQ/0/public/logs/live_backing.log says:
[cache 2017-12-20T18:09:18.498Z] cache /builds/worker/checkouts exists; requirements: gid=1000 uid=1000 version=1
error: requirements for populated cache /builds/worker/checkouts differ from this task
cache requirements: gid=1000 uid=1000 version=1
our requirements: gid=500 uid=500 version=1
There is a UID/GID mismatch on the cache. This likely means:
a) different tasks are running as a different user/group
b) different Docker images have different UID/GID for the same user/group
Our cache policy is that the UID/GID for ALL tasks must be consistent
for the lifetime of the cache. This eliminates permissions problems due
to file/directory user/group ownership.
To make this error go away, ensure that all Docker images are use
a consistent UID/GID and that all tasks using this cache are running as
the same user/group.
audit log:
[2017-12-20T17:04:50.384901Z HTSWdLPsTTye45MzLzGSZQ] created; requirements: gid=1000, uid=1000, version=1
[2017-12-20T18:07:28.926290Z JBUD_5QMT9uqY3Cd5-dyvQ] requirements mismatch; wanted: gid=500, uid=500, version=1
[2017-12-20T18:09:18.498914Z e2ZuCgLVS7K8vIXp4YhFIQ] requirements mismatch; wanted: gid=500, uid=500, version=1
If we follow the audit log, HTSWdLPsTTye45MzLzGSZQ was a task on maple. Every other failing task seems to have a maple task as the root task.
We are supposed to be using uid/gid 500:500 for the worker:worker user:group. However, some maple tasks seems to be using 1000:1000.
Reporter | ||
Comment 1•7 years ago
|
||
There is a google-play-strings docker image on maple using ubuntu:16.04 as the base image. This image produces a worker:worker user:group with uid:gid 1000:1000 instead of 500:500. This is the source of our cache poisoning.
This Dockerfile will need to do the following:
groupadd -g 500 worker
useradd -u 500 -g 500 worker
I'd do this, but I have to run off to a meeting.
Flags: needinfo?(bhearsum)
Comment 2•7 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #1)
> There is a google-play-strings docker image on maple using ubuntu:16.04 as
> the base image. This image produces a worker:worker user:group with uid:gid
> 1000:1000 instead of 500:500. This is the source of our cache poisoning.
>
> This Dockerfile will need to do the following:
>
> groupadd -g 500 worker
> useradd -u 500 -g 500 worker
>
> I'd do this, but I have to run off to a meeting.
https://hg.mozilla.org/projects/maple/rev/f5047440978ee8f81507dde1119d3e5dd7e7d03f
Original bug is https://bugzilla.mozilla.org/show_bug.cgi?id=1385401, which I've commented in.
Flags: needinfo?(bhearsum)
Comment 3•7 years ago
|
||
3 toolchain jobs on glandium's push of bug 1426324 also showed uid:gid mismatches. The retriggers succeeded.
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=f41ca59052be9d0a74e3c1d03695dab446a021ae&filter-resultStatus=usercancel&filter-resultStatus=runnable&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=success&filter-searchStr=toolchains
[taskcluster 2017-12-22 12:57:53.157Z] === Task Starting ===
[setup 2017-12-22T12:57:53.474Z] run-task started
[cache 2017-12-22T12:57:53.476Z] cache /builds/worker/checkouts exists; requirements: gid=1000 uid=1000 version=1
error: requirements for populated cache /builds/worker/checkouts differ from this task
cache requirements: gid=1000 uid=1000 version=1
our requirements: gid=500 uid=500 version=1
There is a UID/GID mismatch on the cache. This likely means:
a) different tasks are running as a different user/group
b) different Docker images have different UID/GID for the same user/group
Our cache policy is that the UID/GID for ALL tasks must be consistent
for the lifetime of the cache. This eliminates permissions problems due
to file/directory user/group ownership.
To make this error go away, ensure that all Docker images are use
a consistent UID/GID and that all tasks using this cache are running as
the same user/group.
audit log:
[2017-12-22T12:01:15.274135Z Ny19mzYSTgyh1UJne9dXnw] created; requirements: gid=1000, uid=1000, version=1
[2017-12-22T12:57:53.476423Z G9MIn2jFQIKWK4LUdmW0DA] requirements mismatch; wanted: gid=500, uid=500, version=1
Assignee | ||
Comment 4•7 years ago
|
||
I think the problem in Comment 3 is due to using the `lint` image on a gecko-N-b-linux image. The lint image uses ubuntu:1604 as a base, which as Comment 1 suggests uses 1000:1000 as the UID/GID. This isn't a problem gecko-t-linux-* workers, as all the images there are basedon ubuntu:1604, so presumably have the same UID/GID. The task in question (Ny19mzYSTgyh1UJne9dXnw[1]) runs on gecko-N-b-linux, though, and the main image used there is `desktop-build` which is based on a centos6 image.
The solution is probably to adjust all the images based on ubuntu:1604 to explicitly set the UID/GID. Doing this will require purging all the caches those images use (or alternatively, `run-task` can be changed to verify that the UID/GID of `worker` is always 500:500 (which incidentally cause the name of the caches used to change).
[1] https://tools.taskcluster.net/groups/YywkNiUjT7CaVXtRvZiF4Q/tasks/Ny19mzYSTgyh1UJne9dXnw/details
Comment 5•7 years ago
|
||
Note we're not far from switching off centos for builds, so we should probably switch the centos images to 1000:1000 instead. Or wait for bug 1399679, which, at this point, is almost a review away (I have it all working, I just need to split my patch queue in reviewable form and put it up for review)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 7•7 years ago
|
||
I implemented https://github.com/taskcluster/docker-worker/pull/347 to allow run-task to request the cache be destroyed, when it notices inconsistencies like this.
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 10•7 years ago
|
||
mozreview-review |
Comment on attachment 8939416 [details]
Bug 1426445: Add sanity check that worker uid/gid is 1000 in run-task;
https://reviewboard.mozilla.org/r/209752/#review215482
This totally makes sense, but I feel like there was some reason we didn't do this already....
Comment 11•7 years ago
|
||
mozreview-review |
Comment on attachment 8939416 [details]
Bug 1426445: Add sanity check that worker uid/gid is 1000 in run-task;
https://reviewboard.mozilla.org/r/209752/#review215522
Attachment #8939416 -
Flags: review?(dustin) → review+
Reporter | ||
Comment 12•7 years ago
|
||
mozreview-review |
Comment on attachment 8939416 [details]
Bug 1426445: Add sanity check that worker uid/gid is 1000 in run-task;
https://reviewboard.mozilla.org/r/209752/#review216386
I didn't implement this because a) it wasn't needed b) it is somewhat architecturally unpure (ideally we shouldn't have low-level details like the "worker" user/group embedded in run-task). But cache poisoning can cause major headaches and run-task is [currently] Firefox centric, so I'm OK with teaching run-task about the "worker" user and group so we can fail faster and not poison caches in the process.
Attachment #8939416 -
Flags: review?(gps) → review+
Comment 13•7 years ago
|
||
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/15a9e149f2db
Add sanity check that worker uid/gid is 1000 in run-task; r=dustin,gps
Assignee | ||
Updated•7 years ago
|
Keywords: leave-open
Comment 14•7 years ago
|
||
Backout by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a77c974e4c75
Backed out changeset 15a9e149f2db for build bustage
Comment hidden (mozreview-request) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 18•7 years ago
|
||
Comment on attachment 8942814 [details]
Bug 1426445: Purge task caches, when an incompatible cache is found; r=gps
Gregory Szorc [:gps] has approved the revision.
https://phabricator.services.mozilla.com/D395#9589
Attachment #8942814 -
Flags: review+
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 20•7 years ago
|
||
Note that this won't show up on try, as incorrect permissions are ignored there.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 23•7 years ago
|
||
Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/integration/mozilla-inbound/rev/69b3883f83a4
Purge task caches, when an incompatible cache is found; r=gps
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 26•7 years ago
|
||
bugherder |
Comment hidden (mozreview-request) |
Reporter | ||
Comment 28•7 years ago
|
||
mozreview-review |
Comment on attachment 8939416 [details]
Bug 1426445: Add sanity check that worker uid/gid is 1000 in run-task;
https://reviewboard.mozilla.org/r/209752/#review224688
I'm fine with doing this and with the value 1000:1000. However, unless I'm not seeing a patch that has landed already, debian-base is still using 500:500.
Also, when switching to 1000:1000, I prefer we bump the cache key to force a new cache. Otherwise we'll just incur cache eviction on the VCS cache, since it is shared across repos.
Comment hidden (mozreview-request) |
Reporter | ||
Comment 30•7 years ago
|
||
mozreview-review |
Comment on attachment 8939416 [details]
Bug 1426445: Add sanity check that worker uid/gid is 1000 in run-task;
https://reviewboard.mozilla.org/r/209752/#review224694
Ship it!
Comment 31•7 years ago
|
||
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ac157b31db6e
Add sanity check that worker uid/gid is 1000 in run-task; r=dustin,gps
Comment 32•7 years ago
|
||
bugherder |
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Product: TaskCluster → Firefox Build System
Assignee | ||
Comment 34•7 years ago
|
||
I think this is adequately protected against now.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Updated•7 years ago
|
Keywords: leave-open
You need to log in
before you can comment on or make changes to this bug.
Description
•