Closed Bug 1132346 Opened 10 years ago Closed 9 years ago

Automatically build (and rebuild) docker images based on the contents of the tree

Categories

(Taskcluster :: General, defect)

x86
macOS
defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlal, Assigned: garndt)

References

Details

Some background first- currently in taskcluster+gecko we rely on users to manually build / define / specify the docker image they wish to use. While this is very flexible it is also a pain to those who maintain the images and completely impossible to scale within a single namespace (If I vbump an image at the same time as someone else problems happen!). Additionally it is important to note that the docker-worker itself only is in a consistent state the _first_ time it sees an image of a given tag. For example if you specify foobar:latest the docker-worker will cache this and never check for updates. The proposal here is to define a hashing / building system which will fingerprint the contents of each image based on the contents of the individual folder + "branch" + "remote". (Note: In quotes because this system must work in both hg/git world) The way this will work is the decision task as it creates the graph will attempt to identify the correct image to use based on the fingerprint if the fingerprint is present (in the index or docker registry is left to be defined) that specific version of the image will be used. In the case where we cannot identify an image based on it's fingerprint we will schedule (as a parent node of any images which require it) a task which will construct this docker image. Some additional implementation related notes: - It is very important (to me) to index by both the contents _and_ by the repository + revision. This is an added level of security if somehow the machines which create these images are compromised. - The machines which build these images must be separated based on security requirements (your "create docker image" worker type that runs on gaia/try is not the same that runs on b2g-inbound or release) - The docker image format has changed significantly (with backwards compatible support) and supports direct pulls from s3 and pushes to s3 with special tooling. We should use this but be aware of the fact that this can race _wildly_ in theory if we have missing docker images with the same fingerprint/namespace. - This also serves as a form of tests if this process fails in automation we know the images are broken locally. - This is a stepping stone to tying the entire state of our infra to particular commits.
There are lots of reasons why we should do this but in particular I want to address the concerns of early users who had to build / rebuild / fuss with docker images. Aside from better security / etc... My aim here is to make it very easy for people to play with the docker images on try to get the desired state. If you have any other pain points around updating your images (or suggestions to how we fingerprint/etc...) please let me know.
Flags: needinfo?(wcosta)
Flags: needinfo?(garndt)
Flags: needinfo?(dustin)
Ah additionally I want to note that the work to actually build/publish these docker images would itself be a worker-type (and a new worker) this piece is likely more generic and a blocker to this.
Flags: needinfo?(dustin)
I have been batting around the idea of something very similar to this, but with a bit wider scope. Morgan's been working on building docker images of the current desktop build environment, too, so she and I have talked about it quite a bit. We've been calling it a "bakery". Beyond what you've outlined, it would need to deal with: * Building AMIs for spot and ondemand instances -- this involves a lot of prep for actually booting the instance, beyond just creating a glorified tarball. * Building other image formats (WinJail ZIPs? OS X chroots? VHDs? QCOW2s? whatever..) * "chained" dependencies, where one image is based on another (e.g., TC docker images now, or releng's golden AMIs and base AMIs) * Image parameters, for example disk size. The idea is that the, say, linux builder AMI could be based on base-centos[centos=65, disk=50], and the bakery would resolve that by following the base-centos recipe with the given parameters passed in. * Regular rebuilds. Releng has demonstrated pretty conclusively that the "freeze on this version and plan to upgrade Next Quarter" plan is ineffective and leaves you running a 3-year-old operating system with an enormous jump to upgrade. Ideally we can rebuild most of our images on a daily or weekly basis, against the upstream repositories every time. Then we get the benefit of upstream updates (including security) when the occur, and we're never too far behind. If we run into an image that fails, a revert is as easy as using last week's image and pausing rebuilds until we've solved the issue. * Able to automatically run PuppetAgain. It really is a cool and powerful system for building hosts -- just not so useful for building build environments. * Simple API and UI to trace the provenance of an image. I've gone half-blind over the last few weeks trying to remember the difference between ami-83a5912 and ami-8390a82! I'm actually stuck searching bugzilla for the string because the image itself has very little associated metadata as to what revision built it. This API could easily be based on the index API, or we could create something within relengapi to track it. * Ability for users -- at least those with a suitable level of trust -- to specify images in a "try" scenario (Github PRs? HG Try? Whatever..) and then run jobs against those images. * Ability to clean up unused images -- we're going to generate 100's of GB a day, which will get expensive quickly. I'd been considering basing the whole thing on running packer in TC tasks, but now that I've seen up-close how releng builds images, that might be tricky. We build "base" AMIs from scratch in a chroot (with the host based on an Amazon Linux AMI), then snapshot the resulting EBS volume and do some fancy stuff to make it boot. For some hosts, we then start an instance with that AMI and run puppet; for others, we start an instance, run puppet, futz a little, and then capture a snapshot into a "golden AMI" which is used for lots of spot and on-demand instances. Basically, this is a great idea and I'd been hoping to find some time to work on it next quarter, but if stuff happens sooner, even better!
I think much of effort that goes into producing RelEng's AMIs for builds goes away once we are doing work inside docker containers. Similar issues remain for producing the docker images, but at least you don't have to worry about bizarre AWS grub configurations.
For builders, yes, but for AMIs for servers in autoscaling groups and such, and for host images to run workers in, we still need AMI creation.
> * "chained" dependencies, where one image is based on another (e.g., TC docker images now, or releng's >golden AMIs and base AMIs) Yeah this one is a requirement for us too ( I did not specify the details here) but I believe this is pretty easy for packer/docker case (though we end up now with some domain specific knowledge outside of packer/docker to make this work). >Ability for users -- at least those with a suitable level of trust -- to specify images in a "try" >scenario (Github PRs? HG Try? Whatever..) and then run jobs against those images. Even for AMIs (not for or against just curious how this would work). The primary motivation for this bug (for me) is to support this workflow for docker images. ---- Again I don't claim to understand all the cases but I am curious how much of your non-build/test cases could be solved in the "docker" space or in a similar space to that of heroku / etc... I fully expect that even with super nice AMI building tools we will still be somewhat sad when we build them as the overall process is slow even if you have a running machine (1-5 min for us depending on what machine we are using and it's network capabilities)
I'd like it for AMIs, too, yes, but that doesn't have to be possible at the outset. I've been stuck in the funny position this week of generating AMIs from uncommitted code, and the generation process automatically puts them in production. So we end up running production builds on un-committed code in a working directory on my laptop. You may be right that some of our requirements can be better met by Docker - it's hard to say right now. And yes, they take a long time to build!
So, I've been working on publishing our build environments as docker containers here: https://github.com/mrrrgn/build-mozilla-build-environments and I've also build a docker container which builds the docker containers in that repo and publishes them. I've been looking at scheduling my container to run with TC but there's a problem, in that I need it to run in privileged mode so that I can run a docker daemon inside of it: https://blog.docker.com/tag/inception/ The other option would be that I could mount /var/run/docker.sock but that's super dangerous! :p If we'd like to use TC to run docker building docker images (oy vey, the recursion) we'll need to support this flag. The flag should also only be allowed for privileged users. How do we feel about me going ahead and cracking on a bug for this so that I can push my own work forward?
Depends on: 1133783
Flags: needinfo?(wcosta)
Component: TaskCluster → General
Product: Testing → Taskcluster
Flags: needinfo?(garndt)
Worth noting, this docker-image-building script should (at least optionally) do a `docker export` of the result as an artifact to support things like hackfests where we'll have 10's of people all wanting to get their hands on the same image over a basic 'net connection. Then someone can download the export tarball once and put it on some USB sticks.
Severity: normal → major
This has been implemented in 1226413 and the worker side here 1210039
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Assignee: nobody → garndt
You need to log in before you can comment on or make changes to this bug.