Closed Bug 1170999 Opened 9 years ago Closed 9 years ago

[docker-worker] Retry image pulls with backoff

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: garndt, Assigned: garndt)

References

Details

Attachments

(1 file)

(deleted), text/x-github-pull-request
jonasfj
: review+
Details
For image pulls that fail for reasons we could retry, we should backoff and retry up to N times. Sometimes when pulling from docker hub, we will receive a 4xx error that could be retried. These are errors on public images that shouldn't get authorization errors and they do exist in the hub so it's for some reason the hub returning back these errors. Docker support ticket has been entered for this and their ops team is looking into it but have not yet determined why. They have explained that over the last couple of weeks they have been making backend changes that could cause this. Either way, we should be smart about the errors and retry where possible.
Would also be nice if docker client had retries built in, certainly as at least a configurable option... We have a go library for http backoff (with interpretation of http response status codes): https://godoc.org/github.com/taskcluster/httpbackoff I wonder if it is worth integrating it into docker directly.
(warning: I might be reading the docker code base wrong) They have some retries and backoff when pulling an image layer, but it seems they are not doing it at a request level, but just for pulling image layer [1]. We do have 504 failures that based on reading their code are being retried but still end up erroring out. We also have issues with fetching remote history [2], which does not implement retries, but some of the issues we're having is fetching the history of an image from the registry to begin with. I think this is a bandaid on our part to try again, but might work. Perhaps there are some status checks we could do on our part if this situation happens before retrying and spit that out to papertrail and submit an alert whenever our retries fail. [1] https://github.com/docker/docker/blob/278798236bdf073dd7c66e32e21d81bbf9243656/registry/session.go#L241 [2] https://github.com/docker/docker/blob/278798236bdf073dd7c66e32e21d81bbf9243656/registry/session.go#L167-187
Very interesting! It looks like they are using a custom transport for adding authentication. A neat fix might be to build in retry logic directly into the transport, to make it entirely transparent. Or maybe this is over-engineering. Here is the custom transport: https://github.com/docker/docker/blob/master/registry/session.go#L72 I'm wondering whether I might do something similar for the go taskcluster client (for authentication, and possibly also for exponential backoff)... I agree with your analysis that the remote history is currently not retrying. :( Pete
Attached file Docker retry (deleted) —
Attachment #8621734 - Flags: review?(jopsen)
Assignee: nobody → garndt
Status: NEW → ASSIGNED
Comment on attachment 8621734 [details] Docker retry Looks good to me comments are just nits...
Attachment #8621734 - Flags: review?(jopsen) → review+
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1165759
Blocks: 1156326
Component: TaskCluster → Docker-Worker
Product: Testing → Taskcluster
Component: Docker-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: