Closed
Bug 1170999
Opened 9 years ago
Closed 9 years ago
[docker-worker] Retry image pulls with backoff
Categories
(Taskcluster :: Workers, defect)
Taskcluster
Workers
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: garndt, Assigned: garndt)
References
Details
Attachments
(1 file)
For image pulls that fail for reasons we could retry, we should backoff and retry up to N times.
Sometimes when pulling from docker hub, we will receive a 4xx error that could be retried. These are errors on public images that shouldn't get authorization errors and they do exist in the hub so it's for some reason the hub returning back these errors.
Docker support ticket has been entered for this and their ops team is looking into it but have not yet determined why. They have explained that over the last couple of weeks they have been making backend changes that could cause this.
Either way, we should be smart about the errors and retry where possible.
Comment 1•9 years ago
|
||
Would also be nice if docker client had retries built in, certainly as at least a configurable option...
We have a go library for http backoff (with interpretation of http response status codes):
https://godoc.org/github.com/taskcluster/httpbackoff
I wonder if it is worth integrating it into docker directly.
Assignee | ||
Comment 2•9 years ago
|
||
(warning: I might be reading the docker code base wrong)
They have some retries and backoff when pulling an image layer, but it seems they are not doing it at a request level, but just for pulling image layer [1]. We do have 504 failures that based on reading their code are being retried but still end up erroring out.
We also have issues with fetching remote history [2], which does not implement retries, but some of the issues we're having is fetching the history of an image from the registry to begin with.
I think this is a bandaid on our part to try again, but might work. Perhaps there are some status checks we could do on our part if this situation happens before retrying and spit that out to papertrail and submit an alert whenever our retries fail.
[1] https://github.com/docker/docker/blob/278798236bdf073dd7c66e32e21d81bbf9243656/registry/session.go#L241
[2] https://github.com/docker/docker/blob/278798236bdf073dd7c66e32e21d81bbf9243656/registry/session.go#L167-187
Comment 3•9 years ago
|
||
Very interesting!
It looks like they are using a custom transport for adding authentication. A neat fix might be to build in retry logic directly into the transport, to make it entirely transparent. Or maybe this is over-engineering.
Here is the custom transport:
https://github.com/docker/docker/blob/master/registry/session.go#L72
I'm wondering whether I might do something similar for the go taskcluster client (for authentication, and possibly also for exponential backoff)...
I agree with your analysis that the remote history is currently not retrying. :(
Pete
Assignee | ||
Comment 4•9 years ago
|
||
Attachment #8621734 -
Flags: review?(jopsen)
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → garndt
Assignee | ||
Updated•9 years ago
|
Status: NEW → ASSIGNED
Comment 5•9 years ago
|
||
Comment on attachment 8621734 [details]
Docker retry
Looks good to me comments are just nits...
Attachment #8621734 -
Flags: review?(jopsen) → review+
Assignee | ||
Comment 6•9 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Component: TaskCluster → Docker-Worker
Product: Testing → Taskcluster
Updated•6 years ago
|
Component: Docker-Worker → Workers
You need to log in
before you can comment on or make changes to this bug.
Description
•