Closed Bug 1170999 Opened 9 years ago Closed 9 years ago

[docker-worker] Retry image pulls with backoff

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: garndt, Assigned: garndt)

References

Details

Attachments

(1 file)

Docker retry 9 years ago Greg Arndt [:garndt] (deleted), text/x-github-pull-request	jonasfj : review+	Details

Greg Arndt [:garndt]

Assignee

Description

•

9 years ago

For image pulls that fail for reasons we could retry, we should backoff and retry up to N times. Sometimes when pulling from docker hub, we will receive a 4xx error that could be retried. These are errors on public images that shouldn't get authorization errors and they do exist in the hub so it's for some reason the hub returning back these errors. Docker support ticket has been entered for this and their ops team is looking into it but have not yet determined why. They have explained that over the last couple of weeks they have been making backend changes that could cause this. Either way, we should be smart about the errors and retry where possible.

Pete Moore [:pmoore][:pete]

Comment 1

•

9 years ago

Would also be nice if docker client had retries built in, certainly as at least a configurable option... We have a go library for http backoff (with interpretation of http response status codes): https://godoc.org/github.com/taskcluster/httpbackoff I wonder if it is worth integrating it into docker directly.

Greg Arndt [:garndt]

Assignee

Comment 2

•

9 years ago

(warning: I might be reading the docker code base wrong) They have some retries and backoff when pulling an image layer, but it seems they are not doing it at a request level, but just for pulling image layer [1]. We do have 504 failures that based on reading their code are being retried but still end up erroring out. We also have issues with fetching remote history [2], which does not implement retries, but some of the issues we're having is fetching the history of an image from the registry to begin with. I think this is a bandaid on our part to try again, but might work. Perhaps there are some status checks we could do on our part if this situation happens before retrying and spit that out to papertrail and submit an alert whenever our retries fail. [1] https://github.com/docker/docker/blob/278798236bdf073dd7c66e32e21d81bbf9243656/registry/session.go#L241 [2] https://github.com/docker/docker/blob/278798236bdf073dd7c66e32e21d81bbf9243656/registry/session.go#L167-187

Pete Moore [:pmoore][:pete]

Comment 3

•

9 years ago

Very interesting! It looks like they are using a custom transport for adding authentication. A neat fix might be to build in retry logic directly into the transport, to make it entirely transparent. Or maybe this is over-engineering. Here is the custom transport: https://github.com/docker/docker/blob/master/registry/session.go#L72 I'm wondering whether I might do something similar for the go taskcluster client (for authentication, and possibly also for exponential backoff)... I agree with your analysis that the remote history is currently not retrying. :( Pete

Greg Arndt [:garndt]

Assignee

Comment 4

•

9 years ago

Attached file Docker retry (deleted) — Details

Attachment #8621734 - Flags: review?(jopsen)

Greg Arndt [:garndt]

Assignee

Updated

•

9 years ago

Assignee: nobody → garndt

Greg Arndt [:garndt]

Assignee

Updated

•

9 years ago

Status: NEW → ASSIGNED

Jonas Finnemann Jensen (:jonasfj)

Comment 5

•

9 years ago

Comment on attachment 8621734 [details] Docker retry Looks good to me comments are just nits...

Attachment #8621734 - Flags: review?(jopsen) → review+

Greg Arndt [:garndt]

Assignee

Comment 6

•

9 years ago

https://github.com/taskcluster/docker-worker/commit/322d6ac8e9aba79cc432a56ae026b864db8c4ab8

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Greg Arndt [:garndt]

Assignee

Updated

•

9 years ago

Blocks: 1165759

Greg Arndt [:garndt]

Assignee

Updated

•

9 years ago

Blocks: 1156326

Pete Moore [:pmoore][:pete]

Updated

•

9 years ago

Component: TaskCluster → Docker-Worker

Product: Testing → Taskcluster

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: Docker-Worker → Workers

You need to log in before you can comment on or make changes to this bug.

Bugzilla

[docker-worker] Retry image pulls with backoff

Categories

(Taskcluster :: Workers, defect)

Tracking

(Not tracked)

People

(Reporter: garndt, Assigned: garndt)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Updated

Comment 5

Comment 6

Updated

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type