1276813 - Intermittent "Failed to download from all available URLs, aborting" downloading from queue.taskcluster.net

Reporter

Description

•

9 years ago

+++ This bug was initially created as a clone of Bug #1260584 +++ https://treeherder.mozilla.org/logviewer.html#?job_id=29062156&repo=mozilla-inbound

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

9 years ago

Worth noting that these are 500 errors. I can open a new case with AWS. Before we do that, it's worth looking at the CloudFront logs, though, to give the AWS support folks some request-ids to look at. I'm at PyCon and don't have the resources to pull that data.

Comment hidden (Intermittent Failures Robot)

Greg Arndt [:garndt]

Comment 7

•

8 years ago

Appears this issue hasn't appeared again in over a week. If it happens again, we need to reopen this bug, and attached logs and request IDs that can be sent to AWS.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WORKSFORME

Comment hidden (Intermittent Failures Robot)

Carsten Book [:Tomcat]

Comment 9

•

8 years ago

(In reply to Greg Arndt [:garndt] from comment #7) > Appears this issue hasn't appeared again in over a week. If it happens > again, we need to reopen this bug, and attached logs and request IDs that > can be sent to AWS. https://treeherder.mozilla.org/logviewer.html#?job_id=67211628&repo=autoland&lineNumber=315

Status: RESOLVED → REOPENED

Flags: needinfo?(garndt)

Resolution: WORKSFORME → ---

Greg Arndt [:garndt]

Comment 10

•

8 years ago

Looking over the errors, they were all from datacenter os x machines and had a variety of errors (some connection, some ssl). I'm guessing there was a hiccup here that resolved quickly on its own and seem to be different than what this bug was about.

Flags: needinfo?(garndt)

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 19

•

8 years ago

this picked up again on the 19th, :garndt, can you help figure out what is going on here? This looks to be osx specific.

Flags: needinfo?(garndt)

Greg Arndt [:garndt]

Comment 20

•

8 years ago

Looking over the results for the last month, outside of a few linux tests that were incorrectly starred under this bug, and some 503's reported by the windows machines early in the month, the remaining are entirely buildbot OS X machines in the datacenter. Unfortunately most of the errors reported display one of the following: Caught exception: '' OR Caught exception: <urlopen error [Errno 8] _ssl.c:504: EOF occurred in violation of protocol> neither of which gives us much information as to what's going on. I suspect that these are blips considering it's not every push and not on the thousands of other machines running for linux and windows. Is there a way we can get more information when this fails from mozharness? Also, looking at our logs from the queue for that artifact, I see plenty of standard responses (redirects to s3), but no unexpected ones (like ISEs).

Flags: needinfo?(garndt)

Greg Arndt [:garndt]

Comment 21

•

8 years ago

Just a heads up, you're going to see hundreds of these come in today. There was a transient issue that Dustin and John have been looking into with downloading artifacts. Appears to be different than the OS X ones because an actual exception was printed to the logs (time out).

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

8 years ago

Whiteboard: [stockwell infra]

Comment hidden (Intermittent Failures Robot)

Greg Arndt [:garndt]

Comment 24

•

8 years ago

As noted in comment 21, were expecting this uptick due some timeout issues experienced on the 22nd. There were only a few failures after that, all on OS X machines as mentioned in comment 20. I am not sure why just the OS X machines have the error like that.

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 26

•

8 years ago

oh another big spike today :(

Greg Arndt [:garndt]

Comment 27

•

8 years ago

I spot checked a dozen of the jobs that were starred and they were for downloading a file from github, not queue.taskcluster.net. We probably should move that somewhere else so we're not depending on github. Socket error when accessing https://raw.githubusercontent.com/mozilla/build-tooltool/master/tooltool.py: The read operation timed out

Geoff Brown [:gbrown]

Updated

•

8 years ago

Depends on: 1345972

Comment hidden (Intermittent Failures Robot)

Greg Arndt [:garndt]

Comment 30

•

8 years ago

Almost all of the ones in comment 29 are because of comment 27

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 32

•

8 years ago

this is behaving better lately :)

Greg Arndt [:garndt]

Comment 33

•

8 years ago

It's great that it's a little better. It's still troubling that this is primarily happening on the OS X machines using BB in the datacenter and the logging does not indicate what the problem is. I do not know who we need to loop in to check out those machines and see if they can put together some information as to what is failing, but so far because it's largely limited to that set of machines, I think it's isolated to those machines in the datacenter, rather than a larger taskcluster issue.

Bob Clary [:bc] (inactive)

Updated

•

8 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1350320

Comment hidden (Intermittent Failures Robot)

Geoff Brown [:gbrown]

Comment 38

•

8 years ago

Kim, could you have a look at this, or redirect to someone who might know what's happening here? This bug likely started out as multiple problems, but now is only about intermittent download failures on osx. We've already run it by taskcluster folks (comment 33).

Flags: needinfo?(kmoir)

Armen [:armenzg]

Comment 39

•

8 years ago

The code involved in the failure is _download_file() rather than _fetch_into_memory() [1] https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#413-485 It seems to generally involve the .dmg file after having downloaded other files from that same task ID. Is there something specially about how cloud mirror stores or copies dmgs around that could fail intermittently? I've noticed in one of the logs something useful (I hope) [2]: 04:24:35 INFO - Caught exception: <urlopen error [Errno 8] _ssl.c:504: EOF occurred in violation of protocol> Here's also the try/except block related to this: https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#1102-1111 [1] https://docs.google.com/document/d/1HRIvVBqTDsSebOkXqMPxTRw3nPwsYYkonooHrC0ksvA/edit [2] 04:21:21 INFO - Downloading https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg to /builds/slave/test/installer.dmg 04:21:21 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg', 'file_name': '/builds/slave/test/installer.dmg'}, attempt #1 04:21:53 INFO - retry: attempt #1 caught exception: '' 04:21:53 INFO - retry: Failed, sleeping 30 seconds before retrying 04:22:23 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg', 'file_name': '/builds/slave/test/installer.dmg'}, attempt #2 04:22:57 INFO - retry: attempt #2 caught exception: '' 04:22:57 INFO - retry: Failed, sleeping 60 seconds before retrying 04:23:57 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg', 'file_name': '/builds/slave/test/installer.dmg'}, attempt #3 04:24:35 WARNING - URL Error: https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg 04:24:35 INFO - Running command: ['nslookup', 'queue.taskcluster.net'] 04:24:35 INFO - Copy/paste: nslookup queue.taskcluster.net 04:24:35 INFO - Server: 10.26.75.40 04:24:35 INFO - Address: 10.26.75.40#53 04:24:35 INFO - queue.taskcluster.net canonical name = toyama-73636.herokussl.com. 04:24:35 INFO - toyama-73636.herokussl.com canonical name = elb081832-1105536650.us-east-1.elb.amazonaws.com. 04:24:35 INFO - Name: elb081832-1105536650.us-east-1.elb.amazonaws.com 04:24:35 INFO - Address: 54.243.65.185 04:24:35 INFO - Name: elb081832-1105536650.us-east-1.elb.amazonaws.com 04:24:35 INFO - Address: 23.21.42.179 04:24:35 INFO - Name: elb081832-1105536650.us-east-1.elb.amazonaws.com 04:24:35 INFO - Address: 54.243.87.215 04:24:35 INFO - Return code: 0 04:24:35 INFO - retry: attempt #3 caught exception: <urlopen error [Errno 8] _ssl.c:504: EOF occurred in violation of protocol> 04:24:35 INFO - Can't download from https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg to /builds/slave/test/installer.dmg! 04:24:35 INFO - Caught exception: '' 04:24:35 INFO - Caught exception: '' 04:24:35 INFO - Caught exception: <urlopen error [Errno 8] _ssl.c:504: EOF occurred in violation of protocol>

Kim Moir [:kmoir] ET

Comment 40

•

8 years ago

Since Armen responded, I'm going to clear my needinfo. If you need someone from buildduty to look at this, let me know and I can redirect it.

Flags: needinfo?(kmoir)

Geoff Brown [:gbrown]

Comment 41

•

8 years ago

Yes, please! (I don't think Armen will be following up...he was just adding some potentially helpful info. We need someone to dig in here and try to resolve these failures.)

Flags: needinfo?(kmoir)

Greg Arndt [:garndt]

Comment 42

•

8 years ago

(In reply to Armen Zambrano [:armenzg] (EDT/UTC-4) from comment #39) > It seems to generally involve the .dmg file after having downloaded other > files from that same task ID. > Is there something specially about how cloud mirror stores or copies dmgs > around that could fail intermittently? > These are machines downloading from the datacenter (outside of AWS) so cloud-mirror is not a factor. Cloud-mirror only copies things between AWS regions as machines within AWS request them. Otherwise, all machines outside of AWS are going to download those files through cloudfront in front of our us-west-2 s3 bucket.

Kim Moir [:kmoir] ET

Comment 43

•

8 years ago

Alin, could you get someone from the buildduty team to investigate this bug when you have a chance?

Flags: needinfo?(kmoir) → needinfo?(aselagea)

Comment hidden (Intermittent Failures Robot)

Alin Selagea [:aselagea]

Comment 45

•

8 years ago

Yup, Andrei will take a look.

Flags: needinfo?(aselagea)

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Comment 46

•

8 years ago

Attached image 3status.PNG (deleted) — Details

Did some investigation here but couldn't find any reason why sometime it fails to download and sometime is downloading from [url]. An example of jobs in mochitest e10s [1] show same jobs that sometime fail and sometime manage to download the [url].Jobs [2] and [3] are example of jobs that ran almost in the same time but had different status.The logs indicate that sometime the download is not working when taking data from [url].I added 3 log examples of the same job when begining to download from the [url] in the 3status.png. [url] https://queue.taskcluster.net/v1/task/JAmeJ_9LSL6fKECXBE-gFA/artifacts/public/build/target.test_packages.json [1] https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=mochitest%20e10s&tochange=feaeb4c4a1149a7925e9d0e32a61fde7ad74b8f2 [2] https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=mochitest%20e10s&tochange=feaeb4c4a1149a7925e9d0e32a61fde7ad74b8f2&selectedJob=89126373 [3] https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=mochitest%20e10s&tochange=feaeb4c4a1149a7925e9d0e32a61fde7ad74b8f2&selectedJob=89126364

Comment hidden (Intermittent Failures Robot)

Geoff Brown [:gbrown]

Comment 50

•

8 years ago

I'd like to try increasing some timeouts or retries -- maybe we can recover if we wait long enough?

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Comment 51

•

8 years ago

Also, I should check into diagnostics; this seems sub-optimal: https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64/1491990368/mozilla-inbound_yosemite_r7_test-mochitest-e10s-devtools-chrome-3-bm106-tests1-macosx-build1033.txt.gz 04:14:11 INFO - retry: attempt #1 caught exception: ''

Comment hidden (Intermittent Failures Robot)

Geoff Brown [:gbrown]

Comment 53

•

8 years ago

Looking at failures such as https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=83e7c9bfbd76715d6e6a73f0ac45bb3cd4305d7e&selectedJob=91888786&filter-searchStr=osx%20debug, I notice that most (maybe all) of the failures happened between 05:11 and 05:13 - a fairly short window. There are successful jobs at about the same time, but most of those initially fail and then succeed on retry. For example, https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/mozilla-inbound_yosemite_r7-debug_test-crashtest-e10s-bm136-tests1-macosx-build180.txt.gz 05:11:03 INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.reftest.tests.zip into memory 05:11:35 INFO - retry: attempt #1 caught exception: '' 05:11:35 INFO - retry: Failed, sleeping 60 seconds before retrying 05:12:35 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.reftest.tests.zip'}, attempt #2 05:12:35 INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.reftest.tests.zip into memory 05:13:07 INFO - retry: attempt #2 caught exception: '' 05:13:07 INFO - retry: Failed, sleeping 120 seconds before retrying 05:15:07 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.reftest.tests.zip'}, attempt #3 05:15:07 INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.reftest.tests.zip into memory 05:15:13 INFO - Expected file size: 35872624 05:15:13 INFO - Obtained file size: 35872624 and https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/mozilla-inbound_yosemite_r7-debug_test-gtest-bm106-tests1-macosx-build199.txt.gz 05:12:35 INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.crashreporter-symbols.zip into memory 05:13:06 INFO - retry: attempt #1 caught exception: '' 05:13:06 INFO - retry: Failed, sleeping 60 seconds before retrying 05:14:06 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.crashreporter-symbols.zip'}, attempt #2 05:14:06 INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.crashreporter-symbols.zip into memory 05:14:40 INFO - retry: attempt #2 caught exception: '' 05:14:40 INFO - retry: Failed, sleeping 120 seconds before retrying 05:16:40 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.crashreporter-symbols.zip'}, attempt #3 05:16:40 INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.crashreporter-symbols.zip into memory 05:16:44 INFO - Expected file size: 134502514 05:16:44 INFO - Obtained file size: 134502514

Geoff Brown [:gbrown]

Comment 54

•

8 years ago

Also, I notice that most recent OSX and Windows failures occur while downloading the test_packages.json file, which sleeps less between attempts than the larger files downloaded with fetch_url_into_memory -- we seem to fail less often when we retry over a longer time period.

Geoff Brown [:gbrown]

Comment 55

•

8 years ago

Attached patch retry more (deleted) — Details — Splinter Review

Based on comment 53 and comment 54, I think at least some of the failures here would be avoided if mozharness retried downloads over a longer time period. In download_proxied_file() (where most failures occur), I increased retries from 3 to 5 and kept the initial sleep time to 30. Instead of the current behavior: - attempt #1 (might take up to 30 seconds) - sleep 30 - attempt #2 (up to 30 seconds) - sleep 60 - attempt #3 (...) Pathological worst case time: 180 seconds Now I will expect: - attempt #1 (might take up to 30 seconds) - sleep 30 - attempt #2 - sleep 60 - attempt #3 - sleep 120 - attempt #4 - sleep 240 - attempt #5 Pathological worst case time: 600 seconds fetch_url_into_memory failures are fewer. The current behavior there is: - attempt #1 (might take up to 30 seconds) - sleep 60 - attempt #2 (up to 30 seconds) - sleep 120 - attempt #3 (...) Pathological worst case time: 270 seconds and I'm changing those retry parameters to be the same as download_proxied_file() - initial sleep of 30 seconds (might recover faster) and 5 retries. There is at least one disadvantage to more retries: It will take longer for jobs to fail, so when there is a long-term, non-recoverable outage, it may take longer to notice the problem and take corrective action. Still, given the persistent failures we are seeing, I think this change is appropriate. (There is one minor ride-along change here, to log the exception name - see comment 51.) My plan is to leave this bug open and monitor it; if failures persist, I'll likely backout this patch.

Assignee: nobody → gbrown

Flags: needinfo?(gbrown)

Attachment #8858919 - Flags: review?(armenzg)

Armen [:armenzg]

Updated

•

8 years ago

Attachment #8858919 - Flags: review?(armenzg) → review+

Geoff Brown [:gbrown]

Updated

•

8 years ago

Keywords: leave-open

Pulsebot

Comment 56

•

8 years ago

Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/5317782b6984 Retry mozharness failed downloads more persistently; r=armenzg

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 57

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/5317782b6984

Geoff Brown [:gbrown]

Comment 58

•

8 years ago

The change from comment 57 seems to have made a world of difference: no failures since it landed! I expect occasional failures due to extended network/service outages, so let's keep this open.

Assignee: gbrown → nobody

Ryan VanderMeulen [:RyanVM]

Comment 59

•

8 years ago

uplift

(In reply to Wes Kocher (:KWierso) from comment #57) > https://hg.mozilla.org/mozilla-central/rev/5317782b6984 And to Beta. https://hg.mozilla.org/releases/mozilla-beta/rev/d358eac1d410

Joel Maher ( :jmaher ) (UTC -8)

Comment 60

•

8 years ago

I am seeing a few failures today on osx: https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&selectedJob=99473976&filter-searchStr=osx%20other 09:05:57 INFO - retry: attempt #5 caught HTTPError exception: HTTP Error 404: Not Found 09:05:57 INFO - Can't download from https://queue.taskcluster.net/v1/task/a6-fWNwlTnKG2YkAJaSDMQ/artifacts/public/build/firefox-55.0a1.en-US.mac.test_packages.json to /builds/slave/test/build/firefox-55.0a1.en-US.mac.test_packages.json! 09:05:57 INFO - Caught exception: HTTP Error 404: Not Found 09:05:57 INFO - Caught exception: HTTP Error 404: Not Found 09:05:57 INFO - Caught exception: HTTP Error 404: Not Found 09:05:57 INFO - Caught exception: HTTP Error 404: Not Found 09:05:57 INFO - Caught exception: HTTP Error 404: Not Found 09:05:57 FATAL - Failed to download from all available URLs, aborting 09:05:57 FATAL - Running post_fatal callback... 09:05:57 FATAL - Exiting 3 :garndt, can you check if all services are running ok?

Flags: needinfo?(garndt)

Greg Arndt [:garndt]

Comment 61

•

8 years ago

This appears to be a different problem and I have openeed a bug for it. Bug 1365350

Flags: needinfo?(garndt)

Comment hidden (Intermittent Failures Robot)

Greg Arndt [:garndt]

Comment 68

•

7 years ago

This bug still happens from time-to-time is mainly concentrated to OS X machines and self resolves.

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 72

•

7 years ago

:garndt, is there any further work we could do to reduce this issue?

Flags: needinfo?(garndt)

Greg Arndt [:garndt]

Comment 73

•

7 years ago

I'm not sure what further work could be done here and I'm open for suggestions. This seems to be always limited to the datacenter machines and it appears the client trying to download the files are retrying with back off.

Flags: needinfo?(garndt)

Greg Arndt [:garndt]

Comment 74

•

7 years ago

Also in many cases the exception being raised is not helpful: Caught exception: ''

Comment hidden (Intermittent Failures Robot)

Greg Arndt [:garndt]

Comment 76

•

7 years ago

There appears to have been a moment in time where the queue was returning ISEs more than usual.

Comment hidden (Intermittent Failures Robot)

Greg Arndt [:garndt]

Comment 78

•

7 years ago

I have setup alerts in heroku to help us identify periods of time where the queue might be part of the problem (higher rate of failed requests).

Comment hidden (Intermittent Failures Robot)

Greg Arndt [:garndt]

Comment 82

•

7 years ago

This spike was caused by the problems AWS was having with s3 and it returning 503 (slow down) responses.

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 90

•

7 years ago

:gbrown, a year ago you had marked this as leave-open, we have had little to no failures since then, can we close this?

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Updated

•

7 years ago

Status: REOPENED → RESOLVED

Closed: 8 years ago → 7 years ago

Flags: needinfo?(gbrown)

Resolution: --- → WORKSFORME

Sylvestre Ledru [:Sylvestre]

Updated

•

7 years ago

Keywords: leave-open

Nobody; OK to take it and work on it

Assignee

Updated

•

6 years ago

Component: Operations → Operations and Service Requests

3status.PNG 8 years ago Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty] (deleted), image/png		Details
retry more 8 years ago Geoff Brown [:gbrown] (deleted), patch	armenzg : review+	Details \| Diff \| Splinter Review