Closed
Bug 1276813
Opened 9 years ago
Closed 7 years ago
Intermittent "Failed to download from all available URLs, aborting" downloading from queue.taskcluster.net
Categories
(Taskcluster :: Operations and Service Requests, task)
Taskcluster
Operations and Service Requests
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: philor, Unassigned)
References
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell infra])
Attachments
(2 files)
(deleted),
image/png
|
Details | |
(deleted),
patch
|
armenzg
:
review+
|
Details | Diff | Splinter Review |
+++ This bug was initially created as a clone of Bug #1260584 +++
https://treeherder.mozilla.org/logviewer.html#?job_id=29062156&repo=mozilla-inbound
Comment 1•9 years ago
|
||
Worth noting that these are 500 errors.
I can open a new case with AWS. Before we do that, it's worth looking at the CloudFront logs, though, to give the AWS support folks some request-ids to look at. I'm at PyCon and don't have the resources to pull that data.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 7•8 years ago
|
||
Appears this issue hasn't appeared again in over a week. If it happens again, we need to reopen this bug, and attached logs and request IDs that can be sent to AWS.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
Comment hidden (Intermittent Failures Robot) |
Comment 9•8 years ago
|
||
(In reply to Greg Arndt [:garndt] from comment #7)
> Appears this issue hasn't appeared again in over a week. If it happens
> again, we need to reopen this bug, and attached logs and request IDs that
> can be sent to AWS.
https://treeherder.mozilla.org/logviewer.html#?job_id=67211628&repo=autoland&lineNumber=315
Status: RESOLVED → REOPENED
Flags: needinfo?(garndt)
Resolution: WORKSFORME → ---
Comment 10•8 years ago
|
||
Looking over the errors, they were all from datacenter os x machines and had a variety of errors (some connection, some ssl). I'm guessing there was a hiccup here that resolved quickly on its own and seem to be different than what this bug was about.
Flags: needinfo?(garndt)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 19•8 years ago
|
||
this picked up again on the 19th, :garndt, can you help figure out what is going on here? This looks to be osx specific.
Flags: needinfo?(garndt)
Comment 20•8 years ago
|
||
Looking over the results for the last month, outside of a few linux tests that were incorrectly starred under this bug, and some 503's reported by the windows machines early in the month, the remaining are entirely buildbot OS X machines in the datacenter.
Unfortunately most of the errors reported display one of the following:
Caught exception: ''
OR
Caught exception: <urlopen error [Errno 8] _ssl.c:504: EOF occurred in violation of protocol>
neither of which gives us much information as to what's going on. I suspect that these are blips considering it's not every push and not on the thousands of other machines running for linux and windows. Is there a way we can get more information when this fails from mozharness?
Also, looking at our logs from the queue for that artifact, I see plenty of standard responses (redirects to s3), but no unexpected ones (like ISEs).
Flags: needinfo?(garndt)
Comment 21•8 years ago
|
||
Just a heads up, you're going to see hundreds of these come in today. There was a transient issue that Dustin and John have been looking into with downloading artifacts. Appears to be different than the OS X ones because an actual exception was printed to the logs (time out).
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Whiteboard: [stockwell infra]
Comment hidden (Intermittent Failures Robot) |
Comment 24•8 years ago
|
||
As noted in comment 21, were expecting this uptick due some timeout issues experienced on the 22nd. There were only a few failures after that, all on OS X machines as mentioned in comment 20. I am not sure why just the OS X machines have the error like that.
Comment hidden (Intermittent Failures Robot) |
Comment 26•8 years ago
|
||
oh another big spike today :(
Comment 27•8 years ago
|
||
I spot checked a dozen of the jobs that were starred and they were for downloading a file from github, not queue.taskcluster.net. We probably should move that somewhere else so we're not depending on github.
Socket error when accessing https://raw.githubusercontent.com/mozilla/build-tooltool/master/tooltool.py: The read operation timed out
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 30•8 years ago
|
||
Almost all of the ones in comment 29 are because of comment 27
Comment hidden (Intermittent Failures Robot) |
Comment 32•8 years ago
|
||
this is behaving better lately :)
Comment 33•8 years ago
|
||
It's great that it's a little better. It's still troubling that this is primarily happening on the OS X machines using BB in the datacenter and the logging does not indicate what the problem is.
I do not know who we need to loop in to check out those machines and see if they can put together some information as to what is failing, but so far because it's largely limited to that set of machines, I think it's isolated to those machines in the datacenter, rather than a larger taskcluster issue.
Updated•8 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 38•8 years ago
|
||
Kim, could you have a look at this, or redirect to someone who might know what's happening here?
This bug likely started out as multiple problems, but now is only about intermittent download failures on osx. We've already run it by taskcluster folks (comment 33).
Flags: needinfo?(kmoir)
Comment 39•8 years ago
|
||
The code involved in the failure is _download_file() rather than _fetch_into_memory() [1]
https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#413-485
It seems to generally involve the .dmg file after having downloaded other files from that same task ID.
Is there something specially about how cloud mirror stores or copies dmgs around that could fail intermittently?
I've noticed in one of the logs something useful (I hope) [2]:
04:24:35 INFO - Caught exception: <urlopen error [Errno 8] _ssl.c:504: EOF occurred in violation of protocol>
Here's also the try/except block related to this:
https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#1102-1111
[1] https://docs.google.com/document/d/1HRIvVBqTDsSebOkXqMPxTRw3nPwsYYkonooHrC0ksvA/edit
[2]
04:21:21 INFO - Downloading https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg to /builds/slave/test/installer.dmg
04:21:21 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg', 'file_name': '/builds/slave/test/installer.dmg'}, attempt #1
04:21:53 INFO - retry: attempt #1 caught exception: ''
04:21:53 INFO - retry: Failed, sleeping 30 seconds before retrying
04:22:23 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg', 'file_name': '/builds/slave/test/installer.dmg'}, attempt #2
04:22:57 INFO - retry: attempt #2 caught exception: ''
04:22:57 INFO - retry: Failed, sleeping 60 seconds before retrying
04:23:57 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg', 'file_name': '/builds/slave/test/installer.dmg'}, attempt #3
04:24:35 WARNING - URL Error: https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg
04:24:35 INFO - Running command: ['nslookup', 'queue.taskcluster.net']
04:24:35 INFO - Copy/paste: nslookup queue.taskcluster.net
04:24:35 INFO - Server: 10.26.75.40
04:24:35 INFO - Address: 10.26.75.40#53
04:24:35 INFO - queue.taskcluster.net canonical name = toyama-73636.herokussl.com.
04:24:35 INFO - toyama-73636.herokussl.com canonical name = elb081832-1105536650.us-east-1.elb.amazonaws.com.
04:24:35 INFO - Name: elb081832-1105536650.us-east-1.elb.amazonaws.com
04:24:35 INFO - Address: 54.243.65.185
04:24:35 INFO - Name: elb081832-1105536650.us-east-1.elb.amazonaws.com
04:24:35 INFO - Address: 23.21.42.179
04:24:35 INFO - Name: elb081832-1105536650.us-east-1.elb.amazonaws.com
04:24:35 INFO - Address: 54.243.87.215
04:24:35 INFO - Return code: 0
04:24:35 INFO - retry: attempt #3 caught exception: <urlopen error [Errno 8] _ssl.c:504: EOF occurred in violation of protocol>
04:24:35 INFO - Can't download from https://queue.taskcluster.net/v1/task/U7fRZclaQlSrXZl7tyjlnQ/artifacts/public/build/firefox-53.0.en-US.mac.dmg to /builds/slave/test/installer.dmg!
04:24:35 INFO - Caught exception: ''
04:24:35 INFO - Caught exception: ''
04:24:35 INFO - Caught exception: <urlopen error [Errno 8] _ssl.c:504: EOF occurred in violation of protocol>
Comment 40•8 years ago
|
||
Since Armen responded, I'm going to clear my needinfo. If you need someone from buildduty to look at this, let me know and I can redirect it.
Flags: needinfo?(kmoir)
Comment 41•8 years ago
|
||
Yes, please! (I don't think Armen will be following up...he was just adding some potentially helpful info. We need someone to dig in here and try to resolve these failures.)
Flags: needinfo?(kmoir)
Comment 42•8 years ago
|
||
(In reply to Armen Zambrano [:armenzg] (EDT/UTC-4) from comment #39)
> It seems to generally involve the .dmg file after having downloaded other
> files from that same task ID.
> Is there something specially about how cloud mirror stores or copies dmgs
> around that could fail intermittently?
>
These are machines downloading from the datacenter (outside of AWS) so cloud-mirror is not a factor. Cloud-mirror only copies things between AWS regions as machines within AWS request them. Otherwise, all machines outside of AWS are going to download those files through cloudfront in front of our us-west-2 s3 bucket.
Comment 43•8 years ago
|
||
Alin, could you get someone from the buildduty team to investigate this bug when you have a chance?
Flags: needinfo?(kmoir) → needinfo?(aselagea)
Comment hidden (Intermittent Failures Robot) |
Comment 46•8 years ago
|
||
Did some investigation here but couldn't find any reason why sometime it fails to download and sometime is downloading from [url].
An example of jobs in mochitest e10s [1] show same jobs that sometime fail and sometime manage to download the [url].Jobs [2] and [3] are example of jobs that ran almost in the same time but had different status.The logs indicate that sometime the download is not working when taking data from [url].I added 3 log examples of the same job when begining to download from the [url] in the 3status.png.
[url] https://queue.taskcluster.net/v1/task/JAmeJ_9LSL6fKECXBE-gFA/artifacts/public/build/target.test_packages.json
[1] https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=mochitest%20e10s&tochange=feaeb4c4a1149a7925e9d0e32a61fde7ad74b8f2
[2] https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=mochitest%20e10s&tochange=feaeb4c4a1149a7925e9d0e32a61fde7ad74b8f2&selectedJob=89126373
[3] https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=mochitest%20e10s&tochange=feaeb4c4a1149a7925e9d0e32a61fde7ad74b8f2&selectedJob=89126364
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 50•8 years ago
|
||
I'd like to try increasing some timeouts or retries -- maybe we can recover if we wait long enough?
Flags: needinfo?(gbrown)
Comment 51•8 years ago
|
||
Also, I should check into diagnostics; this seems sub-optimal:
https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64/1491990368/mozilla-inbound_yosemite_r7_test-mochitest-e10s-devtools-chrome-3-bm106-tests1-macosx-build1033.txt.gz
04:14:11 INFO - retry: attempt #1 caught exception: ''
Comment hidden (Intermittent Failures Robot) |
Comment 53•8 years ago
|
||
Looking at failures such as https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=83e7c9bfbd76715d6e6a73f0ac45bb3cd4305d7e&selectedJob=91888786&filter-searchStr=osx%20debug, I notice that most (maybe all) of the failures happened between 05:11 and 05:13 - a fairly short window.
There are successful jobs at about the same time, but most of those initially fail and then succeed on retry. For example,
https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/mozilla-inbound_yosemite_r7-debug_test-crashtest-e10s-bm136-tests1-macosx-build180.txt.gz
05:11:03 INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.reftest.tests.zip into memory
05:11:35 INFO - retry: attempt #1 caught exception: ''
05:11:35 INFO - retry: Failed, sleeping 60 seconds before retrying
05:12:35 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.reftest.tests.zip'}, attempt #2
05:12:35 INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.reftest.tests.zip into memory
05:13:07 INFO - retry: attempt #2 caught exception: ''
05:13:07 INFO - retry: Failed, sleeping 120 seconds before retrying
05:15:07 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.reftest.tests.zip'}, attempt #3
05:15:07 INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.reftest.tests.zip into memory
05:15:13 INFO - Expected file size: 35872624
05:15:13 INFO - Obtained file size: 35872624
and
https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/mozilla-inbound_yosemite_r7-debug_test-gtest-bm106-tests1-macosx-build199.txt.gz
05:12:35 INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.crashreporter-symbols.zip into memory
05:13:06 INFO - retry: attempt #1 caught exception: ''
05:13:06 INFO - retry: Failed, sleeping 60 seconds before retrying
05:14:06 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.crashreporter-symbols.zip'}, attempt #2
05:14:06 INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.crashreporter-symbols.zip into memory
05:14:40 INFO - retry: attempt #2 caught exception: ''
05:14:40 INFO - retry: Failed, sleeping 120 seconds before retrying
05:16:40 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.crashreporter-symbols.zip'}, attempt #3
05:16:40 INFO - Fetch https://queue.taskcluster.net/v1/task/GAOVP3HbSDK-kCbHlJTrLg/artifacts/public/build/target.crashreporter-symbols.zip into memory
05:16:44 INFO - Expected file size: 134502514
05:16:44 INFO - Obtained file size: 134502514
Comment 54•8 years ago
|
||
Also, I notice that most recent OSX and Windows failures occur while downloading the test_packages.json file, which sleeps less between attempts than the larger files downloaded with fetch_url_into_memory -- we seem to fail less often when we retry over a longer time period.
Comment 55•8 years ago
|
||
Based on comment 53 and comment 54, I think at least some of the failures here would be avoided if mozharness retried downloads over a longer time period.
In download_proxied_file() (where most failures occur), I increased retries from 3 to 5 and kept the initial sleep time to 30. Instead of the current behavior:
- attempt #1 (might take up to 30 seconds)
- sleep 30
- attempt #2 (up to 30 seconds)
- sleep 60
- attempt #3 (...)
Pathological worst case time: 180 seconds
Now I will expect:
- attempt #1 (might take up to 30 seconds)
- sleep 30
- attempt #2
- sleep 60
- attempt #3
- sleep 120
- attempt #4
- sleep 240
- attempt #5
Pathological worst case time: 600 seconds
fetch_url_into_memory failures are fewer. The current behavior there is:
- attempt #1 (might take up to 30 seconds)
- sleep 60
- attempt #2 (up to 30 seconds)
- sleep 120
- attempt #3 (...)
Pathological worst case time: 270 seconds
and I'm changing those retry parameters to be the same as download_proxied_file() - initial sleep of 30 seconds (might recover faster) and 5 retries.
There is at least one disadvantage to more retries: It will take longer for jobs to fail, so when there is a long-term, non-recoverable outage, it may take longer to notice the problem and take corrective action. Still, given the persistent failures we are seeing, I think this change is appropriate.
(There is one minor ride-along change here, to log the exception name - see comment 51.)
My plan is to leave this bug open and monitor it; if failures persist, I'll likely backout this patch.
Updated•8 years ago
|
Attachment #8858919 -
Flags: review?(armenzg) → review+
Updated•8 years ago
|
Keywords: leave-open
Comment 56•8 years ago
|
||
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5317782b6984
Retry mozharness failed downloads more persistently; r=armenzg
Comment 57•8 years ago
|
||
bugherder |
Comment 58•8 years ago
|
||
The change from comment 57 seems to have made a world of difference: no failures since it landed!
I expect occasional failures due to extended network/service outages, so let's keep this open.
Assignee: gbrown → nobody
Comment 59•8 years ago
|
||
uplift |
(In reply to Wes Kocher (:KWierso) from comment #57)
> https://hg.mozilla.org/mozilla-central/rev/5317782b6984
And to Beta.
https://hg.mozilla.org/releases/mozilla-beta/rev/d358eac1d410
Comment 60•8 years ago
|
||
I am seeing a few failures today on osx:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&selectedJob=99473976&filter-searchStr=osx%20other
09:05:57 INFO - retry: attempt #5 caught HTTPError exception: HTTP Error 404: Not Found
09:05:57 INFO - Can't download from https://queue.taskcluster.net/v1/task/a6-fWNwlTnKG2YkAJaSDMQ/artifacts/public/build/firefox-55.0a1.en-US.mac.test_packages.json to /builds/slave/test/build/firefox-55.0a1.en-US.mac.test_packages.json!
09:05:57 INFO - Caught exception: HTTP Error 404: Not Found
09:05:57 INFO - Caught exception: HTTP Error 404: Not Found
09:05:57 INFO - Caught exception: HTTP Error 404: Not Found
09:05:57 INFO - Caught exception: HTTP Error 404: Not Found
09:05:57 INFO - Caught exception: HTTP Error 404: Not Found
09:05:57 FATAL - Failed to download from all available URLs, aborting
09:05:57 FATAL - Running post_fatal callback...
09:05:57 FATAL - Exiting 3
:garndt, can you check if all services are running ok?
Flags: needinfo?(garndt)
Comment 61•8 years ago
|
||
This appears to be a different problem and I have openeed a bug for it. Bug 1365350
Flags: needinfo?(garndt)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 68•7 years ago
|
||
This bug still happens from time-to-time is mainly concentrated to OS X machines and self resolves.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 72•7 years ago
|
||
:garndt, is there any further work we could do to reduce this issue?
Flags: needinfo?(garndt)
Comment 73•7 years ago
|
||
I'm not sure what further work could be done here and I'm open for suggestions. This seems to be always limited to the datacenter machines and it appears the client trying to download the files are retrying with back off.
Flags: needinfo?(garndt)
Comment 74•7 years ago
|
||
Also in many cases the exception being raised is not helpful:
Caught exception: ''
Comment hidden (Intermittent Failures Robot) |
Comment 76•7 years ago
|
||
There appears to have been a moment in time where the queue was returning ISEs more than usual.
Comment hidden (Intermittent Failures Robot) |
Comment 78•7 years ago
|
||
I have setup alerts in heroku to help us identify periods of time where the queue might be part of the problem (higher rate of failed requests).
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 82•7 years ago
|
||
This spike was caused by the problems AWS was having with s3 and it returning 503 (slow down) responses.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 90•7 years ago
|
||
:gbrown, a year ago you had marked this as leave-open, we have had little to no failures since then, can we close this?
Flags: needinfo?(gbrown)
Updated•7 years ago
|
Status: REOPENED → RESOLVED
Closed: 8 years ago → 7 years ago
Flags: needinfo?(gbrown)
Resolution: --- → WORKSFORME
Updated•7 years ago
|
Keywords: leave-open
Assignee | ||
Updated•6 years ago
|
Component: Operations → Operations and Service Requests
You need to log in
before you can comment on or make changes to this bug.
Description
•