Closed Bug 1416270 Opened 7 years ago Closed 6 years ago

Queue: artifacts served with gzip encoding regardless of Accept-Encoding

Categories

(Taskcluster :: Services, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: Pike, Assigned: pmoore)

References

()

Details

When downloading the artifact in the URL, sometimes we get the installer.exe, and sometimes we get the same file gzip'ed. Looks like transfer encoding is sometimes on and sometimes off, or not advertized. Example output of gzip'ed output: [axel@l10n-merge1.private.scl3 ~]$ wget https://queue.taskcluster.net/v1/task/TuT79epdQpSH033sMQrY6Q/runs/0/artifacts/public/build/de/target.installer.exe --2017-11-10 07:52:06-- https://queue.taskcluster.net/v1/task/TuT79epdQpSH033sMQrY6Q/runs/0/artifacts/public/build/de/target.installer.exe Resolving queue.taskcluster.net... 23.23.74.217, 54.197.255.25, 174.129.218.85 Connecting to queue.taskcluster.net|23.23.74.217|:443... connected. HTTP request sent, awaiting response... 303 See Other Location: https://public-artifacts.taskcluster.net/TuT79epdQpSH033sMQrY6Q/0/public/build/de/target.installer.exe [following] --2017-11-10 07:52:06-- https://public-artifacts.taskcluster.net/TuT79epdQpSH033sMQrY6Q/0/public/build/de/target.installer.exe Resolving public-artifacts.taskcluster.net... 54.230.53.225 Connecting to public-artifacts.taskcluster.net|54.230.53.225|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 41579315 (40M) [application/x-msdownload] Saving to: “target.installer.exe” 100%[======================================>] 41,579,315 25.3M/s in 1.6s 2017-11-10 07:52:08 (25.3 MB/s) - “target.installer.exe” saved [41579315/41579315] [axel@l10n-merge1.private.scl3 ~]$ file target.installer.exe target.installer.exe: gzip compressed data, was "target.installer.exe" [axel@l10n-merge1.private.scl3 ~]$ wget --version GNU Wget 1.12 built on linux-gnu. +digest +ipv6 +nls +ntlm +opie +md5/openssl +https -gnutls +openssl -iri Wgetrc: /etc/wgetrc (system) Locale: /usr/share/locale Compile: gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc" -DLOCALEDIR="/usr/share/locale" -I. -I../lib -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -fno-strict-aliasing Link: gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -fno-strict-aliasing -Wl,-z,relro -lssl -lcrypto /usr/lib64/libssl.so /usr/lib64/libcrypto.so -ldl -lrt ftp-opie.o openssl.o http-ntlm.o gen-md5.o ../lib/libgnu.a Copyright © 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://www.gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Originally written by Hrvoje Nikšić <hniksic@xemacs.org>. Currently maintained by Micah Cowan <micah@cowan.name>. Please send bug reports and questions to <bug-wget@gnu.org>. Example of non-gzip'ed output: Fuchsia:zzz axelhecht$ wget https://public-artifacts.taskcluster.net/TuT79epdQpSH033sMQrY6Q/0/public/build/de/target.installer.exe --2017-11-10 17:40:48-- https://public-artifacts.taskcluster.net/TuT79epdQpSH033sMQrY6Q/0/public/build/de/target.installer.exe Resolving public-artifacts.taskcluster.net... 52.84.158.254 Connecting to public-artifacts.taskcluster.net|52.84.158.254|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 41579315 (40M) [application/x-msdownload] Saving to: 'target.installer.exe' target.installer.ex 100%[===================>] 39.65M 5.79MB/s in 6.9s 2017-11-10 17:40:55 (5.78 MB/s) - 'target.installer.exe' saved [41583193] Fuchsia:zzz axelhecht$ file target.installer.exe target.installer.exe: PE32 executable (GUI) Intel 80386, for MS Windows, UPX compressed Fuchsia:zzz axelhecht$ wget --version GNU Wget 1.19.2 built on darwin16.7.0. -cares +digest -gpgme +https +ipv6 -iri +large-file -metalink -nls +ntlm +opie -psl +ssl/openssl Wgetrc: /usr/local/etc/wgetrc (system) Compile: clang -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/usr/local/etc/wgetrc" -DLOCALEDIR="/usr/local/Cellar/wget/1.19.2/share/locale" -I. -I../lib -I../lib -I/usr/local/opt/openssl@1.1/include -DNDEBUG Link: clang -DNDEBUG -L/usr/local/opt/openssl@1.1/lib -lssl -lcrypto -ldl -lz ftp-opie.o openssl.o http-ntlm.o ../lib/libgnu.a -liconv Copyright (C) 2015 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://www.gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Originally written by Hrvoje Niksic <hniksic@xemacs.org>. Please send bug reports and questions to <bug-wget@gnu.org>.
More with headers, tried wget --save-headers, and then do a less. Working: HTTP/1.1 200 OK Content-Type: application/x-msdownload Content-Length: 41579315 Connection: keep-alive Date: Fri, 10 Nov 2017 14:30:54 GMT Last-Modified: Fri, 10 Nov 2017 14:15:42 GMT ETag: "137e76feda1ad0a8803f40634e72e16d" Content-Encoding: gzip x-amz-version-id: lZSnyJAvR6LTWJLrT.PvdoSPWlKyMKRd Accept-Ranges: bytes Server: AmazonS3 Age: 8261 X-Cache: Hit from cloudfront Via: 1.1 81c085110a4ab1cc157a3023ea302f38.cloudfront.net (CloudFront) X-Amz-Cf-Id: tBIF4GSN5IUBCA8YNFCsR93rqulX64S34_21U-uSVB4_k6z9EctYIg== MZ<90>^@^C^@^@^@^D^@^@^@<FF><FF>^@^@<B8>^@^@^@^@^@^@^@@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@<E0>^@^@^@^N^_<BA>^N^@<B4> <CD>! <B8>^AL<CD>!This program cannot be run in DOS mode. Non-working (file gzip'ed locally): HTTP/1.1 200 OK Content-Type: application/x-msdownload Content-Length: 41579315 Connection: keep-alive Date: Fri, 10 Nov 2017 14:58:06 GMT Last-Modified: Fri, 10 Nov 2017 14:15:42 GMT ETag: "137e76feda1ad0a8803f40634e72e16d" Content-Encoding: gzip x-amz-version-id: lZSnyJAvR6LTWJLrT.PvdoSPWlKyMKRd Accept-Ranges: bytes Server: AmazonS3 Age: 6669 X-Cache: Hit from cloudfront Via: 1.1 ec7268fa1110683dbc457e57c2be1475.cloudfront.net (CloudFront) X-Amz-Cf-Id: rf8ZvNoZWH2SCkTtqEY4Fl8zGJHMBclBdt5SvHnq_1sVu0wD3ZjoPw== ^_<8B>^H^H^@^@^@^@^@<FF>target.installer.exe^@<EC><B2>y4<9B><FF><FB>><F8>d^Q!!A ^P^DQ<A1>A<90><92>ڢ^Z<B1><A5><B5>%v<B5>+<AA><C4>R^R<B4><B5><EF>^QA7<A5>U<D5>ֻ+
Assignee: nobody → pmoore
Component: Queue → Generic-Worker
Got it, this is that we always gzip, no matter if the client supports it: Fuchsia:zzz axelhecht$ curl -vO https://public-artifacts.taskcluster.net/TuT79epdQpSH033sMQrY6Q/0/public/build/de/target.installer.exe % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 52.84.158.254... * TCP_NODELAY set * Connected to public-artifacts.taskcluster.net (52.84.158.254) port 443 (#0) * TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 * Server certificate: auth.taskcluster.net * Server certificate: DigiCert SHA2 Secure Server CA * Server certificate: DigiCert Global Root CA > GET /TuT79epdQpSH033sMQrY6Q/0/public/build/de/target.installer.exe HTTP/1.1 > Host: public-artifacts.taskcluster.net > User-Agent: curl/7.54.0 > Accept: */* > 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0< HTTP/1.1 200 OK < Content-Type: application/x-msdownload < Content-Length: 41579315 < Connection: keep-alive < Date: Fri, 10 Nov 2017 17:07:30 GMT < Last-Modified: Fri, 10 Nov 2017 14:15:42 GMT < ETag: "137e76feda1ad0a8803f40634e72e16d" < Content-Encoding: gzip < x-amz-version-id: lZSnyJAvR6LTWJLrT.PvdoSPWlKyMKRd < Accept-Ranges: bytes < Server: AmazonS3 < X-Cache: Miss from cloudfront < Via: 1.1 eefd24fb23003934ecf16bb607089417.cloudfront.net (CloudFront) < X-Amz-Cf-Id: Rr_cO8rFprjU9wtR4X06sYWY6Zhb_ngT1Dox5L3gQ2MamxARljkP6A== < { [16384 bytes data] 100 39.6M 100 39.6M 0 0 2652k 0 0:00:15 0:00:15 --:--:-- 4280k * Connection #0 to host public-artifacts.taskcluster.net left intact Fuchsia:zzz axelhecht$ shasum -a 256 target.installer.exe ba9d1b01a9dd59ce669301bee880b1affa84a9b01563bf7206e0b38f1f7c07b2 target.installer.exe Fuchsia:zzz axelhecht$ curl --compressed -vO https://public-artifacts.taskcluster.net/TuT79epdQpSH033sMQrY6Q/0/public/build/de/target.installer.exe % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 52.84.158.254... * TCP_NODELAY set * Connected to public-artifacts.taskcluster.net (52.84.158.254) port 443 (#0) * TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 * Server certificate: auth.taskcluster.net * Server certificate: DigiCert SHA2 Secure Server CA * Server certificate: DigiCert Global Root CA > GET /TuT79epdQpSH033sMQrY6Q/0/public/build/de/target.installer.exe HTTP/1.1 > Host: public-artifacts.taskcluster.net > User-Agent: curl/7.54.0 > Accept: */* > Accept-Encoding: deflate, gzip > < HTTP/1.1 200 OK < Content-Type: application/x-msdownload < Content-Length: 41579315 < Connection: keep-alive < Date: Fri, 10 Nov 2017 14:30:54 GMT < Last-Modified: Fri, 10 Nov 2017 14:15:42 GMT < ETag: "137e76feda1ad0a8803f40634e72e16d" < Content-Encoding: gzip < x-amz-version-id: lZSnyJAvR6LTWJLrT.PvdoSPWlKyMKRd < Accept-Ranges: bytes < Server: AmazonS3 < Age: 9438 < X-Cache: Hit from cloudfront < Via: 1.1 635d6b64075ae1410e6cbc26907c7141.cloudfront.net (CloudFront) < X-Amz-Cf-Id: 3pikUAk7s-t4ht6ikVhrOeiuf6FbmepAM0IX_4EguyQAXciQJFfE5A== < { [5792 bytes data] 100 39.6M 100 39.6M 0 0 5863k 0 0:00:06 0:00:06 --:--:-- 5907k * Connection #0 to host public-artifacts.taskcluster.net left intact Fuchsia:zzz axelhecht$ shasum -a 256 target.installer.exe 112e6c8ee7579f43d582933f34e5cf81c8d93d4b76c74ac7437542eaf1696ddd target.installer.exe
Summary: artifacts served with and without gzip encoding → artifacts served with gzip encoding regardless of Accept-Encoding
I think the problem here is that we are serving compressed content from our origin: http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html#compressed-content-custom-origin As I see it, CloudFront doesn't provide a mechanism to *decompress* content that it serves, so since we only provide a compressed version, that is all that is available. I'm not sure of the mechanics of interaction between our cloudfront web front ends and our s3 buckets, but maybe this is something that can be resolved in the Queue? As far as the worker is concerned, it publishes content to S3 with the correct encoding headers, so I think it is up to something in Queue/CloudFront/S3 to handle decompressing content as appropriate. @John/Jonas, what are your thoughts on this? Is there a way some service in the chain can decompress, or should just return a http 406 if gzip compression is not listed in the request Accept-Encoding field? (as per https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.3)
Flags: needinfo?(jopsen)
Flags: needinfo?(jhford)
So, what's happening is that S3 really takes Simple to extremes. If you upload a resource with Content-Encoding: gzip, it gets sent out with Content-Encoding: gzip regardless of whether you specify Accept-Encoding: gzip, Accept-Encoding: ??? or no Accept-Encoding. This is done because S3 doesn't understand how to do gzip compression dynamically based on content negotiation. I believe CloudFront does support automatic gzip and content-negotiation, but that'd likely require not uploading with content-encoding. Even so, it might even end up doing double gzip-encoding -- I'm not sure. I'd need to do some tests to confirm. The docs aren't super clear to me: http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html We do this so that we can serve gzip-content inside of the AWS regions and direct from S3 to browsers in a way that they will automatically decompress (think logs, etc). This mainly has an impact to non-browser consumers. It's definitely in violation of the spec, but it is what we need to do to enable serve gzip artifacts direct to the browser. There's good news coming though! The new artifact API and tooling that I've been working on will help to address this. I've written a Javascript (https://github.com/taskcluster/taskcluster-lib-artifact) and Go Library and CLI tool (https://github.com/taskcluster/taskcluster-lib-artifact-go) which automates all of the logic around uploading and downloading artifacts in a safe and reliable way. These libraries are about to be deployed into workers. The JS library is complete but the Go one is undergoing final review. It turns out that downloading artifacts safely is not really a simple task and requires a decent amount of work to do correctly. Because of this, the supported interface to blob artifacts will become one of these libraries or the CLI tool. These libraries do a bunch of extra verifications to ensure that the file you download is the exact bytes that were created originally. In the meantime, I would suggest using an HTTP library to do the request against the queue (following redirects) with a HEAD, then checking the content-encoding. To do this, what you want is to use one of the library clients (js, py, go, java) to generate a Signed URL for the method, then run the HEAD request against that. I know it's not ideal, but we're working to resolve this!
Flags: needinfo?(jhford)
> @John/Jonas, what are your thoughts on this? Is there a way some service in the chain can decompress, or should just return a http 406 if gzip compression is not listed in the request Accept-Encoding field? At some point early on I think we agreed that we require all clients to support redirects and gzip. I guess 406'ing the requests if they don't support gzip would be nice. But we don't currently have the content-encoding in the queue datastore. But this might be something we can explore. @pmoore, note however, that except for generic-worker all other worker implementations will only gzip the logs. In many cases gzipping isn't worth the overhead, and we might ask ourselves if it's worth the trouble here?
Flags: needinfo?(jopsen)
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #5) > At some point early on I think we agreed that we require all clients to > support redirects and gzip. I'm only aware of such a specific agreement in regards to the new artifact api. > I guess 406'ing the requests if they don't support gzip would be nice. But > we don't currently have the content-encoding in > the queue datastore. But this might be something we can explore. We explored it, and it's part of the new artifact api. I think the path forward is to integrate taskcluster-lib-artifact{-go,} into workers and use the existing support for that. > @pmoore, note however, that except for generic-worker all other worker > implementations will only gzip the logs. > In many cases gzipping isn't worth the overhead, and we might ask ourselves > if it's worth the trouble here? In the context of the new artifact api, content-encoding support was considered a blocking feature. Since we now have tools specifically to facilitate easy uploading and downloading of artifacts, with basically transparent Gzip encoding/decoding, I think there's no reason not to support it. We should be suggesting the taskcluster-lib-artifact{-go,} libraries and CLI as the way to get artifacts, since doing it manually is a huge source of errors.
I don't agree that our download assets break spec-compliant http clients. Regardless of what any kind of library does on top, we shouldn't break the web.
(In reply to Axel Hecht [:Pike] from comment #7) > I don't agree that our download assets break spec-compliant http clients. > Regardless of what any kind of library does on top, we shouldn't break the > web. I agree, any public-facing urls we expose should be http compliant. As I understand it (disclaimer: mostly based on guesswork), when you download an artifact, you hit the queue service which redirects you to a cloudfront url, which acts as a proxy front end to S3 buckets. In a perfect world, all three services would be http compliant and respect Accept-Encoding request header. I think it should be relatively straightforward for the queue to return an HTTP 406 if the client doesn't accept gzip content encoding, and this would be compliant with the spec. Whether we can (or should) do this at all three levels (the queue, cloudfront, and S3) is a good question. Certainly it would be nice if all services did this, but it could be argued that only the queue url is client-facing. However, in reality the user sees the cloudfront url - so it arguably should also be compliant, because it is a discoverable url. My two cents - I welcome others to pitch in with comments/ideas.
(based on the assumption that the S3 URL that cloudfront proxies is discoverable/reachable for clients)
(In reply to Pete Moore [:pmoore][:pete] from comment #8) > (In reply to Axel Hecht [:Pike] from comment #7) > I think it should be relatively straightforward for the queue to return an > HTTP 406 if the client doesn't accept gzip content encoding, and this would > be compliant with the spec. That's easy at the Queue level, but impossible at the S3 URL level. S3 urls have their headers and content set one time, at creation. The only way to do this would be to turn off public access to the S3 objects, do the content-negotiation in the Queue and change the queue to redirect to a *signed* S3 URL. The S3 URL would still be what it is now, but we'd have done the content-negotiation in the queue and would not allow creating the URL for clients which do not set Accept-Encoding: gzip. The thing which follows redirects would still need to understand how to parse a resource with Content-Encoding: gzip without an accept-encoding request header set. > Whether we can (or should) do this at all three levels (the queue, > cloudfront, and S3) is a good question. Certainly it would be nice if all > services did this, but it could be argued that only the queue url is > client-facing. However, in reality the user sees the cloudfront url - so it > arguably should also be compliant, because it is a discoverable url. > > My two cents - I welcome others to pitch in with comments/ideas. (In reply to Axel Hecht [:Pike] from comment #7) > I don't agree that our download assets break spec-compliant http clients. > Regardless of what any kind of library does on top, we shouldn't break the > web. Reading through the Accept/Content-Encoding header sections in the http spec, it doesn't actually say anywhere that the use of Content-Encoding in a response requires an Accept-Encoding in the request, or that Accept-Encoding is anything more than a preference. https://tools.ietf.org/html/rfc7231#section-3.1.2.2 If one or more encodings have been applied to a representation, the sender that applied the encodings MUST generate a Content-Encoding header field that lists the content codings in the order in which they were applied. Additional information about the encoding parameters can be provided by other header fields not defined by this specification. https://tools.ietf.org/html/rfc7231#section-5.3.4 A request without an Accept-Encoding header field implies that the user agent has no preferences regarding content-codings. Although this allows the server to use any content-coding in a response, it does not imply that the user agent will be able to correctly process all encodings. I guess I was premature in saying we're breaking the spec. I didn't actually find anywhere in the http spec that says it is required that the Content-Encoding header has to be one of the specified Accept-Encoding header options. The Content-Encoding has to be set if the content has content coding, but the Accept-Encoding header is just stating preference.
From https://tools.ietf.org/html/rfc7231#section-5.3.4 I think the two exceptions would be: 1) An Accept-Encoding header that doesn't include '*' or 'gzip' (either an empty value, or a list of things that aren't either of these) 2) An Accept-Encoding header that contains 'gzip;q=0' So technically Axel's example is compliant, but I believe a modified form of it could demonstrate non-compliance (such as "Accept-Encoding: gzip;q=0" / "Accept-Encoding:", "Accept-Encoding: foo").
QA Contact: pmoore
Suggestion: If request doesn't carry: "Accept-Encoding: gzip" and resource is gzipped We reply 406 "Not Acceptable", from wikipedia: The requested resource is capable of generating only content not acceptable according to the Accept headers sent in the request
We can't do this without breaking exiting clients that handle gzip, but don't specify Accept-Encoding header. But we could probably do this for the new blob artifact type.
PR for this fix: https://github.com/taskcluster/taskcluster-queue/pull/264 (credits jhford for originally suggesting this Berlin) Note: This will only affect the 'blob' type, so was we switch workers to use this artifact type, we are liable to see some breakages. That might be acceptable since we want consumers to start verifying hashes anyways.
This will start working as we transition more artifacts to the blob type.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
Component: Generic-Worker → Workers
Component: Workers → Services
Summary: artifacts served with gzip encoding regardless of Accept-Encoding → Queue: artifacts served with gzip encoding regardless of Accept-Encoding
You need to log in before you can comment on or make changes to this bug.