Closed Bug 1155645 Opened 10 years ago Closed 7 years ago

[queue] Create artifact with explicitly marking if it is published

Categories

(Taskcluster :: Services, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: garndt, Assigned: jhford)

References

Details

(Whiteboard: taskcluster-q32015-meeting)

There are situations where we would not want to publish that an artifact is available until it's been completely uploaded, such as when an artifact upload has been upborted. currently a link is available right away in the task inspector and an exchange message is published that the artifact has been created, but might not be completely true if the upload has been aborted. Clicking the artifact link in the inspector results in an "access denied" message. Spoke with Jonas and perhaps there is a way to maintain the current implementation but add an additional switch for "publish" that would be set to true by default so all existing clients work the same way, but one could specific "publish: false" and this would allow an artifact to be uploaded without publicizing it yet. Once the worker is done uploading an artifact, the same endpoint could be called again, but with "publish: true" and then it will be listed in the artifacts for the task and an exchange message published. Example: createArtifact(..., name, {storageType: 's3', ..., publish: false})
Summary: [queue] Create artifact without publishing that it's available → [queue] Create artifact with explicitly marking if it is published
Blocks: 1154248
Basically, we make it a 3-step process: 1) queue.createArtifact(..., name, {storageType: 's3', ..., publish: false}) 2) Upload artifact to signed PUT-url 3) queue.createArtifact(..., name, {storageType: 's3', ..., publish: true}) Both step (1) and (3) returns a signed PUT-url. But only step (3): - Makes the artifact visible in queue.listArtifacts, - Publishes pulse message about artifact. ### Backwards compatibility * Set default-value for "publish" = true, * Existing uploaders works as they do now (because they don't specify "publish") ### Future Clean-up When all uploaders are converted to the 3-step process, we: - Remove default value from "publish" (and force people to specify it) - Stop returning signed PUT-url from step (3) - Validate existence, contentType and contentLength of artifact in step (3) - Make step (3) return 400, if step (1) wasn't executed first. After this clean-up we will be able to promise that all visible artifacts are available. I think such a promise is extremely valuable, hence, it's worth breaking backwards compatibility as long as we make a slow transition. ----- These changes generally small. But we have quite a few things that uploads artifacts. mozharness, docker-worker, and perhaps even some of the caching tasks. So the breakage has to be gradual. CC'ing mshal, because this will eventually affect mozharness, granted there will be a long transition period.
I'm not sure if this might be clearly from an endpoint point of view, but would it help keeping the createArtifact call (defaulted to publish=true for now) but then instead of #3, it would be publishArtifact? Didn't know if it would help for clarity when understanding the minor difference in signature between 1 and 3.
Blocks: 1158934
@garndt, It's possible that we should use different method names. I'm not sure how we would do the path pattern side of that though... maybe: POST /task/<taskId>/runs/<runId>/artifacts/<name> PUT /task/<taskId>/runs/<runId>/artifacts/<name> I'm not sure... let's look at those options when we implement this. I've been withholding the "PUT" method for a redirect idea, such that one could do uploads like: curl -X PUT -file@... taskcluster/queue/task/<taskId>/runs/<runId>/artifacts/<name> Using the auth proxy, and a serverside redirect, but this might be crazy, or maybe such a hack should use S3 browser/POST upload flow.
I add to this bug that, we should check/require that following S3 headers are set: Content-Type: ... Cache-Control: public, no-transform (public artifact bucket) Cache-Control: private, no-transform (private artifact bucket) Expires: <date-time of artifact expiration, ie. when we plan to delete it> Probably check that following header isn't set: Content-Disposition Possibly also check/require following headers: Content-MD5: ... Content-Encoding: gzip (maybe this is optional) Making this headers required (if we want to) is probably a gradual process, as we slowly migrate all uploaders. Certainly things like Cache-Control and Expires are pure win. Content-Type and Content-MD5 seems like good practice. One can argue of we should gzip everything (possibly once more), but it might worth it. I don't know what Content-Disposition is good for, so I feel like forbidding it, then we can always allow it later. For docker-worker, see: - bug 1164221 (cache-control headers) - bug 1164224 (content-encoding: gzip, discussion on optional or not)
Component: TaskCluster → Queue
Product: Testing → Taskcluster
Note for this discussion, when "Content-Type" is specified in the signed S3 put URL, you cannot overwrite it!
Also ContentMD5 can be specified in the signed S3 put url... But: content-length cannot be specified in the signed url. So we cannot prevent workers from uploading huge files. Granted upload with single URL is limited to 5G (which isn't that big). So maybe we din't actually need the explicit publish step... Instead we just validate uploaded artifacts when task is reported completed?
So investigating what we can force with signed url lead to this: Can be forced: - content-md5: base64 encoded md5 - content-type: mimetype - x-amz-meta-<key>: value (can't be specified if not in signature) Can't be controlled with signed urls: - content-length - content-encoding - cache-control We should probably use something like: x-amz-meta-content-sha256 to specify sha256 checksum. That way we can verify the sha256 sum when downloading. The etag given by aws also works (it's md5) but it doesn't work for multi-part uploads so if we want to support that in the future maybe we should just always calculate sha256 too. There is also some security to having the sha256 sum, as we can store it in azure table storage too.
The more I think about I'm realizing that maybe we should just validate existence of all artifacts task is resolved. On called to: queue.reportCompleted queue.reportFailed We validate existence of artifacts, and resolve the task as exception internal-error, if a worker called queue.createArtifact but didn't successfully uploaded the artifact to the signed url. Downside: - Won't scale if we have lots of artifacts (as it's one HEAD request per artifact) - We still won't be able to send the artifact-created message as soon as the artifact is available Up-side: + no breaking current API + We can still implement the two-step process later... --- Thoughts? Sometimes also tells me we should it correctly now, rather than postponing it again.
Flags: needinfo?(pmoore)
Flags: needinfo?(garndt)
Downside: * everything in existence that has abused the fact that a task won't be marked as failed if the artifact isn't uploaded needs to be changed. This bit me recently and there are a few things out there where this would be a problem. I do think that this issue has been open long enough to where we just want to do it correctly to begin with. I"m not against validating the artifact on the queue side once someone calls one of the report* endpoints, but hopefully we block until that validation has been done and not publish an exchange message until then. I think the validation can solve some of the issues that led to the creation of this bug, but as described in one of the downsides above, createArtifact events don't really mean it's created and available, but rather something just created a placeholder for it. Consumers will need to listen for createartifact and then listen for the task to be resolved before knowing an artifact is indeed uploaded.
Flags: needinfo?(garndt)
@garndt, Yeah, the workers need to implement things correctly first. Ie. fail tasks when artifacts fail to upload. Doing it on the queue side is mostly about forcing workers to do the right thing. As the current situation clearly shows, expecting even ourselves to implement things correctly is a long shot. Forcing things to be done right at multiple levels is always a good guideline :)
I think if we go with the three step publish process, we could consider implementing this in the taskcluster client libraries rather directly in all workers/pseudo-workers. Until now taskcluster-client has been a lightweight easily-generatable wrapper around the HTTP/AMQP APIs - so a downside is that the client becomes more intelligent/bespoke to our particular APIs. The advantage is that the clients are an isolated entry point into the APIs (at least, the only "supported" route into our APIs) so updated libraries could be provided to consumers to ease upgrade-pain (e.g. mozharness, buildbot bridge(?), docker worker, generic worker, ...).
Flags: needinfo?(pmoore)
@pmoore, I'm okay adding auxiliary methods to the client libraries. But we have to use the raw APIs when streaming up from docker-worker and other places where we only have a stream and not a file object.
Whiteboard: taskcluster-q32015-meeting
No longer blocks: 1225116
jhford is working on this.
Assignee: nobody → jhford
blob artifact solves this, we need to deprecate the s3 storage type
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Queue → Services
You need to log in before you can comment on or make changes to this bug.