Closed Bug 1348121 Opened 8 years ago Closed 7 years ago

Cancellation notifications that are faster

Categories

(Taskcluster :: Services, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jhford, Unassigned)

References

Details

Right now, aiui, we might wait up to 20m before triggering a cancellation of an in progress task. In the past, I believe we used a Pulse exchange to send out messages and that caused a meltdown of our AMQP broker. I was thinking another option could be storing this information in a redis cace. The idea was: 1. create a taskcluster-lib-api service that has a /cancel/:taskId/:runId (or w/e is appropriate). This endpoint would store a boolean value at the key "${taskId}_${runId}_CANCEL" as "true". It would be TTL'd to a short time period 2. create a barebones (e.g. raw http.createServer()) service that receives requests at /is-cancelled/:taskId/:runId. This service would look up the redis key and if the key exists return its value as a string "true" or "false", and if not just "false". The idea is that the service which inserts cancellations would be called very infrequently and would require auth. Because the taskId/runId combo is not really all that exploitable information, having a read only service which returns "true" or "false" isn't really a security issue. We should be able to get quite a high number of requests through such a service. Scaling the node part of the service would be trivial, and I'm sure redis scaling is a solved-ish problem, especially for this best-effort kind of service. We could add more than just cancellation notifications using this scheme
+1 to the idea. Here are my only thoughts on it: I think the general case is that we have very few cancelled tasks. If that's true, then maybe what we want is a single endpoint that is GET /cancelled_tasks and it can return a little list of taskIds that should be dead. I expect 90% of the day it will actually just be an empty 200 response. I think this is better because we could do this from the queue directly without the need for a new service or redis. Currently we don't use a lot of http cacheing afaik, but this would be a noop request unless we have a new cancellation and we can set an etag to indicate as much. In this world, the endpoint in the queue gets hit almost never and the proxy that is sitting in front of it handles the static asset load all on its own. You'll notice I've slyly added here that we have an cacheing http proxy in front of the queue (which we currently don't, right?). So then the question here basically becomes: do we want to add redis or an http proxy. I think I might be in the proxy camp, since I expect it could give us performance wins across the board and isn't an extra datastore to support. Did what I say make any sense? I might just be talking bstack here and please tell me so if true.
Obviously we need to prune that list of cancelled tasks somehow. I think we could keep things in there for 24 hours since that's the max runtime anyway, or perhaps have a smarter way of clearing it?
Found in triage! Should we still do this? Thanks!
Flags: needinfo?(jhford)
I'm not sure, is it still a problem? I'll defer to Brian on whether this is still needed.
Flags: needinfo?(jhford) → needinfo?(bstack)
I don't know that this is causing issues right now. I know that I've talked with jonas about increasing the frequency of claimwork and that would end up decreasing the amount of time we may wait to cancel something. We should maybe end up doing that instead of having a parallel mechanism for cancelling?
Flags: needinfo?(bstack) → needinfo?(jopsen)
Increasing the frequency of ReclaimTask seems like the best solution. It also gives the queue more recent information on whether a worker has died. IMO, there is no rush to do it now, as it puts more stress on ours systems, and bad implementations a liable to break, and need fixing. I filed: 1453656 to replace this bug.
Flags: needinfo?(jopsen)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Component: Platform and Services → Services
You need to log in before you can comment on or make changes to this bug.