1348121 - Cancellation notifications that are faster

John Ford [:jhford] CET/CEST Berlin Time

Reporter

Description

•

8 years ago

Right now, aiui, we might wait up to 20m before triggering a cancellation of an in progress task. In the past, I believe we used a Pulse exchange to send out messages and that caused a meltdown of our AMQP broker. I was thinking another option could be storing this information in a redis cace. The idea was: 1. create a taskcluster-lib-api service that has a /cancel/:taskId/:runId (or w/e is appropriate). This endpoint would store a boolean value at the key "${taskId}_${runId}_CANCEL" as "true". It would be TTL'd to a short time period 2. create a barebones (e.g. raw http.createServer()) service that receives requests at /is-cancelled/:taskId/:runId. This service would look up the redis key and if the key exists return its value as a string "true" or "false", and if not just "false". The idea is that the service which inserts cancellations would be called very infrequently and would require auth. Because the taskId/runId combo is not really all that exploitable information, having a read only service which returns "true" or "false" isn't really a security issue. We should be able to get quite a high number of requests through such a service. Scaling the node part of the service would be trivial, and I'm sure redis scaling is a solved-ish problem, especially for this best-effort kind of service. We could add more than just cancellation notifications using this scheme

Brian Stack [:bstack]

Comment 1

•

8 years ago

+1 to the idea. Here are my only thoughts on it: I think the general case is that we have very few cancelled tasks. If that's true, then maybe what we want is a single endpoint that is GET /cancelled_tasks and it can return a little list of taskIds that should be dead. I expect 90% of the day it will actually just be an empty 200 response. I think this is better because we could do this from the queue directly without the need for a new service or redis. Currently we don't use a lot of http cacheing afaik, but this would be a noop request unless we have a new cancellation and we can set an etag to indicate as much. In this world, the endpoint in the queue gets hit almost never and the proxy that is sitting in front of it handles the static asset load all on its own. You'll notice I've slyly added here that we have an cacheing http proxy in front of the queue (which we currently don't, right?). So then the question here basically becomes: do we want to add redis or an http proxy. I think I might be in the proxy camp, since I expect it could give us performance wins across the board and isn't an extra datastore to support. Did what I say make any sense? I might just be talking bstack here and please tell me so if true.

Brian Stack [:bstack]

Comment 2

•

8 years ago

Obviously we need to prune that list of cancelled tasks somehow. I think we could keep things in there for 24 hours since that's the max runtime anyway, or perhaps have a smarter way of clearing it?

Pete Moore [:pmoore][:pete]

Comment 3

•

7 years ago

Found in triage! Should we still do this? Thanks!

Flags: needinfo?(jhford)

John Ford [:jhford] CET/CEST Berlin Time

Reporter

Comment 4

•

7 years ago

I'm not sure, is it still a problem? I'll defer to Brian on whether this is still needed.

Flags: needinfo?(jhford) → needinfo?(bstack)

Brian Stack [:bstack]

Comment 5

•

7 years ago

I don't know that this is causing issues right now. I know that I've talked with jonas about increasing the frequency of claimwork and that would end up decreasing the amount of time we may wait to cancel something. We should maybe end up doing that instead of having a parallel mechanism for cancelling?

Flags: needinfo?(bstack) → needinfo?(jopsen)

Jonas Finnemann Jensen (:jonasfj)

Updated

•

7 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1453656

Jonas Finnemann Jensen (:jonasfj)

Comment 6

•

7 years ago

Increasing the frequency of ReclaimTask seems like the best solution. It also gives the queue more recent information on whether a worker has died. IMO, there is no rush to do it now, as it puts more stress on ours systems, and bad implementations a liable to break, and need fixing. I filed: 1453656 to replace this bug.

Flags: needinfo?(jopsen)

Jonas Finnemann Jensen (:jonasfj)

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Assignee

Updated

•

6 years ago

Component: Platform and Services → Services

Bugzilla

Cancellation notifications that are faster

Categories

(Taskcluster :: Services, enhancement)

Tracking

(Not tracked)

People

(Reporter: jhford, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Updated

Updated