Closed Bug 1434851 Opened 7 years ago Closed 4 years ago

Pending count doesn't drop when jobs are cancelled and nothing is claiming

Categories

(Taskcluster :: Services, enhancement, P5)

enhancement

Tracking

(Not tracked)

RESOLVED MOVED

People

(Reporter: pmoore, Unassigned)

References

Details

This push[1] triggered 2400 jobs for gecko-t-win7-32-gpu-b which weren't claimed because of a problem with the worker type. Those jobs were cancelled four hours ago[2] yet the pending count still shows 2400: > $ curl -L 'https://queue.taskcluster.net/v1/pending/aws-provisioner-v1/gecko-t-win7-32-gpu-b' > { > "provisionerId": "aws-provisioner-v1", > "workerType": "gecko-t-win7-32-gpu-b", > "pendingTasks": 2400 > } Therefore it appears cancelling jobs is not reducing pending count. Note - this is a special worker type since nobody else is using it, and we can be pretty confident these really are the only jobs that were submitted. -- [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=c1ec249d084f85da207b979e8ea87a697339193a [2] https://tools.taskcluster.net/groups/eU--MR1aTfeRVHseiFR99g
pending count is a estimate... intended to be an over-estimate, but in a distributed system I won't make such a promise. This will go down as queue.claimWork is called, the hints in the queue gets removed without tasks being executed. If you see tasks that were canceled getting claimed a long time after the canceling, I'm interested.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INVALID
But when the over-estimation is 100% of 2400 jobs, lasts for hours and costs us serious money in 50+ EC2 GPU instances and starves other pools from getting any GPU instances, we need to strongly consider the design being used. If we cannot make guarantees with the current backing storage, we need to use different backing storage. If something needs to be periodically checking the queues to update these numbers, so be it.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #1) > pending count is a estimate... intended to be an over-estimate, but in a > distributed system I won't make such a promise. > > This will go down as queue.claimWork is called, the hints in the queue gets > removed without tasks being executed. > If you see tasks that were canceled getting claimed a long time after the > canceling, I'm interested. IIUC, the pending count is really the total of all pending jobs plus all cancelled jobs that would have been pending if they hadn't been cancelled. The problem I see is that claimWork won't be called until the particular task is at the front of the queue, so the pending count will be inaccurate until the workers have "caught up". The pending count is only useful when the workers are running behind, (as opposed to when they have caught up), so there is less value in providing a pending count that includes potentially a large proportion of cancelled tasks. An approximate pending count is important data for sheriffs to decide when to close / reopen trees, so I think it is worth us at least considering how we could offset cancelled tasks from the total, to get a more accurate picture.
(In reply to Pete Moore [:pmoore][:pete] from comment #3) > An approximate pending count is important data for sheriffs to decide when > to close / reopen trees, so I think it is worth us at least considering how > we could offset cancelled tasks from the total, to get a more accurate > picture. it's also the only signal that the provisioner has to decide whether to spin up instances or not.
The broken thing here was having 60+ instances that weren't claiming jobs. That is always going to cause expense and (if in production) unhappiness. That issue then exposes this documented over-estimation of queue length. Unfortunately, this can't be fixed with Azure. And whenever we transition to Postgres, it will be implicitly fixed (at the cost of counting becoming quite a bit more expensive). There are a number of other reasons to switch to Postgres, too, although at the moment that project is on hold. Something that might help is a tool to "drain" a queue. It'd be useful for beta workerTypes like this. Basically just loop calling claimWork and then resolving the task as exception. Maybe `taskcluster drain-queue` in taskcluster-cli? Pete, you can write Go super-quickly.. do you want to work on that?
Summary: Pending count doesn't drop when jobs are cancelled → Pending count doesn't drop when jobs are cancelled and nothing is claiming
I think, if I'm understanding Jonas correctly, that the queue will train the entire backlog the first time a claimWork call is made, because it will at that point retrieve tasks from azure until it finds one that hasn't been cancelled (so the cancelled tasks don't ever reach the worker). If that is the case, I think just a single claimWork call needs to be made, and it will return no tasks, and in the process will drain the queue itself. I guess the script could try to claim one task, and if it succeeds, query the worker type current maxCapacity, update the worker type to have 0 max capacity, and then resolve it with an exception reason that causes it to get a new run (hoping that all the runs have not been used up). It could then output the previous max capacity so that when the worker type is functional again, it can be reset. That way, if for some reason there were non-cancelled tasks in the queue, they wouldn't get cancelled, but the provisioner wouldn't keep spawning new instances, as maxCapacity would be 0. So - I guess I could, but it might not be as simple as it first sounded, and would require handling to reset the maxCapacity back to a reasonable value when the worker type is functional again. This probably needs some more thought. Maybe we just wait for postgres implementation when this problem goes away. :-)
s/train/drain/ *sigh*
Or we do the simpler thing and drain the queue also of non-cancelled tasks... But then we might not know who we are impacting.
Question: does this mean tasks that have exceeded their deadline but not yet reached the front of the azure queue will also be included in pending count?
(In reply to Pete Moore [:pmoore][:pete] from comment #6) > I think, if I'm understanding Jonas correctly, that the queue will train the > entire backlog the first time a claimWork call is made, because it will at > that point retrieve tasks from azure until it finds one that hasn't been > cancelled (so the cancelled tasks don't ever reach the worker). That depends on the implementation (and I haven't looked either). It could be that the queue pulls a few elements off the queue and gives up, returning no claimed work, in order to avoid a single API call burning through lots of queue entries. Also, if there are non-cancelled tasks in the queue, then a single call will only consume queue elements until it finds such a non-cancelled task -- even if there are more cancelled tasks later in the queue. > If that is the case, I think just a single claimWork call needs to be made, > and it will return no tasks, and in the process will drain the queue itself. > I guess the script could try to claim one task, and if it succeeds, query > the worker type current maxCapacity, update the worker type to have 0 max > capacity, and then resolve it with an exception reason that causes it to get > a new run (hoping that all the runs have not been used up). It could then > output the previous max capacity so that when the worker type is functional > again, it can be reset. That way, if for some reason there were > non-cancelled tasks in the queue, they wouldn't get cancelled, but the > provisioner wouldn't keep spawning new instances, as maxCapacity would be 0. > > So - I guess I could, but it might not be as simple as it first sounded, and > would require handling to reset the maxCapacity back to a reasonable value > when the worker type is functional again. The tool I suggested would be useful for more situations than just workers that don't claim tasks. For example, occasionally we're asked to cancel all of the pending deepspeech tasks. The advantage to claiming everything is that it avoids the need to separately cancel everything. Messing with the workerType definition seems like scope creep, and would only apply to this case of workers that start and do not claim, and only provide one solution. Other solutions might involve rolling back the workerType or, for a beta wokerType like this, just draining the queue and terminating the instances. You're right that this isn't something we'd do lightly on a production queue like gecko-t-linux-xlarge, but for testing queues like ami-test or the beta windows workers, it makes sense. And for users standing up new hardware workers, it may also be useful. > Maybe we just wait for postgres implementation when this problem goes away. > :-) Mostly -- that's really up to you. I think we should either use this bug to write a drain-queue tool, or close it. (In reply to Pete Moore [:pmoore][:pete] from comment #9) > Question: does this mean tasks that have exceeded their deadline but not yet > reached the front of the azure queue will also be included in pending count? Yes.
OK so drain queue really does mean drain everything then, not just cancelled tasks. That was me misunderstanding the use case this tool was meant to provide a solution to. Probably misbehaving workers are best handled by setting maxCapacity to 0 - so your proposal for an aggressive drain do make a lot of sense. There is another complication though. If we need to make multiple claimWork calls to drain the queue, we're stuck with 20s long polling. It could take an awfully long time to drain the queue, or require a huge amount of parallelism.
If there's work available, claimWork returns immediately. So it would only poll for the 20s on the last call, and eventually return nothing, at which point the utility could exit.
Indeed, of course! Silly me, thanks Dustin.
The pending count will be more accurate when bug 1436478 is complete.
Depends on: 1436478
P5 pending bug 1436478
Priority: -- → P5
QA Contact: jhford
Component: Queue → Services
Status: REOPENED → RESOLVED
Closed: 7 years ago4 years ago
Resolution: --- → MOVED
You need to log in before you can comment on or make changes to this bug.