Closed Bug 1088916 Opened 10 years ago Closed 8 years ago

scheduler: Implement reversible state and add rerun/retrigger task API

Categories

(Taskcluster Graveyard :: Scheduler, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jonasfj, Unassigned)

References

Details

People should not have scopes to rerun a task defined by the task-graph scheduler, as this requires the scope "assume:scheduler-id:task-graph-scheduler". However, we should add an API end-point to the scheduler allowing tasks in a task-graph to be rerun. At this level, we should probably call it retrigger instead of rerun, to keep terminology clear. This API end-point also serves as a way for the scheduler state to reverse from blocked or finished to running. So we need to track which tasks blocks the task-graph, to see if a rerun/retrigger causes the state to reverse. Note, state should only reverse from blocked, if all blocking tasks are rerun, ie. the set of blocking tasks is empty.
details, see conclusion: https://etherpad.mozilla.org/jonasfj-taskcluster-task-graph-semantics @lightsofapollo, I think it's sane to have some hard upper limit for how many tasks a task-graph can contain. We could code something that doesn't have such a limit, but it would take a lot more effort. Hence, I suggest that we agree on an upper limit for the number of tasks in a task-graph. I suggest either: A) 4000 tasks B) 24000 tasks (A) is a tiny bit easier to do, but if there is a risk that limit is too small I suggest 24000 tasks, as limit it'll take a little bit more, but not much. We can also do more than 24000, but after 32000 tasks it would be very painful to add more, in fact 28000 wouldn't leave much space for other things. Hence, I suggest 24000 to leave some wiggle room of space in our azure table entities and also be high enough that we'll likely never hit it. Besides if we have 24000 tasks in a single task-graph, we really can't handle many of those task-graphs at once before AMQP starts becoming mad at us... Imagine posting 24000 messages to the task-defined exchange at once. Note, if 4000 is a limit we'll never hit that would by far the best. Remark, I haven't tested if I can fit this much data into azure table storage, but docs + pocket calculate gives me these numbers of I encode thing efficiently. Of course decoding 4000 taskIds from a binary string to slugid encoded strings and returning them as an array, we never be a memory efficient thing to do in the scheduler. Neither will loading all that data from azure table storage be... @lightsofapollo, What hard upper limit will make you happy, choose something you can live with for the rest if your life please :)
Flags: needinfo?(jlal)
right now, we'll likely get weird internal errors and undefined behaviour around 1200+ tasks..
tldr; 24000 Longer version: We are running at about 100 chunks for the gaia tests so far which tend to have more overhead then the gecko tests we will be optimizing so 500+ tasks is super easy to hit per platform (and we have at least 5~ platforms). Storing the dependencies/requirements as a JSON encoded string probably does not scale well... Have you investigated other options? (Alternatively blob storage could actually do this but that is probably not going to be too fun)
Flags: needinfo?(jlal)
Question: Will a 24k task limit every be something we will anywhere get near? --- longer ramble --- So I wouldn't store 24000 taskIds as a JSON encoded string array. Instead I would convert to binary, hence, every slugid is 16 bytes. Each azure table key,value pair can be atmost 64kb and the value can be binary, so using 6 key,value pairs to store the entire list would do the trick and give us 24000 taskIds. It would also take up about 400kb of storage, which is getting us close to the 1mb limit for azure table entities, if we need two lists (one for tasks not completed yet and one for blocking tasks). Well I guess in theory those would counter each other, so maybe we could go to 50k taskIds, but I like to have a decent margin :) @jlal, have you consider the cost overhead of each task, when you say 4k tasks per push might not be enough. It's certainly something worth considering if 24k tasks per push is even a thinkable limit. Keep in mind that there is azure table storage entities for each tasks, blob storage entries in the queue, and a log on S3... It's tempting and easy to abuse trivial parallelism, but it can hinder both efficiency and performance.
Overhead aside if our goal is to model one push in one graph mozilla-central pushes are about 1000 jobs today. We still need to tune the number of chunks but even if we never increased chunks we would hit that 1200~ mark pretty easily in the next few months alone. While 100~ chunks per test suite may not be the right number we certainly should increase the chunks (which would put us much closer to 2-3k). --- overhead --- Overhead does matter but is fairly minimal when we are in an optimal state (i.e. the worker has the docker image) in practice the overhead in the docker-worker amounts to the cost of starting the docker container (almost nothing) + whatever time it takes the message to arrive (it was about 1s last time). On top of this is the overhead of starting the tests (currently this is anywhere between 30s and 3 min without optimizations). You can draw your own conclusions but for tests we will continue to drive overhead down into almost nothing... As long as 10-20s of overhead is worth splitting up tasks we will continue to do so.
Even better argument here is our current bb infra already has 4k+ jobs (not all in one "graph" [yet anyway]) and has problems because of very similar (soft) limitations of a 4k jobs.
hmm... I suspect my second comment and subsequent discussion belongs on bug 1088920. I filed a lot of bug that day...
Follow up from discussions last week... we're basically not sure we want to do this. In fact graph state seems more immutable now... We should explore big-graph scheduler before we fully decide on anything.
Component: TaskCluster → Scheduler
Product: Testing → Taskcluster
scheduler is now deprecated in favor of task-groups and task.dependencies.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Product: Taskcluster → Taskcluster Graveyard
You need to log in before you can comment on or make changes to this bug.