[meta] Support for manifest based scheduling
Categories
(Tree Management :: Treeherder: Frontend, task, P2)
Tracking
(Not tracked)
People
(Reporter: armenzg, Assigned: armenzg)
References
Details
(Whiteboard: [manifest-scheduling])
Attachments
(2 files)
For smart scheduling we're looking at what are the pieces to move to a manifest based scheduling rather than a task based scheduling.
We currently schedule tasks based if the manifest to be scheduled as part of the task has high value to be scheduled or not.
We believe that currently there's a backfill-and-filter workflow where a sheriff follows these steps:
- Task has regressed and uses backfill action
- The sheriff either uses the signature or the extend task label to look at the backfilled tasks [1] (there's even a hotkey shortcut to get to it)
Please correct me if that workflow is incorrect.
In the new model of scheduling by manifest we cannot guarantee that the manifest will have the same extended task label (Linux 18.04 x64 asan opt Mochitests test-linux1804-64-asan/opt-mochitest-devtools-chrome-e10s-3 M(dt3)
) or signature (c15377f1f0ac8c097f7cd61753999deee12596aa
). This is because chunking will be dynamic, thus, manifests can be changing from push to push.
It might be possible to adjust the job signature to take manifests into consideration and exclude the symbol:
https://github.com/mozilla/treeherder/blob/368c112266f4f276251a4886a5337fcb17b3a1e9/treeherder/etl/jobs.py#L147-L171
[
build_system_type,
repository.name,
build_platform.os_name,
build_platform.platform,
build_platform.architecture,
machine_platform.os_name,
machine_platform.platform,
machine_platform.architecture,
job_group.name,
job_group.symbol,
job_type.name,
job_type.symbol,
option_collection_hash,
reference_data_name,
],
Perhaps we need a new signature as a stepping stone to deprecate the current one.
In the new model, we will backfill new tasks with the list of manifests executed in the backfilled task rather than based on the task label. We will be able to filter those tasks by looking at tasks that have that set of manifests and that platform configuration.
Now, we currently have the ability to filter by manifest by appending &test_paths
, however, the platform will also need to match.
Originally I thought we could reach the maximum URL length, however, upon further investigation it is unlikely.
Perhaps we can add a link that will adjust the &test_paths
and platform related parameters to match the backfilled tasks. The information is extractable from the taskgraph.json
, but there's no convenient place / answer. marco is trying to solve the same problem.
If we use &test_paths
we should fix bug 1626623 by moving all the frontend fetching and data manipulations to the backend. That will probably need to define a new Django model to represent a Manifest. A task would probably need a reference to a ManifestGroup which refers to an N number of Manifests. We will need to modify the jobs endpoint to the test_paths
property. I wonder if we would need a TestPath model as well which would probably lead to the need for a TestGroup. In short:
- Task -1:1-> ManifestGroup -1:N-> Manifest -1:1-> TestGroup -1:N-> TestPath
Such model probably would be the least amount of data needed to store it. We need to evaluate the storage cost. We would also need to verify if cycling data would delete these. I wonder how a test path can be stored compressed rather than as plain text.
This is probably good enough of a description to discuss things.
[1]
Job: (sig): Linux 18.04 x64 asan opt Mochitests test-linux1804-64-asan/opt-mochitest-devtools-chrome-e10s-3 M(dt3)
Assignee | ||
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Comment 1•5 years ago
|
||
I believe we need "add new jobs" support as well, likewise "retrigger" support. These workflows would use what is outlined in comment 0.
I like the idea of changing the table sooner to get into a hybrid state, we will need 4 months of data before expiring. We will probably need both models supported (current vs new) as 4 months is a long time; although I think having 4 weeks is enough time to cover almost all if not all scenarios for backfill, retrigger, add new job.
Assignee | ||
Comment 2•5 years ago
|
||
Ahal and I did not believe there was a technichal reason that would make adding new jobs or retriggers require any special changes. As far as we understand we would schedule what the Gecko decision task would have scheduled (the artifact contains all the data) and re-triggers re-run a clone of the task that got scheduled. Please let me know if we made an oversight. In any case, if we were to encounter any issues we missed later on we would tackle it.
We might consider backfilling data if we need to. We can backfill up to when Andrew got the artifacts generated by the Gecko decision task (sometime in Q1).
Comment 3•5 years ago
|
||
I guess there is an 'add new jobs' feature now, but what would you be adding? mochitest-1 or dom/indexedDB ? how will we display the results of manifest jobs M(m1 m2 m3) where m1 = set of manifests ? then is there a purpose to running the original M(1) job? It is ok if we duplicate tests, but want to think this through.
Comment 4•5 years ago
|
||
So this signature has a couple uses:
-
We can filter with it by clicking the
(sig)
link. That will show only those jobs in all the loaded pushes that have the same signature. The link of text strings next to it is likely ALMOST as precise as that. So the filtering aspect of it may not be super crucial. -
We use it to trigger "add new jobs". But the signatures in there are not the same as what we store in the DB. The signatures for
Runnable Jobs
are like this:addon-tps-xpi
,condprof-linux64-firefox
,searchfox-linux64-searchfox/debug
These signatures
do not match the Treeherder signatures. These come from taskcluster from the url we get by calling getRunnableJobsURL
.
History: These signatures were originally created for when the sheriffs managed a list of jobs to hide. We now use Tier-3 for that, so that functionality is no longer needed.
I think the only thing that field on the jobs
table is used for is that sig
filter link.
Comment 5•5 years ago
|
||
Asked Sebastian to comment on the value of that sig
link.
Comment 6•5 years ago
|
||
The sig link will show if the task config changed and also has the benefit of creating a shorter url than the task name. The task name will always show the tasks which match the name even if their config changed.
RyanVM, how is your usage of the 'sig' link at the bottom left?
Comment 8•5 years ago
|
||
Ok to remove it from the sheriffing side.
Comment 9•5 years ago
|
||
OK, cool. Then it sounds like we can just remove that field from the job table completely at some point. I'm happy to remove the field from the UI.
Comment 10•5 years ago
|
||
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Comment 11•5 years ago
|
||
Assignee | ||
Comment 12•5 years ago
|
||
I'm turning this bug into a meta bug because there's various components involved until we get this working.
bug 1633866 takes care of the Firefox build changes to support dynamic scheduling.
Here are some steps I documented in the PR of what is needed (I will update the information there as I think of it):
- Store transformed joint data of manifests-by-task.json and tests-by-manifest.json in the proposed models
- https://github.com/mozilla/treeherder/blob/master/ui/job-view/pushes/Push.jsx#L39-L57
- Create API to expose that data (project/<repo>/<revision>/task-to-test-paths)
- This would not be used by the UI since the jobs endpoint would include manifest/test paths, however,
it is cheap to add and can be of use for some other systems
- This would not be used by the UI since the jobs endpoint would include manifest/test paths, however,
- Script to backfill past data?
- Bug 1636506 - API to return test_paths as a property
- Switch UI to use new API and deprecate JS code
- This will improve the memory usage when test_paths is used (bug 1626623)
- Switch UI to use new API and deprecate JS code
- Add link to the UI that will filter out tasks that run the same platform config and the same set of manifests
- We can take advantage of concatenating
&test_paths
- We need to determine if we need to show tasks showing the same manifest set OR tasks that match one of the manifests
- I'm leaning for the latter one
- The question is if we want to show past tasks that cover some of the manifests from the originating task OR only
show tasks that have been backfilled
- This important to get right on the first PR
- This link should show the tasks that a backfill will schedule
- We can take advantage of concatenating
- Change behaviour of backfills to trigger tasks with the same manifest set
- The backfilled tasks need to run the same manifest set as the originating task
A separate project is to return the test_paths
property as part of the jobs endpoint. All of the above will enable to make that change.
Assignee | ||
Comment 13•4 years ago
|
||
NOTE to self: the source of truth for what manifests and tests paths a task executes is via MOZHARNESS_TEST_PATH env.
Tasks scheduled out of band would not be using what the artifacts generated by the Gecko decision task says they should execute.
Maybe we need to store the value of MOZHARNESS_TEST_PATH in the DB or Redis.
I was planning of returning the tasks' test paths via the jobs endpoint, however, we might want to remove using test_paths
as a filtering method and use hyperlink from a selected job.
I'm trying to avoid storing such piece of data in the DB if we don't have to.
I have not this profoundly so don't read too much into it.
Comment 14•4 years ago
|
||
On another note, we may want to remove the job signature:
The job signature can be replaced by a (albeit long) unique tuple, as demonstrated in the code. This signature is a design flaw that is hindering our use of the treeherder data in other ways. Specifically, the signature is very specific optimization, that will get in the way of manifest scheduling.
I suggest columns in the signature table be merged with the job table; and code that uses a signature be replaced with code that uses the tuple-of-values it represents. The immediate benefit being the tuple better describes the class of tasks than a hash value does. Using tuples-of-values will also allow shorter tuples (we only need to specify job_type.name
or job_type.symbol
not both). A bigger benefit comes from other use cases, like manifest scheduling:
The manifest_name
will be unique; the job_type.name
is irrelevant and job_group.name
is functionally dependent on the manifest_name
(if you know the manifest_name
, you will be able to conclude the suite). This means finding the class of jobs which run a manifest is best described by this tuple (depending on how specific you want to be:
[
repository.name,
machine_platform.platform,
manifest_name
]
By simply storing the signature properties in the job table, we can use different tuples to select jobs in different ways.
Comment 15•4 years ago
|
||
From https://github.com/mozilla/treeherder/pull/6384/files
I'm debating between these two relationships:
- Job -1:1-> ManifestSet -1:N-> Manifest -1:1-> TestSet -1:N-> TestPath
- Job -1:N-> Manifest -1:N-> TestPath
The second is preferred; jobs, manifests and tests are the only entities we are dealing with.
All 1:1 relations are "annotations": A 1:1 relation is logically no different from merging columns from both tables into one: Any columns you may have in ManifestSet
can be added to Job
for the same effect. 1:1 relations also increase query complexity. That said, 1:1 relations can be useful for avoiding an ALTER TABLE
command on the main table.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 16•4 years ago
|
||
There are improvements that can be made but for now we shipped this.
Description
•