1632946 - [meta] Support for manifest based scheduling

Assignee

Description

•

5 years ago

For smart scheduling we're looking at what are the pieces to move to a manifest based scheduling rather than a task based scheduling.

We currently schedule tasks based if the manifest to be scheduled as part of the task has high value to be scheduled or not.

We believe that currently there's a backfill-and-filter workflow where a sheriff follows these steps:

Task has regressed and uses backfill action
The sheriff either uses the signature or the extend task label to look at the backfilled tasks [1] (there's even a hotkey shortcut to get to it)

Please correct me if that workflow is incorrect.

In the new model of scheduling by manifest we cannot guarantee that the manifest will have the same extended task label (Linux 18.04 x64 asan opt Mochitests test-linux1804-64-asan/opt-mochitest-devtools-chrome-e10s-3 M(dt3)) or signature (c15377f1f0ac8c097f7cd61753999deee12596aa). This is because chunking will be dynamic, thus, manifests can be changing from push to push.

It might be possible to adjust the job signature to take manifests into consideration and exclude the symbol:
https://github.com/mozilla/treeherder/blob/368c112266f4f276251a4886a5337fcb17b3a1e9/treeherder/etl/jobs.py#L147-L171

                [
                    build_system_type,
                    repository.name,
                    build_platform.os_name,
                    build_platform.platform,
                    build_platform.architecture,
                    machine_platform.os_name,
                    machine_platform.platform,
                    machine_platform.architecture,
                    job_group.name,
                    job_group.symbol,
                    job_type.name,
                    job_type.symbol,
                    option_collection_hash,
                    reference_data_name,
                ],

Perhaps we need a new signature as a stepping stone to deprecate the current one.

In the new model, we will backfill new tasks with the list of manifests executed in the backfilled task rather than based on the task label. We will be able to filter those tasks by looking at tasks that have that set of manifests and that platform configuration.

Now, we currently have the ability to filter by manifest by appending &test_paths, however, the platform will also need to match.

Originally I thought we could reach the maximum URL length, however, upon further investigation it is unlikely.

Perhaps we can add a link that will adjust the &test_paths and platform related parameters to match the backfilled tasks. The information is extractable from the taskgraph.json, but there's no convenient place / answer. marco is trying to solve the same problem.

If we use &test_paths we should fix bug 1626623 by moving all the frontend fetching and data manipulations to the backend. That will probably need to define a new Django model to represent a Manifest. A task would probably need a reference to a ManifestGroup which refers to an N number of Manifests. We will need to modify the jobs endpoint to the test_paths property. I wonder if we would need a TestPath model as well which would probably lead to the need for a TestGroup. In short:

Task -1:1-> ManifestGroup -1:N-> Manifest -1:1-> TestGroup -1:N-> TestPath
Such model probably would be the least amount of data needed to store it. We need to evaluate the storage cost. We would also need to verify if cycling data would delete these. I wonder how a test path can be stored compressed rather than as plain text.

This is probably good enough of a description to discuss things.

[1]
Job: (sig): Linux 18.04 x64 asan opt Mochitests test-linux1804-64-asan/opt-mochitest-devtools-chrome-e10s-3 M(dt3)

Flags: needinfo?(cdawson)

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Summary: Support filtering by manifests associated to a backfilled task → Support filtering for manifest-based backfilled task

Kyle Lahnakoski [:ekyle]

Updated

•

5 years ago

Flags: needinfo?(klahnakoski)

Summary: Support filtering for manifest-based backfilled task → Support filtering by manifests associated to a backfilled task

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Summary: Support filtering by manifests associated to a backfilled task → Support filtering for manifest-scheduled backfilled task

Joel Maher ( :jmaher ) (UTC -8)

Comment 1

•

5 years ago

I believe we need "add new jobs" support as well, likewise "retrigger" support. These workflows would use what is outlined in comment 0.

I like the idea of changing the table sooner to get into a hybrid state, we will need 4 months of data before expiring. We will probably need both models supported (current vs new) as 4 months is a long time; although I think having 4 weeks is enough time to cover almost all if not all scenarios for backfill, retrigger, add new job.

Armen [:armenzg]

Assignee

Comment 2

•

5 years ago

Ahal and I did not believe there was a technichal reason that would make adding new jobs or retriggers require any special changes. As far as we understand we would schedule what the Gecko decision task would have scheduled (the artifact contains all the data) and re-triggers re-run a clone of the task that got scheduled. Please let me know if we made an oversight. In any case, if we were to encounter any issues we missed later on we would tackle it.

We might consider backfilling data if we need to. We can backfill up to when Andrew got the artifacts generated by the Gecko decision task (sometime in Q1).

Joel Maher ( :jmaher ) (UTC -8)

Comment 3

•

5 years ago

I guess there is an 'add new jobs' feature now, but what would you be adding? mochitest-1 or dom/indexedDB ? how will we display the results of manifest jobs M(m1 m2 m3) where m1 = set of manifests ? then is there a purpose to running the original M(1) job? It is ok if we duplicate tests, but want to think this through.

Cameron Dawson [:camd]

Comment 4

•

5 years ago

So this signature has a couple uses:

We can filter with it by clicking the (sig) link. That will show only those jobs in all the loaded pushes that have the same signature. The link of text strings next to it is likely ALMOST as precise as that. So the filtering aspect of it may not be super crucial.
We use it to trigger "add new jobs". But the signatures in there are not the same as what we store in the DB. The signatures for Runnable Jobs are like this: addon-tps-xpi, condprof-linux64-firefox, searchfox-linux64-searchfox/debug

These signatures do not match the Treeherder signatures. These come from taskcluster from the url we get by calling getRunnableJobsURL.

History: These signatures were originally created for when the sheriffs managed a list of jobs to hide. We now use Tier-3 for that, so that functionality is no longer needed.

I think the only thing that field on the jobs table is used for is that sig filter link.

Flags: needinfo?(cdawson)

Cameron Dawson [:camd]

Comment 5

•

5 years ago

Asked Sebastian to comment on the value of that sig link.

Flags: needinfo?(aryx.bugmail)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 6

•

5 years ago

The sig link will show if the task config changed and also has the benefit of creating a shorter url than the task name. The task name will always show the tasks which match the name even if their config changed.

RyanVM, how is your usage of the 'sig' link at the bottom left?

Flags: needinfo?(aryx.bugmail) → needinfo?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Comment 7

•

5 years ago

Never used it.

Flags: needinfo?(ryanvm)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 8

•

5 years ago

Ok to remove it from the sheriffing side.

Cameron Dawson [:camd]

Kyle Lahnakoski [:ekyle]

Comment 14

•

4 years ago

On another note, we may want to remove the job signature:

The job signature can be replaced by a (albeit long) unique tuple, as demonstrated in the code. This signature is a design flaw that is hindering our use of the treeherder data in other ways. Specifically, the signature is very specific optimization, that will get in the way of manifest scheduling.

I suggest columns in the signature table be merged with the job table; and code that uses a signature be replaced with code that uses the tuple-of-values it represents. The immediate benefit being the tuple better describes the class of tasks than a hash value does. Using tuples-of-values will also allow shorter tuples (we only need to specify job_type.name or job_type.symbol not both). A bigger benefit comes from other use cases, like manifest scheduling:

The manifest_name will be unique; the job_type.name is irrelevant and job_group.name is functionally dependent on the manifest_name (if you know the manifest_name, you will be able to conclude the suite). This means finding the class of jobs which run a manifest is best described by this tuple (depending on how specific you want to be:

            [
                repository.name,
                machine_platform.platform,
                manifest_name
            ]

By simply storing the signature properties in the job table, we can use different tuples to select jobs in different ways.

Flags: needinfo?(klahnakoski)

Kyle Lahnakoski [:ekyle]

Comment 15

•

4 years ago

From https://github.com/mozilla/treeherder/pull/6384/files

I'm debating between these two relationships:

Job -1:1-> ManifestSet -1:N-> Manifest -1:1-> TestSet -1:N-> TestPath

Job -1:N-> Manifest -1:N-> TestPath

The second is preferred; jobs, manifests and tests are the only entities we are dealing with.

All 1:1 relations are "annotations": A 1:1 relation is logically no different from merging columns from both tables into one: Any columns you may have in ManifestSet can be added to Job for the same effect. 1:1 relations also increase query complexity. That said, 1:1 relations can be useful for avoiding an ALTER TABLE command on the main table.

Armen [:armenzg]

Assignee

Updated

•

4 years ago

Depends on: 1649229

Armen [:armenzg]

Assignee

Updated

•

4 years ago

Depends on: 1650224

Armen [:armenzg]

Assignee

Updated

•

4 years ago

No longer depends on: 1636506

Armen [:armenzg]

Assignee

Updated

•

4 years ago

Type: enhancement → task

Summary: [meta] Support filtering for manifest-scheduled backfilled task → [meta] Support for manifest based scheduling

Armen [:armenzg]

Assignee

Comment 16

•

4 years ago

There are improvements that can be made but for now we shipped this.

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

Resolution: --- → FIXED

Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/6371 5 years ago GitHub Bugzilla PR Linker (deleted), text/x-github-pull-request		Details
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/6384 5 years ago GitHub Bugzilla PR Linker (deleted), text/x-github-pull-request		Details