[meta] New/modified test verification
Categories
(Testing :: General, enhancement, P3)
Tracking
(Not tracked)
People
(Reporter: gbrown, Unassigned)
References
(Depends on 18 open bugs, Blocks 2 open bugs)
Details
(Keywords: meta)
Reporter | ||
Updated•8 years ago
|
Comment 2•8 years ago
|
||
Reporter | ||
Comment 3•7 years ago
|
||
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Comment 4•5 years ago
|
||
One weakness of TV is that the TV task may not run with the same task configuration as the normal test task in which the tasks would run. For instance, if the xpcshell test task and the mochitest test task for a particular platform use different builds (eg. Windows xpcshell tests may run against a signed build) then TV can be configured to match one or the other, but not both. This is why TVg was introduced: So that different virtualizations could be used in TV/TVg; but deciding which tests apply to TV vs TVg has been tricky also. And task configurations are always changing.
:bc's recent work on "test isolation" suggests a different approach -- https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=bc8f78e7ea0b947c07b6a6c4c502882faa1b973f -- where existing task definitions are cloned. There would be additional challenges for TV, but I'm thinking TV could identify tests files from the hg log (as it does today), then spawn new tasks for each supported suite affected by the push. If a push modified a mochitest and an xpcshell test, TV would notice that, then spawn M-tv and X-tv tasks, each cloned from the appropriate existing task definition.
Comment 5•5 years ago
|
||
:gbrown, given the upcoming changes in test scheduling (test manifest level) as well as recent fixes to retain meta data while retriggering, do you think fixing some of the scheduling issues for test-verify is accurate in the coming months?
I would like to know that test-verify works for all our major test harnesses and configs and that it is scheduled properly. Maybe a stretch goal is to treat tests that do not pass test-verify as something we only run on m-c and not on try by default (i.e. lower value). I don't think we can consider something like that without knowing if test-verify is accurate.
Based on the dependencies:
https://bugzilla.mozilla.org/showdependencytree.cgi?id=1357513&hide_resolved=1
it looks like there is some work to do here but not a lot.
Reporter | ||
Comment 6•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #5)
:gbrown, given the upcoming changes in test scheduling (test manifest level) as well as recent fixes to retain meta data while retriggering, do you think fixing some of the scheduling issues for test-verify is accurate in the coming months?
Since https://bugzilla.mozilla.org/show_bug.cgi?id=1522113#c14, most of my TV scheduling concerns have been addressed. Do you have TV scheduling concerns? How would test manifest level test scheduling affect TV scheduling?
I would like to know that test-verify works for all our major test harnesses and configs and that it is scheduled properly.
test-verify supports wpt, mochitest (including subsuites, etc), reftest/crashtest/jsreftest, and xpcshell; nothing else.
Maybe a stretch goal is to treat tests that do not pass test-verify as something we only run on m-c and not on try by default (i.e. lower value). I don't think we can consider something like that without knowing if test-verify is accurate.
TV is intended as an early warning system which draws attention to test vulnerabilities that can lead to intermittent failures; also, it provides a fast and convenient way to reproduce many intermittent failures quickly. I don't think it is appropriate to modify test scheduling based on TV results; certainly intermittent failure history is a more direct, simple, and fair metric to use for such purposes. (This is part of why I keep saying that tier-1 TV should be a non-goal.)
I believe that TV is mostly accurate: It finds genuine vulnerabilities in tests, it reproduces most frequent intermittent failures, it very rarely fails without good reason. There is sometimes a perception that TV is not accurate because it reports failures related to tests relying on state established by other tests (eg. tests that cannot run standalone cannot pass TV).
Based on the dependencies:
https://bugzilla.mozilla.org/showdependencytree.cgi?id=1357513&hide_resolved=1it looks like there is some work to do here but not a lot.
Bug dependencies here reflect a mixture of in-progress work and imminent plans that have been unexpectedly postponed; more extensive, longer term plans for TV were proposed in planning documents in Q4 2018 and Q1 2019 and a trimmed down version ("smart" TV) was again proposed recently but none of these proposals have been supported. Given the on-going lack of investment in TV, I am considering de-scheduling it entirely in 2020.
Reporter | ||
Updated•3 years ago
|
Updated•2 years ago
|
Description
•