Open Bug 1708763 Opened 4 years ago Updated 2 years ago

run test-verify ("TV") tests as part of Phabricator but not on autoland

Categories

(Developer Infrastructure :: Source Code Analysis, defect, P3)

Tracking

(Not tracked)

People

(Reporter: aryx, Unassigned)

References

(Blocks 1 open bug)

Details

The Test-Verify tasks which run modified or added tests several times are set as tier-2 which don't have a backout as consequence if they fail. They run as tier-2 because they frequently fail after earlier changes to non-test files on which the test relies (e.g. frontend, storage systems).

New bugs created for test-verify failures over the last year: 381
40 got set to "fixed" but only 24 have an assignee (= had code landed, and some might have annotate the test to get skipped with test-verify).

There are actually 24 FIXED bugs with an assignee, the others e.g. got fixed by backout.

Test-Verify should stop running on autoland but for the pushes to Phabricator. Andrew, please add the blocker bug about warning for still running tasks when a developer tries to land a patch.

Flags: needinfo?(ahal)

Done, and this proposal makes sense to me.

Depends on: 1699137
Flags: needinfo?(ahal)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #0)

The Test-Verify tasks which run modified or added tests several times are set as tier-2 which don't have a backout as consequence if they fail. They run as tier-2 because they frequently fail after earlier changes to non-test files on which the test relies (e.g. frontend, storage systems).

They also frequently fail because as soon as one failure slips through, TV will forever be orange anytime that test is modified.

(In reply to Andrew Halberstadt [:ahal] from comment #2)
I actually think we are all saying the same thing, but to put it a third way: A TV failure on a push may or may not indicate that the push has introduced a test failure.

Despite the dismal stats from comment 0, I think there is value in TV, at least in that it draws attention to the reliability of recently modified tests. I don't mind it moving from autoland to phabricator, but hope it will be running on one or the other at any point in time.

will we want this on try server?
do we want this on all platforms and configs?

I want to make sure we think of the whole solution, not just moving to phabricator and let it be ignored. I am not aware of phabricator generating all the builds we need in order to run tests.

(In reply to Geoff Brown [:gbrown] from comment #3)

Despite the dismal stats from comment 0, I think there is value in TV, at least in that it draws attention to the reliability of recently modified tests. I don't mind it moving from autoland to phabricator, but hope it will be running on one or the other at any point in time.

Agreed! We even blocked this bug on implementing additional Phabricator checks to make reviewbot failures / pending tasks more noticeable. So I think we all agree there is value, it just doesn't necessarily make sense to run it on autoland.

(In reply to Joel Maher ( :jmaher ) (UTC -0800) from comment #4)

will we want this on try server?
do we want this on all platforms and configs?

Good call! Yes we'd still want it in try, but removing it from central would mean requiring --full. So we'd need some way to add these tasks in to the default try set.

I want to make sure we think of the whole solution, not just moving to phabricator and let it be ignored. I am not aware of phabricator generating all the builds we need in order to run tests.

It has the ability to do this, though I'm unsure if it's doing it anywhere atm. I know awhile back there was a plan to run regular test tasks with reviewbot. Unsure if that fell through or not though.

(In reply to Andrew Halberstadt [:ahal] from comment #5)

I want to make sure we think of the whole solution, not just moving to phabricator and let it be ignored. I am not aware of phabricator generating all the builds we need in order to run tests.

It has the ability to do this, though I'm unsure if it's doing it anywhere atm. I know awhile back there was a plan to run regular test tasks with reviewbot. Unsure if that fell through or not though.

It does have the ability to do this, but might have bitrotten since I started that work (https://github.com/mozilla/code-review/projects/2).

The only drawback I see is the need to run builds on Phabricator revisions, which could increase the cost.

Note: if we end up deciding to accept the cost of the builds, we might as well also run some additional tests selected by test selection (maybe with a high confidence threshold), so we don't pay the price of the builds just for test verify.

Moving to Firefox Build System::Source Code Analysis because that's where reviewbot bugs live.

Component: Phabricator → Source Code Analysis
Product: Conduit → Firefox Build System

Other aspects that come to mind are:

  1. what if the policy around test verify was different? What if it was tier 1? Was this ever considered?

  2. what if we used the test-verify data as another data point to consider when identifying intermittents?

E.g. we could keep a list of tests that pass verfication and so are in a "consistent" state and tests that don't pass verification and so are in a "precarious" state.

Tests in a "precarious" state are more likely to intermittently fail. If a sheriff sees a failure in a test that is in a "consistent" state, they'll know it's likely to be a real failure. If they see a failure in a test that is in a "precarious" state, they'll know it's could possibly be an intermittent failure.

We could surface this data in treeherder, and even use it as part of mozci for regression identification.

If we did this, test-verify value wouldn't only be measured in how many test-verify failure bugs are fixed.

basically test-verify should be close to green with a very low rate of intermittent- otherwise the test itself should be skipped (as are hundreds in the manifests skip-if = verify).

The problem would be much simpler if we had a smaller list of configurations we would run on (i.e. linux64-debug/win10-debug), then we wouldn't have to account for android, osx, asan, tsan, ccov, etc.

(In reply to Marco Castelluccio [:marco] from comment #8)

  1. what if the policy around test verify was different? What if it was tier 1? Was this ever considered?
    We never checked the current state of every test using test-verify (as far as I know) and don't catch regressions if the test didn't get modified. Developers are looking for a workflow which provides confidence a patch will stick - backing changes out for regressions unrelated to their change sounds like a non-starter.
  1. what if we used the test-verify data as another data point to consider when identifying intermittents?

E.g. we could keep a list of tests that pass verfication and so are in a "consistent" state and tests that don't pass verification and so are in a "precarious" state.

test-verify is not often intermittent but the repeated or parallel execution causes the issues (that's why a change outside the test can also go undetected, e.g. a switch from synchronous to asynchronous behavior).

Sorry, my comment wasn't clear.
I'm not concerned with the intermittency of test-verify, I'm saying we could use test-verify data to identify intermittent failures in normal test suite runs.

Say you have a normal, non test-verify, test failure on a push.
It is more likely to be an intermittent failure, if it is a test that is known to fail in verification mode.
It is less likely to be an intermittent failure, if it is a test that is known to pass in verification mode.

Basically, the "verification status" of a test is a measure of its probability to be intermittent.

that was the original goal of test-verify, but in practice that is not what is happening :(

(In reply to Joel Maher ( :jmaher ) (UTC -0800) from comment #12)

that was the original goal of test-verify, but in practice that is not what is happening :(

Maybe because this information is not surfaced directly to sheriffs in the Treeherder UI?

I'm not sure that verify as a tool to make starring failures easier makes much sense. It doesn't run the tests in exactly the same way as a typical run and a result of this is that it can flag things that aren't problematic in normal cases ("this test always fails after a reload") or miss things that are problematic in normal cases ("this test can be affected by state that leaks from the previous test"). So although the amount of signal isn't zero, it's also not something I'd expect to make a big difference on top of the current system where "is there an open intermittent bug for this test" is the main heuristic.

Giving developers feedback from verify makes sense because they're often in a position to fix the underlying problems and so prevent the landing of new unstable tests (at the level verify can detect). I wonder if we could experiment with first running verify jobs on phabricator iff the change doesn't touch any compiled files so we could use an artifact build. That would exclude a lot of platform changes, but would provide a means to prove out the idea without comitting to phabricator running a full build on multiple platforms for all pushes. Doing more than that might look like having a way to reuse an existing try push with developer-scheduled jobs, perhaps.

The original goal of test-verify was to draw attention to problems in recently-modified tests which might lead to intermittent failures. TV was intended to say to developers, "Hey, since you are modifying this test, let's have a closer look at it to see if it could be better." Phabricator seems like a good place for that nudge, if can accept the multi-platform build cost.

TV failures indicate either that the test fails intermittently, or that the test relies on browser state in such a way that the test cannot be run standalone, or cannot be repeated; arguably, all of these conditions should be addressed to make "good" tests. But many existing tests do not meet all of these conditions, and we haven't ever required them to be met.

I have sympathy for extended goals for TV, like using it in a tier-1 fashion, but I don't know that those goals are realistic, certainly not without significant investment. At any rate, those goals are beyond the scope of this bug.

(In reply to James Graham [:jgraham] from comment #14)

I'm not sure that verify as a tool to make starring failures easier makes much sense. It doesn't run the tests in exactly the same way as a typical run and a result of this is that it can flag things that aren't problematic in normal cases ("this test always fails after a reload") or miss things that are problematic in normal cases ("this test can be affected by state that leaks from the previous test"). So although the amount of signal isn't zero, it's also not something I'd expect to make a big difference on top of the current system where "is there an open intermittent bug for this test" is the main heuristic.

We could try collecting some data to see if there is a correlation between verification status and intermittency, so we can tell for sure whether it could be useful or not for this purpose.

could we just look at existing skip-if commands that have both a verify and a !verify condition on the same platform?

We could try collecting some data to see if there is a correlation between verification status and intermittency, so we can tell for sure whether it could be useful or not for this purpose.

Yes, although the question wouldn't be "does the verify result correlate with the test being observed as intermittent"; I'd be very surprised if it didn't, even though as mentioned there will be false positives and negatives. The question is "does observing a failure in a test that previously failed verify allow more accurate classification decisions than just using retriggers+previously filed intermittents".

On the topic of running this in Phabricator, I think that fundamentally givign developers feedback about flakiness at the point when they're editing a test anyway, before it lands, is just more useful than either a) improving the ability to classify failures as intermittent or b) telling them in retrospect once the patch is "done". Having fewer intermittents is more helpful overall than just being good at knowing which failures are intermittent, and telling people to take action before a patch lands is more effective than asking them to go back and look again at "finished" patches.

Severity: -- → S3
Priority: -- → P3
Product: Firefox Build System → Developer Infrastructure
You need to log in before you can comment on or make changes to this bug.