Open Bug 918922 Opened 11 years ago Updated 2 years ago

Retry failing tests in Reftest

Categories

(Testing :: Reftest, defect)

defect

Tracking

(Not tracked)

People

(Reporter: ted, Unassigned)

References

(Blocks 1 open bug)

Details

In bug 906510, as part of the "parallel xpcshell tests" work, mihneadb made the xpcshell harness retry tests that fail, and only count them as failures if they fail again. I think we should expand this to Reftest as well, to help with our intermittent failure problem.
I strongly believe we should NOT do this.  If tests are failing intermittently, there's a bug.
I'd also note that we sometimes have to write tests that aren't guaranteed to fail every time if the bug is present, but they'll fail sometimes if the bug is present.  So the changes that you've already made to xpcshell probably disabled some tests, and I think should have been discussed more widely.


Also, on a more philosophical note -- I think whatever the pass/fail threshold is, you'll have bugs where tests intermittently cross that threshold.  I don't think the answer to the problem is to move the line.
(In reply to David Baron [:dbaron] (needinfo? me) from comment #2)
> I'd also note that we sometimes have to write tests that aren't guaranteed
> to fail every time if the bug is present, but they'll fail sometimes if the
> bug is present.  So the changes that you've already made to xpcshell
> probably disabled some tests, and I think should have been discussed more
> widely.

I don't understand this. Can you explain this further?

> Also, on a more philosophical note -- I think whatever the pass/fail
> threshold is, you'll have bugs where tests intermittently cross that
> threshold.  I don't think the answer to the problem is to move the line.

Our current solution for most harnesses is "manually click through some things on TBPL". Involving a human in the loop sucks. I think you may be right that this is not a great fit for Reftest, since we already have the "random/random-if" qualifier, so we can mark known-flaky tests in the manifest itself. We don't have that sort of support in other harnesses, so we simply live with flaky tests.

I strongly believe that our current system isn't working well. We have lots of intermittent failures that nobody is looking at, so those tests are providing very little value.
(In reply to David Baron [:dbaron] (needinfo? me) from comment #2)
> I'd also note that we sometimes have to write tests that aren't guaranteed
> to fail every time if the bug is present, but they'll fail sometimes if the
> bug is present.  So the changes that you've already made to xpcshell
> probably disabled some tests, and I think should have been discussed more
> widely.
> 
> 
> Also, on a more philosophical note -- I think whatever the pass/fail
> threshold is, you'll have bugs where tests intermittently cross that
> threshold.  I don't think the answer to the problem is to move the line.

In general I tend to agree. Though we've seen cases where the intermittent failure has nothing to do with the test, and where no one knows how to fix it (these types of intermittents have happened a few times for B2G where the problem was somewhere in the AOSP emulator or similar). For these types of failures our options are limited to "ignore them indefinitely", "turn off the harness", or "implement something like this".

I don't think there really is a good solution for these cases. Maybe if we had some good statistics on whether an orange is a true intermittent or just caused by a blip in the underlying infrastructure we'd be able to make smarter decisions.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #3)
> (In reply to David Baron [:dbaron] (needinfo? me) from comment #2)
> > I'd also note that we sometimes have to write tests that aren't guaranteed
> > to fail every time if the bug is present, but they'll fail sometimes if the
> > bug is present.  So the changes that you've already made to xpcshell
> > probably disabled some tests, and I think should have been discussed more
> > widely.
> 
> I don't understand this. Can you explain this further?

What's unclear?

> > Also, on a more philosophical note -- I think whatever the pass/fail
> > threshold is, you'll have bugs where tests intermittently cross that
> > threshold.  I don't think the answer to the problem is to move the line.
> 
> Our current solution for most harnesses is "manually click through some
> things on TBPL". Involving a human in the loop sucks. I think you may be
> right that this is not a great fit for Reftest, since we already have the
> "random/random-if" qualifier, so we can mark known-flaky tests in the
> manifest itself. We don't have that sort of support in other harnesses, so
> we simply live with flaky tests.

We're also not very good about finding regression ranges.  There was a recent example where a bunch of style system mochitests started failing (timing out?) because of a JS JIT change, and the only reason we found the regression range at all was because I was persistent about asking one of the sheriffs to bisect.

The point of having test coverage is to catch regressions.  It catches regressions.  Sometimes people are slow to identify the source of those regressions, or to fix them.

This change would just make it so that we don't catch many of the regressions that we're currently catching.

> I strongly believe that our current system isn't working well. We have lots
> of intermittent failures that nobody is looking at, so those tests are
> providing very little value.

I disagree with that.  The number of tests has been increasing, and the number of intermittent failures per test run has not been increasing proportionally (particularly on the established platforms), which means we have a rapidly increasing number of stable tests.

I think we need stronger owners for bringing up tests on new platforms, and we need to put more resources into both finding the sources of regressions that lead to our intermittent oranges and fixing those regressions.

But I think changing the orange/green line will either (a) provide only temporary relief rather than permanent and/or (b) only improve the intermittent orange situation by worsening our rate of detecting regressions.
(In reply to David Baron [:dbaron] (needinfo? me) from comment #5)
> We're also not very good about finding regression ranges.  There was a
> recent example where a bunch of style system mochitests started failing
> (timing out?) because of a JS JIT change, and the only reason we found the
> regression range at all was because I was persistent about asking one of the
> sheriffs to bisect.

I think that somewhat highlights the problem - devs/test creators don't see their tests as something they should own. Whilst the sheriffs will attempt to find regression ranges for intermittent failures if possible - with a <1% occurrence rate it's just sometimes not possible to easily do so (or fair to expect us to - digging quite that deep is not really in our remit - happy for that to change, but someone will have to decide what else we'll have to drop, since we already regularly work all hours of the day as it is). Yes we need better tooling, but more importantly we need people to actually start caring about intermittent failures & managers to prioritise allocating resources appropriately :-)
Regarding tools, I think it would help if we had:
 * the ability to run a new reftest against an existing build without triggering a new compilation cycle
 * the ability to rerun a particular reftest (not the whole suite) a large number of times
(and similar for mochitests, etc.)
Blocks: 996504
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.