657738 - Help detect orange randomness

Reporter

Description

•

13 years ago

Copy/pasted from http://glandium.org/blog/?p=1998:
I think we lack one important information when we have a test failure: does it reliably happen with a given build? Chances are that most random oranges don’t (like the two I mentioned further above), but those that do may point out subtle problems of compiler optimizations breaking some of our assumptions (though so far, most of the time, they just turn into permanent oranges). The self-serve API does help in that regard, allowing to re-trigger a given test suite on the same build, but I think we should enhance our test harnesses to automatically retry failing tests.

We should do that on all our harnesses (xpcshell, reftest, make check)

Mike Connor [:mconnor]

Comment 1

•

13 years ago

We'd probably need to add something to this to flag known random oranges as a test that may fail randomly, and possibly set a threshold (i.e. fails 1/25 times, more than that we should report failure).  Known random oranges would then avoid making the tree orange...

(no longer active)

Comment 2

•

13 years ago

We have data on this in the OrangeFactor web tool <http://brasstacks.mozilla.com/orangefactor/>, so I'm not sure if we need to modify the harness to get that data (in fact, I'm pretty sure that we don't!).

I don't get the rest of your proposal: what should we do with that data?  We don't have a reliable way to determine failure types for all of our intermittent oranges, and associating failures with test files is not correct.  I like the general idea, but I think it should be way more concrete before we can do anything useful with it.

Justin Dolske [:Dolske]

Comment 3

•

13 years ago

OrangeFactor / Bugzilla + TBPL's tools are helpful for test that have been intermittently failing for a while.

But we're always going to have new tests added to the tree (one hopes ;), and new intermittent orange (either from new tests, or existing tests that start to go intermittent for one of a variety of reasons). Rerunning a failed test would be helpful the first few times something goes wonky.

A concrete example: A change is pushed, and a new test goes orange. (Unexpectedly, since I have _of course_ passed on try!) I could immediately back out, but if I can't reproduce it what then? Try relanding later, and hope the I luck out and miss the new orange?

If the test boxes immediately reran the specific failed test a couple times, it would help people watching the tree to know that either (1) the test is intermittent, and someone needs to start looking at why or (2) the test is _not_ intermittent, and the committer should back out (or otherwise close the tree).

[I wouldn't undersell #2 -- people don't like to think their change could have caused the problem, and having immediate data that it failed 3x would dash any hope that it might go green on the next cycle.]

Mike Hommey [:glandium]

Reporter

Comment 4

•

13 years ago

Just a thought: why don't we also flag our known intermittent oranges in the test suites that allow such flagging (reftest and mochitest), such that instead of TEST-UNEXPECTED-FAIL, we'd get TEST-INTERMITTENT-FAIL + a bug number.

This would:
- help tbpl for all the cases it can't find the corresponding bug (and seeing how much times I had to to some bugzilla search or tbpl digging to find a corresponding bug, that'd be a clear win)
- help people that hit these intermittent failures on their local builds

The downside is that possibly, we may be getting a different failure from the one that is already known.

Mike Connor [:mconnor]

Comment 5

•

13 years ago

In conjunction with this approach, I think it's a great idea (if it's flagged as intermittent, and isn't, still fail).

(no longer active)

Comment 6

•

13 years ago

(In reply to comment #3)
> OrangeFactor / Bugzilla + TBPL's tools are helpful for test that have been
> intermittently failing for a while.
> 
> But we're always going to have new tests added to the tree (one hopes ;), and
> new intermittent orange (either from new tests, or existing tests that start to
> go intermittent for one of a variety of reasons). Rerunning a failed test would
> be helpful the first few times something goes wonky.
> 
> A concrete example: A change is pushed, and a new test goes orange.
> (Unexpectedly, since I have _of course_ passed on try!) I could immediately
> back out, but if I can't reproduce it what then? Try relanding later, and hope
> the I luck out and miss the new orange?
> 
> If the test boxes immediately reran the specific failed test a couple times, it
> would help people watching the tree to know that either (1) the test is
> intermittent, and someone needs to start looking at why or (2) the test is
> _not_ intermittent, and the committer should back out (or otherwise close the
> tree).
> 
> [I wouldn't undersell #2 -- people don't like to think their change could have
> caused the problem, and having immediate data that it failed 3x would dash any
> hope that it might go green on the next cycle.]

OK, this proposal makes sense to me.  The only problem with it is that I don't necessarily think that we need additional test runs for intermittent oranges that TBPL knows about (which are the most common type of intermittent oranges).  I think for those cases, rerunning the tests just wastes everyone's time.

Also, right now, we're almost 99% there, with us being able to rerun a test job from TBPL.  I use this very technique quite often when I see a new orange.  So I guess this proposal is about making it automated, right?

(no longer active)

Comment 7

•

13 years ago

(In reply to comment #4)
> The downside is that possibly, we may be getting a different failure from the
> one that is already known.

Which is pretty serious!

At least on  few occasions, I've come across test failures on my own patches which looked the same as an intermittent orange.  In one case it was my patch making an intermittent orange quite worse (nearly perma-orange), in another case I mistakenly pushed the patch to m-c without realizing that I'm actually seeing a perma-orange, and in the rest of cases, my patch was triggering similar failures for very different reasons (bugs in my patch).

This proposal makes detecting these cases before pushing to m-c nearly impossible.

Mike Hommey [:glandium]

Reporter

Comment 8

•

13 years ago

(In reply to comment #7)
> This proposal makes detecting these cases before pushing to m-c nearly
> impossible.

I'm not saying they shouldn't be orange. I'm saying it might help to have them tagged somehow.

Joel Maher ( :jmaher ) (UTC -8)

Comment 9

•

13 years ago

I hinted at this a year ago in a blog post:
https://elvis314.wordpress.com/2010/07/05/improving-personal-hygiene-by-adjusting-mochitests/

basically if we had meta data for each test we could know the history and report it as an orange or some other color.  this meta data could come from a webservice (think query orangefactor database) to determine if this is a known failure for that given platform.

In the past I have had test harnesses rerun a single test case (not test suite) if it fails to verify it reproduces.  I saw about 75% of the noise removed from my automation by doing that.  Almost every test file will run in seconds and having the harness rerun it to verify it fails could save us a lot of time.  Having to rerun the whole test suite could be time consuming and add more burden to our already backlogged machine pool.

Ed Morley [:emorley]

Updated

•

10 years ago

Blocks: 996504

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Bugzilla

Quick Search

Help detect orange randomness

Categories

(Testing :: Reftest, defect)

Tracking

(Not tracked)

People

(Reporter: glandium, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Updated