Closed Bug 906510 Opened 11 years ago Closed 11 years ago

Enhance parallel XPCShell harness to retry failed tests sequentially

Categories

(Testing :: XPCShell Harness, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla26

People

(Reporter: mihneadb, Assigned: mihneadb)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 5 obsolete files)

Working on bug 898658 I found that I ended up with ~200 tests marked to run sequentially just because I found them fail once or twice in tens of try X test runs. This will not scale too well.

I suggest a new approach - we make the harness smarter (some might say "complex") so that instead of marking flaky tests as "run-sequentially", we watch for failures in the parallel run and then we run the respective failures sequentially at the end of the run.

I did "some" try runs [1] and found this works pretty well. As you can see from the logs, on average there's not many tests that get rerun and they don't take long. I also talked a bit to philor about this in #ateam and (short version [2]) he supports this approach.


The only platform that is really flaky for parxpc and might not necessarily benefit is Win8, from what I can see in the try runs. We could just keep the current way of running tests on Win8 by using a check against mozinfo.

[1] https://tbpl.mozilla.org/?tree=Try&rev=6c50d2ebcada
[2] http://logbot.glob.com.au/?c=mozilla%23ateam#c661779
Blocks: 660788
Attached patch safetynet (obsolete) (deleted) — Splinter Review
This is the patch that I used for the try runs. We could probably make this cleaner, but I wanted to come up quickly with a proof of concept so I could do the test runs.

I'd love some feedback on:

1. should we pursue this approach (it seems to me the only reasonable / fast way to get this rolling into production asap)
2. [if 1. is yes] how should I structure this patch (I'm not sure if there's a nicer approach than the early returns)

Thanks
Assignee: nobody → mihneadb
Attachment #791879 - Flags: feedback?(ted)
My only concern with this is that we're wallpapering over real test problems that need fixing by doing this. As we know from hiding/skipping tests, obscuring a test's problems tends to /dev/null the issue in the minds of any interested parties.

But I also know that getting flaky tests fixed is a much longer-term issue and I wouldn't want to see this work (and benefits it provides) held up in the mean time. Can we at least write a list of repeated tests to the log so that they're easily identifiable? That way, we can at least work towards eliminating them and hopefully remove this workaround in the long-run.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #2)
> My only concern with this is that we're wallpapering over real test problems
> that need fixing by doing this. As we know from hiding/skipping tests,
> obscuring a test's problems tends to /dev/null the issue in the minds of any
> interested parties.
> 
> But I also know that getting flaky tests fixed is a much longer-term issue
> and I wouldn't want to see this work (and benefits it provides) held up in
> the mean time. Can we at least write a list of repeated tests to the log so
> that they're easily identifiable? That way, we can at least work towards
> eliminating them and hopefully remove this workaround in the long-run.

Yes, absolutely, we would be logging them.
Attached patch Retry tests that timed out when run in parallel (obsolete) (deleted) — Splinter Review
It might be enough to just retry the tests that timeout. I'm not sure, doing
a try run to confirm.

https://tbpl.mozilla.org/?tree=Try&rev=9363edb097b1
Comment on attachment 792640 [details] [diff] [review]
Retry tests that timed out when run in parallel

Ok, this is clearly not enough. :(
Attachment #792640 - Attachment is obsolete: true
Attached patch Rerun tests that fail when run in parallel. (obsolete) (deleted) — Splinter Review
Cleaned the code up a bit and improved logging of reruns.

https://tbpl.mozilla.org/?tree=Try&rev=2bb0d6ce7b96
Attachment #791879 - Attachment is obsolete: true
Attachment #791879 - Flags: feedback?(ted)
Attachment #792920 - Flags: feedback?(ted)
It would be great to come up with a way to tie in the retry logs into orangefactor, but I think we don't have support of uploading something from tbpl anywhere on the internet right now. Could be a follow up.
Note that bug 749421 just went live on Cedar.
Great, will do a follow up bug after this lands.
Comment on attachment 792920 [details] [diff] [review]
Rerun tests that fail when run in parallel.

Review of attachment 792920 [details] [diff] [review]:
-----------------------------------------------------------------

I think this is a net positive. We should probably do this in our other harnesses as well to reduce the amount of intermittent orange (but ideally we'd have a way to list the tests that are intermittently failing).

::: testing/xpcshell/runxpcshelltests.py
@@ +79,5 @@
>          self.test_object = test_object
>          self.cleanup_dir_list = cleanup_dir_list
> +        self.try_again_list = try_again_list
> +        self.first_try = first_try
> +        self.retry = False

It feels like you've got too many variables here. I'd probably just make retry a list, and append its contents to try_again_list in the main runTests function.

@@ +129,5 @@
>          self.event.set()
>  
> +    def mark_for_retry(self):
> +        self.try_again_list.append(self.test_object)
> +        self.retry = True

So this could just be:
self.retry.append(self.test_object)

@@ +575,5 @@
> +
> +                if self.first_try:
> +                    self.mark_for_retry()
> +                    self.clean_temp_dirs(name, stdout)
> +                    return

All these repeated blocks are kind of ugly. Can you just set like retry = True here, and then down the bottom of this method do:
if retry:
  self.mark_for_retry()
?

@@ +1302,5 @@
> +                    verbose=verbose, pStdout=pStdout, pStderr=pStderr,
> +                    keep_going=keepGoing, log=self.log, mobileArgs=mobileArgs,
> +                    first_try=False, **kwargs)
> +            test.start()
> +            test.join()

I'd like to see this factored into a function to share with the other call to testClass above.
Attachment #792920 - Flags: feedback?(ted) → feedback+
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #10)
> I think this is a net positive. We should probably do this in our other
> harnesses as well to reduce the amount of intermittent orange (but ideally
> we'd have a way to list the tests that are intermittently failing).

And ideally the frequency of tests in this list would be hooked up directly to orangefactor. Otherwise doing this on a larger would make orangefactor more or less useless.
While that's true, orange factor exists as a measure of the pain of sheriffing the tree. If we're retrying failures and presenting a green run for tests that don't fail repeatably, we're removing some of that pain, which makes it less important.
Attached patch Rerun tests that fail when run in parallel. (obsolete) (deleted) — Splinter Review
Cleaned up, feedback welcome.
Attachment #792920 - Attachment is obsolete: true
Attached patch Rerun tests that fail when run in parallel. (obsolete) (deleted) — Splinter Review
Attachment #796169 - Attachment is obsolete: true
Comment on attachment 796236 [details] [diff] [review]
Rerun tests that fail when run in parallel.

https://tbpl.mozilla.org/?tree=Try&rev=6c43ba6cfd2b

I'm not sure about refactoring that init call, could you please explain what you meant?

Not sure how I could make this cleaner.
Attachment #796236 - Flags: feedback?(ted)
Rebased on latest m-c changes. run_test already has ~3 exit points because
of the way we handle hangs, test failures and crashes, so the patch has to
affect all those.

Changing to r?.
Attachment #796809 - Flags: review?(ted)
Attachment #796236 - Attachment is obsolete: true
Attachment #796236 - Flags: feedback?(ted)
Comment on attachment 796809 [details] [diff] [review]
Rerun tests that fail when run in parallel.

Review of attachment 796809 [details] [diff] [review]:
-----------------------------------------------------------------

Looks a lot better, thanks!
Attachment #796809 - Flags: review?(ted) → review+
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/fa1e79afefca
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla26
I think this change should be reverted.  Tests were written with a particular pass/fail threshold in mind,  and moving the threshold might change whether they're still testing what they're intended to test.  For example, a test might have been written that would pass all the time without a bug, but fail intermittently when the bug is present.  (I've definitely written tests of that form, and I suspect others have as well.)

Moving the line might relieve the pain temporarily, but it won't solve the problem that there will always be some bugs (whether bugs in the code or the tests) that cause tests to intermittently cross the pass/fail line.

The point of running tests is to prevent regressing the things that they're testing.  The way to deal with an intermittent failure problem is to put resources into fixing the tests that fail intermittently (particularly into finding the changes that caused them to start failing intermittently).
David: Correct me if I'm wrong, but your complaint has nothing to do with the fact we re-run tests that failed but rather that we report a failed-then-passed test as passed, right? Presumably, we could re-run tests all we want and as long as the output of the job is "proper" then it doesn't matter what the harness does behind the scenes.

Perhaps we should add a new color to differentiate between "always failed" and "intermittent failed?"
(In reply to Gregory Szorc [:gps] from comment #21)
> David: Correct me if I'm wrong, but your complaint has nothing to do with
> the fact we re-run tests that failed but rather that we report a
> failed-then-passed test as passed, right?

Right.

> Presumably, we could re-run tests
> all we want and as long as the output of the job is "proper" then it doesn't
> matter what the harness does behind the scenes.
> 
> Perhaps we should add a new color to differentiate between "always failed"
> and "intermittent failed?"

I think it should be orange, just like other failed tests.

(Also see the somewhat more extended discussion in bug 918922.)
Blocks: 996504
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: