Closed Bug 1612549 Opened 5 years ago Closed 2 years ago

Automatically retry failing tests

Categories

(Testing :: General, enhancement, P3)

Version 3
enhancement

Tracking

(Not tracked)

RESOLVED MOVED

People

(Reporter: marco, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

After we are done with the smart scheduling work, a few nice possible enhancements will be unlocked. For example: automatically retrying failing tests to check if they are intermittent.

NOTE: Doing this will in turn also improve the data we have and thus the results of the ML algorithm.

Priority: -- → P3

We can schedule these tasks with lower priority.

I think this is a great idea- there are a few ways to tackle this:

  1. retry inside the job- save A LOT of resources; if the machine is bad or infra issues this won't help
  2. retry the same job with same set of tests

The risk here is that if we accidentally schedule for most if not all intermittent test failures, then we could be scheduling many jobs which would not be good on platforms like OSX and Android hardware.

As this is for sheriffs, we should focus on something like autoland and m-c, and keep this for windows10-64 and linux64 only. I would hesitate to put restrictions on number_of_retriggers / revision as then the sheriffs won't know what is retriggered and what is not, it then adds more work to the sheriffs effectively cancelling out the savings of retriggering.

On the point of time savings vs adding- this has a time shifting phase for extra data- sheriffs are looking at failures in <30 minutes and in most cases if not all having decisions made and adding annotations. In some cases they need to wait for retriggers/backfills/future data- and that can take a few hours. So any work we do here needs to think about the sheriff workflow and how we would present data to them via treeherder.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → MOVED
You need to log in before you can comment on or make changes to this bug.