Closed Bug 648734 Opened 14 years ago Closed 13 years ago

Effect of measurement fluctuations on Talos should be minimized

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: ecfbugzilla, Unassigned)

References

()

Details

It seems that in the performance test run from April 2nd the results for about 20% of the add-ons were significantly skewed by performance fluctuations on the Talos machines (noticeable by unusually high standard deviation). The effect on the overall score of the add-ons in questions was in the area of 2-4% the browser start-up time in the overall score. This in itself is already a significant value given that add-on authors are supposed to target less than 5% slowdown. However, the probability of such mistakes is far too high, so that mistakes will inevitably accumulate in some cases and increase their score even more (the current data suggests that measurement mistakes adding 10% to the extension's score are everything but unrealistic). Also, there is the suspicion that the probability of such mistakes increases as the load on the Talos machines used increases. So far I've seen two scenarios. 1) A few significant outliers (http://tinderbox.mozilla.org/showlog.cgi?log=AddonTester/1301788791.1301789735.5361.gz, http://tinderbox.mozilla.org/showlog.cgi?log=AddonTester/1301754681.1301755144.13607.gz). Currently the approach to deal with outliers is to remove the highest result - which usually removes the first measurement which is always significantly higher (no idea why given that this is actually the *second* run with this profile). This doesn't help much if the data contains more than one outlier like in the results linked above. I would suggest removing up to five measurements if they look like outliers. Could be the following algorithm: * Sort the results * Calculate average and standard deviation for the lowest 15 results * Remove any of the highest 5 results that deviate from the average by more than twice (?) the standard deviation * Calculate average for the remaining results and return it as a final result 2) Normal results when the test starts but a significant increase and strong variation towards the end of the test (http://tinderbox.mozilla.org/showlog.cgi?log=AddonTester/1301788492.1301789407.4034.gz, http://tinderbox.mozilla.org/showlog.cgi?log=AddonTester/1301787396.1301788318.647.gz, http://tinderbox.mozilla.org/showlog.cgi?log=AddonTester/1301789550.1301790483.7450.gz). The results I linked to are all from the same Talos machine, the same machine tested another add-on earlier without any abnormalities however. So I don't have an explanation for this strange behavior and I think that it is worth investigating (it should be possible to find out what other jobs were running on the same machine at this time). If the investigation fails or if the source of such mistakes cannot be eliminated then there is still the option to detect suspicious test runs (standard deviation for normal test runs stays below 10ms, the test runs above were at 70ms and more) and to re-run the test on a different machine. It *could* be that the add-on behaves non-deterministically but it is better to verify that it does and that the problem isn't the machine it is tested on.
Outlier-pruning is notoriously difficult. I'd recommend switching to median and quartiles rather than trying to bandage up anything based on the mean.
This is a difficult problem that we have across all talos testing. Usually if a machine is giving suspicious results it is sidebarred for re-imaging - ie, our approach is to work with the given test slave instead of redoing the stats. For our other testing we do so many tests that outlier results can be more easily ignored as we get a more general 'idea' of performance based upon clusters of results on a graph. I believe that the on-demand talos testing for addons will greatly help this issue as you would be able to request re-tests if the machine randomly selected to do your testing appeared unreliable. As is, if you see a machine behave oddly please file a bug against ReleaseEngineering to get the box in question pulled form the testing pool for re-imaging.
Unfortunately, I see lots of suspicious results, they are no single cases. Add-on authors are told to aim for 5% slowdown yet a disabled extension (Torbutton) scores 4.5% thanks to measurement errors. See https://adblockplus.org/trash/slowOverview.html - everything that has a yellow or red entry in one of the "standard deviation" columns.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.