Closed Bug 1502036 Opened 6 years ago Closed 6 years ago

Subtest results for raptor speedometer have incorrect lower_is_better setting

Categories

(Testing :: Raptor, defect)

defect
Not set
normal

Tracking

(firefox65 fixed)

VERIFIED FIXED
mozilla65
Tracking Status
firefox65 --- fixed

People

(Reporter: mgaudet, Assigned: rwood)

References

Details

Attachments

(2 files)

Attached image Web tooling Benchmarks Results (deleted) —
Looking at the graphs for webtooling [1] and speedometer [2] we see a peculiar oddity. In the case of web-tooling, the aggregate results are being reported as a Score, where higher-is-better, and each subtest is being reported as execution time in ms, where lower-is-better. However, if you look at the score and subtests, comparing Chrome and SpiderMonkey, they disagree about what is better: Chrome shows a lower (worse) score than SpiderMonkey in the aggregate score, but for every subtest shows lower (better) execution time, which should imply a higher score. This seems like something that could be resulting from incorrect labelling of results and incorrect use of worse is better. Speedometer seems to show (though less clearly) a similar problem: Chrome has a better (higher) score in the aggregate, yet subtests show lower scores than firefox in most (tho not all) cases. [1]: https://arewefastyet.com/linux64/web-tooling?numDays=90 [2]:
for speedometer we specifically set subtests as ms (lower is better) and report a score (higher is better) as of bug 1486789. We treat the parsing of the output the same for firefox/chrome, so maybe the subtests are weighted?
It seems that Perfherder has noticed the change and still reports the subtests with the old values: https://treeherder.mozilla.org/api/project/mozilla-central/performance/signatures/?parent_signature=bf4fcd6df34988c8aea87ccda51b0f70b4141643
Component: General → Raptor
@mgaudet: can someone confirm that the information below for web-tooling is correct? The web-tooling parent score is marked as *Score* "higher is better" (by specifying "lower_is_better: false"): https://treeherder.mozilla.org/api/project/mozilla-central/performance/signatures/?parent_signature=620f90feca6c0916ca8e593737eb6bf6d0301f68 > "620f90feca6c0916ca8e593737eb6bf6d0301f68": { > "option_collection_hash": "102210fe594ee9b33d82058545b1ed14f4c8206e", > "suite": "web-tooling-benchmark-v8", > "id": 1760218, > "lower_is_better": false, > "has_subtests": true, > "framework_id": 11, > "machine_platform": "linux64" > }, web-tooling subtests are marked as *ms* "lower is better" (by *not* specifying "lower_is_better: false"): https://treeherder.mozilla.org/api/project/mozilla-central/performance/signatures/?parent_signature=620f90feca6c0916ca8e593737eb6bf6d0301f68 I have filed an issue on the UI to overwrite what Perfherder says for subtests until we can fix the metadata: https://github.com/mozilla-frontend-infra/firefox-performance-dashboard/issues/102 I have filed a bug to fix the Speedometer subtest metadata (bug 1502073).
Flags: needinfo?(mgaudet)
So looking at a log, this appears to be an incorrect labelling issue [task 2018-10-25T15:14:38.915Z] Error processing command. Ignoring because optional. (optional:packages.txt:comm/build/virtualenv_packages.txt) [task 2018-10-25T15:14:40.130Z] Running Web Tooling Benchmark 0.3.2... [task 2018-10-25T15:14:40.130Z] -------------------------------------- [task 2018-10-25T15:14:46.432Z] acorn: 6.68 runs/sec [task 2018-10-25T15:14:53.795Z] babel: 6.55 runs/sec [task 2018-10-25T15:15:00.722Z] babylon: 6.38 runs/sec [task 2018-10-25T15:15:09.275Z] buble: 4.90 runs/sec [task 2018-10-25T15:15:13.275Z] chai: 10.66 runs/sec [task 2018-10-25T15:15:24.105Z] coffeescript: 3.65 runs/sec [task 2018-10-25T15:15:36.498Z] espree: 3.49 runs/sec [task 2018-10-25T15:15:42.793Z] esprima: 6.54 runs/sec [task 2018-10-25T15:15:56.339Z] jshint: 3.00 runs/sec [task 2018-10-25T15:16:03.475Z] lebab: 5.87 runs/sec [task 2018-10-25T15:16:11.560Z] prepack: 5.04 runs/sec [task 2018-10-25T15:16:19.193Z] prettier: 5.32 runs/sec [task 2018-10-25T15:16:23.681Z] source-map: 13.82 runs/sec [task 2018-10-25T15:16:29.214Z] typescript: 8.19 runs/sec [task 2018-10-25T15:16:33.845Z] uglify-es: 12.25 runs/sec [task 2018-10-25T15:16:43.891Z] uglify-js: 4.10 runs/sec [task 2018-10-25T15:16:43.891Z] -------------------------------------- [task 2018-10-25T15:16:43.891Z] Geometric mean: 6.05 runs/sec [task 2018-10-25T15:16:44.171Z] PERFHERDER_DATA: {"framework": {"name": "js-bench"}, "suites": [{"name": "web-tooling-benchmark-sm", "lowerIsBetter": false, "value": 6.05, "shouldAlert": false, "units": "score", "subtests": [{"name": "web-tooling-benchmark-source-map", "value": 13.82}, {"name": "web-tooling-benchmark-lebab", "value": 5.87}, {"name": "web-tooling-benchmark-typescript", "value": 8.19}, {"name": "web-tooling-benchmark-uglify-es", "value": 12.25}, {"name": "web-tooling-benchmark-babel", "value": 6.55}, {"name": "web-tooling-benchmark-chai", "value": 10.66}, {"name": "web-tooling-benchmark-uglify-js", "value": 4.1}, {"name": "web-tooling-benchmark-coffeescript", "value": 3.65}, {"name": "web-tooling-benchmark-jshint", "value": 3.0}, {"name": "web-tooling-benchmark-prettier", "value": 5.32}, {"name": "web-tooling-benchmark-acorn", "value": 6.68}, {"name": "web-tooling-benchmark-buble", "value": 4.9}, {"name": "web-tooling-benchmark-esprima", "value": 6.54}, {"name": "web-tooling-benchmark-prepack", "value": 5.04}, {"name": "web-tooling-benchmark-babylon", "value": 6.38}, {"name": "web-tooling-benchmark-espree", "value": 3.49}, {"name": "web-tooling-benchmark-mean", "value": 6.05}]}]} [fetches 2018-10-25T15:16:44.191Z] removing /home/cltbld/fetches All of the values are scores, and therefore higher-is-better.
Flags: needinfo?(mgaudet)
jmaher: according to what mdauget mentions, IIUC we should file another bug like bug 1486789 to update the metric for webtooling's subtests; do you agree? I've updated bug 1502073 to also update webtooling.
:armen, yeah, that makes sense
Depends on: 1502073, 1486789
Depends on: 1502116
This is valid bug for Raptor. Bug 1486789 fixed the 'unit' for the subtests, however, we also need test.lower_is_better to be adjusted. Speedometer subtests should be lower is better (do not define "lower_is_better": false). Web-tooling subtests should be higher is better (define "lower_is_better": false). In this Speedometer output we can see how "lower_is_better": false is defined which is incorrect. > "suites": [{ ... > "name": "raptor-speedometer-firefox", > "lowerIsBetter": false, ... > "subtests": [{ > "name": "jQuery-TodoMVC/DeletingAllItems/Sync", > "lowerIsBetter": false, I believe the fix is needed in `testing/raptor/raptor/output.py`.
Hi @rwood, is this something you or someone else could tackle? It is not urgent but at least documenting some pointers as to what files need fixing could be useful for a potential contributor.
Flags: needinfo?(rwood)
(In reply to Armen [:armenzg] from comment #8) > Hi @rwood, is this something you or someone else could tackle? > It is not urgent but at least documenting some pointers as to what files > need fixing could be useful for a potential contributor. I'm a bit confused though, why can't we just change the 'lowerIsBetter' to 'true' in the raptor-speedeomter test INI? Or does that not work (is that the point hah).
Flags: needinfo?(rwood)
> Speedometer subtests should be lower is better (do not define "lower_is_better": false). It seems that the ini says that already, however, Perfherder does not agree: > subtest_lower_is_better = true https://searchfox.org/mozilla-central/source/testing/raptor/raptor/tests/raptor-speedometer.ini#15 If you load one of the Speedometer subtests: https://treeherder.mozilla.org/perf.html#/graphs?series=mozilla-central,1708404,1,10&selected=mozilla-central,1708404,400133,633925436,10 you can see that Perfherder believes that "higher is better": https://www.dropbox.com/s/sts16ihd5y8uqo7/Screenshot%202018-11-08%2016.45.48.png?dl=0
Here's a perfherder data artifact: https://queue.taskcluster.net/v1/task/R6GGR5e6TsO9B9c1qhmtnA/runs/0/artifacts/public/test_info/perfherder-data.json For the main score: > "lowerIsBetter": false, for the subtests: > "lowerIsBetter": false,
Sample Raptor output of PERFHERDER_DATA for speedometer: 06:56:31 INFO - raptor-output PERFHERDER_DATA: {"framework": {"name": "raptor"}, "suites": [{"extraOptions": [], "name": "raptor-speedometer-firefox", "lowerIsBetter": false, "alertThreshold": 2.0, "value": 71.37732855708006, "subtests": [{"name": "jQuery-TodoMVC/DeletingAllItems/Sync", "lowerIsBetter": false, "alertThreshold": 2.0, "replicates": [136.92, 185.82, 134.8, 136.16, 137.66, 134.16, 132.84, 132.72, 181.26, 136.08, 136.52, 140.96, 138.06, 134.78, 135.54, 133.54, 177.44, 139.88, 133.76, 135.08, 135.56, 136.3, 134.18, 133.36, 133.72], "value": 135.56, "unit": "ms"}, {"name": "jQuery-TodoMVC/DeletingAllItems/Async", "lowerIsBetter": false, ... Has 'lowerIsBetter':false for the main score (raptor-speedometer-firefox' and all the subtests.
I will file a different bug for web-tooling since it is a different framework.
Blocks: 1473078
Summary: Subtest results for Web-tooling and Speedometer seem incorrect → Subtest results for Speedometer seem incorrect
Thanks Armen.
Assignee: nobody → rwood
Status: NEW → ASSIGNED
Summary: Subtest results for Speedometer seem incorrect → Subtest results for raptor speedometer have incorrect lower_is_better setting
No longer depends on: 1502116
This is what we have: > "lowerIsBetter": false, > "name": "raptor-speedometer-chrome", > "subtests": [ > { > "alertThreshold": 2.0, > "lowerIsBetter": true, > "name": "jQuery-TodoMVC/DeletingAllItems/Sync", This is what we want: > "lowerIsBetter": false, > "name": "raptor-speedometer-chrome", > "subtests": [ > { > "alertThreshold": 2.0, > "name": "jQuery-TodoMVC/DeletingAllItems/Sync", The main score is higher is better while subtests are lower is better.
(In reply to Armen [:armenzg] from comment #17) Yep, we are good to go: 08:23:15 INFO - raptor-output PERFHERDER_DATA: {"framework": {"name": "raptor"}, "suites": [{"extraOptions": [], "name": "raptor-speedometer-chrome", "lowerIsBetter": false, "alertThreshold": 2.0, "value": 92.2807915395984, "subtests": [{"name": "jQuery-TodoMVC/DeletingAllItems/Sync", "lowerIsBetter": true,...
Pushed by rwood@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/aef6ff7f100d Subtest results for raptor speedometer have incorrect lower_is_better setting; r=jmaher
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Flags: needinfo?(mgaudet)
This appears good.
Status: RESOLVED → VERIFIED
Flags: needinfo?(mgaudet)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: