Closed
Bug 1502036
Opened 6 years ago
Closed 6 years ago
Subtest results for raptor speedometer have incorrect lower_is_better setting
Categories
(Testing :: Raptor, defect)
Testing
Raptor
Tracking
(firefox65 fixed)
VERIFIED
FIXED
mozilla65
Tracking | Status | |
---|---|---|
firefox65 | --- | fixed |
People
(Reporter: mgaudet, Assigned: rwood)
References
Details
Attachments
(2 files)
Looking at the graphs for webtooling [1] and speedometer [2] we see a peculiar oddity.
In the case of web-tooling, the aggregate results are being reported as a Score, where higher-is-better, and each subtest is being reported as execution time in ms, where lower-is-better.
However, if you look at the score and subtests, comparing Chrome and SpiderMonkey, they disagree about what is better: Chrome shows a lower (worse) score than SpiderMonkey in the aggregate score, but for every subtest shows lower (better) execution time, which should imply a higher score.
This seems like something that could be resulting from incorrect labelling of results and incorrect use of worse is better.
Speedometer seems to show (though less clearly) a similar problem: Chrome has a better (higher) score in the aggregate, yet subtests show lower scores than firefox in most (tho not all) cases.
[1]: https://arewefastyet.com/linux64/web-tooling?numDays=90
[2]:
Comment 1•6 years ago
|
||
for speedometer we specifically set subtests as ms (lower is better) and report a score (higher is better) as of bug 1486789. We treat the parsing of the output the same for firefox/chrome, so maybe the subtests are weighted?
Comment 2•6 years ago
|
||
It seems that Perfherder has noticed the change and still reports the subtests with the old values:
https://treeherder.mozilla.org/api/project/mozilla-central/performance/signatures/?parent_signature=bf4fcd6df34988c8aea87ccda51b0f70b4141643
Updated•6 years ago
|
Component: General → Raptor
Comment 3•6 years ago
|
||
@mgaudet: can someone confirm that the information below for web-tooling is correct?
The web-tooling parent score is marked as *Score* "higher is better" (by specifying "lower_is_better: false"):
https://treeherder.mozilla.org/api/project/mozilla-central/performance/signatures/?parent_signature=620f90feca6c0916ca8e593737eb6bf6d0301f68
> "620f90feca6c0916ca8e593737eb6bf6d0301f68": {
> "option_collection_hash": "102210fe594ee9b33d82058545b1ed14f4c8206e",
> "suite": "web-tooling-benchmark-v8",
> "id": 1760218,
> "lower_is_better": false,
> "has_subtests": true,
> "framework_id": 11,
> "machine_platform": "linux64"
> },
web-tooling subtests are marked as *ms* "lower is better" (by *not* specifying "lower_is_better: false"):
https://treeherder.mozilla.org/api/project/mozilla-central/performance/signatures/?parent_signature=620f90feca6c0916ca8e593737eb6bf6d0301f68
I have filed an issue on the UI to overwrite what Perfherder says for subtests until we can fix the metadata:
https://github.com/mozilla-frontend-infra/firefox-performance-dashboard/issues/102
I have filed a bug to fix the Speedometer subtest metadata (bug 1502073).
Flags: needinfo?(mgaudet)
Reporter | ||
Comment 4•6 years ago
|
||
So looking at a log, this appears to be an incorrect labelling issue
[task 2018-10-25T15:14:38.915Z] Error processing command. Ignoring because optional. (optional:packages.txt:comm/build/virtualenv_packages.txt)
[task 2018-10-25T15:14:40.130Z] Running Web Tooling Benchmark 0.3.2...
[task 2018-10-25T15:14:40.130Z] --------------------------------------
[task 2018-10-25T15:14:46.432Z] acorn: 6.68 runs/sec
[task 2018-10-25T15:14:53.795Z] babel: 6.55 runs/sec
[task 2018-10-25T15:15:00.722Z] babylon: 6.38 runs/sec
[task 2018-10-25T15:15:09.275Z] buble: 4.90 runs/sec
[task 2018-10-25T15:15:13.275Z] chai: 10.66 runs/sec
[task 2018-10-25T15:15:24.105Z] coffeescript: 3.65 runs/sec
[task 2018-10-25T15:15:36.498Z] espree: 3.49 runs/sec
[task 2018-10-25T15:15:42.793Z] esprima: 6.54 runs/sec
[task 2018-10-25T15:15:56.339Z] jshint: 3.00 runs/sec
[task 2018-10-25T15:16:03.475Z] lebab: 5.87 runs/sec
[task 2018-10-25T15:16:11.560Z] prepack: 5.04 runs/sec
[task 2018-10-25T15:16:19.193Z] prettier: 5.32 runs/sec
[task 2018-10-25T15:16:23.681Z] source-map: 13.82 runs/sec
[task 2018-10-25T15:16:29.214Z] typescript: 8.19 runs/sec
[task 2018-10-25T15:16:33.845Z] uglify-es: 12.25 runs/sec
[task 2018-10-25T15:16:43.891Z] uglify-js: 4.10 runs/sec
[task 2018-10-25T15:16:43.891Z] --------------------------------------
[task 2018-10-25T15:16:43.891Z] Geometric mean: 6.05 runs/sec
[task 2018-10-25T15:16:44.171Z] PERFHERDER_DATA: {"framework": {"name": "js-bench"}, "suites": [{"name": "web-tooling-benchmark-sm", "lowerIsBetter": false, "value": 6.05, "shouldAlert": false, "units": "score", "subtests": [{"name": "web-tooling-benchmark-source-map", "value": 13.82}, {"name": "web-tooling-benchmark-lebab", "value": 5.87}, {"name": "web-tooling-benchmark-typescript", "value": 8.19}, {"name": "web-tooling-benchmark-uglify-es", "value": 12.25}, {"name": "web-tooling-benchmark-babel", "value": 6.55}, {"name": "web-tooling-benchmark-chai", "value": 10.66}, {"name": "web-tooling-benchmark-uglify-js", "value": 4.1}, {"name": "web-tooling-benchmark-coffeescript", "value": 3.65}, {"name": "web-tooling-benchmark-jshint", "value": 3.0}, {"name": "web-tooling-benchmark-prettier", "value": 5.32}, {"name": "web-tooling-benchmark-acorn", "value": 6.68}, {"name": "web-tooling-benchmark-buble", "value": 4.9}, {"name": "web-tooling-benchmark-esprima", "value": 6.54}, {"name": "web-tooling-benchmark-prepack", "value": 5.04}, {"name": "web-tooling-benchmark-babylon", "value": 6.38}, {"name": "web-tooling-benchmark-espree", "value": 3.49}, {"name": "web-tooling-benchmark-mean", "value": 6.05}]}]}
[fetches 2018-10-25T15:16:44.191Z] removing /home/cltbld/fetches
All of the values are scores, and therefore higher-is-better.
Flags: needinfo?(mgaudet)
Comment 5•6 years ago
|
||
jmaher: according to what mdauget mentions, IIUC we should file another bug like bug 1486789 to update the metric for webtooling's subtests; do you agree?
I've updated bug 1502073 to also update webtooling.
Comment 6•6 years ago
|
||
:armen, yeah, that makes sense
Updated•6 years ago
|
Comment 7•6 years ago
|
||
This is valid bug for Raptor.
Bug 1486789 fixed the 'unit' for the subtests, however, we also need test.lower_is_better to be adjusted.
Speedometer subtests should be lower is better (do not define "lower_is_better": false).
Web-tooling subtests should be higher is better (define "lower_is_better": false).
In this Speedometer output we can see how "lower_is_better": false is defined which is incorrect.
> "suites": [{
...
> "name": "raptor-speedometer-firefox",
> "lowerIsBetter": false,
...
> "subtests": [{
> "name": "jQuery-TodoMVC/DeletingAllItems/Sync",
> "lowerIsBetter": false,
I believe the fix is needed in `testing/raptor/raptor/output.py`.
Comment 8•6 years ago
|
||
Hi @rwood, is this something you or someone else could tackle?
It is not urgent but at least documenting some pointers as to what files need fixing could be useful for a potential contributor.
Flags: needinfo?(rwood)
Assignee | ||
Comment 9•6 years ago
|
||
(In reply to Armen [:armenzg] from comment #8)
> Hi @rwood, is this something you or someone else could tackle?
> It is not urgent but at least documenting some pointers as to what files
> need fixing could be useful for a potential contributor.
I'm a bit confused though, why can't we just change the 'lowerIsBetter' to 'true' in the raptor-speedeomter test INI? Or does that not work (is that the point hah).
Flags: needinfo?(rwood)
Comment 10•6 years ago
|
||
> Speedometer subtests should be lower is better (do not define "lower_is_better": false).
It seems that the ini says that already, however, Perfherder does not agree:
> subtest_lower_is_better = true
https://searchfox.org/mozilla-central/source/testing/raptor/raptor/tests/raptor-speedometer.ini#15
If you load one of the Speedometer subtests:
https://treeherder.mozilla.org/perf.html#/graphs?series=mozilla-central,1708404,1,10&selected=mozilla-central,1708404,400133,633925436,10
you can see that Perfherder believes that "higher is better":
https://www.dropbox.com/s/sts16ihd5y8uqo7/Screenshot%202018-11-08%2016.45.48.png?dl=0
Comment 11•6 years ago
|
||
Here's a perfherder data artifact:
https://queue.taskcluster.net/v1/task/R6GGR5e6TsO9B9c1qhmtnA/runs/0/artifacts/public/test_info/perfherder-data.json
For the main score:
> "lowerIsBetter": false,
for the subtests:
> "lowerIsBetter": false,
Assignee | ||
Comment 12•6 years ago
|
||
Sample Raptor output of PERFHERDER_DATA for speedometer:
06:56:31 INFO - raptor-output PERFHERDER_DATA: {"framework": {"name": "raptor"}, "suites": [{"extraOptions": [], "name": "raptor-speedometer-firefox", "lowerIsBetter": false, "alertThreshold": 2.0, "value": 71.37732855708006, "subtests": [{"name": "jQuery-TodoMVC/DeletingAllItems/Sync", "lowerIsBetter": false, "alertThreshold": 2.0, "replicates": [136.92, 185.82, 134.8, 136.16, 137.66, 134.16, 132.84, 132.72, 181.26, 136.08, 136.52, 140.96, 138.06, 134.78, 135.54, 133.54, 177.44, 139.88, 133.76, 135.08, 135.56, 136.3, 134.18, 133.36, 133.72], "value": 135.56, "unit": "ms"}, {"name": "jQuery-TodoMVC/DeletingAllItems/Async", "lowerIsBetter": false, ...
Has 'lowerIsBetter':false for the main score (raptor-speedometer-firefox' and all the subtests.
Comment 13•6 years ago
|
||
I will file a different bug for web-tooling since it is a different framework.
Blocks: 1473078
Summary: Subtest results for Web-tooling and Speedometer seem incorrect → Subtest results for Speedometer seem incorrect
Assignee | ||
Comment 14•6 years ago
|
||
Thanks Armen.
Assignee: nobody → rwood
Status: NEW → ASSIGNED
Summary: Subtest results for Speedometer seem incorrect → Subtest results for raptor speedometer have incorrect lower_is_better setting
Assignee | ||
Comment 15•6 years ago
|
||
Assignee | ||
Comment 16•6 years ago
|
||
Comment 17•6 years ago
|
||
This is what we have:
> "lowerIsBetter": false,
> "name": "raptor-speedometer-chrome",
> "subtests": [
> {
> "alertThreshold": 2.0,
> "lowerIsBetter": true,
> "name": "jQuery-TodoMVC/DeletingAllItems/Sync",
This is what we want:
> "lowerIsBetter": false,
> "name": "raptor-speedometer-chrome",
> "subtests": [
> {
> "alertThreshold": 2.0,
> "name": "jQuery-TodoMVC/DeletingAllItems/Sync",
The main score is higher is better while subtests are lower is better.
Assignee | ||
Comment 18•6 years ago
|
||
(In reply to Armen [:armenzg] from comment #17)
Yep, we are good to go:
08:23:15 INFO - raptor-output PERFHERDER_DATA: {"framework": {"name": "raptor"}, "suites": [{"extraOptions": [], "name": "raptor-speedometer-chrome", "lowerIsBetter": false, "alertThreshold": 2.0, "value": 92.2807915395984, "subtests": [{"name": "jQuery-TodoMVC/DeletingAllItems/Sync", "lowerIsBetter": true,...
Comment 19•6 years ago
|
||
Pushed by rwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/aef6ff7f100d
Subtest results for raptor speedometer have incorrect lower_is_better setting; r=jmaher
Comment 20•6 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
status-firefox65:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Comment 21•6 years ago
|
||
Reporter | ||
Updated•6 years ago
|
Flags: needinfo?(mgaudet)
Reporter | ||
Comment 22•6 years ago
|
||
This appears good.
Status: RESOLVED → VERIFIED
Flags: needinfo?(mgaudet)
You need to log in
before you can comment on or make changes to this bug.
Description
•