Closed Bug 1138953 Opened 9 years ago Closed 9 years ago

compare.py --compare-e10s should default to compare only a specific revision, not a range in history

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(e10s-)

RESOLVED FIXED
Tracking Status
e10s - ---

People

(Reporter: jmaher, Assigned: jmaher)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 4 obsolete files)

      No description provided.
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
it is also noted that we should have a mode where we print out:
<geomean> (<num-datapoints>) +/- <stddev> (<stddev-as-%-of-geomean>) (<min> - <max>)

instead of just:
<min> - <max>

this is useful for comparing specific revisions and history.  I am thinking it should be --full-stats or something like that.

Likewise we should have a: --master-rev which does what this patch is doing, then we could easily compare between two types of test for a given rev, or compare anything to a different branch overtime.
(In reply to Joel Maher (:jmaher) from comment #2)
> it is also noted that we should have a mode where we print out:
> <geomean> (<num-datapoints>) +/- <stddev> (<stddev-as-%-of-geomean>) (<min>
> - <max>)

Er.. (<stddev-as-%-of-geomean>%) (added the '%' to make it more readable).

Also we really don't have use for so much accuracy when the values are typically so noisy. Limiting the displayed values to 3-4 significant digits would also make it more readable.
Also, this addition is not only for fun. Being able to compare apples to apples (same kind of data on both sides), including the number of datapoints each side of the comparison uses will enable us to understand better how valid the comparison is.

If, for instance one side uses 100 data points and the other side has one, it's a valuable information to have when assessing how valid the comparison is, and the same goes for the other types of data this addition brings.
Blocks: e10s-tests
tracking-e10s: --- → -
this is what is looks like with my latest bits:
python compare.py --compare-e10s --rev eea6188b9b05 --pgo --verbose --branch Firefox --platform Linux --test tresize
   test		Master geo (count) +/- stddev (min, max)	:test	geo (count) +/- stddev (min, max)

Linux:
:) tresize           	 16.8 (48) +/-   0.3 ( 16.4,  18.6)	:	 13.4 (3) +/-   0.2 ( 13.2,  13.6)
Thanks.

So, I thought a bit more about it. Personally I don't think the min/max values are valuable, especially when we have stddev which gives some representation of the distribution, and also min/max are highly affected by outliers, while stddev is much less so - while still taking all values into account.

Maybe to simplify it and also to help interpret the output better, each side will show only geomean, stddev (possibly only as % of geomean), # of data points, and between the two - the change in percentage between the two geomeans. I could also leave without the smiley since with this suggestion the improvement/regression would appear clearly at the middle as percentage, but we could also leave it for fun :)

e.g.

test     g.mean stddev  points   change    g.mean stddev points

tXYZ     16.8   +/- 10% (48#)    [+20%]    20.2   +/- 9% (3#)

We can also probably round the percentage values to integers, since even for low percentage values we don't really care much if it's 2.2% or 2.6%.

Joel, what do you think?
Also, for canvasmark/dromaeo_dom/dromaeo_css/V8 - higher is better. Does the smiley take that into account? maybe also the change% at the middle should add a '*' for tests where higher is better?
yes, we have a reverse tests coded into compare.py.  I agree on dropping the digits and making these whole numbers.
Attached patch updated with stddev/geomean, etc. (0.92) (obsolete) (deleted) — Splinter Review
jmaher@jmaher-ThinkPad-X230:~/mozilla/talos/talos$ python compare.py --compare-e10s --rev eea6188b9b05 --pgo --verbose --branch Firefox --platform Linux --test tresize
   test		Master gmean stddev points	change	Test	gmean stddev points

Linux:
:) tresize           	   16 +/-  0 (48#)	[-20.4%]	   13 +/-  0 (3#)
jmaher@jmaher-ThinkPad-X230:~/mozilla/talos/talos$ python compare.py --compare-e10s --rev eea6188b9b05 --pgo --verbose --branch Firefox --platform Linux --test tresize --master-revision eea6188b9b05
   test		Master gmean stddev points	change	Test	gmean stddev points

Linux:
:) tresize           	   16 +/-  0 (3#)	[-20.8%]	   13 +/-  0 (3#)
Attachment #8571967 - Attachment is obsolete: true
python compare.py --compare-e10s --rev eea6188b9b05 --pgo --verbose --branch Firefox --platform Win7 --test tresize --master-revision eea6188b9b05 --print-graph-url 
   test		Master gmean stddev points	change	Test	gmean stddev points

Win7:
:( tresize           	   14 +/-  0 (3#)	[1.8%]	   15 +/-  0 (3#)	http://goo.gl/YvHlcI
Attachment #8573316 - Attachment is obsolete: true
this has been verified by avih to produce some useful results.
Attachment #8573324 - Attachment is obsolete: true
Attachment #8573375 - Flags: review?(wlachance)
Comment on attachment 8573375 [details] [diff] [review]
update compare.py that supports per revision comparison for e10s (1.0)

I don't really have a lot to say about this, if it works for you, by all means land it. :)
Attachment #8573375 - Flags: review?(wlachance) → review+
Attached patch bug1138953.v5.patch (deleted) — Splinter Review
This only aligns the formatting compared to the previous patch.
Attachment #8573375 - Attachment is obsolete: true
Attachment #8573435 - Flags: review?(jmaher)
Comment on attachment 8573435 [details] [diff] [review]
bug1138953.v5.patch

Review of attachment 8573435 [details] [diff] [review]:
-----------------------------------------------------------------

thanks!
Attachment #8573435 - Flags: review?(jmaher) → review+
Depends on: 1151466
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: