Retriggered browsertime vismet tasks only report results from the first task
Categories
(Testing :: Raptor, defect, P2)
Tracking
(Not tracked)
People
(Reporter: sparky, Unassigned)
References
(Blocks 2 open bugs)
Details
(Whiteboard: [perf:workflow])
We currently have visual metrics running on browsertime but when we retrigger those browsertime tests, there are no new vismet tasks created for it, see here for a sample "add-new" task which retriggers a browsertime test, but doesn't add a new vismet task for it: https://firefox-ci-tc.services.mozilla.com/tasks/P6v8YHTOTzO6do85ER3RXg/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FP6v8YHTOTzO6do85ER3RXg%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log
It came from this push: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&tier=1%2C2%2C3&revision=5ba6228ac0ee383b783734c386631856a339f0f2&searchStr=add-new&selectedJob=280540364
:ahal, I've added you as a CC to this bug in case you have any thoughts/ideas about this issue. For context, the vismet tasks are currently created dynamically with this transform: https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/transforms/visual_metrics_dep.py#21
A run-visual-metrics attribute dictates which tasks should have one created for them: https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/test/raptor.yml#1023
Reporter | ||
Comment 1•5 years ago
|
||
This problem is worse than I originally thought. I can schedule multiple vismet tasks, but each of them only processed the results from the first btime task that was created for them. (We can't retrigger, and we can't schedule multiple runs). So for me to be able to analyze the vismet data, I would have to make one push per trial which is a bit much. Here's a task where I tried it with 50 retriggers: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=20528cc4f915039c970db9adacbb784491e2c01a
Updated•5 years ago
|
Reporter | ||
Comment 2•5 years ago
|
||
We have a partial solution here (thanks to :aki for the help):
- Select a browsertime test task (not a vismet task) that you want to retrigger.
- In the pop-up menu, select the
...
and click onCustom Action
. - Pick the
retrigger
action if it wasn't already selected. - Set
downstream
totrue
. - Enter the number of times it should be retriggered in
times
. - Retrigger!
This is the only method we have to retrigger these tasks. The --rebuild
doesn't work but it would be cool if we could eventually get that to work with these tasks. Also note that the drop-down arrow on the push that provides the Custom Push Action
option doesn't work for this either.
Reporter | ||
Comment 3•5 years ago
|
||
I've added that information to the wiki as well: https://wiki.mozilla.org/TestEngineering/Performance/Raptor/Browsertime
Comment 4•4 years ago
|
||
I don't know if this workflow works anymore because you can't do retriggers on jobs that succeed.
Reporter | ||
Comment 5•4 years ago
|
||
Last I checked (last week), it still works for the browsertime tests. Are you referring to raptor-browsertime or mozperftest?
Comment hidden (Intermittent Failures Robot) |
Comment 7•4 years ago
|
||
Using ./mach try fuzzy --retry N
and getting stale data is a big footgun still.
Comment 8•4 years ago
|
||
Can we retitle this bug so it shows up in searches? Maybe "retriggered browsertime tasks only report results from the first task" or "retriggered browsertime tasks have invalid data in perfherder" or something? (I didn't know the significance of "visual metrics", so did not realize this bug was describing my problem.)
I forget about this, and run into it all the time. And I'm still not sure what scenarios run into this problem, so I'm never sure whether to trust eg regression bugs when they're complaining about massive numbers of browsertime changes; perherder will report the number of tasks run so everything appears to be fine.
Reporter | ||
Comment 9•4 years ago
|
||
Sure thing, I've updated the title.
The only time you would hit this issue is when you retrigger the *-vismet
tasks in Treeherder - the correct way of doing this is described here: https://wiki.mozilla.org/TestEngineering/Performance/Raptor/Browsertime/VisualMetrics
The perf sherriffs should know how to retrigger the vismet tasks correctly so you shouldn't worry about this happening in regression/improvement bugs. When you look at the Perfherder Compare view, if you see a metric with multiple runs but they only report a single value, then you know that it wasn't retriggered correctly.
We're looking to getting visual-metrics running in the test tasks themselves to get around this issue but we need to get FFMPEG and ImageMagick installed on the machines first.
Comment 10•4 years ago
|
||
(In reply to Greg Mierzwinski [:sparky] from comment #9)
The only time you would hit this issue is when you retrigger the
*-vismet
tasks in Treeherder - the correct way of doing this is described here: https://wiki.mozilla.org/TestEngineering/Performance/Raptor/Browsertime/VisualMetrics
Right, but that's the common case if you're using mach try fuzzy --rebuild N
.
The perf sherriffs should know how to retrigger the vismet tasks correctly so you shouldn't worry about this happening in regression/improvement bugs. When you look at the Perfherder Compare view, if you see a metric with multiple runs but they only report a single value, then you know that it wasn't retriggered correctly.
Right, so in my example I have to know what either " ± 0" means, or seeing confidence either blank (?) or "Infinity", despite "Total Runs" being > 1.
And as a result of looking at this, I'm noticing that my --rebuild 7 pushes are reporting some useful data -- specifically, the non-vismet metrics. Which apparently I can pattern-match because the vismet have spelled out names like FirstVisualChange and the non-vismet ones don't. Or, if looking at the results from individual tasks, the non-vismet ones have abbreviated names like "fcp". Ah, I see now that those show up in the comparison view as "subtests". It would be nice if the vismet ones were grouped the same way, or if neither was grouped. (Or at least, with my limited understanding I think it would be nice. I'm not really understanding the overall picture, so I could be way off base.)
We're looking to getting visual-metrics running in the test tasks themselves to get around this issue but we need to get FFMPEG and ImageMagick installed on the machines first.
Right, I saw that in the dependent bug, though it seems a little unfortunate to need that. It seems like the taskgraph should have the necessary smarts added for this. Still, whatever works. Expediency is good.
Reporter | ||
Comment 11•4 years ago
|
||
(In reply to Steve Fink [:sfink] [:s:] from comment #10)
(In reply to Greg Mierzwinski [:sparky] from comment #9)
The only time you would hit this issue is when you retrigger the
*-vismet
tasks in Treeherder - the correct way of doing this is described here: https://wiki.mozilla.org/TestEngineering/Performance/Raptor/Browsertime/VisualMetricsRight, but that's the common case if you're using
mach try fuzzy --rebuild N
.The perf sherriffs should know how to retrigger the vismet tasks correctly so you shouldn't worry about this happening in regression/improvement bugs. When you look at the Perfherder Compare view, if you see a metric with multiple runs but they only report a single value, then you know that it wasn't retriggered correctly.
Right, so in my example I have to know what either " ± 0" means, or seeing confidence either blank (?) or "Infinity", despite "Total Runs" being > 1.
And as a result of looking at this, I'm noticing that my --rebuild 7 pushes are reporting some useful data -- specifically, the non-vismet metrics. Which apparently I can pattern-match because the vismet have spelled out names like FirstVisualChange and the non-vismet ones don't. Or, if looking at the results from individual tasks, the non-vismet ones have abbreviated names like "fcp". Ah, I see now that those show up in the comparison view as "subtests". It would be nice if the vismet ones were grouped the same way, or if neither was grouped. (Or at least, with my limited understanding I think it would be nice. I'm not really understanding the overall picture, so I could be way off base.)
Oh right sorry, the --rebuild doesn't work either. Essentially, every standard method except the one mentioned in the wiki will not work. I think what you're describing regarding the subtests vs not-subtests is something we were looking into, :davehunt can you provide more information?
Right, I saw that in the dependent bug, though it seems a little unfortunate to need that. It seems like the taskgraph should have the necessary smarts added for this. Still, whatever works. Expediency is good.
I fully agree. I'm getting quite annoyed by this taskcluster limitation so I'm going to try to fix this through another way - posting patch soon.
Reporter | ||
Comment 12•4 years ago
|
||
I'll post the patch in bug 1686118 since it solves the majority of the issue but it won't fix the issues with --rebuild
since that would involve modifying taskcluster code and I'm not sure what may be the best way to change this: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/create.py#89-92
Comment 13•4 years ago
|
||
(In reply to Greg Mierzwinski [:sparky] from comment #11)
I think what you're describing regarding the subtests vs not-subtests is something we were looking into, :davehunt can you provide more information?
Indeed. In bug 1672794 we will stop summarising results as a geometric mean, which means they will report directly in the compare view and not under the "subtests" view.
Comment 15•4 years ago
|
||
I think this bug, as titled, "Retriggered browsertime vismet tasks only report results from the first task
", can now be closed. (?)
Well, if the retriggering is done via Taskcluster UI, then this can be closed because I was doing that today and it worked fine.
But it would be really handy to have the --rebuild
option, because it can be extremely time consuming to add the jobs if you want to test, for instance, 10 sites on 3 platforms.
And maybe Bug 1684946 can be updated to explicitly be about the --rebuild
case?
Comment 16•4 years ago
|
||
(In reply to Andrew Creskey [:acreskey] [he/him] from comment #15)
I think this bug, as titled, "Retriggered browsertime vismet tasks only report results from the first task
", can now be closed. (?)Well, if the retriggering is done via Taskcluster UI, then this can be closed because I was doing that today and it worked fine.
True. I guess I have a bad habit of using "retrigger" to mean passing the --rebuild
flag.
But it would be really handy to have the
--rebuild
option, because it can be extremely time consuming to add the jobs if you want to test, for instance, 10 sites on 3 platforms.And maybe Bug 1684946 can be updated to explicitly be about the
--rebuild
case?
Yeah, I don't know what the best way to arrange the bugs. Retriggering was fixed in bug 1686118. We could either use this bug or bug 1684946 for --rebuild
. This bug has the best description of what happens when things go wrong.
I guess I can just add the additional info to bug 1684946 and use it.
Updated•4 years ago
|
Description
•