Open
Bug 1164925
Opened 9 years ago
Updated 2 years ago
31.1% win7 main normal file io bytes regression on inbound (v.40) from revision b3d61b6150fd
Categories
(Core :: Audio/Video: Playback, defect, P3)
Core
Audio/Video: Playback
Tracking
()
NEW
Tracking | Status | |
---|---|---|
firefox40 | --- | unaffected |
firefox41 | --- | affected |
firefox42 | --- | ? |
firefox43 | --- | ? |
People
(Reporter: jmaher, Unassigned)
References
(Blocks 1 open bug)
Details
(Keywords: perf, regression, Whiteboard: [talos_regression])
In the xperf job, we collect a lot of metrics about file and network io, most of these indicate when we are loading a new dll, or doing something which we meant to do or accidentally did. It is not that common to see these as compared to other test regressions. In fact we saw this first on mozilla-aurora as the code landed on mozilla-inbound the weekend before the uplift, this is easy to reproduce, etc. on inbound. Here is a graph showing non-pgo and pgo on inbound: http://graphs.mozilla.org/graph.html#tests=[[244,131,25],[244,63,25]]&sel=1430643239805.1375,1431599575683&displayrange=90&datatype=geo This test is bi-modal. Not sure why, but it is. The shift is easy to see and via retriggers, it shows itself well: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=xperf&fromchange=fa8214db7c82&tochange=3e46e2f11769 xperf only runs on windows 7. I noticed in the push there are 4 commits: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=b3d61b6150fd It is hard to tell which one is the culprit, I am just thankful that they all have the same author!
Reporter | ||
Comment 1•9 years ago
|
||
:cpearce, do you have any idea what could be causing main thread file io in the set of patches above? No need to jump on the backout wagon, maybe this is intentional or makes sense.
Flags: needinfo?(cpearce)
Comment 2•9 years ago
|
||
In b3d61b6150fd we check whether a plugin DLL exists before reporting that we can instantiate the plugin. That does so in a function called by JavaScript. We don't actually read the file, but we do check that it's on disk, so that could have caused the regression I suppose. We suspect that anti-virus is deleting the plugin, so until we can confirm that is not the case, I think we need to keep this check in.
Flags: needinfo?(cpearce)
Comment 3•9 years ago
|
||
I should also point out we're not sure *when* the plugin DLL disappears, so that's why we want to check on every call, not just on the first. The plugin could be being deleted at runtime for example.
Reporter | ||
Comment 4•9 years ago
|
||
cool, thanks for the info! vladan, do you have any other questions or thoughts on possibly closing the bug out?
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(vdjeric)
Comment 5•9 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #4) > cool, thanks for the info! vladan, do you have any other questions or > thoughts on possibly closing the bug out? It doesn't make sense for the missing DLL check to cause a 30% increase (~3MB) in main-thread bytes read. Chris, can you confirm the increase by comparing the startupSessionRestoreReadBytes & startupSessionRestoreWriteBytes fields in about:telemetry in builds with and without the patch?
Flags: needinfo?(vdjeric)
Comment 6•9 years ago
|
||
.. these fields are in the Simple Measurements section
Comment 7•9 years ago
|
||
Sure. Current m-c tip (633684c4ebc4): startupSessionRestoreReadBytes 143593167 startupSessionRestoreWriteBytes 4558 Current m-c tip (633684c4ebc4) with b3d61b6150fd reverted: startupSessionRestoreReadBytes 143593385 startupSessionRestoreWriteBytes 4030 That seems to indicate b3d61b6150fd isn't the culprit. However I would not expect the EME code to be hit on startup however, unless the page shown on startup was playing Adobe EME video.
Reporter | ||
Comment 8•9 years ago
|
||
maybe our measurement of xperf is doing something odd. I do know that the specific set of patches in the push b3d61b6150fd did increase the xperf number, we have a series of retriggers and it is pretty clear.
Comment 9•9 years ago
|
||
What HTML content does the xperf test run Firefox over? Is it just the default startup page, or do we actually load some web pages as part of the profiling?
Reporter | ||
Comment 10•9 years ago
|
||
we cycle over the tp5 pageset, so there are a lot of pages we cover.
Reporter | ||
Comment 11•9 years ago
|
||
cpearce, any thoughts here? I think the next steps are unclear- our automation shows something but other data from running locally doesn't confirm it.
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(cpearce)
Comment 12•9 years ago
|
||
jmaher: We should push a run to try with and another without media.eme.enabled=false, to verify that is it indeed EME that's causing this. Also doing a try run with/without these patches backed out to confirm it's these patches. I have trouble believing EME could cause this, but I can't argue against hard data. ;) I think we've assumed it's b3d61b6150fd, but it could be others in the regressing range. The main reason why I have trouble believing its EME is not used on many sites use it yet, particularly since we implement a newer version of the spec than what Chrome, IE and Safari implement, so sites that use EME probably don't use the version we implement. One theory; this patch makes us try to re-download the Adobe EME plugin from Adobe's servers if it has disappeared from disk after we've reinstalled it. Could something be deleting the EME plugin or otherwise causing us to re-download it? For example, do we load each new page with a clean profile (which would require a new CDM download)? Or perhaps there's anti-virus running on the test machines?
Blocks: EME
Flags: needinfo?(cpearce) → needinfo?(jmaher)
Reporter | ||
Comment 13•9 years ago
|
||
keep in mind that b3d61b6150fd is the revision that 4 patches landed on, I use it as a marker since that is the granularity that we have, when referencing b3d61b6150fd, it means all the changes. I have done a few pushes to try (all building upon each other): backout b3d61b6150fd: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4ad1581db1ba backout 34d3296e6980: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4c83bfc0f3cf backout b0ebc8d237f0: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e44f209eeb60 unfortunately, 'hg backout 03dc784d5456' is not clean, so I didn't do it. lets see what this turns up, I can dig further if needed
Updated•9 years ago
|
status-firefox40:
--- → unaffected
Priority: -- → P1
Reporter | ||
Comment 14•9 years ago
|
||
it looks like 34d3296e6980 is the actual changeset of concern. Is this the one we could test with/without the media.eme.enabled=false preference? looking at the backout of b3d61b6150fd, we have similar numbers, but for both 34d3296e6980 and b0ebc8d237f0 we see a drop of about 20MB in the overall results. I am not sure why changing line endings for 1 file did it: https://hg.mozilla.org/integration/mozilla-inbound/rev/34d3296e6980 maybe I am reading the results incorrectly, but the try pushes above help show the difference. I really wish try server would show the full queue of pushes. :cpearce, any thoughts on this?
Flags: needinfo?(jmaher) → needinfo?(cpearce)
Comment 15•9 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #14) > it looks like 34d3296e6980 is the actual changeset of concern. Is this the > one we could test with/without the media.eme.enabled=false preference? > > looking at the backout of b3d61b6150fd, we have similar numbers, but for > both 34d3296e6980 and b0ebc8d237f0 we see a drop of about 20MB in the > overall results. > > I am not sure why changing line endings for 1 file did it: > https://hg.mozilla.org/integration/mozilla-inbound/rev/34d3296e6980 > > maybe I am reading the results incorrectly, but the try pushes above help > show the difference. I really wish try server would show the full queue of > pushes. > > :cpearce, any thoughts on this? I'm pretty unconvinced that whitespace changes caused this. ;) You should spin up more runs on those pushes; your sample size is 4 there, and my high school statistics teacher taught me you need at least 30 runs for statistically significant results.
Flags: needinfo?(cpearce)
Reporter | ||
Comment 16•9 years ago
|
||
I did a few more retriggers and mapped out the data. The results come in tri-modal (~300MB, ~350MB, ~375MB), all of them showing data in the regressed range. The prior range was ~260MB -> ~340MB and that is what I wanted to see. Which leaves 1 more patch in the series: https://hg.mozilla.org/integration/mozilla-inbound/rev/03dc784d5456 Since I had trouble with 'hg backout' of this, could you push to try with just that backed out? of course we could 'hg update' to the revision prior, but that would be served by just looking at the existing data.
Flags: needinfo?(cpearce)
Reporter | ||
Comment 17•9 years ago
|
||
this bug has been going on for 2 weeks with no real action. If this was a unittest failure, it would have been backed out or the test would have been disabled. cpearce- can you please comment on this bug about what the next steps are and the timeline they will take place? otherwise we are left with working on backing all of the patches out.
Comment 18•9 years ago
|
||
Am pretty swamped at the moment. This is on our list of EME P1s, so we'll get to it as soon as we can.
Comment 19•9 years ago
|
||
This is going to be on Beta in a couple of weeks. Do we have indication whether this patch fixes or doesn't fix the issue in bug 1160321?
Comment 20•9 years ago
|
||
Actually, it was already uplifted to Beta, so it will be on Release on June 29th
Comment 21•9 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #16) > I did a few more retriggers and mapped out the data. The results come in > tri-modal (~300MB, ~350MB, ~375MB), all of them showing data in the > regressed range. > > The prior range was ~260MB -> ~340MB and that is what I wanted to see. > > Which leaves 1 more patch in the series: > https://hg.mozilla.org/integration/mozilla-inbound/rev/03dc784d5456 This changeset also changes HTMLMediaElement.canPlayType() to load the Windows Media Foundation DLLs and ensure they're on disk and usable before reporting to JavaScript that we can play audio/video. That may explain the I/O spike you're seeing; we're loading those DLLs.
Comment 22•9 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #16) > I did a few more retriggers and mapped out the data. The results come in > tri-modal (~300MB, ~350MB, ~375MB), all of them showing data in the > regressed range. > > The prior range was ~260MB -> ~340MB and that is what I wanted to see. > > Which leaves 1 more patch in the series: > https://hg.mozilla.org/integration/mozilla-inbound/rev/03dc784d5456 > > Since I had trouble with 'hg backout' of this, could you push to try with > just that backed out? of course we could 'hg update' to the revision prior, > but that would be served by just looking at the existing data. Joel, please look at this push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=96d7aef0fba9 This removes check that ensures that we can use Windows Media Foundation to decode before reporting to JavaScript that we can play audio and video. If this is the regressing change, I think we should keep it as it's critical for YouTube to work correctly as there are plenty of Windows machines out in the wild which don't have a working WMF for a variety of reasons. When we initialize WMF we load this list of DLLs: https://dxr.mozilla.org/mozilla-central/source/dom/media/wmf/WMFUtils.cpp#430 We can drop propsys.dll and mfreadwrite.dll from that list, since the code that need them has been preffed off, and I have a patch lying around to remove that preffed off code anyway.
Flags: needinfo?(cpearce) → needinfo?(jmaher)
Reporter | ||
Comment 23•9 years ago
|
||
I did a series of retriggers on that try push, unfortunately we are not seeing improvements. It sounds like we might not be able to fix much here.
Flags: needinfo?(jmaher)
Comment 24•9 years ago
|
||
cpearce, is this talos regression still an issue? Can we close this bug?
Flags: needinfo?(cpearce)
Comment 25•9 years ago
|
||
I would like to circle back after we've shipped and investigate whether there are changes we can make to prevent the regression. I suspect there won't be however. We could P2 it to reflect that we don't need to block shipping.
Flags: needinfo?(cpearce)
Updated•9 years ago
|
Updated•9 years ago
|
Component: Audio/Video → Audio/Video: Playback
Mass change P2 -> P3
Priority: P2 → P3
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•