Open Bug 1164925 Opened 9 years ago Updated 2 years ago

31.1% win7 main normal file io bytes regression on inbound (v.40) from revision b3d61b6150fd

Categories

(Core :: Audio/Video: Playback, defect, P3)

defect

Tracking

()

Tracking Status
firefox40 --- unaffected
firefox41 --- affected
firefox42 --- ?
firefox43 --- ?

People

(Reporter: jmaher, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

In the xperf job, we collect a lot of metrics about file and network io, most of these indicate when we are loading a new dll, or doing something which we meant to do or accidentally did.  It is not that common to see these as compared to other test regressions.

In fact we saw this first on mozilla-aurora as the code landed on mozilla-inbound the weekend before the uplift, this is easy to reproduce, etc. on inbound.  Here is a graph showing non-pgo and pgo on inbound:
http://graphs.mozilla.org/graph.html#tests=[[244,131,25],[244,63,25]]&sel=1430643239805.1375,1431599575683&displayrange=90&datatype=geo

This test is bi-modal.  Not sure why, but it is.  The shift is easy to see and via retriggers, it shows itself well:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=xperf&fromchange=fa8214db7c82&tochange=3e46e2f11769

xperf only runs on windows 7.  I noticed in the push there are 4 commits:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=b3d61b6150fd

It is hard to tell which one is the culprit, I am just thankful that they all have the same author!
:cpearce, do you have any idea what could be causing main thread file io in the set of patches above?  No need to jump on the backout wagon, maybe this is intentional or makes sense.
Flags: needinfo?(cpearce)
In b3d61b6150fd we check whether a plugin DLL exists before reporting that we can instantiate the plugin. That does so in a function called by JavaScript. We don't actually read the file, but we do check that it's on disk, so that could have caused the regression I suppose.

We suspect that anti-virus is deleting the plugin, so until we can confirm that is not the case, I think we need to keep this check in.
Flags: needinfo?(cpearce)
I should also point out we're not sure *when* the plugin DLL disappears, so that's why we want to check on every call, not just on the first. The plugin could be being deleted at runtime for example.
cool, thanks for the info!  vladan, do you have any other questions or thoughts on possibly closing the bug out?
Flags: needinfo?(vdjeric)
(In reply to Joel Maher (:jmaher) from comment #4)
> cool, thanks for the info!  vladan, do you have any other questions or
> thoughts on possibly closing the bug out?

It doesn't make sense for the missing DLL check to cause a 30% increase (~3MB) in main-thread bytes read.
Chris, can you confirm the increase by comparing the startupSessionRestoreReadBytes & startupSessionRestoreWriteBytes fields in about:telemetry in builds with and without the patch?
Flags: needinfo?(vdjeric)
.. these fields are in the Simple Measurements section
Sure.

Current m-c tip (633684c4ebc4):

startupSessionRestoreReadBytes 	143593167
startupSessionRestoreWriteBytes 	4558 

Current m-c tip (633684c4ebc4) with b3d61b6150fd reverted:

startupSessionRestoreReadBytes 	143593385
startupSessionRestoreWriteBytes 	4030 

That seems to indicate b3d61b6150fd isn't the culprit. However I would not expect the EME code to be hit on startup however, unless the page shown on startup was playing Adobe EME video.
maybe our measurement of xperf is doing something odd.  I do know that the specific set of patches in the push b3d61b6150fd did increase the xperf number, we have a series of retriggers and it is pretty clear.
What HTML content does the xperf test run Firefox over? Is it just the default startup page, or do we actually load some web pages as part of the profiling?
we cycle over the tp5 pageset, so there are a lot of pages we cover.
cpearce, any thoughts here?  I think the next steps are unclear- our automation shows something but other data from running locally doesn't confirm it.
Flags: needinfo?(cpearce)
jmaher: We should push a run to try with and another without media.eme.enabled=false, to verify that is it indeed EME that's causing this. Also doing a try run with/without these patches backed out to confirm it's these patches. I have trouble believing EME could cause this, but I can't argue against hard data. ;)

I think we've assumed it's b3d61b6150fd, but it could be others in the regressing range.

The main reason why I have trouble believing its EME is not used on many sites use it yet, particularly since we implement a newer version of the spec than what Chrome, IE and Safari implement, so sites that use EME probably don't use the version we implement.

One theory; this patch makes us try to re-download the Adobe EME plugin from Adobe's servers if it has disappeared from disk after we've reinstalled it. Could something be deleting the EME plugin or otherwise causing us to re-download it? For example, do we load each new page with a clean profile (which would require a new CDM download)? Or perhaps there's anti-virus running on the test machines?
Blocks: EME
Flags: needinfo?(cpearce) → needinfo?(jmaher)
keep in mind that b3d61b6150fd is the revision that 4 patches landed on, I use it as a marker since that is the granularity that we have, when referencing b3d61b6150fd, it means all the changes.

I have done a few pushes to try (all building upon each other):
backout b3d61b6150fd: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4ad1581db1ba
backout 34d3296e6980: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4c83bfc0f3cf
backout b0ebc8d237f0: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e44f209eeb60

unfortunately, 'hg backout 03dc784d5456' is not clean, so I didn't do it.

lets see what this turns up, I can dig further if needed
Priority: -- → P1
it looks like 34d3296e6980 is the actual changeset of concern.  Is this the one we could test with/without the media.eme.enabled=false preference?

looking at the backout of b3d61b6150fd, we have similar numbers, but for both 34d3296e6980 and b0ebc8d237f0 we see a drop of about 20MB in the overall results.

I am not sure why changing line endings for 1 file did it:
https://hg.mozilla.org/integration/mozilla-inbound/rev/34d3296e6980

maybe I am reading the results incorrectly, but the try pushes above help show the difference. I really wish try server would show the full queue of pushes.

:cpearce, any thoughts on this?
Flags: needinfo?(jmaher) → needinfo?(cpearce)
(In reply to Joel Maher (:jmaher) from comment #14)
> it looks like 34d3296e6980 is the actual changeset of concern.  Is this the
> one we could test with/without the media.eme.enabled=false preference?
> 
> looking at the backout of b3d61b6150fd, we have similar numbers, but for
> both 34d3296e6980 and b0ebc8d237f0 we see a drop of about 20MB in the
> overall results.
> 
> I am not sure why changing line endings for 1 file did it:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/34d3296e6980
> 
> maybe I am reading the results incorrectly, but the try pushes above help
> show the difference. I really wish try server would show the full queue of
> pushes.
> 
> :cpearce, any thoughts on this?

I'm pretty unconvinced that whitespace changes caused this. ;)

You should spin up more  runs on those pushes; your sample size is 4 there, and my high school statistics teacher taught me you need at least 30 runs for statistically significant results.
Flags: needinfo?(cpearce)
I did a few more retriggers and mapped out the data.  The results come in tri-modal (~300MB, ~350MB, ~375MB), all of them showing data in the regressed range.

The prior range was ~260MB -> ~340MB and that is what I wanted to see.

Which leaves 1 more patch in the series:
https://hg.mozilla.org/integration/mozilla-inbound/rev/03dc784d5456

Since I had trouble with 'hg backout' of this, could you push to try with just that backed out?  of course we could 'hg update' to the revision prior, but that would be served by just looking at the existing data.
Flags: needinfo?(cpearce)
this bug has been going on for 2 weeks with no real action.  If this was a unittest failure, it would have been backed out or the test would have been disabled.

cpearce- can you please comment on this bug about what the next steps are and the timeline they will take place?  otherwise we are left with working on backing all of the patches out.
Am pretty swamped at the moment. This is on our list of EME P1s, so we'll get to it as soon as we can.
This is going to be on Beta in a couple of weeks. Do we have indication whether this patch fixes or doesn't fix the issue in bug 1160321?
Actually, it was already uplifted to Beta, so it will be on Release on June 29th
(In reply to Joel Maher (:jmaher) from comment #16)
> I did a few more retriggers and mapped out the data.  The results come in
> tri-modal (~300MB, ~350MB, ~375MB), all of them showing data in the
> regressed range.
> 
> The prior range was ~260MB -> ~340MB and that is what I wanted to see.
> 
> Which leaves 1 more patch in the series:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/03dc784d5456

This changeset also changes HTMLMediaElement.canPlayType() to load the Windows Media Foundation DLLs and ensure they're on disk and usable before reporting to JavaScript that we can play audio/video. That may explain the I/O spike you're seeing; we're loading those DLLs.
(In reply to Joel Maher (:jmaher) from comment #16)
> I did a few more retriggers and mapped out the data.  The results come in
> tri-modal (~300MB, ~350MB, ~375MB), all of them showing data in the
> regressed range.
> 
> The prior range was ~260MB -> ~340MB and that is what I wanted to see.
> 
> Which leaves 1 more patch in the series:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/03dc784d5456
> 
> Since I had trouble with 'hg backout' of this, could you push to try with
> just that backed out?  of course we could 'hg update' to the revision prior,
> but that would be served by just looking at the existing data.

Joel, please look at this push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=96d7aef0fba9

This removes check that ensures that we can use Windows Media Foundation to decode before reporting to JavaScript that we can play audio and video.

If this is the regressing change, I think we should keep it as it's critical for YouTube to work correctly as there are plenty of Windows machines out in the wild which don't have a working WMF for a variety of reasons.

When we initialize WMF we load this list of DLLs:
https://dxr.mozilla.org/mozilla-central/source/dom/media/wmf/WMFUtils.cpp#430

We can drop propsys.dll and mfreadwrite.dll from that list, since the code that need them has been preffed off, and I have a patch lying around to remove that preffed off code anyway.
Flags: needinfo?(cpearce) → needinfo?(jmaher)
I did a series of retriggers on that try push, unfortunately we are not seeing improvements.  It sounds like we might not be able to fix much here.
Flags: needinfo?(jmaher)
cpearce, is this talos regression still an issue? Can we close this bug?
Flags: needinfo?(cpearce)
I would like to circle back after we've shipped and investigate whether there are changes we can make to prevent the regression. I suspect there won't be however.

We could P2 it to reflect that we don't need to block shipping.
Flags: needinfo?(cpearce)
Priority: P1 → P2
Component: Audio/Video → Audio/Video: Playback
Mass change P2 -> P3
Priority: P2 → P3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.