1164925 - 31.1% win7 main normal file io bytes regression on inbound (v.40) from revision b3d61b6150fd

Reporter

Description

•

9 years ago

In the xperf job, we collect a lot of metrics about file and network io, most of these indicate when we are loading a new dll, or doing something which we meant to do or accidentally did.  It is not that common to see these as compared to other test regressions.

In fact we saw this first on mozilla-aurora as the code landed on mozilla-inbound the weekend before the uplift, this is easy to reproduce, etc. on inbound.  Here is a graph showing non-pgo and pgo on inbound:
http://graphs.mozilla.org/graph.html#tests=[[244,131,25],[244,63,25]]&sel=1430643239805.1375,1431599575683&displayrange=90&datatype=geo

This test is bi-modal.  Not sure why, but it is.  The shift is easy to see and via retriggers, it shows itself well:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=xperf&fromchange=fa8214db7c82&tochange=3e46e2f11769

xperf only runs on windows 7.  I noticed in the push there are 4 commits:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=b3d61b6150fd

It is hard to tell which one is the culprit, I am just thankful that they all have the same author!

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 1

•

9 years ago

:cpearce, do you have any idea what could be causing main thread file io in the set of patches above?  No need to jump on the backout wagon, maybe this is intentional or makes sense.

Flags: needinfo?(cpearce)

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 2

•

9 years ago

In b3d61b6150fd we check whether a plugin DLL exists before reporting that we can instantiate the plugin. That does so in a function called by JavaScript. We don't actually read the file, but we do check that it's on disk, so that could have caused the regression I suppose.

We suspect that anti-virus is deleting the plugin, so until we can confirm that is not the case, I think we need to keep this check in.

Flags: needinfo?(cpearce)

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 3

•

9 years ago

I should also point out we're not sure *when* the plugin DLL disappears, so that's why we want to check on every call, not just on the first. The plugin could be being deleted at runtime for example.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 4

•

9 years ago

cool, thanks for the info!  vladan, do you have any other questions or thoughts on possibly closing the bug out?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

9 years ago

Flags: needinfo?(vdjeric)

Vladan Djeric (:vladan)

Comment 5

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #4)
> cool, thanks for the info!  vladan, do you have any other questions or
> thoughts on possibly closing the bug out?

It doesn't make sense for the missing DLL check to cause a 30% increase (~3MB) in main-thread bytes read.
Chris, can you confirm the increase by comparing the startupSessionRestoreReadBytes & startupSessionRestoreWriteBytes fields in about:telemetry in builds with and without the patch?

Flags: needinfo?(vdjeric)

Vladan Djeric (:vladan)

Comment 6

•

9 years ago

.. these fields are in the Simple Measurements section

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 7

•

9 years ago

Sure.

Current m-c tip (633684c4ebc4):

startupSessionRestoreReadBytes 	143593167
startupSessionRestoreWriteBytes 	4558 

Current m-c tip (633684c4ebc4) with b3d61b6150fd reverted:

startupSessionRestoreReadBytes 	143593385
startupSessionRestoreWriteBytes 	4030 

That seems to indicate b3d61b6150fd isn't the culprit. However I would not expect the EME code to be hit on startup however, unless the page shown on startup was playing Adobe EME video.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 8

•

9 years ago

maybe our measurement of xperf is doing something odd.  I do know that the specific set of patches in the push b3d61b6150fd did increase the xperf number, we have a series of retriggers and it is pretty clear.

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 9

•

9 years ago

What HTML content does the xperf test run Firefox over? Is it just the default startup page, or do we actually load some web pages as part of the profiling?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 10

•

9 years ago

we cycle over the tp5 pageset, so there are a lot of pages we cover.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 11

•

9 years ago

cpearce, any thoughts here?  I think the next steps are unclear- our automation shows something but other data from running locally doesn't confirm it.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

9 years ago

Flags: needinfo?(cpearce)

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 12

•

9 years ago

jmaher: We should push a run to try with and another without media.eme.enabled=false, to verify that is it indeed EME that's causing this. Also doing a try run with/without these patches backed out to confirm it's these patches. I have trouble believing EME could cause this, but I can't argue against hard data. ;)

I think we've assumed it's b3d61b6150fd, but it could be others in the regressing range.

The main reason why I have trouble believing its EME is not used on many sites use it yet, particularly since we implement a newer version of the spec than what Chrome, IE and Safari implement, so sites that use EME probably don't use the version we implement.

One theory; this patch makes us try to re-download the Adobe EME plugin from Adobe's servers if it has disappeared from disk after we've reinstalled it. Could something be deleting the EME plugin or otherwise causing us to re-download it? For example, do we load each new page with a clean profile (which would require a new CDM download)? Or perhaps there's anti-virus running on the test machines?

Blocks: EME

Flags: needinfo?(cpearce) → needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 13

•

9 years ago

keep in mind that b3d61b6150fd is the revision that 4 patches landed on, I use it as a marker since that is the granularity that we have, when referencing b3d61b6150fd, it means all the changes.

I have done a few pushes to try (all building upon each other):
backout b3d61b6150fd: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4ad1581db1ba
backout 34d3296e6980: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4c83bfc0f3cf
backout b0ebc8d237f0: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e44f209eeb60

unfortunately, 'hg backout 03dc784d5456' is not clean, so I didn't do it.

lets see what this turns up, I can dig further if needed

Chris Peterson [:cpeterson]

Updated

•

9 years ago

status-firefox40: --- → unaffected

Priority: -- → P1

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 14

•

9 years ago

it looks like 34d3296e6980 is the actual changeset of concern.  Is this the one we could test with/without the media.eme.enabled=false preference?

looking at the backout of b3d61b6150fd, we have similar numbers, but for both 34d3296e6980 and b0ebc8d237f0 we see a drop of about 20MB in the overall results.

I am not sure why changing line endings for 1 file did it:
https://hg.mozilla.org/integration/mozilla-inbound/rev/34d3296e6980

maybe I am reading the results incorrectly, but the try pushes above help show the difference. I really wish try server would show the full queue of pushes.

:cpearce, any thoughts on this?

Flags: needinfo?(jmaher) → needinfo?(cpearce)

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 15

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #14)
> it looks like 34d3296e6980 is the actual changeset of concern.  Is this the
> one we could test with/without the media.eme.enabled=false preference?
> 
> looking at the backout of b3d61b6150fd, we have similar numbers, but for
> both 34d3296e6980 and b0ebc8d237f0 we see a drop of about 20MB in the
> overall results.
> 
> I am not sure why changing line endings for 1 file did it:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/34d3296e6980
> 
> maybe I am reading the results incorrectly, but the try pushes above help
> show the difference. I really wish try server would show the full queue of
> pushes.
> 
> :cpearce, any thoughts on this?

I'm pretty unconvinced that whitespace changes caused this. ;)

You should spin up more  runs on those pushes; your sample size is 4 there, and my high school statistics teacher taught me you need at least 30 runs for statistically significant results.

Flags: needinfo?(cpearce)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 16

•

9 years ago

I did a few more retriggers and mapped out the data.  The results come in tri-modal (~300MB, ~350MB, ~375MB), all of them showing data in the regressed range.

The prior range was ~260MB -> ~340MB and that is what I wanted to see.

Which leaves 1 more patch in the series:
https://hg.mozilla.org/integration/mozilla-inbound/rev/03dc784d5456

Since I had trouble with 'hg backout' of this, could you push to try with just that backed out?  of course we could 'hg update' to the revision prior, but that would be served by just looking at the existing data.

Flags: needinfo?(cpearce)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 17

•

9 years ago

this bug has been going on for 2 weeks with no real action.  If this was a unittest failure, it would have been backed out or the test would have been disabled.

cpearce- can you please comment on this bug about what the next steps are and the timeline they will take place?  otherwise we are left with working on backing all of the patches out.

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 18

•

9 years ago

Am pretty swamped at the moment. This is on our list of EME P1s, so we'll get to it as soon as we can.

Vladan Djeric (:vladan)

Comment 19

•

9 years ago

This is going to be on Beta in a couple of weeks. Do we have indication whether this patch fixes or doesn't fix the issue in bug 1160321?

Vladan Djeric (:vladan)

Comment 20

•

9 years ago

Actually, it was already uplifted to Beta, so it will be on Release on June 29th

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 21

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #16)
> I did a few more retriggers and mapped out the data.  The results come in
> tri-modal (~300MB, ~350MB, ~375MB), all of them showing data in the
> regressed range.
> 
> The prior range was ~260MB -> ~340MB and that is what I wanted to see.
> 
> Which leaves 1 more patch in the series:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/03dc784d5456

This changeset also changes HTMLMediaElement.canPlayType() to load the Windows Media Foundation DLLs and ensure they're on disk and usable before reporting to JavaScript that we can play audio/video. That may explain the I/O spike you're seeing; we're loading those DLLs.

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 22

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #16)
> I did a few more retriggers and mapped out the data.  The results come in
> tri-modal (~300MB, ~350MB, ~375MB), all of them showing data in the
> regressed range.
> 
> The prior range was ~260MB -> ~340MB and that is what I wanted to see.
> 
> Which leaves 1 more patch in the series:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/03dc784d5456
> 
> Since I had trouble with 'hg backout' of this, could you push to try with
> just that backed out?  of course we could 'hg update' to the revision prior,
> but that would be served by just looking at the existing data.

Joel, please look at this push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=96d7aef0fba9

This removes check that ensures that we can use Windows Media Foundation to decode before reporting to JavaScript that we can play audio and video.

If this is the regressing change, I think we should keep it as it's critical for YouTube to work correctly as there are plenty of Windows machines out in the wild which don't have a working WMF for a variety of reasons.

When we initialize WMF we load this list of DLLs:
https://dxr.mozilla.org/mozilla-central/source/dom/media/wmf/WMFUtils.cpp#430

We can drop propsys.dll and mfreadwrite.dll from that list, since the code that need them has been preffed off, and I have a patch lying around to remove that preffed off code anyway.

Flags: needinfo?(cpearce) → needinfo?(jmaher)

Chris Pearce [:cpearce (Not reading bugmail)]

Updated

•

9 years ago

Depends on: 1174055

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 23

•

9 years ago

I did a series of retriggers on that try push, unfortunately we are not seeing improvements.  It sounds like we might not be able to fix much here.

Flags: needinfo?(jmaher)

Chris Peterson [:cpeterson]

Comment 24

•

9 years ago

cpearce, is this talos regression still an issue? Can we close this bug?

Flags: needinfo?(cpearce)

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 25

•

9 years ago

I would like to circle back after we've shipped and investigate whether there are changes we can make to prevent the regression. I suspect there won't be however.

We could P2 it to reflect that we don't need to block shipping.

Flags: needinfo?(cpearce)

Chris Peterson [:cpeterson]

Updated

•

9 years ago

status-firefox42: --- → ?

status-firefox43: --- → ?

Priority: P1 → P2

Maire Reavy [:mreavy]

Updated

•

9 years ago

Component: Audio/Video → Audio/Video: Playback

Anthony Jones (:ajones, :kentuckyfriedtakahe, :k17e)

Comment 26

•

8 years ago

Mass change P2 -> P3

Priority: P2 → P3

BMO Automation

Updated

•

2 years ago

Severity: normal → S3