Closed Bug 1137404 Opened 9 years ago Closed 9 years ago

Intermittent talos xperf mainthreadio | File 'c:\windows\system32\en-us\dwrite.dll.mui' was accessed and we were not expecting it

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect)

x86
Windows 7
defect
Not set
normal

Tracking

(firefox38 unaffected, firefox39 fixed, firefox40 fixed, firefox-esr31 unaffected)

RESOLVED FIXED
Firefox 40
Tracking Status
firefox38 --- unaffected
firefox39 --- fixed
firefox40 --- fixed
firefox-esr31 --- unaffected

People

(Reporter: bugzilla, Assigned: Dexter)

References

(Depends on 1 open bug)

Details

(Keywords: intermittent-failure)

Attachments

(3 files, 1 obsolete file)

See https://bugzilla.mozilla.org/show_bug.cgi?id=1136351#c31. It sounds like some new code is accessing nsGfxInfo in a way that is triggering this I/O.

Probably added in bug 1122052.

We should look at nsGfxInfo to see what's going on there. Otherwise we'll have to adjust how we access this in FHR/telemetry.
:aklotz, we are almost to the limit where we need to hide xperf (i.e. it won't get sheriff coverage).  Can we push to get a fix for this, or should we temporarily adjust the mtio-whitelist.json until this can get resolved?

Historically we have been pretty diligent about not modifying the file unless needed, not sure we should be making exceptions here.
Flags: needinfo?(aklotz)
It's the DWriteVersion field that is causing the problem here. Is there anything that we can to do postpone gathering this field until a later time?
See comment 143.
Flags: needinfo?(alessio.placitelli)
(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #143)
> It's the DWriteVersion field that is causing the problem here. Is there
> anything that we can to do postpone gathering this field until a later time?

Given [1], we shouldn't be accessing gfxData from TelemetryEnvironment (and so DWriteVersion) at startup, at least not before 60s (TELEMETRY_DELAY). Even after Telemetry starts up, we only gather gfx data if something triggers an important change in the environment (new theme, new plugin, new addon, etc.). Are these tests triggering any of those changes?

[1] https://hg.mozilla.org/mozilla-central/annotate/0b3c520002ad/toolkit/components/telemetry/TelemetryPing.jsm#l712
Flags: needinfo?(alessio.placitelli)
we already had this in the file, our access times have gone up, almost all of them are <60ms.
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Flags: needinfo?(aklotz)
Attachment #8572737 - Flags: review?(vdjeric)
Comment on attachment 8572737 [details] [diff] [review]
ignore duration of dwrite.dll.mui up to 60ms (1.0)

Review of attachment 8572737 [details] [diff] [review]:
-----------------------------------------------------------------

r+ for a temporary fix until i figure out why DWrite version is being fetched earlier in the unified Telemetry/FHR
Attachment #8572737 - Flags: review?(vdjeric) → review+
As per IRC, ni? for a stack for these accesses from Windows to see what actually triggers this.
Flags: needinfo?(alessio.placitelli)
That's the call stack for the first call to |TelemetryEnvironment.getGfxField|, which should make us hit "dwrite.dll.mui".

getGfxAdapter@resource://gre/modules/TelemetryEnvironment.jsm:255:11
EnvironmentCache.prototype._getGFXData@resource://gre/modules/TelemetryEnvironment.jsm:1029:27
EnvironmentCache.prototype._getSystem@resource://gre/modules/TelemetryEnvironment.jsm:1068:12
EnvironmentCache@resource://gre/modules/TelemetryEnvironment.jsm:667:13
getGlobal@resource://gre/modules/TelemetryEnvironment.jsm:48:26
TelemetryEnvironment.registerChangeListener@resource://gre/modules/TelemetryEnvironment.jsm:63:12
setupChromeProcess/this._delayedInitTask<@resource://gre/modules/TelemetrySession.jsm:1576:9
TaskImpl_run@resource://gre/modules/Task.jsm:314:40
Handler.prototype.process@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:867:23
this.PromiseWalker.walkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:746:7
this.PromiseWalker.scheduleWalkerLoop/<@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:688:37
Promise*this.PromiseWalker.scheduleWalkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:688:5
this.PromiseWalker.schedulePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:709:7
this.PromiseWalker.completePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:671:7
this.BasePromiseWorker.prototype._worker/worker.onmessage@resource://gre/modules/PromiseWorker.jsm:231:9
Spinner.prototype.observe@resource://gre/modules/AsyncShutdown.jsm:464:9
Flags: needinfo?(alessio.placitelli)
The "dwrite.dll.mui" hit is definitely caused by [1] (see comment 400 for a call stack). Thanks to :jmaher, I removed the dwrite dll whitelisting from Talos in a custom repository and removed the offending line in [1] from Firefox: we are not hitting dwrite.dll.mui anymore.

With [1], the caused jank goes from 2ms to 60ms, mostly around 10ms.

Is this delay acceptable (i.e., should we keep whitelisting "dwrite.dll.mui") or should we devise a different solution?

[1] - https://hg.mozilla.org/mozilla-central/annotate/8f57f60ee58a/toolkit/components/telemetry/TelemetryEnvironment.jsm#l1022
[2] - https://treeherder.mozilla.org/#/jobs?repo=try&revision=24db20615901
Flags: needinfo?(vdjeric)
(In reply to Alessio Placitelli [:Dexter] from comment #401)
> Is this delay acceptable (i.e., should we keep whitelisting
> "dwrite.dll.mui") or should we devise a different solution?

If we need to, the simplest option would be to just not collect this during the first N seconds after startup.
(In reply to Georg Fritzsche [:gfritzsche] from comment #402)
> (In reply to Alessio Placitelli [:Dexter] from comment #401)
> > Is this delay acceptable (i.e., should we keep whitelisting
> > "dwrite.dll.mui") or should we devise a different solution?
> 
> If we need to, the simplest option would be to just not collect this during
> the first N seconds after startup.

The I/O duration on Talos generally isn't interesting because on user machines, the durations are going to be different under different circumstances.

As I see the options are:

1) tolerate the regression and chalk it up to startup cost <-- it's really not necessary for startup though
2) move the jank until a minute after startup using delayed-init <-- not ideal
3) use a ChromeWorker and ctypes to get the info off the main thread during delayed-init <-- best performance but kinda ugly
4) if dwrite.dll.mui gets read when DirectWrite is enabled regardless of Telemetry, we can just pospone the read further (so that it becomes a cached read)

We should check if dwrite.dll.mui gets accessed on its own after startup (using XPerf, Logger etc)
Flags: needinfo?(vdjeric) → needinfo?(aklotz)
It doesn't look like it is being accessed during startup with telemetry either on or off. OTOH if I point the browser to about:support, then it is accessed.
Flags: needinfo?(aklotz)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #403)
> 2) move the jank until a minute after startup using delayed-init <-- not
> ideal

As of today, it's already like that (but we will hit dwrite much sooner when bug 1140132 lands).
Bug 1140132 collects the TelemetryEnvironment from "profile-after-change", so we have to decide whether we can live with the dwrite load times or not.
Vladan, can you make the call here?
Flags: needinfo?(vdjeric)
Confirmed that it was accessed during shutdown.
Depends on: 1154500
(In reply to Alessio Placitelli [:Dexter] from comment #405)
> As of today, it's already like that (but we will hit dwrite much sooner when
> bug 1140132 lands).

Right, I missed the fact this I/O is happening during deferred initialization and that the initialization is forced during browser shutdown in tests.

I expect we will add other gfx etc measurements in the future which will require reading files, so I'd like to address this issue comprehensively. We should be fetching such measurements off the main thread. I filed bug 1154500 for this and enumerated some options.

We can do this in phase 3. For now, I'd like to remove the DWriteVersion reporting, since it's not being used by anyone for anything. We should remove it from the whitelist as well.
Flags: needinfo?(vdjeric)
Comment on attachment 8593350 [details] [diff] [review]
part 1 - Remove DWriteVersion from the Telemetry Environment (and the tests)

Review of attachment 8593350 [details] [diff] [review]:
-----------------------------------------------------------------

I would prefer to not completely remove it but temporarily comment it out or at least leave a comment.

::: toolkit/components/telemetry/TelemetryEnvironment.jsm
@@ -1018,5 @@
>    _getGFXData: function () {
>      let gfxData = {
>        D2DEnabled: getGfxField("D2DEnabled", null),
>        DWriteEnabled: getGfxField("DWriteEnabled", null),
> -      DWriteVersion: getGfxField("DWriteVersion", null),

Let's keep a comment line here mentioning that we temporarily removed DWriteVersion pending bug XXX.

::: toolkit/components/telemetry/docs/environment.rst
@@ -111,5 @@
>          },
>          gfx: {
>              D2DEnabled: <bool>, // null on failure
>              DWriteEnabled: <bool>, // null on failure
> -            DWriteVersion: <string>, // null on failure

Let's not remove the line, instead let's comment that it's temporarily removed pending bug xxx.

::: toolkit/components/telemetry/tests/unit/test_TelemetryEnvironment.js
@@ -393,5 @@
>  
>    let gfxData = data.system.gfx;
>    Assert.ok("D2DEnabled" in gfxData);
>    Assert.ok("DWriteEnabled" in gfxData);
> -  Assert.ok("DWriteVersion" in gfxData);

Also, comment here and below.
Attachment #8593350 - Flags: review?(gfritzsche)
Thanks for the review, I've restored those lines and commented them.
Attachment #8593350 - Attachment is obsolete: true
Attachment #8593794 - Flags: review?(gfritzsche)
Attachment #8593794 - Flags: review?(gfritzsche) → review+
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #409)
> We can do this in phase 3. For now, I'd like to remove the DWriteVersion
> reporting, since it's not being used by anyone for anything. We should
> remove it from the whitelist as well.

Vladan's previous patch just added |ignoreDuration|, the entry itself was already there.
The ignoreDuration part was recently removed in [1], because of a change in Talos.

In order to restore the behaviour we had before Vladan's, should I completely remove dwrite.dll.mui from the whitelist?

[1] - http://hg.mozilla.org/build/talos/annotate/3ff845dafa72/talos/mtio-whitelist.json#l121
Flags: needinfo?(jmaher)
I would do that, it would be best.
Flags: needinfo?(jmaher)
Assignee: jmaher → alessio.placitelli
Attachment #8594637 - Flags: review?(jmaher)
Comment on attachment 8594637 [details] [diff] [review]
part 2 - Remove DWriteVersion I/O whitelisting from Talos

Review of attachment 8594637 [details] [diff] [review]:
-----------------------------------------------------------------

thanks!
Attachment #8594637 - Flags: review?(jmaher) → review+
let me know if you want me to land this on talos.
Thanks jmaher, would you mind landing this on Talos?

I've try-pushed both Firefox and Talos patches here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c73d74c3404a
I've marked this as checkin needed, please just check the "part 1" in. We'll check part 2 (Talos) after part 1 hits m-c so we don't get any funky Talos failure.
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/e6478565338c
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 40
Comment on attachment 8593794 [details] [diff] [review]
part 1 - Remove DWriteVersion from the Telemetry Environment (and the tests) - v2

Approved for Aurora. For approval request see bug 1139460 comment 42. For approval comments see bug 1139460 comment 43.
Attachment #8593794 - Flags: approval-mozilla-aurora+
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: