Narrow down the zero_byte_load probe to tailor results for YSOD, part 4
Categories
(Core :: Networking: JAR, task, P2)
Tracking
()
People
(Reporter: zbraniecki, Assigned: kershaw)
References
(Blocks 2 open bugs)
Details
(Whiteboard: [necko-triaged])
Attachments
(2 files)
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
jcristau
:
approval-mozilla-beta+
|
Details |
As a follow up to bug 1693955 and bug 1694263 I believe we have now enough data from Beta to place additional filters.
Looking at the dashboard at https://sql.telemetry.mozilla.org/dashboard/ui-resource-network-errors I see several potentials:
Status as of Feb 28 2021
Nightly stands at ~1.8m events per day. 900k/day of them are HTML and then PNG, SVG and Others with +200k/day.
Beta stands at ~14m events per day. 6.6m are HTML, then 2m SVG and Others, and then 1.5m XHTML and PNG.
Proposed new filters
1) Filter out non-omni.ja HTML
- Scope: all channels
- Impact:
** Nightly: 900k -> 300k
** Beta: 6.6m -> 3.9m
This seems quite obvious. Lots of volume from extensions, and we are not currently aiming to fix that.
2) Filter out omni.ja!/chrome/browser/res/activity-stream/prerendered/activity-stream-noscript
/NS_ERROR_FAILURE
- Scope: all but Nightly (keep it on Nightly)
- Impact:
** Nightly: 300k -> 60k
** Beta: 3.9m -> 300k
This is a bit more tricky. The error indicates that something is going wrong, and it is on a primary UI. Today we saw 3.6m events from Beta from 380k unique clients, and on Nightly from 25k clients.
Those numbers represent a substantial portion of the DAU!
What's also worth noticing is that on average a client which experiences it, reports it 10 times a day.
I would be concerned about filtering it out, because both, it can be useful in our investigation of YSOD sources, and it may be a valueable Q&A artifact.
Unfortunately, since the volume is really high, I think it would be sensible to restrict it to Nightly and monitor there.
Reporter | ||
Comment 1•4 years ago
|
||
3) Filter out omni.ja!/chrome/browser/search-extensions/google/favicon.ico
/ NS_BINDING_ABORTED
- Scope: all channels
- Impact:
** Nightly: -
** Beta: 2m -> 1m
While this file seems to be of importance, it feels to me like it should have the same underlying root cause as other search_extensions/*/favicon.ico
with NS_BINDING_ABORTED
, all of which are in top 10, but none of which has the same 1m events per day volume.
So by filtering this out, we're not ignoring the problem, we just have to scale the other analogous volume by 20x to include this one.
And if we fix the icon for ddg, bing etc. we're likely going to fix this one as well.
Another thing to notice is that the problem seems to be non-existing in Nightly.
Also, we should file a bug about it and CC someone from the Search Team to investigate this particular error's root cause.
4) Filter out omni.ja!/update.locale
/ NS_ERROR_FILE_NOT_FOUND
- Scope: all channels
- Impact:
** Nightly: 200k -> 145k
** Beta: 1m -> 400k
This is a known problem that building Firefox without an installer leads to this file being missing.
We have two places where we ask for this file, in LocaleService where I think we stat it, and in JS here: https://searchfox.org/mozilla-central/rev/6a023272d590409c80458d373986e379b3ad86f4/toolkit/modules/UpdateUtils.jsm#146
Since JS doesn't have an easy "if (fileExists(path)), they just
fetch` and check the ouput, which leads to NS_ERROR_FILE_NOT_FOUND.
Monitoring this error may be useful for general Q&A as we can then verify that the source code handles missing result, but for this particular file we already did, and I think we can filter it out now.
Reporter | ||
Comment 2•4 years ago
|
||
- Filter out non-omni.ja SVG&PNG
Scope: all channels
Impact:
** Nightly: 470k -> 200k
** Beta: 3.4m -> 2.3m
I know it sounds like overly cautious but I prefer to slowly filter out non-omni.ja
per type rather than all at once because it allows me to see if there's any suspicous non-omni.ja call that may lead us to a quality problem in code.
But I also want to reduce the volume, so
Reporter | ||
Comment 3•4 years ago
|
||
Those 5 target all that I can see standing out in the dashboard, and reduces Beta volume from 14m to 5m, which I think is appropriate.
I am in particular not cutting on events for XHTML which I think is significant for YSOD and I'd like to monitor it fully for a couple weeks more.
This is also the first time we're cutting out what I suspect may be a legitimate error, but is at a volume that justifies cutting it since we have lower-volume equivalents that we can study and will likely be fixed by fixing the same root cause.
It'll be important to remember to add the volume of those ((2) on Beta and (3) on all platforms) when evaluating severity and scope of the problem.
Reporter | ||
Comment 4•4 years ago
|
||
Does the list look good to you? If so, would you have time to write a PR?
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 5•4 years ago
|
||
I also think that at this point we'll likely be done with Beta filters, and we'll switch fully to evaluating the correlations with the data at this level.
And waiting for Stable to hit, which may mean 60x volume.
I think :kershaw's idea to implement a perf that we could use to turn it off (or at least clamp?) may be important if we the data from release surprised us and we'd get much more than 300m events per day.
From what I understand from #telemetry Matrix channel, 300m events a day wouldn't overload the system, but I'd still prefer to only collect such volume for a week or so, and then maybe turn it off on Release, fine-tune on Nightly+Beta and then re-enable for the next release with new filters.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 6•4 years ago
|
||
2) Filter out
omni.ja!/chrome/browser/res/activity-stream/prerendered/activity-stream-noscript
/NS_ERROR_FAILURE
- Scope: all but Nightly (keep it on Nightly)
- Impact:
** Nightly: 300k -> 60k
** Beta: 3.9m -> 300kThis is a bit more tricky. The error indicates that something is going wrong, and it is on a primary UI. Today we saw 3.6m events from Beta from 380k unique clients, and on Nightly from 25k clients.
Those numbers represent a substantial portion of the DAU!
What's also worth noticing is that on average a client which experiences it, reports it 10 times a day.
I would be concerned about filtering it out, because both, it can be useful in our investigation of YSOD sources, and it may be a valueable Q&A artifact.Unfortunately, since the volume is really high, I think it would be sensible to restrict it to Nightly and monitor there.
This can be easily reproduced locally on my side.
It seems the channel is cancelled here.
I also captured the log and stack trace below.
Since this is easy to reproduce, I think we don't need to record this on Nightly and we can filter out this completely.
#01: mozilla::net::DocumentLoadListener::FinishReplacementChannelSetup(nsresult)[/Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/MacOS/XUL +0x8a4868]
#02: mozilla::net::DocumentLoadListener::RedirectToRealChannelFinished(nsresult)[/Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/MacOS/XUL +0x8a42fb]
#03: mozilla::MozPromise<nsresult, mozilla::ipc::ResponseRejectReason, true>::ThenValue<mozilla::net::DocumentLoadListener::TriggerRedirectToRealChannel(mozilla::Maybe<unsigned long long> const&)::$_24, mozilla::net::DocumentLoadListener::TriggerRedirectToReal[/Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/MacOS/XUL +0x8c9ae4]
#04: mozilla::MozPromise<nsresult, mozilla::ipc::ResponseRejectReason, true>::ThenValueBase::ResolveOrRejectRunnable::Run()[/Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/MacOS/XUL +0x8c11c4]
#05: mozilla::SimpleTaskQueue::DrainTasks()[/Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/MacOS/XUL +0x1898ad]
#06: nsThread::DrainDirectTasks()[/Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/MacOS/XUL +0x199531]
#07: nsThread::ProcessNextEvent(bool, bool*)[/Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/MacOS/XUL +0x197fca]
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener ctor [this=0x12d815940]
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener [0x12d815940] OpenDocument [uri=about:home]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::nsJARChannel [this=0x12ccc08d0]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::Init [this=0x12ccc08d0]
[Parent 7283: Main Thread]: V/DocumentChannel ParentProcessDocumentOpenInfo ctor [this=0x12d76c900]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::AsyncOpen [this=0x12ccc08d0]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::AsyncOpen [this=0x12ccc08d0]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::LookupFile [this=0x12ccc08d0 jar:file:///Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/Resources/browser/omni.ja!/chrome/browser/res/activity-stream/prerendered/activity-stream-noscripts.html]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::OpenLocalFile [this=0x12ccc08d0]
[Parent 7283: StreamTrans #5]: D/nsJarProtocol nsJARChannel::CreateLocalJarInput [aJarCache=0x10eb11940, , chrome/browser/res/activity-stream/prerendered/activity-stream-noscripts.html]
[Parent 7283: StreamTrans #1]: D/nsJarProtocol nsJARChannel::CreateLocalJarInput [aJarCache=0x10eb11940, , update.locale]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::ContinueOpenLocalFile [this=0x12d14e030 0x111670ca0]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::OnOpenLocalFileComplete [this=0x12d14e030 00000000]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::ContinueOpenLocalFile [this=0x12ccc08d0 0x129436d60]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::OnOpenLocalFileComplete [this=0x12ccc08d0 00000000]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::OnStartRequest [this=0x12ccc08d0 jar:file:///Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/Resources/browser/omni.ja!/chrome/browser/res/activity-stream/prerendered/activity-stream-noscripts.html]
[Parent 7283: Main Thread]: V/DocumentChannel ParentProcessDocumentOpenInfo OnStartRequest [this=0x12d76c900]
[Parent 7283: Main Thread]: V/DocumentChannel ParentProcessDocumentOpenInfo OnDocumentStartRequest [this=0x12d76c900]
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener OnStartRequest [this=0x12d815940]
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener MaybeTriggerProcessSwitch [this=0x12d815940]
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener GetRemoteTypeForPrincipal [this=0x12d815940, contentParent=privilegedabout, preferredRemoteType=privilegedabout]
[Parent 7283: Main Thread]: V/DocumentChannel GetRemoteTypeForPrincipal -> current:privilegedabout remoteType:privilegedabout
[Parent 7283: Main Thread]: V/DocumentChannel Process Switch Abort: type (privilegedabout) is compatible
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener::TriggerRedirectToRealChannel [this=0x12d815940] aDestinationProcess=-1
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener RedirectToRealChannel [this=0x12d815940] aRedirectFlags=4, aLoadFlags=3d0004
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::nsJARChannel [this=0x12d14ea40]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::Init [this=0x12d14ea40]
[Child 7285: Main Thread]: V/DocumentChannel DocumentChannelChild RecvRedirectToRealChannel [this=0x10e3c7280, uri=about:home]
[Child 7285: Main Thread]: V/DocumentChannel DocumentChannelChild OnRedirectVerifyCallback [this=0x10e3c7280, aRv=0x00000000 ]
[Child 7285: Main Thread]: D/nsJarProtocol nsJARChannel::nsJARChannel [this=0x1168aed20]
[Child 7285: Main Thread]: D/nsJarProtocol nsJARChannel::Init [this=0x1168aed20]
[Child 7285: Main Thread]: D/nsJarProtocol nsJARChannel::Open [this=0x1168aed20]
[Child 7285: Main Thread]: D/nsJarProtocol nsJARChannel::Open [this=0x1168aed20]
[Child 7285: Main Thread]: D/nsJarProtocol nsJARChannel::LookupFile [this=0x1168aed20 jar:file:///Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/Resources/browser/omni.ja!/chrome/devtools/modules/devtools/client/jsonview/Sniffer.jsm]
[Child 7285: Main Thread]: D/nsJarProtocol nsJARChannel::CreateJarInput [this=0x1168aed20]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::AsyncOpen [this=0x12d14ea40]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::AsyncOpen [this=0x12d14ea40]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::LookupFile [this=0x12d14ea40 jar:file:///Users/changkershaw/work/central/objdir/dist/firefox/NightlyDebug.app/Contents/Resources/omni.ja!/modules/workers/PromiseWorker.js]
[Parent 7283: Main Thread]: D/nsJarProtocol nsJARChannel::OpenLocalFile [this=0x12d14ea40]
[Parent 7283: StreamTrans #3]: D/nsJarProtocol nsJARChannel::CreateLocalJarInput [aJarCache=0x10eb11940, , modules/workers/PromiseWorker.js]
[Child 7285: Main Thread]: D/nsJarProtocol nsJARChannel::~nsJARChannel [this=0x1168aed20]
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener RedirectToRealChannelFinished [this=0x12d815940, aRv=0 ]
[Parent 7283: Main Thread]: V/DocumentChannel DocumentLoadListener FinishReplacementChannelSetup [this=0x12d815940, aResult=80004005]
Assignee | ||
Comment 7•4 years ago
|
||
Reporter | ||
Comment 8•4 years ago
|
||
Ooh, thank you!
So, if I read your stack correctly, this is not an error, we are just redirecting to another document while loading one?
Do you think more of HTML/XHTML loads are caused by such behavior? Is there a way for us to recognize that as an always "correct" behavior and not include in the probe (should it even be an error?)? - in a follow-up bug of course.
Assignee | ||
Comment 9•4 years ago
|
||
(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #8)
Ooh, thank you!
So, if I read your stack correctly, this is not an error, we are just redirecting to another document while loading one?Do you think more of HTML/XHTML loads are caused by such behavior? Is there a way for us to recognize that as an always "correct" behavior and not include in the probe (should it even be an error?)? - in a follow-up bug of course.
Honestly, I am not sure. It seems wrong to me that redirection involves loading a file from omni.ja locally. I'll file a bug for this.
Comment 10•4 years ago
|
||
Assignee | ||
Comment 11•4 years ago
|
||
Comment on attachment 9206061 [details]
Bug 1695560 - Narrow down the zero_byte_load probe
Beta/Release Uplift Approval Request
- User impact if declined: No impact. However, this patch is about filtering out some unnecessary events for YSOD. If we don't uplift this, the telemetry data will be inconsistent with the data on nightly. To make it easier for analyzing this data, it'd be great if we can uplift this patch.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This patch only affects how we send the zero_byte load event.
- String changes made/needed: N/A
Reporter | ||
Comment 12•4 years ago
|
||
This is important because we seem a continuous increase in the event volume from the Beta channel. As of yesterday we passed 18m events per day. This PR will allow us to lower that to ~5m.
Comment 13•4 years ago
|
||
bugherder |
Reporter | ||
Comment 15•4 years ago
|
||
Thank you! Bug 1696151 just landed. Should we add request beta flag on that follow up, or would you prefer we folded the two PRs for beta uplift?
Reporter | ||
Comment 16•4 years ago
|
||
:kershaw said he's working on a folded one for uplift.
Assignee | ||
Comment 17•4 years ago
|
||
Assignee | ||
Comment 18•4 years ago
|
||
Comment on attachment 9206674 [details]
Bug 1695560 - Narrow down the zero_byte_load probe (beta uplift)
Beta/Release Uplift Approval Request
- User impact if declined: No impact. However, this patch is about filtering out some unnecessary events for YSOD. If we don't uplift this, the telemetry data will be inconsistent with the data on nightly. To make it easier for analyzing this data, it'd be great if we can uplift this patch.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This patch only affects how we send the zero_byte load event.
- String changes made/needed: N/A
Assignee | ||
Comment 19•4 years ago
|
||
(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #15)
Thank you! Bug 1696151 just landed. Should we add request beta flag on that follow up, or would you prefer we folded the two PRs for beta uplift?
I can't remove the previous approval-mozilla-beta
flag. Please make sure to uplift the latest patch. Thanks.
Reporter | ||
Updated•4 years ago
|
Comment 20•4 years ago
|
||
Comment on attachment 9206674 [details]
Bug 1695560 - Narrow down the zero_byte_load probe (beta uplift)
approved for 87.0b6
Comment 21•4 years ago
|
||
bugherder uplift |
Description
•