FOG doesn't always init, especially on short sessions with no idle
Categories
(Toolkit :: Telemetry, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox118 | --- | fixed |
People
(Reporter: chutten, Assigned: chutten)
References
Details
Attachments
(2 files)
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/plain
|
travis_
:
data-review+
|
Details |
BrowserGlue's _scheduleStartupIdleTasks isn't guaranteed to run. This is only really a problem in sessions where the session doesn't finish restoring (which are very short or are on very slow hardware or both), but it's a gap where Telemetry is able to record information and FOG isn't.
Without FOG being init we never init Glean and never shut it down. All the data that's going into the preinit queue lives and dies in memory.
Telemetry, meanwhile, is able to persist and often upload pings at shutdown because it performs sizable portions of its initialization quite early.
I think the solution to this is to split FOG init and shutdown into three pieces:
- FOG component init - All JS calls to
Telemetry
instantiate the global Telemetry Service, but no JS calls to theGlean
global instantiateServices.fog
. We need FOG to be init nice and early to hold some state and register the shutdown handler for #3 - FOG+Glean init - Glean init might be slow and intensive, so scheduling it opportunistically on idle via BrowserGlue is still a solid plan.
XPCOMShutdown
- The latest time we accept data, this is where we'll shut down FOG and Glean... but only after ensuring that they've init'd. If BrowserGlue didn't get to us, then we'll have to do it ourselves.- This will require some Glean SDK work to allow for blocking on init (with timeout)
( There's a hidden 2.5 for XPCOMWillShutdown
for the Nimbus-powered glean.finalInactive
feature work. It'll remain called by callback registered in #2 since if we don't get to #2 then we didn't init early enough to handle a client inactive before the end of XPCOMWillShutdown
. )
This isn't a dupe of bug 1732989 which has IPC interactions we're not dealing with. This comes out of investigations detailed in my rambling style in bug 1837230.
Assignee | ||
Comment 1•1 year ago
|
||
A wrinkle: Not all processes and not all applications will want any of this to happen. One of the reasons we're in BrowserGlue
is because non-Firefox-Desktop users of Gecko-embedded Glean (like Fenix) don't necessarily init it via FOG (but some of them, like the Background Update Agent, do).
Assignee | ||
Comment 2•1 year ago
|
||
We can't rely on BrowserGlue to call initializeFOG in short, busy sessions.
But if we register the shutdown callback early by instantiating the FOG
singleton early, we can catch that case and init while we're shutting down.
(This is important for catching early data in short, busy sessions.
In most cases this code shouldn't run)
THIS IS NOT READY TO LAND. It is missing:
- A test (though I'm not sure it's reliably testable)
- A proper API to call for init'ing at shutdown (see bug 1839433)
- Instrumentation (I claim this should be rare, but we should check that)
Assignee | ||
Comment 3•1 year ago
|
||
Oh, and another thing it's missing: an explanation for why it always waits out the full 30s on waiting for upload during shutdown. Logs aren't that helpful as we just get Metrics Ping Scheduler cancelled. Exiting
followed by Waiting for upload failed. We're shutting down
. Perhaps fixes to bug 1838453 or bug 1839428 or bug 1777233 will take care of it. (Or perhaps I should expend some effort figuring out why it's happening in the first place.)
But for now, this patch is a good enough illustration while we wait for the API from bug 1839433
Assignee | ||
Comment 4•1 year ago
|
||
Guess what: the patch is pretty okay, except it triggers some "you're init'ing things too early in startup" tests (specifically browser_startup.js
). So I reached out on the #fx-desktop-dev channel and :Mossop kindly worked through the problem with me.
So instead of doing what the patch presently does (instantiating the FOG service singleton in BG__init
), we're going to just ensure that the service singleton's been instantiated early in shutdown (by instantiating it in BG__dispose
). Now, this is going to look weird (it'll literally be Services.fog;
. That's it) so we'll need a good comment.
We think this'll work because a) BrowserGlue registers _dispose
against profile-before-change
in BG_init
, so we have the same guarantees that this code will run, b) profile-before-change
is before XPCOM shutdown, so we can register our shutdown handler in instantiation, b) it has zero effect in normal startups that have had a time to run _scheduleStartupIdleTasks
(since the service will have already instantiated in those cases and GetSingleton
will early-return), and c) Since it's not happening during early init, it should appease tests.
Let's give it a go, shall we?
Updated•1 year ago
|
Assignee | ||
Comment 5•1 year ago
|
||
Getting NS_ERROR_OUT_OF_MEMORY
from xpcshell with this change. Happens locally, too, so maybe I can track it down.
It's happening on the line Services.fog;
inside BrowserGlue
's __dispose()
. Not ideal.
Updated•1 year ago
|
Comment 7•1 year ago
|
||
Backed out for causing failures on /test_backgroundtask_experiments.js
- backout: https://hg.mozilla.org/integration/autoland/rev/b65483268e213e83c6dacca75190c1eab1cf03b8
- push: https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=T2R6tIOHTrWhM61db7VGhA.0&revision=59fa356e72f68dd9ec58c6f2c0307c531da8d59d&group_state=expanded
- failure logs:
- TEST-UNEXPECTED-FAIL | toolkit/components/backgroundtasks/tests/xpcshell/test_backgroundtask_experiments.js | test_backgroundtask_caps - [test_backgroundtask_caps : 102] The message background task exited with exit code 0 - 0 == -6
- TEST-UNEXPECTED-FAIL | devtools/client/webconsole/test/browser/browser_toolbox_console_new_process.js | The remote debugger process died cleanly - Got -6, expected +0
- 4016> SUMMARY: ThreadSanitizer: thread leak /builds/worker/fetches/rustc/lib/rustlib/src/rust/library/std/src/sys/unix/thread.rs:87:19 in std::sys::unix::thread::Thread::new::ha7b8e10c02a8e195
Assignee | ||
Comment 8•1 year ago
|
||
This is still blocked on the new Glean SDK release being vendored, at which point I expect the test to pass.
Assignee | ||
Comment 9•1 year ago
|
||
Comment 10•1 year ago
|
||
Comment 11•1 year ago
|
||
bugherder |
Comment 12•1 year ago
|
||
Comment on attachment 9346249 [details]
data collection review request
Data Review
- Is there or will there be documentation that describes the schema for the ultimate data set in a public, complete, and accurate way?
Yes, through the metrics.yaml file and the Glean Dictionary.
- Is there a control mechanism that allows the user to turn the data collection on and off?
Yes, through the telemetry preferences in the application settings.
- If the request is for permanent data collection, is there someone who will monitor the data over time?
Permanent collection to be monitored by chutten@mozilla.com and fallback to the glean-team@mozlla.com
- Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?
Category 1 technical
- Is the data collection request for default-on or default-off?
Default-on
- Does the instrumentation include the addition of any new identifiers (whether anonymous or otherwise; e.g., username, random IDs, etc. See the appendix for more details)?
No
- Is the data collection covered by the existing Firefox privacy notice?
Yes
- Does the data collection use a third-party collection tool?
No
Result
data-review+
Description
•