Closed Bug 1839426 Opened 1 year ago Closed 1 year ago

FOG doesn't always init, especially on short sessions with no idle

Categories

(Toolkit :: Telemetry, defect, P1)

defect

Tracking

()

RESOLVED FIXED
118 Branch
Tracking Status
firefox118 --- fixed

People

(Reporter: chutten, Assigned: chutten)

References

Details

Attachments

(2 files)

BrowserGlue's _scheduleStartupIdleTasks isn't guaranteed to run. This is only really a problem in sessions where the session doesn't finish restoring (which are very short or are on very slow hardware or both), but it's a gap where Telemetry is able to record information and FOG isn't.

Without FOG being init we never init Glean and never shut it down. All the data that's going into the preinit queue lives and dies in memory.

Telemetry, meanwhile, is able to persist and often upload pings at shutdown because it performs sizable portions of its initialization quite early.

I think the solution to this is to split FOG init and shutdown into three pieces:

  1. FOG component init - All JS calls to Telemetry instantiate the global Telemetry Service, but no JS calls to the Glean global instantiate Services.fog. We need FOG to be init nice and early to hold some state and register the shutdown handler for #3
  2. FOG+Glean init - Glean init might be slow and intensive, so scheduling it opportunistically on idle via BrowserGlue is still a solid plan.
  3. XPCOMShutdown - The latest time we accept data, this is where we'll shut down FOG and Glean... but only after ensuring that they've init'd. If BrowserGlue didn't get to us, then we'll have to do it ourselves.
    • This will require some Glean SDK work to allow for blocking on init (with timeout)

( There's a hidden 2.5 for XPCOMWillShutdown for the Nimbus-powered glean.finalInactive feature work. It'll remain called by callback registered in #2 since if we don't get to #2 then we didn't init early enough to handle a client inactive before the end of XPCOMWillShutdown. )

This isn't a dupe of bug 1732989 which has IPC interactions we're not dealing with. This comes out of investigations detailed in my rambling style in bug 1837230.

Depends on: 1839433

A wrinkle: Not all processes and not all applications will want any of this to happen. One of the reasons we're in BrowserGlue is because non-Firefox-Desktop users of Gecko-embedded Glean (like Fenix) don't necessarily init it via FOG (but some of them, like the Background Update Agent, do).

We can't rely on BrowserGlue to call initializeFOG in short, busy sessions.
But if we register the shutdown callback early by instantiating the FOG
singleton early, we can catch that case and init while we're shutting down.

(This is important for catching early data in short, busy sessions.
In most cases this code shouldn't run)

THIS IS NOT READY TO LAND. It is missing:

  • A test (though I'm not sure it's reliably testable)
  • A proper API to call for init'ing at shutdown (see bug 1839433)
  • Instrumentation (I claim this should be rare, but we should check that)

Oh, and another thing it's missing: an explanation for why it always waits out the full 30s on waiting for upload during shutdown. Logs aren't that helpful as we just get Metrics Ping Scheduler cancelled. Exiting followed by Waiting for upload failed. We're shutting down. Perhaps fixes to bug 1838453 or bug 1839428 or bug 1777233 will take care of it. (Or perhaps I should expend some effort figuring out why it's happening in the first place.)

But for now, this patch is a good enough illustration while we wait for the API from bug 1839433

Guess what: the patch is pretty okay, except it triggers some "you're init'ing things too early in startup" tests (specifically browser_startup.js). So I reached out on the #fx-desktop-dev channel and :Mossop kindly worked through the problem with me.

So instead of doing what the patch presently does (instantiating the FOG service singleton in BG__init), we're going to just ensure that the service singleton's been instantiated early in shutdown (by instantiating it in BG__dispose). Now, this is going to look weird (it'll literally be Services.fog;. That's it) so we'll need a good comment.

We think this'll work because a) BrowserGlue registers _dispose against profile-before-change in BG_init, so we have the same guarantees that this code will run, b) profile-before-change is before XPCOM shutdown, so we can register our shutdown handler in instantiation, b) it has zero effect in normal startups that have had a time to run _scheduleStartupIdleTasks (since the service will have already instantiated in those cases and GetSingleton will early-return), and c) Since it's not happening during early init, it should appease tests.

Let's give it a go, shall we?

Depends on: 1841146
Depends on: 1840044
Attachment #9340121 - Attachment description: WIP: Bug 1839426 - DO NOT LAND - If FOG shutdown happens before init, enforce init r?janerik! → Bug 1839426 - If FOG shutdown happens before init, enforce init r?janerik!

Getting NS_ERROR_OUT_OF_MEMORY from xpcshell with this change. Happens locally, too, so maybe I can track it down.

It's happening on the line Services.fog; inside BrowserGlue's __dispose(). Not ideal.

Attachment #9340121 - Attachment description: Bug 1839426 - If FOG shutdown happens before init, enforce init r?janerik! → Bug 1839426 - If FOG shutdown happens before init, enforce init r?janerik!,Mossop!
Pushed by jrediger@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/59fa356e72f6 If FOG shutdown happens before init, enforce init r=janerik

This is still blocked on the new Glean SDK release being vendored, at which point I expect the test to pass.

Flags: needinfo?(jrediger)
Attached file data collection review request (deleted) —
Attachment #9346249 - Flags: data-review?(tlong)
Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b33a0ed0e057 If FOG shutdown happens before init, enforce init r=janerik
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 118 Branch

Comment on attachment 9346249 [details]
data collection review request

Data Review

  1. Is there or will there be documentation that describes the schema for the ultimate data set in a public, complete, and accurate way?

Yes, through the metrics.yaml file and the Glean Dictionary.

  1. Is there a control mechanism that allows the user to turn the data collection on and off?

Yes, through the telemetry preferences in the application settings.

  1. If the request is for permanent data collection, is there someone who will monitor the data over time?

Permanent collection to be monitored by chutten@mozilla.com and fallback to the glean-team@mozlla.com

  1. Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1 technical

  1. Is the data collection request for default-on or default-off?

Default-on

  1. Does the instrumentation include the addition of any new identifiers (whether anonymous or otherwise; e.g., username, random IDs, etc. See the appendix for more details)?

No

  1. Is the data collection covered by the existing Firefox privacy notice?

Yes

  1. Does the data collection use a third-party collection tool?

No

Result

data-review+

Attachment #9346249 - Flags: data-review?(tlong) → data-review+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: