Closed Bug 1728461 Opened 3 years ago Closed 3 years ago

GeckoView unconditionally clears the startup cache of new profiles (less than 6 hours old)

Categories

(GeckoView :: General, defect)

Unspecified
All
defect

Tracking

(firefox94 fixed)

RESOLVED FIXED
94 Branch
Tracking Status
firefox94 --- fixed

People

(Reporter: mcomella, Assigned: robwu)

References

(Blocks 1 open bug)

Details

(Whiteboard: [addons-jira])

Attachments

(1 file)

On the Moto G5 in fenix, we spend 231ms (11% of time to navigation start in the parent process) in InterpretGeneratorResume, which does things like _getNormalizedManifest and extensionById https://share.firefox.dev/3mUQ5y1: this seems like a lot of time to parse the JSON associated with the manifests from web extensions. We identified a perf issue fenix#20934 and further fixed it up bug 1727272 but I'd be curious if we could further improve the performance.

Notes:

I doubt there's something different that we're doing in GV that makes this particularly slow, so I'm moving this to the Extension component. This likely also has nothing really to do with Android (but it's a lot more important there where cold start of Gecko is pretty frequent).

Feel free to move back if needed.

Component: Extensions → Android
Product: GeckoView → WebExtensions

We don't parse extension manifests at startup, though, so if we are, it likely is something specific to GV.

How much earlier on startup the android-components may end up calling the GeckoView's WebExtensionController method ensureBuiltin?

Flags: needinfo?(agi)

(In reply to Kris Maglione [:kmag] from comment #2)

We don't parse extension manifests at startup, though, so if we are, it likely is something specific to GV.

That's interesting, does that include the case where extensions need to be started up before the first network request has executed (e.g. for uBlock)?

I'm assuming what's causing us to parse the extension manifest is that the app calls ensureBuiltIn at startup every time, but I thought that this change from Rob was supposed to avoid parsing the manifest: https://hg.mozilla.org/mozilla-central/rev/19fe145c3b6214d038e053f7f95b1114bb50f03f is that not the case?

We're toyed with the idea of having the app know when the built in extension is installed and not do anything instead of trying to install it again, it sounds like that might help here? but I'm still confused why the difference since we don't explicitly parse the manifest anymore.

How much earlier on startup the android-components may end up calling the GeckoView's WebExtensionController method ensureBuiltin?

(In reply to Luca Greco [:rpl] [:luca] [:lgreco] from comment #3)

How much earlier on startup the android-components may end up calling the GeckoView's WebExtensionController method ensureBuiltin?

For this discussion we can assume that the call is made as early as possible at startup.

Flags: needinfo?(agi)

(In reply to Agi Sferro | :agi | ni? for questions | ⏰ PST | he/him from comment #4)

(In reply to Kris Maglione [:kmag] from comment #2)

We don't parse extension manifests at startup, though, so if we are, it likely is something specific to GV.

That's interesting, does that include the case where extensions need to be started up before the first network request has executed (e.g. for uBlock)?

On desktop, we handle that case with what we call "persistent listeners" and "delayed background page startup",
we basically delay starting the background page until after the first paint, the addon is technically started a bit ealier than
that but after having installed it we should not be parsing the manifest again but use what we parsed before at install time.

I'm wondering if one of the "builtin" extensions installed from the android-components (e.g. maybe webcompat?) may be getting a startupReason that isn't set to "APP_STARTUP" and we end up starting the background page sooner than we do on desktop, see the check here:

In Github, you wrote:

Could this be related to the start up cache? I think I cleared data for the app though before I took the additional measurements...

What did you do exactly? Clearing the cache or really wiping all app data?
The parsed extension manifests are saved in ProfD/StartupCache/webext.sc.lz4 (source, note ProfLD = ProfD on Android). This directory is cleared on version changes, but other than that it will be there as long as the profile exists.

(In reply to Michael Comella (:mcomella) [needinfo or I won't see it] from comment #0)

On the Moto G5 in fenix, we spend 231ms (11% of time to navigation start in the parent process) in InterpretGeneratorResume, which does things like _getNormalizedManifest and extensionById https://share.firefox.dev/3mUQ5y1

Some remarks (after analyzing the full range of the profile):

P.S. I typed this comment at the start of my day, but had to drop it in favor of another priority. This comment doesn't relate to the discussion since comment 3, but I wanted to leave the comment anyway in case it offers more context.

See question at the top of my previous comment.

Flags: needinfo?(michael.l.comella)

(Haven't had a chance to thoroughly understand the other comments but to respond to the NI...)

(In reply to Rob Wu [:robwu] from comment #6)

In Github, you wrote:

Could this be related to the start up cache? I think I cleared data for the app though before I took the additional measurements...

What did you do exactly? Clearing the cache or really wiping all app data?

The process I try to follow for taking start up profiles and benchmarks is:

  1. Clear all app data: adb shell pm clear org.mozilla.fenix
  2. Launch the app once to ensure we're not measuring a first run. I only let the app idle for a few seconds
  3. Force-stop the app: adb shell am force-stop org.mozilla.fenix
  4. Launch the app, taking a profile/benchmark

It sounds like this process might not give the app enough time to populate the start up cache.

:robwu, how long does the app need to be running before the start up cache is populated? I can try waiting for that and taking another profile.

Flags: needinfo?(michael.l.comella) → needinfo?(rob)

(In reply to Rob Wu [:robwu] from comment #6)

In Github, you wrote:

Could this be related to the start up cache? I think I cleared data for the app though before I took the additional measurements...

What did you do exactly? Clearing the cache or really wiping all app data?

The process I try to follow for taking start up profiles and benchmarks is:

  1. Clear all app data: adb shell pm clear org.mozilla.fenix
  2. Launch the app once to ensure we're not measuring a first run. I only let the app idle for a few seconds
  3. Force-stop the app: adb shell am force-stop org.mozilla.fenix
  4. Launch the app, taking a profile/benchmark

It sounds like this process might not give the app enough time to populate the start up cache.

:robwu, how long does the app need to be running before the start up cache is populated? I can try waiting for that and taking another profile.

In the profile that you've shared, it appears that all extensions have finished initializing after 3-4 seconds (not counting the load of the background script, which takes another second).
The extension's StartupCache is populated with a delay, 5 seconds after the first attempt to write to the StartupCache: https://searchfox.org/mozilla-central/rev/49b6e60550243b4b4d71d6ab35a3ff2b9a9f7c69/toolkit/components/extensions/ExtensionParent.jsm#1753,1757

... all together, if you wait for 10 seconds, that should be plenty to have initialized the extension's StartupCache.

Flags: needinfo?(rob)

Changing severity to S3 because this looks like a testing method issue.

Severity: -- → S3
Priority: -- → P3

... all together, if you wait for 10 seconds, that should be plenty to have initialized the extension's StartupCache.

I took a new profile that I think should have hit the start up cache but still has the same problem: https://share.firefox.dev/3DWSSNd Here were my steps:

  • I took some perf tests on yesterdays nightly (25 replicates), left the device screen open for many minutes while I did other things
  • Hit back to background the app
  • Waited about 20 seconds (no stopwatch)
  • Force-stopped the app
  • Took a start up profile

And I followed similar steps (excluding the perf test) to take this one and see it too: https://share.firefox.dev/3hb01PX

Changing severity to S3 because this looks like a testing method issue.

Since I can reproduce after fixing the methodology, I'm reverting the severity.

Severity: S3 → N/A
Priority: P3 → --

:_:

I discovered what's happening. The startup caches are purged at startup by logic from bug 1422087, because the application never exits gracefully on Android. Bug 1624724 recognized the issue and attempted to fix it, but the logic doesn't work as explained in more detail below. This doesn't only affect extensions, but anything that's stored in the startup cache.
The "good" news is that this issue lasts for 6 hours; any profile whose age is older than 6 hours won't suffer from this bug.

Cause of purged caches

At startup, the existence of a dummy canary file called .startup-incomplete is checked.
When the file exists and is younger than 6 hours, the logic assumes that a crash has happened, which triggers the removal of the caches (including the extension's startup cache).
The file is created if it doesn't exist (which is the expected result). This created file is removed when the application exits gracefully (which doesn't happen on mobile), or removed "shortly after startup", which is triggered via browser-idle-startup-tasks-finished in GeckoView). The latter logic was introduced in bug 1624724, but it doesn't seem to work (as seen below, the files still exist and are many years old).

* I just opened about:debugging on desktop, and connected the debugger to two of the many old Firefox browsers that I have on my phone, and can confirm that the file is never actually removed, based on the observation that the files are from 2018 (when the feature was introduced in bug 1422087) and 2019 (when I started using Fenix Nightly):

new Date((await IOUtils.stat(Services.dirsvc.get("ProfLD", Ci.nsIFile).path + "/.startup-incomplete")).lastModified)

Startup canary not removed via browser-idle-startup-tasks-finished

As mentioned before, the nsAppStartup::trackStartupCrashEnd method is supposedly called by a browser-idle-startup-tasks-finished observer in GeckoViewStartup.jsm.
This observer was introduced in bug 1624724 and is not triggered however. In the profile (e.g. https://share.firefox.dev/3hb01PX from comment 11), you can try to search for "crash" in the Marker Table. startupCrashDetectionBegin (from TrackStartupCrashBegin) shows up, but startupCrashDetectionEnd (from TrackStartupCrashEnd does not. This is despite the fact that the notification itself is triggered by geckoview.js (seen in the Marker Table as NotifyObservers - browser-idle-startup-tasks-finished). So the issue is that the observer in GeckoViewStartup.jsm is not registered.

The only sign of an attempt at registering the observer is in GeckoView.manifest, which contains the following:

# GeckoViewStartup.js
category app-startup GeckoViewStartup service,@mozilla.org/geckoview/startup;1
category profile-after-change GeckoViewStartup @mozilla.org/geckoview/startup;1 process=main
category browser-idle-startup-tasks-finished GeckoViewStartup @mozilla.org/geckoview/startup;1 process=main

... therefore the patch for bug 1624724 never worked.

I'll check if I can create a patch with a regression test.

Assignee: nobody → rob
Status: NEW → ASSIGNED
Summary: Investigate perf improvements in parsing web extension manifests on start up → GeckoView unconditionally clears the startup cache of new profiles (less than 6 hours old)
Whiteboard: [addons-jira]

(In reply to Rob Wu [:robwu] from comment #12)

... therefore the patch for bug 1624724 never worked.

It's been awhile since I worked on this, but I'm pretty sure I manually verified the patch was working. Isn't browser-idle-startup-tasks-finished triggered from https://searchfox.org/mozilla-central/rev/aa46c2dcccbc6fd4265edca05d3d00cccdfc97b9/mobile/android/chrome/geckoview/geckoview.js#792

(In reply to Brendan Dahl [:bdahl] from comment #13)

(In reply to Rob Wu [:robwu] from comment #12)

... therefore the patch for bug 1624724 never worked.

It's been awhile since I worked on this, but I'm pretty sure I manually verified the patch was working. Isn't browser-idle-startup-tasks-finished triggered from https://searchfox.org/mozilla-central/rev/aa46c2dcccbc6fd4265edca05d3d00cccdfc97b9/mobile/android/chrome/geckoview/geckoview.js#792

How did you manually verify the patch?

That line indeed notifies any listening observers, and I confirmed that the call appears in the profiler from comment 11 (and also a local profile after waiting for 5 minutes; the notification was triggered after 2 seconds).

The file is however not removed (see second part of comment 12), and when I look at the implementation of nsAppStartup::TrackStartupCrashEnd the file should have been removed along with recording a profiler marker.

In the last part of comment 12 I explained why I think that the observer listener is not registered (hence not triggered).

Here is a profile collected from a local Fenix build where the startup cache isn't discarded at every GeckoView startup (as a quick test I did just add an explicit call to Services.startup.trackStartupCrashEnd from here in geckoview.js):

the profile data collected on the second startup is definitely quicker (and looking closely I can see that in this profile the API schema data has been loaded from the startupCache as expected).

It would be nice if we could add an automated test case that verifies that the startupCache is retained when is expected to along with the fix, I'm not sure if we can run marionette (or marionette-like) tests on the GeckoView example app that is part of mozilla-central, but that could be something to also double-check (either as part of fixing this issue, or as a follow up).

Note that because the gecko profiler is a sampling profiler, just because a function doesn't appear in the profile it doesn't mean it wasn't called. You need a profile marker if you want to be certain of that.

(In reply to Kris Maglione [:kmag] from comment #16)

Note that because the gecko profiler is a sampling profiler, just because a function doesn't appear in the profile it doesn't mean it wasn't called. You need a profile marker if you want to be certain of that.

Thanks for emphasizing that. I did rely on markers for the analysis:

(In reply to Rob Wu [:robwu] from comment #12)

In the profile (e.g. https://share.firefox.dev/3hb01PX from comment 11), you can try to search for "crash" in the Marker Table. startupCrashDetectionBegin (from TrackStartupCrashBegin) shows up, but startupCrashDetectionEnd (from TrackStartupCrashEnd does not. This is despite the fact that the notification itself is triggered by geckoview.js (seen in the Marker Table as NotifyObservers - browser-idle-startup-tasks-finished).

(In reply to Luca Greco [:rpl] [:luca] [:lgreco] from comment #15)

It would be nice if we could add an automated test case that verifies that the startupCache is retained when is expected to along with the fix, I'm not sure if we can run marionette (or marionette-like) tests on the GeckoView example app that is part of mozilla-central, but that could be something to also double-check (either as part of fixing this issue, or as a follow up).

I'll attach a patch that confirms the bug, and verifies the fix.

Component: Android → General
Product: WebExtensions → GeckoView
Pushed by rob@robwu.nl: https://hg.mozilla.org/integration/autoland/rev/be0435c60f56 Properly remove startup canary on Android to avoid purged caches r=agi,rpl,mossop
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 94 Branch

Perf analysis: I ran our cold view (app link) time to nav start start up test, modifying it to sleep 60 seconds after the "warm up" run before beginning the test. From these nightlies:

  • 9/15: 2441ms
  • 9/20 (5am): 1933ms
  • delta: 508ms

So it looks like it worked! :) To be fair, there may have been improvements other than the start up cache that happened in this time frame. Here are profiles:

We still see some manifest work being sampled but it's not as long. Thank you for identifying this and addressing it!

Thanks Michael for linking the new profiles.

As a side note: the 3 new profiles linked from comment 21 do now include the "ExtensionSchemas" profiler marker added from Bug 1728245,
that marker may be a useful one to look at in the future startup perf analysis (especially if we suspect there may be a regression due to impact of extensions during the startup path):

  • the "ExtensionSchemas" markers from profile 9/15 nightly shows that the Schemas.load was loading all the data from file (and as an additional side effect of that the "TelemetryStopwatch" marker for WEBEXT_EXTENSION_STARTUP_MS shows that starting all the builtin addons did take longer than the other two profiles, as expected due to the additional work needed to load all the schema data from files)

  • the "ExtensionSchemas" markers from both the profiles 9/20 nightly shows that the Schemas.load was loading all the data from file (and as an additional side effect the "TelemetryStopwatch" marker for WEBEXT_EXTENSION_STARTUP_MS shows that starting all the builtin addons has been faster)

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: