Closed Bug 1364235 Opened 7 years ago Closed 6 years ago

Collect Telemetry stats on startup cache hits and misses

Categories

(Core :: XPConnect, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla67
Tracking Status
firefox67 --- fixed

People

(Reporter: ehsan.akhgari, Assigned: dthayer)

References

Details

(Keywords: perf, Whiteboard: [fxperf:p1])

Attachments

(3 files)

In bug 1363384 comment 1, Florian observed some main thread IO when Cu.import()ing some JS module.  Looking at the call stack, I noticed that he isn't getting a working startup cache.  I checked with him and he wasn't on a new profile, which raised some alarms.  We looked into his profile and his startupCache.4.little file was very small (about 45KB.)  He'll attach it to the bug soon.  We looked inside the zip package and there were only a few entries corresponding to a few scripts that should be in the startup cache but almost everything else was missing.

In the mean time there is also bug 1351071 which was backed out recently on the suspicion that the generation of startup cache may somehow be busted and therefore users may not be getting a startup cache after that bug landed.  Florian's Nightly on his machine was the latest build so he already have the backout for bug 1351071 in it.

As we were looking at things on Florian's machine, we noticed his startupCache.4.little file changed again and this time grew to somewhere around 700KB or something like that which sounds more like what I would expect a typical startup cache to be around.

We need to investigate why this is happening (and what is really happening...)
Attached file startupCache.4.little.zip (deleted) β€”
Attached file scriptCache-new.bin (deleted) β€”
Priority: -- → P2
It would be really helpful if we can find someone to dig into this sooner than later.
Flags: needinfo?(nihsanullah)
Kris worked on Bug 1361900 and Bug 1359653 so it might be worth to check if those patches fixed/caused this problem. We don't store all the cached scripts in the same file any longer if I recall it correctly. Is the issue reproducible to verify?
Whiteboard: [qf] → [qf:investigate:p1]
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #4)
> Kris worked on Bug 1361900 and Bug 1359653 so it might be worth to check if
> those patches fixed/caused this problem. We don't store all the cached
> scripts in the same file any longer if I recall it correctly. Is the issue
> reproducible to verify?

I don't know much more about this besides comment 0, unfortunately.  Someone needs to look into the reproducibility aspect as part of the investigation too...
Hi Ting-yu,

Is there any chance you can please take a look at this bug and investigate it please?  I'd like to get a sense of whether this is a real regression and what causes it.  We don't have a lot of information about this bug yet unfortunately, so I suggest starting by trying to reproduce the bug from a Nightly build around the date of comment 0 or when bug 1363384 got filed and seeing if you can reproduce it...
Flags: needinfo?(janus926)
Sure, bug 1351071 was backed out at 2017-05-09 and relanded at 2017-05-19, which seems match to the date that Florian profiled and when the cache went back to normal.

If bug 1351071 really causes that, Florian should see a smaller cache file now, do you? Also are you able to reproduce the profile of bug 1363384 comment 1?

On my Ubuntu, Nightly 20170509222706 (I am not sure whether it includes the backout for bug 1351071, but based on the timestamp it should) has ~/.cache/.mozilla/firefox-trunk/x.default/startupCache/startupCache.8.little ~1770k. After updating to Nightly 20170523165845, the file size is ~1340k which seems still normal with bug 1351071 in it.

I assume the cache grows over time, could it be possible that the profile was captured when the cache was just created?
Flags: needinfo?(janus926) → needinfo?(florian)
Or maybe StartupCache::Init() was just failed, so it went to the slow path. There are cases that it could fail.
Florian: ping?
(In reply to Ting-Yu Chou [:ting] from comment #7)

> If bug 1351071 really causes that, Florian should see a smaller cache file
> now, do you? Also are you able to reproduce the profile of bug 1363384
> comment 1?

Not exactly sure of what you are asking, but nsZipArchive::GetDataOffset is still expensive during cold startup, see https://perfht.ml/2tAdjgA

The scriptCache.* files I'm seeing today in the "startupCache" folder of the profile all weight more than 4MB, and the scriptCache-child.* files weight ~1.5MB.


Here is a startup profile if I remove the "startupCache" folder from the profile folder: https://perfht.ml/2tAN2ij

The interesting thing to notice is that the main thread is idle twice before first paint, while loading the browser.xul window and its dependencies. nsZipArchive::GetDataOffset isn't visible there.



These 2 periods of idleness on the main thread during startup puzzled me a lot since I first saw them, back around the time this bug was filed. I could reproduce consistently with warm startup on the reference hardware during the SF all hands. Last time I reproduced for sure was Thursday last week (July 6th). Today I can't reproduce.

I had noticed in SF that I could get a similar startup profile on my macbook if I packaged a local build and started it with the -purgecaches flag. And now I just realized that I can also force the reference hardware to generate such a profile by deleting the startupCache folder.

So... I don't have any proof, but I think this all gives strong evidence that the startup cache either isn't working correctly, or wasn't until at least Thursday last week.
Flags: needinfo?(florian)
Keywords: perf
Hey guys,

Since Firefox 56 we are experiencing random weird behaviours of the Cliqz extension and the only thing we knew so far was that it happens mostly after the extension updates.
We finally managed to pin it down and we think it is related with the issue described in this bug.
When a user runs version X of our extension and updates to version X+1 he has a chance (20-30% from our telemetry) to run a version which uses files from both versions.
More precisely the issue seems to be in β€œServices.scriptloader.loadSubScript” which will load a cached version of some files (even bootstrap.js!) from an older version from "/Users/YYYYY/Library/Caches/Firefox/Profiles/XXXXX/startupCache/scriptCache.bin” or ".../scriptCache-current.bin"

It is not 100% reproducible but after a few tries it happens. Here are the steps:
 - new profile in latest Firefox release (58.0.2) probably happens in more versions, likely after Firefox 56
 - install any version from https://addons.mozilla.org/de/firefox/addon/cliqz/versions/   (Eg 2.23.3)
 - use the search dropdown and the browser in general
 - restart the browser, use more the search, restart again (maybe? :) )
 - install another version (eg 2.24.5)
 - in this moment all the resources loaded with β€œServices.scriptloader.loadSubScript” are from the old version (2.23.3) including our bootstrap.js file which is the entry to our extension although we upgraded to 2.24.5
 - and all the other files are updated (eg process scripts, or iframes or whatever) 
 - it seems to happen more often on downgrade (from 2.24.5 to 2.23.3) but not exclusively so try to alternate upgrade and downgrade to reproduce

To prove that we prepared a small snippet:
  // this will return the wrong version (2.24.5 in the screenshot)
  var win = { global: {} }; 
  Services.scriptloader.loadSubScript('chrome://cliqz/content/runloop.js', win);
  Services.scriptloader.loadSubScript('chrome://cliqz/content/core/app.bundle.js', win);
  console.log(win.global.config.EXTENSION_VERSION); 

  // and this will return the right version (2.23.3 in the screenshot)
  var win = { global: {} }; 
  Services.scriptloader.loadSubScriptWithOptions('chrome://cliqz/content/runloop.js',  { target: win, ignoreCache: true });
  Services.scriptloader.loadSubScriptWithOptions('chrome://cliqz/content/core/app.bundle.js',  { target: win, ignoreCache: true });
  console.log(win.global.config.EXTENSION_VERSION); 

Also a screenshot: https://screenshots.firefox.com/DtUTAx7fKo585FQp/null 
Behind the Browser Console you can see the right version of the code loaded via the chrome url.

We temporarily moved towards "loadSubScriptWithOptions" where we are able to ignore the Cache but it would be great to find a better solution.  https://github.com/cliqz-oss/browser-core/compare/2.24.5...2.24.7#diff-33f3f1d203440162225e08f676fa1d4eR20

Please let me know if these findings make sense.

Cheers!
Lucian
PS: probably important - we are a bootstrapped extension
(In reply to :Ehsan Akhgari from comment #0)
> We looked into his profile and his
> startupCache.4.little file was very small (about 45KB.)  He'll attach it to
> the bug soon.  We looked inside the zip package and there were only a few
> entries corresponding to a few scripts that should be in the startup cache
> but almost everything else was missing.
> 
> As we were looking at things on Florian's machine, we noticed his
> startupCache.4.little file changed again and this time grew to somewhere
> around 700KB or something like that which sounds more like what I would
> expect a typical startup cache to be around.

Interesting. I've actually been thinking about fixing the opposite problem. The startup cache tends to be way too big, because it still duplicates files from the preloader cache. And on Windows, last I checked, we still open that file with aggressive read-ahead enabled, which means lots of wasted IO.

> We need to investigate why this is happening (and what is really
> happening...)

It's worth noting that we still purge the startup caches in a bunch of situations, including after every app or extension update.
(In reply to lucian from comment #11)
> Since Firefox 56 we are experiencing random weird behaviours of the Cliqz
> extension and the only thing we knew so far was that it happens mostly after
> the extension updates.

This is completely unrelated to this bug. In fact, if anything, the behavior you describe is the opposite of the behavior described here. Please file a separate bug for this.

Thanks.
(In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're blocked) from comment #14)
> This is completely unrelated to this bug. In fact, if anything, the behavior
> you describe is the opposite of the behavior described here. Please file a
> separate bug for this.

Thanks for your feedback Kris!
Submitted a new bug for our issue - https://bugzilla.mozilla.org/show_bug.cgi?id=1444957
Clearly ni on Naveed as he no longer works at Mozilla.
Flags: needinfo?(nihsanullah)
Whiteboard: [qf:investigate:p1] → [fxperf][qf]

It's not clear to me if this is still an issue. Reading through this bug again, Florian saw it once, and it may have been fixed by some of kmag's work.

It'd be great if we had information about cache hits and misses when profiling start-up.

Summary: The startup cache is probably broken → Collect Telemetry stats on startup cache hits and misses
Whiteboard: [fxperf][qf] → [fxperf]

We should do this sooner rather than later. We've definitely seen this in more recent performance profiles on the reference device. Mossop found one issue on Windows (bug 1519184) but without telemetry we have no idea what the size of the problem is.

Whiteboard: [fxperf] → [fxperf:p1]

Taking a look at this.

Assignee: nobody → dothayer
Status: NEW → ASSIGNED

Do we know what we're looking for with this data? What are we expecting to see if there are no bugs? We can filter out in our analysis the cases where we know the cache is invalidated like app and extension updates, but do we know what we want to see after we've filtered out what we can, or are we just going to look at the hit/miss rates and say "hmm, does that look right?"

We're going to stop flushing the cache after extension updates. Except in the case of local builds, we really only want to see it flushed after app updates.

I'm not sure how much we want to filter even that, though, given that we've seen bugs where we trigger that code path when we don't want to...

In bug 1264235 we have some indication that observed bugs with the
startup cache might have been resolved, but we don't really know
until we collect data. Collecting these stats will give us the
ability to have more certainty that the startup cache is functioning
correctly in the wild.

In case it's relevant, we used to have some telemetry, added in bug 711297, but it seems to have been removed since then.

Attachment #9043412 - Attachment description: Bug 1364235 - Collect telemetry stats on startup cache hits and misses r?erahm → Bug 1364235 - Collect telemetry stats on startup cache hits and misses r?kmag

(Hopefully a needinfo is sufficient without review flags. Should https://wiki.mozilla.org/Firefox/Data_Collection#Requesting_Approval be updated?)

  1. What questions will you answer with this data?
  • To what extent are our caches improving startup performance?
  • How often do we see startup caches not aligning with our expectations? I.e., do we see any users who consistently encounter mostly misses w/ their startup caches (accounting for invalidations from app updates)
  • Ongoing: have startup caches regressed in some way?
  1. Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements?
  • Establish baselines or measure changes in product or platform quality or performance.

  • We have a bug report (in this bug) indicating that this was at one time a problem. We want to try to assess if it is still a problem.

  1. What alternative methods did you consider to answer these questions? Why were they not sufficient?
  • Eyeballing the code / trying to reproduce cache problems ourselves. Neither of these can show conclusively that there's not a significant problem in the wild.
  1. Can current instrumentation answer these questions?

No.

  1. List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories on the Mozilla wiki.
  • Number of hits / misses to the StartupCache

  • Number of hits / misses to the ScriptPreloader

  • Distribution of amount of time spent waiting for off-thread compiles in the ScriptPreloader

  • How many times we ended up recompiling a script from the script preloader on the main thread.

  1. How long will this data be collected? Choose one of the following:
  • I want this data to be collected for 6 months initially (potentially renewable).
  1. What populations will you measure?
  • Which release channels?

All

  • Which countries?

All

  • Which locales?

All

  • Any other filters? Please describe in detail below.
  1. If this data collection is default on, what is the opt-out mechanism for users?

The usual technical data opt-out in about:preferences

  1. Please provide a general description of how you will analyze this data.

Viewing the histograms via TMO, and deeper analysis via SQL or Spark

  1. Where do you intend to share the results of your analysis?

Likely in this bug or a follow-up.

  1. Is there a third-party tool (i.e. not Telemetry) that you are proposing to use for this data collection? If so:

No.

Flags: needinfo?(chutten)

Preliminary Notes:

Sorry about that. I've updated the docs to specify that you should use the new data-revew flag on attachments.

Also, your review mentions you'd like to collect on all channels, but your definitions will only record on pre-release channels. For the purposes of Data Review I'll assume you want to collect on all channels and will update the definitions accordingly.

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes. This collection is Telemetry so is documented in its definitions files (Histograms.json and Scalars.yaml), the Probe Dictionary, and on telemetry.mozilla.org's Measurement Dashboards.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

N/A this collection expires in Firefox 73.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

Yes. :dthayer is responsible for removing or renewing the collection before it expires in Firefox 73.


Result: datareview+

Flags: needinfo?(chutten)
Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1a681dc60b60
Collect telemetry stats on startup cache hits and misses r=kmag

(In reply to Pulsebot from comment #26)

Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1a681dc60b60
Collect telemetry stats on startup cache hits and misses r=kmag

Why was the startup test as written by using Marionette put under testing/marionette/../unit/? That folder is actually the wrong place given that it only contains the unit tests for Marionette itself. Can you please move it to the same location as the other files. /startup/tests/marionette would be great. You only would have to reference it in:

https://searchfox.org/mozilla-central/source/testing/marionette/harness/marionette_harness/tests/unit-tests.ini

Flags: needinfo?(dothayer)

Ack - my mistake, sorry. I will move it shortly. Leaving the needinfo.

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67
Depends on: 1533056
Flags: needinfo?(dothayer)

Preliminary analysis:

ScriptPreloader:

Hit         116977763
HitChild      2559619
Miss        160545632

StartupCache:

HitMemory      153288
HitDisk      60646842
Miss          9053288

This is just a straightforward analysis of the data on Nightly for cases where this isn't the first run after an update/install, which unfortunately on Nightly is relatively rare and might deviate quite a bit from the release situation.

Nothing in these numbers seems very surprising to me.

I think the next steps are to do an analysis to see if any profiles are somehow stuck in a pathological case where the hit/miss rates are consistently bad.

Numbers on Beta look qualitatively similar:

ScriptPreloader:

Hit         1060675539    47%
HitChild      26926693    01%
Miss        1152341886    52%

Total       2239944118

StartupCache:

HitMemory       572734    00%
HitDisk      674959366    93%
Miss          48281907    07%

Total        723814007
Depends on: 1590385
Regressions: 1590439
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: