Bugzilla

Gabor Krizsanits (INACTIVE)

Reporter

Comment 3

•

7 years ago

It would be really helpful if we can find someone to dig into this sooner than later.

Flags: needinfo?(nihsanullah)

Comment 4

•

7 years ago

Kris worked on Bug 1361900 and Bug 1359653 so it might be worth to check if those patches fixed/caused this problem. We don't store all the cached scripts in the same file any longer if I recall it correctly. Is the issue reproducible to verify?

Andrew Overholt [:overholt]

Updated

•

7 years ago

Whiteboard: [qf] → [qf:investigate:p1]

Reporter

Comment 5

•

7 years ago

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #4)
> Kris worked on Bug 1361900 and Bug 1359653 so it might be worth to check if
> those patches fixed/caused this problem. We don't store all the cached
> scripts in the same file any longer if I recall it correctly. Is the issue
> reproducible to verify?

I don't know much more about this besides comment 0, unfortunately.  Someone needs to look into the reproducibility aspect as part of the investigation too...

Ting-Yu Chou [:ting] (away)

Reporter

Comment 6

•

7 years ago

Hi Ting-yu,

Is there any chance you can please take a look at this bug and investigate it please?  I'd like to get a sense of whether this is a real regression and what causes it.  We don't have a lot of information about this bug yet unfortunately, so I suggest starting by trying to reproduce the bug from a Nightly build around the date of comment 0 or when bug 1363384 got filed and seeing if you can reproduce it...

Flags: needinfo?(janus926)

Comment 7

•

7 years ago

Sure, bug 1351071 was backed out at 2017-05-09 and relanded at 2017-05-19, which seems match to the date that Florian profiled and when the cache went back to normal.

If bug 1351071 really causes that, Florian should see a smaller cache file now, do you? Also are you able to reproduce the profile of bug 1363384 comment 1?

On my Ubuntu, Nightly 20170509222706 (I am not sure whether it includes the backout for bug 1351071, but based on the timestamp it should) has ~/.cache/.mozilla/firefox-trunk/x.default/startupCache/startupCache.8.little ~1770k. After updating to Nightly 20170523165845, the file size is ~1340k which seems still normal with bug 1351071 in it.

I assume the cache grows over time, could it be possible that the profile was captured when the cache was just created?

Flags: needinfo?(janus926) → needinfo?(florian)

Ting-Yu Chou [:ting] (away)

Comment 8

•

7 years ago

Or maybe StartupCache::Init() was just failed, so it went to the slow path. There are cases that it could fail.

Firefox Bug Husbandry Bot

Reporter

Comment 9

•

7 years ago

Florian: ping?

Florian Quèze [:florian]

Comment 10

•

7 years ago

(In reply to Ting-Yu Chou [:ting] from comment #7)

> If bug 1351071 really causes that, Florian should see a smaller cache file
> now, do you? Also are you able to reproduce the profile of bug 1363384
> comment 1?

Not exactly sure of what you are asking, but nsZipArchive::GetDataOffset is still expensive during cold startup, see https://perfht.ml/2tAdjgA

The scriptCache.* files I'm seeing today in the "startupCache" folder of the profile all weight more than 4MB, and the scriptCache-child.* files weight ~1.5MB.


Here is a startup profile if I remove the "startupCache" folder from the profile folder: https://perfht.ml/2tAN2ij

The interesting thing to notice is that the main thread is idle twice before first paint, while loading the browser.xul window and its dependencies. nsZipArchive::GetDataOffset isn't visible there.



These 2 periods of idleness on the main thread during startup puzzled me a lot since I first saw them, back around the time this bug was filed. I could reproduce consistently with warm startup on the reference hardware during the SF all hands. Last time I reproduced for sure was Thursday last week (July 6th). Today I can't reproduce.

I had noticed in SF that I could get a similar startup profile on my macbook if I packaged a local build and started it with the -purgecaches flag. And now I just realized that I can also force the reference hardware to generate such a profile by deleting the startupCache folder.

So... I don't have any proof, but I think this all gives strong evidence that the startup cache either isn't working correctly, or wasn't until at least Thursday last week.

Flags: needinfo?(florian)

:Gijs (he/him)

Updated

•

7 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1389778

Updated

•

7 years ago

Keywords: perf

lucian

Comment 11

•

7 years ago

Hey guys,

Since Firefox 56 we are experiencing random weird behaviours of the Cliqz extension and the only thing we knew so far was that it happens mostly after the extension updates.
We finally managed to pin it down and we think it is related with the issue described in this bug.
When a user runs version X of our extension and updates to version X+1 he has a chance (20-30% from our telemetry) to run a version which uses files from both versions.
More precisely the issue seems to be in “Services.scriptloader.loadSubScript” which will load a cached version of some files (even bootstrap.js!) from an older version from "/Users/YYYYY/Library/Caches/Firefox/Profiles/XXXXX/startupCache/scriptCache.bin” or ".../scriptCache-current.bin"

It is not 100% reproducible but after a few tries it happens. Here are the steps:
 - new profile in latest Firefox release (58.0.2) probably happens in more versions, likely after Firefox 56
 - install any version from https://addons.mozilla.org/de/firefox/addon/cliqz/versions/   (Eg 2.23.3)
 - use the search dropdown and the browser in general
 - restart the browser, use more the search, restart again (maybe? :) )
 - install another version (eg 2.24.5)
 - in this moment all the resources loaded with “Services.scriptloader.loadSubScript” are from the old version (2.23.3) including our bootstrap.js file which is the entry to our extension although we upgraded to 2.24.5
 - and all the other files are updated (eg process scripts, or iframes or whatever) 
 - it seems to happen more often on downgrade (from 2.24.5 to 2.23.3) but not exclusively so try to alternate upgrade and downgrade to reproduce

To prove that we prepared a small snippet:
  // this will return the wrong version (2.24.5 in the screenshot)
  var win = { global: {} }; 
  Services.scriptloader.loadSubScript('chrome://cliqz/content/runloop.js', win);
  Services.scriptloader.loadSubScript('chrome://cliqz/content/core/app.bundle.js', win);
  console.log(win.global.config.EXTENSION_VERSION); 

  // and this will return the right version (2.23.3 in the screenshot)
  var win = { global: {} }; 
  Services.scriptloader.loadSubScriptWithOptions('chrome://cliqz/content/runloop.js',  { target: win, ignoreCache: true });
  Services.scriptloader.loadSubScriptWithOptions('chrome://cliqz/content/core/app.bundle.js',  { target: win, ignoreCache: true });
  console.log(win.global.config.EXTENSION_VERSION); 

Also a screenshot: https://screenshots.firefox.com/DtUTAx7fKo585FQp/null 
Behind the Browser Console you can see the right version of the code loaded via the chrome url.

We temporarily moved towards "loadSubScriptWithOptions" where we are able to ignore the Cache but it would be great to find a better solution.  https://github.com/cliqz-oss/browser-core/compare/2.24.5...2.24.7#diff-33f3f1d203440162225e08f676fa1d4eR20

Please let me know if these findings make sense.

Cheers!
Lucian

lucian

Comment 12

•

7 years ago

PS: probably important - we are a bootstrapped extension

Kris Maglione [:kmag]

Comment 13

•

7 years ago

(In reply to :Ehsan Akhgari from comment #0)
> We looked into his profile and his
> startupCache.4.little file was very small (about 45KB.)  He'll attach it to
> the bug soon.  We looked inside the zip package and there were only a few
> entries corresponding to a few scripts that should be in the startup cache
> but almost everything else was missing.
> 
> As we were looking at things on Florian's machine, we noticed his
> startupCache.4.little file changed again and this time grew to somewhere
> around 700KB or something like that which sounds more like what I would
> expect a typical startup cache to be around.

Interesting. I've actually been thinking about fixing the opposite problem. The startup cache tends to be way too big, because it still duplicates files from the preloader cache. And on Windows, last I checked, we still open that file with aggressive read-ahead enabled, which means lots of wasted IO.

> We need to investigate why this is happening (and what is really
> happening...)

It's worth noting that we still purge the startup caches in a bunch of situations, including after every app or extension update.

Kris Maglione [:kmag]

Comment 14

•

7 years ago

(In reply to lucian from comment #11)
> Since Firefox 56 we are experiencing random weird behaviours of the Cliqz
> extension and the only thing we knew so far was that it happens mostly after
> the extension updates.

This is completely unrelated to this bug. In fact, if anything, the behavior you describe is the opposite of the behavior described here. Please file a separate bug for this.

Thanks.

lucian

Comment 15

•

7 years ago

(In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're blocked) from comment #14)
> This is completely unrelated to this bug. In fact, if anything, the behavior
> you describe is the opposite of the behavior described here. Please file a
> separate bug for this.

Thanks for your feedback Kris!
Submitted a new bug for our issue - https://bugzilla.mozilla.org/show_bug.cgi?id=1444957

Sylvestre Ledru [:Sylvestre]

Comment 16

•

6 years ago

Clearly ni on Naveed as he no longer works at Mozilla.

Flags: needinfo?(nihsanullah)

Updated

•

6 years ago

Whiteboard: [qf:investigate:p1] → [fxperf][qf]

Comment 17

•

6 years ago

It's not clear to me if this is still an issue. Reading through this bug again, Florian saw it once, and it may have been fixed by some of kmag's work.

It'd be great if we had information about cache hits and misses when profiling start-up.

Updated

•

6 years ago

Summary: The startup cache is probably broken → Collect Telemetry stats on startup cache hits and misses

Whiteboard: [fxperf][qf] → [fxperf]

Updated

•

6 years ago

Blocks: 1522877

:Gijs (he/him)

Comment 18

•

6 years ago

We should do this sooner rather than later. We've definitely seen this in more recent performance profiles on the reference device. Mossop found one issue on Windows (bug 1519184) but without telemetry we have no idea what the size of the problem is.

Whiteboard: [fxperf] → [fxperf:p1]

Assignee

Comment 19

•

6 years ago

Taking a look at this.

Assignee: nobody → dothayer

Status: NEW → ASSIGNED

Assignee

Comment 20

•

6 years ago

Do we know what we're looking for with this data? What are we expecting to see if there are no bugs? We can filter out in our analysis the cases where we know the cache is invalidated like app and extension updates, but do we know what we want to see after we've filtered out what we can, or are we just going to look at the hit/miss rates and say "hmm, does that look right?"

Kris Maglione [:kmag]

Comment 21

•

6 years ago

We're going to stop flushing the cache after extension updates. Except in the case of local builds, we really only want to see it flushed after app updates.

I'm not sure how much we want to filter even that, though, given that we've seen bugs where we trigger that code path when we don't want to...

Assignee

Comment 22

•

6 years ago

Attached file Bug 1364235 - Collect telemetry stats on startup cache hits and misses r?kmag (deleted) — Details

In bug 1264235 we have some indication that observed bugs with the
startup cache might have been resolved, but we don't really know
until we collect data. Collecting these stats will give us the
ability to have more certainty that the startup cache is functioning
correctly in the wild.

:Gijs (he/him)

Comment 23

•

6 years ago

In case it's relevant, we used to have some telemetry, added in bug 711297, but it seems to have been removed since then.

Phabricator Automation

Updated

•

6 years ago

Attachment #9043412 - Attachment description: Bug 1364235 - Collect telemetry stats on startup cache hits and misses r?erahm → Bug 1364235 - Collect telemetry stats on startup cache hits and misses r?kmag

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 24

•

6 years ago

(Hopefully a needinfo is sufficient without review flags. Should https://wiki.mozilla.org/Firefox/Data_Collection#Requesting_Approval be updated?)

What questions will you answer with this data?

To what extent are our caches improving startup performance?
How often do we see startup caches not aligning with our expectations? I.e., do we see any users who consistently encounter mostly misses w/ their startup caches (accounting for invalidations from app updates)
Ongoing: have startup caches regressed in some way?

Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements?

Establish baselines or measure changes in product or platform quality or performance.
We have a bug report (in this bug) indicating that this was at one time a problem. We want to try to assess if it is still a problem.

What alternative methods did you consider to answer these questions? Why were they not sufficient?

Eyeballing the code / trying to reproduce cache problems ourselves. Neither of these can show conclusively that there's not a significant problem in the wild.

Can current instrumentation answer these questions?

No.

List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories on the Mozilla wiki.

Number of hits / misses to the StartupCache
- Category 1
- Bug 1364235
Number of hits / misses to the ScriptPreloader
- Category 1
- Bug 1364235
Distribution of amount of time spent waiting for off-thread compiles in the ScriptPreloader
- Category 1
- Bug 1364235
How many times we ended up recompiling a script from the script preloader on the main thread.
- Category 1
- Bug 1364235

How long will this data be collected? Choose one of the following:

I want this data to be collected for 6 months initially (potentially renewable).

What populations will you measure?

Which release channels?

All

Which countries?

All

Which locales?

All

Any other filters? Please describe in detail below.

If this data collection is default on, what is the opt-out mechanism for users?

The usual technical data opt-out in about:preferences

Please provide a general description of how you will analyze this data.

Viewing the histograms via TMO, and deeper analysis via SQL or Spark

Where do you intend to share the results of your analysis?

Likely in this bug or a follow-up.

Is there a third-party tool (i.e. not Telemetry) that you are proposing to use for this data collection? If so:

No.

Flags: needinfo?(chutten)

Chris H-C :chutten

Comment 25

•

6 years ago

Preliminary Notes:

Sorry about that. I've updated the docs to specify that you should use the new data-revew flag on attachments.

Also, your review mentions you'd like to collect on all channels, but your definitions will only record on pre-release channels. For the purposes of Data Review I'll assume you want to collect on all channels and will update the definitions accordingly.

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes. This collection is Telemetry so is documented in its definitions files (Histograms.json and Scalars.yaml), the Probe Dictionary, and on telemetry.mozilla.org's Measurement Dashboards.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

N/A this collection expires in Firefox 73.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

Yes. :dthayer is responsible for removing or renewing the collection before it expires in Firefox 73.

Result: datareview+

Flags: needinfo?(chutten)

Pulsebot

Comment 26

•

6 years ago

Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1a681dc60b60
Collect telemetry stats on startup cache hits and misses r=kmag

Comment 27

•

6 years ago

(In reply to Pulsebot from comment #26)

Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1a681dc60b60
Collect telemetry stats on startup cache hits and misses r=kmag

Why was the startup test as written by using Marionette put under testing/marionette/../unit/? That folder is actually the wrong place given that it only contains the unit tests for Marionette itself. Can you please move it to the same location as the other files. /startup/tests/marionette would be great. You only would have to reference it in:

https://searchfox.org/mozilla-central/source/testing/marionette/harness/marionette_harness/tests/unit-tests.ini

Flags: needinfo?(dothayer)

Stefan Hindli [:stefan_hindli]

Assignee

Comment 28

•

6 years ago

Ack - my mistake, sorry. I will move it shortly. Leaving the needinfo.

Comment 29

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/1a681dc60b60

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

status-firefox67: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla67

Assignee

Updated

•

6 years ago

Depends on: 1533056

Assignee

Updated

•

6 years ago

Flags: needinfo?(dothayer)

Assignee

Comment 30

•

6 years ago

Preliminary analysis:

ScriptPreloader:

Hit         116977763
HitChild      2559619
Miss        160545632

StartupCache:

HitMemory      153288
HitDisk      60646842
Miss          9053288

This is just a straightforward analysis of the data on Nightly for cases where this isn't the first run after an update/install, which unfortunately on Nightly is relatively rare and might deviate quite a bit from the release situation.

Nothing in these numbers seems very surprising to me.

I think the next steps are to do an analysis to see if any profiles are somehow stuck in a pathological case where the hit/miss rates are consistently bad.