1543776 - [meta] 5+ second delays seen while loading www.allrecipes.com on geckoview_example

Reporter

Description

•

6 years ago

We noticed significant variations in loadtime in the raptor tp6m test raptor-tp6m-allrecipes-geckoview

When reloading the site (https://www.allrecipes.com) live on a Pixel 3 (geckoview_example.apk from taskcluster, 2019.04.10) we saw huge delays where seemingly nothing was happening for 5+ seconds.

e.g.
https://perfht.ml/2Z5PTxq

and also here:
https://perfht.ml/2Z7Ovum

In both those cases the script execution and remaining network requests resumed as soon as the GC major completed.

In this profile the remaining scripts and network requests resume while the GC major is still running:
https://perfht.ml/2Z2s8pT

Also note: this site appears to be perpetually running JS, the following capture was made at least 2 minutes after the page had completed loading:
https://perfht.ml/2Z2dUFy

Randell Jesup [:jesup] (needinfo me)

Comment 1

•

6 years ago

The major gap when loading this page is related to blocking Async script parsing if we're in a GC (specifically for Atoms; see https://searchfox.org/mozilla-central/source/js/src/vm/HelperThreads.cpp#700 -- js::OffThreadParsingMustWaitForGC()).

This is due to bug 875125 and see also bug 961318 and bug 1374797.

Andrew Creskey [:acreskey]

Reporter

Comment 2

•

6 years ago

Thank you :jesup
When looking into noise in tp6 a few months back I observed that frequently the slow loads would coincide with the GCs.
Fixing this will help Bug 1502138

Ted Campbell [:tcampbell]

Comment 3

•

6 years ago

Jon, it seems like when we fail OffThreadParsingMustWaitForGC(), we get queued until the end of GC in the parseWaitingOnGC queue. It seems though that we end up waiting on this queue until the major GC finishes, even if it is incremental and takes many seconds.

In this example, 'async' script loads result in an off-thread parse being scheduled and never finishing. Another option might be to force a sync parse and cancel the off-thread, although that still means we are wasting opportunity for offthread due to silly reasons.

Flags: needinfo?(jcoppeard)

Ted Campbell [:tcampbell]

Comment 4

•

6 years ago

I believe this is a jsshell microbenchmark. If I remove the incremental GC or simply move finishgc() to before the busy loop, I get concurrent operation.

var doGC = true;

// Generate a long script to parse
fragment = " + a";
for (var i = 0; i < 19; ++i) {
    fragment = fragment + fragment;
}
offThreadScript = "var a = 2; var x = " + fragment + ";";

// Start incremental GC
if (doGC) {
  startgc(1);
}

// Start the off-thread
var jid = offThreadCompileScript(offThreadScript);

// Spin main-thread
for (var i = 0; i < 1000000; ++i) {
    with({}){} // Prevent IonMonkey
}

// Finish incremental GC
if (doGC) {
  finishgc();
}

// Wait for parse result
runOffThreadScript(jid);

Ted Campbell [:tcampbell]

Updated

•

6 years ago

Depends on: 1543806

Ted Campbell [:tcampbell]

Comment 5

•

6 years ago

(Duped the microbenchmark to JS component so it gets patches get tracked by JS team)

Randell Jesup [:jesup] (needinfo me)

Updated

•

6 years ago

Whiteboard: [qf] → [qf:p1:pageload]

Kyle Lahnakoski [:ekyle]

Comment 6

•

6 years ago

P2 seemed to acquire this problem:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1903615,1,10&series=mozilla-central,1918882,1,10&series=mozilla-central,1955560,1,10

just as G5 solved this problem:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1903597,1,10&series=mozilla-central,1918728,1,10&series=mozilla-central,1955525,1,10

Jon Coppeard (:jonco) (PTO until 14th September)

Comment 7

•

6 years ago

(In reply to Ted Campbell [:tcampbell] from comment #3)

Jon, it seems like when we fail OffThreadParsingMustWaitForGC(), we get queued until the end of GC in the parseWaitingOnGC queue. It seems though that we end up waiting on this queue until the major GC finishes, even if it is incremental and takes many seconds.

This is correct, but we also try not to parse off thread while this is happening. See the comments here:

https://searchfox.org/mozilla-central/source/js/src/vm/OffThreadScriptCompilation.cpp#45

It's possible that the script in question is large enough that we decide to parse it off thread anyway.

The restriction in place happens because off thread parsing creates atoms so it's not safe while collecting the atoms zone. GCs that don't collect atoms are not affected.

I think it's possible to remove this restriction if we allow the main thread to pause off thread parsing and mark in-use atoms, but this is not trivial.

Another question to ask is why this incremental GC is running for so long. Ideally incremental GCs would not take many seconds.

Blocks: GCScheduling

Flags: needinfo?(jcoppeard)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 8

•

6 years ago

Just to try out something:
Perf testing base:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fdcfb5af04fc09d1a09cc6be40ee4e9de24e12e2

No off thread compilation during GC:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=88f1ef245aa7e9adcdae88e65f91984efa521261

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

6 years ago

Depends on: 1543961

Ted Campbell [:tcampbell]

Comment 9

•

6 years ago

I'd say there are a few different considerations to be looking at here:

Why is iGC taking so many seconds?

This seems to be being investigated by smaug/jonco.
Eg. Bug 1543961

Can the GC be made to support off-thread atoms access while doing a GC?

This is likely to be complicated, but perhaps there are partial steps in this direction

Can we remove the parser dependency on GC?

If the parser didn't use GC-things until the final step, we could avoid worrying about the GC.
This may also be beneficial for avoiding lock contention
We could likely avoid mergeRealms/parser-Zone logic
SpiderMonkey team should discuss the practicality of this (Bug 1544117)

What is our async-script scheduling policy?

Experiments with CanDoOffThread fall into this pattern
More aggressive things like detecting that scripts have been in the HelperThreadState().parseWaitingOnGC queue for a long time and decide to pull them to main-thread instead?

Ted Campbell [:tcampbell]

Updated

•

6 years ago

Depends on: 1544117

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 10

•

6 years ago

In general (1) will not help much, since we want GC and CC to run during idle time, and that idle time is after deferred-timers time, so all the setTimouts would get run before the load blocking async script gets a chance to run. And we know that running setTimeouts during page loads have massive negative effect.
However, in this particular page (1) should help some.

Ionuț Goldan [:igoldan]

Comment 11

•

6 years ago

(In reply to Kyle Lahnakoski [:ekyle] from comment #6)

P2 seemed to acquire this problem:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1903615,1,10&series=mozilla-central,1918882,1,10&series=mozilla-central,1955560,1,10

just as G5 solved this problem:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1903597,1,10&series=mozilla-central,1918728,1,10&series=mozilla-central,1955525,1,10

Andrew, is it possible this initial patch from bug 1529044 could have caused such an issue? I noticed this pattern on Google Maps also:

P2: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1916598,1,10&series=mozilla-central,1919889,1,10&series=mozilla-central,1955057,1,10

G5: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1916573,1,10&series=mozilla-central,1918103,1,10&series=mozilla-central,1955228,1,10

Ionuț Goldan [:igoldan]

Updated

•

6 years ago

Flags: needinfo?(acreskey)

Andrew Creskey [:acreskey]

Reporter

Comment 12

•

6 years ago

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #11)

Andrew, is it possible this initial patch from bug 1529044 could have caused such an issue? I noticed this pattern on Google Maps also:

P2: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1916598,1,10&series=mozilla-central,1919889,1,10&series=mozilla-central,1955057,1,10

G5: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1916573,1,10&series=mozilla-central,1918103,1,10&series=mozilla-central,1955228,1,10

Ionut, I think it's definitely possible that the initial patch from bug 1529044 could affect the pageload timing and thus make this behaviour show up in our recorded data.
But I wouldn't say that the initial patch caused the noise since we saw this noise prior to the patch (Comment 6) and also because the root cause has been determined (offthread parsing blocked on GC major). Really anything that shifts when the GCs occur or when the script parsing could run can move the noise around.

Flags: needinfo?(acreskey)

Andrew Creskey [:acreskey]

Reporter

Comment 13

•

6 years ago

:smaug gave me the pref to disable offthread parsing, "javascript.options.parallel_parsing".

This is a performance comparison: baseline (left hand side) and disabling offthread parsing (right hand side)

Not all the tests have sufficient results yet, but we can see from those that do that in general the offthread parsing is extremely valuable (since disabling it leads to some huge regressions, e.g. 20% on instagram over 20 jobs):

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=c059749d487e8668fb006b1f247de5f34edc5897&newProject=try&newRevision=ae54c2fe6dc90981905dcb3209de636c51d1dbd3&framework=10

Andrew Creskey [:acreskey]

Reporter

Comment 14

•

6 years ago

From the test in Comment 13, only looking at tests with 20 retries:

Regressions:
• 14% regression on raptor-tp6m-espn-geckoview opt
• 11% regression on raptor-tp6m-google-maps-geckoview opt
• 22% regression on raptor-tp6m-instagram-geckoview opt
• 61% regression on raptor-tp6m-reddit-geckoview opt
• 15% regression on raptor-tp6m-stackoverflow-geckoview opt
• 5% regression on raptor-tp6m-web-de-geckoview opt (maybe)

Improvements:
• 7% improvement on raptor-tp6m-amazon-geckoview opt
• 6% improvement on raptor-tp6m-allrecipes-geckoview opt
• 19% improvement on raptor-tp6m-bbc-geckoview opt

Andrew Creskey [:acreskey]

Reporter

Comment 15

•

6 years ago

Looking into the reddit regression subtests, it's just DCF that regresses but loadtime is actually improved by ~5%.

Randell Jesup [:jesup] (needinfo me)

Comment 16

•

6 years ago

The comment 13 Try (disable OMT parsing) has some serious loadtime regressions, and lots of FCP/DCF/etc regressions, but they're mixed. Instagram shows 20% in the geomean, but that's 100% regression on DCF but 0-3% on the others. Amazon search, OTOH, has major regressions (40-60%) on everything, but Amazon has 5-7% improvements on everything.

THere's an implication that in some cases not using it could help, but I'm not sure we can isolate those cases out. Also, the more cores you have probably the better OMT helps -- if it doesn't conflict with GC. I wonder how this plays out on 2-core windows laptops...

Denis Palmeiro [:denispal]

Updated

•

6 years ago

Blocks: 1551299

Vicky Chin [:vchin]

Updated

•

5 years ago

Priority: -- → P2

Andrew Creskey [:acreskey]

Reporter

Comment 17

•

5 years ago

When loading https://www.allrecipes.com on android, we do relatively poorly on visual metrics like SpeedIndex compared to Chrome.

Our visual behaviour is bimodal -- sometimes the large images load relatively early, and other times, like in this profile, they come in late:
https://perfht.ml/2Zdoqsl

I've noticed long (e.g. 2 second) GC majors before the image requests are made in the cases where the page is slow to load.

So I'm trying to construct a test where we defer all GC majors for as long as possible.

I've tried setting a high growth percentage on the heaps (e.g. 500), but I still see GC majors.

javascript.options.mem.gc_high_frequency_heap_growth_min
 javascript.options.mem.gc_high_frequency_heap_growth_max,
javascript.options.mem.gc_low_frequency_heap_growth

I've also tried setting very high values for
javascript.options.mem.gc_high_frequency_low_limit_mb (1000) and
javascript.options.mem.gc_high_frequency_high_limit_mb (2000)
but still seeing the GC majors during pageload.

Does anyone know of another way to push GC majors off? (just for testing purposes)

Andrew Creskey [:acreskey]

Reporter

Comment 18

•

5 years ago

Smaug pointed me to the GC requests made by Dom and Layout (PokeGC)
e.g. for these reasons, CC_WAITING, PAGE_HIDE, SET_DOC_SHELL

These are generally the source of the GCs during the allrecipes pageload.
I've converted the timing heuristics e.g. NS_GC_DELAY to StaticPrefs so I can test variations.
Giving the delays much longer values drastically improves onload event timing (~45% in a smaller run), although I haven't yet seen any improvements to visual metrics.

Andrew Creskey [:acreskey]

Reporter

Comment 19

•

5 years ago

I found out that the GC delay timers can also be forced to fire early by the RunNextCollectorTimer mechanism

This frequency is dictated by a different timing pref: NS_USER_INTERACTION_INTERVAL

I made two bugs to track this:
Bug 1575938 - convert GC timing #defines to StaticPrefs
Bug 1575943 - determine best app-level GC scheduling heuristics

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 20

•

5 years ago

RunNextCollectorTimer stuff doesn't get kicked in while the top level page is still loading (visual complete may of course happen after that).
https://searchfox.org/mozilla-central/rev/597a69c70a5cce6f42f159eb54ad1ef6745f5432/dom/base/nsJSEnvironment.cpp#1963-1968
And not running GC often enough may lead to regressions in other tests https://bugzilla.mozilla.org/show_bug.cgi?id=1557771#c16
(Initially MaybeRunNextCollectorSlice was added for Speedometer)

Andrew Creskey [:acreskey]

Reporter

Comment 21

•

5 years ago

Ahh, thanks Olli.
I believe that in my profiles the load event fired before the page was visually complete, but I'll double check that.

Jon Coppeard (:jonco) (PTO until 14th September)

Comment 22

•

5 years ago

Looking at this again, I think a way forward is to limit the total time an incremental GC runs for by increasing the slice time if is taking too long. Trying to keep it under, say, 2 seconds would be ideal. This is relatively straightfoward. This is what's mentioned in bug 1434542 comment 1.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 23

•

5 years ago

That would be related to https://bugzilla.mozilla.org/show_bug.cgi?id=1368972, but I guess the current setup may not kick in in all the cases.

Jon Coppeard (:jonco) (PTO until 14th September)

Updated

•

5 years ago

Depends on: 1579426

Jon Coppeard (:jonco) (PTO until 14th September)

Comment 24

•

5 years ago

(In reply to Olli Pettay [:smaug] from comment #23)
Oh right, I forgot I had originally filed that. I filed bug 1579426 to try something more drastic and increase slice budgets for all collections when they go past a certain point, with the aim of limiting incremental GCs to ~2 seconds (97% of GCs finish in less than 2 seconds already).

Jon Coppeard (:jonco) (PTO until 14th September)

Comment 25

•

5 years ago

Has there been any change since bug 1579426 landed? This won't have fixed the problem completely, but hopefully reduced it somewhat.

Flags: needinfo?(acreskey)

Andrew Creskey [:acreskey]

Reporter

Comment 26

•

5 years ago

I might not be able to get to this for a little bit, but let me test before and after that patch.

Andrew Creskey [:acreskey]

Reporter

Comment 27

•

5 years ago

I've kicked off a large perfherder comparison of bug 1579426 and its parent changeset.

I did some profiling before and after the GC slice budget change landed.
(For reasons unknown to me, the android aarch64 pgo build from your merge would only give me a few threads on from parent process)

So these profiles are from nightly geckoview_example, before and after your change landed.

It's hard to tell if there's performance impact because the site is so noisy. (The perfherder run should tell me that, at for onload event).
I am still seeing GCMajors exceed the 2second rule with the change.

Before, Sept 8
cold load, with a 5297ms GCMajor
https://perfht.ml/2ZYw5Ae

warm load, with a 6102ms GCMajor
https://perfht.ml/2ZZyGtq

warm load - 1333ms GCMajor
https://perfht.ml/2ZWfOvq

After bug 1579426 landed, Sept 10
cold load: longest GCMajor is 2755ms
https://perfht.ml/2ZY6JT7

warm load - 2499ms GC and 3891ms GC
https://perfht.ml/305jZW9

warm load - 11,222ms GCMajor
https://perfht.ml/2ZY0v5p

Jon Coppeard (:jonco) (PTO until 14th September)

Comment 28

•

5 years ago

(In reply to Andrew Creskey from comment #27)
Thanks for testing this! From the profile we can see the slice time is being increased for long running GCs, as intended. However we are going a long time between triggering slices so things are still not proceeding. I thought there was some maximum time we waited for an idle period in which to run a slice after which we did it anyway, but that doesn't look like the case from these profiles. Also, the extra slice time seems like too little too late - maybe we need to ramp up the slice time earlier if we want to keep the total duration under 2 seconds.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 29

•

5 years ago

There seems to be plenty of other stuff running too - so I guess GC slices don't get chance to run.
https://searchfox.org/mozilla-central/rev/d1e33e3e11f559952d7d80e722d26a6cf5dd80ac/dom/base/nsJSEnvironment.cpp#1777 should control the time between slices, if there is no idle time.

Jon Coppeard (:jonco) (PTO until 14th September)

Comment 30

•

5 years ago

(In reply to Olli Pettay [:smaug] from comment #29)
I talked to Olli about this on IRC. To add a bit more detail, it looks like JS running on the main thread is stopping the GC slices from running. If there is no idle time then a timer fires to trigger the slice, but it may not have any effect if the main thread is running higher priority runnables (I think GC ones are low priority?).

The real fix for this issue is to remove the restriction where we block off-thread parsing while a GC is active in the atoms zone. This depends on bug 1544117 which is actively being worked on.

Andrew Creskey [:acreskey]

Reporter

Comment 31

•

5 years ago

Thanks for looking into those profiles.

This is the before (left) and after (right) performance compare for bug 1579426
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1f7d73986e7da2e68e69675ca04861b08c196a97&newProject=try&newRevision=67743c62cc2f4004140df625866bcc5c278516e0&framework=10#table-row-4109543556

It looks like allrecipes.com is improved, particularly on Pixel 2.
Possibly bbc as well.

But this may hurt some sites like aframeio.
(Device lab must be very busy - waiting on a lot of results).

Flags: needinfo?(acreskey)

Andrew Creskey [:acreskey]

Reporter

Updated

•

5 years ago

Blocks: 1502138

Patricia Lawless

Updated

•

5 years ago

Keywords: meta

BugBot [:suhaib / :marco/ :calixte]

Updated

•

5 years ago

Summary: 5+ second delays seen while loading www.allrecipes.com on geckoview_example → [meta] 5+ second delays seen while loading www.allrecipes.com on geckoview_example

Ted Campbell [:tcampbell]

Updated

•

4 years ago

Depends on: stencil-mvp
No longer depends on: 1544117

Michael Comella (:mcomella) [NI reported issues only; no longer employed by Mozilla]

Comment 32

•

3 years ago

I spoke with Bas: he noted in the profile from bug 1543776 comment 27, "after" section, cold load, there are a series of reflows that takes a long time (~1s) followed by long running work on the compositor thread (a slow displaylist build?).

To get an updated profile, I took a cold page load profile in the latest geckoview_example – https://share.firefox.dev/3nqIXt5 – and see a similar trend: a 451ms Styles block followed by 616ms Reflow followed by some work on the Graphics thread (the displaylist markers don't seem as prominent though).

:mstange, do you know if this style/reflow -> graphics/displaylist issue could be related to the slow flexbox grid stuff? Bas mentioned you had talked about it on Monday

Flags: needinfo?(mstange.moz)

Markus Stange [:mstange]

Comment 33

•

3 years ago

(In reply to Michael Comella (:mcomella) [needinfo or I won't see it] from comment #32)

To get an updated profile, I took a cold page load profile in the latest geckoview_example – https://share.firefox.dev/3nqIXt5 – and see a similar trend: a 451ms Styles block followed by 616ms Reflow followed by some work on the Graphics thread (the displaylist markers don't seem as prominent though).

:mstange, do you know if this style/reflow -> graphics/displaylist issue could be related to the slow flexbox grid stuff? Bas mentioned you had talked about it on Monday

Hmm, it doesn't look like there's anything in particular about flexbox / grid that's taking up time. Sure, the page makes use of flexbox, but it's not where the time goes. I think the reason it's slow is mostly just because "there's a lot of stuff on this page" - there's no particular hotspot, just a lot of elements to chug through.

Looking at the specific sections of this profile:

Styling: https://share.firefox.dev/3qOOQSS Most of this is actually frame construction. The only common theme I see is that we spend an appreciable amount of time checking animation values: 19 out of 65 samples are in nsAnimationManager::UpdateAnimations, most of that in KeyframeEffect::CalculateCumulativeChangeHint. https://share.firefox.dev/3HE3gel
Reflow: https://share.firefox.dev/3DCTMO2 Nothing stands out. There's some time in gfxFont::ShapeText but that's expected.
Compositing: https://share.firefox.dev/3HB87gt Shader compilation - this should only happen once per Firefox profile. The next time this page is loaded, even after a Firefox restart, this composite should be fast.

Flags: needinfo?(mstange.moz)

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

3 years ago

Performance Impact: --- → P1

Keywords: perf:pageload

Whiteboard: [qf:p1:pageload]

Frank Doty [:fdoty|on_leave]

Updated

•

3 years ago

Performance Impact: P1 → P2

BMO Automation

Updated

•

2 years ago

Severity: normal → S3