Open Bug 1543776 Opened 6 years ago Updated 2 years ago

[meta] 5+ second delays seen while loading www.allrecipes.com on geckoview_example

Categories

(Core :: Performance, defect, P2)

defect

Tracking

()

Performance Impact medium
Tracking Status
firefox68 --- affected

People

(Reporter: acreskey, Unassigned)

References

(Blocks 3 open bugs)

Details

(Keywords: meta, perf:pageload)

We noticed significant variations in loadtime in the raptor tp6m test raptor-tp6m-allrecipes-geckoview

When reloading the site (https://www.allrecipes.com) live on a Pixel 3 (geckoview_example.apk from taskcluster, 2019.04.10) we saw huge delays where seemingly nothing was happening for 5+ seconds.

e.g.
https://perfht.ml/2Z5PTxq

and also here:
https://perfht.ml/2Z7Ovum

In both those cases the script execution and remaining network requests resumed as soon as the GC major completed.

In this profile the remaining scripts and network requests resume while the GC major is still running:
https://perfht.ml/2Z2s8pT

Also note: this site appears to be perpetually running JS, the following capture was made at least 2 minutes after the page had completed loading:
https://perfht.ml/2Z2dUFy

The major gap when loading this page is related to blocking Async script parsing if we're in a GC (specifically for Atoms; see https://searchfox.org/mozilla-central/source/js/src/vm/HelperThreads.cpp#700 -- js::OffThreadParsingMustWaitForGC()).

This is due to bug 875125 and see also bug 961318 and bug 1374797.

Thank you :jesup
When looking into noise in tp6 a few months back I observed that frequently the slow loads would coincide with the GCs.
Fixing this will help Bug 1502138

Jon, it seems like when we fail OffThreadParsingMustWaitForGC(), we get queued until the end of GC in the parseWaitingOnGC queue. It seems though that we end up waiting on this queue until the major GC finishes, even if it is incremental and takes many seconds.

In this example, 'async' script loads result in an off-thread parse being scheduled and never finishing. Another option might be to force a sync parse and cancel the off-thread, although that still means we are wasting opportunity for offthread due to silly reasons.

Flags: needinfo?(jcoppeard)

I believe this is a jsshell microbenchmark. If I remove the incremental GC or simply move finishgc() to before the busy loop, I get concurrent operation.

var doGC = true;

// Generate a long script to parse
fragment = " + a";
for (var i = 0; i < 19; ++i) {
    fragment = fragment + fragment;
}
offThreadScript = "var a = 2; var x = " + fragment + ";";

// Start incremental GC
if (doGC) {
  startgc(1);
}

// Start the off-thread
var jid = offThreadCompileScript(offThreadScript);

// Spin main-thread
for (var i = 0; i < 1000000; ++i) {
    with({}){} // Prevent IonMonkey
}

// Finish incremental GC
if (doGC) {
  finishgc();
}

// Wait for parse result
runOffThreadScript(jid);
Depends on: 1543806

(Duped the microbenchmark to JS component so it gets patches get tracked by JS team)

Whiteboard: [qf] → [qf:p1:pageload]

(In reply to Ted Campbell [:tcampbell] from comment #3)

Jon, it seems like when we fail OffThreadParsingMustWaitForGC(), we get queued until the end of GC in the parseWaitingOnGC queue. It seems though that we end up waiting on this queue until the major GC finishes, even if it is incremental and takes many seconds.

This is correct, but we also try not to parse off thread while this is happening. See the comments here:

https://searchfox.org/mozilla-central/source/js/src/vm/OffThreadScriptCompilation.cpp#45

It's possible that the script in question is large enough that we decide to parse it off thread anyway.

The restriction in place happens because off thread parsing creates atoms so it's not safe while collecting the atoms zone. GCs that don't collect atoms are not affected.

I think it's possible to remove this restriction if we allow the main thread to pause off thread parsing and mark in-use atoms, but this is not trivial.

Another question to ask is why this incremental GC is running for so long. Ideally incremental GCs would not take many seconds.

Blocks: GCScheduling
Flags: needinfo?(jcoppeard)

I'd say there are a few different considerations to be looking at here:

  1. Why is iGC taking so many seconds?
  • This seems to be being investigated by smaug/jonco.
  • Eg. Bug 1543961
  1. Can the GC be made to support off-thread atoms access while doing a GC?
  • This is likely to be complicated, but perhaps there are partial steps in this direction
  1. Can we remove the parser dependency on GC?
  • If the parser didn't use GC-things until the final step, we could avoid worrying about the GC.
  • This may also be beneficial for avoiding lock contention
  • We could likely avoid mergeRealms/parser-Zone logic
  • SpiderMonkey team should discuss the practicality of this (Bug 1544117)
  1. What is our async-script scheduling policy?
  • Experiments with CanDoOffThread fall into this pattern
  • More aggressive things like detecting that scripts have been in the HelperThreadState().parseWaitingOnGC queue for a long time and decide to pull them to main-thread instead?
Depends on: 1544117

In general (1) will not help much, since we want GC and CC to run during idle time, and that idle time is after deferred-timers time, so all the setTimouts would get run before the load blocking async script gets a chance to run. And we know that running setTimeouts during page loads have massive negative effect.
However, in this particular page (1) should help some.

Flags: needinfo?(acreskey)

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #11)

Andrew, is it possible this initial patch from bug 1529044 could have caused such an issue? I noticed this pattern on Google Maps also:

P2: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1916598,1,10&series=mozilla-central,1919889,1,10&series=mozilla-central,1955057,1,10

G5: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1916573,1,10&series=mozilla-central,1918103,1,10&series=mozilla-central,1955228,1,10

Ionut, I think it's definitely possible that the initial patch from bug 1529044 could affect the pageload timing and thus make this behaviour show up in our recorded data.
But I wouldn't say that the initial patch caused the noise since we saw this noise prior to the patch (Comment 6) and also because the root cause has been determined (offthread parsing blocked on GC major). Really anything that shifts when the GCs occur or when the script parsing could run can move the noise around.

Flags: needinfo?(acreskey)

:smaug gave me the pref to disable offthread parsing, "javascript.options.parallel_parsing".

This is a performance comparison: baseline (left hand side) and disabling offthread parsing (right hand side)

Not all the tests have sufficient results yet, but we can see from those that do that in general the offthread parsing is extremely valuable (since disabling it leads to some huge regressions, e.g. 20% on instagram over 20 jobs):

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=c059749d487e8668fb006b1f247de5f34edc5897&newProject=try&newRevision=ae54c2fe6dc90981905dcb3209de636c51d1dbd3&framework=10

From the test in Comment 13, only looking at tests with 20 retries:

Regressions:
• 14% regression on raptor-tp6m-espn-geckoview opt
• 11% regression on raptor-tp6m-google-maps-geckoview opt
• 22% regression on raptor-tp6m-instagram-geckoview opt
• 61% regression on raptor-tp6m-reddit-geckoview opt
• 15% regression on raptor-tp6m-stackoverflow-geckoview opt
• 5% regression on raptor-tp6m-web-de-geckoview opt (maybe)

Improvements:
• 7% improvement on raptor-tp6m-amazon-geckoview opt
• 6% improvement on raptor-tp6m-allrecipes-geckoview opt
• 19% improvement on raptor-tp6m-bbc-geckoview opt

Looking into the reddit regression subtests, it's just DCF that regresses but loadtime is actually improved by ~5%.

The comment 13 Try (disable OMT parsing) has some serious loadtime regressions, and lots of FCP/DCF/etc regressions, but they're mixed. Instagram shows 20% in the geomean, but that's 100% regression on DCF but 0-3% on the others. Amazon search, OTOH, has major regressions (40-60%) on everything, but Amazon has 5-7% improvements on everything.

THere's an implication that in some cases not using it could help, but I'm not sure we can isolate those cases out. Also, the more cores you have probably the better OMT helps -- if it doesn't conflict with GC. I wonder how this plays out on 2-core windows laptops...

Priority: -- → P2

When loading https://www.allrecipes.com on android, we do relatively poorly on visual metrics like SpeedIndex compared to Chrome.

Our visual behaviour is bimodal -- sometimes the large images load relatively early, and other times, like in this profile, they come in late:
https://perfht.ml/2Zdoqsl

I've noticed long (e.g. 2 second) GC majors before the image requests are made in the cases where the page is slow to load.

So I'm trying to construct a test where we defer all GC majors for as long as possible.

I've tried setting a high growth percentage on the heaps (e.g. 500), but I still see GC majors.

javascript.options.mem.gc_high_frequency_heap_growth_min

javascript.options.mem.gc_high_frequency_heap_growth_max,
javascript.options.mem.gc_low_frequency_heap_growth

I've also tried setting very high values for
javascript.options.mem.gc_high_frequency_low_limit_mb (1000) and
javascript.options.mem.gc_high_frequency_high_limit_mb (2000)
but still seeing the GC majors during pageload.

Does anyone know of another way to push GC majors off? (just for testing purposes)

Smaug pointed me to the GC requests made by Dom and Layout (PokeGC)
e.g. for these reasons, CC_WAITING, PAGE_HIDE, SET_DOC_SHELL

These are generally the source of the GCs during the allrecipes pageload.
I've converted the timing heuristics e.g. NS_GC_DELAY to StaticPrefs so I can test variations.
Giving the delays much longer values drastically improves onload event timing (~45% in a smaller run), although I haven't yet seen any improvements to visual metrics.

I found out that the GC delay timers can also be forced to fire early by the RunNextCollectorTimer mechanism

This frequency is dictated by a different timing pref: NS_USER_INTERACTION_INTERVAL

I made two bugs to track this:
Bug 1575938 - convert GC timing #defines to StaticPrefs
Bug 1575943 - determine best app-level GC scheduling heuristics

RunNextCollectorTimer stuff doesn't get kicked in while the top level page is still loading (visual complete may of course happen after that).
https://searchfox.org/mozilla-central/rev/597a69c70a5cce6f42f159eb54ad1ef6745f5432/dom/base/nsJSEnvironment.cpp#1963-1968
And not running GC often enough may lead to regressions in other tests https://bugzilla.mozilla.org/show_bug.cgi?id=1557771#c16
(Initially MaybeRunNextCollectorSlice was added for Speedometer)

Ahh, thanks Olli.
I believe that in my profiles the load event fired before the page was visually complete, but I'll double check that.

Looking at this again, I think a way forward is to limit the total time an incremental GC runs for by increasing the slice time if is taking too long. Trying to keep it under, say, 2 seconds would be ideal. This is relatively straightfoward. This is what's mentioned in bug 1434542 comment 1.

That would be related to https://bugzilla.mozilla.org/show_bug.cgi?id=1368972, but I guess the current setup may not kick in in all the cases.

(In reply to Olli Pettay [:smaug] from comment #23)
Oh right, I forgot I had originally filed that. I filed bug 1579426 to try something more drastic and increase slice budgets for all collections when they go past a certain point, with the aim of limiting incremental GCs to ~2 seconds (97% of GCs finish in less than 2 seconds already).

Has there been any change since bug 1579426 landed? This won't have fixed the problem completely, but hopefully reduced it somewhat.

Flags: needinfo?(acreskey)

I might not be able to get to this for a little bit, but let me test before and after that patch.

I've kicked off a large perfherder comparison of bug 1579426 and its parent changeset.

I did some profiling before and after the GC slice budget change landed.
(For reasons unknown to me, the android aarch64 pgo build from your merge would only give me a few threads on from parent process)

So these profiles are from nightly geckoview_example, before and after your change landed.

It's hard to tell if there's performance impact because the site is so noisy. (The perfherder run should tell me that, at for onload event).
I am still seeing GCMajors exceed the 2second rule with the change.

Before, Sept 8
cold load, with a 5297ms GCMajor
https://perfht.ml/2ZYw5Ae

warm load, with a 6102ms GCMajor
https://perfht.ml/2ZZyGtq

warm load - 1333ms GCMajor
https://perfht.ml/2ZWfOvq

After bug 1579426 landed, Sept 10
cold load: longest GCMajor is 2755ms
https://perfht.ml/2ZY6JT7

warm load - 2499ms GC and 3891ms GC
https://perfht.ml/305jZW9

warm load - 11,222ms GCMajor
https://perfht.ml/2ZY0v5p

(In reply to Andrew Creskey from comment #27)
Thanks for testing this! From the profile we can see the slice time is being increased for long running GCs, as intended. However we are going a long time between triggering slices so things are still not proceeding. I thought there was some maximum time we waited for an idle period in which to run a slice after which we did it anyway, but that doesn't look like the case from these profiles. Also, the extra slice time seems like too little too late - maybe we need to ramp up the slice time earlier if we want to keep the total duration under 2 seconds.

There seems to be plenty of other stuff running too - so I guess GC slices don't get chance to run.
https://searchfox.org/mozilla-central/rev/d1e33e3e11f559952d7d80e722d26a6cf5dd80ac/dom/base/nsJSEnvironment.cpp#1777 should control the time between slices, if there is no idle time.

(In reply to Olli Pettay [:smaug] from comment #29)
I talked to Olli about this on IRC. To add a bit more detail, it looks like JS running on the main thread is stopping the GC slices from running. If there is no idle time then a timer fires to trigger the slice, but it may not have any effect if the main thread is running higher priority runnables (I think GC ones are low priority?).

The real fix for this issue is to remove the restriction where we block off-thread parsing while a GC is active in the atoms zone. This depends on bug 1544117 which is actively being worked on.

Thanks for looking into those profiles.

This is the before (left) and after (right) performance compare for bug 1579426
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1f7d73986e7da2e68e69675ca04861b08c196a97&newProject=try&newRevision=67743c62cc2f4004140df625866bcc5c278516e0&framework=10#table-row-4109543556

It looks like allrecipes.com is improved, particularly on Pixel 2.
Possibly bbc as well.

But this may hurt some sites like aframeio.
(Device lab must be very busy - waiting on a lot of results).

Flags: needinfo?(acreskey)
Blocks: 1502138
Keywords: meta
Summary: 5+ second delays seen while loading www.allrecipes.com on geckoview_example → [meta] 5+ second delays seen while loading www.allrecipes.com on geckoview_example
Depends on: stencil-mvp
No longer depends on: 1544117

I spoke with Bas: he noted in the profile from bug 1543776 comment 27, "after" section, cold load, there are a series of reflows that takes a long time (~1s) followed by long running work on the compositor thread (a slow displaylist build?).

To get an updated profile, I took a cold page load profile in the latest geckoview_example – https://share.firefox.dev/3nqIXt5 – and see a similar trend: a 451ms Styles block followed by 616ms Reflow followed by some work on the Graphics thread (the displaylist markers don't seem as prominent though).

:mstange, do you know if this style/reflow -> graphics/displaylist issue could be related to the slow flexbox grid stuff? Bas mentioned you had talked about it on Monday

Flags: needinfo?(mstange.moz)

(In reply to Michael Comella (:mcomella) [needinfo or I won't see it] from comment #32)

To get an updated profile, I took a cold page load profile in the latest geckoview_example – https://share.firefox.dev/3nqIXt5 – and see a similar trend: a 451ms Styles block followed by 616ms Reflow followed by some work on the Graphics thread (the displaylist markers don't seem as prominent though).

:mstange, do you know if this style/reflow -> graphics/displaylist issue could be related to the slow flexbox grid stuff? Bas mentioned you had talked about it on Monday

Hmm, it doesn't look like there's anything in particular about flexbox / grid that's taking up time. Sure, the page makes use of flexbox, but it's not where the time goes. I think the reason it's slow is mostly just because "there's a lot of stuff on this page" - there's no particular hotspot, just a lot of elements to chug through.

Looking at the specific sections of this profile:

  • Styling: https://share.firefox.dev/3qOOQSS Most of this is actually frame construction. The only common theme I see is that we spend an appreciable amount of time checking animation values: 19 out of 65 samples are in nsAnimationManager::UpdateAnimations, most of that in KeyframeEffect::CalculateCumulativeChangeHint. https://share.firefox.dev/3HE3gel
  • Reflow: https://share.firefox.dev/3DCTMO2 Nothing stands out. There's some time in gfxFont::ShapeText but that's expected.
  • Compositing: https://share.firefox.dev/3HB87gt Shader compilation - this should only happen once per Firefox profile. The next time this page is loaded, even after a Firefox restart, this composite should be fast.
Flags: needinfo?(mstange.moz)
Performance Impact: --- → P1
Keywords: perf:pageload
Whiteboard: [qf:p1:pageload]
Performance Impact: P1 → P2
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.