Closed Bug 1674305 Opened 4 years ago Closed 4 years ago

Leave stencil instantiation off-main-thread for script decode task

Categories

(Core :: JavaScript Engine, task, P1)

task

Tracking

()

RESOLVED FIXED
86 Branch
Tracking Status
firefox86 --- fixed

People

(Reporter: arai, Assigned: arai)

References

Details

Attachments

(2 files)

script instantiation for script decode task takes longer, and it blocks main thread execution if we move script instantiation for script decode task to main thread.

Depends on: 1674465

I'll defer the decision for this bug until bug 1674306 gets fixed.
If we can improve instantiation performance as well, we might be able to move instantiation on main thread.

After applying local patches,
if we leave instantiation off-main-thread, there's no regression,
so if we take this bug's way, we can flip the pref:
https://treeherder.mozilla.org/perfherder/compare?originalProject=autoland&originalRevision=a7a369259f7485fa3bef6c832ac7967740faf131&newProject=try&newRevision=9e2e3c731391efa3acec83c437713963b329b9d4&framework=10

but if we move instantiation to main-thread, there are 50% regression on youtube, and similar on some other sites:
https://treeherder.mozilla.org/perfherder/compare?originalProject=autoland&originalRevision=a7a369259f7485fa3bef6c832ac7967740faf131&newProject=try&newRevision=c2f44e1b80ed1d17a110b8c7a6bef882b1834bf4&framework=10

according to subtests result, it's mostly because of bad score for fcp (First Contentful Paint) and fnbpaint (First Non-Blank Paint), for most sites
https://treeherder.mozilla.org/perfherder/comparesubtest?originalProject=autoland&newProject=try&newRevision=c2f44e1b80ed1d17a110b8c7a6bef882b1834bf4&originalSignature=1927870&newSignature=1907787&framework=10&originalRevision=a7a369259f7485fa3bef6c832ac7967740faf131

for example, fnbpaint for youtube:
https://treeherder.mozilla.org/perfherder/comparesubtestdistribution?originalProject=autoland&newProject=try&originalRevision=a7a369259f7485fa3bef6c832ac7967740faf131&newRevision=c2f44e1b80ed1d17a110b8c7a6bef882b1834bf4&originalSubtestSignature=1927875&newSubtestSignature=1907790

(the result graph highly overlaps and hard to read, applying [...document.getElementsByTagName("path")].forEach(x => { x.style.strokeWidth = "1px"; x.style.stroke = "black"; }) shows clearer result)

according to that, there's not much difference in best and worst time, but the frequency of the bad score increases.

I'll check how and when such bad score happens.

Tested locally, I see some correlation between bad fnbpaint score and the following time:

  • T1 = the start of DecodeOffThreadScript for https://www.youtube.com/yts/jsbin/scheduler-vfl-N0eGo/scheduler.js (the first file passed to decode task)
  • T2 = the start of DecodeOffThreadScript for https://www.youtube.com/yts/jsbin/desktop_polymer_inlined_html_polymer_flags-vflV1Se-W/desktop_polymer_inlined_html_polymer_flags.js (the last file passed to decode task)
  • T3 = the start of FinishOffThreadScriptDecoder for https://www.youtube.com/yts/jsbin/desktop_polymer_inlined_html_polymer_flags-vflV1Se-W/desktop_polymer_inlined_html_polymer_flags.js (the last file)
  • T4 = the end of FinishOffThreadScriptDecoder for https://www.youtube.com/yts/jsbin/desktop_polymer_inlined_html_polymer_flags-vflV1Se-W/desktop_polymer_inlined_html_polymer_flags.js (the last file)

(where 140ms < T1 < T2 < T3 < T4 < 350ms)

In the good score (fnbpaint < 300ms) case:

  • T4 - T2 > 40ms (time taken by decoding+instantiating desktop_polymer*.js)
  • T4 - T1 > 80ms (time taken by decoding+instantiating all files for off-thread task)
  • fnbpaint < T3

and in the bad score (fnbpaint > 900ms) case:

  • T4 - T2 < 40ms
  • T4 - T1 < 60ms
  • fnbpaint > T3

So, basically, when the decode task (+ instantiation) takes less time, the score is bad,
and in the good case, the first non-blank paint happens before instantiating desktop_polymer*.js.

I wonder if that's because of that, if scripts are decoded quickly, more code (especially desktop_polymer*.js maybe?) runs and it delays the first non-blank paint.
I'll check what's running, and what the first non-blank paint result is for each case.

and when I leave the instantiation off-main-thread, fnbpaint < T3 also holds for good score case.

Attached image raptor youtube frames (deleted) —

I've checked the screen recording.

in good case, the content transits in the following:

  • at the point of start fetching, the content shows the previous page (fully loaded youtube page)
  • at the point of fnbpaint, the left side of the attachment is shown
  • after ~1 second, the right side of the attachment is shown
  • at the point of pageload, still the right side of the attachment is shown

and in bad case:

  • at the point of start fetching, the content shows the previous page (fully loaded youtube page)
  • at the point of fnbpaint, sometimes the left side of the attachment is shown, and sometimes it skips the left side and the right side of the attachment is shown
    • one possibility is that the left side is shown for a moment but is not captured in the screen recording
    • another possibility is that the left side completely gets skipped sometimes
  • if the left side is shown, the right side is shown after ~200ms
  • at the point of pageload, still the right side of the attachment is shown

I'll check the relation between those 2 frames and JS files

The left side of the screenshot is HTML+CSS-only.
the HTML part is embedded in the index HTML file (https://www.youtube.com/) returned by the server,
and CSS is defined in the external file.
(https://www.youtube.com/yts/cssbin/www-main-desktop-home-page-skeleton-webp-vflTRw70F.css, referred by link element inside head)

so, as soon as the CSS file gets loaded, the screen can be rendered.

      <div id="home-page-skeleton" class="">
    <div id="rich-grid-guide-skeleton"  class="mini-mode"></div>
    <div id="home-container-skeleton">
      <div id="rich-shelves">
        
  <div class="shelf-skeleton">
    <div class="rich-shelf-videos">
          <div class="rich-grid-video-skeleton mini-mode">
    <div class="video-details">
      <div class="rich-thumbnail skeleton-bg-color"></div>
      <div class="details">
        <div class="channel-avatar skeleton-bg-color"></div>
        <div class="details-text-shell">
          <div class="rich-video-title text-shell skeleton-bg-color"></div>
          <div class="video-meta rich-video-meta text-shell skeleton-bg-color"></div>
        </div>
      </div>
    </div>
  </div>
...

and #home-page-skeleton element gets removed by the following file:
https://www.youtube.com/yts/jsbin/desktop_polymer_inlined_html_polymer_flags-vflV1Se-W/desktop_polymer_inlined_html_polymer_flags.js

prettified code:

        wi = function(a) {
            return a && a.parentNode ? a.parentNode.removeChild(a) : null
        },
...
        hideSkeleton_: function() {
            this.skeletonHidden_ = !0;
            var a = document.querySelector("#home-page-skeleton");
            a && (zr("hgcr"), wi(a))
        },

and the script seems to be loaded by the following element, in index file (prettified), that's after link element for CSS and "#home-page-skeleton" element:

<script
  src="/yts/jsbin/desktop_polymer_inlined_html_polymer_flags-vflV1Se-W/desktop_polymer_inlined_html_polymer_flags.js"
  type="text/javascript"
  name="desktop_polymer_inlined_html_polymer_flags/desktop_polymer_inlined_html_polymer_flags"
  class="js-httpswwwyoutubecomytsjsbindesktop_polymer_inlined_html_polymer_flagsvflV1SeWdesktop_polymer_inlined_html_polymer_flagsjs">
</script>

and the script is decoded off-main-thread.
looks like it passes the following condition?
https://searchfox.org/mozilla-central/rev/66547980e8e8ca583473c74f207cae5bac1ed541/dom/script/ScriptLoader.cpp#3843-3866

so, the following 2 seems to be racing:

  • loading the above "skeleton" CSS file, layout, rendering the first frame with pure HTML+CSS
  • compiling/decoding "polymer" JS file, running the code that removes "skeleton" element

and if "skeleton" win, the first frame with "skeleton" gets rendered early,
and if "polymer" win, the first frame becomes non-skeleton

and if we move instantiation to main-thread, the time taken by off-main-thread task gets reduced,
and, it results in more "polymer" win case, even if it needs to instantiate before running the code.

profiles with screenshots (thanks to mstange!):

good case: https://share.firefox.dev/2HUnoiz

dcf      : 1249
fcp      : 237
fnbpaint : 217
loadtime : 1319

bad case: https://share.firefox.dev/37lLRWL

dcf      : 1319
fcp      : 1241
fnbpaint : 1226
loadtime : 1321

in good case, skeleton screen is shown at fnbpaint, and screen with polymer result is shown later.
(interestingly, the complete screen isn't yet shown even at "loadtime")

in bad case, skeleton screen isn't shown at all, and the polymer result is shown at fnbpaint.

this matches to the above comments.

here's local test result:
(for tests that shown some regression on automation)

raw data and graphs: https://arai-a.github.io/raptor-graph/graph.html (left/right key to move between sites)

apple

  • loadtime shows bimodal behavior (~200ms and ~250ms)
    • not much difference for good/bad frequency before/afrer
  • on-main-thread instantiation takes 5~10ms
    • the amount of regression (shift) on all subtest matches this

docs

  • fnbpaint/fcp show bimodal behavior (~200ms and ~500ms)
    • bad case rarely happen before stencil
    • bad case happens more than good case after stencil
  • dcf/loadtime aren't regressed
    • dcf somehow improves by 30ms, and happens before fcp/fnbpaint
  • on-main-thread instantiation takes 30ms
    • not directly reflected to any subtest

facebook

  • fnbpaint/fcp shows bimodal behavior (~250ms and ~320ms)
    • bad case slightly increases after stencil
    • also there's slight shift (10ms regression) after stencil
  • on-main-thread instantiation takes 80ms
    • the amount of dcf/loadtime regression (shift) matches this

fandom

  • fcp/fnbpaint shows multimodal behavior
    • bad case increases after stencil
  • loadtime scatters from 300ms to 800ms
  • on-main-thread instantiation takes 20~30ms
  • not much directly reflected to any subtest

imdb

  • all subtest show multimodal behavior (250-300ms and ~450ms for fcp/fnbpaint/dcf)
    • bad case increases after stencil
    • bad case (or peak after instantiation) shifts by ~30ms
  • on-main-thread instantiation takes 20~30ms
    • the amount of regression matches this

imgur

  • fnbpaint/fcp show almost bimodal behavior (~100ms and ~200ms)
    • bad case increases after stencil
  • loadtime shows multimodal behavior
    • no meaningful difference between before/after
  • on-main-thread instantiation takes 40-90ms
    • not directly reflected to any subtest

microsoft

  • all subtest has single peak
  • on-main-thread instantiation takes ~5ms
    • the amount of regression matches this

netflix

  • loadtime shows bimodal behavior (~1350ms and ~1400ms)
    • the case between good and bad increases after stencil
  • no difference for fcp/fnbpaint
  • on-main-thread instantiation takes 35ms
    • the loadtime/dcf regression (shift) matches this

sheets

  • fcp/fnbpaint show bi- or tri- modal behavior before stencil
  • fcp/fnbpaint has single peak after stencil
    • good/bad case in "before" disappear
  • loadtime scatters from 600ms to 1100ms
    • bad case increases after stencil
  • on-main-thread instantiation takes 40-65ms
    • the dcf regression (shift) matches this

slides

  • all substest shows bimodal behavior
    • bad case increases on fcp/fnbpaint after stencil
    • good case increases on dcf after stencil
  • loadtime shows regression (~100ms peak shift) after stencil
  • on-main-thread instantiation takes 15ms
    • not directly reflected to any subtest

tumblr

  • loadtime scatters from 600ms to 1800ms
  • on-main-thread instantiation takes 23ms
    • dcf shift maybe match this

twitch

  • fcp/fnbpaint shows no difference (they happen very early)
  • on-main-thread instantiation takes 55ms
  • the loadtime/dcf regression (shift) matches this

twitter

  • fcp/fnbpaint show bimodal behavior (40ms and 150ms)
    • bad case slightly increases after stencil
  • loadtime shows bimodal behavior
    • good case somehow shifts down (~150ms decrease) after stencil
  • on-main-thread instantiation takes 25ms
  • not directly reflected to any subtest

wikipedia

  • fcp/fnbpaint show bimodal behavior (200ms and 950ms)
    • bad case increases after stencil
  • dcf/loadtime scatters from 700ms to 1100ms
  • on-main-thread instantiation takes 10ms
    • not much reflected to any subtest

yahoo_mail

  • all subtest has almost single peak
  • on-main-thread instantiation takes 30-40ms
    • dcf/loadtime regression matches this

yahoo_news

  • all subtest has almost single peak
  • no notable regression
  • on-main-thread instantiation takes 10-30ms
    • not much reflected to any subtest

for the cases that the time taken by on-main-thread instantiation directly affects loadtime, improving the instantiation perf should help.
(apple, facebook, imdb, microsoft, netflix, twitch, yahoo-mail)

I'll look into other cases what's happening, and how on-main-thread instantiation affects the behavior
(docs, fandom, imgur, sheets slides, tumblr, twitter, wikipedia)

one option here is to conditionally use on-main-thread instantiation and off-main-thread instantiation, maybe based on XDR buffer size.
so that small XDR-encoded script is decoded without parse-global, to avoid GC lock, and large XDR-encoded script is decoded/instantiated off-main-thread in order to avoid regressing main-thread perf

google docs also looks like a race.

there the following 2 are racing:

  • loading CSS file, and painting toolbar embedded in HTML file
  • compiling/running inline scripts in line 218, that generates document content

line 218 of HTML file contains 128 script elements and the script is 1MB in total, and takes ~300ms to execute.
and in either case, one external script (takes 46ms to instantiate and execute) is decoded and executed before them.

for both case, bug 1667903 will help scheduling JS execution and paint in better way.

I'll go with this, and leave "move instantiation to main thread" to post-stencil-mvp.

Assignee: nobody → arai.unmht
Status: NEW → ASSIGNED
Pushed by arai_a@mac.com: https://hg.mozilla.org/integration/autoland/rev/1ba928d12818 Always use off-thread parse global in decode task and instantiate off-thread. r=tcampbell
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 86 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: