1537694 - 2.34 - 10.81% Explicit Memory / Heap Unclassified / JS / Resident Memory (linux64, linux64-pgo-qr, osx-10-10, windows10-64, windows10-64-pgo-qr, windows7-32) regression on push 26973fbde8f49f878e3f729cc4136085075cf5ae

Reporter

Description

•

6 years ago

We have detected an awsy regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=7981b7ae2fafca9d8fc79f12ecbda06b220e8627&tochange=26973fbde8f49f878e3f729cc4136085075cf5ae

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

11% Heap Unclassified windows10-64 pgo stylo 8% Heap Unclassified windows7-32 pgo stylo 8% Heap Unclassified linux64 pgo stylo 6% Heap Unclassified osx-10-10 opt stylo 6% Explicit Memory windows10-64-pgo-qr opt stylo 6% Explicit Memory windows10-64 pgo stylo 5% JS windows10-64-pgo-qr opt stylo 5% JS windows10-64 pgo stylo 5% Explicit Memory windows7-32 pgo stylo 4% Resident Memory windows10-64 pgo stylo 4% Resident Memory windows10-64-pgo-qr opt stylo 4% Resident Memory windows7-32 pgo stylo 4% JS osx-10-10 opt stylo 4% Explicit Memory osx-10-10 opt stylo 4% JS linux64 pgo stylo 3% JS linux64-pgo-qr opt stylo 3% Explicit Memory linux64-pgo-qr opt stylo 3% JS windows7-32 pgo stylo 3% Heap Unclassified linux64-pgo-qr opt stylo 2% Resident Memory linux64-pgo-qr opt stylo 42,957,697.51 -> 47,601,249.48
34,966,968.65 -> 37,723,145.86
75,723,944.15 -> 81,494,662.27
72,322,178.44 -> 76,904,435.80
329,293,580.78 -> 349,757,924.75
314,728,912.71 -> 333,568,714.99
106,381,200.25 -> 111,717,702.58
105,789,211.64 -> 110,922,889.98
255,950,371.03 -> 267,923,539.37
497,483,100.49 -> 519,198,368.39
520,196,452.76 -> 541,517,738.88
559,208,271.72 -> 581,055,484.53
105,987,404.86 -> 109,899,674.23
347,069,816.94 -> 359,786,776.45
104,977,757.34 -> 108,723,091.37
105,289,066.99 -> 108,807,943.87
459,214,830.62 -> 474,515,151.02
78,483,172.34 -> 81,089,941.98
184,821,186.74 -> 190,398,438.34
928,632,154.45 -> 950,355,356.51

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=19984

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/AWSY/Tests

Ionuț Goldan [:igoldan]

Reporter

Updated

•

6 years ago

Component: General → New Tab Page

Product: Testing → Firefox

Ionuț Goldan [:igoldan]

Reporter

Updated

•

6 years ago

Flags: needinfo?(gijskruitbosch+bugs)

:Gijs (he/him)

Assignee

Comment 1

•

6 years ago

How do I get the detailed memory reports pre/post change, and where can I see the source code of the test that takes these memory snapshots, in particular to work out when it does so (which doesn't seem to be documented on the wikipage for the test?

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(igoldan)

:Gijs (he/him)

Assignee

Updated

•

6 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1529309

Eric Rahm [:erahm]

Comment 2

•

6 years ago

(In reply to :Gijs (he/him) from comment #1)

How do I get the detailed memory reports pre/post change, and where can I see the source code of the test that takes these memory snapshots, in particular to work out when it does so (which doesn't seem to be documented on the wikipage for the test?

For AWSY regressions you'll want to follow the comparison link provided, hover over the alert and click 'subtests'. In this case for the explicit measurement the 'Fresh Start' and 'Fresh Start [+30s]' metrics are heavily regressed, which as the names imply are measured after initial startup to a blank page, and then 30 seconds later. The 'Tabs Closed' metrics are also regressed, which again as the name implies, are taken after closing the 30 tabs that are opened.

re: looking at memory reports, here are some details:

click the alert link
hover over the alert and click "graph"
select the alert in the graph view and click 'job'
now we're in treeherder with the job selected, go to the 'Job Details' tab
download the 'memory-report-....' artifact and load that in about:memory

Flags: needinfo?(igoldan)

:Gijs (he/him)

Assignee

Comment 3

•

6 years ago

(In reply to Eric Rahm [:erahm] from comment #2)

(In reply to :Gijs (he/him) from comment #1)

How do I get the detailed memory reports pre/post change, and where can I see the source code of the test that takes these memory snapshots, in particular to work out when it does so (which doesn't seem to be documented on the wikipage for the test?

For AWSY regressions you'll want to follow the comparison link provided, hover over the alert and click 'subtests'. In this case for the explicit measurement the 'Fresh Start' and 'Fresh Start [+30s]' metrics are heavily regressed, which as the names imply are measured after initial startup to a blank page,

No, but really, when exactly? At what technical point in the startup timeline? What exact event fires off this memory report? (ni for this)

Looks like it is somewhere in test_base_memory_usage.py and test_memory_usage.py, but I can't figure out what that bases things off of, or which of those files it is.

On the whole, from a quick look at this, the main cause of the regression is that we now preopen a new tab page off an idle task, and that takes memory (ie we open a preloaded tab). It takes too much memory (cf. bug 1529309) but that's mostly something for the new tab page / activity stream team to fix, I expect. So I think that 'regression' is something we should accept - though it's somewhat bigger than I expected, so maybe that warrants looking into. Prior to this patch, it would have happened as soon as you opened a new tab (not just that tab; the tab opening triggered opening a(nother) preloaded browser.

What I'm somewhat more confused about is that given that the closed subtests call removePreloadedBrowser, I'd expect them not to show a regression, but it seems they do, and it's not clear why.

Flags: needinfo?(erahm)

Eric Rahm [:erahm]

Comment 4

•

6 years ago

(In reply to :Gijs (he/him) from comment #3)

No, but really, when exactly? At what technical point in the startup timeline? What exact event fires off this memory report? (ni for this)

It's not based on an event, it's just when marionette finishes starting up. whimboo, do we wait for a certain event before considering startup complete?

If you'd like to test locally you can simply run:

./mach awsy-test --quick

Looks like it is somewhere in test_base_memory_usage.py and test_memory_usage.py, but I can't figure out what that bases things off of, or which of those files it is.

For this regression we're looking at test_memory_usage.py, which is the original AWSY test aka SY-e10s (sy) in treeherder.

On the whole, from a quick look at this, the main cause of the regression is that we now preopen a new tab page off an idle task, and that takes memory (ie we open a preloaded tab). It takes too much memory (cf. bug 1529309) but that's mostly something for the new tab page / activity stream team to fix, I expect. So I think that 'regression' is something we should accept - though it's somewhat bigger than I expected, so maybe that warrants looking into. Prior to this patch, it would have happened as soon as you opened a new tab (not just that tab; the tab opening triggered opening a(nother) preloaded browser.

This analysis seems correct. Previously we had a chrome process, 2 web content, and a web extension process. Now there's an additional priveleged content process. Accepting the fresh start regression is probably okay, but I agree we should encourage the AS folks to look into memory usage.

What I'm somewhat more confused about is that given that the closed subtests call removePreloadedBrowser, I'd expect them not to show a regression, but it seems they do, and it's not clear why.

Yes, this should be investigated or spun off before we resolve this. From what I can tell the privileged CP seems to stick around after calling removePreloadedBrowser, but it looks like this has been broken for a while.

Flags: needinfo?(erahm) → needinfo?(hskupin)

:Gijs (he/him)

Assignee

Comment 5

•

6 years ago

(In reply to Eric Rahm [:erahm] from comment #4)

Yes, this should be investigated or spun off before we resolve this. From what I can tell the privileged CP seems to stick around after calling removePreloadedBrowser, but it looks like this has been broken for a while.

Right, though this alone doesn't explain the regression as I'd expect this to happen prior to this patch as well (ie a started new tab preload + privileged content process should behave the same, independent of when it's started). The privileged CP is still nightly-only, so in that sense we probably won't actually be paying as much of a penalty here as the regressions suggest once we get to beta/release, but even so...

Hopefully I have time to dive into the memory reports next week. Would still be useful to have answers from :whimboo for the questions from comment 3/4 re: when marionette starts though.

Separately, I wonder if it'd be productive to make the AWSY code never load the new tab page, and not preload it (from the fact that it opens a blank page on startup, and removes the preloaded browser after closing tabs, I assume that was originally the goal?) as long as we also have a separate, specific AWSY test for how much memory that new tab page takes up, so we don't lose sight of its impact. That would also make regressions/improvements in the new tab page itself show up properly, where they might not currently meet thresholds for reporting on the "main" AWSY tests. Eric, does that sound sane? How hard would it be to get to that stage?

Flags: needinfo?(erahm)

:Gijs (he/him)

Assignee

Comment 6

•

6 years ago

Assigning this so I can look at the close regression next week.

Assignee: nobody → gijskruitbosch+bugs

Status: NEW → ASSIGNED

Ionuț Goldan [:igoldan]

Reporter

Comment 7

•

6 years ago

I also noticed a Talos regression:

== Change summary for alert #20026 (as of Thu, 21 Mar 2019 05:48:40 GMT) ==

Regressions:

6% startup_about_home_paint windows10-64-pgo-qr opt e10s stylo 676.08 -> 714.79

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=20026

These are the Gecko profiles for it:

before: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FSm_S39pSQUCNXX7KAA2Log%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_startup_about_home_paint.zip

after: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FHXvT1X7nRe6qmNh0QpOQwQ%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_startup_about_home_paint.zip

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 8

•

6 years ago

Maybe the patch on bug 1530979 made the behavior flaky for idle tasks during startup? We are still waiting for the startup recorder to be finished recording all scripts.

Flags: needinfo?(hskupin)

:Gijs (he/him)

Assignee

Comment 9

•

6 years ago

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #8)

Maybe the patch on bug 1530979 made the behavior flaky for idle tasks during startup? We are still waiting for the startup recorder to be finished recording all scripts.

The question is, what exactly governs when marionette thinks startup is finished? Is that "the startup recorder to be finished"? Where does that code live?

Flags: needinfo?(hskupin)

Eric Rahm [:erahm]

Comment 10

•

6 years ago

(In reply to :Gijs (he/him) from comment #9)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #8)

Maybe the patch on bug 1530979 made the behavior flaky for idle tasks during startup? We are still waiting for the startup recorder to be finished recording all scripts.

The question is, what exactly governs when marionette thinks startup is finished? Is that "the startup recorder to be finished"? Where does that code live?

It looks like at the end of BrowserGlue._scheduleStartupIdleTasks we dispatch a marionette-startup-requested event.

Eric Rahm [:erahm]

Comment 11

•

6 years ago

(In reply to :Gijs (he/him) from comment #5)

Separately, I wonder if it'd be productive to make the AWSY code never load the new tab page, and not preload it (from the fact that it opens a blank page on startup, and removes the preloaded browser after closing tabs, I assume that was originally the goal?) as long as we also have a separate, specific AWSY test for how much memory that new tab page takes up, so we don't lose sight of its impact.That would also make regressions/improvements in the new tab page itself show up properly, where they might not currently meet thresholds for reporting on the "main" AWSY tests. Eric, does that sound sane? How hard would it be to get to that stage?

That sounds reasonable. Someone would have to add a separate test that loads the new tab page. That should be pretty simple, it's basically just the startup portion of the test. Then probably a front-end/marionette dev would need to modify the test_memory_usage test to somehow open a page in a new tab without first opening a new tab to a blank page within the constraints of the marionette framework.

Flags: needinfo?(erahm)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 12

•

6 years ago

(In reply to Eric Rahm [:erahm] from comment #10)

It looks like at the end of BrowserGlue._scheduleStartupIdleTasks we dispatch a marionette-startup-requested event.

Right, and that goes to [1], which usually directly jumps to [2] where we wait for the startup recorder to be finished. Note that before my patch Marionette was waiting for sessionstore-windows-restored, but we wanted to have an unique entry point for Marionette across different applications. So maybe sessionstore-windows-restored fires after the marionette-startup-requested notification?

[1] https://searchfox.org/mozilla-central/rev/2c912888e3b7ae4baf161d98d7a01434f31830aa/testing/marionette/components/marionette.js#393
[2] https://searchfox.org/mozilla-central/rev/2c912888e3b7ae4baf161d98d7a01434f31830aa/testing/marionette/components/marionette.js#439-454

Flags: needinfo?(hskupin)

:Gijs (he/him)

Assignee

Comment 13

•

6 years ago

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #12)

(In reply to Eric Rahm [:erahm] from comment #10)

It looks like at the end of BrowserGlue._scheduleStartupIdleTasks we dispatch a marionette-startup-requested event.

Right, and that goes to [1], which usually directly jumps to [2] where we wait for the startup recorder to be finished. Note that before my patch Marionette was waiting for sessionstore-windows-restored, but we wanted to have an unique entry point for Marionette across different applications. So maybe sessionstore-windows-restored fires after the marionette-startup-requested notification?

No, sessionstore-windows-restored is what trips _scheduleStartupIdleTasks(), see https://searchfox.org/mozilla-central/rev/a7315d78417179b151fef6108f2bce14786ba64d/browser/components/BrowserGlue.jsm#1589 .

The comment above the method indicates that these tasks should be running before the per-window tasks ( https://searchfox.org/mozilla-central/rev/a7315d78417179b151fef6108f2bce14786ba64d/browser/components/BrowserGlue.jsm#1616-1620 ). I think that can't be true given what we're seeing here - the preloaded browser is created off the per-window task which is scheduled off the same sessionstore-windows-restored notification ( https://searchfox.org/mozilla-central/rev/a7315d78417179b151fef6108f2bce14786ba64d/browser/base/content/browser.js#1717-1720 ) and the marionette task runs off the browserglue one, but given that the marionette-based awsy test detects a regression here, it must have run after the preloaded browser was created.

But those tasks are dispatched differently (one uses the thread manager to dispatch idle to mainthread, the other uses a per-window webidl API). Felipe, you seem to have written some of this - should we just remove the comment or does the code/reality need to match the comment?

Flags: needinfo?(felipc)

:Gijs (he/him)

Assignee

Comment 14

•

6 years ago

(In reply to :Gijs (he/him) from comment #9)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #8)

Maybe the patch on bug 1530979 made the behavior flaky for idle tasks during startup? We are still waiting for the startup recorder to be finished recording all scripts.

FWIW, the startup recorder doesn't run in general - and I don't see AWSY setting the prefs that would cause it to run, so I don't think this makes any difference.

:Gijs (he/him)

Assignee

Comment 15

•

6 years ago

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #7)

I also noticed a Talos regression:

== Change summary for alert #20026 (as of Thu, 21 Mar 2019 05:48:40 GMT) ==

Regressions:

6% startup_about_home_paint windows10-64-pgo-qr opt e10s stylo 676.08 -> 714.79

So this is interesting. I looked at the profiles but quite honestly I don't know how to get any useful information out of them. You'd think that the nearly 40ms have to go somewhere, but it seems to be pretty spread out. :-\

I'm assuming that this boils down to a regression to do with how quickly we think we have a timeslice to do the preloading in, leading to us trying to do preloading while still loading the initial about:home tab. At least, that is what Mike suggested earlier today. I guess we could wait for the final onStopRequest for the selected tab if it's loading about:home/newtab (and then queue an idle task), instead of creating the preloaded browser immediately.

Mike, does that sound like a likely explanation for the time increase here, and how would you feel about that as a fix? (and/or, do you see something in these profiles that I don't?)

Flags: needinfo?(mconley)

:Gijs (he/him)

Assignee

Comment 16

•

6 years ago

Attached file After - TabsClosed - memory report, excerpt, sessionworker (deleted) — Details

:Gijs (he/him)

Assignee

Comment 17

•

6 years ago

Attached file Before - TabsClosed - memory report, excerpt, sessionworker (deleted) — Details

:Gijs (he/him)

Assignee

Comment 18

•

6 years ago

Attached file Diff - TabsClosed - memory report, excerpt, sessionworker (deleted) — Details

So the sessionstore worker after the run has around 2.4mb worth of "gc/nursery-malloced-buffers" more post-patch than pre-patch.

It's... really not clear to me what that means or what would cause it.

It's also hanging on to a bunch of bogus looking strings.

:erahm, sorry to be bothering you so much, but I'm not sure how to proceed from here - is there some way of getting more information out of the memory report? Do I just file a follow-up bug and ping sessionrestore folks? From looking at the sessionworker, I don't see anywhere obvious where it'd deliberately hang on to things (nevermind anywhere specifically related to the regressing bug), so perhaps we're "catching" the worker mid-write due to timing differences, meaning a bunch more stuff is in memory?

Flags: needinfo?(erahm)

:Gijs (he/him)

Assignee

Comment 19

•

6 years ago

Looks like the copied strings that aren't icons are serialized copies of principal objects, for localhost:80** posts, which I'm guessing is what we're using to serve test pages for AWSY.

:Gijs (he/him)

Assignee

Comment 20

•

6 years ago

Oh, and it might bear pointing out that the TabsClosed regression is noticeable on Linux, but not so on some other platforms, which would seem to support the hypothesis that the sessionworker diff is to do with timing.

Mike Conley (:mconley) (:⚙️)

Comment 21

•

6 years ago

So, looking at how bug 1353013 is implemented, I have a few thoughts:

It looks to me like we're still preloading after every tab open1, which I found a little surprising - I assumed that we were going to make every preload occur when idle, and not just the initial one.
The profiles don't line up with the reported time delta. I wonder if this is because the about:home paint test is recording a probe that's measuring from process start time to about:home paint2, and there's still some missing time slices between process start and when we start profiling. A similar issue was noted elsewhere in bug 1505944 (see bug 1505944 comment 17 and down).
Focusing in on the time region in the profiles that matter, here's what I have:

Good: http://bit.ly/2TzE0vD
Bad: http://bit.ly/2TG6GmZ

What stands out is that I don't see the code for creating the preloaded browser in the "bad" profile until loooooong after the top sites report painting in the content process (I see it at the 0.7365s mark, near the very end of the profile: http://bit.ly/2TCNzKj)

Nothing else really stands out, and this makes me instantly think of https://bugzilla.mozilla.org/show_bug.cgi?id=1458375#c28.

Are we certain that the newtab preloading is causing this about_home regression? What happens if we back out bug 1530979?

Flags: needinfo?(mconley)

Eric Rahm [:erahm]

Comment 22

•

6 years ago

(In reply to :Gijs (he/him) from comment #18)

Created attachment 9053879 [details]
Diff - TabsClosed - memory report, excerpt, sessionworker

So the sessionstore worker after the run has around 2.4mb worth of "gc/nursery-malloced-buffers" more post-patch than pre-patch.

It's... really not clear to me what that means or what would cause it.

It's also hanging on to a bunch of bogus looking strings.

:erahm, sorry to be bothering you so much, but I'm not sure how to proceed from here - is there some way of getting more information out of the memory report? Do I just file a follow-up bug and ping sessionrestore folks? From looking at the sessionworker, I don't see anywhere obvious where it'd deliberately hang on to things (nevermind anywhere specifically related to the regressing bug), so perhaps we're "catching" the worker mid-write due to timing differences, meaning a bunch more stuff is in memory?

If it's in gc memory we probably need to talk to a worker expert.

:asuth, do you have any thoughts re: tracking down why the session store worker is keeping so much more memory in gc/nursery-malloced-buffers?

Flags: needinfo?(erahm) → needinfo?(bugmail)

:Felipe Gomes (needinfo for replies!)

Comment 23

•

6 years ago

Attached patch check-idle.patch (obsolete) (deleted) — Details — Splinter Review

(In reply to :Gijs (he/him) from comment #13)

But those tasks are dispatched differently (one uses the thread manager to
dispatch idle to mainthread, the other uses a per-window webidl API).
Felipe, you seem to have written some of this - should we just remove the
comment or does the code/reality need to match the comment?

I did a test here by adding dump() statements on tasks from both BrowserGlue and the per-window ones, and it seems that what is stated in the comment is working, that is, the BrowserGlue one ran before than the per-window ones.

Can you test this patch to see if it this same behavior stands under the test conditions?

Flags: needinfo?(felipc)

:Felipe Gomes (needinfo for replies!)

Comment 24

•

6 years ago

Attached patch check-idle.patch (deleted) — Details — Splinter Review

More detailed patch. I got the expected results:

First BrowserGlue idle task
Last BrowserGlue idle task
Marionette startup requested
First window idle task
Last window idle task

Attachment #9054053 - Attachment is obsolete: true

Andrew Sutherland [:asuth] (he/him)

Comment 25

•

6 years ago

(In reply to Eric Rahm [:erahm] from comment #22)

:asuth, do you have any thoughts re: tracking down why the session store worker is keeping so much more memory in gc/nursery-malloced-buffers?

As I understand the operation of SessionWorker.jsm/SessionWorker.js it:

Receives a very large data structure via postMessage, so a potentially large object graph is structured-cloned into its global. This state comes from SessionStore.getCurrentState()[1] with some minor post-processing.
It JSON stringifies this data structure.
It utf-8 encodes the data structure.
It does some file rotation stuff and use OS.File.writeAtomic.
It doesn't trigger a GC because there's no method it could use to do this, although ChromeUtils probably could expose this.

As such, I think we expect an active SessionWorker to quite probably have a bunch of garbage in its nursery when active. Bug 1216175 covers improving worker GC/CC, but its behavior should not have changed recently and it's not clear our heuristics would ever optimize for bursty Worker memory usage that can immediately go to zero.

As I understand its caller, SessionSave.jsm has a state machine that attempts to reduce the rate at which it calls save when idle, but allows a more frequent rate when active. Notably, it uses observer notifications to detect a transition to "active", and in this case appears to schedule a save with a 2 second delay[2].

It seems believable to me that whatever preloading is now happening could perhaps be triggering the session saver to run earlier than it previously would have run?

I tried to use devtools' worker debugging facilities to use the "memory" tab to look into what's going on better, but bug 1506548 disabled the memory tab for workers and trivially reverting the isTargetSupported() check in https://phabricator.services.mozilla.com/D11762 for that functionality didn't make it work.

I would suggest augmenting SessionFile.jsm and/or SessionWorker.jsm with logging to compare when session saves occur in the 2 different scenarios. Alternately, capture profiles with "DOM Worker" enabled for capture, as that should also show what the ChromeWorker gets up to. (Chrome Workers currently inherit the default Thread Name, I believe.)

1: https://searchfox.org/mozilla-central/source/browser/components/sessionstore/SessionStore.jsm#3408
2: https://searchfox.org/mozilla-central/source/browser/components/sessionstore/SessionSaver.jsm#229

Flags: needinfo?(bugmail)

:Gijs (he/him)

Assignee

Updated

•

6 years ago

Depends on: 1542232

:Gijs (he/him)

Assignee

Updated

•

6 years ago

Depends on: 1542234

:Gijs (he/him)

Assignee

Comment 26

•

6 years ago

OK, too much in one bug at this point. I'm splitting things out:

bug 1542232 for the about home paint win10-pgo-qr-only regression
bug 1542234 for measuring about:newtab/about:home separately

Let's use this bug to see if there's something we can figure out about the "closed" regression... Unfortunately this is too noisy for me locally. I could try to use try to see if we could figure out what was up there... but taking a step back, if this is purely about when we start the session restore worker and the memory would be released at the next GC anyway, is that worth investigating further? I mean, timings in the browser change all the time and I'm not sure if it's worth digging into that too deeply - if anything, I'd want to disable the worker or somehow make it predictable for AWSY's benefit, but that's not really related to the regression itself. :erahm, what do you think?

Flags: needinfo?(erahm)

Andrew Sutherland [:asuth] (he/him)

Comment 27

•

6 years ago

Since this bug is going a little meta I'm going to go off-topic but related...

:baku was saying yesterday that he saw Remote Settings open its IndexedDB database something like 3 or 4 times during the startup of Firefox when investigating an unrelated issue recently. It seems clear this is sub-optimal but also something that likely just shows up as noise in most of our benchmarks because the information is so inaccessible.

These are also costs that IndexedDB/QuotaManager can absolutely apportion to specific databases/origins/globals(/JSM scripts?). And further, the per-global data can be flowed back from Worker globals to the global that created them. Is there a set of benchmarks and/or related about:performance(-style) improvement efforts that the DOM: Workers & Storage sub-team can help out with to surface this data in a usable fashion? IndexedDB and QuotaManager performance improvements are on deck and while we already have some relevant telemetry and will be adding more telemetry that will help us quantify these improvements, those numbers will similarly be nebulous aggregates. (And of course, these are things we'd also want to expose to devtools, which would likely be the architectural driver absent anything else.)

:Gijs (he/him)

Assignee

Comment 28

•

6 years ago

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #27)

Since this bug is going a little meta I'm going to go off-topic but related...

:baku was saying yesterday that he saw Remote Settings open its IndexedDB database something like 3 or 4 times during the startup of Firefox when investigating an unrelated issue recently. It seems clear this is sub-optimal but also something that likely just shows up as noise in most of our benchmarks because the information is so inaccessible.

These are also costs that IndexedDB/QuotaManager can absolutely apportion to specific databases/origins/globals(/JSM scripts?). And further, the per-global data can be flowed back from Worker globals to the global that created them. Is there a set of benchmarks and/or related about:performance(-style) improvement efforts that the DOM: Workers & Storage sub-team can help out with to surface this data in a usable fashion? IndexedDB and QuotaManager performance improvements are on deck and while we already have some relevant telemetry and will be adding more telemetry that will help us quantify these improvements, those numbers will similarly be nebulous aggregates. (And of course, these are things we'd also want to expose to devtools, which would likely be the architectural driver absent anything else.)

Can you file a separate bug and needinfo Florian? Perhaps he has ideas about the telemetry, or the automated testing, or integration with about:performance or profiling markers or similar - but probably slightly off-topic to go over all of that in this bug. :-)

Flags: needinfo?(bugmail)

Eric Rahm [:erahm]

Comment 29

•

6 years ago

(In reply to :Gijs (he/him) from comment #26)

OK, too much in one bug at this point. I'm splitting things out:

bug 1542232 for the about home paint win10-pgo-qr-only regression

bug 1542234 for measuring about:newtab/about:home separately

Let's use this bug to see if there's something we can figure out about the "closed" regression... Unfortunately this is too noisy for me locally. I could try to use try to see if we could figure out what was up there... but taking a step back, if this is purely about when we start the session restore worker and the memory would be released at the next GC anyway, is that worth investigating further? I mean, timings in the browser change all the time and I'm not sure if it's worth digging into that too deeply - if anything, I'd want to disable the worker or somehow make it predictable for AWSY's benefit, but that's not really related to the regression itself. :erahm, what do you think?

I'm not convinced the memory would be released at the next GC, for one of the measurements we force a GC after 30s and then measure.

Flags: needinfo?(erahm)

:Gijs (he/him)

Assignee

Comment 30

•

6 years ago

(In reply to Eric Rahm [:erahm] from comment #29)

(In reply to :Gijs (he/him) from comment #26)

OK, too much in one bug at this point. I'm splitting things out:

bug 1542232 for the about home paint win10-pgo-qr-only regression

bug 1542234 for measuring about:newtab/about:home separately

Let's use this bug to see if there's something we can figure out about the "closed" regression... Unfortunately this is too noisy for me locally. I could try to use try to see if we could figure out what was up there... but taking a step back, if this is purely about when we start the session restore worker and the memory would be released at the next GC anyway, is that worth investigating further? I mean, timings in the browser change all the time and I'm not sure if it's worth digging into that too deeply - if anything, I'd want to disable the worker or somehow make it predictable for AWSY's benefit, but that's not really related to the regression itself. :erahm, what do you think?

I'm not convinced the memory would be released at the next GC, for one of the measurements we force a GC after 30s and then measure.

Based on comment #25, I'm not sure the GC actually affects the worker, and/or if the GC call could return before it's taken effect on the worker, and/or if the GC could just happen to run while the worker is mid-call, in which case nothing would be GC'd because everything is still reachable in the worker's scope - we'd want to do a GC once the worker had finished.

To me, this just looks like unfortunate timing that happened to be tripped by my patch (which otherwise shouldn't be affecting the session worker at all, as the preloaded tab shouldn't have its info collected by session restore anyway). Are you saying you think the session worker is leaking? From having looked at the code, I could well be wrong but I'm not sure how/why it would... Or are you saying something else, and if so, what? :-)

Flags: needinfo?(erahm)

Tim Spurway [:tspurway]

Updated

•

6 years ago

status-firefox69: --- → affected

Priority: -- → P2

Tim Spurway [:tspurway]

Updated

•

6 years ago

status-firefox68: --- → affected

Eric Rahm [:erahm]

Comment 31

•

5 years ago

(In reply to :Gijs (he/him) from comment #30)

(In reply to Eric Rahm [:erahm] from comment #29)

(In reply to :Gijs (he/him) from comment #26)

OK, too much in one bug at this point. I'm splitting things out:

bug 1542232 for the about home paint win10-pgo-qr-only regression

bug 1542234 for measuring about:newtab/about:home separately

Let's use this bug to see if there's something we can figure out about the "closed" regression... Unfortunately this is too noisy for me locally. I could try to use try to see if we could figure out what was up there... but taking a step back, if this is purely about when we start the session restore worker and the memory would be released at the next GC anyway, is that worth investigating further? I mean, timings in the browser change all the time and I'm not sure if it's worth digging into that too deeply - if anything, I'd want to disable the worker or somehow make it predictable for AWSY's benefit, but that's not really related to the regression itself. :erahm, what do you think?

I'm not convinced the memory would be released at the next GC, for one of the measurements we force a GC after 30s and then measure.

Based on comment #25, I'm not sure the GC actually affects the worker, and/or if the GC call could return before it's taken effect on the worker, and/or if the GC could just happen to run while the worker is mid-call, in which case nothing would be GC'd because everything is still reachable in the worker's scope - we'd want to do a GC once the worker had finished.

To me, this just looks like unfortunate timing that happened to be tripped by my patch (which otherwise shouldn't be affecting the session worker at all, as the preloaded tab shouldn't have its info collected by session restore anyway). Are you saying you think the session worker is leaking? From having looked at the code, I could well be wrong but I'm not sure how/why it would... Or are you saying something else, and if so, what? :-)

Ah right this is session worker related, I can see how our standard GC attempts might not work. What's not clear to me is if we could magically make GC better for session workers whether or not this memory would be properly cleaned up. At this point lets just accept the regression as I don't think there's much we can do.

Flags: needinfo?(erahm)

:Gijs (he/him)

Assignee

Comment 32

•

5 years ago

Alright, let's accept this then. Thanks!

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Flags: needinfo?(bugmail)

Resolution: --- → WONTFIX

Mathew Hodson

Updated

•

3 years ago

No longer depends on: 1542234

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1542234

After - TabsClosed - memory report, excerpt, sessionworker 6 years ago :Gijs (he/him) (deleted), text/plain		Details
Before - TabsClosed - memory report, excerpt, sessionworker 6 years ago :Gijs (he/him) (deleted), text/plain		Details
Diff - TabsClosed - memory report, excerpt, sessionworker 6 years ago :Gijs (he/him) (deleted), text/plain		Details
check-idle.patch 6 years ago :Felipe Gomes (needinfo for replies!) (deleted), patch		Details \| Diff \| Splinter Review
check-idle.patch 6 years ago :Felipe Gomes (needinfo for replies!) (deleted), patch		Details \| Diff \| Splinter Review