Open Bug 1227663 Opened 9 years ago Updated 4 years ago

TART should not create and destroy many content processes

Categories

(Core :: DOM: Core & HTML, defect, P5)

defect

Tracking

()

Tracking Status
e10s - ---

People

(Reporter: jmaher, Unassigned)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression][e10s])

Talos has detected a Firefox performance regression from your commit 4d8cadcfe408d66bdbcdf4d9b01cdb373c0fd44f in bug 1196159.  We need you to address this regression.

This is a list of all known regressions and improvements related to your bug:
http://alertmanager.allizom.org:8080/alerts.html?rev=4d8cadcfe408d66bdbcdf4d9b01cdb373c0fd44f&showAll=1

On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#TART.2FCART

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p linux64 -u none -t svgr  # add "mozharness: --spsProfile" to generate profile data

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e <path>/firefox -a tart

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Friday, or the offending patch will be backed out! ***

Our wiki page oulines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
the compare view in perfherder shows the specific subtests of tart which have regressed:
https://treeherder.allizom.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=e665ea64eb15&newProject=mozilla-inbound&newRevision=4d8cadcfe408&originalSignature=65e173fb1a511a28979c769ad2c8a1b31dd358bc&newSignature=65e173fb1a511a28979c769ad2c8a1b31dd358bc

tart simple-open-DPI1.all.TART opt e10s 
tart simple-open-DPI1.error.TART opt e10s 
tart simple-open-DPI1.half.TART opt e10s 


I have also done a series of retriggers:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=9a6ee2a16bd6&tochange=0aec4fd3842f&filter-searchStr=ubuntu%20hw%2012.04%20x64%20mozilla-inbound%20talos


:mrbkap, can you look at your patch and see why this might affect the TART test?  I can push to try with a backout if we don't trust the re-triggers in the pushlog on inbound.
Flags: needinfo?(mrbkap)
(In reply to Joel Maher (:jmaher) from comment #1)
> the compare view in perfherder shows the specific subtests of tart which
> have regressed:
> https://treeherder.allizom.org/perf.html#/
> comparesubtest?originalProject=mozilla-
> inbound&originalRevision=e665ea64eb15&newProject=mozilla-
> inbound&newRevision=4d8cadcfe408&originalSignature=65e173fb1a511a28979c769ad2
> c8a1b31dd358bc&newSignature=65e173fb1a511a28979c769ad2c8a1b31dd358bc
> 
> tart simple-open-DPI1.all.TART opt e10s 
> tart simple-open-DPI1.error.TART opt e10s 
> tart simple-open-DPI1.half.TART opt e10s 

You can see a sustained stark change on these subtests by examining the graphs linked from the compare view. e.g.:

https://treeherder.mozilla.org/perf.html#/graphs?series=[mozilla-inbound,1bdaef876f8675b022871adb9f82c98df373cce0,1]&highlightedRevisions=e665ea64eb15&highlightedRevisions=4d8cadcfe408&timerange=604800
:billm, can you take a look at this, :mrbkap hasn't responded, i don't know if he is on PTO, but I don't want to let this get ignored either.
Flags: needinfo?(wmccloskey)
It was Thanksgiving last week in the US. Please needinfo me if Blake hasn't looked at it in the next few days.
Flags: needinfo?(wmccloskey)
It is so weird that a patch that should only affect window.close seems to have caused a regression for a test that exercises window.open...
I just tried running these tests locally with and without my patch by installing the addon and running the default set of tests. The results I got were identical to each other.

I then put a dump where I removed the preventDefault (and printfs where we dispatch that event) and confirmed that we never hit that code running the "simple" test. I don't know how to explain the graph linked to here. Joel, do you have any ideas?
Flags: needinfo?(mrbkap) → needinfo?(jmaher)
:mrbkap, thanks for looking into this.  I did double check things and decided to push to try with the latest tip from inbound, then again with the same revision with 4d8cadcfe408 backed out:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=aae3edac20db&tochange=51449764d5d4

You can see a comparison of the data here:
https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=51449764d5d4&newProject=try&newRevision=aae3edac20db&showOnlyConfident=1

this shows an improvement as we are backing out the patch on the exact subtests which caused the regression earlier:
https://treeherder.allizom.org/perf.html#/comparesubtest?originalProject=try&originalRevision=51449764d5d4&newProject=try&newRevision=aae3edac20db&originalSignature=65e173fb1a511a28979c769ad2c8a1b31dd358bc&newSignature=65e173fb1a511a28979c769ad2c8a1b31dd358bc

So why are you not seeing this locally, and why is the code not being touched at all?   Usually not being able to reproduce things locally are related to machine differences, or slight differences in running mozharness vs raw talos.  In this case, I am not sure if that applies since the code doesn't seem to be touched.

Given this recent data combined with the original, this does look valid.  Not sure what is going on- maybe we could get some profiling with the talos profiler and compare things if that helps?
Flags: needinfo?(jmaher)
I can help gather profiles here.
Flags: needinfo?(mconley)
I have a theory (feel free to blow this out of the water, billm, if I've misunderstood how our event shim works):

TART works by installing an add-on, and looking at the add-on install.rdf, it hasn't been marked multiprocess compatible - that means that the shims are enabled for it.

I suspect that after a previous tab has finished closing, and TART starts opening a new tab, the child realizes that the first tab has closed and fires the DOMWindowClose event, which is synchronously sent to the parent over the shims. Handling that shim message delays the next test.

I have no idea why this would only impact the one subtest (simple-open-DPI1), but I think it'll be pretty easy to eliminate this hypothesis. I've pushed to try a build that disables the shims for the TART tests (I verified locally that the test should run properly without them - so maybe we should just mark the addon multiprocess compatible at some point).

Let's see if this changes things.

Baseline: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ae3882699bf2

Disable shims patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9147c3e4b9b4
Flags: needinfo?(mconley)
That seems entirely possible. Either way, we should disable the shims in TART! Thanks Mike.
Okay, so I don't know if I fully understand what's going on here, but I have a general theory with some evidence to back it up.

First off, my original theory that this is caused by the DOMWindowClosed event somehow going synchronously over the shims is bunk. Disabling the shims globally did not help at all. There was no performance change at all.

One thing I noticed is that TART runs a bunch of its tests one right after the other, and I think this might be an example of the last test (newtabYesPreload) running just before the first test (simple).

The content process is shared between both remote tabs and the thumbnail service. Any tabs get opened to about:blank will load it in the content process (which starts up a content process if one isn't already running) by default. The simple TART test open an about:blank page, so presumably, this requires a content process (subprocess profiling seems to back this up).

When the tab closes, as the last remote tab, the content process is scheduled for shutdown, and will do so after a short while. I suspect we're seeing a slight change in the behaviour of the shutdown of the content process which results in the parent having to deal with ActorDestroy for the content process in the parent at a bad time during the icon-dpi1 test.

Evidence! Here's a comparison between a baseline with mrbkap's patch, and me forcing the content process to be always on throughout the test:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=ae3882699bf2&newProject=try&newRevision=d131eb14d0ad

And here's a comparison between a baseline with mrbkap's patch, and mrbkap's patch backed out:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=ae3882699bf2&newProject=try&newRevision=b8f5873c518c

Verrrrry similar.

One thing to also note is that the backout (and keeping the content process alive) results in a pretty substantial CART regression! Lovely news.

I suggest we take this regression. I don't think we should investigate this more.

Sound good jmaher?
Flags: needinfo?(jmaher)
So what you're saying is that the first test doesn't wait enough after the last one, and so there's some cleanup which happens while the first test executes?

Typically most tart subtests wait 500ms before starting, and depending on the execution mode we might wait quite a bit more before the first test is executed. To understand best, once needs to run TART in talos from the command line and observe its execution cycles.

Eitehr way, the delay before each test starts is configurable. Are you saying that if we set the delay before the first test to, say, 10s, then the regression should go away?

The goal of the pre-wait is exactly to let stuff settle down before the test starts, but if following some code changes stuff now needs longer duration to clean up, then I think it's reasonable to adapt TART accordingly.

I don't think TART should be measuring animation performance while knowingly things are getting cleaned up.

A try push (with modified TART timeout before the first test) should be able to confirm this.
I think it is fine to leave this alone, but I would like to test what avih has proposed.  Ideally if TART need to be changed we should do it.
Flags: needinfo?(jmaher)
So I ran a test where I set a 10s rest between tests. Here's the comparison:

https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=ae3882699bf2&newProject=try&newRevision=eccc64e91349&originalSignature=65e173fb1a511a28979c769ad2c8a1b31dd358bc&newSignature=65e173fb1a511a28979c769ad2c8a1b31dd358bc

Which is basically no change. :/ So my theory that some async content process teardown is affecting us here might be bunk.
Here's another comparison. This time, the base is with mrbkap's patch on fx-team, the new patch is with the content process forced on by making all TART chrome pages run in the content process:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=ae3882699bf2&newProject=try&newRevision=23791abd81d6

Pretty significant improvements for TART (and a regression for CART?)

And here's a comparison with mrbkap's patch on fx-team, and a run with the simple tab test disabled (to see if the regression "moves" once the simple test is gone):

https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=ae3882699bf2&newProject=try&newRevision=3cfa6bf3e997&originalSignature=042d1e849b0a07a0582bf4045cfd9b7e436b1065&newSignature=042d1e849b0a07a0582bf4045cfd9b7e436b1065

Here, I seem to see what might be the regression moving to the iconFade-open-DPI2 test.
and we have frequent TART timeouts on linux64 e10s (bug 1230978).  I wonder if this is somehow related.
Flags: needinfo?(mconley)
tracking-e10s: --- → -
Flags: needinfo?(mconley)
Summary: 2.8% Linux 64 tart e10s regression on Mozilla-Inbound (v.45) on November 24, 2015 from push 4d8cadcfe408d66bdbcdf4d9b01cdb373c0fd44f → TART should not create and destroy many content processes
can you help brad?
Flags: needinfo?(blassey.bugs)
Mike, so as far as I understand the hypothesis of longer wait before the first test didn't help.

I didn't quite follow the details of your second experiment at comment 16, but it would help to know whether you suspect it's a TART issue or a Firefox issue, I.e. do you think the regression at this bug is something users would/could experience as well? Maybe only at some use cases?

The comparison at comment 1 shows clearly that only 1 test has regressed, and, as noted, it's the first tab opened after a "normal" newtab with preload was loaded and closed.

Would that apply also to users?
Flags: needinfo?(mconley)
(In reply to Doug Turner (:dougt) from comment #18)
> can you help brad?

I think Mike is the right person here and already NI'd
Flags: needinfo?(blassey.bugs)
(In reply to Avi Halachmi (:avih) from comment #19)
> Mike, so as far as I understand the hypothesis of longer wait before the
> first test didn't help.
> 
> I didn't quite follow the details of your second experiment at comment 16,
> but it would help to know whether you suspect it's a TART issue or a Firefox
> issue, I.e. do you think the regression at this bug is something users
> would/could experience as well? Maybe only at some use cases?

I think this is a TART issue only.

So here's the TL;DR of this thing:

1. Content processes will stay alive until the last browser connected to them are destroyed, after which the content process is shut down.

2. The newtab preload browser uses the same content process as the rest of the tabs.

3. With the prefs that we set, I think what we're doing is closing the last browser associated with a content process just before opening a new remote browser. I suspect this causes us to shut down the previous content process, and start up a new one.

4. I don't think this is a thing that users need to really care about at this point. Quick creation and destruction of content processes is not a thing that I think users will hit often if ever.
Flags: needinfo?(mconley)

Bulk-downgrade of unassigned, 4 years untouched DOM/Storage bugs' priority.

If you have reason to believe this is wrong (especially for the severity), please write a comment and ni :jstutte.

Severity: normal → S4
Priority: -- → P5
You need to log in before you can comment on or make changes to this bug.