1624301 - 2.18 - 4.07% startup_about_home_paint / startup_about_home_paint_realworld_webextensions (linux64-shippable, linux64-shippable-qr) regression on push bb19bd67310fd0f0b309759d530ce6292b42d16e (Thu March 19 2020)

Florin Strugariu [:Bebe]

Reporter

Description

•

5 years ago

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=bb19bd67310fd0f0b309759d530ce6292b42d16e

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

4% startup_about_home_paint linux64-shippable opt e10s stylo 740.83 -> 771.00
4% startup_about_home_paint_realworld_webextensions linux64-shippable opt e10s stylo 785.46 -> 812.92
3% startup_about_home_paint linux64-shippable-qr opt e10s stylo 1,124.21 -> 1,154.50
2% startup_about_home_paint_realworld_webextensions linux64-shippable-qr opt e10s stylo 1,137.12 -> 1,161.92

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=25465

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/TestEngineering/Performance/Talos

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/TestEngineering/Performance/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/TestEngineering/Performance/Talos/RegressionBugsHandling

Florin Strugariu [:Bebe]

Reporter

Updated

•

5 years ago

Component: Performance → Search

Flags: needinfo?(standard8)

Product: Testing → Firefox

Version: Version 3 → unspecified

Mark Banner (:standard8)

Updated

•

5 years ago

Assignee: nobody → standard8

status-firefox75: --- → unaffected

Flags: needinfo?(standard8)

Priority: -- → P2

Mark Banner (:standard8)

Comment 1

•

5 years ago

The strange thing here is that the last couple of perf runs I did on this showed no regressions for these tests:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=87fd18a4071212bc00840c13378d82c816df974c&newProject=try&newRevision=4c88d4fe66f8cd019b7f114a092fd177596007d4&framework=1
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=49b9c64323badacb3e5aa4a12ff51c61f02321c3&newProject=try&newRevision=6a9167ae2788150c0f24b353301d0b629250720c&framework=1

I did change one thing about pre-emptive loading of the config (I removed it), so I've just pushed a comparison for that, though I'd be surprised if not having that caused the regression:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b8ea3fc3ce997f2e2f36fdcfb311bc462f6a380e&newProject=try&newRevision=12fab6c8389cd7285be48773f27db7d61b9c46a4&framework=1

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

status-firefox74: --- → unaffected

status-firefox-esr68: --- → unaffected

Mark Banner (:standard8)

Updated

•

5 years ago

Iteration: --- → 76.2 - Mar 23 - Apr 5

Regressed by: search-modernization

Bridget Kaluzny

Updated

•

5 years ago

Points: --- → 3

Mark Banner (:standard8)

Comment 2

•

5 years ago

For 76 this code is only enabled on nightly, so marking as fix-optional. We are planning on shipping in 77 though.

Double checking the graphs, this appears to be a linux-only regression:

This makes me wonder if there's something platform dependent that's affecting things here, maybe the sqlite database.

I'm just about to set off a couple more try pushes for various things, hopefully they might help identify it a bit more.

status-firefox76: affected → fix-optional

Mark Banner (:standard8)

Comment 3

•

5 years ago

I've been experimenting with removing non-essential engines from the search engine configuration (ones that won't affect startup time), when I do so I see a reduction in the times on Linux:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a5d12bfd570a65288322d052e22c3955e0419c2e&newProject=try&newRevision=a248628b807ca5f2e85161fb88073dfede792d29&framework=1

I tried getting profiles from before and after the config reduction, but I couldn't see any obvious point that was taking up the time:

The interesting thing is, we didn't see the regression reported in this bug in my last set of try runs:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=30e6909acffed393b647baebc728890a447c5b8f&newProject=try&newRevision=4c88d4fe66f8cd019b7f114a092fd177596007d4&framework=1

Mark Banner (:standard8)

Comment 4

•

5 years ago

Jan, Andrew, we're seeing a performance regression here on our modern search engine configuration. The patch that has caused the regression was one that moved the configuration file from being shipped within Firefox to being served from RemoteSettings (with a dump that's loaded on initial startup).

Since my try runs ahead of landing this, something seems to have caused a regression in startup_about_home_paint and startup_about_home_paint_realworld_webextensions on Linux.

As the search service, and remote settings are JS-only code, I'm wondering if this could have been caused by something in indexedDB since RemoteSettings is based on that.

I'll see if I can narrow down a regression range, but it is likely to take a while since it'll involve a lot of try pushes.

Flags: needinfo?(jvarga)

Flags: needinfo?(bugmail)

Mark Banner (:standard8)

Updated

•

5 years ago

Blocks: search-modernization

No longer regressed by: search-modernization

Andrew Sutherland [:asuth] (he/him)

Comment 5

•

5 years ago

There's been a ton of stuff happening in RemoteSettings lately plus the revelation that instances are getting spun up N times (now addressed?), redirecting needinfos to people more directly involved in the changes. IndexedDB-wise, there are very verbose log messages available with MOZ_LOG=IndexedDB:5 if you want to see the relative timestamps of when requests are issued and serviced, etc. If manual checking of the logs ends up being a lot of work, I can try and see if I can apply some previous log-processing efforts to aid in doing so in a more automated fashion[1].

1: There's also https://github.com/mayhemer/logan in general, but it doesn't understand IndexedDB logs right now, I think.

Flags: needinfo?(mathieu)

Flags: needinfo?(jvarga)

Flags: needinfo?(gijskruitbosch+bugs)

Flags: needinfo?(bugmail)

:Gijs (he/him)

Assignee

Comment 6

•

5 years ago

(In reply to Mark Banner (:standard8) from comment #2)

For 76 this code is only enabled on nightly, so marking as fix-optional. We are planning on shipping in 77 though.

Double checking the graphs, this appears to be a linux-only regression:

I wouldn't necessarily assume that this is linux-only, I suspect the resolution + stability of numbers on other platforms is such that it's harder to see issues on those.

I have a very basic question: does search service initialization block about:home rendering, because of the search box?

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(standard8)

Mark Banner (:standard8)

Comment 7

•

5 years ago

(In reply to :Gijs (he/him) from comment #6)

I have a very basic question: does search service initialization block about:home rendering, because of the search box?

The initial rendering of the outline and placeholder happens, but the bits like icons & topsites are delayed.

Flags: needinfo?(standard8)

:Gijs (he/him)

Assignee

Comment 8

•

5 years ago

(In reply to Mark Banner (:standard8) from comment #7)

(In reply to :Gijs (he/him) from comment #6)

I have a very basic question: does search service initialization block about:home rendering, because of the search box?

The initial rendering of the outline and placeholder happens, but the bits like icons & topsites are delayed.

So does delaying the initialization completion by 5 seconds delay these talos measures by five seconds, or are they not blocked on those things rendering? In other words, is simply a slower search service initialization a possible explanation here, or is it work that happens to happen before we get to the paint, and slows it down because the CPU is, well, perhaps not only human, but still only a CPU, and can only do a limited number of things at once?

Flags: needinfo?(standard8)

Mark Banner (:standard8)

Comment 9

•

5 years ago

I've done some more investigations and some try pushes.

As mentioned previously, these perf regressions didn't show up on my original pushes for the modern configuration, only when I actually pushed it. I've gone back through the few weeks between those, and with a bit of added guesswork, I've tracked it down to bug 1620185.

Here's the comparisons of try runs with my patch from bug 1542269 immediately before and then after bug 1620185 landed.

That's a clear difference between those two runs. So I think the actual performance issue should be considered as both of those bugs combined.

(In reply to :Gijs (he/him) from comment #8)

So does delaying the initialization completion by 5 seconds delay these talos measures by five seconds, or are they not blocked on those things rendering? In other words, is simply a slower search service initialization a possible explanation here, or is it work that happens to happen before we get to the paint, and slows it down because the CPU is, well, perhaps not only human, but still only a CPU, and can only do a limited number of things at once?

If I delay by five seconds, then that increases the talos measures by five seconds as well.

If I'm understanding Talos and the code correctly, Talos is waiting for the top sites to be displayed, and they have to wait on the search service to initialise as they have to get the default engine and other things to work out what to display.

Flags: needinfo?(standard8)

Regressed by: 1620185

BMO Automation

Updated

•

5 years ago

Has Regression Range: --- → yes

Mathieu Leplatre [:leplatrem]

Comment 10

•

5 years ago

I'm clearing my NI here since I feel that I cannot help much here after reading all your comments :(

Flags: needinfo?(mathieu)

Mark Banner (:standard8)

Updated

•

5 years ago

Iteration: 76.2 - Mar 23 - Apr 5 → 77.1 - Apr 6 - Apr 19

Mark Banner (:standard8)

Comment 11

•

5 years ago

I've just got round to doing some more try pushes across the patches from bug 1620185.

This comparison shows no regression and is based on "Move DB to separate file"
This comparison shows the regression and is based on "Import IndexedDB code from kinto.js"

Mathieu, it looks like importing the IndexedDB code from kinto.js has caused the regression here. The regression is a 2 - 6% regression on our startup times for getting to the tab paint.

The search service is opening and getting the search-config collection on startup, so I suspect it is something to do with that code path.

This wasn't picked up when your patch landed, as we didn't have the Search Service patch to load from remote settings at that time.

Is there something we can do to improve the performance of remote settings code here?

Flags: needinfo?(mathieu)

Mark Banner (:standard8)

Comment 12

•

5 years ago

Having thought about this overnight, I think this really needs input/fixing from the remote settings client side.

Fundamentally the database startup, loading, or something around that area is slower than it was before and that should be investigated - it obviously affects startup here, but will probably have an adverse effect on other remote settings items.

Hence I'm moving this across to the remote settings client, and un-assigning myself. I am still willing to help, but I have no background to the IndexedDB code, so I think I'm not the best person to primarily investigate that.

Assignee: standard8 → nobody

Iteration: 77.1 - Apr 6 - Apr 19 → ---

Component: Search → Remote Settings Client

Mark Banner (:standard8)

Updated

•

5 years ago

status-firefox76: fix-optional → wontfix

status-firefox77: --- → fix-optional

Andrew Sutherland [:asuth] (he/him)

Comment 13

•

5 years ago

If anyone generates profiler traces (using MOZ_PROFILER_STARTUP as documented at https://profiler.firefox.com/docs/#/./guide-startup-shutdown I think), the relevant threads to filter on for IndexedDB purposes would be: "IndexedDB" (gets all IndexedDB threads), "QuotaManager" (which covers "QuotaManager IO", just the 1), and "IPDL Background" (PBackground which is where the requests go). Probably also nice to include "DOM Worker" possibly via "Worker" since remote settings also spins up a ChromeWorker and workers in general require the main thread to help out with script loading which can both delay the worker startup and generate main thread contention (which is planned to be fixed after bug 1528285 is fixed).

Mathieu Leplatre [:leplatrem]

Comment 14

•

5 years ago

Mark, I agree with your conclusion, this is something that should be addressed at the Remote Settings level.
Thanks for the pointers and perf graphs!

One question, did you try with the latest trunk? Gijs has changed some parts of the implementation in Bug 1624983, most notably the way we handle opening the DB connection.

Flags: needinfo?(mathieu)

Mark Banner (:standard8)

Comment 15

•

5 years ago

(In reply to Mathieu Leplatre [:leplatrem] from comment #14)

One question, did you try with the latest trunk? Gijs has changed some parts of the implementation in Bug 1624983, most notably the way we handle opening the DB connection.

I believe I looked at that when it landed, and I didn't see anything noticeable in the perf graphs.

Though I'm currently wondering if bug 1631484 might help here.

:Gijs (he/him)

Assignee

Comment 16

•

5 years ago

[Tracking Requested - why for this release]:
We're only shipping this in 77 so that's where it should be tracked. I don't think we're going to be wanting to regress startup times for about:home quite this much for 77...

I'll try to look at this this week.

Blocks: 1628613
No longer blocks: search-modernization

status-firefox77: fix-optional → affected

tracking-firefox77: --- → ?

Flags: needinfo?(gijskruitbosch+bugs)

Whiteboard: [fxperf]

:Gijs (he/him)

Assignee

Comment 17

•

5 years ago

Bah, removed the wrong blocking bug.

Blocks: search-modernization
No longer blocks: 1622755

Pascal Chevrel:pascalc (PTO until August 21)

Updated

•

5 years ago

tracking-firefox77: ? → +

:Gijs (he/him)

Assignee

Comment 18

•

5 years ago

I'm still investigating the remotesettings side, but meanwhile, can you elaborate on this comment:

https://searchfox.org/mozilla-central/rev/41c3ea3ee8eab9ce7b82932257cb80b703cbba67/toolkit/components/search/SearchEngineSelector.jsm#120-123

   * Note that this may cause a network check of the certificate, but that
   * should generally be quick.

?

The slowdown here seems exclusively due to the time (now) taken in _getConfiguration, and if that can involve network requests and we need them to display about:home, that's... really bad? Why would we need a network request?

Flags: needinfo?(standard8)

:Gijs (he/him)

Assignee

Comment 19

•

5 years ago

OK, I think I know what's wrong. There was a fast path at https://searchfox.org/mozilla-central/rev/41c3ea3ee8eab9ce7b82932257cb80b703cbba67/services/common/kinto-offline-client.js#452-457 to make a list call that passed no filtering options use the index's getAll method to get data. This involves one roundtrip to indexeddb's background thread, instead of N (for N items) for the cursor approach that the Database.jsm code currently uses.

This is pretty bad; it also explains why some of the indexeddb interactions I saw in profiles for the blocklist recently were so much worse than I recalled seeing previously.

Mathieu, was there a specific reason this code was omitted when porting into Database.jsm ?

Assignee: nobody → gijskruitbosch+bugs

Status: NEW → ASSIGNED

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(mathieu)

:Gijs (he/him)

Assignee

Comment 20

•

5 years ago

I need to sleep, but I pushed to try to see if I'm right - https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=46cf1883ab673930e84fece812e0edd7f70a98bc&newProject=try&newRevision=4efdf594fafdfc4325e77448e014f4fbbb7441dd&framework=1 should have results in some hours.

I can see a significant difference in the runtime of _getConfiguration that roughly corresponds to the magnitude of the regression here, on my local Windows machine. However, I do not see a significant effect in the resulting about_home time, so I guess on my local machine, other things are the limiting factor in how soon we consider about:home fully painted...

Mark Banner (:standard8)

Comment 21

•

5 years ago

(In reply to :Gijs (he/him) from comment #18)

The slowdown here seems exclusively due to the time (now) taken in _getConfiguration, and if that can involve network requests and we need them to display about:home, that's... really bad? Why would we need a network request?

I believe this is a certificate check that Remote Settings needs to do, I can't quite remember the full details though. Mathieu, can you explain please?

Flags: needinfo?(standard8)

Mathieu Leplatre [:leplatrem]

Comment 22

•

5 years ago

First of all, I apologize for the lack of availability and delay of replies...

Mathieu, was there a specific reason this code was omitted when porting into Database.jsm ?

Not really, I wanted to simplify the implementation by removing this createListRequest() helper, and this fast path got dropped :(

Though I'm currently wondering if bug 1631484 might help here.

It could help, especially on collections with lots of records, but it's not the root cause of the regression, since the put operations chaining was already here before:
https://github.com/Kinto/kinto.js/blob/5e09b2f3e06a731ee2bbd445f20829e63702da5e/src/adapters/IDB.js#L607-L636

I believe this is a certificate check that Remote Settings needs to do, I can't quite remember the full details though. Mathieu, can you explain please?

I was going to say that it was due to signature verification of local data when calling .get(), but I see that you don't pass the {verifySignature: true} option (BTW without this flag, .get() will never throw).

If the collection is empty, then a sync is initiated (unless {syncIfEmpty: false} is passed), and that introduces network requests.

Apart from those, I don't really see why .get() would involve network requests :/

Flags: needinfo?(mathieu)

:Gijs (he/him)

Assignee

Comment 23

•

5 years ago

Attached file Bug 1624301 - optimize remote settings Database list calls without filters, r?leplatrem (deleted) — Details

:Gijs (he/him)

Assignee

Comment 24

•

5 years ago

OK, so the trypush was successful and addresses the regression, so I will put up the patch to get that fix landed and perhaps even uplifted if we can (though this regression doesn't affect 76 because the search config stuff got held to nightly, it'll improve perf elsewhere, as bug 1620185 made it to 76).

(In reply to Mathieu Leplatre [:leplatrem] from comment #22)

I believe this is a certificate check that Remote Settings needs to do, I can't quite remember the full details though. Mathieu, can you explain please?

I was going to say that it was due to signature verification of local data when calling .get(), but I see that you don't pass the {verifySignature: true} option (BTW without this flag, .get() will never throw).

OK, so we don't hit this case here. That's good. I'll follow up separately.

If the collection is empty, then a sync is initiated (unless {syncIfEmpty: false} is passed), and that introduces network requests.

In this case, we have dumps and those should be loaded instead, right? And AFAICT there is no way even failing to import the dump would cause us to request anything from the network.

Flags: needinfo?(mathieu)

:Gijs (he/him)

Assignee

Updated

•

5 years ago

Whiteboard: [fxperf] → [fxperf:p1]

Pulsebot

Comment 25

•

5 years ago

Pushed by gijskruitbosch@gmail.com: https://hg.mozilla.org/integration/autoland/rev/22a36b11aa1a optimize remote settings Database list calls without filters, r=leplatrem

Cristian Brindusan [:cbrindusan]

Comment 26

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/22a36b11aa1a

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox77: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → Firefox 77

:Gijs (he/him)

Assignee

Comment 27

•

5 years ago

Although the regression doesn't actually affect 76, I'm going to mark this as affected so it shows up on uplift queries.

status-firefox76: wontfix → affected

:Gijs (he/him)

Assignee

Comment 28

•

5 years ago

Comment on attachment 9142688 [details]
Bug 1624301 - optimize remote settings Database list calls without filters, r?leplatrem

Beta/Release Uplift Approval Request

User impact if declined: Slow performance for remote settings, meaning sluggishness at (apparently) random intervals when the DB queries happen
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: No
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: n/a
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): This is a small JS-only patch that reintroduces code similar to the bits that were removed in bug 1620185, and there's decent unit-test coverage to ensure that this is still functionally correct.
String changes made/needed: none

Attachment #9142688 - Flags: approval-mozilla-beta?

Mark Banner (:standard8)

Comment 29

•

5 years ago

Thank you for finding & fixing this Gijs. I'll follow up with Mathieu elsewhere about the network comments etc.

Flags: needinfo?(mathieu)

Ryan VanderMeulen [:RyanVM]

Comment 30

•

5 years ago

Comment on attachment 9142688 [details]
Bug 1624301 - optimize remote settings Database list calls without filters, r?leplatrem

I'm a bit hesitant about taking this so late in the cycle, but will approve for 76.0b8 since it's just restoring an erroneously-removed optimization we used to have and we have good test coverage for this feature.

Attachment #9142688 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Ryan VanderMeulen [:RyanVM]

Comment 31

•

5 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/ff34f663e3e0

status-firefox76: affected → fixed

:Gijs (he/him)

Assignee

Updated

•

5 years ago

Blocks: 1629005

Alexandru Ionescu (needinfo me) [:alexandrui]

Comment 32

•

5 years ago

== Change summary for alert #25730 (as of Mon, 27 Apr 2020 03:28:24 GMT) ==

Improvements:

4% startup_about_home_paint linux64-shippable opt e10s stylo 764.50 -> 730.25
3% startup_about_home_paint_realworld_webextensions linux64-shippable-qr opt e10s stylo 1,036.67 -> 1,006.50
3% startup_about_home_paint_realworld_webextensions linux64-shippable-qr opt e10s stylo 1,035.38 -> 1,005.33
3% startup_about_home_paint_realworld_webextensions linux64-shippable-qr opt e10s stylo 1,035.38 -> 1,006.58

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=25730