2.18 - 4.07% startup_about_home_paint / startup_about_home_paint_realworld_webextensions (linux64-shippable, linux64-shippable-qr) regression on push bb19bd67310fd0f0b309759d530ce6292b42d16e (Thu March 19 2020)
Categories
(Firefox :: Remote Settings Client, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox74 | --- | unaffected |
firefox75 | --- | unaffected |
firefox76 | --- | fixed |
firefox77 | + | fixed |
People
(Reporter: Bebe, Assigned: Gijs)
References
(Blocks 1 open bug, Regression)
Details
(4 keywords, Whiteboard: [fxperf:p1])
Attachments
(1 file)
(deleted),
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta+
|
Details |
Talos has detected a Firefox performance regression from push:
As author of one of the patches included in that push, we need your help to address this regression.
Regressions:
4% startup_about_home_paint linux64-shippable opt e10s stylo 740.83 -> 771.00
4% startup_about_home_paint_realworld_webextensions linux64-shippable opt e10s stylo 785.46 -> 812.92
3% startup_about_home_paint linux64-shippable-qr opt e10s stylo 1,124.21 -> 1,154.50
2% startup_about_home_paint_realworld_webextensions linux64-shippable-qr opt e10s stylo 1,137.12 -> 1,161.92
You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=25465
On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.
To learn more about the regressing test(s), please see: https://wiki.mozilla.org/TestEngineering/Performance/Talos
For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/TestEngineering/Performance/Talos/Running
*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***
Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/TestEngineering/Performance/Talos/RegressionBugsHandling
Reporter | ||
Updated•5 years ago
|
Updated•5 years ago
|
Comment 1•5 years ago
|
||
The strange thing here is that the last couple of perf runs I did on this showed no regressions for these tests:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=87fd18a4071212bc00840c13378d82c816df974c&newProject=try&newRevision=4c88d4fe66f8cd019b7f114a092fd177596007d4&framework=1
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=49b9c64323badacb3e5aa4a12ff51c61f02321c3&newProject=try&newRevision=6a9167ae2788150c0f24b353301d0b629250720c&framework=1
I did change one thing about pre-emptive loading of the config (I removed it), so I've just pushed a comparison for that, though I'd be surprised if not having that caused the regression:
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 2•5 years ago
|
||
For 76 this code is only enabled on nightly, so marking as fix-optional. We are planning on shipping in 77 though.
Double checking the graphs, this appears to be a linux-only regression:
https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&series=autoland,2119227,1,1&series=mozilla-central,2119231,1,1&series=mozilla-central,2119233,1,1&series=mozilla-central,2119234,1,1&series=mozilla-central,2119246,1,1&series=mozilla-central,2119245,1,1&timerange=2592000&zoom=1584234887172,1585177614698,535.6043651064882,2295.0528651602904
https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&series=autoland,1925683,1,1&series=mozilla-central,1937834,1,1&series=mozilla-central,2054088,1,1&series=mozilla-central,1899640,1,1&series=mozilla-central,1934570,1,1&timerange=2592000&zoom=1584316807452,1585036687161,576.2999396829043,2156.830350225112
This makes me wonder if there's something platform dependent that's affecting things here, maybe the sqlite database.
I'm just about to set off a couple more try pushes for various things, hopefully they might help identify it a bit more.
Comment 3•5 years ago
|
||
I've been experimenting with removing non-essential engines from the search engine configuration (ones that won't affect startup time), when I do so I see a reduction in the times on Linux:
I tried getting profiles from before and after the config reduction, but I couldn't see any obvious point that was taking up the time:
The interesting thing is, we didn't see the regression reported in this bug in my last set of try runs:
Comment 4•5 years ago
|
||
Jan, Andrew, we're seeing a performance regression here on our modern search engine configuration. The patch that has caused the regression was one that moved the configuration file from being shipped within Firefox to being served from RemoteSettings (with a dump that's loaded on initial startup).
Since my try runs ahead of landing this, something seems to have caused a regression in startup_about_home_paint
and startup_about_home_paint_realworld_webextensions
on Linux.
As the search service, and remote settings are JS-only code, I'm wondering if this could have been caused by something in indexedDB since RemoteSettings is based on that.
I'll see if I can narrow down a regression range, but it is likely to take a while since it'll involve a lot of try pushes.
Updated•5 years ago
|
Comment 5•5 years ago
|
||
There's been a ton of stuff happening in RemoteSettings lately plus the revelation that instances are getting spun up N times (now addressed?), redirecting needinfos to people more directly involved in the changes. IndexedDB-wise, there are very verbose log messages available with MOZ_LOG=IndexedDB:5
if you want to see the relative timestamps of when requests are issued and serviced, etc. If manual checking of the logs ends up being a lot of work, I can try and see if I can apply some previous log-processing efforts to aid in doing so in a more automated fashion[1].
1: There's also https://github.com/mayhemer/logan in general, but it doesn't understand IndexedDB logs right now, I think.
Assignee | ||
Comment 6•5 years ago
|
||
(In reply to Mark Banner (:standard8) from comment #2)
For 76 this code is only enabled on nightly, so marking as fix-optional. We are planning on shipping in 77 though.
Double checking the graphs, this appears to be a linux-only regression:
I wouldn't necessarily assume that this is linux-only, I suspect the resolution + stability of numbers on other platforms is such that it's harder to see issues on those.
I have a very basic question: does search service initialization block about:home rendering, because of the search box?
Comment 7•5 years ago
|
||
(In reply to :Gijs (he/him) from comment #6)
I have a very basic question: does search service initialization block about:home rendering, because of the search box?
The initial rendering of the outline and placeholder happens, but the bits like icons & topsites are delayed.
Assignee | ||
Comment 8•5 years ago
|
||
(In reply to Mark Banner (:standard8) from comment #7)
(In reply to :Gijs (he/him) from comment #6)
I have a very basic question: does search service initialization block about:home rendering, because of the search box?
The initial rendering of the outline and placeholder happens, but the bits like icons & topsites are delayed.
So does delaying the initialization completion by 5 seconds delay these talos measures by five seconds, or are they not blocked on those things rendering? In other words, is simply a slower search service initialization a possible explanation here, or is it work that happens to happen before we get to the paint, and slows it down because the CPU is, well, perhaps not only human, but still only a CPU, and can only do a limited number of things at once?
Comment 9•5 years ago
|
||
I've done some more investigations and some try pushes.
As mentioned previously, these perf regressions didn't show up on my original pushes for the modern configuration, only when I actually pushed it. I've gone back through the few weeks between those, and with a bit of added guesswork, I've tracked it down to bug 1620185.
Here's the comparisons of try runs with my patch from bug 1542269 immediately before and then after bug 1620185 landed.
That's a clear difference between those two runs. So I think the actual performance issue should be considered as both of those bugs combined.
(In reply to :Gijs (he/him) from comment #8)
So does delaying the initialization completion by 5 seconds delay these talos measures by five seconds, or are they not blocked on those things rendering? In other words, is simply a slower search service initialization a possible explanation here, or is it work that happens to happen before we get to the paint, and slows it down because the CPU is, well, perhaps not only human, but still only a CPU, and can only do a limited number of things at once?
If I delay by five seconds, then that increases the talos measures by five seconds as well.
If I'm understanding Talos and the code correctly, Talos is waiting for the top sites to be displayed, and they have to wait on the search service to initialise as they have to get the default engine and other things to work out what to display.
Updated•5 years ago
|
Comment 10•5 years ago
|
||
I'm clearing my NI here since I feel that I cannot help much here after reading all your comments :(
Updated•5 years ago
|
Comment 11•5 years ago
|
||
I've just got round to doing some more try pushes across the patches from bug 1620185.
- This comparison shows no regression and is based on "Move DB to separate file"
- This comparison shows the regression and is based on "Import IndexedDB code from kinto.js"
Mathieu, it looks like importing the IndexedDB code from kinto.js has caused the regression here. The regression is a 2 - 6% regression on our startup times for getting to the tab paint.
The search service is opening and getting the search-config collection on startup, so I suspect it is something to do with that code path.
This wasn't picked up when your patch landed, as we didn't have the Search Service patch to load from remote settings at that time.
Is there something we can do to improve the performance of remote settings code here?
Comment 12•5 years ago
|
||
Having thought about this overnight, I think this really needs input/fixing from the remote settings client side.
Fundamentally the database startup, loading, or something around that area is slower than it was before and that should be investigated - it obviously affects startup here, but will probably have an adverse effect on other remote settings items.
Hence I'm moving this across to the remote settings client, and un-assigning myself. I am still willing to help, but I have no background to the IndexedDB code, so I think I'm not the best person to primarily investigate that.
Updated•5 years ago
|
Comment 13•5 years ago
|
||
If anyone generates profiler traces (using MOZ_PROFILER_STARTUP as documented at https://profiler.firefox.com/docs/#/./guide-startup-shutdown I think), the relevant threads to filter on for IndexedDB purposes would be: "IndexedDB" (gets all IndexedDB threads), "QuotaManager" (which covers "QuotaManager IO", just the 1), and "IPDL Background" (PBackground which is where the requests go). Probably also nice to include "DOM Worker" possibly via "Worker" since remote settings also spins up a ChromeWorker and workers in general require the main thread to help out with script loading which can both delay the worker startup and generate main thread contention (which is planned to be fixed after bug 1528285 is fixed).
Comment 14•5 years ago
|
||
Mark, I agree with your conclusion, this is something that should be addressed at the Remote Settings level.
Thanks for the pointers and perf graphs!
One question, did you try with the latest trunk? Gijs has changed some parts of the implementation in Bug 1624983, most notably the way we handle opening the DB connection.
Comment 15•5 years ago
|
||
(In reply to Mathieu Leplatre [:leplatrem] from comment #14)
One question, did you try with the latest trunk? Gijs has changed some parts of the implementation in Bug 1624983, most notably the way we handle opening the DB connection.
I believe I looked at that when it landed, and I didn't see anything noticeable in the perf graphs.
Though I'm currently wondering if bug 1631484 might help here.
Assignee | ||
Comment 16•5 years ago
|
||
[Tracking Requested - why for this release]:
We're only shipping this in 77 so that's where it should be tracked. I don't think we're going to be wanting to regress startup times for about:home quite this much for 77...
I'll try to look at this this week.
Updated•5 years ago
|
Assignee | ||
Comment 18•5 years ago
|
||
I'm still investigating the remotesettings side, but meanwhile, can you elaborate on this comment:
* Note that this may cause a network check of the certificate, but that
* should generally be quick.
?
The slowdown here seems exclusively due to the time (now) taken in _getConfiguration
, and if that can involve network requests and we need them to display about:home, that's... really bad? Why would we need a network request?
Assignee | ||
Comment 19•5 years ago
|
||
OK, I think I know what's wrong. There was a fast path at https://searchfox.org/mozilla-central/rev/41c3ea3ee8eab9ce7b82932257cb80b703cbba67/services/common/kinto-offline-client.js#452-457 to make a list call that passed no filtering options use the index's getAll
method to get data. This involves one roundtrip to indexeddb's background thread, instead of N (for N items) for the cursor approach that the Database.jsm code currently uses.
This is pretty bad; it also explains why some of the indexeddb interactions I saw in profiles for the blocklist recently were so much worse than I recalled seeing previously.
Mathieu, was there a specific reason this code was omitted when porting into Database.jsm ?
Assignee | ||
Comment 20•5 years ago
|
||
I need to sleep, but I pushed to try to see if I'm right - https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=46cf1883ab673930e84fece812e0edd7f70a98bc&newProject=try&newRevision=4efdf594fafdfc4325e77448e014f4fbbb7441dd&framework=1 should have results in some hours.
I can see a significant difference in the runtime of _getConfiguration that roughly corresponds to the magnitude of the regression here, on my local Windows machine. However, I do not see a significant effect in the resulting about_home time, so I guess on my local machine, other things are the limiting factor in how soon we consider about:home fully painted...
Comment 21•5 years ago
|
||
(In reply to :Gijs (he/him) from comment #18)
The slowdown here seems exclusively due to the time (now) taken in
_getConfiguration
, and if that can involve network requests and we need them to display about:home, that's... really bad? Why would we need a network request?
I believe this is a certificate check that Remote Settings needs to do, I can't quite remember the full details though. Mathieu, can you explain please?
Comment 22•5 years ago
|
||
First of all, I apologize for the lack of availability and delay of replies...
Mathieu, was there a specific reason this code was omitted when porting into Database.jsm ?
Not really, I wanted to simplify the implementation by removing this createListRequest()
helper, and this fast path got dropped :(
Though I'm currently wondering if bug 1631484 might help here.
It could help, especially on collections with lots of records, but it's not the root cause of the regression, since the put operations chaining was already here before:
https://github.com/Kinto/kinto.js/blob/5e09b2f3e06a731ee2bbd445f20829e63702da5e/src/adapters/IDB.js#L607-L636
I believe this is a certificate check that Remote Settings needs to do, I can't quite remember the full details though. Mathieu, can you explain please?
I was going to say that it was due to signature verification of local data when calling .get()
, but I see that you don't pass the {verifySignature: true}
option (BTW without this flag, .get()
will never throw).
If the collection is empty, then a sync is initiated (unless {syncIfEmpty: false}
is passed), and that introduces network requests.
Apart from those, I don't really see why .get()
would involve network requests :/
Assignee | ||
Comment 23•5 years ago
|
||
Assignee | ||
Comment 24•5 years ago
|
||
OK, so the trypush was successful and addresses the regression, so I will put up the patch to get that fix landed and perhaps even uplifted if we can (though this regression doesn't affect 76 because the search config stuff got held to nightly, it'll improve perf elsewhere, as bug 1620185 made it to 76).
(In reply to Mathieu Leplatre [:leplatrem] from comment #22)
I believe this is a certificate check that Remote Settings needs to do, I can't quite remember the full details though. Mathieu, can you explain please?
I was going to say that it was due to signature verification of local data when calling
.get()
, but I see that you don't pass the{verifySignature: true}
option (BTW without this flag,.get()
will never throw).
OK, so we don't hit this case here. That's good. I'll follow up separately.
If the collection is empty, then a sync is initiated (unless
{syncIfEmpty: false}
is passed), and that introduces network requests.
In this case, we have dumps and those should be loaded instead, right? And AFAICT there is no way even failing to import the dump would cause us to request anything from the network.
Assignee | ||
Updated•5 years ago
|
Comment 25•5 years ago
|
||
Comment 26•5 years ago
|
||
bugherder |
Assignee | ||
Comment 27•5 years ago
|
||
Although the regression doesn't actually affect 76, I'm going to mark this as affected so it shows up on uplift queries.
Assignee | ||
Comment 28•5 years ago
|
||
Comment on attachment 9142688 [details]
Bug 1624301 - optimize remote settings Database list calls without filters, r?leplatrem
Beta/Release Uplift Approval Request
- User impact if declined: Slow performance for remote settings, meaning sluggishness at (apparently) random intervals when the DB queries happen
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: n/a
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This is a small JS-only patch that reintroduces code similar to the bits that were removed in bug 1620185, and there's decent unit-test coverage to ensure that this is still functionally correct.
- String changes made/needed: none
Comment 29•5 years ago
|
||
Thank you for finding & fixing this Gijs. I'll follow up with Mathieu elsewhere about the network comments etc.
Comment 30•5 years ago
|
||
Comment on attachment 9142688 [details]
Bug 1624301 - optimize remote settings Database list calls without filters, r?leplatrem
I'm a bit hesitant about taking this so late in the cycle, but will approve for 76.0b8 since it's just restoring an erroneously-removed optimization we used to have and we have good test coverage for this feature.
Comment 31•5 years ago
|
||
bugherder uplift |
Comment 32•5 years ago
|
||
== Change summary for alert #25730 (as of Mon, 27 Apr 2020 03:28:24 GMT) ==
Improvements:
4% startup_about_home_paint linux64-shippable opt e10s stylo 764.50 -> 730.25
3% startup_about_home_paint_realworld_webextensions linux64-shippable-qr opt e10s stylo 1,036.67 -> 1,006.50
3% startup_about_home_paint_realworld_webextensions linux64-shippable-qr opt e10s stylo 1,035.38 -> 1,005.33
3% startup_about_home_paint_realworld_webextensions linux64-shippable-qr opt e10s stylo 1,035.38 -> 1,006.58
For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=25730
Description
•