Closed Bug 1386045 Opened 7 years ago Closed 7 years ago

stylo: Lots of time spent refreshing the stylist on facebook.com

Categories

(Core :: CSS Parsing and Computation, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
Tracking Status
firefox57 --- fixed

People

(Reporter: bholley, Assigned: emilio)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(1 file)

STR: Log out of facebook, load facebook.com. Gecko profile: http://bit.ly/2uOQab4 Stylo profile: http://bit.ly/2uPa4CH The stylo profile spends 229ms in Servo_*, 209ms of which are in Servo_StyleSet_*. By comparison, the Gecko profile only spends 61.5ms in nsStyleSet*. So we're clearly doing way more work than we should. Facebook is p0 as far as 57/quantum are concerned, so we should fix this.
So talking today with Manish reminded me of this line[1], that we really don't need 99% of the time. Gecko needs it to clear stuff in the rule tree, but we don't. I had a comment there[2], but it seems it got removed recently. I'd expect fixing that to help a fair bit with this. [1]: http://searchfox.org/mozilla-central/rev/bbc1c59e460a27b20929b56489e2e55438de81fa/layout/base/nsPresContext.cpp#2122 [2]: http://searchfox.org/mozilla-central/rev/c7741ea7eaecb2511dd4dcc0ed29b18b50d59775/layout/base/nsPresContext.cpp#2132
So the main questions are: 1) Why are we doing the stylist updates at all? i.e. why is SetStylistStyleSheetsDirty() being called? Presumably sheets are being added to the DOM or something? Or are there silly things we're doing somewhere here? 2) Why does a stylist update take longer than the equivalent Gecko thing? We have several bugs on stylist update performance: bug 1362538, bug 1363789, bug 1382925. Bug 1382927 tracks all of that stuff, afaict.
Depends on: 1382927
So I just tried a simple test. Load the page in question (<https://www.facebook.com/>), run this in the console: var s = document.createElement("style"); s.textContent = "#nosuchidhere { color: red; }"; var c = getComputedStyle(document.documentElement); var start = performance.now(); document.head.appendChild(s); var mid = performance.now(); c.color; var stop = performance.now(); alert((mid - start) + " " + (stop - mid)); With Gecko, that shows me things like "0.2 2.5", so 0.2ms to handle the sheet insertion and 2.5ms to handle the rule cascade update plus style flush. With stylo, I get things like "1.4 1.7". So the update plus flush is faster, but the initial thing is much slower. On this same hardware, a stylist rebuild on a page with no stylesheets (i.e. just the UA bits; see bug 1382925) is about 0.3ms. So that's a noticeable fraction of the above times. Looking at the above-linked profiles, some more numbers: 1) Gecko shows 33.5s under RefreshRuleCascade (this is the equivalent of rebuilding the stylist). 2) Gecko shows no appreciable time destroying rule cascade bits. 3) Stylo shows 111ms just under stylist::{{impl}}::clear (!). This is presumably part of the difference between the "0.2" and "1.4" numbers above, but it sounds like it happens ~100 times or something to add up to 111ms??? 4) Stylo shows 97ms under stylist::{impl}}::rebuild. If we assume this is happening ~100 times during pageload, then 30ms will be the UA bits and the rest "something else". But I think this assumption is bogus; see below. I'm not quite sure how to reconcile the 1.7 vs 2.5 numbers above with the 97ms vs 33.5ms, unless of course stylo is much faster than Gecko on the rest of the style recomputation here. I did try instrumenting stylist::clear() (in the !cleared case) and stylist::rebuild() (in the case when it actually does work) to see how many times they're called during this pageload. Looks like 9 clears and 13 rebuilds. In which case I really have no idea how to interpret the numbers above...
So I tried profiling locally. I see nothing like the times in the profiles above. Checked with bholley, and those times/profiles are with abp installed. Which means some additional Very Large stylesheets, I bet at the UA or user level, which Gecko is not refreshing rule cascades for and stylo is. So ok. We should do the thing from comment 1 about getting rid of unnecessary rebuilds. But we're also just doing strictly more work than Gecko....
Depends on: 1382925
Emilio, is there an existing bug tracking commment 1?
Flags: needinfo?(emilio+bugs)
Flags: needinfo?(emilio+bugs)
Err, that's not comment 1, sorry... Not that I know of. Will open one, and presumably attach a patch tomorrow.
Flags: needinfo?(emilio+bugs)
Flags: needinfo?(emilio+bugs)
Priority: -- → P1
Here are profiles exhibiting similarly bad behavior on amazon (search results for 'laptop'): Stylo: https://perfht.ml/2wbH1rA Gecko: https://perfht.ml/2v1cbU0
Assignee: nobody → emilio+bugs
Depends on: 1363789
Depends on: 1387992
So, I landed a few patches today (others are on the queue) that should improve things around here a bit: * https://github.com/servo/servo/pull/17991 * https://github.com/servo/servo/pull/17992 * https://github.com/servo/servo/pull/17993 * https://github.com/servo/servo/pull/17995 * https://github.com/servo/servo/pull/17999 I suspect most of the remaining bits we can trivially improve here, apart from avoiding other dumb hashing on counter-style / keyframes, and perhaps micro-optimizing the rules iterator or media query evaluation (which don't show nearly as much in the profiles), is either bug 1382925 (which Cam is working on, though happy to help), or bug 1362338. Should I try to pursue the second one, Bobby?
Flags: needinfo?(bobbyholley)
(In reply to Emilio Cobos Álvarez [:emilio] from comment #10) > So, I landed a few patches today (others are on the queue) that should > improve things around here a bit: > > * https://github.com/servo/servo/pull/17991 > * https://github.com/servo/servo/pull/17992 > * https://github.com/servo/servo/pull/17993 > * https://github.com/servo/servo/pull/17995 > * https://github.com/servo/servo/pull/17999 This is great, thank you! I will remeasure once all this lands. > > I suspect most of the remaining bits we can trivially improve here, apart > from avoiding other dumb hashing on counter-style / keyframes, and perhaps > micro-optimizing the rules iterator or media query evaluation (which don't > show nearly as much in the profiles), is either bug 1382925 (which Cam is > working on, though happy to help), or bug 1362338. > > Should I try to pursue the second one, Bobby? If it's showing up in the profiles, then yes please. :-)
Flags: needinfo?(bobbyholley)
Blocks: 1386580
Attached patch Measurement patch (deleted) — Splinter Review
So, I took some numbers because Bobby told me that it seemed we were rebuilding unnecessarily. This is _not_ the case. We're rebuilding three times each. The difference between Gecko and stylo is that Gecko manages to save the data across cascade levels pretty much all the time, so Gecko only rebuilds the one level of the stylist each time, while we rebuild the whole thing three times. Output with Gecko: RefreshRuleCascade(0x7f9893c30680, 0, 1, 1, https://www.facebook.com/) RefreshRuleCascadeCache(0x7f9894898b00, 1, 1, 1, 1, https://www.facebook.com/) RefreshRuleCascade(0x7f989313fb00, 3, 1, 0, https://www.facebook.com/) RefreshRuleCascade(0x7f9892978880, 3, 1, 0, https://www.facebook.com/) RefreshRuleCascade(0x7f98907df880, 3, 1, 0, https://www.facebook.com/) Note that the equivalent of: RefreshRuleCascadeCache(0x7f9894898b00, 1, 1, 1, 1, https://www.facebook.com/) Doesn't appear on the stylo log, but stylo does equally well in that regard thanks to bug 1386602... We just don't enter into UpdateStylist at all in that case (it's a MediaFeaturesChanged call with only UA rules). Stylo output: UpdateStylist(1, https://www.facebook.com/) UpdateStylist(2, https://www.facebook.com/) UpdateStylist(3, https://www.facebook.com/) So apart of optimizing stylist rebuild times in general (I'll submit a patch for the Atom thing) the only thing left to do here is bug 1382925.
(In reply to Emilio Cobos Álvarez [:emilio] from comment #12) > Created attachment 8894947 [details] [diff] [review] > Measurement patch > > So, I took some numbers because Bobby told me that it seemed we were > rebuilding unnecessarily. > > This is _not_ the case. We're rebuilding three times each. So I think one of the things I've been seeing is time spent clearing the stylist in stylesheet insertion: https://perfht.ml/2wzMjgK This shows up as a block on the profiler when I filter for Servo_StyleSet_ (because it's in Servo_StyleSet_InsertStyleSheetBefore), even though it isn't a rebuild per se. Gecko spends zero time in AddStyleSheetToStyleSets. Perhaps that's just because it defers that work until the next refresh of the rule cascade? But even there, I don't see any samples spent destroying the old hashtables. Any idea what's going on here? Is it just all the arc traffic dropping the Selectors that's killing us? If so, we could hypothetically cheat and drop the refcounts non-atomically given that we're on the main thread. But we'd want to be sure that's the issue before doing something gross like that.
Flags: needinfo?(emilio+bugs)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #13) > (In reply to Emilio Cobos Álvarez [:emilio] from comment #12) > > Created attachment 8894947 [details] [diff] [review] > > Measurement patch > > > > So, I took some numbers because Bobby told me that it seemed we were > > rebuilding unnecessarily. > > > > This is _not_ the case. We're rebuilding three times each. > > So I think one of the things I've been seeing is time spent clearing the > stylist in stylesheet insertion: https://perfht.ml/2wzMjgK > > This shows up as a block on the profiler when I filter for Servo_StyleSet_ > (because it's in Servo_StyleSet_InsertStyleSheetBefore), even though it > isn't a rebuild per se. > > Gecko spends zero time in AddStyleSheetToStyleSets. Perhaps that's just > because it defers that work until the next refresh of the rule cascade? But > even there, I don't see any samples spent destroying the old hashtables. > > Any idea what's going on here? Is it just all the arc traffic dropping the > Selectors that's killing us? If so, we could hypothetically cheat and drop > the refcounts non-atomically given that we're on the main thread. But we'd > want to be sure that's the issue before doing something gross like that. I think it's just slow drops/free(), yeah... It doesn't appear in Gecko because Gecko doesn't clear the UA parts of the stylesheet, so it's basically the same problem, and the patches by Cameron should also help here.
Flags: needinfo?(emilio+bugs)
So I took a profile with the patches from bug 1382925, and I barely see stylist rebuild now (like 3.5 / 4 ms). Should we dupe them Bobby? We're still slightly slower in the testcase in bug 1382927, but I'll comment there with a few things that can be done there. https://perfht.ml/2uxQZXg
Flags: needinfo?(bobbyholley)
So, unfortunately stylo is still significantly slower here. To be clear, I'm measuring the interval up to timeToFirstNonBlankPaint. The primary testcase I'm using is amazon-laptop.html in https://www.dropbox.com/s/a8mtk66jplcs1rj/stylo-sitetests.zip?dl=0 . On trunk, I see Stylo spending about 30ms in Servo_StyleSet_*, and Gecko spending ~10ms in RefreshRuleCascade. I dug into it a bit and tried a few things. Making the refcounts non-atomic didn't seem to help all that much. Making all the selectors weak references helped a bit more (but not as much as I'd expect). But the thing that helped the most was commenting out the invalidation_map.note_selector() call, which dropped the overhead down to near-gecko levels. So this kind of speaks to the fact that we're building some very expensive data structures that gecko doesn't build (the invalidation map). Even if we got the performance of building the invalidation map down to that of the regular rulehash (it's currently slower), we'd still be significantly slower than Gecko. So here's my proposed set of next steps: (1) Make the clear() in AppendStyleSheets lazy. I think this is as simple as moving the call to clear() from the FFI functions to flush_stylesheets. There are probably some cache advantages of doing this. And this will allow us to parallelize this work (see (3)). (2) Implement incremental updates to the invalidation maps. AFAICT, the map is not ordering-dependent, so we shouldn't need to clear the entire map when a stylesheet is added, we can just incrementally add to it, which saves us work on both ends. (3) Build the invalidation set in parallel with the rule hash (i.e. bug 1362538). I still think this should be pretty straightforward - we just clone the read-only iterator to the stylesheet rules, and pass them to two closures, one of which has a mutable borrow of the invalidation map, and the other of which has a mutable borrow of the rulehash. We then pass the closures to rayon::join. Since it's only two jobs, it will be effective on any machine with at least two cores, which is 90% of our users. My hope is that (2) will get the invalidation set stuff fast enough that it takes about as much time as the rulehash stuff, and then the parallelization can mostly eliminate the cost of doing both. In other words, we use parallelism the counteract the fact that we build a large data structure that gecko doesn't. Once we do that, we can take stock of things and see how close to parity we are with gecko, and whether we want to pull the trigger on weak selector references (which, conceptually, is the primary thing making our rulehash stuff do more work than gecko's). I really hope we don't have to do though, because it wrecks a lot of safety. Emilio, wdyt? Can you tackle this?
Flags: needinfo?(bobbyholley) → needinfo?(emilio+bugs)
I can. I have another few ideas about how to make it faster without falling back to parallelizing it, so I may try doing that first. Btw, instead of weak selector references, I'd rather do weak atom references, tbh, since they're held alive by selectors anyway.
Flags: needinfo?(emilio+bugs)
So, doing (2) is not as trivial as it seems... I mean, sure, we can special-case the only stylesheets were added to the document, and do something there, but the complexity we need for that is rather big and doesn't pay off IMO. The right solution to do it cleanly is moving the StyleSheetSet to the stylist, but that's not easy at all on Servo, last time I checked... I could try though. I have a patch for (1). I think another thing we could to to make the invalidation pass cheaper would be, for example, stop storing in the hash map ids/classes without combinators to the right, and use some kind of bloom filter or what not, and assume that if the element matches the filter we need to restyle ourselves. This will make the optimization less precise, but maybe good enough.
I think I'll work on getting the correct fix for (2) landed. I need to run now, but I'll submit the patches for (1) and some other cleanup tomorrow.
Depends on: 1390255
My refactoring on this stuff went kinda awry, and I decided to stop trying to put the stylist in the DOM and instead put the stylesheet list on layout: https://github.com/servo/servo/pull/18124 With that, I can next move the stylesheet list inside the stylist, so it can be aware of which stylists have changed, and with that rebuild the invalidation map incrementally.
Blocks: 1290285
Today I landed https://github.com/servo/servo/pull/18143 (this got too long due a Servo transitions bug), and I plan to land https://github.com/servo/servo/pull/18170 next, which should make easy to implement the optimizations that will help here.
The above is about to land, so I went ahead and opened https://github.com/servo/servo/pull/18191 is the actual optimization.
I see a perf improvement: == Change summary for alert #8956 (as of August 23 2017 15:03 UTC) == Improvements: 14% tp6_amazon summary windows7-32 pgo 1_thread e10s stylo 624.40 -> 534.25 12% tp6_amazon summary windows7-32 pgo e10s stylo 608.27 -> 536.04 7% tp6_facebook summary windows10-64 pgo 1_thread e10s stylo 307.35 -> 284.58 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=8956
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #23) > I see a perf improvement: > == Change summary for alert #8956 (as of August 23 2017 15:03 UTC) == > > Improvements: > > 14% tp6_amazon summary windows7-32 pgo 1_thread e10s stylo 624.40 -> > 534.25 > 12% tp6_amazon summary windows7-32 pgo e10s stylo 608.27 -> > 536.04 > 7% tp6_facebook summary windows10-64 pgo 1_thread e10s stylo 307.35 -> > 284.58 > > For up to date results, see: > https://treeherder.mozilla.org/perf.html#/alerts?id=8956 Those are pretty nice, great to hear!
Alright, there are potentially other improvements we can make here, but I think we can resolve this for now. Hopefully Simon's parser optimizations will bring us the last bit of the way to parity.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: