Open Bug 1384192 Opened 7 years ago Updated 2 years ago

stylo: ~25% slower CSS parsing in twitter testcase

Categories

(Core :: CSS Parsing and Computation, defect, P4)

defect

Tracking

()

Tracking Status
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 --- ?

People

(Reporter: bholley, Unassigned)

References

(Blocks 1 open bug)

Details

http://tests.themasta.com/twitter-compact/twitter-main.html I'm just measuring the first large stylesheet in the timeline. The numbers fluctuate a bit, but with stylo I get numbers around 40ms, and without stylo I get numbers closer to 30ms. I picked a few representative profiles in the midpoint: With stylo: https://perfht.ml/2vFP7Zr Without stylo: https://perfht.ml/2vFFf1D
Flags: needinfo?(simon.sapin)
It looks like these have a 0.5ms sample rate, which for 40ms give a relatively small total number of samples to look at. It’s hard to draw meaningful conclusions when the time spent in a particular function is a single-digit number of samples. I think the sample rate can be increased a bit, but maybe things break down if we try to do so by a couple orders of magnitude? Is it doable instead to concatenate this stylesheet a hundred times, and profile parsing that?
(In reply to Simon Sapin (:SimonSapin) from comment #1) > It looks like these have a 0.5ms sample rate, which for 40ms give a > relatively small total number of samples to look at. It’s hard to draw > meaningful conclusions when the time spent in a particular function is a > single-digit number of samples. I think the sample rate can be increased a > bit, but maybe things break down if we try to do so by a couple orders of > magnitude? Is it doable instead to concatenate this stylesheet a hundred > times, and profile parsing that? It's easy to bump the sampling rate - I just sample at 0.5ms when trying to get an accurate timing for long-running functions (since increasing the sample rate can skew things, and the total time of a long-running function will be accurate to within the sample rate). Probably worth just taking new profiles to delve into the details - just takes a few seconds on a nightly build.
So I tried to duplicate the sheet 10x and profiled at 0.2ms. Stylo: https://perfht.ml/2vXHpJa Gecko: https://perfht.ml/2vXLtsT Some interesting head-to-head numbers: 1) The total time under _Servo_StyleSheet_FromUTF8Bytes is 411ms. The total time under nsCSSParser::ParseSheet is 301ms. 1) Gecko takes 111ms to ParseSelectorList, while Stylo takes 154ms for parse_prelude. Of that 154ms, 135ms is parse_selector and 11.2ms is Tokenizer::source_location. The rest is self time, memmove, vec doubling, etc; could be leakage from inlined stuff. 2) The times taken under ErrorReport are about the same for both; slightly worse for Gecko (because it actually reports more errors than stylo here; see bug 1384225 and bug 1381143). 3) Gecko spends 12ms on ParseAtRule; Stylo spends 14ms on parse_at_rule. Pretty comparable. 4) Gecko spends 158ms doing ParseDeclarationBlock. Stylo spends 223ms under parse_block. For stylo this includes 11.4ms of malloc presumably for the declaration block itself, plus 50ms PropertyDeclarationBlock::extend_common (reallocing the vec). The rest is 91ms PropertyDeclaration::parse_into, 21ms getting the next token, 14ms reporting errors, 9ms PropertyId::parse, and some long-tail bits. By way of comparison, Gecko has 122ms under ParseDeclaration (including some error reporting, some tokenizer next bits, parsing the actual property value, etc), 19ms compressing into the nsCSSCompressedDataBlock, 10ms malloc for the declaration block, plus some long-tail bits. So the reallocation under block parsing is definitely hurting us some, afaict. That said, if I just filter for "alloc" the Gecko profile shows 175ms while the Stylo one shows 208ms, so that can't account for the entire difference by any means. One other thing worth noting: time under nsCSSScanner::Next is 71ms in the Gecko profile, while time under Parser::next_including_whitespace_and_comments is 82ms in the stylo profile. So there's a difference here, but not a gigantic one. I did poke a bit at the Tokenizer::source_location bit. It looks like at the beginning of every rule we rescan the text from the beginning of the previous rule counting newlines. I don't know whether we can do this more efficiently somehow or do the newline-counting during tokenization or something, or whether any of this would even help things overall.
I initially made source_location lazy on the premise that we would only use it when devtools are open, or something like that. It turns out we always call it, so laziness doesn’t buy us anything. I’ll try counting newlines during tokenization instead, and see if it affects perf numbers. Another lead, discussed on IRC today: cssparser::Parser methods return a with large-ish size_of, up to 56 bytes in some cases IIRC. This is largely because of alignment of nested enumns in the error types. I don’t know how doable that is, but if we can record error details out of band and switch back to zero-sized error types, we might save of memmove traffic.
Priority: -- → P1
Priority: P1 → --
Marking this as P3 just because we're already tracking parser performance in bug 1331843, and there's a lot of overlap. We might be able to take a small regression in parse performance, but hopefully smaller than what we're seeing now.
Priority: -- → P3
I made a front-end for the gtest microbenchmarks harness. I have this in ~/.local/bin/gtest and use it like this: gtest Stylo* Is it useful to add to the tree? If so, where should it go? #!/usr/bin/python3 import sys, subprocess, json, statistics proc = subprocess.Popen(["./mach", "gtest", sys.argv[1]], stdout=subprocess.PIPE) for line in proc.stdout: if line.startswith(b"PERFHERDER_DATA:"): data = json.loads(line[len("PERFHERDER_DATA:"):].decode("utf8")) for suite in data["suites"]: for subtest in suite["subtests"]: print("%4d.%03d ± %6s ms %s.%s" % ( subtest["value"] / 1000., subtest["value"] % 1000, "%.3f" % (statistics.stdev(subtest["replicates"]) / 1000), suite["name"], subtest["name"], ))
NI for comment 7.
Flags: needinfo?(nfroyd)
https://github.com/servo/servo/pull/18171 is blocked from landing right now, but it brings us from ~48% slower to ~3% slower than nsCSSParser in the gtest microbench. Results formatted with the code from comment 7: Before: 43.437 ± 0.391 ms Stylo.Servo_StyleSheet_FromUTF8Bytes_Bench 29.244 ± 0.042 ms Stylo.Gecko_nsCSSParser_ParseSheet_Bench 281.884 ± 0.028 ms Stylo.Servo_DeclarationBlock_SetPropertyById_Bench 426.242 ± 0.008 ms Stylo.Servo_DeclarationBlock_SetPropertyById_WithInitialSpace_Bench After: 29.779 ± 0.254 ms Stylo.Servo_StyleSheet_FromUTF8Bytes_Bench 28.841 ± 0.031 ms Stylo.Gecko_nsCSSParser_ParseSheet_Bench 296.240 ± 4.744 ms Stylo.Servo_DeclarationBlock_SetPropertyById_Bench 293.855 ± 4.304 ms Stylo.Servo_DeclarationBlock_SetPropertyById_WithInitialSpace_Bench
Flags: needinfo?(simon.sapin)
Applying https://joshmatthews.net/myspacebench to use the stylesheets from the myspace tp6 page instead, this patch takes us from ~42% slower to ~21% slower. Before: 334.556 ± 0.693 ms Stylo.Servo_StyleSheet_FromUTF8Bytes_Bench 234.551 ± 0.144 ms Stylo.Gecko_nsCSSParser_ParseSheet_Bench After: 284.321 ± 0.670 ms Stylo.Servo_StyleSheet_FromUTF8Bytes_Bench 234.393 ± 0.203 ms Stylo.Gecko_nsCSSParser_ParseSheet_Bench At this point I’m out of ideas for what to try next :( We can try sprinkling skip_whitespace() in a few more places, but I can’t think of one that might have that much impact.
I think Simon's script is useful and should probably be integrated as a mach command (`mach gtest-bench`?). I guess we could have it as standalone command for the time being?
Flags: needinfo?(nfroyd) → needinfo?(simon.sapin)
Submitted bug 1392329 with a patch.
Flags: needinfo?(simon.sapin)
https://github.com/servo/servo/pull/18171 has landed: https://hg.mozilla.org/integration/autoland/rev/75a20ebcd6e8. gtest numbers are better than before, but not as good as in comment 9: 37.904 ± 0.397 ms Stylo.Servo_StyleSheet_FromUTF8Bytes_Bench 29.507 ± 0.048 ms Stylo.Gecko_nsCSSParser_ParseSheet_Bench 314.070 ± 0.180 ms Stylo.Servo_DeclarationBlock_SetPropertyById_Bench 391.404 ± 43.133 ms Stylo.Servo_DeclarationBlock_SetPropertyById_WithInitialSpace_Bench I’ll investigate to find out what makes that difference exactly and what perf we can get back. (The code measure in comment 9 was buggy, incorrectly considering some valid selectors as invalid.)
Sounds good. It was a bit disappointing to see the landing of #18171 not closing the gap on facebook tp6, so it's nice to know that there may be more wins to be had. It may be worth concatenating all the stylesheets in [1] and dropping that into the benchmark for comparison and profiling, since that's the highest-priority target at the moment. [1] https://github.com/rwood-moz/talos-pagesets/tree/master/tp6/tp6-facebook_files
Also - how dependent are the optimizations here on the amount of whitespace in the stylesheet? The facebook stylesheets appear to be minified, so there's not much whitespace in there.
> it's nice to know that there may be more wins to be had. "May" is the key word. I have yet to reproduce and measure exactly, but it’s possible that the code measured in comment 9 was so fast because it was effectively doing less work by doing it incorrectly. Regarding minification, that’s a good point. I expect #18171 would have no measurable effect if there’s no non-significant whitespace to skip.
Alright, I’ve isolated it. Reverting https://github.com/servo/rust-cssparser/commit/0acff3b2087a0518abe412a05f0db7c32f588f17 gives numbers like those in comment 9. Unfortunately that’s the bug-fixing commit, so comment 9 only tells us we could be fast at doing the wrong thing. I tried adding #[inline(never)] and #[cold] to consume_until_end_of_block but that didn’t make a difference.
Priority: P3 → P4
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.