Closed Bug 329974 Opened 19 years ago Closed 19 years ago

Use static data for common whitespace values in textnodes

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: sicking, Assigned: sicking)

References

Details

Attachments

(3 files)

There is a lot of makup out there that looks like <foo> <bar>...</bar> <baz> </foo> This produces textnodes that contains values like "\n " and "\n ". It should be possible to detect this and instead of allocating new string data for each of these nodes instead let them point into a single buffer that we allocate on startup. Since nsTextFragment doesn't use null terminated data we can use a single string like " \n " and then just set start and length as appropriate. I'm currently running some stats to determine what types of values are common enough that this is useful. Once I have that i'll attach a patch.
I went to just over 500 pages (spider starting with alexa top 500 until we crash). This resulted in 391031 SetTo calls in nsTextFragment. Out of these 227593 calls were for whitespace-only strings. Below are the resulting stats, n=newline, s=space, t=tab. 86872 - n 10601 - s 8313 - nt 7388 - nn 7357 - ntt 6773 - nss 6197 - nssssssss 5774 - nssss 5143 - nssssssssss 5128 - nttt 3929 - ntttt 3533 - nssssssssssss 3517 - nssssssssssssssss 3396 - nssssss 2835 - ns 2511 - nttttt 2216 - nssssssssssssss 1929 - nssssssssssssssssssssss 1815 - nssssssssssssssssss 1765 - ntttttt 1720 - nnn 1708 - nssssssssssssssssssss 1591 - sn 1514 - nssssssssssssssssssssssss 1219 - 1185 - snssssssssssss 1131 - nsss 1033 - nsssss 993 - ss 961 - nttttttt 800 - nssssssssssssssssssssssssssss 786 - snssssssssssssssssss 777 - snssssss 752 - nsssssssssssssss 743 - nsssssssssssssssssss 715 - nnt 688 - snssssssssssssssssssssssss 674 - ntttttttt 603 - snsssssssssssssss 603 - nssssssssssssssssssssssssss 507 - nnnn 475 - sss 466 - nntttttttttttt 457 - nttttttnttttttt 449 - snssssssssss 435 - nnttttttttttt 414 - nsssssssss 406 - nnss 373 - nssssssssssssssssssssssssssssssss 369 - snssss 365 - snssssssss 352 - nnssss 345 - nntt 339 - snssssssssssssssssssss 339 - nnssssssss 313 - ntttttttttt 313 - nttttttttt 277 - nssssssssssssssssssssssssssssss 276 - nsssssss 273 - snssssssssssssss 262 - snss 257 - nntttttttttt 251 - nssssssssssssssssssssssssssssssssss 250 - nnssssss 246 - ntnt 246 - nnssssssssssss 240 - ntss 237 - nntttttttttttttt 236 - snssssssssssssssssssssssssss 235 - ntssss 232 - nnttttttttttttttt 231 - snsssssssss 229 - snssssssssssssssss 228 - ssss
I tried some various strategies of which types of whitespace nodes to use static data for. My initial strategy (A) was to match patterns that started with 1-7 newlines followed by 0-50 spaces or tabs (not a mixture), or just 0-50 spaces. This matches 204664 out of 227593 nodes. If we allow a single space before the newlines it matches 214985 strings. Allowing 0-10 spaces before the newlines matches 215442 strings. If we match 0-50 tabs in addition to 0-50 spaces that adds an additional 500-something strings. I also tried to reduce the number of newlines to match. Just matching 1-3 looses some 1200 strings, 1-2 an additional 2200 and just a single newline 14000.
Sicking, I was wondering what the data looks like if you include the unicode char produced by &nbsp; in "whitespace"... Would that require another spider run, or can it be extracted from whatever data you logged?
I'd have to make another run. What pattern that includes &nbsp; do you suspect is common? I did note that out of the total 391031 values 12927 were a single non-whitespace char. If a significan portion of these chars have a low unicode value it would probably be worth having a static array for them too.
> What pattern that includes &nbsp; do you suspect is common? &nbsp;&nbsp;&nbsp;.... (repeated some number of times). Possibly with a leading \n.
Attached file single-char node values (deleted) —
I looked into which characters are common for single-char nodes. Seems like pretty much only characters 1-255 appear often enough on their own to warrent searching for them (well, 1-183 really, but it's cheap enough to do the rest). The only other charater that occurs pretty often on its own is 12288 which appears to be some sort of whitespace and 12539 which is a dot. However they appear as 0.4 resp. 0.2 per cent of the total number of textnodes so I don't really think it's worth it.
So I ran some tests that included &nbsp; characters. This gave the following result out of 1030005 nodes: 25189 /^\n{1,7}$/s 4093 /^ \n{1,7}$/s 119615 /^\n{1,7} {1,50}$/s 14155 /^ \n{1,7} {1,50}$/s 92678 /^\n{1,7}\t{1,50}$/s 3526 /^ \n{1,7}\t{1,50}$/s 285 /^\n{1,7}\x{a0}{1,50}$/s 7 /^ \n{1,7}\x{a0}{1,50}$/s 7353 /^ {2,50}$/s 1092 /^\t{2,50}$/s 2319 /^\x{a0}{2,50}$/s 664 /^\x{a0}\n{1,7}( {0,50}|\t{0,50}|\x{a0}{0,50})$/s 167 /^ \x{a0}{1,50}$/s 4 /^ {2,10}\x{a0}{1,50}$/s Note that all these patterns should be mutually exclusive. This also does not include any single-char nodes. While &nbsp; on its own is more common then tabs, I don't think searching for either is worth it. After newlines tabs are definitely worth it, however &nbsp; is not.
Attached file perl program to analyze log (deleted) —
This is the perl program i've used to gather the stats. I've put the compressed log on http://people.mozilla.com/~sicking/data/bug329974_log.bz2
Attached patch Patch to fix (deleted) — Splinter Review
This patch makes us share all 8-bit single-char node values, as well as nodevalues matching the expression: /^ ?\n{0,7}( {0,50}|\t{0,50})$/ Those two together matches just over 60% of all textnodes! I also got sucked into cleaning nsTextFragment up a bit. There were way too much dead code in there to just let it pass.
Attachment #215447 - Flags: superreview?(jst)
Attachment #215447 - Flags: review?(jst)
I just thought of a minor change that would be good to include. Rather then having an initial if (firstChar == ' ' || firstChar == '\n' || firstChar == '\t') I should make that if (firstChar <= ' ' && (firstChar == ' ' || firstChar == '\n' || firstChar == '\t')) That way the first test will most likly be the one that decides which way we go and we'd save a few cycles. And even better is if (firstChar <= ' ' && (firstChar == '\n' || firstChar == ' ' || firstChar == '\t')) since newline is the most common firstChar of the three
> + NS_ERROR("Could not initialize nsAttrValue"); You mean nsTextFragment, right?
Comment on attachment 215447 [details] [diff] [review] Patch to fix - In nsTextFragment::operator=(const nsTextFragment& aOther): + else if (aOther.mState.mIs2b) { + m2b = NS_STATIC_CAST(PRUnichar*, + nsMemory::Clone(aOther.m2b, aOther.mState.mLength * sizeof(PRUnichar))); + } + else { + m1b = NS_STATIC_CAST(char *, + nsMemory::Clone(aOther.m1b, aOther.mState.mLength * sizeof(char))); } Combine that to: + m1b = NS_STATIC_CAST(PRUnichar*, + nsMemory::Clone(aOther.m2b, aOther.mState.mLength * (sizeof(char) << aOther.mState.mIs2b))); ? :) Or combine them to: + m1b = NS_STATIC_CAST(PRUnichar*, + nsMemory::Clone(aOther.m2b, aOther.mState.mLength * (aOther.mState.mIs2b ? sizeof(PRUnichar) : sizeof(char))); r+sr=jst either way.
Attachment #215447 - Flags: superreview?(jst)
Attachment #215447 - Flags: superreview+
Attachment #215447 - Flags: review?(jst)
Attachment #215447 - Flags: review+
Checked in. Unfortunately there was no win in any perf numbers :(
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Blocks: 332184
Depends on: 332140
No longer blocks: 332184
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: