Closed Bug 329974 Opened 19 years ago Closed 19 years ago

Use static data for common whitespace values in textnodes

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: sicking, Assigned: sicking)

References

Details

Attachments

(3 files)

single-char node values 19 years ago Jonas Sicking (:sicking) No longer reading bugmail consistently (deleted), text/plain		Details
perl program to analyze log 19 years ago Jonas Sicking (:sicking) No longer reading bugmail consistently (deleted), text/plain		Details
Patch to fix 19 years ago Jonas Sicking (:sicking) No longer reading bugmail consistently (deleted), patch	jst : review+ jst : superreview+	Details \| Diff \| Splinter Review

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Description

•

19 years ago

There is a lot of makup out there that looks like <foo> <bar>...</bar> <baz> </foo> This produces textnodes that contains values like "\n " and "\n ". It should be possible to detect this and instead of allocating new string data for each of these nodes instead let them point into a single buffer that we allocate on startup. Since nsTextFragment doesn't use null terminated data we can use a single string like " \n " and then just set start and length as appropriate. I'm currently running some stats to determine what types of values are common enough that this is useful. Once I have that i'll attach a patch.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 1

•

19 years ago

I went to just over 500 pages (spider starting with alexa top 500 until we crash). This resulted in 391031 SetTo calls in nsTextFragment. Out of these 227593 calls were for whitespace-only strings. Below are the resulting stats, n=newline, s=space, t=tab. 86872 - n 10601 - s 8313 - nt 7388 - nn 7357 - ntt 6773 - nss 6197 - nssssssss 5774 - nssss 5143 - nssssssssss 5128 - nttt 3929 - ntttt 3533 - nssssssssssss 3517 - nssssssssssssssss 3396 - nssssss 2835 - ns 2511 - nttttt 2216 - nssssssssssssss 1929 - nssssssssssssssssssssss 1815 - nssssssssssssssssss 1765 - ntttttt 1720 - nnn 1708 - nssssssssssssssssssss 1591 - sn 1514 - nssssssssssssssssssssssss 1219 - 1185 - snssssssssssss 1131 - nsss 1033 - nsssss 993 - ss 961 - nttttttt 800 - nssssssssssssssssssssssssssss 786 - snssssssssssssssssss 777 - snssssss 752 - nsssssssssssssss 743 - nsssssssssssssssssss 715 - nnt 688 - snssssssssssssssssssssssss 674 - ntttttttt 603 - snsssssssssssssss 603 - nssssssssssssssssssssssssss 507 - nnnn 475 - sss 466 - nntttttttttttt 457 - nttttttnttttttt 449 - snssssssssss 435 - nnttttttttttt 414 - nsssssssss 406 - nnss 373 - nssssssssssssssssssssssssssssssss 369 - snssss 365 - snssssssss 352 - nnssss 345 - nntt 339 - snssssssssssssssssssss 339 - nnssssssss 313 - ntttttttttt 313 - nttttttttt 277 - nssssssssssssssssssssssssssssss 276 - nsssssss 273 - snssssssssssssss 262 - snss 257 - nntttttttttt 251 - nssssssssssssssssssssssssssssssssss 250 - nnssssss 246 - ntnt 246 - nnssssssssssss 240 - ntss 237 - nntttttttttttttt 236 - snssssssssssssssssssssssssss 235 - ntssss 232 - nnttttttttttttttt 231 - snsssssssss 229 - snssssssssssssssss 228 - ssss

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 2

•

19 years ago

I tried some various strategies of which types of whitespace nodes to use static data for. My initial strategy (A) was to match patterns that started with 1-7 newlines followed by 0-50 spaces or tabs (not a mixture), or just 0-50 spaces. This matches 204664 out of 227593 nodes. If we allow a single space before the newlines it matches 214985 strings. Allowing 0-10 spaces before the newlines matches 215442 strings. If we match 0-50 tabs in addition to 0-50 spaces that adds an additional 500-something strings. I also tried to reduce the number of newlines to match. Just matching 1-3 looses some 1200 strings, 1-2 an additional 2200 and just a single newline 14000.

Boris Zbarsky [:bzbarsky]

Comment 3

•

19 years ago

Sicking, I was wondering what the data looks like if you include the unicode char produced by   in "whitespace"... Would that require another spider run, or can it be extracted from whatever data you logged?

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 4

•

19 years ago

I'd have to make another run. What pattern that includes   do you suspect is common? I did note that out of the total 391031 values 12927 were a single non-whitespace char. If a significan portion of these chars have a low unicode value it would probably be worth having a static array for them too.

Boris Zbarsky [:bzbarsky]

Comment 5

•

19 years ago

> What pattern that includes   do you suspect is common?    .... (repeated some number of times). Possibly with a leading \n.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 6

•

19 years ago

Attached file single-char node values (deleted) — Details

I looked into which characters are common for single-char nodes. Seems like pretty much only characters 1-255 appear often enough on their own to warrent searching for them (well, 1-183 really, but it's cheap enough to do the rest). The only other charater that occurs pretty often on its own is 12288 which appears to be some sort of whitespace and 12539 which is a dot. However they appear as 0.4 resp. 0.2 per cent of the total number of textnodes so I don't really think it's worth it.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 7

•

19 years ago

So I ran some tests that included   characters. This gave the following result out of 1030005 nodes: 25189 /^\n{1,7}$/s 4093 /^ \n{1,7}$/s 119615 /^\n{1,7} {1,50}$/s 14155 /^ \n{1,7} {1,50}$/s 92678 /^\n{1,7}\t{1,50}$/s 3526 /^ \n{1,7}\t{1,50}$/s 285 /^\n{1,7}\x{a0}{1,50}$/s 7 /^ \n{1,7}\x{a0}{1,50}$/s 7353 /^ {2,50}$/s 1092 /^\t{2,50}$/s 2319 /^\x{a0}{2,50}$/s 664 /^\x{a0}\n{1,7}( {0,50}|\t{0,50}|\x{a0}{0,50})$/s 167 /^ \x{a0}{1,50}$/s 4 /^ {2,10}\x{a0}{1,50}$/s Note that all these patterns should be mutually exclusive. This also does not include any single-char nodes. While   on its own is more common then tabs, I don't think searching for either is worth it. After newlines tabs are definitely worth it, however   is not.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 8

•

19 years ago

Attached file perl program to analyze log (deleted) — Details

This is the perl program i've used to gather the stats. I've put the compressed log on http://people.mozilla.com/~sicking/data/bug329974_log.bz2

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 9

•

19 years ago

Attached patch Patch to fix (deleted) — Details — Splinter Review

This patch makes us share all 8-bit single-char node values, as well as nodevalues matching the expression: /^ ?\n{0,7}( {0,50}|\t{0,50})$/ Those two together matches just over 60% of all textnodes! I also got sucked into cleaning nsTextFragment up a bit. There were way too much dead code in there to just let it pass.

Attachment #215447 - Flags: superreview?(jst)

Attachment #215447 - Flags: review?(jst)

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 10

•

19 years ago

I just thought of a minor change that would be good to include. Rather then having an initial if (firstChar == ' ' || firstChar == '\n' || firstChar == '\t') I should make that if (firstChar <= ' ' && (firstChar == ' ' || firstChar == '\n' || firstChar == '\t')) That way the first test will most likly be the one that decides which way we go and we'd save a few cycles. And even better is if (firstChar <= ' ' && (firstChar == '\n' || firstChar == ' ' || firstChar == '\t')) since newline is the most common firstChar of the three

Boris Zbarsky [:bzbarsky]

Comment 11

•

19 years ago

> + NS_ERROR("Could not initialize nsAttrValue"); You mean nsTextFragment, right?

Johnny Stenback (:jst)

Comment 12

•

19 years ago

Comment on attachment 215447 [details] [diff] [review] Patch to fix - In nsTextFragment::operator=(const nsTextFragment& aOther): + else if (aOther.mState.mIs2b) { + m2b = NS_STATIC_CAST(PRUnichar*, + nsMemory::Clone(aOther.m2b, aOther.mState.mLength * sizeof(PRUnichar))); + } + else { + m1b = NS_STATIC_CAST(char *, + nsMemory::Clone(aOther.m1b, aOther.mState.mLength * sizeof(char))); } Combine that to: + m1b = NS_STATIC_CAST(PRUnichar*, + nsMemory::Clone(aOther.m2b, aOther.mState.mLength * (sizeof(char) << aOther.mState.mIs2b))); ? :) Or combine them to: + m1b = NS_STATIC_CAST(PRUnichar*, + nsMemory::Clone(aOther.m2b, aOther.mState.mLength * (aOther.mState.mIs2b ? sizeof(PRUnichar) : sizeof(char))); r+sr=jst either way.

Attachment #215447 - Flags: superreview?(jst)

Attachment #215447 - Flags: superreview+

Attachment #215447 - Flags: review?(jst)

Attachment #215447 - Flags: review+

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 13

•

19 years ago

Checked in. Unfortunately there was no win in any perf numbers :(

Status: NEW → RESOLVED

Closed: 19 years ago

Resolution: --- → FIXED

Andrew Schultz

Updated

•

19 years ago

Blocks: 332184

David Baron :dbaron:

Updated

•

19 years ago

Depends on: 332140

Jesse Ruderman

Updated

•

19 years ago

No longer blocks: 332184

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: DOM → DOM: Core & HTML

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Use static data for common whitespace values in textnodes

Categories

(Core :: DOM: Core & HTML, defect)

Tracking

()

People

(Reporter: sicking, Assigned: sicking)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type