Closed Bug 506090 Opened 15 years ago Closed 15 years ago

[HTML5][Patch] Named character reference tokenization performance regression

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: mozilla+ben, Assigned: hsivonen)

References

(
URL
)

Details

(Keywords: perf)

Attachments

(2 files, 2 obsolete files)

WIP: C++ 15 years ago Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11) (deleted), patch		Details \| Diff \| Splinter Review
WIP: Java 15 years ago Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11) (deleted), patch		Details \| Diff \| Splinter Review
The mozilla-central diff 15 years ago Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11) (deleted), patch	mozilla+ben : review+	Details \| Diff \| Splinter Review
html-parser diff 15 years ago Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11) (deleted), patch	mozilla+ben : review+	Details \| Diff \| Splinter Review

Ben Newman (:bnewman) (:benjamn)

Reporter

Description

•

15 years ago

Tokenizing a character reference such as Æ using the HTML5 parser takes linear time in the length of the list of character references. We should be able to do better. Compared with the HTML4 parser, the underlying task has gotten harder in two ways: first, there were only ~300 HTML4 entities as opposed to the 2100+ HTML5 named character references; second, the HTML5 spec requires the parser to perform a longest prefix match, whereas the HTML4 parser simply consumed the identifier following a '&' and gave back a character if that identifier named a known entity. The first factor shouldn't matter if we choose a data structure and/or search algorithm that supports O(log n) or O(1) lookup. The second factor can be overcome, I suspect, with a bit of cleverness. Amdahl's Law obliges me to point out that character reference tokenization is almost certainly not the bulk of parsing any real-world webpage, so this may not be a super-high priority, but we're going to want to fix it sooner or later.

Ben Newman (:bnewman) (:benjamn)

Reporter

Comment 1

•

15 years ago

See this URL for an illustration of the problem: http://people.mozilla.org/~bnewman/bugs/506090/

URL: http://people.mozilla.org/~bnewman/bu...

Keywords: perf

Ben Newman (:bnewman) (:benjamn)

Reporter

Comment 2

•

15 years ago

Relevant part of the HTML5 spec: http://dev.w3.org/html5/spec/syntax.html#tokenizing-character-references

Ben Newman (:bnewman) (:benjamn)

Reporter

Comment 3

•

15 years ago

Relevant part of Tokenizer.java: http://hg.mozilla.org/mozilla-central/file/b2d3327fb4cc/parser/html/javasrc/Tokenizer.java#l4037 And the corresponding part of parser/html/nsHtml5Tokenizer.cpp: http://hg.mozilla.org/mozilla-central/file/b2d3327fb4cc/parser/html/nsHtml5Tokenizer.cpp#l2214 It's worth mentioning that this issue should be possible to resolve within the Java source files. In particular, this regression is *not* a problem with the Java-to-C++ translator, nor does it have anything to do with differences between the performance characteristics of Java and C++.

Zack Weinberg (:zwol)

Comment 4

•

15 years ago

Tries are good for longest prefix match with O(1) string comparisons.

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 5

•

15 years ago

Looks like avoiding premature optimization didn't work all the way here. :-( I think a trie laid out into static const arrays ahead of compile time is the most proper way to fix this (and the same mechanism could be used for optimizing the doctype quirkiness check, too, even though that check runs at most once per parse). However, using binary search to locate the initial search window might be good enough with the current data structures.

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 6

•

15 years ago

The first level of search should definitely be a mere array lookup by using the input character as the array index. However, I wonder if trie-ish things make sense for the second character in the entity name. Maybe using binary search + linear search for window widening would be good enough for the subsequent levels.

Ben Newman (:bnewman) (:benjamn)

Reporter

Comment 7

•

15 years ago

Almost all entries in the table end with a ';', and it occurs to me that we don't need the full complexity of the longest-prefix logic for such character references, because ';' can only be the last character of a reference. Instead, we could save characters starting after the '&' and look those characters up in a simple hash set when we encounter a ';'. If there's no match, or if it takes longer to find the ';' than the maximum length of any character reference, then would could fall back to the smaller table of prefix candidates (the entries without ';'s).

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 8

•

15 years ago

(In reply to comment #7) > Almost all entries in the table end with a ';', and it occurs to me that we > don't need the full complexity of the longest-prefix logic for such character > references, because ';' can only be the last character of a reference. > Instead, we could save characters starting after the '&' and look those > characters up in a simple hash set when we encounter a ';'. If there's no > match, or if it takes longer to find the ';' than the maximum length of any > character reference, then would could fall back to the smaller table of prefix > candidates (the entries without ';'s). This could work. Of course, the search needs to also terminate on < and &. (In reply to comment #6) > The first level of search should definitely be a mere array lookup by using the > input character as the array index. However, I wonder if trie-ish things make > sense for the second character in the entity name. Maybe using binary search + > linear search for window widening would be good enough for the subsequent > levels. I looked into this more. If the first level is a mere array lookup, it doesn't make sense to use binary search plus linear window widening for the second level, because it would do more compares in many cases than linear narrowing. Array lookup on the first level and then using the current code would be very reasonable for & and  . Unfortunately, MathML defines 12 of named characters whose name starts with "lt" and 12 with "gt", so at ';' after "lt" or "12", the 'hi' index would need to travel over 12 entries. Maybe the least intrusive fix is the following: 1) Use array lookup for the first level of window narrowing. 2) Whenever the window narrowing occurs subsequently, first advance lo and then check if hi has to become equal to lo (by looking at lo+1) and only do a linear search with hi if not. This would yield trie-like performance for &, <, > and   (but not ") but would avoid having to buffer input up to the length of the longest name for &ltxxxxxxxxxxxxxxxxxxxxxxx, &gtxxxxxxxxxxxxxxxxxxxxxxx, &ampxxxxxxxxxxxxxxxxxxxxxxx and &xxxxxxxxxxxxxxxxxxxxxx.

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Updated

•

15 years ago

Priority: -- → P2

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 9

•

15 years ago

Taking as part of tp4 work.

Assignee: nobody → hsivonen

Blocks: 543458

Status: NEW → ASSIGNED

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 10

•

15 years ago

Attached patch WIP: C++ (obsolete) (deleted) — Details — Splinter Review

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 11

•

15 years ago

Attached patch WIP: Java (obsolete) (deleted) — Details — Splinter Review

This patch brings character reference tokenization perf to a level similar to raw text tokenization.

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 12

•

15 years ago

Attached patch The mozilla-central diff (deleted) — Details — Splinter Review

Attachment #426230 - Attachment is obsolete: true

Attachment #428395 - Flags: review?(bnewman)

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 13

•

15 years ago

> 2) Whenever the window narrowing occurs subsequently, first advance lo and > then check if hi has to become equal to lo (by looking at lo+1) and only do > linear search with hi if not. This check turned out to be unnecessary for meeting the objective of getting character reference tokenization to the perf level of tokenizing as much text.

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 14

•

15 years ago

Attached patch html-parser diff (deleted) — Details — Splinter Review

Attachment #426232 - Attachment is obsolete: true

Attachment #428397 - Flags: review?(bnewman)

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Updated

•

15 years ago

Summary: [HTML5] Named character reference tokenization performance regression → [HTML5][Patch] Named character reference tokenization performance regression

Ben Newman (:bnewman) (:benjamn)

Reporter

Updated

•

15 years ago

Depends on: 501082

Ben Newman (:bnewman) (:benjamn)

Reporter

Comment 15

•

15 years ago

Comment on attachment 428397 [details] [diff] [review] html-parser diff >+ // Java initializes arrays to zero. Zero is our magic value for no hilo >+ // value. >+ int[][] hiLoTable = new int[0x7B][52]; Make these numbers a little less magical? For example, int[][] hiLoTable = new int['z'+1]['Z'-'A'+1 + 'z'-'a'+1]; or similar?

Attachment #428397 - Flags: review?(bnewman) → review+

Ben Newman (:bnewman) (:benjamn)

Reporter

Updated

•

15 years ago

Attachment #428395 - Flags: review?(bnewman) → review+

Ben Newman (:bnewman) (:benjamn)

Reporter

Comment 16

•

15 years ago

Comment on attachment 428395 [details] [diff] [review] The mozilla-central diff >-#define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, VALUE, SIZE) \ >-static PRUnichar const NAME_##N[] = { CHARS }; >+static PRInt32 const HILO_ACCEL_65[] = { >+ 0, >+ 0, >+ 0, >+ 0, >+ 0, >+ 0, >+ 0, >+ 12386493, >+ 0, ---8<--- I don't know a good way to make these data readable, so you might as well save yourself and anyone else reading the file some scrolling by separating the array elements with ", " instead of ",\n". >+NAMED_CHARACTER_REFERENCE(0, 'l' _ 'i' _ 'g', 3, 0x00c6 _ 0) >+NAMED_CHARACTER_REFERENCE(1, 'l' _ 'i' _ 'g' _ ';', 4, 0x00c6 _ 0) >+NAMED_CHARACTER_REFERENCE(2, 'P', 1, 0x0026 _ 0) >+NAMED_CHARACTER_REFERENCE(3, 'P' _ ';', 2, 0x0026 _ 0) >+NAMED_CHARACTER_REFERENCE(4, 'c' _ 'u' _ 't' _ 'e', 4, 0x00c1 _ 0) >+NAMED_CHARACTER_REFERENCE(5, 'c' _ 'u' _ 't' _ 'e' _ ';', 5, 0x00c1 _ 0) ---8<--- It's also regrettable that this file no longer reads like data, since the first two characters of each reference are missing. Could you maybe include the first two characters in a comment? For instance, NAMED_CHARACTER_REFERENCE(0, /* A e */ 'l' _ 'i' _ 'g', 3, 0x00c6 _ 0) and include a comment at the top of the file pointing to the explanatory comment in Tokenizer.java?

Ben Newman (:bnewman) (:benjamn)

Reporter

Comment 17

•

15 years ago

I realize that my suggestions in comment #16 actually require changes to the java patch, but the resulting C++ code was what caught my attention.

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 18

•

15 years ago

(In reply to comment #15) > (From update of attachment 428397 [details] [diff] [review]) > >+ // Java initializes arrays to zero. Zero is our magic value for no hilo > >+ // value. > >+ int[][] hiLoTable = new int[0x7B][52]; > > Make these numbers a little less magical? For example, > > int[][] hiLoTable = new int['z'+1]['Z'-'A'+1 + 'z'-'a'+1]; > > or similar? Fixed. (In reply to comment #16) > (From update of attachment 428395 [details] [diff] [review]) > >-#define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, VALUE, SIZE) \ > >-static PRUnichar const NAME_##N[] = { CHARS }; > >+static PRInt32 const HILO_ACCEL_65[] = { > >+ 0, > >+ 0, > >+ 0, > >+ 0, > >+ 0, > >+ 0, > >+ 0, > >+ 12386493, > >+ 0, > ---8<--- > > I don't know a good way to make these data readable, so you might as well save > yourself and anyone else reading the file some scrolling by separating the > array elements with ", " instead of ",\n". Fixed. > >+NAMED_CHARACTER_REFERENCE(0, 'l' _ 'i' _ 'g', 3, 0x00c6 _ 0) > >+NAMED_CHARACTER_REFERENCE(1, 'l' _ 'i' _ 'g' _ ';', 4, 0x00c6 _ 0) > >+NAMED_CHARACTER_REFERENCE(2, 'P', 1, 0x0026 _ 0) > >+NAMED_CHARACTER_REFERENCE(3, 'P' _ ';', 2, 0x0026 _ 0) > >+NAMED_CHARACTER_REFERENCE(4, 'c' _ 'u' _ 't' _ 'e', 4, 0x00c1 _ 0) > >+NAMED_CHARACTER_REFERENCE(5, 'c' _ 'u' _ 't' _ 'e' _ ';', 5, 0x00c1 _ 0) > ---8<--- > > It's also regrettable that this file no longer reads like data, since the first > two characters of each reference are missing. > > Could you maybe include the first two characters in a comment? For instance, > > NAMED_CHARACTER_REFERENCE(0, /* A e */ 'l' _ 'i' _ 'g', 3, 0x00c6 _ 0) > > and include a comment at the top of the file pointing to the explanatory > comment in Tokenizer.java? Fixed. Thanks! http://hg.mozilla.org/mozilla-central/rev/42ccfcc82e1e

Status: ASSIGNED → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.