Closed Bug 506090 Opened 15 years ago Closed 15 years ago

[HTML5][Patch] Named character reference tokenization performance regression

Categories

(Core :: DOM: HTML Parser, defect, P2)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: mozilla+ben, Assigned: hsivonen)

References

()

Details

(Keywords: perf)

Attachments

(2 files, 2 obsolete files)

Tokenizing a character reference such as Æ using the HTML5 parser takes linear time in the length of the list of character references. We should be able to do better. Compared with the HTML4 parser, the underlying task has gotten harder in two ways: first, there were only ~300 HTML4 entities as opposed to the 2100+ HTML5 named character references; second, the HTML5 spec requires the parser to perform a longest prefix match, whereas the HTML4 parser simply consumed the identifier following a '&' and gave back a character if that identifier named a known entity. The first factor shouldn't matter if we choose a data structure and/or search algorithm that supports O(log n) or O(1) lookup. The second factor can be overcome, I suspect, with a bit of cleverness. Amdahl's Law obliges me to point out that character reference tokenization is almost certainly not the bulk of parsing any real-world webpage, so this may not be a super-high priority, but we're going to want to fix it sooner or later.
See this URL for an illustration of the problem: http://people.mozilla.org/~bnewman/bugs/506090/
Relevant part of Tokenizer.java: http://hg.mozilla.org/mozilla-central/file/b2d3327fb4cc/parser/html/javasrc/Tokenizer.java#l4037 And the corresponding part of parser/html/nsHtml5Tokenizer.cpp: http://hg.mozilla.org/mozilla-central/file/b2d3327fb4cc/parser/html/nsHtml5Tokenizer.cpp#l2214 It's worth mentioning that this issue should be possible to resolve within the Java source files. In particular, this regression is *not* a problem with the Java-to-C++ translator, nor does it have anything to do with differences between the performance characteristics of Java and C++.
Tries are good for longest prefix match with O(1) string comparisons.
Looks like avoiding premature optimization didn't work all the way here. :-( I think a trie laid out into static const arrays ahead of compile time is the most proper way to fix this (and the same mechanism could be used for optimizing the doctype quirkiness check, too, even though that check runs at most once per parse). However, using binary search to locate the initial search window might be good enough with the current data structures.
The first level of search should definitely be a mere array lookup by using the input character as the array index. However, I wonder if trie-ish things make sense for the second character in the entity name. Maybe using binary search + linear search for window widening would be good enough for the subsequent levels.
Almost all entries in the table end with a ';', and it occurs to me that we don't need the full complexity of the longest-prefix logic for such character references, because ';' can only be the last character of a reference. Instead, we could save characters starting after the '&' and look those characters up in a simple hash set when we encounter a ';'. If there's no match, or if it takes longer to find the ';' than the maximum length of any character reference, then would could fall back to the smaller table of prefix candidates (the entries without ';'s).
(In reply to comment #7) > Almost all entries in the table end with a ';', and it occurs to me that we > don't need the full complexity of the longest-prefix logic for such character > references, because ';' can only be the last character of a reference. > Instead, we could save characters starting after the '&' and look those > characters up in a simple hash set when we encounter a ';'. If there's no > match, or if it takes longer to find the ';' than the maximum length of any > character reference, then would could fall back to the smaller table of prefix > candidates (the entries without ';'s). This could work. Of course, the search needs to also terminate on < and &. (In reply to comment #6) > The first level of search should definitely be a mere array lookup by using the > input character as the array index. However, I wonder if trie-ish things make > sense for the second character in the entity name. Maybe using binary search + > linear search for window widening would be good enough for the subsequent > levels. I looked into this more. If the first level is a mere array lookup, it doesn't make sense to use binary search plus linear window widening for the second level, because it would do more compares in many cases than linear narrowing. Array lookup on the first level and then using the current code would be very reasonable for &amp; and &nbsp;. Unfortunately, MathML defines 12 of named characters whose name starts with "lt" and 12 with "gt", so at ';' after "lt" or "12", the 'hi' index would need to travel over 12 entries. Maybe the least intrusive fix is the following: 1) Use array lookup for the first level of window narrowing. 2) Whenever the window narrowing occurs subsequently, first advance lo and then check if hi has to become equal to lo (by looking at lo+1) and only do a linear search with hi if not. This would yield trie-like performance for &amp;, &lt;, &gt; and &nbsp; (but not &quot;) but would avoid having to buffer input up to the length of the longest name for &ltxxxxxxxxxxxxxxxxxxxxxxx, &gtxxxxxxxxxxxxxxxxxxxxxxx, &ampxxxxxxxxxxxxxxxxxxxxxxx and &xxxxxxxxxxxxxxxxxxxxxx.
Taking as part of tp4 work.
Assignee: nobody → hsivonen
Blocks: 543458
Status: NEW → ASSIGNED
Attached patch WIP: C++ (obsolete) (deleted) — Splinter Review
Attached patch WIP: Java (obsolete) (deleted) — Splinter Review
This patch brings character reference tokenization perf to a level similar to raw text tokenization.
Attached patch The mozilla-central diff (deleted) — Splinter Review
Attachment #426230 - Attachment is obsolete: true
Attachment #428395 - Flags: review?(bnewman)
> 2) Whenever the window narrowing occurs subsequently, first advance lo and > then check if hi has to become equal to lo (by looking at lo+1) and only do > linear search with hi if not. This check turned out to be unnecessary for meeting the objective of getting character reference tokenization to the perf level of tokenizing as much text.
Attached patch html-parser diff (deleted) — Splinter Review
Attachment #426232 - Attachment is obsolete: true
Attachment #428397 - Flags: review?(bnewman)
Summary: [HTML5] Named character reference tokenization performance regression → [HTML5][Patch] Named character reference tokenization performance regression
Depends on: 501082
Comment on attachment 428397 [details] [diff] [review] html-parser diff >+ // Java initializes arrays to zero. Zero is our magic value for no hilo >+ // value. >+ int[][] hiLoTable = new int[0x7B][52]; Make these numbers a little less magical? For example, int[][] hiLoTable = new int['z'+1]['Z'-'A'+1 + 'z'-'a'+1]; or similar?
Attachment #428397 - Flags: review?(bnewman) → review+
Attachment #428395 - Flags: review?(bnewman) → review+
Comment on attachment 428395 [details] [diff] [review] The mozilla-central diff >-#define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, VALUE, SIZE) \ >-static PRUnichar const NAME_##N[] = { CHARS }; >+static PRInt32 const HILO_ACCEL_65[] = { >+ 0, >+ 0, >+ 0, >+ 0, >+ 0, >+ 0, >+ 0, >+ 12386493, >+ 0, ---8<--- I don't know a good way to make these data readable, so you might as well save yourself and anyone else reading the file some scrolling by separating the array elements with ", " instead of ",\n". >+NAMED_CHARACTER_REFERENCE(0, 'l' _ 'i' _ 'g', 3, 0x00c6 _ 0) >+NAMED_CHARACTER_REFERENCE(1, 'l' _ 'i' _ 'g' _ ';', 4, 0x00c6 _ 0) >+NAMED_CHARACTER_REFERENCE(2, 'P', 1, 0x0026 _ 0) >+NAMED_CHARACTER_REFERENCE(3, 'P' _ ';', 2, 0x0026 _ 0) >+NAMED_CHARACTER_REFERENCE(4, 'c' _ 'u' _ 't' _ 'e', 4, 0x00c1 _ 0) >+NAMED_CHARACTER_REFERENCE(5, 'c' _ 'u' _ 't' _ 'e' _ ';', 5, 0x00c1 _ 0) ---8<--- It's also regrettable that this file no longer reads like data, since the first two characters of each reference are missing. Could you maybe include the first two characters in a comment? For instance, NAMED_CHARACTER_REFERENCE(0, /* A e */ 'l' _ 'i' _ 'g', 3, 0x00c6 _ 0) and include a comment at the top of the file pointing to the explanatory comment in Tokenizer.java?
I realize that my suggestions in comment #16 actually require changes to the java patch, but the resulting C++ code was what caught my attention.
(In reply to comment #15) > (From update of attachment 428397 [details] [diff] [review]) > >+ // Java initializes arrays to zero. Zero is our magic value for no hilo > >+ // value. > >+ int[][] hiLoTable = new int[0x7B][52]; > > Make these numbers a little less magical? For example, > > int[][] hiLoTable = new int['z'+1]['Z'-'A'+1 + 'z'-'a'+1]; > > or similar? Fixed. (In reply to comment #16) > (From update of attachment 428395 [details] [diff] [review]) > >-#define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, VALUE, SIZE) \ > >-static PRUnichar const NAME_##N[] = { CHARS }; > >+static PRInt32 const HILO_ACCEL_65[] = { > >+ 0, > >+ 0, > >+ 0, > >+ 0, > >+ 0, > >+ 0, > >+ 0, > >+ 12386493, > >+ 0, > ---8<--- > > I don't know a good way to make these data readable, so you might as well save > yourself and anyone else reading the file some scrolling by separating the > array elements with ", " instead of ",\n". Fixed. > >+NAMED_CHARACTER_REFERENCE(0, 'l' _ 'i' _ 'g', 3, 0x00c6 _ 0) > >+NAMED_CHARACTER_REFERENCE(1, 'l' _ 'i' _ 'g' _ ';', 4, 0x00c6 _ 0) > >+NAMED_CHARACTER_REFERENCE(2, 'P', 1, 0x0026 _ 0) > >+NAMED_CHARACTER_REFERENCE(3, 'P' _ ';', 2, 0x0026 _ 0) > >+NAMED_CHARACTER_REFERENCE(4, 'c' _ 'u' _ 't' _ 'e', 4, 0x00c1 _ 0) > >+NAMED_CHARACTER_REFERENCE(5, 'c' _ 'u' _ 't' _ 'e' _ ';', 5, 0x00c1 _ 0) > ---8<--- > > It's also regrettable that this file no longer reads like data, since the first > two characters of each reference are missing. > > Could you maybe include the first two characters in a comment? For instance, > > NAMED_CHARACTER_REFERENCE(0, /* A e */ 'l' _ 'i' _ 'g', 3, 0x00c6 _ 0) > > and include a comment at the top of the file pointing to the explanatory > comment in Tokenizer.java? Fixed. Thanks! http://hg.mozilla.org/mozilla-central/rev/42ccfcc82e1e
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: