Closed
Bug 489820
Opened 16 years ago
Closed 9 years ago
[HTML5] Avoid repeated conditionals upon buffer append by sizing buffers for worst case
Categories
(Core :: DOM: HTML Parser, enhancement, P4)
Tracking
()
RESOLVED
FIXED
mozilla43
Tracking | Status | |
---|---|---|
firefox43 | --- | fixed |
People
(Reporter: hsivonen, Assigned: hsivonen)
References
(Blocks 3 open bugs)
Details
Attachments
(4 files, 17 obsolete files)
(deleted),
patch
|
hsivonen
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
hsivonen
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
wchen
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
wchen
:
review+
|
Details | Diff | Splinter Review |
Various buffers are grown dynamically on a per-append basis. This involves a buffer available space check on each append.
A given call into to tokenizer cannot cause buffers to grow by more than the length of the input buffer. Thus, it would be possible to use more RAM and run less conditionals by reserving buffer space for current size + length of tokenizer input buffer.
Assignee | ||
Updated•15 years ago
|
Priority: -- → P4
Assignee | ||
Updated•15 years ago
|
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
Assignee | ||
Comment 1•15 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 2•15 years ago
|
||
The fix here was broken. See bug 555462.
It was broken, because it didn't take into account that there can be appendable data baked into the current tokenizer state (either a failed character reference in strBuf or something like a single '<' as part of the state itself.)
Instead of trying to change things to be more clever here, I think the right fix for now is to back this out instead of digging the hole deeper. Then, if we really want this later, this should be re-tried as a well-isolated landing.
Depends on: 555462
Assignee | ||
Comment 3•15 years ago
|
||
Note to self: This solution needs special casing for U+0000 to U+FFFD mapping if the tokenizer is ever changed to operate on UTF-8 instead of UTF-16.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 4•9 years ago
|
||
Attachment #8625260 -
Flags: review?(wchen)
Assignee | ||
Comment 5•9 years ago
|
||
Attachment #8625261 -
Flags: review?(wchen)
Assignee | ||
Comment 6•9 years ago
|
||
The patch for bug 1176668 goes between part 1 and part 2 here, because I thought part 1 here had broken NCR handling. By the time I realized I had found an ancient bug, the patches were already entangled enough that it's not worthwhile to disentangle them.
Status: REOPENED → ASSIGNED
Assignee | ||
Comment 7•9 years ago
|
||
Attachment #8625268 -
Flags: review?(wchen)
Assignee | ||
Comment 8•9 years ago
|
||
Attachment #8625269 -
Flags: review?(wchen)
Assignee | ||
Comment 9•9 years ago
|
||
Attachment #8625276 -
Flags: review?(wchen)
Assignee | ||
Comment 10•9 years ago
|
||
Attachment #8625277 -
Flags: review?(wchen)
Assignee | ||
Comment 11•9 years ago
|
||
Comment on attachment 8625277 [details] [diff] [review]
m-c part 3: Size the buffers to worst case before each tokenization call
Oops. Gecko-specific code in part 3 is still wrong.
Attachment #8625277 -
Flags: review?(wchen)
Assignee | ||
Comment 12•9 years ago
|
||
Attachment #8625277 -
Attachment is obsolete: true
Assignee | ||
Comment 13•9 years ago
|
||
So the key insight here is that (when the input to the HTML parser is UTF-16 and the output is UTF-16 as is the case in Gecko) the HTML parsing algorithm never expands strings. It either keeps the length the same or contracts them. Specifically, various markup does not end up in subsequent buffers at all and a single character reference is always at least 4 UTF-16 code units long in the input and always at most two UTF-16 code units in the output. U+0000 can turn into U+FFFD, but the two are both equally long (one code unit) in UTF-16.
Part 3 of the patch set still crashes, because I forgot about View Source. Oops...
Assignee | ||
Comment 14•9 years ago
|
||
It's a bit unfortunate that this makes tokenizeBuffer() a footgun in the sense that you must consistently call another method right before or memory-unsafety ensues, but this solution seems to be the most sensible one given the existing code written to constraints that assumed that the allocator subsystem would take care of stopping the parser when needed and the current reality of that part of the "infallible" allocator plan having gotten dropped.
Assignee | ||
Comment 15•9 years ago
|
||
Also: I have noticed the opportunity to save memory by unifying the buffer in the tokenizer and the buffer in the tree builder. However, this patch queue is risky enough as-is and there's no way I'd get the unification done before I go out of office for 3 weeks, so I'm leaving the unification as a follow-up.
Assignee | ||
Comment 16•9 years ago
|
||
Attachment #8625326 -
Attachment is obsolete: true
Assignee | ||
Comment 17•9 years ago
|
||
Attachment #8625260 -
Attachment is obsolete: true
Attachment #8625260 -
Flags: review?(wchen)
Assignee | ||
Comment 18•9 years ago
|
||
Attachment #8625261 -
Attachment is obsolete: true
Attachment #8625261 -
Flags: review?(wchen)
Assignee | ||
Comment 19•9 years ago
|
||
Comment on attachment 8625679 [details] [diff] [review]
htmlparser repo Part 1: Consolidate strBuf and longStrBuf (comments fixed in a follow-up), v2
Moving the buffer consolidation to bug 559303 where it belongs to make it easier to explain the order in which this patch queue needs to land.
Attachment #8625679 -
Attachment is obsolete: true
Assignee | ||
Updated•9 years ago
|
Attachment #8625680 -
Attachment is obsolete: true
Assignee | ||
Comment 20•9 years ago
|
||
Attachment #8625269 -
Attachment is obsolete: true
Attachment #8625269 -
Flags: review?(wchen)
Assignee | ||
Comment 21•9 years ago
|
||
Attachment #8625268 -
Attachment is obsolete: true
Attachment #8625268 -
Flags: review?(wchen)
Assignee | ||
Comment 22•9 years ago
|
||
Attachment #8625276 -
Attachment is obsolete: true
Attachment #8625276 -
Flags: review?(wchen)
Assignee | ||
Comment 23•9 years ago
|
||
Attachment #8625457 -
Attachment is obsolete: true
Assignee | ||
Comment 24•9 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #6)
> The patch for bug 1176668 goes between part 1 and part 2 here, because I
> thought part 1 here had broken NCR handling. By the time I realized I had
> found an ancient bug, the patches were already entangled enough that it's
> not worthwhile to disentangle them.
This no longer applies. Now the order is simply: bug 559303, bug 1176668, this bug.
Assignee | ||
Comment 25•9 years ago
|
||
Attachment #8625704 -
Attachment is obsolete: true
Assignee | ||
Updated•9 years ago
|
Attachment #8625701 -
Flags: review?(wchen)
Assignee | ||
Updated•9 years ago
|
Attachment #8625702 -
Flags: review?(wchen)
Assignee | ||
Updated•9 years ago
|
Attachment #8625703 -
Flags: review?(wchen)
Assignee | ||
Comment 26•9 years ago
|
||
Comment on attachment 8625705 [details] [diff] [review]
m-c part 2: Size the buffers to worst case before each tokenization call, vN
See bug 1029671 comment 36 for the big picture of the whole queue and about landing.
Attachment #8625705 -
Flags: review?(wchen)
Assignee | ||
Comment 27•9 years ago
|
||
Attachment #8626084 -
Flags: review?(wchen)
Assignee | ||
Comment 28•9 years ago
|
||
Attachment #8626085 -
Flags: review?(wchen)
Updated•9 years ago
|
Attachment #8625701 -
Flags: review?(wchen) → review+
Updated•9 years ago
|
Attachment #8625702 -
Flags: review?(wchen) → review+
Comment 29•9 years ago
|
||
Comment on attachment 8625703 [details] [diff] [review]
htmlparser repo part 2: Size the buffers to worst case before each tokenization call
Review of attachment 8625703 [details] [diff] [review]:
-----------------------------------------------------------------
I took a look at the crashes and it seems like it's essentially the same problem as comment 2, and the insight of comment 13 doesn't hold.
The W2 test that is causing the crash looks like this (I've simplified the example):
var s = "<script></a";
for (var i=0; i<s.length; i++) {
document.write(s[i]);
}
Here is what I think is happening: It's writing to the tokenizer one character at a time, so we don't grow the buffer more than 2 characters (mozilla::RoundUpPow2(1)). At the same time, this is causing a change in tokenizer state. When the tokenizer consumes the final "a", it's in a state that will cause it to output "</a", but the buffer can't hold that many character and overflows.
Attachment #8625703 -
Flags: review?(wchen) → review-
Comment 30•9 years ago
|
||
Comment on attachment 8625705 [details] [diff] [review]
m-c part 2: Size the buffers to worst case before each tokenization call, vN
Review of attachment 8625705 [details] [diff] [review]:
-----------------------------------------------------------------
::: parser/html/nsHtml5TreeBuilderCppSupplement.h
@@ +894,5 @@
>
> void
> nsHtml5TreeBuilder::accumulateCharacters(const char16_t* aBuf, int32_t aStart, int32_t aLength)
> {
> + memcpy(charBuffer + charBufferLen, aBuf + aStart, sizeof(char16_t) * aLength);
We overflow here. There isn't an assertion here which is why you weren't getting anything from debug builds.
Attachment #8625705 -
Flags: review?(wchen) → review-
Comment 31•9 years ago
|
||
Comment on attachment 8626084 [details] [diff] [review]
m-c part 3 - undo part of part 2 to avoid crashing on web-platform-tests-2
Review of attachment 8626084 [details] [diff] [review]:
-----------------------------------------------------------------
This patch shouldn't be necessary if you fix the previously mentioned bug.
Attachment #8626084 -
Flags: review?(wchen) → review-
Updated•9 years ago
|
Attachment #8626085 -
Flags: review?(wchen)
Assignee | ||
Updated•9 years ago
|
Attachment #8626084 -
Attachment is obsolete: true
Assignee | ||
Updated•9 years ago
|
Attachment #8626085 -
Attachment is obsolete: true
Assignee | ||
Comment 32•9 years ago
|
||
Thanks. Attaching patch 1 with an updated hg comment to keep track of stuff.
Attachment #8625701 -
Attachment is obsolete: true
Attachment #8644266 -
Flags: review+
Assignee | ||
Comment 33•9 years ago
|
||
Attachment #8625702 -
Attachment is obsolete: true
Attachment #8644267 -
Flags: review+
Assignee | ||
Comment 34•9 years ago
|
||
Thanks for finding the bug! Attaching revised part 2.
Attachment #8625703 -
Attachment is obsolete: true
Assignee | ||
Comment 35•9 years ago
|
||
Attachment #8625705 -
Attachment is obsolete: true
Assignee | ||
Updated•9 years ago
|
Attachment #8644268 -
Flags: review?(wchen)
Assignee | ||
Updated•9 years ago
|
Attachment #8644269 -
Flags: review?(wchen)
Comment 36•9 years ago
|
||
Comment on attachment 8644268 [details] [diff] [review]
htmlparser repo part 2: Size the buffers to worst case before each tokenization call, revised
Looks good. Patch introduces some new trailing whitespace which should be removed.
Attachment #8644268 -
Flags: review?(wchen) → review+
Comment 37•9 years ago
|
||
Comment on attachment 8644269 [details] [diff] [review]
m-c part 2: Size the buffers to worst case before each tokenization call, revised
Trailing whitespace
Attachment #8644269 -
Flags: review?(wchen) → review+
Comment 38•9 years ago
|
||
Assignee | ||
Comment 39•9 years ago
|
||
Thanks. Landed with the trailing whitespace fixed. I think I'll file a follow-up to zap all trailing whitespace from the Java code after which I can make my editor enforce it.
Comment 40•9 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/c5c201250f2a
https://hg.mozilla.org/mozilla-central/rev/bb89c4d2d004
Status: ASSIGNED → RESOLVED
Closed: 15 years ago → 9 years ago
status-firefox43:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla43
You need to log in
before you can comment on or make changes to this bug.
Description
•