Closed Bug 236002 Opened 21 years ago Closed 7 years ago

Memory expands further than the system will allow with tags such as <s^g^ggfdf%> parsed as <s>

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
critical

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: superdug, Assigned: choess)

References

()

Details

Attachments

(1 obsolete file)

User-Agent: Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040206 Firefox/0.8 Screen captures in XP can be found at http://www.dugnet.com/dug/oldstuff/000090.html the 17 megabyte file takes more than 1.12 gigs of system memory and doesnt completely load. Reproducible: Always Steps to Reproduce: 1.go to page http://www.chigc.com/et/ss-current/moz-kill.htm 2.watch memory usage of process 3.mozilla crashes Actual Results: Windows produced an error explaining that the system memory was critically low and exited the firefox process. Expected Results: Internet Explorer displayed the page correctly in under 100 megs of system resources.
Typical line from that file: <p> <a href=pb000106.htm target=_blank>000106</a> "^YF^<s^ht^YF^<n^hg^Yr^<z^_[^jAd" (W) GUID=fb9205635216121e04601afc1bd20ac6(VALID:247) [2004.02.12 00:28:41] This keeps coming up in various places. In view of our residual style handling being as memory-intensive as it is, I propose making almost all characters allowed in tagnames in the HTML parser (basically disallow whitespace, '/', '>'). That would solve issues like crap sites doing <style="something"> and ending up with the whole page as part of the stylesheet, too. Thoughts?
Blocks: 26003
Confirmed on linux. Browser continued to run, but system gets very sluggish when memory is exhausted (and another mozilla process (my Mailnews) crashed, probably because of out mem). Browser does *not* free memory after page close, which is probably a bug in itself. Mozilla Seamonkey 1.4.x on SuSE 9.
Status: UNCONFIRMED → NEW
Ever confirmed: true
> Browser does *not* free memory after page close Actually, that's not a Mozilla bug... see bug 130157 and discussion therein. That said, could we stay somewhat focused here? There's really no need to comment on this bug by anyone not directly involved in parser development or standards-compliance QA..... (in other words, I'd like to hear from Ian and choess on the feasibility of what I suggested).
No longer blocks: 26003
The following: "^YF^<s^ht^YF^<n^hg^Yr^<z^_[^jAd" (W) ...should parse to: text: |"^YF^| element: <s^ht^YF^> + element: <n^hg^Yr^> + element: <z^_[^jAd">; attribute: |(W)|, value: || ...or some such. That is, <foo<bar> ...is valid HTML (ignoring the fact that the tag names are from some future version of HTML that we don't know yet) and is exactly equivalent to <foo><bar>. I don't exactly understand what you are proposing though. Could you explain it in more detail? What is the problem you are solving?
Ian, '^' is not a valid character in an element name, as far as I understand. At the moment we parse that as follows: "^YF^<s^ht^YF^<n^hg^Yr^<z^_[^jAd" (W) parses to text: |"^YF^| element: <s>; attribute: |^ht^YF^|, value: || + element: <n>; attribute: |^hg^Yr^|, value:|| + element: <z>; attribute: |^_[^jAd"|, value: ||; attribute: |(W)|, value: || There may be more attributes; I'm not sure how '^' affects attribute parsing and it does not matter for purposes of this testcase. Now notice that we ended up with an <s> tag. And now we have unclose <s> tags, and residual style kicks in, so we keep reopening them inside every <p>, leading to a DOM tree that's several orders of magnitude bigger than it "should" be. That is the problem I am trying to solve. My proposal is to allow '^' in tagnames so that we will parse as you propose in comment 4. That would mean we don't have an <s> tag anymore. Since just allowing '^' is hacky, I propose allowing all chars that are not whitespace, '>', or '/' in a tagname (that is, the tagname would go from the '<' to the first whitespace, '/', or '>' char).
(In reply to comment #5) > Ian, '^' is not a valid character in an element name, as far as I understand. Well, no, but it's not allowed in attribute names either, as I understand it. > Now notice that we ended up with an <s> tag. And now we have unclose <s> tags, > and residual style kicks in, so we keep reopening them inside every <p>, > leading to a DOM tree that's several orders of magnitude bigger than it > "should" be. Oh, I see. Ok, I now understand what you were talking about before when you mentioned the residual style thing. So yes, your proposal makes sense to me. Does it match what IE does, do we know? I can make test cases if you want, but I don't have IE here.
IE allows '^' in tagnames as far as I recall from the testing I had people do before this bug got filed. If you can make some testcases for random chars, I'm sure we can find IE-using volunteers to test IE's behavior. ;)
Ok. Testcase: http://junkyard.damowmow.com/122 The test consists of taking the markup "<p>aaa<s^bbb</p>" and seeing what innerHTML thinks it is. IE6: aaa<S^BBB< p> IE appears to treat the whole thing as an element, although it also seems to get rather confused about the end tag. It also capitalised, go figure. Safari 1.2: aaa Safari appears to drop unknown HTML elements altogether, it thinks the first element contains no unknown element on that test. Mozilla: aaa<s ^bbb=""></s> As noted above, we treat the ^ bit as part of the attribute. (But note that View Source is different than either innerHTML _or_ the actual source.) Opera: aaa<s/> I don't even know where to begin. In conclusion, I think your proposal is the best thing.
Ian, what we actually need are tests that will test at least the following characters as part of "tag names": ! @ # $ % ^ & * ( ) _ + - = { } | [ ] \ : " ; ' , . ? < > / We could also use tests that use non-ascii chars in tag-soup tagnames. At the moment, we allow ':', '_', '-', and '.' in tagnames but no other punctuation and we don't allow non-ascii. The question is what punctuation to allow and whether to allow other random non-ascii stuff in there too. That is, do we keep our current "chars are not allowed in a tagname unless we explicitly say they are" approach, or do we go with the "chars are allowed unless we explicitly say they are not" approach?
The < and > characters are special. <t<t> is the same as <t><t> <t>t> is the same as <t>t&gt; So I didn't test those. Testcase: http://www.hixie.ch/tests/adhoc/html/parsing/compat/016.html It seems IE does indeed treat them all as valid characters.
We don't really want to allow '/' in tagnames because then <img/> will not be an <img> tag.... What about non-ascii?
OK, my informants tell me that on that testcase IE has exactly the behavior the testcase describes. I think we should do the same.
Hmmm. What if we modified CStartToken::Consume after the call to GetIdentifier and peeked at the next character? If it's not whitespace, "<", ">", or "/", then instead of calling LookupTag to set the TypeID, we could just set it to eHTMLTag_userdefined. That way we'd still keep identifiers to name characters, per spec, but it *should*--I'd want to check CNavDTD first for surprises--keep residual style from being activated on the tags, since they'll no longer be recognized as <s>, <p>, or whatever.
"per spec"?
That sounds pretty fragile to me.... And we'd apply the style anyway, even if non-residually (since for style the tagname is all that matters). I think that keeping to the spec on this aspect of tag-soup is just a lost cause....
Attached patch Patch to give us IE tokenization behavior here (obsolete) (deleted) — Splinter Review
Incidentally, this patch refixes bug 23791. The funny thing is, this doesn't make us use non-ridiculous amounts of memory on the page, since the page also contains about 250 lines that look like: <p> <a href=pb000194.htm target=_blank>000194</a> "^8<<I am a Target>>" (W) GUID=18a6a80c8604653277b2929c7352b857(?) [2004.02.16 21:10:19] and <i am a target> is a perfectly valid HTML tag that we parse as such and which triggers residual style. (And 125-fold increase in the amount of memory needed by the page. We need about 200 megs RAM to load the page if I remove the 'I' from there, so figure we'd need 25GB RAM to load the thing with this patch. Without this patch, toss in another factor of 5 or so, since there are around 850 lines that have a "<s".)
Comment on attachment 143225 [details] [diff] [review] Patch to give us IE tokenization behavior here I think this is the right thing to do anyway.... we have separate bugs on the fact that residual style is really bloaty in cases when we _do_ need it.
Attachment #143225 - Flags: superreview?(jst)
Attachment #143225 - Flags: review?(choess)
> ...perfectly valid HTML tag... I think you may be stretching the words "perfectly" and "valid" here... :-P + case '\b': + case '\t': + case '\v': Why do we search for those, but not, say, the form feed character?
It's valid from a tokenization standpoint. Because I forgot to add \f to the list. Consider it added.
Ok, what about a zero width space? or a zero width non-joiner? or...?
If there is a simple way to express things that should terminate tagnames, I welcome you telling me what it is. The ugly thing in the XML spec is not something I'm going to stick into this code (never mind that we are in fact violating it). Though I would love to hear what IE does on those testcases.
Try it (warning: takes several minutes to complete): http://www.hixie.ch/tests/adhoc/html/parsing/compat/017.html
Apparently IE totally fails to load that testcase, Ian.
Hixie updated the test to work. This is the output from IE 6.0: Test complete. 65536 characters tested. 0 characters not mentioned below were treated as part of tag names. U+0 - U+8 created no element U+9 - U+d delimits tag names from attributes U+e - U+1f created no element U+20 delimits tag names from attributes U+21 - U+2e created no element U+2f delimits tag names from attributes U+30 - U+3d created no element U+3e delimits tag names from attributes U+3f - U+d7ff created no element U+e000 - U+ffff created no element
bz: So in IE, "spaces" are U+0009 to U+000D, U+0020, U+002F (/), and U+003E (>). I have no clue why IE failed to created characters for the other cases. Probably some silly bug.
Comment on attachment 143225 [details] [diff] [review] Patch to give us IE tokenization behavior here sr=jst
Attachment #143225 - Flags: superreview?(jst) → superreview+
choess, any chance of a review? I'd sorta like to land this as early in 1.8 as I can....
Comment on attachment 143225 [details] [diff] [review] Patch to give us IE tokenization behavior here r=choess, but could you leave this bug on my plate when you check it in? I'd still like to consider an alternative solution that avoids the "arbitrariness" of the current behavior, but it will take a while to be sure it's not fragile.
Attachment #143225 - Flags: review?(choess) → review+
Checked that patch in to trunk. Over to choess, per request.
Assignee: parser → choess
Attachment #143225 - Attachment is obsolete: true
*** Bug 278055 has been marked as a duplicate of this bug. ***
*** Bug 290795 has been marked as a duplicate of this bug. ***
QA Contact: parser
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: