Closed Bug 70828 Opened 24 years ago Closed 20 years ago

Tags that are not closed by the end of the document do strange things.

Categories

(Core :: DOM: HTML Parser, defect)

x86
All
defect
Not set
minor

Tracking

()

RESOLVED FIXED
Future

People

(Reporter: ericmao, Assigned: mrbkap)

References

Details

(Keywords: testcase)

Attachments

(5 files, 1 obsolete file)

I created a simple HTML file: <html> <body bgcolor="white" "bogus extra stuff""""> </body> </html> I opened the file up in Mozilla 0.8 under Win2K, and went to View Page Source. I expected to see the same HTML code that I typed in, despite the fact that the HTML code contained errors. Instead, some of the double-quotes got cleaned up: <html> <body bgcolor="white" bogus extra stuff> </body> </html>
Same in here, under Linux. The question is whether it's a bug: a html file should look like that: <TAG OPT1="value1"> and not <TAG "value1"> ...
sending to parser for triage
Assignee: ben → harishd
Component: XP Apps: GUI Features → Parser
QA Contact: sairuh → janc
Confirming as per user comment, setting OS=All.
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Windows 2000 → All
View source should be showing the source as it really is -- otherwise its usefulness for debugging web pages is not very great (and why does anyone ever look at view source otherwise?).
because we want to see what content mozilla decided was worth parsing ;-b otherwise, we'd use telnet to get the real content :) *sigh* this is at least normal severity.
Severity: minor → normal
Keywords: dataloss
QA Contact: janc → bsharma
I appreciate that this is truly annoying, but can it actually alter the semantics of content so that developers are misled rather than just inconvenienced? Futuring for now, will reconsider given a compelling test case.
Target Milestone: --- → Future
Attached file Testcase (icky HTML, I know) (deleted) —
The attached testcase sucks as HTML goes. Nevertheless... Moving the closing quote to before </html> will make things work OK. But the fact remains that view source just silently lost half the source of that page, and more importantly has lost all indications that an error of any kind has occured. I'm not completely sure that it's the same bug, but it seems related...
Blocks: 57724
No longer blocks: 57724
*** This bug has been marked as a duplicate of 57724 ***
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → DUPLICATE
QA Contact: bsharma → moied
Reopening 57724 dependencies for independent resolution.
Status: RESOLVED → REOPENED
Depends on: 57724
Keywords: testcase
Resolution: DUPLICATE → ---
Outer single quotes are converted to double quotes both in "View->Source" and via the "Save Page As..." dialogue. In the latter case, the saved page will not work because this change breaks the nested quotes.
Roland, I cannot reproduce the error you describe when viewing the source of that attachment (linux trunk build 2002-05-01-21). I see single quotes... The "save as" issue is a separate issue (with the "web page, complete" mode only) and should be filed as a separate bug.
By the definitions on <http://bugzilla.mozilla.org/bug_status.html#severity> and <http://bugzilla.mozilla.org/enter_bug.cgi?format=guided>, crashing and dataloss bugs are of critical or possibly higher severity. Only changing open bugs to minimize unnecessary spam. Keywords to trigger this would be crash, topcrash, topcrash+, zt4newcrash, dataloss.
Severity: normal → critical
Severity: critical → minor
Keywords: dataloss
Please actually read bugs before changing the severity, ok?
See also comment #3 of this bug: 223753
*** Bug 233609 has been marked as a duplicate of this bug. ***
This "cleaning up" of the source before interpretation can also cause security issues. Eg my website allows users to add certain simple HTML tags to their posts, but not others. If, however, a user enters this: <B ONCLICK="window.open('http://www.badsite.com')" then mozilla will automatically append the closing > and render the next part of the website's own content with this onclick link. Admittedly, it's my fault in this case for not having a good enough regexp for filtering out bad tags (which I've now fixed), but I do wonder whether Mozilla should be displaying "what an attacker means" rather than "what the designer said". The following was the vulnerable html cleanup code I had used. I've simplified $allowed for clarity. $allowed='\s*\/?\s*(b|i|u|s|pre|tt|ul|ol|li|p|)\s*'; $memo=preg_replace("/<((?!($allowed>))[^<>]*)>/is", "&lt;\\1&gt;", $memo);
--> me since I'm working on this.
Assignee: harishd → mrbkap
Status: REOPENED → NEW
Fixing summary to reflect the bug that this is covering (our awful handling of unclosed tags and the end of the document). For the record: start tags disappear altogether, end tags get duplicated.
Summary: view source makes double-quotes disappear → Tags that are not closed by the end of the document do strange things.
bz, rbs, do you have any more information on the stuff here? If you know anyone who may, please add them to the CC list. My explanation for this bug turned into a 300 word mini-essay on end-of-file handling at the tokenizer level, so instead of filling up this bug with it, I'll attach it (it's also available at: http://www.owlnet.rice.edu/~mrbkap/bug70828explain.html). My explanation is pretty long and comprehensive, but the short of it is that I have a partial patch with a couple of problems left to fix. I'm not sure about some parts of the patch (see the explanation). Any suggestions on my explanation or solution for this are welcome! Also, I can generalize my explanation and put it up as parser documentation (on EOF handling in the tokenizer) somewhere (since htmlparser documentation is terribly lacking), I just need to know what format/where to put it.
Attached file Explanation of this bug. (deleted) —
I'm attaching this to the bug so that if/when I take the explanation down from the internet, people looking into parser code can still find it here (and don't receive a 404).
Attached patch work in progress (obsolete) (deleted) — Splinter Review
This is a partial fix for this bug. I'm still unsure as to the nsScanner::IsIncremental()/mIsFinalChunk decision and there are a couple of (easier) bugs that I'd like to fix in this code before I spend more time on this bug.
Comment on attachment 161187 [details] [diff] [review] work in progress I've decided to go with the existing code's idea and fix this (for the most part) with nsScanner::IsIncremental(). I'll post a new patch later this week.
Attachment #161187 - Attachment is obsolete: true
Attached patch patch v2 (deleted) — Splinter Review
This patch fixes this bug. Basically, because of document.write(), mIsFinalChunk is useless to us (this is the reason it isn't really used anywhere in the parser). Basically, I went through and each place we can receive EOF, and added a check to see if the document was out of data. This also fixes the bug where the document: "<script>foo;</script " would show up as "<script>foo;</script</script" in view-source. I'm not sure if I was just unobservant, but my debug build seems to be viewing source very slowly (1/2 to 1/4 of a second lag on slashdot.org). I'm unsure if it's my patch, because when I locally backed it out, I still saw the lag. If someone could test and see if this does adversely view-source performance, it would be great.
Comment on attachment 162547 [details] [diff] [review] patch v2 rbs, could you take a look at this? An additional note: the changes to nsScanner::ReadTagIdentifier are because if we ran out of data before reaching an invalid character, we wouldn't end up appending what we had already.
Attachment #162547 - Flags: review?(rbs)
Note that testing performance in a debug build is very pointless....
Comment on attachment 162547 [details] [diff] [review] patch v2 I think you are better off at this point to find out how to run the parser regression tests. They will add another degree of confidence. This patch is invasive and hard to assess. I am witholding my r= pending what you see from the regression tests.
Attachment #162547 - Flags: review?(rbs)
Comment on attachment 162547 [details] [diff] [review] patch v2 Rerequesting r= after running the parser regression tests. The only two testcases to change were quote002.html and quote003.html, both of which exhibit the problem that this bug is trying to fix (unclosed start tag at the end of the document gets dropped).
Attachment #162547 - Flags: review?(rbs)
Comment on attachment 162547 [details] [diff] [review] patch v2 r=rbs
Attachment #162547 - Flags: review?(rbs) → review+
Comment on attachment 162547 [details] [diff] [review] patch v2 bz, asking you for sr since I'm touching nsScanner::ReadTagIdentifier. again.
Attachment #162547 - Flags: superreview?(bzbarsky)
Comment on attachment 162547 [details] [diff] [review] patch v2 This patch misses a case: the document |<a b| will lose the 'b'. I have a fix in my tree, but don't have time to attach it now. The patch is pretty much the same, except for an extra check in CAttributeToken::Consume().
Attached patch patch v2 (deleted) — Splinter Review
This makes us not lose the last attribute in an unclosed tag.
Comment on attachment 163144 [details] [diff] [review] patch v2 This should be the final patch. The differences between this patch are all in CAttributeToken::Consume(). I also fixed some typos and cleaned up some comments.
Attachment #163144 - Flags: superreview?(bzbarsky)
Attachment #162547 - Flags: superreview?(bzbarsky)
Comment on attachment 163144 [details] [diff] [review] patch v2 In an attempt to get this in faster. I'm playing with sr's. Trying roc first. rbs, I'm asking for r= on this one (for the attribute changes).
Attachment #163144 - Flags: superreview?(roc)
Attachment #163144 - Flags: superreview?(bzbarsky)
Attachment #163144 - Flags: review?(rbs)
Comment on attachment 163144 [details] [diff] [review] patch v2 r=rbs
Attachment #163144 - Flags: review?(rbs) → review+
Attachment #163144 - Flags: superreview?(roc) → superreview+
Fix checked in (albeit with the wrong bug number in the comment).
Status: NEW → RESOLVED
Closed: 23 years ago20 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: