Closed Bug 15378 Opened 25 years ago Closed 15 years ago

Newlines and spaces in html src get passed through to output

Categories

(Core :: DOM: HTML Parser, defect, P5)

defect

Tracking

()

RESOLVED WONTFIX
Future

People

(Reporter: akkzilla, Unassigned)

References

Details

When parsing html, newlines in the html source get passed through to the output sink as part of the string in a parser node of type eHTMLTag_text. The output sink is passing these verbatim into the plaintext output. I'm not sure what the parser should be doing in this case; I was under the impression that the newline should be passed as a separate tag of type eHTMLTag_newline, to make it easier to separate out. But if the parser is behaving as expected, I can change the output sink to parse the string and filter the newlines out. An example is in htmlparser/tests/outsinks/simple.html when converting to plaintext. On Linux, you can do this easily by running the test: TestOutput -i text/html -o text/plain -f 0 -w 0 OutTestData/simple.html ; on other platforms try highlighting text and pasting into something that accepts plaintext, or loading the file in the editor and doing "Debug->Output to Text" or highlighting and doing "test selection". The parser passes "page.\nHere is some " all as one text node, where I would have expected sepate nodes for "page.", eHTMLTag_newline, "Here is some", and eHTMLTag_whitespace (and I thought the parser used to separate the tags out that way).
Summary: Newlines in html src get passed through to output → Newlines and spaces in html src get passed through to output
Another, related, question. On the same page, there's the line: Here is a <a href="http://www.mozilla.org">link to the mozilla.org</a> page. The parser sends "Here is a " as a chunk (note the trailing space), but after the link, it sends "page.\nHere is some " -- note the space between the link and "page" isn't part of the string (and it wasn't sent as a separate text node, either). What's the rule on which spaces/newlines get embedded into text nodes and which ones don't?
To me the rule seems to be that newline and whitespace are part of a text but not of a tag, i.e., in the example Here is a <a href="http://www.mozilla.org">link to the mozilla.org</a> page. "Here is a " ---> Trailing whitespace part of the string "<a href="http://www.mozilla.org">link to the mozilla.org</a>" --> No whitespace "Whitespace token" ---> separate token is created since it follows a tag. "page" ---> No leading whitespace.
Target Milestone: M12
It's not a matter of whether it's part of the string or part of the tag, since in both cases the whitespace is adjacent to a text node. It seems like trailing whitespace is being considered part of the tag, but leading whitespace isn't. This is going to take a fair amount of effort to re-parse in the output sinks, so I want to make sure that this is really what's intended (and exactly what the rules are) before trying to write that parsing code. The rule it's following right now doesn't make much sense to me. Are things like this (e.g. the meaning of whitespace and newline nodes and when they're used) written down anywhere?
Akkana, what should I do with this bug? BTW, I'm still looking for proper documentation on whitespace handling whenever I find time.
Could you put a short document up on mozilla.org explaining the current whitespace handling (trailing whitespace gets included in the node but leading whitespace doesn't, or whatever the rule is) then assign the bug back to me? I'll make the content sinks do whatever they have to, but I'd like to have some constant document to refer to so that I have a reminder of what I'm trying to make them do.
Target Milestone: M12 → M14
Priority: P3 → P4
Target Milestone: M13 → M14
Priority: P4 → P5
Target Milestone: M14 → M16
Bulk move of all "Output" component bugs to new "DOM to Test Conversion" component. Output will be deleted as a component.
Component: Output → DOM to Text Conversion
Moving to M19..
Target Milestone: M16 → M19
Status: NEW → ASSIGNED
Target Milestone: M19 → Future
Here's a quote that says what authoring tools should do. Maybe this will give us a clue as to what ``user agents'' like Mozilla should do. (Where are the W2 recommendations for user agents ?) `` In order to avoid problems with SGML line break rules and inconsistencies among extant implementations, authors should not rely on user agents to render white space immediately after a start tag or immediately before an end tag. Thus, authors, and in particular authoring tools, should write: <P>We offer free <A>technical support</A> for subscribers.</P> and not: <P>We offer free<A> technical support </A>for subscribers.</P> '' -- http://www.w3.org/TR/html4/struct/text.html
Component -> Parser. This is definitely a bug with Parser in handling the new lines at least for the ones appearing immediately after an opening tag or before a closing tag. Following HTML specification indicates the same: http://www.w3.org/TR/html4/appendix/notes.html#notes-line-breaks says that "The following two HTML examples must be rendered identically: <P>Thomas is watching TV.</P> <P> Thomas is watching TV. </P>", This is what I mentioned in bug#75283, comment #18. In fact if we can do something nicer in parser to overcome this problem, many(most) of DOM-TXT serialization bugs would be resolved retaining a nice view source and indentation.
Component: DOM to Text Conversion → Parser
Blocks: 107927
But they are rendered identically, at least as far as I can see. What actually is in the content tree is a different matter.
Blocks: 147355
At ths same time, the composer ads whitespace in the saved HTML. It seems to just stick whitespace in at random. After editing with composer, all my html now looks like: "<p>We are a small local firm offering reliable</p>" If I have a CDATA like this: <style type="text/css"> /*<![CDATA[*/ table.all { max-width:43em; width:expression( document.body.clientWidth > (650/12) * parseInt(document.body.currentStyle.fontSize)? "40em": "auto" ); } /*]]>*/ </style> Composer ads an extra newline between each line, every time the file is edited. I'd like the final HTML result from composer to be neatly formatted, so that hand-editing is possible.
Assignee: harishd → nobody
Status: ASSIGNED → NEW
QA Contact: sujay → parser
This should be INVALID/WONTFIX per HTML5.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.