Closed Bug 64799 Opened 24 years ago Closed 21 years ago

textarea parsing: case of <tags> gets corrupted

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: djoham, Unassigned)

References

Details

(Keywords: compat, testcase)

Attachments

(9 files, 2 obsolete files)

From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux 2.2.17-21mdk i686; en-US; m18) Gecko/20010109 BuildID: 2001010908 In writing web applications, I generally embedd raw XML into textares to load it into a JavaScript XML parser. The code generally looks something like this: <TEXTAREA><TAG1>value<TAG2>value</TAG2></TAG1></TEXTAREA> I've heard conflicting statements that this is either valid or invalid code. Nevertheless, it works in NN4, IE4+ and NN6. The problem is that now in Mozilla, some of the XML can have it's case changed from upper to lower case. Specifically, when the XML tag is <TITLE>. I have no idea why. This is a recent regression. Mozilla did not display this behavour in the mid-December builds Reproducible: Always Steps to Reproduce: Open the test case I'll submit. Notice that what is displayed in the textarea is not what is actually in the code. Actual Results: the ending title tag is now in lower case. Expected Results: the ending title tag should be in upper case Yes, I suppose it would be possible to escape out all of the less than and greater than signs. The problem is the complexity that that solution introduces since I'll be dealing with escaped escape characters at times in XML (since less thans and what not are escaped in XML also) It could get really ugly. It would be very nice if Mozilla could work the same way as NN4, NN6 and IE4+
Attached file test case (deleted) —
CONFIRMED, something screwed up is happening. -> Parser However, it should be noted that this is really invalid and not at all the "official" way of doing stuff like this. ;-)
Assignee: rods → harishd
Status: UNCONFIRMED → NEW
Component: HTML Form Controls → Parser
Ever confirmed: true
Keywords: testcase
QA Contact: bsharma → janc
Summary: Mozilla sometimes modifies the case of XML embedded in TEXTAREAS → Mozilla sometimes modifies the case of XML embedded in TEXTAREAS
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.1
This bug has been marked "future" because the original netscape engineer working on this is over-burdened. If you feel this is an error, that you or another known resource will be working on this bug,or if it blocks your work in some way -- please attach your concern to the bug for reconsideration -----
Target Milestone: mozilla0.9.1 → Future
updated qa contact.
QA Contact: janc → bsharma
Ian, just to let you know, this is the only way (that I know of) to manipulate XML client side, in a cross browser manner. Mostly because Netscape 4.x can't get values out of a div or anything like that. So if you are trying to reduce load on your server (for instance) by off loading xml parsing to the client, you need to do this, and it breaks in mozilla 0.9 I'm not saying it's any more important then 'Future', but that this is a legitimate thing to do.
Ryley: in the real world, you're right. I was just pointing out that having raw "<" and ">" (and "&") characters in <textarea> elements is invalid per HTML.
Just like to add a second on this one and give some more specific bug behavior... I have a basic static content management system that uses HTML forms, and by default all tags are put in the database in uppercase... but when I pull back down from my database and put the raw html into a textarea field, the closing tag of a binary html tag is switched to lowercase... the opening tag (or any tag that doesn't begin with a forward slash) maintains its case... This just hasn't been touched on in forever... just wanted to make sure it wasn't forgotten, minor as it may be...
QA Contact: bsharma → moied
*** Bug 96111 has been marked as a duplicate of this bug. ***
*** Bug 114759 has been marked as a duplicate of this bug. ***
Quick comment.... Escaping: & --> &amp; < --> &lt; > --> &gt; in text you put in a textarea by default will make your code compliant with the HTML spec, will make it work peachy with NS4, IE5, lynx, and Mozilla (all I have to test) and will generally make the world a better place.
*** Bug 127043 has been marked as a duplicate of this bug. ***
OS: Linux → All
Hardware: PC → All
Adding additional testcase from bug 127043 If you create a plain HTML page with the following source code in it: <HTML> <HEAD> <TITLE>Untitled</TITLE> </HEAD> <BODY> <FORM> <TEXTAREA> <SCRIPT> </TEXTAREA> </FORM> </BODY> </HTML> When this is rendered in Mozilla it will add a spurious </script> tag within the <textarea>. Also if you amend that source and add a closing (</SCRIPT>) tag (in upper case), Mozilla displays the opening (<SCRIPT>) tag in upper case as before, but the closing (</SCRIPT>) tag is now displayed as lower case, again changing the form input. Possible Dataloss?
Attached file Additional test case (deleted) —
This is the testcase mentioned in comment 12 expanded slightly and attached as an actual testcase. It demonstrates both Mozilla changing the case of tags within text areas, and adding its own closing tags that are visible both on the page and in the source, even though they aren't actually in the original source.
*** Bug 167616 has been marked as a duplicate of this bug. ***
*** Bug 181065 has been marked as a duplicate of this bug. ***
*** Bug 183545 has been marked as a duplicate of this bug. ***
With URI-encoded XML no workaround is possible (I've tried out) :)
Comment on attachment 111161 [details] Another testcase - verified with Mozilla 1.3a The -Label- tag surrounding the -Name- tag should start uppercas in both cases (start and end tag). Mozilla version: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.3a) Gecko/20021212
Attachment #111161 - Attachment description: The -Label- tag surrounding the -Name- tag should start uppercas in both cases (start and end tag) → Another testcase - verified with Mozilla 1.3a
For what it's worth, all of these examples are invalid; you need to escape markup inside textarea. i.e.: <textarea><tag></tag></textarea> is wrong <textarea>&lt;tag>&lt;/tag></textarea> is right.
Keywords: compat
*** Bug 189772 has been marked as a duplicate of this bug. ***
*** Bug 190985 has been marked as a duplicate of this bug. ***
*** Bug 196828 has been marked as a duplicate of this bug. ***
Flags: blocking1.4a+
mlux, please don't set flags if you don't know how. thanks.
Flags: blocking1.4a+
np - setting the flag to "blocking 1.4a" was kind of "tryout" if someone would notice, 'cause in my opinion this is a really _serious_ bug. I'm working on an internet based multimedia library and I always forced Mozilla as number one app for using it because of its support for numerous features we need (i.e. quite the same drag and drop support on both linux and windows), but since we came across this bug we are using ie & windows again :( Since this bug has been identified in Jan. 2001 I wonder when there will be dealt with :) Please don't feel affronted with my comments - I'm using mozilla since milestone xx (don't know which) and i really like it!
Mozilla does not only convert your end tags to lowercase; in some case it entirely modifies textarea content. I have made a page available at http://www.iplataforma.org/test_textarea.php to test this bug. It basically shows in the textarea what you just submited. To test some unexpected behaviours: 1. Write <U>test</U> in the textarea and submit the query. Second tag appears in lowercase. 2. Now write <ABC>test</ABC> and press submit. Text stands intact. By trying some combinations, it seems the problem only happens with "known" tags. But here's a more serious one: 1. Write the following in the textarea: </B/U/I>test</IFRAME/BODY> and submit the query. 2. Result is </b>test</iframe>. By escaping > and < the error disappears, but it might be a good idea to parse everything between <TEXTAREA> and </TEXTAREA> just as a string.
Attached file XML within textarea (deleted) —
This attachment contains XML with tag names containing '.' separators (valid NameChars). Closing tag names are truncated in Mozilla.
I've added an attachment that shows even greater mangling of un-escaped XML markup within a textarea. The XML tag names contain '.' separators (e.g., com.broadsoft.protocols.nsoss.BroadsoftDocument). Interestingly, the opening tags are unmolested, but closing tag names of this form are truncated at the first '.' As remarked above, this data does not conform to #PCDATA, but is clearly #CDATA. Escaping the entities resolves the problem and makes the textarea render properly. This also works fine in IE, so perhaps fixing this bug would be more of a convenience than a high priority. --kirby
.
Assignee: harishd → parser
Status: ASSIGNED → NEW
Summary: Mozilla sometimes modifies the case of XML embedded in TEXTAREAS → textarea parsing bug: case of <tags> gets corrupted
QA Contact: moied → ian
Summary: textarea parsing bug: case of <tags> gets corrupted → textarea parsing: case of <tags> gets corrupted
Target Milestone: Future → ---
*** Bug 223069 has been marked as a duplicate of this bug. ***
*** Bug 223275 has been marked as a duplicate of this bug. ***
So in case anyone cares enough to try to fix this, the only reason start tags work is that we hack them to work.... see the call to PreserveToken() at http://lxr.mozilla.org/seamonkey/source/htmlparser/src/nsHTMLTokenizer.cpp#745, which writes some data into the mTrailingContent member of the start token. End tokens do not get such special treatment at the moment; doing that would be a possible way to fix this bug (though note the comment at http://lxr.mozilla.org/seamonkey/source/htmlparser/src/nsHTMLTokenizer.cpp#733)
What does the parser do for <style> blocks? I know <textarea> isn't supposed to be a CDATA block but I get the feeling we might be better off parsing it as one anyway (do people comment out stuff in <textarea>s?)
Hmmm. Although we're obviously losing the correctness battle on this judging from the number of bugs filed, I can think of one reason we might not want to rush right in and implement it as CDATA. I get the impression that a lot of these textareas causing problems are being used to enter HTML that gets stored server-side, then can be pulled back up to edit in the textarea again. If someone (accidentally or maliciously) enters a "</textarea>" into there, it'll break the entire form. So failing to properly escape markup has, I think, security implications beyond "nuisance HTML correctness issue" and I don't think it's wise to further obscure the issue for people.
Well in that case we should stop doing anything at all with elements inside <textarea>. What we do now is a kind of half-assed mess. (Note: I didn't mean CDATA block really. That would stop parsing entities as well, and we want to keep parsing those.) "What does IE do?"
Yeah, IE parses things as you've described. (SGML-wise, it would be RCDATA, if it weren't for the fact that they "screw up" comments.) I'm going to go drink turpentine now...
screw up meaning what, exactly? ignore, or include literally? We really should do what IE does here. Yeah, it'll cause problems with </textarea>, but then those problems already exist if you think about it.
We have implemented a mainframe-driven Web-based development environment. Using the [/textarea] tag in our pages is a known issue and has been dealt with for five years. I don't think that "it might break Web-configured apps" is a valid reason for holding up a workaround or fix of the case-translation issue, because it's not a new aspect of the object's behavior in that context. Would it be a good idea if this behaviour could be configured as an attribute? Something like [textarea style="content:CDATA;"] ?
No need for an attribute, but note that if you are serving <textarea> content with unescaped "<" characters, your markup is invalid.
Attached file More evil testcase (deleted) —
Attached patch The futility of it all (deleted) — Splinter Review
This fixes all testcases attached to this bug and all its duplicates except for attachment 84153 [details] and its equivalent in its original bug. To fix that we really would have to go to parsing the content as PCDATA or the like instead of actually tokenizing it. That could probably be done with a good bit of pain and suffering (probably by hacking CTextToken::ConsumeUntil or better hacking it's caller to handle the text token stopping at entities if desired (and thus changing the contents of <textarea> from a token stream of all sorts of tokens to a token stream of just text and entity tokens)). That change sounds big and scary to me, given the regression potential of changing ConsumeUntil()..... This patch is also scary (as anything in parser-land), but I _think_ it's not too dangerous. Ian? Chris? Thoughts?
Attached patch I lied (obsolete) (deleted) — Splinter Review
OK, this fixes all testcases in this bug and its dups. I haven't done much testing past making sure that it does not regress view-source of XML with <script> tags in it and testing the testcases in these bugs. If someone could apply this and test the hell out of it, that would be much appreciated.
Attached file Even more evil testcase (deleted) —
Comment on attachment 135161 [details] [diff] [review] I lied choess? heikki? What do you think (other than "PreserveToken() probably needs renaming if this patch is to go in"?) I don't really like this setup, but I can't think of anything sane, simple, and non-introsive on the tokens to replace it... I think I'm willing to stake my peace of mind on this patch being "correct" as far as it goes...
Attachment #135161 - Flags: superreview?(hjtoi-bugzilla)
Attachment #135161 - Flags: review?(choess)
choess, heikki, please let me know if there is anything I can do to facilitate reviewing this... (if the code needs more comments, tell me so ;) ). I'd sort of like to get this in for 1.6b, and freeze is in less than a week.
Comment on attachment 135161 [details] [diff] [review] I lied In the code itself the only thing that concerns me is the new data member. A lot of work went into making the parser consume as little memory as possible, and this would be a step backwards. How much does this cost us in memory on some typical pages? Being a fairly big parser change, the minimum testing that should happen before checkin, IMO: * run the parser tests * add all these testcase here into the parser test suite * run page load tests * visually observe everything looks ok on one round * make sure perf does not regress * get a bunch of people run with these changes for a while * Finally check in during an alpha cycle (parser breakages can cause very subtle problems that take a long time to notice)
My builds are up and running again; I think I can do this tomorrow.
Hmmm. Looking over the patch, my gut feeling is that we'd be better off in terms of compat by treating textarea as PCDATA. The changes look like they'd be fairly small (textarea.CanContainCDATA should be true), and we already have tried-and-true CDATA parsing in the tree for <script> and <style>. Furthermore, IE appears to treat <textarea> in just this way, which I think tends to reduce the risk we'll break something. Given that aIgnoreComments would be true in ConsumeUntil(), I think the prospect is not quite as scary as initially assessed. Heikki, bz, do you concur? I wouldn't necessarily be opposed to the current patch, but I think a fix to treat <textarea> as CDATA content would actually be less confusing down the road (doesn't split the content across start, text, end tokens for instance).
Heikki, I'm not sure I can estimate the memory usage here... Isn't all this stuff transient anyway? How do I run the parser tests? How do I usefully run the pageload tests from where I sit? I know I won't be able to get useful numbers out of them, but I could at least do the visual inspection bit. And yes, this is totally alpha material at this stage in the game (I wasn't cced on the bug :( ). Chris, does IE not deal with character entities in <textarea>? I find that difficult to believe... <script> and <style> do NOT deal with character entities, last I checked.
The memory used by these objects might stick around for a while, I can no longer remember exactly (and don't have time to go dig). The tests are in htmlparser/tests/html, check the README file on how to run the tests. Regarding pageload numbers, I guess in your case you would just need to check in and observe the numbers (or get someone else who has local access to the testsuite test it for you - I don't have the test material anymore, but I was able to run them locally with very little noise).
> The tests are in htmlparser/tests/html, check the README file Oh, those... It looks like those are Windows-only -- the various scripts hardcode things like exe extensions... If I correct that, the test app is still not getting started correctly, etc. Last time I needed to do this I spent 15-20 hours on getting the damn thing to run, then just gave up and asked Harish to run it. If someone with a Windows tree (where this stuff hopefully works) could run the tests, that would be great...
Oh, and I guess I will try to see whether I can avoid using that extra member...
OK, testing on IE reveals that entities in textareas are recognized as markup (but not comments or tags other than </textarea>), so the IE content model is closer to RCDATA.
Right. So we can't reuse the <style>/<script> parser...
Attached file htmlparser/tests/html log (obsolete) (deleted) —
Note to self -- tokens are arena-allocated, and creating a new token size is bad. I really need to not add that member.
Further note to self -- any solution I adopt must be able to return the following pieces of information: 1) The "tagname" (whatever CEndToken normally consumes). This must be returned by GetStringValue() 2) The full "trailing content" string (or its equivalent, for use by the textarea stuff). It's not clear to me how I can manage this given one string to work with.... :(
Attachment #135161 - Flags: superreview?(hjtoi-bugzilla)
Attachment #135161 - Flags: review?(choess)
Attached patch Slightly different approach (deleted) — Splinter Review
The basic ideas here are as follows: 1) Let tokens know whether we are preserving content or not as they consume 2) Have the end token save its case when preserving content 3) Allow identifiers to contain '.' (that fixes the XML testcase in this bug that does that) 4) Preserve case of nodes that close CDATA sections when preserving content The rest is random cleanup that I felt justified. This fixes all the testcases except "More evil testcase"; if people do that and think it should work, that's their problem, really. It's not valid XML, so there should be little call to do it. I don't think item 3 above should cause any problems, but it's really optional if we decide it's no good to have that change. tor, could I possibly impose on you to run those regression tests again?
Attachment #135161 - Attachment is obsolete: true
Attachment #138735 - Attachment is obsolete: true
so you mean that on forms that are broken (and thus forget to escape content on the preview page) I can't type: bla bla </ian hickson> ...and have it roundtrip? ;-)
Attached file regression test log of new patch (deleted) —
Comment on attachment 138857 [details] [diff] [review] Slightly different approach Ian, indeed. tor, thanks! Looks like the only change related to the '.' thing is also non-substantive. choess, heikki, could you please review this one? I'd really like to land this before alpha freezes (in a week and a half).
Attachment #138857 - Flags: superreview?(hjtoi-bugzilla)
Attachment #138857 - Flags: review?(choess)
Comment on attachment 138857 [details] [diff] [review] Slightly different approach 1) Adding the "." is fully spec-compliant. 2) I like Ian's quasi-RCDATA suggestion in bug 231631 somewhat better from a philosophical perspective, but this patch certainly works and is here now, so go with it.
Attachment #138857 - Flags: review?(choess) → review+
Attachment #138857 - Flags: superreview?(hjtoi-bugzilla) → superreview?(jst)
Comment on attachment 138857 [details] [diff] [review] Slightly different approach sr=jst
Attachment #138857 - Flags: superreview?(jst) → superreview+
Checked in to 1.7a trunk. No noticeable Tp effect.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Blocks: 216994
Blocks: 222523
*** Bug 216560 has been marked as a duplicate of this bug. ***
*** Bug 227050 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: