Closed
Bug 64799
Opened 24 years ago
Closed 21 years ago
textarea parsing: case of <tags> gets corrupted
Categories
(Core :: DOM: HTML Parser, defect)
Core
DOM: HTML Parser
Tracking
()
RESOLVED
FIXED
People
(Reporter: djoham, Unassigned)
References
Details
(Keywords: compat, testcase)
Attachments
(9 files, 2 obsolete files)
(deleted),
text/html
|
Details | |
(deleted),
text/html
|
Details | |
(deleted),
text/html
|
Details | |
(deleted),
text/html
|
Details | |
(deleted),
text/html
|
Details | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
text/html
|
Details | |
(deleted),
patch
|
choess
:
review+
jst
:
superreview+
|
Details | Diff | Splinter Review |
(deleted),
text/plain
|
Details |
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux 2.2.17-21mdk i686; en-US; m18) Gecko/20010109
BuildID: 2001010908
In writing web applications, I generally embedd raw XML into textares to load it
into a JavaScript XML parser. The code generally looks something like this:
<TEXTAREA><TAG1>value<TAG2>value</TAG2></TAG1></TEXTAREA>
I've heard conflicting statements that this is either valid or invalid code.
Nevertheless, it works in NN4, IE4+ and NN6.
The problem is that now in Mozilla, some of the XML can have it's case changed
from upper to lower case. Specifically, when the XML tag is <TITLE>. I have no
idea why.
This is a recent regression. Mozilla did not display this behavour in the
mid-December builds
Reproducible: Always
Steps to Reproduce:
Open the test case I'll submit. Notice that what is displayed in the textarea is
not what is actually in the code.
Actual Results: the ending title tag is now in lower case.
Expected Results: the ending title tag should be in upper case
Yes, I suppose it would be possible to escape out all of the less than and
greater than signs. The problem is the complexity that that solution introduces
since I'll be dealing with escaped escape characters at times in XML (since less
thans and what not are escaped in XML also) It could get really ugly.
It would be very nice if Mozilla could work the same way as NN4, NN6 and IE4+
Comment 2•24 years ago
|
||
CONFIRMED, something screwed up is happening. -> Parser
However, it should be noted that this is really invalid and not at all the
"official" way of doing stuff like this. ;-)
Assignee: rods → harishd
Status: UNCONFIRMED → NEW
Component: HTML Form Controls → Parser
Ever confirmed: true
Keywords: testcase
QA Contact: bsharma → janc
Summary: Mozilla sometimes modifies the case of XML embedded in TEXTAREAS → Mozilla sometimes modifies the case of XML embedded in TEXTAREAS
This bug has been marked "future" because the original netscape engineer working
on this is over-burdened. If you feel this is an error, that you or another
known resource will be working on this bug,or if it blocks your work in some way
-- please attach your concern to the bug for reconsideration -----
Target Milestone: mozilla0.9.1 → Future
Ian, just to let you know, this is the only way (that I know of) to manipulate
XML client side, in a cross browser manner. Mostly because Netscape 4.x can't
get values out of a div or anything like that. So if you are trying to reduce
load on your server (for instance) by off loading xml parsing to the client, you
need to do this, and it breaks in mozilla 0.9
I'm not saying it's any more important then 'Future', but that this is a
legitimate thing to do.
Comment 6•23 years ago
|
||
Ryley: in the real world, you're right. I was just pointing out that having
raw "<" and ">" (and "&") characters in <textarea> elements is invalid per HTML.
Comment 7•23 years ago
|
||
Just like to add a second on this one and give some more specific bug behavior...
I have a basic static content management system that uses HTML forms, and by
default all tags are put in the database in uppercase... but when I pull back
down from my database and put the raw html into a textarea field, the closing
tag of a binary html tag is switched to lowercase... the opening tag (or any tag
that doesn't begin with a forward slash) maintains its case...
This just hasn't been touched on in forever... just wanted to make sure it
wasn't forgotten, minor as it may be...
Comment 9•23 years ago
|
||
*** Bug 114759 has been marked as a duplicate of this bug. ***
Comment 10•23 years ago
|
||
Quick comment.... Escaping:
& --> &
< --> <
> --> >
in text you put in a textarea by default will make your code compliant with the
HTML spec, will make it work peachy with NS4, IE5, lynx, and Mozilla (all I have
to test) and will generally make the world a better place.
Comment 11•23 years ago
|
||
*** Bug 127043 has been marked as a duplicate of this bug. ***
Updated•23 years ago
|
OS: Linux → All
Hardware: PC → All
Comment 12•23 years ago
|
||
Adding additional testcase from bug 127043
If you create a plain HTML page with the following source code in it:
<HTML>
<HEAD>
<TITLE>Untitled</TITLE>
</HEAD>
<BODY>
<FORM>
<TEXTAREA>
<SCRIPT>
</TEXTAREA>
</FORM>
</BODY>
</HTML>
When this is rendered in Mozilla it will add a spurious </script> tag within the
<textarea>. Also if you amend that source and add a closing (</SCRIPT>) tag (in
upper case), Mozilla displays the opening (<SCRIPT>) tag in upper case as
before, but the closing (</SCRIPT>) tag is now displayed as lower case, again
changing
the form input.
Possible Dataloss?
Comment 13•22 years ago
|
||
This is the testcase mentioned in comment 12 expanded slightly and attached as
an actual testcase.
It demonstrates both Mozilla changing the case of tags within text areas, and
adding its own closing tags that are visible both on the page and in the
source, even though they aren't actually in the original source.
Comment 14•22 years ago
|
||
*** Bug 167616 has been marked as a duplicate of this bug. ***
Comment 15•22 years ago
|
||
*** Bug 181065 has been marked as a duplicate of this bug. ***
Comment 16•22 years ago
|
||
*** Bug 183545 has been marked as a duplicate of this bug. ***
Comment 17•22 years ago
|
||
With URI-encoded XML no workaround is possible (I've tried out) :)
Comment 18•22 years ago
|
||
Comment on attachment 111161 [details]
Another testcase - verified with Mozilla 1.3a
The -Label- tag surrounding the -Name- tag should start uppercas in both cases
(start and end tag).
Mozilla version:
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.3a) Gecko/20021212
Attachment #111161 -
Attachment description: The -Label- tag surrounding the -Name- tag should start uppercas in both cases (start and end tag) → Another testcase - verified with Mozilla 1.3a
Comment 19•22 years ago
|
||
For what it's worth, all of these examples are invalid; you need to escape
markup inside textarea. i.e.:
<textarea><tag></tag></textarea> is wrong
<textarea><tag></tag></textarea> is right.
Keywords: compat
Comment 20•22 years ago
|
||
*** Bug 189772 has been marked as a duplicate of this bug. ***
Comment 21•22 years ago
|
||
*** Bug 190985 has been marked as a duplicate of this bug. ***
Comment 22•22 years ago
|
||
*** Bug 196828 has been marked as a duplicate of this bug. ***
Updated•22 years ago
|
Flags: blocking1.4a+
Comment 23•22 years ago
|
||
mlux, please don't set flags if you don't know how. thanks.
Flags: blocking1.4a+
Comment 24•22 years ago
|
||
np - setting the flag to "blocking 1.4a" was kind of "tryout" if someone would
notice, 'cause in my opinion this is a really _serious_ bug. I'm working on an
internet based multimedia library and I always forced Mozilla as number one
app for using it because of its support for numerous features we need (i.e.
quite the same drag and drop support on both linux and windows), but since we
came across this bug we are using ie & windows again :(
Since this bug has been identified in Jan. 2001 I wonder when there will be
dealt with :)
Please don't feel affronted with my comments - I'm using mozilla since
milestone xx (don't know which) and i really like it!
Comment 25•21 years ago
|
||
Mozilla does not only convert your end tags to lowercase; in some case it
entirely modifies textarea content.
I have made a page available at http://www.iplataforma.org/test_textarea.php to
test this bug. It basically shows in the textarea what you just submited.
To test some unexpected behaviours:
1. Write <U>test</U> in the textarea and submit the query. Second tag appears in
lowercase.
2. Now write <ABC>test</ABC> and press submit. Text stands intact.
By trying some combinations, it seems the problem only happens with "known" tags.
But here's a more serious one:
1. Write the following in the textarea:
</B/U/I>test</IFRAME/BODY>
and submit the query.
2. Result is </b>test</iframe>.
By escaping > and < the error disappears, but it might be a good idea to parse
everything between <TEXTAREA> and </TEXTAREA> just as a string.
Comment 26•21 years ago
|
||
This attachment contains XML with tag names containing '.' separators (valid
NameChars). Closing tag names are truncated in Mozilla.
Comment 27•21 years ago
|
||
I've added an attachment that shows even greater mangling of un-escaped XML
markup within a textarea. The XML tag names contain '.' separators (e.g.,
com.broadsoft.protocols.nsoss.BroadsoftDocument). Interestingly, the opening
tags are unmolested, but closing tag names of this form are truncated at the
first '.'
As remarked above, this data does not conform to #PCDATA, but is clearly #CDATA.
Escaping the entities resolves the problem and makes the textarea render
properly. This also works fine in IE, so perhaps fixing this bug would be more
of a convenience than a high priority.
--kirby
Updated•21 years ago
|
Summary: Mozilla sometimes modifies the case of XML embedded in TEXTAREAS → textarea parsing bug: case of <tags> gets corrupted
Updated•21 years ago
|
QA Contact: moied → ian
Summary: textarea parsing bug: case of <tags> gets corrupted → textarea parsing: case of <tags> gets corrupted
Target Milestone: Future → ---
Comment 29•21 years ago
|
||
*** Bug 223069 has been marked as a duplicate of this bug. ***
Comment 30•21 years ago
|
||
*** Bug 223275 has been marked as a duplicate of this bug. ***
Comment 31•21 years ago
|
||
So in case anyone cares enough to try to fix this, the only reason start tags
work is that we hack them to work.... see the call to PreserveToken() at
http://lxr.mozilla.org/seamonkey/source/htmlparser/src/nsHTMLTokenizer.cpp#745,
which writes some data into the mTrailingContent member of the start token.
End tokens do not get such special treatment at the moment; doing that would be
a possible way to fix this bug (though note the comment at
http://lxr.mozilla.org/seamonkey/source/htmlparser/src/nsHTMLTokenizer.cpp#733)
Comment 32•21 years ago
|
||
What does the parser do for <style> blocks? I know <textarea> isn't supposed to
be a CDATA block but I get the feeling we might be better off parsing it as one
anyway (do people comment out stuff in <textarea>s?)
Comment 33•21 years ago
|
||
Hmmm. Although we're obviously losing the correctness battle on this judging
from the number of bugs filed, I can think of one reason we might not want to
rush right in and implement it as CDATA. I get the impression that a lot of
these textareas causing problems are being used to enter HTML that gets stored
server-side, then can be pulled back up to edit in the textarea again. If
someone (accidentally or maliciously) enters a "</textarea>" into there, it'll
break the entire form. So failing to properly escape markup has, I think,
security implications beyond "nuisance HTML correctness issue" and I don't think
it's wise to further obscure the issue for people.
Comment 34•21 years ago
|
||
Well in that case we should stop doing anything at all with elements inside
<textarea>. What we do now is a kind of half-assed mess. (Note: I didn't mean
CDATA block really. That would stop parsing entities as well, and we want to
keep parsing those.)
"What does IE do?"
Comment 35•21 years ago
|
||
Yeah, IE parses things as you've described. (SGML-wise, it would be RCDATA, if
it weren't for the fact that they "screw up" comments.) I'm going to go drink
turpentine now...
Comment 36•21 years ago
|
||
screw up meaning what, exactly? ignore, or include literally?
We really should do what IE does here. Yeah, it'll cause problems with
</textarea>, but then those problems already exist if you think about it.
Comment 37•21 years ago
|
||
We have implemented a mainframe-driven Web-based development environment. Using
the [/textarea] tag in our pages is a known issue and has been dealt with for
five years. I don't think that "it might break Web-configured apps" is a valid
reason for holding up a workaround or fix of the case-translation issue, because
it's not a new aspect of the object's behavior in that context.
Would it be a good idea if this behaviour could be configured as an attribute?
Something like [textarea style="content:CDATA;"]
?
Comment 38•21 years ago
|
||
No need for an attribute, but note that if you are serving <textarea> content
with unescaped "<" characters, your markup is invalid.
Comment 39•21 years ago
|
||
Comment 40•21 years ago
|
||
This fixes all testcases attached to this bug and all its duplicates except for
attachment 84153 [details] and its equivalent in its original bug. To fix that we really
would have to go to parsing the content as PCDATA or the like instead of
actually tokenizing it.
That could probably be done with a good bit of pain and suffering (probably by
hacking CTextToken::ConsumeUntil or better hacking it's caller to handle the
text token stopping at entities if desired (and thus changing the contents of
<textarea> from a token stream of all sorts of tokens to a token stream of just
text and entity tokens)). That change sounds big and scary to me, given the
regression potential of changing ConsumeUntil().....
This patch is also scary (as anything in parser-land), but I _think_ it's not
too dangerous. Ian? Chris? Thoughts?
Comment 41•21 years ago
|
||
OK, this fixes all testcases in this bug and its dups. I haven't done much
testing past making sure that it does not regress view-source of XML with
<script> tags in it and testing the testcases in these bugs. If someone could
apply this and test the hell out of it, that would be much appreciated.
Comment 42•21 years ago
|
||
Comment 43•21 years ago
|
||
Comment on attachment 135161 [details] [diff] [review]
I lied
choess? heikki? What do you think (other than "PreserveToken() probably needs
renaming if this patch is to go in"?) I don't really like this setup, but I
can't think of anything sane, simple, and non-introsive on the tokens to
replace it...
I think I'm willing to stake my peace of mind on this patch being "correct" as
far as it goes...
Attachment #135161 -
Flags: superreview?(hjtoi-bugzilla)
Attachment #135161 -
Flags: review?(choess)
Comment 44•21 years ago
|
||
choess, heikki, please let me know if there is anything I can do to facilitate
reviewing this... (if the code needs more comments, tell me so ;) ). I'd sort
of like to get this in for 1.6b, and freeze is in less than a week.
Comment on attachment 135161 [details] [diff] [review]
I lied
In the code itself the only thing that concerns me is the new data member. A
lot of work went into making the parser consume as little memory as possible,
and this would be a step backwards. How much does this cost us in memory on
some typical pages?
Being a fairly big parser change, the minimum testing that should happen before
checkin, IMO:
* run the parser tests
* add all these testcase here into the parser test suite
* run page load tests
* visually observe everything looks ok on one round
* make sure perf does not regress
* get a bunch of people run with these changes for a while
* Finally check in during an alpha cycle (parser breakages can cause very
subtle problems that take a long time to notice)
Comment 46•21 years ago
|
||
My builds are up and running again; I think I can do this tomorrow.
Comment 47•21 years ago
|
||
Hmmm. Looking over the patch, my gut feeling is that we'd be better off in terms
of compat by treating textarea as PCDATA. The changes look like they'd be fairly
small (textarea.CanContainCDATA should be true), and we already have
tried-and-true CDATA parsing in the tree for <script> and <style>. Furthermore,
IE appears to treat <textarea> in just this way, which I think tends to reduce
the risk we'll break something. Given that aIgnoreComments would be true in
ConsumeUntil(), I think the prospect is not quite as scary as initially
assessed. Heikki, bz, do you concur? I wouldn't necessarily be opposed to the
current patch, but I think a fix to treat <textarea> as CDATA content would
actually be less confusing down the road (doesn't split the content across
start, text, end tokens for instance).
Comment 48•21 years ago
|
||
Heikki, I'm not sure I can estimate the memory usage here... Isn't all this
stuff transient anyway? How do I run the parser tests? How do I usefully run
the pageload tests from where I sit? I know I won't be able to get useful
numbers out of them, but I could at least do the visual inspection bit. And
yes, this is totally alpha material at this stage in the game (I wasn't cced on
the bug :( ).
Chris, does IE not deal with character entities in <textarea>? I find that
difficult to believe... <script> and <style> do NOT deal with character
entities, last I checked.
The memory used by these objects might stick around for a while, I can no longer
remember exactly (and don't have time to go dig). The tests are in
htmlparser/tests/html, check the README file on how to run the tests. Regarding
pageload numbers, I guess in your case you would just need to check in and
observe the numbers (or get someone else who has local access to the testsuite
test it for you - I don't have the test material anymore, but I was able to run
them locally with very little noise).
Comment 50•21 years ago
|
||
> The tests are in htmlparser/tests/html, check the README file
Oh, those... It looks like those are Windows-only -- the various scripts
hardcode things like exe extensions... If I correct that, the test app is still
not getting started correctly, etc. Last time I needed to do this I spent 15-20
hours on getting the damn thing to run, then just gave up and asked Harish to
run it.
If someone with a Windows tree (where this stuff hopefully works) could run the
tests, that would be great...
Comment 51•21 years ago
|
||
Oh, and I guess I will try to see whether I can avoid using that extra member...
Comment 52•21 years ago
|
||
OK, testing on IE reveals that entities in textareas are recognized as markup
(but not comments or tags other than </textarea>), so the IE content model is
closer to RCDATA.
Comment 53•21 years ago
|
||
Right. So we can't reuse the <style>/<script> parser...
Comment 54•21 years ago
|
||
Comment 55•21 years ago
|
||
Note to self -- tokens are arena-allocated, and creating a new token size is
bad. I really need to not add that member.
Comment 56•21 years ago
|
||
Further note to self -- any solution I adopt must be able to return the
following pieces of information:
1) The "tagname" (whatever CEndToken normally consumes). This must be returned
by GetStringValue()
2) The full "trailing content" string (or its equivalent, for use by the
textarea stuff).
It's not clear to me how I can manage this given one string to work with.... :(
Updated•21 years ago
|
Attachment #135161 -
Flags: superreview?(hjtoi-bugzilla)
Attachment #135161 -
Flags: review?(choess)
Comment 57•21 years ago
|
||
The basic ideas here are as follows:
1) Let tokens know whether we are preserving content or not as they consume
2) Have the end token save its case when preserving content
3) Allow identifiers to contain '.' (that fixes the XML testcase in this bug
that does that)
4) Preserve case of nodes that close CDATA sections when preserving content
The rest is random cleanup that I felt justified.
This fixes all the testcases except "More evil testcase"; if people do that and
think it should work, that's their problem, really. It's not valid XML, so
there should be little call to do it.
I don't think item 3 above should cause any problems, but it's really optional
if we decide it's no good to have that change.
tor, could I possibly impose on you to run those regression tests again?
Attachment #135161 -
Attachment is obsolete: true
Attachment #138735 -
Attachment is obsolete: true
Comment 58•21 years ago
|
||
so you mean that on forms that are broken (and thus forget to escape content on
the preview page) I can't type:
bla bla
</ian hickson>
...and have it roundtrip? ;-)
Comment 59•21 years ago
|
||
Comment 60•21 years ago
|
||
Comment on attachment 138857 [details] [diff] [review]
Slightly different approach
Ian, indeed.
tor, thanks! Looks like the only change related to the '.' thing is also
non-substantive.
choess, heikki, could you please review this one? I'd really like to land this
before alpha freezes (in a week and a half).
Attachment #138857 -
Flags: superreview?(hjtoi-bugzilla)
Attachment #138857 -
Flags: review?(choess)
Comment 61•21 years ago
|
||
Comment on attachment 138857 [details] [diff] [review]
Slightly different approach
1) Adding the "." is fully spec-compliant.
2) I like Ian's quasi-RCDATA suggestion in bug 231631 somewhat better from a
philosophical perspective, but this patch certainly works and is here now, so
go with it.
Attachment #138857 -
Flags: review?(choess) → review+
Updated•21 years ago
|
Attachment #138857 -
Flags: superreview?(hjtoi-bugzilla) → superreview?(jst)
Comment 62•21 years ago
|
||
Comment on attachment 138857 [details] [diff] [review]
Slightly different approach
sr=jst
Attachment #138857 -
Flags: superreview?(jst) → superreview+
Comment 63•21 years ago
|
||
Checked in to 1.7a trunk. No noticeable Tp effect.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Comment 64•20 years ago
|
||
*** Bug 216560 has been marked as a duplicate of this bug. ***
Comment 65•20 years ago
|
||
*** Bug 227050 has been marked as a duplicate of this bug. ***
You need to log in
before you can comment on or make changes to this bug.
Description
•