Closed
Bug 105960
Opened 23 years ago
Closed 16 years ago
xml.cgi and other future xml pages generate invalid XML
Categories
(Bugzilla :: Query/Bug List, defect, P3)
Tracking
()
RESOLVED
FIXED
Bugzilla 3.2
People
(Reporter: luis, Assigned: khampton)
References
()
Details
Attachments
(3 files, 3 obsolete files)
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
mkanat
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
LpSolit
:
review+
|
Details | Diff | Splinter Review |
Because xml.cgi basically hand crafts xml instead of using the standard perl XML
libraries, it doesn't escape/encode umlauts (for example), so the XML it creates
is invalid. This means that sane XML libraries looking to parse the code are...
well, displeased.
Because we are moving towards templates for all data output (not just xml.cgi),
it seems unlikely that this problem is going to be solved the 'correct' way-
i.e., by using the perl XML libraries to create properly encoded XML.
So... I'm not entirely sure that there is any elegant solution to this, except
maybe to provide a 'encode_this_like_XML_likes()' function that can be called
iff xml is chosen as the output format. Not sure how that would work, though.
But it is a problem, and will continue to be if people want to use standard
system libraries (on both linux and windows) to create interaction with bugzilla.
Comment 1•23 years ago
|
||
We expect proper escaping for HTML templates, we should do the same for XML
templates. Template Toolkit has the 'html' filter, does it have an 'xml'
filter, or are they the same?
Priority: -- → P3
Target Milestone: --- → Bugzilla 2.18
Even for cases that have been anticipated, it looks like something weird is
going on here. Go to http://bugzilla.mozilla.org/xml.cgi?id=1000 and browse the
output and notice the invalid entities being generated for known cases.
QuoteXMLChars() in Bug.pm is supposed to be taking care of &, <, >, ', and ",
but the output above is somehow translating & into &amp;, < into &lt;,
and so on.
It looks as though translation is somehow happening twice, once expanding < to
< for example, then expanding that to &lt; . But QuoteXMLChars() doesn't
seem to be getting called twice within the <text /> tags output, and it
translates & characters before <, not after, as my guess would indicate. If it
is because of a recursive substitution, why does it stop? I am completely
baffled. Perhaps it is an escaping problem in QuoteXMLChars()?
Templates might be able to fix this, but without knowing what the problem is or
what the html filter covers, I can't be certain. IIRC, the html filter didn't
cover ", much less umlauts, and I didn't see an xml filter.
Comment 3•23 years ago
|
||
To handle the accented characters, would it make sense to set the encoding of
the xml output to latin1? Something like this:
<?xml version="1.0" encoding="iso8859-1" standalone="no"?>
Alternatively, the data could be reencoded to UTF-8 (which is the default
encoding for XML files).
The first option is probably the easiest one, and is probably correct if
bugzilla is assuming things work in latin1.
The encoding could easily be part of the template, so that bugzillas in other
countries which have bugs using a different encoding could be exported correctly.
I think a compromise like this would be necessary, given that the encoding of
the text returned by the bugzilla user isn't known (or isn't recorded).
Comment 4•23 years ago
|
||
An example of the accented chars causing problems is:
http://bugzilla.mozilla.org/xml.cgi?id=384
Setting the encoding in the <?xml?> PI makes the file parseable.
Comment 5•23 years ago
|
||
note that xml.cgi sends the wrong mimetype, too.
The current version on b.m.o sends text/plain, and lxr said it would send
text/html. tststs, text/xml it should be.
Talking about entities, the xml produces by xml.cgi should be readable for
Mozilla, IMHO, so it needs to be standalone. No dtds are loaded over the
network at all.
If you want to use entities, declare them in the document, or stick to the basic
XML entities plus numerical.
Comment 6•23 years ago
|
||
lxr is showing cvs bugzilla - that first content-type happens if you don't put
an id in at all, in which case we do generate an html page.
The dtd is just for validation - there are no entities used.
justdave: should we fix the mimetype for 2.16? Its a simple, obviously correct
fix. (Which will mean that we can't see the output in mozilla directly, w/o
using view source, but....)
Comment 7•23 years ago
|
||
If the DTD is only good for validation, then the xml pi should have
<?xml version="1.0" standalone="yes"?>, but xml.cgi serves "no".
If you want to keep validation information, give the right hint that this is
only needed for validation, but not for the creation of the content.
http://www.w3.org/TR/2000/REC-xml-20001006#sec-rmd
Comment 8•23 years ago
|
||
Brad: if you guys agree it needs to be done, I won't stop it from getting
checked in. :-) Although I learned a lot about DTD-writing recently I'm still
generally clueless on the way the XML gets transmitted.
Comment 9•23 years ago
|
||
An easy way to test if the output is formal XML is with the xmllint program that
comes with the libxml2 library (http://www.xmlsoft.org). Just download the
output of xml.cgi, then run xmllint on it like this:
xmllint -noout file.xml
If it prints anything, then the file has character set or tag balancing issues.
You can also perform validity checks (it should be able to download the DTD) with:
xmllint -noout -valid file.xml
At the moment, it looks like the output from bugzilla.mozilla.org has formality
problems (isn't utf-8, but doesn't specify its character set), and validity
problems (looks like some elements are out of order).
For my purposes (my python module for talking to a bugzilla installation), only
the formality issue is a problem. It might be worth fixing the validity bug though.
Here is the output from a validity check after setting the character set to
iso8859-1 in the <?xml?> PI:
$ xmllint -noout -valid 384.xml
384.xml:20: validity error: No declaration for element resolution
<resolution>INVALID</resolution>
^
384.xml:199: validity error: Element bug content doesn't follow the DTD
Expecting (bug_id , exporter , urlbase , bug_status , resolution? , product ,
priority , version , rep_platform , assigned_to , delta_ts , component ,
reporter , target_milestone? , bug_severity , creation_ts , qa_contact? ,
status_whiteboard? , op_sys , short_desc? , keywords* , dependson* , blocks* ,
cc* , long_desc? , attachment*), got (bug_id bug_status product priority version
rep_platform assigned_to delta_ts component reporter target_milestone
bug_severity creation_ts qa_contact op_sys resolution short_desc long_desc
long_desc long_desc long_desc long_desc long_desc long_desc long_desc long_desc )
</bug>
^
Comment 10•23 years ago
|
||
Axel: to summarise: for 2.16, you want:
a) The XML served as text/xml
b) The standalone="no" to become standalone="yes"?
Gerv
Target Milestone: Bugzilla 2.18 → Bugzilla 2.16
Comment 11•23 years ago
|
||
That seems like the correct thing to do
Updated•23 years ago
|
Comment 12•23 years ago
|
||
Comment on attachment 81244 [details] [diff] [review]
patch
2r=caillon. This change is fine. (is there anything else we need to do though
before closing this bug? Comment 9 hints that we don't validate per the
bugzilla DTD)
Attachment #81244 -
Flags: review+
Comment 13•23 years ago
|
||
we validate now - that was a separate bug.
Charset encoding stuff is covered by a separate bug (bug 126255) Won't using
utf8 require us to use perl 5.6? That bug has a patch on it, but it doesn't
really cover whats needed here, it just allows an encoding to be set.
Leaving open, -> 2.18, and depending on bug 126255. I'll mark the other patch as
obsolete, too, so that it doesn't affect queries.
Updated•23 years ago
|
Attachment #81244 -
Attachment is obsolete: true
Comment 14•23 years ago
|
||
Bug 126255 seems to be something about logging into amazon.com, rather than a
bugzilla bug. I think you meant bug 126266. That patches for that bug seems to
cover setting the encoding of the html output of xml.cgi (ie. when you don't
give any bug numbers). Not sure which bug the charset issue fits into better
(we definitely aren't producing UTF-8 at the moment, so we should be setting it
to _some_ encoding, if we want it to validate).
As for the charset issue, specifying the encoding as iso8859-1 is probably the
best option at the moment.
Comment 15•23 years ago
|
||
I forgot to mention that the reason I marked my patch obsolete, and moved this
to 2.18, was because I checked that patch in. Oops.
Yes, I did mean bug 126266, sorry. If that patch goes in as is, we could just
add the use of the Param to the xml header. In any event, working out the
encoding for the xml output implies working out the encoding for the rest of
bugzilla.
We only really support ascii at the moment, I believe, so UTF8 is sort of
correct. The problem in that other bug and related issues is that we don't have
a way of defined what encoding is being used.
Comment 16•23 years ago
|
||
I just tried on http://landfill.tequilarista.org/bugzilla-tip/xml.cgi?id=267,
that looks pretty much like what you would expect.
It works in document inspector, entities are as they should be.
Rossi, would you take a look and see if the output works for you?
Comment 17•23 years ago
|
||
yes, this one works great. thanks for checking with me!
Comment 18•21 years ago
|
||
I found another problem caused by the QuoteXMLChars function in Bug.pm regarding
the '&' -> '&' conversion.
If I have for example a sequence like '->' in a bug description it is converted
to '-&gt;' when exporting this bug with xml.cgi.
Seems to be caused by QuoteXMLChars to be applied more than once on the same
field, thus first converting '>' to '>' and then '>' to '&gt;'.
The solution is an additional look-ahead in QuoteXMLChars:
$_[0] =~ s/&/&/g;
should be replaced with
$_[0] =~ s/&(?![#A-Za-z][0-9A-Za-z]+;)/&/g;
This way, a '&' is not touched if it is used as a character prefix (as in >
or Ä or ™).
Comment 19•21 years ago
|
||
What verson are you using? I have a recollection of our double (and in some
cases triple) XML Escaping being fixed a while back. See bug 109530.
This is fixed in CVS where the whole xml thing got redone to just be a display
format.
Comment 20•21 years ago
|
||
Unloved bugs targetted for 2.18 but untouched since 9-15-2003 are being
retargeted to 2.20
If you plan to act on one immediately, go ahead and pull it back to 2.18.
Target Milestone: Bugzilla 2.18 → Bugzilla 2.20
Comment 21•20 years ago
|
||
It still generates invalid xml if you use things like smart quotes, etc. It
should convert them to &<ordinal value>; type deals.
I've attached a patch that fixes it.
Comment 22•20 years ago
|
||
Updated•20 years ago
|
Attachment #158828 -
Flags: review?
Comment 23•20 years ago
|
||
Sigh, i was up too late that night.
What i meant is that you still have to remove the unrepresentable characters.
There are some characters that xml simply doesn't allow.
However, they can still make it into the db occassionally.
I believe they are 0x01-0x08, 0x0b-0x0c 0x0e-0x19. THe XML Spec 2.2 says:
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]
Those not in here need to be simply dropped from the output.
ie $var =~ s<([\x01\x02\x03\x04\x05\x06\x07\x08\x0b\x0c\x0e-\x19])><>seg;
Updated•20 years ago
|
Attachment #158828 -
Flags: review?(kiko)
Updated•20 years ago
|
Whiteboard: patch awaiting review
Comment 24•20 years ago
|
||
This bug has not been touched by its owner in over six months, even though it is
targeted to 2.20, for which the freeze is 10 days away. Unsetting the target
milestone, on the assumption that nobody is actually working on it or has any
plans to soon.
If you are the owner, and you plan to work on the bug, please give it a real
target milestone. If you are the owner, and you do *not* plan to work on it,
please reassign it to nobody@bugzilla.org or a .bugs component owner. If you are
*anybody*, and you get this comment, and *you* plan to work on the bug, please
reassign it to yourself if you have the ability.
Target Milestone: Bugzilla 2.20 → ---
Comment 25•20 years ago
|
||
Comment on attachment 158828 [details] [diff] [review]
XML quoting fix
Canceling review requests since comment 23 indicates this patch needs more
work. Daniel, are you still around to update your patch?
Attachment #158828 -
Flags: review?(kiko)
Attachment #158828 -
Flags: review?
Comment 26•20 years ago
|
||
*** Bug 291503 has been marked as a duplicate of this bug. ***
Updated•19 years ago
|
Whiteboard: patch awaiting review
Reporter | ||
Comment 27•19 years ago
|
||
Poking dberlin, who AFAICT was not on the cc list.
Comment 28•19 years ago
|
||
I'll update it. It's a one liner.
Updated•18 years ago
|
QA Contact: mattyt-bugzilla → default-qa
Comment 29•18 years ago
|
||
the following seem to work:
https://bugzilla-test.mozilla.org/show_bug.cgi?ctype=xml&id=384
https://bugzilla-test.mozilla.org/show_bug.cgi?ctype=xml&id=267
is this bug still real?
Comment 30•18 years ago
|
||
It's still possible to generate bad XML. See for example http://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?ctype=xml&id=5032 (plus real-world example at http://bugzilla.gnome.org/show_bug.cgi?id=417196&ctype=xml)
Comment 31•17 years ago
|
||
I stumbled across another real-world example of this today; the following link results in invalid XML...
https://bugs.eclipse.org/bugs/show_bug.cgi?ctype=xml&id=140108
Assignee | ||
Comment 33•16 years ago
|
||
The proposed patch addresses two issues, both related to generating well-formed XML:
1) It replaces instances of '&' with '&' while skipping over valid character entities (so that '™' would *not* become '&trade;' for example).
2) It knocks out characters from disallowed character ranges (control characters, etc) per the XML 1.0 Spec (production 2.2). Note that the more verbose form of expressing the hex characters is used in the regex because the abbreviated form appearing in character classes evidently makes certain perl's lexers cry bitter tears of failure.
Attachment #354687 -
Flags: review?(mkanat)
Comment 34•16 years ago
|
||
Comment on attachment 354687 [details] [diff] [review]
V1
>+ # substitute & for & unless it is already
>+ # used in a character entity.
>+ $var =~ s/&(?![#A-Za-z][0-9A-Za-z]+;)/&/g;
That's getting too complex. The way the filter is used, it should be displaying "™" if somebody writes "™".
>+ # the following nukes characters disallowed by the XML 1.0
>+ # spec, Production 2.2. 1.0 declares that only the following
>+ # are valid:
>+ # (#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF])
>+ $var =~ s/([\x{0001}-\x{0008}]|
>+ [\x{000B}-\x{000C}]|
>+ [\x{00E}-\x{0019}]|
>+ [\x{D800}-\x{DFFF}]|
>+ [\x{FFFE}-\x{FFFF}])//gx;
I'd rather replace them with HTML entities, is that possible? People export data via XML, and theoretically some of these characters could be in comments (as unlikely as it seems).
Attachment #354687 -
Flags: review?(mkanat) → review-
Comment 35•16 years ago
|
||
html entities have nothing to do in xml output. You could use numeric xml entities, fwiw, http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-references.
Comment 36•16 years ago
|
||
Except that you can't, see the fourth line in that section (or any of the other thousands of places googling "form feed xml" will show you people being shocked to find they can't "just use XML" to wrap any data). You simply *cannot* represent a form-feed in XML 1.0 content. You can use &#x12;, to turn an actual form-feed character into the literal string "&x12;" if you find that noisy dataloss is better than silent dataloss, but you can no more use  than you can use the actual character.
(Thanks to its desire to be as incompatible and unused as possible, you actually can have  in your XML 1.1, though good luck finding anything to consume XML 1.1 :)
Assignee | ||
Comment 37•16 years ago
|
||
(In reply to comment #34)
> >+ $var =~ s/&(?![#A-Za-z][0-9A-Za-z]+;)/&/g;
>
> That's getting too complex. The way the filter is used, it should be
> displaying "™" if somebody writes "™".
Well, that's the real question, I suppose: "display" vs. "consume". If I'm only looking at a bug "in XML" in a browser I'd expect what you expect; if I'm passing that data to some kind of XML tool chain in order to do something with it, I'd expect the entities to be preserved as-is.
>
> >+ # the following nukes characters disallowed by the XML 1.0
> >+ # spec, Production 2.2.
::snipping fulgy regex::
> I'd rather replace them with HTML entities, is that possible? People export
> data via XML, and theoretically some of these characters could be in comments
> (as unlikely as it seems).
As Phil rightly points out below, any instance of one of the characters not allowed in Production 2.2-- whether expressed by cutting and pasting from a Word doc or written by hand as entity ref to the code point-- instantly makes the document not-well-formed and all XML 1.0 parsers are /required/ to throw a fatal error. I realize that just nuking them is lossy and might break expectations, but there's simply no way to doll them up to make XML parsers happy and still maintain data integrity.
Comment 38•16 years ago
|
||
(In reply to comment #37)
> Well, that's the real question, I suppose: "display" vs. "consume".
If I type "& means & in HTML" in a comment, I want it to be set back to this exact string after the XML parser has read the XML file. I don't want the XML parser to convert it back to "& means & in HTML". So when you generate the XML file, you should encode & in all cases, and not assume that the input data was already escaped.
About illegal characters, as philor says we have no way to pass them due to limitations of the XML 1.0 spec, I think it's better to silently skip them as it's very unlikely that we will "see" them anyway.
So simply remove the first part of your patch about & and re-request review.
Assignee: bbaetz → khampton
OS: Other → All
Assignee | ||
Comment 39•16 years ago
|
||
(In reply to comment #36)
> Except that you can't, see the fourth line in that section (or any of the other
> thousands of places googling "form feed xml" will show you people being shocked
> to find they can't "just use XML" to wrap any data). You simply *cannot*
> represent a form-feed in XML 1.0 content.
Exactly.
> You can use &#x12;, to turn an
> actual form-feed character into the literal string "&x12;" if you find that
> noisy dataloss is better than silent dataloss, but you can no more use 
> than you can use the actual character.
Okay, that's interesting. It doesn't buy much but at least it keep the code
from silently eating parts of the users' data. Personally, I favor just nuking them since the most common case in my experience is that these characters are artifacts of copy/paste and are not intentional.
> (Thanks to its desire to be as incompatible and unused as possible, you
> actually can have  in your XML 1.1, though good luck finding anything to
> consume XML 1.1 :)
Heh, yeah. I expect to have my flying car /long/ before any 1.1 parsers make it
out of the lab.
Assignee | ||
Comment 40•16 years ago
|
||
(In reply to comment #35)
> html entities have nothing to do in xml output. You could use numeric xml
> entities, fwiw, http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-references.
Check the version on that spec. XML 1.1 does indeed allow for certain characters that XML 1.0 does not, but, to my knowledge, there are /no/ 1.1 parsers in the wild.
Assignee | ||
Comment 41•16 years ago
|
||
(In reply to comment #38)
> So simply remove the first part of your patch about & and re-request review.
Thanks, Frédéric. Will do.
Assignee | ||
Comment 42•16 years ago
|
||
Resubmitted with suggested changes. The patch now only nukes non-XML 1.0-friendly characters.
Attachment #354687 -
Attachment is obsolete: true
Attachment #354764 -
Flags: review?(mkanat)
Comment 43•16 years ago
|
||
Comment on attachment 354764 [details] [diff] [review]
Character Stripping V2
>+ [\x{00E}-\x{0019}]|
For consistency, should you write 000E instead of 00E? That's minor, though.
Assignee | ||
Comment 44•16 years ago
|
||
(In reply to comment #43)
> For consistency, should you write 000E instead of 00E? That's minor, though.
Yep, typo. Good catch. Resubmitting.
Assignee | ||
Comment 45•16 years ago
|
||
Fixed typo.
Attachment #354764 -
Attachment is obsolete: true
Attachment #354766 -
Flags: review?(mkanat)
Attachment #354764 -
Flags: review?(mkanat)
Comment 46•16 years ago
|
||
Comment on attachment 354766 [details] [diff] [review]
Character Stripping V3
This looks fine to me. :-)
Attachment #354766 -
Flags: review?(mkanat) → review+
Updated•16 years ago
|
Flags: approval3.2+
Flags: approval+
Target Milestone: --- → Bugzilla 3.2
Comment 47•16 years ago
|
||
tip:
Checking in Bugzilla/Util.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v <-- Util.pm
new revision: 1.78; previous revision: 1.77
done
3.2:
Checking in Bugzilla/Util.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v <-- Util.pm
new revision: 1.69.2.6; previous revision: 1.69.2.5
done
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Comment 48•16 years ago
|
||
I've made this same mistake twice, publicly, in the last 24 hours!
The character that precedes 0x20 is 0x1f, not 0x19.
Attachment #365723 -
Flags: review?(mkanat)
Comment 49•16 years ago
|
||
ubu: Is dwm correct?
Assignee | ||
Comment 50•16 years ago
|
||
(In reply to comment #49)
> ubu: Is dwm correct?
Yes.
Comment 51•16 years ago
|
||
Comment on attachment 365723 [details] [diff] [review]
patch to V3
Yes, this change is correct, per the spec. r=LpSolit
Attachment #365723 -
Flags: review?(mkanat) → review+
Comment 52•16 years ago
|
||
tip:
Checking in Bugzilla/Util.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v <-- Util.pm
new revision: 1.86; previous revision: 1.85
done
3.2.2:
Checking in Bugzilla/Util.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v <-- Util.pm
new revision: 1.69.2.9; previous revision: 1.69.2.8
done
See Also: → https://launchpad.net/bugs/191199
You need to log in
before you can comment on or make changes to this bug.
Description
•