105960 - xml.cgi and other future xml pages generate invalid XML

Reporter

Description

•

23 years ago

Because xml.cgi basically hand crafts xml instead of using the standard perl XML libraries, it doesn't escape/encode umlauts (for example), so the XML it creates is invalid. This means that sane XML libraries looking to parse the code are... well, displeased. Because we are moving towards templates for all data output (not just xml.cgi), it seems unlikely that this problem is going to be solved the 'correct' way- i.e., by using the perl XML libraries to create properly encoded XML. So... I'm not entirely sure that there is any elegant solution to this, except maybe to provide a 'encode_this_like_XML_likes()' function that can be called iff xml is chosen as the output format. Not sure how that would work, though. But it is a problem, and will continue to be if people want to use standard system libraries (on both linux and windows) to create interaction with bugzilla.

Matthew Tuck [:CodeMachine]

Comment 1

•

23 years ago

We expect proper escaping for HTML templates, we should do the same for XML templates. Template Toolkit has the 'html' filter, does it have an 'xml' filter, or are they the same?

Priority: -- → P3

Target Milestone: --- → Bugzilla 2.18

xyzzy

Comment 2

•

23 years ago

Even for cases that have been anticipated, it looks like something weird is going on here. Go to http://bugzilla.mozilla.org/xml.cgi?id=1000 and browse the output and notice the invalid entities being generated for known cases. QuoteXMLChars() in Bug.pm is supposed to be taking care of &, <, >, ', and ", but the output above is somehow translating & into &amp;, < into &lt;, and so on. It looks as though translation is somehow happening twice, once expanding < to < for example, then expanding that to &lt; . But QuoteXMLChars() doesn't seem to be getting called twice within the <text /> tags output, and it translates & characters before <, not after, as my guess would indicate. If it is because of a recursive substitution, why does it stop? I am completely baffled. Perhaps it is an escaping problem in QuoteXMLChars()? Templates might be able to fix this, but without knowing what the problem is or what the html filter covers, I can't be certain. IIRC, the html filter didn't cover ", much less umlauts, and I didn't see an xml filter.

James Henstridge

Comment 3

•

23 years ago

To handle the accented characters, would it make sense to set the encoding of the xml output to latin1? Something like this: <?xml version="1.0" encoding="iso8859-1" standalone="no"?> Alternatively, the data could be reencoded to UTF-8 (which is the default encoding for XML files). The first option is probably the easiest one, and is probably correct if bugzilla is assuming things work in latin1. The encoding could easily be part of the template, so that bugzillas in other countries which have bugs using a different encoding could be exported correctly. I think a compromise like this would be necessary, given that the encoding of the text returned by the bugzilla user isn't known (or isn't recorded).

James Henstridge

Comment 4

•

23 years ago

An example of the accented chars causing problems is: http://bugzilla.mozilla.org/xml.cgi?id=384 Setting the encoding in the <?xml?> PI makes the file parseable.

Axel Hecht

Comment 5

•

23 years ago

note that xml.cgi sends the wrong mimetype, too. The current version on b.m.o sends text/plain, and lxr said it would send text/html. tststs, text/xml it should be. Talking about entities, the xml produces by xml.cgi should be readable for Mozilla, IMHO, so it needs to be standalone. No dtds are loaded over the network at all. If you want to use entities, declare them in the document, or stick to the basic XML entities plus numerical.

Bradley Baetz (:bbaetz)

Comment 6

•

23 years ago

lxr is showing cvs bugzilla - that first content-type happens if you don't put an id in at all, in which case we do generate an html page. The dtd is just for validation - there are no entities used. justdave: should we fix the mimetype for 2.16? Its a simple, obviously correct fix. (Which will mean that we can't see the output in mozilla directly, w/o using view source, but....)

Axel Hecht

Comment 7

•

23 years ago

If the DTD is only good for validation, then the xml pi should have <?xml version="1.0" standalone="yes"?>, but xml.cgi serves "no". If you want to keep validation information, give the right hint that this is only needed for validation, but not for the creation of the content. http://www.w3.org/TR/2000/REC-xml-20001006#sec-rmd

Dave Miller [:justdave]

Comment 8

•

23 years ago

Brad: if you guys agree it needs to be done, I won't stop it from getting checked in. :-) Although I learned a lot about DTD-writing recently I'm still generally clueless on the way the XML gets transmitted.

James Henstridge

Comment 9

•

23 years ago

An easy way to test if the output is formal XML is with the xmllint program that comes with the libxml2 library (http://www.xmlsoft.org). Just download the output of xml.cgi, then run xmllint on it like this: xmllint -noout file.xml If it prints anything, then the file has character set or tag balancing issues. You can also perform validity checks (it should be able to download the DTD) with: xmllint -noout -valid file.xml At the moment, it looks like the output from bugzilla.mozilla.org has formality problems (isn't utf-8, but doesn't specify its character set), and validity problems (looks like some elements are out of order). For my purposes (my python module for talking to a bugzilla installation), only the formality issue is a problem. It might be worth fixing the validity bug though. Here is the output from a validity check after setting the character set to iso8859-1 in the <?xml?> PI: $ xmllint -noout -valid 384.xml 384.xml:20: validity error: No declaration for element resolution <resolution>INVALID</resolution> ^ 384.xml:199: validity error: Element bug content doesn't follow the DTD Expecting (bug_id , exporter , urlbase , bug_status , resolution? , product , priority , version , rep_platform , assigned_to , delta_ts , component , reporter , target_milestone? , bug_severity , creation_ts , qa_contact? , status_whiteboard? , op_sys , short_desc? , keywords* , dependson* , blocks* , cc* , long_desc? , attachment*), got (bug_id bug_status product priority version rep_platform assigned_to delta_ts component reporter target_milestone bug_severity creation_ts qa_contact op_sys resolution short_desc long_desc long_desc long_desc long_desc long_desc long_desc long_desc long_desc long_desc ) </bug> ^

Gervase Markham [:gerv]

Comment 10

•

23 years ago

Axel: to summarise: for 2.16, you want: a) The XML served as text/xml b) The standalone="no" to become standalone="yes"? Gerv

Target Milestone: Bugzilla 2.18 → Bugzilla 2.16

Bradley Baetz (:bbaetz)

Comment 11

•

23 years ago

Attached patch patch (obsolete) (deleted) — Details — Splinter Review

That seems like the correct thing to do

Bradley Baetz (:bbaetz)

Updated

•

23 years ago

Keywords: patch, review

Christopher Aillon (sabbatical, not receiving bugmail)

Comment 12

•

23 years ago

Comment on attachment 81244 [details] [diff] [review] patch 2r=caillon. This change is fine. (is there anything else we need to do though before closing this bug? Comment 9 hints that we don't validate per the bugzilla DTD)

Attachment #81244 - Flags: review+

Bradley Baetz (:bbaetz)

Comment 13

•

23 years ago

we validate now - that was a separate bug. Charset encoding stuff is covered by a separate bug (bug 126255) Won't using utf8 require us to use perl 5.6? That bug has a patch on it, but it doesn't really cover whats needed here, it just allows an encoding to be set. Leaving open, -> 2.18, and depending on bug 126255. I'll mark the other patch as obsolete, too, so that it doesn't affect queries.

Assignee: endico → bbaetz

Depends on: 126255

Keywords: patch, review

Target Milestone: Bugzilla 2.16 → Bugzilla 2.18

Bradley Baetz (:bbaetz)

Updated

•

23 years ago

Attachment #81244 - Attachment is obsolete: true

James Henstridge

Comment 14

•

23 years ago

Bug 126255 seems to be something about logging into amazon.com, rather than a bugzilla bug. I think you meant bug 126266. That patches for that bug seems to cover setting the encoding of the html output of xml.cgi (ie. when you don't give any bug numbers). Not sure which bug the charset issue fits into better (we definitely aren't producing UTF-8 at the moment, so we should be setting it to _some_ encoding, if we want it to validate). As for the charset issue, specifying the encoding as iso8859-1 is probably the best option at the moment.

Bradley Baetz (:bbaetz)

Comment 15

•

23 years ago

I forgot to mention that the reason I marked my patch obsolete, and moved this to 2.18, was because I checked that patch in. Oops. Yes, I did mean bug 126266, sorry. If that patch goes in as is, we could just add the use of the Param to the xml header. In any event, working out the encoding for the xml output implies working out the encoding for the rest of bugzilla. We only really support ascii at the moment, I believe, so UTF8 is sort of correct. The problem in that other bug and related issues is that we don't have a way of defined what encoding is being used.

Axel Hecht

Comment 16

•

23 years ago

I just tried on http://landfill.tequilarista.org/bugzilla-tip/xml.cgi?id=267, that looks pretty much like what you would expect. It works in document inspector, entities are as they should be. Rossi, would you take a look and see if the output works for you?

rossi

Comment 17

•

23 years ago

yes, this one works great. thanks for checking with me!

Roland Genske

Comment 18

•

21 years ago

I found another problem caused by the QuoteXMLChars function in Bug.pm regarding the '&' -> '&' conversion. If I have for example a sequence like '->' in a bug description it is converted to '-&gt;' when exporting this bug with xml.cgi. Seems to be caused by QuoteXMLChars to be applied more than once on the same field, thus first converting '>' to '&gt' and then '&gt' to '&gt;'. The solution is an additional look-ahead in QuoteXMLChars: $_[0] =~ s/&/&/g; should be replaced with $_[0] =~ s/&(?![#A-Za-z][0-9A-Za-z]+;)/&/g; This way, a '&' is not touched if it is used as a character prefix (as in > or Ä or ™).

Bradley Baetz (:bbaetz)

Comment 19

•

21 years ago

What verson are you using? I have a recollection of our double (and in some cases triple) XML Escaping being fixed a while back. See bug 109530. This is fixed in CVS where the whole xml thing got redone to just be a display format.

Joel Peshkin

Comment 20

•

21 years ago

Unloved bugs targetted for 2.18 but untouched since 9-15-2003 are being retargeted to 2.20 If you plan to act on one immediately, go ahead and pull it back to 2.18.

Target Milestone: Bugzilla 2.18 → Bugzilla 2.20

Daniel Berlin

Comment 21

•

20 years ago

It still generates invalid xml if you use things like smart quotes, etc. It should convert them to &<ordinal value>; type deals. I've attached a patch that fixes it.

Daniel Berlin

Comment 22

•

20 years ago

Attached patch XML quoting fix (deleted) — Details — Splinter Review

Daniel Berlin

Updated

•

20 years ago

Attachment #158828 - Flags: review?

Daniel Berlin

Comment 23

•

20 years ago

Sigh, i was up too late that night. What i meant is that you still have to remove the unrepresentable characters. There are some characters that xml simply doesn't allow. However, they can still make it into the db occassionally. I believe they are 0x01-0x08, 0x0b-0x0c 0x0e-0x19. THe XML Spec 2.2 says: [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] Those not in here need to be simply dropped from the output. ie $var =~ s<([\x01\x02\x03\x04\x05\x06\x07\x08\x0b\x0c\x0e-\x19])><>seg;

Christian Reis

Updated

•

20 years ago

Attachment #158828 - Flags: review?(kiko)

Max Kanat-Alexander

Updated

•

20 years ago

Whiteboard: patch awaiting review

Max Kanat-Alexander

Comment 24

•

20 years ago

This bug has not been touched by its owner in over six months, even though it is targeted to 2.20, for which the freeze is 10 days away. Unsetting the target milestone, on the assumption that nobody is actually working on it or has any plans to soon. If you are the owner, and you plan to work on the bug, please give it a real target milestone. If you are the owner, and you do *not* plan to work on it, please reassign it to nobody@bugzilla.org or a .bugs component owner. If you are *anybody*, and you get this comment, and *you* plan to work on the bug, please reassign it to yourself if you have the ability.

Target Milestone: Bugzilla 2.20 → ---

Myk Melez [:myk] [@mykmelez]

Comment 25

•

20 years ago

Comment on attachment 158828 [details] [diff] [review] XML quoting fix Canceling review requests since comment 23 indicates this patch needs more work. Daniel, are you still around to update your patch?

Attachment #158828 - Flags: review?(kiko)

Attachment #158828 - Flags: review?

Dave Miller [:justdave]

Comment 26

•

20 years ago

*** Bug 291503 has been marked as a duplicate of this bug. ***

Frédéric Buclin

Updated

•

19 years ago

Whiteboard: patch awaiting review

Luis Villa

Reporter

Comment 27

•

19 years ago

Poking dberlin, who AFAICT was not on the cc list.

Daniel Berlin

Comment 28

•

19 years ago

I'll update it. It's a one liner.

Olav Vitters

Updated

•

18 years ago

QA Contact: mattyt-bugzilla → default-qa

timeless

Comment 29

•

18 years ago

the following seem to work: https://bugzilla-test.mozilla.org/show_bug.cgi?ctype=xml&id=384 https://bugzilla-test.mozilla.org/show_bug.cgi?ctype=xml&id=267 is this bug still real?

Tom Parker

Comment 30

•

18 years ago

It's still possible to generate bad XML. See for example http://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?ctype=xml&id=5032 (plus real-world example at http://bugzilla.gnome.org/show_bug.cgi?id=417196&ctype=xml)

Matthew Beermann

Comment 31

•

17 years ago

I stumbled across another real-world example of this today; the following link results in invalid XML... https://bugs.eclipse.org/bugs/show_bug.cgi?ctype=xml&id=140108

Kip Hampton

Assignee

Comment 33

•

16 years ago

Attached patch V1 (obsolete) (deleted) — Details — Splinter Review

The proposed patch addresses two issues, both related to generating well-formed XML: 1) It replaces instances of '&' with '&' while skipping over valid character entities (so that '™' would *not* become '&trade;' for example). 2) It knocks out characters from disallowed character ranges (control characters, etc) per the XML 1.0 Spec (production 2.2). Note that the more verbose form of expressing the hex characters is used in the regex because the abbreviated form appearing in character classes evidently makes certain perl's lexers cry bitter tears of failure.

Attachment #354687 - Flags: review?(mkanat)

Max Kanat-Alexander

Comment 34

•

16 years ago

Comment on attachment 354687 [details] [diff] [review] V1 >+ # substitute & for & unless it is already >+ # used in a character entity. >+ $var =~ s/&(?![#A-Za-z][0-9A-Za-z]+;)/&/g; That's getting too complex. The way the filter is used, it should be displaying "™" if somebody writes "™". >+ # the following nukes characters disallowed by the XML 1.0 >+ # spec, Production 2.2. 1.0 declares that only the following >+ # are valid: >+ # (#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]) >+ $var =~ s/([\x{0001}-\x{0008}]| >+ [\x{000B}-\x{000C}]| >+ [\x{00E}-\x{0019}]| >+ [\x{D800}-\x{DFFF}]| >+ [\x{FFFE}-\x{FFFF}])//gx; I'd rather replace them with HTML entities, is that possible? People export data via XML, and theoretically some of these characters could be in comments (as unlikely as it seems).

Attachment #354687 - Flags: review?(mkanat) → review-

Axel Hecht [:Pike]

Comment 35

•

16 years ago

html entities have nothing to do in xml output. You could use numeric xml entities, fwiw, http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-references.

Phil Ringnalda (:philor)

Comment 36

•

16 years ago

Except that you can't, see the fourth line in that section (or any of the other thousands of places googling "form feed xml" will show you people being shocked to find they can't "just use XML" to wrap any data). You simply *cannot* represent a form-feed in XML 1.0 content. You can use &#x12;, to turn an actual form-feed character into the literal string "&x12;" if you find that noisy dataloss is better than silent dataloss, but you can no more use  than you can use the actual character. (Thanks to its desire to be as incompatible and unused as possible, you actually can have  in your XML 1.1, though good luck finding anything to consume XML 1.1 :)

Kip Hampton

Assignee

Comment 37

•

16 years ago

(In reply to comment #34) > >+ $var =~ s/&(?![#A-Za-z][0-9A-Za-z]+;)/&/g; > > That's getting too complex. The way the filter is used, it should be > displaying "™" if somebody writes "™". Well, that's the real question, I suppose: "display" vs. "consume". If I'm only looking at a bug "in XML" in a browser I'd expect what you expect; if I'm passing that data to some kind of XML tool chain in order to do something with it, I'd expect the entities to be preserved as-is. > > >+ # the following nukes characters disallowed by the XML 1.0 > >+ # spec, Production 2.2. ::snipping fulgy regex:: > I'd rather replace them with HTML entities, is that possible? People export > data via XML, and theoretically some of these characters could be in comments > (as unlikely as it seems). As Phil rightly points out below, any instance of one of the characters not allowed in Production 2.2-- whether expressed by cutting and pasting from a Word doc or written by hand as entity ref to the code point-- instantly makes the document not-well-formed and all XML 1.0 parsers are /required/ to throw a fatal error. I realize that just nuking them is lossy and might break expectations, but there's simply no way to doll them up to make XML parsers happy and still maintain data integrity.

Frédéric Buclin

Comment 38

•

16 years ago

(In reply to comment #37) > Well, that's the real question, I suppose: "display" vs. "consume". If I type "& means & in HTML" in a comment, I want it to be set back to this exact string after the XML parser has read the XML file. I don't want the XML parser to convert it back to "& means & in HTML". So when you generate the XML file, you should encode & in all cases, and not assume that the input data was already escaped. About illegal characters, as philor says we have no way to pass them due to limitations of the XML 1.0 spec, I think it's better to silently skip them as it's very unlikely that we will "see" them anyway. So simply remove the first part of your patch about & and re-request review.

Assignee: bbaetz → khampton

OS: Other → All

Kip Hampton

Assignee

Comment 39

•

16 years ago

(In reply to comment #36) > Except that you can't, see the fourth line in that section (or any of the other > thousands of places googling "form feed xml" will show you people being shocked > to find they can't "just use XML" to wrap any data). You simply *cannot* > represent a form-feed in XML 1.0 content. Exactly. > You can use &#x12;, to turn an > actual form-feed character into the literal string "&x12;" if you find that > noisy dataloss is better than silent dataloss, but you can no more use  > than you can use the actual character. Okay, that's interesting. It doesn't buy much but at least it keep the code from silently eating parts of the users' data. Personally, I favor just nuking them since the most common case in my experience is that these characters are artifacts of copy/paste and are not intentional. > (Thanks to its desire to be as incompatible and unused as possible, you > actually can have  in your XML 1.1, though good luck finding anything to > consume XML 1.1 :) Heh, yeah. I expect to have my flying car /long/ before any 1.1 parsers make it out of the lab.

Kip Hampton

Assignee

Comment 40

•

16 years ago

(In reply to comment #35) > html entities have nothing to do in xml output. You could use numeric xml > entities, fwiw, http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-references. Check the version on that spec. XML 1.1 does indeed allow for certain characters that XML 1.0 does not, but, to my knowledge, there are /no/ 1.1 parsers in the wild.

Kip Hampton

Assignee

Comment 41

•

16 years ago

(In reply to comment #38) > So simply remove the first part of your patch about & and re-request review. Thanks, Frédéric. Will do.

Kip Hampton

Assignee

Comment 42

•

16 years ago

Attached patch Character Stripping V2 (obsolete) (deleted) — Details — Splinter Review

Resubmitted with suggested changes. The patch now only nukes non-XML 1.0-friendly characters.

Attachment #354687 - Attachment is obsolete: true

Attachment #354764 - Flags: review?(mkanat)

Frédéric Buclin

Comment 43

•

16 years ago

Comment on attachment 354764 [details] [diff] [review] Character Stripping V2 >+ [\x{00E}-\x{0019}]| For consistency, should you write 000E instead of 00E? That's minor, though.

Kip Hampton

Assignee

Comment 44

•

16 years ago

(In reply to comment #43) > For consistency, should you write 000E instead of 00E? That's minor, though. Yep, typo. Good catch. Resubmitting.

Kip Hampton

Assignee

Comment 45

•

16 years ago

Attached patch Character Stripping V3 (deleted) — Details — Splinter Review

Fixed typo.

Attachment #354764 - Attachment is obsolete: true

Attachment #354766 - Flags: review?(mkanat)

Attachment #354764 - Flags: review?(mkanat)

Max Kanat-Alexander

Comment 46

•

16 years ago

Comment on attachment 354766 [details] [diff] [review] Character Stripping V3 This looks fine to me. :-)

Attachment #354766 - Flags: review?(mkanat) → review+

Max Kanat-Alexander

Updated

•

16 years ago

Flags: approval3.2+

Flags: approval+

Target Milestone: --- → Bugzilla 3.2

Frédéric Buclin

Comment 47

•

16 years ago

tip: Checking in Bugzilla/Util.pm; /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v <-- Util.pm new revision: 1.78; previous revision: 1.77 done 3.2: Checking in Bugzilla/Util.pm; /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v <-- Util.pm new revision: 1.69.2.6; previous revision: 1.69.2.5 done

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

David Marshall

Comment 48

•

16 years ago

Attached patch patch to V3 (deleted) — Details — Splinter Review

I've made this same mistake twice, publicly, in the last 24 hours! The character that precedes 0x20 is 0x1f, not 0x19.

Attachment #365723 - Flags: review?(mkanat)

Max Kanat-Alexander

Comment 49

•

16 years ago

ubu: Is dwm correct?

Kip Hampton

Assignee

Comment 50

•

16 years ago

(In reply to comment #49) > ubu: Is dwm correct? Yes.

Frédéric Buclin

Comment 51

•

16 years ago

Comment on attachment 365723 [details] [diff] [review] patch to V3 Yes, this change is correct, per the spec. r=LpSolit

Attachment #365723 - Flags: review?(mkanat) → review+

Frédéric Buclin

Comment 52

•

16 years ago

tip: Checking in Bugzilla/Util.pm; /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v <-- Util.pm new revision: 1.86; previous revision: 1.85 done 3.2.2: Checking in Bugzilla/Util.pm; /cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v <-- Util.pm new revision: 1.69.2.9; previous revision: 1.69.2.8 done

Launchpad

Updated

•

14 years ago

See Also: → https://launchpad.net/bugs/191199

patch 23 years ago Bradley Baetz (:bbaetz) (deleted), patch	caillon : review+ caillon : review+	Details \| Diff \| Splinter Review
XML quoting fix 20 years ago Daniel Berlin (deleted), patch		Details \| Diff \| Splinter Review
V1 16 years ago Kip Hampton (deleted), patch	mkanat : review-	Details \| Diff \| Splinter Review
Character Stripping V2 16 years ago Kip Hampton (deleted), patch		Details \| Diff \| Splinter Review
Character Stripping V3 16 years ago Kip Hampton (deleted), patch	mkanat : review+	Details \| Diff \| Splinter Review
patch to V3 16 years ago David Marshall (deleted), patch	LpSolit : review+	Details \| Diff \| Splinter Review