Open
Bug 169388
Opened 22 years ago
Updated 2 years ago
Handling of non ASCII characters in URLs
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
NEW
People
(Reporter: roslawski, Unassigned)
References
Details
(Keywords: intl)
Attachments
(1 file)
(deleted),
text/html
|
Details |
If a mailto-link contains url-encoded characters above %7f (e.g. %A7 for "§" or %B4 for "´"), then the parameter parsing appears to be confused. Url-encoded characters like "&" (%26) and "=" (%3D) are suddenly interpreted just like they aren't url-encoded. Furthermore the status bar doesn't show the link- url if the mouse moves over the link. The following link should open the mail client with a recipient "test@test.com" and a subject "test§test&body=test". Netscape 4.78 actually does that. Mozilla sets the subject to "test§test" and sets the body to "test": mailto:test@test.com?subject=test%A7test%26body%3Dtest The following link should open the mail client with the subject "test´&test". Mozilla sets the subject to "test´" and drops the rest: mailto:test@test.com?subject=test%B4%26test Everything seems to work fine in Mozilla when the above-%7f character isn't encoded, e.g.: mailto:test@test.com?subject=test§test%26body%3Dtest I didn't had much time to look into this, sorry. Maybe it's just me who is confused or who didn't understand url-encoding. Maybe there's something in a RFC about url-enconding for mailto which I don't know yet. Sorry then. I think the behaviour is strange enough to report it. - Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.2a) Gecko/20020910 - Character Encoding Western (ISO 8859-1), auto-detect off - Mozilla Mail is default mailer (IMAP-SSL account) - Windows 2000 SP2 (german) I only found this statement from July 31, 2001 on http://mozilla- evangelism.bclary.com/letter/news.html, which might point at the same problem: "Also, when preparing the string containing the letter for use in the mailto: url assignment, it appears that ISO-8859-2 strings are truncated when using the JS function escape() to prepare them for use in the URL. I am not sure if I am causing this problem or if it is a problem in Mozilla."
Comment 1•22 years ago
|
||
Confirming on Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2a) Gecko/2002091318. I recently had a very similar bug which proved to be invalid for a reason which does not apply here, see bug 168793. CCing Niklas who helped to solve it. pi
Assignee: asa → harishd
Status: UNCONFIRMED → NEW
Component: Browser-General → Parser
Ever confirmed: true
OS: Windows 2000 → All
QA Contact: asa → moied
Hardware: PC → All
Comment 2•22 years ago
|
||
Comment 3•22 years ago
|
||
->Networking. Parser is HTML only, not appropriate for the URL parser.
Assignee: harishd → new-network-bugs
Component: Parser → Networking
QA Contact: moied → benc
Comment 4•22 years ago
|
||
It actually is related to bug 168793, namely because everything above %7F is supposed to be encoded with two bytes. Why is that? Because everyhting not ASCII (> %7F) _should_ be encoded with UTF-8, which uses one byte for the first 128 chars and two for the rest (and then three and four bytes for characters in still higher positions in Unicode). Excerpt from the HTML spec: **** B.2.1 Non-ASCII characters in URI attribute values Although URIs do not contain non-ASCII values (see [URI], section 2.1) authors sometimes specify them in attribute values expecting URIs (i.e., defined with %URI; in the DTD). For instance, the following href value is illegal: <A href="http://foo.org/Håkon">...</A> We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases: 1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes. 2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value). This procedure results in a syntactically legal URI (as defined in [RFC1738], section 2.2 or [RFC2141], section 2) that is independent of the character encoding to which the HTML document carrying the URI may have been transcoded. *** π-winger's problem was that he tried to escape a two-byte value (in UTF-8) with one byte, which failed. Same problem here. Find the relevant two-byte combination, and it will probably work. The section sign is encoded as %C2%A7 in a UTF-8 URI hex scheme.
Comment 5•22 years ago
|
||
And, may I add, it is safest NOT to encode anything. Let the browser do the encoding. This will also make your code more legible. Unicode to the world.
Reporter | ||
Comment 6•22 years ago
|
||
The (default) character encoding of the document is ISO-8859-1, not UTF-8 like http://piology.org/ (at least Mozilla "view page info" tells me the document there uses UTF-8 encoding). URI-encoded characters above %7f in attribute values of http-links seem to work fine in ISO-8859-1 encoded pages. The problem only happens in mailto-links. I encoded the parameters of the mailto-link in UTF-8, but didn't change the document encoding. It works fine with Mozilla. But IE 6.0 and Netscape 4.78 translate it into two characters, e.g. %C2%A7 yields "§" instead of "§". And when I use %C2%A7 to encode an attribute value of a http- link, Mozilla yields "§" as well (on ISO-8859-1 encoded document). Btw, I had a quick run at the link with the UTF-8 encoded anchor name on http://piology.org/. IE, Netscape, Opera, and Lynx fail to jump to the given anchor point. I don't say the other browsers are right, it's just that I'm seriously confused by now. :-) And I don't know about HTML-specifications on URIs, but why should they handle the encoding of http-links and mailto-links differently? And no encoding of attribute values at all isn't always an option, especially when you have to handle dynamic content based on textual input of various users.
Comment 7•22 years ago
|
||
@Niklas Dougherty: The part from the HTML spec you have posted is only a recommendataion for error correction when an URI contains non-ASCII characters. These and only these should be encoded as UTF-8 and then as %nn. They don't say that URIs can't contain (encoded) octets that are not part of a valid UTF-8 sequence. (This is out of the scope of an HTML specification -- or even the W3C -- anyway.) And some URL schemes, such as "data" (RFC 2397), rely on that. And no, it's not safest not to encode anything. This results in INVALID URIs and amounts to relying on the error-correction scheme you mentioned. ------------------------ On the other hand, the URI shown above is still invalid. This is because the "mailto" scheme does not define any representation for non-ASCII characters (see RFC 2368): | 8-bit characters in mailto URLs are forbidden. MIME encoded words (as | defined in [RFC2047]) are permitted in header values, but not for any | part of a "body" hname. -- RFC 2368, section 2 But as it explicitly does allow MIME encoding (RFC 2047) in headers, the URI written above should be written as: mailto:test@test.com?subject=%3D%3FISO-8859-1%3Ftest%3DA7test%3D26body%3D3Dtest%3F%3D Note that the following would also be valid (but includes an extra space): mailto:test@test.com?subject=%3D%3FISO-8859-1%3Ftest%3DA7test%3F%3D%20%26body%3Dtest For the body, this does not work (you can't have encoded words in the body and the spec explicitly disallows content-* headers). Now, what should Mozilla do when it encounters such an invalid URI: 1. It should be able to display it. 2. It should be able to parse it correctly and not confuse parts of the (intended) header with other (intended) headers or the (intended) body. 3. If the data for a header or for the body (incorrectly) contains non-ASCII characters, it should try to interpret it as UTF-8 and, failing that, as ISO-8859-1 (or any other 8bit charset determined by other means...)
Comment 8•22 years ago
|
||
OK, so let's make this bug broader. A problem in another bug (which I'll dupe in a moment) was: http://www.duden.de/schreibung/regelwerk/zeichen_11.html#%A797 Mozilla did not display the link in the status bar. pi
Summary: url-encoded characters above %7f trouble parameter parsing in mailto:-links → Handling of non ASCII characters in URLs
Comment 9•22 years ago
|
||
*** Bug 168793 has been marked as a duplicate of this bug. ***
Comment 10•22 years ago
|
||
If I enter the URI references: http://www.duden.de/schreibung/regelwerk/zeichen_11.html#%A797 and http://www.duden.de/schreibung/regelwerk/zeichen_11.html#%C2%A797 manually into Mozilla's (Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.1) Gecko/20020826) address bar, both of them actually works. Excellent. Also, all of these URIs do work, too: mailto:test@test.com?subject=test§test%26body%3Dtest mailto:test@test.com?subject=test%c2%a7test%26body%3Dtest mailto:test@test.com?subject=%3D%3FISO-8859-1%3FQ%3Ftest%3DA7test%3D26body%3D3Dtest%3F%3D The following URI does not: mailto:test@test.com?subject=test%a7test%26body%3Dtest but that's not a big problem because it's neither a valid URI nor can it be expected that it would be in the future (the trand goes towars UTF-8). ((But maybe Mozilla should only produce a single <?> character (U+FFFD) for the non-UTF-8 "%A7" and not for the rest of the string.)) So the bug seems not to be related to Mozilla's URI handling in general but only to extracting links from HTML documents.
Comment 12•22 years ago
|
||
-> intl
Assignee: darin → yokoyama
Component: Networking → Internationalization
QA Contact: benc → ruixu
Comment 14•20 years ago
|
||
*** Bug 285731 has been marked as a duplicate of this bug. ***
Comment 15•19 years ago
|
||
*** Bug 307694 has been marked as a duplicate of this bug. ***
Comment 16•19 years ago
|
||
*** Bug 296934 has been marked as a duplicate of this bug. ***
Comment 17•19 years ago
|
||
*** Bug 307940 has been marked as a duplicate of this bug. ***
Comment 18•19 years ago
|
||
*** Bug 192108 has been marked as a duplicate of this bug. ***
Comment 19•18 years ago
|
||
*** Bug 341532 has been marked as a duplicate of this bug. ***
Comment 20•18 years ago
|
||
*** Bug 354567 has been marked as a duplicate of this bug. ***
Updated•15 years ago
|
QA Contact: amyy → i18n
Comment hidden (spam) |
Comment hidden (spam) |
Comment hidden (spam) |
Comment 26•8 years ago
|
||
The (default) character encoding of the document is ISO-8859-1, not UTF-8 like http://piology.org/ (at least Mozilla "view page info" tells me the document there uses UTF-8 encoding). URI-encoded characters above %7f in attribute values of http-links seem to work fine in ISO-8859-1 encoded pages. The problem only happens in mailto-links. I encoded the parameters of the mailto-link in UTF-8, but didn't change the document encoding. It works fine with Mozilla. But IE 6.0 and Netscape 4.78 translate it into two characters, e.g. %C2%A7 yields "§" instead of "§". And when I use %C2%A7 to encode an attribute value of a http- link, Mozilla yields "§" as well (on ISO-8859-1 encoded document). Btw, I had a quick run at the link with the UTF-8 encoded anchor name on http://seopoker888.blogspot.com. IE, Netscape, Opera, and Lynx fail to jump to the given anchor point.
Comment 27•8 years ago
|
||
why should people used Non ASCII for URL, UTF-8 Should be enough for sure, some example like http://www.s1228.net UTF-8 already enough to describe the url
Comment 28•2 years ago
|
||
The bug assignee didn't login in Bugzilla in the last 7 months, so the assignee is being reset.
Assignee: jshin1987 → nobody
Updated•2 years ago
|
Severity: normal → S3
Comment 29•2 years ago
|
||
The severity field for this bug is relatively low, S3. However, the bug has 9 duplicates.
:m_kato, could you consider increasing the bug severity?
For more information, please visit auto_nag documentation.
Flags: needinfo?(m_kato)
Comment 30•2 years ago
|
||
The last needinfo from me was triggered in error by recent activity on the bug. I'm clearing the needinfo since this is a very old bug and I don't know if it's still relevant.
Flags: needinfo?(m_kato)
You need to log in
before you can comment on or make changes to this bug.
Description
•