Open Bug 169388 Opened 22 years ago Updated 2 years ago

Handling of non ASCII characters in URLs

Categories

(Core :: Internationalization, defect)

defect

Tracking

()

People

(Reporter: roslawski, Unassigned)

References

Details

(Keywords: intl)

Attachments

(1 file)

If a mailto-link contains url-encoded characters above %7f
(e.g. %A7 for "§" or %B4 for "´"), then the parameter parsing
appears to be confused. Url-encoded characters like "&" (%26)
and "=" (%3D) are suddenly interpreted just like they aren't
url-encoded. Furthermore the status bar doesn't show the link-
url if the mouse moves over the link.

The following link should open the mail client with a recipient
"test@test.com" and a subject "test§test&body=test". Netscape
4.78 actually does that. Mozilla sets the subject to "test§test"
and sets the body to "test":

  mailto:test@test.com?subject=test%A7test%26body%3Dtest


The following link should open the mail client with the subject
"test´&test". Mozilla sets the subject to "test´" and drops the
rest:

  mailto:test@test.com?subject=test%B4%26test


Everything seems to work fine in Mozilla when the above-%7f
character isn't encoded, e.g.:

  mailto:test@test.com?subject=test§test%26body%3Dtest


I didn't had much time to look into this, sorry. Maybe it's just
me who is confused or who didn't understand url-encoding. Maybe
there's something in a RFC about url-enconding for mailto which
I don't know yet. Sorry then. I think the behaviour is strange
enough to report it.

- Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.2a) Gecko/20020910
- Character Encoding Western (ISO 8859-1), auto-detect off
- Mozilla Mail is default mailer (IMAP-SSL account)
- Windows 2000 SP2 (german)

I only found this statement from July 31, 2001 on http://mozilla-
evangelism.bclary.com/letter/news.html, which might point at the
same problem:

  "Also, when preparing the string containing the letter for use
   in the mailto: url assignment, it appears that ISO-8859-2
   strings are truncated when using the JS function escape() to
   prepare them for use in the URL. I am not sure if I am causing
   this problem or if it is a problem in Mozilla."
Confirming on Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2a) Gecko/2002091318.

I recently had a very similar bug which proved to be invalid for a reason which
does not apply here, see bug 168793. CCing Niklas who helped to solve it.

pi
Assignee: asa → harishd
Status: UNCONFIRMED → NEW
Component: Browser-General → Parser
Ever confirmed: true
OS: Windows 2000 → All
QA Contact: asa → moied
Hardware: PC → All
->Networking. Parser is HTML only, not appropriate for the URL parser.
Assignee: harishd → new-network-bugs
Component: Parser → Networking
QA Contact: moied → benc
It actually is related to bug 168793, namely because everything above %7F is
supposed to be encoded with two bytes. Why is that? Because everyhting not ASCII
(> %7F) _should_ be encoded with UTF-8, which uses one byte for the first 128
chars and two for the rest (and then three and four bytes for characters in
still higher positions in Unicode). Excerpt from the HTML spec:

****
B.2.1 Non-ASCII characters in URI attribute values

Although URIs do not contain non-ASCII values (see [URI], section 2.1) authors
sometimes specify them in attribute values expecting URIs (i.e., defined with
%URI; in the DTD). For instance, the following href value is illegal:

<A href="http://foo.org/Håkon">...</A>

We recommend that user agents adopt the following convention for handling
non-ASCII characters in such cases:

   1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.
   2. Escape these bytes with the URI escaping mechanism (i.e., by converting
each byte to %HH, where HH is the hexadecimal notation of the byte value).

This procedure results in a syntactically legal URI (as defined in [RFC1738],
section 2.2 or [RFC2141], section 2) that is independent of the character
encoding to which the HTML document carrying the URI may have been transcoded.
***

&#960;-winger's problem was that he tried to escape a two-byte value (in UTF-8) with
one byte, which failed. Same problem here. Find the relevant two-byte
combination, and it will probably work. The section sign is encoded as %C2%A7 in
a UTF-8 URI hex scheme.
And, may I add, it is safest NOT to encode anything. Let the browser do the
encoding. This will also make your code more legible.

Unicode to the world.
The (default) character encoding of the document is ISO-8859-1, not UTF-8
like http://piology.org/ (at least Mozilla "view page info" tells me the
document there uses UTF-8 encoding). URI-encoded characters above %7f in
attribute values of http-links seem to work fine in ISO-8859-1 encoded
pages. The problem only happens in mailto-links.

I encoded the parameters of the mailto-link in UTF-8, but didn't change
the document encoding. It works fine with Mozilla. But IE 6.0 and Netscape
4.78 translate it into two characters, e.g. %C2%A7 yields "§" instead
of "§". And when I use %C2%A7 to encode an attribute value of a http-
link, Mozilla yields "§" as well (on ISO-8859-1 encoded document).

Btw, I had a quick run at the link with the UTF-8 encoded anchor name on
http://piology.org/. IE, Netscape, Opera, and Lynx fail to jump to the
given anchor point.

I don't say the other browsers are right, it's just that I'm seriously
confused by now. :-) And I don't know about HTML-specifications on URIs,
but why should they handle the encoding of http-links and mailto-links
differently?

And no encoding of attribute values at all isn't always an option,
especially when you have to handle dynamic content based on textual
input of various users.
@Niklas Dougherty:

The part from the HTML spec you have posted is only a recommendataion
for error correction when an URI contains non-ASCII characters. These
and only these should be encoded as UTF-8 and then as %nn.

They don't say that URIs can't contain (encoded) octets that are not
part of a valid UTF-8 sequence. (This is out of the scope of an HTML
specification -- or even the W3C -- anyway.)
And some URL schemes, such as "data" (RFC 2397), rely on that.

And no, it's not safest not to encode anything. This results in INVALID
URIs and amounts to relying on the error-correction scheme you mentioned.

------------------------

On the other hand, the URI shown above is still invalid. This is because
the "mailto" scheme does not define any representation for non-ASCII
characters (see RFC 2368):

| 8-bit characters in mailto URLs are forbidden. MIME encoded words (as
| defined in [RFC2047]) are permitted in header values, but not for any
| part of a "body" hname.
                                                  -- RFC 2368, section 2

But as it explicitly does allow MIME encoding (RFC 2047) in headers, the
URI written above should be written as:

  mailto:test@test.com?subject=%3D%3FISO-8859-1%3Ftest%3DA7test%3D26body%3D3Dtest%3F%3D

Note that the following would also be valid (but includes an extra
space):

  mailto:test@test.com?subject=%3D%3FISO-8859-1%3Ftest%3DA7test%3F%3D%20%26body%3Dtest

For the body, this does not work (you can't have encoded words in the
body and the spec explicitly disallows content-* headers).  

Now, what should Mozilla do when it encounters such an invalid URI:

1. It should be able to display it.
2. It should be able to parse it correctly and not confuse parts of the
   (intended) header with other (intended) headers or the (intended)
   body.
3. If the data for a header or for the body (incorrectly) contains
   non-ASCII characters, it should try to interpret it as UTF-8 and,
   failing that, as ISO-8859-1 (or any other 8bit charset determined by
   other means...)
OK, so let's make this bug broader.

A problem in another bug (which I'll dupe in a moment) was:
http://www.duden.de/schreibung/regelwerk/zeichen_11.html#%A797
Mozilla did not display the link in the status bar.

pi
Summary: url-encoded characters above %7f trouble parameter parsing in mailto:-links → Handling of non ASCII characters in URLs
*** Bug 168793 has been marked as a duplicate of this bug. ***
If I enter the URI references:
  http://www.duden.de/schreibung/regelwerk/zeichen_11.html#%A797 and
  http://www.duden.de/schreibung/regelwerk/zeichen_11.html#%C2%A797
manually into Mozilla's (Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
rv:1.1) Gecko/20020826) address bar, both of them actually works. Excellent.

Also, all of these URIs do work, too:
  mailto:test@test.com?subject=test§test%26body%3Dtest  
  mailto:test@test.com?subject=test%c2%a7test%26body%3Dtest  
  mailto:test@test.com?subject=%3D%3FISO-8859-1%3FQ%3Ftest%3DA7test%3D26body%3D3Dtest%3F%3D

The following URI does not:
  mailto:test@test.com?subject=test%a7test%26body%3Dtest
but that's not a big problem because it's neither a valid URI nor can it be expected that it would be in the future (the trand goes towars UTF-8). ((But maybe Mozilla should only produce a single <?> character (U+FFFD) for the non-UTF-8 "%A7" and not for the rest of the string.))

So the bug seems not to be related to Mozilla's URI handling in general but
only to extracting links from HTML documents.
darin
Assignee: new-network-bugs → darin
-> intl
Assignee: darin → yokoyama
Component: Networking → Internationalization
QA Contact: benc → ruixu
Keywords: intl
QA Contact: ruixu → ylong
nhotta
Assignee: yokoyama → nhotta
Blocks: 157673
Blocks: 181117
*** Bug 285731 has been marked as a duplicate of this bug. ***
*** Bug 307694 has been marked as a duplicate of this bug. ***
*** Bug 296934 has been marked as a duplicate of this bug. ***
*** Bug 307940 has been marked as a duplicate of this bug. ***
*** Bug 192108 has been marked as a duplicate of this bug. ***
Assignee: nhottanscp → jshin1987
Blocks: iri
*** Bug 341532 has been marked as a duplicate of this bug. ***
*** Bug 354567 has been marked as a duplicate of this bug. ***
QA Contact: amyy → i18n
The (default) character encoding of the document is ISO-8859-1, not UTF-8
like http://piology.org/ (at least Mozilla "view page info" tells me the
document there uses UTF-8 encoding). URI-encoded characters above %7f in
attribute values of http-links seem to work fine in ISO-8859-1 encoded
pages. The problem only happens in mailto-links.

I encoded the parameters of the mailto-link in UTF-8, but didn't change
the document encoding. It works fine with Mozilla. But IE 6.0 and Netscape
4.78 translate it into two characters, e.g. %C2%A7 yields "§" instead
of "§". And when I use %C2%A7 to encode an attribute value of a http-
link, Mozilla yields "§" as well (on ISO-8859-1 encoded document).

Btw, I had a quick run at the link with the UTF-8 encoded anchor name on
http://seopoker888.blogspot.com. IE, Netscape, Opera, and Lynx fail to jump to the
given anchor point.
why should people used Non ASCII for URL, UTF-8 Should be enough for sure, some example like http://www.s1228.net UTF-8 already enough to describe the url

The bug assignee didn't login in Bugzilla in the last 7 months, so the assignee is being reset.

Assignee: jshin1987 → nobody
Severity: normal → S3

The severity field for this bug is relatively low, S3. However, the bug has 9 duplicates.
:m_kato, could you consider increasing the bug severity?

For more information, please visit auto_nag documentation.

Flags: needinfo?(m_kato)

The last needinfo from me was triggered in error by recent activity on the bug. I'm clearing the needinfo since this is a very old bug and I don't know if it's still relevant.

Flags: needinfo?(m_kato)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: