169388 - Handling of non ASCII characters in URLs

Reporter

Description

•

22 years ago

If a mailto-link contains url-encoded characters above %7f
(e.g. %A7 for "§" or %B4 for "´"), then the parameter parsing
appears to be confused. Url-encoded characters like "&" (%26)
and "=" (%3D) are suddenly interpreted just like they aren't
url-encoded. Furthermore the status bar doesn't show the link-
url if the mouse moves over the link.

The following link should open the mail client with a recipient
"test@test.com" and a subject "test§test&body=test". Netscape
4.78 actually does that. Mozilla sets the subject to "test§test"
and sets the body to "test":

  mailto:test@test.com?subject=test%A7test%26body%3Dtest


The following link should open the mail client with the subject
"test´&test". Mozilla sets the subject to "test´" and drops the
rest:

  mailto:test@test.com?subject=test%B4%26test


Everything seems to work fine in Mozilla when the above-%7f
character isn't encoded, e.g.:

  mailto:test@test.com?subject=test§test%26body%3Dtest


I didn't had much time to look into this, sorry. Maybe it's just
me who is confused or who didn't understand url-encoding. Maybe
there's something in a RFC about url-enconding for mailto which
I don't know yet. Sorry then. I think the behaviour is strange
enough to report it.

- Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.2a) Gecko/20020910
- Character Encoding Western (ISO 8859-1), auto-detect off
- Mozilla Mail is default mailer (IMAP-SSL account)
- Windows 2000 SP2 (german)

I only found this statement from July 31, 2001 on http://mozilla-
evangelism.bclary.com/letter/news.html, which might point at the
same problem:

  "Also, when preparing the string containing the letter for use
   in the mailto: url assignment, it appears that ISO-8859-2
   strings are truncated when using the JS function escape() to
   prepare them for use in the URL. I am not sure if I am causing
   this problem or if it is a problem in Mozilla."

Boris 'pi' Piwinger

Comment 1

•

22 years ago

Confirming on Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2a) Gecko/2002091318.

I recently had a very similar bug which proved to be invalid for a reason which
does not apply here, see bug 168793. CCing Niklas who helped to solve it.

pi

Assignee: asa → harishd

Status: UNCONFIRMED → NEW

Component: Browser-General → Parser

Ever confirmed: true

OS: Windows 2000 → All

QA Contact: asa → moied

Hardware: PC → All

Boris 'pi' Piwinger

Comment 2

•

22 years ago

Attached file Testcase with reporter's information (deleted) — Details

Christopher Hoess (gone)

Comment 3

•

22 years ago

->Networking. Parser is HTML only, not appropriate for the URL parser.

Assignee: harishd → new-network-bugs

Component: Parser → Networking

QA Contact: moied → benc

Niklas Dougherty

Comment 4

•

22 years ago

It actually is related to bug 168793, namely because everything above %7F is
supposed to be encoded with two bytes. Why is that? Because everyhting not ASCII
(> %7F) _should_ be encoded with UTF-8, which uses one byte for the first 128
chars and two for the rest (and then three and four bytes for characters in
still higher positions in Unicode). Excerpt from the HTML spec:

****
B.2.1 Non-ASCII characters in URI attribute values

Although URIs do not contain non-ASCII values (see [URI], section 2.1) authors
sometimes specify them in attribute values expecting URIs (i.e., defined with
%URI; in the DTD). For instance, the following href value is illegal:

<A href="http://foo.org/Håkon">...</A>

We recommend that user agents adopt the following convention for handling
non-ASCII characters in such cases:

   1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.
   2. Escape these bytes with the URI escaping mechanism (i.e., by converting
each byte to %HH, where HH is the hexadecimal notation of the byte value).

This procedure results in a syntactically legal URI (as defined in [RFC1738],
section 2.2 or [RFC2141], section 2) that is independent of the character
encoding to which the HTML document carrying the URI may have been transcoded.
***

&#960;-winger's problem was that he tried to escape a two-byte value (in UTF-8) with
one byte, which failed. Same problem here. Find the relevant two-byte
combination, and it will probably work. The section sign is encoded as %C2%A7 in
a UTF-8 URI hex scheme.

Niklas Dougherty

Comment 5

•

22 years ago

And, may I add, it is safest NOT to encode anything. Let the browser do the
encoding. This will also make your code more legible.

Unicode to the world.

Christian Roslawski

Reporter

Comment 6

•

22 years ago

The (default) character encoding of the document is ISO-8859-1, not UTF-8
like http://piology.org/ (at least Mozilla "view page info" tells me the
document there uses UTF-8 encoding). URI-encoded characters above %7f in
attribute values of http-links seem to work fine in ISO-8859-1 encoded
pages. The problem only happens in mailto-links.

I encoded the parameters of the mailto-link in UTF-8, but didn't change
the document encoding. It works fine with Mozilla. But IE 6.0 and Netscape
4.78 translate it into two characters, e.g. %C2%A7 yields "Â§" instead
of "§". And when I use %C2%A7 to encode an attribute value of a http-
link, Mozilla yields "Â§" as well (on ISO-8859-1 encoded document).

Btw, I had a quick run at the link with the UTF-8 encoded anchor name on
http://piology.org/. IE, Netscape, Opera, and Lynx fail to jump to the
given anchor point.

I don't say the other browsers are right, it's just that I'm seriously
confused by now. :-) And I don't know about HTML-specifications on URIs,
but why should they handle the encoding of http-links and mailto-links
differently?

And no encoding of attribute values at all isn't always an option,
especially when you have to handle dynamic content based on textual
input of various users.

Claus Färber

Comment 7

•

22 years ago

@Niklas Dougherty:

The part from the HTML spec you have posted is only a recommendataion
for error correction when an URI contains non-ASCII characters. These
and only these should be encoded as UTF-8 and then as %nn.

They don't say that URIs can't contain (encoded) octets that are not
part of a valid UTF-8 sequence. (This is out of the scope of an HTML
specification -- or even the W3C -- anyway.)
And some URL schemes, such as "data" (RFC 2397), rely on that.

And no, it's not safest not to encode anything. This results in INVALID
URIs and amounts to relying on the error-correction scheme you mentioned.

------------------------

On the other hand, the URI shown above is still invalid. This is because
the "mailto" scheme does not define any representation for non-ASCII
characters (see RFC 2368):

| 8-bit characters in mailto URLs are forbidden. MIME encoded words (as
| defined in [RFC2047]) are permitted in header values, but not for any
| part of a "body" hname.
                                                  -- RFC 2368, section 2

But as it explicitly does allow MIME encoding (RFC 2047) in headers, the
URI written above should be written as:

  mailto:test@test.com?subject=%3D%3FISO-8859-1%3Ftest%3DA7test%3D26body%3D3Dtest%3F%3D

Note that the following would also be valid (but includes an extra
space):

  mailto:test@test.com?subject=%3D%3FISO-8859-1%3Ftest%3DA7test%3F%3D%20%26body%3Dtest

For the body, this does not work (you can't have encoded words in the
body and the spec explicitly disallows content-* headers).  

Now, what should Mozilla do when it encounters such an invalid URI:

1. It should be able to display it.
2. It should be able to parse it correctly and not confuse parts of the
   (intended) header with other (intended) headers or the (intended)
   body.
3. If the data for a header or for the body (incorrectly) contains
   non-ASCII characters, it should try to interpret it as UTF-8 and,
   failing that, as ISO-8859-1 (or any other 8bit charset determined by
   other means...)

Boris 'pi' Piwinger

Comment 8

•

22 years ago

OK, so let's make this bug broader.

A problem in another bug (which I'll dupe in a moment) was:
http://www.duden.de/schreibung/regelwerk/zeichen_11.html#%A797
Mozilla did not display the link in the status bar.

pi

Summary: url-encoded characters above %7f trouble parameter parsing in mailto:-links → Handling of non ASCII characters in URLs

Boris 'pi' Piwinger

Comment 9

•

22 years ago

*** Bug 168793 has been marked as a duplicate of this bug. ***

Claus Färber

Comment 10

•

22 years ago

If I enter the URI references:
  http://www.duden.de/schreibung/regelwerk/zeichen_11.html#%A797 and
  http://www.duden.de/schreibung/regelwerk/zeichen_11.html#%C2%A797
manually into Mozilla's (Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
rv:1.1) Gecko/20020826) address bar, both of them actually works. Excellent.

Also, all of these URIs do work, too:
  mailto:test@test.com?subject=test§test%26body%3Dtest  
  mailto:test@test.com?subject=test%c2%a7test%26body%3Dtest  
  mailto:test@test.com?subject=%3D%3FISO-8859-1%3FQ%3Ftest%3DA7test%3D26body%3D3Dtest%3F%3D

The following URI does not:
  mailto:test@test.com?subject=test%a7test%26body%3Dtest
but that's not a big problem because it's neither a valid URI nor can it be expected that it would be in the future (the trand goes towars UTF-8). ((But maybe Mozilla should only produce a single <?> character (U+FFFD) for the non-UTF-8 "%A7" and not for the rest of the string.))

So the bug seems not to be related to Mozilla's URI handling in general but
only to extracting links from HTML documents.

Doug Turner (:dougt)

Comment 11

•

22 years ago

darin

Assignee: new-network-bugs → darin

Darin Fisher

Comment 12

•

22 years ago

-> intl

Assignee: darin → yokoyama

Component: Networking → Internationalization

QA Contact: benc → ruixu

Rui Xu

Updated

•

22 years ago

Keywords: intl

QA Contact: ruixu → ylong

Roy Yokoyama

Comment 13

•

22 years ago

nhotta

Assignee: yokoyama → nhotta

Yuying Long

Updated

•

22 years ago

Blocks: 157673

Simon Montagu :smontagu

Updated

•

22 years ago

Blocks: 181117

Mike Cowperthwaite

Comment 14

•

20 years ago

*** Bug 285731 has been marked as a duplicate of this bug. ***

Elmar Ludwig

Comment 15

•

19 years ago

*** Bug 307694 has been marked as a duplicate of this bug. ***

Mike Cowperthwaite

Comment 16

•

19 years ago

*** Bug 296934 has been marked as a duplicate of this bug. ***

OstGote!

Comment 17

•

19 years ago

*** Bug 307940 has been marked as a duplicate of this bug. ***

OstGote!

Comment 18

•

19 years ago

*** Bug 192108 has been marked as a duplicate of this bug. ***

Jungshik Shin

Updated

•

19 years ago

Assignee: nhottanscp → jshin1987

Blocks: iri

Mike Cowperthwaite

Comment 19

•

18 years ago

*** Bug 341532 has been marked as a duplicate of this bug. ***

Mike Cowperthwaite

Comment 20

•

18 years ago

*** Bug 354567 has been marked as a duplicate of this bug. ***

Phil Ringnalda (:philor)

Updated

•

15 years ago

QA Contact: amyy → i18n

Comment hidden (spam)

And, may I add, it is safest NOT to encode anything. Let the browser do the
encoding. This will also make your code more legible.

Unicode to the world.

for more information check at http://hbcbet-id.com http://kartupoker.com

Comment hidden (spam)

http://contenidosscorm.com thats very good here

Comment hidden (spam)

http://interex-maroc.com good site thanks

Patricia Joseph

Comment 26

•

8 years ago

The (default) character encoding of the document is ISO-8859-1, not UTF-8
like http://piology.org/ (at least Mozilla "view page info" tells me the
document there uses UTF-8 encoding). URI-encoded characters above %7f in
attribute values of http-links seem to work fine in ISO-8859-1 encoded
pages. The problem only happens in mailto-links.

I encoded the parameters of the mailto-link in UTF-8, but didn't change
the document encoding. It works fine with Mozilla. But IE 6.0 and Netscape
4.78 translate it into two characters, e.g. %C2%A7 yields "Â§" instead
of "§". And when I use %C2%A7 to encode an attribute value of a http-
link, Mozilla yields "Â§" as well (on ISO-8859-1 encoded document).

Btw, I had a quick run at the link with the UTF-8 encoded anchor name on
http://seopoker888.blogspot.com. IE, Netscape, Opera, and Lynx fail to jump to the
given anchor point.

Rakensutotong

Comment 27

•

8 years ago

why should people used Non ASCII for URL, UTF-8 Should be enough for sure, some example like http://www.s1228.net UTF-8 already enough to describe the url

BugBot [:suhaib / :marco/ :calixte]

Comment 28

•

2 years ago

The bug assignee didn't login in Bugzilla in the last 7 months, so the assignee is being reset.

Assignee: jshin1987 → nobody

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

BugBot [:suhaib / :marco/ :calixte]

Comment 29

•

2 years ago

The severity field for this bug is relatively low, S3. However, the bug has 9 duplicates.
:m_kato, could you consider increasing the bug severity?

For more information, please visit auto_nag documentation.

Flags: needinfo?(m_kato)

BugBot (nomail) [:suhaib / :marco/ :calixte]

Comment 30

•

2 years ago

The last needinfo from me was triggered in error by recent activity on the bug. I'm clearing the needinfo since this is a very old bug and I don't know if it's still relevant.

Flags: needinfo?(m_kato)