66515 - Mozilla incorrectly rewrites URLs containing ISO characters

Reporter

Description

•

24 years ago

If I type the URL Mozilla rewrites it to http://www.dmoz.org/World/Fran%C3%A7ais/ instead of http://www.dmoz.org/World/Fran%E7ais/ The same URL is correctly translated if it appears as an <A HREF> tag in an HTML document.

Stephan Niemz

Comment 1

•

24 years ago

Confirming with build 2001012420 on NT4. Platform/OS -> All/All, Component -> Networking, Severity -> Major (can't type in that URL)

Assignee: asa → neeti

Severity: normal → major

Status: UNCONFIRMED → NEW

Component: Browser-General → Networking

Ever confirmed: true

OS: Linux → All

QA Contact: doronr → tever

Hardware: PC → All

Stephan Niemz

Comment 2

•

24 years ago

Could be a dup of bug 31225, although here the URI works as long as it is not typed in by hand. Adding dependency to URI tracking bug.

Blocks: 61999

Andreas Otte

Comment 3

•

24 years ago

No, no dup of 31225, this is not related to host resolving. This seems to be related to character encoding and conversion, also not with url parsing.

neeti

Updated

•

24 years ago

Keywords: nsbeta1

Target Milestone: --- → mozilla0.9.1

Gagan

Comment 4

•

24 years ago

I think this is being caused by the difference in how we are escaping the URLs for the location bar vs. the href handler. Someone in docshell should verify this.

Assignee: neeti → adamlock

Component: Networking → Embedding: Docshell

QA Contact: tever → adamlock

Chak Nanga

Comment 5

•

24 years ago

->Chak per Jud

Assignee: adamlock → chak

Chak Nanga

Comment 6

•

24 years ago

I think the issue here is the usage of ToNewUTF8String() in NS_NEWUri() at http://lxr.mozilla.org/seamonkey/source/netwerk/base/public/nsNetUtil.h#86 Piping a URL with non-ascii chars thru' ToNewUTF8String() to force a conversion to single byte char * seems incorrect. This function basically converts the 'ç' to UTF8 which results in two chars. Finally, when the HTTP request string is built inside of nsHTTPRequest::formBuffer() (at http://lxr.mozilla.org/seamonkey/source/netwerk/protocol/http/src/nsHTTPRequest. cpp#410) a call to GetPath() is made which results in the escaped string "World/Fran%C3%A7ais/" being added to the request - hence the server responds with a page not found. I think the way to fix this would be to call ToNewCString() instead of ToNewUTF8String() (at http://lxr.mozilla.org/seamonkey/source/netwerk/base/public/nsNetUtil.h#86) (I'll submit the patch) I tested with that change and it seems to work fine. I'll let the experts in this area to tell me if this breaks anything else and/or if there's a better way to fix this. PS : If you want to try out this change yourself... Since the functions in nsNetUtils.h are inlined you may have to do a clean build of atleast netwerk and docshell to test this change out i.e. just changing the nsNEtUtils.h and doing a make won't help.

Chak Nanga

Comment 7

•

24 years ago

Attached patch Patch to call ToNewCString() instead of ToNewUTF8String() (deleted) — Details — Splinter Review

Chak Nanga

Comment 8

•

24 years ago

Is there anyone else who can r= this one?....Thanks

Gagan

Comment 9

•

24 years ago

Seems like ftang added the UTF8 conversion. Any reason why they should not be changed back Frank?

Frank Tang

Comment 10

•

24 years ago

no- please do not chagne to ToNewCString This will break other stuff. Please read http://www.ietf.org/rfc/rfc2718.txt the URI should be UTF-8

Frank Tang

Comment 11

•

24 years ago

also, please read ftp://ftp.isi.edu/in-notes/rfc2396.txt The right thing to do is to % encode the text in the upper level while we still know the encoding information, and pass down in % form. ToNewCString will break I-DNS work, Internet Keyword and so on.

Chak Nanga

Comment 12

•

24 years ago

Frank : Assuming that we do not call ToNewCString() like you're suggesting, how/where do we fix this current issue?

nhottanscp

Comment 13

•

24 years ago

Domain names has to be UTF-8. For path part, it can try in a document charset after failing with UTF-8. A similar issue for HTML anchor, see the HTML spec. http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1

Chak Nanga

Comment 14

•

24 years ago

What http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1 says is fine for an href inside a document. But, the note does not address the issue of what should be done in the case there's no document charset specified or for ex, when we enter via the URL bar (which is what this bug is about). [ Also, adhering to the above reccomendation can get us into a lot of trouble if we're not careful. For ex: Imagine the user requests a URL (with non-ascii chars) which does not really exist on the server. We request the URL first with UTF8 encoding and get back a HTTP 404(doc not found) since the doc does not really exist. Now, we make another request, this time with the URL encoded with the document charset which happens to be UTF-8. We get a 404 back again since the second request was essentially the same as the first. At this stage we need to keep track of the request count so as not to get into a recursive requests for a non-existent URL. ] I'm also not sure Comminicator 4.x implements what's specified in the note above and it seems to be working fine with URLs with non-ascii chars. How is Communicator handling this issue? Just curious...Thanks

nhottanscp

Comment 15

•

24 years ago

I just mentioned about HTML spec because it describles about the fallback method of trying UTF-8 then a document charset. I think 4.x converts URL to OS default charset. This works for limited cases where the server's charset is same as the client OS charset. If there is a way to know the server's charset then that should be used. If it's not possible to get a server's charset or a document charset then OS charset could be uses as a fallback for UTF-8.

Judson Valeski

Comment 16

•

24 years ago

we've pretty much got to ignore the specs because we can't code to them anyway. backwards compat and real-world scenarios are our masters. cc'ing darin because we were recently talking about erradicating unicode from necko altogehter :-), and this sort of falls in those lines. I'm going to try and break this down like so many have before: 1. From the UI standpoint, we want to present URL's in their native character format. If that means unicode at the UI level, fine, but let's keep that at the level *above* necko. If we can't do this, someone please explain why we can't. 2. From necko's standpoint, all it should be dealing w/ is raw escaped char*'s. If I hand necko a url w/ a space in it, it needs to escape (not UTF8 encode) that space, and send the escaped request out onto the network. so, can't we remove all the encoding from necko, and ensure that all of the encoding happens *above* necko, necko can't do anything w/ it anyway if my *real-world* understanding is correct. this would mean necko util callsites would need to encode/decode on their own.

Frank Tang

Comment 17

•

24 years ago

The *old* URL spec said everything have to be ASCII or to be % escaped- so http://www.dmoz.org/World/Français/ is illegal in term of the *old* URL spec. The *new* URL guide line proposed to use UTF8 in URL so the ISO Latin 1 http://www.dmoz.org/World/Français/ is ALSO an illegal in term of the *new* URL draft. In the mean time, we could also see ShiftJIS, EUCJP, Russian, Big5 case of URL path similar to http://www.dmoz.org/World/Français/ in the real world. The data may or maynot encoded in ISO-8859-1. I think there are no real solution if user type into the URL bar. Basically, there are no way we can know what charset it is. It could be ISO-8859-1 in the server side, it could be ISO-8859-2 in the server side, it could be anything in the server side. We have no context for it. The next best thing we could have is to assume it is for the new URL / IDNS spec- which is UTF-8 in the URL. That is why we convert to UTF-8. The other reason we convert to UTF-8 is because 1) IDNS, 2) Internet Keyword 3) what is related, 4) ODP accept UTF-8 URL and UTF-8 is the only choice which we won't loss data. I think the right thing to do is to % escape as possible as we can in the upper level (as what we did now). And for edge case which we cannot, convert it to UTF-8 will ensure forward compatability at least. Also, I think I did something special for file:/// url. If it is file url, we convert to FileSystem charset and escape it. for HTTP url, I think there are no real solution.

Frank Tang

Comment 18

•

24 years ago

Also, be aware that both LDAP and IMAP URL are in UTF8 as defined in ftp://ftp.isi.edu/in-notes/rfc2253.txt ftp://ftp.isi.edu/in-notes/rfc2255.txt ftp://ftp.isi.edu/in-notes/rfc2192.txt >we've pretty much got to ignore the specs because we can't code to them anyway. backwards compat and real-world scenarios are our masters. I agree with you for ftp:// and http:// case. But for IMAP and LDAP URL, we did UTF8 for a while already. And you have to allow UTF-8 in the nsNetUtil since nsNetUtil is not only for http/ftp/file protocol. We want to make sure nsNetUtil work for IMAP/LDAP also- if it contains UTF-8.

Chak Nanga

Comment 19

•

24 years ago

Changing milestone to 0.9.2 since there's going to me some reworking of the Necko layer wrt to handling wide char strings. This bug depends on those chages and will revist when they're in place.

Status: NEW → ASSIGNED

Target Milestone: mozilla0.9.1 → mozilla0.9.2

Chak Nanga

Comment 20

•

24 years ago

->0.9.3

Target Milestone: mozilla0.9.2 → ---

Judson Valeski

Updated

•

23 years ago

Target Milestone: --- → mozilla0.9.3

Chak Nanga

Comment 21

•

23 years ago

->0.9.4

Target Milestone: mozilla0.9.3 → mozilla0.9.4

Chak Nanga

Updated

•

23 years ago

Target Milestone: mozilla0.9.4 → mozilla1.0

Asa Dotzler [:asa]

Updated

•

23 years ago

Blocks: 104166

Asa Dotzler [:asa]

Comment 22

•

23 years ago

Bugs targeted at mozilla1.0 without the mozilla1.0 keyword moved to mozilla1.0.1 (you can query for this string to delete spam or retrieve the list of bugs I've moved)

Target Milestone: mozilla1.0 → mozilla1.0.1

Andrew Hagen

Updated

•

22 years ago

Keywords: mozilla1.3, patch, review

Summary: Mozilla incorrectly rewrites URLS containing ISO characters → Mozilla incorrectly rewrites URLs containing ISO characters

Scott Gifford

Comment 23

•

21 years ago

I can reproduce this in Linux with FF 20040406 by copying the URL onto the clipboard and pasting it into the URL bar.

Keywords: mozilla1.3 → top100

Emil Hesslow

Comment 24

•

21 years ago

I have no problem with this URL in XP on FF 20040419. Just copy/paste to URL-bar and I get http://www.dmoz.org/World/Fran%E7ais/.

Phil Ringnalda (:philor)

Updated

•

15 years ago

Assignee: chak → nobody

Status: ASSIGNED → NEW

QA Contact: adamlock → docshell

O. Atsushi (Torisugari)

Comment 25

•

13 years ago

* dmoz's encoding is UTF-8, these days. * This is probably WONTFIX, for Awesomebar deeply depends on UTF-8 URI. ->wfm

Severity: major → normal

Status: NEW → RESOLVED

Closed: 13 years ago

Keywords: top100

Resolution: --- → WORKSFORME

Target Milestone: mozilla1.0.1 → ---