Closed Bug 66515 Opened 24 years ago Closed 13 years ago

Mozilla incorrectly rewrites URLs containing ISO characters

Categories

(Core :: DOM: Navigation, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: michele, Unassigned)

References

()

Details

Attachments

(1 file)

If I type the URL Mozilla rewrites it to http://www.dmoz.org/World/Fran%C3%A7ais/ instead of http://www.dmoz.org/World/Fran%E7ais/ The same URL is correctly translated if it appears as an <A HREF> tag in an HTML document.
Confirming with build 2001012420 on NT4. Platform/OS -> All/All, Component -> Networking, Severity -> Major (can't type in that URL)
Assignee: asa → neeti
Severity: normal → major
Status: UNCONFIRMED → NEW
Component: Browser-General → Networking
Ever confirmed: true
OS: Linux → All
QA Contact: doronr → tever
Hardware: PC → All
Could be a dup of bug 31225, although here the URI works as long as it is not typed in by hand. Adding dependency to URI tracking bug.
Blocks: 61999
No, no dup of 31225, this is not related to host resolving. This seems to be related to character encoding and conversion, also not with url parsing.
Keywords: nsbeta1
Target Milestone: --- → mozilla0.9.1
I think this is being caused by the difference in how we are escaping the URLs for the location bar vs. the href handler. Someone in docshell should verify this.
Assignee: neeti → adamlock
Component: Networking → Embedding: Docshell
QA Contact: tever → adamlock
->Chak per Jud
Assignee: adamlock → chak
I think the issue here is the usage of ToNewUTF8String() in NS_NEWUri() at http://lxr.mozilla.org/seamonkey/source/netwerk/base/public/nsNetUtil.h#86 Piping a URL with non-ascii chars thru' ToNewUTF8String() to force a conversion to single byte char * seems incorrect. This function basically converts the 'ç' to UTF8 which results in two chars. Finally, when the HTTP request string is built inside of nsHTTPRequest::formBuffer() (at http://lxr.mozilla.org/seamonkey/source/netwerk/protocol/http/src/nsHTTPRequest. cpp#410) a call to GetPath() is made which results in the escaped string "World/Fran%C3%A7ais/" being added to the request - hence the server responds with a page not found. I think the way to fix this would be to call ToNewCString() instead of ToNewUTF8String() (at http://lxr.mozilla.org/seamonkey/source/netwerk/base/public/nsNetUtil.h#86) (I'll submit the patch) I tested with that change and it seems to work fine. I'll let the experts in this area to tell me if this breaks anything else and/or if there's a better way to fix this. PS : If you want to try out this change yourself... Since the functions in nsNetUtils.h are inlined you may have to do a clean build of atleast netwerk and docshell to test this change out i.e. just changing the nsNEtUtils.h and doing a make won't help.
Is there anyone else who can r= this one?....Thanks
Seems like ftang added the UTF8 conversion. Any reason why they should not be changed back Frank?
no- please do not chagne to ToNewCString This will break other stuff. Please read http://www.ietf.org/rfc/rfc2718.txt the URI should be UTF-8
also, please read ftp://ftp.isi.edu/in-notes/rfc2396.txt The right thing to do is to % encode the text in the upper level while we still know the encoding information, and pass down in % form. ToNewCString will break I-DNS work, Internet Keyword and so on.
Frank : Assuming that we do not call ToNewCString() like you're suggesting, how/where do we fix this current issue?
Domain names has to be UTF-8. For path part, it can try in a document charset after failing with UTF-8. A similar issue for HTML anchor, see the HTML spec. http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1
What http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1 says is fine for an href inside a document. But, the note does not address the issue of what should be done in the case there's no document charset specified or for ex, when we enter via the URL bar (which is what this bug is about). [ Also, adhering to the above reccomendation can get us into a lot of trouble if we're not careful. For ex: Imagine the user requests a URL (with non-ascii chars) which does not really exist on the server. We request the URL first with UTF8 encoding and get back a HTTP 404(doc not found) since the doc does not really exist. Now, we make another request, this time with the URL encoded with the document charset which happens to be UTF-8. We get a 404 back again since the second request was essentially the same as the first. At this stage we need to keep track of the request count so as not to get into a recursive requests for a non-existent URL. ] I'm also not sure Comminicator 4.x implements what's specified in the note above and it seems to be working fine with URLs with non-ascii chars. How is Communicator handling this issue? Just curious...Thanks
I just mentioned about HTML spec because it describles about the fallback method of trying UTF-8 then a document charset. I think 4.x converts URL to OS default charset. This works for limited cases where the server's charset is same as the client OS charset. If there is a way to know the server's charset then that should be used. If it's not possible to get a server's charset or a document charset then OS charset could be uses as a fallback for UTF-8.
we've pretty much got to ignore the specs because we can't code to them anyway. backwards compat and real-world scenarios are our masters. cc'ing darin because we were recently talking about erradicating unicode from necko altogehter :-), and this sort of falls in those lines. I'm going to try and break this down like so many have before: 1. From the UI standpoint, we want to present URL's in their native character format. If that means unicode at the UI level, fine, but let's keep that at the level *above* necko. If we can't do this, someone please explain why we can't. 2. From necko's standpoint, all it should be dealing w/ is raw escaped char*'s. If I hand necko a url w/ a space in it, it needs to escape (not UTF8 encode) that space, and send the escaped request out onto the network. so, can't we remove all the encoding from necko, and ensure that all of the encoding happens *above* necko, necko can't do anything w/ it anyway if my *real-world* understanding is correct. this would mean necko util callsites would need to encode/decode on their own.
The *old* URL spec said everything have to be ASCII or to be % escaped- so http://www.dmoz.org/World/Français/ is illegal in term of the *old* URL spec. The *new* URL guide line proposed to use UTF8 in URL so the ISO Latin 1 http://www.dmoz.org/World/Français/ is ALSO an illegal in term of the *new* URL draft. In the mean time, we could also see ShiftJIS, EUCJP, Russian, Big5 case of URL path similar to http://www.dmoz.org/World/Français/ in the real world. The data may or maynot encoded in ISO-8859-1. I think there are no real solution if user type into the URL bar. Basically, there are no way we can know what charset it is. It could be ISO-8859-1 in the server side, it could be ISO-8859-2 in the server side, it could be anything in the server side. We have no context for it. The next best thing we could have is to assume it is for the new URL / IDNS spec- which is UTF-8 in the URL. That is why we convert to UTF-8. The other reason we convert to UTF-8 is because 1) IDNS, 2) Internet Keyword 3) what is related, 4) ODP accept UTF-8 URL and UTF-8 is the only choice which we won't loss data. I think the right thing to do is to % escape as possible as we can in the upper level (as what we did now). And for edge case which we cannot, convert it to UTF-8 will ensure forward compatability at least. Also, I think I did something special for file:/// url. If it is file url, we convert to FileSystem charset and escape it. for HTTP url, I think there are no real solution.
Also, be aware that both LDAP and IMAP URL are in UTF8 as defined in ftp://ftp.isi.edu/in-notes/rfc2253.txt ftp://ftp.isi.edu/in-notes/rfc2255.txt ftp://ftp.isi.edu/in-notes/rfc2192.txt >we've pretty much got to ignore the specs because we can't code to them anyway. backwards compat and real-world scenarios are our masters. I agree with you for ftp:// and http:// case. But for IMAP and LDAP URL, we did UTF8 for a while already. And you have to allow UTF-8 in the nsNetUtil since nsNetUtil is not only for http/ftp/file protocol. We want to make sure nsNetUtil work for IMAP/LDAP also- if it contains UTF-8.
Changing milestone to 0.9.2 since there's going to me some reworking of the Necko layer wrt to handling wide char strings. This bug depends on those chages and will revist when they're in place.
Status: NEW → ASSIGNED
Target Milestone: mozilla0.9.1 → mozilla0.9.2
->0.9.3
Target Milestone: mozilla0.9.2 → ---
Target Milestone: --- → mozilla0.9.3
->0.9.4
Target Milestone: mozilla0.9.3 → mozilla0.9.4
Target Milestone: mozilla0.9.4 → mozilla1.0
Blocks: 104166
Bugs targeted at mozilla1.0 without the mozilla1.0 keyword moved to mozilla1.0.1 (you can query for this string to delete spam or retrieve the list of bugs I've moved)
Target Milestone: mozilla1.0 → mozilla1.0.1
Keywords: mozilla1.3, patch, review
Summary: Mozilla incorrectly rewrites URLS containing ISO characters → Mozilla incorrectly rewrites URLs containing ISO characters
I can reproduce this in Linux with FF 20040406 by copying the URL onto the clipboard and pasting it into the URL bar.
Keywords: mozilla1.3top100
I have no problem with this URL in XP on FF 20040419. Just copy/paste to URL-bar and I get http://www.dmoz.org/World/Fran%E7ais/.
Assignee: chak → nobody
Status: ASSIGNED → NEW
QA Contact: adamlock → docshell
* dmoz's encoding is UTF-8, these days. * This is probably WONTFIX, for Awesomebar deeply depends on UTF-8 URI. ->wfm
Severity: major → normal
Status: NEW → RESOLVED
Closed: 13 years ago
Keywords: top100
Resolution: --- → WORKSFORME
Target Milestone: mozilla1.0.1 → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: