Closed
Bug 66515
Opened 24 years ago
Closed 13 years ago
Mozilla incorrectly rewrites URLs containing ISO characters
Categories
(Core :: DOM: Navigation, defect)
Core
DOM: Navigation
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: michele, Unassigned)
References
()
Details
Attachments
(1 file)
(deleted),
patch
|
Details | Diff | Splinter Review |
If I type the URL Mozilla rewrites it to
http://www.dmoz.org/World/Fran%C3%A7ais/
instead of
http://www.dmoz.org/World/Fran%E7ais/
The same URL is correctly translated if it appears as an <A HREF> tag in an HTML
document.
Comment 1•24 years ago
|
||
Confirming with build 2001012420 on NT4. Platform/OS -> All/All,
Component -> Networking, Severity -> Major (can't type in that URL)
Assignee: asa → neeti
Severity: normal → major
Status: UNCONFIRMED → NEW
Component: Browser-General → Networking
Ever confirmed: true
OS: Linux → All
QA Contact: doronr → tever
Hardware: PC → All
Comment 2•24 years ago
|
||
Could be a dup of bug 31225, although here the URI works as long as it is not
typed in by hand.
Adding dependency to URI tracking bug.
Blocks: 61999
Comment 3•24 years ago
|
||
No, no dup of 31225, this is not related to host resolving. This seems to be
related to character encoding and conversion, also not with url parsing.
I think this is being caused by the difference in how we are escaping the URLs
for the location bar vs. the href handler. Someone in docshell should verify this.
Assignee: neeti → adamlock
Component: Networking → Embedding: Docshell
QA Contact: tever → adamlock
Comment 6•24 years ago
|
||
I think the issue here is the usage of ToNewUTF8String() in NS_NEWUri() at
http://lxr.mozilla.org/seamonkey/source/netwerk/base/public/nsNetUtil.h#86
Piping a URL with non-ascii chars thru' ToNewUTF8String() to force a conversion
to single byte char * seems incorrect. This function basically converts the 'ç'
to UTF8 which results in two chars.
Finally, when the HTTP request string is built inside of
nsHTTPRequest::formBuffer() (at
http://lxr.mozilla.org/seamonkey/source/netwerk/protocol/http/src/nsHTTPRequest.
cpp#410) a call to GetPath() is made which results in the escaped string
"World/Fran%C3%A7ais/" being added to the request - hence the server responds
with a page not found.
I think the way to fix this would be to call ToNewCString() instead of
ToNewUTF8String() (at
http://lxr.mozilla.org/seamonkey/source/netwerk/base/public/nsNetUtil.h#86)
(I'll submit the patch)
I tested with that change and it seems to work fine. I'll let the experts in
this area to tell me if this breaks anything else and/or if there's a better way
to fix this.
PS : If you want to try out this change yourself...
Since the functions in nsNetUtils.h are inlined you may have to do a clean build
of atleast netwerk and docshell to test this change out i.e. just changing the
nsNEtUtils.h and doing a make won't help.
Comment 7•24 years ago
|
||
Comment 8•24 years ago
|
||
Is there anyone else who can r= this one?....Thanks
Seems like ftang added the UTF8 conversion. Any reason why they should not be
changed back Frank?
Comment 10•24 years ago
|
||
no- please do not chagne to ToNewCString
This will break other stuff.
Please read http://www.ietf.org/rfc/rfc2718.txt
the URI should be UTF-8
Comment 11•24 years ago
|
||
also, please read
ftp://ftp.isi.edu/in-notes/rfc2396.txt
The right thing to do is to % encode the text in the upper level while we still
know the encoding information, and pass down in % form. ToNewCString will break
I-DNS work, Internet Keyword and so on.
Comment 12•24 years ago
|
||
Frank : Assuming that we do not call ToNewCString() like you're suggesting,
how/where do we fix this current issue?
Comment 13•24 years ago
|
||
Domain names has to be UTF-8. For path part, it can try in a document charset
after failing with UTF-8.
A similar issue for HTML anchor, see the HTML spec.
http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1
Comment 14•24 years ago
|
||
What http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1 says is
fine for an href inside a document. But, the note does not address the issue of
what should be done in the case there's no document charset specified or for ex,
when we enter via the URL bar (which is what this bug is about).
[
Also, adhering to the above reccomendation can get us into a lot of trouble if
we're not careful.
For ex: Imagine the user requests a URL (with non-ascii chars) which does not
really exist on the server. We request the URL first with UTF8 encoding and get
back a HTTP 404(doc not found) since the doc does not really exist. Now, we make
another request, this time with the URL encoded with the document charset which
happens to be UTF-8. We get a 404 back again since the second request was
essentially the same as the first. At this stage we need to keep track of the
request count so as not to get into a recursive requests for a non-existent URL.
]
I'm also not sure Comminicator 4.x implements what's specified in the note above
and it seems to be working fine with URLs with non-ascii chars. How is
Communicator handling this issue? Just curious...Thanks
Comment 15•24 years ago
|
||
I just mentioned about HTML spec because it describles about the fallback method
of trying UTF-8 then a document charset.
I think 4.x converts URL to OS default charset. This works for limited cases
where the server's charset is same as the client OS charset.
If there is a way to know the server's charset then that should be used.
If it's not possible to get a server's charset or a document charset then OS
charset could be uses as a fallback for UTF-8.
Comment 16•24 years ago
|
||
we've pretty much got to ignore the specs because we can't code to them anyway.
backwards compat and real-world scenarios are our masters.
cc'ing darin because we were recently talking about erradicating unicode from
necko altogehter :-), and this sort of falls in those lines.
I'm going to try and break this down like so many have before:
1. From the UI standpoint, we want to present URL's in their native character
format. If that means unicode at the UI level, fine, but let's keep that at the
level *above* necko. If we can't do this, someone please explain why we can't.
2. From necko's standpoint, all it should be dealing w/ is raw escaped char*'s.
If I hand necko a url w/ a space in it, it needs to escape (not UTF8 encode)
that space, and send the escaped request out onto the network.
so, can't we remove all the encoding from necko, and ensure that all of the
encoding happens *above* necko, necko can't do anything w/ it anyway if my
*real-world* understanding is correct. this would mean necko util callsites
would need to encode/decode on their own.
Comment 17•24 years ago
|
||
The *old* URL spec said everything have to be ASCII or to be % escaped- so
http://www.dmoz.org/World/Français/ is illegal in term of the *old* URL spec.
The *new* URL guide line proposed to use UTF8 in URL so the ISO Latin 1
http://www.dmoz.org/World/Français/ is ALSO an illegal in term of the *new* URL
draft.
In the mean time, we could also see ShiftJIS, EUCJP, Russian, Big5 case of URL
path similar to http://www.dmoz.org/World/Français/ in the real world. The data
may or maynot encoded in ISO-8859-1.
I think there are no real solution if user type into the URL bar. Basically,
there are no way we can know what charset it is. It could be ISO-8859-1 in the
server side, it could be ISO-8859-2 in the server side, it could be anything in
the server side. We have no context for it. The next best thing we could have is
to assume it is for the new URL / IDNS spec- which is UTF-8 in the URL. That is
why we convert to UTF-8.
The other reason we convert to UTF-8 is because 1) IDNS, 2) Internet Keyword 3)
what is related, 4) ODP accept UTF-8 URL and UTF-8 is the only choice which we
won't loss data.
I think the right thing to do is to % escape as possible as we can in the upper
level (as what we did now). And for edge case which we cannot, convert it to
UTF-8 will ensure forward compatability at least.
Also, I think I did something special for file:/// url. If it is file url, we
convert to FileSystem charset and escape it.
for HTTP url, I think there are no real solution.
Comment 18•24 years ago
|
||
Also, be aware that both LDAP and IMAP URL are in UTF8 as defined in
ftp://ftp.isi.edu/in-notes/rfc2253.txt
ftp://ftp.isi.edu/in-notes/rfc2255.txt
ftp://ftp.isi.edu/in-notes/rfc2192.txt
>we've pretty much got to ignore the specs because we can't code to them anyway.
backwards compat and real-world scenarios are our masters.
I agree with you for ftp:// and http:// case. But for IMAP and LDAP URL, we did
UTF8 for a while already. And you have to allow UTF-8 in the nsNetUtil since
nsNetUtil is not only for http/ftp/file protocol.
We want to make sure nsNetUtil work for IMAP/LDAP also- if it contains UTF-8.
Comment 19•24 years ago
|
||
Changing milestone to 0.9.2 since there's going to me some reworking of the
Necko layer wrt to handling wide char strings.
This bug depends on those chages and will revist when they're in place.
Status: NEW → ASSIGNED
Target Milestone: mozilla0.9.1 → mozilla0.9.2
Updated•23 years ago
|
Target Milestone: --- → mozilla0.9.3
Updated•23 years ago
|
Target Milestone: mozilla0.9.4 → mozilla1.0
Comment 22•23 years ago
|
||
Bugs targeted at mozilla1.0 without the mozilla1.0 keyword moved to mozilla1.0.1
(you can query for this string to delete spam or retrieve the list of bugs I've
moved)
Target Milestone: mozilla1.0 → mozilla1.0.1
Updated•22 years ago
|
Summary: Mozilla incorrectly rewrites URLS containing ISO characters → Mozilla incorrectly rewrites URLs containing ISO characters
Comment 23•21 years ago
|
||
I can reproduce this in Linux with FF 20040406 by copying the URL onto the
clipboard and pasting it into the URL bar.
Keywords: mozilla1.3 → top100
Comment 24•21 years ago
|
||
I have no problem with this URL in XP on FF 20040419. Just copy/paste to URL-bar
and I get http://www.dmoz.org/World/Fran%E7ais/.
Updated•15 years ago
|
Assignee: chak → nobody
Status: ASSIGNED → NEW
QA Contact: adamlock → docshell
Comment 25•13 years ago
|
||
* dmoz's encoding is UTF-8, these days.
* This is probably WONTFIX, for Awesomebar deeply depends on UTF-8 URI.
->wfm
Severity: major → normal
Status: NEW → RESOLVED
Closed: 13 years ago
Keywords: top100
Resolution: --- → WORKSFORME
Target Milestone: mozilla1.0.1 → ---
You need to log in
before you can comment on or make changes to this bug.
Description
•