Closed Bug 43852 Opened 24 years ago Closed 24 years ago

"Send URLs as UTF-8" not working

Categories

(Core :: Internationalization, defect, P3)

x86
Windows 98
defect

Tracking

()

VERIFIED FIXED
mozilla0.9

People

(Reporter: bill, Assigned: nhottanscp)

References

()

Details

(Keywords: helpwanted)

Mozilla build M14 seems to work sending URLs as UTF-8, to our Internationalized domain name service at http://www.nunames.nu/eu-lang-test.htm. But only if you type the URL into the browser's address form directly (or you copy and paste it) - not if you click on a link. The way it works is, if you type the Multilingual URL into the browser window, using localized non-UTF-8 encoding (say your keyboard and OS encoding is for ISO-8859-1, for example), the Mozilla M14 browser will convert that URL into UTF-8 and send the request to the name server to be resolved. For www.åreskutan.nu it does not display the UTF-8 in the browser window (which should be "www.Ã¥reskutan.nu" in encoded form) but displays www.%c3%a5reskutan.nu instead. Even though the browser displays these % encodings, it actually sends the UTF-8 to the name server query, and this name will resolve in our system using M14. Under the "rule of least astonishment", it would be nice if it actually displayed the local language keyboard encoding to the user, in this case, the originally typed IS-8859-1, or www.åreskutan.nu But using M14, if you have a link on your page (as we do at http://www.nunames.nu/eu-lang-test.htm) and you click on the link instead, it correctly displays the utf-8 encoding at the bottom left side of the browser (where it says "contacting http://www.Ã¥reskutan.nu"), so it seems to be able to make the correct conversion to UTF-8 from a link which uses ISO-8859-1. But it does not actually send that UTF-8 to the resolver and the query does not work as a result. In this case, just as in the previous one, the UTF-8 encoding does *not* display in the browser window. But this time, it displays a different series of % encodings: http://www.%c3%83%c2%a5reskutan.nu And this is the UTF-8 it actually sends: "www.Ã¥reskutan.nu", which does not resolve (since the actual name we are serving is encoded as www.Ã¥reskutan.nu.) When using Mozilla build M15 (the same as NN 6 Beta, I believe), it also correctly displays the utf-8 encoding at the bottom left side of the browser (where it says "contacting http://www.Ã¥reskutan.nu"), so it also seems to be able to make the correct conversion from a link which uses ISO-8859-1. But it does not actually send *any* UTF-8 to the resolver, but sends the following % type encoding to the name server: "www.%c3ƒ%c2%a5reskutan.nu" as ASCII, which has nothing to do with the correct UTF-8 conversion it initially made, as far as I can see, is not actually sent as any kind of UTF-8, and is rejected by our system. Two variations of a broken browser, I'd say. Bill Semich .NU Domain
Teruko, please try to reproduce this and confirm if reproducible. Please also check 4.x behavior.
bill@mail.nic.nu, could you list up the problems? Also, please try newer builds (M16 or later). Using today's build win32 M17, type http://www.åreskutan.nu in the url bar and hit return sends UTF-8 query "http://www.%D0%93%D2%90reskutan.nu/". I am not sure what is broken.
I sent the following message to people listed in this bug report in a reponse to Bill. I will repeat it here for the record. I think we are doing the right thing mostly. But there is one spec-related issue. We can return UTF-8 URLs in case the URL links on a web page are not going to the same server as the page itself. In that case, we don't have to be bound by the page/server charset. For this small improvement, I will confirm the bug. ==== It seems to me that the current Mozilla is behaving more or less correctly with regard to returning/sending the URL. To summarize the current behavior, 1. In the location bar, there is no way we can assume the charset which the target server requires, so we default to UTF-8 in case the URL entered contains 8-bit data. 2. For web pages, if the pages are marked with the meta-charset (or if the server sends the charset info with the page), then we return the URL in that charset. I think we are following these 2 basic principle described above currently. Your web pages are marked as follows: A. http://www.nunames.nu/eu-lang-test.htm (Windows-1252) Therefore we return Latin 1 encoding in sending back the URL. B. http://www.nunames.nu/NUregistryJP.htm (UTF-8) Therefore we send UTF-8 URL back to the server -- I confirmed that we indeed do this on this page. C. http://www.nunames.nu/lldemo (Has no charset info) Therefore we will send back in whatever charset the user has selected in the Character Coding menu, or the default browser view charset. I think these are more or less correct but there is probably one improvement we can make. If the links on the page are going to different servers than the one which is hosting the page, then we probably do not have to follow the charset of the page in sending the URL from a link. I can think of returning such URLs in UTF-8. Perhaps we can make this bug into making such an improvement. What do you think -- people on this list? Other than this, I don't see much else we can do. ====
Status: UNCONFIRMED → NEW
Ever confirmed: true
IE5 has a preference (on by default?) Tools|Internet Options...|Advanced [x] Always sendURLs as UTF-8 How does IE5 behave with and without this enabled? Related bugs: bug 42898 iDNS support bug 42899 IURI support
FYI, there are unresolved issues with unicode canonicalization/normalization and "case" folding with regards to iDNS.
Assignee: nhotta → ftang
Reassing to ftang.
Bob wrote: > > IE5 has a preference (on by default?) Tools|Internet Options...|Advanced > [x] Always sendURLs as UTF-8 I received the following email from a Microsoft employee a while ago: Subject: Re: The .nu domain's experiment with 8859-1encoded domain names. Date: Mon, 10 Jan 2000 13:16:55 -0800 From: "Chris Wendt" <christw@microsoft.com> To: "Erik van der Poel" <erik@netscape.com>, "Karlsson Kent - keka" <keka@im.se> CC: <hostmaster@mail.nic.nu>, <duerst@w3.org>, <markdavis@ispchannel.com>, <mark.davis@us.ibm.com>, <goldsmith@apple.com>, <chrispr@microsoft.com>, <ftang@netscape.com>, <presnick@qualcomm.com>, <henrik.sviden@idg.se> > > IE 5 can, apparently, always use Unicode/UTF-8 in (all of) > > the URL, if set properly, already. (all of) is not correct. Only in the part which comes before the first question mark '?'. > What does "if set properly" mean, exactly? How does IE5 deal with HTML > forms in non-UTF-8 encodings when submitting them? "If set properly" means that the advanced option "Always send URLs in UTF-8" is ON. It is ON by default except for the Korean and Traditional Chinese localized version (major globalization fauxpas, I agree :-(() The query part (behind the first '?') is encoded in the encoding of the document bearing the <form> or in the client machine's default code page if the query is not submitted from a FORM. Clent code can override the default setting for non-FORM queries as you can see in the IE5 autosearch feature where the autosearch query is ALWAYS UTF-8. If any part of the URL is pre-escaped when IE gets it, i.e. by the HTML author, there will be no change applied. I think we should look at the domain names without consideration of queries. > (1) The Location field (URL bar) where users type the URL via keyboard. > (2) Links in HTML pages <A HREF="..."> > For (1), we can convert the string typed by the user to UTF-8 before > sending the domain name to the server. > > But for (2), what do you suggest? Should we convert it to UTF-8? Definitely the same for both cases.
Kat, why should we treat URLs that go back to the original server differently from URLs that go to other servers? Does some spec say this?
I don't think there is an RFC which defines that. However, when we parse an server path (URL) which is not escaped by the server itself, we do something like what we are doing, i.e. assume the encoding of the document and then escape it -- for the part below the host name level. I think we discussed this issue in: http://bugzilla.mozilla.org/show_bug.cgi?id=10373 So I am not surprised by what we are doing for the domain name part of it. My concern for distinguishing the original server vs. some other server is motivated by the same consideration, but I am not sure if that is the best thing to do. That is should we distinguish how to deal with the domain name part from the rest of the server paths? In the absence of the real standard we can agree on, I think we can only agree on the best practice.
The approach that Mozilla has taken when the existing browsers do not adhere to the specs is to implement both, and switch between them based on the "Quirks Mode" and "Standard Mode". So I guess one possibility here is to follow the draft in Standard Mode, and follow some mixture of Nav4/MSIE in Quirks Mode. The draft is ftp://ftp.ietf.org/internet-drafts/draft-masinter-url-i18n-05.txt.
nhotta- I think you are the P person for URL issue in our current matrix. Reassign back to nhotta. We probably need to discuss what we should do with this bug.
Assignee: ftang → nhotta
Status: NEW → ASSIGNED
Keywords: helpwanted
*** Bug 49939 has been marked as a duplicate of this bug. ***
*** Bug 55303 has been marked as a duplicate of this bug. ***
Target Milestone: --- → Future
I told Mozilla 0.7 to load http://%e2%88%ae.cr.yp.to. That domain (with three 8-bit characters in place of %e2%88%ae, of course) has an address in DNS, namely 131.193.178.181. Try ``dig contourcname.cr.yp.to'' and you'll see, among other things, the relevant A record. Mozilla gave me error 804b001e, the same error that it gives for nonexistent.cr.yp.to, and said that the host wasn't found. I had expected it to find the host without trouble. Positive note: The not-found dialog box had a UTF-8 display of the name. Negative note: The ``Resolving host'' display had an ISO-8859-1 display of the name. I would have been disappointed in that behavior even if ISO-8859-1 had been my default character set; domain names should be displayed the same way throughout the world.
On my Windows2000, WinAPI WSAAsyncGetHostByName (in nsDNSService.cpp) is called with a host name in UTF-8, and it returns a success. I also got the same error even with 131.193.178.181.
>and it returns a success. I mean calling the API succeeded but I got the error dialog which says the name was not found.
>I also got the same error even with 131.193.178.181. Not the same error, I got a page which says "file does not exist" but no dialog appeared. BTW, the following URLs (mentioned in the original report) are working with NS6. http://www.%C3%B6resundsregionen.nu/ http://www.%e7%99%bb%e9%8c%b2%e6%89%80.nu/ I am not sure what is special about http://%e2%88%ae.cr.yp.to.
I've created an index.html now. If you connect to 131.193.178.181 and do GET http://%e2%88%ae.cr.yp.to HTTP/1.1, you'll see it. But Mozilla says the host isn't found. Perhaps this is a UNIX-specific problem. The BIND DNS client library chokes on unusual characters; does Mozilla still use it?
Target Milestone: Future → mozilla0.9
The issue originally filed is resolved. The remaining problem is specific to one site, it can be filed separately. Actually, I cannot connect to 131.193.178.181.
The original problem is fixed. Please file a separate bug for http://%e2%88%ae.cr.yp.to, but I see 131.193.178.181 does not work either.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Changed QA contact to andreasb@netscape.com. Andreas, please talk with nhotta how to verify this.
QA Contact: teruko → andreasb
Original problem verified fixed in the following builds: * 20010313 Linux * 20010312 Win98 * 20010228 MacOS 9.1 Fix uncovered url display problems, reporting new bugs for this.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.