Closed Bug 83277 Opened 24 years ago Closed 16 years ago

Multilingual (unicode) input into address (url) bar

Categories

(Core :: Internationalization, defect)

x86
All
defect
Not set
major

Tracking

()

RESOLVED WORKSFORME
Future

People

(Reporter: nprobert, Assigned: nhottanscp)

Details

(Keywords: intl)

One of the things that is happening will be the support of multilingual international domain names (IDN). The conversion to an Ascii Character Encoding (ACE) is handled outside of the browser at the resolver level in the case of our client. It is important that Mozilla especially not treat all address bar and form input as Latin-1, not normalize, and not case fold the Unicode input by the user in the address bar. Looks like anchors and javascript works okay. But input into the address bar makes a mess of the Unicode input. I suspect that the input locale is getting in the way here and that Mozilla is not respecting it, particularly when under Windows the keyboard input locale and language is changed and input is done through the IME. Unfortunately, it does mess it up in the Windows version, as does IE. Microsoft won't change IE because it case folds the Latin-1 to optimize the caching. I hope Mozilla is not doing the same thing. Given the Unicode input of these Chinese characters: 4e00 4e8c 4e09.com (one, two and three horizonatal bars dot com) which converted to UTF-8 is: 00e4 00b8 0080 00e4 00ba 008c 00e4 00b8 0089 002e 0063 006f 006d The result is this mangled UTF-8, by the time the resolver gets it: 00e4 00b8 20ac 00e4 00ba 0152 00e4 00b8 2030 002e 0063 006f 006d If Mozilla can be fixed to do this right, then it can be pushed as the preferred browser to the international community over IE.
over to i18n. See also bug 81019, bug 81022 and bug 81024 for related issues.
Assignee: rchen → nhotta
Status: UNCONFIRMED → NEW
Component: Localization → Internationalization
Ever confirmed: true
Summary: Multilingual input into address bar → Multilingual (unicode) input into address (url) bar
I tried Japanese string before but I got a correct UTF-8 when calling WSAAsyncGetHostByName(). http://lxr.mozilla.org/seamonkey/source/netwerk/dns/src/nsDnsService.cpp#620 >00e4 00b8 0080 00e4 00ba 008c 00e4 00b8 0089 002e 0063 006f 006d Where did you get this string? Did you put a break point somewhere in Mozilla? > The conversion to an Ascii Character Encoding > (ACE) is handled outside of the browser at the resolver level in the case of > our client. What does "our client" mean? There is another bug 42898 which implements http://search.ietf.org/internet-drafts/draft-ietf-idn-idna-01.txt.
Keywords: intl
>00e4 00b8 0080 00e4 00ba 008c 00e4 00b8 0089 002e 0063 006f 006d UTF-8 does not have zeros except for a terminator. UTF-8 value for \u4e00 \u4e8c \u4e09 are 0xE4B880 0xE4BA8C 0xE4B889.
5/30/01 2:48 PM --------------- We've only used the MS IME or the Character Map. Are you saying that Mozilla properly handles all forms of Unicode input by the user? Regardless of the keyboard and/or input locale? Is there some way we can log what Mozilla or Netscape sees when we put urls in the address bar and follow it through to the tail pipe? Our client (resolver plug-in) only sees what comes out of Mozilla. Naoki Hotta wrote: > > Hi, > > What kind of IME have you tried? I heard some third party IME have problem > with Mozilla. I think there is no data corruption when using MS IME. > > Naoki > Neal Probert wrote: ------------------- For more information on IDNA, see http://www.ietf.org/html.charters/idn-charter.html and http://www.i-d-n.net/ for IDNA. Certain Latin-1 characters (which are really UTF-8 sequences) are case folded, particularly C0-CF. Others like 80-8F are transformed into something else, but I'm not sure exactly what is happening there. There may be other unexpected transformations as well, but the behavior is the same. Our client plugs into the resolver library, so that's how we get our data when we turn on our debugging. We do expect UTF-8 and can handle URI escaped UTF-8 as well. A copy of our client can be found at http://www.walid.com/ which is used for nameprep and ACE. We plug in at the resolver level so that any client application can begin using IDNA immediately, and we will follow the standard set by the IDN WG and update our client accordingly. To get the characters, we paste the characters from Character Map(using Arial Unicode MS) into the address bar. It can also be done from the IME with identical results, using the Chinese virtual keyboard.
With the calls to MultiByteToWideChar(), perhaps the codepage is wrong and not being adjusted to match the data type coming from the paste operator or the IME itself. We've been testing on Windows 2000. According to the documentation, perhaps the codepage should be CP_UTF8 instead on NT, 2K and XP. Somehow, it looks like language is being mapped to a codepage and that may be the cause of our problem. Btw, the UTF-8 data printed below was the result of printf( "%04x ", c[i] ); so it maybe mis-leading.
Code page is mapped from Language Identifiers. http://lxr.mozilla.org/seamonkey/source/widget/src/windows/nsWindow.cpp#388 Language Identifiers is taken by calling Windows API GetKeyboardLayout(). http://msdn.microsoft.com/library/psdk/winui/keybinpt_5sxg.htm I think paste case may not work if the pasted characters does not match with keyboard locale.
It's not safe to assume that what ever language is entered via the IME or even pasted that it will match the language, codepage or keyboard locale. While the keyboard maybe tied to the locale (language), the IME is independent of the keyboard locale. Data from the IME should be handled with the same language/codepage/locale setting of the IME. If pasted text is in unicode then it should be treated as unicode, independent of the keyboard locale.
I was incorrect about the paste behavior assumption, it is actually treated as unicode (e.g I pasted Chinese character from Character Map to url bar and they turned to UTF-8 when it was fed to search.netscape.com). I do not understand about you request about independence of IME from keyboard locale. Do you have examples of keyboard and IME are set in different languages? In my environment (windows 2000), I cannot use Japanese IME unless I set to Japanese keyboard.
Here is a memory dump of a host name at the code calling WSAAsyncGetHostByName(). http://lxr.mozilla.org/seamonkey/source/netwerk/dns/src/nsDnsService.cpp#620 051329C0 E4 B8 80 E4 BA 8C E4 B8 89 一二三 051329C9 2E 63 6F 6D 00 FD FD FD FD .com.ýýýý The test case was three Chinese characters (\u4e00 \u4e8c \u4e09) and ".com". The three Chinese characters were converted to valid UTF-8 (0xE4B880 0xE4BA8C 0xE4B889). I tried two cases pasting characters from "Character Map" and typed using Japanese IME, both got the same result. I used a debug build on my machine (windows 2000 with system locale en-US), pulled this morning. Mark as Worksforme.
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → WORKSFORME
QA -> jonrubin@netscape.com. Jon, when you get a chance can you take a look at this? Reporter, if possible please let us know if you are still able to reproduce this problem. Thanks.
QA Contact: andreasb → jonrubin
Could someone provide a testcase for this? Is it sufficient to enter a multilingual address and see if the resulting error message displays the error correctly, as in bug 81019? If so, then I can verify that this is fixed.
This is still not fixed in 0.9.1, for the hostname part for IDNA. The Unicode UTF-16 code points from the Character Map used were 4e00, 4e8c, 4e09, which should be sent as UTF-8 code points e4b880, e4ba8c, e4b889 in UTF-8 were still mangled. Perhaps we are looking in the wrong places for this problem? If the call to WSAAsyncGetHostByName() is passed the correct UTF-8 data, then where is it getting mangled? Does this call get repeated again by Mozilla anywhere else?
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Target Milestone: --- → mozilla0.9.3
After using the debug/trace ws2_32.dll from the Platform SDK in conjunction with a modified version of dt_dll built from the MSDN example, I was able to verify that Mozilla calls WSAAsyncGetHostByName() with the correct UTF-8 Unicode. Our name space provider (NSP) for Winsock still sees the mangled UTF-8 which leads me to the conclusion that the Winsock layer is converting the character data using the thread's current locale. I'm wondering if the thread that calls WSAAsyncGetHostByName() can set it's locale to UTF-8 before making the call and that will solve the problem.
Move to future.
Target Milestone: mozilla0.9.3 → Future
mass change, switching qa contact from jonrubin to ruixu.
QA Contact: jonrubin → ruixu
Same kind of bug appears with french accentuated letters. And also when opening an URL via an script (by example searching onto Google). I'm maintening the http://www.tchatche.com/bd , and a script is creating links for pictures with the name of the comics. Sometimes, the formed URL has accents. B.E. when you clic onto http://www.tchatche.com/BD/Contents/consultation/chroniques/DetailBD.asp?bd=634 «Agrandir la...» or «voir un...», a javascript is calling a picture with accentuated ( javascript:cover('/BD/','/bd/Images/ImagesBD/D\u00e9luge-Universal_-4-l.jpg','') where u00e9 is standing for é). The image is not loaded instead of IE. If I search the french word for "summer" in Google, it gives « http://www.google.fr/search?q=%E9t%E9&hl=fr&btnG=Recherche+Google&meta= » (correct) from the main page. But from the sidebar « http://www.google.com/search?q=%C3%A9t%C3%A9&sourceid=mozilla-xul&btnG=Google+Search » will be wrongly translated « été ». Is it the same bug? (and sorry for my mispelled anglishe)
Beep, we need this fixed. Try going to www.domäninfo.com.. Works fine in IE 6 SP1 on WindowsXP. Mozilla can't resolve the name.
WFM Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9) Gecko/2008052906 Firefox/3.0
Status: REOPENED → RESOLVED
Closed: 24 years ago16 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.