Closed Bug 206811 Opened 21 years ago Closed 21 years ago

xp_iconv in nsNativeCharsetutils.cpp should use UTF-16 if available

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.4final

People

(Reporter: jshin1987, Assigned: jshin1987)

References

Details

(Keywords: intl)

Attachments

(2 files, 5 obsolete files)

xp_iconv in nsNativeCharsetUtils.cpp uses UCS-2, but *PRUnichar is not a string of UCS2's but UTF-16 string. [1] Therefore, we should try UTF-16 before trying UCS-2. At least, glibc, Solaris, HP/UX 10/11 and Tru64 [2] use 'UTF-16' for the native endian UTF-16. I haven't yet figured out what AIX and IRIX do about it. [3] [1] http://lxr.mozilla.org/seamonkey/source/xpcom/io/nsNativeCharsetUtils.cpp#177 [2] http://docs.sun.com/db/doc/806-5584/6jej8rb0u?a=view http://h30097.www3.hp.com/docs/porting/tru64-to-hpux/CHP18NXX.HTM [3]http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/aixbman/admnconc/convert.htm
Attached patch a patch (obsolete) (deleted) — Splinter Review
Comment on attachment 124018 [details] [diff] [review] a patch >+static const char *UTF_16_NAMES[] = { >+ "UTF-16", > "UCS-2", > "UCS2", > "UCS_2", this patch makes good sense to me, but do we need to worry about other variants of UTF-16 .. such as UTF16, UTF_16, utf-16, utf16, and utf_16? i know it almost seems rediculous, but i can't really say for sure. also, how about replacing all instances of "ucs2" with "utf16" e.g., ucs2_to_isolatin1 -> utf16_to_isolatin1
As for replacing all instances of "ucs2" with "utf16", I'll do. As for UTF-16 variants, I'm not sure. It seems like every platform that supports UTF-16 uses "UTF-16". However, I guess it doesn't hurt to try other variants because it's done only once at the init, right?
darin, Can you also comment on my posting 'news://news.mozilla.org:119/bajml3$fvr1@ripley.netscape.com'? I thought I was ringing a false alram, but I just realized that Linux does not use iconv(3) but uses wcrtomb and mbrtowc assuming that wchar_t is UTF-16 string, of which I'm not sure.
Status: NEW → ASSIGNED
actually, i'm not sure why we'd want to ever prefer wcrtomb/mbrtowc over iconv on linux. i know that under OSX, nl_langinfo does not seem to exist, but the nsNativeCharsetUtils methods aren't invoked until OSX IIRC.
On Linux it is *never* true that wchar_t == UCS2 or UTF-16. With the -fshort-wchar option one can make gcc define wchar_t as a two-byte value but this doesn't magically the runtime. In fact, when using this option and the normal runtime for wide character handling you guarantee certain death for your program. Everybody who uses this option must be aware of this. And s/he must be aware of the fact that the so created fake wchar_t type violates ISO C. UTF-16 is no valid encoding for wchar_t and UCS2 is obviously insufficient. If somebody insists on using -fshort-wchar s/he must make sure to never use any of the wide character interfaces in the runtime. It is possible to use iconv() but even then the pseudo encoding name "WCHAR_T" must not be used. It is then necessary to explicitly use "UTF-16" or so.
Darin and I decided to switch over to iconv for Linux after discussing the issue off-line (newsgroups/mail) as well as here with Ulrich and Bruno.
Attachment #124018 - Attachment is obsolete: true
Before leaving for the vacation, Darin wrote to me the following: <quote> jshin: i am going to be away on vacation for the next 2 weeks, in which case i most likely will not get a chance to make this change in time for 1.4 final. if you think we should try to make this change for 1.4 final (sounds to me like we should), then you ought to get in touch with drivers@mozilla.org to discuss it with them. be sure to run some performance tests (start up time is probably going to be most sensitive) so they can be aware of all the implications of this change. thanks! </quote> I agree with him that we should fix this both in trunk and 1.4f branch because 1.4f branch is likely to be a long-lived one. I measured the start-up time with and without the patch (attachment 124092 [details] [diff] [review]) and I found that the one with the patch takes slightly shorter than otherwise. (0.095 vs 0.097 on 10-time average. the std. deviation is virtually zero unless I wait long enough between runs).
Attachment #124092 - Flags: superreview?(alecf)
Attachment #124092 - Flags: review?(ftang)
Comment on attachment 124092 [details] [diff] [review] a new patch (including switching to iconv under Linux) looks good. sr=alecf
Attachment #124092 - Flags: superreview?(alecf) → superreview+
Setting the target to 1.4final. Frank, can you review?
Target Milestone: --- → mozilla1.4final
Comment on attachment 124092 [details] [diff] [review] a new patch (including switching to iconv under Linux) asking smontagu for r.
Attachment #124092 - Flags: review?(ftang) → review?(smontagu)
Comment on attachment 124092 [details] [diff] [review] a new patch (including switching to iconv under Linux) I can't say I know enough about iconv to review this properly, but since (from comment 7) it looks like people who do know have signed off on the approach, I am happy to give r=smontagu
Attachment #124092 - Flags: review?(smontagu) → review+
This appears to have increased codesize by 3.4k on various linux platforms. Was this expected; are we statically linking the iconv libraries?
> This appears to have increased codesize by 3.4k on various linux platforms. Thank you for alerting me about the codesize I missed. I realize that USE_ICONV has more code than USE_STDCONV. We can cut down the increase by half by NOT defining ENABLE_UTF8_FALLBACK_SUPPORT. For sure, iconv(3) in glibc doesn't need that which is for iconv(3) on Solaris. On Solaris, iconv(3) has to go through UTF-8 for most combinations of to/from codesets. Removing 'inline' from xp_iconv_reset helped me very little (100bytes) while removing 'inline' from xp_iconv_open and xp_iconv didn't change the size at all (they are too complex to be inlined with gcc 3.x with -O1). I can squeeze out additional ~300 bytes removing unnecessary array elements. The final numbers are: - old : 898056 bytes - current : 902504 bytes : +4448 - w/ patch : 899528 bytes : +1472
Attached patch a patch to reduce the size (obsolete) (deleted) — Splinter Review
This patch doesn't look pretty because it's littered with '#if !defined(__linux)'. Anyway, with this patch, I cut down the code size increase (libxpcom.so on ix86) to 1.2kB.
I forgot to mention that I removed 'inline' in xp_iconv_reset, which saved me only about 190 bytes. Because I removed the function call overhead on Linux in other places (plain iconv_open() is used in place of xp_iconv_open), I don't think there's any slow down in the startup. If it's a worry, perhaps we can just put back inline for 190 byte size increase (on ix86, gcc 3.x, -O1...) BTW, I assumed that Linux binary is always compiled AND run with glibc 2.x or GNU libiconv, which is very unlikely to break and which enabled me to use 'UTF-16' and 'ISO-8859-1' directly in iconv_open(). Benjamin, I guess you're concerned with embedded devices. Is my assumption valid there?
jshin, I am not actually an embedder. I was just made very conscious of codesize issues while working with alecf, and looking a tbox it seemed a small change had large codesize implications. If that was unexpected, I just wanted to bring it to your attention. I don't have any particular need/desire for a fix.
But I'd like to see this fix go in if in fact the extra code isn't needed on linux... however, is there any way to detect at runtime what systems do and don't need it, as opposed to choosing linux vs. not linux? i.e. is there some kind of test we can add to configure.in to detect this case?
> is there any way to detect at runtime what systems do and > don't need it, as opposed to choosing linux vs. not linux? i.e. is > there some kind of test we can add to configure.in to detect this > case? There may be. Adding seawood to CC. We might be able to detect the codeset names for UTF-16, UTF-8, and ISO-8859-1 with some tweaking in configure.in. Then, we can reduce the codesize on all Unix(-like) platforms. When I wrote about this possibility for another program, some people had a reservation about it because what's detected at the compile-time is not always the case at the run-time because iconv(3) is available in a separate standalone libiconv and people often install it on non-Linux systems if they're not satisfied with the stock version. On Linux (barring possible exceptions on embedded devices unknown to me. adding dougt for this), the possibility is almost zero because there's no reason to replace the stock iconv(3) with something else. It just occurrs to me that some embedded devices (Linux) use a lightweight C library other than glibc. I'm not sure whether it lacks iconv(3) altogether, in which case we don't have to worry because 'HAVE_ICONV' is set to false. A tricky case is vanilla iconv(3) that doesn't support UTF-16. BTW, Ulrich, is there any known glibc (configuration option) that doesn't support UTF-16 but supports UCS-2? This question is also relevant to native uconv (http://lxr.mozilla.org/mozilla/source/intl/uconv/native/). If all these are regarded as too much headache for about 800bytes, I think we can just go with #ifdef'ing out 'UTF-8 fallback' code for Linux. That'll save us about 2.4kB (out of 4.5kB introduced by my check-in) instead of 3.2kB.
I think this checkin caused blocker bug 208809
> BTW, Ulrich, is there any known glibc (configuration option) that doesn't > support UTF-16 but supports UCS-2? UCS-2 handling is built in libc, UTF-16 is a module which can be removed.
>If all these are regarded as too much headache for about 800bytes, I think we >can just go with #ifdef'ing out 'UTF-8 fallback' code for Linux. That'll save us >about 2.4kB (out of 4.5kB introduced by my check-in) instead of 3.2kB. yeah, something like that is probably sufficient. as for the other tweaks, since you have done them... maybe worth making the changes, but in general there are bigger fish to fry ;-) i think ulrich's comment suggests that we should fallback on UCS-2 if no UTF-16 converter exists.
Blocks: 208809
Attached patch a new patch to fix make it work on RH 6.2 (obsolete) (deleted) — Splinter Review
This is the first of two patches. Here I'm explictly checking whether we need byte swapping or not and byte-swap if necessary. I had to cast away constness, which can be dangerous. I also disabled UTF8FALLBACK on linux but had to put back the dummy conversion for BOM check because glibc 2.2.x also puts BOM for UTF-16 and it's also necessary to check the endianness of UTF-16. With this, the size reduction compared with the trunk is 1.3kB.
Here I'm following Ulrich's suggestion in bug 208809. Instead of using UTF-16 (which is 'native' endian on recent glibc but is BE in old glibc regardless of the native endianness), UTF-16LE/UTF-16BE are used depending on the endianness. One problem with this is that UTF-16LE/BE are not supported by glibc 2.1.x (at least they're not in glibc 2.1.3 on RH 6.2 ). As a fallback, I'm using 'UNICODELITTLE' and 'UNICODEBIG' that are present both in glibc 2.1.x and recent glibc. The size reduction is about 2.8kB. Although we can't support nob-BMP on Linux with old glibc with this patch, I think this patch is better than attachment 125292 [details] [diff] [review]. There're a couple of reasons. Firstly, I don't feel comfortable casting away const (I guess it can crash Mozilla). And, it's smaler and faster (no need to byte-swap). With attachment 125292 [details] [diff] [review], we might be byteswapping twice, in glibc and in our code. BTW, I blocked the dummy conversion to get rid of 'BOM" on linux (because we now explicitly specify the endianness), but that saves us only about 100bytes. To be safe, I'd rather keep that code in.
> Instead of using UTF-16 > (which is 'native' endian on recent glibc but is BE in old glibc regardless of > the native endianness), That's wrong. If you use UTF-16 and you provide no BOM big endian is used. But it is not correct to use UTF-16 without BOM. So using the explicit names is *always* correct, regardless of platform.
> That's wrong. If you use UTF-16 and you provide no BOM big endian is used. But it is not correct to use UTF-16 without BOM. Let's put aside difference in interpreting UTF-16 and focus on how different versions of glibc behave. I didn't think about converting from UTF-16. 1. In recent glibc, when converting to 'UTF-16', the output is prepended with BOM to indicate the native endian in the first invocation and the *native* endian is used. 2. When converting from UTF-16, the absence of BOM is regarded as indicating BE regardless of the native endianness. 3. in old glibc, BOM was not prepended to the output and BE (not the native endian) is used everywhere when converting to UTF-16. #1 is what I observed and #3 was inferred from bz's test result in bug 208809. As for #2, is that what you meant? My test with glibc 2.2.93 on ix86 tells me a different story. With or without BOM, "\x20\x21" is interpreted as U+2120 by iconv when converting _from_ "UTF-16". That is, both '\xff\xfe\x20\x21' and '\x20\x21' gave me the same result (interpreted as LE). Has there been a change recently (2.3.x)? > So using the explicit names is *always* correct, regardless of > platform. No doubt about that. As I wrote, one problem with that is that UTF-16LE and UTF-16BE are not available in old glibc. Anyway, using UNICODELITTLE and UNICODEBIG as fallback at the cost of losing non-BMP support is reasonable.
Glibc has /usr/include/byteswap.h which has optimized swapping functions. You should use them where available.
Comment on attachment 125292 [details] [diff] [review] a new patch to fix make it work on RH 6.2 This is not well tested nor safe. (crashes on bz's RH 6.2). Let's go with the other one.
Attachment #125292 - Attachment is obsolete: true
> using UNICODELITTLE and UNICODEBIG as fallback at the cost of > losing non-BMP support in old glibc is reasonable. Darin (Simon, and Alec) , what do you think? I think attachment 125293 [details] [diff] [review] (except for not so pretty '__linux') is a reasonable compromise. If you agree, pls give r/sr to the patch so that I can remove the blocker (bug 208809).
I've done some tests (direct and indirect) and found that UTF-16LE and UTF-16BE work under glibc 2.2.x, Solaris 8 or newer, and Tru64 (Compaq - HP) V5. Under Tru64 (both V4 and V5), the meaning of UTF-16 is dependent on the environment variable ICONV_BYTEORDER, which is yet another reason to use UTF-16LE (because alpha is LE). By default, 'UTF-16' in Tru64 is interpreted as LE (platform endian). Unfortunately, Tru64 V4 doesn't support 'UTF-16LE' so that it'd break Mozilla if ICONV_BYTEORDER is set to 'big-endian', but there's not much we can do other than providing a shell script wrapper. Solaris 7 and IRIX 6 (both BE. MIPS can be run as LE as well, but SGI machines are BE) interpret UTF-16 as BE and emit UTF-16BE for 'UTF-16'. As mentioned earlier, for glibc 2.1.x, we need to use UNICODELITTLE and UNICODEBIG because 'UTF-16' is always interpreted as BE by glibc 2.1.x and UTF-16LE/UTF-16BE are not supported. Based on these and the fact that we have used UCS-2 and variants so far (AIX, HP/UX and other Unix), I think we can use the following for UTF_16_NAMES[] +#if defined(IS_LITTLE_ENDIAN) + "UTF-16LE", +#if defined(__linux) + "UNICODELITTLE", +#endif + "UCS-2LE", +#else + "UTF-16BE", +#if defined(__linux) + "UNICODEBIG", +#endif + "UCS-2BE", +#endif "UTF-16", "UCS-2", "UCS2",
jshin: that sounds reasonable to me.
Attached patch a new patch doing what I wrote (obsolete) (deleted) — Splinter Review
Darin, thanks. Here's the patch implementing that. Can I get review? Alec and Simon, please feel free to review :-) This is basically the same as attachment 125293 [details] [diff] [review] except that UTF16_NAMES[] is changed as I wrote and I kept the block removing BOM in the first invocation of iconv(3) for linux (that is, I got rid of '#if !defined(__linux)' around that code.).
Attachment #124889 - Attachment is obsolete: true
Attachment #125293 - Attachment is obsolete: true
Just in case we need to refer to them later..Let me add a note that I used attachment 125288 [details] (to bug 208809) and attachment 125485 [details] (to bug 209048) ( for Solaris that doesn't like 4 NULL's in iconv, need to do what Mozilla does in nsNativeCharsetUtils.cpp along with 'const char **' change) to try iconv(3) on a few platforms.
Comment on attachment 125560 [details] [diff] [review] a new patch doing what I wrote >+ * We also use it to determine whether or not 'UTF-16'/'UCS-2' is >+ * interpreted as native endian. If it's interpreted as BE on LE, >+ * we need byte-swapping. The above comment should have not been included. I got rid of it in my tree. I've been waiting for bug 209048 to be resolved. Now that it can be fixed by a Solaris patch, I'm asking for r/sr.
Attachment #125560 - Flags: superreview?(darin)
Attachment #125560 - Flags: review?(smontagu)
Comment on attachment 125560 [details] [diff] [review] a new patch doing what I wrote >Index: xpcom/io/nsNativeCharsetUtils.cpp >+// just in case... but we know for sure that iconv(3) in glibc for Linux nit: "but we know for" >+// doesn't need this. >+#if !defined(__linux) is there a GLIBC specified #define that we should test instead? >-// PRUnichar[] is NOT a UCS-2 array BUT for UTF-16 string. Therefore, we >-// have to use UTF-16 with iconv(3) on platforms where it's supported. >-// We could list 'UTF-16' name variants, but all platforms known (to me) to >-// support UTF-16 in iconv(3) uses 'UTF-16'. Let me know (jshin) if there's an >-// exception. (bug 206811) >+/* >+ * PRUnichar[] is NOT a UCS-2 array BUT for a UTF-16 string. Therefore, we nit: PRUnichar[] is NOT a UCS-2 array BUT a UTF-16 array. >+ * have to use UTF-16 with iconv(3) on platforms where it's supported. >+ * However, the way UTF-16 and UCS-2 are interpreted varies across platforms nit: However, the way UTF-16 and UCS-2 are interpreted varies across platforms >+ * and implementations of iconv(3). On Tru64, it also depends on the environment >+ * variable. To avoid the trouble arising from byte-swapping nit: variable. To avoid the trouble arising from byte-swapping > static const char *UTF_16_NAMES[] = { >+#if defined(IS_LITTLE_ENDIAN) >+ "UTF-16LE", >+#if defined(__linux) >+ "UNICODELITTLE", >+#endif >+ "UCS-2LE", ... > "UTF-16", > "UCS-2", so it's never the case that we want to try "UTF-16" before "UCS-2LE"? >+ * the first use of the iconv converter. The same is the case >+ * case of glibc 2.2.9x and Tru64 V5 (see bug 208809) nit: "case case" >+ * when 'UTF-16' is used. However, we use 'UTF-16LE/BE' in both >+ * cases, instead so that we should be safe. But just in case... nit: * cases, instead so that we should be safe. But just in case... > * This dummy conversion gets rid of the BOMs and fixes bugid 153562. nit: s/bugid/bug/
Attachment #125560 - Flags: superreview?(darin) → superreview-
> so it's never the case that we want to try "UTF-16" before "UCS-2LE"? I guess what's in 125560 is not 100% correct, but safer than the following. -------------- static const char *UTF_16_NAMES[] = { #if defined(IS_LITTLE_ENDIAN) "UTF-16LE", #if defined(__linux) "UNICODELITTLE", #endif #else "UTF-16BE", #if defined(__linux) // or if defined(__GLIBC__) "UNICODEBIG", #endif #endif "UTF-16", #if defined(IS_LITTLE_ENDIAN) "UCS-2LE", #else "UCS-2BE", #endif "UCS-2", --------------- In reality, I think there's no difference unless there's an iconv implementation that works the way glibc 2.1.x does. That is, we're in trouble if the implementation interprets UTF-16 as UTF-16BE on a LE machine but does not recognize 'UTF-16LE'. Do you know of any implementation that does? Anyway, which one do you like? >>+#if !defined(__linux) >is there a GLIBC specified #define that we should test instead? Perhaps '__GLIBC__' or '__GNU_LIBRARY__' (deprecated)?
>In reality, I think there's no difference unless there's an iconv >implementation that works the way glibc 2.1.x does. That is, we're in trouble >if the implementation interprets UTF-16 as UTF-16BE on a LE machine but does >not recognize 'UTF-16LE'. Do you know of any implementation that does? no, i don't know of any such implementations... >Anyway, which one do you like? perhaps your first way was better since it'd be better... i defer to your best judgement. we can of course always fix it if we discover a problem on some platform ;-) >>>+#if !defined(__linux) > >>is there a GLIBC specified #define that we should test instead? > > Perhaps '__GLIBC__' or '__GNU_LIBRARY__' (deprecated)? yeah, let's use __GLIBC__ instead since it's possible someone on another OS might be using glibc as well.
I decided to do the same as in attachment 125560 [details] [diff] [review]. So in terms of the actual code, this is the same as attachment 125560 [details] [diff] [review] except that __GLIBC__ is used instead of __linux.
Attachment #125560 - Attachment is obsolete: true
Attachment #125560 - Flags: review?(smontagu)
Attachment #127768 - Flags: superreview?(darin)
Attachment #127768 - Flags: review?(smontagu)
Comment on attachment 127768 [details] [diff] [review] a new patch with nits taken care of and __linux replaced by __GLIBC__ sr=darin
Attachment #127768 - Flags: superreview?(darin) → superreview+
Ping :-) Simon, can I get r? Or shall I ask someone else?
Probably better ask someone else since, as I said before in comment 12, I really know almost nothing about iconv.
Comment on attachment 127768 [details] [diff] [review] a new patch with nits taken care of and __linux replaced by __GLIBC__ Ulrich, can you review? You're the authority on iconv :-)
Attachment #127768 - Flags: review?(smontagu) → review?(drepper)
Comment on attachment 127768 [details] [diff] [review] a new patch with nits taken care of and __linux replaced by __GLIBC__ The patch looks fine. glibc will never use the BOM if the endianness is explicitly requested but it probably doesn't hurt. If the style is OK I see no problems. r=drepper
Attachment #127768 - Flags: review?(drepper) → review+
Comment on attachment 127768 [details] [diff] [review] a new patch with nits taken care of and __linux replaced by __GLIBC__ Fix checked into the trunk. Thank you all. Now I'm asking for a1.4 as Darin and I discussed (see comment #8). This patch (along with attachment 124092 [details] [diff] [review]) is to make nsNativeCharsetUtils on Unix(including Linux) be UTF-16-safe. Some problems in attachment 124092 [details] [diff] [review] have been identified and addressed in this patch so that I believe the risk is pretty low. Linux with both glibc 2.2.x or later and old glibc 2.1.x were tested. Solaris 8/9(Solaris 7 needs either a iconv patch from Sun or GNU iconv, which was already relase-noted) is fine. Other Unix should be all right because nsNativeCharsetUtils.cpp (with or without this patch) has a few 'defense measures' against various possibilities. I also tested 'iconv(3)' APIs on Tru64 and HP/UX 11 and it works as is assummed here. Considering that 1.4 would be a long lived one(like 1.0), I guess we can't just assume that nobody would use non-BMP characters in filenames and other places. On Linux, there was a little worry about the start-up time, but there's virtually no difference (or slightly faster with this patch than without). Given all these, I think it's safe to land this combined with attachment 124092 [details] [diff] [review] in 1.4branch
Attachment #127768 - Flags: approval1.4.x?
Comment on attachment 127768 [details] [diff] [review] a new patch with nits taken care of and __linux replaced by __GLIBC__ We don't have many testing resources and we're short on time. This doesn't appear to be a high-profile problem for users so it's not 1.4 material.
Attachment #127768 - Flags: approval1.4.x? → approval1.4.x-
sorry it's late. marking as fixed.
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: