Closed
Bug 108136
Opened 23 years ago
Closed 22 years ago
Shift_JIS conversion problem on MacOS9, OS/2
Categories
(Core :: Internationalization, defect, P2)
Core
Internationalization
Tracking
()
VERIFIED
FIXED
mozilla1.2beta
People
(Reporter: shom, Assigned: smontagu)
References
Details
(Keywords: intl)
Attachments
(7 files, 2 obsolete files)
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
ftang
:
review+
smontagu
:
superreview+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
ftang
:
review+
alecf
:
superreview+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
ftang
:
review+
alecf
:
superreview+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
smontagu
:
review+
alecf
:
superreview+
|
Details | Diff | Splinter Review |
Now, the internal mapping table for Japanese is fully based on CP932 (bug-54135).
MacOS9 and OS/2 have another mapping table, so some characters have conversion
problem when mozilla passes internal UCS2 codes to OS Native functions which
handle UCS2.
PROBLEM:
testpage: http://rh.vinelinux.org/~shom/sjisprob.html
a problem on MacOS9
http://bugzilla.mozilla.gr.jp/showattachment.cgi?attach_id=364
a problem on OS/2
http://bugzilla.mozilla.gr.jp/showattachment.cgi?attach_id=367
RELATED BUGS : bug 35166, bug 58637, bug 33162, bug 65991
SOLUTIONs:
i) convert internal UCS2 codes to compatible codes of OS native codes when use
every OS function which treat UCS2. SO HARD?
ii) implement dual mapping method to conversion tables. VERY HARD, I think.
iii) make other tables for Shift_JIS variants. Currently Japanese:UCS2
conversion table is generated from CP932.txt with mkjpconv.pl (bug 54135). Since
this tool can generate other mapping tables (ex APPLE_JAPANESE.txt), it is easy
to make Shift_JIS(MacOS9) and Shift_JIS(OS2) -- or Shift_JIS(IBM943). This
solution have another advantage -- can treat platform depend characters without
unicode sequences (surrogate pairs?).
*** This bug has been confirmed by popular vote. ***
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 3•23 years ago
|
||
Reassign to ftang.
Updated•23 years ago
|
Status: NEW → ASSIGNED
Comment 5•23 years ago
|
||
what will happen if we don't fix this?
Updated•23 years ago
|
Priority: -- → P4
Reporter | ||
Comment 6•23 years ago
|
||
Cannot treat many vendor specific Shift JIS kanji chars (I know NC4 can).
# CP932 contains MS specific kanji chars, so on Windows can treat them :b
and legal chars in JIS X 0208 have conversion problem.
[reported in bugzilla-jp <http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=868>]
testpage : http://rh.vinelinux.org/~shom/sjisprob2.html
* OS/2
SJIS 4 chars (0x815c,0x8160,0x8161,0x817c) have problem.
screen shot
http://bugzilla.mozilla.gr.jp/showattachment.cgi?attach_id=538
screen shot after re-input '?' chars in and submit
http://bugzilla.mozilla.gr.jp/showattachment.cgi?attach_id=539
- display problem
(0x815c,0x8160,0x8161,0x817c) are displayed as '?'
on page body, bookmark title, tab, javascript alert.
on titlebar, ' '.
- query send problem
When input one of (0x815c,0x8160,0x8161,0x817c) in INPUT type=text /
TEXTAREA, chars following these chars are truncated.
(http://bugzilla.mozilla.gr.jp/showattachment.cgi?attach_id=539)
- compose problem
(0x815c,0x8160,0x8161,0x817c) becomes — 〜 ‖ −
in saved page.
- mail/news send problem
(0x815c,0x8160,0x8161,0x817c) treated as illegal, so cannot send.
if ignore alert, 0x815c becomes '--', others '?'.
* Mac OS 9 (and probably Mac OS X)
(0x815c,0x8160,0x8161,0x817c,0x8191,0x8192,0x81ca) have problem.
- query send problem
When input (0x815c,0x8160,0x8161,0x817c,0x8191,0x8192,0x81ca) in
INPUT type=text / TEXTAREA, chars following these chars are truncated.
- mail/news send problem
(0x815c,0x8160,0x8161,0x817c,0x8191,0x8192,0x81ca) treated as illegal,
so cannot send.
if ignore alert, 0x815c becomes '--', others '?'.
- bookmark problem
bookmark title contains (0x815c,0x8160,0x8161,0x817c,0x8191,0x8192,0x81ca)
in menubar of OS are displaed as blank.
Comment 7•22 years ago
|
||
one of the top problem mozilla japanese group report. not sure how to solve it
yet. May need to break down to different tasks.
Comment 8•22 years ago
|
||
Kohei Ichioka has made a patch for this bug.
http://www5a.biglobe.ne.jp/~expf/ucvja.tar.gz
This file contains readme.txt which explains how to apply the patch.
And chado has made a Mac build based on this patch.
ftp://download.sourceforge.jp/wazilla/996/Wazilla-mac-1.1-2156c.sea.bin
Comment 9•22 years ago
|
||
Adding mkaply to Cc.
Tarball in comment 8 contains the patch for OS/2, but Kohei Ichioka
hasn't tested it. He doesn't have OS/2. Can you review the patch and
test it?
Severity: normal → critical
Comment 10•22 years ago
|
||
In the original report,
>MacOS9 and OS/2 have another mapping table
Does the problem exist for MacOSX or this is specific to MacOS9?
Comment 11•22 years ago
|
||
Can anyone attach a patch using cvs diff -u to this bug?
Reporter | ||
Comment 12•22 years ago
|
||
* change Japanese to Unicode conversion rule
pref("intl.jis0208.map", "Apple") using MacJapanese conversion rule.
pref("intl.jis0208.map", "IBM943") using IBM943 conversion rule.
* dual mapping for Unicode to Japanese conversion rule
CP932 ,Apple ,IBM943 SJIS (JIS)
U+2015,U+2014,U+2014 -> 0x815C(01-29)
U+FF5E,U+301C,U+301C -> 0x8160(01-33)
U+2225,U+2016,U+2016 -> 0x8161(01-34)
U+FF0D,U+2212,U+2212 -> 0x817C(01-61)
U+FFE0,U+00A2,U+FFE0 -> 0x8191(01-81)
U+FFE1,U+00A3,U+FFE1 -> 0x8192(01-82)
U+FFE2,U+00AC,U+FFE2 -> 0x81CA(02-44)
U+FFE4,U+FFE4,U+00A6 -> 0xEEFA(92-92)
U+FFE4,U+FFE4,U+00A6 -> 0xFA55
mozilla/intl/uconv/tools/jamap.pl creates maps.
mozilla/intl/uconv/ucvja/japanese.map is the map for Japanese to Unicode.
Comment 13•22 years ago
|
||
Matsumoto san,
Does the problem exist for MacOSX or this is specific to MacOS9?
Reporter | ||
Comment 14•22 years ago
|
||
I don't know.
But I think MacOSX uses the same conversion rule as MacOS9 for backward
compatibility.
Comment 15•22 years ago
|
||
could you give us a patch instead of a application/x-gzip ?
Reporter | ||
Comment 16•22 years ago
|
||
Reporter | ||
Comment 17•22 years ago
|
||
Reporter | ||
Comment 18•22 years ago
|
||
Comment 19•22 years ago
|
||
Comment 20•22 years ago
|
||
Comment 21•22 years ago
|
||
Comment 22•22 years ago
|
||
Comment 23•22 years ago
|
||
The patch id=98510-98512 is incomplete.
id=102147-102150 is the actual patch.
Comment 24•22 years ago
|
||
ftang, could you review the patch?
The patch is devided to following attachments.
attachment 102147 [details] [diff] [review]
attachment 102148 [details] [diff] [review]
attachment 102149 [details] [diff] [review]
attachment 102150 [details] [diff] [review]
Updated•22 years ago
|
Attachment #102147 -
Flags: review+
Updated•22 years ago
|
Attachment #102148 -
Flags: review+
Updated•22 years ago
|
Attachment #102149 -
Flags: review+
Updated•22 years ago
|
Attachment #102150 -
Flags: review+
Updated•22 years ago
|
Attachment #98078 -
Attachment is obsolete: true
Comment 26•22 years ago
|
||
Comment on attachment 102147 [details] [diff] [review]
patch #1/4
sr=alecf
Comment 27•22 years ago
|
||
Comment on attachment 102148 [details] [diff] [review]
patch #2/4
sr=alecf
Attachment #102148 -
Flags: superreview+
Comment 28•22 years ago
|
||
Comment on attachment 102149 [details] [diff] [review]
patch #3/4
sr=alecf
Attachment #102149 -
Flags: superreview+
Comment 29•22 years ago
|
||
Comment on attachment 102150 [details] [diff] [review]
patch #4/4
what does this notation mean?
+ const PRUint16 (*mMapIndex)[128];
this seems a little confusing, how about
const PRUint16* mMapIndex[128]?
Though actually are you storing a pointer to a 128 bit array? I think this is a
misuse of this type and what you might really want is PRUint16** mMapIndex?
Also, storing the per-platform in prefs seems unnecessary... I mean, the value
is never going to change right? why not just #ifdef the code?
Prefs should only be used when the value is going to be changed... the
per-platform pref stuff is when you want the DEFAULT value of the pref to vary
based on the platform, but you still expect the user to change it later.
Attachment #102150 -
Attachment is obsolete: true
Comment 30•22 years ago
|
||
mMapIndex is actually a pointer to a 128-PRUint16-values array.
It points the first item of gIndex, gCP932Index, or gIBM943Index.
const PRUint16 gIndex[2][128];
const PRUint16 gCP932Index[2][128];
const PRUint16 gIBM943Index[2][128];
If I use PRUint16** mMapIndex, I must use extra variables.
const PRUint16 *const gIndex[2] = { gIndex1, gIndex2 };
const PRUint16 gIndex1[128] = {
...
}
const PRUint16 gIndex2[128] = {
...
}
...
Comment 31•22 years ago
|
||
reassign to smontagu for landing
Assignee: ftang → smontagu
Status: ASSIGNED → NEW
Assignee | ||
Comment 32•22 years ago
|
||
Kohei, can you attach a new version of attachment 102150 [details] [diff] [review] addressing alecf's
comments? I'm assuming that all 4 attachments need to be checked in together.
Comment 33•22 years ago
|
||
In some cases, users will want to change the conversion table.
On unix, the suitable conversion table depends the installed fonts.
And it is not fixed at compile time.
For another case, a macintosh mozilla user had an accident with
a web site and contact with the web site engineer, the engineer uses
a windows machine and not has a macintosh.
The enginner will want to look into the behavior of conversion
on his windows machine.
(In Japan, troubles related to the character-conversion often occur)
If a windows mozilla user attaches importance to the compatibility
with java programs than the looks on the screen,
the user will want to use the standard conversion table instead of
the windows(CP932) conversion table.
Comment 34•22 years ago
|
||
Comment 35•22 years ago
|
||
Re Comment 14: this happens also on Mac OS X.
Assignee | ||
Comment 36•22 years ago
|
||
Comment on attachment 106482 [details] [diff] [review]
patch #4/4 using PRUint16** mMapIndex
Transferring r=ftang and requesting sr
Attachment #106482 -
Flags: superreview?(alecf)
Attachment #106482 -
Flags: review+
Comment 37•22 years ago
|
||
Comment on attachment 106482 [details] [diff] [review]
patch #4/4 using PRUint16** mMapIndex
I thought I had commented about this earlier: (maybe it was another bug?)
Why are we using prefs to choose the charset on a per-platform basis - can't we
do this with #ifdefs? I guess I'm trying to understand the situation where the
user will be changing this value? If this isn't going to be changed by the
user, then we shouldn't add more dependencies on prefs.
The patch looks ok, but I'm going to hold off on my sr= until this is
explained..
Reporter | ||
Comment 38•22 years ago
|
||
see #33 and...
Japanese "Shift JIS" has many variants. Many pages in Japanese Shift JIS has
"Shift_JIS" charset, but actually some of them are Shift_JIS, others are
Windows-31J, and others are Apple Japanese, IBM943C, etc.
They have the same "encoding (Shift JIS)", but have each "charset" and Unicode
mapping rules. We Japanese -- espacially web developpers -- sometimes want to
use them properly.
case-1) vendor specific Shift JIS characters problem
Up to this time, Windows specific chars could not be displayed on Mac/UNIX, Mac
specifics on Windows/UNIX). Now, if we change the charset in runtime, we can see
them via iso10646-1 glyph mapping (at the costs of finding glyphs).
Especially on UNIX, some users want to use only "Shift_JIS" characters
because the cost of searching iso10646 font glyphs is so large, but others want
to see "Windows-31J" specific chars because many web pages (and some mails) use
them with "charset=Shift_JIS".
IMHO, the best solution is to make each charset/mapping rules for major variants
of Shift JIS, and we could specify a rule to be used as "Shift JIS" at runtime.
(In addition, ISO-2022-JP compatible with Windows-31J - many Windows mailer
generates - is different from ISO-2022-JP compatible with Shift_JIS - JIS spec.
case-2) Unicode conversion problem on XML with charset=UTF-8
Shift JIS variants have each mapping rules for Unicode. Unfortunately they are
not compatible with each other, so there are Shift_JIS/Windows-31J/Apple
Japanese compatible UTF-8s.
For example, XMLs with "charset=UTF-8" converted/generated from Shift JIS datum
by XML processor using "Windows-31J/CP932" mapping rules -- I think Microsoft
products are so -- will not be usable on other systems.
This problem does not come up with surface as far, but it may become large as
XMLs with "charset=UTF-8" comes to be used.
Comment 39•22 years ago
|
||
Comment on attachment 106482 [details] [diff] [review]
patch #4/4 using PRUint16** mMapIndex
ok, that seems like a reasonable explanation. sr=alecf
By the way, you should learn to use "cvs diff" - you don't need to keep two
seperate tree's around.
Attachment #106482 -
Flags: superreview?(alecf) → superreview+
Assignee | ||
Comment 40•22 years ago
|
||
Attachment #102147 -
Flags: superreview+
Assignee | ||
Comment 41•22 years ago
|
||
Fix checked in.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Comment 42•22 years ago
|
||
The test page:
http://rh.vinelinux.org/~shom/sjisprob2.html and
http://rh.vinelinux.org/~shom/sjisprob.html
are displayed fine on 11-26 trunk build / Mac 9.2.1.
Mark as verified as fixed.
Status: RESOLVED → VERIFIED
Updated•22 years ago
|
Attachment #102147 -
Flags: approval1.0.x?
Updated•22 years ago
|
Attachment #102148 -
Flags: approval1.0.x?
Updated•22 years ago
|
Attachment #102149 -
Flags: approval1.0.x?
Updated•22 years ago
|
Attachment #106482 -
Flags: approval1.0.x?
Comment 43•21 years ago
|
||
I'm trying to understand the fix to this bug. At first glance, it seems
fundamentally incorrect. From the look of this patch, we're treating incoming
content from the web differently depending on platform, so that some characters
work on some platforms and some on others. If that's true, it's simply wrong,
and should be undone.
Was the real problem here that when some platforms use something they call
Shift_JIS as their native character encoding (e.g., for the filesystem), they
mean different things? If that's the case, then we should call those different
things different names, have encoders/decoders for all of them, and fix up the
name when determining what the filesystem/native encoding is.
Or am I misunderstanding what this fix did?
Comment 44•21 years ago
|
||
> Was the real problem here that when some platforms use something they call
> Shift_JIS as their native character encoding (e.g., for the filesystem), they
> mean different things?
Yes, this is the crux of the problem. The differences, however, are
limited to a small number of characters. But these characters are
often used, too. Now, do we want a full table of encoders/decoders
for Mac, Windows, OS/2, etc.? Or do we handle only these small
number of characters differently?
Vendors are clear about differences in their technical specs and even
use different names though some are quite similar in naming.
The major problem is the web pages and the way Mozilla used to treat pages
that are determined to be in Shift_JIS. On web pages, there is only
one dominant name used, i.e. Shift_JIS. We only have one encoding
name, i.e. Shift_JIS in the Character Coding menu to relfect that
overwheling reality of over 65% of Japanese web pages. (The remaining
pages use either EUC-JP or ISO-2022-JP)
It would be nearly impossible to persuade web developers to use different
names at this point -- it's been over 15 years with this single familiar
name to most web surfers. Can browser users tolerate different names for
encodings that have been treated for so many years as the same Shift_JIS
thing (except for a small number of characters)?
Comment 45•21 years ago
|
||
Are the pages on the Web in the standard version of Shift_JIS, the Windows
version, or the OS/2 version? If they're in the Windows version, then perhaps
we should treat "Shift_JIS" as the Windows version of Shift_JIS on all
platforms? Treating it as the Windows version only on Windows seems
problematic, since it could cause pages to work on Windows and fail on other
platforms, which is exactly what we don't want -- and why there should NOT be
platform differences at this level of the code.
Comment 46•21 years ago
|
||
> Are the pages on the Web in the standard version of Shift_JIS, the Windows
> version, or the OS/2 version?
We cannot tell which in reality because people by now are used to
minor glyph shape differences. A good place to begin is this image
above showing the differences between Mac and Windows:
http://bugzilla.mozilla.gr.jp/attachment.cgi?id=364&action=view
* The leftmost column shows: Shift_JIS codepoints
* The middle column shows glyphs used by Mac Japanese & corresponding Unicode
points
* The rightmost column shows Windows glyphs & corresponding Unicode points
You can see that the same Shift_JIS codepoints lead to slightly
different glyph shapes between the 2 platforms. But except for
Shift_JIS 0x007e (overline)
Shift_JIS 0x815F (reverse solidus)
all others look remarkably alike in glyph shapes. Users really don't care
about these minor glyph differences. As for the overline and revserse solidus
characters, by now after so many years of seeing how these 2 codepoints
may use different glyph shapes, users now regard the two separate
glyphs on different OS's as **cognitively** equivalent.
So the glyph shapes are not an issue here. And given this situation,
for all practical purposes, Shift_JIS pages on the web can be
considered platform-independent.
The real problem happens when Mozilla has to convert internal Unicode
points back to OS native encodings. We had been using only the Windows
mapping table before this bug got fixed:
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
Take the wave dash character, which is used a lot in mail and web logs in
Japanese. This is Shift_JIS: 0x8160.
On Windows, it maps to \uFF5E.
Now on Mac, if we need to convert this to the native encoding, there is
no \uFF5E codepoint in the Mac Japanese mapping table:
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT
hence users see a question mark there on Mac.
Now we had converted Shift_JIS 0x8160 to \u301C on Mac in the first place,
that would solve this roundtrip problem for Mac. I believe the current
code now takes care of this issue.
Comment 47•21 years ago
|
||
Sorry, I should have said that the first part of conversion
from OS encoding -> Unicode created the real problem because
we used to use only the Windows mapping. The roundtrip is also
a problem.
By the way, this type of problem would not have occurred if we
lived only in the world of native encodings. The need for conversion
to/from Unicode is what exposes this problem so clearly.
Updated•20 years ago
|
Attachment #102147 -
Flags: approval1.0.x?
Updated•20 years ago
|
Attachment #102148 -
Flags: approval1.0.x?
Updated•20 years ago
|
Attachment #102149 -
Flags: approval1.0.x?
Updated•20 years ago
|
Attachment #106482 -
Flags: approval1.0.x?
You need to log in
before you can comment on or make changes to this bug.
Description
•