Closed Bug 71489 Opened 24 years ago Closed 24 years ago

RFE: JOHAB <-> Unicode converter for Korean locale

Categories

(Core :: Internationalization, defect, P1)

Product:

Component:

Platform:

All

Linux

Type:

defect

Priority:

P1

Severity:

normal

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla0.9

People

(Reporter: bstell, Assigned: ftang)

References

Details

(Keywords: intl)

Attachments

(8 files)

new file- mozilla/intl/uconv/ucvko/nsUnicodeToJohab.cpp 24 years ago Frank Tang (deleted), text/plain		Details
new file- mozilla/intl/uconv/ucvko/nsUnicodeToJohab.h 24 years ago Frank Tang (deleted), text/plain		Details
patch gfx/src/gtk , intl/uconv/{public,src,ucvko} 24 years ago Frank Tang (deleted), patch		Details \| Diff \| Splinter Review
updated patch for mozilla/intl/uconv/src/uscan.c 24 years ago Frank Tang (deleted), patch		Details \| Diff \| Splinter Review
a better patch to replace the last 2 24 years ago Frank Tang (deleted), patch		Details \| Diff \| Splinter Review
Wrong hangul characters with the latest patches 24 years ago Brian Yuan (deleted), image/gif		Details
The UTF-8 file for testing 24 years ago Brian Yuan (deleted), text/plain		Details
wrong display of English characters under ko.UTF-8 locale 24 years ago Ervin Yan (deleted), image/gif		Details

kill this account

Reporter

Description

•

24 years ago

from bug 70550: Anyway, Mozilla needs 'SANG-YONG JOHAB' <-> Unicode converter to make use of Sun's ksc5601.1992-3 fonts. The converter should be very easy to write. For Hangul syllables, the conversion is algorithmic and for the rest(Hanja and symbols), shifted-translated tables for EUC-KR can be used(refer to my implementation of JOHAB<->UCS-4 conversion for iconv() in glibc 2.1.x or later or LGPLed libiconv by Bruno Haible). Perhaps, it's time to overhaul intl/uconv/ucvko directory following a similar way used for ucvcn. Jungshik

kill this account

Reporter

Updated

•

24 years ago

Depends on: 67374

Assignee

Comment 1

•

24 years ago

what is "Sun's ksc5601.1992-3 " font ? What is the algorithm ? Are this cp949 ? Any documentation ? amd jshin- this is a good time for you to demostrate your programming skill to us :)

Comment 2

•

24 years ago

Like I wrote a couple of times before, the encoding of ksc5601.1992-3 is just JOHAB defined as a supplementary encoding in KS C 5601-1992 (KS X 1001:1997: see annex 3 table 2 of the standard which I know you got a few years ago from Ken Lunde ^-^. You don't need to know Korean to 'decipher it'). It's NOT WIndows-949 which is a proprietary (upward compatible) extension of EUC-KR by Microsoft. As such, Windows-949 does NOT have any algorithmic conversion to and from Unicode(Neither does EUC-KR). On the other hand, there's a simple algorithmic conversion between JOHAB and Unicode for 11,172 Hangul syllables. For the rest of JOHAB, the conversion table for EUC-KR or KS X 1001 GL can be used after some shifting/translating. However, I'm afraid currently ucvko is not written in such a way that this kind of reuse (of the table) can be done seamlessly That's why I wrote about overhauling it the way it's done for ucvcn (or the way ucvja is written. I can do it pretty easily and love to, but not now(maybe in May). I have to search for supernovae at the moment. :-) As for the algorithm aforementioned, refer to Ken Lunde's "CJKV information processing" (and Unicode 3.0 book) or get the source code of yudit (<http://www.yudit.org>) and how I implemented it (UJOHABConv class in UCS2Conv.cpp. I'm not sure I have submitted some of my recent changes to support U1100 Hangul Jamo block as well as UAC00 Hangul syllable block to the author. Probably, I haven't) Yet another place is the gconv module for JOHAB(originally written by me) in GNU Libc 2.x as I mentioend before. Basically, Hangul Jamo (leading consonant, medial voewl, final consonant) are assigned *5bit* values(there are fewer than 32 leading consonants, medial vowels and final consonants in modern Korean so that they can be encoded with 5bits) and they're bitwise ORed after the bit-patterns for initial consonants and medial vowels are shifted left by 10bit and 5bit and the first bit of the first byte of two bytes is always set to 1 to tell it from US-ASCII(the MSB of the second byte is not always 1, though). Therefore, two-byte representation of a single Hangul syllable is something like this: 1lllllmmmmmfffff where 'lllll', 'mmmmm', and 'fffff' are the bit patterns for leading cons., medial vowel, and final consonant(if there's no final consonant, the bit-pattern for the 'filler' is used). Jungshik

Comment 3

•

24 years ago

In case somebody wants to have JOHAB <-> Unicode mapping table, it's available at http://pantheon.yale.edu/~jshin/faq/JOHAB.TXT.gz It should be noted, however, that HANGUL syllables can be algorithmically converted to and from Unicode. Therefore, it's not desirable to use the huge table for conversion of Hangul syllable. For ucvko, the only table necessary is Windows-949 <-> Unicode (this is huge !) and other encodings can make use of the whole or part of it.

Assignee

Comment 4

•

24 years ago

Jungshik Shin- do you have time to change mozilla to add this encoding (from unicode only) for us ?

Comment 5

•

24 years ago

Frank, I really love to work on ucvko. What time frame do you have for fixing this bug? I'm now looking at the roadmap according to which mozilla 0.9 and mozilla 0.9.1(or 1.0) will get frozen on April 18 and May 23,respectively. Would it be reasonable to aim at getting it added(as well as overhauling ucvko, if possible and/or necessary, to get Windows-949 supported) sometime April? Until early April, I really have no time at all(sleeping should be given the highest priority if there's any spare time for me to survive :-)). Jungshik

Assignee

Comment 6

•

24 years ago

Jungshik Shin - never mind. I read Ken Lunde's book and hack a version here. Maybe you can code review here and test for us. Here is the patch

Status: NEW → ASSIGNED

Assignee

Comment 7

•

24 years ago

Attached file new file- mozilla/intl/uconv/ucvko/nsUnicodeToJohab.cpp (deleted) — Details

Assignee

Comment 8

•

24 years ago

Attached file new file- mozilla/intl/uconv/ucvko/nsUnicodeToJohab.h (deleted) — Details

Assignee

Comment 9

•

24 years ago

Attached patch patch gfx/src/gtk , intl/uconv/{public,src,ucvko} (deleted) — Details — Splinter Review

Assignee

Comment 10

•

24 years ago

I have not test this code yet. Waiting for Sun folks to send me the font. Also, there are one line in Ken Lende's code looks starnge. I need to talk to him first. The line is in Index: intl/uconv/src/uscan.c ... + // The following code are based on the Perl code lised under + // "Johab to ISO-2022-KR or EUC-KR Conversion" in page 1014 of + // "CJKV Information Processing" by Ken Lunde <lunde@adobe.com> ... + // $d8_off = ($hi == 216 and ($lo > 160 ? 94 : 42)); and in my C "translation" it is + PRUint16 d8_off = (hi == 216 && (lo > 160 ? 94 : 42));

Assignee

Comment 11

•

24 years ago

Is there any reason that we should also support conversion from johab to Unicode ? If not, then we should change the summary to RFE: JOHAB <= Unicode converter for Korean locale

Assignee

Comment 12

•

24 years ago

Ken Lunde <lunde@adobe.com> reply my email- here is his email: Frank, I don't have a password, so I couldn't enter comments. My comments are attached below. Regards... -- Ken The line in question: $d8_off = ($hi == 216 and ($lo > 160 ? 94 : 42)); has three possible return values: 0 (if $hi is not equal to 216) 94 (if $hi is equal to 216, and if $lo is greater than 160) 42 (if $hi is equal to 216, and if $lo is not greater than 160) This works in Perl, and you want to make sure that the C implementation uses the above logic.

Assignee

Comment 13

•

24 years ago

Attached patch updated patch for mozilla/intl/uconv/src/uscan.c (deleted) — Details — Splinter Review

Assignee

Comment 14

•

24 years ago

Attached patch a better patch to replace the last 2 (deleted) — Details — Splinter Review

Assignee

Comment 15

•

24 years ago

The thing I change from the 2+3 attachment and the 4th attachment 1 [details] [diff] [review]. all c++ style comment change ot c style comment in ugen.c and uscan.c 2. add the change in charsetData.properties 3. #if 0 the change in uscan.c since we cannot test the code yet. 4. add #if 0 with debugging printf code in ugen.c 5. Change lo_off into hi_off in the following two lines in ugne.c + * push(@out, ((($hi+$hi_off) >> 1)+ ($hi <74 ? 200:187)- $fe_off), and + out[0] = ((hi+hi_off) >> 1) + ((hi<74) ? 200 : 187 ) - fe_off; I test with the font Brian Yuan send me on my Linux box. It work if select the font in the font pref.

Comment 16

•

24 years ago

P1 bug for Sun; setting Priority = P1; adding keywords intl, nsbeta1

Keywords: intl, nsbeta1

Priority: -- → P1

Assignee

Comment 17

•

24 years ago

byuan- do we need Johab to unicode ? Why ?

Comment 18

•

24 years ago

Frank, What Solaris needs is to support all of the 11,172 hangul characters that are included in Unicode3.0 using the Solaris ksc5601.1992-3 fonts in all of the Solaris UTF-8 locales, it seems the Johab to Unicode converter will fix this problem. Thanks. Brian.

Assignee

Comment 19

•

24 years ago

moz0.9

Target Milestone: --- → mozilla0.9

Assignee

Comment 20

•

24 years ago

To fix this, we need the first 2 attachement and the 5th attachment. Also, we need to change the Mac ucvko.mcp file.

kill this account

Reporter

Comment 21

•

24 years ago

please add a comment documenting the diff between X11Johab and Johab r=bstell@netscape.com

Assignee

Comment 22

•

24 years ago

sr=erik Check in and fixed.

Status: ASSIGNED → RESOLVED

Closed: 24 years ago

Resolution: --- → FIXED

Updated

•

24 years ago

Blocks: 60916

Comment 23

•

24 years ago

>Is there any reason that we should also support conversion from johab to > Unicode ? It depends. If all you want to do is supporting ksc5601.1992-3 font included in Solaris, you do NOT as you know well. However, it might be necessary if Mozilla wants to be sorta 'universal code converter'. That is, JOHAB -> Unicode could be handy in case some Korean users might have old plain text documents stored in JOHAB and want to import them into Mozilla (editor).

Comment 24

•

24 years ago

Changing QA Contact to ftang@netscape.com for now. Frank, can development verify this bug or do you know of any test case which IQA can exectute in order to verify this bug? Thanks.

QA Contact: andreasb → ftang

Comment 25

•

24 years ago

Yuying, can you verify this bug? You can find more information in how to verify it in bug 70550. Thanks.

QA Contact: ftang → ylong

Assignee

Comment 26

•

24 years ago

v=byuan@eng.sun.com last week

Status: RESOLVED → VERIFIED

QA Contact: ylong → byuan

Comment 27

•

24 years ago

After further testing, it seems the following hanguls still cannot be displayed correctly: U+AF0D U+B3BF U+B607 U+BA61 U+BFC0 U+BFC1 I will attach one snapshot to show the problem about 'U+AF0D' Brian.

Comment 28

•

24 years ago

Attached image Wrong hangul characters with the latest patches (deleted) — Details

Comment 29

•

24 years ago

Attached file The UTF-8 file for testing (deleted) — Details

Comment 30

•

24 years ago

Please file a new bug for these problems in case the bug is fixed and the described issues are special cases, otherwise please reopen this bug.

Comment 31

•

24 years ago

In Mozilla nightly 2001050810, all Johab characters are display OK. the bug specified by brian have been fixed, Now the wrong_displayed characters can be displayed OK. but under Solaris ko.UTF-8 locale, all English characters are displayed as hangul characters, while it is OK under zh.UTF-8 and zh_TW.UTF-8 locale. I will attach one snapshot to show the above problem.

Comment 32

•

24 years ago

Attached image wrong display of English characters under ko.UTF-8 locale (deleted) — Details

Comment 33

•

24 years ago

Hi Yan, we should file new bug. It seems that ksc5601.1992-3 converter contains ascii part. Is this correct? or ksc5601.1992-3 fonts should have the ascii glyph? or We should do "-noascii" solution for this?

Assignee

Comment 34

•

24 years ago

I am stuip. I made one mistake. open a new bug for the ASCII problem into 80111 I will give you a fix right away.

You need to log in before you can comment on or make changes to this bug.