Closed Bug 162431 Opened 22 years ago Closed 11 years ago

add non-BMP Unicode (plane 1 and above. surrogate) support to charset encoder/decoder

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX
mozilla1.7alpha

People

(Reporter: ftang, Assigned: jshin1987)

References

Details

(Keywords: intl)

Attachments

(11 files, 2 obsolete files)

here is my plan to add the support use the upper 5 bits of aShiftTable->classID to indicate the plane of the mapping passing with it. The mapping is still divided into a 16 bits to 16 bits mapping. For example the, cns11643 plane 3 to unicode mapping will be divdied into two 16-16 mapping. one map cns11643 plane 3 to unicode bmp, one map cns11643 plane 3 to unicode plane 2. For the 2nd one, the 16-16 bits mapping does not include the 2 of the 0x20000 part. that information is pass in by the higher 5 bits of aShiftTable We need to fully support euc-tw
also see bug 162364, that bug talk about update cns11643 plane 3-7,15 table We also generated seperated table for extension b now. (see those cnsIRGTp*ExtB.txt and cns*extb.uf cns*extb.ut)
Summary: add surrogte support for nsIUnicodeDeocdeHelper, nsIUnicodeEncodeHelper → add surrogte support for nsIUnicodeDeocdeHelper, nsIUnicodeEncodeHelper
not complete, have not compile yet.
OS: Windows XP → All
Hardware: PC → All
Summary: add surrogte support for nsIUnicodeDeocdeHelper, nsIUnicodeEncodeHelper → add surrogte support for nsIUnicodeDecodeHelper, nsIUnicodeEncodeHelper
*** Bug 162403 has been marked as a duplicate of this bug. ***
Keywords: intl
QA Contact: ruixu → ylong
assign rbs- why you need it ? I need it to support euc-tw cns 11643 plane 3-7 . what is your need. That will help me to justify the priority of this bug
Status: NEW → ASSIGNED
Unicode 3.2 and the MathML spec include a bunch of math characters in plane-1. And, as usual, these math characters cannot be found in ordinary fonts. So in order to support these surrogate characters, a number of things are needed: - support of surrogate characters in the system (without this foundation, there is not even a starting point). - then, support the mapping between surrogate characters and the glyphs within the special math fonts. Currently, nsIUnicodeEncode::Convert() is used to do the mapping, but this breaks if the input contains surrogate pairs. So to fix the problem, nsIUnicodeEncode::Convert() needs to be able to take an input that may contain surrogate pairs, and do a re-map according to a factory pre-built mapping table. Needless to say, the tools used to factory-build the mapping tables need to be updated to support surrogate input as well. Otherwise, it won't be possible to easily setup the internal data sets that are necessary to do the mapping.
my idea is not to change the tool I think we should do is keep the mappint a 16-16 mapping and supply additional information somewhere else. therefore, if we need to convert unicode to one charset and that charset encode bmp and one addtional plane, then we need to uf files for that. One for bmp and one fo the other plane. and we supply that plane information in the shift table.
Target Milestone: --- → mozilla1.2beta
Frank, What's your plan for nsICharRepresentable? Currently, it uses 2048(=64k / 32) PRUint32s to store the representability of BMP characters. Increasing the array size by a factor of 17(plane 0 thru 16) would be quick, but is not so good an idea. Do you want to use a technique similar to what's used in CCMap(compressed charmap) in gfx? Hmm, we can just switch to it, can't we (is FillInfo() frozen?) ? Downsides of switching are that it takes longer to build CCMap (at run time) than 'info' array currently used and that more extensive patch is necessary (both in intl and gfx where 'nsICharRe..' is used) The former problem can be worked around by storing precompiled CCMaps as statics in converter class instead of building them up at run-time whereever necessary/desirable/possible (see bug 180266. the perl script there has to be extended to generate extended CCMaps to cover non-BMP characters). Certainly, there's a trade-off between speed and memory footprint here. Alternatively, we can extend 'info' array (without inflating it by a factor of 17) as was done to extend CCMap to support plane 1 and beyond. By this, I mean adding 16 offsets(pointers) to 'info' arrayes for plane 1 - 16 (or 17 if we want to avoid branching completely, in which case we also need 2048 zerod-out PRUint32s) followed by the size and as many arrays of 2048 PRUint32's as there are non-empty non-BMP planes.
Can reach MathML SMP fonts in XML via entity reference (𝒜). Numeric character references and mathvariants in XML or DOM do not work.
> Can reach MathML SMP fonts in XML via entity reference (𝒜). > Numeric character references and mathvariants in XML or DOM do not work. It looks like this is a separate issue. What font did you use (Code2001?)? Do NCR(numeric character reference)'s work in html?
Attached file Sample XML invoking DOM. (deleted) —
Attached file Straight XML. (deleted) —
Attached file Straight HTML. (deleted) —
I used the fonts: TeX's CM Fonts (cmex10, cmmi10, cmr10, cmsy10) and Mathematica 4.1 Fonts (Math1-5 and Math1-5Mono) from http://www.mozilla.org/projects/mathml/fonts/. Numeric character references above 0xffff do not work in XML or DOM or HTML--they show as question marks. In the DOM I can show that an SMP character round-trip loses 0x10000. Please see the three attachments which show attempts to display a mathematical script letter A. For me only the 𝒜 method works, and this is not available in the DOM. I am using Windows 98 and Mozilla build 2003051008.
> I used the fonts: TeX's CM Fonts (cmex10, cmmi10, cmr10, cmsy10) and Mathematica > 4.1 Fonts (Math1-5 and Math1-5Mono) from Without fixing this bug, these fonts cannot be used to render non-BMP characters (unless they're in xml and repersented with entity names.) If you install a font that actually covers Plane 1 (e.g. Code2001, http://home.att.net/~jameskass), you'll see that NCRs work fine in both html and xml. In html, the character entity name for 'Script A' doesn't work, which has to be filed as a separate bug. The DOM issue has to be filed as another separate bug. Why does 'ASCR' work in xml (not in html)? That's because it's mapped to a PUA code point for MathML. See http://lxr.mozilla.org/seamonkey/source/layout/mathml/content/src/mathml.dtd This file has to be modified once this bug is fixed.
FYI, bug 207919 and bug 207923 were filed for entity names for non-BMP characters in (X)HTML/XML and fromCharCode() in JS, respectively.
@netscape.com address doesn't work any more and ftang's aol address is not in bugzilla.
Assignee: ftang → jshin
Blocks: 230006
Status: ASSIGNED → NEW
what do you think of the problem mentioned in comment #17?
Summary: add surrogte support for nsIUnicodeDecodeHelper, nsIUnicodeEncodeHelper → add non-BMP Unicode (plane 1 and above. surrogate) support to charset encoder/decoder
Target Milestone: mozilla1.2beta → mozilla1.7alpha
Jungshik Shin wrote: > Additional Comment #17 From Jungshik Shin 2004-01-03 21:30 ------- ^^^ > what do you think of the problem mentioned in comment #17? ^^^ Uhm, now I am stuck in an endless loop... :) ... which comment do you mean ?
Ooops.sorry it's comment #7.
Jungshik Shin wrote: > Ooops.sorry it's comment #7. OK... suggestion: First implement a simple_, _working_ version for release 1.7a, regardless whether it makes the_ zilla bigger or not... ... and then do the fine-tuning and footprint work for release 1.8 cycle.
Blocks: jis0213
re comment #7: I was wrong to think that every instance of nsIUnicodeEncoder carries |info| (2048 PRInt32's) array. Callers (of nsICharRepresentable::FillInfo) have to take care of the memory alloc/dealloc. And, the only caller is nsCompressedCharMap. So, what has to be done is to make some changes in the way FillInfo works (or add FillInfoEx) in intl/uconv/src/(umap.c, nsUCSupport.cpp, nsUnicodeEncoderHelper.cpp, etc).
Status: NEW → ASSIGNED
*** Bug 320086 has been marked as a duplicate of this bug. ***
Is PRUint32-info array really needed? If I'm not missing something important, it's almost equivalent to ccmap, and only used inside intl, between nsCompressedCharMap and nsBasicEncoder subclasses. Then we can change the behavior of nsBasicEncoder to directly set ccmap, rather than info arrays. In this way the change will be invisible to gfx. And gfx codes can be modified just to use CCMAP_HAS_CHAR_EXT instead of CCMAP_HAS_CHAR... To make the situation consistent we can introduce more radical changes (which can be hard ;) Implement HasChar, HasChars(PRUnichar* ptr, PRUint32 len) (what's really needed in gfx) to nsCompressedCharMap, and something like a GetCompressedCharMap to nsICharacterRepresentable. Then gfx codes should simply ask for nsCompressedCharMap instead of direct CCMAP data block handling. In this way a more algorithmic approach for CCMAPs (like the ones in the mapping tables) will be possible.
Attached patch support higher planes (obsolete) (deleted) — Splinter Review
This experimentary patch is even more ad-hoc than what I've suggested in comment 23. Beside applying the patch, you need to move nsCompressedCharMap.* from unicharutil/util/ to uconv/util/ and put nsMultiPlaneEncoderSupport.* there. And nsUnicodeToTeXMSBM files into uconv/ucvmath. The patch introduces nsICharRepresentable - added HasChars(UInt16*, UInt32*) for (surrogate-aware) representability testing nsBasicEncoder - stores ccmap (constructed from self FillInfo) to implement HasChars - some unicode converters are changed to inherit this - This way we can replace old CCMAP_HAS_CHAR(MapperToCCMAP(*),-) by *->HasChars(&-,len)
Attached file multiple plane encoder (deleted) —
This nsMultiPlaneEncoderSupport introduces: parent converter - init with mapping tables array and shift tables array - has child converter (maybe null) for each plane. They are just instances of nsTableEncoderSupport, doing conversion from UTF-32 lower 16 bit and don't know which plane they are associated to. in this way we don't have to extend mapping tables. - FillInfo is just redirected to that of the plane 0 child, for backward compatibility. - surrogate decomposition in conversion - has an extended CCMAP - HasChars implementation can be moved to nsCompressedCharMap once we have declared an interface to access it from outside intl.
Attached file multiplane encoder implementation (obsolete) (deleted) —
Attached file MSBM10 (for texture) converter (deleted) —
Attached file MSBM converter impl (deleted) —
as you can see, it #includes two uf's for plane 0 and 1.
Attached file msbm (for texture) plane 0 uf (deleted) —
Attached file msbm plane 1 uf (deleted) —
The change is build bustage fix and gfx codes to illustrate how it will work. config.mk declares a ref to ucvutil because nsICharRepresentable->CMAP conversion has moved to there. rbs, could you look into how this works? Current shortcomings are: - gfx/win is yet complete (surrogate char is accompanied by a "ghost" char). - On windows somehow I couldn't import to nsMultiPlane...cpp the surrogate macros (IS_HIGH/LOW_SURROGATE) in nsCharTraits.h. - const declarations in nsUnicodeToTeXMSBM.cpp might cause errors. Please just drop them then. filenames for previous posts are: 25 nsMultiPlaneEncoderSupport.h 27 nsUnicodeToTeXMSBM.h 28 nsUnicodeToTeXMSBM.cpp 29 msbm10p0.uf 30 msbm10p1.uf
Attachment #207378 - Attachment is obsolete: true
Attachment #207380 - Attachment is obsolete: true
Blocks: 403564
When can this bug be fixed?
QA Contact: amyy → i18n
This is a very old bug and it's not clear to me that any of it is still relevant. Please indicate what still needs to be done, or else close the bug.
We still need this to support EUC-TW and Big5-HKSCS fully.
(In reply to Masatoshi Kimura [:emk] from comment #35) > We still need this to support EUC-TW and Big5-HKSCS fully. Thanks for the update. What remains to be done before a patch can be reviewed?
Per bug 912470 comment 25, I suggest we WONTFIX this. EUC-TW is no longer used by Firefox and we never supported a non-x- label for it. The code lingers around so that mailnews devs can decide if they want EUC-TW to be their problem. That leaves only unified Big5 from the Encoding Standard, which is different from all other encodings when it comes to astral characters. I think we should implement it from the spec in bug 912470 instead of trying to add generic astral plane support to our existing generic machinery. (Also, the encoder patch here doesn't seem ready yet anyway. At least it doesn't seem to handle the case where halves of a surrogate pair fall on different sides of a buffer boundary.)
I have no objection as long as the MathML fonts stuff (way, way above) has been dealt with some other way, which I suspect it has.
My understanding is that we now only support and only want to support TTF/OTF math fonts that know their own Unicode mapping, but needinfoing fredw to check that we don't need font-specific encoders for math anymore.
Flags: needinfo?(fred.wang)
(In reply to Henri Sivonen (:hsivonen) from comment #39) > My understanding is that we now only support and only want to support > TTF/OTF math fonts that know their own Unicode mapping, but needinfoing > fredw to check that we don't need font-specific encoders for math anymore. I think Karl took care a long time ago to move the MathML code to use only TTF/OTF fonts. At the moment, the MathML code only accesses characters by Unicode code point (and using non-BMP characters has been possible since we support Asana fonts). The plan for bug 407059 is to add access by glyph index for stretchy characters using the MATH table, but I don't think anything in this bug will help / is necessary. cc'ing Karl.
Flags: needinfo?(fred.wang) → needinfo?(karlt)
(In reply to Henri Sivonen (:hsivonen) from comment #39) > My understanding is that we now only support and only want to support > TTF/OTF math fonts that know their own Unicode mapping, Legacy non-OTF fonts are still supported by graphics code, but I expect (but I'm not sure) that we only support non-Unicode mappings when the platform provides the translation. Regardless, we don't need to add a non-BMP encoder for fonts or math.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Flags: needinfo?(karlt)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: