Closed Bug 182877 Opened 22 years ago Closed 22 years ago

Xft-enabled Mozilla : Needs non-BMP (plane 1 and beyond) rendering support

Categories

(Core :: Internationalization, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jshin1987, Assigned: jshin1987)

References

()

Details

(Keywords: testcase)

Attachments

(5 files, 11 obsolete files)

(deleted), image/jpeg
Details
(deleted), text/html
Details
(deleted), image/jpeg
Details
(deleted), image/jpeg
Details
(deleted), patch
Details | Diff | Splinter Review
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021028 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021028 Xft-enabled Mozilla currently can't render non-BMP characters. As plane 1 and plane 2 have been being filled pretty rapidly with quite a lot of characters, the demand for non-BMP character rendering will go higher. With a very little change, Xft-enabled Mozilla can render non-BMP characters. I've already made a patch and it works pretty well except for RTL script(see bug 120114). Reproducible: Always Steps to Reproduce: 1.Go to the URL specified above 2.See how it gets rendered. 3. Actual Results: Each non-BMP character is rendered with a pair of unknown glyph chars(with 4 hex digit inside a box representing surrogate code point). That's because Mozilla's internal Unicode representation is UTF-16. Expected Results: All non-BMP characters have to be rendered if a font for them is present. If not, they have to be rendered with a single unknown glyph character (6 hex digits inside a box) instead of two unknown glyph characters. I have a patch. Xft and Freetype2 library also need a small patch.
Making it depend on bug 172768 (Xft-tracking bug). Simon, as I mentioned in the report, it's time to get in the patch you and Shanjian made for bug 120114. RTL/BIDI is now necessary for non-BMP characters as well :-) I don't know why I can't assign this to myself. Simon, I'm sorry to bother you, but can you give it to me?
Blocks: xft_tracking
Attached patch patch v1 (obsolete) (deleted) — Splinter Review
This, by itself, does not work. Xft and Freetyp2 also have to be patched. I'll post patches for them here as well as sending them to the maintainers of two libraries.
I'm gonna send this to Keith Packard. If I can make my case well, this is likely to be included in upcoming XFree86 4.3.0 release. Alternatively, he can make XftTextExtentsUtf16(), but from Mozilla's point of view, extending XftTextExtents16() to support UTF-16 is better because Mozilla's internal string is UTF-16.
This patch is necessary because Freetype2 doesn't handle well truetype fonts (e.g. Code2001) with multiple Unicode cmaps of which only one is UCS-4 (UTF-32) cmap while other two are UCS-2 cmap. Cmap search routine in FT2 (FT_Select_Charmap) returns the first Unicode cmap it find, but most of time it's not UCS-4 cmap but UCS-2 cmap although fonts like Code2001 have non-BMP characters. FcCharSetHasChar(fontconfig API) used by Mozilla fails to recognize that non-BMP characters are covered by CODE2001 because FT2 (used by fontconfig) fails the way described. With this patch applied, 'ft-cache' has to be run afresh to rebuild on-disk-cache (used by fontconfig and in turn by Xft and Mozilla) of chara. coverage map of fonts.
Etruscan is broken because it's RTL script encoded in Plane 1. Simon has a patch that I believe will solve this problem.
Attached file another test page (deleted) —
Testing this enabled me to catch a bug in my previous patch. In both attachment 107850 [details] [diff] [review] and attachment 107851 [details] [diff] [review], I used bitwise-or where I should have used arithmetic '+'.
Attached image rendering attachment 107855 (deleted) —
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: patch, testcase
Attached patch patch v2 (obsolete) (deleted) — Splinter Review
fixed a bug in surr. pair to UCS4 macro and removed a mix-up from my patch to another bug.
Attachment #107850 - Attachment is obsolete: true
Attached patch Xft patch (revised) (obsolete) (deleted) — Splinter Review
Attachment #107851 - Attachment is obsolete: true
I should have taken a look at the patch for bug 120114 to avoid saying something stupid.... Bug 120114 is about solving the problem in Windows.. Now, I have to figure out what Resolve(For|Back)ward() are for... nsFontMetricsXft didn't implement them, but it seems like they're used for BIDI. Simon, could you tell me a little about the role of those two methods? Perhaps, I have to rework my patch from the scratch or ...
The patch for bug 122800 solved the problem with RTL text as shown by the screenshot. There still is a mystery for which I don't have an explanation at the moment. Somehow, some characters are rendered as hollow boxes at 100%/120%. When I change the zooming factor to a higher value, they get rendered correctly. Some metric-related problem?...
Is this change about hacking nsAString to hold something other than UCS-2 like some Windows code does? If that is the case, we need to seriously reconsider our string strategy in general instead of hacking around it all over...
From the very early days of Mozilla development, the internal representation used has been not UCS-2 but UTF-16. Therefore, the content of nsAString has always been UTF-16 (!= UCS-2). I'm not making up anything here :-) Circa 1998, I argued that it'd be easier to use UTF-32, but perhaps due to concerns about memory consumption and other reasons, Mozilla team decided to use UTF-16 ( != UCS-2). Therefore, handling of surrogate pairs have to be implemented wherever necessary. This is not the only place but there are some other places where surrogate pairs are dealt with. A simple search with 'IS_HIGH_SURROGATE' or similar yields (maybe a dozen or so) hits. Anyway, UTF-16 is not so bad as you may think (although I prefer to work in UTF-32 ). Firstly, UTF-16 is much easier to deal with than old multibyte encodings (CJK encodings) . In some CJK encodings, in the worst case, you have to back off all the way to the beginning of a string to figure out the 'character boundary' correctly. In UTF-16, it's much simpler than that. Secondly, even if we use UTF-32, we have to deal with combining characters and many other subtleties of Unicode so that one codepoint for one 'grapheme' model doesn't hold even in that case. (for instance, see DUTR #29 at Unicode web site). Thirdly, Win32 and Java(and to a lesser extent, Javascript) have demonstrated that UTF-16 could be used without much disruption and much difficulty. Win32 'W' APIs use UTF-16 extensively and applications built on top of those APIs handle the full repertoire of Unicode (upto U+10FFFF) very well. Java's 'string' is also a string of UTF-16 (NOT UCS-2). And, Mozilla itself is another good example that UTF-16 can be used rather easily. If you're not yet convinced that UTF-16 works rather well (hope you're :-)) , you can read a lot more at Unicode mailing list archive at http://www.unicode.org. At this point, IMHO, switching over to UTF-32 is too risky and costs too much with little gain. (again, I'm not a big fan of UTF-16 and I wish UTF-32 had been adopted at the beginning, but it's not that bad so I can live with it.) Moreover, the distinction between UTF-16 and UCS-2 has to be made only in limited areas like this. If you read some of old news articles in mozilla.* hierarchy, you'll find that Erik (not at netscape any more) definitely stated that Mozilla string is UTF-16 (not UCS-2). He also made it clear to those involved in 'String' class development that they wouldn't have to concern themselves with this issue, to which they agreed.
> not UCS-2 but UTF-16 In that case, please explain the following: 1) Why there is no way to create a UTF-16 string (we have ways to convert ASCII or UTF-8 to UCS-2, but not to UTF-16). 2) Why nsIUnicodeDecoder/Encoder seem to work with UCS-2 (if they do not, the "number of Unicode characters written/read" params on Convert() are sort of useless). 3) Why the string documentation keeps referring to UCS-2 (eg see http://www.mozilla.org/projects/xpcom/string-guide.html) The basic issue is that right now almost everyone working on Mozilla is firmly in the "nsAString is UCS-2 with one PRUnichar per one character except in some weird Windows-only text-measurement code or something". If this viewpoint is incorrect, perhaps an effort should be made to educate people. Properly commenting the string classes and unicode decoder/encoder interfaces would be a good start. I'm not saying we should switch to UTF-32 or anything like that. I'm just saying that we can't have half the app assuming that nsAString is UTF-16 and the other half assuming it's UCS-2. We have to choose one or the other and use it. Otherwise, what's the point of having nsAString?
This is really a dupe of bug 127713, but since a lot of work has happened here already, for the time being I am mutually cc-ing the respective owners of the two bugs so you can work together and decide which one to close.
Assignee: smontagu → jshin
Simon, Thanks for giving it to me. BTW, as Brian noted in bug 127713, Xft-Mozilla and FT2-Mozilla have separate codepaths so that they're not dupe of each other.
Boris, I've just filed bug 183048 for UTF-16 vs UCS-2 documentation change. I think you raised a very important issue here. BTW, despite unfortunate choices of names like NS_ConvertUTF8toUCS2, I believe they work correctly for non-BMP characters. Therefore, the issue is mainly documentation and 'education'. Anyway, I'll try to check if they indeed do.
That bug does not address item #2 in comment 14. If NS_Convert*toUCS2 actually convert to UTF-16, we should rename them to make that clear. If developers cannot depend on the "one PRUnichar == one rendered glyph" mapping, I bet there is some code that needs rethinking.
Actually, I mentioned (though rather passingly) about the need to rename related functions/classes/methods in bug 183048. Do you think we need a separate bug to replace UCS2 in function/class/method names with UTF16? Perhaps, we do. That may be a huge replace/search operation that canbe gradually done with macro definitions. BTW, I believe all *UCS2* functions/classes are written to be 'UTF-16 clean'(see for instance |ConvertUTF8toUCS2| in nsString2.h (http://lxr.mozilla.org/seamonkey/source/string/obsolete/nsString2.h#587) or |ConvertUCS2toUTF8| in nsString.cpp. As for the lack of 1-to-1 correspondence between 'codepoints' and 'glyphs'/graphemes/whatever, no matter what representation we use for Unicode string, that NEVER holds true for even seemingly simple Latin/Greek/Cyrillic letters. Virtually all Unicode characters are 'complex' and have to be treated as such. Surrogate-pari handling can be regarded as a special case of 'combining character sequences'. I believe that that has been well noted by Mozilla developers. Of course, this doesn't mean that everything is perfect as of now (for instance, Latin/Greek/Cyrillic diacritic marks are not well supported at the moment) and that there's no need for rethinking. Let's move this discussion to bug 183048 or better(?) still, i18n newsgroup.
Attached patch a third patch (obsolete) (deleted) — Splinter Review
I haven't _yet_ managed to convince Keith that extending UCS2 API to support UTF-16 is harmless. Instead, he added XftTextExtentsUtf16()(I guess it'll go into XFree86 CVS soon). This patch calls that API. As for Freetype2 patch, a patch that does what I intended to do has been committed to the CVS of FT2. Because this patch calls XftTextExtentsUtf16 API (not yet available in any released version of Xft), it can't go in now. Nonetheless, Simon, can you review it? On my machine, it works well.
Attachment #107857 - Attachment is obsolete: true
Comment on attachment 108478 [details] [diff] [review] a third patch Why not included unicharutil.h and use the macros from there instead of defining your own IS_HIGH_SURROGATE IS_LOW_SURROGATE and SURROGATE_TO_UCS4? Why do you have several times the pattern + if (! IS_LOW_SURROGATE(c)) { + if (IS_HIGH_SURROGATE(c)) { Unless I am missing something, the first |if| is implied by the second. I would rather use a macro with an informative name like IS_NON_BMP, or a boolean, instead of testing |c >> 16| all the time. Apart from that, patch looks OK.
Attached patch patch v4 (obsolete) (deleted) — Splinter Review
Simon, thanks for reviewing it. This patch addressed your concern. As for two if's, the first 'if' is not implied by the second if because the set of codepoints satisfying the first condition contains not only high surrogates but also regular codepoints. See comments added in this patch. Can you reivew it again?
Attachment #108478 - Attachment is obsolete: true
Comment on attachment 108551 [details] [diff] [review] patch v4 r=smontagu, sorry for my confusion about the conditions before.
Attachment #108551 - Flags: review+
Comment on attachment 108551 [details] [diff] [review] patch v4 + goto FoundFont; // for speed -- avoid "if" statement you could simply break; here...
Attached patch patch v5 (obsolete) (deleted) — Splinter Review
Thank you Simon for r. Thanks, Christian. I should have done that way. It's clearer. Then, it wouldn't have confused Simon :-) Boris or David, can I get sr?
Attachment #108551 - Attachment is obsolete: true
Um, isn't it the |goto NotFound;|s that you should change to |break;| rather than the |goto FoundFont;|s?
urg, I think I might have misread the patch and read "FoundFont" as "NotFound". damn I hate gotos. In that light, I think it might be better to ignore what I said.
I don't like goto, either ;-) Anyway, patch v4 and patch v5 do exactly the same thing. When unpaired LOW SURR. is found, it goes to 'NotFound'. In patch v4, it's implicite while in patch v5 it's explicite and less confusing to human readers. Note that both NotFound and FoundFont blocks are _inside_ the (outer) for-loop(walking thru |aString[ ]|). Whether a font for char. is found or not and whether illegal surrogate is come across or not, the loop has to keep on going. Yeah, reading only the patch, I might have been confused, too.
I realized that using XftTextExtentsUtf16 (that was committted to the XFree86 CVS a couple of days ago) can be a 'deployment headache' because it'll take a while for a new Xft lib. with XftTextExtentsUtf16 to get widely deployed. Following Keith's suggestion, I added a run-time check for the presence of XftTextExtentsUtf16 and made Mozilla use the internally defined version if it's not available. I tested under four combinations (compiled with old Xft/new Xft and run with old Xft/new Xft) and in all four cases, it worked fine.
Attachment #107858 - Attachment is obsolete: true
Attachment #108582 - Attachment is obsolete: true
Comment on attachment 109358 [details] [diff] [review] a new patch with the run-time check for XftTextExtentsUtf16() Simon, can you review again? This is basically the same patch as the one you reivewed. The only significant difference is the addition of run-time check for XftTextExtentsUtf16().
Attachment #109358 - Flags: review?(smontagu)
Instead of the following line in attachment 109358 [details] [diff] [review], + gXftTextExtentsUtf16 = PR_FindSymbolAndLibrary("XftTextExtentsUtf16", &lib); I now have the following: // ISO C++ forbids casting between pointer-to-object(void *) and // pointer-to-function. g++ allows it nonetheless, but some compilers // may not. To be on the safe side, we need to do this double // casting. It depends on ptrdiff_t type being of the same size // as pointer, which I believe is the case on all platforms. // I found this trick with google at // http://www.geocrawler.com/archives/3/1452/2001/6/0/5924216/ gXftTextExtentsUtf16 = NS_REINTERPRET_CAST(void (*)(Display*, XftFont*, const FcChar8*, FcEndian, int, XGlyphInfo*), NS_REINTERPRET_CAST(ptrdiff_t, PR_FindSymbolAndLibrary("XftTextExtentsUtf16", &lib)));
Comment on attachment 109358 [details] [diff] [review] a new patch with the run-time check for XftTextExtentsUtf16() On Win16 (Windows 3.1), ptrdiff_t was 16 bits, signed. I think this was just broken, never mind any ISO standards -- you can't express a pointer difference within a 64K segment with 15 bits. Anyway, I'm not a language laywer, but I wonder if PRWord wouldn't be a better type than ptrdiff_t here? /be
Brendan, Your comment prompted to do some more digging and I found two threads in comp.lang.c++.* about casting between pointer to object and pointer to function. ( 'Casting void* to function pointer' in 1999 and 'g++ 3.1: ISO C++ forbids casting between pointer-to-function and pointer-to-object' this year). I was naive (as you pointed out by an example of Win16) to assume what I assumed in my prev. comment. Anyway, opinions varied as to the best way to work around that they could be of different sizes and that dlsym() returns |void *| for both function and object. Using a union as a generic pointer appears to be safest, but I feel like that's a bit too overkill for this. Someone cited POSIX document (http://www.opengroup.org/austin/docs/austin_110r2.txt) that stipulates that an object of type void * be able to hold a pointer to a function on a system implementing POSIX although it's not required by C/C++ standard. It also suggested the following to 'silence' a compiler complian to the provision of ISO C++/C that casting between ptr-to-obj and ptr-to-fcnt is forbidden. *(void **)(&fptr) = dlsym(handle, "my_function"); So, I tried the above and the following and both worked fine on my ix86 Linux box, which, of course, says little about the portability with which we're concerned. *NS_REINTERPRET_CAST(void **, &gXftTextExtentsUtf16) = PR_FindSymbolAndLibrary("XftTextExtentsUtf16", &lib); This looks at least less ugly than double casting... What would you say?
No, PRUptrdiff need not be big enough to hold a pointer. PRWord is good enough; I don't mind the *(void **) hack, it's equivalent but IMHO uglier. A PRWord is the size of a void*, and we (following POSIX) stipulate that function pointers fit in void pointers. /be
My NSPR-fu is old school (NSPR1, back in 1995, something kipp and I wrote). PRWord and PRUword are deprecated by comments in prtypes.h, and PRPtrdiff and PRUptrdiff have comments claiming they're big enough for pointers, as well as for pointer differences. /be
I thought I added my comment this morning, but I forgot to commit. Anyway, as two persons have converged to double casting with PRUptrdiff as an intermediary and prtypes.h claims that it's big enough for pointers, I'll go with it as below: gXftTextExtentsUtf16 = NS_REINTERPRET_CAST(void (*)(Display*, XftFont*, const FcChar8*, FcEndian, int, XGlyphInfo*), NS_REINTERPRET_CAST(PRUptrdiff, PR_FindSymbolAndLibrary("XftTextExtentsUtf16", &lib))); Simon, can you review attachment 109358 [details] [diff] [review] with that in mind?
Comment on attachment 109358 [details] [diff] [review] a new patch with the run-time check for XftTextExtentsUtf16() r=smontagu, with the caveat that you need specific sr approval for xft zone (i.e. blizzard) and for pointer fu (but comment 36 seems to imply that already).
Attachment #109358 - Flags: review?(smontagu) → review+
Comment on attachment 109358 [details] [diff] [review] a new patch with the run-time check for XftTextExtentsUtf16() Thank you Simon for r and Brendan and Christopher for help with casting. Asking Blizzard for sr.
Attachment #109358 - Flags: superreview?(blizzard)
Chris, Can we move this forward? It's been two months.. With this, we can boast that Mozilla-Xft is sorta on par with Mozill-Win in terms of non-BMP support.
I know, I've just been swamped. I've been catching up on older reviews than this (if you can believe that.)
Just to make it a bit easier for Chris when he finally catches up with old reviews, I made a new patch against the current CVS with the casting issue fixed as we discussed. BTW, bug 127713 (non-BMP support for Mozilla with FT2) has been fixed recently.
Attachment #109358 - Attachment is obsolete: true
Attached patch an updated patch(against the cvs) (obsolete) (deleted) — Splinter Review
Sorry I diddn't realize nsFontMetricsXft.cpp had changed yesterday. This is a patch against the current cvs.
Attachment #115714 - Attachment is obsolete: true
Comment on attachment 115741 [details] [diff] [review] an updated patch(against the cvs) assuming Simon's r is carried over and asking for sr.
Attachment #115741 - Flags: superreview?(blizzard)
Attached patch patch (deleted) — Splinter Review
This is the patch that I checked in. It's a little different than the last patch, mostly in style. I cleaned up a lot of the braces and indenting, fixed a couple extra else/return places and some other random things. No big functionality changes, as tempting as they were. Thanks for the patch, and sorry it took so long to get into the tree.
Attachment #107852 - Attachment is obsolete: true
Attachment #115741 - Attachment is obsolete: true
Fixed.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Comment on attachment 116090 [details] [diff] [review] patch >+void >+MozXftTextExtentsUtf16 (Display *aDisplay, XftFont *aFont, >+ _Xconst FcChar8 *aString, FcEndian aEndian, >+ int aLen, XGlyphInfo *aExtents) >+{ >+ FT_UInt *glyphs, glyphs_local[NUM_LOCAL_GLYPHS]; >+ int i; >+ int nglyphs = 0; >+ >+ // This implementation is not for generic use, but is only to be >+ // used where we know for sure that 'string' can be cast to 16bit >+ // shorts. >+ _Xconst FcChar16 *wstr = (_Xconst FcChar16 *) aString; >+ >+ aLen = aLen / 2; Hey, you forgot what I said over IRC about needing to make aLen unsigned, or else sticking with >> (or both). Using a signed int a / will result in slight code and average-executed-instruction bloat due to the need to check for a negative left operand before strength-reducing to >> 1 (as the original patch did by hand). /be
> Hey, you forgot what I said over IRC about needing to make aLen unsigned, or > else sticking with >> (or both). Using a signed int a / will result in slight > code and average-executed-instruction bloat due to the need to check for a > negative left operand before strength-reducing to >> 1 (as the original patch > did by hand). > I couldn't change the definition of the function since it has to be the same as the one included in the Xft header file, and that function uses a regular int. I could have just used an unsigned int in the function and assigned to it from the argument, but I figured that would have just added more code bloat than using division instead of the shift.
Assigning an int to an unsigned takes zero instructions. But you might just leave the int and use >> as the original patch did. I don't think it's a good idea to use say aLen /= 2 (I forgot to pick on the aLen = aLen / 2 failure to use /=, but that's not a big deal compared to what the compiler produces :-) with a signed int type. /be
Attachment #115741 - Flags: superreview?(blizzard)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: