Closed Bug 122436 Opened 23 years ago Closed 22 years ago

Unicode (UTF-8) pages use Western font preference

Categories

(Core :: Internationalization, defect)

All
Linux
defect
Not set
normal

Tracking

()

VERIFIED DUPLICATE of bug 91190
Future

People

(Reporter: liblit, Assigned: shanjian)

References

()

Details

(Keywords: intl)

Attachments

(2 files)

From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.7) Gecko/20020104 BuildID: 20020104 The URL given above contains XHTML encoded using UTF-8. However, Mozilla renders it using my "Western" serif font rather than the "Unicode" font. This leads to incorrect display of Unicode characters that are not represented in the (typically ISO-8859-*) Western font. For example, curly quotes and apostrophes instead appear as straight vertical marks. The easiest way to see that something is going wrong is to select wildly different Western and Unicode fonts, then visit a Unicode page. You will see that the Western font is used. Reproducible: Always Steps to Reproduce: 1. Select a "Western" font that will be easily recognized as follows: 1.1. Bring up the Preferences dialog. 1.2. Select Appearance -> Fonts. 1.3. In the "Fonts for:" menu, select "Western". 1.4. In the "Proportional:" menu, select "Serif". 1.5. In the "Serif:" menu, select something distinctive, such as "urw-zaph chancery-iso8859-1". 1.6. Verify that the "Unicode" serif font is set to something reasonable and different from the "Western" font just selected. 2. Visit <http://www.cs.berkeley.edu/~liblit/>. 3. Under View -> Character Coding, verify that Mozilla has correctly selected Unicode (UTF-8). 4. Observe that the distinctive "Western" font is being used, in spite of the fact that the page is UTF-8 encoded and therefore should be using the "Unicode" font. Actual Results: The page is rendered using the distinctive "Western" font. Expected Results: The page should have been rendered using the selected "Unicode" font. View -> Character Coding shows "Unicode (UTF-8)" selected. So Mozilla does know how the page is encoded. It's just not picking the right set of fonts for that encoding. I suggested picking visually distinctive "Western" and "Unicode" fonts to make the problem easier to see. This issue shows up in normal usage as well, though it's more subtle. I normally have "adobe-times-iso8859-1" selected as my Western serif font, and "adobe-times-iso10646-1" selected as my Unicode serif font. There are a couple of Unicode characters at the URL given above: for example, the apostrophe in "Ben's" should be a curly right apostrophe (&#8217;), but it is rendered as a vertical apostrophe (') because the wrong font is being used. There are a couple more apostrophes as well as some double quotes (&#8220;...&#8221;) further down on the page that have the same problem. Mozilla's font selectors prevent one from choosing an iso10646-1 font for the Western encoding. Galeon, which uses Mozilla's rendering engine, applies no such restrictions. If I tell Galeon to use an iso10646-1 font for Western encodings, the Mozilla engine happily goes ahead and uses it when I visit a Unicode page, and Unicode characters appear as they should (e.g., curly quotes really are curly). So at some level the rendering engine *does* know how to use Unicode fonts, if those are the fonts it's asked to use.
To intl.
Assignee: attinasi → yokoyama
Status: UNCONFIRMED → NEW
Component: Layout → Internationalization
Ever confirmed: true
QA Contact: petersen → ruixu
Keywords: intl
QA Contact: ruixu → ylong
Shanjian, I believe we can use NS_FONT_DEBUG to find out what lang group the font code thinks the document is in and where in the font search path the font is found. Could you explain to them how to do this? Once we know what is happening we can try to determine what can/should be done. Thanks.
For utf-8 and other unicode encodings, we are currently using user's locale charset to figure out the language. This is because we lack of a mechinism to specify/recognize language in xul (and probably other xml files). Before that is fixed, we could not do much to this bug.
Target Milestone: --- → Future
This happens on Windows too. Changed the platform to ALL. On my Simplified Chinese Windows XP, UTF-8 page display is using Simplified Chinese fonts.
Hardware: PC → All
give to shanjian
Assignee: yokoyama → shanjian
I'm not sure I understand what's the exact problem. I have multiple languages inside inside a UTF page, and Mozilla auto-senses the region and displays the appropriate fonts for the appropriate language. Mozilla's seems to use multiple fonts in one page. URL: http://www.realmspace.com/unicode/ut/h/utf8.html
Joaquin Menchaca has one example of a multilingual page that works, but one working example doesn't mean the code is correct in general. In my original report I gave quite exhaustive instructions on how to reproduce the problem. For Unicode characters not present in non-Unicode fonts, Mozilla is clearly and unambiguously doing the wrong thing. Joaquin, please surf over to <http://www.cs.berkeley.edu/~liblit/>, and look at the first word in the title: "Ben's". Do you see a vertical apostrophe, or do you see a curved single right quote? If you see a vertical apostrophe, then Mozilla is doing the wrong thing.
I see 2 issues here: 1) Having Unicode in the list of font language groups in the font prefs seems inappropiate. The rest of the entries are language groups (excluding "User Defined" which is there to in the hope that it will allow people to trick the browser into working for unsupported languages). I suspect that the Unicode entry is a leftover from NS 4x days when the code did not support Unicode. 2) The font system tries to avoid iso10646 fonts because it is so expensive to determine which chars they support and we do not have any good way to tell which language group they are appropiate for. It would be great if we could tell what chars are in iso10646 fonts but we cannot without doing a XLoadQueryFont (or XQueryFont) which is very expensive. When I added the TrueType support I ended up writing 4000+ lines of code to address this issue of getting the list of supported chars in a font. I was able to cache the info because I had access to the TrueType font file timestamps and could tell if the files had changed. Unfortuantely the X font API provides no way to tell if the fonts have changes so if we were to cache which chars ae X iso10646 font had we would never be able to tell if it was stale. If the info is stale we would get complaints that we did the wrong thing. Until we have a reasonable way to get the list of chars in iso10646 fonts we we either have to choose to be very inefficient when searching for glyphs (all languages) or to have less than perfect Unicode support.
Let me restate the problem to make it more clear. When we choosing fonts, we use the language group information guiding our font search. This info can be provided through HTML attribute "lang". When it is missing from doc, in most cases we are able to figure it out from document's encoding, ie. charset. For unicode encodings, this approach does not work and we can only mark the document as unicode. Since all of mozilla's xul files (UI implementation) are in unicode encoding, and we don't want to use unicode font in such situation, we put a hack and current locale language replaces unicode. If you run mozilla in western locale, western language group will be use, thus western font will be selected. I do plan to fix this, but my effort is blocked by xml's incapability of handling "lang". Keeping XUL files working well is a priority. Anyway bug has been filed and I am waiting for it. For characters like &#8217; &#8220; &#8221;, Their glyphs could not be found in western font. If we choose to use asian font, glyph will be too wide. Unicode font is too expensive (as bstell suggest in his last comment) and we always try to avoid it. So the current approach is, if we cann't find them in western font, we transliterate them and using subsitute glyph found in western font to replace it. I have been thinking if we should try 10646 font or not. There is some other bugs filed against those problem and should not be the concern of this one. to brian, Current unicode language group is really misleading and practically does not work at all. It might be a good idea to eliminate it for now. But for future, I guess it might be useful in certain situation. I have no strong opinion about this issue.
Ben: in the css could you try adding "adobe-times-iso10646-1" to the font list?
Per Brian's request, I tried adding the following CSS rule: html { font-family: adobe-times-iso10646-1 } With this change, the various curly Unicode quotes do appear as intended. I'm not sure if Brian was trying to debug things or was suggesting a workound. I wouldn't really consider this to be a viable workaround, because it has an additional unwanted side effect (selecting a Times font regardless of the user's defaults). If there were a way to specify the "iso10646-1" part without the "adobe-times" part, that might be a reasonable workaround.
Is this a duplicate of bug #91190? A blocker of it? Dependent upon it? I think both reports are basically talking about the same issue.
If I understand things correctly, when a character is not mapped to a specific non-Unicode language group, Mozilla falls back on the language group associated with the current locale. If the character is not actually defined in that locale's fonts, then Mozilla performs a reasonable best-effort substitution. What about adding one more stage to this logic? Before doing the best-effort substitution, check to see if that character is defined in the iso10646 font. If it is, then use it. If it's missing from there too, then fall back on the best-effort substitution. That should fix the sort of problems I'm seeing without changing the behavior of anything that was already working correctly. Can this be done in a way which is efficient relative to Brian Stell's concerns about XQueryFont() inefficiency and such?
this looks like a dup of bug 91190
> Before doing the best-effort substitution, check to see if that character is > defined in the iso10646 font. ... Can this be done in a way which is > efficient relative to Brian Stell's concerns about XQueryFont() inefficiency The problem *is* that to check if a iso10646 font has the char is very expensive. Thats why we only do it when we are desparate (such as when tranliteration fails). This is a problem with trying to use X's XLFD for iso10646 (Unicode) fonts. All other encoding (mostly) fill in all possible chars. Unicode does not. Thus we are stuck needing to get the list of chars via XLoadQueryFont (or XQueryFont). For a long time now we have talked about caching the data but without a way to check if the cached data is stale this is not safe to do. For the TrueType fonts I was able to check for stale data because I have access to the font file timestamps (if timestamp is not the same as when the data was generated then the data is stale).
bstell wrote: > This is a problem with trying to use X's XLFD for iso10646 (Unicode) fonts. > All other encoding (mostly) fill in all possible chars. Unicode does not. > Thus we are stuck needing to get the list of chars via XLoadQueryFont (or > XQueryFont). Actually the XLFD standard _allows_ to peek if a char is available in the font or not... For example: '-misc-fixed-medium-r-normal--0-0-0-0-c-0-iso8859-1[65 70 80_92]' tells the font source (Xserver or xfs) that the client is interested only in characters 65, 70, and 80-92. Question is whether major vendors like XFree86 implement that correctly...
Peeking like this implies a round trip to the X server per font which is also expensive. Perhaps for local X servers we could detect that the font info cache is stale by checking the X font path and the files on that path. If the path or the files on the path change we could update the cached font info. I have very limited time and I'm am working on TrueType printing. If someone would care to volunteer to work on caching the X font info I think I can guide them. I'd guess that it would only take about a week to get working code and another 2-3 weeks to bring it up for production grade.
accept.
Status: NEW → ASSIGNED
If the only problematic issue here is when to invalidate the cache, why not invalidate at Mozilla exit? I.e., cache for the lifetime of the process. Fonts don't change all *that* often, so it seems reasonable to require a quit/restart cycle to pick up changes. Or flush the cache whenever font prefs change. Anything more sophisticated, such as monitoring the font search path, is bonus work that shouldn't prevent us from getting something simple up and running that will do the job for most people in most common usage scenarios.
> If the only problematic issue here is when to invalidate the cache, why not > invalidate at Mozilla exit? Generating the data is extremely expensive (in the multiple minute range). Thus we cannot regenerate it every startup (unless we want a multiple minute delay on startup). Because of the huge time cost to be useful we would need to generate it once only and then only check if the data needs to be updated (as I do for the TrueType fonts).
Egad. I knew it was bad, but I didn't know it was *that* bad. Thanks for the info.
I just went back and revisited the cited URL (<http://www.cs.berkeley.edu/~liblit/>) using Mozilla 1.0, and the curvy quotes show up correctly. The Western font preference is still used for the majority of text on the page, but a proper Unicode font is being used for the Unicode-only characters (quotes, in this case). Is this bug now fixed? Or has it merely changed in some curious way?
That is probably because of the support of freetype.
No, I don't think this is because of freetype support: I'm using the prebuilt Red Hat RPM's, which supposedly do not include freetype support. Perhaps the addition of conditional freetype support affected font handling elsewhere, though, causing this change even without freetype support in my binary,
Actually the mozilla moz (non Redhat) has direct FreeType2 (Truetype) support and I believe the Redhat rpms have FreeType2 via Xft (there was/is a long discussion on whether Xft was/is ready for mozilla) so you might have Truetype working. You could use 'xmag' to capture/enlarge the pixels and see if they have "grey" pixels on the edges (while the direct FreeType2 code does use the Truetype embedded bitmaps if available I believe that the Xft version cannot).
I'm using Ximian's RPMs installed via Red Carpet. "xmag" shows no grey-edged antialiasing. "lsof" reports that "mozilla-bin" has neither the Xft nor the FreeType2 libraries open. {shrug}
I wanted to attach two files to the same comment, but that is not allowed I guess. This bug is still present in Mozilla 1.0.1 (the browser used to submit this) and Mozilla 1.2.1. Personally this bug drives me up the wall, especially since support is so close to working. While this attachment does show a working example, asking everyone on the planet with UTF-8 html to add a font selection in a style sheet doesn't seem like a likely workaround.
*** This bug has been marked as a duplicate of 91190 ***
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → DUPLICATE
Mark as verified.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: