Closed Bug 36163 Opened 25 years ago Closed 23 years ago

BIDI control chars displayed as short vertical bar in browser (e.g. Unicode char ‌)

Categories

(Core :: Internationalization, defect, P3)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: kentsin, Assigned: mkaply)

References

Details

(Whiteboard: Win: ???; Linux: ???; Mac: ???;)

Attachments

(5 files)

The character is displayed on IE5 as nothing. I believe this confirm to the 
name of the character. However, Word2000 display it as a vertical bar also. 
Which I think is apropriate. 

The composer should display this character as a short vertical bar, however, 
the browser should not display anything at all. 

The character is zero-width non joiner. Which I assume is used to mark some non-
join break point but do not distrub the reader with anything : a zero width 
character.
See http://bugzilla.mozilla.org/showattachment.cgi?attach_id=2616 for a 
testcase. The 2000-04-17-08-M16 nightly binary on WinNT displays ‌
as "?". This machine has the most recent TT MS core fonts installed.
Confirming that it is not displayed zero-width :-(.

Setting dependency for bug 17962, "Display all HTML 4 character entities in 
browser correctly", M20.
Blocks: 17962
Status: UNCONFIRMED → NEW
Ever confirmed: true
-> Erik
Assignee: ftang → erik
Hi Mike, I believe the zero width non-joiner is intended to be used with Arabic
(and possibly other scripts). Does your team intend to address this?
Assignee: erik → mkaply
Here are some remarks from Mati in Israel:

The ZWNJ (Zero-Width Non-Joiner) is an artificial character invented by 
Microsoft (I think) and adopted by Unicode to prevent adjacent letters to join 
in scripts like Arabic, where letter shapes are affected by their neighbors.  A 
typical usage would be to insert ZWNJs between the Arabic letters corresponding 
to IBM, since acronyms should be displayed in Isolated shapes.

Other than affecting the shape of letters, the ZWNJ should not be displayed, at 
least in a browser.  It seems that some editors choose to visualize ZWNJs (e.g. 
as vertical bars) to facilitate editing (it is quite awkward to edit text with 
invisible characters!).

By the way, all the above applies as well to ZWJ (Zero-Width Joiner), except 
maybe how it is visualized in editors.

So it should be displayed as a vertical bar in the composer.

Should the entity code be changed to display a space in browser and bar in 
composer?

More info from Maha:

	the &zwnj is behaving correctly in the browser on Arabic NT platform 
(don't know about the US platform). 
it is a zero width character and performs its function correctly (showing the 
characters as isolated) with out any specific display to the zwnj it self. In 
the composer it is also not displayed. 

NT gives an option of displaying or not displaying such zero width character in 
notepad (try the right mouse button)
as well as inserting them.

We can only measure to that behaviour and keep in mind that at least Windows 95 
would be a different story.

There is no specific requirement for Arabic, although it would be a nice to have 
to be able to insert them.
I am the person report the bug on Mozilla about the display of the ‌. From 
the conversions of the bug-report, it seems you know the real meaning of this 
code.

I am a Chinese and new to unicode. I would like to ask you a question about the 
use of this character.

I intended to use this char for my analysis of Asian text. Since Asian text do 
not have nature word boundary, and there seems not a perfect word boundary 
separator available. I would like to make an semi-auto process to do the word 
boundary by inserting some word break marks into the text.

Comments from the original opener of the bug:

In a look of the unicode table, I found the ‌ I find out it have zero 
width, and it have a function on non-joiner. Which I guess that it is suitable 
for the task I want. I would like to ask if such an use will break other things? 
Will normal search engines recognize this char as word boundary break? Will this 
conflict with the Arabic applications?
Comments from Egypt:

I think that your choice of ZWNJ for delimiting words is not optimal.  The 
Unicode Standard 3.0 on page 316 states that "U+200C ZERO WIDTH NON-JOINER and 
U+200D ZERO WIDTH JOINER have no effect on word boundaries..."

The character which seems more appropriate for your purpose is U+200B ZERO WIDTH 
SPACE, about which Unicode says (on page 315): "The U+200B ZERO WIDTH SPACE 
indicates a word boundary, except that it has no width.  Zero-width space 
characters are intended to be used in languages that have no visible word 
spacing to represent word breaks, such as in Thai or Japanese."

To Mike Kaply: although the above may solve Kent's problem, there is still a 
Bidi-related problem about &zwnj being represented as vertical bar in Mozilla's 
browser.  I leave it to you to update the bug report if you see fit.
Status: NEW → ASSIGNED
Whiteboard: WORKSFORME?
This now WORKSFORME on Windows 2000 commerical build 6.0.17.2000080104.
Should this be closed, or are there remaining issues?
Whiteboard: WORKSFORME? → Win: WFM; Linux: ???; Mac: ???;
Eli, can you verifiy that this is WORKSFORME on all three tier 1 platforms too?
Cheers! The testcase is here:
   http://bugzilla.mozilla.org/showattachment.cgi?attach_id=2616
QA Contact: teruko → elig
Viewing the attached testcase with 2000-08-02-08-M18 on WinNT (U.S. Version),
‌, ‍ ‎ and ‏ all display as "?", both as named and numeric
entity references. Richard Zach saw the same on Linux with today's build,
see -- Additional Comments From Richard Zach 2000-08-02 11:58 -- in bug 17962.

The testcase also includes U+200B ZERO WIDTH SPACE, ​ -- it appears from
the empty table cell that Mozilla does not understand that character at all.
Whiteboard: Win: WFM; Linux: ???; Mac: ???; → Win: ???; Linux: ???; Mac: ???;
A zero width space is not content, is it? Hence there is no content to show in
the cell. Sounds right to me... I get no cell for:

?<!ENTITY zwnj  CDATA "&#8204;" -- zero width non-joiner, U+200C NEW RFC 2070 -->
??<!ENTITY zwj   CDATA "&#8205;" -- zero width joiner, U+200D NEW RFC 2070 -->
??<!ENTITY lrm   CDATA "&#8206;" -- left-to-right mark, U+200E NEW RFC 2070 -->
??<!ENTITY rlm   CDATA "&#8207;" -- right-to-left mark, U+200F NEW RFC 2070 -->

...all of which are, IIRC, sort of non-characters.

The attached testcase works for me, modulo bugs in the testcase (an extraneous
semicolon in one case, for example). Note that "zwsp" is not valid, and this
bug is on about zwnj not zwsp.

If a character is displayed as "?" on some platforms, it is probably because
there are no fonts with that character in it. Try downloading (if you are on
Windows) the latest Microsoft Web Core Fonts (or whatever they call them).
[handing back to ian]
QA Contact: elig → ianh
Testing 2000-08-07-14-M18 with Bitstream Cyberbase, Times New Roman 2.55, and 
Times New Roman 2.82 (current), the same results were seen with each: in text
&zwnj;, &zwj; &lrm; and &rlm; all display as a thin vertical bar (this is the 
same as originally reported), while alone in a table cell they all appear as 
"nothing".

Ideally, those characters would be displayed as an invisible, zero-width
glyph no matter what font (or font version) was chosen, since they require
special handling anyway. But for FCS it would probably suffice to have them
display (and work) properly when a Middle-eastern font is active ... after all,
the impact will negligible for display of other languages.

Mike?
If you make the font size bug enough in a table, then you will see something
for these entities; at least you do on Linux with Bitstream Cyberbase.
This character does not display as nothing on IE on an English system.

Do we know what the right thing to do here is?
Zero-width "bidi" characters (like '&zwnj;') wouldn't be displayed in "bidi" 
text (containing at least one strong Light-To-Left character - spacing or zero-
width - or Right-To-Left directional attribute)on mozilla, even on systems 
without bidi support. The reason is that we remove bidi control characters 
before displaying. Cannot we do the same for non-bidi text?
Attached patch Suggested patch to fix this bug (deleted) β€” β€” Splinter Review
Can anyone give a test to the patch?

Summary of the changes:

1. Added BIDI control characters to the definitions of the stuff to be 
discarded.
2. Removed the method StripBidiControlCharacters() from nsBidiPresUtils.

Question:

Before the mentioned changes, characters classified as 'IS_DISCARDED' were only 
single-byte ones. Newly joined to them BIDI controls are double-bytes. Is that 
OK?
The patch is buggy: it doesn't work, if word length returned by 
nsTextTransformer::GetNextWord exceeds content length of a text frame (this may 
happen, when a text frame is split during BIDI processing). I'll fix the bug 
and submit a new patch. 
Attached patch Working patch (deleted) β€” β€” Splinter Review
Well, the patch seems to work, and it also blocks bug 88588 (since it strips 
BIDI control characters before both text drawing and measurement). However, it 
caused many changes to nsTextFrame. So far it covers only *BIDI* zero-width 
characters (but it would be very easy to add whatever we might want). And 
finally, we have to separate BIDI controls (and maybe other not-single-byte 
characters, if any) from other discardable stuff.
Blocks: 88588
Is this a working patch (i.e., ripe for reviewing?)  If so, you should add the 
"review" keyword, and email the BIDI module owner (mkaply) asking for review, 
cc'ing to reviewers@mozilla.org.  (This sounds more in his line than ftang's).
Keywords: patch
QA Contact: bugmail → ian
Summary: Unicode char &zwnj; displayed as short vertical bar in browser → BIDI control chars displayed as short vertical bar in browser (e.g. Unicode char &zwnj;)
Lina, can you give a quick explanation of your changes?

I am going to send it to a layout person for review.
Hello all,

Here is a brief overview of the changes.

1. The idea was to add Bidi control characters to the existing list of 
discardable characters (nsTextTranformer.cpp). These characters, being omitted 
when obtaining next or previous word for presentation, are not included in a 
paint buffer, and are not counted when measuring or displaying a text.

2. But this very small change discovered a bug, which may occur if *any* (not 
only Bidi) discardable character is present.

For example,

Given a word abcDEF
(capitalized is Right-to-Left, '|' is a character to be taken off).

Suppose for Bidi purposes this word, initially represented by 1 text frame, is 
split into 2 text frames - "abc|" and "DEF". And suppose that for measuring the 
text content of the 1st frame - "abc|" - we're calling 
nsTextTransformer::GetNextWord(..). GetNextWord() removes the character '|' and 
returns word length that equals to 6 (while the desired length is 3). The 
caller is not aware of how many characters were discarded. He can only retrieve 
the frame content length (4) and ensure that the word length doesn't exceed it: 
then he sets the word length to 4 - which is also incorrect.
Working around this problem, before getting next or previous word, we set a 
stopper, such that word can't go beyond it (see nsTextFrame.cpp; 
nsTextTransformer.cpp).

3. And finally, StripBidiControlCharacters(..) in nsBidiPresUtils became 
unnecessary, and was removed.
Sorry about the typo - the example should look as:
"Given a word abc|DEF"
Comment on attachment 41932 [details] [diff] [review]
Working patch

There is some really bad indentation - in the new code it is not so bad, but see the BIDI blocks after
@@ -4038,12 +4064,18 @@

[s]r=attinasi
Attachment #41932 - Flags: review+
This bad indentation is outside BIDI code.
Is there room in this bug for 'vertical tab (U+000B) should not display as box'
( http://www.fontfont.de/fffstuff/f_central.html ) or should I file another bug?

Affects the whole site, ignored by NS4.x and IE, may apply to other chars from
U+0000 - U+001F.
Alistair, I think that's a separate issue (and there may be a bug on it already,
though I couldn't find one).
Lina, I regret to inform you that the patch is bitrotted by the checkins on bug
98546
Thanks Simon, vertical tab issue is now bug 106311.
Strange. On IE 5.5, zwnj and zwj are displayed.

I will get this checked fix checked in tomorrow night.

are we going to worry about the editor case?
Fix checked in
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: