Closed Bug 135323 Opened 23 years ago Closed 22 years ago

A return code between the two CJK characters is converted to a space code

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED

People

(Reporter: hsaito54, Assigned: shanjian)

References

Details

(Keywords: intl, Whiteboard: [adt2 rtm])

Attachments

(4 files, 10 obsolete files)

A space code between the two kanji characters is desirable to remove. For example, this Japanese sentence "¤¢ ¤¤" is displayed "¤¢ ¤¤" but I expect to display "¤¢¤¤" without a space.
-> Int
Assignee: Matti → yokoyama
Component: Browser-General → Internationalization
QA Contact: imajes-qa → ruixu
Please apply this patch next to apply a patch "patch to wrap long word by key-characters" at "http://bugzilla.mozilla.org/attachment.cgi?id=75233&action=view" in "http://bugzilla.mozilla.org/show_bug.cgi?id=95067".
Attached file testcase (deleted) —
This bug is related to bug 95067. I believe most Japanese people prefer that a line-break isn't rendered as space.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Hardware: Other → All
Attached patch additional patch #1 (obsolete) (deleted) — Splinter Review
I made a error in my patch carelessly. I correct errors with this patch program.
Keywords: intl
QA Contact: ruixu → ylong
saito-san: Your patch looks simple; but I think the patch is rotten. Can you update the patch for the current trunk? I'd like to apply the patch to verify the fix. Thanks
The patch was remade.
assign to shanjian. Shanjian: can you review his code? It's in the layout. Thanks
Assignee: yokoyama → shanjian
saito san, I think '\n' should be removed only between CJK characters. That's say inside function "ScanNormalWhiteSpace_F", you need to check character before and after '\n' before setting the flag. If '\n' is the last character in this fragment, it is OK to only check its precedence. (This will happen when text span multiple fragment.) I am also wondering about this change: - *aContentLenResult = offset - mOffset; + *aContentLenResult = offset - save_offset; Why did you use save_offset instead of mOffset, which is more updated? I assumed you have tested the former one and it didn't work well. But I would like to know why.
Status: NEW → ASSIGNED
To be sure, my patch program is wrong. Testing the following pages, it found two errors. http://www.asahi-net.or.jp/~wq6k-yn/para.html Every time Reload, a vertical scrill bar moves the first error. But returning a program to a basis, *aContentLenResult = offset - mOffset; the error was corrected. A patch needs to be added to correct the next error. The next sentence is not sometimes displayed accurately. 用途を詳しく分析した上で DTDを設計すれば、そのようなこと はあまり起きない (かもしれない)。しかし、 <===== error HTMLのように一つの <===== error DTDを様々なタイプの文書に用いるのでは、この種の不都合は避 <===== error け難い。 A variable contentLength is in the function nsTextFrame::MeasureText() at nsTextFrame.cpp. When a space between Kanji characters is deleted, 1 must be subtracted from this variable.
Can't I write the document of Japanese? 用途を詳しく分析した上で DTDを設計すれば、そのようなこと はあまり起きない (かもしれない)。しかし、 HTMLのように一つの DTDを様々なタイプの文書に用いるのでは、この種の不都合は避 け難い。
>>Can't I write the document of Japanese? What do you mean? As a test case, that's fine. But if you plan to use japanese for communication, I am incapable to read it.
I made a mistake. I wrote in an Shift-JIS code and displayed it in an EUC-JP code. > I think '\n' should be removed only between CJK characters. I appreciate your suggestion. I will try to investigate whether a patch can act only to the CJK characters.
The patch was remade. It is happy if you can evaluate.
saito san, I really appreciated your help one this issue. This part of code is complicated and very sensitive, so please be patient with me. Please also excuse me if my question does not make sense to you. I am learning this part of code as well. Several suggestions: 1, It is probably not a good idea to add Probe method to word breaker. You can add the macro to local file directly. In different places, we may have different definition of CJK char, so it is fine for me to define it multiply. Of cause, if we can add widely used CJK macro to XPCOM module to avoid this, it will be nice. But let's leave that for later. 2. Can we avoid the use of SpaceInfo? In nsTextTransformer::GetNextWord, we have some code dealing with German Szlig character, we need to transform this one to "SS". I assume you can do it the same way. But I am not very sure. I hope DeleteOneChar can also be avoided.
I am sorry, there was a spelling mistake. > Every time Reload, a vertical scrill bar moves the first error. Every time Reload, a vertical scroll bar moves the first error. > patch to remove a space between two CJK caracters patch to remove a space between two CJK characters > DeleteOneChar can also be avoided. If I can also do, I do not want to use this function. But a space will remain by the copy & paste using the mouse. The place of correction was not able to be found.
Summary: A return code between the two kanji caracters is converted to a space code → A return code between the two CJK characters is converted to a space code
please let me know. Although the character is stored to the buffer "mTransformBuf.mBuffer", I recognize that this buffer is not used by the copy & paste using the mouse. Is it right? If right, when copying & pasting by the mouse, where is the processing which transposes a return code to a space or the continuous space to one space?
I can't answer your question without looking into the code. But I guess that we do not do any transformation for copy-n-paste. If we do collapse multiple spaces to one in copy-n-paste, we need to do this transformation (remove space between kanji) in the same way.
The patch was remade again. Please give me a review.
This patch is acceptable to me, but the changes in nsTextFrame.cpp does not seem necessary. So could you take one more look?
The changes in nsTextFrame.cpp is required in order that a variable "totLen" in nsPlainTextSerializer.cpp may acquire the right length of a character sequence. Is it better to look for other methods for a its variable acquiring the length of a character sequence? content/base/src/nsPlainTextSerializer.cpp: 1513 void 1514 nsPlainTextSerializer::Write(const nsAString& aString) 1515 { 1524 PRInt32 totLen = aString.Length(); An example is shown below, "xx" shows the Kanji character of one character. "xx yy" The value of a variable "totLen" is 2, when a patch is not applied to nsTextFrame.cpp. However, the value should be 3.
Saito san, Sorry for not being able to response promptly. I was busy in fixing some urgent issues. I did spend some time today and check your patch carefully. I still could not understand why the change in nsTextFrame.cpp is necessary. Display and copy-n-paste works fine without that change. So could you post a test case to show the problem when that part is not there? thanks.
I appreciate for your preview and suggestions. Please test a testcase "Created an attachment (id=77572)". Can you copy "¤¢¤¤" using the mouse and paste it? When a patch was not applied to nsTextFrame.cpp , "¤¢ " was pasted. I tested by the version 0.9.8. The latest version cannot immediately be made up, but should I do so? P.S. "¤¢¤¤" is "xx" and "yy" that indicates two Kanji characters. "¤¢ " is "xx" and a white space.
I am sorry, it mistook again. > I appreciate for your preview and suggestions. I appreciate for your review and suggestions.
The previous message containing kanji characters was written in the code of EUC-JP.
Attached patch patch v5, (obsolete) (deleted) — Splinter Review
Saito, I finally got enough time to understand all your changes. I changed the patch to make it more local. (I hope you don't mind my modification.) The idea is the same. Please review my patch to see if I missed any of your points. I appreciate you contribution, this is a complicate area and you did a good job.
I agree with your changes, but the next program needs to change. nsTextTransformer.cpp: old + if (firstChar == '\n' && + offset - mOffset == 1 && + fragLen > offset) nsTextTransformer.cpp: new + if (firstChar == '\n' && + offset - mOffset == 1 && mOffset > 0 && + fragLen > offset) This change is added and a test is O.K.
Attached patch patch v6 (obsolete) (deleted) — Splinter Review
Attachment #77569 - Attachment is obsolete: true
Attachment #77597 - Attachment is obsolete: true
Attachment #80550 - Attachment is obsolete: true
Attachment #82001 - Attachment is obsolete: true
Attachment #82597 - Attachment is obsolete: true
Attachment #82989 - Attachment is obsolete: true
good point. I updated the patch. Now let's ask for r/sr. rbs/attinasi, could you give r/sr?
Comment on attachment 83046 [details] [diff] [review] patch v6 - i = contentLen; + if (1 == wordLen && contentLen == 2 && IS_CJK_CHAR(*bp)) { + i = contentLen; + } else { + i = contentLen; } No need to duplicate the statement. What happens when there are _multiple_ '\n' between the CJK chars? I am not a big fan of macros defined in several places. You can define IS_CJK_CHAR in one place (next to IS_HIGH/LOW_SURROGATE) in intl/unicharutil/util/nsUnicharUtils.h
Attached patch patch v7 (obsolete) (deleted) — Splinter Review
Moved macro to nsUnicharUtils.h.
Attachment #83046 - Attachment is obsolete: true
> What happens when there are _multiple_ '\n' between the CJK chars? One space is replaced, when two or more '\n' are between CJK characters, or when a space is included. It is the conventional processing.
When changed into patch-v6 from patch-v5, a mistake occurs. Please check. patch-v5: + if (firstChar == '\n' && + offset - mOffset == 1 && + fragLen > offset) + { patch-v6: + if (firstChar == '\n' && + offset - mOffset == 1 && + mOffset > 0) + { new: + if (firstChar == '\n' && + offset - mOffset == 1 && + mOffset > 0 && + fragLen > offset) + {
Attached patch patch v8 (obsolete) (deleted) — Splinter Review
Attachment #83587 - Attachment is obsolete: true
rbs/waterson, could you r/sr?
Comment on attachment 85265 [details] [diff] [review] patch v8 I didn't understand the reply to my question about what happens when there are multiple '\n' and/or tabs and/or space chars in-between the CJK chars?
Two or more '\n' between CJK characters replaces one space. Although one or more TAB and spaces between CJK characters also replace one space, the above both processing is the conventional processing and this patch does not bar the processing.
rbs, let me explain hideo's statement. We only remove a single return between 2 CJK characters. We do nothing to multiple return, tab, space, whatsoever. Why is that? For tab, space, and other non-return space equivalent characters, we believe user has a reason to put it there when it is there, so we don't want to do anything. In case of multiple return, that's arguable. When user put a single return between 2 CJK char, they just want to shorten line length and they expect text flow to be continuous. We decide to only handle this situation and nothing else. I personally believe hideo's approach is reasonable. To remove multiple return between CJK char also make sense to me, but I didn't find it really necessary.
Is there a real-life page with the problem? Something that has been bugging me is why there wasn't a duplicate/precedent to this bug. (If there is one in bugzilla, care to point it?). What is conventional elsewhere (e.g., telegraphic text -- just a funny example) might be irrelevant in HTML. Surely, there should be a real-life page, in this massive web, with this bug? Is the patch not going to cause a suprise and regress pages that have been using '\n' as end-of-word *and* end-of-line -- not just end-of-text-line? Also, I wonder about the rendering of <pre>formatted text, the patch might mess all that. I applied the patch and it seemed to "fix" the testcase. I had some doubts about the necessity of the plain-text part, but again, it might be the same doubt as above (BTW, double-check/confirm that mInWhitespace is in sync).
Again, as a reminder, this isn't an easy area, and tip-toeing to arrive safely is better -- although it might take longer. So bear with the picky r/sr.
By the text of usual Japanese, a space is not placed between kanji characters. In case a long document is edited, starting a new line, and continuing and describing a text in the suitable place of immediately after a punctuation and kanji characters other than a punctuation is performed ordinarily. Please see the following examples (was written in the code of EUC-JP). If HTML describes the text of 1, like 2, although it is visible like 3, we do not desire a text of 3. 1, ºé¤¤¤¿¡¢ºé¤¤¤¿¡¢ºù¤¬ºé¤¤¤¿¡£ 2, ºé¤¤¤¿¡¢ ºé¤¤¤¿¡¢ ºù¤¬ ºé¤¤¤¿¡£ 3, ºé¤¤¤¿¡¢ ºé¤¤¤¿¡¢ ºù¤¬ ºé¤¤¤¿¡£ It is possible to continue writing a Japanese text without a new-line for a long time. However, it is pain for us. It is checking that this patch does not influence the <pre> element.
> Is there a real-life page with the problem? It shows one example, but not a good example, since CSS is used and the space between kanji characters spreads, http://www.asahi-net.or.jp/~wq6 k-yn/para.html > Something that has been bugging me is why there wasn't a duplicate/precedent to this bug. The argument is made even now at Bugzilla-jp (http://www.mozilla.gr.jp/), and there is a contrary opinion about this affair. That is, it says that it is the specification of HTML and the specification should be followed, and has been considered so until now.
Sorry, it is the following URL. http://www.asahi-net.or.jp/~wq6k-yn/para.html
> It is possible to continue writing a Japanese text without a new-line for a > long time. Do you mean that there are words that excessively long and cannot be broken by users as they type them? Such a word would be weird to read... > The argument is made even now at Bugzilla-jp (http://www.mozilla.gr.jp/), and > there is a contrary opinion about this affair. > That is, it says that it is the specification of HTML and the specification > should be followed, and has been considered so until now. I am leaning towards the opinion of leaving things as they are myself (i.e., retain the HTML-spec way). This will avoid undue uncertainties with <pre>, not to mention the (still unknown) ramifications when typing in Composer. Plus, IE6 does what Mozilla is currently doing, and a change will not only be non-spec compliant, but it will also deviate from what other popular browsers are doing.
> Do you mean that there are words that excessively long and cannot be broken by > users as they type them? Such a word would be weird to read... CJK text doesn't have a space between words. So, if somebody hates current Mozilla behaviour, he/she have to write a paragraph in a single line. IE6/Win doesn't display a line-break as a space.
> CJK text doesn't have a space between words. So, if somebody hates > current Mozilla behaviour, he/she have to write a paragraph in a single line. But if you use the Mozilla's editor, |nsJISx4501LineBreaker| does line-breaking nicely as you type, right?
It is different to everybody what is used for a text editor. Can't it be interpreted as deletion being possible about the space between words by HTML 4.01 specifications depending on a each languege?
HTML spec: <http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1> > This layout may involve putting space between words (called inter-word space), > but conventions for inter-word space vary from script to script. For example, > in Latin scripts, inter-word space is typically rendered as an ASCII space > (&#x0020;), while in Thai it is a zero-width word separator (&#x200B;). > In Japanese and Chinese, inter-word space is not typically rendered at all. CSS spec: http://www.w3.org/TR/2002/WD-css3-text-20020515/#linefeed-treatment
> Plus, IE6 does what Mozilla is currently doing, and a change will not only be > non-spec compliant, but it will also deviate from what other popular browsers > are doing. The specifications described by "9.1 White space" of HTML is right. > In Japanese and Chinese and inter-word space is not typically rendered at all. I believe that it is left to the browser how inter-word space is rendered. I consider that those browsers force the custom of a Latin scripts on us.
Another real life example of this problem. http://alltheweb.com/
Oh, this one too. http://www.google.com/ To see the problem, have Japanese at the top of your preferred langage list, and search something in Japanese.
http://www.asahi-net.or.jp/~wq6k-yn/para.html Looking at that page, it looks bad. Although that page is intentionally creating the problem, it can happen in real pages like machine generated output (e.g. google). &#12288; I think we want to fix this on the trunk.
I propose to fix this in branch so that it makes to Netscape 7 release. We shouldn't ignore 2nd and 3rd mostly used language on the internet.
Taka, you can nominate this as nsbeta1 if you think this is important. Please provide some info. 1) How frequent does the user encounter the page with the problem. 2) How bad the problem can be? Please provide a screen shot.
Attached image screen shot of the doc in comment 53 (deleted) —
Naoki, It's not a matter of how often users encounter this. As far as we state that Mozilla is HTML 4.0 compliant browser and it supports Chinese and Japanese, the document in comment 53, which is error free HTML 4.0 Strict document, must be rendered as native language speakers expect. Please see the screenshot and judge how bad we are doing with error free HTML document. This issue has been around since Netscape 4.0 and discussed on various ML, BBS, Newsgroup, etc. I don't want to see people blaming on Mozilla/Netscape about this issue anymore. Nominating to nsbeta1.
Keywords: nsbeta1
Attached patch patch v9 (obsolete) (deleted) — Splinter Review
Exclude Korean from patch v8.
Attachment #85265 - Attachment is obsolete: true
Comment on attachment 88737 [details] [diff] [review] patch v9 r=shanjian
Attachment #88737 - Flags: review+
I just put r=shanjian since this patch is originally proposed by saito, and new one is touched by taka. chris waterson, could you please sr? rbs, feel free to express your objection if you still have some.
can we reduce this patch by take out the non essential change in nsJISx4501LineBreaker.cpp and +#define IS_CJK_CHAR(u) in nsUnicharUtils.h [adt2] since this a strong recommendation/complains from Japanese mozilla community.
Keywords: nsbeta1nsbeta1+
Whiteboard: [adt2]
Blocks: 141008
Attached patch patch v10 (deleted) — Splinter Review
don't touch nsJISx4501LineBreaker.cpp.
Attachment #88737 - Attachment is obsolete: true
After we added the macro to nsUnicharUtils.h, it does not make sense to keep another copy in nsJISx4501LineBreaker.cpp. Since those are identical macro, there should be no risk. I would suggest to use v9 instead of v10.
Actually, IS_CJK_CHAR(u) is not used in this patch anymore because we have to exclude Korean from special case handling. I added IS_CJ_CHAR() in nsUnicharUtils.h and that's the macro we are using.
Comment on attachment 89113 [details] [diff] [review] patch v10 I am Ok with v10.
Attachment #89113 - Flags: review+
Comment on attachment 89113 [details] [diff] [review] patch v10 sr=waterson
Attachment #89113 - Flags: superreview+
checked into trunk. /cvsroot/mozilla/layout/html/base/src/nsTextFrame.cpp,v <-- nsTextFrame.cpp new revision: 1.374; previous revision: 1.373 /cvsroot/mozilla/layout/html/base/src/nsTextTransformer.cpp,v <-- nsTextTransf ormer.cpp new revision: 1.69; previous revision: 1.68 /cvsroot/mozilla/content/base/src/nsPlainTextSerializer.cpp,v <-- nsPlainTextS erializer.cpp new revision: 1.60; previous revision: 1.59 /cvsroot/mozilla/intl/unicharutil/util/nsUnicharUtils.h,v <-- nsUnicharUtils.h new revision: 1.15; previous revision: 1.14
I talked to naoki and we double checked the code. The patch will only affect chinese and japanese characters. For characters of other languages, the behavior should be the same. For verification testing, we can also use any non-chinese and non-japanese website to verify this. Since '\n' is used in many long peice of text. If there is something, it shouldn't be difficult to identify. For chinese/japanese verification, see previous comment for testcases. Be sure to test copy-n-paste.
On 07-02 trunk build: 1. It works fine when a return code between 2 Japanese text characters. However, when there is a return code HTML tag mixed together with character(s), there still like "an extra space" shows in browser window. http://www.asahi-net.or.jp/~wq6k-yn/para.html 2. When the HTML source has space(s) between return code, like page: http://www.php.net/manual/ja/function.setcookie.php or: http://www.vinelinux.org/ There is a single byte space between 2 characters. I don't if it is OK, but on Netscape I can highlight the space between characters while on IE I can not. 3. No Copy/Paste regression was found right now. 4. I don't find any regression with English or other languages so far.
Attached image a screen shot after fixed (deleted) —
Notice: For the display in browser, there is no different with this case before or after the fix. When I copy/paste the string from Broswer window to Composer, the space will be shorter.
Taka, the first two issues in Comment #70 were they intended to be fixed by the last patch? If not, we can verify this as fixed.
>Taka, the first two issues in Comment #70 were they intended to be fixed by the >last patch? If not, we can verify this as fixed. Let me ask this to the original author of the patch. Saito san, could you answer my last comment? Is it okay for mozilla community in Japan about the current level of the support (i.e. remaining problems are not so important)?
There are about three future examination matters. A new-line replaces a space, without being deleted in the following examples. (1) a new-line is between a Kanji character and an English character. KK HTML KK (2) the Kanji character is inserted into HTML tag KK <a>KK</a> KK (3) there is a space following a new-line KK KK KK is the Kanji character of one character. The user group of Bugzilla-jp examines whether deletion of a space is required.
Okay, please file a separate bug for those problems. This bug can be marked as fixed. I think the current fix mostly covers exisiting problems. Nominate for adt1.0.1
Keywords: adt1.0.1
Whiteboard: [adt2] → [adt2 rtm]
adding adt1.0.1- per ADT. Let's get this in the next release.
> re: comment #59 > > rbs, feel free to express your objection if you still have some. I have been travelling, and was out of bugzilla. I don't like the fix very much (especially that it doesn't cover everything and it is short-sighted for the remaining parts), but I don't know what else to suggest -- except perhaps the following: would that possible to transform the space (when in the CJ context) into a zero-width space (ZWSP) to make it disappear in a natural/visual way? Motivation: avoid complications in nsTextFrame, if possible.
What rbs suggested is a interesting idea. But I am not sure how much it can do to simplify the code. I will mark this bug as fixed. File another bug anybody like to explore that idea.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Mark as verified for this one.
Status: RESOLVED → VERIFIED
Bug 156369 is filed for remaining problems.
Adding adt1.0.1- per the adt. Let's try to get it into the next release.
Keywords: adt1.0.1adt1.0.1-
I'm not very familiar with the Mozilla source but if the macro IS_CJK_CHAR or IS_CJ_CHAR in intl/unicharutil/util/nsUnicharUtils.h works like IS_HIGH/LOW_SURROGATE, won't Chinese and Japanese characters composed of two surrogates be missed and still cause this bug?
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: