Closed Bug 164759 Opened 22 years ago Closed 4 years ago

Line wrapping issues with some punctuations

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: ruixu, Assigned: jshin1987)

References

(Blocks 1 open bug)

Details

(Keywords: fixed1.4, intl)

Attachments

(4 files, 2 obsolete files)

Build: 2002-08-26-04 trunk This is seperated from bug 162949. Line wrappings are incorrect with some punctuations, e.g. "...", "-", "." and "\" in single byte. Please refer to the test case attached. (better to compare with IE)
Attached file test case (deleted) —
Keywords: intl
QA Contact: ruixu → ylong
*** Bug 202883 has been marked as a duplicate of this bug. ***
See attachment 121366 [details] for a bit more extensive list of test cases. In case of U+002E (full stop), it's easy to fix. In the following, the condition (U_SPACE != next) in line 365 (http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#365) is too broad. It need to be removed all together or has to be narrowed down. U+002E is tricky because even in CJK text, somebody might use it between CJK Ideographs or Hangul syllables for abbreviation (as in 'e.g.'). If we can put it aside as too edge a case, we can just remove line 365 and 366 unless there's a special provision about U+002E in JIS X 4501 that I don't know about. Does JIS X 4501 stipulate that U+002E cannot end a line if it's followed by anything other than SPACE? It could be, but I guess it doesn't allow it to start a line no matter what. Currently, line 366 (assigning class 8 to U+002E) effectively allows line break both before and after U+002E. 350 PRInt8 nsJISx4501LineBreaker::ContextualAnalysis( 351 PRUnichar prev, PRUnichar cur, PRUnichar next 352 ) 353 { ..... 360 else if( U_PERIOD == cur) 361 { 362 if( (IS_ASCII_DIGIT (prev) || (0x0020 == prev) )&& 363 IS_ASCII_DIGIT (next) ) 364 return NUMERIC_CLASS; 365 if( U_SPACE != next ) 366 return CHARACTER_CLASS; 367 } ...... 374 return this->GetClass(cur); As for U+005C(reverse solidus), it's class 3(in Mozilla's reduced class scheme ( see http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/jisx4501class.h and http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/jisx4051pairtable.txt ) so that it can start a line. To me, it's odd to see a line starting with U+005C. (btw, U+002F is class 8 and it can both start and end a line. although not so much as U+005C, it's a bit strange to start a line with U+002F.) Anyway, we'd not want to break away from JIS X 4501 for Japanese. Therefore, I think we have to special-case U+005C (and possibly U+002F) for zh and ko.
Attached patch a simple fix for full stop only. (obsolete) (deleted) — Splinter Review
Other punctuation marks have be dealt with in lang/locale-specific line breakers, IMO (see bug 203016 and bug 56652) Instead of just removing two lines (which would make lines break at the first full stop in cases like 'i.e.' or 'e.g.'), I assigned CHARACTER_CLASS if both prev. and next chars are CHARACTER_CLASS.
Comment on attachment 121406 [details] [diff] [review] a simple fix for full stop only. Asking for review. The patch is to allow break after a full stop when it's followed by a character that can be broken with class 1(when 0 is the first class), full stop's 'native' class unless it's between class 8 characters (as in 'e.g.'). BTW, to short-circuit the most common case (next == SPACE) we may use if (next != U_SPACE && GetClass(prev) == ...... && GetClass(next) == .......) instead of + if( GetClass(prev) == CHARACTER_CLASS && + GetClass(next) == CHARACTER_CLASS ) Let's deal with the 'simplest' of all (full stop). It seems like JIS X 4051 and UTR #14 are in disagreement on other punctuation marks and they'd better be dealt with in a broader context (see my comment in bug 56652).
Attachment #121406 - Flags: superreview?(bzbarsky)
Attachment #121406 - Flags: review?(yokoyama)
Comment on attachment 121406 [details] [diff] [review] a simple fix for full stop only. sr=bzbarsky, but including just a _little_ more context in that diff would have saved me having to open the file and read the surrounding code...
Attachment #121406 - Flags: superreview?(bzbarsky) → superreview+
Attachment #121406 - Flags: review?(yokoyama) → review+
I'm really sorry to take back my patch after getting r and sr, but I'd rather be sorry now than to repatch it after commiting a not-so-right patch. The more I think about it, the more like a mine field it becomes ;-). My patch doesn't cover cases like 'a.3', '3.a', or '<cjk>.3' (as used in section numbers and equation numbers). At least in the first two cases, we don't want to break after the full stop. If our bottom line is that a full stop can never start a line, perhaps a better condition is 'GetClass(next) == 6 || GetClass(next) == 8' (simply put, class 6 is numeric digits and class 8 is alphabetic letters). For myself and others, here I'm summarizing the difference between the native class of the character in question(full stop) class 1 and CHARACTER_CLASS (class 8). (I'm using the actual class numbers used in C++ code instead of comments) (see http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#169) 1. breaking before - No line breaking is allowed before class 1 - No line breaking is allowed before class 8 when the it's preceeded by a character of class 0, 6(numeric), 7, and 8(alphabetic letters). That is, a character of class 8 can start a line if it's following a character of class 1 - 5. - difference : if prev == class [1..5], class 8 starts a line but class 1 cannot even in that case. - interim conclusion : we don't have to special case 'full stop' *based on prev. char.* because we don't want it to start a line. The only exception is when it follows space and is followed by a numeric digit, which is taken care of before the fragment of the code we're discussing. (see http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#362 ) 2. breaking after - line breaking is allowed after class 1 except when followed by class 1 - line breaking is allowed after class 8 except when followed by class 1, 6, 7, and 8 - difference : class 8 cannot end a line before class [6..8] while class 1 can before class [6..8]. [1] Based on this comparison and the fact that we don't want to break after '.' in cases like 'a.3', '3.a' and probably '<cjk>.a' as well (i.e. we don't want to break after '.' if it's followed by class 6 and 8) and we don't want to start a line with '.' unless it's at the beginning of fractions ( .357 ). I propose to replace the condition 'next != U_SPACE' in line #365 (http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#365) with a new condition 'GetClass(next) == 6 || GetClass(next) == 8' Sorry again and thanks. [1] The only class 7 character is U+2126 (Ohm sign). I wonder what's so special about it to make it a class of its own :-)
Attached patch a new patch (obsolete) (deleted) — Splinter Review
> 1. break before ..... > - difference : if prev == class [1..5], class 8 starts a line but > class 1 cannot even in that case. > - interim conclusion : we don't have to special case 'full stop' > *based on prev. char.* because we don't want it to start a line. Actually, we have to preserve this difference.Therefore, if prev. char. class is [1..5], we should keep the native class of the full stop. In my previous comment, I suggested using (next_class==6 || next_class=8). Here I'm using (next_class > 5), which is equivalent because class 9 is Thai and it breaks with everything and class 7(a class made of a single character) behaves the same way as class 6/8 when following full-stop.
Attachment #121406 - Attachment is obsolete: true
Comment on attachment 122108 [details] [diff] [review] a new patch Shanjian, I'm really sorry to bother you again with this.(I'm aware that you're now in other projects). But I couldn't help because I discovered a problem with my prev. patch. I guess this is as good as I can make it within JIS X 0451 framework. Boris, I'm sorry to you, too. There's a bit more context in this patch and I gave a rather long rationale for my patch in the bugzilla. Thank you !
Attachment #122108 - Flags: superreview?(bzbot)
Attachment #122108 - Flags: review?(shanjian)
Comment on attachment 122108 [details] [diff] [review] a new patch Sorry for spamming. I realized that bz is away. I'm asking dbaron for sr.
Attachment #122108 - Flags: superreview?(bzbot) → superreview?(dbaron)
Comment on attachment 122108 [details] [diff] [review] a new patch Shanjian seems too busy (he's not any more on mozilla-side). Because Simon has doen a lot of work on linebreaking, I'm asking him for review. Thanks
Attachment #122108 - Flags: superreview?(heikki)
Attachment #122108 - Flags: superreview?(dbaron)
Attachment #122108 - Flags: review?(smontagu)
Attachment #122108 - Flags: review?(shanjian)
Depends on: line-breaking
Comment on attachment 122108 [details] [diff] [review] a new patch >+ if( pc > 5 && pc < 1 && GetClass(next) > 5 ) This can't be right since this if-clause can never be true (pc can never be more than 5 and less than 1 at the same time).
Attachment #122108 - Flags: superreview?(heikki) → superreview-
Attached file another test case (deleted) —
In place of (pc < 1 && pc>5), I have ( pc==0 || pc > 5). It's beyond me how that blunder slipped in (perhaps I hand-edited the patch ....)
Attachment #122108 - Attachment is obsolete: true
Attachment #122108 - Flags: review?(smontagu)
Comment on attachment 123780 [details] [diff] [review] a new patch with the stupid mistake fixed sorry for unnecessary spamming due to my mistake.
Attachment #123780 - Flags: superreview?(heikki)
Attachment #123780 - Flags: review?(smontagu)
Comment on attachment 123780 [details] [diff] [review] a new patch with the stupid mistake fixed r=smontagu
Attachment #123780 - Flags: review?(smontagu) → review+
jshin: I am assigning this to a right owner. I hope you don't mind.
Assignee: yokoyama → jshin
Roy, no problem. Accepting.
Status: NEW → ASSIGNED
Comment on attachment 123780 [details] [diff] [review] a new patch with the stupid mistake fixed >Index: intl/lwbrk/src/nsJISx4501LineBreaker.cpp >=================================================================== >+ // character/numeric class (class 8, 6, and 7. class 9 (Thai) Wouldn't it make more sense to say classes 6-8? sr=heikki
Attachment #123780 - Flags: superreview?(heikki) → superreview+
Comment on attachment 123780 [details] [diff] [review] a new patch with the stupid mistake fixed Thank you for r/sr. I'll change the comment as you suggested. Asking for a1.4. This is a low risk patch that will make Mozilla more compliant to JIS X 4501 and UTR #10.
Attachment #123780 - Flags: approval1.4?
Have we been referring to JIS X 4051 as JIS X 4501? I don't think there is such a standard as "JIS X 4501". Gosh, we even have a file name including "4501"? Shouldn't this error be corrected in the source?
Thank you for the note. I'm always confused as to which is the correct name, JIS X 4051 or JIS X 4501, partly due to the fact that Mozilla uses both names. We can rename *JISX4501*cpp. The only downside is that we'll lose the cvs change record for the old file (, which is a weak point of CVS.) BTW, in comment #20, it should have been UTR #14 instead of UTR #10.
fix checked in to the trunk. waiting for a1.4
OS: Windows 2000 → All
Hardware: PC → All
Please don't include spurious style/spacing changes in patches; it makes them harder to review, and also makes cvs blame much less functional.
Or if you do clean up spacing, attacha diff -wu as well. /be
Comment on attachment 123780 [details] [diff] [review] a new patch with the stupid mistake fixed a=asa (on behalf of drivers) for checkin to the 1.4 branch.
Attachment #123780 - Flags: approval1.4? → approval1.4+
Attached file Quotation marks woes (deleted) —
Here is an additional testcase with regard to quotation marks (ubiquotous ' and " as well as paired ones; U+2018, U+2019, U+201C, U+201D). It is encoded in UTF-8, and modified from attachment 121366 [details].
Keywords: fixed1.4
asa, sorry I forgot to note that the patch had been checked in for 1.4branch. as for attachment 124625 [details], thanks for Issac for alerting me to this issue. It's really curious that U+2019 (single-high 9) and U+201D(double-high-9) behave differently (case 4 and case 5). A quick glance at intl/lwbrk tells me a different story (they're both class 1 and are not supposed to start a line.). Depending on the cause, we may have to open a new bug.
I don't know how I forgot that U+2019 is also subjected to contextual analaysis. It's dealt with in the 'if-clause' right after that for full-stop. It may be handled similarly, but that may not be the best.
No longer depends on: line-breaking
jshin: Why is this bug still opened? I think this bug should be marked as fixed. If there are other bugs, we should file new bugs for them.
QA Contact: amyy → i18n

Marking as fixed as per last comment.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: