Closed
Bug 164759
Opened 22 years ago
Closed 4 years ago
Line wrapping issues with some punctuations
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
RESOLVED
FIXED
People
(Reporter: ruixu, Assigned: jshin1987)
References
(Blocks 1 open bug)
Details
(Keywords: fixed1.4, intl)
Attachments
(4 files, 2 obsolete files)
(deleted),
text/html
|
Details | |
(deleted),
text/html
|
Details | |
(deleted),
patch
|
smontagu
:
review+
hjtoi-bugzilla
:
superreview+
asa
:
approval1.4+
|
Details | Diff | Splinter Review |
(deleted),
text/html
|
Details |
Build: 2002-08-26-04 trunk
This is seperated from bug 162949.
Line wrappings are incorrect with some punctuations, e.g.
"...", "-", "." and "\" in single byte.
Please refer to the test case attached. (better to compare with IE)
Assignee | ||
Comment 2•22 years ago
|
||
*** Bug 202883 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 3•22 years ago
|
||
See attachment 121366 [details] for a bit more extensive list of test cases.
In case of U+002E (full stop), it's easy to fix. In the following,
the condition (U_SPACE != next) in line 365
(http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#365)
is too broad. It need to be removed all together or has to be narrowed down.
U+002E is tricky because even in CJK text, somebody might use it
between CJK Ideographs or Hangul syllables for abbreviation (as in
'e.g.'). If we can put it aside as too edge a case, we can just remove
line 365 and 366 unless there's a special provision about U+002E
in JIS X 4501 that I don't know about. Does JIS X 4501 stipulate
that U+002E cannot end a line if it's followed by anything other
than SPACE? It could be, but I guess it doesn't allow it to
start a line no matter what. Currently, line 366 (assigning
class 8 to U+002E) effectively allows line break both before and after U+002E.
350 PRInt8 nsJISx4501LineBreaker::ContextualAnalysis(
351 PRUnichar prev, PRUnichar cur, PRUnichar next
352 )
353 {
.....
360 else if( U_PERIOD == cur)
361 {
362 if( (IS_ASCII_DIGIT (prev) || (0x0020 == prev) )&&
363 IS_ASCII_DIGIT (next) )
364 return NUMERIC_CLASS;
365 if( U_SPACE != next )
366 return CHARACTER_CLASS;
367 }
......
374 return this->GetClass(cur);
As for U+005C(reverse solidus), it's class 3(in Mozilla's reduced class scheme
( see
http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/jisx4501class.h and
http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/jisx4051pairtable.txt )
so that it can start a line. To me, it's odd to see a line starting with
U+005C. (btw, U+002F is class 8 and it can both start and end a line.
although not so much as U+005C, it's a bit strange to start a line with
U+002F.) Anyway, we'd not want to break away from JIS X 4501 for
Japanese. Therefore, I think
we have to special-case U+005C (and possibly U+002F) for zh and ko.
Assignee | ||
Comment 4•22 years ago
|
||
Other punctuation marks have be dealt with in lang/locale-specific
line breakers, IMO (see bug 203016 and bug 56652)
Instead of just removing two lines
(which would make lines break at the first full stop in cases
like 'i.e.' or 'e.g.'), I assigned CHARACTER_CLASS if both prev.
and next chars are CHARACTER_CLASS.
Assignee | ||
Comment 5•22 years ago
|
||
Comment on attachment 121406 [details] [diff] [review]
a simple fix for full stop only.
Asking for review. The patch is to allow break after a full stop when it's
followed by a character that can be broken with class 1(when 0 is the first
class), full stop's 'native' class unless it's between
class 8 characters (as in 'e.g.').
BTW, to short-circuit the most common case
(next == SPACE)
we may use
if (next != U_SPACE &&
GetClass(prev) == ...... &&
GetClass(next) == .......)
instead of
+ if( GetClass(prev) == CHARACTER_CLASS &&
+ GetClass(next) == CHARACTER_CLASS )
Let's deal with the 'simplest' of all (full stop). It seems like JIS X 4051 and
UTR #14 are in disagreement on other punctuation marks and they'd better be
dealt with in a broader context (see my comment in bug 56652).
Attachment #121406 -
Flags: superreview?(bzbarsky)
Attachment #121406 -
Flags: review?(yokoyama)
Comment 6•22 years ago
|
||
Comment on attachment 121406 [details] [diff] [review]
a simple fix for full stop only.
sr=bzbarsky, but including just a _little_ more context in that diff would have
saved me having to open the file and read the surrounding code...
Attachment #121406 -
Flags: superreview?(bzbarsky) → superreview+
Updated•22 years ago
|
Attachment #121406 -
Flags: review?(yokoyama) → review+
Assignee | ||
Comment 7•22 years ago
|
||
I'm really sorry to take back my patch after getting r and sr, but I'd
rather be sorry now than to repatch it after commiting a not-so-right
patch. The more I think about it, the more like a mine field it becomes ;-).
My patch doesn't cover cases like 'a.3', '3.a', or '<cjk>.3' (as used in section
numbers
and equation numbers). At least in the first two cases,
we don't want to break after the full stop. If our bottom line is
that a full stop can never start a line, perhaps a better condition
is 'GetClass(next) == 6 || GetClass(next) == 8' (simply put, class 6 is
numeric digits and class 8 is alphabetic letters).
For myself and others, here I'm summarizing the difference between
the native class of the character in question(full stop) class 1 and
CHARACTER_CLASS (class 8). (I'm using the actual class numbers
used in C++ code instead of comments)
(see
http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#169)
1. breaking before
- No line breaking is allowed before class 1
- No line breaking is allowed before class 8 when the it's preceeded by
a character of class 0, 6(numeric), 7, and 8(alphabetic letters).
That is, a character of class 8 can start a line if it's following
a character of class 1 - 5.
- difference : if prev == class [1..5], class 8 starts a line but
class 1 cannot even in that case.
- interim conclusion : we don't have to special case 'full stop'
*based on prev. char.* because we don't want it to start a line.
The only exception is when it follows space and is followed by
a numeric digit, which is taken care of before the fragment
of the code we're discussing. (see
http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#362
)
2. breaking after
- line breaking is allowed after class 1 except when followed by
class 1
- line breaking is allowed after class 8 except when followed by
class 1, 6, 7, and 8
- difference : class 8 cannot end a line before class [6..8]
while class 1 can before class [6..8]. [1]
Based on this comparison and the fact that we don't want to break
after '.' in cases like 'a.3', '3.a' and probably '<cjk>.a' as well
(i.e. we don't want to break after '.' if it's followed by
class 6 and 8) and we don't want to start a line with '.'
unless it's at the beginning of fractions ( .357 ).
I propose to replace the condition 'next != U_SPACE' in
line #365
(http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp#365)
with a new condition 'GetClass(next) == 6 || GetClass(next) == 8'
Sorry again and thanks.
[1] The only class 7 character is U+2126 (Ohm sign). I wonder
what's so special about it to make it a class of its own :-)
Assignee | ||
Comment 8•22 years ago
|
||
> 1. break before
.....
> - difference : if prev == class [1..5], class 8 starts a line but
> class 1 cannot even in that case.
> - interim conclusion : we don't have to special case 'full stop'
> *based on prev. char.* because we don't want it to start a line.
Actually, we have to preserve this difference.Therefore, if prev. char. class
is [1..5], we should keep the native class of the full stop. In my
previous comment, I suggested using (next_class==6 || next_class=8).
Here I'm using (next_class > 5), which is equivalent because
class 9 is Thai and it breaks with everything and class 7(a class
made of a single character) behaves the same way as class 6/8
when following full-stop.
Attachment #121406 -
Attachment is obsolete: true
Assignee | ||
Comment 9•22 years ago
|
||
Comment on attachment 122108 [details] [diff] [review]
a new patch
Shanjian, I'm really sorry to bother you again with this.(I'm aware that you're
now in other projects). But I couldn't help because I discovered a problem with
my prev. patch.
I guess this is as good as I can make it within JIS X 0451 framework.
Boris, I'm sorry to you, too. There's a bit
more context in this patch and
I gave a rather long rationale for my
patch in the bugzilla.
Thank you !
Attachment #122108 -
Flags: superreview?(bzbot)
Attachment #122108 -
Flags: review?(shanjian)
Assignee | ||
Comment 10•22 years ago
|
||
Comment on attachment 122108 [details] [diff] [review]
a new patch
Sorry for spamming.
I realized that bz is away. I'm asking dbaron
for sr.
Attachment #122108 -
Flags: superreview?(bzbot) → superreview?(dbaron)
Assignee | ||
Comment 11•22 years ago
|
||
Comment on attachment 122108 [details] [diff] [review]
a new patch
Shanjian seems too busy (he's not any more on mozilla-side). Because Simon has
doen a lot of work on linebreaking, I'm asking him for review.
Thanks
Attachment #122108 -
Flags: superreview?(heikki)
Attachment #122108 -
Flags: superreview?(dbaron)
Attachment #122108 -
Flags: review?(smontagu)
Attachment #122108 -
Flags: review?(shanjian)
Assignee | ||
Updated•22 years ago
|
Depends on: line-breaking
Comment on attachment 122108 [details] [diff] [review]
a new patch
>+ if( pc > 5 && pc < 1 && GetClass(next) > 5 )
This can't be right since this if-clause can never be true (pc can never be
more than 5 and less than 1 at the same time).
Attachment #122108 -
Flags: superreview?(heikki) → superreview-
Assignee | ||
Comment 13•22 years ago
|
||
Assignee | ||
Comment 14•22 years ago
|
||
In place of (pc < 1 && pc>5), I have ( pc==0 || pc > 5). It's beyond me how
that blunder slipped in (perhaps I hand-edited the patch ....)
Attachment #122108 -
Attachment is obsolete: true
Assignee | ||
Updated•22 years ago
|
Attachment #122108 -
Flags: review?(smontagu)
Assignee | ||
Comment 15•22 years ago
|
||
Comment on attachment 123780 [details] [diff] [review]
a new patch with the stupid mistake fixed
sorry for unnecessary spamming due to my mistake.
Attachment #123780 -
Flags: superreview?(heikki)
Attachment #123780 -
Flags: review?(smontagu)
Comment 16•22 years ago
|
||
Comment on attachment 123780 [details] [diff] [review]
a new patch with the stupid mistake fixed
r=smontagu
Attachment #123780 -
Flags: review?(smontagu) → review+
Comment 17•22 years ago
|
||
jshin: I am assigning this to a right owner. I hope you don't mind.
Assignee: yokoyama → jshin
Comment on attachment 123780 [details] [diff] [review]
a new patch with the stupid mistake fixed
>Index: intl/lwbrk/src/nsJISx4501LineBreaker.cpp
>===================================================================
>+ // character/numeric class (class 8, 6, and 7. class 9 (Thai)
Wouldn't it make more sense to say classes 6-8?
sr=heikki
Attachment #123780 -
Flags: superreview?(heikki) → superreview+
Assignee | ||
Comment 20•22 years ago
|
||
Comment on attachment 123780 [details] [diff] [review]
a new patch with the stupid mistake fixed
Thank you for r/sr. I'll change the comment as you suggested.
Asking for a1.4. This is a low risk patch that will make Mozilla more compliant
to JIS X 4501 and UTR #10.
Attachment #123780 -
Flags: approval1.4?
Comment 21•22 years ago
|
||
Have we been referring to JIS X 4051 as JIS X 4501?
I don't think there is such a standard as "JIS X 4501". Gosh,
we even have a file name including "4501"? Shouldn't this
error be corrected in the source?
Assignee | ||
Comment 22•22 years ago
|
||
Thank you for the note.
I'm always confused as to which is the correct name, JIS X 4051 or JIS X 4501,
partly due to the fact that Mozilla uses both names. We can rename
*JISX4501*cpp. The only downside is that we'll lose the cvs change record for
the old file (, which is a weak point of CVS.)
BTW, in comment #20, it should have been UTR #14 instead of UTR #10.
Assignee | ||
Comment 23•22 years ago
|
||
fix checked in to the trunk. waiting for a1.4
OS: Windows 2000 → All
Hardware: PC → All
Comment 24•22 years ago
|
||
Please don't include spurious style/spacing changes in patches; it makes them
harder to review, and also makes cvs blame much less functional.
Comment 25•22 years ago
|
||
Or if you do clean up spacing, attacha diff -wu as well.
/be
Comment 26•22 years ago
|
||
Comment on attachment 123780 [details] [diff] [review]
a new patch with the stupid mistake fixed
a=asa (on behalf of drivers) for checkin to the 1.4 branch.
Attachment #123780 -
Flags: approval1.4? → approval1.4+
Comment 27•22 years ago
|
||
Here is an additional testcase with regard to quotation marks (ubiquotous ' and
" as well as paired ones; U+2018, U+2019, U+201C, U+201D). It is encoded in
UTF-8, and modified from attachment 121366 [details].
Assignee | ||
Comment 28•21 years ago
|
||
asa, sorry I forgot to note that the patch had been checked in for 1.4branch.
as for attachment 124625 [details], thanks for Issac for alerting me to this issue. It's
really curious that U+2019 (single-high 9) and U+201D(double-high-9) behave
differently (case 4 and case 5). A quick glance at intl/lwbrk tells me a
different story (they're both class 1 and are not supposed to start a line.).
Depending on the cause, we may have to open a new bug.
Assignee | ||
Comment 29•21 years ago
|
||
I don't know how I forgot that U+2019 is also subjected
to contextual analaysis. It's dealt with in the 'if-clause' right after that for
full-stop. It may be handled similarly, but that may not be the best.
Assignee | ||
Updated•21 years ago
|
Blocks: line-breaking
No longer depends on: line-breaking
Comment 30•17 years ago
|
||
jshin:
Why is this bug still opened? I think this bug should be marked as fixed. If there are other bugs, we should file new bugs for them.
Updated•15 years ago
|
QA Contact: amyy → i18n
Comment 31•4 years ago
|
||
Marking as fixed as per last comment.
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•