Closed
Bug 56652
Opened 24 years ago
Closed 1 year ago
More intelligent Unicode-compatible linebreaking algorithms (UAX #14) needed
Categories
(Core :: Internationalization, enhancement, P4)
Core
Internationalization
Tracking
()
RESOLVED
DUPLICATE
of bug 1719535
Future
People
(Reporter: cmattar, Assigned: m_kato)
References
(Blocks 2 open bugs)
Details
(Keywords: intl, testcase, Whiteboard: [p-ie])
Attachments
(1 file)
(deleted),
text/html
|
Details |
Currently the line only breaks on whitespaces. There are some more cases where
linebreaking wouldn't hurt (and IE seems to agree with me there), see the
attachment for example.
I tested this using Mozilla M18 on Win98, but I don't think this is a
regression.
Assigning to Layout, don't know if this bug's right there.
Reporter | ||
Comment 1•24 years ago
|
||
Comment 2•24 years ago
|
||
Should this go to Internationalization?
Comment 3•24 years ago
|
||
UNICODE has a zero-width space for this kind of thing -- are we sure this
is valid? I'm always dubious of heuristics, and this would seem to fall
into that category...
Comment 4•24 years ago
|
||
Reassigning to Buster and marking future.
Assignee: clayton → buster
Target Milestone: --- → Future
Comment 6•23 years ago
|
||
FWIW, <URL:http://www.unicode.org/unicode/reports/tr14/> provides some guidance
on line-breaking implementation.
Comment 7•23 years ago
|
||
*** Bug 147836 has been marked as a duplicate of this bug. ***
Comment 8•23 years ago
|
||
as I see, this is 2 years old bug. I found it today again and reported as bug
147836.
there is't possibly simple solution for all languages, but what do you think
about starting with simple wordsplitting using fixed maximal number of
characters for every word? it will solve worst cases of this bug.
I have seen this on a page with many paragraphs, all were turned into 10
monitors wide lines by this bug.
Comment 9•22 years ago
|
||
Layout doesn't even break lines on hyphens, for crying out loud.
OS: Windows 98 → All
Hardware: PC → All
Comment 10•22 years ago
|
||
There are two types of hyphens according to the HTML4.01 spec:
http://www.w3.org/TR/html401/struct/text.html#h-9.3.3
In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen.
The plain hyphen should be interpreted by a user agent as just another
character (i.e no special breaking behavior) The soft hyphen tells the user
agent where a line break can occur.
Those browsers that interpret soft hyphens must observe the following semantics:
If a line is broken at a soft hyphen, a hyphen character must be displayed at
the end of the first line. If a line is not broken at a soft hyphen, the user
agent must not display a hyphen character. For operations such as searching and
sorting, the soft hyphen should always be ignored.
In HTML, the plain hyphen is represented by the "-" character (- or -).
The soft hyphen is represented by the character entity reference ­ (­
or ­)
Comment 11•22 years ago
|
||
"(i.e., no special breaking behavior)" isn't part of the spec and isn't really
what the spec means. Rather, the relevant section is lower, in 9.3.5, which
uses Western scripts as an example and gives incorrect rules. I'd say the
example is non-normative and can be ignored.
Comment 12•22 years ago
|
||
dbaron: Should we be breaking a long repeating line of hyphens?
This is an issue in http://bugscape.mcom.com/show_bug.cgi?id=15288
Comment 13•22 years ago
|
||
see also bug 157967 : Mac OS X needs to use the ATSUI-services which, among
other things, will do the line breaking for you.
Comment 14•22 years ago
|
||
Altering summary: Unicode seems to lay down some normative linebreaking
behavior, although they don't define a full linebreaking alogrithm per se.
<URL:http://www.unicode.org/unicode/reports/tr14/>. (More precisely, it defines
in what places lines may, must, or must not break; whether or not the layout
takes advantage of a possible linebreak is left to a higher-level algorithm.)
Keywords: testcase
Summary: More intelligent linebreaking algorithms needed → More intelligent Unicode-compatible linebreaking algorithms needed
Comment 15•22 years ago
|
||
*** Bug 175578 has been marked as a duplicate of this bug. ***
Comment 16•22 years ago
|
||
IE breaks on slashes (as the unicode standard recommends), Mozilla does not.
Adding [p-ie] to whiteboard.
Whiteboard: [p-ie]
Comment 17•22 years ago
|
||
ATSUI claims to support Unicode 3.2 (which defines the line breaks as noted
above, http://www.unicode.org/unicode/reports/tr14/):
http://developer.apple.com/techpubs/macosx/Carbon/text/ATSUI/
ATSUI_Concepts/atsui_app_unicode/index.html
(yeah I inserted a space ... ;-)
"ATSUI provides full layout support for Unicode 3.2 and supports text rendering
for all the features required by scripts included with version 2.1 of the
Unicode standard or later. "
So using ATSUI to render at the paragraph level on Mac OS X would fix this bug
on Mac OS X at least (obviously this == fixing bug 157967).
Comment 18•22 years ago
|
||
Mozilla currently implements JIS X 4051 and Thai linebreaking.
(see files in http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/).
There are several differences between JIS X 4051 (the only linebreaker
implemented so far) and UTR #14. They include (but are not limited to)
- treatment of NBSP, ZWNBSP, CGJ : the current linebreaker doesn't
implement 'do not break after. do not break before, either'
UTR #14>
GL * Non-breaking (“Glue”) NBSP, ZWNBSP,CGJ prohibit line breaks before
or after
Currently, Mozilla breaks before(after) NBSP if what follows(preceeds)
it is CJK Ideograph or Hangul syllables.
- In (current) JIS X 4051 (implementation), Euro (U+20AC) and
other currency signs are class 8 while Yen(U+00A5) and Pound(U+00A3)
are class 3. UTR #14 stipulates that they be treated consistently.
- comma is treated per UTR, but fullstop is not (see bug 164759.
A simple 2-line patch will fix this). Other characters in
UTR#14 IS category need to be taken care of.
UTR> IS - Numeric Separator (Infix) (XB)
Characters that usually occur inside a numerical expression may
not be separated from following numeric characters, unless space
character intervenes. Since they are otherwise sentence ending
punctuation, they prevent breaks before.
- UTR #14 prohibits break before ‘]’ or ‘!’ or ‘;’ or ‘/’, even after spaces,
but JIS X 4051 allows break before '/'.
FYI, other bugs on linebreaking we may need a tracking bug opened) are :
bug 193212, bug 203016, bug 178290, bug 172052, bug 164759(dup: bug 202833),
bug 162049(closed), bug 162940 and more.
It has to be noted that *not all* rules in UTR #14 are
normative and we can tailor them(non-normative rules)
as we see fit (per lang/locale or based on other criteria).
Keywords: intl
Comment 19•22 years ago
|
||
> fullstop is not (see bug 164759. A simple 2-line patch will fix this).
Actually, a bit more work is necessary. We have to add a new class
(break neither before nor after) and assign that class to fullstop in
some context (for instance, between 'e' and 'g' in 'e.g.').
We need that class anyway for NBSP/ZWNBSP/CGJ.
Comment 20•22 years ago
|
||
Will Mozilla's linebreaking algorithm include BPH -- break permitted here? (Or
something equivalent.)
Does bug 172819 belong to this "bug family" too?
Comment 21•22 years ago
|
||
Sorry for spamming.
Adding a few more to CC. BTW, note that I uploaded a fix for fullstop case in
bug 164759 (attachment 121406 [details] [diff] [review]).
Updated•22 years ago
|
Depends on: line-breaking
Comment 22•21 years ago
|
||
->Fonts & Text
Assignee: attinasi → font
Component: Layout → Layout: Fonts and Text
QA Contact: petersen → ian
Updated•21 years ago
|
Priority: P3 → --
Target Milestone: Future → ---
Comment 23•21 years ago
|
||
This is a serious usability problem, because it causes pages with long
hyperlinks to grow excessively wide. If fixing to Unicode is going to be
futured, then a workaround to line-break on slashes and hyphens should be put in
place in the meantime.
Updated•21 years ago
|
Priority: -- → P4
Target Milestone: --- → Future
Updated•21 years ago
|
Blocks: line-breaking
No longer depends on: line-breaking
Comment 24•21 years ago
|
||
re: comment 23 (Simon Woodside): Breaking after ASCII hyphen (hyphen-minus) is
now bug 95067; breaking after slash is bug 218580. Also, handling of
soft-hyphen is bug 9101.
re: comment 20 (Torsten Bronger): I am unfamiliar with "BPH", but Unicode does
provide a zero-width space (U200B -- note that 'zwsp' is not a defined entity
name) to use as a non-visible author-specified break point -- and Mozilla does
in fact handle this correctly.
Comment 25•21 years ago
|
||
I had a descussion about U200B in comp.text.xml last year. I don't like it
because the Unicode specs say that it "may expand in justification" which is
totally unacceptable of course. But with this Unicode description I must assume
that some XML interpreters do/will treat it like this.
Comment 26•21 years ago
|
||
*** Bug 222057 has been marked as a duplicate of this bug. ***
Comment 27•19 years ago
|
||
This bug hasn't seen any activity in a couple years, but it still seems to be a problem. What's up?
Comment 28•19 years ago
|
||
Lack of resources. Patches accepted.
Comment 29•18 years ago
|
||
smontagu asked me to comment here -- basically I agree with Chris Hoess's assessment in comment 14.
Most of UAX #14 is non-normative. (This will be even clearer in the next revision.) The normative rules deal mainly with line breaking control characters, and should be implemented as specced in the proposed update.
The rest of it is tailorable, and many of those rules are impractical unless we also implement prioritization. E.g. we should allow breaks after hyphens as suggested, but only if they are at a lower priority than spaces.
IMHO UAX#14's non-normative rules should be viewed more as hints on how to do things right rather than a specification for how to do things right. It collects together a lot of hard-to-find and useful information about line breaking, but it's not a complete usable algorithm, its heuristics are not always the best, and sometimes it's just wrong.
So, to summarize, work on line breaking at punctuation other than spaces should
a) implement prioritization
b) use UAX14 as a starting point but also
b) use common sense, expert opinion, and/or research to support any changes
from what we do today, not just blindly implement UAX14's pairs table
c) use the latest proposed update to UAX 14 [1], as it fixes some substantial
errors in the latest approved version
[1] http://unicode.org/reports/tr14/tr14-20.html
Comment 30•17 years ago
|
||
This bug is affected by the recent fix to bug 95067 (which, I suppose, is actually a duplicate, although at a little more specific level). The fix allows linebreaking in connection with hyphens, slashes and a number of other characters, mainly by imitating the linebreaking behavior of WinIE 7. See the comparison table between Gecko, IE 7 and Opera 9.2:
http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/tools/spec_table.html
While offering a solution to the lay-out problems caused by URLs and other very long strings, the fix seems to introduce rather undesirable side-effects. For example, linebreaking is allowed after the slash in "c/o", and both before and after the parentheses in "colo(u)ring".
I was considering filing bugs for some of the new issues but I haven't had the opportunity to test them properly. And then I found this bug and realized that basically they all concern the same subject, so perhaps the discussion should continue here.
I don't think imitating IE's over-simplified linebreaking algorithms is the right thing to do. Mozilla has made its reputation by being better than IE, even if doing so caused some web-sites that were optimized for IE to look bad. Now, the competition has finally forced even Microsoft to bring its browser to the 21st century. This is not the time to lower the standards and start trailing them.
Comment 31•17 years ago
|
||
By the way, when considering the applicability of UAX #14, the general criticism by Jukka Korpela might be worth taking into account (although it isn't quite up to date with the most recent revisions):
http://www.cs.tut.fi/~jkorpela/unicode/linebr.html
There is even a more extensive article about word division in IE and the problems it causes especially from the point of web-authoring:
http://www.cs.tut.fi/~jkorpela/html/nobr.html
Comment 32•17 years ago
|
||
Can we have an update on this bug.
By the way, bug 346969 is now fixed, but I cannoy close it.
Comment 33•16 years ago
|
||
Bug 450088 is related to this issue. I also have a zlib-licensed implementation of UAX #14 available at:
http://vimgadgets.cvs.sourceforge.net/vimgadgets/common/tools/linebreak/
Updated•15 years ago
|
Assignee: layout.fonts-and-text → nobody
QA Contact: ian → layout.fonts-and-text
Comment 34•12 years ago
|
||
I think that this is now important for compatibility with other browsers. I'll try to implement by a new class which can be chosen with pref. I think that when we enable the new class in default settings, we should remove the pref and current implementation.
Assignee: nobody → masayuki
Severity: minor → normal
Component: Layout: Text → Internationalization
Priority: P4 → --
Summary: More intelligent Unicode-compatible linebreaking algorithms needed → More intelligent Unicode-compatible linebreaking algorithms (UAX #14) needed
Assignee | ||
Comment 35•12 years ago
|
||
If we import libicu (bug 724531 and bug 820261) into our code, we can handle this more easily instead of creating new table.
Comment 36•12 years ago
|
||
(In reply to Makoto Kato from comment #35)
> If we import libicu (bug 724531 and bug 820261) into our code, we can handle
> this more easily instead of creating new table.
Is it enough for our requirement? Probably, if we implement UAX #14 strictly, we break compatibility with a lot of websites. So, we need to add similar customization added in current line breaker. Is it possible?
Comment 37•12 years ago
|
||
Looks like it's not capable of CSS3 text, such as line-break. I don't think that we should use 3rd party's library for line breaker because it's too sensitive for compatibility and performance.
Comment 38•12 years ago
|
||
(In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #37)
> Looks like it's not capable of CSS3 text, such as line-break.
What exactly do you mean here? Do you think we need to change the spec? If so, which part (5.1? 5.2?)
Comment 39•12 years ago
|
||
It seems that Chrominum also uses their own table for compatibility:
http://mxr.mozilla.org/chromium/source/src/third_party/WebKit/Source/WebCore/rendering/break_lines.cpp#71
(In reply to John Daggett (:jtd) from comment #38)
> (In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #37)
> > Looks like it's not capable of CSS3 text, such as line-break.
>
> What exactly do you mean here? Do you think we need to change the spec? If
> so, which part (5.1? 5.2?)
No. If we would use ICU line breaker, the library should have all behavior defined by CSS3 Text and the behavior should have compatibility with current Gecko and other browsers moderately, especially in ASCII character range.
Comment 40•12 years ago
|
||
If ICU supports complex line breaking script, it's worthwhile to use ICU only for them, I think. Currently, we use native API's line breaker for them. So, Gecko doesn't behave same on all platforms for such language users.
Assignee | ||
Comment 41•12 years ago
|
||
(In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #40)
> If ICU supports complex line breaking script, it's worthwhile to use ICU
> only for them, I think. Currently, we use native API's line breaker for
> them. So, Gecko doesn't behave same on all platforms for such language users.
Platform's line breaker may not handle correct line break position for complex language such as khmer. We should use another way (ex. using libicu) for these languages.
Also, actually, even if not complex script, line breaker isn't compatible on each browser implementation. See http://w3c-test.org/framework/results/i18n-css3-text/.
Comment 42•12 years ago
|
||
Hmm, chromimum might use ICU for fallback class of non-ASCII characters. But I'm not sure if the build option (ICU_UNICODE) is enabled in the default setting. And if it's enabled, I'm not sure how do they think about supporting line-break property in the future.
(In reply to Makoto Kato from comment #41)
> Also, actually, even if not complex script, line breaker isn't compatible on
> each browser implementation. See
> http://w3c-test.org/framework/results/i18n-css3-text/.
Yes, but I think we can improve the compatibility in non-ASCII range since we have never used UAX #14 yet.
Comment 43•12 years ago
|
||
And probably, if we use ICU, it becomes more difficult to fix bug 389710.
Updated•8 years ago
|
Whiteboard: [p-ie] → [p-ie] [platform-rel-Intel]
Comment hidden (offtopic) |
Comment hidden (offtopic) |
Comment hidden (offtopic) |
Updated•8 years ago
|
Whiteboard: [p-ie] [platform-rel-Intel] → [p-ie]
Updated•4 years ago
|
Assignee: masayuki → nobody
Severity: normal → S3
Type: defect → enhancement
Priority: -- → P4
Assignee | ||
Updated•4 years ago
|
Assignee: nobody → m_kato
Status: NEW → ASSIGNED
Updated•1 year ago
|
Comment 48•1 year ago
|
||
We've integrated ICU4X line segmenter in bug 1719535, which is UAX 14 compatible.
You need to log in
before you can comment on or make changes to this bug.
Description
•