word boundary detection for Thai text
Categories
(Core :: Layout: Text and Fonts, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox77 | --- | fixed |
People
(Reporter: arthit, Assigned: jfkthame)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(Keywords: intl)
Attachments
(10 files, 8 obsolete files)
(deleted),
patch
|
roc
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
roc
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
roc
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
roc
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
roc
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
roc
:
review+
|
Details | Diff | Splinter Review |
(deleted),
text/x-review-board-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details |
Comment 1•16 years ago
|
||
Comment 2•14 years ago
|
||
Comment 3•14 years ago
|
||
Comment 4•14 years ago
|
||
Comment 5•14 years ago
|
||
Comment 8•14 years ago
|
||
Comment 9•14 years ago
|
||
Comment 10•14 years ago
|
||
Comment 11•14 years ago
|
||
Comment 12•14 years ago
|
||
Comment 13•14 years ago
|
||
Comment 14•14 years ago
|
||
Comment 15•14 years ago
|
||
Comment 16•14 years ago
|
||
Comment 17•14 years ago
|
||
Comment 18•10 years ago
|
||
Comment 19•10 years ago
|
||
Comment 20•10 years ago
|
||
Comment 21•10 years ago
|
||
Comment 22•10 years ago
|
||
Comment 23•10 years ago
|
||
Comment 24•10 years ago
|
||
Comment 25•10 years ago
|
||
Comment 27•10 years ago
|
||
Comment 28•10 years ago
|
||
Comment 30•10 years ago
|
||
Comment 31•10 years ago
|
||
Comment 32•10 years ago
|
||
Comment 33•10 years ago
|
||
Comment 34•10 years ago
|
||
Comment 35•10 years ago
|
||
Comment 36•10 years ago
|
||
Comment 37•10 years ago
|
||
Comment 38•10 years ago
|
||
Comment 39•10 years ago
|
||
Comment 41•10 years ago
|
||
Comment 43•10 years ago
|
||
Comment 44•10 years ago
|
||
Assignee | ||
Comment 45•10 years ago
|
||
Comment 46•10 years ago
|
||
Assignee | ||
Comment 47•10 years ago
|
||
Comment 48•10 years ago
|
||
Comment 49•10 years ago
|
||
Comment 50•10 years ago
|
||
Comment hidden (obsolete) |
Comment 52•10 years ago
|
||
Comment 53•10 years ago
|
||
Comment 54•10 years ago
|
||
Comment 55•10 years ago
|
||
Comment 56•8 years ago
|
||
Comment 57•8 years ago
|
||
Comment 58•8 years ago
|
||
Comment 59•8 years ago
|
||
Comment 60•8 years ago
|
||
Comment hidden (mozreview-request) |
Comment 62•8 years ago
|
||
Comment 63•8 years ago
|
||
Assignee | ||
Comment 64•8 years ago
|
||
Comment 65•5 years ago
|
||
Our monthly meetings with the Thai community keeps bringing up this bug. This issue is a pretty major usability issue and hinders bringing new contributors to the community.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 66•5 years ago
|
||
Word boundary detection is used for various text-related functions
in Firefox, notably caret keyboard navigation, mouse double-click
selection, and three-tap dictionary lookup on macOS. The current
"naive" word breaker use boundary between different scripts, which
would break languages such as Thai, Mandarin, and Japanese (mixed
Hiragana and Han character [Kanji]) into only large chuck of text.
This patch utilizes complex script services provided by platform
that are already being used by moz::intl::LineBreaker, adapting
similar code for moz::intl::WordBreaker.
Comment 67•5 years ago
|
||
I'd like to bump up this issue as it affects Mandarin and Japanese hiragana as well. Double-clicking, option
+ arrow key caret movement, and most importantly, dictionary lookup, are all selecting the whole chunk of text.
Additional Scenarios for Mandarin and Japanese
Han characters generally doesn’t work well:
山林開放後|,|我們與動物的距離 (current behavior)
山林|開放|後|,|我們|與|動物|的|距離 (expected)
Japanese words are more complicated, as there could be Han and Hiragana in a single word (but not Katakana):
雨|と|気温上昇|で|雪解|けが|進|む (current behavior, grouping word by script type)
雨|と|気温|上昇|で|雪解け|が|進む (expected, native behavior in macOS)
Current Status
I’ve refreshed the patch on Phabricator, built and tested against macOS 10.14. It works for Thai examples provided above, but due to liberal line-breaking rules in East Asian scripts, NS_GetComplexLineBreaks
basically means break-all
.
Proposed Solution
Probably related to Bug 1275486. For complex scripts and Han/Hiragana characters, I think we need to pass different parameter to platform text engine, which in macOS’ case would be kCFStringTokenizerUnitWord
instead of kCFStringTokenizerUnitSentence
. This would probably be implemented in a function similar to NS_GetComplexLineBreaks
(say, NS_GetComplexWordBoundary
).
Thoughts?
Comment 68•5 years ago
|
||
Ideally, we would have a behavior that is (a) the same across platforms, (b) the same across browsers, and (c) something that we could standardize. How does what you propose fit these criteria?
Comment 69•5 years ago
|
||
... and I suppose it's also worth asking how it aligns with what CSS Text Level 3 specifies.
Comment 70•5 years ago
|
||
@dbaron Thanks for pointing these out.
My understanding is that LineBreaker
is used to implement CSS word-break
and line-break
, hence it is properly documented and carefully distinguish approaches on different scripts. It is the class that calls into platform-specific language service for better line breaking (see intl/lwbrk/LineBreaker.cpp
line 1072–84).
WordBreaker
, however, is neither used in rendering, nor does it follow any web standard previously. A full grep
in source shows that it is only used for caret, find in page, spellchecker, dictionary lookup, and other UI functions. Thus, it is important for them to be consistent with native platform behaviors. The previous suggestion is basically giving alphabets a fast track, and pass the rest to the system.
Assignee | ||
Comment 71•5 years ago
|
||
That's right, according to my understanding. The terminology is a bit confusing, because e.g. the CSS word-break
property is one of the things that controls line breaking for layout, but that's not what Gecko's WordBreaker
is for -- it provides word boundary analysis for various user-interaction purposes. I don't think CSS Text really has anything to say about this.
What might become relevant here is that there's a proposal to add a text segmentation API to JavaScript, which would provide word-boundary services (among other things). If this happens, it will probably involve reimplementing what WordBreaker does in a more thorough and better-standardized way, possibly based on ICU or a Rust implementation of UAX 29. If/when that happens, it's likely to supersede anything done here.
However, that's still under discussion and the way forward is not entirely clear. In the meantime, I think that if we can improve the behavior for languages such as Thai by means of a fairly straightforward patch to the existing code, this would be a valuable interim solution. It's unfortunate that Thep's work several years ago came so close to being ready to land, but we weren't able to get over the finish line at that time.
I've just pushed the current patch from phabricator to tryserver to see how things look across platforms: https://treeherder.mozilla.org/#/jobs?repo=try&revision=68211de31fd686e359ce3147900ae4fe65420954.
Comment 72•5 years ago
|
||
@jfkthame Thank you for summarize it up. UAX 29 is a pretty good read, and from this note under 4.1.1:
For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.
I'd say if we ever implement UAX 29, it would be a proper replacement for the current WordBreaker
script guessing hack only. Unless we’re going full ICU with dictionaries, we still need to pass them to the platform just like you mentioned.
I’ve never used try server before! I’ll see if I could make Japanese text work, based from the current test results.
Reporter | ||
Comment 73•5 years ago
|
||
May related:
W3C Southeast Asian Layout task force (sealreq) is now developing document for complex text layout languages (Thai, Lao, Khmer) here https://github.com/w3c/sealreq/
Assignee | ||
Comment 74•5 years ago
|
||
Assignee | ||
Comment 75•5 years ago
|
||
Poren: I took your patch and made some minor fixes to resolve crashes/failures that showed up on tryserver, and then I have also updated it to handle the various Southeast Asian writing systems that do not use interword spaces (Khmer, Lao, etc) in addition to Thai, based on the Unicode script property of the characters. This seems to be working pretty well in my (limited) testing.
I omitted sending Han & Hiragana text through the complex breaker, as I'm not sure using line-break positions as word boundaries will work so well there; it tends to be too eager to break. Maybe that would still be better than the current behavior, but it's less clear to me. Really, we need to call a slightly different platform API (or integrate a real text segmentation component that knows about Chinese & Japanese). Anyhow, what I suggest is that in this bug we just target the SEAsian alphabets (like Thai) that do not have word spaces; if we have ideas for how to handle Japanese/Chinese better, we can do a followup bug about that.
If you'd like to test this version of the patch, and let me know of any further issues/suggestions, that would be really helpful - thanks!
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 76•5 years ago
|
||
Assignee | ||
Comment 77•5 years ago
|
||
Just to note: there are other aspects of the WordBreaker code that should really be updated, too -- e.g. it doesn't handle surrogate pairs, which means it won't properly recognize supplementary-plane characters. But for clarity, I haven't included that here; this patch just aims to address the specific issue of "borrowing" the complex line-breaker behavior for these SEAsian scripts, without disrupting anything else. Other fixes/improvements should be separate bugs (or may be superseded by a new text-segmentation component, if/when we adopt an ICU- or Rust-based implementation for that).
Updated•5 years ago
|
Assignee | ||
Comment 78•5 years ago
|
||
Depends on D71206
Comment 79•5 years ago
|
||
Comment 80•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/73938a98d8fc
https://hg.mozilla.org/mozilla-central/rev/e6483e810796
Comment 81•5 years ago
|
||
Nice to see this fixed, thanks so much Jonathan!
It's been ~10 years since Barcamp Bangkok, fond memories of working with local hackers there and helping to file and push forward these types of issues that matter so much to daily use of Firefox in Thai language :D
Comment 82•5 years ago
|
||
Jonathan: Thank you so much, glad to see it merged!
I finally had time to revisit the codebase (given the epidemic), and not until now did I realize why you mentioned the upcoming IntlSegmenter
component. It totally makes sense now. 😅
I'll start a new bug to track possible mitigations for CJK scripts, as IntlSegmenter
probably won’t land for at least half a year? Bug 345823 seems stale.
Description
•