[meta] Unified Segmenter 2021
Categories
(Core :: Internationalization, task, P3)
Tracking
()
People
(Reporter: zbraniecki, Unassigned)
References
(Depends on 4 open bugs, Blocks 4 open bugs)
Details
(Keywords: meta)
We are kicking off a new project to implement a unified segmentation model for Layout engine based on UAX#14/UAX#29 and offer it to SpiderMonkey to back ECMA-402 Intl.Segmenter.
The effort is going to be part of ICU4X project, and initially will live as a branch of icu4x in https://github.com/aethanyc/icu4x
Once we're ready to loop it into Gecko, we'll file specific integration bugs and mark them as blocking this one.
Until then, this meta bug will collect bugs we hope to address in the rewrite.
Reporter | ||
Updated•4 years ago
|
Updated•4 years ago
|
Comment 1•4 years ago
|
||
This came up in https://github.com/w3c/editing/issues/278. Is there a standardization effort backing this? It's biting web developers that even for a single platform browsers will behave differently from each other. (It's expected that there are platform differences.)
Reporter | ||
Comment 2•4 years ago
|
||
I don't think there is. We're going to follow Unicode UAX#14 and UAX#29 standards, which should help, but the biggest goal for us is to end up with a single logic and data powering both the layout segmentation and javascript segmentation.
Due to UAX#14/UAX#29, we're likely going to vastly close the gap to what engines powered by ICU4C are doing.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 3•4 years ago
|
||
I'm just curious, how will this impact complex text layout languages? Will it ship the data needed for Thai, Burmese, Khmer, and Lao?
Comment 4•4 years ago
|
||
I'm just curious, how will this impact complex text layout languages? Will it ship the data needed for Thai, Burmese, Khmer, and Lao?
Yes, we are planning to support these languages. The data is either a trained model via machine learning or a dictionary like ICU.
Comment 5•4 years ago
|
||
That's awesome!
It would be neat to learn more about the process of building machine learning models - as that could be a huge data size benefit for a number of these languages.
I'm helping maintain the Lao data in ICU and Hunspell Lao, and have personal contacts that have worked with the other regional languages (Thai, Khmer, and Burmese), so I'm sure it would be possible to train lots of decent training data to develop some really good basic models.
Updated•4 years ago
|
Comment 6•3 years ago
|
||
No longer blocks bug 1381019, because of work-around landed in bug 1713973.
Updated•3 years ago
|
Description
•