Open Bug 1684927 (segmenter) Opened 4 years ago Updated 1 year ago

[meta] Unified Segmenter 2021

Categories

(Core :: Internationalization, task, P3)

task

Tracking

()

People

(Reporter: zbraniecki, Unassigned)

References

(Depends on 4 open bugs, Blocks 4 open bugs)

Details

(Keywords: meta)

We are kicking off a new project to implement a unified segmentation model for Layout engine based on UAX#14/UAX#29 and offer it to SpiderMonkey to back ECMA-402 Intl.Segmenter.

The effort is going to be part of ICU4X project, and initially will live as a branch of icu4x in https://github.com/aethanyc/icu4x

Once we're ready to loop it into Gecko, we'll file specific integration bugs and mark them as blocking this one.

Until then, this meta bug will collect bugs we hope to address in the rewrite.

Severity: -- → S3
Priority: -- → P3
Blocks: 359179
Blocks: 774965

This came up in https://github.com/w3c/editing/issues/278. Is there a standardization effort backing this? It's biting web developers that even for a single platform browsers will behave differently from each other. (It's expected that there are platform differences.)

I don't think there is. We're going to follow Unicode UAX#14 and UAX#29 standards, which should help, but the biggest goal for us is to end up with a single logic and data powering both the layout segmentation and javascript segmentation.

Due to UAX#14/UAX#29, we're likely going to vastly close the gap to what engines powered by ICU4C are doing.

I'm just curious, how will this impact complex text layout languages? Will it ship the data needed for Thai, Burmese, Khmer, and Lao?

I'm just curious, how will this impact complex text layout languages? Will it ship the data needed for Thai, Burmese, Khmer, and Lao?

Yes, we are planning to support these languages. The data is either a trained model via machine learning or a dictionary like ICU.

That's awesome!

It would be neat to learn more about the process of building machine learning models - as that could be a huge data size benefit for a number of these languages.

I'm helping maintain the Lao data in ICU and Hunspell Lao, and have personal contacts that have worked with the other regional languages (Thai, Khmer, and Burmese), so I'm sure it would be possible to train lots of decent training data to develop some really good basic models.

Depends on: 1719535
Depends on: 1719537
Depends on: 1722484

No longer blocks bug 1381019, because of work-around landed in bug 1713973.

No longer blocks: win32k-lockdown
Alias: segmenter
Depends on: 1847807
Depends on: 1817386
You need to log in before you can comment on or make changes to this bug.