Closed Bug 728180 Opened 13 years ago Closed 8 years ago

Use ICU to replace the (obsolete) implementation of nsUnicodeNormalizer

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla52

Tracking Flags:

Tracking

Status

firefox52

---

fixed

People

(Reporter: jfkthame, Assigned: jfkthame)

References

(Blocks 2 open bugs)

Details

Attachments

(1 file, 2 obsolete files)

patch 1, update normalization_data.h to unicode 6.1 13 years ago Jonathan Kew [:jfkthame] (deleted), patch		Details \| Diff \| Splinter Review
patch 2, update normalization code for Unicode PRI-29 issue 13 years ago Jonathan Kew [:jfkthame] (deleted), patch		Details \| Diff \| Splinter Review
Use ICU normalization functions to implement nsUnicodeNormalizer when ENABLE_INTL_API is defined, in place of our obsolete/unmaintained normalization code 8 years ago Jonathan Kew [:jfkthame] (deleted), patch	emk : review+	Details \| Diff \| Splinter Review

Jonathan Kew [:jfkthame]

Assignee

Description

•

13 years ago

I was intending to make use of our normalization support, but on looking into intl/unicharutil to see what nsIUnicodeNormalizer could offer, it appeared to be quite out-of-date, with the key normalization data in http://mxr.mozilla.org/mozilla-central/source/intl/unicharutil/src/normalization_data.h not having been updated since the move to HG (at least). There's a NormalizationTest tool in intl/unicharutil/tests, but the necessary test data isn't actually present in the tree, so I don't see how it can be testing much - and it doesn't look like it is run as part of "make check" anyway, AFAICT. I tried running the NormalizationTest after generating up-to-date test data with the genNormalizationData.pl tool. As expected, the results indicate that the implementation of normalization is seriously behind the times: NormalizationTest: test nsIUnicodeNormalizer. UCD version: 6.1.0 Test Part0: Specific cases 23 cases passed, 0 failed Test Part1: Character by character test 1113792 cases passed, 319 failed Test Part2: Canonical Order Test 740 cases passed, 566 failed Test Part3: PRI #29 Test 25 cases passed, 143 failed Test finished

Jonathan Kew [:jfkthame]

Assignee

Comment 1

•

13 years ago

It turns out there are two aspects to this. First, we need to regenerate normalization_data.h based on current Unicode data files. This will resolve the failures in Part1 and Part2 of the test. However, there's also the correction to the definition of normalization, as discussed in http://www.unicode.org/review/pr-29.html and incorporated into UAX#15 as of Unicode 4.1. This requires a code fix in nsUnicodeNormalizer.cpp. Patches upcoming, after I finish building & testing locally....

Jonathan Kew [:jfkthame]

Assignee

Comment 2

•

13 years ago

One complication to consider: I'm not sure what all the various users of nsIUnicodeNormalizer may be, but I believe there are some standards/protocols that explicitly require the use of a specific, older version of Unicode normalization. Do we even _know_ what version of Unicode is currently implemented by nsIUnicodeNormalizer? I didn't see any obvious mention of the version; I assumed it was meant to support the same Unicode version as we're supporting in the rest of our code (the current version, allowing for a short time lag to update our data tables when a new version is released). But that may not be true, even in intention (let alone in actuality). So do we actually need to support _multiple_ normalization tables, so as to provide one for general use by code that just wants "support for the current Unicode version", and one (or more) specific older versions to support protocols that mandate their use? Taking a quick look at some docs: http://tools.ietf.org/html/rfc3454 (stringprep) references Unicode 3.2.0 http://tools.ietf.org/html/rfc3491 (nameprep) also uses Unicode 3.2, by reference to stringprep http://tools.ietf.org/html/rfc5890 (IDNA - definitions) references Unicode 5.2.0 http://tools.ietf.org/html/rfc5891 (IDNA - protocol) references RFC5890, but also (normatively) references the _current_ version of UAX#15 (which would now be 6.1.0) - I suspect that may be unintentional? Ugh.

Jonathan Kew [:jfkthame]

Assignee

Comment 3

•

13 years ago

According to MXR, it seems that the only two users of nsIUnicodeNormalizer in the m-c tree are at (1) http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/nsIDNService.cpp#539 (as part of the implementation of stringprep), and (2) http://mxr.mozilla.org/mozilla-central/source/dom/plugins/base/nsNPAPIPlugin.cpp#2205 (looks like part of an API to return the document origin, applies normalization form NFKC before returning it). For use (1), stringprep, the spec seems to insist on Unicode 3.2.0, so updating nsIUnicodeNormalizer would _break_ that. For use (2), it's less clear. I found https://wiki.mozilla.org/NPAPI:DocumentOrigin, but this only says that the API returns "the Unicode serialization of the origin converted to NFKC-encoded (normalized) UTF-8"; in the absence of a mention of any specific Unicode version, it would be natural to assume this refers to whatever version of Unicode the browser currently "supports" (i.e. 6.1, for Gecko). In view of item (1) here, I think we need to be very cautious about updating the normalization code; however, I also think it is highly misleading to provide an "nsIUnicodeNormalizer" service that is tied to an old version of Unicode, without making this limitation extremely clear in its documentation/headers/even its NAME. (nsIUnicode_3_2_0_Normalizer?)

Jonathan Kew [:jfkthame]

Assignee

Comment 4

•

13 years ago

Attached patch patch 1, update normalization_data.h to unicode 6.1 (obsolete) (deleted) — Details — Splinter Review

FTR, I'm attaching a patch that updates the normalization data to Unicode 6.1, although as discussed above, I don't think we can just go ahead and do this.

Jonathan Kew [:jfkthame]

Assignee

Comment 5

•

13 years ago

Forgot to add, with that patch, the NormalizationTest (run against Unicode 6.1 test data) report: NormalizationTest: test nsIUnicodeNormalizer. UCD version: 6.1.0 Test Part0: Specific cases 23 cases passed, 0 failed Test Part1: Character by character test 1114111 cases passed, 0 failed Test Part2: Canonical Order Test 1306 cases passed, 0 failed Test Part3: PRI #29 Test 12 cases passed, 156 failed

Jonathan Kew [:jfkthame]

Assignee

Comment 6

•

13 years ago

Attached patch patch 2, update normalization code for Unicode PRI-29 issue (obsolete) (deleted) — Details — Splinter Review

And this is the normalization code fix for PRI-29; with this patch in addition to the data update, NormalizationTest passes all tests.

Ed Morley [:emorley]

Updated

•

13 years ago

Assignee: nobody → jfkthame

Simon Montagu :smontagu

Comment 7

•

13 years ago

See bug 210502 comment 26 and following: we deliberately kept the normalization out-of-date in order not to break IDNA. I agree with you that it would have made more sense to make this explicit in the name. However, bug 479520 should make all this moot.

Depends on: IDNA2008

Jonathan Kew [:jfkthame]

Assignee

Comment 8

•

13 years ago

Ah, thanks, I hadn't run across bug 210502. That confirms my concern about updating this. What about the NPAPI consumer of normalization - is there any definitive ruling on what version that's supposed to be using? I do think we should have an up-to-date normalization service, for the sake of the (new) text-shaping consumer, and also to expose to any code (extensions, whatever) that might want normalization and isn't specifically tied to an old version.

Jonathan Kew [:jfkthame]

Assignee

Updated

•

12 years ago

Blocks: 543200

Masatoshi Kimura [:emk]

Comment 9

•

11 years ago

Now we have ICU, but are low-level functions (such as nsUnicodeNormalizer::DecomposeNonRecursively) implementable without modifying the ICU source?

Jonathan Kew [:jfkthame]

Assignee

Comment 10

•

11 years ago

For DecomposeNonRecursively (which AFAIK is used only by the harfbuzz shaping backend), I think unorm2_getRawDecomposition can be used. http://icu-project.org/apiref/icu4c/unorm2_8h.html#a0551019637648af4638435d79f24be17

Jonathan Kew [:jfkthame]

Assignee

Comment 11

•

11 years ago

...though we should adapt the callback in gfxHarfBuzzShaper.cpp to use ICU directly rather than adding an extra layer of wrapper around it to emulate the old nsUnicodeNormalizer API. Especially as this is a particularly hot codepath. So I suspect we could scrap DecomposeNonRecursively altogether.

Makoto Kato [:m_kato]

Updated

•

9 years ago

Depends on: 1161900

Jonathan Kew [:jfkthame]

Assignee

Comment 12

•

8 years ago

When ICU is available (i.e. when ENABLE_INTL_API is true), we can use ICU normalization to replace our old (unmaintained and obsolete) code here. The issue of requiring Unicode 3.2 normalization for IDNA is no longer a concern when ENABLE_INTL_API is true, as we've migrated to ICU stringprep services for that case. The ICU-based implementation doesn't need to provide the Compose and DecomposeNonRecursively functions, as these are not used in ENABLE_INTL_API builds, and will disappear from the tree completely when non-ENABLE_INTL_API codepaths are eliminated.

Summary: nsIUnicodeNormalizer is out of date and is not being tested → Use ICU to replace the (obsolete) implementation of nsUnicodeNormalizer

Jonathan Kew [:jfkthame]

Assignee

Updated

•

8 years ago

Attachment #598327 - Attachment is obsolete: true

Jonathan Kew [:jfkthame]

Assignee

Updated

•

8 years ago

Attachment #598330 - Attachment is obsolete: true

Jonathan Kew [:jfkthame]

Assignee

Comment 13

•

8 years ago

Attached patch Use ICU normalization functions to implement nsUnicodeNormalizer when ENABLE_INTL_API is defined, in place of our obsolete/unmaintained normalization code (deleted) — Details — Splinter Review

Attachment #8795662 - Flags: review?(VYV03354)

Jonathan Kew [:jfkthame]

Assignee

Updated

•

8 years ago

Blocks: 724529

Jonathan Kew [:jfkthame]

Assignee

Comment 14

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d0628ce6f4b8 FTR, this reduces the impact of ENABLE_INTL_API on the final .apk size of an Android build by about 32K, as we no longer end up with two copies of normalization data in the product. (It'll have a similar effect on desktop builds, too, though we're less concerned about size there.)

Masatoshi Kimura [:emk]

Comment 15

•

8 years ago

Do we need to preserve nsUnicodeNormalizer in the first place? The only remaining user is NPAPI if ICU is available. How about rewriting the last caller instead?

Jonathan Kew [:jfkthame]

Assignee

Comment 16

•

8 years ago

I've been assuming there'd be add-on users of the service that haven't necessarily all been updated to use the new JS String method. If we're willing to break such code then yes, I think we could remove the service entirely.

Jonathan Kew [:jfkthame]

Assignee

Comment 17

•

8 years ago

Incidentally, I see that bug 1292476 intends to remove the NPAPI consumer of this interface.

Depends on: 1292476

Masatoshi Kimura [:emk]

Comment 18

•

8 years ago

Comment on attachment 8795662 [details] [diff] [review] Use ICU normalization functions to implement nsUnicodeNormalizer when ENABLE_INTL_API is defined, in place of our obsolete/unmaintained normalization code Review of attachment 8795662 [details] [diff] [review]: ----------------------------------------------------------------- OK, let's discuss dropping nsIUnicodeNormalizer elsewhere.

Attachment #8795662 - Flags: review?(VYV03354) → review+

Jonathan Kew [:jfkthame]

Assignee

Comment 19

•

8 years ago

It's bug 1292486. I'm inclined to think we should go ahead and do it, but to be polite to any existing users, we should announce it ahead of time (an "intent to unship"?) and document it (with recommended alternatives). Meanwhile, we can do this internally without affecting any consumers.

Jonathan Kew [:jfkthame]

Assignee

Comment 20

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/01cd7a8a158d425c216cb319898832c4891ec35e Bug 728180 - Use ICU normalization functions to implement nsUnicodeNormalizer when ENABLE_INTL_API is defined, in place of our obsolete/unmaintained normalization code. r=emk

Carsten Book [:Tomcat]

Comment 21

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/01cd7a8a158d

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox52: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla52

Sylvestre Ledru [:Sylvestre]

Updated

•

7 years ago

Depends on: 1378683

You need to log in before you can comment on or make changes to this bug.