Closed
Bug 728180
Opened 13 years ago
Closed 8 years ago
Use ICU to replace the (obsolete) implementation of nsUnicodeNormalizer
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
RESOLVED
FIXED
mozilla52
Tracking | Status | |
---|---|---|
firefox52 | --- | fixed |
People
(Reporter: jfkthame, Assigned: jfkthame)
References
(Blocks 2 open bugs)
Details
Attachments
(1 file, 2 obsolete files)
I was intending to make use of our normalization support, but on looking into intl/unicharutil to see what nsIUnicodeNormalizer could offer, it appeared to be quite out-of-date, with the key normalization data in http://mxr.mozilla.org/mozilla-central/source/intl/unicharutil/src/normalization_data.h not having been updated since the move to HG (at least).
There's a NormalizationTest tool in intl/unicharutil/tests, but the necessary test data isn't actually present in the tree, so I don't see how it can be testing much - and it doesn't look like it is run as part of "make check" anyway, AFAICT.
I tried running the NormalizationTest after generating up-to-date test data with the genNormalizationData.pl tool. As expected, the results indicate that the implementation of normalization is seriously behind the times:
NormalizationTest: test nsIUnicodeNormalizer. UCD version: 6.1.0
Test Part0: Specific cases
23 cases passed, 0 failed
Test Part1: Character by character test
1113792 cases passed, 319 failed
Test Part2: Canonical Order Test
740 cases passed, 566 failed
Test Part3: PRI #29 Test
25 cases passed, 143 failed
Test finished
Assignee | ||
Comment 1•13 years ago
|
||
It turns out there are two aspects to this. First, we need to regenerate normalization_data.h based on current Unicode data files. This will resolve the failures in Part1 and Part2 of the test.
However, there's also the correction to the definition of normalization, as discussed in http://www.unicode.org/review/pr-29.html and incorporated into UAX#15 as of Unicode 4.1. This requires a code fix in nsUnicodeNormalizer.cpp.
Patches upcoming, after I finish building & testing locally....
Assignee | ||
Comment 2•13 years ago
|
||
One complication to consider: I'm not sure what all the various users of nsIUnicodeNormalizer may be, but I believe there are some standards/protocols that explicitly require the use of a specific, older version of Unicode normalization.
Do we even _know_ what version of Unicode is currently implemented by nsIUnicodeNormalizer? I didn't see any obvious mention of the version; I assumed it was meant to support the same Unicode version as we're supporting in the rest of our code (the current version, allowing for a short time lag to update our data tables when a new version is released). But that may not be true, even in intention (let alone in actuality).
So do we actually need to support _multiple_ normalization tables, so as to provide one for general use by code that just wants "support for the current Unicode version", and one (or more) specific older versions to support protocols that mandate their use?
Taking a quick look at some docs:
http://tools.ietf.org/html/rfc3454 (stringprep) references Unicode 3.2.0
http://tools.ietf.org/html/rfc3491 (nameprep) also uses Unicode 3.2, by reference to stringprep
http://tools.ietf.org/html/rfc5890 (IDNA - definitions) references Unicode 5.2.0
http://tools.ietf.org/html/rfc5891 (IDNA - protocol) references RFC5890, but also (normatively) references the _current_ version of UAX#15 (which would now be 6.1.0) - I suspect that may be unintentional?
Ugh.
Assignee | ||
Comment 3•13 years ago
|
||
According to MXR, it seems that the only two users of nsIUnicodeNormalizer in the m-c tree are at
(1) http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/nsIDNService.cpp#539
(as part of the implementation of stringprep), and
(2) http://mxr.mozilla.org/mozilla-central/source/dom/plugins/base/nsNPAPIPlugin.cpp#2205
(looks like part of an API to return the document origin, applies normalization form NFKC before returning it).
For use (1), stringprep, the spec seems to insist on Unicode 3.2.0, so updating nsIUnicodeNormalizer would _break_ that. For use (2), it's less clear. I found https://wiki.mozilla.org/NPAPI:DocumentOrigin, but this only says that the API returns "the Unicode serialization of the origin converted to NFKC-encoded (normalized) UTF-8"; in the absence of a mention of any specific Unicode version, it would be natural to assume this refers to whatever version of Unicode the browser currently "supports" (i.e. 6.1, for Gecko).
In view of item (1) here, I think we need to be very cautious about updating the normalization code; however, I also think it is highly misleading to provide an "nsIUnicodeNormalizer" service that is tied to an old version of Unicode, without making this limitation extremely clear in its documentation/headers/even its NAME. (nsIUnicode_3_2_0_Normalizer?)
Assignee | ||
Comment 4•13 years ago
|
||
FTR, I'm attaching a patch that updates the normalization data to Unicode 6.1, although as discussed above, I don't think we can just go ahead and do this.
Assignee | ||
Comment 5•13 years ago
|
||
Forgot to add, with that patch, the NormalizationTest (run against Unicode 6.1 test data) report:
NormalizationTest: test nsIUnicodeNormalizer. UCD version: 6.1.0
Test Part0: Specific cases
23 cases passed, 0 failed
Test Part1: Character by character test
1114111 cases passed, 0 failed
Test Part2: Canonical Order Test
1306 cases passed, 0 failed
Test Part3: PRI #29 Test
12 cases passed, 156 failed
Assignee | ||
Comment 6•13 years ago
|
||
And this is the normalization code fix for PRI-29; with this patch in addition to the data update, NormalizationTest passes all tests.
Updated•13 years ago
|
Assignee: nobody → jfkthame
Comment 7•13 years ago
|
||
See bug 210502 comment 26 and following: we deliberately kept the normalization out-of-date in order not to break IDNA. I agree with you that it would have made more sense to make this explicit in the name.
However, bug 479520 should make all this moot.
Depends on: IDNA2008
Assignee | ||
Comment 8•13 years ago
|
||
Ah, thanks, I hadn't run across bug 210502. That confirms my concern about updating this.
What about the NPAPI consumer of normalization - is there any definitive ruling on what version that's supposed to be using?
I do think we should have an up-to-date normalization service, for the sake of the (new) text-shaping consumer, and also to expose to any code (extensions, whatever) that might want normalization and isn't specifically tied to an old version.
Comment 9•11 years ago
|
||
Now we have ICU, but are low-level functions (such as nsUnicodeNormalizer::DecomposeNonRecursively) implementable without modifying the ICU source?
Assignee | ||
Comment 10•11 years ago
|
||
For DecomposeNonRecursively (which AFAIK is used only by the harfbuzz shaping backend), I think unorm2_getRawDecomposition can be used.
http://icu-project.org/apiref/icu4c/unorm2_8h.html#a0551019637648af4638435d79f24be17
Assignee | ||
Comment 11•11 years ago
|
||
...though we should adapt the callback in gfxHarfBuzzShaper.cpp to use ICU directly rather than adding an extra layer of wrapper around it to emulate the old nsUnicodeNormalizer API. Especially as this is a particularly hot codepath.
So I suspect we could scrap DecomposeNonRecursively altogether.
Assignee | ||
Comment 12•8 years ago
|
||
When ICU is available (i.e. when ENABLE_INTL_API is true), we can use ICU normalization to replace our old (unmaintained and obsolete) code here. The issue of requiring Unicode 3.2 normalization for IDNA is no longer a concern when ENABLE_INTL_API is true, as we've migrated to ICU stringprep services for that case.
The ICU-based implementation doesn't need to provide the Compose and DecomposeNonRecursively functions, as these are not used in ENABLE_INTL_API builds, and will disappear from the tree completely when non-ENABLE_INTL_API codepaths are eliminated.
Summary: nsIUnicodeNormalizer is out of date and is not being tested → Use ICU to replace the (obsolete) implementation of nsUnicodeNormalizer
Assignee | ||
Updated•8 years ago
|
Attachment #598327 -
Attachment is obsolete: true
Assignee | ||
Updated•8 years ago
|
Attachment #598330 -
Attachment is obsolete: true
Assignee | ||
Comment 13•8 years ago
|
||
Attachment #8795662 -
Flags: review?(VYV03354)
Assignee | ||
Comment 14•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d0628ce6f4b8
FTR, this reduces the impact of ENABLE_INTL_API on the final .apk size of an Android build by about 32K, as we no longer end up with two copies of normalization data in the product. (It'll have a similar effect on desktop builds, too, though we're less concerned about size there.)
Comment 15•8 years ago
|
||
Do we need to preserve nsUnicodeNormalizer in the first place?
The only remaining user is NPAPI if ICU is available. How about rewriting the last caller instead?
Assignee | ||
Comment 16•8 years ago
|
||
I've been assuming there'd be add-on users of the service that haven't necessarily all been updated to use the new JS String method.
If we're willing to break such code then yes, I think we could remove the service entirely.
Assignee | ||
Comment 17•8 years ago
|
||
Incidentally, I see that bug 1292476 intends to remove the NPAPI consumer of this interface.
Depends on: 1292476
Comment 18•8 years ago
|
||
Comment on attachment 8795662 [details] [diff] [review]
Use ICU normalization functions to implement nsUnicodeNormalizer when ENABLE_INTL_API is defined, in place of our obsolete/unmaintained normalization code
Review of attachment 8795662 [details] [diff] [review]:
-----------------------------------------------------------------
OK, let's discuss dropping nsIUnicodeNormalizer elsewhere.
Attachment #8795662 -
Flags: review?(VYV03354) → review+
Assignee | ||
Comment 19•8 years ago
|
||
It's bug 1292486. I'm inclined to think we should go ahead and do it, but to be polite to any existing users, we should announce it ahead of time (an "intent to unship"?) and document it (with recommended alternatives).
Meanwhile, we can do this internally without affecting any consumers.
Assignee | ||
Comment 20•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/01cd7a8a158d425c216cb319898832c4891ec35e
Bug 728180 - Use ICU normalization functions to implement nsUnicodeNormalizer when ENABLE_INTL_API is defined, in place of our obsolete/unmaintained normalization code. r=emk
Comment 21•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox52:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
You need to log in
before you can comment on or make changes to this bug.
Description
•