Use UTS 35 Unicode BCP 47 Locale Identifiers instead of RFC-5646/6067 BCP 47 language tags
Categories
(Core :: JavaScript: Internationalization API, enhancement, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox70 | --- | fixed |
People
(Reporter: anba, Assigned: anba)
References
Details
Attachments
(18 files)
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details |
Intl.Locale (bug 1433303) depends on https://github.com/tc39/ecma402/pull/289.
https://github.com/tc39/ecma402/pull/289#issuecomment-444178026 denotes multiple open issues, but the PR got merged nonetheless, so it's not entirely clear what to implement in some edge cases.
Assignee | ||
Comment 1•6 years ago
|
||
I propose we first switch the language tag parser over to Unicode BCP 47 locale identifiers and if that works out without causing any web-compat issues, we can proceed to switch the canonicalisation to whatever UTS 35 specifies. (But also see https://github.com/tc39/ecma402/issues/330.)
Assignee | ||
Comment 2•6 years ago
|
||
https://github.com/tc39/ecma402/pull/289 changed ECMA-402 to use Unicode BCP47
locale identifiers instead of BCP47 language tags for language tags. That means
extlang subtags are no longer supported in language tags.
Assignee | ||
Comment 3•6 years ago
|
||
Irregular grandfathered language tags and regular grandfathered tags with
extlang-like subtags can't be parsed as Unicode BCP 47 locale identifiers, so
they now need to be rejected by the language tag parser.
Depends on D23536
Assignee | ||
Comment 4•6 years ago
|
||
Language tags only consisting of a private-use subtags are not allowed in Unicode
BCP 47 locale identifiers.
Depends on D23537
Assignee | ||
Comment 5•6 years ago
|
||
Unicode BCP 47 locale identifiers don't support four letter language subtags.
Depends on D23538
Assignee | ||
Comment 6•6 years ago
|
||
- Strict parsing for "u" and "t" extensions is not yet implemented.
- Canonicalisation per UTS 35 is also not yet implemented, so it still refers to BCP 47 tags.
Depends on D23539
Assignee | ||
Comment 7•6 years ago
|
||
Unicode BCP 47 locale identifiers have stricter requirements for the Unicode ("-u-") and
tranformed content ("-t-") extension sequences.
- Keys in Unicode extensions must be of the form "alphanum alpha".
- Transformed content extensions need to be parsed following the
transformed_extensions
syntax from UTS 35.
Depends on D23540
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 9•6 years ago
|
||
Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f5f2bb8e0636a84c2b4ea4442abe936cf23f3983
Comment 10•6 years ago
|
||
Pushed by csabou@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/113a287cfb7f
Part 1: Remove support for extlang subtags. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/5a50fd4ec8ba
Part 2: Remove support for irregular grandfathered tags and regular grandfathered tags with extlang-like subtags. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/b215b68fbccc
Part 3: Remove support for privateuse-only language tags. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/c5a97d342431
Part 4: Remove support for four letter language subtags. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/7b0c2144242c
Part 5: Update comments to refer to Unicode BCP 47 locale identifiers. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/1245a50cc3a0
Part 6: Add strict parsing of Unicode and transform extension sequences. r=jwalden
Comment 11•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/113a287cfb7f
https://hg.mozilla.org/mozilla-central/rev/5a50fd4ec8ba
https://hg.mozilla.org/mozilla-central/rev/b215b68fbccc
https://hg.mozilla.org/mozilla-central/rev/c5a97d342431
https://hg.mozilla.org/mozilla-central/rev/7b0c2144242c
https://hg.mozilla.org/mozilla-central/rev/1245a50cc3a0
Comment 12•6 years ago
|
||
== Change summary for alert #20419 (as of Fri, 12 Apr 2019 06:30:28 GMT) ==
Improvements:
1% Base Content JS linux64-shippable opt 4,023,191.33 -> 4,002,330.67
1% Base Content JS linux64-shippable-qr opt 4,023,148.00 -> 4,002,240.67
1% Base Content JS osx-10-10-shippable opt 4,020,194.67 -> 3,999,178.67
1% Base Content JS windows10-64-shippable opt 4,083,708.00 -> 4,062,744.67
1% Base Content JS windows10-64-shippable-qr opt 4,083,694.00 -> 4,062,758.00
0% Base Content JS linux64-shippable-qr opt 4,020,210.33 -> 4,002,276.67
For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=20419
Comment 13•5 years ago
|
||
I...think this is still open to implement UTS canonicalization from comment 1? Not sure any more from reading the comment history here. I guess that's all the canonical-form things mentioned in http://unicode.org/reports/tr35/#unicode_locale_id like for en-u-ms-imperial
⇒ en-u-ms-uksystem
.
If that's all that's left, I guess we need to do some more make_intl_data.py
hacking to read through all the transform extension data and generate the necessary code to handle that.
Assignee | ||
Comment 14•5 years ago
|
||
(In reply to Jeff Walden [:Waldo] from comment #13)
I...think this is still open to implement UTS canonicalization from comment 1? Not sure any more from reading the comment history here. I guess that's all the canonical-form things mentioned in http://unicode.org/reports/tr35/#unicode_locale_id like for
en-u-ms-imperial
⇒en-u-ms-uksystem
.
This canonicalisation is currently not required, but probably should be for consistency with Intl.Locale
. See also https://github.com/tc39/ecma402/issues/330.
If that's all that's left, I guess we need to do some more
make_intl_data.py
hacking to read through all the transform extension data and generate the necessary code to handle that.
The ten patches (actually eleven, but one of them was already reviewed in bug 1433303 and now got moved here) are on the way to you! :-)
Assignee | ||
Comment 15•5 years ago
|
||
Start implementing the new canonicalisation algorithm by validating all
subtags are in normalised case.
Assignee | ||
Comment 16•5 years ago
|
||
Updated test cases to conform to the changed canonicalization when variants are
sorted in alphabetical order.
Depends on D37440
Assignee | ||
Comment 17•5 years ago
|
||
Depends on D37441
Assignee | ||
Comment 18•5 years ago
|
||
Depends on D37442
Assignee | ||
Comment 19•5 years ago
|
||
The new language mappings data will be retrieved from CLDR, so rename the previous file
before starting to make the switch to CLDR.
Depends on D37443
Assignee | ||
Comment 20•5 years ago
|
||
Switch language and region mappings from IANA to CLDR.
Depends on D37444
Assignee | ||
Comment 21•5 years ago
|
||
- Add support for language mappings where in addition to the language subtag also
other subtags are modified. - Add support for region mappings where the preferred replacement region is chosen
based on the likely subtags from the base language and script subtags. - Variant subtags replacements are not supported in the currently used CLDR
algorithm, so remove them for now.
Depends on D37445
Assignee | ||
Comment 22•5 years ago
|
||
Depends on D37446
Assignee | ||
Comment 23•5 years ago
|
||
Depends on D37447
Assignee | ||
Comment 24•5 years ago
|
||
Intl.Locale no longer requires to handle (uncanonicalised) grandfathered tags,
so we can directly update grandfathered tags to their modern form in the parser.
Depends on D37448
Assignee | ||
Comment 25•5 years ago
|
||
Depends on D37449
Assignee | ||
Comment 26•5 years ago
|
||
Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1d1089b58023c109bbea1cdcd59e92a126c4ab56
Comment 27•5 years ago
|
||
Pushed by nbeleuzu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c6dd14881fdb
Part 7: Ensure subtags are in normalised case. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/f79999353e41
Part 8: Order all subtags in canonical syntax form. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/2794e19ca868
Part 9: Add BCP47 tokenizer and split stringification from CanonicalizeLanguageTag. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/566e019402f2
Part 10: Canonicalize BCP 47 T extension subtag. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/6059f6604172
Part 11: Rename LangTagMappingsGenerated.js to LangTagMappingsIANAGenerated.js. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/df9468fb8406
Part 12: Add simple language and region mappings from CLDR. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/aa7801f87840
Part 13: Add complex language and region mappings from CLDR. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/1c3668a78484
Part 14: Remove no longer used IANA language subtag registry code. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/6de0fe9b007c
Part 15: Update comment for CanonicalizeUnicodeExtension. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/984a6cb95c09
Part 16: Update grandfathered tags to modern form directly in the parser. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/1fec79fd2faf
Part 17: Remove references to RFC 5646. r=jwalden
Comment 28•5 years ago
|
||
Backed out for mochitest failures on test_ManifestProcessor_lang.html
Backout link: https://hg.mozilla.org/integration/autoland/rev/0d8d40e596d0c5b1455c9e210cb70853613ad6f6
Log link: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=257230947&repo=autoland&lineNumber=8069
Comment 29•5 years ago
|
||
// Test valid language tags - derived from IANA and BCP-47 spec
// and our Intl.js implementation.
var validTags = [
"aa", "ab", "ae", "af", "ak", "am", "an", "ar", "as", "av", "ay", "az",
"ba", "be", "bg", "bh", "bi", "bm", "bn", "bo", "br", "bs", "ca", "ce",
"ch", "co", "cr", "cs", "cu", "cv", "cy", "da", "de", "dv", "dz", "ee",
"el", "en", "eo", "es", "et", "eu", "fa", "ff", "fi", "fj", "fo", "fr",
"fy", "ga", "gd", "gl", "gn", "gu", "gv", "ha", "he", "hi", "ho", "hr",
"ht", "hu", "hy", "hz", "ia", "id", "ie", "ig", "ik", "io",
"is", "it", "iu", "ja", "jv", "ka", "kg", "ki", "kj",
"kk", "kl", "km", "kn", "ko", "kr", "ks", "ku", "kv", "kw", "ky", "la",
"lb", "lg", "li", "ln", "lo", "lt", "lu", "lv", "mg", "mh", "mi", "mk",
"ml", "mn", "mr", "ms", "mt", "my", "na", "nb", "nd", "ne", "ng",
"nl", "nn", "no", "nr", "nv", "ny", "oc", "oj", "om", "or", "os", "pa",
"pi", "pl", "ps", "pt", "qu", "rm", "rn", "ro", "ru", "rw", "sa", "sc",
"sd", "se", "sg", "sh", "si", "sk", "sl", "sm", "sn", "so", "sq", "sr",
"ss", "st", "su", "sv", "sw", "ta", "te", "tg", "th", "ti", "tk", "tl",
"tn", "to", "tr", "ts", "tt", "tw", "ty", "ug", "uk", "ur", "uz", "ve",
"vi", "vo", "wa", "wo", "xh", "yi", "yo", "za", "zh", "zu", "en-US",
"jp-JS", "pt-PT", "pt-BR", "de-CH", "de-DE-1901", "es-419", "sl-IT-nedis",
"en-US-boont", "mn-Cyrl-MN", "sr-Cyrl", "sr-Latn",
"zh-TW", "en-GB-boont-posix-r-extended-sequence-x-private",
"nan-Hans-MM-variant2-variant1-t-zh-latn-u-ca-chinese-x-private",
"yue-HK", "de-CH-x-phonebk", "az-Arab-x-aze-derbend",
"qaa-Qaaa-QM-x-southern",
];
for (var tag of validTags) {
const expected = `Expect lang to be "${tag}"`;
data.jsonText = JSON.stringify({
lang: tag,
});
const result = processor.process(data);
is(result.lang, tag, expected);
}
...yeah, "sh" processed above will ultimately go through https://searchfox.org/mozilla-central/rev/da855d65d1fbdd714190cab2c46130f7422f3699/dom/manifest/ValueExtractor.jsm#73 which will behave differently after these patches. Looks like the stuff in this file needs a regen of expected results, more or less, for at least that value, possibly for others.
Assignee | ||
Comment 30•5 years ago
|
||
ECMA-402 changed the language tag specification from RFC-5646 BCP-47 language
tags to UTS 35 Unicode BCP-47 locale identifiers. Update the expected
canonicalisation results accordingly.
Depends on D37450
Assignee | ||
Comment 31•5 years ago
|
||
Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f0032fb96e8190bf9d4230d4bdecb934b0a19627
Comment 32•5 years ago
|
||
Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/autoland/rev/cc29a8483de6
Part 7: Ensure subtags are in normalised case. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/79f844a1bf23
Part 8: Order all subtags in canonical syntax form. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/4e9279f09500
Part 9: Add BCP47 tokenizer and split stringification from CanonicalizeLanguageTag. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/80158bfcf0f5
Part 10: Canonicalize BCP 47 T extension subtag. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/a11b3db4d9ca
Part 11: Rename LangTagMappingsGenerated.js to LangTagMappingsIANAGenerated.js. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/21eae8e6487a
Part 12: Add simple language and region mappings from CLDR. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/2b41a865b58b
Part 13: Add complex language and region mappings from CLDR. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/fec9e8ac45dd
Part 14: Remove no longer used IANA language subtag registry code. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/20bd2706beb6
Part 15: Update comment for CanonicalizeUnicodeExtension. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/be2eb3b3774c
Part 16: Update grandfathered tags to modern form directly in the parser. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/69499c0220c4
Part 17: Remove references to RFC 5646. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/1e0a350b954a
Part 18: Update 'lang' member test for Web manifest to match latest ECMA-402. r=marcosc
Comment 33•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/cc29a8483de6
https://hg.mozilla.org/mozilla-central/rev/79f844a1bf23
https://hg.mozilla.org/mozilla-central/rev/4e9279f09500
https://hg.mozilla.org/mozilla-central/rev/80158bfcf0f5
https://hg.mozilla.org/mozilla-central/rev/a11b3db4d9ca
https://hg.mozilla.org/mozilla-central/rev/21eae8e6487a
https://hg.mozilla.org/mozilla-central/rev/2b41a865b58b
https://hg.mozilla.org/mozilla-central/rev/fec9e8ac45dd
https://hg.mozilla.org/mozilla-central/rev/20bd2706beb6
https://hg.mozilla.org/mozilla-central/rev/be2eb3b3774c
https://hg.mozilla.org/mozilla-central/rev/69499c0220c4
https://hg.mozilla.org/mozilla-central/rev/1e0a350b954a
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Description
•