Closed Bug 802082 Opened 12 years ago Closed 12 years ago

Merge encodings that IE or WebKit treat as the same

Categories

(Core :: Internationalization, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 801402

People

(Reporter: ayg, Assigned: ayg)

References

Details

Attachments

(1 file)

Bug 802030 deals with merging us-ascii, iso-8859-1, and windows-1252. That's potentially risky, because all major browsers (IE, Firefox, Chrome) treat them as distinct. But the encoding spec mandates merging lots of other encodings that we treat as distinct too, listed at bug 802030 comment 2. All of these other merges are already implemented by IE or Chrome, so we have much more assurance that they're safe.
So the following charsets will no longer exist: * Big5-HKSCS -> Big5 * GB2312 -> gbk * ISO-8859-6-E -> ISO-8859-6 * ISO-8859-6-I -> ISO-8859-6 * ISO-8859-8-E -> ISO-8859-8 * ISO-8859-9 -> windows-1254 * ISO-8859-11 -> windows-874 * TIS-620 -> windows-874 * windows-949 -> euc-kr I have to figure out which encoder/decoder to choose in each case, though.
41 files changed, 42 insertions(+), 1180 deletions(-) This only tries to tackle three of the merges, which Anne advised me would be the safest to start with. It was mindlessly adapted from the patch to bug 623610. I have no idea if this even makes sense, but it seems to compile. Try: https://tbpl.mozilla.org/?tree=Try&rev=264a46a7828e I guess we want someone interested in mail to comment on whether this is a bad idea for them. It will probably make outgoing mail declare its encoding as windows-* instead of ISO-8859-*, which maybe other clients don't like. If that is a problem, what do we want to do about it, here and in similar cases?
Attachment #671796 - Flags: review?(smontagu)
Since you remove variants, you might want to have the UI just say "Thai" instead of also listing the encoding. Chrome does the same.
Comment on attachment 671796 [details] [diff] [review] Patch part 1 -- Merge ISO-8859-9 and -11 and TIS-620 with windows-1254 and -874 >- {"ISO-8859-9", "ISO8859_9"}, >+ {"ISO-8859-9", "Cp1254"}, Just remove this line. Left hand is a canonical charset name, so "ISO-8859-9" will never appear after merge. >- {"iso88599",iso9_tbl}, //ISO-8859-9 > {"iso885910",iso10_tbl}, //ISO-8859-10 >- {"tis620",tis620_tbl}, //TIS-620/ISO-8859-11 >- {"tis6202533",tis620_tbl}, //TIS-620/ISO-8859-11 >- {"iso885911",tis620_tbl}, //TIS-620/ISO-8859-11 Then hunspell doesn't support Thai anymore? I don't think it's correct.
(In reply to Masatoshi Kimura [:emk] from comment #4) > Then hunspell doesn't support Thai anymore? I don't think it's correct. What are those tables in hunspell used for? Do we ever feed hunspell with an encoding other than UTF-* anyway?
Dunno. Please ask spellchecker folks.
I needed this change too to fix a failing test: --- a/extensions/universalchardet/src/base/LangThaiModel.cpp +++ b/extensions/universalchardet/src/base/LangThaiModel.cpp @@ -180,10 +180,10 @@ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, const SequenceModel TIS620ThaiModel = { TIS620CharToOrderMap, ThaiLangModel, (float)0.926386, false, - "TIS-620" + "windows-874" }; Maybe the classes in that file should be renamed? I'll address other feedback when smontagu says how he'd like me to proceed (or if he's even the right person to ask for review).
Comment on attachment 671796 [details] [diff] [review] Patch part 1 -- Merge ISO-8859-9 and -11 and TIS-620 with windows-1254 and -874 Change > {"TIS-620", "MS874"}, to {"windows-874", "MS874"}, .
I'm not happy with merging encoders (as opposed to decoders) until we have a way to make it not apply to sent mail.
Bug 801402 will fix this without degrading the mail.
Depends on: 801402
If bug 801402 brings us in line with the spec just as well, this bug is no longer necessary.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
Attachment #671796 - Flags: review?(smontagu)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: