Use UTF-8 for all outgoing email
Categories
(MailNews Core :: Composition, task, P2)
Tracking
(thunderbird_esr78 wontfix)
Tracking | Status | |
---|---|---|
thunderbird_esr78 | --- | wontfix |
People
(Reporter: hsivonen, Assigned: rnons)
References
Details
(Keywords: intl)
Attachments
(4 files, 9 obsolete files)
(deleted),
patch
|
mkmelin
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
rnons
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
mkmelin
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
rnons
:
review+
|
Details | Diff | Splinter Review |
Comment 2•12 years ago
|
||
Reporter | ||
Comment 4•12 years ago
|
||
Comment 5•12 years ago
|
||
Comment 6•12 years ago
|
||
Comment 7•12 years ago
|
||
Reporter | ||
Comment 8•12 years ago
|
||
Comment 10•12 years ago
|
||
Comment 11•11 years ago
|
||
Reporter | ||
Comment 12•11 years ago
|
||
Reporter | ||
Comment 13•11 years ago
|
||
Reporter | ||
Comment 14•9 years ago
|
||
Comment 15•7 years ago
|
||
Reporter | ||
Comment 16•7 years ago
|
||
Comment 17•7 years ago
|
||
Comment 18•7 years ago
|
||
Comment 19•5 years ago
|
||
We should probably do this now. Even Japanese set the mailnews.send_default_charset to UTF-8 since back in 2017. https://dxr.mozilla.org/l10n-central/source/ja/mail/chrome/messenger/messenger.properties#281
If they can do it, and Gmail has done it since 4-5 years already, it doesn't seem likely it would cause any problems.
Comment 20•4 years ago
|
||
Let's take this on now, and save ourselves from some pointless charset complications.
Comment 21•4 years ago
|
||
Can you please post to the relevant mailing lists announcing this change.
Assignee | ||
Comment 22•4 years ago
|
||
This patch
- removes the
Text Encoding
menu from compose window - removes
mailnews.send_default_charset
preference
Do we want to remove mailnews.reply_in_default_charset
as well?
(In reply to Jorg K (CEST = GMT+2) from comment #21)
Can you please post to the relevant mailing lists announcing this change.
OK, I will post to the maildev list after my patch got reviewed, so that I know I understand the problem correctly.
Assignee | ||
Updated•4 years ago
|
Reporter | ||
Comment 23•4 years ago
|
||
Do we want to remove
mailnews.reply_in_default_charset
as well?
It seems to me that pref needs to be removed (in the sense of always replying in UTF-8) in order to end up it a state where all outgoing email is in UTF-8 and the non-UTF-8 email generation code paths become irrelevant.
Comment 24•4 years ago
|
||
Yes that should be removed as well.
Comment 25•4 years ago
|
||
Assignee | ||
Comment 26•4 years ago
|
||
- Remove
mailnews.reply_in_default_charset
andmailnews.disable_fallback_to_utf8
- Remove characterSet from nsIMsgCompFields
- Fix tests
Assignee | ||
Comment 27•4 years ago
|
||
Fix Windows build. Try run: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=149179d49660f1d14bce26b17c2db9
Comment 28•4 years ago
|
||
Comment 29•4 years ago
|
||
I don't think it's very useful to split up backend, UI and test changes. We could split up removal of delsp though.
Comment 30•4 years ago
|
||
Assignee | ||
Comment 31•4 years ago
|
||
::: mailnews/compose/test/unit/test_autoReply.js
@@ +215,5 @@// The text gets converted to UTF-8 (regardless of what it is) at some point.
// Suspect a bug with how BinaryInputStream handles the strings.
if (templateHdr.Charset == "windows-1252") {
// XXX: should really check for " åäö"
- if (!body.includes("åäö xlatin1")) {
I don't think this should change?
Old templates/drafts can have outdated charsets. We'll have to deal with that
Yes, it's a bit confusing, but the change is needed. Take https://searchfox.org/comm-central/source/mailnews/test/data/template-utf8#26 for example, åäö xutf8
became "åäö xutf8"
in the test file https://searchfox.org/comm-central/source/mailnews/compose/test/unit/test_autoReply.js#230. Maybe due to https://searchfox.org/comm-central/source/mailnews/compose/test/unit/test_autoReply.js#214-216
I can split up if that makes reviewing easier. But if you don't mind, I will not split then, and will remove delsp
in a separate patch.
I'm surprised that https://searchfox.org/comm-central/source/mail/test/browser/composition/browser_charsetEdit.js doesn't need to change.
I thought it was irrelevant so I removed it, seems I'm wrong, will update.
Comment 32•4 years ago
|
||
Sorry, I missed the removal. Maybe it is irrelevant now, I'd have to check closer. What happens with you receive a message in windows-1252 or ISO-2022-JP and you go "edit as new message". That needs to be successfully converted into UTF-8 now. That case is a little similar to that test, no?
Comment 33•4 years ago
|
||
(In reply to Jorg K (CEST = GMT+2) from comment #28)
Also, keep in mind that the outgoing delsp processing was only introduced
for ISO-2022-JP, so if you scrap that, the delsp processing can be removed,
for example here:
I was going to say that's not entirely true, but then I went back searching through the bugs and remembered that we have this giant cesspool of interaction between format=flowed, plaintext serializer, charset, and Content-Transfer-Encoding selection that is overcomplicated and way too deferential to supporting legacy issues that may not matter anymore, and so what ought to be the design in a smart modern system has too little implications for our existing codebase. When we can watch all of this crap code go up in a merry blaze, I will be quite happy.
I haven't looked at this patch in detail yet (splinter review is crapping out on me), but I will ask that any time you assign "UTF-8" to a nsString only to immediately pass it into a UTF-16-to-ASCII (or UTF-8) conversion function, just strip out the middleman and assign it to a nsCString instead. Better yet, try to go forward and pass "UTF-8" literal strings to directly wherever the results are being used.
Assignee | ||
Comment 34•4 years ago
|
||
- Update browser_charsetEdit.js
- Fix some comments and pass "UTF-8" directly in a few places
Assignee | ||
Comment 35•4 years ago
|
||
Remove delsp related code.
Comment 36•4 years ago
|
||
Comment on attachment 9169733 [details] [diff] [review]
862292-delsp.patch
You missed https://searchfox.org/comm-central/rev/b3767e49a6acf1f66abd0a276bd71579eada4787/mailnews/compose/src/MessageSend.jsm#359 (that's new, I know, but Searchfox finds it). Actually, there is more in that file:
https://searchfox.org/comm-central/search?q=DelSp&path=&case=false®exp=false
Not that the stuff in mimetpfl.cpp is for incoming delsp, so that needs to stay.
Assignee | ||
Comment 37•4 years ago
|
||
Thanks, if it's OK with you, I'm planning to update MessageSend.jsm related stuff in bug 1211292, after this bug is closed.
Comment 38•4 years ago
|
||
Comment on attachment 9169733 [details] [diff] [review]
862292-delsp.patch
Sure. Sorry.
Assignee | ||
Comment 39•4 years ago
|
||
Remove unnecessary charset
and disallowBreaks
arguments from GetSerialiserFlags
.
Comment 40•4 years ago
|
||
This change breaks Chinese, Japanese and Korean languages (CJK), and probably some others.
The issue is that Unicode is broken for CJK. Early on they made the mistake of trying to merge many characters in these languages that are in fact not the same, so it becomes impossible to create a true Unicode font that works for more than one of the three.
It's a severe issue because it prevents people writing their names correctly and other basic stuff like that. Worse still the receiver of the email may see a different character on their screen to the one the sender wrote.
So until Unicode gets fixed for CJK Thunderbird should default to their preferred national character encodings. I don't know if it would be easier to have an allowlist or denylist of broken Unicode languages because I'm not sure how widespread this problem is. I seem to recall some Thai people complaining they couldn't render their names in Unicode a few years back.
Comment 41•4 years ago
|
||
Thai uses UTF-8 already so I can't think that's a problem. Japanese also uses UTF-8.
zh-TW uses UTF-8 as well. I'm assuming zh-CN uses gbk from old habit or more or less political reasons?
Korean uses EUC-KR, I don't know if there's a reason. If there's a reason, what does Gmail do for Korean?
https://dxr.mozilla.org/l10n-central/search?q=mailnews.send_default_charset&redirect=false
Reporter | ||
Comment 42•4 years ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #41)
Thai uses UTF-8 already so I can't think that's a problem.
It's not a problem. Han unification is irrelevant to Thai. The way Unicode analyzes Thai is based on the Thai national standard that existed before. There's no benefit from using the legacy encoding.
Japanese also uses UTF-8.
Moreover, absent a hint of how to resolve CJK ambiguity, Gecko resolves in favor of Japanese by default. This makes sense for platforms that have distinct font designs for different CJK locales, since this way kana and kanji match design-wise.
zh-TW uses UTF-8 as well.
So there's no change from this patch for Traditional Chinese.
I'm assuming zh-CN uses gbk from old habit or more or less political reasons?
There are potential relevance to font selection for incoming email that was labeled as gbk (if the CJK resolution hint from encoding even survives in Thunderbird the way it survives in Firefox; I didn't check). However, if the result is unsatisfactory, users can change the font Thunderbird uses to a font that resolves the issue. In any case, zh-CN users have been receiving UTF-8 email from non-Thunderbird senders for ages and, therefore, must have dealt with this already.
Korean uses EUC-KR, I don't know if there's a reason. If there's a reason, what does Gmail do for Korean?
Han characters are rarely used in Korean these days. They mostly appear as abbreviations in newspaper headlines and in parenthetical clarifications. I don't have statistics for email, but it's indicative that the Android input method for Korean doesn't even support the input of Han characters. Hangul doesn't have anything to unify with C or J, so there are no unification issues.
Comment 43•4 years ago
|
||
This is incorrect, Japan still uses a lot of Shift-JIS because of the Unicode issues and I believe China is BIG-5, but I'm not an expert on Chinese.
Again though, the issue is less of a problem for say Japanese people who only communicate with other Japanese people in Japanese. Where it becomes an issue is when they need to talk to a Chinese or Korean person.
Imagine if you booked a flight from Xiamen to Tokyo but the ticket had your name printed wrong because of Unicode. Now you have to explain to the staff that it's a character encoding issue, your browser displays Chinese Unicode and the airline uses Japanese Unicode and please look at the Latin transliteration and ignore the fact that there is a mismatch with my passport.
I remembered wrong, it was Bengali that was the issue: https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name
That was 2015, I don't know if they have fixed it and all relevant fonts have been updated and all relevant software that processes and displays Unicode has been fixed.
I suggest that Thunderbird displays a notification to the user asking them if they want to make the change, and warning them that it could cause potential issues with certain languages (Chinese, Japanese, Korean, Bengali, possibly others).
Reporter | ||
Comment 44•4 years ago
|
||
(In reply to kuro68k from comment #43)
This is incorrect, Japan still uses a lot of Shift-JIS because of the Unicode issues and I believe China is BIG-5, but I'm not an expert on Chinese.
I request that you refrain from arguing about this further on this bug report. The above sentence is indicative that it is not going to be productive.
I remembered wrong, it was Bengali that was the issue: https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name
That article is unfortunate in its attribution of issues. Unicode's analysis of Bengali is based on ISCII, which the federal government of India issued in 1988. The approach of analyzing non-Devanagari Brahmic scripts used in India as isomorphisms of Devanagari comes from there.
In any case, Thunderbird doesn't support any non-Unicode encodings for Bengali. Also, Thunderbird processes all text as Unicode internally, so the notion that not using UTF-8 for interchange somehow avoided Unicode is mistaken.
Comment 45•4 years ago
|
||
That article is unfortunate in its attribution of issues.
Not really, it either works or it doesn't and Unicode doesn't. Naturally people who can't write their own names in it care little where the blame is assigned, only that it gets fixed.
Anyway I realized that this isn't an issue as long as Thunderbird adds the right RFC 3282 header. Well, it is an issue because at least outside of HTML email you can't mix CJK, but it addresses the main problem here.
Comment 46•4 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #42)
Japanese also uses UTF-8.
Moreover, absent a hint of how to resolve CJK ambiguity, Gecko resolves in favor of Japanese by default. This makes sense for platforms that have distinct font designs for different CJK locales, since this way kana and kanji match design-wise.
It is no longer true. We changed the precedence recently (see bug 1581822) and made it configurable (bug 1596875).
(In reply to kuro68k from comment #43)
This is incorrect, Japan still uses a lot of Shift-JIS because of the Unicode issues and I believe China is BIG-5, but I'm not an expert on Chinese.
Did you forget that this bug is about text encoding for e-mails? We (Japanese) do not use Shift-JIS encoding for e-mail messages. We used to use ISO-2022-JP exclusively.
Comment 47•4 years ago
|
||
Comment 48•4 years ago
|
||
(In reply to Joshua Cranmer [:jcranmer] from comment #47)
An interesting test to add to conduct (and add to our test suite) is
replying inline to a non-UTF-8 message. A further test would be to reply
inline to a message whose charset has been manually overridden by the UI.
Right, and, as I said in comment #32, we also need to cover the other code path, which is forward, "edit as new message", edit draft/template. You could have an old non-UTF-8 draft/template. Remember that composition has two very distinct code paths: One is reply and one is the other lot.
EDIT: Looks like there are some forwarding tests here: https://searchfox.org/comm-central/source/mail/test/browser/composition/browser_forwardUTF8.js which can be extended to non-UTF-8 messages.
There is also https://searchfox.org/comm-central/source/mail/test/browser/composition/browser_replyMultipartCharset.js which uses body-greek.eml. So you'd have to check what is covered and what is not and extend what is already there.
Assignee | ||
Comment 49•4 years ago
|
||
Updates according to comment 47. Fixed QuoteMessage
argument, added browser_quoteMessage.js
.
An interesting test to add to conduct (and add to our test suite) is replying inline to a non-UTF-8 message. A further test would be to reply inline to a message whose charset has been manually overridden by the UI.
I've tested replying to non-UTF8 message a few times before, it worked well. After your comment, I became aware of the Options > Quote Message
menu, it did corrupt non-UTF8 text, thanks. I've fixed argument of QuoteMessage
and added browser_quoteMessage.js
for this case. For the case of manually changing the view charset to an incorrect one, then click reply, the text is corrupted as well. But I think that is expected, at least it's the same on nightly, do I need to do anything?
(In reply to Jorg K (CEST = GMT+2) from comment #32)
Sorry, I missed the removal. Maybe it is irrelevant now, I'd have to check closer. What happens with you receive a message in windows-1252 or ISO-2022-JP and you go "edit as new message". That needs to be successfully converted into UTF-8 now. That case is a little similar to that test, no?
Sorry, I forgot to reply this message earlier, click "edit as new message" on a ISO-2022-JP message works well.
(In reply to Jorg K (CEST = GMT+2) from comment #48)
Right, and, as I said in comment #32, we also need to cover the other code path, which is forward, "edit as new message", edit draft/template. You could have an old non-UTF-8 draft/template. Remember that composition has two very distinct code paths: One is reply and one is the other lot.
Seems to me browser_replyMultipartCharset.js
already covered reply, edit as new and forward. And test_autoReply
covered non-UTF8 template.
The case I missed was Options > Quote Message
, which is fixed in this patch.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Comment 50•4 years ago
|
||
Comment 51•4 years ago
|
||
Assignee | ||
Comment 52•4 years ago
|
||
Updates according to comment 50.
- use var
- use NS_ConvertUTF16toUTF8
- remove unused test
We usually use var for the top level variables.
Seems to me const is used more in m-c now.
Maybe the forceMsgEncoding and isAsciiOnly are not something we need anymore?
I think isAsciiOnly can be removed, maybe we should do it separately. But forceMsgEncoding
is used to determine QP. https://searchfox.org/comm-central/source/mailnews/compose/src/nsMsgAttachmentHandler.cpp#334
Assignee | ||
Comment 53•4 years ago
|
||
eslint
Comment 54•4 years ago
|
||
Updated•4 years ago
|
Reporter | ||
Comment 55•4 years ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #54)
mOriginalHTMLBody = NS_ConvertUTF16toUTF8(origHTMLBody).get() ?
= NS_ConvertUTF16toUTF8(bodyText).get()?
Wouldn't using .get()
result in use-after-free? What would keep the pointee of mOriginalHTMLBody
alive in that case?
Why isn't mOriginalHTMLBody
an nsCString
in the first place?
Comment 56•4 years ago
|
||
I think you're right. Looks like we should just make mOriginalHTMLBody an nsCString.
Assignee | ||
Comment 57•4 years ago
|
||
Updates according to comment 54.
Maybe you don't need
attachment1_body
anymore either
I think it's still needed since there is rv = SnarfAndCopyBody(attachment1_body, TEXT_HTML);
later.
(In reply to Magnus Melin [:mkmelin] from comment #56)
I think you're right. Looks like we should just make mOriginalHTMLBody an nsCString.
I didn't make this change. There are several other char*
and ToNewCString
s in nsMsgSend, I don't see the immediate benefit of only changing mOriginalHTMLBody to nsCString in this patch.
Try run: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=32a5ce888a39968972d2b1be17efef
Assignee | ||
Updated•4 years ago
|
Comment 58•4 years ago
|
||
Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/comm-central/rev/58159743bab4
Use UTF-8 for all outgoing email. r=mkmelin
https://hg.mozilla.org/comm-central/rev/0d9bc3ca1452
Remove delsp since UTF-8 is always used. r=mkmelin
Comment 59•4 years ago
|
||
There's a new failure in comm/mailnews/compose/test/unit/test_longLines.js on Windows, which seems likely to be related to this bug. Please check it out.
Example failure log:
https://treeherder.mozilla.org/logviewer.html#?job_id=314049833&repo=comm-central&lineNumber=4043
(It's there in your Try run too. I didn't spot it before landing.)
Assignee | ||
Comment 60•4 years ago
|
||
Fix test failure in test_longLines.
There is also a failure in browser_quoteMessage.js related to open menu on macOS, looking into it.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 61•4 years ago
|
||
Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/comm-central/rev/a6bcebe11fbc
followup - Fix newline character in test_longLines. r=mkmelin DONTBUILD
Assignee | ||
Comment 62•4 years ago
|
||
Call goDoCommand("cmd_quoteMessage")
directly on macOS to fix browser_quoteMessage.
Try run: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=08fae1a373dc2579e3e59dad3a3deac5549a56ab
Comment 63•4 years ago
|
||
Updated•4 years ago
|
Assignee | ||
Comment 64•4 years ago
|
||
Just found browser_quoteMessage still failed for macOS on Try server, will take another look.
Assignee | ||
Comment 65•4 years ago
|
||
sleep(1)
seems more reliable. https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=dd65416e54766e5131970a3d1e83c62a1268
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Comment 66•4 years ago
|
||
Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/e79e0ea42b97
followup - Workaround clicking menu on mac to make browser_quoteMessage pass. r=mkmelin
Description
•