Enable UTF-8 IMAP store (RFC 6855: add Support for UTF8=ACCEPT and UTF8=ONLY)
Categories
(MailNews Core :: Networking: IMAP, enhancement, P1)
Tracking
(thunderbird_esr78 wontfix, thunderbird82 wontfix)
People
(Reporter: vesely, Assigned: gds)
References
Details
Attachments
(1 file, 20 obsolete files)
(deleted),
patch
|
mkmelin
:
review+
|
Details | Diff | Splinter Review |
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0
Steps to reproduce:
Open a message that contains non-ASCII header fields from a Courier-IMAP mailstore.
Actual results:
Get a warning like:
Message 2211 appears to be a Unicode message and your E-mail reader did not enable Unicode support. Please use an E-mail reader that supports IMAP with UTF-8.
Note that alert messages are displayed in an unpredictable fashion.
Expected results:
I noticed no hiccups when receiving messages with utf-8 header fields. I can copy them to and from local folders without harm. Note, in particular, that Courier-IMAP does not downgrade messages —that operation is unsafe if it involves, for example, an utf-8 local part.
Because of bug #1563891, I cannot reply to messages from utf-8 local parts addresses. However, they are received well. The proper action would be to let the server know, by adding an IMAP command after login. For example:
server: * CAPABILITY IMAP4rev1 UIDPLUS [...] ENABLE UTF8=ACCEPT
server: 5 OK CAPABILITY completed
client: 6 ENABLE UTF8=ACCEPT
server: * ENABLED UTF8=ACCEPT
server: 6 OK Options enabled
Updated•5 years ago
|
Comment 1•5 years ago
|
||
There's more than just sending ENABLE UTF8. The complete specification is RFC 6855.
Comment 2•5 years ago
|
||
Seems like a larger job.
Reporter | ||
Comment 3•5 years ago
|
||
Hi Jorg,
(In reply to Jorg K (GMT+2) from comment #2)
Seems like a larger job.
IME, the behavior is fine as is. Yes, there may be cases... there always are. To send ENABLE UTF8 will make it official that messages can have utf-8 values in the header. That's the right direction anyway. Possible issues will be addressed when they emerge, as usual.
The alternative, i.e. not enabling UTF8=ACCEPT, can be worse with varying IMAP servers. As RFC 6855 suggests, servers can hide EAI mail or produce surrogates. That would be decisively annoying. Users unable to get their messages won't like it —and we know sporadic unencoded Subject:'s leaked into our Inboxes even before our servers supported SMTPUTF8.
So, I ran into this recently when trying to figure out why Seamonkey's message filtering was no longer working and why I kept getting errors in my huge inbox. (I use Courier IMAP which appears to implement 6855/6856 rather aggressively).
Everything on my linux box has been UTF-8 by default for a couple of decades. Is there any magic flag in about:config or anything lower level to just make Thunderbird/Seamonkey send out this UTF8=ACCEPT so stuff starts working again? I am totally fine with anything manual even if you guys want to be cautious about doing it by default, and all my other clients still work fine.
Assignee | ||
Comment 5•5 years ago
|
||
(In reply to nemo from comment #4)
Everything on my linux box has been UTF-8 by default for a couple of decades. Is there any magic flag in about:config or anything lower level to just make Thunderbird/Seamonkey send out this UTF8=ACCEPT so stuff starts working again? I am totally fine with anything manual even if you guys want to be cautious about doing it by default, and all my other clients still work fine.
I don't see anywhere in the tb code where it responds to imap capability UTF8=ACCEPT. However, this doesn't look like it would be hard to do. Just send ". enable UTF8=ACCEPT" whenever a new connection is authenticated if the capability is present (and ENABLE capability is present which seem like it would be implied by UTF8=ACCEPT).
It could also be disabled by a new pref in case it causes problems.
Should I put this on my list and do this?
Comment 6•5 years ago
|
||
Go for it! :)
Yay. Thanks.
BTW, I found:
https://www.linuxquestions.org/questions/linux-software-2/thunderbird-and-unicode-how-do-you-know-you've-enabled-support-4175669972/
So clearly I'm not the only one struggling with this.
Assignee | ||
Comment 9•5 years ago
|
||
Here's a possible fix for this bug. It now recognizes the UTF8=ACCEPT capability and, on entering AUTHENTICATED imap state, sends the "ENABLE UTF8=ACCEPT" command. Since no action is needed on the OK response, the response is not parsed and acted upon (except to log a possible NO or BAD response).
There are a few other capabilities defined by the same RFC that defines UTF8=ACCEPT. Should we also recognize UTF8=ONLY, UTF8=ALL and UTF8=USER as described in https://tools.ietf.org/html/rfc6855?
I tested this with gmail that advertises UTF8=ACCEPT and I see the enable occur and it succeeds. I haven't tested for any other different behavior. If anyone can test this in a useful way I can make a "try" build. Just need to know your architecture (win64, OSX etc) and if you want it applied to daily trunk or 68.
Comment 10•5 years ago
|
||
Updated•5 years ago
|
Assignee | ||
Comment 11•5 years ago
|
||
I tested this on emails with Russian characters and it works as expected. I added the ability to inhibit the "enable UTF8=ACCEPT". I also I remove the "CHARSET <specification>" sub-string from the search string when UTF8=ACCEPT is in effect. There is probably a better way to do what I did but it does work. Tested with gmail and if "CHARSET <specification>" not removed, gmail returns an error and the search fails. Also, if server advertises UTF8=ONLY, I treat it like UTF8=ACCEPT; I think that is correct. (I don't check for the UTF8=<> items you pointed out as obsolete.)
Comment 12•5 years ago
|
||
Assignee | ||
Comment 13•5 years ago
|
||
Hmm... do we ever send a charset? I didn't find where. Maybe it should be fixed at that point if we do.
Yes we do send "search charset <spec> ..." here:
https://searchfox.org/comm-central/rev/edbbc461dc1489444ca7ab72fe5da32c826b28fe/mailnews/base/search/src/nsMsgImapSearch.cpp#110
I too preferred to just skip putting it in at that point instead of taking it out later. However, we only want to not put it in when capability UTF8=ACCEPT has been enabled. I don't know how make this known in nsMsgImapSearch.cpp when the enable occurs in nsImapProtocol.cpp. Maybe an access function of some sort?
Edit: Looking closer at this, I can't figure it out. If you really want this and think it can be done, I need some good advice on how to do it.
usually better to do the positive form. maybe useUTF8Accept
No problem, I can do this.
Comment 14•5 years ago
|
||
I guess you should be able to get the server from the folder and then the pref from the server. https://searchfox.org/comm-central/rev/edbbc461dc1489444ca7ab72fe5da32c826b28fe/mailnews/base/search/src/nsMsgSearchTerm.cpp#1638
Then that would need to be passed on to the searchadapter - https://searchfox.org/comm-central/rev/edbbc461dc1489444ca7ab72fe5da32c826b28fe/mailnews/base/search/src/nsMsgImapSearch.cpp#17
(btw, gross that the .h file there is "nsMsgSearchImap.h !!)
Assignee | ||
Comment 15•5 years ago
|
||
Thanks for the hints. I think I did what you asked. With UTF8=ACCEPT in effect, It now leaves off the "CHARSET <name>" where the search string is created instead of removing it later before it is sent.
.
One thing I'm not sure is right: The function Encode() is declared as static. I needed to use the member variable m_scope inside of it but that's not allowed. I removed the static and it worked with m_scope inside. I then went back and put the static back and passed the variable m_scope into Encode() as a new parameter, called scope and used it inside. That also worked and seems better to add a parameter rather than remove the static designation.
It seems to be working OK but I probably need to test it a bit more and I've left a few printf's in the diff for now. Also, there are some blanks and end of lines I need to remove.
(btw, gross that the .h file there is "nsMsgSearchImap.h !!)
Yes, I noticed that too. ...SearchImap.h and ...ImapSearch.cpp
Comment 16•5 years ago
|
||
Comment 17•5 years ago
|
||
Comment 18•5 years ago
|
||
2020-05-24 11:36:00.874872 UTC - [(null) 7281: IMAP]: I/IMAP 0x7f0b5fadd800:imap.gmail.com:A:CreateNewLineFromSocket: * XLIST (\HasNoChildren \Spam) "/" "[Gmail]/Skräppost"
Maybe it's Thunderbird who doesn't handle that...
Comment 19•5 years ago
|
||
2020-05-24 11:36:25.662826 UTC - [(null) 7281: Main Thread]: D/IMAP proposed url = [Gmail]/Skräppo folder for connection INBOX has To Wait = false can run = false
2020-05-24 11:36:26.021864 UTC - [(null) 7281: IMAP]: I/IMAP 0x7f0b5f160000:imap.gmail.com:A:SendData: 6 select "[Gmail]/Skräppo"
2020-05-24 11:36:26.052594 UTC - [(null) 7281: IMAP]: D/IMAP ReadNextLine [stream=0x7f0b5ed2b3a0 nb=66 needmore=0]
2020-05-24 11:36:26.052611 UTC - [(null) 7281: IMAP]: I/IMAP 0x7f0b5f160000:imap.gmail.com:A:CreateNewLineFromSocket: 6 NO [NONEXISTENT] Unknown Mailbox: [Gmail]/Skräppo (Failure)
Comment 20•5 years ago
|
||
Or maybe have the pref on ifdef nightly.
Reporter | ||
Comment 21•5 years ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #17)
Running with this for a bit, I now notice that Gmail mangled my folder names
(I only saw this after a restart or two).
My Junk mail for that account is in Swedish so it's Skräppost, but it's now
showing as Skrᅢᄂppost
Pre Unicode IMAP used a modified UTF-7 encoding. Courier-MTA documents this here. However, after playing with iconv for a few minutes, I haven't been able to find out how the 'ᅢᄂ' came about.
Assignee | ||
Comment 22•4 years ago
|
||
Magnus, I made the changes you requested. I left the enabling pref true. Not sure if that's how you want it.
Comment 23•4 years ago
|
||
Assignee | ||
Comment 24•4 years ago
|
||
You still have it on by default.
Ok so I assume that by this:
Or maybe have the pref on ifdef nightly.
you wanted it off by default. Wasn't sure what you meant by "ifdef nightly".
But, can we resolve the folder naming thing?
I duplicated the problem and will take a look at it. Thanks for the links.
Assignee | ||
Comment 25•4 years ago
|
||
Tb is not currently designed, as far as I can tell, to accept folder names from the imap server in UTF8 except as single byte 7-bit ascii. In fact I guess it was assumed to be illegal (Re: Bug 63186 comment 23). TB still assumes that any mailbox/folder name returned by the server that contains a character needing more than one UTF8 byte is encoded in MUTF7 (as 7-bit ascii). For example, if I create the mailbox Skräppost--gds
at gmail.com and UTF8=ACCEPT is enabled, gmail returns in the imap LIST response Skräppost--gds
where the ä is 8-bit UTF8 chars 0xc3 0xa4 and the others are their 7-bit ascii values. The mailbox displays incorrectly in tb (with Korean characters) and when clicked-on, the mailbox name encoded in the imap SELECT is also invalid and rejected by the server. There are also problems with the mbox and .msf filenames.
But with UTF8=ACCEPT in effect (or if not in effect), I can create in tb the folder Skräppost--eds
. Tb encodes it as MUTF7Skr&AOQ-ppost--eds
for the imap CREATE command. Gmail returns the same MUTF7 string in imap LIST and the mailbox displays correctly in tb as Skräppost--eds
and works correctly when clicked on (the MUTF7 is sent in imap SELECT).
The display of Korean characters is caused by function
nsresult CopyMUTF7toUTF16(const nsACString& aSrc, nsAString& aDest) in mailnews/base/util/nsMsgI18N.cpp: https://searchfox.org/comm-central/rev/83ae42aa38bc39143b0f6d93bb69a40094c4b515/mailnews/base/util/nsMsgI18N.cpp#141
Looking with debugger at aSrc:
(gdb) p aSrc
$1 = (const nsACString &) @0x7fffffff75e8: {<mozilla::detail::nsTStringRepr<char>> = {
mData = 0x7fffd5e56c88 "Skräppost--gds", mLength = 15,
mDataFlags = (mozilla::detail::StringDataFlags::TERMINATED | mozilla::detail::StringDataFlags::REFCOUNTED),
mClassFlags = mozilla::detail::StringClassFlags::NULL_TERMINATED}, static kMaxCapacity = 2147483637}
(gdb) x/15bx aSrc.mData
0x7fffd5e56c88: 0x53 0x6b 0x72 0xc3 0xa4 0x70 0x70 0x6f
0x7fffd5e56c90: 0x73 0x74 0x2d 0x2d 0x67 0x64 0x73
The 2-byte UTF8 value is highlighted above. This gets converted from MUTF7 to UTF16 but it's not MUTF7; it is UTF8. Looking with debugger after conversion:
(gdb) p aDest
$3 = (nsAString &) @0x7fffffff7768: {<mozilla::detail::nsTStringRepr<char16_t>> = {
mData = 0x7fffffff777c u"Skrᅢᄂppost--gds", mLength = 15,
mDataFlags = (mozilla::detail::StringDataFlags::TERMINATED | mozilla::detail::StringDataFlags::INLINE),
mClassFlags = (mozilla::detail::StringClassFlags::INLINE | mozilla::detail::StringClassFlags::NULL_TERMINATED)}, static kMaxCapacity = 1073741817}
(gdb) x /30bx aDest.mData
0x7fffffff777c: 0x53 0x00 0x6b 0x00 0x72 0x00 0xc3 0xff
0x7fffffff7784: 0xa4 0xff 0x70 0x00 0x70 0x00 0x6f 0x00
0x7fffffff778c: 0x73 0x00 0x74 0x00 0x2d 0x00 0x2d 0x00
0x7fffffff7794: 0x67 0x00 0x64 0x00 0x73 0x00
The 2-byte UTF8 chars, highlighted above, are converted to U-FFC3 and U-FFA4 or ᅢand ᄂ
This seems to only really affect the "Pretty" name displayed in the folder list. I don't think it affects the mbox/.msf file names or the mailbox name sent to the server in imap commands (the "online" name). I can fix this in CopyMUTF7toUTF16() by checking aSrc using IsAscii() and if it returns false because aSrc is not a pure 7-bit string, do CopyUTF8toUTF16(aSrc, aDest) instead. This converts the 2-byte UTF8 character to the correct UTF-16 value U-00E4 and it displays correctly in the list.
Another problem is that when the UTF8 folder name containing "ä" is sent by tb to the imap server, the "online name" is corrupted to either one character short and/or the "ä" changed to "ä" (U-00c3,U-00a4). This would cause a failure on the imap SELECT server response due to an invalid mailbox name. I was never able to figure out why but removing this line fixes at least the "too short" problem:
https://searchfox.org/comm-central/rev/83ae42aa38bc39143b0f6d93bb69a40094c4b515/mailnews/imap/src/nsImapMailFolder.cpp#1896
Apparently the call to GetCharPropterty() right before this leaves off the last char(s) when the string has 8-bit chars. From what I can see, the value assigned to m_onlineFolderName is already OK so skipping this causes no problems. (Note: The original developer possibly alluded to this problem here: https://searchfox.org/comm-central/rev/83ae42aa38bc39143b0f6d93bb69a40094c4b515/mailnews/imap/src/nsImapMailFolder.cpp#1854)
The final problem is that on tb restart, the folders containing 8-bit UTF8s would alternately appear and then not appear in the folder list on subsequent restarts. Also the mbox/.msf files would change names (between containing "ä" and "ä") or have new numerical dash suffixes added on restart. Also, all headers and messages bodies for this folder were all re-downloaded on each restart. I think this was due to several places where UTF16 was converted to ASCII in a lossy manner. So changing several calls from NS_LossyConvertUTF16toASCII() to NS_ConvertUTF16toUTF8() fixed that problem.
The attached utf8.diff show the proposed fixes along with a lot of debug printf's and in a very rough form. It seems to fix the utf8 problem and I haven't noticed any other problems that my changes might have caused. But more testing is definitely required.
utf8.diff doesn't show anything from the original patch.
Assignee | ||
Comment 26•4 years ago
|
||
This cleans up utf8.diff to remove all the printfs and other debug stuff that make it hard to parse and only shows the essential changes in two files. With attachment 9153343 [details] [diff] [review] applied this seems to work correctly when applied on top of that patch. (It at least fixes the problem identified by Magnus in comment 17.)
Note: There are several places where I replaced CopyASCIItoUTF16() with CopyUTF8toUTF16() that I failed to mention in my previous (long) comment.
Reporter | ||
Comment 27•4 years ago
|
||
(In reply to gene smith from comment #25)
Tb is not currently designed, as far as I can tell, to accept folder names from the imap server in UTF8 except as single byte 7-bit ascii. In fact I guess it was assumed to be illegal (Re: Bug 63186 comment 23). TB still assumes that any mailbox/folder name returned by the server that contains a character needing more than one UTF8 byte is encoded in MUTF7 (as 7-bit ascii). For example, if I create the mailbox ```Skräppost--gds`` [...]
FWIW, Courier-IMAP fixes that:
courier@wmail:Maildir$ grep Skr imaplog.dat
* LIST (\Unmarked \HasNoChildren) "." "INBOX.Skr&AOQ-ppost"
* LIST (\Unmarked \HasNoChildren) "." "INBOX.Skr&AOQ-ppost"
WRITE: * LSUB (\Unmarked \HasNoChildren) "." "INBOX.Skr&AOQ-ppost"
WRITE: * LSUB (\Unmarked \HasNoChildren) "." "INBOX.Skr&AOQ-ppost"
READ: QUOTED_STRING: INBOX.Skr&AOQ-ppost
READ: QUOTED_STRING: INBOX.Skr&AOQ-ppost
WRITE: * MYRIGHTS "INBOX.Skr&AOQ-ppost" "acdilrsw"
READ: QUOTED_STRING: INBOX.Skr&AOQ-ppost
WRITE: * QUOTAROOT "INBOX.Skr&AOQ-ppost" "ROOT"
But
courier@wmail:Maildir$ ls -ld .Sk*
drwx------ 6 courier courier 4096 Jun 13 17:23 .Skräppost
Assignee | ||
Comment 28•4 years ago
|
||
FWIW, Courier-IMAP fixes that:
Is the grep result what courier sends to tb with UTF7=ACCEPT enabled? If so, courier isn't reporting the mailbox/folder name in UTF8 but in MUTF7. Therefore tb with just my original patch should work fine with courier. But with gmail, it returns the folder name as UTF8 so something like my utf8-clean.diff is also needed.
From comment 18, here's how gmail returns the folder name with UTF8=ACCEPT enabled:
2020-05-24 11:36:00.874872 UTC - [(null) 7281: IMAP]: I/IMAP 0x7f0b5fadd800:imap.gmail.com:A:CreateNewLineFromSocket: * XLIST (\HasNoChildren \Spam) "/" "[Gmail]/Skräppost"
With UTF8=ACCEPT not enabled gmail returns MUTF7 just like courier:
2020-05-24 11:36:00.874872 UTC - [(null) 7281: IMAP]: I/IMAP 0x7f0b5fadd800:imap.gmail.com:A:CreateNewLineFromSocket: * XLIST (\HasNoChildren \Spam) "/" "[Gmail]/Skr&AOQ-ppost"
Reporter | ||
Comment 29•4 years ago
|
||
(In reply to gene smith from comment #28)
FWIW, Courier-IMAP fixes that:
Is the grep result what courier sends to tb with UTF7=ACCEPT enabled?M
Oops, no, sorry, I thought it was clear I'm using a stock Thunderbird package (68.8.0). There is no ACCEPT string in the log, as it starts after login.
Here's a telnet session:
Connected to wmail.tana.it.
Escape character is '^]'.
* OK [CAPABILITY IMAP4rev1 UIDPLUS CHILDREN NAMESPACE THREAD=ORDEREDSUBJECT THREAD=REFERENCES SORT QUOTA AUTH=CRAM-MD5 AUTH=CRAM-SHA1 IDLE ACL ACL2=UNION STARTTLS XCOURIEROUTBOX=INBOX.SendMailViaImap XMAGICTRASH ENABLE UTF8=ACCEPT] Courier-IMAP ready. Copyright 1998-2019 Double Precision, Inc. See COPYING for distribution information.
a login **** ****
a OK LOGIN Ok.
a list "" "INBOX.Sk*"
* LIST (\Unmarked \HasNoChildren) "." "INBOX.Skr&AOQ-ppost"
a OK LIST completed
a status INBOX.Skr&AOQ-ppost messages
* STATUS "INBOX.Skr&AOQ-ppost" (MESSAGES 1)
a OK STATUS Completed.
a enable utf8=accept
* ENABLED UTF8=ACCEPT
a OK Options enabled
a list "" "INBOX.Sk*"
* LIST (\Unmarked \HasNoChildren) "." "INBOX.Skräppost"
a OK LIST completed
a status INBOX.Skr&AOQ-ppost messages
a NO Mailbox does not exist, or must be subscribed to.
a status INBOX.Skräppost messages
a NO Mailbox does not exist, or must be subscribed to.
a status "INBOX.Skräppost" messages
* STATUS "INBOX.Skräppost" (MESSAGES 1)
a OK STATUS Completed.
a logout
* BYE Courier-IMAP server shutting down
a OK LOGOUT completed
Connection closed by foreign host.
Is it painful to install a patched TB?
Assignee | ||
Comment 30•4 years ago
|
||
Ok, I see. So courier and gmail behave the same once UTF8=ACCEPT is enabled and the string has to be quoted which tb always does AFAIK.
Is it painful to install a patched TB?
I can build version of tb 68 using the "try" server with my patches if this is what you want. Sounds like just a linux version would be OK based on your original comment 0. When built I can provide a link. You just un-pack the tarball, cd into the new tb directory and run ./thunderbird. No install is needed.
Assignee | ||
Comment 31•4 years ago
|
||
I attempt to build ESR 68 but I would have had to make unrelated changes to get it to merge and build. If you really want patched 68 let me know and I can do it, but I just went ahead and did the build on top of the latest trunk/daily from comm-central. The try build is here:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=e9ab71c1f3aff80d9aa3c8027e1ea757095d0682
You can click on the green "B" and see the Job Details below. Then you can download and unpack the item called target.tar.bz2 and run thunderbird. The direct link is this:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/S9FhRXY2RXWXaw5ZOfXkjQ/runs/0/artifacts/public/build/target.tar.bz2
When you run the daily build with my patches applied, you will see something about profiles. I don't know what it all means but you can bypass that by running tb like this:
./thunderbird -p -allow-downgrade
and then select your usual profile (typically default). Not sure you really need the -allow-downgrade.
The enable UTF8=ACCEPT is true by default. Magnus asked to to make the pref false by default but I haven't done that yet.
Reporter | ||
Comment 32•4 years ago
|
||
(In reply to gene smith from comment #31)
./thunderbird -p -allow-downgrade
It works! The only thing I noticed, apart from UI changes, if a yellow warning triangle near the Daily Menu saying "PopmailListRecipients requires new permissions".
Server side logs differ:
--- /home/ale/tmp/imaplog.dat 2020-06-13 17:46:16.000000000 +0200
+++ /home/ale/tmp/imaplog-TBpatch.dat 2020-06-15 11:08:06.000000000 +0200
@@ -5,13 +5,20 @@
READ: EOL
WRITE: * NAMESPACE (("INBOX." ".")) NIL (("#shared." ".")("shared." "."))
4 OK NAMESPACE completed.
+READ: NUMBER: 5
+READ: ATOM: ENABLE
+READ: ATOM: UTF8=ACCEPT
+READ: EOL
+WRITE: * ENABLED UTF8=ACCEPT
+5 OK Options enabled
READ: NUMBER: 4
-READ: ATOM: NAMESPACE
+READ: ATOM: ENABLE
+READ: ATOM: UTF8=ACCEPT
+READ: NUMBER: 6
READ: EOL
-WRITE: * NAMESPACE (("INBOX." ".")) NIL (("#shared." ".")("shared." "."))
-4 OK NAMESPACE completed.
-READ: NUMBER: 5
READ: ATOM: LIST
+WRITE: * ENABLED UTF8=ACCEPT
+4 OK Options enabled
READ: QUOTED_STRING:
READ: QUOTED_STRING: INBOX.*
READ: EOL
@@ -21,7 +28,7 @@
READ: QUOTED_STRING: INBOX.*
READ: EOL
WRITE: * LIST (\Unmarked \HasNoChildren) "." "INBOX.Sent.S2018"
-* LIST (\Unmarked \HasNoChildren) "." "INBOX.Skr&AOQ-ppost"
+* LIST (\Unmarked \HasNoChildren) "." "INBOX.Skräppost"
The enable UTF8=ACCEPT is true by default. Magnus asked to to make the pref false by default but I haven't done that yet.
Yes, it's true
. I found it in mail.server.default.allow_utf8_accept
. Of course, TB would only enable UTF8=ACCEPT when servers advertise the corresponding capability. At standard use, TB knows better than users whether UTF8 is supported, so it doesn't make much sense to keep it false
, except possibly on beta releases. To require that users use Config Editor to set that value in case their server support UTF8 sounds overelaborate.
Looking at Wikipedia table, there are just 6 servers having a green "Yes" in both IMAP and IDN/UTF8 columns, CommuniGate Pro, Courier Mail Server, Dovecot, MDaemon, Microsoft Exchange Server, and Oracle Communications Messaging Server. How many of them should be tested before concluding there are no problems?
Assignee | ||
Comment 33•4 years ago
|
||
It works! The only thing I noticed, apart from UI changes, if a yellow warning triangle near the Daily Menu saying "PopmailListRecipients requires new permissions".
Ok, glad to hear that it worked! Don't think I changed anything about POP so might be some other existing bug/feature.
Do you have an email that was rejected before but now works? I haven't done anything that touches internal headers that might be in UTF8 that might be rejected with the error like you show above:
"Message 2211 appears to be a Unicode message and your E-mail reader did not enable Unicode support. Please use an E-mail reader that supports IMAP with UTF-8."
Do you see this in tb or in your server log?
I only changed the mailbox name handling.
I guess I'm asking if you now don't see the message with the "try" version with the "problem" email? If so, could you attach a sample email that is fixed by the "try" tb version?
I have an assortment of test imap servers that I can access. The only one I see that advertises UTF8=ACCEPT is gmail. I don't see it with Dovecot, MDaemon or Outlook.com (M/S Exchange I assume). I don't have Courier, CG-Pro or Oracle.
Reporter | ||
Comment 34•4 years ago
|
||
(In reply to gene smith from comment #33)
Do you have an email that was rejected before but now works? I haven't done anything that touches internal headers that might be in UTF8 that might be rejected with the error like you show above:
"Message 2211 appears to be a Unicode message and your E-mail reader did not enable Unicode support. Please use an E-mail reader that supports IMAP with UTF-8."
Actually not. That behavior happened in the first UTF-8 release of Courier-IMAP, and was later hacked to return all messages, leaving it to the client to do what it can. The message appears transiently as a warning, not every time, but doesn't prevent fetching the message. Remarkably, TB 68.8.0 works nicely. When replying to an UTF-8 address, the message is not sent due to bug 1563891. However, while the message used to look right before hitting «Send», Daily 79.0a1 (2020-06-14) worsenes the appearance, as the field becomes mim?@fo?.it
(see image) —but that's another bug.
In the words of Sam Varshavchik (Courier developer):
The alert is trigged by 8-bit characters in headers but the IMAP client did
not enable UTF-8 mode.
Not really a bug, but a situation that's falling through the cracks, and
this serves as a visual heads-up that things have entered "garbage-in,
garbage-out" state of affairs.
That state of affairs is the reason why I'd opt for default true
...
I have an assortment of test imap servers that I can access. The only one I see that advertises UTF8=ACCEPT is gmail. I don't see it with Dovecot, MDaemon or Outlook.com (M/S Exchange I assume). I don't have Courier, CG-Pro or Oracle.
Fixed Wikipedia table, thanks.
Comment 35•4 years ago
|
||
Assignee | ||
Comment 36•4 years ago
|
||
Magnus, Sorry for the delay. I will try to finish this up tomorrow (Wednesday).
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 37•4 years ago
|
||
Due to bit rot this updates 1571672-UTF8-accept-v4.patch with no other changes to work with current tip. It still needs to be integrated with ut8-clean.diff, after some more testing and changes, to provide the complete patch.
Assignee | ||
Comment 38•4 years ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #35)
Probably needs to be changed to AUTF8String. ACString is ASCII only.
Changing that didn't help. All it changed was a comment in the resulting .h file. Also changed it on setCharProperty() and made no difference either.
It seems that the m_onlineFolderName is 1 char too short when the code is not commented out. The server reports the correct UTF8 folder name: "Skräppost--gds". But the online name tb uses to access the server is "Skräppost--gd" (length=14) instead of the correct "Skräppost--gds" (length=15) where the 'ä' is length 2. When it reads the string from the database, the lenght is 14 but when it writes it the length is 15, so don't know where the missing byte is down in the "mork" stuff. I never could find where it actually stores the 15 (or 14) bytes.
Not sure, but maybe this is the "crunch" referred to here:
https://searchfox.org/comm-central/rev/57ab5f2d0feff8be94d400b745d06aa55dbde61c/mailnews/imap/src/nsImapMailFolder.cpp#1853
It does end up corrupting m_onlineFolderName if that's what he meant by "crunch". But I don't think the code was tested originally with UTF8.
Comment 39•4 years ago
|
||
Did you change the CopyASCIItoUTF16 to use CopyUTF8toUTF16? https://searchfox.org/comm-central/rev/57ab5f2d0feff8be94d400b745d06aa55dbde61c/mailnews/imap/src/nsImapMailFolder.cpp#1858
Assignee | ||
Comment 40•4 years ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #39)
Did you change the CopyASCIItoUTF16 to use CopyUTF8toUTF16? https://searchfox.org/comm-central/rev/57ab5f2d0feff8be94d400b745d06aa55dbde61c/mailnews/imap/src/nsImapMailFolder.cpp#1858
Yes, been doing that all along.
I found a problem here https://searchfox.org/comm-central/rev/c3b6043c388edb0c72244958863fe2cbaf73149c/mailnews/db/msgdb/src/nsMsgDatabase.cpp#3394. Length() returns the number of UTF16 16-bit words and not the number of bytes after conversion to UTF8. That's why a byte was cut off of the onlineName UTF8 string. Doing strlen() on the null terminated converted string gives the right byte length (see diff fragment beflow). A similar use of strlen() is done here: https://searchfox.org/comm-central/rev/c3b6043c388edb0c72244958863fe2cbaf73149c/mailnews/base/src/nsMsgFolderCacheElement.cpp#107
/* static */ struct mdbYarn* nsMsgDatabase::nsStringToYarn(
struct mdbYarn* yarn, const nsAString& str) {
yarn->mYarn_Buf = ToNewCString(NS_ConvertUTF16toUTF8(str));
- yarn->mYarn_Size = str.Length() + 1;
+ yarn->mYarn_Size = strlen((const char*)yarn->mYarn_Buf) + 1;
yarn->mYarn_Fill = yarn->mYarn_Size - 1;
yarn->mYarn_Form =
0; // what to do with this? we're storing csid in the msg hdr...
return yarn;
}
After this, I realized that if you create or rename a folder name with non-ascii UTF8, it is sent to the server as mUTF7. It works OK and looks right in tb but when you go to webmail you see the strange mUTF7 string for the folder name. (I've only checked this on gmail.) I'm still in the process of making create and rename send UTF8 when UTF8=ACCEPT is enabled.
Edit 22 days later: In my patch I now send create and rename strings as UTF8 when UTF8=ACCEPT is in effect. So looks ok in webmail now.
Updated•4 years ago
|
Assignee | ||
Comment 41•4 years ago
|
||
Well, trying to get this to work has not been fun. With no offline store, when I open a message with multiple attachments and the folder name contains one or more non-ascii chars such as ä, the url is OK and the whole message is fetched and stored to RAM cache. Then each attachment (part) is accessed from cache and displayed inline. Then another URL occurs to fetch the whole message again (not sure why) but this time the url does not contain the proper 2 byte utf8 sequence for ä but instead contains the "replacement character" sequence that displays as a question mark inside a diamond (0xef, 0xbf, 0xbd). The 2nd fetch with the bad URL often causes a hang since it can't get a protocol connection.
I finally found a way to fix this. The bad utf8 was caused by the ä being encoded as a single byte 0xE4 instead of utf8, 0xc3,0xa4. Converting from ASCII to UTF16 and then back to UTF8 fixes it. However it only works when the UTF8 character in the folder name can be represented as a 2 byte "latin" utf8. If you want to have a 3 or more byte "non-latin" utf8 char in the folder name, like ᄂ (U+FFA4, or 0xef,0xbe,0xa4), it won't work.
With offline store the same problem occurs. The messages are all stored OK to mbox file with correctly encoded UTF8 URLs. But when an individual message is accessed, the URL again has the invalid replacement characters. The same fix also works to convert the URL from single byte to multi-byte UTF8 for only the "latin" chars.
The mystery is why the URL to access messages starts out OK, even with non-latin, but then gets "corrupted". (The corruption seem to be that the folder name starts out with proper UTF8 encoding but then somehow gets changed to ISO-8859-1 single byte per char style. It's fixable for "latin" chars that can be represented as a single byte but not for non-latin that won't fit in a single byte.)
Here's the diff fragment for the "fix" I'm referring to above:
diff --git a/mailnews/base/util/nsMsgDBFolder.cpp b/mailnews/base/util/nsMsgDBFolder.cpp
--- a/mailnews/base/util/nsMsgDBFolder.cpp
+++ b/mailnews/base/util/nsMsgDBFolder.cpp
@@ -2843,17 +2843,25 @@ nsresult nsMsgDBFolder::parseURI(bool ne
nsAutoCString fileName;
nsAutoCString escapedFileName;
url->GetFileName(escapedFileName);
if (!escapedFileName.IsEmpty()) {
// XXX conversion to unicode here? is fileName in UTF8?
// yes, let's say it is in utf8
MsgUnescapeString(escapedFileName, 0, fileName);
NS_ASSERTION(mozilla::IsUtf8(fileName), "fileName is not in UTF-8");
- CopyUTF8toUTF16(fileName, mName);
+ if (!mozilla::IsUtf8(fileName))
+ {
+ static nsAutoString temp;
+ printf("gds: 1. fileName is not in UTF-8=%s\n", fileName.get());
+ CopyASCIItoUTF16(fileName, temp /*mName*/);
+ mName = temp;
+ }
+ else
+ CopyUTF8toUTF16(fileName, mName);
}
}
// grab the server by parsing the URI and looking it up
// in the account manager...
// But avoid this extra work by first asking the parent, if any
nsCOMPtr<nsIMsgIncomingServer> server = do_QueryReferent(mServer, &rv);
if (NS_FAILED(rv)) {
@@ -2888,16 +2896,23 @@ nsresult nsMsgDBFolder::parseURI(bool ne
// now try to find the local path for this folder
if (server) {
nsAutoCString newPath;
nsAutoCString escapedUrlPath;
nsAutoCString urlPath;
url->GetFilePath(escapedUrlPath);
if (!escapedUrlPath.IsEmpty()) {
MsgUnescapeString(escapedUrlPath, 0, urlPath);
+ if (!mozilla::IsUtf8(urlPath))
+ {
+ static nsAutoString temp;
+ printf("gds: 2. urlPath is not in UTF-8=%s\n", urlPath.get());
+ CopyASCIItoUTF16(urlPath, temp);
+ urlPath = NS_ConvertUTF16toUTF8(temp);
+ }
// transform the filepath from the URI, such as
// "/folder1/folder2/foldern"
// to
// "folder1.sbd/folder2.sbd/foldern"
// (remove leading / and add .sbd to first n-1 folders)
// to be appended onto the server's path
Assignee | ||
Comment 42•4 years ago
|
||
I have it working properly now for your typical operations like create, rename, [un]subscribe, select, fetch, expunge etc. However, there is still a problem when the the folder name contains a non-latin1 character. That is, it works as long as the unicode value is less than or equal to U+00FF. If you have a folder with a non-latin1 character like U+FFC3 "ᅢ" you will see this warning at the console:
[138089, Main Thread] WARNING: char16_t out of char range; high bits of data lost: 0xffc3: file /home/gene/mozilla/js/xpconnect/src/XPCConvert.cpp, line 408
It comes from here: https://searchfox.org/comm-central/rev/e3ec4fdf727ca65e4acbe42fc4888275977eaf5a/mozilla/js/xpconnect/src/XPCConvert.cpp#406. At this point, it is only a warning. The actual loss of the high byte occurs at this point: https://searchfox.org/comm-central/rev/e3ec4fdf727ca65e4acbe42fc4888275977eaf5a/mozilla/js/xpconnect/src/XPCConvert.cpp#604 and inside the call here: https://searchfox.org/comm-central/rev/e3ec4fdf727ca65e4acbe42fc4888275977eaf5a/mozilla/js/src/jsapi.cpp#4539
This only occurs when accessing mem cached or offline store for messages with attachments and unusual chars (greater than U+00FF) are present in the folder name. (I assume that these "unusual" characters are needed in the many non-latin1 languages.)
Note: There is an bug reported that proposes to change the above warning to a crash, at least for DEBUG builds (bug 1557254). However, after some complaints the associated patch was backed out. The last comment was to re-land that patch but that was about a year ago.
Assignee | ||
Comment 43•4 years ago
|
||
Alessandro, is it possible for you to provide me with a temporary Courier imap account for testing this? So far, I have only tested with gmail and the only local imap server I have set up here and sort-of working is Dovecot and it doesn't support UTF8=ACCEPT capability. Also, none of my other external accounts I have set up for imap testing support it, outlook, yahoo etc.
Assignee | ||
Comment 44•4 years ago
|
||
Thanks! Test account working and tests in progress. No surprises yet.
Assignee | ||
Comment 45•4 years ago
|
||
This is the diff to comm-central that I am currently running and seems to be working OK. Any folder can be named with a UTF-8 name and works with the basic IMAP functions. It also now allows the folder name to contain non-latin1 characters but only if the subsequent diff to mozilla code (in the next attachment) is applied.
This still contains a lot of printf's and commented out alternatives. These need to be cleaned up for a formal review.
Assignee | ||
Comment 46•4 years ago
|
||
To allow non-latin1 UTF-8 characters in folder names with UTF8=ACCEPT in effect, this change to the mozilla-central code seems to be needed. Without this, any UTF-16 character that is greater than U+00FF loses the top byte. This allows the top bytes to be kept in the resulting UTF-8 string instead of being discarded and allows folder names such as "Skrê_ꯍ" to be created in tb.
This address the issue described in comment 42.
Assignee | ||
Comment 47•4 years ago
|
||
This is the cleaned-up diff against comm-central. Removed the printfs and the various alternatives that were commented out. Functionality is equivalent to attachment 9166487 [details] [diff] [review]
Assignee | ||
Comment 48•4 years ago
|
||
This is the cleaned-up diff against mozilla-central. Removed the printfs and the various alternatives that were commented out. Functionality is equivalent to attachment 9166491 [details] [diff] [review] [diff] [review]
Assignee | ||
Comment 49•4 years ago
|
||
(In reply to Alessandro Vesely from comment #0)
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0
Steps to reproduce:
Open a message that contains non-ASCII header fields from a Courier-IMAP mailstore.
Actual results:
Get a warning like:
Message 2211 appears to be a Unicode message and your E-mail reader did not enable Unicode support. Please use an E-mail reader that supports IMAP with UTF-8.
Alessandro, where do you see this warning? This must be in your Courier log, right?. I have only tested with various non-ASCII mailbox names and not with internal non-ASCII headers. Can you provide an example email that has non-ASCII headers that I can test? (I'm not sure what these look like.)
Anyhow, after working on this for probably much too long I have wondered as to the usefulness of all this. Testing with gmail and courier, it appears that even with UTF8=ACCEPT turned off you can still create and have mailbox names with any unicode character. The only difference is that they are encoded in mUTF7 instead of UTF8. So I don't see a lot of improvement here.
Another thing is that the RFC https://tools.ietf.org/html/rfc6855#section-3 it states that "All IMAP servers that support "UTF8=ACCEPT" SHOULD accept UTF-8 in mailbox names...". I take this to mean that some servers may still accept and report mailbox name as mUTF7 even with UTF8=ACCEPT in effect. But it appears that gmail and courier (the only two I have found that really support UTF8=ACCEPT) do report via LIST the mailbox names as UTF8 and accept UTF8 when mailbox is created, renamed and selected when in UTF8=ACCEPT mode.
I do know, at least with gmail, that if you create a mailbox with a mUTF7 string when in UTF8=ACCEPT mode you will see in webmail the literal mUTF7 string and not the intended unicode string. See comment 40 above.
So it seems that the lack of UTF8=ACCEPT support is a non-problem for gmail. Problems have only been reported for courier and often the "solution" is to switch to dovecot. So I'm asking if this complex and high-risk change is really needed due to issues with one server type Courier?
(In reply to Jorg K (CEST = GMT+2) from comment #2)
Seems like a larger job.
Very true!
Comment 50•4 years ago
|
||
(In reply to gene smith from comment #46)
Created attachment 9166491 [details] [diff] [review]
utf8=accept;mozilla;with-debug.diffTo allow non-latin1 UTF-8 characters in folder names with UTF8=ACCEPT in effect, this change to the mozilla-central code seems to be needed. Without this, any UTF-16 character that is greater than U+00FF loses the top byte. This allows the top bytes to be kept in the resulting UTF-8 string instead of being discarded and allows folder names such as "Skrê_ꯍ" to be created in tb.
What properties is it that cause a problem? I'm assuming they just have a wrong string type.
Assignee | ||
Comment 51•4 years ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #50)
What properties is it that cause a problem? I'm assuming they just have a wrong string type.
Don't know which "properties" cause the problem and don't really even know what you mean by "property". I can tell that the problem occurs at two places where JS_EncodeStringToBuffer() is called
here
https://searchfox.org/comm-central/source/mozilla/js/xpconnect/src/XPCConvert.cpp#605
and here
https://searchfox.org/comm-central/source/mozilla/js/xpconnect/src/XPCConvert.cpp#727
Edit: Permalinks not working currently for some reason. Replaced "rev/<long number>" with "source" now works. No idea why?
The first call at line 604 is for string types:
case nsXPTType::T_CHAR_STR:
case nsXPTType::T_PSTRING_SIZE_IS:
The other call at line 726 is for string type:
case nsXPTType::T_CSTRING:
In both cases the problem occurs when a UTF16 string is converted to a byte (UTF8) string and the upper byte is discarded. My fix is to keep the upper byte by decoding it in a different way (only if new parameter "losses" is true). Apparently with mozilla losses are never expected. With TB losses now only occur because folder names at this point are in UTF-16 and, with UTF8=ACCEPT active both byte are significant. With just mUTF7, the upper byte is always zero so no problem occurs.
The actual discard/loss of the upper byte occurs here when decoding 2-byte char strings:https://searchfox.org/comm-central/source/mozilla/js/src/jsapi.cpp#4532
I really have no idea what controls this. But I see this code occur in conjunction with a when a message in a UTF8 encoded folder name with non-latin1 chars (>U+00FF) is accessed and it has attachments inline. It occurs after the full message is fetched and after the inline parts are accessed from offline store or ram cache depending on configuration.
Another different problem also occurs when the compose window appears in response to a reply, forward, edit-as-new or edit draft when the recipient addresses are written to the fields. This is described in bug 1563891 comment 23.
Assignee | ||
Comment 52•4 years ago
|
||
(In reply to gene smith from comment #49)
(In reply to Alessandro Vesely from comment #0)
Alessandro, where do you see this warning? This must be in your Courier log, right?.
Never mind. I already asked this and you answered above. It's the courier log. Sorry. Oh well, still wrong. It is seen in tb. See below.
Still sample emails with UTF8 encoded headers might be helpful.
Reporter | ||
Comment 53•4 years ago
|
||
(In reply to gene smith from comment #49)
Get a warning like:
Message 2211 appears to be a Unicode message and your E-mail reader did not enable Unicode support. Please use an E-mail reader that supports IMAP with UTF-8.Alessandro, where do you see this warning? This must be in your Courier log, right?
No, Courier sends it via IMAP and TB displays it in a tiny window which disappears after s short time even if you don't dismiss it. It happened the first time I opened the mail I sent you from my Sent folder. I should've been quicker with gnome-screenshot. Reopening it doesn't generate a new warning, even after restarting TB. Probably Courier tries and minimize those warnings.
Another thing is that the RFC https://tools.ietf.org/html/rfc6855#section-3 it states that "All IMAP servers that support "UTF8=ACCEPT" SHOULD accept UTF-8 in mailbox names...".
I'd guess it's not "MUST" to allow for ASCII-only file systems.
[...] So I'm asking if this complex and high-risk change is really needed due to issues with one server type Courier?
Personally, I think UTF-8 can be used as an universal encoding for quite a long time. The only issues, w.r.t. ASCII only are some string operations like truncating a string or align it to a given column, which require programmer's attention. Oh, and IDNA before DNS lookup . Having to deal with obsolete clients is a PITA. It cannot be done cleanly. For comparison, consider how log did utilities to convert filenames into DOS 8.3 format last. Sooner or later obsolete uses have to be dropped. The sooner the better.
Thank you for carrying Thunderbird into modern-day.
Assignee | ||
Comment 54•4 years ago
|
||
Yes, I see the message now with the sample email you sent to my test courier account. Of course I only see it with UTF8=ACCEPT disabled in my patched code. It maybe only happens on the first message access after starting tb, definitely not every time. It is actaully an "Alert" message as part of the OK response on fetching the full message body:
<full body fetch>
:
2020-07-30 18:27:56.657180 UTC - [(null) 3858: IMAP]: I/IMAP 0x7f1e20e2f000:mail.tana.it:S-INBOX:STREAM:CLOSE: Normal Message End Download Stream
2020-07-30 18:27:56.931924 UTC - [(null) 3858: Main Thread]: D/IMAP Updating stored message size from 1309, new size 1309
2020-07-30 18:27:56.932148 UTC - [(null) 3858: IMAP]: D/IMAP ReadNextLine [rv=0x0 stream=0x7f1e1f30b560 nb=216 needmore=0]
2020-07-30 18:27:56.932168 UTC - [(null) 3858: IMAP]: I/IMAP 0x7f1e20e2f000:mail.tana.it:S-INBOX:CreateNewLineFromSocket: * OK [ALERT] Message 3 appears to be a Unicode message and your E-mail reader did not enable Unicode support. Please use an E-mail reader that supports IMAP with UTF-8 (see https://tools.ietf.org/html/rfc6855.html)
2020-07-30 18:27:56.932188 UTC - [(null) 3858: IMAP]: D/IMAP SetConnectionStatus(0x0)
2020-07-30 18:27:56.996999 UTC - [(null) 3858: IMAP]: D/IMAP ReadNextLine [rv=0x0 stream=0x7f1e1f30b560 nb=24 needmore=0]
2020-07-30 18:27:56.997029 UTC - [(null) 3858: IMAP]: I/IMAP 0x7f1e20e2f000:mail.tana.it:S-INBOX:CreateNewLineFromSocket: 19 OK FETCH completed.
Also I see now that non-ASCII header just means something as simple as a U+2014 (Long Dash) in the Subject as in your example. Also, see non-ASCII in the CC address.
The non-ASCII displays correctly in TB main screen with and without UTF8=ACCEPT enabled as expected. The only problem I see is that on "Show Source" the UTF8 is not shown correctly. It seems to be decoding each UTF8 3-byte sequence as a separate IS0-8859-1 characters. So the long dash appears as 3 separate characters —.
Reporter | ||
Comment 55•4 years ago
|
||
(In reply to gene smith from comment #54)
The only problem I see is that on "Show Source" the UTF8 is not shown correctly. It seems to be decoding each UTF8 3-byte sequence as a separate IS0-8859-1 characters. So the long dash appears as 3 separate characters —.
"Show Source" has a View/Text Encoding menu where you can correct that. The default values are probably adjustable, but I haven't found out how, yet.
Assignee | ||
Comment 56•4 years ago
|
||
(In reply to Alessandro Vesely from comment #55)
"Show Source" has a View/Text Encoding menu where you can correct that. The default values are probably adjustable, but I haven't found out how, yet.
Yes, that fixes it if you change from Western to Unicode. Seems like the setting for Incoming mail with the fonts setting should set the default for view source but it doesn't.
Assignee | ||
Comment 57•4 years ago
|
||
There's a bug in my APPEND of UTF8 messages found when trying to save to SENT when sending a UTF8 message. See bug 1563891 comment 22. Probably applies to any situation where imap APPEND occurs.
Thanks Alessandro for researching this and finding the cause.
I failed to implement this detail in the utf8=accept rfc: https://tools.ietf.org/html/rfc6855#section-4
Reporter | ||
Comment 58•4 years ago
|
||
Sam Varshavchik researched and found, see Python bug. I merely pushed the papers. The Python patch which fixed that looks quite simple...
Assignee | ||
Comment 59•4 years ago
|
||
This fixes the append problem. If utf8=accept is enabled, it adds the UTF8, parenthesis and the tilde in the same manner as the python patch pointed to in comment 58. Append to Sent folder now works OK with no alerts with Alessandro's and gmail server.
It also fixes another problem revealed after some email exchanges with Arnt Gulbrandsen (co-author of imap enable RFC and others too I think). He recommended that I go ahead and attempt the enable of utf8=accept if ENABLE capability is detected. If the server doesn't support utf8=accept you still get an OK response from the ENABLE command but you don't get the untagged
* ENABLED UTF8=ACCEPT
response. Before I wasn't even looking for this but just assumed that it worked if the ENABLE and UTF8=ACCEPT capabilities are both present and if OK tagged response occurs (which it always does unless you, for example, misspell ENABLE).
I also now set a flag accessible with m_imapServerSink-> and imapServer-> since there are several other places that utf8=accept needs to be known. Before I was just looking at capabilities and not whether it was really enabled at the server.
The .diff still contains some debug and printfs that need to be cleaned up. It also contain the changes for SMTPUTF8 from bug #1563891 that need to be taken out for this bug.
Comment 60•4 years ago
|
||
To clarify (In reply to gene smith from comment #51)
(In reply to Magnus Melin [:mkmelin] from comment #50)
What properties is it that cause a problem? I'm assuming they just have a wrong string type.
Don't know which "properties" cause the problem and don't really even know what you mean by "property".
I was referring to the object properties as declared in the .idl files, and what string types they have there
- string / wstring: old style string pointers (char*) - ASCII only
- ACString: ASCII only, nicer c++ interface
- AUTF8String/ AString are the non-ASCII versions
If something you're passing around doesn't have the right string type you'll get in trouble for non-ascii characters.
Feel free to needinfo if you have specific questions. Questions inside comments can get lost if I don't have time to answer when I read it.
Assignee | ||
Comment 61•4 years ago
|
||
Magnus, thanks for the explanation. I am now looking again at this bug. But I have another question that is OT to this bug.
I just applied my patch again to a pristine trunk that only affects comm-central code. For some reason, it's re-building a large chunk of mozilla-central code starting with
3:09.45 devtools/shared/heapsnapshot
3:09.52 devtools/shared/heapsnapshot/tests/gtest
3:09.58 docshell/base
3:09.66 docshell/shistory
3:11.04 dom/abort
3:20.84 dom/animation
3:23.96 dom/audiochannel
:
and it takes 30 to 40 minutes to finish on this old (slow) laptop. Does mozilla code have dependencies on comm code? I've only just recently been seeing this. I may be touching something in comm-central that affects the javascript engine and forcing a rebuild? (A full "clobber" build takes about an hour so I'm not see that.)
Comment 62•4 years ago
|
||
Sounds like bug 1658188.
Assignee | ||
Comment 63•4 years ago
|
||
Ok, made some progress on this thanks to Jorg's suggestion to use DebugDumpJSStack() at the point where the upper byte is lost in js/xpconnect/src/XPCConvert.cpp. The JS stack dumps shows me where in the JS code the upper byte is getting ignored. Then I found the relevant .idl file and typically changed "string" type to "AUTF8String" and then fixed the many compile errors that "ripple" out. In some cases that change was just from ACString to AUTF8String in the .idl resulting in no compiler errors (not 100% sure these changes are needed, but the "string" to AUTF8String definitely are).
Now I'm not seeing any warnings about the truncation of char16 values and the folder names containing non-latin1 chars are working correctly. However, I still need to go over everything again and make sure all is still working and then clean up the changes and submit an official patch for review. Also, need to run this through the unit tests (probably should do that first).
For now I am leaving in my changes to the mozilla code to detect and correct losses and dump the JS stack for my debug purposes. For reference, I'll attach those changes next.
Assignee | ||
Comment 64•4 years ago
|
||
For reference, here is my temporary changes to mozilla code to detect when char16's are truncated to char8. When this is detected, DebugDumpJSStack() is called to print the JS stack to help in determining where the problem is occurring. This is not readily evident from just the C++ stack visible in gdb.
Assignee | ||
Comment 65•4 years ago
|
||
I did some more testing by attempting to designate drafts, archives, template and sent as folders containing non-ascii utf8. Drafts and archives worked OK with my latest diff. But when I tried trash with a non-ascii chars, e.g., Skräpppost, problems occurred. For example, what looks like a duplicate folder was created but with MUTF7 encoding that became the designated trash folder. (Both folder appeared in the list with pretty name Skräpppost.)
The attached fix-trash-utf8.diff shows the changes I made to just fix this problem. I still need to test some more and also try my current changes with template and sent special folders having non-ascii utf8 chars.
Assignee | ||
Comment 66•4 years ago
|
||
Noticed another problem when testing the trash folder selection issue. If the folder selected for trash is non-ascii utf8, on startup and the server setup page invoked, it fails to initialize the "picker" widget for the trash folder name; it only says "Choose folder...". The correct trash folder is still actually selected according to the icon and tb behavior, so this is a display problem.
I'm not sure if it has anything to do with this comment in the code: https://searchfox.org/comm-central/rev/ec64095d118d299477d3e5f9fd2dc11912d0b573/mailnews/base/prefs/content/am-server.js#531, but this is where the folder
selected for trash is displayed. (Looking with tb's built-in debugger, I don't actually see the "try" for select failing and causing the "catch" to occur.)
Another random observation is that you can't choose an arbitrary trash folder (e.g., on another account or Local Folders) like you can for Sent, Drafts, etc. (I thought maybe I had broken it while tweaking the code, but I hadn't.) Apparently this was requested but never acted upon in bug 277854.
Assignee | ||
Comment 67•4 years ago
|
||
(In reply to gene smith from comment #66)
Noticed another problem when testing the trash folder selection issue. If the folder selected for trash is non-ascii utf8, on startup and the server setup page invoked, it fails to initialize the "picker" widget for the trash folder name; it only says "Choose folder...". The correct trash folder is still actually selected according to the icon and tb behavior, so this is a display problem.
The attached incremental diff to am-server.js shows a fix for the problem. It only show the changes to fix this problem required only in am-server.js function setupImapDeleteUI(). A full cleaned-up patch is still needed for the full bug fix showing all the files and functions.
I'm not sure if it has anything to do with this comment in the code: https://searchfox.org/comm-central/rev/ec64095d118d299477d3e5f9fd2dc11912d0b573/mailnews/base/prefs/content/am-server.js#531, but this is where the folder
selected for trash is displayed. (Looking with tb's built-in debugger, I don't actually see the "try" for select failing and causing the "catch" to occur.)
No, this comment seems to be not an issue anymore. i never see the "select" actually fail and trigger the catch. The "select" can return false which causes the "Choose folder..." to be displayed. I now check the return value for SelectFolder() to know that the trash folder name is probably stored utf8 instead of the normal mutf7 due to UTF8=ACCEPT and process the strings differently.
Assignee | ||
Comment 68•4 years ago
|
||
This is a raw diff of all changes working correctly to support utf8=accept. It includes some commented out code and and fairly useful printfs for debug and verification purposes.
I am marking all the older attachments as obsolete but they are still useful for documentation and explanation purposes.
Next I will attach the complete cleaned-up patch for feedback and/or review.
Assignee | ||
Comment 69•4 years ago
|
||
Here's the full patch without printfs and ready for review.
There is included some changes in nsImapProtocol.cpp to work around bug 1661992 to avoid an immediate crash so those changes will be removed in later patch versions (function Suspend() and Resume()).
Overall, this required a lot of changes to allow folder names to have non-ascii utf-8 chars at all usage points. I think I have checked everything except namespace and server directory name as utf-8.
Reporter Alessandro, thanks for letting me use your Courier server to test this. I have also tested it on gmail which also support UTF8=ACCEPT capability. Any testing you could do on this would be appreciated. If that's possible let me know and I will produce a "try" build for you to use.
I have manually run my changed version through all the comm unit tests and don't see any failure that can be attributed to my changes.
Finally, clang formatting has yet to be done.
Comment 70•4 years ago
|
||
Assignee | ||
Comment 71•4 years ago
|
||
Did clang-format on the cpp/h files. Skipped all the other, js, idl. I think that's right.
Did try build. Fixed a reported lint problem on a .js file. See https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=53e34454281dab86162c3108569aa5c2a65db3d6
Comment 72•4 years ago
|
||
Assignee | ||
Comment 73•4 years ago
|
||
I remember doing that but never made a note as to why. So I went back and retested and tried to recall why I did it. It looks like the change is not needed but doesn't break anything. I think I was concern that with non-ascii addresses that the parsing would fail and produce the wrong count of the number of addresses. However unlike with Bug 1658361 that also changed parseEncodedHeader to parseEncodedHeadeW, the actual parsed address strings are not used here but just the count of the number of address strings so the proper count is returned which determines if the "Reply All" button is enabled or not.
I'll remove this from the patch since it really doesn't apply to this bug even if there was a problem.
Assignee | ||
Comment 74•4 years ago
|
||
Removed changes to mailWindowOverlay.js from the patch per comment 72 and comment 73. Otherwise same as v7 patch version.
Comment 75•4 years ago
|
||
Assignee | ||
Comment 76•4 years ago
|
||
Made the requested changes and ran a try build:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=8b625e4f213380c84e9209dc9bda6ebbac20feb1
I've fixed the whitespace error shown.
Note: I had problems with pushing my changes to try-comm and did several re-pushes since I thought I must be doing something wrong. But apparently my initial push was OK and the server was just busy or something, so they all queued up and they all finally ran. Sorry for the extra traffic. Anyhow, the link above is to my first push.
On my later re-push there was an X* error on the windows build but I see similar errors on other people's pushes so I'm assuming it's OK:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=1289707e4621d0de2d171552b7db185e7352cce7
Comment 77•4 years ago
|
||
Comment 78•4 years ago
|
||
Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/0054ea73e658
Add support for imap capability UTF8=ACCEPT (RFC 6855). r=mkmelin
Updated•4 years ago
|
Comment 79•4 years ago
|
||
As mentioned in bug 1686034 comment #15, it's unclear why URI's were moved to UTF-8 here
https://hg.mozilla.org/comm-central/rev/0054ea73e658#l2.12
https://hg.mozilla.org/comm-central/rev/0054ea73e658#l4.12
and elsewhere in the patch when maybe proper escaping as for mailbox: URIs could have done the trick.
There's some very hacky stuff, for example CopyMUTF7toUTF16()
can now also handle UTF-8 input. Overall it's not clear when a folder name is transported as UTF-8 and when it's transported as MUTF-7, IOW, which parts of the code now always expect UTF-8. Here here, the name is now unconditionally converted to UTF-8:
https://hg.mozilla.org/comm-central/rev/0054ea73e658#l16.68
The test which decides whether the input is UTF-7 or UTF-8 is here:
https://hg.mozilla.org/comm-central/rev/0054ea73e658#l5.22
And it's based on whether the folder name is ASCII. Looks like creating a folder by the name of test&AOk- now results in creating a folder testé.
This hunk here https://hg.mozilla.org/comm-central/rev/0054ea73e658#l16.149 is unnecessary. The original code compiles.
Comment 80•4 years ago
|
||
BTW, apart from bug 1686034, this also caused bug 1685033 and bug 1686415.
Assignee | ||
Comment 81•4 years ago
|
||
Back at comment 49 I question the usefulness of this UTF8=ACCEPT stuff since MUTF7 allows you to create any mailbox name you want without using UTF8. This was after much frustration getting anything to work without causing other problems.
Anyhow, I'm not sure what you mean by "proper escaping" to fix the problem.
I did just verify that even with UTF8=ACCEPT disabled in config editor, even on an account that doesn't support UTF8=ACCEPT capability, that a folder created as test&AOk- produces a folder named testé. I haven't tried running without it, but I assume this is a regression due to the patch for the current bug.
Comment 82•4 years ago
|
||
Anyhow, I'm not sure what you mean by "proper escaping" to fix the problem.
Well, please read bug 1686034 comment #15 again. It shows that mailbox: URLs escape testé as mailbox-message://nobody@Local%20Folders/test%C3%A9#1 in nsMailboxService::MessageURIToMsgHdr(). The é is C3A9 in UTF-8 and it's percent encoded.
You try to push UTF-8 into the URI without escaping here and in bug 1686034 (and bug 1685033) to get imap-message://klaus.bartosch%40gmail.com@imap.gmail.com/INBOX/testé#1.
That may be an option, but it's certainly not the way other URL types are handled.
Overall, as stated in comment #79, it's unclear which encodings are used in a string retrieved from an IMAP folder name, so sadly CopyMUTF7toUTF16()
now tries to guess based on the ASCII-ness of the input.
I'm also wondering if/why https://hg.mozilla.org/comm-central/rev/0054ea73e658#l16.68 is correct. IMHO, instead of spot-fixing any encoding issues, if would be good to design/define at which locations in the code which encoding is used. There is a comment to that extent here:
https://hg.mozilla.org/comm-central/rev/0054ea73e658#l16.131
Reporter | ||
Comment 83•4 years ago
|
||
mUTF7 is a relic from the past.
See Is UTF-8 the final character encoding for all future time?
Comment 84•4 years ago
|
||
MUTF-7 is used for folder names in almost all IMAP servers except Gmail and Courier if I read comment #49 correctly. Sure, moving towards UTF-8 is the way to go, but I suspect that MUTF-7 is here to stay for another decade.
All that said, sadly the changes here introduced a heap of regressions due to incorrect charset handling: bug 1686034, bug 1685033 and bug 1686415. So whilst getting rid of a "relic from the past", Gmail processing went broken in a few ways. So even if you wanted to use the new feature, you can't.
Comment 85•4 years ago
|
||
I've looked through the patch again. Looks like the idea was to make folder names UTF-8 internally, and only convert them to MUTF-7 at the "last minute":
https://hg.mozilla.org/comm-central/rev/0054ea73e658#l23.38
That also answers the question re. https://hg.mozilla.org/comm-central/rev/0054ea73e658#l16.68.
In view of bug 1686034 comment #24 all the NS_ConvertUTF16toUTF8/NS_ConvertUTF8ToUTF16 should be inspected to check whether they should be NS_CopyUnicodeToNative/NS_CopyNativeToUnicode.
If would also be good to determine whether a folder name coming back from IMAP is MUTF-7 or UTF-8. That test shouldn't be done in CopyMUTF7toUTF16(). Surely, if the server doesn't have UTF-8 support, MUTF-7 is returned. It's only questionable whether servers with UTF-8 support are guaranteed to return UTF-8 and not MUTF-7 any more.
Looks like a trash folder name is constructed here:
https://hg.mozilla.org/comm-central/rev/0054ea73e658#l19.48
Looking at bug 1686415 I wonder whether some processing was missed for the other "special" folders, like Sent, etc.
Further nits are the inconsistent spelling of UTF-8 and MUTF-7 throughout the patch, I see: uft8, uft-8, mUTF7, MUTF7, mutf7, mutf-7.
Comment 86•4 years ago
|
||
Reporter | ||
Comment 87•4 years ago
|
||
(In reply to Klaus B. from comment #85)
Using TB 78, I have a folder named Skr&AOQ-ppost on disk. It should be converted when switching to UTF-8.
I don't know Gmail, but the first time I installed an UTF-8 version of Courier, before launching the new daemon I had to convert any directory on disk. Courier provides a conversion utility for that.
Switching to UTF-8 simplifies IMAP a bit, although, having to support both UTF-8 and mUTF-7 actually complicates it. I'm not clear how much longer will we have to endure UTF-16.
Comment 88•4 years ago
|
||
Using TB 78, I have a folder named Skr&AOQ-ppost on disk. It should be converted when switching to UTF-8.
Exactly, I see testé and testé.msf now. On Windows, you will run into trouble when your local ANSI page can't hold the name. I'm looking into bug 1686034 comment #26 right now and the reason why this fails is that the filename is NOT UTF-8 when returned from Windows.
UTF-16 is the internal encoding in Mozilla, also used in JavaScript and for Windows WCHAR type. It will be forever.
Comment 89•4 years ago
|
||
A bit of pedantism. The various wchar types out there were invented in the era of UCS-2 and are mostly UTF-16 oblivious. UTF-16 surrogate pairs had to be hacked in later. Windows wchar falls in that camp - it is UTF-16 surrogate pair oblivious and thus mangling is quite possible. This is the reason for WTF-8. http://simonsapin.github.io/wtf-8/
Otherwise, yeah, we're stuck w/ UTF-16 for a good long while. Hopefully not forever. Windows is trying to slowly move off of it too. Maybe in another hundred years the limitations of UTF-16 in the UTF-8 encoding length can be dropped ☺
https://utf8everywhere.org/
Anyway, carry on - I'm very much looking forward to this bug landing!
Assignee | ||
Comment 90•4 years ago
|
||
(In reply to Alessandro Vesely from comment #87)
(In reply to Klaus B. from comment #85)
I don't know Gmail, but the first time I installed an UTF-8 version of Courier, before launching the new daemon I had to convert any directory on disk. Courier provides a conversion utility for that.
Does that utility convert files in tb profile files or Courier's own files? Just curious.
Comment 91•4 years ago
|
||
Anyway, carry on - I'm very much looking forward to this bug landing!
This bug has landed months ago and is now causing problems in beta, see comment #84. Those bugs revealed some fundamental issues, so it's going to be a while before this is ready for prime-time.
Thanks for the links, fascinating reading :-)
Does that utility convert files in tb profile files or Courier's own files?
Looks like it re-encodes MUTF-7 file names to UTF-8. As far as I know, (some) IMAP servers use maildir, just like TB.
Reporter | ||
Comment 92•4 years ago
|
||
(In reply to Klaus B. from comment #91)
Does that utility convert files in tb profile files or Courier's own files?
Looks like it re-encodes MUTF-7 file names to UTF-8. As far as I know, (some) IMAP servers use maildir, just like TB.
It also converts maildrop recipes, if they were created by Courier's automaton.
(In reply to nemo from comment #89)
That manifesto mentions Java but not Javascript.
Comment 93•4 years ago
|
||
Magnus, can you please set bug 1686415, bug 1685033, bug 1685450 and bug 1687452 as additional regressions. There are also various issues that slipped the review, see comment #86. Finally, there are a few call sites in the code base https://searchfox.org/comm-central/search?q=CopyUTF16toMUTF7&path=&case=false®exp=false where CopyUTF16toMUTF7 is called unconditionally. That is most likely wrong.
Backing out this bug here would be the way to go since you're already in a "whack a mole" situation where more bugs are bound to pop up.
Comment 94•4 years ago
|
||
And bug 1685449 appears to be another regression.
Comment 95•4 years ago
|
||
I don't think we want to back it out, but rather fix the discovered problems.
Comment 96•4 years ago
|
||
To all the people following this bug: All (known) regressions have been fixed, so this should work perfectly in TB Daily and TB 86 beta soon to come.
Note that if your file system can handle the names, you will get the mailbox name as file name. If your filesystem can't handle the name, it will be "hashed up". For example, on my US-English Windows machine with ANSI/windows-1252 locale, a folder with name テストテストテスト ends up as 61ef77d0.msf.
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Description
•