Closed Bug 479214 Opened 16 years ago Closed 16 years ago

Gloda JS mime emitter needs to deal with character set and encoding issues.

Tracking

(Not tracked)

Status:

VERIFIED FIXED

Milestone:

Thunderbird 3.0b3

People

(Reporter: bugzilla, Assigned: asuth)

References

Details

(Keywords: intl)

Attachments

(1 file, 2 obsolete files)

SQLite Manager screenshot which show mojibake strings in global-messages-db.sqlite 16 years ago dynamis (Tomoya ASAI) (deleted), image/png		Details
v1 path of least resistance, convert the stewed utf-8 16 years ago Andrew Sutherland [:asuth] (he/him) (deleted), patch		Details \| Diff \| Splinter Review
v2 convert startAttachment/writeBody/write signatures on nsIMimeEmitter 16 years ago Andrew Sutherland [:asuth] (he/him) (deleted), patch	neil : review+ Bienvenu : superreview+	Details \| Diff \| Splinter Review

dynamis (Tomoya ASAI)

Reporter

Description

•

16 years ago

Attached image SQLite Manager screenshot which show mojibake strings in global-messages-db.sqlite (obsolete) (deleted) — Details

When we search Japanese (or other multibyte) strings within message body (or attachmentName) with Thunderbar (or Postbox full text search), we cannot find it.
In other words, current Thunderbar don't support multibyte chars for message body.

I first thought that is just bug of thunderbar frontend but when I checked global-messages-db.sqlite with SQLite Manager, I found body/attachmentNames field of messagesText/messagesText_content table contain mojibake (garbage chars).
# subject field contains correct strings

And I checked insertMessage of datastore.js (to see mojibake is introduced before insert or not) method. It will get aMessage argument and aMessage._subject is OK but aMessage._indexedBodyText contains mojibake strings.
http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/modules/datastore.js#1807

So, I think that not exptoolbar but gloda backend (caller of insertMessage) don't support multi-byte chars correctly now.

In general, Subject (and other) header and message Body is not equally encoded in mail message. I guess current code just treat body same (or something wrong) way as one for subject (and other) header and it cause mojibake in message body.

I believe body/attachmentNames field should contain UTF-8 same as subject field but if body/attachment field should contain non-UTF-8 string, I'm wrong and ignore my report. In that case, bug of exptoolbar (and Postbox) not of gloda.

note: One curious thing is that we can fulltext multibyte string search correctly with Edit -> Find -> Find Messages menu, even when we use gloda and global-messages-db.sqlite contains mojibake.

Flags: blocking-thunderbird3?

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 1

•

16 years ago

Thank you for the detailed investigation and bug message!

I was hoping really hard that libmime and the rest of the message streaming process had this under control.  (I was going to look into this when I implemented the tokenizer stuff...)  I suspect the jsmimeemitter should probably be doing something with updateCharacterSet; we previously tried to tell the channel about it, but the IMAP mock channel got angry.

cc-ing bienvenu in case he knows exactly what we should be doing :)

Dropping importance since gloda is not yet exposed in trunk, but marking as blocking enabling gloda since this should be a fairly easy fix.

Blocks: 464354

Severity: blocker → major

Summary: Gloda database must support multibyte chars (for body/attachemntNames fields) → Gloda database must support multibyte chars (for body/attachmentNames fields)

Target Milestone: --- → Thunderbird 3.0b3

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

16 years ago

Summary: Gloda database must support multibyte chars (for body/attachmentNames fields) → Gloda JS mime emitter needs to deal with character set and encoding issues.

Magnus Melin [:mkmelin]

Comment 2

•

16 years ago

blocking‑thunderbird3+ as this sounds pretty bad

Flags: blocking-thunderbird3? → blocking-thunderbird3+

David :Bienvenu

Comment 3

•

16 years ago

I would store utf8 in sqlite - e.g., mime2 decode the part header to unicode, and then convert that to utf8 (or maybe mime2 decode directly to utf8 - don't know if our decoders can do that...) - is that not what we're doing?

I would also store the message body as utf8, but I'm not real familiar with what sqlite supports.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 4

•

16 years ago

Attached patch v1 path of least resistance, convert the stewed utf-8 (obsolete) (deleted) — Details — Splinter Review

libmime is, in fact, doing all the hard work for us.  The problem is that nsIMimeEmitter.writeBody passes its utf-8 data with a "string" signature.  XPConnect takes this to mean we are passing ASCII data and not utf8 data.  So we get utf8 stored in wide characters.  mozStorage uses utf-8 storage in sqlite, but of course utf-8 stored in utf-16 does not reduce to valid utf-8 when converted.  You get escaped utf-8 in utf-8...

The attached patch does the simplest thing possible, which is to use scriptableunicodeconverter to convert all the body data that gets thrown at us.  Generally speaking, this is silly and inefficient.  The arguably better course of action is to change the signature of writeBody to be AUTF8String so that things come out of XPConnect correct.  The trick there is that then all the C++ dudes need to change to using an nsACString& signature.  I am not a "native speaker" when it comes to mozilla string classes, so I'd appreciate input on how far I would want to propagate such changes, or whether I should just make that one change using a dependent string or such...

I did add a simple unit test that sanity checks the current implementation and should be useful to sanity-check a better implementation too.

Assignee: nobody → bugmail

Attachment #363086 - Attachment is obsolete: true

Status: NEW → ASSIGNED

Attachment #363619 - Flags: superreview?(bienvenu)

Attachment #363619 - Flags: review?(bienvenu)

David :Bienvenu

Comment 5

•

16 years ago

There's only one call to WriteBody, mime_output_fn, in mimemoz2.cpp, so I think it's well worthwhile changing the idl signature. I think you can use nsDependentCString(buf) and change WriteBody to take AUTF8String, which the idl compiler changes to nsACString, and then I think our utf8 data will pass through unharmed. I'm not sure you couldn't even use ACString in the idl, but I bet Neil knows off the top of his head.

neil@parkwaycc.co.uk

Comment 6

•

16 years ago

(In reply to comment #5)
> I'm not sure you couldn't even use ACString in the idl
My understanding was that string didn't even guarantee 8 bits when passed through XPConnect; ACString is best for binary data since AUTF8String unsurprisingly only works for UTF8 data.

dynamis (Tomoya ASAI)

Reporter

Comment 7

•

16 years ago

With the patch Thunderbird will (convert and) import Japanese message body correctly into the global-messages-db.sqlite database as utf-8. ;)

But I still see mojibake for attachmentNames field of the message_Text/Messate_Text_content table.

And as far as I test, following will solve the problem for attachemntNames field too:
# do same for mime_emitter_startAttachment() as for mime_emitter_writeBody()
http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/components/jsmimeemitter.js#339

       if (partMatch) {
         let part = new this._mimeMsg.MimeMessageAttachment(partMatch[1],
-            aName, aContentType, aUrl, aIsExternalAttachment);
+            this._converter.ConvertToUnicode(aName), aContentType, aUrl, aIsExternalAttachment);

dynamis (Tomoya ASAI)

Reporter

Updated

•

16 years ago

Blocks: 479783

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 8

•

16 years ago

(In reply to comment #7)
> With the patch Thunderbird will (convert and) import Japanese message body
> correctly into the global-messages-db.sqlite database as utf-8. ;)
> 
> But I still see mojibake for attachmentNames field of the
> message_Text/Messate_Text_content table.

Thanks for (thoroughly) testing!  I'll fix the attachment name vector (and
audit the rest of the interface call-sites) and add tests for this to my
enhanced unit test.  (The unit test in the patch sadly fails to prove much, as
the message payload was actually utf-8 anyways.  Happily, it turns out that it
does work for euc-jp and shift-jis :)

David :Bienvenu

Comment 9

•

16 years ago

Comment on attachment 363619 [details] [diff] [review]
v1 path of least resistance, convert the stewed utf-8

I'd prefer fixing the idl for WriteBody since that's where we want to end up, I believe.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 10

•

16 years ago

Attached patch v2 convert startAttachment/writeBody/write signatures on nsIMimeEmitter (deleted) — Details — Splinter Review

In addition to changing the signature for writeBody, we also need startAttachment's signature changed for that to be correct.  Also, as Neil points out, ACString is best for binary data, so I changed the signature to write as well.  I tried to propagate signatures just far enough, but may not have struck the right balance.

Unit test now tests subject, body, and attachment name, and does so for 3 encodings.

Attachment #363619 - Attachment is obsolete: true

Attachment #364856 - Flags: superreview?(bienvenu)

Attachment #364856 - Flags: review?(neil)

Attachment #363619 - Flags: superreview?(bienvenu)

Attachment #363619 - Flags: review?(bienvenu)

neil@parkwaycc.co.uk

Comment 11

•

16 years ago

Comment on attachment 364856 [details] [diff] [review]
v2 convert startAttachment/writeBody/write signatures on nsIMimeEmitter

+ *  expected/actual value. 
I only skimmed the gloda bits, but there's trailing whitespace here. *

>-    mCurrentAttachment->displayName = strdup(name);
>+    mCurrentAttachment->displayName = ToNewCString(name);
This is still an allocator mismatch :-( [According to the rules, strdup goes with strfree and ToNewCString goes with NS_Free and neither go with PR_FREEIF]

>+      mBufferMgr->IncreaseBuffer(buf.Data(), buf.Length());
Data() is internal API only; I think BeginReading() should work instead. *

>+    rv = Write(NS_LITERAL_CSTRING(""), &written);
Use EmptyCString() *

>+  NS_IMETHOD          UtilityWrite(const nsACString &buf);
>   NS_IMETHOD          UtilityWriteCRLF(const char *buf);
Weird, why are these public virtual instead of protected? (Or indeed the "original" UtilityWrite method, which made it into the .idl!)

>     void utilityWrite([const] in string buf);
Ah yes, here it is ;-)

>+      msd->output_emitter->WriteBody(nsDependentCString(buf, (PRUint32)size),
>+                                     &written);
...
>+      msd->output_emitter->Write(nsDependentCString(buf, (PRUint32)size), &written);
IIRC Substring() would be better, although unfortunately the internal API doesn't have a start/length version... *

>+  // use of nsDependentCString demands nul-termination, so add 1.
Substring() would avoid this. *

* required for r=me

Attachment #364856 - Flags: review?(neil) → review+

David :Bienvenu

Comment 12

•

16 years ago

Interesting about your vending machine problem. My hovercraft has a similar issue.

In addition to Neil's comments,

you can lose the blank lines here:

+NS_IMETHODIMP
+nsMimeBaseEmitter::UtilityWrite(const nsACString &buf)
+{
+  PRUint32    written;
+
+  Write(buf, &written);
 
   return NS_OK;
 }

David :Bienvenu

Updated

•

16 years ago

Attachment #364856 - Flags: superreview?(bienvenu) → superreview+

David :Bienvenu

Comment 13

•

16 years ago

Comment on attachment 364856 [details] [diff] [review]
v2 convert startAttachment/writeBody/write signatures on nsIMimeEmitter

sr=me, but don't forget to fix Neil's starred comments.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 14

•

16 years ago

comments addressed.  Thanks all.

pushed: http://hg.mozilla.org/comm-central/rev/99320ed86776

Status: ASSIGNED → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

dynamis (Tomoya ASAI)

Reporter

Comment 15

•

16 years ago

As far as I test with latest tinderbox builds ISO-2022-JP (most popular Japanese mail encoding), Shift_JIS, EUC-JP, UTF-8 Japanese mail are stored into db with UTF-8 correctly.
# testmail bymyself only for Shift_JIS, EUC-Jp since almost none use it in e-mail

Thanks fixing this bug.

note:
I feel that it may be better to have more languages and more encodings tests if we want to make sure we can handle every locales/encoding.
# it's just better to have, not required

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 16

•

16 years ago

Thank you for verifying!  In this case, it's not essential to have additional encodings.  The main thing was just to make sure to have more than just UTF-8 so we don't get some form of accidental correct behaviour on UTF-8 but fail on all other encodings.  I believe this is sufficient to verify that the proper decoding mechanisms are in play, and it is probably best for the codec code to make sure it has full unit test coverage of the multiple possible encodings, rather than gloda.

I think once we implement the tokenizer, that is where we will want to have a large set of different strings in different encodings to verify that tokenization is occurring as expected/desired.

Status: RESOLVED → VERIFIED

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 17

•

16 years ago

Note: The fix for this bug accidentally introduced/exposed bug 482416.  (The removal of a guard exposed an oversight; nested rfc822 messages are falling into the void a bit.)

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

16 years ago

Flags: in-testsuite+

You need to log in before you can comment on or make changes to this bug.