Closed Bug 547119 Opened 15 years ago Closed 14 years ago

Imported Japanese Email from Outlook shows as garbled text

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

Thunderbird 3.3a1

People

(Reporter: dythim, Assigned: jorgk-bmo)

References

(Blocks 1 open bug,
URL
)

Details

(Whiteboard: [gs])

Attachments

(3 files, 6 obsolete files)

PST containing a single Japanese-encoded message. 15 years ago dythim (deleted), application/octet-stream		Details
This is how the test case looks like after I imported it with my fixed version. 14 years ago Jorg K (CEST = GMT+2) (deleted), image/jpeg		Details
Patch to header file 14 years ago Jorg K (CEST = GMT+2) (deleted), patch		Details \| Diff \| Splinter Review
Patch to CPP file. 14 years ago Jorg K (CEST = GMT+2) (deleted), text/plain		Details
Patch to CPP file. Oops, didn't check the box. 14 years ago Jorg K (CEST = GMT+2) (deleted), patch		Details \| Diff \| Splinter Review
Combined patch to header and CPP file 14 years ago Jorg K (CEST = GMT+2) (deleted), patch	neil : superreview+	Details \| Diff \| Splinter Review
New patch after review 14 years ago Jorg K (CEST = GMT+2) (deleted), patch		Details \| Diff \| Splinter Review
New patch after review (take 2) 14 years ago Jorg K (CEST = GMT+2) (deleted), patch	neil : review+	Details \| Diff \| Splinter Review
New patch after review (take 3) 14 years ago Jorg K (CEST = GMT+2) (deleted), patch	neil : review+	Details \| Diff \| Splinter Review

dythim

Reporter

Description

•

15 years ago

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.89 Safari/532.5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 6.0; ja; rv:1.9.1.7) Gecko/20100111 Thunderbird/3.0.1

When migrating from Outlook 2007 from Thunderbird 3.01 (both Japanese),
using the Tools -> Import to retrieve Outlook messages apparently succeeded, 
but when viewing the imported messages in Thunderbird the text shows as garbage.

The character set is auto-detected as Shift_JIS, and ISO-2022-JP must be manually selected in order to read the messages properly.

Newly sent/received mail does not seem to have the same problem -- it is correctly auto-detected as ISO-2022-JP.

Reproducible: Always

Steps to Reproduce:
1. Have existing Outlook Japanese e-mail.
2. Install Thunderbird 3.0.1 Japanese version.
3. Select Tools -> Import (ツール→設定とデータのインポート)
4. Select Mailbox (メールボックス) from the Import window, click Next (次へ), select Outlook as the type, press Next (次へ) to execute.
5. Import completes automatically, no problems reported.
6. Open one of the imported messages, it shows as garbage.
7. Change the View -> Character Encoding (表示→文字エンコーディング) from Shift_JIS (日本語（Shift_JIS) to (日本語（ISO-2022-JP）).
8. Message shows as normal Japanese. 
Actual Results:  
Thunderbird detects character encoding of imported messages as Shift_JIS, which results in garbled text.

Expected Results:  
Thunderbird should be able to detect that the messages are encoded in ISO-2022-JP.

All mail sent and received after the migration is correctly auto-detected as ISO-2022-JP.  It seems like a problem with the import.

Also, switching between tabs while an imported message is displayed causes the message to be refreshed, and regarbled.

Makoto Kato [:m_kato]

Comment 1

•

15 years ago

Could you attach sample *.pst file to reproduce this?

dythim

Reporter

Comment 2

•

15 years ago

Attached file PST containing a single Japanese-encoded message. (deleted) — Details

dythim

Reporter

Comment 3

•

15 years ago

I uploaded a PST file, but I was not able to force Thunderbird to import from it...  It would only import directly from Outlook.

Makoto Kato [:m_kato]

Comment 4

•

15 years ago

Thank you!.  I confirmed on 3.1 alpha1.

When Tb imports your mailbox as HTML mail, mail data is following.  Although charset of Content-type is Shift_JIS, but charset of HTML is iso-2022-jp.


 :
Content-Type: text/html; charset=Shift_JIS
 :

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-2022-jp">
 :
 :

Status: UNCONFIRMED → NEW

Ever confirmed: true

Makoto Kato [:m_kato]

Comment 5

•

15 years ago

related to bug 378008 (same) and bug 505072 (wrong fix for this).

When I open HTML source via Outlook, there is no charset...  If we import HTML mail data from Outlook, I should remove charset of header, maybe...

Jorg K (CEST = GMT+2)

Assignee

Comment 6

•

14 years ago

This is a problem with Korean messages, too. For example this is the source of a message I imported from Outlook:

This is a multi-part message in MIME format.
--------------040206010205070609040902
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: 8bit

Her name is Yena Ryu and that&#39;s 예나 류 in Korean<br><br>

===============
Where does the Content-Type come from???

The message looks OK when viewed as UTF-8, but when viewed with the default windows-1252, it looks like this:
Her name is Yena Ryu and that's ì˜ˆë‚˜ ë¥˜ in Korean

This is not only a problem with Asian text. Even German text with Umlaut (äöü) gets garbled:

This is a multi-part message in MIME format.
--------------090001070201040804030900
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: 8bit

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Ihre simyo Rechnung</TITLE>
<META content="text/html; charset=utf-8" http-equiv=Content-Type><!-- Title ON --><!-- Title OFF -->

Later in the body of the message:
  der Rechnungsbetrag für die 
which when viewed at the default looks like this:
  der Rechnungsbetrag fÃ¼r die

========

So once again, is there away to detect the default encoding for the message a little better?

If it's a HTML message, it could be taken from the charset of the HTML header.

Jorg K (CEST = GMT+2)

Assignee

Comment 7

•

14 years ago

The problem is here in the code:

http://mxr.mozilla.org/comm-central/source/mailnews/import/outlook/src/nsOutlookCompose.cpp#607.

The import tries to get the charset from the header, if none is there, it defaults to the system character set, for Windows machines with an English version of Windows, that's normally "windows-1251".

Instead, what should be done is extract the charset from the appropriate HTML header.

Makoto Kato [:m_kato]

Comment 8

•

14 years ago

(In reply to comment #7)
> The problem is here in the code:
> 
> http://mxr.mozilla.org/comm-central/source/mailnews/import/outlook/src/nsOutlookCompose.cpp#607.
> 
> The import tries to get the charset from the header, if none is there, it
> defaults to the system character set, for Windows machines with an English
> version of Windows, that's normally "windows-1251".
> 
> Instead, what should be done is extract the charset from the appropriate HTML
> header.

That's right.  we need a meta element parser like nsMsgI18NParseMetaCharset() to outlook importer.

Ludovic Hirlimann [:Usul]

Updated

•

14 years ago

Assignee: nobody → mozilla

Jorg K (CEST = GMT+2)

Assignee

Comment 10

•

14 years ago

Attached image This is how the test case looks like after I imported it with my fixed version. (deleted) — Details

Jorg K (CEST = GMT+2)

Assignee

Comment 11

•

14 years ago

Attached patch Patch to header file (obsolete) (deleted) — Details — Splinter Review

Attachment #442379 - Flags: review?(bienvenu)

Jorg K (CEST = GMT+2)

Assignee

Comment 12

•

14 years ago

Attached file Patch to CPP file. (obsolete) (deleted) — Details

I'd like to request that this fix be rolled into the source base quickly since I want to continue working on another bug (bug 558653) which will require changes to this file.

Attachment #442380 - Flags: review?(bienvenu)

Jorg K (CEST = GMT+2)

Assignee

Comment 13

•

14 years ago

Attached patch Patch to CPP file. Oops, didn't check the box. (obsolete) (deleted) — Details — Splinter Review

I'd like to request that this fix be rolled into the source base quickly since
I want to continue working on another bug (bug 558653) which will require
changes to this file.

Attachment #442380 - Attachment is obsolete: true

Attachment #442382 - Flags: review?(bienvenu)

Attachment #442380 - Flags: review?(bienvenu)

Ludovic Hirlimann [:Usul]

Comment 14

•

14 years ago

Can you provide both .h and cpp changes in the same patch ?

David :Bienvenu

Comment 15

•

14 years ago

Jorg, thx for working on this.
Yes, one patch per bug fix, and you might want to look into Mercurial queues as a way to manage multiple patches to the same files. https://developer.mozilla.org/en/Mercurial_Queues

We can't land anything until the tree re-opens, so you're going to have to be patient...

Jorg K (CEST = GMT+2)

Assignee

Comment 16

•

14 years ago

Attached patch Combined patch to header and CPP file (obsolete) (deleted) — Details — Splinter Review

Attachment #442379 - Attachment is obsolete: true

Attachment #442382 - Attachment is obsolete: true

Attachment #442493 - Flags: review?(bienvenu)

Attachment #442379 - Flags: review?(bienvenu)

Attachment #442382 - Flags: review?(bienvenu)

Jorg K (CEST = GMT+2)

Assignee

Comment 17

•

14 years ago

(In reply to comment #15)
> Yes, one patch per bug fix, and you might want to look into Mercurial queues as
> a way to manage multiple patches to the same files.
> https://developer.mozilla.org/en/Mercurial_Queues

I prefer not to spend time on config management. Once this patch has been included into the source base, I will submit the patch for bug 558653.

Joe Sabash [:JoeS1]

Updated

•

14 years ago

Blocks: tb-enterprise

Jorg K (CEST = GMT+2)

Assignee

Updated

•

14 years ago

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Wayne Mery (:wsmwk)

Comment 18

•

14 years ago

Jorg K, something's amiss here - the bug got marked fixed, but there's no checkin mentioned and the patch doesn't yet have review approval

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Jorg K (CEST = GMT+2)

Assignee

Comment 19

•

14 years ago

Sorry, looks like I did the wrong thing. I just wanted to give this a little nudge.

Ludovic Hirlimann [:Usul]

Comment 20

•

14 years ago

David , ping for the review.

David :Bienvenu

Updated

•

14 years ago

Attachment #442493 - Flags: review?(bienvenu) → review?(bugzilla)

Mark Banner (:standard8)

Comment 21

•

14 years ago

Comment on attachment 442493 [details] [diff] [review]
Combined patch to header and CPP file

Neil, I'm happy to test this, but can you comment on the string related stuff first (and if they will be fine with external api).

Attachment #442493 - Flags: superreview?(neil)

neil@parkwaycc.co.uk

Comment 22

•

14 years ago

Comment on attachment 442493 [details] [diff] [review]
Combined patch to header and CPP file

Well, we haven't started on porting mailnews/import to the external API (mainly because a Linux guy has been doing most of the work). But I'll comment anyway.

>+  PRInt32 idx_endhead = str.Find( "/head", PR_TRUE);
>+  PRInt32 idx_charset = str.Find( "charset=", PR_TRUE);
[This happens to be the one usage of Find that is directly portable. Which is a shame because we probably should be using nsCString here.]

>+  str.Right( tStr, str.Length() - idx_charset);
Right isn't available in the external API. Instead you should use
tStr = Substring(str, idx_charset)
Or you can use an offset to your FindChar functions and use the three-argument version of substring.

>+  PRInt32 idx  = tStr.FindChar( ';');
>+  PRInt32 idx0 = tStr.FindChar( '"');
>+  PRInt32 idx1 = tStr.FindChar( ' ');
This should use FindCharInSet. (Well, it should use MsgFindCharInSet, but that's currently broken.) Conveniently FindCharInSet accepts an initial offset parameter, so you could write
PRInt32 idx = str.FindCharInSet(";\" ", idx_charset);
[this would of course return an offset from the start of the string]

>+    tStr.Left( str, idx);
str = StringHead(tStr, idx);
Or if you didn't create tStr,
str = Substring(str, idx_charset, idx - idx_charset);

>+  int recheck_charset = 0;
>   if (headerVal.IsEmpty())
>   {
>+    recheck_charset = 1;
This should use bool/false/true or PRBool/PR_FALSE/PR_TRUE. (And the variable name isn't strictly accurate either, since we're checking a charset that we didn't even know existed before.)

>+    pMimeType = strdup ("text/plain");
[I know this was moved code but it should be NS_strdup.]

>+    CopyASCIItoUTF16(m_Body.get(), headerVal);
>+    ExtractMetaCharset (headerVal);
It's not ideal to convert the entire body just to extract the charset. It would be better if the ASCII charset could be extracted from the ASCII body. (Everyone really wants an ASCII charset anyway, but that's harder to fix.)

>-  charSet = headerVal;
So, the one thing I don't understand is why we set the character set field, but not actually use the meta character set when creating the body.

Jorg K (CEST = GMT+2)

Assignee

Comment 23

•

14 years ago

Thanks for the review.

Being new here, I used the existing ExtractCharset
http://mxr.mozilla.org/comm-central/source/mailnews/import/outlook/src/nsOutlookCompose.cpp#455
as a model.

That uses all the things you complained about, ie. str.Left, str.Right. strdup is also used throughout the same source file.

CopyASCIItoUTF16 is used so that my new function ExtractMetaCharset can work the same way as the existing ExtractCharset.

So please indicate with of the changes you suggested are mandatory.

Most importantly, the logic that you don't understand:

Quite frankly, I don't understand it either, all I can say is that it works this way.

I observed that when the message is the converted using the current version of TB, it creates a

Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: 8bit

To make the message appear correctly, all you need to do is to change the charset to whatever the HTML code said, for example UTF-8 or iso-2022-jp.

The actual creation of the message is not affected, only the header.

Therefore in the code, I use:
m_pMsgFields->SetCharacterSet( NS_LossyConvertUTF16toASCII(headerVal).get() );
(headerVal being the new charset extracted from the HTML header)

nsMsgI18NConvertFromUnicode( NS_LossyConvertUTF16toASCII(charSet).get(),uniBody, body);
(charSet being the character set retrieved the "traditional" way as in the previous version).

If I use the new charset from the HTML header for the body conversion, things fall apart and I get dog's breakfast.

Please advise how to proceed.

Mark Banner (:standard8)

Updated

•

14 years ago

Component: Migration → Import

Product: Thunderbird → MailNews Core

QA Contact: migration → import

Ludovic Hirlimann [:Usul]

Comment 24

•

14 years ago

standard8 ^^

Jorg K (CEST = GMT+2)

Assignee

Comment 25

•

14 years ago

Sorry, what does "standard8 ^^" mean?

Ludovic Hirlimann [:Usul]

Comment 26

•

14 years ago

Asking standard8 to comment on the previous comment that's what it means

neil@parkwaycc.co.uk

Updated

•

14 years ago

Attachment #442493 - Flags: superreview?(neil) → superreview+

neil@parkwaycc.co.uk

Comment 27

•

14 years ago

Comment on attachment 442493 [details] [diff] [review]
Combined patch to header and CPP file

OK, you've convinced me.

Jorg K (CEST = GMT+2)

Assignee

Comment 28

•

14 years ago

Well, I'm happy to make further changes.

1) FindCharInSet seems to me more elegant.
2) If I can avoid to convert the whole message with
     CopyASCIItoUTF16(m_Body.get(), headerVal);
   then surely that's the more elegant solution.
   Perhaps you can give me a hint how I'd do this instead.

neil@parkwaycc.co.uk

Comment 29

•

14 years ago

(In reply to comment #28)
> If I can avoid to convert the whole message with
>   CopyASCIItoUTF16(m_Body.get(), headerVal);
> then surely that's the more elegant solution.
> Perhaps you can give me a hint how I'd do this instead.
You would need to change ExtractMetaCharset to use nsCString instead of nsString, and you would pass CaseInsensitiveCompare as the second parameter to Find, so it would work with the external API. And you might also want to add a second parameter for the return value, so that you don't have to copy m_Body.

Jorg K (CEST = GMT+2)

Assignee

Comment 30

•

14 years ago

Attached patch New patch after review (obsolete) (deleted) — Details — Splinter Review

All review comments have now been incorporated.

Attachment #442493 - Attachment is obsolete: true

Attachment #461877 - Flags: review?(neil)

Attachment #442493 - Flags: review?(bugzilla)

Jorg K (CEST = GMT+2)

Assignee

Comment 31

•

14 years ago

Attached patch New patch after review (take 2) (obsolete) (deleted) — Details — Splinter Review

All review comments have now been incorporated.

However, I don't understand this part of comment 29:

... and you would pass CaseInsensitiveCompare as the second parameter to
Find, so it would work with the external API.

Attachment #461877 - Attachment is obsolete: true

Attachment #461880 - Flags: review?(neil)

Attachment #461877 - Flags: review?(neil)

neil@parkwaycc.co.uk

Comment 32

•

14 years ago

(In reply to comment #31)
> I don't understand this part of comment 29:
> 
> ... and you would pass CaseInsensitiveCompare as the second parameter to
> Find, so it would work with the external API.
Starting in bug 377319, we've been trying to convert the mailnews codebase to be compatible with the external string API, although there are some notable omissions, such as most of import. In the case of Find, the external string API wants a specific parameter to indicate a case insensitive find. There is a #define CaseInsensitiveCompare PR_TRUE in nsMsgUtils.h for use with the internal API, so that the code continues to compile and work.

neil@parkwaycc.co.uk

Comment 33

•

14 years ago

Comment on attachment 461880 [details] [diff] [review]
New patch after review (take 2)

>+void nsOutlookCompose::ExtractMetaCharset( nsCString str, nsString& newstr)
>+{ 
Nit: unnecessary space after the { ...

>+} 
>+
> void nsOutlookCompose::ExtractType( nsString& str)
... and again after the }

Attachment #461880 - Flags: review?(neil) → review+

Jorg K (CEST = GMT+2)

Assignee

Comment 34

•

14 years ago

Attached patch New patch after review (take 3) (deleted) — Details — Splinter Review

used CaseInsensitiveCompare and removed extra whitespace as requested by most recent review.

Attachment #461880 - Attachment is obsolete: true

Attachment #461937 - Flags: review?(neil)

neil@parkwaycc.co.uk

Updated

•

14 years ago

Attachment #461937 - Flags: review?(neil) → review+

Mike Kaganski

Comment 36

•

14 years ago

I have the same problem with russian charsets, too.
I want to note that TB not only garbles the HTML-style text, it also changes the text-only messages that do have the 'Content-Type: text/plain; charset="iso-8859-5"' and 'Content-Transfer-Encoding: quoted-printable' into garbled with 'Content-Type: text/html; charset=windows-1251
' and 'Content-transfer-encoding: 8bit', as shown in my (duplicate) https://bugzilla.mozilla.org/show_bug.cgi?id=584504

Michael A. Pasek

Updated

•

14 years ago

URL: http://gsfn.us/t/19887

Whiteboard: [gs]

Michael A. Pasek

Updated

•

14 years ago

URL: http://gsfn.us/t/19887

Whiteboard: [gs]

Jorg K (CEST = GMT+2)

Assignee

Comment 37

•

14 years ago

This bug is about HTML messages only, which are imported correctly but display garbled.

Garbled (badly imported, data loss) plain text messages are covered in bug 207156.

David :Bienvenu

Updated

•

14 years ago

Whiteboard: [checkin-needed]

Ludovic Hirlimann [:Usul]

Updated

•

14 years ago

Keywords: checkin-needed

Mark Banner (:standard8)

Comment 38

•

14 years ago

Checked into trunk: http://hg.mozilla.org/comm-central/rev/5107881ac105

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Keywords: checkin-needed

Resolution: --- → FIXED

Whiteboard: [checkin-needed]

Target Milestone: --- → Thunderbird 3.2a1

Michael A. Pasek

Comment 39

•

14 years ago

Incorrectly removed the "[gs]" tag and URL due to a mis-reading of Comment #37  (GS topic has BOTH HTML and plain-text problems).  Re-adding tag/URL.

URL: http://gsfn.us/t/19887

Whiteboard: [gs]

You need to log in before you can comment on or make changes to this bug.