Closed Bug 207156 Opened 21 years ago Closed 13 years ago

Characters outside the default code page becomes ?'s when importing Outlook mail

Categories

(MailNews Core :: Internationalization, defect)

x86
Windows XP
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 6.0

People

(Reporter: peter, Assigned: mikekaganski)

References

(Blocks 3 open bugs, )

Details

(Keywords: dataloss, Whiteboard: [gs])

Attachments

(10 files, 13 obsolete files)

(deleted), application/x-stuffit
Details
(deleted), application/zip
Details
(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), application/zip
Details
(deleted), application/octet-stream
Details
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
standard8
: review+
Details | Diff | Splinter Review
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4b) Gecko/20030507
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4b) Gecko/20030507

Trying to import Outlook 2002 mail into Mozilla. Some of my mail includes
Unicode characters not in my system code page 1252, some not in any standard
code page. Outlook displays this correctly. Mozilla replaces with ?'s or a near
equivalent in CP1252 (e.g. Turkish g breve > g). The same problem with e-mail I
send, not marked with any encoding but presumably stored as Unicode in Outlook
(although in Mozilla it appears as "Content-type: text/plain;
charset=windows-1252; format=flowed"), and in e-mail I receive which is marked
"Content-Type: text/plain; charset=utf-8; format=flowed".

Reproducible: Always

Steps to Reproduce:
1. Import my Outlook data into Mozilla
2. Open the messages in Mozilla and/or look at the raw files which have the same
problem

Actual Results:  
Part of one message looks like:
I can send Unicode Greek text in plain text e-mail, as here: ?? ???? ?? ? ?????.

Expected Results:  
Greek characters should appear instead of the question marks. (Some of them are
from the Unicode Extended Greek pages, not in standard Greek code pages.)

Classified as "critical" because I am losing data here, and there was no
warning. Unfortunately if there is no fix or work around for this one I will
have to abandon Mozilla and continue to use Outlook, as I need access to old
e-mails which are not just in CP1252.
This may be related to bugs 88603 and 100867 - if the latter it has not been
fixed properly.
This zip file includes two copies of the same e-mail, one which I submitted to
a list and one received back by me (only the headers are different), in an
Outlook PST file with the Greek text visible in both, and as source saved from
Mozilla showing the Greek replaced by question marks even in the received copy
which is marked as utf-8.
This problem may be more Outlook's than Mozilla's. I find that every way of
saving or exporting these messages from Outlook, even importing them into
Outlook Express 6, converts them according to CP1252 - which means any data not
in CP1252 is converted to ?'s and so lost. The Unicode data is saved only in the
.PST files. But it is accessible by copy and paste from an Outlook window and
also if a message is forwarded. In the latter case Outlook automatically selects
a suitable code page if one is available or uses UTF-8 if no one code page
covers the whole text. This suggests that a better way to export Outlook
messages to Mozilla might be to forward the messages to an internal pseudo-mail
server within Mozilla.
A little more on this one. I find that the Unicode is preserved correctly and
can be read by Mozilla if the problem e-mail is sent as an attachment to another
e-mail and received by Mozilla. This method also gets around bug 127049 (and bug
183124 which seems to be a duplicate) as the headers are filled in as e-mail
addresses - and is a cleaner solution than Gene Wood's because the message is
not actually resent and so no additional headers are added. So my outline new
import procedure, to fix both bugs, would be to attach all the messages in each
Outlook folder to a dummy message and send that via a mail client to Mozilla,
which can then unpack them back into a folder complete with Unicode and
meaningful headers. It would probably be possible to set up a macro within
Outlook 2002 to do what is necessary at that end. Can Mozilla handle the rest?
Just to confirm that Gene Wood's workaround for bug 127049 (copy the Outlook
folder to an IMAP server and view from Mozilla) does get round this bug as well
as that one, and there is no reason to consider this fix less clean than my
suggestion of using attachments. The only problem is that I have 1 GB of Outlook
archives and the IMAP server I am using gives me only a few MB of space. Is
there any way Mozilla can appear to Outlook to be an IMAP server with a local
loopback IP address?
Peter,
  If the attachment of emails solution works, great. If your having trouble
because your provider only gives you a few meg of space for the IMAP server,
just install a local imap server temporarily. It isn't a great solution, but it
would be a one time thing and it would probably be faster than uploading and
downloading a gig. 

I'm not sure I follow what you mean when you say "would be to attach all the
messages in each Outlook folder to a dummy message and send that via a mail
client to Mozilla". How are you "sending" this email with a gig of attached
emails to mozilla?
Thanks, Gene. Yes, I have tried Mercury as a local IMAP server and that seems to
do the job though not very cleanly. One problem is that Mozilla only recognises
newly uploaded folders in Mercury when I close and reopen both Mozilla and
Mercury. Can you recommend something better, a free download easily configurable
for Win2000?

My idea of sending attachments to Mozilla worked well via a remote e-mail
server. But to do this for all my data I would also have to use some kind of
local mail server, so not much different from your idea really except I can use
POP/SMTP rather than IMAP.
Another twist on this one. Some of my archives in Outlook were imported (in
2000) from Lotus cc:Mail. It seems that Outlook stores these in a different
format from regular mail. Mozilla import seems to work correctly with these
messages (they are ASCII only so the non-western character issue doesn't apply).
But the alternatives don't. When Outlook 2002 copies them from a local folder to
an IMAP server (I tried both my local one and a remote one using different
software) the From: line is lost, and of course remains lost when read into
Mozilla. And when they are sent as attachments the attachment arrives as
"Content-Type: application/ms-tnef;
	name="winmail.dat"" which Mozilla doesn't know what to do with.

So it looks as if I will have to use Mozilla import for some of my messages and
the IMAP method for others.
I now have in Mozilla two copies of the same e-mail, originating in Israel.
Although mostly in English, there is some Hebrew in both the subject line and
the title. The copy which was imported from Outlook using the Mozilla import
command (now on build Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4)
Gecko/20030529) has the Hebrew correct in the subject line but replaced by ?'s
in the text. The copy which was imported via an IMAP server has the Hebrew in
the subject line replaced by ?'s but that in the text correct.

The source for the subject line in which Hebrew appears OK is
Subject: =?utf-8?Q?=D7=AA=D7=A9=D7=95=D7=91=D7=94=3A_Re=3A_=5BHebrew_Computing=5D?=
 =?utf-8?Q?_WordMill?=

In the copy which displays the Hebrew text correctly, the Hebrew is displayed
also in the message source window, following:
Content-Type: text/plain;
	charset="utf-8"

I just can't win, it seems! But actually Outlook itself could not display the
Hebrew in the subject line but replaced it with ?'s.
Some further observations on this one: I discovered that in many of the messages
which I imported from Outlook via the IMAP server the only attachment listed is
winmail.dat, which is basically unreadable except within Outlook. From looking
in Mozilla at several hundred messages which had attachments in Outlook, I
discovered the following patterns:

Messages from the time when I was using Outlook 2000 mostly came across via IMAP
into Mozilla correctly, with readable attachments. In just a few cases, mostly
towards the end of this period, the attachments became unreadable in
winmail.dat. I traced some of these to the sender of the message using Outlook 2002.

Messages dated from the day when I installed Office XP and Outlook 2002 (I found
one message telling me the exact date, 14 January 2002) mostly have unreadable
attachments, when imported via IMAP. There are some exceptions, many of them
being when the message is HTML format rather than plain text.

This all suggests that the problem is largely with Outlook 2002. It is not
Mozilla's problem, I realise, but I hope it is helpful that this is noted here,
and perhaps mentioned in documentation at some time.

So for these messages with unreadable attachments I was forced to revert to the
copies imported from Outlook by Mozilla. This is despite several significant
problems with such imported messages, some of which probably are Mozilla issues:

1) Addresses in the To: etc lines of these messages are separated by semicolons
and spaces, not by commas and new lines - see bug 210600;

2) In many of these messages the text has become an attachment named Part 1.1 -
this would not be a problem if such attachments were displayed inline, they are
not displayed inline although I have "Display Attachments Inline" checked.

3) Attachment names are truncated to 8.3 filename format;

4) The Unicode problem which this bug started with.
Severity: critical → normal
On comment #9 point 2, see my newly reported bug 210606.
Re comment 9: I just found Fentun, at http://www.fentun.com/, which decodes
winmail.dat attachments (and runs OK on Windows 2000). I set up Fentun as the
default application for MIME type ms-tnef, and now I can read my imported
attachments! Wonderful! Well, it would have been even more wonderful if it had
alllowed me to double click on the .gif file it found to view it with my default
viewer, rather than require me extract it to a file and view it from there. But
you can't have everything.
Product: MailNews → Core
This is an automated message, with ID "auto-resolve01".

This bug has had no comments for a long time. Statistically, we have found that
bug reports that have not been confirmed by a second user after three months are
highly unlikely to be the source of a fix to the code.

While your input is very important to us, our resources are limited and so we
are asking for your help in focussing our efforts. If you can still reproduce
this problem in the latest version of the product (see below for how to obtain a
copy) or, for feature requests, if it's not present in the latest version and
you still believe we should implement it, please visit the URL of this bug
(given at the top of this mail) and add a comment to that effect, giving more
reproduction information if you have it.

If it is not a problem any longer, you need take no action. If this bug is not
changed in any way in the next two weeks, it will be automatically resolved.
Thank you for your help in this matter.

The latest beta releases can be obtained from:
Firefox:     http://www.mozilla.org/projects/firefox/
Thunderbird: http://www.mozilla.org/products/thunderbird/releases/1.5beta1.html
Seamonkey:   http://www.mozilla.org/projects/seamonkey/
This bug has been automatically resolved after a period of inactivity (see above
comment). If anyone thinks this is incorrect, they should feel free to reopen it.
Status: UNCONFIRMED → RESOLVED
Closed: 19 years ago
Resolution: --- → EXPIRED
This bug should be reopend because of bug 330134 and there is still no bugfix for bug 207156. And the only reason that there are no other comments to this bug is that not a lot of people having emails with unicode and importing of emails from another emailprogram is happening mostly one time in usage of mozilla/thunderbird.

Can someone deactivate autosolving for this bug and 330134?
Status: RESOLVED → UNCONFIRMED
Resolution: EXPIRED → ---
Confirming based on comments
Status: UNCONFIRMED → NEW
Ever confirmed: true
Blocks: 330134
According to comment #2, it's not as much our problem as that of MS Outlook. Nonetheless, we need to figure out a way to work around that. 

Summary: Unicode becomes ?'s when importing Outlook mail → Characters outside the default code page becomes ?'s when importing Outlook mail
Because of loss of data may this bug set as "critical" and "dataloss"?
I would say: "Yes".
(In reply to comment #18)
> I would say: "Yes".

Please, read comment #2 and comments following it about how to deal with this problem of *Outlook*. I wrote in comment #16 that we need to figure out a workaround, but now I'm not sure if it's possible at all. 

Yes, I know about comment #2.

Two easy suggestions from me:

 1) Write to Microsoft about this problem, so that they can make a bugfix for Outlook.
 2) Until this Mozilla should at least give the user a information that Mozilla
    is not able to import emails, if it is getting mails filled with ???????, without this bugfix.

Because I'm not involved in programming of Mozilla I can not make an proposal for this problem. Is there no working interface at outlook to get the mails with the right codepage?
Question: How dealing other email programs with this problem?

Are having http://www.pmail.com/ and http://www.eudora.com/ (for example) the same problem with importing emails from outlook?
possible related bugs:
bug 217234
bug 272745
bug 359785
bug 276663
I am sure there are others.

(does OE have same problem and cause? example bug 254118)
Severity: normal → critical
Keywords: dataloss
Can someone add "intl" to the keywords like in bug 330134

And this is not only a win2000 problem. It's still there in WinXP.
(In reply to comment #22)
> possible related bugs:
> bug 217234
> bug 272745
> bug 359785
> bug 276663

I do see these are dupes

> (does OE have same problem and cause? example bug 254118)
> 
This is very probably related

But are these still on in current versions tb2.0.0.14 or 3?
This bug makes migration to Thunderbird completely impossible.
Problem lies somewhere deeper I assume look for my bug 404255 which is somewhat related. There are some difference behaving of TB when using localized version. And yes there also some dump things Outlook doing when exporting address book for example - will look deeper into importing emails
OS: Windows 2000 → Windows XP
QA Contact: marina → mailnews.i18n
Product: Core → MailNews Core
Martin, can you help with this or bug 404255?
(In reply to comment #29)
> Martin, can you help with this or bug 404255?

I don't think so, sorry - my work was about URL-encoded data in email headers, not a generic encoding problem.
Przemyslaw, Phil, suggestions on how to move this forward?
Flags: wanted-thunderbird3?
someone post the source of the email.  the samples have ???
then I can try import of OE and see if it is the same problem.

Or maybe I just make an email formatted to greek(windows)?
The attached PST contains two messages:

One sent message (without any useful header information) and one received one.

The received message has this header:
MIME-Version: 1.0
Content-Type: text/plain;
	charset="utf-8"

TB correctly recognizes the character set and in the imported e-mail we get this header:
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed

However, in the plain text body of the message the Greek characters appear garbled, very much like the Russian characters in bug 547119.
Whiteboard: [gs]
I will further investigate this problem shortly.
Take a look at
http://mxr.mozilla.org/comm-central/source/mailnews/import/outlook/src/MapiMessage.cpp#461

If the first attempt to retrieve the message body failed, we retrieve the body in the else-clause at line 482.

Debugging shows that the body is returned at line 482 as this (for example the e-mail containing Greek characters):

====
{mData=0x07ae5728 "Three points here:

1) This is a quite separate issue from the plain text vs. HTML format issue. I can send Unicode Greek text in plain text e-mail, as here: ?? ???? ?? ? ?????. Indeed you did just this in your e-mail to me where you quoted these Greek words from my e-mail (retained below). Now to read this, especially the extended polytonic Greek characters, you may have to 
====

Outlook has already converted the characters in question into question marks.
And there is nothing we can do to get the original information back.

Debugging on the Russian messages supplied in bug 584504 tell a different story. The plain text message attached there is returned in the first call at line 461. Outlook returns us the plain text message as HTML (with the correct data). Our heuristic at line 477 decides correctly that this is a plain text message, so we retrieve the body again at line 482 and again get ??? back.

Summary:

There are two cases here:

In one case, Outlook never offers the message body as HTML. When retrieved at line 482, we get ???

In the other case, Outlook at first delivers the message body as HTML **with** the correct data. We decide to retrieve it again and then get the question marks.

So the best a fix could do is to convert some plain text messages as HTML in oder to have the correct content.

Opinions?
For any case where the body is returned as HTML -- even if Outlook did convert it from text/plain -- wouldn't using  the Outlook HTML conversion be more likely to result in the proper display to the user ? 

I note that the MXR comm-central repository contains your fix for Bug 250878 (applied Aug 2, 2010), whereas the user in the case of Bug 584504 would not have that fix (the "heuristic at line 477" would not be present in his case).

I'm also slightly confused by the comments regarding that user's problem in the four bug reports; I think what you're saying is that the first message (plain text) could be properly imported (with the fix for Bug 250878 ?), but the HTML message is hopelessly lost, since it doesn't contain any charset data, either in the headers or HTML body.  Is that an accurate statement ?
(In reply to comment #37)
> For any case where the body is returned as HTML -- even if Outlook did convert
> it from text/plain -- wouldn't using  the Outlook HTML conversion be more
> likely to result in the proper display to the user ?

That remains to be investigated. In this case we would promote a plain text message to HTML, which is in itself a bad thing. My testing/debugging shows that there are messages, like the one with the Greek characters, where we don't stand a chance to get to the correct data since these are never offered as HTML in the MAPI interface.

> I note that the MXR comm-central repository contains your fix for Bug 250878
> (applied Aug 2, 2010), whereas the user in the case of Bug 584504 would not
> have that fix (the "heuristic at line 477" would not be present in his case).

Not true. The "heuristic at line 477" was always there, only now it's a little more refined. Before we looked for the string:
"<!-- Converted from text/plain format -->"

Now we look more specifically for:
"<BODY>\r\n<!-- Converted from text/plain format -->"

That's the difference.

> I'm also slightly confused by the comments regarding that user's problem in the
> four bug reports; I think what you're saying is that the first message (plain
> text) could be properly imported (with the fix for Bug 250878 ?), but the HTML
> message is hopelessly lost, since it doesn't contain any charset data, either
> in the headers or HTML body.  Is that an accurate statement ?

No.

The following is correct:

HTML messages are always (well, as far as I have seen) properly imported. Where they don't contain the correct charset information, the will display garbled, but by changing the character encoding used for the display, the content can be viewed correctly. The change implemented for bug 547119 now results in looking for the charset a little harder, that is, also in the body of the HTML message, not only in the headers. Were a HTLM message does not contain the right information, the fix applied for bug 547119 won't make a difference. This is the case for the HTML message presented in bug 584504. I repeat: Even in the case where the HTML data looks garbled, it is indeed stored correctly and can be viewed correctly after manual intervention.

This bug 207156 is about plain text messages. They have a general problem in that "international" characters don't get imported properly.

My testing shows that the MAPI interface returns question marks instead of the correct data. However, there is subclass of plain text messages that the MAPI interface also offers as a version converted to HTML which indeed contains the correct characters. This is what we discussed in the first paragraph.
Please note that the looking for the charset "a little harder" results in a change in the original message headers. Wouldn't it be better not to change the message headers but change the way the message is displayed? OK, there's no header charset information; so when _showing_ the message we take into account the HTML header charset information? I would further suppose that an inner block with a different setting than in the outer block should temporary (for this block and its sub-blocks only) override that setting. And the HTML code is definitely the inner block relative to its MIME header.

And isn't the "???" problem caused by non-unicode MAPI?
(In reply to comment #39)

> And isn't the "???" problem caused by non-unicode MAPI?

Yes, indeed. In XP in "Regional and Language Options" (Advanced) I switched the character set to be used for non-unicode applications to Russian. And guess what: The two messages from your bug 584504 imported just fine. Even the folder names were right.

Hold on a shake which I switch the machine to Greek and try again with the original Greek messages supplied here.

> Please note that the looking for the charset "a little harder" results in a
> change in the original message headers. Wouldn't it be better not to change the
> message headers but change the way the message is displayed? OK, there's no
> header charset information; so when _showing_ the message we take into account
> the HTML header charset information? I would further suppose that an inner
> block with a different setting than in the outer block should temporary (for
> this block and its sub-blocks only) override that setting. And the HTML code is
> definitely the inner block relative to its MIME header.

Well, if you want to join the discussion on bug 547119, why don't you post your comment there?

To answer your question: The way the message is shown is derived from the message headers. If they say KOI8-R, then this will be used for the default display. There is no change to the "original message headers". In fact, the fix implemented adds the encoding information where it was missing.

My testing show that Japanese and German messages (German has äöüß, etc.) now show up very nicely. Too bad your Russian HTML message from bug 584504 has such bad HTML that there is no character set mentioned.
In XP in "Regional and Language Options" (Advanced) I switched the
character set to be used for non-unicode applications to Greek and imported again.

Result: less ??? but still not perfect:

Outlook: Ἐν ἀρχῇ ἦν ὁ λόγος
TB: ?ν ?ρχ? ?ν ? λ?γος
(In reply to comment #38)
> [...] Not true. The "heuristic at line 477" was always there [...]

So it was.  I was looking for any deleted lines AFTER the "+"s, and missed the
line preceding the change added by the fix.

> [...] The following is correct:

Thanks for the explanation, and your further explanation/examination in
subsequent comments.  One last question:
> In XP in "Regional and Language Options" (Advanced) I switched the
> character set to be used for non-unicode applications to Russian. And guess
> what: The two messages from your bug 584504 imported just fine. Even the folder
> names were right.

Does that correct import require fixes beyond what's in 3.1.1  (i.e., does the user merely need to select the correct setting in XP, or does he need more code ?)  I understand that TB cannot possibly determine the correct character set if none is mentioned in the converted data, but it sounds like the correct choice for that setting would allow the correct import for many foreign-language users.  

[test with Greek] Maybe the system from which the .pst was obtained did not have the "locale" set to "Greek" ?
Oh, I didn't know about 3.1.1, I'm still on 3.0.6.

Well, even in 3.0.6 (without any modifications), if you import Russian messages, switching the XP default to Russian in the "Regional and Language Options" will give you a better result.
(In reply to comment #41)
> In XP in "Regional and Language Options" (Advanced) I switched the
> character set to be used for non-unicode applications to Greek and imported
> again.
> 
> Result: less ??? but still not perfect:
> 
> Outlook: Ἐν ἀρχῇ ἦν ὁ λόγος
> TB: ?ν ?ρχ? ?ν ? λ?γος

That's expected: ν, ρ, χ, λ, γ,ο ς are all representable in the Windows Greek code page, but the other polytonic characters are not.
My findings show that Outlook may store the message body in three different formats: plain text, HTML and RTF. It never stores the message as it was received (except for the very unlikely case where user has the SaveAllMIMENotJustHeaders setting in the registry) but recomposes it as it likes (http://email.about.com/od/outlooktips/qt/How_to_View_the_Complete_Message_Source_in_Outlook.htm). Plain text may be converted into HTML, and both plain text and HTML may be converted into RTF.

There is a possibility that there will only exist formatted version (such as in http://www.ureader.com/msg/107959.aspx). In addition, as observed with that Greek text, even the "originally"-formatted version may become broken when manipulated using MAPI.
The formatted versions, on the other side, may contain the information needed to display the text properly (it may be either charset info or unicode codepoints of every character). Furthermore, even if the original message had not initially contain the necessary info, the user himself could provide this info to the Outlook, and in this case this info is saved in this formatted version.

There already exist an implementation that takes the HTML converted into RTF and extracts the "original" HTML (http://groups.google.com/group/microsoft.public.win32.programmer.messaging/browse_thread/thread/d78c7a34576adc49/16f4a84aec573358?lnk=st&q=Lucian+Wischik+rtf+html&rnum=1&hl=en#16f4a84aec573358). It is a part of the "MAPI Utils" (http://www.wischik.com/lu/programmer/mapi_utils.html).

Seems like (at least in some versions of Outlook) the precedence of the formats is RTF over HTML over plain text.

So my proposal is that we could first check the RTF version (getting the compressed version using the PR_RTF_COMPRESSED flag, then decompressing it with WrapCompressedRTFStream), and if present, look into it to see if the original is HTML or plain text ("\fromhtml" or "\fromtext"). If it is HTML then try to get the PR_BODY_HTML (0x1013001e) body. If successful then use it, otherwise extract the HTML from RTF. If it is plain text, then _don't_ use the PR_BODY nor PR_BODY_HTML but straight convert the RTF to plain text, thus saving the internationalization info. If there's no RTF version then descend to HTML, and then if the original is plain text then again don't use the PR_BODY but just convert the HTML to plain text, and only use the PR_BODY if there's no RTF and HTML.
The reason for this is that anyway we cannot get the true original body from Outlook. The PR_BODY would be modified version even if the charset would be OK. So we could substitute one modified version (that gives the garbage way too often) with another modified version (that would possibly provide us much better results).
I started to study the RTF specification (http://www.microsoft.com/downloads/details.aspx?familyid=DD422B8D-FF06-4207-B476-6B5396A18A2B&displaylang=en), and one thing startled me: when discussing the codepages and unicode, there is a passage: "An RTF writer, when it encounters a Unicode character with no corresponding ANSI character, should output \uN followed by the best ANSI representation it can manage. Often a question mark is used if no reasonable ANSI character exists". Resembles this issue, isn't it? Seems like Outlook itself converts the plain text data from its own RTF using this technique. However, I started to develop an RTF reader that would be able to convert it to plain text or extract the embedded HTML reliaby (looks like the abovementioned code cannot handle charsets properly).

This said, I must admit that this could further require the revision of the code that extracts the message headers, as they may contain the charset info that is inconsistent with the new converted body.
As I suspected, the Microsoft MAPI implementation just couldn't be non-Unicode-aware. At least some problems here could be resolved by just using the necessary Unicode flags/structures.

E.g., the PR_BODY constant is defined in MAPITags.h as "PROP_TAG( PT_TSTRING, 0x1000)"; there also exist PR_BODY_W and PR_BODY_A versions. When you query the msg specifically for the PR_BODY_W, you get the UTF16 string. In TB, it seems that the UNICODE preprocessor directive isn't in effect, so the generic PR_BODY translates to PR_BODY_A and thus the returned body text contains "?" in places where the characters out of the current codepage were. If the PR_BODY_W would be used then, again, the decision would have to be made how to store the UTF16 text into the body; either use utf8 or some other charset? But this would allow to fix that Greek message.
Join the team, Mike, and fix it ;-) UTF8 would be a suitable choice, right?
Thank you for invitation.

The fix is the thing I'm busy with right now. Hope to be worthwhile.
I worked with Mike on this problem over two weeks (23 Aug to 3 Sept) and tested his changes. Mike did a great job restructuring the handling of the body text of a message during import. Not only do they solve the problem of international characters being used, but also cases where HTML messages were displayed as plain text showing the original HTML tags (bug 395745).

Our test data includes Korean, Japanese, Greek, Russian and German (äöüß) messages.

We used test data from this bug (Greek), bug 584504 (Russian) and bug 547119 (Japanese) as well as our own (more Russian, Korean, German).

I can supply my test data to anyone who wants it. I don't want to post it here since it contains personal messages.

Can anyone supply some Chinese test data, please!

We also made sure that the fix for bug 250878 was maintained and that bug 395745 was also fixed (which CLEARLY was not a duplicate of 250878).

Clearly this bug also blocks tb-enterprise (bug 564148) since bug companies are likely to have a lot of "international" e-mail.

I recommend a very swift review of this bug since its solution is very important, as can be seen in the large number of duplicates (see comment #22).
Attachment #472109 - Flags: review?(smontagu)
Attachment #472109 - Flags: review?(smontagu) → review?(bienvenu)
Seems like I have failed to fulfill the requirements of the "How to Submit a Patch" (https://developer.mozilla.org/en/Getting_your_patch_in_the_tree) in that I didn't request the review. Now that I try to fulfill this, I am not sure who is the person who owns the module in question. Based on the patches of other bugs (Bug 250878, Bug 309932) that were reviewed by David :Bienvenu, I decided to ask David. Please tell me the correct address if I'm wrong.
I think that's a good start ;-) He'll nominate someone else if he doesn't want to do it.
Comment on attachment 472109 [details] [diff] [review]
Totally reworked retrieval of the body, it is now retrieved in Unicode. Fixes to body type handling. Fixes to original charset guessing.

switching review to Neil - I haven't been able to get testing of outlook import working on my machine.
Attachment #472109 - Flags: review?(bienvenu) → review?(neil)
I can certainly attempt to perform a test import of Outlook data, but I don't know any of the C++ standard library, so I can't comment on the code itself.

Note: Use of the C++ standard library may be incompatible with Gecko's (lack of) use of exceptions and custom memory allocator.
I suggested to Mike not to use STL. In any case, the use is limited to one function only (ExtractMetaCharset), if one ignores the fact that it is used in the "new" RTF decoder.
I must note that the work is being done to allow the use of stl in Mozilla projects. See Bug 550610, Bug 556699. The following modules are already considered safe (as they have been reviewed and anyway they are used in the code): <algorithm> (Bug 556700), <vector> (Bug 556701), <map> (Bug 556702), <deque> (Bug 556703), <queue> (Bug 556704), <set> (Bug 556705), <stack> (Bug 556706).
Generally it's considered safe to use stl specifically because of the compiler directives that turn the exception generation off, and because of the custom operator new() that doesn't throw exceptions.
I use <algorithm>, <stack>, <map>, and the following not reviewed modules: <locale>, <string>, <istream>, <sstream>. I could try to avoid the use of streams, using iterators instead, and replace the use of <locale> by using another function testing if a char if valid ascii alfanumeric. But I prefer using the stl containers and algorithms.
So I am waiting for your decision. :)
Regarding the use of <locale>:
I only use the static method std::locale::classic() that returns the always-existent object (classic C locale). The isspace, isalpha and isdigit functions return a value on any input value, so no exception is possible. I use the locale paradigm to make it clear which assumptions/tests are being done, without any dependencies on a user environment settings. I'm quite sure that the algorithms in this module should be as fast as possible (hopefully table-based).
Is somewhere available a build of thunderbird for windows (nightly build etc.) which includes the mike's patch? I would really need it. I have the same problem as Mike and my trial version of Outlook what I've installed because of my old mail import to Thunderbird will expire soon...
bienvenu, or Mike, can you throw up a tryserver build.
also, is there progress to finding an additional reviewer beyond neil?

removing my obsolete wanted-thunderbird3? and transferring to thunderbird3.2.

bugs for you to examine as possible dups (I have not examined these closely):
Bug 270638 - Import kills 8-bit characters from subjects and addresses (has testcase)
Bug 357294 - incorrect import of outlook pst (has testcase)
blocking-thunderbird3.2: --- → ?
Flags: wanted-thunderbird3?
(In reply to comment #60)
> bienvenu, or Mike, can you throw up a tryserver build.

Unfortunately, I don't know how to do it, and suspect that some high access level needed for this. I'm just a newbee.

(In reply to comment #60)
> also, is there progress to finding an additional reviewer beyond neil?

See above :)

(In reply to comment #60)
> Bug 270638 - Import kills 8-bit characters from subjects and addresses (has
> testcase)
This one isn't dup (thiugh I believe it may be fixed using similar approach)

(In reply to comment #60)
> Bug 357294 - incorrect import of outlook pst (has testcase)
After importing the testcase with my patched version, the message looks OK. Well, almost OK, because the message consists of very long lines (see Bug 593907) - one letter got broken in my test.
I can confirm Mike's test result. In a version that contains this fix, the test message from bug 357294 is imported correctly.

Bug 357294 can therefore be marked as a duplicate of this bug.
Attached patch Major revision of the mail import (obsolete) (deleted) — Splinter Review
The import code have been rewritten to make it more structured and manageable. Roles and responsibilities of modules and classes have been reviewed. As a result, this made possible to fix numerous other bugs, most notably Bug 558653. Created a workaround for deficiency of nsMsgComposeAndSend::EnsureLineBreaks(), thus solving Bug 593907. Prevented an analog of Bug 503690 to appear in Outlook import (incomplete workaround for poorly implemented nsMsgComposeAndSend::GetBodyFromEditor()). Made it possible to import of Outlook messages (msg) embedded as attachments as message/rfc822 attachment type. Fixed a minor bug in converting RTF to HTTP/plaintext.
Attachment #472109 - Attachment is obsolete: true
Attachment #488161 - Flags: review?(neil)
Attachment #472109 - Flags: review?(neil)
The import from Microsoft Outlook has had many bugs. Mike Kaganski has totally reworked the import to fix all major bugs, ie. this bug 207156 and bug 558653 (which are the parents of many duplicates). He has made a fantastic effort. I have worked with Mike over the last few months and I have tested his changes. This week I have tested Mike's latest changes on my Outlook data from eleven years (1999-2010). I imported 2 GB worth of data in on hit without any problem.

I urge you to review and accept this patch as soon as possible. If ever TB wants to make way into Outlook territory, this patch is an absolute MUST HAVE.
The compiled version of TB (Mozilla/5.0 (Windows NT 5.1; rv:2.0b8pre) Gecko/20101108 Thunderbird/3.3a1pre) with the patch applied is available at http://habenergoproect.narod.ru/Shredder.7z. It is a debug version of TB with the following .mozconfig:

ac_add_options --enable-application=mail
ac_add_options --disable-optimize
ac_add_options --enable-debug

I don't recommend using this for anything except importing the Outlook data. Furthermore, I don't recommend using this if you are able to compile TB yourself (I can assist with this if you need). I only did this for those who cannot compile the program themselves and need it fast (as Comment 59). I tested it with my antivirus, and I didn't incorporate any malware in it. It works in my environment. No other claims are made.
To avoid numerous debug messages, you should set the "XPCOM_DEBUG_BREAK" environment variable to "warn" (both without quotes).
And you should expect quite long import time compared to vanilla TB. This is partially due to debug version of the program, but mostly because of large amount of processing that is introduced by my patch that is needed to improve the import quality. As the import is just one-time activity, I hope that user will agree to wait longer once to get more accurate results.

If you will try my patch (in this archive or compiled yourself), please respond with results, especially if you find any bugs.
(In reply to comment #66)
> The compiled version of TB (Mozilla/5.0 (Windows NT 5.1; rv:2.0b8pre)
> Gecko/20101108 Thunderbird/3.3a1pre) with the patch applied is available at
> http://habenergoproect.narod.ru/Shredder.7z. It is a debug version of TB with

This is 404.
Oh, I see. Maybe it was removed by hoster. Can you suggest me a place to upload
a 14.1 MB archive?
Excuse me. I now uploaded it to Google Docs. Here is the link:
https://docs.google.com/leaf?id=0B-kIIbVJbQ46NWRiNGY4NWUtYTRhZC00NTZlLThhMTctM2I5NzYzOGI3ZjVi&hl=en
(In reply to comment #69)
> Excuse me. I now uploaded it to Google Docs. Here is the link:
> https://docs.google.com/leaf?id=0B-kIIbVJbQ46NWRiNGY4NWUtYTRhZC00NTZlLThhMTctM2I5NzYzOGI3ZjVi&hl=en

I don't like the report generated by virus total about this file http://www.virustotal.com/file-scan/report.html?id=9a2cbd2ba5dd970fc8581d284dd7d7787c09b9c84f6565c5f74f08cbd34a107b-1289201775 it says it contains ApplicUnsaf.Win32.AdWare.cinmus.194
Why we just can't use tryserver for that?
(In reply to comment #71)
> Why we just can't use tryserver for that?

Excuse me, I just don't know what is tryserver. Maybe you can. If you can, than you definitely should not use my file, and I explicitly stated that you should not use it if you can use any other safer way. I expected that some AV may mark the file I uploaded as unsafe, and even if they would not, you cannot guarantee my good will. This proves true (as may be seen in Comment 70).

As there are doubts concerning the file I submitted, I can only say that you could try it in a virtual machine. I only hope that there will be some update to the Comodo av bases that will not trigger this false positive any more.

And I regret that I posted this. It spoils my reputation, and I could predict this, as I already saw the very same situations.
(In reply to comment #72)
> (In reply to comment #71)
> > Why we just can't use tryserver for that?
> 
> Excuse me, I just don't know what is tryserver.

Try server is a set of builders that we use to try out patches. https://wiki.mozilla.org/Thunderbird/Infrastructure/TryServer

As you won't have access yet, I'll be pushing the patch shortly, once I get some other issues resolved. I'll post to the bug again once the builds are available, hopefully sometime today.

> And I regret that I posted this. It spoils my reputation, and I could predict
> this, as I already saw the very same situations.

Don't worry, we're very grateful to you for providing the patch and trying to help move it along. I really suspect the anti-virus case is just a false positive, and you did include health warnings anyway :-)
The try server build failed with:

MapiMessage.cpp

e:/buildbot/tryserver-win32/build/mailnews/import/outlook/src/MapiMessage.cpp(1290) : warning C4018: '>=' : signed/unsigned mismatch

e:/buildbot/tryserver-win32/build/mailnews/import/outlook/src/MapiMessage.cpp(1428) : error C2664: 'std::_Vector_iterator<_Ty,_Alloc> std::vector<_Ty>::erase(std::_Vector_iterator<_Ty,_Alloc>)' : cannot convert parameter 1 from 'std::_Vector_const_iterator<_Ty,_Alloc>' to 'std::_Vector_iterator<_Ty,_Alloc>'

        with
        [
            _Ty=CMapiMessageHeaders::CHeaderField *,
            _Alloc=std::allocator<CMapiMessageHeaders::CHeaderField *>
        ]

        No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called


http://tinderbox.mozilla.org/showlog.cgi?log=ThunderbirdTry/1289221464.1289228419.23322.gz
Attached patch Major revision of the mail import (obsolete) (deleted) — Splinter Review
Made code more standards-complying.
Corrections to comments.

Hope that I didn't use such non-standard MS-specific versions of functions any more. Now it should compile on other compilers (I hope).
Attachment #488161 - Attachment is obsolete: true
Attachment #488995 - Flags: review?(neil)
Attachment #488161 - Flags: review?(neil)
That's better, the build succeeded and all tests passed (although we've not got any tests covering Outlook import). The build can be found here:

http://ftp.mozilla.org/pub/mozilla.org/thunderbird/tryserver-builds/bugzilla@standard8.plus.com-f90ad2e714c9/tryserver-win32/
Hello, I have tested the above mentioned build with Outlook 2007 (Italian version) and I have been able to import mail, folders and address book fine.
Just the Settings weren't imported but I think this was because our company work with Exchange.
I had to import them by choosing the related option one by one since if I choose Import Everything, after I hit Next, I receive this dialog window back http://img26.imageshack.us/img26/9141/nofrom.png where after "from:" there are no clients available and if I hit Next nothing happens.

Ciao,
Giuliano
I have just made a test with Mozilla/5.0 (Windows NT 5.1; rv:2.0b8pre) Gecko/20101109 Thunderbird/3.3a1pre.

Import of all parameters from Outlook 2007 is fine but:
- account settings (pop and smtp parameters) where not imported and failed with a warning "An error occurred while importing settings (...)" if I try to import the settings only later.
- all mail are marked unread whereas they are read.
Mike Kaganski rework has to do with importing e-mail messages. It has nothing to do with importing settings.

Imported mail used to show as unread and this behaviour hasn't changed.

The behaviour that has changed is that
- HTML messages are now reliably imported as HTML
- plain text messages are imported as plain text
- messages using international characters are imported properly
- messages with embedded images are imported properly
- messages with long lines are now imported properly, so no
  <CR><LF> is inserted into a multi-byte character thus destroying the
  original information.

I trust Mike himself can extend this list.
Giuliano, caméléon, thank you for testing the patch!

(In reply to comment #77)
Well, I have never tried to import everything before. Now I have downloaded this tryserver build along with a normal nightly build (http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/latest-comm-central/thunderbird-3.3a1pre.en-US.win32.zip). When I tried to "Import everything" they both worked OK.
I have only one Outlook profile, so I don't know if the problem you described is somehow related the multi-profile setup.
However, as Jorg mentioned, we haven't modified a single line outside the mail import. Everything in the import of addresses and settings is left as it was.
What to the fact that all the mail is marked unread - well, I'm not even sure if it's possible to mark any of the imported mail as read. If it is, please tell me how, and I will try to fix this (insignificant to my opinion) inconvenience.
Oops, one thing to be added to the list in comment #79:
- messages with attached messages now get imported properly

As for the imported messages being marked as unread: Go to "Unread Folders" and mark them all as read in one hit. Yes, it's inconvenient, and there is a bug about it, too - bug 219269.
(In reply to comment #81)
> Oops, one thing to be added to the list in comment #79:
> - messages with attached messages now get imported properly

According to http://perso.hirlimann.net/~ludo/blog/archives/2010/11/outlook-user-wanted.html#comment-511811 it doesn't work for all mailboxes.
In the attached ZIP file there is a message in Outlook format with two attached messages and the same message imported into TB. Test is for yourself.
(In reply to comment #83)
> Created attachment 489784 [details]
> Message with attached messages for test (comment #82)
> 
> In the attached ZIP file there is a message in Outlook format with two attached
> messages and the same message imported into TB. Test is for yourself.

The point I was trying to make is not that it doesn't work, but that it doesn't seem to cover all cases.

Martijn what version of version of outloook did you try to import from ?
(In reply to comment #82)
That definitely looks like a bug.
Please provide more information about this. Most useful would be information about the Outlook configuration plus a .pst file that contains such messages and is known to import without attachments on the user's machine.
Ludovic, I tried to import from Outlook 2007 SP2.

I have not especially checked the import of attached email messages, but at first glance I did not see some of the attachments that were in Outlook.

When I now look in other folders I do see some attachments with all kinds of different attachments (the regular PDF and XLS, but also .lic and .config). So some of them are imported correct. I did not see that before, because I checked another folder that has attachments in Outlook, but not in the import in Thunderbird.

However, in another folder I do not see a DOCX-attachment (= the OOXML format from recent Microsoft Office versions). I think that most of the attachments that I am missing are from the DOCX format.
I am less comfortable with attaching a PST file as this is my corporate email.
I think it is best reproduced with sending a message with a DOCX file attached to it to a test mailbox (just to see whether docx attachments should be working fine or whether there's a reason these might fail).
(In reply to comment #80)
> (In reply to comment #77)
> Well, I have never tried to import everything before. Now I have downloaded
> this tryserver build along with a normal nightly build
> (http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/latest-comm-central/thunderbird-3.3a1pre.en-US.win32.zip).
> When I tried to "Import everything" they both worked OK.

Mike, I've just downloaded the build you suggested and tried again to "Import everything". 
This is the build ID: Mozilla/5.0 (Windows NT 6.0; rv:2.0b8pre) Gecko/20101110 Thunderbird/3.3a1pre
Unfortunately I am obtaining again the same behavior I reported in comment #77 

About attachments: all my attachments where imported correctly (doc, docx, xls, xlsx, ppt, pptx, pdf, zip and rar).
About emails: all the emails were imported marked as unread
About settings: also today I was unable to import my settings (Exchange server)

About Outlook: http://img529.imageshack.us/img529/7268/aboutoutlook.png

Ciao,
Giuliano
(In reply to comment #87)
> I am less comfortable with attaching a PST file as this is my corporate email.
> I think it is best reproduced with sending a message with a DOCX file attached
> to it to a test mailbox (just to see whether docx attachments should be working
> fine or whether there's a reason these might fail).

Could you generate a pst file with only one email in it and share that one ?
(In reply to comment #88)
Giuliano,
this indicates that the patch we discuss here doesn't cause any regression. This behaviour is somewhere outside of it (though I agree that from the user's point of view there's no mush difference - you just need the task be done).

(In reply to comment #86)
Martjin,
the sympthoms you describe indicate that most likely the cause of this problem is the so-called OLE attachments (ATTACH_OLE). The code that "handles" them in the old code is here:
http://mxr.mozilla.org/comm-central/source/mailnews/import/outlook/src/MapiMessage.cpp#714.
You may note that this code does nothing - it simply ignores the attachment.
My new code does nothing more - it ignores the attachment as well. Unfortunately, I have no idea how to handle this.
To make me look a little better, I must note that my code still handles one other attachment type better - it is Embedded message (ATTACH_EMBEDDED_MSG) (previously it was ignored as well, now it is imported correctly).

However, I'm eager to see a test case to see if my speculations are correct. And if someone can point me a direction where I can handle the OLE attachments, I would be most grateful.
Attached patch Major revision of the mail import (obsolete) (deleted) — Splinter Review
Fixed a bug where some attachments were mistakingly treated as embedded.

Martijn, thank you very much for the bug report and the test case! I was wrong when thought about the possible cause of it. Now I hope that this problem will be solved.

The code that tells if an attachment is embedded or not has to be improved. I'm not familiar with this, so I simply check for embedded images (i.e. "src" attribute of <img> tags). However, there may exist other cases of embedded resources (like css or script), or other attributes may be affected. If somebody has testcases of this, or has some documentation covering this, please send it to me so that I can improve this.
Attachment #488995 - Attachment is obsolete: true
Attachment #490035 - Flags: review?(neil)
Attachment #488995 - Flags: review?(neil)
Yes, it happens in background images ...

<table width="100%" border="0" cellspacing="0" cellpadding="0"><tr valign="top">
<td 
style="background-image:url(cid:1__=45BBFC26DFED153A8f9e8a93d@slv.vic.gov.au);
Attached patch Major revision of the mail import (obsolete) (deleted) — Splinter Review
The complete reimplementation of handling the embedded attachments.
Now the import relies on the information that Outlook provides to decide whether an attachment is embedded or not. Further, the Content-Ids of those attachments are kept intact (previously they were replaced with TB-generated). I hope these changes will improve the reliability and quality of the embedded attachments import. But I must say, that the logic to decide whether an attachment is embedded or not is based on some undocumented properties, and I need some feedback to see if it has the right to live.
Second, the Bug 219269 is partially fixed. Namely, the read state is now retained on import. For this, I adopted the patch proposed by David Bienvenu for Bug 315069 (Attachment #362624 [details] [diff]). I hope that David will not frown upon it. Thank you, David.

(In reply to comment #92)
Jorg, this patch now creates the proper message structure for those testcases that you sent me. However, TB seems to ignore the embedded attachments in the places other than <img>. You saw it yourself when recieved such message directly with TB. This should be filed as a separate bug, what do you think? By the way, seems like Outlook does the same. At least, I didn't notice any difference in them.
Attachment #490035 - Attachment is obsolete: true
Attachment #490035 - Flags: review?(neil)
Given that there is now a system in place for a number of volunteers to test builds including this patch, and that it is now three times its original size, I would just like to point out that I cannot provide much useful input at this point, except possibly to assist in the use of Mozilla-specific constructs.

For instance,
bool CMapiMessage::GetTmpFile(nsILocalFile **tmp_file)
{
  nsCOMPtr<nsIFile> _tmp_file;
...
  _tmp_file->QueryInterface(nsILocalFile::GetIID(), reinterpret_cast<void**>(tmp_file));
}
[note: no return value!] becomes
bool CMapiMessage::GetTmpFile(nsILocalFile **aResult)
{
  nsCOMPtr<nsIFile> tmpFile;
...
  return NS_SUCCEEDED(CallQueryInterface(tmpFile, aResult));
}
although I do wonder why data->tmp_file isn't an nsIFile in the first place.
(In reply to comment #94)
Hi Neil!

Thank you for your input!
Your assistance in this field would be most useful. I cannot even get close to understanding of the Mozilla-specific constructs, and I'm absolutely sure that I make mistakes often. Please help me with this. If you would also comment on the mistakes (I mean why it is conceptually wrong), I would be most grateful.
I converted 2 GB of Outlook data using the newest version posted with comment #93. At first glance I found no problems.

As a bonus, imported mail is no marked as read.

Please get a tryserver build going, so others then test it, too.
Oops, this is not ready for prime time use yet.

Text messages containing the word "From" in the message body are now broken into to parts, since the import in no longer inserts the ">" prefix. So the mailbox after the conversion looks like this:

From - Sat, 19 Sep 2009 05:20:31
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
Date:Sat, 19 Sep 2009 05:24:36 +0000
From: "JA¶rg Knobloch"
Subject: How are you?
To: "David Morgan"
Content-Type: text/plain; charset=iso-8859-1; format=flowed
Mime-Version: 1.0
Content-Transfer-Encoding: 8bit

Hello David,

From the "mailing list" e-mail I sent out the other day regarding the pictures o
f my latest trip, I received a read receipt from you. That was a good sign.

Drop me a line.

Jörg.
Attached patch Major revision of the mail import (obsolete) (deleted) — Splinter Review
In this patch I fixed the bugs that have been reported so far. I am especially thankful to Jorg for his reports and thorough testing.

Besides, the Replied/Forwarded status is now kept on imported mail, and the mail that have been imported in the "headers-only" mode is now marked as such.

All cases of ampersand being improperly treated in imported mail should be fixed now.

Now I consider the mail import functionally complete. I ask for doing a tryserver build.
Attachment #490452 - Attachment is obsolete: true
Once again, I imported my 2 GB Outlook e-mail data and all the problems reported above (comment #97) were fixed.

I watched the memory consumption and it was better than in the previous version. It climbed up to 200 MB.

As a bonus, imported mail is now marked as read and the replied/forwarded status is also maintained.

Please get a tryserver build going, so others then test it, too.
(In reply to comment #93)

> Second, the Bug 219269 is partially fixed. Namely, the read state is now
> retained on import. For this, I adopted the patch proposed by David Bienvenu
> for Bug 315069 (Attachment #362624 [details] [diff]). I hope that David will not frown upon it.
> Thank you, David.
That's fine, of course! Thanks for working on this.
(In reply to comment #98)
> Created attachment 491457 [details] [diff] [review]
> Major revision of the mail import

Try server build requested for this patch, it should be available here in about 2 hours:

http://ftp.mozilla.org/pub/mozilla.org/thunderbird/tryserver-builds/bugzilla@standard8.plus.com-1112462eedb3

(that directory will be 404 unit the builds show up).
Just downloaded and tested the new build with another import from my Outlook profile. All attachments seems to be working fine as well and the contents of the emails looks fine. My compliments and the added bonus (read status of the mails) is really nice.
Yesterday I downloaded the latest tryserver build and imported all my Outlook mail (thousands of e-mails). Everything seems to be OK. Mike's patch solved all the problems I've had i.e. wrong charsets, missing text or attachments, html code in messages etc. Great work! Thank you.
Comment on attachment 491457 [details] [diff] [review]
Major revision of the mail import

I didn't asked for review of the previous patch, since I made a lot of versions that turned out to require bug fixing. But now the last tryserver build with the last patch seems to achieve a generally positive responce, and I have no plans to add any new functionality to it. So now I think is the time to start the process of reviewing and consequent merge to the trunk.
Neil, I ask you to review it (or assign someone for this task), since it's required to make things done. I understand that there are no tests for the module, but as this module is used for only one task that is performed once, maybe it's acceptable to rely on the users' testing? What to the improper use of the Mozilla-specific constructs, I would be thankful if you poit to them so that they could be fixed.
Attachment #491457 - Flags: review?(neil)
Attached patch Major revision of the mail import (obsolete) (deleted) — Splinter Review
Fixed a bug that was introduced by previous patches, that improperly escaped some plaintext attachments and caused such messages to be split to fragments ("ghost messages"). Thanks to Jörg for pointing out this bug.
Attachment #491457 - Attachment is obsolete: true
Attachment #497106 - Flags: review?(neil)
Attachment #491457 - Flags: review?(neil)
Blocks: 618480
Blocks: 219269
Comment on attachment 497106 [details] [diff] [review]
Major revision of the mail import

I didn't try to understand the code. This is just a style review. For instance, there are some lines of the form if (test) return false; these should be split onto two lines.

>+      m_headers.Assign(NS_ConvertASCIItoUTF16(pVal->Value.lpszA).get());
CopyASCIItoUTF16(pVal->Value.lpszA, m_headers);

>+    _tmp_file->QueryInterface(nsILocalFile::GetIID(), reinterpret_cast<void**>(tmp_file));
CallQueryInterface(_tmp_file, tmp_file);
Actually I notice that you used this earlier in the file.

>+  nsresult rv = NS_NewFileURI(getter_AddRefs(uri), aFile, /*Is this OK? I added it to optimize calls*/m_pIOService);
It might be necessary if you're running on a background thread. I'm not sure how threadsafe the IO service is.

>+      data->real_name = strdup(NS_ConvertUTF16toUTF8(fname).get());
...
>+    NS_Free( data->real_name);
Although NS_Free currently calls free and therefore this will work, it didn't before, and as far as I know there's no guarantee that this will continue to work in the future. So, please either
a) ensure that you free everything that you strdup, or
b) use methods that allocate using NS_Alloc, for instance
data->real_name = ToNewUTF8String(fname);

>+#define hackWiden2(t) L ## t
>+#define hackWiden(t) hackWiden2(t)
I think we have a macro for this, NS_LL(x)

>+    pOutlookEditor->QueryInterface( NS_GET_IID(nsIEditor), getter_AddRefs(pEditor) );
pEditor = do_QueryObject(pOutlookEditor);

>+  if (m_EmbeddedObjectList == nsnull) {
Could just write if (!m_EmbeddedObjectList) {

>+    NS_IF_ADDREF(m_EmbeddedObjectList);
m_EmbeddedObjectList is an nsCOMPtr which addrefs automatically. But GetEmbeddedObjects needs to NS_IF_ADDREF(*aNodeList).

>+  nsOutlookHTMLImageElement *image = new nsOutlookHTMLImageElement(this, uri, cid, name);
>+
>+  nsCOMPtr<nsIDOMHTMLImageElement>   imageNode;
>+  image->QueryInterface( NS_GET_IID(nsIDOMHTMLImageElement), getter_AddRefs(imageNode) );
nsCOMPtr<nsIDOMHTMLImageElement> imageNode = new nsOutlookHTMLImageElement(this, uri, cid, name);

>+  nsCOMPtr<nsOutlookHTMLImageElement> node;
>+  nsresult rv = m_EmbeddedObjectList->QueryElementAt(embedIndex, NS_GET_IID(nsOutlookHTMLImageElement), getter_AddRefs(node));
nsOutlookHTMLImageElement doesn't have an IID. This only compiles because we don't check IIDs strictly. (The templated IID mechanism makes it hard to enforce this. The previous system was enforcable.) Now we know this is a closed system, but I'd prefer a static cast.
Attachment #497106 - Flags: review?(neil) → review-
Happy new year everyone!

Thank you Neil! I'll make these changes as soon as I will reach my development PC.
I hope these will help me better understand the Mozilla coding style.
(In reply to comment #106)

First off, I want to note that some of the code that handles embedded images was borrowed from the corresponding part of the Eudora code. And some ugliness comes from there.

> >+      m_headers.Assign(NS_ConvertASCIItoUTF16(pVal->Value.lpszA).get());
> CopyASCIItoUTF16(pVal->Value.lpszA, m_headers);
m_headers is not a string; thus, it's impossible to use the string handling functions on it.

> >+  nsresult rv = NS_NewFileURI(getter_AddRefs(uri), aFile, /*Is this OK? I added it to optimize calls*/m_pIOService);
> It might be necessary if you're running on a background thread. I'm not sure
> how threadsafe the IO service is.

Could you please clarify what should I do here? Do I need to remove the comment? Or maybe the parameter? Or investigate the safeness of this service? (I would prefer not doing the latter :) ).

> Although NS_Free currently calls free and therefore this will work, it didn't
> before, and as far as I know there's no guarantee that this will continue to
> work in the future. So, please either
> a) ensure that you free everything that you strdup, or
> b) use methods that allocate using NS_Alloc, for instance
Could you please advise me a NS function that duplicates a char*? I could only find those that operate on nsStrings. I need it to replace fragments like strdup(MESSAGE_RFC822).

> >+  nsCOMPtr<nsOutlookHTMLImageElement> node;
> >+  nsresult rv = m_EmbeddedObjectList->QueryElementAt(embedIndex, NS_GET_IID(nsOutlookHTMLImageElement), getter_AddRefs(node));
> nsOutlookHTMLImageElement doesn't have an IID. This only compiles because we
> don't check IIDs strictly. (The templated IID mechanism makes it hard to
> enforce this. The previous system was enforcable.) Now we know this is a closed
> system, but I'd prefer a static cast.

When I wrote it, I studied how the IIDs work, and I am aware about the mechanism you refer to. If this mechanism may change in the future then I agree that I need to change this. However, I think that the static_cast is unsafe, too. Maybe I need to introduce my own IID? This way it could be safer (similar to the dynamic_cast). For now, I rewrote it this way:
m_EmbeddedObjectList->QueryElementAt(embedIndex, NS_GET_IID(nsIDOMHTMLImageElement), getter_AddRefs(node));
This is the same as static_cast, and it is obvious that we get an nsIDOMHTMLImageElement and cast it to nsOutlookHTMLImageElement. What do you think?
(In reply to comment #108)
> (In reply to comment #106)
> First off, I want to note that some of the code that handles embedded images
> was borrowed from the corresponding part of the Eudora code. And some ugliness
> comes from there.
Ah, well that explains alot ;-)

> m_headers is not a string
Sorry, I hadn't realised.

> > >+  nsresult rv = NS_NewFileURI(getter_AddRefs(uri), aFile, /*Is this OK? I added it to optimize calls*/m_pIOService);
> > It might be necessary if you're running on a background thread. I'm not sure
> > how threadsafe the IO service is.
> Could you please clarify what should I do here? Do I need to remove the
> comment? Or maybe the parameter? Or investigate the safeness of this service?
I'll try to investigate the safeness of the call. If it's unsafe then you need the parameter to make it safe. Then you may want to comment to that effect.

> (I would prefer not doing the latter :) ).
> 
> > Although NS_Free currently calls free and therefore this will work, it didn't
> > before, and as far as I know there's no guarantee that this will continue to
> > work in the future. So, please either
> > a) ensure that you free everything that you strdup, or
> > b) use methods that allocate using NS_Alloc, for instance
> Could you please advise me a NS function that duplicates a char*? I could only
> find those that operate on nsStrings. I need it to replace fragments like
> strdup(MESSAGE_RFC822).
NS_strdup seems to be overloaded for char* and PRUnichar*.

> > >+  nsCOMPtr<nsOutlookHTMLImageElement> node;
> > >+  nsresult rv = m_EmbeddedObjectList->QueryElementAt(embedIndex, NS_GET_IID(nsOutlookHTMLImageElement), getter_AddRefs(node));
> > nsOutlookHTMLImageElement doesn't have an IID. This only compiles because we
> > don't check IIDs strictly. (The templated IID mechanism makes it hard to
> > enforce this. The previous system was enforcable.) Now we know this is a closed
> > system, but I'd prefer a static cast.
> When I wrote it, I studied how the IIDs work, and I am aware about the
> mechanism you refer to. If this mechanism may change in the future then I agree
> that I need to change this. However, I think that the static_cast is unsafe,
> too. Maybe I need to introduce my own IID?
Only introducing your own IID is completely safe. (We do that in a couple of other places, such as nsITreeColumn -> nsTreeColumn.) Calling QueryInterface on nsIDOMHTMLImageElement's IID will only guarantee you an nsIDOMHTMLImageElement, so none of the other suggestions gives any improvement from a safety point of view, but my version does at least make it most clear that it's unsafe!
Comment on attachment 497106 [details] [diff] [review]
Major revision of the mail import

>+  nsresult rv = NS_NewFileURI(getter_AddRefs(uri), aFile, /*Is this OK? I added it to optimize calls*/m_pIOService);
>+  if (NS_FAILED(rv))
>+    return false;
>+
>+  nsCString urlStr;
>+  uri->GetSpec(urlStr);
>+  if (urlStr.IsEmpty())
>+    return false;
>+
>+  rv = m_pIOService->NewURI( urlStr, nsnull, nsnull, url);
I can't see the wood for the trees! How did I overlook that you get the spec of the URI you just created only to create a new URI with it. It does demonstrate that you need to pass m_pIOService to be safe, but if you prefer you could just write nsresult rv = m_pIOService->NewFileURI(aFile, url);
(In reply to comment #110)

http://mxr.mozilla.org/comm-central/source/mailnews/import/eudora/src/nsEudoraCompose.cpp#575

Oh, now I see why I couldn't understand why it seems to be done twice!

About the IID. Is there any procedure of aquiring/registering new IIDs? How can I do it?
(In reply to comment #111)
> (In reply to comment #110)
> http://mxr.mozilla.org/comm-central/source/mailnews/import/eudora/src/nsEudoraCompose.cpp#575
> 
> Oh, now I see why I couldn't understand why it seems to be done twice!
As I said, that explains a lot ;-)

> About the IID. Is there any procedure of aquiring/registering new IIDs? How can
> I do it?
Generally you run the uuidgen program or equivalent. (Or /msg firebot uuid and it will give you one, with the advantage of providing the C form.)
Attached patch Major revision of the mail import (obsolete) (deleted) — Splinter Review
Fixes according to comments 106 to 112. Now nsOutlookHTMLImageElement has its own IID.
Attachment #497106 - Attachment is obsolete: true
Attachment #504319 - Flags: review?(neil)
blocking-thunderbird3.2: ? → ---
Flags: wanted-thunderbird+
Got some feedback from a French user on Geckozone: Outlook import is much better than with TB3.1, but there are still some issues. For instance:
"Je reviens du ciné (où je viens de voir l'excellent "Le Discours d' Un Roi" ), et voilà une des bandes annonces que j'y ai vu juste avant:
Il me semble que c'est tiré d'une énième nouvelle de ce cher barbu de Philip K. Dick..."

Also, sometimes the images embedded into the message body seems to be not imported.

Source: http://www.geckozone.org/forum/viewtopic.php?p=624147#p624147
Please submit a PST file containing the messages that are not imported properly.

I have tested with German, Korean, Russian, Greek messages and also imported tons of messages with embedded pictures. So what is claimed here, is somewhat hard to believe. Some French accents like àùèé etc. are surely imported correctly. But let's see the data to take the issue further.

The correct text above would be:
Je reviens du ciné (où je viens de voir l'excellent "Le Discours d' Un Roi"
), et voilà une des bandes annonces que j'y ai vu juste avant: ..."
(In reply to comment #115)
Hello caméléon!

Every feedback is velcome, but as it was noted, we need the source message in the Outlook firmat to be able to investigate the problem.

Also, we need to know how the import was performed. To be specific, we need to know which patch was used and how it was compiled. As you may see, quite a few bugs were fixed in various versions of the proposed patch.

Unfortunately, I cannot read French, so even if this information is provided in the source you mentioned, I am unable to understand it.
From Outlook 2007 (12.0.4518.1014) MSO (12.0.4518.1014)
(not correctly imported on Miramar Alpha 3)
From Outlook 2007 (12.0.4518.1014) MSO (12.0.4518.1014)
(In this email,images are not imported on Miramar Alpha 3)
We need a PST file to test it. Either attach a PST file or individual messages in Outlook format (.MSG).
Attached file import issue with "é" and "è" (obsolete) (deleted) —
***PLEASE DELETE THIS ATTACHMENT WHEN JOB IS DONE TO PROTECT PRIVACY ***

Original in Outlook 2007: 
Venez découvrir nos produits nouveaux
=> Le débitmètre massique thermique FS10A de FCI

After Miramar alpha 3 import:
Venez découvrir nos produits nouveaux
=> Le débitmètre massique thermique FS10A de FCI
3 messages showing different issues when importing from Outlook 2007 by Miramar alpha3:
- image not correctly imported
- body of email completely corrupted
- accent issue with "é" and "è"

*** PLEASE DELETE THE ZIP FILE WHEN JOB IS DONE FOR PRIVACY CONCERN ***
Attachment #529065 - Attachment is obsolete: true
(In reply to comment #119)
Hello Zacki!

I'm afraid that the Miramar Alpha 3 doesn't include the patches that were suggested here. For some reason it is postponed to an indefinite time (at least I can guess so taking into account the recent changes of the state of this bug). I would advise you to check the latest tryserver build containing these patches (it is referenced in Comment #101), but it seems to be unavailable now. I am sure that it would import those messages properly. Your feedback is helpful in the sense that it may remind to the developers that the state of import still needs improvement, so maybe they will pay the due attention to this bug.
OK. So I don't know where to find the patch as the link in the comment #101 doesn't work.
By the way, do you still need the two exemples in Outlook format ?
The situation is very unclear...

(In reply to comment #113)
> Created attachment 504319 [details] [diff] [review]
> Major revision of the mail import
> 
> Fixes according to comments 106 to 112. Now nsOutlookHTMLImageElement has its
> own IID.

I wonder if this is the last patches for the bug? 

Can anyone confirm which patch has been integrated in TB3.3a3 and which one still need to be integrated?
Can anyone test if the last version of the patch import correctly the 3 messages I have attached in the zip file?
The three messages supplied were imported perfectly - as expected - into TB. This of course with a version that contains the patch.

In the future when raising issue, please ensure that the version you are using to do the import indeed contains the patch.
(In reply to comment #126) 
> In the future when raising issue, please ensure that the version you are using
> to do the import indeed contains the patch.

This is great but I wonder how I can guess if the version that I use contain or not the patch. I would have expected it was the case with TB3.3a3...
I think that when this patch will find its way to any official build of TB they will mention it here somehow, probably they will change the status of the bug to "Fixed in version X".

I hoped that this would happen already, too, but you may check the code of the files
(http://mxr.mozilla.org/comm-central/source/mailnews/import/outlook/src/) - it is still unchanged since august 2010, while the last patch here dated january 2011.

If you wish to test this patch, you may only try to request the tryserver build (last time it was Mark Banner who did it for me).
Unfortunately I have no idea how I can try this patch or what is a tryserver build... Could you explain it step by step for averages users like Zacki and me are? If it is not too complicated, I believe Zacki will be happy to test this patch as he have an urgent need to import from Outlook...
caméléon,

Regarding the tryserver build. You are echoing me :) This thing is explained here above (starting at Comment #60). In essence, you need to ask someone who has the authority to do it (possibly via a personal mail), and if you are successfull, this person compiles the patched version and posts a link here.
The only other option is to compile it yourself, but this is rather complicated (though not impossible). There are tons of information on this in the web, the starting point may be here: https://developer.mozilla.org/en/Simple_Thunderbird_build.

Regarding Comment #126. I suppose (and I think that Jörg will agree) that it's OK for everyone to write here about this problem, until it is actually fixed in the TB. One user had opened this bug, some other (including Jörg and me) tried to contribute and invented the fix, but the work is not finished until it is 
actually commited to the program. The developers should take it and commit to the trunk (or choose to invent their own wheel). Until then, I think that anyone interested ought to raise this issue so that developers don't forget that it is not dissapear by a miracle. On the other side, of course, you realize that the creators of the proposed patch will not support unpatched programs.
From the normal user's point of view it's not essential to know all these details. He needs his import to go smoothly and its result satisfactory. So please go and call to devs: Hey, we need this! Turn your attention here and just make it work!
Neil, do you think there's a chance you'll get to review this in the next couple weeks? If not, I can give it a try.
Comment on attachment 504319 [details] [diff] [review]
Major revision of the mail import

this patch has bit-rotted in that it doesn't apply against the trunk. I'll try to refresh it.
Attachment #504319 - Flags: review?(neil)
Mike, in order to take this patch, we'd need to put the Mozilla tri-license on rtfDecoder files - is that OK? I've updated your patch so that it builds and works on the trunk, and cleaned up a bunch of whitespace issues, and I'll attach that in a minute, but before I go any further, I'd need to know that I can change the license on those files.

Thx again for all your work on this!
Yes, it is definitely must be done. The only thing I ask is that the notice to be kept that I am the original author, so that if I will later decide to use this lib in a project under a different (incompatible) license, there would not arise a legal issue. Is this OK?
yes, of course, I'll put you as the original author, thx!
Attached patch update to trunk, fix whitespace issues (obsolete) (deleted) — Splinter Review
this now builds on the trunk. I fixed a bunch of whitespace issues (tabs, newline issues, etc) and updated the copyright on the rtf files per Mike's last comment. I'm having a bit of a challenge testing the import code because I only have a trial version of Outlook at the moment. I'll try to do a fuller review later today, but the first thing that struck me were the commented out lines,
e.g.,

+  // nsOutlookMail::SetDefaultContentType()
+  if (strnicmp(m_mimeContentType.get(), "multipart/", 10) == 0) {

+      //50229  ISO 2022 Traditional Chinese 
+      //50930  EBCDIC Japanese (Katakana) Extended 
+      //50931  EBCDIC US-Canada and Japanese 
+      //50933  EBCDIC Korean Extended and Korean 
+      //50935  EBCDIC Simplified Chinese Extended and Simplified Chinese 
+      //50936  EBCDIC Simplified Chinese 
+      //50937  EBCDIC US-Canada and Traditional Chinese 
+      //50939  EBCDIC Japanese (Latin) Extended and Japanese 

+    m_msgFlags = CMapiApi::GetLongFromProp( pVal);
+//    CMapiApi::MAPIFreeBuffer( pVal); // No need since GetLongFromProp() has a delVal with default of PR_TRUE
+  }
+  pVal = CMapiApi::GetMapiProperty(m_lpMsg, PR_LAST_VERB_EXECUTED);
+  if (pVal) {
+    m_msgLastVerb = CMapiApi::GetLongFromProp( pVal);
+//    CMapiApi::MAPIFreeBuffer( pVal); // No need since GetLongFromProp() has a delVal with default of PR_TRUE

We generally don't allow commented out code in patches, because it makes the code hard to maintain. If appropriate, the commmented out code should be replaced with comments, or just removed.

Also,

#if 0 code - nsOutlookEditor::UpdateEmbeddedImageReference().

I assume this was just cloned from the Eudora editor code - should it just be removed?
Attachment #504319 - Attachment is obsolete: true
Attachment #529583 - Flags: review?(dbienvenu)
(In reply to comment #136)
> +  // nsOutlookMail::SetDefaultContentType()
> +  if (strnicmp(m_mimeContentType.get(), "multipart/", 10) == 0) {
  = the next lines came from old nsOutlookMail::SetDefaultContentType()


> +      //50229  ISO 2022 Traditional Chinese 
> +      //50930  EBCDIC Japanese (Katakana) Extended 
> +      //50931  EBCDIC US-Canada and Japanese 
> +      //50933  EBCDIC Korean Extended and Korean 
> +      //50935  EBCDIC Simplified Chinese Extended and Simplified Chinese 
> +      //50936  EBCDIC Simplified Chinese 
> +      //50937  EBCDIC US-Canada and Traditional Chinese 
> +      //50939  EBCDIC Japanese (Latin) Extended and Japanese 
  As I noted above this block, this list is from a MS document; I left those lines that had not corresponding charset commented out to keep the table as close to original as possible; besides, it clearly shows that this table is incomplete, so if someone knows a charset name fot that codepages it's easily added here.

> +    m_msgFlags = CMapiApi::GetLongFromProp( pVal);
> +//    CMapiApi::MAPIFreeBuffer( pVal); // No need since GetLongFromProp() has
> a delVal with default of PR_TRUE
> +  }
> +  pVal = CMapiApi::GetMapiProperty(m_lpMsg, PR_LAST_VERB_EXECUTED);
> +  if (pVal) {
> +    m_msgLastVerb = CMapiApi::GetLongFromProp( pVal);
> +//    CMapiApi::MAPIFreeBuffer( pVal); // No need since GetLongFromProp() has
> a delVal with default of PR_TRUE
  Here I do something I believe to be OK, but as I'm not comfortable with the Mozilla API, I feel safer putting these comments so anyone looking for possible memory leaks will see a possible root of problems.

> #if 0 code - nsOutlookEditor::UpdateEmbeddedImageReference().
> I assume this was just cloned from the Eudora editor code - should it just be
> removed?
  No, this isn't cloned from the Eudora editor, it's the first attempt to revert the original cids. It was later replaced with another, that I believe to be more precise. Whatever, it may be removed. However, I'd like to explain what is it for.
When a multipart/related message contains embedded objects, they are referenced by their cids. As the editor interface allows for autogenerated cids only, not allowing me to specify a cid of my choice, the resulting code of the imported message differed from the original. Furthermore, these cids may appear at places other than img tags, eg in scc or script sections or in other parts of the multipart/related message. At first I used the Eudora approach, which cares for img tags only, and tried to replace the autogenerated cids after generation of the message. But as it became clear that this approach is too narrow I decided to drop it and now my code tries to restore all the cids that may happen in the message so that the message is most close to the original.
OK, thx, but all those would be better expressed as comments, not simply commented out code. I can make those changes since I need to do a couple things to speed up import, which seems to be massively slowed down by the patch, though I think it's mostly due to thread-safety warnings that I can fix.
Yes, the patch does slow the import down a lot. I'm afraid there's nothing to be done about it. It is for the sake of accuracy. As it's generally done once, I think that the quality of result is more important than the speed.

What to the comments, I want to note that

// nsOutlookMail::SetDefaultContentType()
*is* just a comment, not a commented-out code (if it were it would end with ";");

      //50229  ISO 2022 Traditional Chinese 
*is* just a comment - you cannot uncomment it and make it compile;

//    CMapiApi::MAPIFreeBuffer( pVal); // No need since GetLongFromProp() has a delVal with default of PR_TRUE
is indeed the commented out code, but I hardly can imagine something clearer (if it to be left anyway). And I do believe that changing some fine self-explaining commented-out code with some not-so-good comment is not good, even if rules say otherwise.

Of course, if you feel like it, please do it.
I'm very happy that working on this bug is revived.

 Happy Coding (^.^)/ Thanks Thanks
Blocks: 415045
Blocks: 645994
I've gone through and tried to explain the commented out code, and I'll attach the newere patch in a minute.

However, what does this mean, as a comment? I honestly don't know - there's no method nsOutlookMail::SetDefaultContentType, but there is a class nsOutlookMail.

// nsOutlookMail::SetDefaultContentType()

The function question is normalizing the mime content type, that much I can tell from the function comment...
Attached patch tweak the comments. (obsolete) (deleted) — Splinter Review
(In reply to comment #141)
> // nsOutlookMail::SetDefaultContentType()

Isn't that part of a larger block which is commented out via #if 0?
(In reply to comment #143)
> (In reply to comment #141)
> > // nsOutlookMail::SetDefaultContentType()
> 
> Isn't that part of a larger block which is commented out via #if 0?

You're right, it is, a good example of why we don't like to checkin #if 0 code :-) So I withdraw that question, and I'll remove the #if 0 code.
Attached patch remove more #if 0 code (deleted) — Splinter Review
this just removes some more #if 0 code, and tweaks some comments. Generally, the normal place for the #if 0 code is in patches in bugzilla - if it turns out we want to turn on the code, we can get it from the patch in the bug.
Attachment #529734 - Attachment is obsolete: true
Blocks: 583490
Attached patch more cleanup (deleted) — Splinter Review
this cleans up a lot more whitespace issues (tabs, spaces after (, long lines, whitespace at end of lines, unneeded {'s, etc). There are still some long lines, but it looks a lot more reasonable.
Attachment #529583 - Attachment is obsolete: true
Attachment #529583 - Flags: review?(dbienvenu)
I'm going to make one more pass through this cleaning up long lines and the like, and then I hope to land this before we branch for Miramar, and get some testing during beta.
Attached patch yet more cleanup (deleted) — Splinter Review
mostly more breaking up of long lines, cleaning up some comments, etc. I've made some minor code changes, and I'm thinking this is what I'm going to land, so I'm going to kick off a try server build for some last sanity checks by interested folks.
I hate to ask, but what's a "minor" code change? Anything that changes the logic? I did some extensive testing with the version Mike posted in December 2010 (comment #105). Have there been any functional changes since this version?
(In reply to comment #149)
> I hate to ask, but what's a "minor" code change? 
Extremely minor - I changed a function into a method, and made the method use a member variable instead of having the caller pass in the same two arguments every time. All the other changes were whitespace, breaking up long lines, etc.

> Anything that changes the
> logic? I did some extensive testing with the version Mike posted in December
> 2010 (comment #105). Have there been any functional changes since this
> version?

No, there shouldn't be any functional changes.
Jörg,
I have checked the changes and I can assure you that the logic isn't changed.

David,
If you consider this a good change than it's ok. However, I feel like this change is something questionable:
1. I tried to make it clear that this function is just a replacement for a defficient other (that may become corrected at some time, or its variant may appear, so that the only thing that will be needed in that case is to replace my function with that proper one).
2. You change the class definition. Additionally, you make it look like this class is responsible for something it isn't (in this case, you claim its responsibility to analyze the charsets of a text).
However, your method name makes it clear that this responsibility only covers this class' member, and it's private. And if that "proper" function will be created in the proper place (presumably in nsMsgI18N), this method may be easily changed to just one line, without changing other things. Well, whatever you decide.
(In reply to comment #151)

> 1. I tried to make it clear that this function is just a replacement for a
> defficient other (that may become corrected at some time, or its variant may
> appear, so that the only thing that will be needed in that case is to
> replace my function with that proper one).
Mike, thx for looking at the change. I left your comment in, so I think it's still clear that you're replacing some deficient functionality. But it's an excellent working assumption that the other function will not be fixed. Making the method private, as you say, makes it clear that it's a helper function. And if the other function is fixed, it's easy to switch back to it - we have excellent tools for tracking those kinds of changes.
try server build here - 
The full log for this test run is available at http://ftp.mozilla.org/pub/mozilla.org/thunderbird/try-builds/bienvenu@nventure.com-e4866a647ff0

My plan is to land this on the trunk after we branch for Miramar, so the changes would be in the release 6 weeks after Miramar.
OK, I've downloaded and tested this build against my test cases. This time I used Outlook 2010 for testing, and as far as I can tell, everything is fine. So now this importer is tested to work with Outlook 2003, 2007 and 2010. Hope to hear about other (older) versions, too.
Assignee: smontagu → mikekaganski
fixed on trunk, with one more tweak Mike e-mailed me - changeset http://hg.mozilla.org/comm-central/rev/90c3929c5b5d.

This fix will be in the release after Miramar
Status: NEW → RESOLVED
Closed: 19 years ago13 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 3.4
Attached patch stop using stdlib::locale (obsolete) (deleted) — Splinter Review
this caused at least one person to have a build issue, and we don't tend to use the stdlib stuff much, and I think plain old isalpha is fine for rtf.
Comment on attachment 533280 [details] [diff] [review]
stop using stdlib::locale

With this patch I don't get the link error and can compile successfully. The application starts up without error but I can't test the import functionality because I don't have Outlook installed.
Attachment #533280 - Flags: feedback+
Comment on attachment 533280 [details] [diff] [review]
stop using stdlib::locale

Mike, can you just do a quick sanity check that this is OK? We don't tend to use stdlib if we don't have to, and rtf is inherently ascii, so I can't imagine this would cause a problem.
Attachment #533280 - Flags: review?(mikekaganski)
Comment on attachment 533280 [details] [diff] [review]
stop using stdlib::locale

Review of attachment 533280 [details] [diff] [review]:
-----------------------------------------------------------------

1. I don't insist on using the std::locale
2. Unfortunately, the ctype functions such as isalpha, isdigit etc, are not appropriate here, as they depend on the current locale. The RTF spec defines its character types exactly as the default "C" locale (or std's classic) does. So if we have to abandon the <locale>, we need to use the functions that allow to specify the "C" locale, or create custom variants.
thx, Mike - isn't this essentially ch >= 'a' && <= 'z' || ch >= 'A' && <= 'Z' and isdigit is ch >= '0' && ch <= '9'?

Or, won't this give us the standard "isalpha"? NS_ISALPHA here:

http://mxr.mozilla.org/comm-central/source/mailnews/base/public/msgCore.h#106
Yes, David, you are right. I would prefer not using the NS_ISALPHA version, because personally I don't know if it will not change someday.

And isspace needs to be defined as well.
(In reply to comment #161)
> Yes, David, you are right. I would prefer not using the NS_ISALPHA version,
> because personally I don't know if it will not change someday.

I believe it exists for the very reason you described, so I'm sure it won't change in a way that breaks us. Plus, it's mailnews code, so we control it, and it hasn't changed in a very long time.

But I don't particularly care, since there's no advantage to using the macro's either.

> 
> And isspace needs to be defined as well.

OK, thx, I'll come up with a new patch.
I have palyed with some RTFs, and it looks like the isspace here must look like
ch == ' '. So in this case, the "C" locale isspace is incorrect, too.

I thought that when "a space" appeared in the spec, it referred to any spacing character, but the MS RTF parsers treat tabs differently.
Attached patch use our own macro's (obsolete) (deleted) — Splinter Review
Attachment #533280 - Attachment is obsolete: true
Attachment #533457 - Flags: review?(mikekaganski)
Attachment #533280 - Flags: review?(mikekaganski)
Please modify the macros this way:
#define IS_DIGIT(i)   (((i) >= '0') && ((i) <= '9'))
#define IS_ALPHA(VAL) ((((i) >= 'A') && ((i) <= 'Z')) || (((i) >= 'a') && ((i) <= 'z')))

This will literally follow the next statements from MS RTF Spec:
"<letter>		a..z | A..Z
<control name>		<letter>+
<digit>			0..9
<parameter>		'-'? <digit>+
<control word entity>	'\' <control name><parameter>?"
this uses explicit character checks...
Attachment #533457 - Attachment is obsolete: true
Attachment #534752 - Flags: review?(mikekaganski)
Attachment #533457 - Flags: review?(mikekaganski)
Comment on attachment 534752 [details] [diff] [review]
use explicit char checks - checked in

Everything is in the place, r+ from me.
Comment on attachment 534752 [details] [diff] [review]
use explicit char checks - checked in

Mike said he's happy with this in comment 168, I'm happy as well (although I've not tested it, but that's fine). So we should get this landed.
Attachment #534752 - Flags: review?(mikekaganski) → review+
Comment on attachment 534752 [details] [diff] [review]
use explicit char checks - checked in

checked in - http://hg.mozilla.org/comm-central/rev/7ac67e03558b
Attachment #534752 - Attachment description: use explicit char checks → use explicit char checks - checked in
Blocks: 668393
Recent modifications to rtfDecoder.cpp have broken the build, at least in my environment.  This is what I get when building TB 6.0b2:

   Creating library xul.lib and object xul.exp
import.lib(rtfDecoder.obj) : error LNK2001: unresolved external symbol "private: static void __cdecl std::locale::facet::facet_Register(class std::locale::facet *)" (?facet_Register@facet@locale@std@@CAXPEAV123@@Z)
xul.dll : fatal error LNK1120: 1 unresolved externals

Seen with both 32- and 64-bit builds.  No problem building TB 5.0 in same environment.

Environment:

WinXP/x64 (all updates applied)
VS2008/SP1 (all updates applied)
WinSDK v7.1

The build completes successfully if I remove the references to the Outlook importer.
(In reply to comment #171)
> Recent modifications to rtfDecoder.cpp have broken the build, at least in my
> environment.  This is what I get when building TB 6.0b2:

Please file a separate follow-up bug so we can track that separately.
Done. See Bug 675893.
Depends on: 675893
(In reply to Steve Snyder from comment #171)
> Recent modifications to rtfDecoder.cpp have broken the build, at least in my
> environment.  This is what I get when building TB 6.0b2:
> ...
> The build completes successfully if I remove the references to the Outlook
> importer.

I just encountered exactly the same problem. One possible workaround without disabling the Outlook import functionality is to replace these expressions

std::isalpha(ch, std::locale::classic())
std::isdigit(ch, std::locale::classic())
std::isspace(ch, std::locale::classic())

with their C runtime counterparts in rtfDecoder.cpp.
This is marked resolved, but I am seeing similar errors in importing from Outlook 2003 using Thunderbird 17.0.2
(In reply to km from comment #176)
> This is marked resolved, but I am seeing similar errors in importing from
> Outlook 2003 using Thunderbird 17.0.2

Please file a separate follow-up bug so we can track that separately.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: