Closed Bug 129443 Opened 23 years ago Closed 3 years ago

Incorrect encoding (charset) for mail and news/nntp URIs in browser

Categories

(MailNews Core :: Internationalization, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: xslf, Unassigned)

References

(Blocks 1 open bug, )

Details

(Keywords: intl)

From Bugzilla Helper: User-Agent: Mozilla/5.0 (Macintosh; U; PPC; en-US; rv:0.9.9+) Gecko/20020306 BuildID: 2002030608 when opening in the browser a Hebrew file with the MIME type of message/rfc822 , Mozilla incorrectly displays it as windows-1255, and show junk. The user has to manually change the encoding to unicode Reproducible: Always Steps to Reproduce: 1. go to http://www.typo.co.il/~sbforum/MESSAGES/86/1586.eml 2. see how the hebrew is (not) displayed 3. change the encoding to unicode in order to view correctly Actual Results: mozilla displays the message with the wrong encoding Expected Results: mozilla should pick the correct encoding I have heard that this happens in linux and windows as well, but haven't tested on them.
Summary: When opening a URL with a Hebrew file with the mime type of message/rfc822, mozilla incorrectly detects it as being windows-1255 → When opening a URL with a Hebrew file with the mime type of message/rfc822, mozilla incorrectly detects it as being windows-1255
This problem is not specific to Mac. It happens in Windows and Linux as well. Some more examples: http://www.typo.co.il/~sbforum/MESSAGES/80/1580.eml http://www.typo.co.il/~sbforum/MESSAGES/73/1573.eml These messages are encoded in 'windows-1255'. Mozilla displays them as either 'iso-8859-1' (latin) or 'windows-1255' (actually I'm not sure about the later -- I'm using Windows now and the messages are shown using 'iso-8859-1', not 'windows-1255') and you see junk instead of Hebrew. You have to switch to UTF-8 in order to read them. It seems that the user's default encoding doesn't matter. Nor does it matter if the message is 'multipart/alternative' or single. The following messages are encoded in UTF-8, so the problem is not specific to Hebrew: http://www.typo.co.il/~sbforum/MESSAGES/75/975.eml http://www.typo.co.il/~sbforum/MESSAGES/15/715.eml Mozilla displays them as 'iso-8859-1' (latin) and you see junk instead of Hebrew.
OS: Mac System 9.x → All
Hardware: Macintosh → All
This should probably be in the intl component.
Assignee: mkaply → yokoyama
Component: BiDi Hebrew & Arabic → Internationalization
QA Contact: zach → ruixu
There is a summary of what's going on here at http://bugzilla.mozilla.org/show_bug.cgi?id=33049#c17 Bug 33049 was resolved as WORKSFORME, but this seems to be a real problem.
.eml files are files saved by Mozila/Netscape 6Mail. If it is saved by Mozilla/Netscape 6, they are saved in UTF-8. So what you're seeing is not a bug but according to the current spec. If you want to see them in the encoding of the system you are using, you shoud save them as ".txt" files. Better yet, use HTML format when saving. So in essence this is not a browser bug. These people are exposing saved mail msgs without pointers and they should be told to include an instruction. My suggestions to eliminate the problem: 1. When you save mail msgs, use HTML format. This should get you the document encoding tag. 2. Turn on View | Character Coding | Auto-Detect | All. Auto-detectors normally check for UTF-8 sequences It is possible that we can build in automatic UTF-8 check on any encoding menu item. I wonder if that is a good idea or bad idea. During Communicator 4.x days, we used to check for UCS-2 on any incoming data and that turned out to cause some problems and so we restricted the UCS-2 check to just when one of the Unicode encodings are chosen.
> .eml files are files saved by Mozila/Netscape 6Mail. If it is saved by > Mozilla/Netscape 6, they are saved in UTF-8. I correct myself. I explained this much better in http://bugzilla.mozilla.org/show_bug.cgi?id=33049#c17 The .eml data is saved as the original RFC 822 data. I should add more one workaround. Eliminate the .eml extension. You will be able to see it as Windows-1255 file. > Bug 33049 was resolved as WORKSFORME, but this seems to be a real problem. Before you do anything, please check with the mail team to see what consequences there are for changing the current behavior as summarized in the above quoted comment for parsing .eml files.
Keywords: intl
QA Contact: ruixu → ylong
If I understand correctly, the problem is that we construct internally a DOM representation of the message, with the text in UTF-8, but without setting any charset attribute. I haven't located the code where this happens, but if my assumptions are right, the fix ought to be trivial (famous last words)
re-assign to smontagu
Assignee: yokoyama → smontagu
cc Xianglan and marina.
this is totally a mail/charset issue. cc'ing nhotta.
As Kat explained, saving the original RFC822 data in UTF-8 for .eml file extension is by design. If we add any charset attribute to the file, it won't be the original RFC822 data anymore. Should we resolve this as WFM then? QA contact to myself.
Product: Browser → MailNews
QA Contact: ylong → ji
Wiith regard to comment #6 by smontagu, we may be using re-using or using the mail code for this because of the .eml extension. CC'ing bienvenu@netscape.com also. > As Kat explained, saving the original RFC822 data in UTF-8 > for .eml file extension is by design. My comment in this bug is incorrect. I think I was more accurate in the original bug smontagu cited above. The data are saved as the original data. But we use UTF-8 in internal representation.
Kat, you're probably right, but I'm not the right person to ask - you might try e-mailing mscott directly for the definitive answer.
Status: NEW → ASSIGNED
*** Bug 223225 has been marked as a duplicate of this bug. ***
Summary: When opening a URL with a Hebrew file with the mime type of message/rfc822, mozilla incorrectly detects it as being windows-1255 → Incorrect encoding for mail and news URIs in browser
Summary: Incorrect encoding for mail and news URIs in browser → Incorrect encoding (charset) for mail and news/nntp URIs in browser
Yes, I'm the one who submitted the duplicated bug 223225. In that case it shows that the problem is not the *.EML file in itself. Apparently the same UTF-8 conversion mentioned in comment #4 is also performed on external links to news articles. Probably the conversion is performed on all non-webpages displayed in the browser, and comment #6 and comment #11 are therefore perfectly right.
*** Bug 231524 has been marked as a duplicate of this bug. ***
*** Bug 244945 has been marked as a duplicate of this bug. ***
Blocks: 254868
None of the URLs provided in this bug as samples are valid any longer. Could someone *attach* an actual .eml file that exhibits this problem to the bug? Remember to give it type: message/rfc822 The file at attachment 11787 [details] (from bug 33049) is pretty peculiar. Loading it in the browser: - Autodetect:Universal identifies the charset as Greek (ISO-8859-7). - Autodetect:Japanese identifies the charset as Shift_JIS, which shows a bunch of Kanji (or Chinese) mixed with centered-dot characters -- including within the vCard. - Forcing an encoding of ISO-2022-JP (the charset specified within the file itself), the display is all '?'. - Forcing an encoding of UTF-8, the subject and body appear to be some form of kana, except in the vCard where the characters appear as '?'.
(In reply to comment #19) > - Forcing an encoding of UTF-8, the subject and body appear to be some form of > kana, except in the vCard where the characters appear as '?'. This needs to be retested, but I believe that that is bug 221631, which has been fixed since the date of the attachment.
(In reply to comment #20) > (In reply to comment #19) > > - Forcing an encoding of UTF-8, the subject and body appear to be some form > > of kana, except in the vCard where the characters appear as '?'. > > This needs to be retested, but I believe that that is bug 221631, which has > been fixed since the date of the attachment. The fix there seems to be forcing a default of utf-8 on (some?) vCards -- which is how Mozilla sends vCards now. The vCard in that attachment has an explicit 2022-JP encoding. Even when displayed in Mail/News, those characters are not shown correctly, so that problem is unrelated to this bug. I forgot that attachment 139450 [details], from the bug I filed that was duped to this one, shows the basic problem. One symptom from that attachment which is not mentioned here: the 8bit characters which (illegally) are in the Subject header of that mail display correctly when the browser's encoding is 8859-1 (whereas the body shows the 8859-1 bytes corresponding to the UTF-8 encoding of the original 8859-1 characters). Forcing the encoding to UTF-8, the body displays correctly but the headers are wrong.
*** Bug 38109 has been marked as a duplicate of this bug. ***
Product: MailNews → Core
(In reply to comment #20) > (In reply to comment #19) > > - Forcing an encoding of UTF-8, the subject and body appear to be some form of > > kana, except in the vCard where the characters appear as '?'. > > This needs to be retested, but I believe that that is bug 221631, which has been > fixed since the date of the attachment. Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b5pre) Gecko/2008031507 SeaMonkey/2.0a1pre I see Character Encoding: Autodetect -> Universal and UTF-8. The one line of text is identical to the Subject; they look Japanese (including both hiragana and kanji). The vcard includes only ASCII plus a number of black diamonds with white question marks on them.
(In reply to comment #21) [...] > I forgot that attachment 139450 [details], from the bug I filed that was duped to this > one, shows the basic problem. One symptom from that attachment which is not > mentioned here: the 8bit characters which (illegally) are in the Subject header > of that mail display correctly when the browser's encoding is 8859-1 (whereas > the body shows the 8859-1 bytes corresponding to the UTF-8 encoding of the > original 8859-1 characters). Forcing the encoding to UTF-8, the body displays > correctly but the headers are wrong. It is still so using "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b5pre) Gecko/2008031507 SeaMonkey/2.0a1pre": Autodetect -> Universal and Windows-1252 shows accented characters OK in Subject header and replaced by gibberish in the body. Forcing UTF-8 shows accented characters replaced by black diamonds with white question marks on them in the Subject header and OK in the body.
Product: Core → MailNews Core
QA Contact: ji → i18n
Assignee: smontagu → nobody
Status: ASSIGNED → NEW

Is this expected to still be a problem?

Flags: needinfo?(mkmelin+mozilla)

Probably not. Testcase are no longer available.

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(mkmelin+mozilla)
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.