Closed
Bug 129443
Opened 23 years ago
Closed 3 years ago
Incorrect encoding (charset) for mail and news/nntp URIs in browser
Categories
(MailNews Core :: Internationalization, defect)
MailNews Core
Internationalization
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: xslf, Unassigned)
References
(Blocks 1 open bug, )
Details
(Keywords: intl)
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Macintosh; U; PPC; en-US; rv:0.9.9+) Gecko/20020306
BuildID: 2002030608
when opening in the browser a Hebrew file with the MIME type of message/rfc822 ,
Mozilla incorrectly displays it as windows-1255, and show junk. The user has to
manually change the encoding to unicode
Reproducible: Always
Steps to Reproduce:
1. go to http://www.typo.co.il/~sbforum/MESSAGES/86/1586.eml
2. see how the hebrew is (not) displayed
3. change the encoding to unicode in order to view correctly
Actual Results: mozilla displays the message with the wrong encoding
Expected Results: mozilla should pick the correct encoding
I have heard that this happens in linux and windows as well, but haven't tested
on them.
Updated•23 years ago
|
Summary: When opening a URL with a Hebrew file with the mime type of message/rfc822, mozilla incorrectly detects it as being windows-1255 → When opening a URL with a Hebrew file with the mime type of message/rfc822, mozilla incorrectly detects it as being windows-1255
Comment 1•23 years ago
|
||
This problem is not specific to Mac. It happens in Windows and Linux as well.
Some more examples:
http://www.typo.co.il/~sbforum/MESSAGES/80/1580.eml
http://www.typo.co.il/~sbforum/MESSAGES/73/1573.eml
These messages are encoded in 'windows-1255'.
Mozilla displays them as either 'iso-8859-1' (latin) or 'windows-1255'
(actually I'm not sure about the later -- I'm using Windows now and the
messages are shown using 'iso-8859-1', not 'windows-1255') and you see junk
instead of Hebrew.
You have to switch to UTF-8 in order to read them.
It seems that the user's default encoding doesn't matter. Nor does it matter if
the message is 'multipart/alternative' or single.
The following messages are encoded in UTF-8, so the problem is not specific to
Hebrew:
http://www.typo.co.il/~sbforum/MESSAGES/75/975.eml
http://www.typo.co.il/~sbforum/MESSAGES/15/715.eml
Mozilla displays them as 'iso-8859-1' (latin) and you see junk instead of
Hebrew.
Reporter | ||
Updated•23 years ago
|
OS: Mac System 9.x → All
Hardware: Macintosh → All
Comment 2•23 years ago
|
||
This should probably be in the intl component.
Assignee: mkaply → yokoyama
Component: BiDi Hebrew & Arabic → Internationalization
QA Contact: zach → ruixu
Comment 3•23 years ago
|
||
There is a summary of what's going on here at
http://bugzilla.mozilla.org/show_bug.cgi?id=33049#c17
Bug 33049 was resolved as WORKSFORME, but this seems to be a real problem.
Comment 4•23 years ago
|
||
.eml files are files saved by Mozila/Netscape 6Mail. If it is saved by
Mozilla/Netscape 6, they are saved in UTF-8. So what you're seeing
is not a bug but according to the current spec. If you want to see them in the
encoding of the system you are using, you shoud save them as ".txt" files.
Better yet, use HTML format when saving.
So in essence this is not a browser bug. These people are exposing
saved mail msgs without pointers and they should be told to
include an instruction. My suggestions to eliminate the problem:
1. When you save mail msgs, use HTML format. This should get you the
document encoding tag.
2. Turn on View | Character Coding | Auto-Detect | All.
Auto-detectors normally check for UTF-8 sequences
It is possible that we can build in automatic UTF-8 check on any
encoding menu item. I wonder if that is a good idea or bad idea.
During Communicator 4.x days, we used to check for UCS-2 on any incoming
data and that turned out to cause some problems and so we restricted
the UCS-2 check to just when one of the Unicode encodings are
chosen.
Comment 5•23 years ago
|
||
> .eml files are files saved by Mozila/Netscape 6Mail. If it is saved by
> Mozilla/Netscape 6, they are saved in UTF-8.
I correct myself. I explained this much better in
http://bugzilla.mozilla.org/show_bug.cgi?id=33049#c17
The .eml data is saved as the original RFC 822 data.
I should add more one workaround.
Eliminate the .eml extension. You will be able to see it
as Windows-1255 file.
> Bug 33049 was resolved as WORKSFORME, but this seems to be a real problem.
Before you do anything, please check with the mail team to see
what consequences there are for changing the current behavior
as summarized in the above quoted comment for parsing .eml files.
Comment 6•23 years ago
|
||
If I understand correctly, the problem is that we construct internally a DOM
representation of the message, with the text in UTF-8, but without setting any
charset attribute. I haven't located the code where this happens, but if my
assumptions are right, the fix ought to be trivial (famous last words)
Comment 8•23 years ago
|
||
cc Xianglan and marina.
Comment 9•23 years ago
|
||
this is totally a mail/charset issue. cc'ing nhotta.
Comment 10•23 years ago
|
||
As Kat explained, saving the original RFC822 data in UTF-8 for .eml file
extension is by design. If we add any charset attribute to the file, it won't be
the original RFC822 data anymore. Should we resolve this as WFM then?
QA contact to myself.
Product: Browser → MailNews
QA Contact: ylong → ji
Comment 11•23 years ago
|
||
Wiith regard to comment #6 by smontagu, we may be using
re-using or using the mail code for this because of the
.eml extension. CC'ing bienvenu@netscape.com also.
> As Kat explained, saving the original RFC822 data in UTF-8
> for .eml file extension is by design.
My comment in this bug is incorrect. I think I was more
accurate in the original bug smontagu cited above. The data
are saved as the original data. But we use UTF-8 in internal
representation.
Comment 12•23 years ago
|
||
Kat, you're probably right, but I'm not the right person to ask - you might try
e-mailing mscott directly for the definitive answer.
Updated•23 years ago
|
Status: NEW → ASSIGNED
Comment 13•21 years ago
|
||
*** Bug 223225 has been marked as a duplicate of this bug. ***
Comment 14•21 years ago
|
||
From dupe: the same bug with news:// and nntp:// URIs
nntp://news.mozilla.org:119/tnhhsv6arys1.dlg@borumat.de
news:news.mozilla.org:119/tnhhsv6arys1.dlg@borumat.de
Summary: When opening a URL with a Hebrew file with the mime type of message/rfc822, mozilla incorrectly detects it as being windows-1255 → Incorrect encoding for mail and news URIs in browser
Updated•21 years ago
|
Summary: Incorrect encoding for mail and news URIs in browser → Incorrect encoding (charset) for mail and news/nntp URIs in browser
Comment 15•21 years ago
|
||
Yes, I'm the one who submitted the duplicated bug 223225.
In that case it shows that the problem is not the *.EML file in itself.
Apparently the same UTF-8 conversion mentioned in comment #4 is also performed
on external links to news articles. Probably the conversion is performed on all
non-webpages displayed in the browser, and comment #6 and comment #11 are
therefore perfectly right.
Comment 16•21 years ago
|
||
*** Bug 231524 has been marked as a duplicate of this bug. ***
Comment 17•21 years ago
|
||
xref bug 116399
Comment 18•20 years ago
|
||
*** Bug 244945 has been marked as a duplicate of this bug. ***
Comment 19•20 years ago
|
||
None of the URLs provided in this bug as samples are valid any longer.
Could someone *attach* an actual .eml file that exhibits this problem to the
bug? Remember to give it type: message/rfc822
The file at attachment 11787 [details] (from bug 33049) is pretty peculiar. Loading it in
the browser:
- Autodetect:Universal identifies the charset as Greek (ISO-8859-7).
- Autodetect:Japanese identifies the charset as Shift_JIS, which shows a bunch
of Kanji (or Chinese) mixed with centered-dot characters -- including within the
vCard.
- Forcing an encoding of ISO-2022-JP (the charset specified within the file
itself), the display is all '?'.
- Forcing an encoding of UTF-8, the subject and body appear to be some form of
kana, except in the vCard where the characters appear as '?'.
Comment 20•20 years ago
|
||
(In reply to comment #19)
> - Forcing an encoding of UTF-8, the subject and body appear to be some form of
> kana, except in the vCard where the characters appear as '?'.
This needs to be retested, but I believe that that is bug 221631, which has been
fixed since the date of the attachment.
Comment 21•20 years ago
|
||
(In reply to comment #20)
> (In reply to comment #19)
> > - Forcing an encoding of UTF-8, the subject and body appear to be some form
> > of kana, except in the vCard where the characters appear as '?'.
>
> This needs to be retested, but I believe that that is bug 221631, which has
> been fixed since the date of the attachment.
The fix there seems to be forcing a default of utf-8 on (some?) vCards -- which
is how Mozilla sends vCards now. The vCard in that attachment has an explicit
2022-JP encoding. Even when displayed in Mail/News, those characters are not
shown correctly, so that problem is unrelated to this bug.
I forgot that attachment 139450 [details], from the bug I filed that was duped to this
one, shows the basic problem. One symptom from that attachment which is not
mentioned here: the 8bit characters which (illegally) are in the Subject header
of that mail display correctly when the browser's encoding is 8859-1 (whereas
the body shows the 8859-1 bytes corresponding to the UTF-8 encoding of the
original 8859-1 characters). Forcing the encoding to UTF-8, the body displays
correctly but the headers are wrong.
Comment 22•20 years ago
|
||
*** Bug 38109 has been marked as a duplicate of this bug. ***
Updated•20 years ago
|
Product: MailNews → Core
Comment 23•17 years ago
|
||
(In reply to comment #20)
> (In reply to comment #19)
> > - Forcing an encoding of UTF-8, the subject and body appear to be some form of
> > kana, except in the vCard where the characters appear as '?'.
>
> This needs to be retested, but I believe that that is bug 221631, which has been
> fixed since the date of the attachment.
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b5pre) Gecko/2008031507 SeaMonkey/2.0a1pre
I see Character Encoding: Autodetect -> Universal and UTF-8. The one line of text is identical to the Subject; they look Japanese (including both hiragana and kanji). The vcard includes only ASCII plus a number of black diamonds with white question marks on them.
Comment 24•17 years ago
|
||
(In reply to comment #21)
[...]
> I forgot that attachment 139450 [details], from the bug I filed that was duped to this
> one, shows the basic problem. One symptom from that attachment which is not
> mentioned here: the 8bit characters which (illegally) are in the Subject header
> of that mail display correctly when the browser's encoding is 8859-1 (whereas
> the body shows the 8859-1 bytes corresponding to the UTF-8 encoding of the
> original 8859-1 characters). Forcing the encoding to UTF-8, the body displays
> correctly but the headers are wrong.
It is still so using "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b5pre) Gecko/2008031507 SeaMonkey/2.0a1pre":
Autodetect -> Universal and Windows-1252 shows accented characters OK in Subject header and replaced by gibberish in the body. Forcing UTF-8 shows accented characters replaced by black diamonds with white question marks on them in the Subject header and OK in the body.
Assignee | ||
Updated•16 years ago
|
Product: Core → MailNews Core
Updated•16 years ago
|
QA Contact: ji → i18n
Updated•4 years ago
|
Assignee: smontagu → nobody
Status: ASSIGNED → NEW
Comment 26•3 years ago
|
||
Probably not. Testcase are no longer available.
Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(mkmelin+mozilla)
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•