Closed Bug 253807 Opened 20 years ago Closed 20 years ago

RSS preview and subjects are always UTF-8

Categories

(MailNews Core :: Feed Reader, defect)

x86
Windows XP
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rowla, Assigned: mscott)

References

Details

(Keywords: fixed-aviary1.0)

Attachments

(7 files)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040731 Firefox/0.9.1+ Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040731 Firefox/0.9.1+ Characters in preview window and subjects are all "?" because of "Content-Type: text/html; charset=UTF-8" header, even when the site has different charset such as euc-jp. Reproducible: Always Steps to Reproduce: 1. Subscribe to RSS http://slashdot.jp/slashdotjp.rss 2. Update feed. Actual Results: Characters are all "?". Expected Results: Thunderbird should auto-detect charset used in the feed.
I don't know enough about what's going on here to be sure. But if I look at the XML generated for: http://slashdot.jp/slashdotjp.rss I see the following right at the top: <?xml version="1.0" encoding="utf-8"?> specifying that the feed is utf-8. We honor that and treat the parsed out feeds as being utf-8...
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Target Milestone: --- → Thunderbird0.8
another question for you. Not having a japanese system I need to ask. Does the name of the folder in the folder pane for this feed look correct? i.e. is the problem just the subject and the feed contents or does it also include the name of the feed folder in the folder pane?
(In reply to comment #2) > Does the name of the folder in the folder pane for this feed look correct? > > i.e. is the problem just the subject and the feed contents or does it also > include the name of the feed folder in the folder pane? Names of the Folders panel are correct. Problem is just the subject and the feed contents. Example rdf is in UTF-8 as you pointed out, that was a bad example. Although, rdf is in UTF-8 but the site is in euc-jp. I should have mentioned this.
Found good example: http://rebecca.ac/milano/mt/index.rdf This rdf file is in: <?xml version="1.0" encoding="EUC-JP"?> and this site is written in: <meta http-equiv="Content-Type" content="text/html; charset=EUC-JP" /> but header of feed is "Contetn-Type: text/html; charset=UTF-8".
Attached image screen shot of my attempted fix (deleted) —
I don't have the ability to read Japanese fonts. Ayumi, here's a screen shot of the message list pane with some changes I made to try to mime encode the subject as EUC_JP. Can you tell me if this screen shot looks correct (with the Subject) or is it impossible to tell since I don't have a japanese font installed?
This patch does several things: 1) extracts the document charset from the XML document for the RSS feed. 2) mime encodes the subject in the forementioned character set before writing the subject to the mail folder 3) Sets the charset parameter on the Content-Type header to match the forementioned character set instead of hard coding a value of UTF-8. What works with this patch: 1) Subjects valus for sites that use 8-bit ascii such as: http://www.heise.de/newsticker/heise.rdf http://photodb.kicker.de/library/rss091/kicker.xml These sites now have the correct subject values show up in the thread pane and the message pane. 2) I need a japanese user to confirm from the screen shot I posted that this patch also fixes the subject values. Remaining issues: When the RSS article is just a iframe link to the website, the contents of the iframe are not properly getting decoded. For both the ascii with accented characters case and the JA case, I don't think the body is rendering correctly. I don't know why yet. Maybe we need to explicitly set a charset on the iframe?
(In reply to comment #5) > Created an attachment (id=155038) > screen shot of my attempted fix > > I don't have the ability to read Japanese fonts. Ayumi, here's a screen shot of > the message list pane with some changes I made to try to mime encode the > subject as EUC_JP. Can you tell me if this screen shot looks correct (with the > Subject) or is it impossible to tell since I don't have a japanese font > installed? > It is impossible to tell from the screenshot. May be because of a Japanese font you don't have. I'll test build Thunderbird with your patch.
Hmm fixing the iframe problem is going to be really hard. In fact right now I don't know how to fix it. The message pane is always showing UTF-8 for the gecko instance running inside of it. when we display a message we check the charset attribute and convert the entire message body from the specified charset to UTF-8 before rendering the text. This works great except for RSS. For many feed articles the message body is just an iframe link: <iframe src="some feed article url"></iframe> We convert that from the appropriate charset to UTF-8 which gives us back: <iframe src="some feed article url"></iframe> and layout renders the contents of the iframe which is treated as UTF-8 because we've told layout our message has been converted to utf-8. I don't know how we can convert data we don't actually have to utf-8 before layout renders it.
About the original feed that started this discussion: http://slashdot.jp/slashdotjp.rss I'm not sure what to do here for the subjects. The RSS document says its a UTF-8 document so we try to convert the values for the title field to UTF-8. However that fails because the titles are really in EUC-ja. Seems like a problem with the feed not really being encoded in the charset it advertises...
Attached image screenshot (deleted) —
> 2) I need a japanese user to confirm from the screen shot I posted that this > patch also fixes the subject values. My test build is finished. However, I'm still seeing subject and preview in "?". While I was comareing both patched and not-patched version, I've noticed one part where Japanese letters are rendered correct in both cases (place where I marked on screenshot).
I've gone ahead and checked my patch into the aviary 1.0 branch. This addresses the subject issues for 8-bit ascii characters. Now we need to figure out what's going on with more complicated charsets like EUC-ja for the subject. I suspect I'm never getting the correct characters from the javascript when I try to mime encode the header. I see lots of weird javascript string assertions about data being lost before the mime encoding code ever gets called.
RSS articles that are iframes whose src attribute points back to the website article aren't getting the correct document charset conversion happening on the iframe contents. This was occurring because mailnews forces a charset value of UTF-8 on the message pane because libmime converts message bodies from their native charsets to UTF-8 before giving the data to gecko. This fix attempt to get around that problem by setting a default character set on the message pane dochsell instead of a force charset. By using the default charset method, the nested iframe is no longer forced to use the UTF-8 charset of the outer frame (the message pane). This patch fixes the RSS problem. However, I'm quite concerned it may break display of I18n message bodies for regular mail and news articles. My limited tests showed that it didn't break rendering of mail messages but I need more testing help. Ayumi, up for another patch to test?
Oh in case I wasn't clear, the 2004-08-03 patch addresses the RSS body ONLY. It doesn't address japanese characters in the subject fields still looking incorrect...
Note that three out of the four feeds mentioned above have problems with their character encoding - you can check it with http://feedvalidator.org/. The reason that they may be displayed as intented anyway is probably due to bug 247024.
(In reply to comment #14) > Oh in case I wasn't clear, the 2004-08-03 patch addresses the RSS body ONLY. It > doesn't address japanese characters in the subject fields still looking incorrect... I've just finished building Thunderbird with your patch, and it is working.
(In reply to comment #16) > incorrect... > > I've just finished building Thunderbird with your patch, and it is working. Did I read this right. Do you mean with both the RSS body patch and the patch to mime encode the subject, you are now seeing the subjects look correct and the body? As in this bug is fixed with those two patches? :) Woo Hoo!
> As in this bug is fixed with those two patches? :) Woo Hoo! No, no. As you mentioned, the subject line is still "?". I meant the preview is working.
*** Bug 254424 has been marked as a duplicate of this bug. ***
I think this patch gets us a lot closer and it fixes a nasty regression Attachment #155147 [details] [diff] introduced. Ayumi, can you try this patch and see if it works for the subject? Note: You must re-download the headers with a build that has this latest patch to test if it really fixed things. If the feed article was already downloaded using a bad build then the header will still look wrong. So make sure you delete the RSS feed then add it again...
I just checked the 08/05 patch into the branch to get more testing. The FeedItem.js change that converts the JS unicode string back to a char * in the original charset is something we definetly want even if the problem isn't fully fixed. This gets rid of some really nasty xpconnect errors I was seeing when xpconnect tried to pass the unicode string into nsIMsgLocalFolder::AddMessage as a char * without converting the string properly.
The patch is working in 'some' cases. o http://slashdot.jp/slashdotjp.rss -- working o http://rebecca.ac/milano/mt/index.rdf -- not wroking
Mozilla Thunderbird 0.7+ (Windows/20040807) Most pages are working for me, but some page is not. Case of following sites, both message/preview pane are broken. http://blog.bulknews.net/mt/index.rdf http://naoya.dyndns.org/~naoya/mt/index.rdf
Another RSS that doesn't work as expected: http://www.pc-magazin.de/rss/all IMHO Thunderbird should use the default charset for mails if no charset is specified in the rss feed.
Moving the remaining work for this bug to 0.9. The initial patches for this bug have fixed a lot of issues. Most remaining issues are with sites that list the wrong encoding but there are still some sites that just look wrong even though they do use the right encoding. I don't know why that is yet.
Target Milestone: Thunderbird0.8 → Thunderbird0.9
> Does the name of the folder in the folder pane for this feed look correct? > > i.e. is the problem just the subject and the feed contents or does it also > include the name of the feed folder in the folder pane? No, the filename is bad. Look at http://www.vaclavak.net/weblog/weblog.xml which is in ISO-8859-2 encoding. Problem is not only with subjets, but also with the name of the feed (and the name of the folder and the name of the filename). It creates strange filename "Weblog V&#195;&#165;clav&#195;&#165;k" and "Weblog V&#195;&#165;clav&#195;&#165;k.msf", but the "Weblog V&#195;&#165;clav&#195;&#165;k" is empty and the msgbox is stored in the file "f85ec1e9" and "f85ec1e9.msf". As the result there is created second "Weblog V&#195;&#165;clav&#195;&#165;k" folder in the TB. And there are problems, when you drag and drop this folder (dropped folder lost is contents). For me seems, that TB is creating bad filename "Weblog V&#195;&#165;clav&#195;&#165;k" and sometimes write data to the "f85ec1e9" and sometimes is trying to read from the "Weblog V&#195;&#165;clav&#195;&#165;k" (which is empty). Screenshot are comming...
Added screenshot
Attached image Screenshot of 2 created mailboxes (deleted) —
Added screenshot
Blocks: 259227
Depends on: 258447
xref: bug #264071 I see now some things from my comment #26 >and sometimes write data to the "f85ec1e9" were reported in the bug #264071
I don't currently have any pending issues here for 0.9. But these fixes do need migrated to the trunk. leaving open for that.
Target Milestone: Thunderbird0.9 → Thunderbird1.1
adding fixed-aviary1.0 per comment 30 (also to help in our queries).
Keywords: fixed-aviary1.0
checked into the trunk too.
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Component: RSS → Feed Reader
Product: Thunderbird → MailNews Core
Target Milestone: Thunderbird1.1 → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: