Closed Bug 1713145 Opened 3 years ago Closed 3 years ago

messages.getFull can not decode body of nntp message

Categories

(Thunderbird :: Add-Ons: Extensions API, defect)

Thunderbird 90
defect

Tracking

(thunderbird_esr78 wontfix)

RESOLVED FIXED
91 Branch
Tracking Status
thunderbird_esr78 --- wontfix

People

(Reporter: manikulin, Assigned: TbSync)

Details

Attachments

(2 files, 2 obsolete files)

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0

Steps to reproduce:

MimeParser.extractMimeMsg introduced in Bug #1644027 for messages.getFull is unable to properly decode UTF-8 body or message part having quoted-printable or base64 transfer encoding. The issue is specific to nntp messages, e.g. messages from local folders are decoded correctly.

Minimal example using code from implementation of messages.getFull. Inspect output in console after loading of the following extension:

background.js

browser.tst.decode(
`From - Thu May 27 21:23:35 2021
Newsgroups: gmane.comp.mozilla.thundebird.user
From: Bug Reporter <new@thunderbird.bug>
Subject: =?UTF-8?B?zrHOu8+GzqzOss63z4TOvw==?=
Date: Thu, 27 May 2021 21:23:35 +0100
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8;
Content-Transfer-Encoding: base64

zobOu8+GzrEK
`).then(msg => {
	console.log(msg);
	console.log(msg.headers['subject']);
	console.log(msg.parts[0].body);
}).catch(console.error);

experiment.js

var { MimeParser } = ChromeUtils.import("resource:///modules/mimeParser.jsm");

function convertMessagePart(part) {
  let partObject = {};
  for (let key of [
    "body",
    "contentType",
    "headers",
    "name",
    "partName",
    "size",
  ]) {
    if (key in part) {
      partObject[key] = part[key];
    }
  }
  if ("parts" in part && Array.isArray(part.parts) && part.parts.length > 0) {
    partObject.parts = part.parts.map(convertMessagePart);
  }
  return partObject;
}
var tst = class extends ExtensionCommon.ExtensionAPI {
	getAPI(context) {
		return { tst: {
			async decode(raw) {
				let mimeMsg = {};
				try {
					mimeMsg = MimeParser.extractMimeMsg(raw, {
						includeAttachments: false,
					});
				} catch (e) {
					console.log(MimeParser);
					console.error(e);
					throw new Error('Message decode error');
				}
				return convertMessagePart(mimeMsg);
			}
		} };
	}
};

schema.json

[ {
	"namespace": "tst",
	"functions": [ {
		"name": "decode",
		"type": "function",
		"async": true,
		"parameters": [ { "name": "raw", "type": "string" } ]
	} ]
} ]

manifest.json

{
	"manifest_version": 2,
	"name": "getFull decoder test",
	"version": "0.1",
	"background": { "scripts": [ "background.js" ] },
	"experiment_apis": { "tst": {
		"schema": "schema.json",
		"parent": {
			"scopes": [ "addon_parent" ],
			"paths": [ [ "tst" ] ],
			"script": "experiment.js"
		}
	} }
}

Actual results:

  • Headers (Subject) are decoded correctly.
  • Garbage instead of body text. Representation is different in object inspector (msg) and for body printed by console.log.

Expected results:

Body: "Άλφα"

@max: Do you know how to download binaries from the artifact section of "B" tests of the try run? For Windows it is the zip file:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/IEp0TznQRu2uL_AhqKS1yw/runs/0/artifacts/public/build/target.zip

Could you give that version a spin and check if this bug is fixed and if there are more flaws I have missed?

If it is not Windows, what OS are you running, so I can give you the correct download link?

I hope, I will try it tomorrow. I am a linux user. I noticed the problem when I tried a couple of messages from news.gmane.io. There are plenty of groups with messages where ASCII is not enough. Try e.g. gmane.linux.debian.user.russian or similar one with a more familiar language.

Assignee: nobody → john

Thank you, most messages, I have tried, are working correctly. While I was looking for a message in legacy 8 bit encoding, I found some messages with missed body in getFull result. It may happen even with messages having UTF-8 encoding. Try e.g. X70U3LO/zYt2yoRe@mail.ioffe.ru on news.gmane.io and other messages by the same author.

I have not figured out what can be wrong with such messages (or why they are special ones).

Can you give me the name of the newsgroup and a title so I can find this example message?

thunderbird 'news://news.gmane.io/X70U3LO/zYt2yoRe@mail.ioffe.ru'

I can not say that it works reliably, but first time when thunderbird is already opened, the command opens the message. It is gmane.linux.debian.user.russian.

It seems, the problem is caused by Content-Disposition: inline. Other headers may be quite usual

Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

It is the Content-Disposition: inline header. For some reason the emitter is copying many headers from the main message to the first mime part and thus the body is seen as an attachment, even though it is not. Investigating. Nevertheless, setting content-disposition on the main message seems invalid, but I think WE (Thunderbird) did that wrong, if I understad the wiki article correct:
https://en.wikipedia.org/wiki/MIME#Content-Disposition

Attachment #9225138 - Attachment is obsolete: true

(In reply to John Bieling (:TbSync) from comment #8)

It is the Content-Disposition: inline header. For some reason the emitter is copying many headers from the main message to the first mime part and thus the body is seen as an attachment, even though it is not. Investigating. Nevertheless, setting content-disposition on the main message seems invalid, but I think WE (Thunderbird) did that wrong, if I understad the wiki article correct:
https://en.wikipedia.org/wiki/MIME#Content-Disposition

I am not familiar with related RFC's or de-facto practice. I think, it should be consisted within thunderbird. If the body is shown in message window then extensions should get it as well.

Recently I was looking for a way to view patches sent as attachments in thunderbird without launching an external viewer (maybe it is not a fool-proof way, but acceptable implementation was done in Bug #1509709). Before I found that preference, I noticed a couple of bugs related to Content-Disposition issues: Bug #184869 and Bug #182627 (the latter have complains added after it was closed). Unsure if they are relevant to the case found while I was testing your patch.

Please, consider adding a couple of tests with legacy 8-bit encodings in addition to UTF-8. It seems, they work.
Expectations: {body: "Вопрос\n", subject: [ "Алфавит" ]}

From - Sun May 27 21:23:35 2001
Newsgroups: gmane.comp.mozilla.thundebird.user
From: Bug Reporter <new@thunderbird.bug>
Subject: =?koi8-r?B?4czGwdfJ1Ao=?=
Date: Sun, 27 May 2001 21:23:35 +0100
MIME-Version: 1.0
Content-Type: text/plain; charset=koi8-r;
Content-Transfer-Encoding: base64

98/Q0s/TCg==
From - Sun May 27 21:23:35 2001
Newsgroups: gmane.comp.mozilla.thundebird.user
From: Bug Reporter <new@thunderbird.bug>
Subject: =?windows-1251?B?wOv04OLo8go=?=
Date: Sun, 27 May 2001 21:23:35 +0100
MIME-Version: 1.0
Content-Type: text/plain; charset=windows-1251;
Content-Transfer-Encoding: base64

wu7v8O7xCg==
Attachment #9225258 - Attachment is obsolete: true

I found more issues and updated the patch/tests. Try run is here:
https://treeherder.mozilla.org/jobs?repo=try-comm-central&revision=196e3f512c528e4d4c611f33a529fed73f9f3fee

If that is through, you could give it another spin.

Pushed by alessandro@thunderbird.net:
https://hg.mozilla.org/comm-central/rev/9473904d2047
Fix MimeParser.extractMimeMsg() to return the same headers as MsgHdrToMimeMessage, properly return inline body parts and fix multiple decoding issues in the messages.getFull WebExtension API. r=mkmelin

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED

Is it correct that MimeParser.extractMimeMsg() will now return the MIME parts as unicode instead of raw binary data in the original encoding? See:
https://hg.mozilla.org/comm-central/rev/9473904d2047fc7d42476625eb8f4cf8d5a6f11f#l12.408 ?

This will invalidate the proposed solution in bug 1713786, attachment 9224827 [details] [diff] [review], where I relied on the raw data to be returned so it can be used for charset detection. Or am I reading this incorrectly?

BTW, what does https://hg.mozilla.org/comm-central/rev/9473904d2047fc7d42476625eb8f4cf8d5a6f11f#l12.224 do? Looks like you're encoding header values into UTF-8 now. Why is that necessary? Maybe for headers that where not ASCII/RFC 2047, so they're now safely are made UTF-8 since you can have raw UTF-8 in headers according to RFC 6532.

Flags: needinfo?(john)

MimeParser.extractMimeMsg() behaves now identical to the core libmime based mime parser used for imap and local messages. The API messages.getFull() now returns the same values for local, imap and nntp. The only remaining difference is that local/imap headers eat the TAB after a CLRF of a multiline header, while nntp headers do not. I opened bug 1715833 for that.

Is it correct that MimeParser.extractMimeMsg() will now return the MIME parts as unicode instead of raw binary data in the original encoding? See:
https://hg.mozilla.org/comm-central/rev/9473904d2047fc7d42476625eb8f4cf8d5a6f11f#l12.408 ?

That is only used internally and was not exposed to the final API. The binary string was later converted, but as it turned out, body parts got mangled. If you search for the enigmail code this is based on, you will find they used unicode there as well. It was a mistake to switch to binarystring, but at that time I did not know any better.

This will invalidate the proposed solution in bug 1713786, attachment 9224827 [details] [diff] [review] [diff] [review], where I relied on the raw data to be returned > so it can be used for charset detection. Or am I reading this incorrectly?

messages.getFull() does not (and should not) return a binary string, but properly decoded data, according to the parts charset. I did not invent this, I just aligned the output of nntp messages and imap/local messages. See the added tests which have to be passed for all three account types:
https://searchfox.org/comm-central/rev/8a77f2ff2762bbe436133205d6b98561f3bc81d1/mail/components/extensions/test/xpcshell/test_ext_messages_get.js#163-642

BTW, what does https://hg.mozilla.org/comm-central/rev/9473904d2047fc7d42476625eb8f4cf8d5a6f11f#l12.224 do? Looks like you're encoding
header values into UTF-8 now. Why is that necessary? Maybe for headers that where not ASCII/RFC 2047, so they're now safely are made UTF-8
since you can have raw UTF-8 in headers according to RFC 6532.

TextEncoder is used to turn the header into a binary string, as this is what the original mime parser was doing (internally). Again, I did not invent this. I just alligned the new MimeParser.extractMimeMsg() mime parser with the already existing one.

Flags: needinfo?(john)

Thank you. I have not tried hard it, but a dozen of messages are decoded properly (including "Content-Disposition: inline").

I noticed the following behavior however, I hope, it is expected

  1. Content-Type: multipart/alternative txt+html (e.g. from gmail) have their parts in parts[0].parts, not in the top-level parts array.
  2. Message generated by Alpine has Content-Type: multipart/mixed and

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--8323329-1640320710-1596988834=:40294
Content-Type: text/plain; CHARSET=UTF-8; format=flowed
Content-Transfer-Encoding: 8BIT

I can not find "This message is in MIME format..." in getFull() result. news://news.gmane.io/alpine.DEB.2.23.453.2008091859370.40294@jura104.jinr.ru Unsure if I really need it, I am just slightly afraid that some parts of complex messages are unavailable.

This is expected, the core parser is doing the same. I updated the test files to ensure identical behavior.

Target Milestone: --- → 91 Branch

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/cdf8bc4921de
Also test content at root part of multipart/alternative emails for messages.getFull API. r=mkmelin

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: