Closed Bug 136055 Opened 23 years ago Closed 18 years ago

Filter/Search on Body erroneously applied to encoded binary attachments

Categories

(MailNews Core :: Filters, defect)

x86
Windows 98
defect
Not set
major

Tracking

(Not tracked)

VERIFIED DUPLICATE of bug 37031

People

(Reporter: dmitry, Assigned: sspitzer)

References

Details

If I have a filter rule Body contains "porn" then a legitimate message with a binary attachment gets filtered if the attachment happens to contain ... 1jQNPorNpERB6fvelBUi1+XqGieb7gKwd8asCQNRAO7Uf6AINv/A/+DgH3DW/gJXyejoLLjVRSpI ...
related: bug 67421
Severity: critical → major
also related: bug 98141
*** Bug 153973 has been marked as a duplicate of this bug. ***
On the other hand, if the message is multipart/mixed and its html part is encoded as base64, filters of the form "body contains string" do not apply to it. I think the html part of a message should be considered its "body" for filtering purposes, to be decoded if necessary and fed through the "body" filters.
*** Bug 166573 has been marked as a duplicate of this bug. ***
*** Bug 159645 has been marked as a duplicate of this bug. ***
*** Bug 181418 has been marked as a duplicate of this bug. ***
Confirmed. Voted for. Suggest changing OS to all. This is quite annoying to me also. I have a filter called "porn mail" that checks to see if the body contains "sex", "porn", "farm", etc. I use match "any of these words" because "match all words" doesn't work well for this application. There is therefore no way to setup filters like these without making each filter only 1 word + "and only if message doesn't have attachment" (because I would have to use match all). Filters would be greatly improved if this bug were to be fixed.
Status: UNCONFIRMED → NEW
Ever confirmed: true
don't a rule "body doesn't contain 'Content-Transfer-Encoding: base64'" (or any other way to find out attachements. this was the best that passed my mind now) solve this?
Well, try it yourself. I don't think that will work though, because you would have to use "match all", instead of "match any of the below rules". Using match all would work, but it doesn't solve the problem. It's just a very awkward workaround. You cannot stop the filters from being applied to a message's attachments. Using match all extremely limits the usefulness of making filters, unless you want to make one filter per word to match. Correct me if I'm wrong.
Well, I have a rule set up that moves Klez to a Viruses folder using a "Body contains <first line of encoded Klez>" filter, so if this bug were fixed that would stop working.
mass re-assign.
Assignee: naving → sspitzer
Maybe a "Body" and a "text-only body" that will get only the parts with "Content-type: text/anything". For example, if the Reporter of that bug set the rule: "text-only body" match any porn or xxx it would not tigger the filter on a message that have a gif with ... 1jQNPorNpERB6fvelBUi1+XqGieb7gKwd8asCQNRAO7Uf6AINv/A/+DgH3DW/gJXyejoLLjVRSpI ... because it will look on text/plain, text/html, and others text/* but the Justin Kerk from comment #11 would be able to set "body" match <first line of encoded Klez>
workaround that worked for me and a few others, I get probably 30 or 40 junk emails a day, i have maybe 30 or 40 ppl in my address book and those that i confer with daily. Since most of them are from the same domain mailer, i setup something like so... if sender deos not contain <specified domain> then move to trash if sender deos not contain <specified email> Or <specified email>, etc. i find that filtering out the emails i do want instead of those i dont' want works faster and is much less work. I would still like an filter attachment though, i am using currently 3 rules explicitly stating different parts of the header info using the AND feature I can get 90% of them but a custom one named attachment would be nice.
This bug is still present in current builds, e.g. Thunderbird 0.7.*. I would really appreciate if it would be fixed before 1.0 because it's bugging me since Netscape 6.0. I have it set up to mark all emails that contain the word "yps" to be moved to a seperate folder, but it works so bad that almost any mail with an attachment is being moved.
*** Bug 267230 has been marked as a duplicate of this bug. ***
That duplicate notes that the problem exists for UUencoded attachments, as well as MIME ones.
Product: MailNews → Core
*** Bug 272042 has been marked as a duplicate of this bug. ***
Summary: Filters erroneously apply to encoded binary attachments → Filter/Search on Body erroneously applied to encoded binary attachments
This is a big problem, and while one result of the bug is talked about frequently (the fact that the wrong messages are matched with a search/filter), another problem is mostly overlooked, but arguably even more important: a full body search takes extremely long to complete in folders with many attachments, since those attachments can easily be megabytes in size, while the text in the message (that should be searched) is perhaps only a few percent of that. So, the search could easily be made 95% quicker or so in most situations, where people send a few pictures or other documents every now and then. I really don't understand that it takes years to fix this bug: i thought open-source actually meant that things get fixed quickly, but this really makes me lose confidence in this process and this product.
*** Bug 282682 has been marked as a duplicate of this bug. ***
I think this is a dup of bug #132340
(In reply to comment #21) > I think this is a dup of bug #132340 Not exactly -- there we *do* want to search the body after decoding; here, we not only don't want to search within (binary) attachments, we don't even want to decode them in the first place (during search).
(In reply to comment #22) > (In reply to comment #21) > > I think this is a dup of bug #132340 > > Not exactly -- there we *do* want to search the body after decoding; here, we > not only don't want to search within (binary) attachments, we don't even want to > decode them in the first place (during search). Ah, ok, that makes sense. But in light of my efforts to unify the junk filter and regular filters, I think we need to make this difference explicit in the UI. E.g., there should be separate criteria: Body Text <contains/etc> <-- that is, only search the plaintext Attachment <contains/etc> <-- only search the attachments, decode as necessary Body <contains/etc> <-- everything and some other bug reports have also requested All Headers and Entire Message as criteria scopes. That would probably cover all the bases.
(In reply to comment #23) > E.g., there should be separate criteria: > Body Text <contains/etc> <-- that is, only search the plaintext > Attachment <contains/etc> <-- only search the attachments, decode as necessary > Body <contains/etc> <-- everything After re-reading some more, that doesn't seem to really cover it completely. It helps to be able to decide which portions of the message to filter. If the portion you choose is encoded, it should always be decoded. (And the spam filter will always operate on the whole message, I didn't need to mention that here.)
(In reply to comment #24) > If the portion you choose is encoded, it should always be decoded. (And the > spam filter will always operate on the whole message, I didn't need to > mention that here.) I think that's all fine; but I wonder if it's ever necessary to perform a text search within a binary attachment. As an example, suppose you regularly get messages with text/html attachments, and others with (large) image/jpeg. If you're searching "attachment body" for a string that you expect in some html files, you don't need to spend the time decoding the JPEG and then performing a probably-fruitless search in that file's data; even if you *did* get a match (on a short string, presumably, like that in the original bug report here), it would probably be a false positive. Generally, I would prefer that "attachment body" searching was limited to text attachments (including message/rfc822 attachments). On the other hand, there are Word and PDF and Postscript docs which are in fact mostly text that might well be a good target for filtering. But there are many, many JPEGs, MP3s and the like being mailed around these days which do not seem like good targets for text filtering.
Can anyone explain why it takes > 3 years to fix something simple like this? I'm still getting the wrong search results, and my searches take way too long because it's searching through attachments unnecessarily as i explained earlier. It shouldn't take one programmer more than a day or so to make sure that attachments aren't searched. Or should it?
LOL, if it was that easy, yes, it would be done. Body search has no idea about the mime structure of a message, and teaching it about mime would be non-trivial.
But surely it shouldn't have to take over 3 years?! I mean, MIME isn't rocket science.. just follow the specs from the corresponding RFC document. You don't even have to decode anything, just ignore the binary parts. Have a look at 'view source' in an email message: all that basically needs to be done is search through everything below "Content-Type: text/plain", and "Content-Type: text/html", and ignore the rest. It's rather trivial. If the reason that this isn't done is that Thunderbird basically isn't being supported anymore by the developer community, then so be it, but in that case it would be nice to put some sort of note on the main page - like: "this product is no longer being supported or improved", so potential users don't make the mistake to download this program expecting broken things to be fixed within a reasonable time span. Or am i basically just expecting too much? How do other users / developers feel about this? Do most people think it's normal to wait 3 years for simple bugs to be fixed? Also, there doesn't really seem to be any (visible) progress: it would be nice if the people managing this part of the program would post something on this bug page, like "yeah.. we're working on it: it'll be fixed in release 1.xx". That's the way it's done for instance by Sun on the Java bug parade. Right now - i wouldn't be surprised if it's still not fixed in another 3 years. (In reply to comment #27) > LOL, if it was that easy, yes, it would be done. Body search has no idea about > the mime structure of a message, and teaching it about mime would be non- trivial.
Just because we're not working on your favorite bug doesn't mean we've done nothing over the last three years. >"Content-Type: text/plain", and "Content-Type: >text/html", and ignore the rest. It's rather trivial. What about messages with nested attachments? Anyway, we're just going to have to agree to disagree about the relative importance and difficult of fixing this bug. Any help you want to offer in coding up a fix for this would be greatly appreciated...
*** This bug has been marked as a duplicate of 37031 ***
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → DUPLICATE
Status: RESOLVED → VERIFIED
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.