Closed Bug 224318 Opened 21 years ago Closed 21 years ago

Bayes filtering should learn through use of external/serverside filters

Categories

(MailNews Core :: Filters, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: raccettura, Assigned: Bienvenu)

References

(Blocks 1 open bug)

Details

(Keywords: fixed1.7, late-l10n)

Attachments

(11 files, 5 obsolete files)

(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), patch
mscott
: superreview+
Details | Diff | Splinter Review
(deleted), patch
mscott
: superreview+
chofmann
: approval1.7+
Details | Diff | Splinter Review
(deleted), patch
mscott
: superreview+
Details | Diff | Splinter Review
(deleted), patch
mscott
: superreview+
Details | Diff | Splinter Review
(deleted), patch
mscott
: superreview+
Details | Diff | Splinter Review
(deleted), patch
Stefan.Borggraefe
: review+
Bienvenu
: superreview+
chofmann
: approval1.7+
Details | Diff | Splinter Review
(deleted), patch
Bienvenu
: superreview+
chofmann
: approval1.7+
Details | Diff | Splinter Review
(deleted), patch
mscott
: superreview+
Details | Diff | Splinter Review
Bayes filtering should be aware of the increasingly popular X-Spam headers. Products such as SpamAssassin use them to mark suspected spam emails. Ideally, Bayes should ignore the top message that SpamAssassin attaches to suspected spam. Invite discussion on how to deal with Bayes and other spam filtering software. I'm attaching 2 emails, 1 spam, 1 ham, both filtered through SpamAssassin. The spam is very distinctive, as it's always attached, to a spam notice email. All have X-Spam. There is also the possibility to have an option and utalize external spam filters with/or bayes. For example, training on SpamAssassin's results of spam/ham (such as SA's Bayes filtering does). Or allowing Mozilla to simply recognize SA's decision as what the email is. Rather than SA does the scan, then Mozilla does it again. This would in essence allow third party filters the ability to use Mozilla's Spam UI, or work in conjunction with Mozilla.
Attached file Spam (deleted) —
Spam Sample (note attached original email, and headers).
Attached file Ham (deleted) —
Ham. Note difference from Spam.
Note messages *can* be inline, though no longer the default format in SA later than 2.50, now attachments are default behavior.
Severity: normal → enhancement
OS: Windows XP → All
Hardware: PC → All
Scott and I had some ideas about what else we could do with this. We're thinking a new tab on the spam settings window (which we're proposing to have tabs when we add more options like this) with the following choices about what to do with the x-spam-status header: 1. Ignore 2. Trust positives 3. Trust negatives (can trust both pos and neg) 4. Give Weight to x-spam-status (somehow combine x-spam-status result with bayesian score) If the user choses to trust both positives and negatives, then we don't need to run the bayesian filter.
I think an option to "use X-Spam" would be good to. Rather than use Mozilla's bayes filtering. Honor the spam filter on the server. Use Mozilla's UI and spam handling with the server's decision. Would cut down on CPU for those users. I like "give weight" Would be nice if we can have a checkbox to "feed" the bayes filter and train with the results from X-Spam. Since SpamAssassin as well as other products are relatively accurate, without user interaction, Bayes could be trained quite quickly in those situations, as noted here (among other places): http://www.eweek.com/article2/0,4149,1366242,00.asp There's a ton of potential. That little tag can really do a lot to enhance Mozilla.
Attached file Header examples (deleted) —
There is one problem: There are different headers with different products. I use spampal (Mail-Proxy freeware for win32) and it adds different headers.
Blocks: spam
*Detection Checklist* SpamAssassin: Ham X-Spam-Status: No, Spam X-Spam-Status: Yes, Spam X-Spam-Flag: YES (x-Spam-Status can have data after Yes, No) I believe X-Spam-Flag was added in later versions. SpamPal: Ham X-SpamPal: PASS Spam X-SpamPal: SPAM (spam can have data after the word spam) SpamCatcher: Ham X-SpamCatcher-Flag: No Spam X-SpamCatcher-Flag: Yes
It occurs to me that we need this to be extensible, and that we need pattern matching of some sort. So, I'm thinking I should add a filter action to set the junk score. Then, users can write their own filters to handle some of the server-side spam products. Then, integrating with these products becomes a matter of defining some filters. We could do like the MDN code does and define the filters internally on the fly, invisible to the user. So I think I'll add a filter action that sets a junk score.
I agree, there should be a way through filters. But I'm thinking in Junk Mail Controls, there should be a new tab, with checkboxes for: [ ] Enable Support for External Mail Filters (Select..) [ ] Enable Habeas Support [ ] More soon... and a mention they can define their own filters to customize this further. In the first one, have a button, that brings a popup asking "Spam Assassin", "SpamPal", "SpamCatcher". Able to check multiples. Turn them on by default (if the user doesn't have spamassassin, it just will never fire, no real harm done). If it does, it automatically kicks in. The other option is to have Thunderbird detect the first instance. By having the filter, and the tab in junk mail controls, the user can not only define their own rules with a filter (power user), but the basics are within easy reach for the general user. As more products emerge, we could easily add UI options for a few of the most popular options. I think the 3 mentioned are the most popular right now. Only doing filter rules, would make this feature beyond the casual user, who may want filtering, but is not geeky enough to make a filter. Besides. I think the above 3 will apply to most of those who want the feature anyway, so it would work out of the box for most people, and the rest can adapt it to their needs. As a sidenote bug 11040 is indirectly related to this bug. I wouldn't say blocking, but definate influcence.
sorry if I wasn't clear - that's what I meant by "Then, integrating with these products becomes a matter of defining some filters. We could do like the MDN code does and define the filters internally on the fly, invisible to the user." So the implementation of that UI would internally be some filters. One issue I need to deal with is to propagate the junk status set by filters to the imap server so that if the message gets moved to another folder, the junk status is also moved.
David: Sounds good to me. Changing summary of this bug a little, since it's more more than X-Spam Headers now.
Summary: Bayes filtering should be aware of X-Spam Headers → Bayes filtering should learn through use of external/serverside filters
One drawback of this filter approach, as opposed to putting some code in the code that parses mail headers, is that filters only run on new mail downloaded in the inbox. If there are server-side filters that classify messages *and* move them to other imap folders on the server, the client-side spam header detection filters won't detect them. Not sure if this is an important issue...if it turns out to be, we could run the internal filters on folders other than the inbox, I guess. The advantage of using filters is that they're extensible.
this patch makes it so the user can set a message as junk or not junk through a mail filter. I think I'm going to make this three separate bugs. 1. this one - UI and backend for filters to set junk score. 2. adding hidden custom filters for various well-known server-side plugins. 3. Adding ability to train bayesian filter on server-data
Attachment #144072 - Flags: superreview?(mscott)
(In reply to comment #13) > 2. adding hidden custom filters for various well-known server-side plugins. > I think I might take that one when I get a few free cycles.
Robert, I'll get you started by doing one of them, when I get a few cycles :-)
(In reply to comment #15) > Robert, I'll get you started by doing one of them, when I get a few cycles :-) If your doing these in seaparate bugs, CC me on them, so I can keep track. Thanks.
Comment on attachment 144072 [details] [diff] [review] support for filters setting junk score awesome!
Attachment #144072 - Flags: superreview?(mscott) → superreview+
I think Habeas should write their own as an XPI, personally. Lazy so-and-so's ;-) Gerv
Habeas headers - they suggest filtering on #3... http://www.habeas.com/configurationPages/headers.htm I'm thinking the way this will work is that we'll add the ability to load this kind of spam filter from disk, so we'll store the individual filters on disk. That way dropping in new kinds of filters won't involve changing the code so much.
Attached patch fix for custom headers (deleted) — Splinter Review
turns out custom headers were somewhat broken, in terms of what the UI allowed you to set.
Attachment #144171 - Flags: superreview?(mscott)
Attachment #144171 - Flags: superreview?(mscott) → superreview+
Attached file filters for spamassassin (obsolete) (deleted) —
Attached file filters for SpamCatcher (obsolete) (deleted) —
Attached file filters for Habeas (obsolete) (deleted) —
Attached file filters for SpamPal (obsolete) (deleted) —
I'm thinking the way this might work is we add some attributes to nsISpamSettings for handling server-side spam filters: 1. ServerSpamFilterName 2. ServerSpamAction - trust yes, trust no, trust both Then, when we're starting up a server, if the spam filter name is set, we load the correspondingly named filter file, and enable the Yes and/or No filters, according to what the user has specified. As far as the UI for picking the server side spam filter to incorporate is concerned, I imagine it'll just be a drop down where you can pick from the list of server-side spam filters we know about (maybe with a default choice of None, or a checkbox to turn off this behaviour). It would be cool to populate this list from the .dat files on disk, so that dropping in a new one adds it automatically to the list, but we might not get there...
Comment on attachment 144072 [details] [diff] [review] support for filters setting junk score this would involve an exception for the localization freeze (it adds a few strings) but we'd really like to get this into tbird .6 and Moz 1.7 - the fix is fairly safe, and allows you to make filters set a junk score.
Attachment #144072 - Flags: approval1.7?
Comment on attachment 144171 [details] [diff] [review] fix for custom headers this is needed because the custom headers stuff was always slightly broken...
Attachment #144171 - Flags: approval1.7?
Comment on attachment 144171 [details] [diff] [review] fix for custom headers a=chofmann for 1.7
Attachment #144171 - Flags: approval1.7? → approval1.7+
David: If we are learning from positive marks from external spam filters, isn't it necessary to learn from negatives as well? Otherwise we are essentially tainting the built in bayesian filters with one sided results. Just thinking outloud really.
Not sure what you mean - I've added settings to trust both positive and negative results in my patch, and in the filters (except for Habeas). But I haven't done anything about actually feeding the data into the spam filter to train it...I'm probably going to leave that to you or someone else.
Hmm.. I retract my last comment. I apparantly have some networking issues, when I was looking at the filter for spamAssassin I saw: >name="SpamAssasinYes" >enabled="yes" >type="1" >action="JunkScore" >actionValue="100" >condition="OR (\"X-Spam-Status\",begins with,Yes) OR (\"x-Spam-Flag\",begins with,YES)" and that was it... hence my question. But now I see the rest. I've also been double posting on at least one forum, and having connections time out. So I think I have some networking problem here at the minute, though my MRTG graph barely shows a change in ping time. Anyway. Disregard my last comment.
This handles automatically creating hidden filters for a given server-side filter, if the per-server pref serverFilterName and serverFilterTrustFlags are set appropriately.
Attachment #144590 - Flags: superreview?(mscott)
diff for filter description files (includes a typo fix in SpamAssassin.sfd)
Attachment #144230 - Attachment is obsolete: true
Attachment #144231 - Attachment is obsolete: true
Attachment #144232 - Attachment is obsolete: true
Attachment #144233 - Attachment is obsolete: true
Attachment #144591 - Flags: superreview?(mscott)
Attachment #144592 - Flags: superreview?(mscott)
Attachment #144592 - Flags: superreview?(mscott) → superreview+
Attachment #144591 - Flags: superreview?(mscott) → superreview+
Comment on attachment 144590 [details] [diff] [review] backend support for automatic server spam filter filters looks great.
Attachment #144590 - Flags: superreview?(mscott) → superreview+
Comment on attachment 144072 [details] [diff] [review] support for filters setting junk score a=asa (on behalf of drivers) for checkin to 1.7
Attachment #144072 - Flags: approval1.7? → approval1.7+
front and backend support for filters setting junk score checked in.
Keywords: late-l10n
Bug 181631 was already about having Mark as Junk/Not Junk in the message filter actions; I've marked it Fixed with an xref to this bug. I've opened bug 238816 about adding those enhancements for custom-header matching to MailViews and Search.
I think there are some small issues with the strings that were checked in: > +<!ENTITY setJunkScore.label "Set Junk Status"> The other filter actions that use a combobox all end with a colon. Also I think this filter action should end with a "to" to be consistend with "Change message priority to:". I'm not sure whether "Junk Status" should be upper case or not. > +<!ENTITY notJunk.label "NotJunk"> There should be a blank between Not and Junk.
I agree with Stefan about the language changes he suggested. This patch does just that. 1) It adds a space beteen Not and Junk 2) It adds a colon to the phrase: Set Junk Status to to be consisent with setting the priority 3) I also moved the junk status action in the dialog so it was grouped with the rest of the combo box driven actions such as setting priority, label the message, etc. Don't let the wierd way cvs diff generated the patch for that change fool you. It was just moving a few lines of xul higher up in the file. Still have one remaining problem...whenever we read the filter in from disk, this action always resets to Not Junk even if you had it set to Junk.
(In reply to comment #41) > Created an attachment (id=145191) > > 1) It adds a space beteen Not and Junk This is not included in the patch. :-(
actually it is. But cvs diff -uw ignores white space and it views that change as white space so it didn't show up. Weird :)
Attachment #145191 - Flags: superreview?(bienvenu)
Attachment #145191 - Flags: review?(Stefan.Borggraefe)
Attachment #145191 - Flags: review?(Stefan.Borggraefe) → review+
Attachment #145191 - Flags: superreview?(bienvenu) → superreview+
I'm not able to reproduce the filter returning to non-junk problem, even with a fresh tree from CVS. Maybe it's a release build only issue...
Comment on attachment 145191 [details] [diff] [review] following up on Stefan's suggestions to the filter UI asking for 1.7 status for this polish
Attachment #145191 - Flags: approval1.7?
Comment on attachment 145191 [details] [diff] [review] following up on Stefan's suggestions to the filter UI a=chofmann for 1.7
Attachment #145191 - Flags: approval1.7? → approval1.7+
Comment on attachment 145191 [details] [diff] [review] following up on Stefan's suggestions to the filter UI this patch has been checked in for 1.7 final
(In reply to comment #41) > Still have one remaining problem...whenever we read the filter in from disk, > this action always resets to Not Junk even if you had it set to Junk. I see this too. The actionValue contains a random number instead 0 or 100 when the FilterListDialog is opened for the first time after mozilla is started. When you just open the FilterListDialog and close it immediatly without opening the FilterEditor this value is written to msgFilterRules.dat. Also in FilterEditor.js sometimes gJunkScoreCheckbox and sometimes gChangeJunkScoreCheckbox is used for something that looks like it should be just one variable instead. But this is unrelated to the random number problem.
this patch fixes one of the issues with this bug fix: "whenever we read the filter in from disk, this action always resets to Not Junk even if you had it set to Junk." We were never reading in the junk mail action value when reading the filter from disk. Hence, the action value was garbage, causing it to sometimes get set to mark as junk and sometimes as not junk However there is still another really nasty issue out there. Any filter that fires has the random potential of marking mail as junk. Even if the filter does not have the junk status action checked. See Bug #239349 for information about that issue.
Comment on attachment 145363 [details] [diff] [review] fixes a bug where the junk action value never gets initialized david see my comment above that explains this patch.
Attachment #145363 - Flags: superreview?(bienvenu)
Comment on attachment 145363 [details] [diff] [review] fixes a bug where the junk action value never gets initialized uninitialized variables leading to random behavior == good candidate for 1.7 final :)
Attachment #145363 - Flags: approval1.7?
I just found the cause of Bug #239349 which caused messages to get randomly marked as junk or not junk if you had a filter rule that set a label action. That fix should also go into 1.7
Comment on attachment 145363 [details] [diff] [review] fixes a bug where the junk action value never gets initialized a=chofmann for 1.7
Attachment #145363 - Flags: approval1.7? → approval1.7+
Comment on attachment 145363 [details] [diff] [review] fixes a bug where the junk action value never gets initialized I swear I wrote that code...
Attachment #145363 - Flags: superreview?(bienvenu) → superreview+
Comment on attachment 145363 [details] [diff] [review] fixes a bug where the junk action value never gets initialized this patch has been checked in for 1.7
Keywords: fixed1.7
backend support is checked in. I still need to write some front end code to allow the user to set this up (though for now you can just set a hidden pref on the server, serverFilterName, to the appropriate server side filter name (Habeas, SpamAssassin, SpamCatcher, or SpamPal).
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
this busted balsa tinderbox (gcc3.4): /builds/tinderbox/SeaMonkey-gcc3.4/Linux_2.4.7-10_Depend/mozilla/mailnews/base/src/nsSpamSettings.cpp:458: error: extra `;'
Attached patch Fix Bustage (obsolete) (deleted) — Splinter Review
Attachment #146072 - Flags: review?(bienvenu)
Comment on attachment 146072 [details] [diff] [review] Fix Bustage I actually already checked in the same fix
Attachment #146072 - Attachment is obsolete: true
Attachment #146072 - Flags: review?(bienvenu)
Blocks: 240476
What about forget headers? Is this code immune to that? E.g. takes only the later added headers? The spammers could insert their own headers saying spam-status: 0. And how are new spam filters added to this? My server has some new stuff, it inserts a score into the header and even the cause for this score - what was suspicious in the mail. Something like this: X-Spam-Status: No, hits=0.1 required=5.0 X-Spam-Level: HTML_MAIL, NO_SENDER
(In reply to comment #60) > What about forget headers? Is this code immune to that? E.g. takes only the > later added headers? The spammers could insert their own headers saying > spam-status: 0. Good point. Some of my mails gets filtered two or more times and get different X-Spam headers if not all marks it as spam then it might be a problem.
I meant forged, sorry for the typo.
Attachment #147413 - Flags: superreview?(mscott)
Attachment #147413 - Flags: superreview?(mscott) → superreview+
Comment on attachment 147413 [details] [diff] [review] fix for pop3 filter junk score action very safe fix, only affecting setting junk score with pop3 filters...
Attachment #147413 - Flags: approval1.7?
Comment on attachment 147413 [details] [diff] [review] fix for pop3 filter junk score action a=asa (on behalf of drivers) for checkin to 1.7
Attachment #147413 - Flags: approval1.7? → approval1.7+
*** Bug 243049 has been marked as a duplicate of this bug. ***
My mail server sends x-junkmail-status headers, an example value is "score=150/50, host=mx01.versatel.de". There also is a header X-Junkmail-Whitelist.
Product: MailNews → Core
I noticed that the sfd files were added to packages-os2 (and others), but I build mailnews, these files never get exported into my dist. It appears that the Makefile never gets hit?
(In reply to comment #68) > I noticed that the sfd files were added to packages-os2 (and others), but I > build mailnews, these files never get exported into my dist. > > It appears that the Makefile never gets hit? Makefile.in (in mailnews/base/search/src/) has only added SpamAssassin and SpamPal, leaving out the others two, even if the packager scripts try to install them (see attachment 144592 [details] [diff] [review]). Legal issues or simply forgot to add them?
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: