Open
Bug 62598
Opened 24 years ago
Updated 2 years ago
Allow to filter mail messages by language (charset)
Categories
(MailNews Core :: Filters, enhancement, P3)
Tracking
(Not tracked)
NEW
People
(Reporter: dr, Unassigned)
References
Details
I get a fair amount of Japanese spam. I don't speak Japanese, so I can safely
filter all jp-encoded mail to my trash. I'd like to see a filter-by-language or
by-charset criterion added to filters.
Comment 1•24 years ago
|
||
Not a bad idea. Here is a little something to describe what could be done with
it, in preferences there could be an option menu just for languages so languages
could be setup for Chatzilla!, Mail, Composer, and Navigator. It could be setup
for both reading and writing in all components like this. Just a visualization
of how this could be carried out. Hope it helps.
Comment 2•24 years ago
|
||
The languages don't have to be installed. The header has to just be searched I
think to see what language its in.
Comment 3•24 years ago
|
||
I wonder how practical this is.
In any case, if we want this kind of feature,
maybe we should allow for a general 'header' key in which
the user can manually specify a header attribute.
For example,
Key: relationship value:
Headers include charset=iso-2022-jp
Headers include Content-type: text/html
etc.
Comment 4•24 years ago
|
||
A specific language filter rule would be more discoverable by novice users.
Comment 5•24 years ago
|
||
In that case, there needs to be a backend way to
map a 'language' name to a corresponding set of 'charsets'.
This is because one language may use more than one charsets.
Coming up with such a list is non-trivial as there are
quite a few languages. For the encoding such as
ISO-8859-1, there is no way to distinguish languages
that it supports without additional lang info buried
in the messages.
That is why I asked above how practical this is.
Comment 6•24 years ago
|
||
Yep. Maybe you could do something like:
Encoding is <blah> (Language, Language, ...)
Comment 7•24 years ago
|
||
"Language Encoding" or similar might be easier to understand.
Mm, "language encoding" sounds good to me. There are descriptions for each
encoding ("Central European" for example) which serve to better describe each
encoding than enumerating each language using it (since there are often
political issues attached to the names of languages -- Serbo-Croat vs. Croatian
vs. Bosnian, etc.)
Comment 10•23 years ago
|
||
This is a bad way to filter spam. Many Japanese users send all mail (including
mail written in English) in the Japanese charset, etc.
Comment 11•23 years ago
|
||
*** Bug 129263 has been marked as a duplicate of this bug. ***
Comment 12•23 years ago
|
||
AFAIK such a filter is useless - the SPAMers always can switch to UTF-8
encoding... what do you in that case ?
Comment 13•23 years ago
|
||
Also, as comment 10 points out, many non-spammers use asian encodings even when
sending messages in english. This would probably cause a lot of trouble for such
users sending mail to a mozilla-mailnews user. I hate spam, but I believe
there must be better ways to filter out junk -- this would be a ugly hack.
Suggesting wontfix.
Comment 14•23 years ago
|
||
I would encourage discussion regarding clever ways / algorithms to detect spam,
so we could build those features instead. netscape.public.mail-news, anyone?
Comment 15•23 years ago
|
||
Håkan Waara write:
> Suggesting wontfix.
It is not a way to defeat spammers - but it may be usefull in other ways.
Just implement it - it won't hurt... :)
Comment 16•23 years ago
|
||
This kind of filtering can be done for user who really want it with the current
Mozilla even if it's not very easy.
The steps could just documented in a document somewhere.
- Create a new filter
- choose the Customize header
- create a new customized header named "Content-Type"
- When it contains "ks_c_5601" (corean spam) or "iso-2002-jp" (japanese spam)or
big-5 (chinese spam), set the rule to destroy the message.
- add a rule to also destroy the mail when you find one of the above string in
the subject. (I haven't tested if it really works, I don't know if the filter
applies before or after subject encoding decoding. If it's after, and thinking
about it it should be after, this won't work).
Comment 17•23 years ago
|
||
Wontfix. Filtering by language would be an ineffective and dangerous way to
block spam, so we should not encourage Mozilla users to use language to filter spam.
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → WONTFIX
Comment 18•23 years ago
|
||
Jesse Ruderman wrote:
> Wontfix. Filtering by language would be an ineffective and dangerous way to
> block spam, so we should not encourage Mozilla users to use language to filter
> spam.
Please read comment #15 - and consider reopening this bug. "Implement
filter-by-language" may not be effective to filter SPAM but it may have other
(usefull) purposes...
Comment 19•23 years ago
|
||
There are many headers that people might want to filter on, but they can't all
be listed in the filter dialog. Language encoding would be one of the less
reliable and more confusing headers to filter on.
Comment 20•23 years ago
|
||
I don't know Japanese, Chinese, or any other oriental language for that matter,
and somehow I manage to get all these emails in Outlook that look like absolute
jibberish because I'm on some stupid mailing list, and its absolutely annoying!
If Mailnews is able to show chinese, etc characters for an email, then doesn't
it KNOW its in chinese?
Comment 21•23 years ago
|
||
*** Bug 154811 has been marked as a duplicate of this bug. ***
Comment 22•23 years ago
|
||
*** Bug 159150 has been marked as a duplicate of this bug. ***
Comment 23•23 years ago
|
||
The filter would also have to take advantage of UTF character ranges. Maybe
one solution would be to provide the ability to add mail filter plug-ins. This
way Mozilla need not provide all these filters, but at least offer the
possibility for someone to provide their Mozilla distribution independent
filter. The filter dialog would then be coupled with an 'advanced' filter
section where you would select the filter by name and then hit a configure
button which would bring up the settings panel of the filter. Below is a quick
rendition of what the code interface could look like:
Filter
- getName() : String
- getConfigPanel( SettingsRef ) : Panel
- matches ( e-mail ) : boolean
BTW I certainly feel that this needs to be reopened
Reporter | ||
Comment 24•23 years ago
|
||
Ok, you have good points. I agree that it wouldn't have the intended effect as
much as I'd hope. Your argument of Japanese users sending English mail in Shift-
JIS encoding is particularly compelling. But I wouldn't consider it
categorically useless, or harmful.
I don't ever expect to get any email from Russia or China, for example. Even
English email. I don't have any friends or family there, and I'm certainly not
interested in any "business opportunities" there. So if I see KOI8 or Big5
emails, I don't care what the content is, I want that mail in the trash.
As for UTF-8, it'd be pretty easy to deal with character ranges... Actually,
come to think of it, do we convert all encodings to Unicode internally? (That'd
help with the other encodings). Regardless, it'd be plenty good to take a
random sampling of characters in the email, determine their unicode range, and
filter based on that. I'd be very happy to say "only send me email in basic
latin and latin-1 supplement."
I think behavior like that, specified by the mail recipient's expectations of
where -- in a very broad-brush approximation sense -- they expect to be
receiving mail from, would be plenty good.
So I apologize if this is a nuisance but I'd like to reopen this request for
enhancement. It shouldn't clutter your radar as such, and I think it would be a
rather useful feature for many users, even if it's not perfect for everybody.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Comment 25•22 years ago
|
||
I don't get much legitimate mail from China or Japan, but I do get legitimate
mail from friends who live in the US and send all of their mail in strange
character sets.
Summary: Implement filter-by-language → Filter by language
Comment 27•21 years ago
|
||
*** Bug 225784 has been marked as a duplicate of this bug. ***
Comment 28•20 years ago
|
||
has anything ever happened here? I have become a favorite of korean and
japanese spammers---and since I have not spoken korean EVER, I would love to
turn these off.
Comment 29•20 years ago
|
||
*** Bug 268646 has been marked as a duplicate of this bug. ***
Updated•20 years ago
|
Product: MailNews → Core
Comment 30•19 years ago
|
||
I disagree with Jesse's reasoning. That people post to lists in other than their native (or default) language and forget to change the language type represents incorrect behaviour on their part...
This is not an invalidation of the basic idea and its merit of filtering based on language.
The people doing the above will eventually learn to do the right thing. In any case, TB and the rest of the Mozilla clan allows for one to easily set up multiple user profiles. One could be a profile used exclusively for posting to mailing lists, that has default options such as:
* charset 8859-1
* top-posting
* text only encoding
* etc.
Comment 31•18 years ago
|
||
It's been possible for a long time now to add Content-Type to the list of searchable headers, and then filter on a charset name (similar to comment 3).
Comment 4 and 5 are still valid, if it's actually worth anyone's effort to set up a mapping between, say, Japanese and several encodings on the behalf of those novice users who want an easily discoverable way to ignore such messages.
But if the encoding is UTF-8, you can't tell what language it's in. Either you'll end up filtering out, say, French or German messages, or you'll allow some Japanese messages. Either way, these novice users are going to be confused by the situation. And I agree with Jesse Ruderman's basic premise: this is a dangerous feature; therefore, I'd argue it shouldn't be discoverable.
Given that, combined with the pretty high quality of the Junk Controls feature, this bug really should be WontFix'd, for good.
Comment 32•18 years ago
|
||
(In reply to comment #31)
> But if the encoding is UTF-8, you can't tell what language it's in. Either
> you'll end up filtering out, say, French or German messages, or you'll allow
> some Japanese messages. Either way, these novice users are going to be
> confused by the situation. And I agree with Jesse Ruderman's basic premise:
> this is a dangerous feature; therefore, I'd argue it shouldn't be discoverable.
True, but there are a lot of spammers out there that try to imitate what Outlook does, and outlook likes Windows-125[0-8] charset encoding... God knows why.
There isn't much that can't be encoded in USASCII, ISO-8859-1, or UTF8 (in that order of trying).
In fact, there's an RFC out there (forget which) that says that these are the recommended encodings, and that nothing else should be used.
This applies to comment #10 as well: since messages should be encoded in the smallest encoding that that they will fit ("Be conservative in what you end...", to quote Jon Postel), since this has the highest probability of being supported, then English would be encoded in USASCII or at worst ISO-8859-1, not any Japanese native charsets.
As for comment #12: when that happens, we'll evolve, as they have.
Comment 33•18 years ago
|
||
(In reply to comment #32)
Umm... "Conservative in what you send..." Fat fingers.
Comment 34•18 years ago
|
||
(In reply to comment #33)
> (In reply to comment #32)
>
> Umm... "Conservative in what you send..." Fat fingers.
This bug isn't about sending, it's about receiving. And the rest of that aphorism, "be liberal in what you accept," exactly countermands your so-called "argument" for keeping this bug open.
Comment 35•18 years ago
|
||
Just a question, are there any hooks available to be able to write a plugin to do this? In a worst case scenario I can imagine being able to add custom filter plugins until there is a large enough demand for such a feature.
Comment 36•18 years ago
|
||
(In reply to comment #34)
>
> This bug isn't about sending, it's about receiving. And the rest of that
> aphorism, "be liberal in what you accept," exactly countermands your so-called
> "argument" for keeping this bug open.
The flip-side is that it's fairly clear what the correct behavior should have been for Outlook in how they select their encodings... and that there's a limit in how much another MTA should be willing to "be liberal" in order to accept things that are just plain wrong.
Demonstrably, Outlook in just plain wrong.
And it shouldn't matter that they have an 80% market share (or whatever it is).
Comment 37•18 years ago
|
||
*** Bug 354445 has been marked as a duplicate of this bug. ***
Comment 38•18 years ago
|
||
It could be done if thunderbird analyse the content. I'm of opinion that the key is not the encoding of message (more languages have the same encoding) but the words in the message.
It could be possible:
- to show the words that appear in the message ("ham", "cosa", "equus", ...),
- classify these in the possible languages these belong to ("ham" belongs to {english}, "cosa" belongs to {spanish, catalan, italian}...),
- determine what is the most probable language in which message is written (message is written in language L if L is the language that has more possible belonged words),
- and, then, determine if we should filter the message or not....
It's a draft
The classification of the words could be done automatically if we know the words that belong to any language. And it could be done, for example, of public database of such words. The translation of programs, wikipedia pages, etc could be done for sources
Thanks,
Xan.
Comment 39•18 years ago
|
||
sorry for the spam. making bugzilla reflect reality as I'm not working on these bugs. filter on FOOBARCHEESE to remove these in bulk.
Assignee: sspitzer → nobody
Assignee | ||
Updated•16 years ago
|
Product: Core → MailNews Core
Summary: Filter by language → Allow to filter mail messages by language (charset)
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•