Open Bug 752844 Opened 13 years ago Updated 1 years ago

have global search support exact matching (disabling stemming). ex: Searching for 'wedding' finds 'weds'

Categories

(Thunderbird :: Search, enhancement)

12 Branch
enhancement

Tracking

(Not tracked)

People

(Reporter: chris, Unassigned)

References

(Blocks 1 open bug)

Details

User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168 Safari/535.19 Steps to reproduce: Search e-mail for 'wedding' Actual results: Many emails were found, most of which contained 'weds'. Expected results: weds should not have been found. There should be some way to over-ride this 'clever' search. I've seen other systems use +word, or "word". Neither stop finding 'weds'.
You are correct; we have no way to disable the porter stemming algorithm. This would need to be performed as a filtering post-pass since the database stores the stemmed word. Using the plus sign might work out better, since we/SQLite already use quoting to indicate phrases and although quoting a single word is unambiguous, phrases would not be. In the specific example, I assume the problem is that although "weds" in the sense of "Alice weds Bob tomorrow" is accurately related to wedding (wiktionary: "wed (third-person singular simple present weds, present participle wedding"), the problem is that people use "Weds." as an abbreviation for Wednesday, so the results for "wedding" get swamped out by completely unrelated things?
Severity: normal → enhancement
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Linux → All
Hardware: x86_64 → All
Summary: Searching for 'wedding' finds 'weds' → have global search support exact matching (disabling stemming). ex: Searching for 'wedding' finds 'weds'
Yes, you are correct, my mails are full of people using weds for Wednesday. I had not really clicked that 'Weds' could also be a stem variant of Wedding.
Trying to find a message recommending some 'accountants' was pretty tricky as the stemmer took that all the way down to 'account'
Somehow worse is the fact that longer, more precise matches like "we submitted" (quoted in the search box) match things like: submitted submit @submit -submit The stemming behavior should be disabled by quoting.
(In reply to M Lopez-Ibanez from comment #5) > The stemming behavior should be disabled by quoting. The problem is that under the current implementation, that information is already gone in the inverted index. There is no "submitted" in the inverted index, only "submit". Amending my previous statements in comment 1, possible solutions that don't require implementing our own fulltext-search back-end for SQLite would be to: - Try and have the tokenizer also emit an un-stemmed token. Unfortunately, this would break phrase searching, so it's not a great fix. - Populate a second full-text-search table that does not do stemming but does do case-folding/etc. So then "submitted" would be in the inverted index. The downside to this is that it would probably double the disk space used (well, a 50% increase for global-messages-db.sqlite), although I think FTS3/FTS4 may have had some enhancements where you don't need to store the body text in-place anymore, so that might be only a 25% increase. Important note: I don't hack on Thunderbird anymore at all, just including these possibilities if someone else wanted to pick up the bug.
(In reply to Andrew Sutherland (:asuth) from comment #6) > (In reply to M Lopez-Ibanez from comment #5) > > The stemming behavior should be disabled by quoting. > > The problem is that under the current implementation, that information is > already gone in the inverted index. There is no "submitted" in the inverted > index, only "submit". Speaking from ignorance, so feel free to tell me that this simply cannot work, why not just post-process the results to remove invalid matches. I see Thunderbird helpfully highlighting the matched word, so perhaps before doing that it can check that what is matched actually matches exactly what was requested. This is a bit inefficient, since the exact search could be more efficient than the partial search, plus the added overhead of detecting and removing inexact matches. But perhaps it is not actually noticeably slower. I don't find the global search specially slow (with 107M global-messages-db.sqlite), but it is annoying to not be able to find something that I know is there.
Great minds think alike :) What you propose can be done and is the option I described in comment 1. I just wanted to enumerate some additional options that could avoid a post-filtering pass. Especially in cases where we issue a LIMIT to the results, post-filtering could run into trouble where the stem of the word is extremely common, like for the example in comment 1.
Quoting definitely does not work. This functionality defeats the purpose of "searching" if it's giving me things I'm not looking for! I would propose that this is not an "enhancement" request. This goes beyond the basic functionality of searching (i.e. returning matches for exact string "needle") in a way that reduces the quality of the search results ("needle OR hay OR stack OR haystack"). - RG>
It affects me to. Search just doesn't work for me. I can't find emails with "@reviews.co.uk" "reviews.co.uk" because it find emails with "review" word in it....
Yes, also for me is a big problem. Looking for domains or email addresses thunderbird returns too much results with some parts of my search, in most of cases I need an exact match.
This NEEDS correcting - just because a bad method has been implemented is not a reason to justify it. I was searching for "Messaging" - but I get a million results for "Messages" This happens every time. USABILITIY IS KING There are other algorithms for search - even if it's slow but returns what we want it's VASTLY BETTER than what happens now. Just give a checkbox "strict match", and let us save that as a preference.
It's also missing some text. For instance, Exchange replies to meetings with messages having *~*~*~*~*~ in them. Current version of Thunderbird won't find these. For example: When: Friday, November 14, 2014 3:30 PM-4:30 PM. (UTC-05:00) Eastern Time (US & Canada) Where: Conference Room 2 WPB *~*~*~*~*~*~*~*~*~* Meeting notes go here
The official help pages of Mozilla https://support.mozilla.org/en-US/kb/global-search say: 'Search for the phrase "new Thunderbird pages". Results should include messages that contain the entire phrase' This, as this bug report says, doesn't work.
A work-around is to use the classical search "Ctrl+Shift+F", but it is limited to one account.
This issue even affects searches for specific websites. For example i searched for the website alternate.de and got matches for emails which contain the word alternatively. This feature should be disabled for terms which aren't words. I still would prefer a solution where you can disable the stem feature completely.
As a workaround with this plugin search seems to be better: https://addons.mozilla.org/en-US/thunderbird/addon/gmailui/ This is no solution though.
I hope this will be fixed soon too. I use the search a lot for searching non-English and non existing words to find important documents. It became a very time consuming task since this smart search has been enabled a few years ago. The Thunderbird versions before that saved me a lot of time. Imho, it would be nice if we can disable stemming altogether, if I want both account and accounting I will just search for account. Or make it optional if people really need it.
As a workaround you can always use the old style search (Ctrl+Shift+F). It's not as fast as it's not indexed, but there's no stemming.
(In reply to Piotr Szymkowski from comment #11) Yes, for me it is a big problem. Search for domains or email addresses Thunderbird returns too many results, with some parts of my research, in most cases, I need an exact match. @Piotr Szymkowski Yes, I work in the wedding sector: http://www.noemiwedding.com/ A work-around is to use the classical search "Ctrl+Shift+F", but it is limited to one account.
A search for "interns" provides results for "internal" (even when followed by "and external") and "international" (which can swamp results) (though curiously, "internship" is not found).
And trying to exclude some of these unwanted terms, with -[unwanted term], doesn't exclude them.
I'm also trying to search for all e-mails containing the exact literal "hp.com", but Thunderbird finds all mails with "com" (i.e.: all mails containing a dot com address :-O)... and I can't search for just "hp" (no results... probably too short?). This is a huge limitation of the global search engine that hurt me so many times.
Blocks: 544580
Blocks: glodafailtracker
No longer blocks: 544580

(In reply to Caro Cogitatus from comment #31)

Please fix this. Fuzzy searches should be an option. This "feature" makes
the search function essentially useless.

I could not agree more. I have finally managed to introduce thunderbird in my company and being able to find email messages reliably is basic functionality.

This issue was opened 7 years ago and it appears there aren't any plans for changing it any time soon.

One last thought, IMHO this should not be considered an enhancement but a serious bug, as I said before, email search is basic functionality for any email client.

Agreed, this is infuriatingly broken! 99% of the time I do NOT want the stemmed result.

Today I am trying to find emails referencing products of the brand "Sensative", and it comes up with hits for "sense", "senses", etc., completely drowning out the few emails with 'sensative' in them...

This has been bugging people for 7 years, and still no-one fixed it.

This has been bugging people for 7 years, and still no-one fixed it.

"No-one" includes everyone who adds comments here instead of contributing a patch. Age of a ticket is entirely irrelevant: There is no unlimited (wo)manpower. If you want to see this fixed, write a patch to fix it. Please avoid more "me too" comments that don't help anyone. Thanks.

IMHO 7 years, 16 votes, 33 watchers, 1 blocked bug, 5 duplicates and a problem which is objectively important and a huge limitation in an e-mail client should not be considered "entirely irrelevant" when the devs decide the roadmap of the next Thunderbird release...

After all, we're talking about the ability to perform an "exact match" search, something that should be one of the very first thing that one needs in an e-mail client...

Of course you can always say "step and and fix it by yourself", but I don't think it's quite fair, I don't think there are many people here who have the knowledge to fix this, but this doesn't mean that they are not legitimate Thunderbird users who would like to see it improved. And if the "mee too" is annoying and "irrelevant", how could the community express its opinion on the importance of this problem vs another one? What is driving the devs decisions?

Just IMHO, as I already said.

@Mauro: Some ego's just don't like to be told what other people feel is important when it doesn't match the shiny new feature they want to add.
Adding new stuff is always more fun for a developer (I know, I've been one) than fixing old code!

I work 60-ish hours every week, have a family and a house, I don't even have time to play the games I paid cash for, let alone debug a mess of code like Thunderbird...

It is possible to fix it, because someone wrote a plugin that provides better search functionality than TB does itself, but it's ridiculous that something this basic is still bothering people 7 years and several duplicates later!

Just a little quote to close this comment:
"Constant and intense critique is one of the reasons we build great products. It's harder to fall into group-think if there is always a healthy amount of dissent. We want to encourage vibrant debate inside of the Mozilla community, we want you to disagree with us, and we want you to effectively argue your case. However, we require that in the process, you criticize things, not people"

I have been a volunteer free-software contributor[*] and I'm also an long-time thunderbird user frustrated by this bug like you. I can tell you that the tone of your comments is actually detracting from getting this bug fixed. Metoo comments (including this one) actually make developers mute and remove bugs from their lists to reduce the signal to noise ratio. Moreover, no volunteer dev likes to be told that their work is bad or a mess, that they must fix something in their limited free time that may not bother them personally, and they may not have enough free time to have a family or enough money to buy a house... Nobody wants to reward bad behaviour by fixing bugs under pressure.

There are many positive and constructive ways for a user to express their interest for a particular bug:

  • Politely explain why you think the bug deserves more attention by adding arguments that have not been mentioned before (the minimum effort one can do before commenting is to actually read previous comments).
  • Add constructive information to the bug report: a deep analysis of the issue, high-level suggestions for potential solutions designs, a high-level sketch of solution implementation, references to fixes implemented in add-ons or other email readers, etc.
  • Volunteer in thunderbird in some other way that gives you a voice in setting the roadmap.
  • Organise users affected to set up a substantial bounty.
  • Reach out to companies using/contributing to Thunderbird to invest resources in this particular bug.
  • etc.

[*] I recommend trying to be one, even for a short period, you'll learn a lot and you'll perspective will change on many things)

(I believe this comment should be marked as off-topic and hidden by default)

I'm wrote "me too" comment 6 years ago

Just FYI - there is a workaround how to find exact matches. Don't use global search, just right click on folder (even parent/top folder) and click Search messages

It's not indexed search, it takes time, you can search only one mailbox a time but it works

(In reply to M Lopez-Ibanez from comment #38)

I have been a volunteer free-software contributor[*] and I'm also an long-time thunderbird user frustrated by this bug like you. I can tell you that the tone of your comments is actually detracting from getting this bug fixed. Metoo comments (including this one) actually make developers mute and remove bugs from their lists to reduce the signal to noise ratio. Moreover, no volunteer dev likes to be told that their work is bad or a mess, that they must fix something in their limited free time that may not bother them personally, and they may not have enough free time to have a family or enough money to buy a house... Nobody wants to reward bad behaviour by fixing bugs under pressure.

There are many positive and constructive ways for a user to express their interest for a particular bug:

  • Politely explain why you think the bug deserves more attention by adding arguments that have not been mentioned before (the minimum effort one can do before commenting is to actually read previous comments).
  • Add constructive information to the bug report: a deep analysis of the issue, high-level suggestions for potential solutions designs, a high-level sketch of solution implementation, references to fixes implemented in add-ons or other email readers, etc.
  • Volunteer in thunderbird in some other way that gives you a voice in setting the roadmap.
  • Organise users affected to set up a substantial bounty.
  • Reach out to companies using/contributing to Thunderbird to invest resources in this particular bug.
  • etc.

[*] I recommend trying to be one, even for a short period, you'll learn a lot and you'll perspective will change on many things)

(I believe this comment should be marked as off-topic and hidden by default)

I respect you taking time to present a balanced, friendly response. It's a refreshing change from the high-and-mighty and arrogant responses.
It's the latter that triggered responses 36-38... It's not just users requesting bug fixes that should be aware of their tone!

I've been a developer and also did open source, so I know what you mean. And yes, IMHO comments # 33, 34 should be marked #MeToo, and 35, 37, 38 and 40 as #OffTopic.

But automatically considering MeToo's as bad and negative is NOT productive either! It's other people confirming that the bug is a problem that helps to identify issues that might need more attention. Just discarding MeToo's is disregarding that important indicator, and there seems to be no other way to help identify what users find important.

I never said that anyone's work is bad, or a mess, I referred to the WHOLE TB codebase as a Mess of Code... (Which it is! Nobody can say it is structured, uniform coding... At least not without lying)
And yes, this is a clear example why free/open-source software will never be able to fully get to the level commercial code does: Nobody wants to work on fixing old code that isn't a flashy new feature, whereas an employer can go tell developers "go fix this"... Having said that, I - most of the time - still prefer the free/open-source software!

Now let's close this discussion, as it contributes nothing to this bug, or the fixing of this broken behavior.

So finally, to actually add a constructive element to this comment to further indicate why this should be fixed:
I know of several people who prefer to use any of Microsoft's mail products over TB simply because they cannot find what they're looking for in TB when they need to search for things, despite me having it set up for them! So this is definitely costing TB users, and a bigger user base means a better representation, and a bigger 'voice' for the TB team and the Mozilla Foundation.
Now if I know several people, there's going to be more people that know several people like that...

Is this something money could fix?

I ask, because time & talent aren't things I can contribute, but money is.

I pledge $200 to this getting fixed. Is that realistic? Will others join my pledge?

I thought it was just me going crazy - no wonder I found TB search (filter) to be nearly useless if this is the default (!)

I want to add that this is very counter-intuitive.

When I type a search (filter) word (especially as I use 2-3 languages in my emails) I need to to use the exact word - especially as it is not clear how to "force" TB to use the search term verbatim.

Ah - found right click "search" - but searching the Body of the text seems to be broken for me - no results when I know their should be...
Search Subject does work...

I'll happily match the $200 pledge for getting this fixed. Is there a way to do that officially? (I didn't find one in a quick search.) I don't want to renege on it by say missing an email.

Post processing seems perfectly fine to me - if there is an internal limit hit, the result is not worse. The limit would already get hit in the current code, right? Just more to wade through manually. It might be nice to warn though.

Alternatively, a 2x doubling of index storage sounds like a small cost to get such an important feature working as expected. Even smaller now, relatively, given the passage of time (hardware specs continue to evolve, other software bloats much faster).

if there is an internal limit hit, the result is not worse

Oh except, when you add additional search terms that doesn't shrink the original query but expand it, so there's no easy user workaround(?) Hmm....

A larger index would seem like a very agreeable 'sacrifice' for a better (non-stemmed) search... Stemming might make sense in some cases, but in most cases (both myself as many people in my vicinity) a literal search is what we're looking for!

Anything is good - I last contributed to this thread * 6 * years ago, yes folks that's SIX years ago.
I'll happily test and do interface review - very experienced dev for web and UX here - and wrangler of database search - it would be trivially easy if they hadn't gone for stemming.
It's lunacy as it stands.
And add an up-front date filter - half the time there's no need to search 10 years of data, usually all I want is the last year.

This is a terrible issue. I´m trying to find some emails that mention a specific company name and it is showing all the emails that contain "Here".

I´ve quickly reviewed this issue discussion, and I understand there is no fix for this because we don´t want to expand the index.

Fine: Why just don´t double search, ex
Step 1) Find on index all messages that stemmed-hit our query as it is now
I understand that gives us a list of hits and what is the word where we got a hit (since thunderbid highlights the words)
Step 2) Filter the word matches and keep on the list only those that exactly hit the word (could be a checkmark or specific query string to activate this).

Looks so simple to implement. Am I missing something?

(In reply to cristian from comment #47)

...
I´ve quickly reviewed this issue discussion, and I understand there is no fix for this because we don´t want to expand the index.

No, index size and speed was the reason for stemming in the original design 10+ years ago. Hardware has improved so those should no longer a blocking factors.

The reason you haven't seen this yet is no volunteer has offered to fix it it, and amongst paid staff it far from the top of the heap of things of thousands of things to fix. I say this as someone who knows the process, and also like to see this fixed. It expect it will be higher when search i general gets a revamp, but I don't see that happening in the current development cycle for version 78 to come out this spring.

(In reply to Wayne Mery (:wsmwk) from comment #48)

The reason you haven't seen this yet is no volunteer has offered to fix it it, and amongst paid staff it far from the top of the heap of things of thousands of things to fix. I say this as someone who knows the process, and also like to see this fixed. It expect it will be higher when search i general gets a revamp, but I don't see that happening in the current development cycle for version 78 to come out this spring.

Thanks for your reply. I´m mentioning it because maybe some volunteer missed the point here, and is not willing to work on a full rewamp on search, but could work on a just quick post-filtering of search results.

I´m willing to do it but I´ll need a longer onboarding time on the project and I have few spare time. I mean, I would take some time but not 8 years to do it :). Maybe someone with more experience just needs the idea on how to proceed, is thinking on a too complex change to what could be a quick fix?

I think the easiest way to partially fix this is to replace stemming with lemmatization.

Is there any chnce this will be prioritzed in the newar future?
Seach in TB is a mess and it is driving me crazy not beng able to find mail that I KNOW are somewhere in my mail storage.

This is the single most reason I loathe TB as my mail client.

Please fix this!
Put a bounty on it and let users chip in.

I no longer use the global search, and just do CTRL+SHIFT+F (CMD +SHIFT+F on MacOS) now... It’s considerably slower in populating results, but at least it shows me what I am looking for, instead of the useless crap the stemmed option in global search regurgitates!

(IMHO they should just drop it altogether, and just give us back the storage and cpu cycles that creating the stemmed database cost...)

(In reply to Fear na Boinne from comment #52)

I no longer use the global search, and just do CTRL+SHIFT+F (CMD +SHIFT+F on MacOS) now... It’s considerably slower in populating results, but at least it shows me what I am looking for, instead of the useless crap the stemmed option in global search regurgitates!

(IMHO they should just drop it altogether, and just give us back the storage and cpu cycles that creating the stemmed database cost...)

I completely agree. I also deleted the search index recently and turned off the search in advanced settings as it uses over 10GB of data and I have to delete valuable data to keep my computer working properly. Search is (one of) the most used feature for me and I hate it that there is no proper alternative. I have over 20 e-mail accounts and I now search them one by one every time I need to find an email message with the per mailbox search. Unfortunately it is even more cumbersome as I filter many messages into several folders. This stemming makes using Thunderbird very time consuming.

I understand the reasons the global search returns unwanted results. However, why does it seem to not be the case for the message filter that searches messages in a particular folder (i.e inbox)?

For example, if I want to filter messages with a subject containing "direct debit," the filter is able to parse out and return messages only containing "direct debit" in the subject. It does not return messages containing either direct or debit, which the global search would return.

Is there any way to utilize the algorithm from the message filter in the global search?

Because quick filter doesn't use the global search (index), but searches on demand.

Magnus,

I appreciate your answer, but your answer feels like it is equivalent to "because I sad so."

Also, your answer does not address the questions that I asked. The point I tried to make (perhaps poorly) is: if Mozilla can do the search on "demand" for the filter, then why can't employ the same search method for the global search. It's analogous to building a car that can go forward and reverse, and building a truck that can only go forward.

Thanks anyway.

It may be practical to search a few messages, but impractical to search too many messages without using an index. And stemming makes it easier to create an index.

Easier to create an index at the cost of being unable to get accurate results for some searches. Personally, I'm happy to have to wait longer for exact match results on the relatively few occasions when it's required. As it stands the only way to do this is to use an extension and search folder-by-folder, which is about a 1 on a 1-10 UX scale.

(In reply to mozilla from comment #61)

Easier to create an index at the cost of being unable to get accurate results for some searches. Personally, I'm happy to have to wait longer for exact match results on the relatively few occasions when it's required. As it stands the only way to do this is to use an extension and search folder-by-folder, which is about a 1 on a 1-10 UX scale.

You can use the built-in Global Search ([CTRL|CMD]+SHIFT+F) to search one or more folders without stemming. I think it would be an easier ask to make Global Search a widget for the toolbar(s) than to change their mind on the usefulness of the stemming search (for me the usefulness is a -3 on a 1-10 scale, because I NEVER use it, but occasionally forget I should use the Global Search instead!) that is the default now…

Frankie suggests trying ("why can't employ the same search method for the global search"), but the architecture of global search doesn't fit quick filter. So Magnus is absolutely correct and directly answered the question of comment 57 - a solution is never going to come from plumbing global search capabilities into Quick Filter Bar.

The possible architectural solutions are comment 1, comment 6, and a complete replacement of gloda - none of which are in scope for the next version, 103.

For now,possible workarounds for the foreseeable future are Edit > Find > Search Messages mentioned in comment 17 (repeated in comment 23), and ...

(In reply to github from comment #21)

As a workaround with this plugin search seems to be better:
https://addons.mozilla.org/en-US/thunderbird/addon/gmailui/

The updated version is https://addons.thunderbird.net/en-US/thunderbird/addon/expression-search-ng/ If you want to support his work, suggest helping with the code, or donating. Open issues are listed at https://github.com/opto/Expression-Search-NG/issues (this bug report is not a place to discuss this addon).

Sorry, I am not getting this. This is not a bug, but feature enhancement request - thus not duplicate of my bugreport. The fact, that search often returns 98% of false positives means that search don't work, thus its not a feature request, its a bug and should be marked accordingly. As I see it's been very long ago since this behavior should be visible, but for some reason I bumped into it today. Polite requests, rude requests, doesn't matter, we go the one microsoft way: it's not a bug, it's a feature.
I don't care for any index, just do a full text search, I need results and if that means that it eats thousand times more resources than 20 years ago, ok, do it and maybe, after that we may try to see how to make it as fast as in version 2 where it worked well with very similar amount of mails as I have today - around ten thousand.

The main point is that the amount of text data in emails did not rise too much in the meantime, proly even got lower as many mails are just a title and link somewhere instead of the whole message. The amount of emails of active users did not rise too much and even if it did, the search may go in the background deeper and deeper to history. Thus if it was possible to do an exact match search in reasonable time 20 years ago, it should be instant today. I understand that as every developer adds his 1% negligible slowdown, we are doomed to wait as we always were as number of developers behind every button rises every year. Thus it could not be instant, but still it should be at worst as slow as it was back then.

Maybe, the stemming algorythm is language related, so maybe using Svahili version of thunderbird could solve this. Am I right or not? And if so, is there a way of using svahili stemming algorythm in english or czech speaking thunderbird?

Severity: normal → S3
Duplicate of this bug: 1816331
Duplicate of this bug: 1837175

I made BR 1837175. IMHO software should not make assumptions about what the user wants.
i think (as an outsider), why not take the existing search function for 'global search', allow for search over multiple account and fields, which gets a reasonable default (like from, to, subject, message - well we have an assumption again, but) which can be modified and saved by the user?
(thx for your good work btw i am going to donate some money, so i am ready to give sth back.)

You need to log in before you can comment on or make changes to this bug.