Open
Bug 544580
Opened 15 years ago
Updated 2 years ago
gloda tokenizer could probably do a better job of indexing numbers
Categories
(Thunderbird :: Search, defect)
Thunderbird
Search
Tracking
(Not tracked)
NEW
People
(Reporter: asuth, Unassigned)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(Whiteboard: [gloda key][tokenizer key])
Here's the deal on how this works right now:
- The copy stemmer takes at most the first 3 characters and last 3 characters from a string involving digits. So "123456789" is emitted as "123789". As long as the user queries on "123456789" they will get what they expect, but they will also get any other results that look like "123*789".
- We stop on punctuation. So "100,000,000" gets emitted as "100", "000", "000". Also, "1-555-555-5555" get emitted as "1", "555", "555", "5555". While phone number detection can be handled at a higher level, unless we start forcibly intercepting queries and shunting them to higher-level searches (rather than hinting with autocomplete), this can still result in some ridiculously expensive queries.
The most likely solution to this problem would be to do the following:
- Increase the copy stemmer constant so that it at least goes up to a full phone number with 4 digit extension.
- Further specialize the state machine so that it distinguishes between ASCII digits and ASCII letters and treat limited punctuation between digits as invisible. For example ",.-" would probably be good candidates.
I think the driving motivation here would be to constrain the potential set of matches rather than guarantee the user gets the right match. For example, apart from the ambiguity between US (1,000) and European (1.000) style delimiters, the user probably wouldn't care about the number of cents involved if they were searching for a dollar amount. However, from a search space perspective, interpreting "1,122.00" as "1122" and "00" is going to completely flood us with a number of bogus search results in the latter case. Interepreting it as "112200" is much saner, and could still turn out okay for the user if promoted to a wildcard search.
Reporter | ||
Updated•15 years ago
|
Whiteboard: [gloda key][tokenizer key]
Reporter | ||
Comment 2•15 years ago
|
||
The thing I duped in was a request for finding version numbers like "3.6.4". Same general problem, although in that case we likely do not want to elide the punctuation.
Comment 3•14 years ago
|
||
I was searching today for my email announcement of Thunderbird 3.1.7 using the string "3.1.7" and "3 1 7". Neither search returned any results, even though I've located emails (using gmail) with that string.
I think it makes sense to index "3.1.7" as a word and make it searchable.
Updated•14 years ago
|
Flags: wanted-thunderbird+
Updated•13 years ago
|
Blocks: glodafailtracker
Comment 4•13 years ago
|
||
(exactly what I mentioned to protz on IRC yesterday)
Protz, will this be helped by Bug 681754 - gloda fts3 tokenizer would greatly benefit from stopword support
and how is this related to Bug 549594? GlodaMsgSearcher needs to avoid generating clauses that the tokenizer will eat
(In reply to John Hopkins (:jhopkins) from comment #3)
> I was searching today for my email announcement of Thunderbird 3.1.7 using
> the string "3.1.7" and "3 1 7". Neither search returned any results, even
> though I've located emails (using gmail) with that string.
>
> I think it makes sense to index "3.1.7" as a word and make it searchable.
Comment 5•13 years ago
|
||
This bug is specifically about improving the tokenizer so that it emits better tokens when indexing numbers. However, if we are to take any action regarding this bug or bug 681754, then we better make sure we fix bug 549594, otherwise the situation will become a real mess.
The problem you're having is that the part of gloda that builds the query doesn't behave exactly like the tokenizer, so gloda thinks that search terms it passes to the SQLite search will yield valid results, while in fact, they're not valid tokens, so there's no chance they'll yield any results.
To make sure Gloda only issues valid search terms, we need to run the query through to tokenizer first, and this is bug 549594.
I'm updating the dependencies to reflect my comment.
Depends on: 549594
Comment 11•5 years ago
|
||
I don't know if I am experiencing this bug but I have an e-mail with words in the subject Electrolux 5303918344, and if I search 5303918344 the email is not found. OTOH if I search Electrolux the e-mail is found. I'm using Thunderbird 68.8.0.
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•