Closed Bug 858138 Opened 12 years ago Closed 12 years ago

[keyboard] auto-correct needs probabilities from the prediction engine

Categories

(Firefox OS Graveyard :: Gaia::Keyboard, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: djf, Unassigned)

References

Details

The prediction engine current returns three word suggestions to us. They are supposed to be the three most commonly used words that match (fuzzily) the user's input so far.

For auto-correct, we can't just always use the most common word that matches. I think we only want to alter the user's input if we think it is very likely that correction is what the user wanted.

If I type the letter "t", there are many common words that I might be typing.  "the" is probably the most common word. But there are hundreds of other common words that begin with t also.  So our auto-correction should not automatically convert "t" to "the".  

It is easy to make the prediction engine return the word frequency along with each suggested word, but this isn't quite what we need.  For auto-correction, I'd like to know the frequency of the word "the" weighted by the frequencies of all other candidates that begin with "t".  When I type "t", there is probably a < 10% chance that I actually intend to type the word "the". I probably don't want to auto-correct unless the probability is > 25% or something.

So anyway, I'm not so interested in the frequency of each word in the language as a whole, but instead the frequency of the word relative to the frequencies of the other words in the search space of possible matches. Computing this will require changes to the search algorithm to retain this information during the search.
Blocks: 797170
I've decided that a simpler heuristic is good enough: if the number associated with the first suggestion is significantly higher than the number associated with the second suggestion (where "significantly" is a tuneable parameter), then the first suggestion should be used for auto-correction.

This is what I did for bug 860462, and it seems to be working reasonably, so I'm going to close this bug.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
For the record, in case we need to come back to this and re-open it, here are the rambling thoughts I wrote up while trying to decide if we really needed to do this:

 Need to think about frequency some more. We need word frequencies to
 build this dictionary, and we may need to keep frequencies in the
 file for the search algorithm (do we?). But that really isn't what we
 want for word suggestions and auto-correction.

 If we know that the user has typed some prefix p, the question we're
 asking when doing auto correction and prediction is "of all the words
 beginning with p, what word is most likely and what is the liklihood
 that the user meant to type it?" (Or, instead of "of all the words"
 maybe we want to ask "of all the words up to 2 times the length of
 p") If there are lots of high-frequency words beginning with p, then
 we may be able to say which is most likely, but if there are other
 words that are almost as likely, our confidence in the suggestion
 will be low, and it will not be a good choice to auto-correct. For
 auto correction we need to know the confidence as well as the
 frequency.

 Or do we? Currently I just compare the weight of the first suggestion
 to the weight of the second. If it is significantly higher, I
 autocorrect. That may actually be fine.

 For suggestions and corrections, we're not just going to look at the
 prefix p, we're also going to assume that the user could have
 mistyped and consider other prefixes p' that are similar. If there
 are words that begin with p' that have higher frequencies than the 3
 highest frequencies for p (even after being weighted depending on how
 unlikely the hypothetical mistyping is) then those words will end up
 on the list of suggestions. So if we're going to assign a confidence
 value to the suggestions, we'd have to compare words not just against
 the universe of words with the prefix p, but all words that begin
 with similar prefixes.  That's not something we can precompute.

 It would still be elegant if we could convert from word frequencies
 into probabilities of some kind. (It would also help with
 scaling... Once we get to long, unlikely words, we wouldn't have to
 be up against a lower limit of 1 and 2 for frequency. If we're
 talking about the probability of the word among words that begin with
 the prefix, the numbers might be higher and we could discriminate
 more accurately.

 Can I do that? If the most common word 'the' is about 5% of english
 words, then its 220 frequency means 220/4400, and we could convert
 all of the frequencies to probabilities by dividing by 4400.  (Better
 would probably be to add up all the frequencies to get a total and
 divide them all by that to give the probability of each word if words
 were being picked randomly by throwing darts at the corpus.)

 We can do the same thing for any prefix: add up the frequencies of
 all the words that begin with that prefix and divide each one by that
 total, possibly limiting the search to words that aren't too much
 longer than the prefix length.  Somehow then we'd need to scale these
 numbers so that they didn't get too small. To allow comparisons with
 other prefixes p', we'd have to look at the weights for all prefixes
 of length n and scale them all the same. Then do it for prefixes of
 length n+1, etc.

 Our data structure depends on being able to find the highest
 probability word at a given node by traversing the center pointer
 chain straight down.  We can't do that if the way the words are
 ranked changes with prefix length. If we can be certain that the
 highest frequency word will always remain the highest ranked this
 this works. But we can only be certain of that if we always consider
 all words and don't limit the universe of words based on the prefix
 length. 

 Really, I'm no longer sure any of this matters. Word frequency may be
 a good enough proxy for word probability and maybe its just fine the
 way it is.
Blocks: 873934
You need to log in before you can comment on or make changes to this bug.