Closed Bug 861108 Opened 12 years ago Closed 12 years ago

Text prediction ignores Polish letters

Categories

(Firefox OS Graveyard :: Gaia::Keyboard, defect)

All
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:tef+, b2g18 fixed, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 verified)

VERIFIED FIXED
1.0.1 Madrid (19apr)
blocking-b2g tef+
Tracking Status
b2g18 --- fixed
b2g18-v1.0.0 --- wontfix
b2g18-v1.0.1 --- verified

People

(Reporter: stas, Assigned: jj.evelyn)

References

Details

(Whiteboard: [status: needs new patch][madrid] c=keyboard)

Attachments

(1 file)

This happens on v1.0.1 only.

When typing Cześć (Polish for "hello"), as soon as I type "ś" the whole suggestion bar clears and remains empty.

When typing Śmiech (Polish for laughter), with "Ś" the bar shows nothing, and with "Śm" it shows suggestions for "m".

I'm not sure if this happens only with Polish letters.  I tested Spanish, and the special characters seemed to work correctly.
blocking-b2g: tef? → tef+
If this is only on v1.0.1, then that would be a very good reason to uplift the new prediction code to that tree.  But we should give it a while to stabilize before we do that, I suppose.

Alternatively, we could fix this specific bug in the old code, but I think that the desire is to uplift all the new stuff to 1.0.1.
Blocks: 797170
I wonder if this is something stupid on the keyboard code that can be easily resolved without importing a whole new feature to 1.0,1?
hi evelyn, any idea on this? thanks
Flags: needinfo?(ehung)
Assignee: nobody → ehung
I just saw you push a patch on IRC. I'd be happy to review if you'd like.
Thanks for kanru's help and we found it's not a unicode problem, just because the regular expression used to match latin letters in latin.js is too simple and missing a lot of characters.
Flags: needinfo?(ehung)
Attachment #737524 - Flags: review?(dflanagan)
Comment on attachment 737524 [details]
point to https://github.com/mozilla-b2g/gaia/pull/9183

I'm glad you found that bug. I was worried that we would have to re-generate the dictionaries to fix it.

I'm giving an r- here because the regexp initialization code is in the wrong place, and also because I think a better fix is to just remove that regexp completely.

See the patch for bug 860462: in that, I get rid of LETTER altogether and just use anything other than whitespace as a letter.  The user isn't going to have a keyboard layout that enables them to enter non-Latin letters, so I think we can assume that it will always be latin.  So I recommend that you just copy the relevant changes from 860462 instead of creating this giant regexp.
Attachment #737524 - Flags: review?(dflanagan) → review-
Evelyn, do you have an ETA for a new patch?
Whiteboard: [status: needs new patch]
Whiteboard: [status: needs new patch] → [status: needs new patch][status: ETA 4/19]
(In reply to David Flanagan [:djf] from comment #7)
> Comment on attachment 737524 [details]
> point to https://github.com/mozilla-b2g/gaia/pull/9183
> 
> I'm glad you found that bug. I was worried that we would have to re-generate
> the dictionaries to fix it.
> 
> I'm giving an r- here because the regexp initialization code is in the wrong
> place, and also because I think a better fix is to just remove that regexp
> completely.
> 
> See the patch for bug 860462: in that, I get rid of LETTER altogether and
> just use anything other than whitespace as a letter.  The user isn't going
> to have a keyboard layout that enables them to enter non-Latin letters, so I
> think we can assume that it will always be latin.
It's possible to be punctuation. If we don't filter out them before prediction, it costs CPU cycles to search a word that is impossible to be in the dictionary. However, in the case of correction, if we consider punctuation in, it would be smarter. By the fact that we do prediction and correction at the same time, I think it's just a choice: to be efficient or to be smarter. Is it correct?
Flags: needinfo?(dflanagan)
Whiteboard: [status: needs new patch][status: ETA 4/19] → [status: needs new patch]
You're absolutely correct: you should filter out punctuation. That's a bug in the code I landed.  Thanks for pointing it out. 

But still, instead of enumerating all possible letters in one giant regexp you can probably enumerate all possible non-letters more easily. (Don't treat ' and - as punctuation, though, since they get used in words).
Flags: needinfo?(dflanagan)
I've filed bug 862430 to fix the issue Evelyn points out above in my code on master.
(In reply to David Flanagan [:djf] from comment #10)
> You're absolutely correct: you should filter out punctuation. That's a bug
> in the code I landed.  Thanks for pointing it out. 
> 
> But still, instead of enumerating all possible letters in one giant regexp
> you can probably enumerate all possible non-letters more easily. (Don't
> treat ' and - as punctuation, though, since they get used in words).
Hi David, I'm trying to enumerate all possible non-letters but I think it's not easier. Two approaches to do this:
1) scan layout.js file and copy those non-letters keys' value from all imEngine: 'latin' layouts.
2) look up unicode "Symbols and Punctuation" set, and pick some subsets that we're interested, e.g. ASCII Punctuation, Latin-1 Punctuation, Arrows, maybe some subsets in "Other Symbols"... etc.

either way is painful, and we also need to care return key, backspace key...
I guess the result won't be smaller than enumerating all possible letters. :-/
How do you think?
Flags: needinfo?(dflanagan)
> 2) look up unicode "Symbols and Punctuation" set, and pick some subsets that
> we're interested, e.g. ASCII Punctuation, Latin-1 Punctuation, Arrows, maybe
> some subsets in "Other Symbols"... etc. 
my reference: http://www.unicode.org/charts/
(In reply to Evelyn Hung [:evelyn] from comment #12)

> 1) scan layout.js file and copy those non-letters keys' value from all
> imEngine: 'latin' layouts.

Why is this approach painful?  It sounds like we could take all Latin letters plus the ones defined in the "alt" dictionary of the currently selected layout.

https://github.com/mozilla-b2g/gaia/blob/master/apps/keyboard/js/layout.js#L1030

(In reply to David Flanagan [:djf] from comment #10)
> (Don't treat ' and - as punctuation, though, since they get used in words).

This sounds like an i18n thing: these characters are used in words only in a subset of languages.

Another solution, which would require the recompilation of the dictionaries, would be to expose a list of letters used in all words of the dictionary for the given language.  Is this feasible?
Whiteboard: [status: needs new patch] → [status: needs new patch][madrid]
Stas, David,

My understanding from Evelyn to this bug is that David doesn't think the approach in https://github.com/mozilla-b2g/gaia/pull/9183 make sense, and he would like to write the regexp in reverse, i.e. filter out spaces, punctuations and other non-alphabet characters. My concern is that in reality, it's a even larger set, given the wide range of the Unicode characters, and the possibility for those characters to went through our regexp filter not because the user will be able to type it on our vkb, but exists as the content of the currently editing input.

Therefore, I would tend to think pull 9183 is the right approach -- it could certainly be improved through (e.g. not to re-construct regexp on every init(), or construct the regexp based on the language we are using right now).
(In reply to Staś Małolepszy :stas (needinfo along with cc, please) from comment #14)
> 
> Another solution, which would require the recompilation of the dictionaries,
> would be to expose a list of letters used in all words of the dictionary for
> the given language.  Is this feasible?

Already done, but not landed yet. It probably won't help here, however, because the patch is question is trying to decide, outside of the prediction engine, what the word boundaries are so it can send an appropriate word to the prediction engine.
Flags: needinfo?(dflanagan)
(In reply to David Flanagan [:djf] from comment #16)
> (In reply to Staś Małolepszy :stas (needinfo along with cc, please) from
> comment #14)
> > 
> > Another solution, which would require the recompilation of the dictionaries,
> > would be to expose a list of letters used in all words of the dictionary for
> > the given language.  Is this feasible?
> 
> Already done, but not landed yet. It probably won't help here, however,
> because the patch is question is trying to decide, outside of the prediction
> engine, what the word boundaries are so it can send an appropriate word to
> the prediction engine.

If this is done, why not go for this fix instead?
Tim, Evelyn: I think we're really over-thinking this. The issue is just minor optimization of not asking the prediction engine to complete something that we know can not be a word.  If our word detection code is imperfect and we ask for completions of something like 'l33t' we won't get any completions and no harm is done. 

Whitespace and punctuation are the obvious word-separator characters in all Latin languages that I know of.  (Though Stas is correct that hypens and apostrophes are ambiguous in English and French, at least).

That's why I suggest a simple, imperfect test for determining word boundaries.  I can even add code to the prediction engine to return immediately if the input text contains characters that do not occur in the dictionary.  That way it wouldn't even be expensive to ask for predictions on invalid words.

If you want to land 9183, I don't really have a strong opinion about that, as long as you put the regexp initialization code in the right place. I always get nervous about giant regexps like that, though.  What is the memory footprint of a compiled regexp that includes about a hundred Unicode characters?  Maybe its no big deal at all.
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) from comment #17)
> > Already done, but not landed yet. It probably won't help here, however,
> > because the patch is question is trying to decide, outside of the prediction
> > engine, what the word boundaries are so it can send an appropriate word to
> > the prediction engine.
> 
> If this is done, why not go for this fix instead?

Because the list of characters is in the prediction worker, and we need the list of characters outside the worker in latin.js. We could add a communication protocol between the IM and the worker to transfer that data, but it seems unnecessary for this.
Adding whiteboard tags for tracking via srumbu.gs.
Whiteboard: [status: needs new patch][madrid] → c=keyboard
Whiteboard: c=keyboard → [status: needs new patch][madrid] c=keyboard
(In reply to David Flanagan [:djf] from comment #18)
> Tim, Evelyn: I think we're really over-thinking this. The issue is just
> minor optimization of not asking the prediction engine to complete something
> that we know can not be a word.  If our word detection code is imperfect and
> we ask for completions of something like 'l33t' we won't get any completions
> and no harm is done. 
> 
> Whitespace and punctuation are the obvious word-separator characters in all
> Latin languages that I know of.  (Though Stas is correct that hypens and
> apostrophes are ambiguous in English and French, at least).
> 
> That's why I suggest a simple, imperfect test for determining word
> boundaries.  I can even add code to the prediction engine to return
> immediately if the input text contains characters that do not occur in the
> dictionary.  That way it wouldn't even be expensive to ask for predictions
> on invalid words.

Agree. If we already know sending invalid characters into prediction engines doesn't do any harm, let's do that.

> If you want to land 9183, I don't really have a strong opinion about that,
> as long as you put the regexp initialization code in the right place. I
> always get nervous about giant regexps like that, though.  What is the
> memory footprint of a compiled regexp that includes about a hundred Unicode
> characters?  Maybe its no big deal at all.

"Find the memory footprint of that regexp" would be harder than do the simple test :P
Comment on attachment 737524 [details]
point to https://github.com/mozilla-b2g/gaia/pull/9183

I decided to fix the bug as what :djf proposed in comment 7. After thinking more, I'm convinced that using whitespace(\s) to determine a word boundary probably is better than filter punctuation out, because:
1. sending invalid characters into prediction engines doesn't do any harm. (comment 18) - main reason
2. it's complicate to enumerate correct prediction to filter out by considering i18n thing. (comment 14)  
3. if we support customized word completion in the future so the user may select a word and add it to a auto-complete list, then every 'token' is possible to be a valid word.
Attachment #737524 - Attachment mime type: text/plain → text/html
Attachment #737524 - Flags: review- → review?(dflanagan)
Target Milestone: --- → Madrid (19apr)
Comment on attachment 737524 [details]
point to https://github.com/mozilla-b2g/gaia/pull/9183

Looks good!
Attachment #737524 - Flags: review?(dflanagan) → review+
merged into gaia v1.0.1 branch.
https://github.com/mozilla-b2g/gaia/commit/4e08c2bbec97273a97753250a047fb18acc0bffb
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
(In reply to David Flanagan [:djf] from comment #23)
> Comment on attachment 737524 [details]
> point to https://github.com/mozilla-b2g/gaia/pull/9183
> 
> Looks good!
Thank you, David. 
Nice to have a long discussion here that makes me know keyboard more. :)
also uplift to v1-train:

https://github.com/mozilla-b2g/gaia/commit/21ff040abc73842b1304790b574a7fceea270451
Verified-fixed
gecko: 6dfd179008acf965412d98bebc97f60dd99e2376
gaiac47ef39de04e634d847cc86b6acc616fabce69eb
BuildID 20130426070207
Version 18.0
Status: RESOLVED → VERIFIED
Hum, I take back what I say. It seems this is resolved on v1.0.1 (cf comment above), but trying on v1 just gives me one option for the S and C letters in Polish, whereas there are more options on the v1.0.1 keyboard....
Therefore flagging accordingly

v1
gecko: bdbd181fa434d15976a2ae965d82ec7fe85a963f
gaia: c9046a7acef33328977840176ff5574720d2c74c
BuildID 20130426070204
Version 18.0
Status: VERIFIED → RESOLVED
Closed: 12 years ago12 years ago
$ git branch --contains 21ff040abc73842b1304790b574a7fceea270451
  v1-train

This patch is definitely on v1-train.

Evelyn, can you please check whether this patch still fixes v1-train?
Flags: needinfo?(ehung)
I tried and it works as v1.0.1. Can you verify it again? Please note the patch is landed on v1.0.1 and v1-train, not on gaia master.
Flags: needinfo?(ehung)
Flags: needinfo?(lebedel.delphine)
(In reply to Evelyn Hung [:evelyn] from comment #30)
> I tried and it works as v1.0.1. Can you verify it again? Please note the
> patch is landed on v1.0.1 and v1-train, not on gaia master.

Evelyn, comment 28 says that this works on v1.0.1, but not 'v1'.  I interpreted 'v1' to mean 'v1-train'.  Can you try this again on v1-train to verify that it works there?
Flags: needinfo?(ehung)
(In reply to John Ford [:jhford] -- If you expect a reply from me, use needsinfo? instead of CC from comment #31)
> 
> Evelyn, comment 28 says that this works on v1.0.1, but not 'v1'.  I
> interpreted 'v1' to mean 'v1-train'.  Can you try this again on v1-train to
> verify that it works there?

Yes that's what I meant. 
So I've just tested this again today on Unagi pvt nightly build v1-train, and now it seems to be ok:
Gaia   94f03a82bc66ad04352d127747ca226368d96363
BuildID 20130506105248
Version 18.0

I don't really know why though, so if Evelyne can verify as well for a double check that would be great
Flags: needinfo?(lebedel.delphine)
(In reply to John Ford [:jhford] -- If you expect a reply from me, use needsinfo? instead of CC from comment #31)
> (In reply to Evelyn Hung [:evelyn] from comment #30)
> > I tried and it works as v1.0.1. Can you verify it again? Please note the
> > patch is landed on v1.0.1 and v1-train, not on gaia master.
> 
> Evelyn, comment 28 says that this works on v1.0.1, but not 'v1'.  I
> interpreted 'v1' to mean 'v1-train'.  Can you try this again on v1-train to
> verify that it works there?

I don't know why I said v1.0.1, sorry, I actually tested on v1-train.

(In reply to delphine from comment #32) 
> I don't really know why though, so if Evelyne can verify as well for a
> double check that would be great
Thanks! I checked again, it work fine.
Flags: needinfo?(ehung)
Per comment 33, it looks like this is fixed on branch then.
Issue on Leo V1.1 build.
When typing "Cześć" (Polish for "hello"), no word "Cześć" is suggested.
There is just different words are suggested, but not the simple word "Hello" on Polish. 
On Inari V1.0.1 build the word "Cześć" suggestion is appeared right after the third letter "e"

Environmental  Variables:
Leo Build ID: 20130515070208
Gecko: http://hg.mozilla.org/releases/mozilla-b2g18/rev/d06cfe7d67c2
Gaia: 0ddb515f15cbc6b74fc2742b7599d6ae74c6413f

Environmental  Variables:
Inari Build ID: 20130515070203
Gecko: http://hg.mozilla.org/releases/mozilla-b2g18_v1_0_1/rev/0b6bcb1f4175
Gaia: 9648799c2e45917ff150fa9eef8aeac79a9ac008
Blocks: 873934
Word Suggestions Text Prediction now correctly recognizes Polish characters. 
Tested b2g/nightly/mozilla-central-unagi/latest

Kernel Date: Dec 5
Gecko: http://hg.mozilla.org/mozilla-central/rev/3c6f2394995d
Gaia: d2ad0f1a5a6be8a2d6df721ba3c2b33e037c423b
Status: RESOLVED → VERIFIED
Please add a testcase for this bug to moztrap for 1.1 testsuite.  If yes, mark this in-moztrap+ when completed.  If not, mark this in-moztrap-.
Flags: in-moztrap?(cschmoeckel)
Added Keyboard Suite Test Case #8449 - [Keyboard] Auto Correct's text prediction works correctly in the Polish Language
Flags: in-moztrap?(cschmoeckel) → in-moztrap+
This patch was backed out of v1-train as part of bug 873934, and was replaced by equivalent code uplifted from master.  It remains fixed on v1-train.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: