861108 - Text prediction ignores Polish letters

Reporter

Description

•

12 years ago

This happens on v1.0.1 only.

When typing Cześć (Polish for "hello"), as soon as I type "ś" the whole suggestion bar clears and remains empty.

When typing Śmiech (Polish for laughter), with "Ś" the bar shows nothing, and with "Śm" it shows suggestions for "m".

I'm not sure if this happens only with Polish letters.  I tested Spanish, and the special characters seemed to work correctly.

Daniel Coloma:dcoloma

Updated

•

12 years ago

blocking-b2g: tef? → tef+

David Flanagan [:djf]

Comment 1

•

12 years ago

If this is only on v1.0.1, then that would be a very good reason to uplift the new prediction code to that tree.  But we should give it a while to stabilize before we do that, I suppose.

Alternatively, we could fix this specific bug in the old code, but I think that the desire is to uplift all the new stuff to 1.0.1.

Blocks: 797170

Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails, needinfo? please)

Comment 2

•

12 years ago

I wonder if this is something stupid on the keyboard code that can be easily resolved without importing a whole new feature to 1.0,1?

Joe Cheng [:jcheng] (please needinfo)

Comment 3

•

12 years ago

hi evelyn, any idea on this? thanks

Flags: needinfo?(ehung)

Evelyn Hung

Assignee

Updated

•

12 years ago

Assignee: nobody → ehung

David Flanagan [:djf]

Comment 4

•

12 years ago

I just saw you push a patch on IRC. I'd be happy to review if you'd like.

Evelyn Hung

Assignee

Comment 5

•

12 years ago

Thanks for kanru's help and we found it's not a unicode problem, just because the regular expression used to match latin letters in latin.js is too simple and missing a lot of characters.

Flags: needinfo?(ehung)

Evelyn Hung

Assignee

Comment 6

•

12 years ago

Attached file point to https://github.com/mozilla-b2g/gaia/pull/9183 (deleted) — Details

Attachment #737524 - Flags: review?(dflanagan)

David Flanagan [:djf]

Comment 7

•

12 years ago

Comment on attachment 737524 [details]
point to https://github.com/mozilla-b2g/gaia/pull/9183

I'm glad you found that bug. I was worried that we would have to re-generate the dictionaries to fix it.

I'm giving an r- here because the regexp initialization code is in the wrong place, and also because I think a better fix is to just remove that regexp completely.

See the patch for bug 860462: in that, I get rid of LETTER altogether and just use anything other than whitespace as a letter.  The user isn't going to have a keyboard layout that enables them to enter non-Latin letters, so I think we can assume that it will always be latin.  So I recommend that you just copy the relevant changes from 860462 instead of creating this giant regexp.

Attachment #737524 - Flags: review?(dflanagan) → review-

Dietrich Ayala (:dietrich)

Comment 8

•

12 years ago

Evelyn, do you have an ETA for a new patch?

Whiteboard: [status: needs new patch]

Kai-Chih Hu [:khu]

Updated

•

12 years ago

Whiteboard: [status: needs new patch] → [status: needs new patch][status: ETA 4/19]

Evelyn Hung

Assignee

Comment 9

•

12 years ago

(In reply to David Flanagan [:djf] from comment #7)
> Comment on attachment 737524 [details]
> point to https://github.com/mozilla-b2g/gaia/pull/9183
> 
> I'm glad you found that bug. I was worried that we would have to re-generate
> the dictionaries to fix it.
> 
> I'm giving an r- here because the regexp initialization code is in the wrong
> place, and also because I think a better fix is to just remove that regexp
> completely.
> 
> See the patch for bug 860462: in that, I get rid of LETTER altogether and
> just use anything other than whitespace as a letter.  The user isn't going
> to have a keyboard layout that enables them to enter non-Latin letters, so I
> think we can assume that it will always be latin.
It's possible to be punctuation. If we don't filter out them before prediction, it costs CPU cycles to search a word that is impossible to be in the dictionary. However, in the case of correction, if we consider punctuation in, it would be smarter. By the fact that we do prediction and correction at the same time, I think it's just a choice: to be efficient or to be smarter. Is it correct?

Flags: needinfo?(dflanagan)

Whiteboard: [status: needs new patch][status: ETA 4/19] → [status: needs new patch]

David Flanagan [:djf]

Comment 10

•

12 years ago

You're absolutely correct: you should filter out punctuation. That's a bug in the code I landed.  Thanks for pointing it out. 

But still, instead of enumerating all possible letters in one giant regexp you can probably enumerate all possible non-letters more easily. (Don't treat ' and - as punctuation, though, since they get used in words).

Flags: needinfo?(dflanagan)

David Flanagan [:djf]

Comment 11

•

12 years ago

I've filed bug 862430 to fix the issue Evelyn points out above in my code on master.

Evelyn Hung

Assignee

Comment 12

•

12 years ago

(In reply to David Flanagan [:djf] from comment #10)
> You're absolutely correct: you should filter out punctuation. That's a bug
> in the code I landed.  Thanks for pointing it out. 
> 
> But still, instead of enumerating all possible letters in one giant regexp
> you can probably enumerate all possible non-letters more easily. (Don't
> treat ' and - as punctuation, though, since they get used in words).
Hi David, I'm trying to enumerate all possible non-letters but I think it's not easier. Two approaches to do this:
1) scan layout.js file and copy those non-letters keys' value from all imEngine: 'latin' layouts.
2) look up unicode "Symbols and Punctuation" set, and pick some subsets that we're interested, e.g. ASCII Punctuation, Latin-1 Punctuation, Arrows, maybe some subsets in "Other Symbols"... etc.

either way is painful, and we also need to care return key, backspace key...
I guess the result won't be smaller than enumerating all possible letters. :-/
How do you think?

Flags: needinfo?(dflanagan)

Evelyn Hung

Assignee

Comment 13

•

12 years ago

> 2) look up unicode "Symbols and Punctuation" set, and pick some subsets that
> we're interested, e.g. ASCII Punctuation, Latin-1 Punctuation, Arrows, maybe
> some subsets in "Other Symbols"... etc. 
my reference: http://www.unicode.org/charts/

Staś Małolepszy :stas

Reporter

Comment 14

•

12 years ago

(In reply to Evelyn Hung [:evelyn] from comment #12)

> 1) scan layout.js file and copy those non-letters keys' value from all
> imEngine: 'latin' layouts.

Why is this approach painful?  It sounds like we could take all Latin letters plus the ones defined in the "alt" dictionary of the currently selected layout.

https://github.com/mozilla-b2g/gaia/blob/master/apps/keyboard/js/layout.js#L1030

(In reply to David Flanagan [:djf] from comment #10)
> (Don't treat ' and - as punctuation, though, since they get used in words).

This sounds like an i18n thing: these characters are used in words only in a subset of languages.

Another solution, which would require the recompilation of the dictionaries, would be to expose a list of letters used in all words of the dictionary for the given language.  Is this feasible?

Alex Keybl [:akeybl]

Updated

•

12 years ago

Whiteboard: [status: needs new patch] → [status: needs new patch][madrid]

Tim Guan-tin Chien [:timdream] (please needinfo)

Comment 15

•

12 years ago

Stas, David,

My understanding from Evelyn to this bug is that David doesn't think the approach in https://github.com/mozilla-b2g/gaia/pull/9183 make sense, and he would like to write the regexp in reverse, i.e. filter out spaces, punctuations and other non-alphabet characters. My concern is that in reality, it's a even larger set, given the wide range of the Unicode characters, and the possibility for those characters to went through our regexp filter not because the user will be able to type it on our vkb, but exists as the content of the currently editing input.

Therefore, I would tend to think pull 9183 is the right approach -- it could certainly be improved through (e.g. not to re-construct regexp on every init(), or construct the regexp based on the language we are using right now).

David Flanagan [:djf]

Comment 16

•

12 years ago

(In reply to Staś Małolepszy :stas (needinfo along with cc, please) from comment #14)
> 
> Another solution, which would require the recompilation of the dictionaries,
> would be to expose a list of letters used in all words of the dictionary for
> the given language.  Is this feasible?

Already done, but not landed yet. It probably won't help here, however, because the patch is question is trying to decide, outside of the prediction engine, what the word boundaries are so it can send an appropriate word to the prediction engine.

Flags: needinfo?(dflanagan)

Tim Guan-tin Chien [:timdream] (please needinfo)

Comment 17

•

12 years ago

(In reply to David Flanagan [:djf] from comment #16)
> (In reply to Staś Małolepszy :stas (needinfo along with cc, please) from
> comment #14)
> > 
> > Another solution, which would require the recompilation of the dictionaries,
> > would be to expose a list of letters used in all words of the dictionary for
> > the given language.  Is this feasible?
> 
> Already done, but not landed yet. It probably won't help here, however,
> because the patch is question is trying to decide, outside of the prediction
> engine, what the word boundaries are so it can send an appropriate word to
> the prediction engine.

If this is done, why not go for this fix instead?

David Flanagan [:djf]

Comment 18

•

12 years ago

Tim, Evelyn: I think we're really over-thinking this. The issue is just minor optimization of not asking the prediction engine to complete something that we know can not be a word.  If our word detection code is imperfect and we ask for completions of something like 'l33t' we won't get any completions and no harm is done. 

Whitespace and punctuation are the obvious word-separator characters in all Latin languages that I know of.  (Though Stas is correct that hypens and apostrophes are ambiguous in English and French, at least).

That's why I suggest a simple, imperfect test for determining word boundaries.  I can even add code to the prediction engine to return immediately if the input text contains characters that do not occur in the dictionary.  That way it wouldn't even be expensive to ask for predictions on invalid words.

If you want to land 9183, I don't really have a strong opinion about that, as long as you put the regexp initialization code in the right place. I always get nervous about giant regexps like that, though.  What is the memory footprint of a compiled regexp that includes about a hundred Unicode characters?  Maybe its no big deal at all.

David Flanagan [:djf]

Comment 19

•

12 years ago

(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) from comment #17)
> > Already done, but not landed yet. It probably won't help here, however,
> > because the patch is question is trying to decide, outside of the prediction
> > engine, what the word boundaries are so it can send an appropriate word to
> > the prediction engine.
> 
> If this is done, why not go for this fix instead?

Because the list of characters is in the prediction worker, and we need the list of characters outside the worker in latin.js. We could add a communication protocol between the IM and the worker to transfer that data, but it seems unnecessary for this.

Mike Lee [:mlee]

Comment 20

•

12 years ago

Adding whiteboard tags for tracking via srumbu.gs.

Whiteboard: [status: needs new patch][madrid] → c=keyboard

Mike Lee [:mlee]

Updated

•

12 years ago

Whiteboard: c=keyboard → [status: needs new patch][madrid] c=keyboard

Tim Guan-tin Chien [:timdream] (please needinfo)

Comment 21

•

12 years ago

(In reply to David Flanagan [:djf] from comment #18)
> Tim, Evelyn: I think we're really over-thinking this. The issue is just
> minor optimization of not asking the prediction engine to complete something
> that we know can not be a word.  If our word detection code is imperfect and
> we ask for completions of something like 'l33t' we won't get any completions
> and no harm is done. 
> 
> Whitespace and punctuation are the obvious word-separator characters in all
> Latin languages that I know of.  (Though Stas is correct that hypens and
> apostrophes are ambiguous in English and French, at least).
> 
> That's why I suggest a simple, imperfect test for determining word
> boundaries.  I can even add code to the prediction engine to return
> immediately if the input text contains characters that do not occur in the
> dictionary.  That way it wouldn't even be expensive to ask for predictions
> on invalid words.

Agree. If we already know sending invalid characters into prediction engines doesn't do any harm, let's do that.

> If you want to land 9183, I don't really have a strong opinion about that,
> as long as you put the regexp initialization code in the right place. I
> always get nervous about giant regexps like that, though.  What is the
> memory footprint of a compiled regexp that includes about a hundred Unicode
> characters?  Maybe its no big deal at all.

"Find the memory footprint of that regexp" would be harder than do the simple test :P

Evelyn Hung

Assignee

Comment 22

•

12 years ago

Comment on attachment 737524 [details]
point to https://github.com/mozilla-b2g/gaia/pull/9183

I decided to fix the bug as what :djf proposed in comment 7. After thinking more, I'm convinced that using whitespace(\s) to determine a word boundary probably is better than filter punctuation out, because:
1. sending invalid characters into prediction engines doesn't do any harm. (comment 18) - main reason
2. it's complicate to enumerate correct prediction to filter out by considering i18n thing. (comment 14)  
3. if we support customized word completion in the future so the user may select a word and add it to a auto-complete list, then every 'token' is possible to be a valid word.

Attachment #737524 - Attachment mime type: text/plain → text/html

Attachment #737524 - Flags: review- → review?(dflanagan)

Alex Keybl [:akeybl]

Updated

•

12 years ago

Target Milestone: --- → Madrid (19apr)

David Flanagan [:djf]

Comment 23

•

12 years ago

Comment on attachment 737524 [details]
point to https://github.com/mozilla-b2g/gaia/pull/9183

Looks good!

Attachment #737524 - Flags: review?(dflanagan) → review+

Evelyn Hung

Assignee

Comment 24

•

12 years ago

merged into gaia v1.0.1 branch.
https://github.com/mozilla-b2g/gaia/commit/4e08c2bbec97273a97753250a047fb18acc0bffb

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Daniel Coloma:dcoloma

Updated

•

12 years ago

status-b2g18: --- → affected

status-b2g18-v1.0.1: --- → affected

Daniel Coloma:dcoloma

Updated

•

12 years ago

status-b2g18: affected → ---

status-b2g18-v1.0.1: affected → fixed

Evelyn Hung

Assignee

Comment 25

•

12 years ago

(In reply to David Flanagan [:djf] from comment #23)
> Comment on attachment 737524 [details]
> point to https://github.com/mozilla-b2g/gaia/pull/9183
> 
> Looks good!
Thank you, David. 
Nice to have a long discussion here that makes me know keyboard more. :)

Evelyn Hung

Assignee

Comment 26

•

12 years ago

also uplift to v1-train:

https://github.com/mozilla-b2g/gaia/commit/21ff040abc73842b1304790b574a7fceea270451

status-b2g18: --- → fixed

status-b2g18-v1.0.0: --- → wontfix

Delphine Lebédel [:delphine - Off until Sept 19]

Comment 27

•

12 years ago

Verified-fixed
gecko: 6dfd179008acf965412d98bebc97f60dd99e2376
gaiac47ef39de04e634d847cc86b6acc616fabce69eb
BuildID 20130426070207
Version 18.0

Status: RESOLVED → VERIFIED

Delphine Lebédel [:delphine - Off until Sept 19]

Comment 28

•

12 years ago

Hum, I take back what I say. It seems this is resolved on v1.0.1 (cf comment above), but trying on v1 just gives me one option for the S and C letters in Polish, whereas there are more options on the v1.0.1 keyboard....
Therefore flagging accordingly

v1
gecko: bdbd181fa434d15976a2ae965d82ec7fe85a963f
gaia: c9046a7acef33328977840176ff5574720d2c74c
BuildID 20130426070204
Version 18.0

Status: VERIFIED → RESOLVED

Closed: 12 years ago → 12 years ago

status-b2g18: fixed → affected

John Ford [:jhford] CET/CEST Berlin Time

Comment 29

•

12 years ago

$ git branch --contains 21ff040abc73842b1304790b574a7fceea270451
  v1-train

This patch is definitely on v1-train.

Evelyn, can you please check whether this patch still fixes v1-train?

Flags: needinfo?(ehung)

Evelyn Hung

Assignee

Comment 30

•

12 years ago

I tried and it works as v1.0.1. Can you verify it again? Please note the patch is landed on v1.0.1 and v1-train, not on gaia master.

Flags: needinfo?(ehung)

Evelyn Hung

Assignee

Updated

•

12 years ago

Flags: needinfo?(lebedel.delphine)

John Ford [:jhford] CET/CEST Berlin Time

Comment 31

•

12 years ago

(In reply to Evelyn Hung [:evelyn] from comment #30)
> I tried and it works as v1.0.1. Can you verify it again? Please note the
> patch is landed on v1.0.1 and v1-train, not on gaia master.

Evelyn, comment 28 says that this works on v1.0.1, but not 'v1'.  I interpreted 'v1' to mean 'v1-train'.  Can you try this again on v1-train to verify that it works there?

Flags: needinfo?(ehung)

Delphine Lebédel [:delphine - Off until Sept 19]

Comment 32

•

12 years ago

(In reply to John Ford [:jhford] -- If you expect a reply from me, use needsinfo? instead of CC from comment #31)
> 
> Evelyn, comment 28 says that this works on v1.0.1, but not 'v1'.  I
> interpreted 'v1' to mean 'v1-train'.  Can you try this again on v1-train to
> verify that it works there?

Yes that's what I meant. 
So I've just tested this again today on Unagi pvt nightly build v1-train, and now it seems to be ok:
Gaia   94f03a82bc66ad04352d127747ca226368d96363
BuildID 20130506105248
Version 18.0

I don't really know why though, so if Evelyne can verify as well for a double check that would be great

Flags: needinfo?(lebedel.delphine)

Evelyn Hung

Assignee

Comment 33

•

12 years ago

(In reply to John Ford [:jhford] -- If you expect a reply from me, use needsinfo? instead of CC from comment #31)
> (In reply to Evelyn Hung [:evelyn] from comment #30)
> > I tried and it works as v1.0.1. Can you verify it again? Please note the
> > patch is landed on v1.0.1 and v1-train, not on gaia master.
> 
> Evelyn, comment 28 says that this works on v1.0.1, but not 'v1'.  I
> interpreted 'v1' to mean 'v1-train'.  Can you try this again on v1-train to
> verify that it works there?

I don't know why I said v1.0.1, sorry, I actually tested on v1-train.

(In reply to delphine from comment #32) 
> I don't really know why though, so if Evelyne can verify as well for a
> double check that would be great
Thanks! I checked again, it work fine.

Flags: needinfo?(ehung)

John Ford [:jhford] CET/CEST Berlin Time

Comment 34

•

12 years ago

Per comment 33, it looks like this is fixed on branch then.

status-b2g18: affected → fixed

sarsenyev

Comment 35

•

12 years ago

Issue on Leo V1.1 build.
When typing "Cześć" (Polish for "hello"), no word "Cześć" is suggested.
There is just different words are suggested, but not the simple word "Hello" on Polish. 
On Inari V1.0.1 build the word "Cześć" suggestion is appeared right after the third letter "e"

Environmental  Variables:
Leo Build ID: 20130515070208
Gecko: http://hg.mozilla.org/releases/mozilla-b2g18/rev/d06cfe7d67c2
Gaia: 0ddb515f15cbc6b74fc2742b7599d6ae74c6413f

Environmental  Variables:
Inari Build ID: 20130515070203
Gecko: http://hg.mozilla.org/releases/mozilla-b2g18_v1_0_1/rev/0b6bcb1f4175
Gaia: 9648799c2e45917ff150fa9eef8aeac79a9ac008

status-b2g18-v1.0.1: fixed → verified

Leo

Updated

•

12 years ago

Blocks: 873934

:CSchmoeckel

Comment 36

•

11 years ago

Word Suggestions Text Prediction now correctly recognizes Polish characters. 
Tested b2g/nightly/mozilla-central-unagi/latest

Kernel Date: Dec 5
Gecko: http://hg.mozilla.org/mozilla-central/rev/3c6f2394995d
Gaia: d2ad0f1a5a6be8a2d6df721ba3c2b33e037c423b

Status: RESOLVED → VERIFIED

Tony Chung [:tchung]

Comment 37

•

11 years ago

Please add a testcase for this bug to moztrap for 1.1 testsuite.  If yes, mark this in-moztrap+ when completed.  If not, mark this in-moztrap-.

Flags: in-moztrap?(cschmoeckel)

:CSchmoeckel

Comment 38

•

11 years ago

Added Keyboard Suite Test Case #8449 - [Keyboard] Auto Correct's text prediction works correctly in the Polish Language

Flags: in-moztrap?(cschmoeckel) → in-moztrap+

David Flanagan [:djf]

Comment 39

•

11 years ago

This patch was backed out of v1-train as part of bug 873934, and was replaced by equivalent code uplifted from master.  It remains fixed on v1-train.