Closed Bug 967732 Opened 11 years ago Closed 8 years ago

[B2G][l10n][Gaia][Keyboard] Bengali: Auto correction and word suggestion are not present for the Bengali keyboard

Categories

(Firefox OS Graveyard :: Gaia::Keyboard, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(b2g-v1.3 affected)

RESOLVED WONTFIX
Tracking Status
b2g-v1.3 --- affected

People

(Reporter: lmauritson, Assigned: karanjai.moz, NeedInfo)

References

Details

(Whiteboard: LocRun1.3)

Attachments

(1 file)

Attached image Bengal Bemgal Bebgal (deleted) —
Description:
When the keyboard is set to Bengali both Auto Correct and Word Suggestion do not appear. 

Repro Steps:
1) Update a Buri to BuildID: 20140203181708
2) Switch to the Bengali keyboard with Auto Correct and Word Suggestion enabled.
3) Open up the Messages app and create a new message, selecting the text box.
4) With the Bengali keyboard enabled, type "Bengal" as a comparison word and hit space to finalize the word.
5) Type in "Bemgal" or "Bebgal" and observe the lack of auto correction or word suggestion bar.

Actual:
Auto correct does not correct words and there is no word suggestion bar.

Expected:
Auto correct and word suggestions are present and functional.

1.3 Environmental Variables:
Device: Buri 1.3 MOZ
BuildID: 20140203181708
Gaia: 0388dcb7621c5933712680a258ce94ead389e2b6
Gecko: 9731b0b7fa78
Version: 28.0
Firmware Version: v1.2-device.cfg

Repro frequency: 100%

See attached: Screenshot
Blocks: 949609
We don't have a Bengali dictionary in keyboard/dictionaries/

I suspect that for bn-Avro, this is additional work to hook up a dictionary to the IME?
Component: Gaia → Gaia::Keyboard
Keywords: l12y
If someone knows where to find a Bangla dictionary with word frequencies that would help (should also allow for redistributing).
Needinfo Kevin, our keyboard super star :)
Flags: needinfo?(kscanne)
We don't have word frequency list until now; there were few efforts from students only, but nothing mature to incorporate. 

As of we now, the most prominent language tool available is the Bengali spell checker add-on for Firefox. That has a word list in text format.

We can arrange some kind of boot camp to develop a Bengali word frequency list. Just saying.
I can create a frequency list from web texts without too much trouble.  How will the keyboard handle U+200C (zero-width non-joiner)?  If I include those in the frequency list will it mess up the prediction algorithm?
Flags: needinfo?(kscanne)
(In reply to Kevin Scannell from comment #5)
> I can create a frequency list from web texts without too much trouble.  How
> will the keyboard handle U+200C (zero-width non-joiner)?  If I include those
> in the frequency list will it mess up the prediction algorithm?

I'm not sure, if I understood well. We use the zero-width non-joiner for only few character conjunctions; if we don't use this U+200C, the end result are nonsense.

This word wrote using the zero-width non-joiner, and rendered correctly. - র‍্যাব If I don't use that, it appears like this - র্যাব 

do you want me to give a list of possible uses of zero-width non-joiner?
Thanks.  No need to send a list - I can see the possibilities in the texts I'm collecting, and also in the spellchecking word list.

So I'll treat 200C like any other character when creating the word list, that's easy enough.  I suppose my comment was aimed more at the people who wrote the predictive text algorithm, to be sure it handles non-letter characters correctly.
One other question - is the word list in the Firefox addon complete enough that I can just gather frequencies for the words in that list?  Doing this would mean there's no danger of misspelled words being suggested.  But you might also be missing some common words.    I have v0.08 of the addon which has 503573 words.
Another technical issue that may come back to bite us.  Looks like the addon word list does not use Normalization Form C, so there are (many) precomposed characters in there, like U+09DF which normalizes to U+09AF U+09BC.   What I'll do is generate a frequency list in NFC but then it will be important for the input method and prediction algorithm to take this into account.
(In reply to Kevin Scannell from comment #9)
> Another technical issue that may come back to bite us.  Looks like the addon
> word list does not use Normalization Form C, so there are (many) precomposed
> characters in there, like U+09DF which normalizes to U+09AF U+09BC.   What
> I'll do is generate a frequency list in NFC but then it will be important
> for the input method and prediction algorithm to take this into account.

Personally, I hate this normalization. Bengali wikipedia did this for http://codepoints.net/U+09DF and we suffer from consequence while searching. We type using the original character, but wikipedia saves after normalizing; again when we search, we use original character and wikipedia returns zero results. :(

I don't know, if normalization of these chars are neede that much. there are couple of more characters, that suffers from these consequences, IMHO.
Is having auto-correct dictionaries for all shipping locales a requirement? If so, then this could become a blocker.
Flags: needinfo?(lebedel.delphine)
Here's my first attempt at the auto-correct dictionary:
http://borel.slu.edu/obair/bn.zip
Kevin, are there any profane words in that dictionary that we wouldn't want to suggest? Like 'fuck' in English. We have them in the dictionaries with f="0". Other than that it looks pretty awesome (but then again I don't speak Bengali :p)
Flags: needinfo?(kscanne)
This was considered as a blocker in previous versions. Nominating
blocking-b2g: --- → 1.3?
Flags: needinfo?(lebedel.delphine)
That's a question for Mak; switching the needinfo. I can say that the wordlist only contains words accepted by v0.08 of the spell checking addon:

https://addons.mozilla.org/en-us/firefox/addon/bengali-bangladesh-dictionary/
Flags: needinfo?(kscanne) → needinfo?(mahayalamkhan)
adding David here also in case he can help
Flags: needinfo?(dflanagan)
(In reply to Kevin Scannell from comment #15)
> That's a question for Mak; switching the needinfo. I can say that the
> wordlist only contains words accepted by v0.08 of the spell checking addon:
> 
> https://addons.mozilla.org/en-us/firefox/addon/bengali-bangladesh-dictionary/

The word list within the add-on, doesn't contain any words in Bengali that means, fuck or anything similar. 

Even though, I shall mail Kevin a list of words in Bengali which are not suppose to be in the Auto complete dictionary and are meant to be vulgar for our society and culture. He can do a search and tell us here.
Flags: needinfo?(mahayalamkhan)
I removed the words as requested by mak and created a new version:

http://borel.slu.edu/obair/bn-v2.zip
Well we should still have them in the dict, they should just be marked with f="0", so if someone is inclined to type the word they are still able to do so. It will just never show up in autocorrect suggestions.

Thanks btw Kevin for creating dictionaries!
Flags: needinfo?(kscanne)
Auto correction for bengali?
Flags: needinfo?(l10n)
Wilfred, is the auto correction and word suggestion functionality critical in 1.3?
Flags: needinfo?(wmathanaraj)
Wilfred's answer matters here, not mine.
Flags: needinfo?(l10n)
The attached screenshot appears to show the bn-Avro keyboard layout.  That layout uses the "jsavrophonetic" input method.  Autocorrection and word suggestions are a feature of the "latin" input method.  So even with Kevin's dictionary (thanks, Kevin!) we can't enable autocorrect for that layout without lots of coding work in the input method.

The other Bengali layout is bn-Probhat, which does not appear to use an input method at all. If we add the latin input method and Kevin's dictionary to that layout we might be able to get autocorrection.  But I don't know enough about Bengali to know if the other latin input method stuff (like auto punctuation and auto capitalization) makes sense.

If Bengali autocorrect is important for v1.3, this should have been prioritized much earlier.

I spoke once with the contributor who created the input method for the bn-Avro layout.  IIRC, he says that is the one that everyone in Bangladesh uses. So even if we can add autocorrect to bn-Probhat it could be that no one will care.

Mahay: is it worth trying to add autocorrect for bn-Probhat?  If so, will features of the latin input method, like auto punctuation (double space turns into period space, e.g.) and auto capitalization cause problems in Bengali?

Aniruddha: do you have time and interest in adding autocorrect support to your jsavrophonetic input method?

I'm guessing that this is not something that this is not something we'll be able to fix for 1.3.  And I don't think it is on anyone's feature list for 1.4 either. So if Bangladesh is an important market, someone should be sounding an alarm here.
Flags: needinfo?(mahayalamkhan)
Flags: needinfo?(dflanagan)
Flags: needinfo?(aniruddha)
I put the alarm here as Bangladesh will be a very important market for us in 1.3. But if this would be pretty much self-contained we can uplift in partner builds from 1.4.

All depending on how important dictionaries are in phonetic layout of course.
I will find out the importance of this and circle back
Flags: needinfo?(wmathanaraj)
If we have the Bengali Auto correction and word suggestion in Firefox by default, it will be big advantage for us. From a user point of view, I definitely want a auto correction of my language in the mobile.
I am currently working on implementing an "efficient" auto-suggestion builder for bn-Avro. Since it is a transliterating IME, a lot of work needs to be done to get usable suggest-as-you-type suggestions. I have talked with original creator of the Avro Phonetic IME about it and he gave me some interesting pointers on this.

So, currently the state of dictionary support is -> WIP. I am working on it. And bn-Probhat won't do with Latin because latin has stuff like auto-capitalization which might seriously mess up the IME.
Flags: needinfo?(aniruddha)
(In reply to David Flanagan [:djf] from comment #23)
 
> Mahay: is it worth trying to add autocorrect for bn-Probhat?  If so, will
> features of the latin input method, like auto punctuation (double space
> turns into period space, e.g.) and auto capitalization cause problems in
> Bengali?

YES. It is worth trying for Probhat. Autocorrect in Bengali will be an unique selling point for Firefox OS in Bangladesh. AFAIK, there is none right now. This autocorrect can be forked for lots of would be FxOS apps in Bengali. Can be used for Bengali handwriting prediction too. The opportunity is unlimited.


As Firefox OS is focused for first time feature phone users; Probhat for Mobile phone don't need users to learn they layout unlike Desktop users. They can type while seeing and at some point it can grow a big user base.
Flags: needinfo?(mahayalamkhan)
blocking-b2g: 1.3? → backlog
Based on triage today we discussed with Wilfred that its too late for 1.3 and so moving it to backlog at this point
Jan,

To move this forward, are you able to take Kevin's wordlist, add the missing equals signs to the flag attributes, convert to a .dict file and update the bn-Probhat layout to use the latin IM and the bn.dict dictionary?

Then we could ask Mahay or another localizer to try it out and give us feedback on any problems with the autocapitalization and autopunctuation features of the latin im. (We need a way to make these configurable on a per-layout basis for the French keyboard layout also.)
Flags: needinfo?(janjongboom)
Yeah, no problem. Leaving the ni so I don't forget tomorrow :-)
Assignee: nobody → janjongboom
Flags: needinfo?(janjongboom)
Find the branch here: https://github.com/comoyo/gaia/tree/bengal_autocomplete

To make a new profile with this branch:

git remote add comoyo git://github.com/comoyo/gaia.git
git fetch comoyo
git checkout comoyo/bengal_autocomplete
GAIA_KEYBOARD_LAYOUTS=en,bn-Probhat make

PLEASE NOTE: I disabled |replaceSurroundingText|, so it won't autocorrect words as it crashes B2G on Bangla at the moment, due to some bug somewhere (haven't looked into it yet). But you can see the suggestions it comes up with. Please let me know if they make any sense.
Flags: needinfo?(mahayalamkhan)
Flags: needinfo?(kscanne)
(In reply to Jan Jongboom [:janjongboom] from comment #32)
> Find the branch here: https://github.com/comoyo/gaia/tree/bengal_autocomplete
> 
> To make a new profile with this branch:
> 
> git remote add comoyo git://github.com/comoyo/gaia.git
> git fetch comoyo
> git checkout comoyo/bengal_autocomplete
> GAIA_KEYBOARD_LAYOUTS=en,bn-Probhat make
> 
> PLEASE NOTE: I disabled |replaceSurroundingText|, so it won't autocorrect
> words as it crashes B2G on Bangla at the moment, due to some bug somewhere
> (haven't looked into it yet). But you can see the suggestions it comes up
> with. Please let me know if they make any sense.

too much busy to seat and make a build from git repo; let me know if your patch committed into latest nightly.
Flags: needinfo?(mahayalamkhan)
This is not going to land in nightly before I have any confirmation that the suggestions actually work, so doing it in this order is not gonna work.
Flags: needinfo?(mahayalamkhan)
Mak, any news on this please? thanks
Me, Jan and Aniruddha met today and discussed many things. We will do a Brainstorm with two more person, who developed Bengali phonetic keyboards and helped Aniruddha a lot to developed bn_BD keyboard for Firefox OS. 

Hope to update more on this after Thursday.
Flags: needinfo?(mahayalamkhan)
@Mak,
Talking with Benedikte at Telenor, if this feature is to be available in the initial launch device then it would need to land before 15th July. Do you think that is feasible? If not, do you have an estimation of when it might land (even if it is a very rough estimate) pls?
Thanks, tony
Flags: needinfo?(mahayalamkhan)
(In reply to Tony Appleton from comment #37)
> @Mak,
> Talking with Benedikte at Telenor, if this feature is to be available in the
> initial launch device then it would need to land before 15th July. Do you
> think that is feasible? If not, do you have an estimation of when it might
> land (even if it is a very rough estimate) pls?
> Thanks, tony

Dear Tony,

I am afraid, we can't make it before 15th July or any time soon before the device launch. We need more time to brainstorm on algorithm, benchmark which way works faster and more correct, then we have to go through a rigorous manual error checking of all worlds in the list.

So, I just can't predict a time as it depends on resources availability. We can initiate a mail discussion to source and estimate.
Flags: needinfo?(mahayalamkhan)
Assignee: janjongboom → nobody
Assign to Rabimba per proposal on dev-gaia.
Assignee: nobody → karanjai.moz
Status: NEW → ASSIGNED
Setting mentor flag so I can give reply promptly on this bug.
Mentor: timdream
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #41)
> Setting mentor flag so I can give reply promptly on this bug.

I have an initial Word Frequency list prepared for both Phonetic and Probhat(two WF lists). Need to integrate and test it before I generate a bigger(hopefully better) WF list using my full corpus (only for Probhat)

Need help/pointers creating dict. Or should I just post the lists here?
(In reply to Rabimba from comment #42)
> I have an initial Word Frequency list prepared for both Phonetic and
> Probhat(two WF lists). Need to integrate and test it before I generate a
> bigger(hopefully better) WF list using my full corpus (only for Probhat)
> 
> Need help/pointers creating dict. Or should I just post the lists here?

So by corpus you mean you already have a small list of word-frequency pair and you are trying to create a bigger one? How do you create the small list? Isn't the method apply to the larger list?

As of generating the suggestions from the word list, did you try to read prediction.js and see how it works? Maybe you could tweak list of expected alphabets and come up with a version of prediction.js, and you could then hook it up to a Bengali equivalent of latin.js.
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #43)
> So by corpus you mean you already have a small list of word-frequency pair
> and you are trying to create a bigger one? How do you create the small list?
> Isn't the method apply to the larger list?

Well no. Not exactly. By corpus I meant I have a big list of text written in Bengali. I have written a script to parse through this and generate a word frequency list for our consumption. The problem with this approach is that I still have to tweak my script to get a better output. For example I need to be able to determine what words are proper noun (names of persons) and leave them out. But I don't want to leave out the names of the places (hence still debating my options). There are many more tweaks I need to do.
Also the corpus I am using is two folds and made by scraping regional language websites (which might have copyright on it's data usage hence I need to confirm that too). I will be concentrating one two data sources primarily sourcing from bn_IN and bn_BD. Though Bengali in it's written form should be same for both, I just want to be sure I'm not missing any local derivations.

> As of generating the suggestions from the word list, did you try to read
> prediction.js and see how it works? Maybe you could tweak list of expected
> alphabets and come up with a version of prediction.js, and you could then
> hook it up to a Bengali equivalent of latin.js.

I haven't. That is an excellent idea and I'll have a look at it to see how it works. But before that I just wanted to see if it works right now if I provide a word frequency pair list. I guess I should be able to build the dict if I run xml2dict.py?
(In reply to Rabimba from comment #44)
> (In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from
> comment #43)
> > So by corpus you mean you already have a small list of word-frequency pair
> > and you are trying to create a bigger one? How do you create the small list?
> > Isn't the method apply to the larger list?
> 
> Well no. Not exactly. By corpus I meant I have a big list of text written in
> Bengali. I have written a script to parse through this and generate a word
> frequency list for our consumption. The problem with this approach is that I
> still have to tweak my script to get a better output. For example I need to
> be able to determine what words are proper noun (names of persons) and leave
> them out. But I don't want to leave out the names of the places (hence still
> debating my options). There are many more tweaks I need to do.
> Also the corpus I am using is two folds and made by scraping regional
> language websites (which might have copyright on it's data usage hence I
> need to confirm that too). I will be concentrating one two data sources
> primarily sourcing from bn_IN and bn_BD. Though Bengali in it's written form
> should be same for both, I just want to be sure I'm not missing any local
> derivations.
> 

I probably can't help you on this part because I have never done this before.

> > As of generating the suggestions from the word list, did you try to read
> > prediction.js and see how it works? Maybe you could tweak list of expected
> > alphabets and come up with a version of prediction.js, and you could then
> > hook it up to a Bengali equivalent of latin.js.
> 
> I haven't. That is an excellent idea and I'll have a look at it to see how
> it works. But before that I just wanted to see if it works right now if I
> provide a word frequency pair list. I guess I should be able to build the
> dict if I run xml2dict.py?

Yes, xml2dict.py will give you the binary dict that prediction.js consumes.
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #45)
 
> Yes, xml2dict.py will give you the binary dict that prediction.js consumes.

Thanks!
Can I somehow utilize this to test it quickly?
https://github.com/timdream/gaia-keyboard-demo
(In reply to Rabimba from comment #46)
> (In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from
> comment #45)
>  
> > Yes, xml2dict.py will give you the binary dict that prediction.js consumes.
> 
> Thanks!
> Can I somehow utilize this to test it quickly?
> https://github.com/timdream/gaia-keyboard-demo

Yes, just update/replace the Gaia loaded as submodule in ./gaia, and launch the page with a localhost http server. Skip the |git submodule| commands and create a symbolic link that goes to your working Gaia repo.
The problem is that prediction.js doesn't yield quality output for Bangla.
(In reply to Jan Jongboom [:janjongboom] (Telenor) from comment #48)
> The problem is that prediction.js doesn't yield quality output for Bangla.

I simply want to point out the ternary search tree part of the code should be reused, since it's pure math.
I'm sorry, this thread is very confusing, so please bear with me.
This bug has been opened since 1.3 and seems like it still hasn't been resolved. We've shipped in Bengali in the meantime.
There is renewed interest in Bengali, from v1.4 and onwards. We will be needing Bengali autocorrection/wordsuggestion at least from 1.4 and onwards, please! (not sure if/what flags I should be putting in for this)
Last time we visited the bug Rabimba said he is working on a new IMEngine which will provide auto correction & suggestion for Bengali. I am not sure about his progress here.
Flags: needinfo?(karanjai.moz)
If the change is isolated in a new IMEngine, it would not be hard to backport all the way to v1.4, other than adopting some InputMethodGlue interface changes.
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #51)
> Last time we visited the bug Rabimba said he is working on a new IMEngine
> which will provide auto correction & suggestion for Bengali. I am not sure
> about his progress here.

I will be sending out a pull request in a day or two.
My apologies for being so late and lackluster even after the mail communications. I was travelling last few weeks and since now I am back in India for a month I did not have connectivity set-up at my old place till yesterday.

(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #52)
> If the change is isolated in a new IMEngine, it would not be hard to
> backport all the way to v1.4, other than adopting some InputMethodGlue
> interface changes.

Just the prediction itself is isolated (mostly). So should not be a problem. The learning however is not. I will reserve any more comments till the pull request to let you see the changes.
Flags: needinfo?(karanjai.moz)
(In reply to Rabimba from comment #53)
> (In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from
> comment #52)
> > If the change is isolated in a new IMEngine, it would not be hard to
> > backport all the way to v1.4, other than adopting some InputMethodGlue
> > interface changes.
> 
> Just the prediction itself is isolated (mostly). So should not be a problem.
> The learning however is not. I will reserve any more comments till the pull
> request to let you see the changes.

Sounds great! Just a heads-up, I am touching latin.js and worker.js for bug 1110028.
[Blocking Requested - why for this release]:
Bengali is shipping from 1.4 and onwards, so we would need to get this into 1.4 and all other following branches. Since there is no more 1.4 blocking flag, have requested this as blocking for 2.0. However we should make sure this gets into all other needed branches. Thanks!
blocking-b2g: backlog → 2.0?
Flags: needinfo?(bbajaj)
[Blocking Requested - why for this release]:

[Triage] Considering current 2.0 timing and the need from partner, nom. to 2.1 (or even 2.2?) instead for consideration.
blocking-b2g: 2.0? → 2.1?
Heads Up:

I am stuck not being able to do anything on this one till 16th January when I return to US since
1. I am not being able to even access github from my country now (ref: http://techcrunch.com/2014/12/31/indian-government-censorsht/)
2. Even a few days back when I was able to my pathetic connection here is giving me speeds around 10-12kbps which was sufficiently inadequate for me to even make this post here :(
(In reply to Rabimba from comment #57)

No worries, do connect me if you need any help!
Clearing blocking nom, refer to comment in : https://bugzilla.mozilla.org/show_bug.cgi?id=1114866
Flags: needinfo?(bbajaj)
blocking-b2g: 2.1? → ---
Hi Rabimba! Seems like this was dropped at one point :) You mentioned doing a pull request for this in comment 53. Care to push this forwards? thanks!
Flags: needinfo?(karanjai.moz)
I am closing this bug as Won't fix.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: