Bugzilla

Comment 2

•

11 years ago

If someone knows where to find a Bangla dictionary with word frequencies that would help (should also allow for redistributing).

Comment 3

•

11 years ago

Needinfo Kevin, our keyboard super star :)

Flags: needinfo?(kscanne)

Comment 4

•

11 years ago

We don't have word frequency list until now; there were few efforts from students only, but nothing mature to incorporate. 

As of we now, the most prominent language tool available is the Bengali spell checker add-on for Firefox. That has a word list in text format.

We can arrange some kind of boot camp to develop a Bengali word frequency list. Just saying.

Comment 5

•

11 years ago

I can create a frequency list from web texts without too much trouble.  How will the keyboard handle U+200C (zero-width non-joiner)?  If I include those in the frequency list will it mess up the prediction algorithm?

Flags: needinfo?(kscanne)

Comment 6

•

11 years ago

(In reply to Kevin Scannell from comment #5)
> I can create a frequency list from web texts without too much trouble.  How
> will the keyboard handle U+200C (zero-width non-joiner)?  If I include those
> in the frequency list will it mess up the prediction algorithm?

I'm not sure, if I understood well. We use the zero-width non-joiner for only few character conjunctions; if we don't use this U+200C, the end result are nonsense.

This word wrote using the zero-width non-joiner, and rendered correctly. - র‍্যাব If I don't use that, it appears like this - র্যাব 

do you want me to give a list of possible uses of zero-width non-joiner?

Comment 7

•

11 years ago

Thanks.  No need to send a list - I can see the possibilities in the texts I'm collecting, and also in the spellchecking word list.

So I'll treat 200C like any other character when creating the word list, that's easy enough.  I suppose my comment was aimed more at the people who wrote the predictive text algorithm, to be sure it handles non-letter characters correctly.

Comment 8

•

11 years ago

One other question - is the word list in the Firefox addon complete enough that I can just gather frequencies for the words in that list?  Doing this would mean there's no danger of misspelled words being suggested.  But you might also be missing some common words.    I have v0.08 of the addon which has 503573 words.

Comment 9

•

11 years ago

Another technical issue that may come back to bite us.  Looks like the addon word list does not use Normalization Form C, so there are (many) precomposed characters in there, like U+09DF which normalizes to U+09AF U+09BC.   What I'll do is generate a frequency list in NFC but then it will be important for the input method and prediction algorithm to take this into account.

Comment 10

•

11 years ago

(In reply to Kevin Scannell from comment #9)
> Another technical issue that may come back to bite us.  Looks like the addon
> word list does not use Normalization Form C, so there are (many) precomposed
> characters in there, like U+09DF which normalizes to U+09AF U+09BC.   What
> I'll do is generate a frequency list in NFC but then it will be important
> for the input method and prediction algorithm to take this into account.

Personally, I hate this normalization. Bengali wikipedia did this for http://codepoints.net/U+09DF and we suffer from consequence while searching. We type using the original character, but wikipedia saves after normalizing; again when we search, we use original character and wikipedia returns zero results. :(

I don't know, if normalization of these chars are neede that much. there are couple of more characters, that suffers from these consequences, IMHO.

Jason Smith [:jsmith]

Comment 11

•

11 years ago

Is having auto-correct dictionaries for all shipping locales a requirement? If so, then this could become a blocker.

Flags: needinfo?(lebedel.delphine)

Comment 12

•

11 years ago

Here's my first attempt at the auto-correct dictionary:
http://borel.slu.edu/obair/bn.zip

Comment 13

•

11 years ago

Kevin, are there any profane words in that dictionary that we wouldn't want to suggest? Like 'fuck' in English. We have them in the dictionaries with f="0". Other than that it looks pretty awesome (but then again I don't speak Bengali :p)

Flags: needinfo?(kscanne)

Comment 14

•

11 years ago

This was considered as a blocker in previous versions. Nominating

blocking-b2g: --- → 1.3?

Flags: needinfo?(lebedel.delphine)

Comment 15

•

11 years ago

That's a question for Mak; switching the needinfo. I can say that the wordlist only contains words accepted by v0.08 of the spell checking addon:

https://addons.mozilla.org/en-us/firefox/addon/bengali-bangladesh-dictionary/

Flags: needinfo?(kscanne) → needinfo?(mahayalamkhan)

Comment 16

•

11 years ago

adding David here also in case he can help

Flags: needinfo?(dflanagan)

Comment 17

•

11 years ago

(In reply to Kevin Scannell from comment #15)
> That's a question for Mak; switching the needinfo. I can say that the
> wordlist only contains words accepted by v0.08 of the spell checking addon:
> 
> https://addons.mozilla.org/en-us/firefox/addon/bengali-bangladesh-dictionary/

The word list within the add-on, doesn't contain any words in Bengali that means, fuck or anything similar. 

Even though, I shall mail Kevin a list of words in Bengali which are not suppose to be in the Auto complete dictionary and are meant to be vulgar for our society and culture. He can do a search and tell us here.

Flags: needinfo?(mahayalamkhan)

Comment 18

•

11 years ago

I removed the words as requested by mak and created a new version:

http://borel.slu.edu/obair/bn-v2.zip

Preeti Raghunath(:Preeti)

Comment 19

•

11 years ago

Well we should still have them in the dict, they should just be marked with f="0", so if someone is inclined to type the word they are still able to do so. It will just never show up in autocorrect suggestions.

Thanks btw Kevin for creating dictionaries!

Flags: needinfo?(kscanne)

Comment 20

•

11 years ago

Auto correction for bengali?

Flags: needinfo?(l10n)

Peter Dolanjski [:pdol]

Comment 21

•

11 years ago

Wilfred, is the auto correction and word suggestion functionality critical in 1.3?

Flags: needinfo?(wmathanaraj)

Axel Hecht [:Pike]

Comment 22

•

11 years ago

Wilfred's answer matters here, not mine.

Flags: needinfo?(l10n)

David Flanagan [:djf]

Comment 23

•

11 years ago

The attached screenshot appears to show the bn-Avro keyboard layout. That layout uses the "jsavrophonetic" input method. Autocorrection and word suggestions are a feature of the "latin" input method. So even with Kevin's dictionary (thanks, Kevin!) we can't enable autocorrect for that layout without lots of coding work in the input method.

The other Bengali layout is bn-Probhat, which does not appear to use an input method at all. If we add the latin input method and Kevin's dictionary to that layout we might be able to get autocorrection. But I don't know enough about Bengali to know if the other latin input method stuff (like auto punctuation and auto capitalization) makes sense.

If Bengali autocorrect is important for v1.3, this should have been prioritized much earlier.

I spoke once with the contributor who created the input method for the bn-Avro layout. IIRC, he says that is the one that everyone in Bangladesh uses. So even if we can add autocorrect to bn-Probhat it could be that no one will care.

Mahay: is it worth trying to add autocorrect for bn-Probhat? If so, will features of the latin input method, like auto punctuation (double space turns into period space, e.g.) and auto capitalization cause problems in Bengali?

Aniruddha: do you have time and interest in adding autocorrect support to your jsavrophonetic input method?

I'm guessing that this is not something that this is not something we'll be able to fix for 1.3. And I don't think it is on anyone's feature list for 1.4 either. So if Bangladesh is an important market, someone should be sounding an alarm here.

Flags: needinfo?(mahayalamkhan)

Flags: needinfo?(dflanagan)

Flags: needinfo?(aniruddha)

Wilfred Mathanaraj [:WDM]

Comment 24

•

11 years ago

I put the alarm here as Bangladesh will be a very important market for us in 1.3. But if this would be pretty much self-contained we can uplift in partner builds from 1.4.

All depending on how important dictionaries are in phonetic layout of course.

Comment 25

•

11 years ago

I will find out the importance of this and circle back

Flags: needinfo?(wmathanaraj)

Ashickur Rahman

Comment 26

•

11 years ago

If we have the Bengali Auto correction and word suggestion in Firefox by default, it will be big advantage for us. From a user point of view, I definitely want a auto correction of my language in the mobile.

Aniruddha Adhikary [:tuxboy]

Comment 27

•

11 years ago

I am currently working on implementing an "efficient" auto-suggestion builder for bn-Avro. Since it is a transliterating IME, a lot of work needs to be done to get usable suggest-as-you-type suggestions. I have talked with original creator of the Avro Phonetic IME about it and he gave me some interesting pointers on this.

So, currently the state of dictionary support is -> WIP. I am working on it. And bn-Probhat won't do with Latin because latin has stuff like auto-capitalization which might seriously mess up the IME.

Flags: needinfo?(aniruddha)

Comment 28

•

11 years ago

(In reply to David Flanagan [:djf] from comment #23)
 
> Mahay: is it worth trying to add autocorrect for bn-Probhat?  If so, will
> features of the latin input method, like auto punctuation (double space
> turns into period space, e.g.) and auto capitalization cause problems in
> Bengali?

YES. It is worth trying for Probhat. Autocorrect in Bengali will be an unique selling point for Firefox OS in Bangladesh. AFAIK, there is none right now. This autocorrect can be forked for lots of would be FxOS apps in Bengali. Can be used for Bengali handwriting prediction too. The opportunity is unlimited.


As Firefox OS is focused for first time feature phone users; Probhat for Mobile phone don't need users to learn they layout unlike Desktop users. They can type while seeing and at some point it can grow a big user base.

Flags: needinfo?(mahayalamkhan)

Updated

•

11 years ago

blocking-b2g: 1.3? → backlog

Comment 29

•

11 years ago

Based on triage today we discussed with Wilfred that its too late for 1.3 and so moving it to backlog at this point

David Flanagan [:djf]

Comment 30

•

11 years ago

Jan,

To move this forward, are you able to take Kevin's wordlist, add the missing equals signs to the flag attributes, convert to a .dict file and update the bn-Probhat layout to use the latin IM and the bn.dict dictionary?

Then we could ask Mahay or another localizer to try it out and give us feedback on any problems with the autocapitalization and autopunctuation features of the latin im. (We need a way to make these configurable on a per-layout basis for the French keyboard layout also.)

Flags: needinfo?(janjongboom)

Comment 31

•

11 years ago

Yeah, no problem. Leaving the ni so I don't forget tomorrow :-)

Updated

•

11 years ago

Assignee: nobody → janjongboom

Flags: needinfo?(janjongboom)

Comment 32

•

11 years ago

Find the branch here: https://github.com/comoyo/gaia/tree/bengal_autocomplete

To make a new profile with this branch:

git remote add comoyo git://github.com/comoyo/gaia.git
git fetch comoyo
git checkout comoyo/bengal_autocomplete
GAIA_KEYBOARD_LAYOUTS=en,bn-Probhat make

PLEASE NOTE: I disabled |replaceSurroundingText|, so it won't autocorrect words as it crashes B2G on Bangla at the moment, due to some bug somewhere (haven't looked into it yet). But you can see the suggestions it comes up with. Please let me know if they make any sense.

Flags: needinfo?(mahayalamkhan)

Updated

•

11 years ago

Flags: needinfo?(kscanne)

Comment 33

•

11 years ago

(In reply to Jan Jongboom [:janjongboom] from comment #32)
> Find the branch here: https://github.com/comoyo/gaia/tree/bengal_autocomplete
> 
> To make a new profile with this branch:
> 
> git remote add comoyo git://github.com/comoyo/gaia.git
> git fetch comoyo
> git checkout comoyo/bengal_autocomplete
> GAIA_KEYBOARD_LAYOUTS=en,bn-Probhat make
> 
> PLEASE NOTE: I disabled |replaceSurroundingText|, so it won't autocorrect
> words as it crashes B2G on Bangla at the moment, due to some bug somewhere
> (haven't looked into it yet). But you can see the suggestions it comes up
> with. Please let me know if they make any sense.

too much busy to seat and make a build from git repo; let me know if your patch committed into latest nightly.

Flags: needinfo?(mahayalamkhan)

Comment 34

•

11 years ago

This is not going to land in nightly before I have any confirmation that the suggestions actually work, so doing it in this order is not gonna work.

Flags: needinfo?(mahayalamkhan)

Comment 35

•

11 years ago

Mak, any news on this please? thanks

Comment 36

•

10 years ago

Me, Jan and Aniruddha met today and discussed many things. We will do a Brainstorm with two more person, who developed Bengali phonetic keyboards and helped Aniruddha a lot to developed bn_BD keyboard for Firefox OS. 

Hope to update more on this after Thursday.

Flags: needinfo?(mahayalamkhan)

aappleton@mozilla.com

Comment 37

•

10 years ago

@Mak,
Talking with Benedikte at Telenor, if this feature is to be available in the initial launch device then it would need to land before 15th July. Do you think that is feasible? If not, do you have an estimation of when it might land (even if it is a very rough estimate) pls?
Thanks, tony

Flags: needinfo?(mahayalamkhan)

Comment 38

•

10 years ago

(In reply to Tony Appleton from comment #37)
> @Mak,
> Talking with Benedikte at Telenor, if this feature is to be available in the
> initial launch device then it would need to land before 15th July. Do you
> think that is feasible? If not, do you have an estimation of when it might
> land (even if it is a very rough estimate) pls?
> Thanks, tony

Dear Tony,

I am afraid, we can't make it before 15th July or any time soon before the device launch. We need more time to brainstorm on algorithm, benchmark which way works faster and more correct, then we have to go through a rigorous manual error checking of all worlds in the list.

So, I just can't predict a time as it depends on resources availability. We can initiate a mail discussion to source and estimate.

Flags: needinfo?(mahayalamkhan)

Updated

•

10 years ago

Assignee: janjongboom → nobody

Updated

•

10 years ago

Blocks: IndianBengali-WordPrediction

Comment 39

•

10 years ago

Assign to Rabimba per proposal on dev-gaia.

Assignee: nobody → karanjai.moz

Status: NEW → ASSIGNED

Comment 41

•

10 years ago

Setting mentor flag so I can give reply promptly on this bug.

Mentor: timdream

Assignee

Comment 42

•

10 years ago

(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #41)
> Setting mentor flag so I can give reply promptly on this bug.

I have an initial Word Frequency list prepared for both Phonetic and Probhat(two WF lists). Need to integrate and test it before I generate a bigger(hopefully better) WF list using my full corpus (only for Probhat)

Need help/pointers creating dict. Or should I just post the lists here?

Comment 43

•

10 years ago

(In reply to Rabimba from comment #42)
> I have an initial Word Frequency list prepared for both Phonetic and
> Probhat(two WF lists). Need to integrate and test it before I generate a
> bigger(hopefully better) WF list using my full corpus (only for Probhat)
> 
> Need help/pointers creating dict. Or should I just post the lists here?

So by corpus you mean you already have a small list of word-frequency pair and you are trying to create a bigger one? How do you create the small list? Isn't the method apply to the larger list?

As of generating the suggestions from the word list, did you try to read prediction.js and see how it works? Maybe you could tweak list of expected alphabets and come up with a version of prediction.js, and you could then hook it up to a Bengali equivalent of latin.js.

Assignee

Comment 44

•

10 years ago

(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #43)
> So by corpus you mean you already have a small list of word-frequency pair
> and you are trying to create a bigger one? How do you create the small list?
> Isn't the method apply to the larger list?

Well no. Not exactly. By corpus I meant I have a big list of text written in Bengali. I have written a script to parse through this and generate a word frequency list for our consumption. The problem with this approach is that I still have to tweak my script to get a better output. For example I need to be able to determine what words are proper noun (names of persons) and leave them out. But I don't want to leave out the names of the places (hence still debating my options). There are many more tweaks I need to do.
Also the corpus I am using is two folds and made by scraping regional language websites (which might have copyright on it's data usage hence I need to confirm that too). I will be concentrating one two data sources primarily sourcing from bn_IN and bn_BD. Though Bengali in it's written form should be same for both, I just want to be sure I'm not missing any local derivations.

> As of generating the suggestions from the word list, did you try to read
> prediction.js and see how it works? Maybe you could tweak list of expected
> alphabets and come up with a version of prediction.js, and you could then
> hook it up to a Bengali equivalent of latin.js.

I haven't. That is an excellent idea and I'll have a look at it to see how it works. But before that I just wanted to see if it works right now if I provide a word frequency pair list. I guess I should be able to build the dict if I run xml2dict.py?

Comment 45

•

10 years ago

(In reply to Rabimba from comment #44)
> (In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from
> comment #43)
> > So by corpus you mean you already have a small list of word-frequency pair
> > and you are trying to create a bigger one? How do you create the small list?
> > Isn't the method apply to the larger list?
> 
> Well no. Not exactly. By corpus I meant I have a big list of text written in
> Bengali. I have written a script to parse through this and generate a word
> frequency list for our consumption. The problem with this approach is that I
> still have to tweak my script to get a better output. For example I need to
> be able to determine what words are proper noun (names of persons) and leave
> them out. But I don't want to leave out the names of the places (hence still
> debating my options). There are many more tweaks I need to do.
> Also the corpus I am using is two folds and made by scraping regional
> language websites (which might have copyright on it's data usage hence I
> need to confirm that too). I will be concentrating one two data sources
> primarily sourcing from bn_IN and bn_BD. Though Bengali in it's written form
> should be same for both, I just want to be sure I'm not missing any local
> derivations.
> 

I probably can't help you on this part because I have never done this before.

> > As of generating the suggestions from the word list, did you try to read
> > prediction.js and see how it works? Maybe you could tweak list of expected
> > alphabets and come up with a version of prediction.js, and you could then
> > hook it up to a Bengali equivalent of latin.js.
> 
> I haven't. That is an excellent idea and I'll have a look at it to see how
> it works. But before that I just wanted to see if it works right now if I
> provide a word frequency pair list. I guess I should be able to build the
> dict if I run xml2dict.py?

Yes, xml2dict.py will give you the binary dict that prediction.js consumes.

Assignee

Comment 46

•

10 years ago

(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #45)
 
> Yes, xml2dict.py will give you the binary dict that prediction.js consumes.

Thanks!
Can I somehow utilize this to test it quickly?
https://github.com/timdream/gaia-keyboard-demo

Comment 47

•

10 years ago

(In reply to Rabimba from comment #46)
> (In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from
> comment #45)
>  
> > Yes, xml2dict.py will give you the binary dict that prediction.js consumes.
> 
> Thanks!
> Can I somehow utilize this to test it quickly?
> https://github.com/timdream/gaia-keyboard-demo

Yes, just update/replace the Gaia loaded as submodule in ./gaia, and launch the page with a localhost http server. Skip the |git submodule| commands and create a symbolic link that goes to your working Gaia repo.

Comment 48

•

10 years ago

The problem is that prediction.js doesn't yield quality output for Bangla.

Comment 49

•

10 years ago

(In reply to Jan Jongboom [:janjongboom] (Telenor) from comment #48)
> The problem is that prediction.js doesn't yield quality output for Bangla.

I simply want to point out the ternary search tree part of the code should be reused, since it's pure math.

Comment 50

•

10 years ago

I'm sorry, this thread is very confusing, so please bear with me.
This bug has been opened since 1.3 and seems like it still hasn't been resolved. We've shipped in Bengali in the meantime.
There is renewed interest in Bengali, from v1.4 and onwards. We will be needing Bengali autocorrection/wordsuggestion at least from 1.4 and onwards, please! (not sure if/what flags I should be putting in for this)

Comment 51

•

10 years ago

Last time we visited the bug Rabimba said he is working on a new IMEngine which will provide auto correction & suggestion for Bengali. I am not sure about his progress here.

Flags: needinfo?(karanjai.moz)

Comment 52

•

10 years ago

If the change is isolated in a new IMEngine, it would not be hard to backport all the way to v1.4, other than adopting some InputMethodGlue interface changes.

Assignee

Comment 53

•

10 years ago

(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #51)
> Last time we visited the bug Rabimba said he is working on a new IMEngine
> which will provide auto correction & suggestion for Bengali. I am not sure
> about his progress here.

I will be sending out a pull request in a day or two.
My apologies for being so late and lackluster even after the mail communications. I was travelling last few weeks and since now I am back in India for a month I did not have connectivity set-up at my old place till yesterday.

(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #52)
> If the change is isolated in a new IMEngine, it would not be hard to
> backport all the way to v1.4, other than adopting some InputMethodGlue
> interface changes.

Just the prediction itself is isolated (mostly). So should not be a problem. The learning however is not. I will reserve any more comments till the pull request to let you see the changes.

Flags: needinfo?(karanjai.moz)

Comment 54

•

10 years ago

(In reply to Rabimba from comment #53)
> (In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from
> comment #52)
> > If the change is isolated in a new IMEngine, it would not be hard to
> > backport all the way to v1.4, other than adopting some InputMethodGlue
> > interface changes.
> 
> Just the prediction itself is isolated (mostly). So should not be a problem.
> The learning however is not. I will reserve any more comments till the pull
> request to let you see the changes.

Sounds great! Just a heads-up, I am touching latin.js and worker.js for bug 1110028.

Updated

•

10 years ago

Blocks: WordPrediction

Comment 55

•

10 years ago

[Blocking Requested - why for this release]:
Bengali is shipping from 1.4 and onwards, so we would need to get this into 1.4 and all other following branches. Since there is no more 1.4 blocking flag, have requested this as blocking for 2.0. However we should make sure this gets into all other needed branches. Thanks!

blocking-b2g: backlog → 2.0?

Flags: needinfo?(bbajaj)

Wesly Huang (EPM)

Comment 56

•

10 years ago

[Blocking Requested - why for this release]:

[Triage] Considering current 2.0 timing and the need from partner, nom. to 2.1 (or even 2.2?) instead for consideration.

blocking-b2g: 2.0? → 2.1?

Assignee

Comment 57

•

10 years ago

Heads Up:

I am stuck not being able to do anything on this one till 16th January when I return to US since
1. I am not being able to even access github from my country now (ref: http://techcrunch.com/2014/12/31/indian-government-censorsht/)
2. Even a few days back when I was able to my pathetic connection here is giving me speeds around 10-12kbps which was sufficiently inadequate for me to even make this post here :(

Comment 58

•

10 years ago

(In reply to Rabimba from comment #57)

No worries, do connect me if you need any help!

Comment 59

•

10 years ago

Clearing blocking nom, refer to comment in : https://bugzilla.mozilla.org/show_bug.cgi?id=1114866

Flags: needinfo?(bbajaj)

Updated

•

10 years ago

blocking-b2g: 2.1? → ---

Comment 60

•

9 years ago

Hi Rabimba! Seems like this was dropped at one point :) You mentioned doing a pull request for this in comment 53. Care to push this forwards? thanks!

Flags: needinfo?(karanjai.moz)