Closed Bug 1007547 (Basque-WordPrediction) Opened 11 years ago Closed 10 years ago

[B2G][l10n][Gaia][Keyboard] Basque: auto-correction and auto-suggestion

Categories

(Firefox OS Graveyard :: Gaia::Keyboard, defect)

x86
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:2.1+, b2g-v2.1 verified, b2g-v2.2 verified)

VERIFIED FIXED
2.1 S6 (10oct)
blocking-b2g 2.1+
Tracking Status
b2g-v2.1 --- verified
b2g-v2.2 --- verified

People

(Reporter: julenx, Assigned: ileturia)

References

Details

Attachments

(1 file, 2 obsolete files)

We would like to have the auto-correction and auto-suggestion features for people using the keyboard in a Basque locale. It's worth noting that there's no such thing as a Basque keyboard (see bug 995509), so I'm not sure where should this live, as the definition of what auto-correcting/suggesting dictionary to use is defined there. I've seen there are some dictionaries/wordlists at https://github.com/mozilla-b2g/gaia/tree/master/apps/keyboard/js/imes/latin/dictionaries Any special instructions apart from those listed in the README file for building our own auto-correction/suggestion features?
I think the dictionary is defined at keyboard level. Since Catalan is using the Spanish keyboard, I'm not sure that's possible. The alternative is to fork the Spanish keyboard, call it Catalan, and add a dictionary. CCing also Pike who might know more.
Sounds like a question for keyboard folks really. Also, flod, *Basque*. Just sayin'
Flags: needinfo?(dflanagan)
(In reply to Axel Hecht [:Pike] from comment #2) > Also, flod, *Basque*. Just sayin' Need. More. Coffee. :-(
Yes, autocorrect dictionaries are tied to layouts, which is why we have per-language layouts rather than just QWERTY, AZERTY, etc. So if we can get a Basque dictionary, then we'd just have to clone the most appropriate layout file and link it to the Basque dictionary. Kevin Scannell is our source for wordlists for languages not supported by Android. Kevin: do you have time to create a Basque wordlist for us? Francesco, Axel, or Julen: if Kevin can create a wordlist for us, maybe you can take it from there. Or ask Rudy or I for assistance.
Flags: needinfo?(dflanagan) → needinfo?(kscanne)
Always happy to help my Basque friends :) That said, Basque is morphologically very complex, and so no matter how big of a corpus I collect, there will be many words missing. For example, the Xuxen spellchecker addon accepts hundreds of millions of words in total (so many it's hard to even estimate), but accepts 86-87% of words in typical running texts. Julen, any thoughts on this? Would you be satisfied with a frequency list of say 1.5-2M words even if there are many gaps? Also, do you want me to only include words that the spellchecker accepts? This is what I've done for other languages to avoid English or Spanish "pollution", but again this might leave out some important words. I could also send you a list of the most frequent words not accepted by the spell checker and you could manually clean that list (and potentially add them to Xuxen).
Flags: needinfo?(kscanne)
(In reply to David Flanagan [:djf] from comment #4) > Francesco, Axel, or Julen: if Kevin can create a wordlist for us, maybe you > can take it from there. Or ask Rudy or I for assistance. I can take care of it. @Julen Can we use Spanish as base? These strings should be correct for Basque label: 'Catalan' menuLabel: 'Euskara'
... sends coffee.
Something is seriously wrong here, too many things going on these days :-\ label: 'Basque' menuLabel: 'Euskara'
(In reply to Francesco Lodolo [:flod] from comment #6) > > @Julen > Can we use Spanish as base? As a start yes, since the majority of potential users will be in Spain-governed areas. It'd be nice if we could have another configuration for users from the French area since they'll be used to have French keyboard layouts. But as I understood in bug 995509, that's not possible yet(?).
(In reply to Francesco Lodolo [:flod] from comment #8) > Something is seriously wrong here, too many things going on these days :-\ > > label: 'Basque' > menuLabel: 'Euskara' You got the coffee right this time :)
(In reply to Kevin Scannell from comment #5) > Always happy to help my Basque friends :) > > That said, Basque is morphologically very complex, and so no matter how big > of a corpus I collect, there will be many words missing. For example, the > Xuxen spellchecker addon accepts hundreds of millions of words in total (so > many it's hard to even estimate), but accepts 86-87% of words in typical > running texts. Julen, any thoughts on this? Would you be satisfied with a > frequency list of say 1.5-2M words even if there are many gaps? > > Also, do you want me to only include words that the spellchecker accepts? > This is what I've done for other languages to avoid English or Spanish > "pollution", but again this might leave out some important words. I could > also send you a list of the most frequent words not accepted by the spell > checker and you could manually clean that list (and potentially add them to > Xuxen). Thanks for the input and willingness to help, Kevin. In order to have a more contrasted opinion, I'll contact some fellow colleagues that have been working on these things for a while and ideally ask them to chime in. Otherwise I'll come back with a more specific answer once I've heard from them.
(In reply to Julen Ruiz Aizpuru from comment #9) > As a start yes, since the majority of potential users will be in > Spain-governed areas. It'd be nice if we could have another configuration > for users from the French area since they'll be used to have French keyboard > layouts. But as I understood in bug 995509, that's not possible yet(?). Yep, if you use the French keyboard, it will have French suggestions. And AZERTY/QWERTY are not compatible, so I don't think we have an easy way out.
This might be a little left-field but how about taking a middle-ground approach? Why not take a more modern layout which offers improved speed but which is not common to either Spain or France? Like Dvorak or something? Or do a hybrid between French/Spanish if you think a layout like Dvorak is too different?
(In reply to Julen Ruiz Aizpuru from comment #11) > (In reply to Kevin Scannell from comment #5) > > Always happy to help my Basque friends :) > > That said, Basque is morphologically very complex, and so no matter how big > > of a corpus I collect, there will be many words missing. For example, the > > Xuxen spellchecker addon accepts hundreds of millions of words in total (so > > many it's hard to even estimate), but accepts 86-87% of words in typical > > running texts. Julen, any thoughts on this? Would you be satisfied with a > > frequency list of say 1.5-2M words even if there are many gaps? > > Also, do you want me to only include words that the spellchecker accepts? > > This is what I've done for other languages to avoid English or Spanish > > "pollution", but again this might leave out some important words. I could > > also send you a list of the most frequent words not accepted by the spell > > checker and you could manually clean that list (and potentially add them to > > Xuxen). > Thanks for the input and willingness to help, Kevin. > In order to have a more contrasted opinion, I'll contact some fellow > colleagues that have been working on these things for a while and ideally > ask them to chime in. Otherwise I'll come back with a more specific answer > once I've heard from them. Julen contacted us to ask if we could provide a larger word frequency list. We have a Basque web corpus of around 200 million words, with around 4 million different word forms, whose frequency list we could provide. What is the format that would be needed? word-TAB-frequency-NEWLINE? And also, when would it be needed, in order to appear in the forthcoming 2.0 version (if possible)? As a side question, my institution, Elhuyar Fundazioa (the owner of the corpus), has asked if it would be possible for it to somehow appear in the credits for the Basque auto-suggestion...
You may want to check out bug 992647 - we finally managed to get Gaelic/Irish/Manx onto the system with the help of janjongboom. There seems to be three steps to this process (summing up, I'm thinking of doing a writeup anyway and post it on Mozilla somewhere): 1. Grab the nearest matching keyboard js from github and adjust it for the new language. Just open it in Notepad++ or some such text editor, it's easy enough to understand. 2. Create a wordlist, using an existing one as a pattern. 2.1 Size - according to Jan, size is not an issue but advice is to follow en-US size for now. So less than 150k lines. 2.2 Ranking. Based on a corpus, rank them according to frequency using a scale from 1 to 255. 2.3 Collect this data in xml format following the format the other locales use i.e.: <wordlist locale="ga" description="Gaeilge" date="1401554807" version="1"> <w f="255" flags="">a</w> <w f="254" flags="">an</w> <w f="247" flags="">agus</w> So for Basque that would be (not sure about the date stamp) <wordlist locale="eu" description="Euskara" date="1401554807" version="1"> <w f="255" flags="">eta</w> <w f="254" flags="">da</w> <w f="247" flags="">hau</w> the f= is the frequency (1 to 255, most frequent at the top) There is then a process to create the working files: 1. Go in the gaia directory to apps/keyboard/js/imes/latin/dictionaries 2. Open Makefile 3. Append the name of your dictionary to the list (e.g. if it's called sv_wordlist.xml), put: ru.dict \ sk.dict \ sl.dict \ sv.dict \ sr.dict \ 4. Make sure the _worldlist.xml file is in the folder 5. Run 'make sv.dict At this stage as you can see in the Gaelic bug I got stuck and if you do, it might be best to ask Jan for help. But then I'm a noob when it comes to code, it might be really obvious to Julen. He also helped me get a test version onto my Flame. Hope that helps.
Igor, the instructions written by Michael look to be the most accurate steps to follow for now. He has just written an extended version at https://developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Customizing_the_keyboard#New_locales_from_the_localizers_perspective
I have doubts as to how to act regarding case... Does the word frequency list have to be completely case-insensitive and in lower case, or should I treat 'herri', 'Herri' and 'HERRI' as different words with different frequencies? If the word prediction does not take into account the case of the words the user is typing, then a case-insensitive frequency list would be best. But if it does, then the ideal would be not to take into account the case except for proper nouns, but I do not have this information...
In Gaelic we only capped proper nouns and in terms of frequencies ignore cased completely. This works reasonably well (with reasonably I mean that the prediction feature in Mozilla OS is fairly basic and does not take into account anything but single word frequencies so it's pretty useless at predicting the next word). Caps in the live system seem to work like this: - sentence initial is auto-capped (e.g. atz will get you > Atzo) - single shift press caps sentence medial initial letter (e.g. SHIFT elk will get you > Elkar) All caps is a bit counter-intuitive. If you double press SHIFT, you start typing in caps (e.g. HER) but if you select a suggestions, it reduces this to single cap (e.g. Herri). But I suspect that's a bug rather than a feature. Personally I think that for now you can ignore case for frequencies and only cap proper nouns in the lexicon.
OK, Michael, thanks! Another doubt: should I include numbers, dates, times, etc? In Basque they can come inflected, as in '2014ko', or '19:30ean'...
That's probably a judgement call at this stage but I notice that the English lexicon does not have 5th, 6th etc in it but that may possibly have something to do with the keyboard settings which allow th via longpress on the numbers: alt: { '1': ['¹', '1st'], '2': ['²', '2nd'], '3': ['³', '3rd'], So you could change that to alt: { '1': ['¹', '1en'], '2': ['²', '2garren'], '3': ['³', '3garren'], It might also be possible to do (removing the superscript numbers, they don't seem to work anyway) alt: { '1': ['1en', '1ak], '2': ['2garren', '2ak'], '3': ['3garren', '3ak'], and outsource some of the more common ones into the long press menu of the number, especially those for giving the time. I can't see why that wouldn't work In terms of the years, my guess is that they would have to get added if you want 1887tik to come up when someone enters 1887tik but given the verb/noun system, I'd be tempted to leave those for manual entry and add more noun/verb inflections. Why don't you make a judgement call and build a first version and play around with that? As with Gaelic, I'm sure you'll end up submitting a second version (at least) anyway. Would be nice if we could develop an affix system like Hunspell has, it would make the system much more powerful for languages like Basque or Gaelic. Kevin, you know Hunspell fairly well, any thoughts on whether that might work?
So, if there is not a clear criterion regarding numbers, then it is up to us to decide... As you say, Michael, I will make a first try and play with it, and improve it later if necessary. And yes, Hunspell can be used for Basque at least to some extent, and it has been used. But I do not know if it can be used in the FirefoxOS context...
OK, so following Michael's instructions (very clear, thanks!), I have been able to build the keyboard layout and the dictionary. I think everything has gone all right, and I have submitted a pull request, https://github.com/mozilla-b2g/gaia/pull/23038. According to the instructions on https://developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Customizing_the_keyboard#New_locales_from_the_localizers_perspective, I should now be able to test the keyboard at the address https://github.com/timdream/gaia-keyboard-demo (or http://timdream.org/gaia-keyboard-demo/), but the Basque keyboard is not among the options there. I guess the pull request has to be accepted and merged before? Anyone here with admin rights to do that?
Ez horregatik :) Copying in Jan - could you take a look? I think Igor has done most of the legwork.
Attached file Patch (obsolete) (deleted) —
Attachment #8481347 - Flags: review?(janjongboom)
We have to remove the words with numbers. You can play around with the UI at: http://janjongboom.com/gaia-keyboard-demo (open in Chrome, I broke it in Firefox for some reason).
Flags: needinfo?(ileturia)
Attachment #8481347 - Flags: review?(janjongboom)
You are right, Jan. I have tried the keyboard in the address you pointed out and having included the inflected numbers does more harm than good: numbers are always suggested no matter what key I press! I will make another version without the numbers and try it, and if it works OK I will commit it. Thanks!
Flags: needinfo?(ileturia)
I have tried the dictionary without the numbers at the address provided by Jan and it works much better. I have commited and pushed the new dictionary, so now it is in the pull request. I guess we now have to wait for someone to approve the pull request?
In order for the PR to be in good shape to be merged, I believe you can squash all three commits into a single one (use `git rebase -i a8fea49^`, then write `squash` next to the last two commits) and rebase current master on top of it.
OK, Julen, thanks for the help! It's done now. Now we just wait till it gets merged? And then in what version will it be included? I'm eager to have it in my phone! ;-)
After a long time waiting for the acceptance of the Pull Request, it no longer could be automatically merged and the Pull Request was closed by some robot... I have made a new Pull Request based on Gaia's current master which can be automatically merged (https://github.com/mozilla-b2g/gaia/pull/24667). Could someone please review, accept and merge it? Thanks!
Flags: needinfo?(janjongboom)
[Blocking Requested - why for this release]: Nominated as v2.1+ according to bug 1077033 comment 1, === this is a Tako shipping locale, we're going to need it in 2.1 ===
blocking-b2g: --- → 2.1?
Attached file Updated patch (obsolete) (deleted) —
Add an attachment for the updated pr. Jan, could you please help review this since you already looked into it? Let me know if you need me to take over. Thanks.
Attachment #8499453 - Flags: review?(janjongboom)
Attachment #8481347 - Attachment is obsolete: true
Attached file Patch v3 (deleted) —
r=me, except for two things: - We changed the way alt pages work, so I copied the one from Spanish in here. - The alt numbers (like 4ko) don't fit in the current alt char menu. As you can type these because the postfix is latin script (contrary to Spanish) I removed them for now. Can you please let me know if you're OK with this change?
Attachment #8499453 - Attachment is obsolete: true
Attachment #8499453 - Flags: review?(janjongboom)
Attachment #8500340 - Flags: review+
Attachment #8500340 - Flags: feedback?(ileturia)
Flags: needinfo?(janjongboom)
2.1 shipping locale, blocking
blocking-b2g: 2.1? → 2.1+
Yes, Jan, if the alt numbers don't fit, then the best thing to do is to remove them, as you have done.
Comment on attachment 8500340 [details] Patch v3 OK with the change.
Attachment #8500340 - Flags: feedback?(ileturia)
Hi iteturia, can you help to land this? thanks.
Flags: needinfo?(ileturia)
Assignee: nobody → ileturia
Status: NEW → ASSIGNED
Assign to iteturia, the patch author.
Everything OK from my part, PR ready to be merged. Maybe Jan has merge rights? Sending needinfo to him.
Flags: needinfo?(ileturia) → needinfo?(janjongboom)
Tree is closed at the moment.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: needinfo?(janjongboom)
Resolution: --- → FIXED
Target Milestone: --- → 2.1 S6 (10oct)
So glad to finally see this hit the gaia repository! On a side note, may I ask why the authorship is not attributed to ileturia in the above commit?
It does: see the actual commit https://github.com/mozilla-b2g/gaia/commit/992cd373070e539760c35369e3b582aa928a5dfc The merge commit is authored by someone else, but the underlying commit is by Igor. See also: https://github.com/mozilla-b2g/gaia/commits?author=e-gor
Aye, I see now, thanks! These merge commits are a jungle hiding the real meat.
This should not be auto uplifted to v2.1, since we changed the format of the layout definition in v2.2.
Comment on attachment 8500340 [details] Patch v3 [Approval Request Comment] [Bug caused by] (feature/regressing bug #): This is a new feature to add Basque keyboard layout. [User impact] if declined: Cannot present a Basque based keyboard layout to the native user. [Testing completed]: yes. [Risk to taking this patch] (and alternatives if risky): pretty low, the layout file itself is just a definition of the layout without logic inside. [String changes made]: N/A
Attachment #8500340 - Flags: approval-gaia-v2.1?
Whiteboard: NO_UPLIFT
Attachment #8500340 - Flags: approval-gaia-v2.1? → approval-gaia-v2.1+
Gaia v2.1, b9e9f537531cb7c577cff64d00da340aa1067bed
Whiteboard: NO_UPLIFT
So now the Basque keyboard with word prediction has been committed to master, it should come shipped in the nightly OTA updates for the Flame, right? The problem is, I have a Flame with the latest master build (2.2) of the production build (including locales), and with the nightly update channel. But although I have the very latest nightly build (the one for today), I cannot find the option of the Basque keyboard anywhere... As I understand, with the change we made in build/config/keyboard-layouts.json, that is, "eu": [ + {"layoutId": "eu", "app": ["apps", "keyboard"]}, {"layoutId": "es", "app": ["apps", "keyboard"]}, {"layoutId": "fr", "app": ["apps", "keyboard"]}, {"layoutId": "en", "app": ["apps", "keyboard"]} ], the default keyboard for the Basque locale should be the Basque one we created, but although I have the Basque locale installed, when I type anything I only have the other three available. I've gone through the settings in case there is something that I need to change or activate, but could not find anything... The nearest I could find was an option to add a keyboard, but Basque was not available there... Am I missing something?
(In reply to ileturia from comment #49) > So now the Basque keyboard with word prediction has been committed to > master, it should come shipped in the nightly OTA updates for the Flame, > right? I don't think this happens automatically, build owner needs to add it to config as well. Easiest fix: $ cd gaia $ APP=keyboard GAIA_KEYBOARD_LAYOUTS=en,zh-Hans-Pinyin,zh-Hant-Zhuyin,nl,es,fr,eu make install-gaia Adds the keyboard to your phone. Maybe Rudy knows what our policy is for Flame builds these days.
Flags: needinfo?(rlu)
As Jan said, the preload set of keyboard layouts is configured at build time. For Flame builds, we could open a separate bug to include more layouts like this one, bug 1020068. BTW, if you guys can wait, we have Bug 1029951 to track a feature that dictionaries could be downloaded dynamically.
Flags: needinfo?(rlu)
(In reply to Rudy Lu [:rudyl] from comment #51) > As Jan said, the preload set of keyboard layouts is configured at build time. > For Flame builds, we could open a separate bug to include more layouts like > this one, bug 1020068. My understanding was that we weren't accepting new layouts and were waiting for bug 1029951 (see bug 1050574 and all the duplicates).
This issue is verified fixed on Flame 2.2 and 2.1. The auto-correct and auto-suggestion functions work properly when the keyboard is in Basque. Flame 2.2 Device: Flame 2.2 Master (319mb)(Kitkat Base)(Full Flash) BuildID: 20141021040206 Gaia: ba6667c83c5d0fb1e333349dfeaf5f6ca8043e63 Gecko: 29fbfc1b31aa Gonk: 05aa7b98d3f891b334031dc710d48d0d6b82ec1d Version: 36.0a1 (2.2 Master) Firmware: V180 User Agent: Mozilla/5.0 (Mobile; rv:36.0) Gecko/36.0 Firefox/36.0 Flame 2.1 Device: Flame 2.1 (319mb)(Kitkat Base)(Full Flash) BuildID: 20141021001201 Gaia: f896470b694e3e76e39c5d48f1428b847a10b8fd Gecko: ee86921a986f Gonk: 05aa7b98d3f891b334031dc710d48d0d6b82ec1d Version: 34.0 (2.1) Firmware: V180 User Agent: Mozilla/5.0 (Mobile; rv:34.0) Gecko/34.0 Firefox/34.0
Status: RESOLVED → VERIFIED
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
Keywords: verifyme
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Alias: Basque-WordPrediction
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: