Closed
Bug 1007547
(Basque-WordPrediction)
Opened 11 years ago
Closed 10 years ago
[B2G][l10n][Gaia][Keyboard] Basque: auto-correction and auto-suggestion
Categories
(Firefox OS Graveyard :: Gaia::Keyboard, defect)
Tracking
(blocking-b2g:2.1+, b2g-v2.1 verified, b2g-v2.2 verified)
People
(Reporter: julenx, Assigned: ileturia)
References
Details
Attachments
(1 file, 2 obsolete files)
(deleted),
text/x-github-pull-request
|
janjongboom
:
review+
fabrice
:
approval-gaia-v2.1+
|
Details |
We would like to have the auto-correction and auto-suggestion features for people using the keyboard in a Basque locale.
It's worth noting that there's no such thing as a Basque keyboard (see bug 995509), so I'm not sure where should this live, as the definition of what auto-correcting/suggesting dictionary to use is defined there.
I've seen there are some dictionaries/wordlists at https://github.com/mozilla-b2g/gaia/tree/master/apps/keyboard/js/imes/latin/dictionaries
Any special instructions apart from those listed in the README file for building our own auto-correction/suggestion features?
Comment 1•11 years ago
|
||
I think the dictionary is defined at keyboard level. Since Catalan is using the Spanish keyboard, I'm not sure that's possible.
The alternative is to fork the Spanish keyboard, call it Catalan, and add a dictionary.
CCing also Pike who might know more.
Comment 2•11 years ago
|
||
Sounds like a question for keyboard folks really.
Also, flod, *Basque*. Just sayin'
Flags: needinfo?(dflanagan)
Comment 3•11 years ago
|
||
(In reply to Axel Hecht [:Pike] from comment #2)
> Also, flod, *Basque*. Just sayin'
Need. More. Coffee. :-(
Comment 4•11 years ago
|
||
Yes, autocorrect dictionaries are tied to layouts, which is why we have per-language layouts rather than just QWERTY, AZERTY, etc. So if we can get a Basque dictionary, then we'd just have to clone the most appropriate layout file and link it to the Basque dictionary.
Kevin Scannell is our source for wordlists for languages not supported by Android. Kevin: do you have time to create a Basque wordlist for us?
Francesco, Axel, or Julen: if Kevin can create a wordlist for us, maybe you can take it from there. Or ask Rudy or I for assistance.
Flags: needinfo?(dflanagan) → needinfo?(kscanne)
Comment 5•11 years ago
|
||
Always happy to help my Basque friends :)
That said, Basque is morphologically very complex, and so no matter how big of a corpus I collect, there will be many words missing. For example, the Xuxen spellchecker addon accepts hundreds of millions of words in total (so many it's hard to even estimate), but accepts 86-87% of words in typical running texts. Julen, any thoughts on this? Would you be satisfied with a frequency list of say 1.5-2M words even if there are many gaps?
Also, do you want me to only include words that the spellchecker accepts? This is what I've done for other languages to avoid English or Spanish "pollution", but again this might leave out some important words. I could also send you a list of the most frequent words not accepted by the spell checker and you could manually clean that list (and potentially add them to Xuxen).
Flags: needinfo?(kscanne)
Comment 6•11 years ago
|
||
(In reply to David Flanagan [:djf] from comment #4)
> Francesco, Axel, or Julen: if Kevin can create a wordlist for us, maybe you
> can take it from there. Or ask Rudy or I for assistance.
I can take care of it.
@Julen
Can we use Spanish as base?
These strings should be correct for Basque
label: 'Catalan'
menuLabel: 'Euskara'
Comment 7•11 years ago
|
||
... sends coffee.
Comment 8•11 years ago
|
||
Something is seriously wrong here, too many things going on these days :-\
label: 'Basque'
menuLabel: 'Euskara'
Reporter | ||
Comment 9•11 years ago
|
||
(In reply to Francesco Lodolo [:flod] from comment #6)
>
> @Julen
> Can we use Spanish as base?
As a start yes, since the majority of potential users will be in Spain-governed areas. It'd be nice if we could have another configuration for users from the French area since they'll be used to have French keyboard layouts. But as I understood in bug 995509, that's not possible yet(?).
Reporter | ||
Comment 10•11 years ago
|
||
(In reply to Francesco Lodolo [:flod] from comment #8)
> Something is seriously wrong here, too many things going on these days :-\
>
> label: 'Basque'
> menuLabel: 'Euskara'
You got the coffee right this time :)
Reporter | ||
Comment 11•11 years ago
|
||
(In reply to Kevin Scannell from comment #5)
> Always happy to help my Basque friends :)
>
> That said, Basque is morphologically very complex, and so no matter how big
> of a corpus I collect, there will be many words missing. For example, the
> Xuxen spellchecker addon accepts hundreds of millions of words in total (so
> many it's hard to even estimate), but accepts 86-87% of words in typical
> running texts. Julen, any thoughts on this? Would you be satisfied with a
> frequency list of say 1.5-2M words even if there are many gaps?
>
> Also, do you want me to only include words that the spellchecker accepts?
> This is what I've done for other languages to avoid English or Spanish
> "pollution", but again this might leave out some important words. I could
> also send you a list of the most frequent words not accepted by the spell
> checker and you could manually clean that list (and potentially add them to
> Xuxen).
Thanks for the input and willingness to help, Kevin.
In order to have a more contrasted opinion, I'll contact some fellow colleagues that have been working on these things for a while and ideally ask them to chime in. Otherwise I'll come back with a more specific answer once I've heard from them.
Comment 12•11 years ago
|
||
(In reply to Julen Ruiz Aizpuru from comment #9)
> As a start yes, since the majority of potential users will be in
> Spain-governed areas. It'd be nice if we could have another configuration
> for users from the French area since they'll be used to have French keyboard
> layouts. But as I understood in bug 995509, that's not possible yet(?).
Yep, if you use the French keyboard, it will have French suggestions.
And AZERTY/QWERTY are not compatible, so I don't think we have an easy way out.
Comment 13•11 years ago
|
||
This might be a little left-field but how about taking a middle-ground approach? Why not take a more modern layout which offers improved speed but which is not common to either Spain or France? Like Dvorak or something? Or do a hybrid between French/Spanish if you think a layout like Dvorak is too different?
Assignee | ||
Comment 14•10 years ago
|
||
(In reply to Julen Ruiz Aizpuru from comment #11)
> (In reply to Kevin Scannell from comment #5)
> > Always happy to help my Basque friends :)
> > That said, Basque is morphologically very complex, and so no matter how big
> > of a corpus I collect, there will be many words missing. For example, the
> > Xuxen spellchecker addon accepts hundreds of millions of words in total (so
> > many it's hard to even estimate), but accepts 86-87% of words in typical
> > running texts. Julen, any thoughts on this? Would you be satisfied with a
> > frequency list of say 1.5-2M words even if there are many gaps?
> > Also, do you want me to only include words that the spellchecker accepts?
> > This is what I've done for other languages to avoid English or Spanish
> > "pollution", but again this might leave out some important words. I could
> > also send you a list of the most frequent words not accepted by the spell
> > checker and you could manually clean that list (and potentially add them to
> > Xuxen).
> Thanks for the input and willingness to help, Kevin.
> In order to have a more contrasted opinion, I'll contact some fellow
> colleagues that have been working on these things for a while and ideally
> ask them to chime in. Otherwise I'll come back with a more specific answer
> once I've heard from them.
Julen contacted us to ask if we could provide a larger word frequency list. We have a Basque web corpus of around 200 million words, with around 4 million different word forms, whose frequency list we could provide.
What is the format that would be needed? word-TAB-frequency-NEWLINE? And also, when would it be needed, in order to appear in the forthcoming 2.0 version (if possible)?
As a side question, my institution, Elhuyar Fundazioa (the owner of the corpus), has asked if it would be possible for it to somehow appear in the credits for the Basque auto-suggestion...
Comment 15•10 years ago
|
||
You may want to check out bug 992647 - we finally managed to get Gaelic/Irish/Manx onto the system with the help of janjongboom.
There seems to be three steps to this process (summing up, I'm thinking of doing a writeup anyway and post it on Mozilla somewhere):
1. Grab the nearest matching keyboard js from github and adjust it for the new language. Just open it in Notepad++ or some such text editor, it's easy enough to understand.
2. Create a wordlist, using an existing one as a pattern.
2.1 Size - according to Jan, size is not an issue but advice is to follow en-US size for now. So less than 150k lines.
2.2 Ranking. Based on a corpus, rank them according to frequency using a scale from 1 to 255.
2.3 Collect this data in xml format following the format the other locales use i.e.:
<wordlist locale="ga" description="Gaeilge" date="1401554807" version="1">
<w f="255" flags="">a</w>
<w f="254" flags="">an</w>
<w f="247" flags="">agus</w>
So for Basque that would be (not sure about the date stamp)
<wordlist locale="eu" description="Euskara" date="1401554807" version="1">
<w f="255" flags="">eta</w>
<w f="254" flags="">da</w>
<w f="247" flags="">hau</w>
the f= is the frequency (1 to 255, most frequent at the top)
There is then a process to create the working files:
1. Go in the gaia directory to apps/keyboard/js/imes/latin/dictionaries
2. Open Makefile
3. Append the name of your dictionary to the list (e.g. if it's called sv_wordlist.xml), put:
ru.dict \
sk.dict \
sl.dict \
sv.dict \
sr.dict \
4. Make sure the _worldlist.xml file is in the folder
5. Run 'make sv.dict
At this stage as you can see in the Gaelic bug I got stuck and if you do, it might be best to ask Jan for help. But then I'm a noob when it comes to code, it might be really obvious to Julen. He also helped me get a test version onto my Flame.
Hope that helps.
Reporter | ||
Comment 16•10 years ago
|
||
Igor, the instructions written by Michael look to be the most accurate steps to follow for now. He has just written an extended version at https://developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Customizing_the_keyboard#New_locales_from_the_localizers_perspective
Assignee | ||
Comment 17•10 years ago
|
||
I have doubts as to how to act regarding case... Does the word frequency list have to be completely case-insensitive and in lower case, or should I treat 'herri', 'Herri' and 'HERRI' as different words with different frequencies? If the word prediction does not take into account the case of the words the user is typing, then a case-insensitive frequency list would be best. But if it does, then the ideal would be not to take into account the case except for proper nouns, but I do not have this information...
Comment 18•10 years ago
|
||
In Gaelic we only capped proper nouns and in terms of frequencies ignore cased completely. This works reasonably well (with reasonably I mean that the prediction feature in Mozilla OS is fairly basic and does not take into account anything but single word frequencies so it's pretty useless at predicting the next word).
Caps in the live system seem to work like this:
- sentence initial is auto-capped (e.g. atz will get you > Atzo)
- single shift press caps sentence medial initial letter (e.g. SHIFT elk will get you > Elkar)
All caps is a bit counter-intuitive. If you double press SHIFT, you start typing in caps (e.g. HER) but if you select a suggestions, it reduces this to single cap (e.g. Herri). But I suspect that's a bug rather than a feature.
Personally I think that for now you can ignore case for frequencies and only cap proper nouns in the lexicon.
Assignee | ||
Comment 19•10 years ago
|
||
OK, Michael, thanks!
Another doubt: should I include numbers, dates, times, etc? In Basque they can come inflected, as in '2014ko', or '19:30ean'...
Comment 20•10 years ago
|
||
That's probably a judgement call at this stage but I notice that the English lexicon does not have 5th, 6th etc in it but that may possibly have something to do with the keyboard settings which allow th via longpress on the numbers:
alt: {
'1': ['¹', '1st'],
'2': ['²', '2nd'],
'3': ['³', '3rd'],
So you could change that to
alt: {
'1': ['¹', '1en'],
'2': ['²', '2garren'],
'3': ['³', '3garren'],
It might also be possible to do (removing the superscript numbers, they don't seem to work anyway)
alt: {
'1': ['1en', '1ak],
'2': ['2garren', '2ak'],
'3': ['3garren', '3ak'],
and outsource some of the more common ones into the long press menu of the number, especially those for giving the time. I can't see why that wouldn't work
In terms of the years, my guess is that they would have to get added if you want 1887tik to come up when someone enters 1887tik but given the verb/noun system, I'd be tempted to leave those for manual entry and add more noun/verb inflections.
Why don't you make a judgement call and build a first version and play around with that? As with Gaelic, I'm sure you'll end up submitting a second version (at least) anyway.
Would be nice if we could develop an affix system like Hunspell has, it would make the system much more powerful for languages like Basque or Gaelic.
Kevin, you know Hunspell fairly well, any thoughts on whether that might work?
Assignee | ||
Comment 21•10 years ago
|
||
So, if there is not a clear criterion regarding numbers, then it is up to us to decide... As you say, Michael, I will make a first try and play with it, and improve it later if necessary.
And yes, Hunspell can be used for Basque at least to some extent, and it has been used. But I do not know if it can be used in the FirefoxOS context...
Assignee | ||
Comment 22•10 years ago
|
||
OK, so following Michael's instructions (very clear, thanks!), I have been able to build the keyboard layout and the dictionary. I think everything has gone all right, and I have submitted a pull request, https://github.com/mozilla-b2g/gaia/pull/23038.
According to the instructions on https://developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Customizing_the_keyboard#New_locales_from_the_localizers_perspective, I should now be able to test the keyboard at the address https://github.com/timdream/gaia-keyboard-demo (or http://timdream.org/gaia-keyboard-demo/), but the Basque keyboard is not among the options there. I guess the pull request has to be accepted and merged before? Anyone here with admin rights to do that?
Comment 23•10 years ago
|
||
Ez horregatik :)
Copying in Jan - could you take a look? I think Igor has done most of the legwork.
Comment 24•10 years ago
|
||
Attachment #8481347 -
Flags: review?(janjongboom)
Comment 25•10 years ago
|
||
We have to remove the words with numbers. You can play around with the UI at:
http://janjongboom.com/gaia-keyboard-demo
(open in Chrome, I broke it in Firefox for some reason).
Flags: needinfo?(ileturia)
Updated•10 years ago
|
Attachment #8481347 -
Flags: review?(janjongboom)
Assignee | ||
Comment 26•10 years ago
|
||
You are right, Jan. I have tried the keyboard in the address you pointed out and having included the inflected numbers does more harm than good: numbers are always suggested no matter what key I press! I will make another version without the numbers and try it, and if it works OK I will commit it. Thanks!
Flags: needinfo?(ileturia)
Assignee | ||
Comment 27•10 years ago
|
||
I have tried the dictionary without the numbers at the address provided by Jan and it works much better. I have commited and pushed the new dictionary, so now it is in the pull request. I guess we now have to wait for someone to approve the pull request?
Reporter | ||
Comment 28•10 years ago
|
||
In order for the PR to be in good shape to be merged, I believe you can squash all three commits into a single one (use `git rebase -i a8fea49^`, then write `squash` next to the last two commits) and rebase current master on top of it.
Assignee | ||
Comment 29•10 years ago
|
||
OK, Julen, thanks for the help! It's done now. Now we just wait till it gets merged? And then in what version will it be included? I'm eager to have it in my phone! ;-)
Assignee | ||
Comment 30•10 years ago
|
||
After a long time waiting for the acceptance of the Pull Request, it no longer could be automatically merged and the Pull Request was closed by some robot...
I have made a new Pull Request based on Gaia's current master which can be automatically merged (https://github.com/mozilla-b2g/gaia/pull/24667). Could someone please review, accept and merge it? Thanks!
Flags: needinfo?(janjongboom)
Comment 32•10 years ago
|
||
[Blocking Requested - why for this release]:
Nominated as v2.1+ according to bug 1077033 comment 1,
===
this is a Tako shipping locale, we're going to need it in 2.1
===
blocking-b2g: --- → 2.1?
Comment 33•10 years ago
|
||
Add an attachment for the updated pr.
Jan, could you please help review this since you already looked into it?
Let me know if you need me to take over.
Thanks.
Attachment #8499453 -
Flags: review?(janjongboom)
Updated•10 years ago
|
Attachment #8481347 -
Attachment is obsolete: true
Comment 34•10 years ago
|
||
r=me, except for two things:
- We changed the way alt pages work, so I copied the one from Spanish in here.
- The alt numbers (like 4ko) don't fit in the current alt char menu. As you can type these because the postfix is latin script (contrary to Spanish) I removed them for now.
Can you please let me know if you're OK with this change?
Attachment #8499453 -
Attachment is obsolete: true
Attachment #8499453 -
Flags: review?(janjongboom)
Attachment #8500340 -
Flags: review+
Attachment #8500340 -
Flags: feedback?(ileturia)
Flags: needinfo?(janjongboom)
Assignee | ||
Comment 36•10 years ago
|
||
Yes, Jan, if the alt numbers don't fit, then the best thing to do is to remove them, as you have done.
Assignee | ||
Comment 37•10 years ago
|
||
Comment on attachment 8500340 [details]
Patch v3
OK with the change.
Attachment #8500340 -
Flags: feedback?(ileturia)
Updated•10 years ago
|
Assignee: nobody → ileturia
Status: NEW → ASSIGNED
Comment 39•10 years ago
|
||
Assign to iteturia, the patch author.
Assignee | ||
Comment 40•10 years ago
|
||
Everything OK from my part, PR ready to be merged. Maybe Jan has merge rights? Sending needinfo to him.
Flags: needinfo?(ileturia) → needinfo?(janjongboom)
Comment 41•10 years ago
|
||
Tree is closed at the moment.
Comment 42•10 years ago
|
||
Landed to Gaia master,
https://github.com/mozilla-b2g/gaia/commit/37bb56d123df2def71be63b7cc3bdd4c9a2e2d6c
--
Thanks.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
status-b2g-v2.2:
--- → fixed
Flags: needinfo?(janjongboom)
Resolution: --- → FIXED
Target Milestone: --- → 2.1 S6 (10oct)
Reporter | ||
Comment 43•10 years ago
|
||
So glad to finally see this hit the gaia repository!
On a side note, may I ask why the authorship is not attributed to ileturia in the above commit?
Comment 44•10 years ago
|
||
It does: see the actual commit https://github.com/mozilla-b2g/gaia/commit/992cd373070e539760c35369e3b582aa928a5dfc
The merge commit is authored by someone else, but the underlying commit is by Igor. See also: https://github.com/mozilla-b2g/gaia/commits?author=e-gor
Reporter | ||
Comment 45•10 years ago
|
||
Aye, I see now, thanks! These merge commits are a jungle hiding the real meat.
Comment 46•10 years ago
|
||
This should not be auto uplifted to v2.1, since we changed the format of the layout definition in v2.2.
Comment 47•10 years ago
|
||
Comment on attachment 8500340 [details]
Patch v3
[Approval Request Comment]
[Bug caused by] (feature/regressing bug #): This is a new feature to add Basque keyboard layout.
[User impact] if declined: Cannot present a Basque based keyboard layout to the native user.
[Testing completed]: yes.
[Risk to taking this patch] (and alternatives if risky): pretty low, the layout file itself is just a definition of the layout without logic inside.
[String changes made]: N/A
Attachment #8500340 -
Flags: approval-gaia-v2.1?
Updated•10 years ago
|
Whiteboard: NO_UPLIFT
Updated•10 years ago
|
Attachment #8500340 -
Flags: approval-gaia-v2.1? → approval-gaia-v2.1+
Comment 48•10 years ago
|
||
Gaia v2.1,
b9e9f537531cb7c577cff64d00da340aa1067bed
status-b2g-v2.1:
--- → fixed
Whiteboard: NO_UPLIFT
Assignee | ||
Comment 49•10 years ago
|
||
So now the Basque keyboard with word prediction has been committed to master, it should come shipped in the nightly OTA updates for the Flame, right?
The problem is, I have a Flame with the latest master build (2.2) of the production build (including locales), and with the nightly update channel. But although I have the very latest nightly build (the one for today), I cannot find the option of the Basque keyboard anywhere...
As I understand, with the change we made in build/config/keyboard-layouts.json, that is,
"eu": [
+ {"layoutId": "eu", "app": ["apps", "keyboard"]},
{"layoutId": "es", "app": ["apps", "keyboard"]},
{"layoutId": "fr", "app": ["apps", "keyboard"]},
{"layoutId": "en", "app": ["apps", "keyboard"]}
],
the default keyboard for the Basque locale should be the Basque one we created, but although I have the Basque locale installed, when I type anything I only have the other three available.
I've gone through the settings in case there is something that I need to change or activate, but could not find anything... The nearest I could find was an option to add a keyboard, but Basque was not available there... Am I missing something?
Comment 50•10 years ago
|
||
(In reply to ileturia from comment #49)
> So now the Basque keyboard with word prediction has been committed to
> master, it should come shipped in the nightly OTA updates for the Flame,
> right?
I don't think this happens automatically, build owner needs to add it to config as well. Easiest fix:
$ cd gaia
$ APP=keyboard GAIA_KEYBOARD_LAYOUTS=en,zh-Hans-Pinyin,zh-Hant-Zhuyin,nl,es,fr,eu make install-gaia
Adds the keyboard to your phone.
Maybe Rudy knows what our policy is for Flame builds these days.
Flags: needinfo?(rlu)
Comment 51•10 years ago
|
||
As Jan said, the preload set of keyboard layouts is configured at build time.
For Flame builds, we could open a separate bug to include more layouts like this one, bug 1020068.
BTW, if you guys can wait, we have Bug 1029951 to track a feature that dictionaries could be downloaded dynamically.
Flags: needinfo?(rlu)
Comment 52•10 years ago
|
||
(In reply to Rudy Lu [:rudyl] from comment #51)
> As Jan said, the preload set of keyboard layouts is configured at build time.
> For Flame builds, we could open a separate bug to include more layouts like
> this one, bug 1020068.
My understanding was that we weren't accepting new layouts and were waiting for bug 1029951 (see bug 1050574 and all the duplicates).
Comment 53•10 years ago
|
||
This issue is verified fixed on Flame 2.2 and 2.1.
The auto-correct and auto-suggestion functions work properly when the keyboard is in Basque.
Flame 2.2
Device: Flame 2.2 Master (319mb)(Kitkat Base)(Full Flash)
BuildID: 20141021040206
Gaia: ba6667c83c5d0fb1e333349dfeaf5f6ca8043e63
Gecko: 29fbfc1b31aa
Gonk: 05aa7b98d3f891b334031dc710d48d0d6b82ec1d
Version: 36.0a1 (2.2 Master)
Firmware: V180
User Agent: Mozilla/5.0 (Mobile; rv:36.0) Gecko/36.0 Firefox/36.0
Flame 2.1
Device: Flame 2.1 (319mb)(Kitkat Base)(Full Flash)
BuildID: 20141021001201
Gaia: f896470b694e3e76e39c5d48f1428b847a10b8fd
Gecko: ee86921a986f
Gonk: 05aa7b98d3f891b334031dc710d48d0d6b82ec1d
Version: 34.0 (2.1)
Firmware: V180
User Agent: Mozilla/5.0 (Mobile; rv:34.0) Gecko/34.0 Firefox/34.0
Status: RESOLVED → VERIFIED
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
Keywords: verifyme
Updated•10 years ago
|
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Updated•10 years ago
|
Blocks: WordPrediction
Updated•9 years ago
|
Alias: Basque-WordPrediction
You need to log in
before you can comment on or make changes to this bug.
Description
•