Closed
Bug 1126076
Opened 10 years ago
Closed 9 years ago
Add Hausa (ha) Wordlist/Dictionary
Categories
(Firefox OS Graveyard :: Gaia::Keyboard, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: delphine, Unassigned, NeedInfo)
References
Details
Attachments
(3 files)
Please add Hausa Wordlist and Dictionary to Firefox OS
Reporter | ||
Comment 1•10 years ago
|
||
Adding localizer to see if he can help with feedback here. thanks
Flags: needinfo?(mcsteann)
Reporter | ||
Comment 2•10 years ago
|
||
No update from localizer, so asking:
* Peiying can you get Rubric plugged in to help with this?
* Kevin: could you help out also?
thanks all!
Flags: needinfo?(pmo)
Flags: needinfo?(kscanne)
Comment 4•10 years ago
|
||
(In reply to Peiying Mo [:CocoMo] from comment #3)
> + Devon and Ian, we need Rubric's advice on this.
Let's try the same approach as my Comment 12 on Bug 1121730
Comment 5•10 years ago
|
||
There's a good amount of Hausa online, but like Lingala, only a small percentage (about 4% by my best estimate) uses the correct "special" characters (in this case ɓ, ƙ, ɗ).
The other issue is that there is no clean word list that uses the correct characters. The Firefox addon here:
https://addons.mozilla.org/en-us/firefox/addon/hausa-spelling-dictionary/
is virtually all ASCII.
From earlier work of mine on a spellchecker, I have what I think is a pretty comprehensive list of (~500) pairs of words that are correct as either ASCII or with special characters (ƙasa/kasa, saƙo/sako, etc.)
With that in mind, here's what I'd propose:
(1) I use the full web corpus to produce an ASCII-only frequency list, maybe validating against the Firefox spellchecker (I haven't checked its coverage, so I don't know if that's worthwhile)
(2) Use the 4% of properly-encoded web texts to produce a word list of words containing special characters, everything appearing more than, say, 2 or 3 times.
(3) Use the frequency of the ASCII version from (1) as a proxy for the frequency of the presumed-correct words from step (2)...
(4) *except* if the word is in my list of special cases (ƙasa/kasa, saƙo/sako, etc.) Here it's not clear what to do. I suppose could split the frequency of the ASCII version from (1) according to the relative proportions that I see in the good (4%) corpus.
I'd be grateful for some feedback from the Hausa team on this. If they don't care about preserving the special characters then I suppose I don't need to bother with any of this!
Flags: needinfo?(kscanne)
Comment 6•10 years ago
|
||
Kevin, you can also have a look at the PO files for Gaia itself:
https://github.com/translate/mozilla-gaia/commits/2.0/ha
Here are also a few from old GNOME translations:
https://l10n.gnome.org/POT/gnome-panel.master/gnome-panel.master.ha.po
https://l10n.gnome.org/POT/metacity.master/metacity.master.ha.po
https://l10n.gnome.org/POT/nautilus.master/nautilus.master.ha.po
(be sure to include obsolete messages if you can - there seems to be quite some text there).
Maybe this is already in your web corpus, but if not, hopefully gives you a little bit more text (horribly biased, unfortunately). I saw at least some non-ASCII characters in these files. I don't know how frequent it is supposed to be.
About your plan: If your list of 500 is fairly complete, it mostly sounds good. I guess you can also augment the 500 with what you see in the 4% corpus. Is the 4% big enough, though? Any issues of balance in the 4%?
Comment 7•10 years ago
|
||
(In reply to Friedel Wolff from comment #6)
> Kevin, you can also have a look at the PO files for Gaia itself:
> https://github.com/translate/mozilla-gaia/commits/2.0/ha
>
> Here are also a few from old GNOME translations:
> https://l10n.gnome.org/POT/gnome-panel.master/gnome-panel.master.ha.po
> https://l10n.gnome.org/POT/metacity.master/metacity.master.ha.po
> https://l10n.gnome.org/POT/nautilus.master/nautilus.master.ha.po
> (be sure to include obsolete messages if you can - there seems to be quite
> some text there).
>
> Maybe this is already in your web corpus, but if not, hopefully gives you a
> little bit more text (horribly biased, unfortunately). I saw at least some
> non-ASCII characters in these files. I don't know how frequent it is
> supposed to be.
Thanks.
>
> About your plan: If your list of 500 is fairly complete, it mostly sounds
> good. I guess you can also augment the 500 with what you see in the 4%
> corpus. Is the 4% big enough, though? Any issues of balance in the 4%?
It's about 250k words total, ~19k unique words. Heavily biased towards religious texts, so I'd rather not use frequencies from it if possible.
Comment 8•10 years ago
|
||
Here is the FFOS
Comment 9•10 years ago
|
||
Comment 10•10 years ago
|
||
Comment 11•10 years ago
|
||
Hello Kevin,
These files are done:
http://mozilla.locamotion.org/ha/mozilla_lang/main.lang.po
http://mozilla.locamotion.org/ha/mozilla_lang/mozorg/home/index.lang.po
http://mozilla.locamotion.org/ha/mozilla_lang/firefox/os/index.lang.po
http://mozilla.locamotion.org/ha/mozilla_lang/firefox/os/devices.lang.po
http://mozilla.locamotion.org/ha/mozilla_lang/firefox/os/faq.lang.po
http://mozilla.locamotion.org/ha/mozilla_lang/firefox/partners/index.lang.po
http://mozilla.locamotion.org/ha/mozilla_lang/firefoxos/firefoxos.lang.po
http://mozilla.locamotion.org/ha/mozilla_lang/legal/index.lang.po
http://mozilla.locamotion.org/ha/mozilla_lang/tabzilla/tabzilla.lang.po
Comment 12•10 years ago
|
||
Hello Kevin,
Another bit of corpus:
https://localize.mozilla.org/ha/masterfirefoxos/
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•