Closed Bug 1137970 Opened 10 years ago Closed 9 years ago

[b2g] Generate Ukrainian (uk) wordlist/dictionary

Categories

(Firefox OS Graveyard :: Gaia::Keyboard, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: delphine, Assigned: a.polivanchuk)

References

Details

Attachments

(8 files)

We are working on adding Ukrainian locale up on FxOS builds right now (see Bug 1137822)
Once that's done, we will need to get autcorrection/wordsuggestion up and running for this locale.
Artem: if you have any feedback or can give input on this, it would be helpful. thanks!
Is it possible to get the dictionary from uk locale of Firefox?
I don't think they contain frequency information, which we need for FxOS.
Which exactly information needed?
Hey Artem, for this bug someone will need to generate a wordlist so that there is autocorrection and wordsuggestion in Firefox OS, when you are typing with the keyboard.
Someone on CC here will probably be able to advise on how to go forwards, better than me ^^ thanks!
As I see in the requests for other locales, list of autcorrection/wordsuggestion should contain about 4-5 thousand frequently used words.
Is there any other requirements?
Hey Artem - sorry this fell off my radar as I was not need info'ed here :) 
Asking Kevin if he can help you with this when he gets back. Thanks!
Flags: needinfo?(kscanne)
I already have the list of regular Ukrainian words.
Now I check them and will post here soon :)
Thanks Artem! 
@Kevin: could you please have a look at the list attached? Do you think we need more corpus, or are we good to go? thanks!
Hi Artem, this list looks very good. How did you create it? 

Is it worth including frequent proper names like "України", "Україні"?  I see a few other common words on the web like "можна" and "цього" that could be added too.  Would it worth working from a big spell checking word list and paring it down?  This one is tri-licensed:

http://extensions.libreoffice.org/extension-center/ukrainian-spelling-dictionary-and-thesaurus

Or maybe that's overkill if you have most of the words you want in testing...
Flags: needinfo?(kscanne)
Attached file Words for Firefox OS.txt (deleted) —
Hi Kevin, I just got the big list of common words and handled it in excel table with further check.
Of course it worth to include your proposed words. I already added them to the updated list.

How can I open this big spell checking word list in readable view? How many words are there?
I guess that 4-5 thousand common words should be enough for autocorrection and wordsuggestion.
Hi Artem,

  It's a bit tricky to view the spelling list as a text file - the .oxt file can be renamed as .zip, and then unzipped.  Inside the zip you'll see a uk_UA folder with .dic and .aff files that have the dictionary data.   Fully expanded, there are 1.7 *million* words... I agree with you that this is overkill! :)  I'd say it's OK to go ahead with the list you have, and if you feel like you want to add more common words based on testing, I can help with that.
OK, I see. So, let's go ahead :)
Thanks!
Is there any progress with lending wordlist to 2.2 and 2.5?
Maybe some additional action is needed to move further?
Attached file uk_wordlist.xml (deleted) —
OK, here is the right list, which contains 98931 words.

Information about this list (http://u-mova.blogspot.com/2013/09/blog-post.html):

Frequency list of Ukrainian language.
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.
Copyright © 2013 Volodymyr Vlad

I converted original file into the .xml file.
One thing I'm in doubt is the date="1377342507" in header.
Please, take a look and check it.

Thanks!
Flags: needinfo?(kscanne)
Looks good to me.  The date in the header is just a timestamp, seconds since the Unix epoch, "date +%s" in your favorite shell.
Flags: needinfo?(kscanne)
Artem,

Thanks for finding this wordlist! This looks great. Our dictionary creation script used to require frequences to be in the range 1 to 255, but I don't think it does anymore. However, note that the frequencies you give will be linearly scaled to fit in 5 bits, so that there are really just 32 frequency levels. This means that something 95% or more of your words will be at the lowest frequency level. This will mean that autocorrect can't make good decisions about which words are more frequent than others. So I'd recommend that you take these raw frequency numbers and rescale them on a logarithmic scale so that they are more evenly distributed between 1 and 255 or 1 and 1000 or something.

Also, note that you can use f="0" to mark obscene words that you do not want to be suggested by autocorrect, but that should be considered valid words and not be autocorrected. You can look at the english wordlist if you want to see the long list of english words that Google things are profane.

Once you've adjusted the frequencies, you need to run the xml2dict.js script to create a uk.dict file. See https://bugzilla.mozilla.org/show_bug.cgi?id=1226981#c32 for a description of how this was done for Romanian.

Test out the .dict file to make sure it works well.  If it does, then prepare a pull request that includes the wordlist, and the .dict file, and also modifies the README file to include notes about how the Ukranian wordlist was generated. Explain the original source and its CC license and briefly explain how you modified the frequencies from that source.

Once you've done that, you can ask me or :timdream to review and land the patch.

After that, the final step is to get the uk.dict file hosted on the Amazon S3 server so that it can be downloaded by FirefoxOS devices.  I don't know how to do that part, but :timdream can help you with that if you set the needinfo flag for him.
OK, at Artem's request I took a look at his word list. I added a list of relative frequencies scaled from the list found here http://u-mova.blogspot.com/2013/09/blog-post.html .

The formula for relative frequencies (in the xml, simply "freq") is: 
relative_freq = ceil (log(maxFreq/freq) / log(1.05622) + 0.5)

The constant 1.05622 is found by: exp (log(1082616) / 254), where 1082616 is the maximum pre-normalized frequency.
This gives values between 1 and 255, with 255 the least frequent words and 1 most frequent.
If this looks alright I can also generate a .dict blob file for a potential patch.
Hi Jobava,
Thank you very much for help with adjusting the frequencies.

Is it normal that the higher frequency number became to the lower after adjusting?
I mean that original list has descended frequencies, but new list has ascended, while the words order kept without changes.
Looking at the lists for other locales, I guess it should be adjusted by the opposite way (255 the most frequent words and 1 least frequent).
Flags: needinfo?(jobaval10n)
Should be that way, initial frequencies are the raw number of occurrences in that corpus of text that Volodymyr compiled. Volodymyr's list also had "relative frequencies" on a base2 logarithm, but that ran out after just rank 21, since log2(1million) ≈ 20. I stretched that by picking a smaller base as described above.

The designation "freq" here are relative frequencies.
Flags: needinfo?(jobaval10n)
David, could you please review new list (Comment 19)...
Flags: needinfo?(dflanagan)
I recalculated frequency order adjusted by Jobava, using formula: 256-<frequency>

Also I contacted to Volodymyr, the author of this list, and he has recommended to recalculate this way.
Hello, I am attaching the .dict file as requested by Sergiy and Artem. I used the xml2dict.js script in apps/keyboard/js/imes/latin/dictionaries/

It looks like the .dict compilation instructions need a little attention as other people are having issues with using it.
Thanks a lot for your help, Jobava!
Next I'm going to create PR using last uk_wordlist.xml and uk.dict
Flags: needinfo?(dflanagan)
Comment on attachment 8714892 [details]
[gaia] stenox:master > mozilla-b2g:master

Please review PR.
Attachment #8714892 - Flags: review?(timdream)
Comment on attachment 8714892 [details]
[gaia] stenox:master > mozilla-b2g:master

Looks good, thanks for the contribution. However you would need to fix the build tests.

https://treeherder.mozilla.org/#/jobs?repo=gaia&revision=dd7d25d44aa5946d6f54fea1ab42f507cbb41885&selectedJob=3491105
Attachment #8714892 - Flags: review?(timdream) → feedback+
Assignee: nobody → a.polivanchuk
Is this a matter of broken tests, or the uploaded files broke some things they shouldn't have?

I don't understand the errors, or where in that web interface I should look, or how to mentally parse them.
Will be back after the holiday here...
Flags: needinfo?(timdream)
Artem,

Please refer to this commit I added for bug 1033185. The keyboard build tests asserts the layout included in the build. You would need to include the newly-added layout in the expected lists here so the test code would know the addition is legit.

https://github.com/timdream/gaia/commit/8be51e8e5020c8855ad2722819c1942657b078b6

Thanks for helping out!
Flags: needinfo?(timdream) → needinfo?(a.polivanchuk)
Tim, I modified files and updated PR, but there are 5 failures again.
It might be something wrong with my changes.
Could you please take a look?

https://github.com/mozilla-b2g/gaia/pull/34020
Flags: needinfo?(a.polivanchuk)
Yes, I see. But, unfortunately, I have no idea what's the problem and how to fix it.
Flags: needinfo?(a.polivanchuk)
I am not sure about your commitment, but if you have more time to deal with this issue, you could try to run the tests locally to see it's output. The command to run is:

> make build-test-integration TEST_FILES=apps/keyboard/test/build/integration/keyboard_test.js

Sorry about all the troubles.
Flags: needinfo?(a.polivanchuk)
Hey Tim, thanks for your continued guidance.

The layoutIds array already has a 'uk' element.
Flags: needinfo?(timdream)
Actually I already added new layout to layoutIds array with last PR update.
Flags: needinfo?(a.polivanchuk)
Hi, Artem

Thank for your patch. The reason of build fail is that we added a dict to uk, so our build system no longer preload the layout by default.

We need to delete this line to match manifest, then test should pass.
https://github.com/stenox/gaia/blob/370511577d3fccf66310c84d049a8c7b12ffb3bb/apps/keyboard/test/build/integration/keyboard_test.js#L72

Hope it helps.
Thanks Ray for helping out!
Flags: needinfo?(timdream)
merged, master: https://github.com/mozilla-b2g/gaia/commit/a6ecae635719115aa72465efe522fdced3dd1d70
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: