Closed
Bug 1011528
Opened 10 years ago
Closed 7 years ago
Improve the heuristics used to make Accented English strings longer
Categories
(Firefox OS Graveyard :: Gaia::L10n, defect, P4)
Firefox OS Graveyard
Gaia::L10n
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: stas, Unassigned)
References
Details
(Whiteboard: [good first bug])
Attachments
(1 file)
(deleted),
patch
|
stas
:
feedback-
|
Details | Diff | Splinter Review |
Bug 900182 will add a fake Accented English locale. It currently uses a very simple method of making its strings longer than regular English: every vowel is doubled. This could be improved, with the goal of making short string be affect in greater extent than long strings. See flod's http://l10n.mozilla-community.org/~flod/compare_length/ for some length statistics. French might be a good example: strings are 33% longer than English on average, but the average increase differs depending on the length of the original string: short (0, 5] +58% (+2.3 chars) medium (5, 10] +31% (+2.5 chars) long (10, 20] +38% (+5.6 chars) phrase (20, ∞) +25% (+13 chars) The current vowel strategy might be too linear to be useful.
Updated•10 years ago
|
Priority: -- → P4
Reporter | ||
Updated•10 years ago
|
Whiteboard: [good first bug]
Comment 1•10 years ago
|
||
If I didn't mess up my tests, the current strategy (double vowels) gives us these stats. global +32.2% (+7.1 chars) short (0, 5] +34.4% (+1.4 chars) medium (5, 10] +35.0% (+2.7 chars) long (10, 20] +30.8% (+4.7 chars) phrase (20, ∞) +30.6% (+15.8 chars) So, the only big difference is with short strings. Maybe it would enough to double one random character if the string has from 2 to 5 characters.
Comment 2•10 years ago
|
||
(In reply to Francesco Lodolo [:flod] from comment #1) > If I didn't mess up my tests, the current strategy (double vowels) gives us > these stats. Actually, this is not reliable, I didn't think about variable names in strings.
Comment 3•10 years ago
|
||
I'm using a pseudo-locale created with ugly Python code, so it might not be completely reliable (e.g. I'm trying to not double vowels in time/date formats, and variable names). Anyhow, these are some data. Current strategy (double vowels) global +30.12% (+ 6.62 chars) short (0, 5] +34.27% (+ 1.35 chars) medium (5, 10] +34.36% (+ 2.65 chars) long (10, 20] +27.70% (+ 4.15 chars) phrase (20, ∞) +27.93% (+14.92 chars) Current strategy + double last character if length of the original string is <= 4 global +33.32% (+ 6.76 chars) short (0, 5] +56.69% (+ 2.04 chars) medium (5, 10] +34.57% (+ 2.67 chars) long (10, 20] +28.32% (+ 4.25 chars) phrase (20, ∞) +27.99% (+14.93 chars) Current strategy + double last character if length of the new string (with doubled vowels) is <= 5 global +34.14% (+ 6.80 chars) short (0, 5] +62.98% (+ 2.35 chars) medium (5, 10] +34.57% (+ 2.67 chars) long (10, 20] +28.32% (+ 4.25 chars) phrase (20, ∞) +27.99% (+14.93 chars)
Comment 4•10 years ago
|
||
Some more thoughts: while I believe that l10n.js has access to a "compiled" string (with variables replaced), I only have access to the original strings. So, in my case "SIM {n}" is 7 characters long, while for l10n.js it should be just 5 (e.g. "SIM 1"). Besides that I think that the data could be still useful. I fixed some more lousy variable replacements and added gaia-l10n locales to the picture http://l10n.mozilla-community.org/~flod/compare_length_gaia/
Reporter | ||
Comment 5•10 years ago
|
||
Actually, l10n.js pseudolocalizes the strings in their raw form, right after the l10n resources are downloaded: https://github.com/mozilla-b2g/gaia/blob/ea93363a8c424d65a9ad91438ce6961377a20f98/shared/js/l10n.js#L1086-L1101 The raw form for "foo bar {n} baz" is just "foo bar {n} baz". However, the value passed to the makeLonger function here: https://github.com/mozilla-b2g/gaia/blob/ea93363a8c424d65a9ad91438ce6961377a20f98/shared/js/l10n.js#L941-L945 is "foo bar" and then "baz", because I tried to exclude tokens and syntax which should not be pseudolocalized: https://github.com/mozilla-b2g/gaia/blob/ea93363a8c424d65a9ad91438ce6961377a20f98/shared/js/l10n.js#L966-L981
Comment 6•10 years ago
|
||
Not sure if I should ask for feedback or review, let's start with f?. This is the most intrusive version of the patch: * (not really necessary) makeAccented is renamed as remapAlphaCharacters, since it doesn't do just Accented English and the name is confusing. * Instead of passing pieces of strings to makeLonger and makeRTL, I'm replacing variables with placeholders, apply transformation to the entire string and then restore placeholders. The alternative patch just changes makeLonger, but I need more controls on the last transformations: * exclude val.length<=1 * exclude val = '' Github PR https://github.com/mozilla-b2g/gaia/pull/23349
Attachment #8479668 -
Flags: feedback?(stas)
Reporter | ||
Comment 7•10 years ago
|
||
Thanks for the patch, flod! I'll review it next week; this week has been a little bit crazy because of the FL deadline.
Reporter | ||
Comment 8•10 years ago
|
||
Comment on attachment 8479668 [details] [diff] [review] 23349.patch Review of attachment 8479668 [details] [diff] [review]: ----------------------------------------------------------------- Hey Flod, thanks for the patch. I agree with the name changes, but I think I'd like to try out an alternative approach to this one. In your patch, the string is taken as a whole and based on its total length the last word might become longer in certain cases. This leads to non-determinism: one word can have more than one 'translations' into pseudolocales with this method. Instead, I wonder if it would be possible to look at each word separately and come up with rules that would make certain short words longer consistently. You could also make longer words grow only by a little, to compensate for longer shorter words in full sentences. What do you think?
Attachment #8479668 -
Flags: feedback?(stas) → feedback-
Comment 9•10 years ago
|
||
I'm not sure I see value in being deterministic: pseudolocale needs to be understandable, and qps-ploc gives us a more realistic coverage by increasing the original en-US length. Numbers say that we're already doing a good job in all sectors besides very short words (under 5 characters). Also, while this becomes deterministic for single words (i.e. 'Done' is always rendered in the same way), the inflation for longer strings becomes a lot harder to measure: a string made of short words could become extremely long (e.g. "You can not make calls, send messages or go online because emergency callback mode is enabled. Would you like to turn it off?"), unless you want to consider the entire sentence.
Reporter | ||
Updated•10 years ago
|
Blocks: fxos-pseudolocales
Comment 10•7 years ago
|
||
Firefox OS is not being worked on
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•