Closed Bug 1011528 Opened 10 years ago Closed 7 years ago

Improve the heuristics used to make Accented English strings longer

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: stas, Unassigned)

References

Details

(Whiteboard: [good first bug])

Attachments

(1 file)

23349.patch 10 years ago Francesco Lodolo [:flod] (deleted), patch	stas : feedback-	Details \| Diff \| Splinter Review

Staś Małolepszy :stas

Reporter

Description

•

10 years ago

Bug 900182 will add a fake Accented English locale.  It currently uses a very simple method of making its strings longer than regular English: every vowel is doubled.

This could be improved, with the goal of making short string be affect in greater extent than long strings.

See flod's http://l10n.mozilla-community.org/~flod/compare_length/ for some length statistics.  French might be a good example:  strings are 33% longer than English on average, but the average increase differs depending on the length of the original string:

  short   (0, 5]   +58% (+2.3 chars)
  medium  (5, 10]  +31% (+2.5 chars)
  long    (10, 20] +38% (+5.6 chars)
  phrase  (20, ∞)  +25% (+13 chars)

The current vowel strategy might be too linear to be useful.

Zibi Braniecki [:zbraniecki][:gandalf]

Updated

•

10 years ago

Priority: -- → P4

Staś Małolepszy :stas

Reporter

Updated

•

10 years ago

Whiteboard: [good first bug]

Francesco Lodolo [:flod]

Comment 1

•

10 years ago

If I didn't mess up my tests, the current strategy (double vowels) gives us these stats.

  global           +32.2%  (+7.1 chars)
  short   (0, 5]   +34.4%  (+1.4 chars)
  medium  (5, 10]  +35.0%  (+2.7 chars)
  long    (10, 20] +30.8%  (+4.7 chars)
  phrase  (20, ∞)  +30.6%  (+15.8 chars)

So, the only big difference is with short strings. 
Maybe it would enough to double one random character if the string has from 2 to 5 characters.

Francesco Lodolo [:flod]

Comment 2

•

10 years ago

(In reply to Francesco Lodolo [:flod] from comment #1)
> If I didn't mess up my tests, the current strategy (double vowels) gives us
> these stats.

Actually, this is not reliable, I didn't think about variable names in strings.

Francesco Lodolo [:flod]

Comment 3

•

10 years ago

I'm using a pseudo-locale created with ugly Python code, so it might not be completely reliable (e.g. I'm trying to not double vowels in time/date formats, and variable names). Anyhow, these are some data.

Current strategy (double vowels)

  global           +30.12%  (+ 6.62 chars)
  short   (0, 5]   +34.27%  (+ 1.35 chars)
  medium  (5, 10]  +34.36%  (+ 2.65 chars)
  long    (10, 20] +27.70%  (+ 4.15 chars)
  phrase  (20, ∞)  +27.93%  (+14.92 chars)

Current strategy + double last character if length of the original string is <= 4

  global           +33.32%  (+ 6.76 chars)
  short   (0, 5]   +56.69%  (+ 2.04 chars)
  medium  (5, 10]  +34.57%  (+ 2.67 chars)
  long    (10, 20] +28.32%  (+ 4.25 chars)
  phrase  (20, ∞)  +27.99%  (+14.93 chars)

Current strategy + double last character if length of the new string (with doubled vowels) is <= 5

  global           +34.14%  (+ 6.80 chars)
  short   (0, 5]   +62.98%  (+ 2.35 chars)
  medium  (5, 10]  +34.57%  (+ 2.67 chars)
  long    (10, 20] +28.32%  (+ 4.25 chars)
  phrase  (20, ∞)  +27.99%  (+14.93 chars)

Francesco Lodolo [:flod]

Comment 4

•

10 years ago

Some more thoughts: while I believe that l10n.js has access to a "compiled" string (with variables replaced), I only have access to the original strings. 

So, in my case "SIM {n}" is 7 characters long, while for l10n.js it should be just 5 (e.g. "SIM 1").

Besides that I think that the data could be still useful. I fixed some more lousy variable replacements and added gaia-l10n locales to the picture
http://l10n.mozilla-community.org/~flod/compare_length_gaia/

Staś Małolepszy :stas

Reporter

Comment 5

•

10 years ago

Actually, l10n.js pseudolocalizes the strings in their raw form, right after the l10n resources are downloaded:

  https://github.com/mozilla-b2g/gaia/blob/ea93363a8c424d65a9ad91438ce6961377a20f98/shared/js/l10n.js#L1086-L1101

The raw form for "foo bar {n} baz" is just "foo bar {n} baz". However, the value passed to the makeLonger function here:

  https://github.com/mozilla-b2g/gaia/blob/ea93363a8c424d65a9ad91438ce6961377a20f98/shared/js/l10n.js#L941-L945

is "foo bar" and then "baz", because I tried to exclude tokens and syntax which should not be pseudolocalized:

  https://github.com/mozilla-b2g/gaia/blob/ea93363a8c424d65a9ad91438ce6961377a20f98/shared/js/l10n.js#L966-L981

Francesco Lodolo [:flod]

Comment 6

•

10 years ago

Attached patch 23349.patch (deleted) — Details — Splinter Review

Not sure if I should ask for feedback or review, let's start with f?.

This is the most intrusive version of the patch:
* (not really necessary) makeAccented is renamed as remapAlphaCharacters, since it doesn't do just Accented English and the name is confusing.
* Instead of passing pieces of strings to makeLonger and makeRTL, I'm replacing variables with placeholders, apply transformation to the entire string and then restore placeholders. 

The alternative patch just changes makeLonger, but I need more controls on the last transformations:
* exclude val.length<=1
* exclude val = ''

Github PR
https://github.com/mozilla-b2g/gaia/pull/23349

Attachment #8479668 - Flags: feedback?(stas)

Staś Małolepszy :stas

Reporter

Comment 7

•

10 years ago

Thanks for the patch, flod!  I'll review it next week;  this week has been a little bit crazy because of the FL deadline.

Staś Małolepszy :stas

Reporter

Comment 8

•

10 years ago

Comment on attachment 8479668 [details] [diff] [review]
23349.patch

Review of attachment 8479668 [details] [diff] [review]:
-----------------------------------------------------------------

Hey Flod, thanks for the patch.  I agree with the name changes, but I think I'd like to try out an alternative approach to this one.  In your patch, the string is taken as a whole and based on its total length the last word might become longer in certain cases.  This leads to non-determinism:  one word can have more than one 'translations' into pseudolocales with this method.  Instead, I wonder if it would be possible to look at each word separately and come up with rules that would make certain short words longer consistently.  You could also make longer words grow only by a little, to compensate for longer shorter words in full sentences.

What do you think?

Attachment #8479668 - Flags: feedback?(stas) → feedback-

Francesco Lodolo [:flod]

Comment 9

•

10 years ago

I'm not sure I see value in being deterministic: pseudolocale needs to be understandable, and qps-ploc gives us a more realistic coverage by increasing the original en-US length. Numbers say that we're already doing a good job in all sectors besides very short words (under 5 characters).

Also, while this becomes deterministic for single words (i.e. 'Done' is always rendered in the same way), the inflation for longer strings becomes a lot harder to measure: a string made of short words could become extremely long (e.g. "You can not make calls, send messages or go online because emergency callback mode is enabled. Would you like to turn it off?"), unless you want to consider the entire sentence.

Staś Małolepszy :stas

Reporter

Updated

•

10 years ago

Blocks: fxos-pseudolocales

BMO Automation

Comment 10

•

7 years ago

Firefox OS is not being worked on

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WONTFIX

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Improve the heuristics used to make Accented English strings longer

Categories

(Firefox OS Graveyard :: Gaia::L10n, defect, P4)

Tracking

(Not tracked)

People

(Reporter: stas, Unassigned)

References

Details

(Whiteboard: [good first bug])

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Attachment

General

Description

File Name

Content Type