Closed Bug 1487212 Opened 6 years ago Closed 5 years ago

Can we share hyphenation data across processes?

Categories

(Core :: Internationalization, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
mozilla72
Tracking Status
firefox72 --- fixed

People

(Reporter: bzbarsky, Assigned: jfkthame)

References

(Blocks 1 open bug)

Details

(Whiteboard: [overhead:655K])

Attachments

(1 file)

Looking at a DMD report for a content process, I have this: Unreported { 1 block in heap block record 31 of 6,988 655,360 bytes (655,360 requested / 0 slop) 0.14% of the heap (25.97% cumulative) 0.37% of unreported (70.56% cumulative) Allocated at { #01: replace_realloc(void*, unsigned long) (DMD.cpp:1307, in libmozglue.dylib) #02: moz_xrealloc (mozalloc.cpp:94, in libmozglue.dylib) #03: hnj_get_state (hyphen.c:194, in XUL) #04: hnj_hyphen_load_line (hyphen.c:370, in XUL) #05: hnj_hyphen_load_file (hyphen.c:441, in XUL) #06: hnj_hyphen_load (hyphen.c:384, in XUL) #07: nsHyphenationManager::GetHyphenator(nsAtom*) (nsHyphenator.cpp:22, in XUL) #08: nsHyphenationManager::GetHyphenator(nsAtom*) (nsHyphenationManager.cpp:119, in XUL) } } If I read that right, we fopen and read the hyphenation file and end up allocating a 655,360-byte buffer for the hyphenation states. Each state looks like 40 bytes on 64-bit (though seems like we could get it down to 32 bytes with better packing), so we have 16384 HyphenStates in that array... or at least more than 8192. Anyway, this seems like data that might be nice to share across processes if possible.
We may have to allocate the data in a shared memory... I don't know how feasible is it. Or we can eagerly load it before forking in a page supposed to be shared?
Component: Layout: Text and Fonts → Internationalization
(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #1) > We may have to allocate the data in a shared memory... I don't know how > feasible is it. Or we can eagerly load it before forking in a page supposed > to be shared? The only place we're sure we'll be able to fork content processes at all in the future is desktop Linux. Possibly also OS-X. Definitely not Windows. And we don't do it anywhere now. So we'll need to explicitly allocate it in shared memory.
Whiteboard: [overhead:655K]
Using shared memory here seems like it'll be difficult, unless we're prepared to fork libhyphen and re-write its runtime data structures to not rely on a bunch of structs that have pointers to each other. Note that we don't load hyphenation data for any given language until a page tries to use it (i.e. we encounter content with hyphens:auto and lang=...), so the memory usage here is highly dependent on the site that's loaded. In the extreme case where a site applies hyphenation to a number of different languages, it might be considerably higher; but in the (common?) case where hyphens:auto isn't used, this shouldn't show up at all.
FWIW, the total memory used by hyphenation data (when loaded) will be considerably more than just the large block used for the array of HyphenStates: each state has pointers to separately-malloc'd match and repl strings, and to an array of HyphenTrans records (again, separately malloc'd). So there are potentially thousands more small malloc'd objects (and associated slop and overhead!) hanging off the array of states. This could certainly be done in a more memory-efficient way (and potentially in a cross-process-sharable way), but unfortunately I think this would be a pretty intrusive modification of the libhyphen code from upstream. :\
Another approach may be putting the hyphenation into the font server if we are going to have one? I'm not sure how that would work, though.
Priority: -- → P3
One decent start would be reducing the space required for a state index: https://searchfox.org/mozilla-central/source/intl/hyphenation/hyphen/hyphen.h#88-89 https://searchfox.org/mozilla-central/source/intl/hyphenation/hyphen/hyphen.h#95 Looking at the hyphenation files we have, the most states of any one file is 113772 (intl/locales/hu/hyphenation/hyph_hu.dic), so we only need 17 bits to store a state. (en-US has 15618 states, FWIW.) Storing a state number in 24 bits would enable _HyphenState in the first link above to be represented in 20/32 bytes on a 32/64-bit system, down from 24/40. That would provide the easy win bz mentions. We'd also win by shrinking _HyphenTrans down to 4/4 bytes (down from 8/8), which might provide a little improvement. There are just as many _HyphenTrans objects floating around as _HyphenState objects, though they're not contiguously allocated, as jfkthame notes. I don't know if upstream would accept the changes necessary to shrink those members or not. But this doesn't address cross-process sharing issues...

Bug 1567437 comment 2 gives examples of the total memory footprint of loading hyphenation patterns.

Depends on: 1590167

Bug 1590167 makes this no longer an issue for desktop Firefox, as the hyphenation resources are stored uncompressed in the omnijar (which is mapped into memory already) and the new mapped_hyph library uses the resources directly from there.

On Android (GeckoView), the omnijar is compressed, so in order to use a hyphenation table it must first be uncompressed. The RAM footprint of this is much smaller with mapped_hyph than it was with libhyphen, but may be as much as a megabyte for the largest hyphenation resources. Therefore, using shared memory to share a single uncompressed copy across all processes will still be beneficial in a multi-content-process world.

Pushed by jkew@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d519e5920a23 When hyphenation resources are compressed in omnijar, load them into shared memory and share among all content processes. r=heycam,froydnj
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72
Assignee: nobody → jfkthame
Regressions: 1751840
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: