Open
Bug 1436250
(memshrink-content)
Opened 7 years ago
Updated 2 years ago
[meta] Reduce content process memory overhead
Categories
(Core :: General, enhancement)
Core
General
Tracking
()
NEW
People
(Reporter: bzbarsky, Unassigned)
References
(Depends on 58 open bugs, Blocks 1 open bug)
Details
(Keywords: meta, Whiteboard: [MemShrink:meta])
User Story
Attachments
(1 file)
(deleted),
patch
|
Details | Diff | Splinter Review |
Going to use this to track various specific bits.
We have a lot of heap-unclassified (almost 40% of the heap) in a vanilla content process with nothing really loaded.
Also, I haven't even looked at the non-heap-allocated overhead.
Updated•7 years ago
|
Whiteboard: [memshrink]
Updated•7 years ago
|
Keywords: meta
Summary: Reduce content process memory overhead → [meta] Reduce content process memory overhead
Comment 1•7 years ago
|
||
I have a dump of memory allocated by content processes using ASAN's __sanitizer_print_memory_profile(); it's a little confused because both content processes interleave their dumps (I'll need to find a way to separate them, or only dump 1). The two processes have allocated (still live) ~17MB and ~21.5MB; for reference the two processes show a size in System Monitor (on Fedora) of ~24.5 and 28MB when I looked in a different run with the same profile.
Probably the 17MB is the 'warm' process that hasn't loaded any content. and the other is showing a blank page. Comparing the two will also be interesting.
The dump is large (since I told it to dump the allocation stacks of *all* allocations; total was ~77K for the small, and 100K for the larger). I'll upload the raw files, but the highlights:
We're spending a lot on alignment(?)
We have a LOT of power-of-2-sized buffers -- IIRC jemalloc isn't efficient on powers-of-two (not unusual) -- Glandium?
The profiler is allocating a bunch of memory up front in case it needs it when turned on (I presume)
Lots of HashTables - many probably far from filled, and some are static once created
Prefs.... (njn is working on this!)
fontconfig is a PIG!!!!
Telemetry is a non-0 %-age
Quite a bit (scattered) of script data/source/etc
~8% is in posix_memalign from slab_allocator_alloc_chunk() in gslice.c (2500+ allocations)
~3% (~850K) in ~25 allocations from ThreadInfo::ThreadInfo in tools/profiler/core/ThreadInfo.cpp, allocated when the threads are registered with the profiler, or 32808 bytes per thread. That's a lot to spend for the profiler when I haven't installed it in that profile, let alone used it. Lazy allocation, perhaps?
~3% in many allocations from g_realloc() (no further backtrace)
~3% (650K) in 21 allocations from performXDR<> called from js::XDRScript<>
~2% in 5 allocations (of ~88K each) from js::DuplicateString(), called from js::ScriptSource::setSourceCopy()
~1% (262144) in 1 allocation from PLDHashTable::ChangeTable from Preferences(!) (SetLatePreferences)
~1% (262144) in 16 allocations from DoInterfaceDescriptior(XPTArena...), called a ways above from DoRegisterXPT()
~1% in ~8000 allocations from FcCharSetFindLeafCreate() (fontconfig)
~1% in ~7700 allocations from FcValueListCreate/
~1% in 621 allocations from JSScript::createScriptData() (from XDRScript<>)
~0.5% in 2 allocations from __strtof_l()
~0.5% in FcPatternObjectInsertElt()
~0.5% (131072) from js::detail::HashTable<>changeTableSize()
~0.5% (131072) in 2 allocations from PLDHashTable::Add() (an XPTInterfaceInfoManager table)
~0.5% (131072) in 1 allocations from PLDHashTable::Add() from GetAtomHashEntry() when in RegisterStaticAtoms
~0.5% (131072) in PLDHashTable::Add() called from TelemetryHistogram::InitializeGlobalState()
(perhaps a couple more 131072 or 262144 allocations)
~0.5% (122K) in XPT_DoCString() from XPTInterfaceInfoManager::RegisterBuffer()
~0.4% (113K) in 95 allocations from _dl_new_object()
~0.4% (111K) from FcCharSetPutLeaf
~0.4% in 17xx allocations from PLDHashTable::Add for strings from TelemetryHistogram::InitializeGlobalState()
~0.4% (102K) in 50 allocations from nsPersistentProperties::SetStringProperty()
~0.4% (101K) in many allocations from FcValueSave()
~0.4% (98K) in 3 allocations from ThreadInfo::ThreadInfo()
~0.4% (98304) in 6 allocations from xptiInterfaceEntry::Create()
~0.4% (98304) in 2 allocations from PLDHashTable::Add()
98K in 12 allocs from FcConfigAllocExpr
81920 (10*8192!) in 10 allocations from DuplicateString<char, 8192ul, 1ul> from Pref::Pref()
Bunch more allocs from ThreadInfo::ThreadInfo() (profiler)
65536 in 1 alloc from HashTable<>::createTable() from AtomizeAndCopyChars<>
65536 in 1 allocation from nsAtomFriend::RegisterStaticAtoms()
65536 in 2 allocations from gfxFcPlatformFontList::AddPatternToFontList()/InitFontListForPlatform()
65536 in 2 allocs from js::LifoAlloc::newChunkWithCapacity()
65536 in 1 alloc from nsComponentManagerImpl::RegisterCIDEntryLocked()
65520 in 2 allocs from nsPurpleBuffer::Put()
60K in 65 allocs from ft_mem_qalloc() (freetype)
60K in ~2500 allocs from nsAtomFriend::RegisterStaticAtoms()
Flags: needinfo?(mh+mozilla)
Comment 2•7 years ago
|
||
> We have a LOT of power-of-2-sized buffers -- IIRC jemalloc isn't efficient on powers-of-two (not unusual) -- Glandium?
No, powers-of-two are the best case, along with everything that's exactly matching a class size, or is a multiple of the page size for larger sizes.
Flags: needinfo?(mh+mozilla)
Comment 3•7 years ago
|
||
> No, powers-of-two are the best case, along with everything that's exactly
> matching a class size, or is a multiple of the page size for larger sizes.
Good. (IIRC at one point it was better to be power-of-2-minus-n; though perhaps I'm thinking of some other system/allocator)
Comment 4•7 years ago
|
||
You might be thinking about things like nsTAutoArray, which have an embedded header, so a better size for it is jemalloc_class_size - header_size.
Comment 5•7 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #1)
> I have a dump of memory allocated by content processes using ASAN's
> __sanitizer_print_memory_profile()
You'll want to be careful with that -- I'm pretty sure ASAN will be using it's own allocator instead of jemalloc, so it's not a representative run.
You can use DMD for vanilla heap profiling. It works with jemalloc so will give representative results. See the docs about "live mode" at https://developer.mozilla.org/en-US/docs/Mozilla/Performance/DMD.
Comment 6•7 years ago
|
||
So, DMD results (I was using ASAN): similar of course, since I don't care much about a few bytes - biggest difference would be in slop and alignment I imagine.
Comments above are still valid; we can now see that gtk is using a moderate amount, and no surprise the fontconfig stuff is called from a InitFontList.
Lots in total in js::ScriptSource, various things involving Atoms, and XPTInterfaceInfoManager::RegisterBuffer() is big hotspot
LifoAlloc then comes in with a lot of different little allocations (probably not surprising)
6% (811K) in ThreadInfo::ThreadInfo (note: another couple % below with different stacks)
4.5% in js::ScriptSource::performXDR<>
3.6% (491K) from XPTInterfaceInfoManager::RegisterBuffer() (a few % more below)
2.3% in js::ScriptSource::setSourceCopy()
2% in js::SharedScriptData::new_ from JSScript::createScriptData()
1.9% (262144) in PLDHashTable::ChangeTable() from SetLatePreferences()
1.7% in js::SharedScriptData::new_ from JSScript::fillyInitFromEmitters
1.6% in FcPatternObjectAddWithBinding() (from InitFontList())
1.5% in gtk_css_selector_tree_builder_build from (near top) dgtk_settings_get_for_display()
1.4% from js::ScriptSource::setSourceCopy()
1.2% in glibc _dl_new_object()
1% (~30% cumulative) in nsPersistentProperties::SetStringProperty() from nsStringBundle::LoadProperties()
1% from XPTInterfaceInfoManager::RegisterBuffer() (again, different stack slightly)
1% in PLDHashTable::Add() from nsAtomFriend::RegisterStaticAtoms()
1% (131072, 1 alloc) in gtk_css_provider_load_internal() from gtk_settings_get_for_display
1% (ditto) in PLDHashTable::Add() from TelemetryHistogram::InitializeGlobalState()
1% (ditto) in js::AtomizeChars from the frontend::GeneralParser<>
1% (ditto) in js::AtomizeChars from js::XDRAtom<>/js::XDRScript<>
1% in FcPatternObjectInsertElt from InitFontlist()
0.85% in PLDHashTable::Add from XPTInterfaceInfoManager::RegisterBuffer() (different stack)
0.8% in gtk_css_ruleset_add() from gtk_settings_get_for_display
0.8% in js::SharedScriptData::new_() from JSScript::fullyInitFromEmitter()
0.8% DuplicateString() from pref_SetPref()
0.7% in FcCharSetPutLeaf from InitFontList
0.7% in js:LifoAlloc::newChunkWithCapacity() from js::frontend:PerHandlerParser
0.6% from js::SharedScriptData::new_ from JSScriptCreateScriptData (different stack)
0.6% from nsAtomFriend::RegisterStaticAtoms()
0.5% from FcPatternObjectAddWithBinding
0.5% from ThreadInfo::ThreadInfo (different stack)
0.5% from XPTInterfaceInfoManager::RegisterBuffer() (different stack)
0.5% from call_init() (dl_init.c) in glibc
0.5% (45% cumulative) in js::AtomizeChars (different stack)
<bunch of 65536 byte allocs from Component Manager, HashTables for StaticAtoms, JSSCript::shareScriptData()>
<several 61440-byte totals (15 allocs) from LifoAlloc, and a bunch in the 50K region with 13 allocs from LifoAlloc>
<4 ~36K alloc stacks from ThreadInfo::ThreadInfo -- different callers - HangMonitor, WatchdogMain, BackgroundHangManager -- I wonder if there's some duplication here that could be eliminated)
Reporter | ||
Comment 7•7 years ago
|
||
Bug 1436179 tracks the ThreadInfo/profiler bits.
Comment 8•7 years ago
|
||
Raw data. Note lsan4_xaa is the first 100K lines (which goes down to ~1K total allocation/stack; the tail is LONG; xaa is only about 1/15th of the full file. Also note that lsan has a mix of two content processes; one that is displaying a blank page, one that hasn't been used yet
https://app.box.com/folder/46288716831
Comment 9•7 years ago
|
||
Updated•7 years ago
|
Assignee: nobody → rjesup
Status: NEW → ASSIGNED
Updated•7 years ago
|
Assignee: rjesup → nobody
Status: ASSIGNED → NEW
Updated•7 years ago
|
Whiteboard: [memshrink] → [MemShrink:meta]
Updated•7 years ago
|
Updated•7 years ago
|
Alias: memshrink-content
Updated•7 years ago
|
Comment 11•6 years ago
|
||
cc'ing felipe who might want to be in the loop on this.
Updated•6 years ago
|
Reporter | ||
Updated•6 years ago
|
User Story: (updated)
Updated•6 years ago
|
Depends on: memshrink-style
Updated•6 years ago
|
Depends on: ipc-devirt
Updated•5 years ago
|
Depends on: WarpBuilder
Updated•5 years ago
|
Updated•5 years ago
|
Comment 14•5 years ago
|
||
There are a few instance of nsStaticCaseInsensitiveNameTable, which is a class that could clearly be generated at compile time. However, the three instances of this class in each content process occupy a total of about 7000 bytes so it would not be worth the time to convert it.
Comment 15•5 years ago
|
||
gPropertyIDLNameTable also looks to contain only static data. It uses about 28720 bytes, which is better but still maybe too small to bother with.
Comment 16•4 years ago
|
||
Here's 12KB of slop/clownshoes allocated by a 3rd party rust library: https://github.com/crossbeam-rs/crossbeam/issues/551
Updated•4 years ago
|
Fission Milestone: --- → Future
Updated•3 years ago
|
Depends on: editor-transaction-footprint
Updated•3 years ago
|
Fission Milestone: Future → ---
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•