Closed Bug 1506776 Opened 6 years ago Closed 6 years ago

17.78% Base Content Resident Unique Memory (osx-10-10) regression on push 22d335fc020fb0509c67c4a945db21a27b8332d4 (Sat Nov 10 2018)

Categories

(Core :: Security: Process Sandboxing, defect, P1)

65 Branch
Unspecified
macOS
defect

Tracking

()

VERIFIED FIXED
mozilla65
Tracking Status
firefox-esr60 --- unaffected
firefox63 --- unaffected
firefox64 --- unaffected
firefox65 + fixed

People

(Reporter: igoldan, Assigned: haik)

References

Details

(Keywords: perf, regression)

Attachments

(1 file)

We have detected an awsy regression from push: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=e51037702c9e8febad9b39f219fec3f22a6d7f26&tochange=22d335fc020fb0509c67c4a945db21a27b8332d4 As author of one of the patches included in that push, we need your help to address this regression. Regressions: 18% Base Content Resident Unique Memory osx-10-10 opt stylo 22,576,298.67 -> 26,590,549.33 You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=17483 On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format. To learn more about the regressing test(s), please see: https://wiki.mozilla.org/AWSY/Tests
Component: General → Security: Process Sandboxing
Product: Testing → Core
Flags: needinfo?(haftandilian)
This is a pretty huge regression from a Fission Memshrink point of view, I think we might need to backout.
I'll be looking into this. I first need to understand at what point in content process startup do we snapshot the memory size. Starting the sandbox earlier (which is enabled with bug 1501126) is going to use memory earlier in the content process startup than before.
The diff is pretty weird. 'explicit', 'resident', and 'vsize' went down but 'resident-unique' went up. > Web Content (pid NNN) > Explicit Allocations > > -1.82 MB (100.0%) -- explicit > ├──-1.77 MB (96.98%) ── heap-unclassified > ├──-0.06 MB (03.38%) -- heap-overhead > │ ├──-0.09 MB (04.94%) ── page-cache [2] > │ ├───0.03 MB (-01.55%) ── bin-unused [2] > │ └───0.00 MB (-00.01%) ── bookkeeping [2] > └───0.01 MB (-00.36%) ++ (3 tiny) > > Other Measurements > > 0 (100.0%) ++ extensions > > -1.82 MB (100.0%) -- heap-committed > ├──-1.76 MB (96.62%) ── allocated [2] > └──-0.06 MB (03.38%) ── overhead [2] > > 0.01 MB (100.0%) -- js-main-runtime > ├──0.01 MB (101.12%) -- zones > │ ├──0.01 MB (114.29%) ── type-pool [2] > │ ├──-0.00 MB (-15.29%) -- shapes > │ │ ├──-0.00 MB (-14.29%) ── malloc-heap/tree-kids [2] > │ │ └──-0.00 MB (-01.00%) ++ gc-heap > │ └──0.00 MB (02.12%) ── unused-gc-things [2] > └──-0.00 MB (-01.12%) ── realms/classes/objects/gc-heap [2] > > 0.00 MB (100.0%) -- js-main-runtime-gc-heap-committed > ├──0.00 MB (100.0%) -- unused/gc-things > │ ├──0.00 MB (100.0%) ── objects [2] > │ ├──0.00 MB (100.0%) ── shapes [2] > │ └──0.00 MB (100.0%) ── base-shapes [2] > └──-0.00 MB (100.0%) -- used/gc-things > ├──-0.00 MB (100.0%) ── objects [2] > ├──-0.00 MB (100.0%) ── shapes [2] > └──-0.00 MB (100.0%) ── base-shapes [2] > > -1.76 MB ── heap-allocated [2] > 2,282 ── page-faults-soft [2] > -6.25 MB ── resident [2] > -4.48 MB ── resident-peak [2] > 7.03 MB ── resident-unique [2] > -25.96 MB ── vsize [2]
Assignee: nobody → haftandilian
Flags: needinfo?(haftandilian)
Priority: -- → P1
How much do we know about the services we're blocking access to with the early sandbox startup? Could there be something that normally accesses shared memory but falls back to making its own copy when that fails? If it's something like that, comparing the output of vmmap(1) before/after could help find what's different.
(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #4) > How much do we know about the services we're blocking access to with the > early sandbox startup? Could there be something that normally accesses > shared memory but falls back to making its own copy when that fails? If > it's something like that, comparing the output of vmmap(1) before/after > could help find what's different. Yes, that's definitely possible and I think I've come across some instances of that before--where without access to a service, the process creates its own instances of things. Once I can reproduce this, vmmap is first on my list of things to check.
I've found that adding back access to the coreservicesd daemon via com.apple.CoreServices.coreservicesd improves the "Base Content Resident Unique Memory" measurement bringing it back in line with the measurement prior to enabling early sandbox startup. This compares earlyinit=true (base) with earlyinit=true + coreservicesd: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3f0e50d050e7007f8b15fb1f6f2434a2f3b45f49&newProject=try&newRevision=5c7d54b858be6213be146661bd752fcc095a960b&framework=4 This compares earlyinit=false (base) with earlyinit=true + coreservicesd: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=28035db7d9a0&newProject=try&newRevision=5c7d54b858be6213be146661bd752fcc095a960b&framework=4 coreservicesd is sandboxed, but unfortunately its sandbox policy is highly permissive allowing global filesystem read access, filesystem write access to temp and cache dirs, outbound network access and more. If we added this back, it would still be a win to be dropping access to the unsandboxed dock daemon along with distnoted. I'm going to continue debugging to try to get a better understanding of what's happening and to see if we can realize the same memory savings without granting access to coreservicesd.
I have not found a workaround for enabling CoreServices so I plan to add back CoreServices to the content policy and file a new bug to cover investigating this more. See below for some debugging details. The memory impact of blocking CoreServices seems to be less on newer OS versions. To investigate this further, I think we should compare the memory impact and possibly turn on blocking just for newer OS versions. In addition, tracing all allocations and trying to see where/why we do the additional allocations when CoreServices is blocked may lead to some discovery. On 10.10, comparing vmmap output between an about:blank content process without allowing CoreServices vs allowing Coreservices, the total writable regions is larger by ~4 MB when CoreServices is blocked which is consistent with the difference flagged by the awsy base test "Base Content Resident Unique Memory" (ab on try). I don't have an explanation for why just the awsy base test regressed, but not the "after tabs open" test. More debugging needed. There are also some memory mapped files that I only see when CoreServices is allowed, but they are small and don't appear to be causing the difference. Diff of vmmap output: *** vmmap-base.51799 --- vmmap-with-core-services.51862 ! ==== Summary for process 51799 ReadOnly portion of Libraries: Total=345.5M resident=0K(0%) swapped_out_or_unallocated=345.5M(100%) ! Writable regions: Total=61.4M written=0K(0%) resident=0K(0%) swapped_out=0K(0%) unallocated=61.4M(100%) VIRTUAL REGION REGION TYPE SIZE COUNT (non-coalesced) =========== ======= ======= CG image 4K 2 CG shared images 32K 2 ! CoreServices 500K 14 -- ! ==== Summary for process 51862 ReadOnly portion of Libraries: Total=345.5M resident=0K(0%) swapped_out_or_unallocated=345.5M(100%) ! Writable regions: Total=56.7M written=0K(0%) resident=0K(0%) swapped_out=0K(0%) unallocated=56.7M(100%) VIRTUAL REGION REGION TYPE SIZE COUNT (non-coalesced) =========== ======= ======= CG image 4K 2 CG shared images 32K 2 ! CoreServices 128K 4 I looked into trying to avoid triggering the connection to CoreServices. We first connect in ContentChild::Init() when we call SetProcessName(). With that code removed, we then try to connect in nsURLHelperOSX.cpp calling ::FSGetVolumeInfo() for some HFS-specific path checks. And with that code removed, we try to connect when calling [NSApplication sharedApplication] from nsAppShell::Init(). * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1 * frame #0 libxpc.dylib`bootstrap_look_up3() frame #1 libxpc.dylib`bootstrap_look_up2() frame #2 CarbonCore`SCClientSession::checkinWithServer() frame #3 CarbonCore`connectToCoreServicesD() frame #4 CarbonCore`getStatus() frame #5 CarbonCore`getSystemSession() frame #6 CarbonCore`scCreateSystemCacheable() frame #7 CarbonCore`___ZL17_CLV_GetCacheablev_block_invoke() frame #8 libdispatch.dylib`_dispatch_client_callout() frame #9 libdispatch.dylib`_dispatch_once_callout() frame #10 CarbonCore`_CLV_GetCacheable() frame #11 CarbonCore`_standalone_CopyCheckFixForBundleID() frame #12 CarbonCore`GetBugsForOurBundleIDFromCoreservicesd() frame #13 CarbonCore`_CSCheckFix() frame #14 SkyLight`initDisplayState() frame #15 SkyLight`initDisplayMappings() frame #16 SkyLight`__SLSInitialize_block_invoke() frame #17 libdispatch.dylib`_dispatch_client_callout() frame #18 libdispatch.dylib`_dispatch_once_callout() frame #19 SkyLight`CGS_CHECK_INIT() frame #20 SkyLight`SLSMainConnectionID() frame #21 CoreFoundation`_CFAppSleepSetupCoreGraphics() frame #22 CoreFoundation`____CFRunLoopSetOptionsReason_block_invoke_2() frame #23 libdispatch.dylib`_dispatch_client_callout() frame #24 libdispatch.dylib`_dispatch_once_callout() frame #25 CoreFoundation`__CFRunLoopSetOptionsReason() frame #26 LaunchServices`_LSApplicationCheckIn() frame #27 HIServices`_RegisterApplication() frame #28 HIServices`GetCurrentProcess() frame #29 HIToolbox`MenuBarInstance::GetAggregateUIMode() frame #30 HIToolbox`MenuBarInstance::IsVisible() frame #31 AppKit`_NSInitializeAppContext() frame #32 AppKit`-[NSApplication init]() frame #33 AppKit`+[NSApplication sharedApplication]() frame #34 XUL`nsAppShell::Init() at nsAppShell.mm:337 frame #35 XUL`nsAppShellInit() at nsAppShellSingleton.h:45 frame #36 XUL`nsComponentManagerImpl::KnownModule::Load() at nsComponentManager.cpp:811 frame #37 XUL`nsFactoryEntry::GetFactory() at nsComponentManager.cpp:1811 frame #38 XUL`nsComponentManagerImpl::CreateInstance() at nsComponentManager.cpp:1048 frame #39 XUL`nsComponentManagerImpl::GetService() at nsComponentManager.cpp:1293 frame #40 XUL`nsGetServiceByCID::operator()() at nsComponentManagerUtils.cpp:56 frame #41 XUL`nsGetServiceByCID::operator()() at nsComponentManagerUtils.cpp:253 frame #42 XUL`nsCOMPtr_base::assign_from_gs_cid() at nsCOMPtr.cpp:73 frame #43 XUL`XRE_RunAppShell() at nsCOMPtr.h:660 frame #44 XUL`XRE_RunAppShell() at nsCOMPtr.h:657 frame #45 XUL`XRE_RunAppShell() at nsEmbedFunctions.cpp:911 frame #46 XUL`MessageLoop::Run() at message_loop.cc:325 frame #47 XUL`MessageLoop::Run() at message_loop.cc:318 frame #48 XUL`MessageLoop::Run() at message_loop.cc:298 frame #49 XUL`XRE_InitChildProcess() at nsEmbedFunctions.cpp:780 frame #50 plugin-container`main() at plugin-container.cpp:50 frame #51 plugin-container`main() at MozillaRuntimeMain.cpp:25 frame #52 libdyld.dylib`start() frame #53 libdyld.dylib`start()
Add back access to com.apple.CoreServices.coreservicesd to avoid extra allocations that are triggered when the service is blocked.
Pushed by haftandilian@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/823acff9cd2f 17.78% Base Content Resident Unique Memory regression r=Alex_Gaynor
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: