Closed Bug 1614933 Opened 5 years ago Closed 5 years ago

Don't use shared memory to send annotations from child processes

Categories

(Toolkit :: Crash Reporting, task, P1)

task

Tracking

()

RESOLVED FIXED
mozilla77
Tracking Status
firefox77 --- fixed

People

(Reporter: gsvelto, Assigned: gsvelto)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Whiteboard: [MemShrink:P2][overhead:16K] )

Attachments

(3 files)

This bug is meant to simplify the interface we use for sending crash annotations so that it matches the one we've designed for the rewritten client library in bug 1588530.

The architecture we devised for bug 1588530 needs all crash data to be sent to an external process that will write out the complete crash report on behalf of the crashed process. One of the things that need to be fixed in order to achieve this is send all crash annotations to a single destination. Right now our interface assumes that crash annotations coming from content processes will be streamed via shared-memory while the process is alive while the exception-time ones will be sent over a pipe from within the exception handler.

I want to remove the first path and stream out all the annotations at exception time. This allows us to break the dependency between the CrashReporterClient class and the crash reporter implementation making gathering annotations a black-box which can be easily replaced.

Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Priority: -- → P1

Crash annotations in content processes are currently sent over IPC using
shared memory buffers. To pave the way for the Rust rewrite of the exception
handler we are removing this code and gathering all the crash annotations
within the content processes. The annotations are then streamed out from the
exception handler together with the exception-time annotations.

This has a number of benefits: we have one less channel to exchange data
between content processes and the parent process, we save memory because we
don't need to allocate the shared memory buffers, annotations are faster
because we don't stream them all out every time one changes and finally we
won't truncate annotation anymore if we run out of space in the shared
segment.

I was testing my WIP here and I noticed something odd: removing this call that allocates a chunk of shared memory in the content process causes all following calls to shm_open() to fail. The sandbox seems to deny the shm_open() call and spits this in the log:

Sandbox: SandboxBroker: denied op=open rflags=0 perms=0 path=/dev/shm for pid=27576
Sandbox: Failed errno -13 op open flags 00 path /dev/shm/

In this case process 27576 is a child process. If I disable the sandbox everything works as expected.

Jed can you help me with this? The first shm_open() call I removed in this patch happens very early during content process startup, all the other ones happen later. It seems that if we don't open a chunk of shared memory early enough we can never open another one later. Is this expected behavior?

Flags: needinfo?(jld)

I confirmed that before applying my patch in a content process things go like this in a content process:

  1. RecvInitCrashReporter() gets called in response to IPC
  2. We call shm_open() to allocate some shared memory and it succeeds
  3. RecvSetProcessSandbox() gets called in response to IPC, we setup the sandbox
  4. We call shm_open() again and it also succeeds

With my patch this turns into:

  1. RecvInitCrashReporter() gets called in response to IPC
  2. RecvSetProcessSandbox() gets called in response to IPC, we setup the sandbox
  3. We call shm_open() for the first time and it fails because the sandbox prevents it

Do I need to call shm_open() before we set the sandbox to prime it? Or is there a cleaner way to prevent the sandbox from blocking the call?

That… shouldn't happen. It looks as if glibc is trying to open the /dev/shm directory itself, which is a problem because we'd like to not allow readdiring it. That's probably the result of our seccomp-bpf-powered polyfill for this statfs call, expressing it as open+fstatfs, so one possibility would be to make statfs a first-class broker call. Alternately, opening with O_PATH might have the properties we want (but we'd have to teach the broker to allow that, because I think it currently doesn't.)

But the simplest solution might be to just try to shm_open something that doesn't exist (without O_CREAT) before starting the sandbox, to force glibc's lazy initialization.

Flags: needinfo?(jld)

Thanks Jed, this is precious information. I was completely stumped at why it was trying to open just /dev/shm. I'll go for the quick fix of shm_open()ing something early enough to prime glibc for now, do you feel like we should file a bug for this? Obviously this is a glibc-specific thing which might also change in the future so I don't know if it's worth a more complex fix.

Also adding this as a blocker to bug 1436250 because we're going to shave 16KiB of memory per content process.

Whiteboard: [MemShrink][overhead:16K]
Whiteboard: [MemShrink][overhead:16K] → [MemShrink:P2][overhead:16K]
Attachment #9126101 - Attachment description: Bug 1614933 - Gather content processes' crash annotations at exception time instead of using IPC → Bug 1614933 - Gather content processes' crash annotations at exception time instead of using IPC; r=froydnj
Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5cee8f603ae4 Ensure that glibc's lazy initializers run before we enable the content process sandbox on Linux; r=jld https://hg.mozilla.org/integration/autoland/rev/ff92f800a74e Gather content processes' crash annotations at exception time instead of using IPC; r=froydnj

Backed out 2 changesets (Bug 1614933) for busages complaining about IdleSchedulerChild.cpp
Also saw leaks starting with these changes.
Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=success%2Ctestfailed%2Cbusted%2Cexception&fromchange=6eab5a5a91c13ef5c91fccfe354fc49130810df3&searchStr=build%2Cosx&tochange=b7d06bfdd0d5655fba6c53d7d86070b1f2f2ace6&selectedJob=292472867

Backout link: https://hg.mozilla.org/integration/autoland/rev/f0e5bc0bf9ace0eaefa152aa3410edb75a0ad674

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=292472867&repo=autoland&lineNumber=31750

[task 2020-03-10T15:18:11.818Z] 15:18:11     INFO -  make[4]: Entering directory '/builds/worker/workspace/obj-build/storage'
[task 2020-03-10T15:18:11.818Z] 15:18:11     INFO -  storage/mozStorageBindingParams.o
[task 2020-03-10T15:18:11.819Z] 15:18:11     INFO -  make[4]: Leaving directory '/builds/worker/workspace/obj-build/storage'
[task 2020-03-10T15:18:12.040Z] 15:18:12     INFO -  make[4]: Entering directory '/builds/worker/workspace/obj-build/ipc/glue'
[task 2020-03-10T15:18:12.046Z] 15:18:12     INFO -  /builds/worker/fetches/sccache/sccache /builds/worker/fetches/clang/bin/clang++ -isysroot /builds/worker/checkouts/gecko/MacOSX10.11.sdk -std=gnu++17 --target=x86_64-apple-darwin -o Unified_cpp_ipc_glue1.o -c  -fvisibility=hidden -fvisibility-inlines-hidden -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -fstack-protector-strong -ftrivial-auto-var-init=pattern -DDEBUG=1 -DHAVE_UINT64_T -DWEBRTC_POSIX -DWEBRTC_BUILD_LIBEVENT -DWEBRTC_MAC -DOS_POSIX=1 -DOS_MACOSX=1 '-DMOZ_CHILD_PROCESS_NAME="plugin-container"' '-DMOZ_CHILD_PROCESS_BUNDLE="plugin-container.app/Contents/MacOS/"' -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -DSTATIC_EXPORTABLE_JS_API -I/builds/worker/checkouts/gecko/ipc/glue -I/builds/worker/workspace/obj-build/ipc/glue -I/builds/worker/checkouts/gecko/caps -I/builds/worker/checkouts/gecko/dom/broadcastchannel -I/builds/worker/checkouts/gecko/dom/indexedDB -I/builds/worker/checkouts/gecko/dom/storage -I/builds/worker/checkouts/gecko/media/webrtc/trunk -I/builds/worker/checkouts/gecko/media/webrtc/trunk/webrtc -I/builds/worker/checkouts/gecko/netwerk/base -I/builds/worker/checkouts/gecko/xpcom/build -I/builds/worker/checkouts/gecko/dom/ipc -I/builds/worker/checkouts/gecko/toolkit/crashreporter -I/builds/worker/checkouts/gecko/toolkit/xre -I/builds/worker/checkouts/gecko/xpcom/base -I/builds/worker/checkouts/gecko/xpcom/threads -I/builds/worker/workspace/obj-build/ipc/ipdl/_ipdlheaders -I/builds/worker/checkouts/gecko/ipc/chromium/src -I/builds/worker/checkouts/gecko/ipc/glue -I/builds/worker/workspace/obj-build/dist/include -I/builds/worker/workspace/obj-build/dist/include/nspr -I/builds/worker/workspace/obj-build/dist/include/nss -fPIC -DMOZILLA_CLIENT -include /builds/worker/workspace/obj-build/mozilla-config.h -Qunused-arguments -Qunused-arguments -Wall -Wbitfield-enum-conversion -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wshadow-field-in-constructor-modified -Wsign-compare -Wtype-limits -Wunreachable-code -Wunreachable-code-return -Wwrite-strings -Wno-invalid-offsetof -Wclass-varargs -Wempty-init-stmt -Wfloat-overflow-conversion -Wfloat-zero-conversion -Wloop-analysis -Wc++2a-compat -Wcomma -Wimplicit-fallthrough -Wunused-function -Wunused-variable -Werror=non-literal-null-conversion -Wstring-conversion -Wtautological-overlap-compare -Wtautological-unsigned-enum-zero-compare -Wtautological-unsigned-zero-compare -Wno-error=tautological-type-limit-compare -Wno-inline-new-delete -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-error=backend-plugin -Wno-error=return-std-move -Wno-error=atomic-alignment -Wformat -Wformat-security -Wno-gnu-zero-variadic-macro-arguments -Wno-unknown-warning-option -fno-sized-deallocation -fno-aligned-new -fcrash-diagnostics-dir=/builds/worker/artifacts -fno-exceptions -fno-strict-aliasing -stdlib=libc++ -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -pipe -g -Xclang -load -Xclang /builds/worker/workspace/obj-build/build/clang-plugin/libclang-plugin.so -Xclang -add-plugin -Xclang moz-check -O3 -fno-omit-frame-pointer -funwind-tables -Werror -Wno-error=shadow -Wno-shadow  -MD -MP -MF .deps/Unified_cpp_ipc_glue1.o.pp   Unified_cpp_ipc_glue1.cpp
[task 2020-03-10T15:18:12.046Z] 15:18:12     INFO -  In file included from Unified_cpp_ipc_glue1.cpp:2:
[task 2020-03-10T15:18:12.046Z] 15:18:12    ERROR -  /builds/worker/checkouts/gecko/ipc/glue/IdleSchedulerChild.cpp:86:15: error: member access into incomplete type 'ipc::PBackgroundChild'
[task 2020-03-10T15:18:12.046Z] 15:18:12     INFO -      background->SendPIdleSchedulerConstructor(sMainThreadIdleScheduler);
[task 2020-03-10T15:18:12.046Z] 15:18:12     INFO -                ^
[task 2020-03-10T15:18:12.047Z] 15:18:12     INFO -  /builds/worker/workspace/obj-build/dist/include/mozilla/dom/ipc/StructuredCloneData.h:25:7: note: forward declaration of 'mozilla::ipc::PBackgroundChild'
[task 2020-03-10T15:18:12.048Z] 15:18:12     INFO -  class PBackgroundChild;
[task 2020-03-10T15:18:12.049Z] 15:18:12     INFO -        ^
[task 2020-03-10T15:18:12.050Z] 15:18:12     INFO -  1 error generated.
[task 2020-03-10T15:18:12.051Z] 15:18:12     INFO -  /builds/worker/checkouts/gecko/config/rules.mk:745: recipe for target 'Unified_cpp_ipc_glue1.o' failed
[task 2020-03-10T15:18:12.051Z] 15:18:12    ERROR -  make[4]: *** [Unified_cpp_ipc_glue1.o] Error 1
[task 2020-03-10T15:18:12.052Z] 15:18:12     INFO -  make[4]: Leaving directory '/builds/worker/workspace/obj-build/ipc/glue'
[task 2020-03-10T15:18:12.053Z] 15:18:12     INFO -  make[4]: *** Waiting for unfinished jobs....
[task 2020-03-10T15:18:12.054Z] 15:18:12     INFO -  make[4]: Entering directory '/builds/worker/workspace/obj-build/storage'

https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=292482568&resultStatus=success%2Ctestfailed%2Cbusted%2Cexception&fromchange=6eab5a5a91c13ef5c91fccfe354fc49130810df3&tochange=b7d06bfdd0d5655fba6c53d7d86070b1f2f2ace6

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=292482568&repo=autoland&lineNumber=951

[task 2020-03-10T15:45:52.225Z] 15:45:52     INFO - PID 4164 | [Parent 7968, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file /builds/worker/checkouts/gecko/js/xpconnect/loader/mozJSComponentLoader.cpp, line 228
[task 2020-03-10T15:45:52.225Z] 15:45:52     INFO - PID 4164 | [Parent 7968, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file /builds/worker/checkouts/gecko/js/xpconnect/loader/mozJSComponentLoader.cpp, line 255
[task 2020-03-10T15:45:53.013Z] 15:45:53     INFO - PID 4164 | [Parent 7968, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp, line 3352
[task 2020-03-10T15:45:53.091Z] 15:45:53     INFO - PID 4164 | nsStringStats
[task 2020-03-10T15:45:53.091Z] 15:45:53     INFO - PID 4164 |  => mAllocCount:         125331
[task 2020-03-10T15:45:53.091Z] 15:45:53     INFO - PID 4164 |  => mReallocCount:            0
[task 2020-03-10T15:45:53.091Z] 15:45:53     INFO - PID 4164 |  => mFreeCount:          125331
[task 2020-03-10T15:45:53.091Z] 15:45:53     INFO - PID 4164 |  => mShareCount:         118938
[task 2020-03-10T15:45:53.091Z] 15:45:53     INFO - PID 4164 |  => mAdoptCount:           2379
[task 2020-03-10T15:45:53.091Z] 15:45:53     INFO - PID 4164 |  => mAdoptFreeCount:       2451
[task 2020-03-10T15:45:53.092Z] 15:45:53     INFO - PID 4164 |  => Process ID: 7968, Thread ID: 9848
[task 2020-03-10T15:45:53.114Z] 15:45:53     INFO - Browser exited with return code 0
[task 2020-03-10T15:45:53.115Z] 15:45:53     INFO - PROCESS LEAKS c:\users\task_1583853630\appdata\local\temp\tmpt3_jr9.mozrunner\runtests_leaks_7872.log
[task 2020-03-10T15:45:53.115Z] 15:45:53     INFO - leakcheck | Processing log file c:\users\task_1583853630\appdata\local\temp\tmpt3_jr9.mozrunner\runtests_leaks_7872.log for scope /2dcontext/wide-gamut-canvas
[task 2020-03-10T15:45:53.115Z] 15:45:53     INFO - TEST-INFO | leakcheck | default process: leak threshold set at 30000 bytes
[task 2020-03-10T15:45:53.115Z] 15:45:53     INFO - TEST-INFO | leakcheck | forkserver process: leak threshold set at 0 bytes
[task 2020-03-10T15:45:53.115Z] 15:45:53     INFO - TEST-INFO | leakcheck | gmplugin process: leak threshold set at 20000 bytes
[task 2020-03-10T15:45:53.115Z] 15:45:53     INFO - TEST-INFO | leakcheck | gpu process: leak threshold set at 0 bytes
[task 2020-03-10T15:45:53.115Z] 15:45:53     INFO - TEST-INFO | leakcheck | plugin process: leak threshold set at 0 bytes
[task 2020-03-10T15:45:53.116Z] 15:45:53     INFO - TEST-INFO | leakcheck | rdd process: leak threshold set at 400 bytes
[task 2020-03-10T15:45:53.116Z] 15:45:53     INFO - TEST-INFO | leakcheck | socket process: leak threshold set at 0 bytes
[task 2020-03-10T15:45:53.116Z] 15:45:53     INFO - TEST-INFO | leakcheck | tab process: leak threshold set at 10000 bytes
[task 2020-03-10T15:45:53.116Z] 15:45:53     INFO - TEST-INFO | leakcheck | vr process: leak threshold set at 0 bytes
[task 2020-03-10T15:45:53.116Z] 15:45:53     INFO - leakcheck | Processing leak log file c:\users\task_1583853630\appdata\local\temp\tmpt3_jr9.mozrunner\runtests_leaks_7872.log
[task 2020-03-10T15:45:53.116Z] 15:45:53     INFO - 
[task 2020-03-10T15:45:53.116Z] 15:45:53     INFO - == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, default process 7968
[task 2020-03-10T15:45:53.116Z] 15:45:53     INFO - 
[task 2020-03-10T15:45:53.116Z] 15:45:53     INFO -      |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
[task 2020-03-10T15:45:53.116Z] 15:45:53     INFO -      |                                      | Per-Inst   Leaked|   Total      Rem|
[task 2020-03-10T15:45:53.117Z] 15:45:53     INFO -    0 |TOTAL                                 |       39        0| 1581948        0|
[task 2020-03-10T15:45:53.122Z] 15:45:53     INFO - 
[task 2020-03-10T15:45:53.122Z] 15:45:53     INFO - nsTraceRefcnt::DumpStatistics: 1908 entries
[task 2020-03-10T15:45:53.122Z] 15:45:53     INFO - TEST-PASS | leakcheck | default no leaks detected!
[task 2020-03-10T15:45:53.122Z] 15:45:53     INFO - leakcheck | Processing leak log file c:\users\task_1583853630\appdata\local\temp\tmpt3_jr9.mozrunner\runtests_leaks_7872_gpu_pid4656.log
[task 2020-03-10T15:45:53.122Z] 15:45:53     INFO - 
[task 2020-03-10T15:45:53.123Z] 15:45:53     INFO - == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, gpu process 4656
[task 2020-03-10T15:45:53.123Z] 15:45:53     INFO - 
[task 2020-03-10T15:45:53.123Z] 15:45:53     INFO -      |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
[task 2020-03-10T15:45:53.124Z] 15:45:53     INFO -      |                                      | Per-Inst   Leaked|   Total      Rem|
[task 2020-03-10T15:45:53.124Z] 15:45:53     INFO -    0 |TOTAL                                 |       18      168|  104452        3|
[task 2020-03-10T15:45:53.124Z] 15:45:53     INFO -   71 |Mutex                                 |       80      160|     266        2|
[task 2020-03-10T15:45:53.125Z] 15:45:53     INFO -  260 |nsStringBuffer                        |        8        8|    1098        1|
[task 2020-03-10T15:45:53.125Z] 15:45:53     INFO - 
[task 2020-03-10T15:45:53.125Z] 15:45:53     INFO - nsTraceRefcnt::DumpStatistics: 271 entries
[task 2020-03-10T15:45:53.125Z] 15:45:53     INFO - TEST-INFO | leakcheck | gpu leaked 2 Mutex
[task 2020-03-10T15:45:53.125Z] 15:45:53     INFO - TEST-INFO | leakcheck | gpu leaked 1 nsStringBuffer
[task 2020-03-10T15:45:53.126Z] 15:45:53     INFO - TEST-UNEXPECTED-FAIL | leakcheck | gpu 168 bytes leaked (Mutex, nsStringBuffer)
[task 2020-03-10T15:45:53.126Z] 15:45:53     INFO - 
[task 2020-03-10T15:45:53.126Z] 15:45:53     INFO - leakcheck | Processing leak log file c:\users\task_1583853630\appdata\local\temp\tmpt3_jr9.mozrunner\runtests_leaks_7872_tab_pid10028.log
[task 2020-03-10T15:45:53.126Z] 15:45:53     INFO - 
[task 2020-03-10T15:45:53.126Z] 15:45:53     INFO - == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, tab process 10028
[task 2020-03-10T15:45:53.127Z] 15:45:53     INFO - 
[task 2020-03-10T15:45:53.127Z] 15:45:53     INFO -      |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
[task 2020-03-10T15:45:53.128Z] 15:45:53     INFO -      |                                      | Per-Inst   Leaked|   Total      Rem|
[task 2020-03-10T15:45:53.128Z] 15:45:53     INFO -    0 |TOTAL                                 |       41      232|   39271       11|
[task 2020-03-10T15:45:53.129Z] 15:45:53     INFO -  244 |Mutex                                 |       80      160|     424        2|
[task 2020-03-10T15:45:53.129Z] 15:45:53     INFO -  809 |nsStringBuffer                        |        8       72|    8228        9|
[task 2020-03-10T15:45:53.129Z] 15:45:53     INFO - 
[task 2020-03-10T15:45:53.129Z] 15:45:53     INFO - nsTraceRefcnt::DumpStatistics: 882 entries
[task 2020-03-10T15:45:53.130Z] 15:45:53     INFO - TEST-INFO | leakcheck | tab leaked 2 Mutex
[task 2020-03-10T15:45:53.130Z] 15:45:53     INFO - TEST-INFO | leakcheck | tab leaked 9 nsStringBuffer
[task 2020-03-10T15:45:53.130Z] 15:45:53     INFO - WARNING | leakcheck | tab 232 bytes leaked (Mutex, nsStringBuffer)
[task 2020-03-10T15:45:53.130Z] 15:45:53     INFO - 
...
Flags: needinfo?(gsvelto)
Depends on: 1622452
Flags: needinfo?(gsvelto)
Blocks: 1623787
Attachment #9127824 - Attachment description: Bug 1614933 - Ensure that glibc's lazy initializers run before we enable the content process sandbox on Linux; r=jld → Bug 1614933 - Ensure that glibc's lazy initializers run before we enable the content process sandbox on Linux; r=froydnj

I rebased the patches after fixing bug 1622452 but I still have some failures to sort out: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1605f4f2324c9862c978b539a5ed48a95cd720ef

Attachment #9127824 - Attachment description: Bug 1614933 - Ensure that glibc's lazy initializers run before we enable the content process sandbox on Linux; r=froydnj → Bug 1614933 - Ensure that glibc's lazy initializers run before we enable the content process sandbox on Linux; r=jld
Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/578c8ba9598f Ensure that glibc's lazy initializers run before we enable the content process sandbox on Linux; r=jld https://hg.mozilla.org/integration/autoland/rev/59503d3a702f Gather content processes' crash annotations at exception time instead of using IPC; r=froydnj

I'll look into this today. Naturally the failing test doesn't reproduce locally even if I run it a zillion times.

Flags: needinfo?(gsvelto)
Depends on: 1624467

The issue is truly puzzling, I first thought about a race and identified bug 1624467, but that's not what's causing the test to fail here. There seems to be an issue in breakpad exception handler that happens only in that test: the callback that writes out the crash annotations is being called correctly but then the exception handler fails to inform the crash generation server (or the crash generation server doesn't respond?). I'm still trying to figure out what's happening.

No longer depends on: 1624467

Alright, I figure it out. We're writing the annotations to a pipe before we signal the main process that it can grab a minidump and thus before the main process has a chance to pick it up. The main process will read from the pipe only after it has picked up the minidump. If we write too much data into the pipe the child process will block and never notify the main process and it'll be deadlocked. This only seems to happen in this particular test because it's sending more annotation than others thus filling up the pipe buffers. The straightforward solution should be to move writing to the pipe after the minidump has been taken but I have to double-check if that guarantees that the main process will read from the pipe afterwards.

Attachment #9127824 - Attachment description: Bug 1614933 - Ensure that glibc's lazy initializers run before we enable the content process sandbox on Linux; r=jld → Bug 1614933 - Ensure that glibc's lazy initializers run before we enable the content process sandbox on Linux; r=froydnj

On Windows content process minidump generation goes through the following
steps:

  • The child process exception handler invokes the filter callback. This is the
    ChildFPEFilter() function in nsExceptionHandler.cpp.
  • If this callback returns true then the child process sends a request to the
    crash generation server living in the main process and then waits.
  • The main process generates a minidump in response to this message.
  • Once the minidump has been generated the crash generation server calls the
    dump_callback callback with the minidump filename. This is the
    OnChildProcessDumpRequested() callback in nsExceptionHandler.cpp.
  • Once this callback returns the main process signals the child process that
    it has finished generating the dump.
  • The child process resumes execution in the exception handler and then
    terminates.

During this process we send crash annotations via a pipe from the child
process to the main process. The function responsible for this is
PrepareChildExceptionTimeAnnotations() in nsExceptionHandler.cpp and it's
called within ChildFPEFilter(). Because pipe writes are blocking if we send
enough annotations to fill the pipe buffer then the child process will get
stuck in ChildFPEFilter() and never request a dump from the main process. Thus
the content process will never shut down and the main process will never see
that it has crashed, it will only appear indefinitely stuck.

To solve this issue a few changes are necessary:

  • The crash annotations are not sent from the filter callback, they are sent
    from the minidump callback which we did not use before. This callback is
    executed in the child process after signaling the main process.
  • This is not enough to prevent the deadlock though, because the content
    process will still wait before the parent process has responded before
    invoking this new callback. The parent process on the other hand will
    generate the minidump and then call OnChildProcessDumpRequest() which will
    wait to read the annotations sent by the child process. Thus both processes
    would be stuck. To solve this other issue we change the order in which the
    parent process responds: instead of invoking the dump_callback first and the
    signaling the child process it does the opposite.

I've also added a new test that covers two separate issues solved by this
patch-set: that we don't deadlock when writing too much data (this issue) and
that we don't fail to send an annotation if it's too large (previously the
shared memory buffer was 16KiB in size so this annotation wouldn't fit).

Depends on D62586

Attachment #9127824 - Attachment description: Bug 1614933 - Ensure that glibc's lazy initializers run before we enable the content process sandbox on Linux; r=froydnj → Bug 1614933 - Ensure that glibc's lazy initializers run before we enable the content process sandbox on Linux; r=jld

Thanks for the review Kris, let's try to land this again.

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/fbc017b1397e Ensure that glibc's lazy initializers run before we enable the content process sandbox on Linux; r=jld https://hg.mozilla.org/integration/autoland/rev/3b17ced47bd7 Gather content processes' crash annotations at exception time instead of using IPC; r=froydnj https://hg.mozilla.org/integration/autoland/rev/991fa6037162 Fix a deadlock in Breakpad crash generation on Windows; r=KrisWright
Regressions: 1628261
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla77
Regressions: 1628399
Regressions: 1659185
Depends on: 1690606
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: