Closed
Bug 1457545
Opened 6 years ago
Closed 6 years ago
Mac Crash deadlock triggered by dlsym()/dlopen() deadlock
Categories
(Toolkit :: Crash Reporting, defect, P1)
Tracking
()
RESOLVED
FIXED
mozilla62
Tracking | Status | |
---|---|---|
firefox62 | --- | fixed |
People
(Reporter: haik, Assigned: haik)
References
(Blocks 1 open bug)
Details
Attachments
(4 files)
See also bugs 1457501 and 1395504 which are similar.
While running the test testing/marionette/harness/marionette_harness/tests/unit/test_crash.py with a change to make the test wait indefinitely for the browser to cleanly exit (see attached) and the fix for 1457501, I hit test hangs where Firefox has exited but a content process is stuck in a hung state.
This problem is triggered by breakpad code in the exception handler thread running ExceptionHandler::WaitForMessage(). After we've finished generating a minidump, we call breakpad_exc_server() to send a response back to the exception server. That triggers a dlsym(3) call which blocks on a lock held by a suspended thread in the middle of a dlopen(3). The dlopen appears to hold a lock which is used to serialize dlsym() and dlopen() calls. Since the lock holder is suspended, it never releases the lock and causes deadlock in the crash handling. Ted tracked down the dlsym() blocking in Darwin/XNU code [1].
Here are some stacks from an instance of the problem.
Exception handler thread, blocked trying to call dlsym:
frame #0 libsystem_kernel.dylib`__psynch_mutexwait()
frame #1 libsystem_pthread.dylib`_pthread_mutex_lock_wait()
frame #2 libsystem_pthread.dylib`_pthread_mutex_lock_slow()
frame #3 libdyld.dylib`LockHelper::LockHelper()
frame #4 libdyld.dylib`dlsym()
frame #5 libsystem_kernel.dylib`internal_catch_exception_raise()
frame #6 libsystem_kernel.dylib`_Xexception_raise()
frame #7 libsystem_kernel.dylib`exc_server()
frame #8 XUL`google_breakpad::ExceptionHandler::WaitForMessage() at exception_handler.cc:207
frame #9 XUL`google_breakpad::ExceptionHandler::WaitForMessage() at exception_handler.cc:596
frame #10 libsystem_pthread.dylib`_pthread_body()
frame #11 libsystem_pthread.dylib`_pthread_start()
frame #12 libsystem_pthread.dylib`thread_start()
Main thread, suspended during a call to dlopen, presumably holding a libdyld serialization lock:
frame #0 dyld`stat64()
frame #1 dyld`dyld::my_stat()
frame #2 dyld`dyld::loadPhase5()
frame #3 dyld`dyld::loadPhase4()
frame #4 dyld`dyld::loadPhase3()
frame #5 dyld`dyld::loadPhase1()
frame #6 dyld`dyld::loadPhase0()
frame #7 dyld`dyld::load()
frame #8 dyld`dlopen()
frame #9 libdyld.dylib`dlopen()
frame #10 libsystem_sandbox.dylib`sandbox_init_with_parameters()
frame #11 XUL`mozilla::StartMacSandbox() at Sandbox.mm:282
frame #12 XUL`mozilla::dom::ContentChild::RecvSetProcessSandbox() at ContentChild.cpp:1711
frame #13 XUL`mozilla::dom::ContentChild::RecvSetProcessSandbox() at ContentChild.cpp:1745
frame #14 XUL`mozilla::dom::PContentChild::OnMessageReceived() at PContentChild.cpp:5610
frame #15 XUL`mozilla::ipc::MessageChannel::DispatchAsyncMessage() at MessageChannel.cpp:2141
frame #16 XUL`mozilla::ipc::MessageChannel::DispatchMessage() at MessageChannel.cpp:2071
frame #17 XUL`mozilla::ipc::MessageChannel::MessageTask::Run() at MessageChannel.cpp:1950
frame #18 XUL`nsThread::ProcessNextEvent() at nsThread.cpp:1096
frame #19 XUL`NS_ProcessNextEvent() at nsThreadUtils.cpp:519
frame #20 XUL`mozilla::ipc::MessagePump::Run() at MessagePump.cpp:97
frame #21 XUL`MessageLoop::Run() at message_loop.cc:326
frame #22 XUL`MessageLoop::Run() at message_loop.cc:319
frame #23 XUL`MessageLoop::Run() at message_loop.cc:299
frame #24 XUL`XRE_RunAppShell() at nsEmbedFunctions.cpp:888
frame #25 XUL`MessageLoop::Run() at message_loop.cc:326
frame #26 XUL`MessageLoop::Run() at message_loop.cc:319
frame #27 XUL`MessageLoop::Run() at message_loop.cc:299
frame #28 XUL`XRE_InitChildProcess() at nsEmbedFunctions.cpp:719
frame #29 plugin-container`main() at plugin-container.cpp:50
frame #30 plugin-container`main() at MozillaRuntimeMain.cpp:25
frame #31 plugin-container`start()
[1] https://github.com/apple/darwin-xnu/blob/5394bb038891708cd4ba748da79b90a33b19f82e/libsyscall/mach/exc_catcher.c#L55
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Updated•6 years ago
|
Assignee: nobody → haftandilian
Priority: -- → P1
Assignee | ||
Comment 3•6 years ago
|
||
The posted changes resume threads before calling breakpad_exc_server(). Needs analysis is needed to determine if this is the best fix.
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 4•6 years ago
|
||
I'm looking into ways to fix this that don't require resuming threads earlier. And I hope to land this early in 62.
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 8•6 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #7)
> Haik, do you have an update for us?
Sorry, some higher priority things came up. I need to study the breakpad implementation a bit more. The posted fix worked well in my tests, but I'm not sure it's the right fix yet. I'll aim to get to this by the end of the week.
Flags: needinfo?(haftandilian)
Assignee | ||
Comment 9•6 years ago
|
||
After looking more at the breakpad code, I don't see a problem with resuming threads before calling breakpad_exc_server() after we've attempted to generate the minidump. At this point, we've already called WriteMinidumpWithException(exit_after_write=true) and therefore attempted to message the parent to write the minidump. Since WriteMinidumpWithException() returned instead of exiting, it must be because its call to crash_generation_client_->RequestDumpForException() fails which may mean we failed to generate a minidump. The call to breakpad_exc_server() ends up forwarding the exception which should result in the process being killed anyway.
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 13•6 years ago
|
||
Comment 14•6 years ago
|
||
mozreview-review |
Comment on attachment 8971682 [details]
Bug 1457545 - Part 3 - Re-enable crash tests for Mac opt builds
https://reviewboard.mozilla.org/r/240456/#review251212
Hurray!! Thank you so much!
Attachment #8971682 -
Flags: review?(hskupin) → review+
Comment 15•6 years ago
|
||
mozreview-review |
Comment on attachment 8977763 [details]
Bug 1457545 - Part 2 - Fix indendation (whitespace change only)
https://reviewboard.mozilla.org/r/245250/#review251260
LGTM
Attachment #8977763 -
Flags: review?(gsvelto) → review+
Comment 16•6 years ago
|
||
mozreview-review |
Comment on attachment 8971681 [details]
Bug 1457545 - Part 1 - Mac Crash deadlock triggered by dlsym()/dlopen() deadlock
https://reviewboard.mozilla.org/r/240454/#review251258
Very nice catch!
Attachment #8971681 -
Flags: review?(gsvelto) → review+
Comment 17•6 years ago
|
||
Pushed by haftandilian@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c57a74c70332
Part 1 - Mac Crash deadlock triggered by dlsym()/dlopen() deadlock r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/a532236ec4a8
Part 2 - Fix indendation (whitespace change only) r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/e87851499d8f
Part 3 - Re-enable crash tests for Mac opt builds r=whimboo
Comment 18•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/c57a74c70332
https://hg.mozilla.org/mozilla-central/rev/a532236ec4a8
https://hg.mozilla.org/mozilla-central/rev/e87851499d8f
Status: NEW → RESOLVED
Closed: 6 years ago
status-firefox62:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
You need to log in
before you can comment on or make changes to this bug.
Description
•