Closed Bug 1457501 Opened 6 years ago Closed 6 years ago

Mac Crash deadlock triggered by CrashReporter::GetFlatThreadAnnotation() lock acquisition

Categories

(Toolkit :: Crash Reporting, defect, P1)

57 Branch
Unspecified
macOS
defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox61 --- fixed

People

(Reporter: haik, Assigned: haik)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

While running the test testing/marionette/harness/marionette_harness/tests/unit/test_crash.py with a change to make the test wait indefinitely for the browser to cleanly exit (see attached), I hit test hangs where Firefox has exited but a content process is stuck in a hung state. This hang is a deadlock similar to bug 1395504, but in a breakpad callback. The problem is that in CrashReporter::GetFlatThreadAnnotation() which is run from the exception handler thread when all other threads have been suspended, we acquire a lock. Acquiring the lock triggers a memory allocation which blocks on a memory allocator lock. The lock is held by another thread which is suspended, hence the deadlock. And without the allocation, we could still deadlock if another thread held the mutex. So we shouldn't try to get the lock in exception context on Mac. This bug is specific to Mac because Windows and Linux implementations of breakpad work differently and don't suspend all threads while still trying to run in the child. Here are some stacks from an instance of the problem. Exception handler thread, blocked trying to allocate memory: thread #5 frame #0 libsystem_kernel.dylib`syscall_thread_switch() frame #1 libsystem_platform.dylib`_OSSpinLockLockYield() frame #2 libmozglue.dylib`Mutex::Lock() at Mutex.h:66 frame #3 libmozglue.dylib`AutoLock<Mutex>::AutoLock() at Mutex.h:129 frame #4 libmozglue.dylib`AutoLock<Mutex>::AutoLock() at Mutex.h:127 frame #5 libmozglue.dylib`arena_t::MallocSmall() at mozjemalloc.cpp:2940 frame #6 libmozglue.dylib`arena_t::Malloc() at mozjemalloc.cpp:3000 frame #7 libmozglue.dylib`BaseAllocator::calloc() at mozjemalloc.cpp:4179 frame #8 libmozglue.dylib`Allocator<MozJemallocBase>::calloc() at malloc_decls.h:38 frame #9 libmozglue.dylib`Allocator<ReplaceMallocBase>::calloc() at malloc_decls.h:38 frame #10 libmozglue.dylib`::calloc(size_t, size_t)() at malloc_decls.h:38 frame #11 libnss3.dylib`PR_Calloc() at prmem.c:443 frame #12 libnss3.dylib`pt_AttachThread() at ptthread.c:265 frame #13 libnss3.dylib`PR_GetCurrentThread() at ptthread.c:622 frame #14 libnss3.dylib`PR_GetThreadPrivate() at prtpd.c:204 frame #15 XUL`mozilla::BlockingResourceBase::ResourceChainFront() at BlockingResourceBase.h:157 frame #16 XUL`mozilla::BlockingResourceBase::CheckAcquire() at BlockingResourceBase.cpp:279 frame #17 XUL`mozilla::OffTheBooksMutex::Lock() at BlockingResourceBase.cpp:384 frame #18 XUL`mozilla::StaticMutex::Lock() at StaticMutex.h:44 frame #19 XUL`mozilla::BaseAutoLock<mozilla::StaticMutex>::BaseAutoLock() at Mutex.h:166 frame #20 XUL`mozilla::BaseAutoLock<mozilla::StaticMutex>::BaseAutoLock() at Mutex.h:163 frame #21 XUL`CrashReporter::GetFlatThreadAnnotation() at ThreadAnnotation.cpp:235 frame #22 XUL`CrashReporter::PrepareChildExceptionTimeAnnotations() at nsExceptionHandler.cpp:1300 frame #23 XUL`CrashReporter::ChildFilter() at nsExceptionHandler.cpp:1426 frame #24 XUL`google_breakpad::ExceptionHandler::WriteMinidumpWithException() at exception_handler.cc:373 frame #25 XUL`google_breakpad::ExceptionHandler::WaitForMessage() at exception_handler.cc:573 frame #26 libsystem_pthread.dylib`_pthread_body() frame #27 libsystem_pthread.dylib`_pthread_start() frame #28 libsystem_pthread.dylib`thread_start() Main thread, suspended by exception handler, holding allocator lock: thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP frame #0 libmozglue.dylib`RedBlackTree<arena_chunk_map_t, ArenaRunTreeTrait>::TreeNode::Left() at rb.h:172 frame #1 libmozglue.dylib`RedBlackTree<arena_chunk_map_t, ArenaRunTreeTrait>::Remove() at rb.h:577 frame #2 libmozglue.dylib`RedBlackTree<arena_chunk_map_t, ArenaRunTreeTrait>::Remove() at rb.h:144 frame #3 libmozglue.dylib`arena_t::GetNonFullBinRun() at mozjemalloc.cpp:2790 frame #4 libmozglue.dylib`arena_t::MallocSmall() at mozjemalloc.cpp:2943 frame #5 libmozglue.dylib`arena_t::Malloc() at mozjemalloc.cpp:3000 frame #6 libmozglue.dylib`BaseAllocator::calloc() at mozjemalloc.cpp:4179 frame #7 libmozglue.dylib`Allocator<MozJemallocBase>::calloc() at malloc_decls.h:38 frame #8 libmozglue.dylib`Allocator<ReplaceMallocBase>::calloc() at malloc_decls.h:38 frame #9 libmozglue.dylib`::calloc(size_t, size_t)() at malloc_decls.h:38 frame #10 libmozglue.dylib`zone_calloc() at zone.c:143 frame #11 libsystem_malloc.dylib`malloc_zone_calloc() frame #12 libsystem_malloc.dylib`calloc() frame #13 ColorSync`CMMMemMgr::New() frame #14 ColorSync`CMMBase::NewInternal() frame #15 ColorSync`CMMProfile::MakeTag() frame #16 ColorSync`CMMProfile::GetTag() frame #17 ColorSync`CMMProfile::Usable() frame #18 ColorSync`DoValidateProfile() frame #19 ColorSync`AppleCMMValidateProfile() frame #20 ColorSync`create() frame #21 ColorSync`ColorSyncVerifySRGBData() frame #22 CoreGraphics`CGColorSpaceCreateWithICCData() frame #23 SkyLight`CGSColorSpaceRegistryCopyColorSpace() frame #24 SkyLight`SLSCopyDisplayColorSpace() frame #25 AppKit`-[NSCGSDisplay initWithDisplayID:flipOffset:]() frame #26 AppKit`___NSCGSCreateDisplaysFromDisplayIDsUsingPredicate_block_invoke() frame #27 AppKit`_NSCGSCreateArrayUsingBlock() frame #28 AppKit`+[NSCGSDisplay uniqueDisplays]() frame #29 AppKit`_NSScreenConfigurationUpdateSharedInfoForReason() frame #30 AppKit`___NSScreenConfigurationEnsureInitialUpdateOccurred_block_invoke() frame #31 libdispatch.dylib`_dispatch_client_callout() frame #32 libdispatch.dylib`dispatch_once_f() frame #33 AppKit`-[NSApplication(ScreenHandling) _registerForDisplayChangedNotifications]() frame #34 AppKit`-[NSApplication init]() frame #35 AppKit`+[NSApplication sharedApplication]() frame #36 XUL`nsAppShell::Init() at nsAppShell.mm:314 frame #37 XUL`nsAppShellInit() at nsAppShellSingleton.h:45 frame #38 XUL`nsComponentManagerImpl::KnownModule::Load() at nsComponentManager.cpp:726 frame #39 XUL`nsFactoryEntry::GetFactory() at nsComponentManager.cpp:1748 frame #40 XUL`nsComponentManagerImpl::CreateInstance() at nsComponentManager.cpp:964 frame #41 XUL`nsComponentManagerImpl::GetService() at nsComponentManager.cpp:1209 frame #42 XUL`CallGetService() at nsComponentManagerUtils.cpp:56 frame #43 XUL`nsGetServiceByCID::operator()() at nsComponentManagerUtils.cpp:253 frame #44 XUL`nsCOMPtr<nsIAppShell>::assign_from_gs_cid() at nsCOMPtr.h:1198 frame #45 XUL`nsCOMPtr<nsIAppShell>::nsCOMPtr() at nsCOMPtr.h:559 frame #46 XUL`nsCOMPtr<nsIAppShell>::nsCOMPtr() at nsCOMPtr.h:556 frame #47 XUL`XRE_RunAppShell() at nsEmbedFunctions.cpp:850 frame #48 XUL`mozilla::ipc::MessagePumpForChildProcess::Run() at MessagePump.cpp:269 frame #49 XUL`MessageLoop::RunInternal() at message_loop.cc:326 frame #50 XUL`MessageLoop::RunHandler() at message_loop.cc:319 frame #51 XUL`MessageLoop::Run() at message_loop.cc:299 frame #52 XUL`XRE_InitChildProcess() at nsEmbedFunctions.cpp:719 frame #53 XUL`mozilla::BootstrapImpl::XRE_InitChildProcess() at Bootstrap.cpp:69 frame #54 plugin-container`content_process_main() at plugin-container.cpp:50 frame #55 plugin-container`main() at MozillaRuntimeMain.cpp:25 frame #56 plugin-container`start()
Assignee: nobody → haftandilian
Priority: -- → P1
More specifically: it's not safe to malloc in the exception handler, because one of the things that can cause a crash is heap corruption, so trying to malloc could cause us to crash again! It's unfortunate that `PR_GetCurrentThread` can malloc. I wonder if we could change BlockingResourceBase to use a `static MOZ_THREAD_LOCAL(BlockingResourceBase*) sResourceAcqnChainFront` member instead of `PR_{Set,Get}ThreadPrivate`? That would involve changing this: https://dxr.mozilla.org/mozilla-central/rev/63a0e2f626febb98d87d2543955ab99a653654ff/xpcom/threads/BlockingResourceBase.h#297 To look more like this: https://searchfox.org/mozilla-central/rev/78dbe34925f04975f16cb9a5d4938be714d41897/dom/script/ScriptSettings.cpp#29 and replacing the `PR_{Get,Set}CurrentThread` usage with `sResourceAcqnChainFront.{get,set}`.
I'm not super familiar with how our thread local stuff works, but we're using `MOZ_THREAD_LOCAL` in plenty of other places and it seems like the thing we need here.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #3) > I'm not super familiar with how our thread local stuff works, but we're > using `MOZ_THREAD_LOCAL` in plenty of other places and it seems like the > thing we need here. Even if acquiring the StaticMutex didn't trigger allocations, it's still not safe to do in exception context because a suspended thread may hold the lock. That would cause deadlock because the exception thread would be waiting for a suspended thread to release a lock which would never happen.
Comment on attachment 8971644 [details] Bug 1457501 - Part 1 - Mac Crash deadlock triggered by CrashReporter::GetFlatThreadAnnotation() lock acquisition https://reviewboard.mozilla.org/r/240414/#review246608 Looks good to me. While I don't like seeing more platform-specific code being added to these code paths I don't see an alternative method of fixing this.
Attachment #8971644 - Flags: review?(gsvelto) → review+
Comment on attachment 8971645 [details] Bug 1457501 - Part 2 - Remove unused type DeleteWithLock https://reviewboard.mozilla.org/r/240416/#review246610 Nice catch, the code that used this was backed out and never reintroduced. LGTM.
Attachment #8971645 - Flags: review?(gsvelto) → review+
Pushed by haftandilian@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/bba71ee5fb46 Part 1 - Mac Crash deadlock triggered by CrashReporter::GetFlatThreadAnnotation() lock acquisition r=gsvelto https://hg.mozilla.org/integration/autoland/rev/911c930bc055 Part 2 - Remove unused type DeleteWithLock r=gsvelto
Backed out 2 changesets (bug 1457501) for build bustage. CLOSED TREE Log: https://treeherder.mozilla.org/logviewer.html#?job_id=176362508&repo=autoland&lineNumber=24665 [task 2018-04-30T22:45:34.310Z] 22:45:34 INFO - /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp:82:8: error: 'MacCrashReporterLock' does not name a type [task 2018-04-30T22:45:34.310Z] 22:45:34 INFO - static MacCrashReporterLock sMutex; [task 2018-04-30T22:45:34.310Z] 22:45:34 INFO - ^~~~~~~~~~~~~~~~~~~~ [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp: In function 'void CrashReporter::{anonymous}::ThreadLocalDestructor(void*)': [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp:210:30: error: 'sMutex' was not declared in this scope [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - CrashReporterAutoLock lock(sMutex); [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - ^~~~~~ [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp: In destructor 'CrashReporter::{anonymous}::ThreadAnnotationSpan::~ThreadAnnotationSpan()': [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp:222:3: error: 'sMutex' was not declared in this scope [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - sMutex.AssertCurrentThreadOwns(); [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - ^~~~~~ [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp: In function 'void CrashReporter::InitThreadAnnotation()': [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp:233:30: error: 'sMutex' was not declared in this scope [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - CrashReporterAutoLock lock(sMutex); [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - ^~~~~~ [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp: In function 'void CrashReporter::SetCurrentThreadName(const char*)': [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp:258:30: error: 'sMutex' was not declared in this scope [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - CrashReporterAutoLock lock(sMutex); [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - ^~~~~~ [task 2018-04-30T22:45:34.311Z] 22:45:34 INFO - /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp: In function 'void CrashReporter::GetFlatThreadAnnotation(const std::function<void(const char*)>&, bool)': [task 2018-04-30T22:45:34.312Z] 22:45:34 INFO - /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp:294:5: error: 'sMutex' was not declared in this scope [task 2018-04-30T22:45:34.312Z] 22:45:34 INFO - sMutex.Lock(); [task 2018-04-30T22:45:34.312Z] 22:45:34 INFO - ^~~~~~ [task 2018-04-30T22:45:34.312Z] 22:45:34 INFO - /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp:305:5: error: 'sMutex' was not declared in this scope [task 2018-04-30T22:45:34.312Z] 22:45:34 INFO - sMutex.Unlock(); [task 2018-04-30T22:45:34.312Z] 22:45:34 INFO - ^~~~~~ [task 2018-04-30T22:45:34.312Z] 22:45:34 INFO - /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp: In function 'void CrashReporter::ShutdownThreadAnnotation()': [task 2018-04-30T22:45:34.312Z] 22:45:34 INFO - /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp:311:30: error: 'sMutex' was not declared in this scope [task 2018-04-30T22:45:34.312Z] 22:45:34 INFO - CrashReporterAutoLock lock(sMutex); [task 2018-04-30T22:45:34.312Z] 22:45:34 INFO - ^~~~~~ [task 2018-04-30T22:45:34.312Z] 22:45:34 INFO - At global scope: [task 2018-04-30T22:45:34.312Z] 22:45:34 INFO - cc1plus: error: unrecognized command line option '-Wno-implicit-fallthrough' [-Werror] [task 2018-04-30T22:45:34.313Z] 22:45:34 INFO - cc1plus: all warnings being treated as errors [task 2018-04-30T22:45:34.313Z] 22:45:34 INFO - /builds/worker/workspace/build/src/config/rules.mk:1024: recipe for target 'Unified_cpp_crashreporter0.o' failed [task 2018-04-30T22:45:34.313Z] 22:45:34 INFO - make[4]: *** [Unified_cpp_crashreporter0.o] Error 1 [task 2018-04-30T22:45:34.313Z] 22:45:34 INFO - make[4]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/toolkit/crashreporter' [task 2018-04-30T22:45:34.313Z] 22:45:34 INFO - /builds/worker/workspace/build/src/config/recurse.mk:73: recipe for target 'toolkit/crashreporter/target' failed [task 2018-04-30T22:45:34.313Z] 22:45:34 INFO - make[3]: *** [toolkit/crashreporter/target] Error 2 [task 2018-04-30T22:45:34.313Z] 22:45:34 INFO - make[3]: *** Waiting for unfinished jobs.... [task 2018-04-30T22:45:34.313Z] 22:45:34 INFO - make[4]: Entering directory '/builds/worker/workspace/build/src/obj-firefox/media/mtransport/third_party/nICEr/nicer_nicer' [task 2018-04-30T22:45:34.313Z] 22:45:34 INFO - make[4]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/media/mtransport/third_party/nICEr/nicer_nicer' Push with fails: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=911c930bc0558545ef3b49294b81fa3ec0237b9a Backout: https://hg.mozilla.org/integration/autoland/rev/072c882dccce90b2b2b3e13442dab17752855947
Flags: needinfo?(haftandilian)
(In reply to Dorel Luca [:dluca] from comment #12) > Backed out 2 changesets (bug 1457501) for build bustage. CLOSED TREE Sorry, this fails to compile on Linux due to a typo. The fix is below. I'll fix this and test the build on Linux/Mac/Windows and then re-land. $ hg diff diff --git a/toolkit/crashreporter/ThreadAnnotation.cpp b/toolkit/crashreporter/ThreadAnnotation.cpp --- a/toolkit/crashreporter/ThreadAnnotation.cpp +++ b/toolkit/crashreporter/ThreadAnnotation.cpp @@ -74,17 +74,17 @@ typedef mozilla::BaseAutoLock<MacCrashRe typedef MacCrashReporterLock CrashReporterLockType; #else /* !XP_MACOSX */ // Use StaticMutex for locking typedef StaticMutexAutoLock CrashReporterAutoLock; typedef StaticMutex CrashReporterLockType; #endif /* XP_MACOSX */ // Protects access to sInitialized and sThreadAnnotations. -static MacCrashReporterLock sMutex; +static CrashReporterLockType sMutex;
Flags: needinfo?(haftandilian)
hg error in cmd: hg pull upstream: pulling from https://hg.mozilla.org/integration/autoland searching for changes abort: HTTP Error 500: Internal Server Error
Pushed by haftandilian@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c287079aaf7f Part 1 - Mac Crash deadlock triggered by CrashReporter::GetFlatThreadAnnotation() lock acquisition r=gsvelto https://hg.mozilla.org/integration/autoland/rev/e4216d889836 Part 2 - Remove unused type DeleteWithLock r=gsvelto
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: