Open Bug 1461353 Opened 7 years ago Updated 2 years ago

Startup hang in nsIWidget::Create on Windows 10

Categories

(Core :: Widget: Win32, defect, P3)

defect

Tracking

()

People

(Reporter: gbrown, Unassigned)

References

(Blocks 1 open bug)

Details

Bug 1414495 records intermittent failures in tests caused by "no output timeouts": when the test harness times out because tests have not reported any results for several minutes. For many weeks, frequent intermittent failures in that bug have been apparent startup hangs in Windows 10 mochitest tasks. Those hangs appear to be on browser startup, well before any tests are started. Screenshots taken at the time of the timeout show no Firefox window. On timeout, the harness kills the browser and almost always successfully collects a minidump; however, the crash reports have rarely been useful, with no Mozilla code in the crashing thread stack. Loading some minidumps in Visual Studio pointed to bug 1373279, which was recently fixed. Now some - perhaps 30% - of the Windows startup hang reports have a stack that includes nsIWidget::Create. Here are some examples: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=178273429&lineNumber=7613 https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=178251739&lineNumber=5336 https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=178251664&lineNumber=5808 https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=178251659&lineNumber=10585 https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=178333902&lineNumber=2879-2910 13:26:54 INFO - 16 xul.dll!nsIWidget::Create(nsIWidget *,void *,mozilla::gfx::IntRectTyped<mozilla::DesktopPixel> const &,nsWidgetInitData *) [nsIWidget.h:091cdd176a11741a6d1c5d0d90f5d72b4c146472 : 458 + 0x22] 13:26:54 INFO - rbp = 0x00007ff88ba50100 rsp = 0x0000006555bfc3f0 13:26:54 INFO - rip = 0x00007ff869653a2a 13:26:54 INFO - Found by: stack scanning 13:26:54 INFO - 17 xul.dll!nsWebShellWindow::Initialize(nsIXULWindow *,nsIXULWindow *,nsIURI *,int,int,bool,nsITabParent *,mozIDOMWindowProxy *,nsWidgetInitData &) [nsWebShellWindow.cpp:091cdd176a11741a6d1c5d0d90f5d72b4c146472 : 177 + 0x1a] 13:26:54 INFO - rbp = 0x00007ff88ba50100 rsp = 0x0000006555bfc450 13:26:54 INFO - rip = 0x00007ff86965328e 13:26:54 INFO - Found by: call frame info 13:26:54 INFO - 18 xul.dll!nsAppShellService::JustCreateTopWindow(nsIXULWindow *,nsIURI *,unsigned int,int,int,bool,nsITabParent *,mozIDOMWindowProxy *,nsWebShellWindow * *) [nsAppShellService.cpp:091cdd176a11741a6d1c5d0d90f5d72b4c146472 : 736 + 0x57] 13:26:54 INFO - rbp = 0x00007ff88ba50100 rsp = 0x0000006555bfc630 13:26:54 INFO - rip = 0x00007ff869652cff 13:26:54 INFO - Found by: call frame info 13:26:54 INFO - 19 xul.dll!nsAppShellService::CreateTopLevelWindow(nsIXULWindow *,nsIURI *,unsigned int,int,int,nsITabParent *,mozIDOMWindowProxy *,nsIXULWindow * *) [nsAppShellService.cpp:091cdd176a11741a6d1c5d0d90f5d72b4c146472 : 200 + 0x56] 13:26:54 INFO - rbp = 0x00007ff88ba50100 rsp = 0x0000006555bfc7a0 13:26:54 INFO - rip = 0x00007ff86953cb1c 13:26:54 INFO - Found by: call frame info 13:26:54 INFO - 20 xul.dll!nsAppStartup::CreateChromeWindow2(nsIWebBrowserChrome *,unsigned int,nsITabParent *,mozIDOMWindowProxy *,unsigned __int64,bool *,nsIWebBrowserChrome * *) [nsAppStartup.cpp:091cdd176a11741a6d1c5d0d90f5d72b4c146472 : 680 + 0x3f] 13:26:54 INFO - rbp = 0x00007ff88ba50100 rsp = 0x0000006555bfc820 13:26:54 INFO - rip = 0x00007ff86953ca03 13:26:54 INFO - Found by: call frame info 13:26:54 INFO - 21 xul.dll!nsWindowWatcher::CreateChromeWindow(nsTSubstring<char> const &,nsIWebBrowserChrome *,unsigned int,nsITabParent *,mozIDOMWindowProxy *,unsigned __int64,nsIWebBrowserChrome * *) [nsWindowWatcher.cpp:091cdd176a11741a6d1c5d0d90f5d72b4c146472 : 469 + 0x3d] 13:26:54 INFO - rbp = 0x00007ff88ba50100 rsp = 0x0000006555bfc8a0 13:26:54 INFO - rip = 0x00007ff86953c8c4 13:26:54 INFO - Found by: call frame info 13:26:54 INFO - 22 xul.dll!nsWindowWatcher::OpenWindowInternal(mozIDOMWindowProxy *,char const *,char const *,char const *,bool,bool,bool,nsIArray *,bool,bool,nsIDocShellLoadInfo *,mozIDOMWindowProxy * *) [nsWindowWatcher.cpp:091cdd176a11741a6d1c5d0d90f5d72b4c146472 : 939 + 0x44] 13:26:54 INFO - rbp = 0x00007ff88ba50100 rsp = 0x0000006555bfc910 13:26:54 INFO - rip = 0x00007ff86917bc63 13:26:54 INFO - Found by: call frame info 13:26:54 INFO - 23 xul.dll!nsWindowWatcher::OpenWindow(mozIDOMWindowProxy *,char const *,char const *,char const *,nsISupports *,mozIDOMWindowProxy * *) [nsWindowWatcher.cpp:091cdd176a11741a6d1c5d0d90f5d72b4c146472 : 328 + 0x4e] 13:26:54 INFO - rbp = 0x00007ff88ba50100 rsp = 0x0000006555bfcc50 13:26:54 INFO - rip = 0x00007ff86968afef 13:26:54 INFO - Found by: call frame info
:jimm - It would be great if you could find someone to look into this. These startup hangs affect a wide range of Windows 10 test jobs and have been on-going for many weeks; this stack looks like the most promising way of addressing it.
Blocks: 1414495
Flags: needinfo?(jmathies)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=2aa248746ceed45ac2cf1ecc4e9caa92706c2ac7 reproduces the hang with additional logging: https://hg.mozilla.org/try/rev/4ff65cda51295d3e7503eebd98a8be650e0385ac#l1.12 https://treeherder.mozilla.org/logviewer.html#?job_id=178817719&repo=try&lineNumber=6392-6395 18:33:17 INFO - GECKO(6308) | ZZZ calling WTSRegisterSessionNotification 00000000001D00EA 18:36:17 INFO - runtests.py | Waiting for browser... 18:39:27 INFO - Buffered messages finished 18:39:27 WARNING - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output suggesting a hang in WTSRegisterSessionNotification(). Same in: https://treeherder.mozilla.org/logviewer.html#?job_id=178817592&repo=try&lineNumber=14110-14113
Interesting here is also the comment above the line of code which seems to hang: > // Starting with Windows XP, a process always runs within a terminal services > // session. In order to play nicely with RDP, fast user switching, and the > // lock screen, we should be handling WM_WTSSESSION_CHANGE. We must register > // our HWND in order to receive this message. > DebugOnly<BOOL> wtsRegistered = ::WTSRegisterSessionNotification(mWnd, > NOTIFY_FOR_THIS_SESSION); Given that this is only a Windows 10 issue, I wonder what's different compared to our Windows 7 AMIs. Rob, any idea?
Flags: needinfo?(rthijssen)
(In reply to Henrik Skupin (:whimboo) from comment #3) > Given that this is only a Windows 10 issue, I wonder what's different > compared to our Windows 7 AMIs. Rob, any idea? well, for starters the operating system is quite different. https://stackoverflow.com/questions/40854000/wtsregistersessionnotification-causing-process-to-hang https://stackoverflow.com/questions/39904632/why-is-exitprocess-necessary-under-win32-when-you-can-use-a-ret
Flags: needinfo?(rthijssen)
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla- > central&job_id=178273429&lineNumber=7613 > https://treeherder.mozilla.org/logviewer.html#?repo=mozilla- > inbound&job_id=178251739&lineNumber=5336 > https://treeherder.mozilla.org/logviewer.html#?repo=mozilla- > inbound&job_id=178251664&lineNumber=5808 > https://treeherder.mozilla.org/logviewer.html#?repo=mozilla- > inbound&job_id=178251659&lineNumber=10585 All of these appear to be related to a delayed DLL load that seems to be taking a long time. Jank here is common since it involves disk access usually but we shouldn't hang indefinitely. Can you post a minidump for one of these hangs here we can inspect?
Flags: needinfo?(jmathies) → needinfo?(gbrown)
(My first thought is the graphic test window btw, but I have no way to confirm that yet.
Flags: needinfo?(gbrown)
(In reply to Jim Mathies [:jimm] from comment #5) > All of these appear to be related to a delayed DLL load that seems to be > taking a long time. Jank here is common since it involves disk access > usually but we shouldn't hang indefinitely. Can you post a minidump for one > of these hangs here we can inspect? We are definitely waiting over 5 minutes without seeing any progress in the test logs...not just jank. Did you notice that the frame just after nsIWidget::Create is often (always?) "0x1": https://treeherder.mozilla.org/logviewer.html#?job_id=180534849&repo=mozilla-central&lineNumber=7630-7645 12:02:22 INFO - 14 wtsapi32.dll!PsspCaptureVaSpaceInformation + 0x35d 12:02:22 INFO - rbx = 0x0000000000000001 rbp = 0x00007ff975810100 12:02:22 INFO - rsp = 0x00000017af3fca90 r12 = 0x0000000000000003 12:02:22 INFO - r13 = 0x0000000000000040 r14 = 0x0000000000000001 12:02:22 INFO - r15 = 0x0000000000000000 rip = 0x00007ff9726a21d1 12:02:22 INFO - Found by: call frame info 12:02:22 INFO - 15 0x1 12:02:22 INFO - rbx = 0x0000000000000001 rbp = 0x00007ff975810100 12:02:22 INFO - rsp = 0x00000017af3fcbf0 r12 = 0x0000000000000003 12:02:22 INFO - r13 = 0x0000000000000040 r14 = 0x0000000000000001 12:02:22 INFO - r15 = 0x0000000000000000 rip = 0x0000000000000001 12:02:22 INFO - Found by: call frame info 12:02:22 INFO - 16 xul.dll!nsIWidget::Create(nsIWidget *,void *,mozilla::gfx::IntRectTyped<mozilla::DesktopPixel> const &,nsWidgetInitData *) [nsIWidget.h:a466172aed4bc2afc21169b749b8068a4b98c93f : 459 + 0x22] 12:02:22 INFO - rbp = 0x00007ff975810100 rsp = 0x00000017af3fcc20 12:02:22 INFO - rip = 0x00007ff953986fca 12:02:22 INFO - Found by: stack scanning Is that just a curiosity of the crash report, or are we actually calling into a bad address?
:jimm - See comments 7 through 10. Also, more generally, can you find some time for this bug? It is one of the most common Windows test failures.
Flags: needinfo?(jmathies)
(In reply to Geoff Brown [:gbrown] from comment #1) > :jimm - It would be great if you could find someone to look into this. These > startup hangs affect a wide range of Windows 10 test jobs and have been > on-going for many weeks; this stack looks like the most promising way of > addressing it. Looking for someone with cycles, will assign when I do.
(In reply to Geoff Brown [:gbrown] from comment #8) > And here's a new example, from today: > > https://treeherder.mozilla.org/logviewer.html#?job_id=180534849&repo=mozilla- > central&lineNumber=7642-7673 > > (very similar crash report) > > Minidump: > https://queue.taskcluster.net/v1/task/DB6POVKrRau_X_X8OWmm9w/runs/0/ > artifacts/public/test_info/c7728861-ab34-40ff-9cc6-8865d9a3b06c.dmp Hey Adam, would you mind taking ten minutes and poking at this see if you can find anything obvious going on here.
Flags: needinfo?(jmathies) → needinfo?(agashlin)
tldr: We don't prevent stack walking during DLL load/unload unless the profiler is enabled, here we are walking due to Background Hang Monitor and getting deadlocked. --- The looks like the same symptoms as bug 1349444. From https://treeherder.mozilla.org/logviewer.html#?job_id=183012047&repo=mozilla-inbound&lineNumber=12237 in Comment 9: Thread 3020 (main): ntdll.dll!NtProtectVirtualMemory() ntdll.dll!LdrpChangeMrdataProtection() ntdll.dll!LdrProtectMrdata() ntdll.dll!RtlInsertInvertedFunctionTable() ntdll.dll!LdrpProcessMappedModule() ntdll.dll!LdrpMapDllWithSectionHandle() ntdll.dll!LdrpMapDllNtFileName() ntdll.dll!LdrpMapDllSearchPath() ntdll.dll!LdrpProcessWork() ntdll.dll!LdrpLoadDllInternal() ntdll.dll!LdrpLoadForwardedDll() ntdll.dll!LdrpGetDelayloadExportDll() ntdll.dll!LdrpHandleProtectedDelayload() ntdll.dll!LdrResolveDelayLoadedAPI() wtsapi32.dll!__delayLoadHelper2() wtsapi32.dll!__tailMerge_WINSTA_dll() wtsapi32.dll!WTSRegisterSessionNotification() xul.dll!nsWindow::Create(nsIWidget * aParent, void * aNativeParent, const mozilla::gfx::IntRectTyped<mozilla::LayoutDevicePixel> & aRect, nsWidgetInitData * aInitData) Line 928 RtlInsertInvertedFunctionTable does RtlAcquireSRWLockExclusive on LdrpInvertedFunctionTableSRWLock before the call to LdrProtectMrdata and doesn't release it until later. Thread 6612: ntdll.dll!NtWaitForAlertByThreadId() ntdll.dll!RtlpxLookupFunctionTable() ntdll.dll!RtlLookupFunctionEntry() mozglue.dll!WalkStackMain64(WalkStackData * aData) Line 317 mozglue.dll!MozStackWalkThread(void(*)(unsigned int, void *, void *, void *) aCallback, unsigned int aSkipFrames, unsigned int aMaxFrames, void * aClosure, void * aThread, _CONTEXT * aContext) Line 443 xul.dll!DoMozStackWalkBacktrace(const mozilla::BaseAutoLock<PSMutex> & aLock, const RegisteredThread & aRegisteredThread, const Registers & aRegs, NativeStack & aNativeStack) Line 1124 [Inline Frame] xul.dll!profiler_suspend_and_sample_thread::__l9::<lambda_9998a5e560acb364e76945bccdc5e530>::operator()(const Registers &) Line 3674 [Inline Frame] xul.dll!Sampler::SuspendAndSampleAndResumeThread(const mozilla::BaseAutoLock<PSMutex> &) Line 177 xul.dll!profiler_suspend_and_sample_thread(int aThreadId, unsigned int aFeatures, ProfilerStackCollector & aCollector, bool aSampleNative) Line 3661 xul.dll!mozilla::ThreadStackHelper::GetStack(mozilla::HangStack & aStack, nsTSubstring<char> & aRunnableName, bool aStackWalk) Line 148 xul.dll!mozilla::BackgroundHangManager::RunMonitorThread() Line 395 xul.dll!mozilla::BackgroundHangManager::MonitorThread(void * aData) Line 80 profiler_suspend_and_sample_thread has aThreadId = 3020, so it has suspended the main thread. The NtWaitForAlertByThreadId is done on LdrpInvertedFunctionTableSRWLock. So it seems like we're detecting a hang (maybe due to the DLL load taking a while?) and trying to walk the stack, but that's turning into a permanent hang as we were in the middle of DLL loading. In bug 1349444 we added a hook to prevent stack walking during LdrUnloadDll, in this case we're instead in LdrResolveDelayLoadedAPI, but that should already have a hook in place since bug 1355559. But it looks like since bug 1378442 we only install the hooks if the profiler is enabled, and in this run ActivePS::sInstance = NULL. So Background Hang Monitor is running without a safety net. It sounds like we just need to do InitializeWin64ProfilerHooks in BHR init, but I'm reluctant to do that since we explicitly disabled it in bug 1378442.
Flags: needinfo?(agashlin)
dmajor, what do you think about enabling the profiler hooks for background hang reporter?
Flags: needinfo?(dmajor)
I've paged out all the profiler code I worked on. Could you give me some context/backstory around the question?
Flags: needinfo?(dmajor)
Ok I think I managed to page some of this back in... having the hooks installed for typical users in the wild had a lot of bad interactions with AVs and other software that competes to hook those same functions. I moved the hooks to the profiler so that they'd basically only run on the Quantum Flow team's dev machines, which tend to be pretty free of such external software. If you are wanting to re-enable these hooks in the wild, the results are probably not going to be pretty.
Wrote this up before seeing comment 17: It's dangerous to walk the stack of a thread while it is loading or unloading DLLs, as it could be suspended while holding a lock exclusively on some internal loader structures; when the walking thread does RtlLookupFunctionEntry it tries to get the same lock and we deadlock. (It looks like a related issue with the JIT was first diagnosed in bug 1263595 comment 12.) The workaround from bug 1349444 (LdrUnloadDll), bug 1354611 (LdrLoadDll), and bug 1355559 (LdrResolveDelayLoadedAPI) is to not allow stack walking while running the Ldr* APIs, this is accomplished by wrapping the calls in a AutoSuppressStackWalking guard [1]. The wrapping is done by installing hooks on the APIs (or adding to the one already in place for LdrLoadDll). Originally this was added to EnsureWalkThreadReady for LdrUnloadDll and LdrResolveDelayLoadedAPI, but it was moved to DllBlocklist_Initialize in bug 1372375 (to avoid taking locks during the profiler's critical section), and then to locked_profiler_start in bug 1378442 (to mitigate shutdown regressions in non-profiling sessions caused by the previous move). Moving to locked_profiler_start means that the LdrUnloadDll and LdrResolveDelayLoadedAPI hooks are now only installed when profiling is enabled, but Background Hang Monitor (not "Reporter", oops) also does stack walking, which seems to be the cause of the current bug. It seems reasonable to run InitializeWin64ProfilerHooks when setting up Background Hang Monitor. But I think BHM is used on all Nightly builds, and I'm wary of reviving the yet-unsolved shutdown regressions mentioned in bug 1378442. In bug 1359507 the mechanism for disabling stack walking was simplified, I was hoping that might have solved the shutdown issues, but it looks like you landed that (03 May 2017) [2] before moving the hook installation to mitigate the shutdown issues (05 July) [3], so I assume that whatever was happening on shutdown is still happening. [1] https://searchfox.org/mozilla-central/rev/d0a41d2e7770fc00df7844d5f840067cc35ba26f/tools/profiler/core/platform-win32.cpp#303-354 [2] https://hg.mozilla.org/mozilla-central/rev/ab0c15933c74 [3] https://hg.mozilla.org/mozilla-central/rev/a199c96110ca
(In reply to David Major [:dmajor] from comment #17) > Ok I think I managed to page some of this back in... having the hooks > installed for typical users in the wild had a lot of bad interactions with > AVs and other software that competes to hook those same functions. I moved > the hooks to the profiler so that they'd basically only run on the Quantum > Flow team's dev machines, which tend to be pretty free of such external > software. If you are wanting to re-enable these hooks in the wild, the > results are probably not going to be pretty. Ah, ok, interfering with others hooking the same functions sounds like a recipe for disaster, it's surprising we get away with it for LdrLoadDll. I wonder if there's any way around this that isn't going to neuter BHM.
> I wonder if there's any way around this that isn't going to neuter BHM. This is probably a horrible idea, but if we can be sure that LdrLoadDll/LdrUnloadDll/LdrResolveDelayLoadedAPI always hold the loader lock before they take the function-table lock, then we could test whether the loader lock is held by calling LdrLockLoaderLock(2, ...) and avoid stack walking if so. This is an undocumented API though so it would be inviting trouble on the next Windows release.
As discussed in our PlatInt meeting: What if we wrote our own delayLoadHelper(), whose replacement is supported by MSVC? I guess another question is, what are the implications for clang-cl?
Flags: needinfo?(dmajor)
AIUI it's just normal symbol resolution, so as long as your lib containing __delayLoadHelper2 appears before delayimp.lib, it should be fine. Do we know that this issue is limited to delay-loading, i.e. it doesn't happen during "vanilla" LdrLoadDll/LdrUnloadDll?
Flags: needinfo?(dmajor)
We already suppress profiler stackwalking in those cases AIUI.
LdrLoadDll, yes. LdrUnloadDll is only hooked when profiling, like LdrResolveDelayLoadedAPI: https://searchfox.org/mozilla-central/rev/d0a41d2e7770fc00df7844d5f840067cc35ba26f/tools/profiler/core/platform-win32.cpp#346 For the record, "cross our fingers and hope this doesn't happen during Unload" is a valid approach IMO. If we've made it this long without problems...
Do we have any update for this bug?
Maybe Adam can comment? (Intermittent test failures continue to be reported in bug 1414495. Huge thanks to everyone who has commented so far!)
Flags: needinfo?(agashlin)
No immediate plans from me (or platform integration in general) to work on this. As I understand we are only doing Background Hang Monitor on Nightly [1], though that includes automated testing. Fix will probably be aklotz's suggestion in comment 21, but it isn't known how much work that is going to involve yet. [1] https://searchfox.org/mozilla-central/rev/6ef785903fee6c0b16a1eab79d722373d940fd78/toolkit/components/backgroundhangmonitor/moz.build#10
Flags: needinfo?(agashlin)
Priority: -- → P2
Please note that bug 1414495 shows a very small number of intermittent failures related to Windows all the last weeks. So has something been changed on mozilla-central, which might have caused this?
Actually this failure is still happening more or less frequently.
Priority: P2 → P3

Hey Geoff,
Can you still reproduce this issue or should we close it?

Flags: needinfo?(whole.grains)

Windows mochitest startup hangs continue (after all this time!), as seen in bug 1414495. I see stacks very similar to those discussed in comment 14+. This bug is still relevant.

Flags: needinfo?(whole.grains)
QA Whiteboard: [qa-not-actionable]

In the process of migrating remaining bugs to the new severity system, the severity for this bug cannot be automatically determined. Please retriage this bug using the new severity system.

Severity: critical → --
Severity: -- → S3
You need to log in before you can comment on or make changes to this bug.