Closed Bug 1374038 Opened 7 years ago Closed 7 years ago

Nightly 2017-06-17 prevents Windows 10 shutdown

Categories

(Core :: Gecko Profiler, defect)

56 Branch
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla56
Tracking Status
firefox-esr52 --- unaffected
firefox54 --- unaffected
firefox55 + verified
firefox56 + verified

People

(Reporter: ssusbauer, Assigned: benjamin)

References

Details

(Keywords: regression)

Attachments

(2 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0 Build ID: 20170617030206 Steps to reproduce: Running Nightly 2017-06-17 (x64). Shutdown or reboot Windows without first exiting Nightly. Actual results: Windows displays "This program is preventing shutdown" and eventually dumps back to desktop. The Nightly UI is hard frozen, "not responding" and I have to manually kill with Windows. There is nothing shown in about:crashes. Same behavior if I hit "cancel" on the Windows shutdown screen to return to the desktop. Expected results: Nightly should close immediately.
I experience this on two computers, Nightly exits immediately when closing manually. I've run through Mozregression, I do not experience this behavior in 2017-06-16, I do experience this in build dfb091ee but none of the other ones that come up. Here is the output from Mozregression: 2017-06-18T01:03:33: INFO : Narrowed inbound regression window from [ff45f0da, b4d533bb] (4 revisions) to [c956b50f, b4d533bb] (2 revisions) (~1 steps left) 2017-06-18T01:03:33: DEBUG : Starting merge handling... 2017-06-18T01:03:33: DEBUG : Using url: https://hg.mozilla.org/integration/autoland/json-pushes?changeset=b4d533bb18fc216234c2f18717502050663ba5d8&full=1 2017-06-18T01:03:34: DEBUG : Found commit message: Bug 1368386 - Only call mInvalidRegion.GetRegion() when an up-to-date valid region is requested. r=mattwoodrow MozReview-Commit-ID: IHb0fNAzyjj 2017-06-18T01:03:34: INFO : The bisection is done. 2017-06-18T01:03:34: INFO : Stopped
Blocks: 1368386
Has Regression Range: --- → yes
Component: Untriaged → Layout: Web Painting
Keywords: regression
Product: Firefox → Core
This is very surprising. Steven, can you check whether the build at https://queue.taskcluster.net/v1/task/UPYbXgstSyiijD2WtsG5hQ/runs/0/artifacts/public/build/firefox-56.0a1.en-US.win64.zip exhibits the problem?
Yes, this build does appear to have the problem. I have noticed that for all of the affected builds, if I start Firefox in Safe Mode the problem goes away until I run normally again. Manually disabling hardware acceleration or setting the content process to 1 does not seem to have an affect.
(In reply to Steven S. from comment #3) > Yes, this build does appear to have the problem. Thanks! This was the build from revision https://hg.mozilla.org/integration/autoland/rev/1c4c304c065706ffb014ca972f23cf19c586c910 , which is just one push *before* the push that landed bug 1368386. So bug 1368386 is not the cause of this regression. > I have noticed that for all > of the affected builds, if I start Firefox in Safe Mode the problem goes > away until I run normally again. Manually disabling hardware acceleration or > setting the content process to 1 does not seem to have an affect. Ah. It sounds like it's rather tricky to get consistent results during the regression range search. Can you try again anyway, now that you know more about the things you need to look out for?
No longer blocks: 1368386
Has Regression Range: yes → no
from dupe bug: 1374284 Affected: Nightly56.0a1(default e10s enabled) Not affected: Beta55.0b2(both e10s disabled by accessibility tools and force e10s enabled) Reproducible: always Nightly56.0a1(default e10s on) Steps To reproduce: 1. Launch Nightly with clean profile(no addon) 2. Sign out from right click of Start button of windows taskbar 3. Wait for few seconds( ~10sec) AR: Nightly Not Responding. And Fails to sigh out. You must forcibly sign out ER: Successfully proceed sign out Regression window: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=c9669890cf99c9db2374cf81055c19cb42e382c4&tochange=bb9f906e70e30d56c386717ee9a45b7d3c774573 Regressed by: bb9f906e70e3 Markus Stange — Bug 1330185 - Use a top-level PProfiler protocol to control the profiler in other processes. r=billm,njn In Ver55, I confirmed that it was fixed when merge day (central to beta). i.e, Reproduced on https://hg.mozilla.org/mozilla-central/rev/f9605772a0c9098ed1bcaa98089b2c944ed69e9b Mozilla/5.0 (Windows NT 10.0; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0 ID:20170612041720 But not reproduced on https://hg.mozilla.org/releases/mozilla-beta/rev/2191d7f87e2e69af1db8b3459522abe7c1db4c2e Mozilla/5.0 (Windows NT 10.0; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0 ID:20170612062327
Blocks: 1330185
Status: UNCONFIRMED → NEW
Component: Layout: Web Painting → Untriaged
Ever confirmed: true
Thanks, I found the same. When I was manually asking mozregression gui for 2017-06-16 or later, it is grabbing builds from mozilla-central that do not appear to have the issue.
(In reply to Markus Stange [:mstange] from comment #4) > So bug 1368386 is not the cause of this regression. (In reply to Alice0775 White from comment #7) > Regressed by: > bb9f906e70e3 Markus Stange — Bug 1330185 - Use a top-level PProfiler > protocol to control the profiler in other processes. r=billm,njn Haha, I shouldn't have believed I'd be let off the hook that easily.
Assignee: nobody → mstange
Status: NEW → ASSIGNED
We're hanging while calling Close() on the message channel in the parent process. This happens when the ProfilerParentTracker is destroyed from the UniquePtr destructor during global object destruction when all other threads have already shut down. So we're waiting for the IPC thread to signal us at a time at which there is no IPC thread any more. > ntdll.dll!NtWaitForAlertByThreadId() > ntdll.dll!RtlSleepConditionVariableSRW() > KernelBase.dll!SleepConditionVariableSRW() > > mozglue.dll!mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl & lock) > xul.dll!mozilla::CondVar::Wait(unsigned int aInterval) > xul.dll!mozilla::ipc::MessageChannel::SynchronouslyClose() > xul.dll!mozilla::ipc::MessageChannel::Close() > xul.dll!mozilla::ProfilerParentTracker::~ProfilerParentTracker() > xul.dll!mozilla::UniquePtr<mozilla::ProfilerParentTracker,mozilla::DefaultDelete<mozilla::ProfilerParentTracker> >::reset(mozilla::ProfilerParentTracker * aPtr) > ucrtbase.dll!<lambda>(void)() > ucrtbase.dll!__crt_seh_guarded_call<int>::operator()<<lambda_7777bce6b2f8c936911f934f8298dc43>,<lambda>(void) & __ptr64,<lambda_3883c3dff614d5e0c5f61bb1ac94921c> >() > ucrtbase.dll!_execute_onexit_table() > xul.dll!dllmain_crt_process_detach(const bool is_terminating) > xul.dll!dllmain_dispatch(HINSTANCE__ * const instance, const unsigned long reason, void * const reserved) > ntdll.dll!LdrpCallInitRoutine() > ntdll.dll!LdrShutdownProcess() > ntdll.dll!RtlExitUserProcess() > kernel32.dll!ExitProcessImplementation() > ucrtbase.dll!exit_or_terminate_process() > ucrtbase.dll!common_exit() > xul.dll!nsWindow::ProcessMessage(unsigned int msg, unsigned __int64 & wParam, __int64 & lParam, __int64 * aRetValue) > xul.dll!nsWindow::WindowProcInternal(HWND__ * hWnd, unsigned int msg, unsigned __int64 wParam, __int64 lParam) > xul.dll!CallWindowProcCrashProtected(__int64(*)(HWND__ *, unsigned int, unsigned __int64, __int64) aWndProc, HWND__ * aHWnd, unsigned int aMsg, unsigned __int64 aWParam, __int64 aLParam) > xul.dll!nsWindow::WindowProc(HWND__ * hWnd, unsigned int msg, unsigned __int64 wParam, __int64 lParam) > user32.dll!UserCallWinProcCheckWow() > user32.dll!DispatchClientMessage() > user32.dll!__fnDWORD() > ntdll.dll!KiUserCallbackDispatcherContinue() > win32u.dll!NtUserMessageCall() > user32.dll!RealDefWindowProcWorker() > user32.dll!DefWindowProcW() > user32.dll!UserCallWinProcCheckWow() > user32.dll!CallWindowProcW() > xul.dll!nsWindow::WindowProcInternal(HWND__ * hWnd, unsigned int msg, unsigned __int64 wParam, __int64 lParam) > xul.dll!CallWindowProcCrashProtected(__int64(*)(HWND__ *, unsigned int, unsigned __int64, __int64) aWndProc, HWND__ * aHWnd, unsigned int aMsg, unsigned __int64 aWParam, __int64 aLParam) > xul.dll!nsWindow::WindowProc(HWND__ * hWnd, unsigned int msg, unsigned __int64 wParam, __int64 lParam) > user32.dll!UserCallWinProcCheckWow() > user32.dll!DispatchClientMessage() > user32.dll!__fnDWORD() > ntdll.dll!KiUserCallbackDispatcherContinue() > win32u.dll!NtUserPeekMessage() > user32.dll!PeekMessageW() > msctf.dll!CThreadInputMgr::PeekMessageW(struct tagMSG *,struct HWND__ *,unsigned int,unsigned int,unsigned int,int *) > xul.dll!nsAppShell::ProcessNextNativeEvent(bool mayWait) > xul.dll!nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal * thr, bool mayWait) > xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) > xul.dll!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate * aDelegate) > xul.dll!MessageLoop::RunHandler() > xul.dll!MessageLoop::Run() > xul.dll!nsBaseAppShell::Run() > xul.dll!nsAppShell::Run() > xul.dll!nsAppStartup::Run() > xul.dll!XREMain::XRE_mainRun() > xul.dll!XREMain::XRE_main(int argc, char * * argv, const mozilla::BootstrapConfig & aConfig) > xul.dll!XRE_main(int argc, char * * argv, const mozilla::BootstrapConfig & aConfig) > firefox.exe!NS_internal_main(int argc, char * * argv, char * * envp) > firefox.exe!wmain(int argc, wchar_t * * argv) > firefox.exe!__scrt_common_main_seh() > kernel32.dll!BaseThreadInitThunk() > ntdll.dll!RtlUserThreadStart()
I don't know if this patch is a good idea. It seems very risky. According to https://connect.microsoft.com/VisualStudio/feedback/details/3111120/vs-2015-ucrt-exit-function-is-racy-with-respect-to-other-threads , the fact that _exit(0) runs the destructor of my UniquePtr could be considered a bug. It might be nicer to have a termination function which does not run these destructors. Is there such a thing?
Component: Untriaged → Gecko Profiler
(In reply to Markus Stange [:mstange] from comment #12) > It might be nicer to have a termination function which does not run these > destructors. Is there such a thing? That would be TerminateProcess, and according to the Connect bug, the Chrome folks switched to that as a workaround. I don't know our shutdown code well enough to say whether it makes sense for us to do the same.
Comment on attachment 8879318 [details] Bug 1374038 - Run ClearOnShutdown handlers during WM_ENDSESSION. https://reviewboard.mozilla.org/r/150602/#review156316 I don't know what the right answer is here. I think the `TerminateProcess` route is a little less risky (and better maintenance-wise, etc.)...maybe? Anybody who's doing things that they want saved to disk should be listening for `profile-before-change` or some variant thereof. WDYT? ::: widget/windows/nsWindow.cpp:5139 (Diff revision 1) > + KillClearOnShutdown(ShutdownPhase::WillShutdown); > + KillClearOnShutdown(ShutdownPhase::Shutdown); > + KillClearOnShutdown(ShutdownPhase::ShutdownThreads); > + KillClearOnShutdown(ShutdownPhase::ShutdownLoaders); > + KillClearOnShutdown(ShutdownPhase::ShutdownFinal); This seems pretty risky; I'd be worried about hidden dependencies between these shutdown stages and the work performed by the handlers for the actual xpcom shutdown notifications (that we don't actually do here). OTOH, I'd wager the reason we haven't taken the `TerminateProcess` course (or, uh, `quick_exit`, I think, on Unix-y platforms) is that we're worried about the possibility that real, useful work is down in those static destructors that wouldn't get called. I think anybody who is trying to do useful work in destructors in that fashion is asking for trouble, but I can understand the concern.
The mochitest-2 failures on the associated try run also look kind of concerning to me, in addition to comment 14.
Flags: needinfo?(mstange)
(In reply to Steven S. from comment #0) > > Windows displays "This program is preventing shutdown" and eventually dumps > back to desktop. The Nightly UI is hard frozen, "not responding" and I have > to manually kill with Windows. I get the exact same thing for a week or two now, only on v55 beta (e10s=on). Was something pushed to beta that might do this?
(In reply to Nathan Froyd [:froydnj] from comment #14) > I don't know what the right answer is here. I think the `TerminateProcess` > route is a little less risky (and better maintenance-wise, etc.)...maybe? > Anybody who's doing things that they want saved to disk should be listening > for `profile-before-change` or some variant thereof. WDYT? Sounds reasonable... I think. I don't think I know enough about this problem to really make an informed decision here, though. I haven't looked into the mochitest-2 failures yet.
Flags: needinfo?(mstange)
Comment on attachment 8879318 [details] Bug 1374038 - Run ClearOnShutdown handlers during WM_ENDSESSION. (In reply to Markus Stange [:mstange] from comment #17) > (In reply to Nathan Froyd [:froydnj] from comment #14) > > I don't know what the right answer is here. I think the `TerminateProcess` > > route is a little less risky (and better maintenance-wise, etc.)...maybe? > > Anybody who's doing things that they want saved to disk should be listening > > for `profile-before-change` or some variant thereof. WDYT? > > Sounds reasonable... I think. I don't think I know enough about this problem > to really make an informed decision here, though. Join the club! :) > I haven't looked into the mochitest-2 failures yet. Canceling review while the test failures and an alternate path forward are looked at, then.
Attachment #8879318 - Flags: review?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #14) > I don't know what the right answer is here. I think the `TerminateProcess` > route is a little less risky (and better maintenance-wise, etc.)...maybe? > Anybody who's doing things that they want saved to disk should be listening > for `profile-before-change` or some variant thereof. WDYT? the problem is that, unless the consumer uses AsyncShutdown (And not all of our "writers" do), even if the work begins at profile-before-change, it's unlikely to finish there, since we move most IO off the main thread.
Is this still reproducible in Nightly? Andrei, can someone on your team check the STR in comment 7?
Flags: needinfo?(andrei.vaida)
I am still seeing this on Nightly 20170712030204
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #20) > Is this still reproducible in Nightly? Andrei, can someone on your team > check the STR in comment 7? Hello, Yes, I can confirm that this issue is still reproducible with latest Nightly 56.0a1 (2017-07-12) on Windows 10 x64.
Flags: needinfo?(andrei.vaida)
Markus, can you follow up? Is this a possible source of updates failing?
Flags: needinfo?(mstange)
I think we should quick-exit here. We aren't trying to clean up memory in this case: we're just trying to save the profile and then get out of the way as quickly as possible. If static destructors are being relied on to save profile data, we have much bigger problems.
Comment on attachment 8887201 [details] Bug 1374038 - _exit() can call static destructors on Windows! So replace it with TerminateProcess which can't, +mstange https://reviewboard.mozilla.org/r/157970/#review163078 Thank you! I agree with your assessment.
Attachment #8887201 - Flags: review?(mstange) → review+
Comment on attachment 8887201 [details] Bug 1374038 - _exit() can call static destructors on Windows! So replace it with TerminateProcess which can't, +mstange https://reviewboard.mozilla.org/r/157970/#review163086
Attachment #8887201 - Flags: review?(jmathies) → review+
Pushed by bsmedberg@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/9acc3b670795 _exit() can call static destructors on Windows! So replace it with TerminateProcess which can't, r=jimm r=mstange
Did you intend to set `process = GetCurrentProcessId()`? I suspect that WaitForSingleObject wouldn't know what to do with the -1 pseudo-handle (is it a process? a thread? something else?).
Flags: needinfo?(benjamin)
(In reply to David Major [:dmajor] from comment #30) > Did you intend to set `process = GetCurrentProcessId()`? Or maybe that's not right either. You might need to DuplicateHandle to get a real handle. I'm not sure offhand.
I don't know for certain that WaitForSingleObject can deal with the -1 pseudohandle, but it appears to work in local testing (I used the tool at http://stefanstools.sourceforge.net/SendMessage.html to do local testing).
Flags: needinfo?(benjamin)
Oh, the GetCurrentThread() pseudo-handle is -2, so maybe WaitForSingleObject knows how to distinguish them after all. If all is well then sorry for the noise!
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Emil, you mentioned in email you were still able to reproduce this issue with the 7-18 Nightly. Can you verify the fix here tomorrow (I guess with the 7/20 nightly.) Thanks!
Flags: qe-verify+
Hi Liz, I can confirm that this issue is not reproducible anymore with latest Nightly 56.0a1 (2017-07-20) both e10s enabled and disabled, on Windows 10 x64.
Status: RESOLVED → VERIFIED
Flags: qe-verify+
Confirmed working, thank you for fixing this really annoying thing!
I have just checked on 55.0b10 to see if Beta is indeed unaffected. It seems that with e10s enabled (1/1 Enabled by default) I can reproduce this issue. Having e10s disabled the issue can't be reproduced. Can you please take a look at this, Markus?
You're right, 55 is affected. The bug that caused this (bug 1330185) landed in 55.
Flags: needinfo?(mstange)
(In reply to Markus Stange [:mstange] from comment #40) > You're right, 55 is affected. The bug that caused this (bug 1330185) landed > in 55. So is there any barrier to uplifting this to beta? There are still two weeks left in this beta cycle. (and it's really annoying) (PS: I mentioned that I'm experiencing this on beta way back in comment #16 )
Assignee: mstange → benjamin
Comment on attachment 8887201 [details] Bug 1374038 - _exit() can call static destructors on Windows! So replace it with TerminateProcess which can't, +mstange Approval Request Comment [Feature/Bug causing the regression]: proximate cause is bug 1330185; root bug existed far earlier [User impact if declined]: users with the profiler will hang on OS shutdown/restart on Windows [Is this code covered by automated tests?]: no [Has the fix been verified in Nightly?]: yes [Needs manual test from QE? If yes, steps to reproduce]: already in the bug [List of other uplifts needed for the feature/fix]: none [Is the change risky?]: hard to say [Why is the change risky/not risky?]: This is fairly isolated code that I expect would not be very risky. However, shutdown effects are always hard to predict: especially if there are 3rd-party DLLs in our process, this could possibly cause those to change their behavior in ways that are user-visible. [String changes made/needed]: None It might be worth testing this shutdown codepath with the more common injected DLLs: antivirus and accessibility tools; just for sanity checking.
Attachment #8887201 - Flags: approval-mozilla-beta?
Comment on attachment 8887201 [details] Bug 1374038 - _exit() can call static destructors on Windows! So replace it with TerminateProcess which can't, +mstange Nice catch, Ciprian. Let's uplift this and see if we can get it into today's beta build.
Attachment #8887201 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Verified fixed on 55.0b11, as well. (e10s on/off)
Flags: in-qa-testsuite+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: