Closed Bug 1143866 Opened 10 years ago Closed 7 years ago

crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | PR_JoinThread | mozilla::dom::DOMStorageDBThread::Shutdown()

Categories

(Core :: DOM: Core & HTML, defect)

x86
Windows
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox36 --- affected
firefox37 --- affected
firefox38 --- affected
firefox39 --- affected
firefox42 --- affected
firefox43 --- affected
firefox44 --- affected
firefox45 --- affected
firefox47 --- affected
firefox48 --- affected
firefox49 --- affected
firefox-esr38 --- affected
firefox-esr45 --- affected
firefox50 --- affected
firefox51 --- affected

People

(Reporter: u279076, Unassigned)

References

()

Details

(Keywords: crash, steps-wanted, topcrash-win)

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is report bp-593574cb-90c4-430d-a5a2-2220c2150309. ============================================================= 0 xul.dll mozilla::`anonymous namespace'::RunWatchdog(void*) toolkit/components/terminator/nsTerminator.cpp 1 nss3.dll _PR_NativeRunThread nsprpub/pr/src/threads/combined/pruthr.c 2 nss3.dll pr_root nsprpub/pr/src/md/windows/w95thred.c 3 msvcr120.dll _callthreadstartex f:\dd\vctools\crt\crtw32\startup\threadex.c:376 4 msvcr120.dll msvcr120.dll@0x2c000 5 kernel32.dll BaseThreadInitThunk 6 ntdll.dll __RtlUserThreadStart 7 ntdll.dll _RtlUserThreadStart ============================================================= More reports: https://crash-stats.mozilla.com/report/list?product=Firefox&signature=shutdownhang+|+WaitForSingleObjectEx+|+WaitForSingleObject+|+PR_WaitCondVar+|+PR_JoinThread+|+mozilla%3A%3Adom%3A%3ADOMStorageDBThread%3A%3AShutdown%28%29#tab-reports While there have been over 1600 reports of this crash over the last week, I do not believe this qualifies it for topcrash status. Correlations: * Top platform is Windows 7 @ 51% * Top product is Firefox 36.0.1 @ 77% * Top 10 GPUs are all Intel * Top URLs are facebook, youtube, gmail, and twitter * Comments: several people mention this occurs when restoring from a sleep/hibernate state Florin, do you think you have time to see if you can shake out steps to reproduce?
Flags: needinfo?(florin.mezei)
QA Whiteboard: [triaged]
Maybe Honza has some ideas.
Flags: needinfo?(honzab.moz)
We figured out earlier today that the "crashing thread" stack for shutdownhangs in Socorro does not match that of the reported signature. To get this you need to drill down to the other threads in the report. I have filed a Socorro bug to rectify this. For clarity, here is the stack for the matching signature in this bug: 0 ntdll.dll NtWaitForSingleObject 1 kernelbase.dll WaitForSingleObjectEx 2 kernel32.dll WaitForSingleObjectExImplementation 3 kernel32.dll WaitForSingleObject 4 nss3.dll PR_WaitCondVar nsprpub/pr/src/threads/combined/prucv.c 5 nss3.dll PR_JoinThread nsprpub/pr/src/threads/combined/pruthr.c 6 xul.dll mozilla::dom::DOMStorageDBThread::Shutdown() dom/storage/DOMStorageDBThread.cpp 7 xul.dll mozilla::dom::DOMStorageCache::StopDatabase() dom/storage/DOMStorageCache.cpp 8 xul.dll mozilla::dom::DOMStorageObserver::Observe(nsISupports*, char const*, wchar_t const*) dom/storage/DOMStorageObserver.cpp 9 xul.dll nsObserverList::NotifyObservers(nsISupports*, char const*, wchar_t const*) xpcom/ds/nsObserverList.cpp 10 xul.dll nsObserverService::NotifyObservers(nsISupports*, char const*, wchar_t const*) xpcom/ds/nsObserverService.cpp 11 xul.dll nsXREDirProvider::DoShutdown() toolkit/xre/nsXREDirProvider.cpp 12 xul.dll ScopedXPCOMStartup::~ScopedXPCOMStartup() toolkit/xre/nsAppRunner.cpp 13 xul.dll ScopedXPCOMStartup::`scalar deleting destructor'(unsigned int) 14 xul.dll XREMain::XRE_main(int, char** const, nsXREAppData const*) toolkit/xre/nsAppRunner.cpp 15 xul.dll XRE_main toolkit/xre/nsAppRunner.cpp 16 firefox.exe do_main browser/app/nsBrowserApp.cpp 17 firefox.exe NS_internal_main(int, char**) browser/app/nsBrowserApp.cpp 18 firefox.exe wmain toolkit/xre/nsWindowsWMain.cpp 19 firefox.exe __tmainCRTStartup f:/dd/vctools/crt/crtw32/startup/crt0.c 20 kernel32.dll BaseThreadInitThunk 21 ntdll.dll __RtlUserThreadStart 22 ntdll.dll _RtlUserThreadStart
Hmm.. it actually "hangs" on the DOMStorage IO thread. I see two cases: 1. mozilla::storage::Connection::Close -> NtClose 2. mozilla::storage::Connection::Close -> arena_dalloc_large I've encountered two other bugs (1124880 and 1131330) that hang in those two functions as well. We either have a global problem here or it's just another individual case we have to handle separately. I'm on both of those other bugs already, so I may find some conclusion there.
We tried to reproduce the issue today on two separate machines (Windows 7 x64 and x86) using info from comment 0, but we could not get the crash. Also we have not encountered this during our testing last week. At this point it seems highly unlikely that we could find steps to reproduce this.
Flags: needinfo?(florin.mezei)
I don't know when, but in the last weeks there was a patch checked in that fix a lot of the [OOM | small] crashes. So for me - I had this signature often before - it is nearly impossible to crash FF with this signature again. Can somebody figure out when this patch was checked in and if the [shutdownhang] and [WaitForSingleObjectEx] crashes rose after that date ??? If so, I guess that it is (nearly) the same problem as described from me here: https://bugzilla.mozilla.org/show_bug.cgi?id=1131330#c37 In short words: FF now crashes no more on complex JS with a [OOM | small] (e.g. by Facebook, Google, Youtube, Twitter) because FF have now a patch that prevent FF from crashing with this signature, but now FF is not able to handle the cases that have FF crashed before. As it normally only happens on Windows and there the memory goes very often by this pages "crazy", I guess it is a Garbage Collection problem (in different variations). Solution should be IMHO to analyze mem-reports again, find the problems, fix GC and look if the crash signatures go down again.
Some "new" signature I had in last time: [@ shutdownhang | ReleaseData(void*, unsigned int) ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_InvokeByIndex ] [@ shutdownhang | _VEC_memset ] [@ hang | WaitForMultipleObjectsEx | RealMsgWaitForMultipleObjectsEx | MsgWaitForMultipleObjects | F_1152915508___________________________________ ] [@ shutdownhang | RtlEnterCriticalSection | PR_Lock | mozilla::net::CacheFile::~CacheFile()] [OOM | large | mozalloc_abort(char const* const) | mozalloc_handle_oom(unsigned int) | moz_xrealloc | mozilla::CycleCollectedJSRuntime::DeferredFinalize(nsISupports*)] [@ js::gc::AllocateNonObject<JSFatInlineString, int>(js::ExclusiveContext*)] I attach two mem-reports from the last days that were auto-created by FF but didn't send out because FF was no more crashing. Btw.: FF hangs during creating this reports but don't send them out anymore because FF crash no more. Think there should be something changed again ... The second archive is incomplete because it seems I killed the task before FF was able to finish the report. But I don't think that FF would ever finished the report because the file in the archive would have - when complete - over 3GB !!!
Some more infos: - I never use sleep/hibernate. - Plz try to open in Facebook e.g. a group, where at the right side the advertisements and suggestions reload all the time. Then let it run for some time (some hours).
Tobias, unless you are sure your reports belong to this *specific* shutdownhang in DOMStorageDBThread shutdown, please put your comments on a separate bug report instead. Also, there are no good reasons I know of to believe that OOM|small would have anything to do with any shutdownhangs. The former means we were out of (virtual) memory, the latter means Firefox was already being (intentionally) closed and could not finish that in more than a minute of time.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #10) > Tobias, unless you are sure your reports belong to this *specific* > shutdownhang in DOMStorageDBThread shutdown, please put your comments on a > separate bug report instead. > > Also, there are no good reasons I know of to believe that OOM|small would > have anything to do with any shutdownhangs. The former means we were out of > (virtual) memory, the latter means Firefox was already being (intentionally) > closed and could not finish that in more than a minute of time. Plz read again: (In reply to Tobias B. Besemer from comment #5) > I don't know when, but in the last weeks there was a patch checked in that > fix a lot of the [OOM | small] crashes. So for me - I had this signature > often before - it is nearly impossible to crash FF with this signature again. > > Can somebody figure out when this patch was checked in and if the > [shutdownhang] and [WaitForSingleObjectEx] crashes rose after that date ??? > > If so, I guess that it is (nearly) the same problem as described from me > here: > https://bugzilla.mozilla.org/show_bug.cgi?id=1131330#c37 > > In short words: > FF now crashes no more on complex JS with a [OOM | small] (e.g. by Facebook, > Google, Youtube, Twitter) because FF have now a patch that prevent FF from > crashing with this signature, but now FF is not able to handle the cases > that have FF crashed before. > As it normally only happens on Windows and there the memory goes very often > by this pages "crazy", I guess it is a Garbage Collection problem (in > different variations). > Solution should be IMHO to analyze mem-reports again, find the problems, fix > GC and look if the crash signatures go down again. Maybe you should read: https://bugzilla.mozilla.org/show_bug.cgi?id=1131330#c37
Robert, I comment on this bug because of "See Also: 1131330 Remove" and "Keywords: steps-wanted". Bug 1131330 was also "steps-wanted" and you can find them here: https://bugzilla.mozilla.org/show_bug.cgi?id=1131330#c37 I was able to reproduce it because I had an idea what should be the reason for the crash and so I guess what is the reason of the crash. ;-)
They still do not have anything to do with each other. You are speculating.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #13) > They still do not have anything to do with each other. You are speculating. OK, I speculating! ;-) But the crash seems to be gone in the last days, that's fine, or? :-) https://crash-stats.mozilla.com/report/list?signature=shutdownhang%20|%20WaitForSingleObjectEx%20|%20WaitForSingleObject%20|%20PR_WaitCondVar%20|%20PR_JoinThread%20|%20mozilla%3A%3Adom%3A%3ADOMStorageDBThread%3A%3AShutdown%28%29#tab-table
OS: Windows NT → Windows
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | PR_JoinThread | mozilla::dom::DOMStorageDBThread::Shutdown()] → [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | PR_JoinThread | mozilla::dom::DOMStorageDBThread::Shutdown()] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | PR_JoinThread | mozilla::dom::…
*arg* Signature have changed ... :-/ https://crash-stats.mozilla.com/report/list?range_unit=days&range_value=28&signature=shutdownhang+|+WaitForSingleObjectEx+|+WaitForSingleObject+|+PR_WaitCondVar+|+PR_JoinThread+|+mozilla%3A%3Adom%3A%3ADOMStorageDBThread%3A%3AShutdown Last 28 days: Operating System Percentage Number Of Crashes Windows 7 59.66% 2181 Windows XP 14.28% 522 Windows 8.1 13.48% 493 Windows Vista 6.32% 231 Windows 10 4.18% 153 Windows 8 2.02% 74 Windows Unknown 0.06% 2 18 comments, just one useful: part of the problem is that time warner wifi service is slower that the old telephone dialup service. they ae the only game in town so I am stuck with them Submitted: 2015-11-29T13:57:43+00:00 https://crash-stats.mozilla.com/report/index/d2fbe824-e635-4a7b-9175-f37332151129 Last report on FF44.0a2: https://crash-stats.mozilla.com/report/index/85e33bee-3ef1-46b2-8979-854cc2151202 Crashing Thread Frame Module Signature Source 0 xul.dll mozilla::`anonymous namespace'::RunWatchdog(void*) toolkit/components/terminator/nsTerminator.cpp 1 nss3.dll _PR_NativeRunThread nsprpub/pr/src/threads/combined/pruthr.c 2 nss3.dll pr_root nsprpub/pr/src/md/windows/w95thred.c 3 msvcr120.dll _callthreadstartex f:\dd\vctools\crt\crtw32\startup\threadex.c:376 4 msvcr120.dll msvcr120.dll@0x2c000 5 kernel32.dll BaseThreadStart Seems the Crashing Thread is always the same.
Status: RESOLVED → REOPENED
Keywords: topcrash-win
Resolution: WORKSFORME → ---
Component: DOM → Hardware Abstraction Layer (HAL)
Component: Hardware Abstraction Layer (HAL) → DOM
It's been a year since you were looking at those other bugs, and this is still here ...
Flags: needinfo?(honzab.moz)
I've spend a lot of time on this. We all did in necko, also asking around etc. No idea where from hangs in ntclose() and other IO operations are coming. Could be just a slow hard drive and just a coincidence the watchdog catches it in NTClose (where it may hang for few hundred ms). I also start thinking that _VEC_memset is just hanging because of system swaps, so again - blocked on io. True is that we should not crash. But we are crashing artificially... I start thinking that the shutdown watchdog is a bad thing to do by only one measure. Apparently slower systems should have the delay longer. To sum, no, I personally don't have any new data here. Sorry.
Flags: needinfo?(honzab.moz)
BTW, the hangs are inside mozstorage. You should target your questions to someone who maintains that code, like mak77.
Dropping in as a mozStorage peer... the NtClose thing seems like this is a system issue, but indeed there seem to be a lot of other indications in other stacks that indicate I/O issues, like the mozStorage stack being in a call to FlushFileBuffers, etc. There seem to be two major possibilities: * Anti-virus issues. Brief searching brought up http://stackoverflow.com/questions/24169830/how-can-i-work-out-which-process-thread-owns-the-resource-that-my-program-is-han where someone who seems to know their windows kernels investigation stuff identified Symantec anti-virus as a problem with NtClose. * I/O backlog (which can also be compounded with anti-virus). It probably also doesn't help that closes on Windows can be excessively slow, as :gps identified at http://gregoryszorc.com/blog/2015/10/22/append-i/o-performance-on-windows/ Anti-virus wise, does Socorro know whether there is an active anti-virus on the system and what it is? I/O backlog wise... could the watchdog maybe do something crazy like join on a thread that issues one or more writes to the profile directory followed by a FlushFileBuffer invocation? The idea would be that the watchdog wouldn't kill the process if I/O is hopelessly backlogged while also grabbing a measurement point about the I/O status if it does force a crash. It would be very useful to know that it took 5 seconds to issue a write and a flush when the forced crash happens, for example. (Or even if there's just a number that indicates how long the crash-dump process took from start to finish if we have that?)
(In reply to Andrew Sutherland [:asuth] from comment #20) > Anti-virus wise, does Socorro know whether there is an active anti-virus on > the system and what it is? Only if they have loaded a library into our process or installed and LSP.
Crash volume for signature 'shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | PR_JoinThread | mozilla::dom::DOMStorageDBThread::Shutdown': - nightly (version 50): 3 crashes from 2016-06-06. - aurora (version 49): 41 crashes from 2016-06-07. - beta (version 48): 1774 crashes from 2016-06-06. - release (version 47): 11676 crashes from 2016-05-31. - esr (version 45): 749 crashes from 2016-04-07. Crash volume on the last weeks: Week N-1 Week N-2 Week N-3 Week N-4 Week N-5 Week N-6 Week N-7 - nightly 0 0 0 0 2 1 0 - aurora 8 4 8 6 6 7 1 - beta 224 239 207 274 345 327 83 - release 1814 1918 1724 1706 1764 1725 490 - esr 81 94 71 64 111 88 64 Affected platform: Windows
Crash volume for signature 'shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | PR_JoinThread | mozilla::dom::DOMStorageDBThread::Shutdown': - nightly (version 51): 2 crashes from 2016-08-01. - aurora (version 50): 6 crashes from 2016-08-01. - beta (version 49): 389 crashes from 2016-08-02. - release (version 48): 477 crashes from 2016-07-25. - esr (version 45): 967 crashes from 2016-05-02. Crash volume on the last weeks (Week N is from 08-22 to 08-28): W. N-1 W. N-2 W. N-3 - nightly 0 1 1 - aurora 0 5 1 - beta 103 222 64 - release 107 275 95 - esr 47 89 83 Affected platform: Windows Crash rank on the last 7 days: Browser Content Plugin - nightly - aurora - beta - release - esr
Status: REOPENED → RESOLVED
Closed: 9 years ago7 years ago
Resolution: --- → WORKSFORME
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: