Closed
Bug 1389021
Opened 7 years ago
Closed 7 years ago
shutdownhangs in mozilla::layers::CompositorThreadHolder::Shutdown
Categories
(Core :: Graphics: Layers, defect, P3)
Tracking
()
RESOLVED
FIXED
mozilla57
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox55 | --- | unaffected |
firefox56 | blocking | fixed |
firefox57 | --- | fixed |
People
(Reporter: philipp, Assigned: aosmond)
References
(Blocks 1 open bug)
Details
(Keywords: crash, regression, topcrash, Whiteboard: [gfx-noted])
Crash Data
Attachments
(6 files, 4 obsolete files)
(deleted),
patch
|
nical
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
lizzard
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
nical
:
review+
jerry
:
feedback+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
aosmond
:
review+
|
Details | Diff | Splinter Review |
[Tracking Requested - why for this release]:
in the first hours of the 56.0b1 rollout this issue is manifesting itself as one of the main crash issues (13% of all browser crashes)
This bug was filed from the Socorro interface and is
report bp-c8561832-aae4-44ce-ac47-61a290170810.
=============================================================
Crashing Thread (0)
Frame Module Signature Source
0 ntdll.dll KiFastSystemCallRet
1 ntdll.dll ZwWaitForKeyedEvent
2 ntdll.dll RtlSleepConditionVariableCS
3 kernel32.dll SleepConditionVariableCS
4 mozglue.dll mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) mozglue/misc/ConditionVariable_windows.cpp:58
5 xul.dll mozilla::CondVar::Wait(unsigned int) obj-firefox/dist/include/mozilla/CondVar.h:68
6 xul.dll nsEventQueue::GetEvent(bool, nsIRunnable**, mozilla::BaseAutoLock<mozilla::Mutex>&) xpcom/threads/nsEventQueue.cpp:76
7 xul.dll nsThread::nsChainedEventQueue::GetEvent(bool, nsIRunnable**, unsigned short*, mozilla::BaseAutoLock<mozilla::Mutex>&) xpcom/threads/nsThread.cpp:839
8 xul.dll nsThread::GetEvent(bool, nsIRunnable**, unsigned short*, mozilla::BaseAutoLock<mozilla::Mutex>&) xpcom/threads/nsThread.cpp:1297
9 xul.dll nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp:1380
10 xul.dll NS_ProcessNextEvent(nsIThread*, bool) xpcom/threads/nsThreadUtils.cpp:480
11 xul.dll mozilla::layers::CompositorThreadHolder::Shutdown() gfx/layers/ipc/CompositorThread.cpp:137
12 xul.dll gfxPlatform::ShutdownLayersIPC() gfx/thebes/gfxPlatform.cpp:1039
13 xul.dll mozilla::ShutdownXPCOM(nsIServiceManager*) xpcom/build/XPCOMInit.cpp:887
14 xul.dll ScopedXPCOMStartup::~ScopedXPCOMStartup() toolkit/xre/nsAppRunner.cpp:1464
15 xul.dll mozilla::UniquePtr<ScopedXPCOMStartup, mozilla::DefaultDelete<ScopedXPCOMStartup> >::reset(ScopedXPCOMStartup*) obj-firefox/dist/include/mozilla/UniquePtr.h:345
16 xul.dll XREMain::XRE_main(int, char** const, mozilla::BootstrapConfig const&) toolkit/xre/nsAppRunner.cpp:4800
17 xul.dll XRE_main(int, char** const, mozilla::BootstrapConfig const&) toolkit/xre/nsAppRunner.cpp:4867
18 xul.dll mozilla::BootstrapImpl::XRE_main(int, char** const, mozilla::BootstrapConfig const&) toolkit/xre/Bootstrap.cpp:45
19 firefox.exe wmain toolkit/xre/nsWindowsWMain.cpp:115
20 firefox.exe __scrt_common_main_seh f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:253
21 kernel32.dll BaseThreadInitThunk
22 ntdll.dll __RtlUserThreadStart
23 ntdll.dll _RtlUserThreadStart
these shutdownhang crashes on windows with mozilla::layers::CompositorThreadHolder::Shutdown are regressing in volume in the 56 cycle. the first affected build on nightly seems to have been 56.0a1 20170615030208 - pushlog to the day before: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b266a8d8fd595b84a7d6218d7b8c6b7af0b5027c&tochange=035c25bef7b5e4175006e63eff10c61c2eef73f1
the associated crash signatures are generic shutdownhang signature lumping together a bunch of different issues - this is the query i've used to filter for the crashes described in this report:
https://crash-stats.mozilla.com/search/?proto_signature=~mozilla%3A%3Alayers%3A%3ACompositorThreadHolder%3A%3AShutdown&version=56.0a1&version=57.0a1&version=56.0b&platform=Windows&date=%3E%3D2017-01-01T00%3A00%3A00.000Z&_facets=signature&_facets=version&_facets=user_comments&_facets=uptime&_facets=adapter_vendor_id&_facets=build_id&_facets=install_time&_facets=platform_pretty_version&_facets=useragent_locale&_facets=release_channel&_facets=cpu_arch&_facets=process_type&_facets=moz_crash_reason&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
Milan, this is shaping up to be a top crash in beta 56, can you find someone to investigate? Thanks.
Flags: needinfo?(milan)
Let's see if Andrew can make heads or tails out of this.
Flags: needinfo?(milan) → needinfo?(aosmond)
Threads 19 and 20 are image decoder threads...
Comment 4•7 years ago
|
||
I checked several crash reports and all of them had the following configuration.
features":{"compositor":"basic","gpuProcess":{"status":"unavailable"},"advancedLayers":{"status":"disabled"},"d3d11":{"status":"failed"},"d2d":{"status":"blacklisted","version":"1.1"}}},"isWow64":false},
"settings":{"blocklistEnabled":true,"e10sEnabled":true,"e10sMultiProcesses":4,"e10sCohort":"multiBucket4",
Comment 5•7 years ago
|
||
And all crashes occurred at win7.
Assignee | ||
Comment 6•7 years ago
|
||
Pretty sure I understand what went wrong here -- I believe I broke it in part 5 of bug 1365927. Patch coming up.
Updated•7 years ago
|
Crash Signature: shutdownhang | ntdll.dll@0x4d62a] → shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | GetTickCount64 | mozilla::TimeStamp::Now | mozilla::CondVar::Wait]
[@ shutdownhang | NtWaitForKeyedEvent | RtlSleepConditionVariableCS | SetWaitableTimer]
[@ shutdownhang | ntdll.dll@0x…
Assignee | ||
Comment 7•7 years ago
|
||
Assignee | ||
Updated•7 years ago
|
Attachment #8897011 -
Flags: review?(dvander)
Comment on attachment 8897011 [details] [diff] [review]
Release owned CompositorManagerParent in CompositorBridgeParent::DeferredDestroy., v1
Review of attachment 8897011 [details] [diff] [review]:
-----------------------------------------------------------------
:dvander on PTO, let's get a quicker review
Attachment #8897011 -
Flags: review?(dvander) → review?(nical.bugzilla)
Updated•7 years ago
|
Attachment #8897011 -
Flags: review?(nical.bugzilla) → review+
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/dd4bf41ca47d
Release owned CompositorManagerParent in CompositorBridgeParent::DeferredDestroy. r=nical
Comment 10•7 years ago
|
||
Backout by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a9f759fd05fd
Backed out changeset dd4bf41ca47d for bustage
Assignee | ||
Comment 11•7 years ago
|
||
Let's try this again...
try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=802c95d19b25ca1373006fe3cf14eb86e55c7da9
Did a full try run, none of the failures seem related to this change. I broke up the freeing of the thread and the freeing of CompositorManagerParent. It appears the failures were caused by trying to free shared memory after the CompositorManagerParent was freed. So we need to keep it around for just as long as we do now, but ensure the compositor thread itself can go free when the IPDL is shutdown.
Attachment #8897011 -
Attachment is obsolete: true
Attachment #8898062 -
Flags: review?(nical.bugzilla)
Updated•7 years ago
|
Attachment #8898062 -
Flags: review?(nical.bugzilla) → review+
Comment 12•7 years ago
|
||
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c289ff85f94b
Release the compositor thread after CompositorManagerParent IPDL reference is released. r=nical
Comment 13•7 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Comment 14•7 years ago
|
||
(In reply to Wes Kocher (:KWierso) from comment #13)
> https://hg.mozilla.org/mozilla-central/rev/c289ff85f94b
This patch will have appeared in 08-18 nightly build. Reduced weekend usage makes it is unclear whether nightly crash rate dropped. But judging from two queries for the most common nightly signatures containing RtlSleepConditionVariableCS, the crash definitely is not gone
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&signature=shutdownhang%20%7C%20NtWaitForAlertByThreadId%20%7C%20RtlSleepConditionVariableCS%20%7C%20SleepConditionVariableCS&date=%3E%3D2017-08-06T11%3A52%3A37.000Z&date=%3C2017-08-20T11%3A52%3A37.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&signature=shutdownhang%20%7C%20NtWaitForKeyedEvent%20%7C%20RtlSleepConditionVariableCS%20%7C%20SleepConditionVariableCS&date=%3E%3D2017-08-06T11%3A52%3A37.000Z&date=%3C2017-08-20T11%3A52%3A37.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports
Comment 15•7 years ago
|
||
(In reply to Wayne Mery (:wsmwk, NI for questions) from comment #14)
> (In reply to Wes Kocher (:KWierso) from comment #13)
> > https://hg.mozilla.org/mozilla-central/rev/c289ff85f94b
>
> This patch will have appeared in 08-18 nightly build. Reduced weekend usage
> makes it is unclear whether nightly crash rate dropped. But judging from two
> queries for the most common nightly signatures containing
> RtlSleepConditionVariableCS, the crash definitely is not gone
>
> https://crash-stats.mozilla.com/signature/
> ?product=Firefox&release_channel=nightly&signature=shutdownhang%20%7C%20NtWai
> tForAlertByThreadId%20%7C%20RtlSleepConditionVariableCS%20%7C%20SleepConditio
> nVariableCS&date=%3E%3D2017-08-06T11%3A52%3A37.000Z&date=%3C2017-08-
> 20T11%3A52%3A37.
> 000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_colum
> ns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-
> date&page=1#reports
>
> https://crash-stats.mozilla.com/signature/
> ?product=Firefox&release_channel=nightly&signature=shutdownhang%20%7C%20NtWai
> tForKeyedEvent%20%7C%20RtlSleepConditionVariableCS%20%7C%20SleepConditionVari
> ableCS&date=%3E%3D2017-08-06T11%3A52%3A37.000Z&date=%3C2017-08-
> 20T11%3A52%3A37.
> 000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_colum
> ns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-
> date&page=1#reports
From the first query, I still saw crash from 20170820 nightly build.
Jerry, could you help to check the fix was include the build or not?
If yes, then we need to re-open this issue and check it.
c00f79dc-97ac-4604-a5c2-ddbaa0170820 2017-08-20 11:16:43 Firefox 57.0a1 20170819100442
Flags: needinfo?(hshih)
Comment 16•7 years ago
|
||
The version of "Firefox 57.0a1 20170819100442":
https://download-origin.cdn.mozilla.net/pub/firefox/nightly/2017/08/2017-08-19-10-04-42-mozilla-central/firefox-57.0a1.en-US.win64.txt
It already contains the fix is this bug.
hg log --follow -r :4f4487cc2d30d988742109868dcf21c4113f12f5 | grep c289ff85f94b
=>
changeset: 421990:c289ff85f94b
So, reopen this bug.
Status: RESOLVED → REOPENED
Flags: needinfo?(hshih)
Resolution: FIXED → ---
There are a lot of different crash signatures here and only a few crashes on nightly. We could still uplift this fix to beta (for beta 6) if you think it will mitigate the issue for beta users. What do you think?
Flags: needinfo?(aosmond)
Unfortunately this missed the beta 5 build today. We could uplift the current patch for beta 6 on Thursday though.
Comment 19•7 years ago
|
||
shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | SleepConditionVariableCS signature was the top ranked browser crash in Beta 4.
Comment 20•7 years ago
|
||
MozReview-Commit-ID: 3eVziNDH59f
Comment 21•7 years ago
|
||
Comment on attachment 8900253 [details] [diff] [review]
[uplift for beta] Release the compositor thread after CompositorManagerParent IPDL reference is released.
:aosmond is in PTO, so I request this beta uplift. And I will still investigate the remaining crash with the same signature like: https://bugzilla.mozilla.org/show_bug.cgi?id=1389021#c16
---
Approval Request Comment
[Feature/Bug causing the regression]:
bug 1365927
Please check https://bugzilla.mozilla.org/show_bug.cgi?id=1389021#c6
[User impact if declined]:
We might hit the shutdown timeout.
[Is this code covered by automated tests?]:
Already pass the try in Nightly.
https://bugzilla.mozilla.org/show_bug.cgi?id=1389021#c11
[Has the fix been verified in Nightly?]:
https://bugzilla.mozilla.org/show_bug.cgi?id=1389021#c18
We still see the same crash in nightly, but the number is fewer.
[Needs manual test from QE? If yes, steps to reproduce]:
none
[List of other uplifts needed for the feature/fix]:
none
[Is the change risky?]:
no.
[Why is the change risky/not risky?]:
The "mozilla::layers::CompositorThreadHolder::Shutdown()" hang is because that CompositorThreadHolder::Shutdown() is waiting for the CompositorThreadHolder obj released. And this patch try to release the CompositorThreadHolder obj ref-counting during shutdown.
[String changes made/needed]:
none
Attachment #8900253 -
Flags: approval-mozilla-beta?
Comment on attachment 8900253 [details] [diff] [review]
[uplift for beta] Release the compositor thread after CompositorManagerParent IPDL reference is released.
Review of attachment 8900253 [details] [diff] [review]:
-----------------------------------------------------------------
Should fix a beta top crash or at least mitigate the crash rate. Aiming for beta 6 here.
Attachment #8900253 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
n-i to myself to check the crash rate on Friday after beta 6 is released.
Flags: needinfo?(lhenry)
Comment 24•7 years ago
|
||
bugherder uplift |
Comment 25•7 years ago
|
||
I just ran a quick query, and it looks as if this is still happening in B6: http://bit.ly/2vDeOs1
It may be happening less with beta 6. Most 56 users are now on beta 6 (as of Monday morning). So we'll see from today's crash rate.
Jerry or Andrew, can you look at the signatures coming in for 56 beta 6?
Flags: needinfo?(hshih)
Assignee | ||
Comment 28•7 years ago
|
||
I'm looking into this again as well, although I no longer have a straightforward explanation as to why this is still happening. I only see reports for when the compositor thread is in the UI process -- this may just be an artifact of the reporting however because I see no "shutdownhang" crashes for the GPU process. Linux and Mac do not appear to be affected as they have relatively few shutdownhang reports in the last month, and none of them (that I checked) matched the profile -- this is interesting because Linux and Mac don't have the GPU process enabled at all so all users could be affected if it reproduced there (instead of just the subset of Windows users which disable the GPU process).
Comment 29•7 years ago
|
||
Currently, the crash are all at win7. And there is no gpu process in these crash reports.
I try to add some log in the AddRef()/Release() for compositorThreadHolder object, but I don't see the non-matched addRef/Release pair for compositorThreadHolder.
checking...
Assignee | ||
Comment 30•7 years ago
|
||
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #29)
> Currently, the crash are all at win7. And there is no gpu process in these
> crash reports.
> I try to add some log in the AddRef()/Release() for compositorThreadHolder
> object, but I don't see the non-matched addRef/Release pair for
> compositorThreadHolder.
> checking...
Are you sure it is tied to the Windows version? I see reports on Windows 8.1, Windows 10.
Comment 31•7 years ago
|
||
Hi Andrew,
Here is my query string.
https://crash-stats.mozilla.com/signature/?_sort=-date&platform=Windows&version=56.0a1&version=57.0a1&version=56.0b&signature=shutdownhang%20%7C%20ZwWaitForKeyedEvent%20%7C%20RtlSleepConditionVariableCS%20%7C%20SleepConditionVariableCS&date=%3E%3D2017-01-01T00%3A00%3A00.000Z&date=%3C2017-08-31T13%3A50%3A00.000Z&proto_signature=%7Emozilla%3A%3Alayers%3A%3ACompositorThreadHolder%3A%3AShutdown
And all crashes are at win7.
And here is another query string.
"Most" of crashes doesn't use gpu process.
https://crash-stats.mozilla.com/search/?proto_signature=~mozilla%3A%3Alayers%3A%3ACompositorThreadHolder%3A%3AShutdown&telemetry_environment=~%22gpuProcess%22%3A%7B%22status%22%3A%22unavailable%22%7D&version=56.0a1&version=57.0a1&version=56.0b&platform=Windows&date=%3E%3D2017-01-01T00%3A00%3A00.000Z&date=%3C2017-08-31T13%3A50%3A00.000Z&_sort=-date&_facets=signature&_facets=version&_facets=user_comments&_facets=uptime&_facets=adapter_vendor_id&_facets=build_id&_facets=install_time&_facets=platform_pretty_version&_facets=useragent_locale&_facets=release_channel&_facets=cpu_arch&_facets=process_type&_facets=moz_crash_reason&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
We use some tools to open the firefox and close it. Then, I saw one shutdown hang log. I will turn on the addRef/Release call log for compositorThreadHolder.
Assignee | ||
Comment 32•7 years ago
|
||
Probably just an artifact of that particular trace, as other versions of Windows show up on other queries, e.g.
Windows 10:
https://crash-stats.mozilla.com/report/index/08eea1f9-7ff3-4020-8435-01b2b0170901
https://crash-stats.mozilla.com/report/index/6f8b8c32-c399-42aa-9f4d-52da10170901
Windows 8.1:
https://crash-stats.mozilla.com/report/index/0c87f205-ee10-450e-a80e-aeffb0170901
https://crash-stats.mozilla.com/report/index/abd4e87d-9ea0-42b0-8396-9eeeb0170830
Assignee | ||
Comment 33•7 years ago
|
||
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #31)
>
> And here is another query string.
> "Most" of crashes doesn't use gpu process.
> https://crash-stats.mozilla.com/search/
> ?proto_signature=~mozilla%3A%3Alayers%3A%3ACompositorThreadHolder%3A%3AShutdo
> wn&telemetry_environment=~%22gpuProcess%22%3A%7B%22status%22%3A%22unavailable
> %22%7D&version=56.0a1&version=57.0a1&version=56.
> 0b&platform=Windows&date=%3E%3D2017-01-01T00%3A00%3A00.000Z&date=%3C2017-08-
> 31T13%3A50%3A00.000Z&_sort=-
> date&_facets=signature&_facets=version&_facets=user_comments&_facets=uptime&_
> facets=adapter_vendor_id&_facets=build_id&_facets=install_time&_facets=platfo
> rm_pretty_version&_facets=useragent_locale&_facets=release_channel&_facets=cp
> u_arch&_facets=process_type&_facets=moz_crash_reason&_columns=date&_columns=s
> ignature&_columns=product&_columns=version&_columns=build_id&_columns=platfor
> m#facet-signature
>
Hmm, the crash reports are all for the main process, as the only way we hit this particular code path (gfxPlatform::ShutdownLayersIPC) is for shutting down the compositor thread in the main process. Looking for the alternative path for the GPU process (mozilla::gfx::GPUParent::ActorDestroy) turns up nothing.
However tweaking your query to search for Linux and Mac OS X did turn up a handful of recent results on Linux, e.g.
https://crash-stats.mozilla.com/report/index/39a8a8cb-cf19-4544-8087-0f94b0170821
https://crash-stats.mozilla.com/report/index/d1c54c34-f447-4eda-8526-94c190170827
https://crash-stats.mozilla.com/report/index/cefb6e21-7254-4f16-816e-b99bc0170816
https://crash-stats.mozilla.com/report/index/dc37eb9c-7a4d-47d2-9d3f-376510170815
https://crash-stats.mozilla.com/report/index/27d63e0f-f9f1-4e0e-bb67-65f080170815
Another variant of Linux is it blocking on acquiring the MessageChannel mutex:
https://crash-stats.mozilla.com/report/index/b95ba2b6-2cd8-4537-852f-c00bb0170826
I suspect unrelated, but notable in case it offers additional clues.
It was also observed on Mac OS X, but last seen on build 20170728100358 for 56.0a1, e.g.:
https://crash-stats.mozilla.com/report/index/61fce4e1-5465-47f8-8cbe-4771c0170812
If I can glean anything from these, I will note it on the bug.
> We use some tools to open the firefox and close it. Then, I saw one shutdown
> hang log. I will turn on the addRef/Release call log for
> compositorThreadHolder.
Great news -- if you are able to reproduce again with the extra logging, I look forward to seeing the results :D.
Assignee | ||
Comment 34•7 years ago
|
||
It appears the Linux crashes only started in build 20170814100258. It has bug 1383742 attached to the signature, which was found to be introduced by bug 1351148. Bug 1351148 relanded on central on 2017-08-13. The volume is very low on Linux 57, and it wasn't promoted to 56, so it cannot explain our 56 crashes. It also does not appear to have had an impact on the Windows 57 crash rate.
Assignee | ||
Comment 35•7 years ago
|
||
The first build containing the changes in bug 1365927 was 20170615030208 (https://hg.mozilla.org/mozilla-central/rev/035c25bef7b5e4175006e63eff10c61c2eef73f1, confirmed as present in the build via https://hg.mozilla.org/mozilla-central/log/035c25bef7b5e4175006e63eff10c61c2eef73f1/gfx/layers/ipc/CompositorManagerChild.cpp).
The first build that we see the crash signature however is the *next day*'s nightly, 20170616030207. There appear to be lots of telemetry results / crash reports in general for both builds, so unless 20170615030208 got pulled early (I do see the top crash volume (IPCError) got cut in half between 20170615030208 and 20170616030207), it seems it is in the realm of possibility the change causing the crash was not bug 1365927 but something that came the next day.
Pushlog between the two builds:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=035c25bef7b5e4175006e63eff10c61c2eef73f1&tochange=fe809f57bf2287bb937c3422ed03a63740b3448b
Assignee | ||
Comment 36•7 years ago
|
||
Milan confirmed that there was nothing especially bad about 20170615030208 (e.g. such that we pulled it earlier and thus maybe didn't have enough users on it to hit the crash). It was also on a Thursday which has "normal" usage patterns.
Jerry, if your own testing with the AddRef/Release logging doesn't turn up anything useful, could you throw build 20170615030208 into the tool you are using to reproduce? If it doesn't happen there, but it *does* happen on 20170616030207, I think we need to hook it up to mozregression to figure out what the offending change is. The tinderbox builds should still be available for us to narrow this down without having to rebuild ourselves.
Flags: needinfo?(aosmond) → needinfo?(hshih)
Assignee | ||
Comment 37•7 years ago
|
||
My bad, hg itself contains the actual build information. It seems it was in a day earlier than I thought:
first release with
nightly linux32 b266a8d8fd59 / 56.0a1 / 20170614100332 / files
nightly linux64 b266a8d8fd59 / 56.0a1 / 20170614100332 / files
nightly mac b266a8d8fd59 / 56.0a1 / 20170614030206 / files
nightly win32 b266a8d8fd59 / 56.0a1 / 20170614030206 / files
nightly win64 b266a8d8fd59 / 56.0a1 / 20170614030206 / files
last release without
nightly linux32 5293e5f89358 / 56.0a1 / 20170613100218 / files
nightly linux64 5293e5f89358 / 56.0a1 / 20170613100218 / files
nightly mac 5293e5f89358 / 56.0a1 / 20170613030203 / files
nightly win32 5293e5f89358 / 56.0a1 / 20170613030203 / files
nightly win64 5293e5f89358 / 56.0a1 / 20170613030203 / files
This is even more suggestive that it was something new in the 20170616 build.
Reporter | ||
Comment 38•7 years ago
|
||
(In reply to Andrew Osmond [:aosmond] from comment #35)
> The first build that we see the crash signature however is the *next day*'s
> nightly, 20170616030207.
i'm not sure if this is correct - in crash stats i see those crashes already popping up starting with build 20170615030208 like bp-0d5df06e-bc3a-4230-ade5-47ebb0170626
Comment 39•7 years ago
|
||
Comment 40•7 years ago
|
||
After adding the ref-count log, I only got one crash log at win7. Here is the log: attachment 8904211 [details].
I add debug message in CompositorThreadHolder::Shutdown(), but I don't see the debug message for this function.
Another interesting thing is that I saw some logs like "[Child 263496] WARNING: '!mMainThread', file c:/mozilla-source/git/gecko-dev/xpcom/threads/nsThreadManager.cpp, line 320". It means that the main thread is already deleted.
But the main thread deletion is called after CompositorThreadHolder::Shutdown(). So, I add some additional log in "nsThreadManager::GetMainThread()" for more information.
Comment 41•7 years ago
|
||
I have another crash log which show the gc module try to post some tasks after the main thread deleted. I am not sure it's related to this problem. Still investigating.
Comment 42•7 years ago
|
||
This is still high volume in 56.0b8
Assignee | ||
Comment 43•7 years ago
|
||
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #40)
> After adding the ref-count log, I only got one crash log at win7. Here is
> the log: attachment 8904211 [details].
>
> I add debug message in CompositorThreadHolder::Shutdown(), but I don't see
> the debug message for this function.
> Another interesting thing is that I saw some logs like "[Child 263496]
> WARNING: '!mMainThread', file
> c:/mozilla-source/git/gecko-dev/xpcom/threads/nsThreadManager.cpp, line
> 320". It means that the main thread is already deleted.
> But the main thread deletion is called after
> CompositorThreadHolder::Shutdown(). So, I add some additional log in
> "nsThreadManager::GetMainThread()" for more information.
Looks like the thread which the shutdown hang is reported on is not the main thread (since it doesn't match the thread addref and release was called on). From that I can only deduce the content process CompositorManagerParent never gets deallocated. From the crash reports I have looked at, the compositor thread is never blocked and just waiting for events to come in. This would suggest then that the close request from the content processes never came. I checked a number of crash reports and couldn't find any content process crash reports around the same time as the thread shutdown hang.
I wonder if the content process is blocking in CompositorBridgeChild::ShutDown. Maybe CompositorBridgeChild::ActorDestroy got called on its own somehow, so we cleared mCanSend which prevents CompositorBridgeChild::Destroy from running, which in turn is the only way CompositorBridgeChild::AfterDestroy gets called, which in turn is how sCompositorBridge is cleared which is the spin condition in CompositorBridgeChild::ShutDown? And the parent process crashes first on the shutdown hang so we never get to see the content process shutdown hang? I can prepare something to fix this, but it is just more speculation from me ;).
Assignee | ||
Comment 44•7 years ago
|
||
As I said, total speculation. Jerry, can you take a look at the patch and give your thoughts? And maybe apply this patch to build running in that special test harness? My thought process is moving entirely to a badly behaving content process now over something gone wrong in the parent process, but I could be completely off base here...
try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9257f97a9fdb64b05c98af94133bae0c11d3b70e
Attachment #8904681 -
Flags: feedback?(hshih)
Comment 45•7 years ago
|
||
I am not sure I catch your idea. But the patch was applied for the shutdown test. Hope there will have more info for this issue. Let me think more for your description.
Comment 46•7 years ago
|
||
I set a infinite loop in the CompositorBridgeChild::ShutDown(), then we will have the same shutdown hang problem as this bug. Since the parent side are waiting for the CompositorBridgeParent::ActorDestroy(), but the child side never sends that.
The interesting thing is that parent process has a shutdown hang watchdog, but the content process doesn't.
I setup 2 laptop for the the original beta code and attachment 8904681 [details] [diff] [review], but there is no more crash occurred from my testing environment. I can't tell the attachment 8904681 [details] [diff] [review] could fix the problem but at least the sCompositorBridge could be always released.
Flags: needinfo?(hshih)
Updated•7 years ago
|
Attachment #8904681 -
Flags: feedback?(hshih) → feedback+
Assignee | ||
Updated•7 years ago
|
Attachment #8904681 -
Attachment description: [WIP] CompositorBridgeChild::ActorDestroy should not prevent CompositorBridgeChild::ShutDown from returning., v1 → CompositorBridgeChild::ActorDestroy should not prevent CompositorBridgeChild::ShutDown from returning., v1
Attachment #8904681 -
Flags: review?(nical.bugzilla)
Assignee | ||
Comment 47•7 years ago
|
||
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #45)
> I am not sure I catch your idea. But the patch was applied for the shutdown
> test. Hope there will have more info for this issue. Let me think more for
> your description.
I should probably explain on the bug itself.
When CompositorManagerChild::Shutdown is called, it in turn delegates to CompositorBridgeChild::ShutDown before calling CompositorManagerChild::Close and thus starting the process required to release the compositor thread reference in the parent process.
CompositorBridgeChild::ShutDown calls CompositorBridgeChild::Destroy and then spin the main thread until the static reference to itself is released. Ordinarily Destroy will tear down the CompositorBridgeChild bits and bobs, and then trigger CompositorBridgeChild::AfterDestroy (via a dispatch to the main thread) to delete the IPDL object and clear said static reference. Then ShutDown stops spinning, and we call Close.
If however CompositorBridgeChild::ActorDestroy already got called (somehow -- I did not think too hard on why...), then Destroy does nothing. And we will loop waiting for the static reference to be cleared (which will never happen, since it only happens in AfterDestroy). Which in turn means Close never gets called, and the compositor thread references linger.
Assignee | ||
Updated•7 years ago
|
Priority: -- → P3
Whiteboard: [gfx-noted]
Updated•7 years ago
|
Attachment #8904681 -
Flags: review?(nical.bugzilla) → review+
Comment 48•7 years ago
|
||
(In reply to Andrew Osmond [:aosmond] from comment #47)
> (In reply to Jerry Shih[:jerry] (UTC+8) from comment #45)
> > I am not sure I catch your idea. But the patch was applied for the shutdown
> > test. Hope there will have more info for this issue. Let me think more for
> > your description.
>
> I should probably explain on the bug itself.
>
> When CompositorManagerChild::Shutdown is called, it in turn delegates to
> CompositorBridgeChild::ShutDown before calling CompositorManagerChild::Close
> and thus starting the process required to release the compositor thread
> reference in the parent process.
>
> CompositorBridgeChild::ShutDown calls CompositorBridgeChild::Destroy and
> then spin the main thread until the static reference to itself is released.
> Ordinarily Destroy will tear down the CompositorBridgeChild bits and bobs,
> and then trigger CompositorBridgeChild::AfterDestroy (via a dispatch to the
> main thread) to delete the IPDL object and clear said static reference. Then
> ShutDown stops spinning, and we call Close.
>
> If however CompositorBridgeChild::ActorDestroy already got called (somehow
> -- I did not think too **** why...), then Destroy does nothing. And we
> will loop waiting for the static reference to be cleared (which will never
> happen, since it only happens in AfterDestroy). Which in turn means Close
> never gets called, and the compositor thread references linger.
It's more clear.
The parent process has a shutdown hang watchdog, but the content process doesn't.
I think we should also have a hang monitor at the child side to prevent the problem again. In this case, the child side might still be spinning but we can't know.
Comment 49•7 years ago
|
||
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/df9d3661abb6
CompositorBridgeChild::ActorDestroy should not prevent CompositorBridgeChild::ShutDown from returning. r=nical
Comment 50•7 years ago
|
||
bugherder |
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Comment 51•7 years ago
|
||
For release build, the content process will go through ProcessChild::QuickExit(). Then, the content process will not call xpcom_shutdown. So, it will also not go though CompositorBridgeChild::ShutDown.
I'm not sure the attachment #8904681 [details] [diff] [review] could fix the problem. But let's see how it will be.
Flags: needinfo?(aosmond)
Assignee | ||
Comment 52•7 years ago
|
||
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #51)
> For release build, the content process will go through
> ProcessChild::QuickExit(). Then, the content process will not call
> xpcom_shutdown. So, it will also not go though
> CompositorBridgeChild::ShutDown.
> I'm not sure the attachment #8904681 [details] [diff] [review] [diff] [review] could fix the
> problem. But let's see how it will be.
Ugh, I wonder if is this related to reproducing in other ways? Maybe I should follow the exact configuration for a nightly build. I am still seeing reports (4 thus far) on 20170907220212 which contains the new fix.
https://crash-stats.mozilla.com/report/index/609a2aab-8fa9-4af5-ba11-315210170908
https://crash-stats.mozilla.com/report/index/5edb45fa-1e37-4e69-9b09-658420170908
At this point, I need to think about what information would be useful to debug and land something which dumps it to the GFX log. Maybe a list of what CompositorManagerParents are still alive at shutdown...
Assignee | ||
Comment 53•7 years ago
|
||
It still calls CompositorBridgeChild::Destroy in a few other places (e.g. by the CompositorSession) that should still happen on shutdown (e.g. nsBaseWidget::DestroyCompositor via WidgetShutdownObserver::Observe). Before my changes this would release the CompositorBridgeParent and its thread before the quick exit. While it still releases the CompositorBridgeParent, the CompositorManagerParent remains up. Maybe we need to do more in these places.
It doesn't make sense though because a quick exit should error out the ProcessLink on the other side...surely. But it is a behavioural change, so I guess it is an option.
Assignee | ||
Comment 54•7 years ago
|
||
The signature morphed as a result of some other changes (probably the quantum DOM scheduler).
Status: RESOLVED → REOPENED
Crash Signature: ntdll.dll@0x4d62a] → ntdll.dll@0x4d62a]
[@ shutdownhang | NtWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::PrioritizedEventQueue<T>::GetEvent]
[@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::PrioritizedEventQueue<T>::GetEvent]
Resolution: FIXED → ---
Comment 55•7 years ago
|
||
Here is the customize AddRef/Release impl for CompositorThreadHolder. It will use
gfxCriticalNote to dump the stack for AddRef/Release call.
But I can't get the complete log from the gfxCriticalNote. Some logs seems to be
overwrote by the newer log.
Comment 56•7 years ago
|
||
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e66c8d65d0f1
Explicitly shutdown the CompositorManagerChild during ContentChild::ActorDestroy. r=me
Assignee | ||
Comment 57•7 years ago
|
||
(In reply to Pulsebot from comment #56)
> Pushed by aosmond@gmail.com:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/e66c8d65d0f1
> Explicitly shutdown the CompositorManagerChild during
> ContentChild::ActorDestroy. r=me
I will back this out after we get a few builds with it and observe the effect on crash rates. Fingers crossed.
Comment 58•7 years ago
|
||
bugherder |
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Comment 59•7 years ago
|
||
(In reply to Andrew Osmond [:aosmond] from comment #53)
> It still calls CompositorBridgeChild::Destroy in a few other places (e.g. by
> the CompositorSession) that should still happen on shutdown (e.g.
> nsBaseWidget::DestroyCompositor via WidgetShutdownObserver::Observe). Before
> my changes this would release the CompositorBridgeParent and its thread
Yes, but it happens at parent process. If the parent is blocked at the CompositorThreadHolder::Shutdown, it should already have ran the nsBaseWidget::DestroyCompositor().
> before the quick exit. While it still releases the CompositorBridgeParent,
> the CompositorManagerParent remains up. Maybe we need to do more in these
> places.
>
> It doesn't make sense though because a quick exit should error out the
> ProcessLink on the other side...surely. But it is a behavioural change, so I
> guess it is an option.
The parent actor should be deleted by ActorDestory(). So, I'm still checking for the remaining holder for the compositorThread.
https://dxr.mozilla.org/mozilla-central/source/obj-x86_64-pc-linux-gnu/ipc/ipdl/PCompositorManagerParent.cpp#234
If this is fixed in nightly can you request uplift to beta? Or is this still a diagnostic patch?
Flags: needinfo?(lhenry)
Reporter | ||
Comment 62•7 years ago
|
||
unfortunately i still see those crashes in today's nightly build 20170912100139 which contains the patch:
http://bit.ly/2f3G29h
Assignee | ||
Comment 63•7 years ago
|
||
Yes I agree. Back to the drawing board.
Status: RESOLVED → REOPENED
Flags: needinfo?(aosmond)
Resolution: FIXED → ---
Updated•7 years ago
|
Comment 64•7 years ago
|
||
This patch use CompositorThreadHolderRef to record the different types of CompositorThreadHolder object holder.
Then, we could dump the log for the remaining holder type during the XPCOM shutdown.
MozReview-Commit-ID: GfUtMD11Ajt
Attachment #8907556 -
Flags: feedback?(nical.bugzilla)
Attachment #8907556 -
Flags: feedback?(aosmond)
Updated•7 years ago
|
Attachment #8906563 -
Attachment is obsolete: true
Comment 65•7 years ago
|
||
There are 53 crashes in nightly 57 for signature "shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::TimeStamp::operator-" and 86 for "shutdownhang | NtWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::TimeStamp::operator-".
Crash Signature: ntdll.dll@0x4d62a]
[@ shutdownhang | NtWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::PrioritizedEventQueue<T>::GetEvent]
[@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::PrioritizedEventQueue<T>::GetEvent] → ntdll.dll@0x4d62a]
[@ shutdownhang | NtWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::PrioritizedEventQueue<T>::GetEvent]
[@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::PrioritizedEventQueue<T>::GetEvent]
[@…
Comment 66•7 years ago
|
||
Backout by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bcbb4575e482
Backed out changeset e66c8d65d0f1 because it did not reduce the crash volume.
Assignee | ||
Comment 67•7 years ago
|
||
So two things in the pipeline from my end right now. One, bug 1399453 has landed on inbound which should annotate the crash reports with the current compositor thread references -- when a shutdown hang is tripped, we should be able to just search for "gfxCompositorThread" to find them. Hopefully that makes it into the next build this evening/tomorrow morning. This may help with diagnosing the problem.
In parallel, I had a discussion with dvander about the problem. We agreed that in an ideal world, the parent process should be the one initiating the IPDL objects closing, rather than the child processes. I figured this was more risky than it probably is, because as dvander pointed out, the child processes already need to handle the GPU process going away at a random point. This is sort of a backup plan, assuming the annotations don't tell us anything new but hopefully a sound one if the try is clean.
try (central): https://treeherder.mozilla.org/#/jobs?repo=try&revision=92a2008a0c4b243673cc2e57f19bff2c21bb3ec1
try (beta): https://treeherder.mozilla.org/#/jobs?repo=try&revision=030ee6b93585fa860012a965b3cbb2afabf84a8c
Assignee | ||
Comment 68•7 years ago
|
||
(In reply to Andrew Osmond [:aosmond] from comment #67)
> Created attachment 8907693 [details] [diff] [review]
> Force CompositorManagerParent to close before shutting down the compositor
> thread., v1
>
> try (central):
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=92a2008a0c4b243673cc2e57f19bff2c21bb3ec1
> try (beta):
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=030ee6b93585fa860012a965b3cbb2afabf84a8c
My bad:
try (central): https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc8e6365ad4398e155ebbdeb67412e808a90157c
Assignee | ||
Comment 69•7 years ago
|
||
Fix that leak check (grr).
try (central): https://treeherder.mozilla.org/#/jobs?repo=try&revision=e27af77640ce1b1e6bef88a27f31d74cc5e0b4fb
try (beta): https://treeherder.mozilla.org/#/jobs?repo=try&revision=316a24034803386ae55d6c315200c671f0f0061a
Attachment #8907693 -
Attachment is obsolete: true
Assignee | ||
Comment 70•7 years ago
|
||
Comment on attachment 8907841 [details] [diff] [review]
Force CompositorManagerParent to close before shutting down the compositor thread., v2
The crash report annotations show that the CompositorManagerChild/Parent for the content processes are still alive. So we were on the right track with the previous patches, but for some reason, they aren't helping. The content process should quick exit and let the OS cleanup the IPC resources but that doesn't seem to be happening either.
As said in comment 67, we actually would prefer to move towards shutting down in the parent process anyways, rather than rely upon orderly shutdown in the content processes. This should be safe since they need to handle the GPU process dying at any time in theory. So this patch should move us in the right direction and hopefully fix the hang once and for all.
Attachment #8907841 -
Flags: review?(dvander)
Comment on attachment 8907841 [details] [diff] [review]
Force CompositorManagerParent to close before shutting down the compositor thread., v2
Review of attachment 8907841 [details] [diff] [review]:
-----------------------------------------------------------------
::: gfx/layers/ipc/CompositorManagerParent.cpp
@@ +133,5 @@
> AddRef();
> +
> + StaticMutexAutoLock lock(sMutex);
> + if (!sActiveActors) {
> + sActiveActors = new nsTArray<CompositorManagerParent*>();
Arguably we should create the array at startup, and refuse to create new CompositorManagers if this array doesn't exist, since otherwise they might not be severed later. But it's okay as-is.
Attachment #8907841 -
Flags: review?(dvander) → review+
Comment 72•7 years ago
|
||
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7d0893f1b6ce
Force CompositorManagerParent to close before shutting down the compositor thread. r=dvander
Comment 73•7 years ago
|
||
bugherder |
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 74•7 years ago
|
||
Comment on attachment 8907841 [details] [diff] [review]
Force CompositorManagerParent to close before shutting down the compositor thread., v2
No crash reports since the change landed. We typically have seen one by now, so I'm ready to call it.
Approval Request Comment
[Feature/Bug causing the regression]: Bug 1365927
[User impact if declined]: May experience shutdown hangs on Windows.
[Is this code covered by automated tests?]: The code is run when it shuts down the test harness in general.
[Has the fix been verified in Nightly?]: Yes. The shutdown hang crash reports have stopped coming in.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: It is self contained within one module and easily auditable. While the CompositorManager and subprotocols could be closed earlier (in theory) than we have historically tested, the content processes already support the GPU process going away unexpectedly (e.g. it crashes and later recovers), which is all this emulates from the content process perspective. There should not be any state information lost as a result of the earlier close.
[String changes made/needed]: None.
Attachment #8907841 -
Flags: approval-mozilla-beta?
Comment 75•7 years ago
|
||
Comment on attachment 8907841 [details] [diff] [review]
Force CompositorManagerParent to close before shutting down the compositor thread., v2
56 is now on m-r. See comment 74.
Attachment #8907841 -
Flags: approval-mozilla-beta? → approval-mozilla-release?
Comment on attachment 8907841 [details] [diff] [review]
Force CompositorManagerParent to close before shutting down the compositor thread., v2
Looks like this should fix a topcrash blocking the 56 release. Please uplift for the RC build.
Attachment #8907841 -
Flags: approval-mozilla-release? → approval-mozilla-release+
Updated•7 years ago
|
status-firefox55:
--- → unaffected
status-firefox-esr52:
--- → unaffected
Comment 77•7 years ago
|
||
bugherder uplift |
Comment 78•7 years ago
|
||
I had to back this out from mozilla-release because it broke debug builds (and I confirmed on Try that this patch was the culprit). Basically, Windows and Linux builds would appear to finish their checktests and then just hang until eventually timing out. Also, OSX debug xpcshell would similarly appear to just hang and end up timing out during chrome/test/unit_ipc/test_resolve_uris_ipc.js.
https://hg.mozilla.org/releases/mozilla-release/rev/28c16d0c9746f7e5f4a5d01c21bd3fede6c6e1a5
https://treeherder.mozilla.org/logviewer.html#?job_id=131483091&repo=mozilla-release
https://treeherder.mozilla.org/logviewer.html#?job_id=131477593&repo=mozilla-release
Flags: needinfo?(aosmond)
Assignee | ||
Comment 79•7 years ago
|
||
Ugh, I should have posted my own trys against mozilla-release instead of mozilla-beta at the time of posted I take it.
Comment 80•7 years ago
|
||
Your Try push from comment 69 against Beta shows the same problem.
Comment 81•7 years ago
|
||
And no, you did the right thing at the time. 56 went from m-b to m-r on the 14th, so your push against Beta on the 13th was fine.
Assignee | ||
Comment 82•7 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #81)
> And no, you did the right thing at the time. 56 went from m-b to m-r on the
> 14th, so your push against Beta on the 13th was fine.
Ugh -- I guess I didn't think it to be related. Is this happening on central as well (to your knowledge)? I know about the leakcheck failure, which is in my queue to look at. But that seems less serious than this.
Comment 83•7 years ago
|
||
It's not happening on m-c. I'm quite certain it would have been backed out already had it been :)
Comment 84•7 years ago
|
||
This causes permaleaks (bug 1374856) on Linux x64 asan mda2, see https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&tochange=ea39b11826c9f4299ae84231d76a97abecbf9747&fromchange=7fbfb64d7b9b351e557785987baecf4e0cbd1649&filter-searchStr=linux%20asan%20mda2&selectedJob=131595760
Depends on: 1374856
Comment 85•7 years ago
|
||
Backed out for leaking in mda2 on Linux x64 asan (bug 1374856) and because the patch was only applied on central:
https://hg.mozilla.org/mozilla-central/rev/877d536658e644bd0d1662d608cc2ca3a9fa8d9f
Updated•7 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla57 → ---
Assignee | ||
Comment 86•7 years ago
|
||
I fixed the asan/leakcheck issue, so this should be landable on mozilla-central:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=223084d2affbc22de128e0bdfb3bef4a4d61116a
And I fixed the 56 branch debug vs release issue by not shutting down in the parent in debug builds. I believe the proper fix will be to always shutdown in the parent in both debug and release but we are already past the 11th hour...
Attachment #8907841 -
Attachment is obsolete: true
Flags: needinfo?(aosmond)
Attachment #8909466 -
Flags: review+
Assignee | ||
Comment 87•7 years ago
|
||
We will need to uplift both attachment 8904681 [details] [diff] [review] and attachment 8909466 [details] [diff] [review], assuming the try is clean.
try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1a6b883cb23225f54dc96d8f0d3b55936031ec92
Comment 88•7 years ago
|
||
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4dd86bbce649
Force CompositorManagerParent to close before shutting down the compositor thread. r=dvander,me
Comment 89•7 years ago
|
||
bugherder |
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Comment 90•7 years ago
|
||
bugherder uplift |
Comment 91•7 years ago
|
||
Comment on attachment 8907556 [details] [diff] [review]
show the ref-counting for CompositorThreadHolder in XPCOM shutdown.
I am late to the game sorry.
Attachment #8907556 -
Flags: feedback?(nical.bugzilla)
Updated•7 years ago
|
Attachment #8907556 -
Flags: feedback?(aosmond)
Updated•7 years ago
|
Crash Signature: mozilla::PrioritizedEventQueue<T>::GetEvent]
[@ shutdownhang | NtWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::TimeStamp::operator-]
[@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::TimeStamp::operator-] → mozilla::PrioritizedEventQueue<T>::GetEvent]
[@ shutdownhang | NtWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::TimeStamp::operator-]
[@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::TimeStamp::operator-]
[@ s…
Comment 94•7 years ago
|
||
Bug 1379619 is specific to Mac OS X.
Whereas Bug 1389021 is specific to Windows OS.
These bugs were marked as duplicates of each other. But with different operating systems being affected, and different crash signatures, should they be left separate instead? Maybe linked via "See Also"?
Or should the problems with both bugs and operating systems be merged? Changing the OS on Bug 1389021 from Windows to All.
Comment 95•7 years ago
|
||
(In reply to Andrew Osmond [:aosmond] from comment #74)
> Comment on attachment 8907841 [details] [diff] [review]
> Force CompositorManagerParent to close before shutting down the compositor
> thread., v2
>
> No crash reports since the change landed. We typically have seen one by now, so I'm ready to call it.
>
> Approval Request Comment
> [Feature/Bug causing the regression]: Bug 1365927
Are these crash reports relevant? They appeared in Nightly recently.
00e502a7-1bc6-4b77-a3f2-2627e0170919 2017-09-19 15:35:04 Firefox 57.0a1 20170918220054 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0 2017-09-19 15:12:59
2a12f86b-5f6c-48db-ad9c-c588d0170917 2017-09-17 00:37:59 Thunderbird 57.0a1 20170914030208 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0 2017-09-16 15:52:23
3ad05691-0c7c-4c20-b645-127a50170916 2017-09-16 15:06:09 Firefox 57.0a1 20170914100122 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0 2017-09-15 04:31:26
7c040dc6-f960-42dc-af21-cffef0170915 2017-09-15 13:29:15 Firefox 57.0a1 20170914220209 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0 2017-09-15 09:16:34
Assignee | ||
Comment 96•7 years ago
|
||
(In reply to skywalker333 from comment #95)
> Are these crash reports relevant? They appeared in Nightly recently.
>
> 00e502a7-1bc6-4b77-a3f2-2627e0170919 2017-09-19 15:35:04 Firefox 57.0a1
> 20170918220054 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0
> 2017-09-19 15:12:59
> 2a12f86b-5f6c-48db-ad9c-c588d0170917 2017-09-17 00:37:59 Thunderbird
> 57.0a1 20170914030208 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0
> 2017-09-16 15:52:23
> 3ad05691-0c7c-4c20-b645-127a50170916 2017-09-16 15:06:09 Firefox 57.0a1
> 20170914100122 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0
> 2017-09-15 04:31:26
> 7c040dc6-f960-42dc-af21-cffef0170915 2017-09-15 13:29:15 Firefox 57.0a1
> 20170914220209 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0
> 2017-09-15 09:16:34
They are unrelated crashes, and should get a new bug. For the crash tracked by this bug, you should see CompositorThreadHolder::Shutdown in the crashing thread stack trace. Sadly the crash signature can't be made more specific (afaik).
Updated•7 years ago
|
Comment 97•7 years ago
|
||
(In reply to Andrew Osmond [:aosmond] from comment #96)
> (In reply to skywalker333 from comment #95)
> > Are these crash reports relevant? They appeared in Nightly recently.
> >
> > 00e502a7-1bc6-4b77-a3f2-2627e0170919 2017-09-19 15:35:04 Firefox 57.0a1
> > 20170918220054 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0
> > 2017-09-19 15:12:59
> > 2a12f86b-5f6c-48db-ad9c-c588d0170917 2017-09-17 00:37:59 Thunderbird
> > 57.0a1 20170914030208 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0
> > 2017-09-16 15:52:23
> > 3ad05691-0c7c-4c20-b645-127a50170916 2017-09-16 15:06:09 Firefox 57.0a1
> > 20170914100122 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0
> > 2017-09-15 04:31:26
> > 7c040dc6-f960-42dc-af21-cffef0170915 2017-09-15 13:29:15 Firefox 57.0a1
> > 20170914220209 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0
> > 2017-09-15 09:16:34
>
> They are unrelated crashes, and should get a new bug. For the crash tracked
> by this bug, you should see CompositorThreadHolder::Shutdown in the crashing
> thread stack trace. Sadly the crash signature can't be made more specific
> (afaik).
So then these two bugs are not duplicates and should be kept separate.
[Bug 1379619 is specific to Mac OS X.
Whereas Bug 1389021 is specific to Windows OS.]
Updated•7 years ago
|
Crash Signature: mozilla::PrioritizedEventQueue<T>::GetEvent]
[@ shutdownhang | NtWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::TimeStamp::operator-]
[@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::TimeStamp::operator-]
[@ s… → mozilla::PrioritizedEventQueue<T>::GetEvent]
[@ shutdownhang | NtWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::TimeStamp::operator-]
[@ shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | mozilla::TimeStamp::operator-]
Comment 100•7 years ago
|
||
(In reply to Andrew Osmond [:aosmond] from comment #74)
> [Is this code covered by automated tests?]: The code is run when it shuts
> down the test harness in general.
> [Has the fix been verified in Nightly?]: Yes. The shutdown hang crash
> reports have stopped coming in.
> [Needs manual test from QE? If yes, steps to reproduce]: No.
Setting qe-verify- based on Andrew's assessment on manual testing needs and the fact that it has automated coverage.
Flags: qe-verify-
Updated•7 years ago
|
Comment 101•7 years ago
|
||
Are we sure the bug is actually fixed? It's possible the signature just shifted a bit.
E.g. look at https://crash-stats.mozilla.com/search/?proto_signature=~CompositorThreadHolder%3A%3AShutdown&product=Firefox&date=%3E%3D2017-09-23T10%3A57%3A41.000Z&date=%3C2017-09-26T10%3A57%3A41.000Z&_sort=-date&_facets=signature&_facets=proto_signature&_facets=content_sandbox_level&_facets=content_sandbox_capable&_facets=content_sandbox_enabled&_facets=content_sandbox_capabilities&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-proto_signature.
Assignee | ||
Comment 102•7 years ago
|
||
(In reply to Marco Castelluccio [:marco] from comment #101)
> Are we sure the bug is actually fixed? It's possible the signature just
> shifted a bit.
> E.g. look at
> https://crash-stats.mozilla.com/search/
> ?proto_signature=~CompositorThreadHolder%3A%3AShutdown&product=Firefox&date=%
> 3E%3D2017-09-23T10%3A57%3A41.000Z&date=%3C2017-09-26T10%3A57%3A41.
> 000Z&_sort=-
> date&_facets=signature&_facets=proto_signature&_facets=content_sandbox_level&
> _facets=content_sandbox_capable&_facets=content_sandbox_enabled&_facets=conte
> nt_sandbox_capabilities&_columns=date&_columns=signature&_columns=product&_co
> lumns=version&_columns=build_id&_columns=platform#facet-proto_signature.
Yes. It went away from 56. I think this is fallout from bug 1401668. Maybe I should back it out on central to confirm now that bug 1398070 will have made us more forgiving on a late free, and to make a real fix less urgent.
Updated•3 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•