1389021 - shutdownhangs in mozilla::layers::CompositorThreadHolder::Shutdown

Reporter

Description

•

7 years ago

[Tracking Requested - why for this release]: in the first hours of the 56.0b1 rollout this issue is manifesting itself as one of the main crash issues (13% of all browser crashes) This bug was filed from the Socorro interface and is report bp-c8561832-aae4-44ce-ac47-61a290170810. ============================================================= Crashing Thread (0) Frame Module Signature Source 0 ntdll.dll KiFastSystemCallRet 1 ntdll.dll ZwWaitForKeyedEvent 2 ntdll.dll RtlSleepConditionVariableCS 3 kernel32.dll SleepConditionVariableCS 4 mozglue.dll mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) mozglue/misc/ConditionVariable_windows.cpp:58 5 xul.dll mozilla::CondVar::Wait(unsigned int) obj-firefox/dist/include/mozilla/CondVar.h:68 6 xul.dll nsEventQueue::GetEvent(bool, nsIRunnable**, mozilla::BaseAutoLock<mozilla::Mutex>&) xpcom/threads/nsEventQueue.cpp:76 7 xul.dll nsThread::nsChainedEventQueue::GetEvent(bool, nsIRunnable**, unsigned short*, mozilla::BaseAutoLock<mozilla::Mutex>&) xpcom/threads/nsThread.cpp:839 8 xul.dll nsThread::GetEvent(bool, nsIRunnable**, unsigned short*, mozilla::BaseAutoLock<mozilla::Mutex>&) xpcom/threads/nsThread.cpp:1297 9 xul.dll nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp:1380 10 xul.dll NS_ProcessNextEvent(nsIThread*, bool) xpcom/threads/nsThreadUtils.cpp:480 11 xul.dll mozilla::layers::CompositorThreadHolder::Shutdown() gfx/layers/ipc/CompositorThread.cpp:137 12 xul.dll gfxPlatform::ShutdownLayersIPC() gfx/thebes/gfxPlatform.cpp:1039 13 xul.dll mozilla::ShutdownXPCOM(nsIServiceManager*) xpcom/build/XPCOMInit.cpp:887 14 xul.dll ScopedXPCOMStartup::~ScopedXPCOMStartup() toolkit/xre/nsAppRunner.cpp:1464 15 xul.dll mozilla::UniquePtr<ScopedXPCOMStartup, mozilla::DefaultDelete<ScopedXPCOMStartup> >::reset(ScopedXPCOMStartup*) obj-firefox/dist/include/mozilla/UniquePtr.h:345 16 xul.dll XREMain::XRE_main(int, char** const, mozilla::BootstrapConfig const&) toolkit/xre/nsAppRunner.cpp:4800 17 xul.dll XRE_main(int, char** const, mozilla::BootstrapConfig const&) toolkit/xre/nsAppRunner.cpp:4867 18 xul.dll mozilla::BootstrapImpl::XRE_main(int, char** const, mozilla::BootstrapConfig const&) toolkit/xre/Bootstrap.cpp:45 19 firefox.exe wmain toolkit/xre/nsWindowsWMain.cpp:115 20 firefox.exe __scrt_common_main_seh f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:253 21 kernel32.dll BaseThreadInitThunk 22 ntdll.dll __RtlUserThreadStart 23 ntdll.dll _RtlUserThreadStart these shutdownhang crashes on windows with mozilla::layers::CompositorThreadHolder::Shutdown are regressing in volume in the 56 cycle. the first affected build on nightly seems to have been 56.0a1 20170615030208 - pushlog to the day before: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b266a8d8fd595b84a7d6218d7b8c6b7af0b5027c&tochange=035c25bef7b5e4175006e63eff10c61c2eef73f1 the associated crash signatures are generic shutdownhang signature lumping together a bunch of different issues - this is the query i've used to filter for the crashes described in this report: https://crash-stats.mozilla.com/search/?proto_signature=~mozilla%3A%3Alayers%3A%3ACompositorThreadHolder%3A%3AShutdown&version=56.0a1&version=57.0a1&version=56.0b&platform=Windows&date=%3E%3D2017-01-01T00%3A00%3A00.000Z&_facets=signature&_facets=version&_facets=user_comments&_facets=uptime&_facets=adapter_vendor_id&_facets=build_id&_facets=install_time&_facets=platform_pretty_version&_facets=useragent_locale&_facets=release_channel&_facets=cpu_arch&_facets=process_type&_facets=moz_crash_reason&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Comment 1

•

7 years ago

Milan, this is shaping up to be a top crash in beta 56, can you find someone to investigate? Thanks.

tracking-firefox56: ? → +

Flags: needinfo?(milan)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 2

•

7 years ago

Let's see if Andrew can make heads or tails out of this.

Flags: needinfo?(milan) → needinfo?(aosmond)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 3

•

7 years ago

Threads 19 and 20 are image decoder threads...

Jerry Shih[:jerry] (UTC+8) (inactive)

Updated

•

7 years ago

Assignee: nobody → hshih

Blocks: 1388995

Peter Chang[:pchang]

Comment 4

•

7 years ago

I checked several crash reports and all of them had the following configuration. features":{"compositor":"basic","gpuProcess":{"status":"unavailable"},"advancedLayers":{"status":"disabled"},"d3d11":{"status":"failed"},"d2d":{"status":"blacklisted","version":"1.1"}}},"isWow64":false}, "settings":{"blocklistEnabled":true,"e10sEnabled":true,"e10sMultiProcesses":4,"e10sCohort":"multiBucket4",

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 5

•

7 years ago

And all crashes occurred at win7.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 6

•

7 years ago

Pretty sure I understand what went wrong here -- I believe I broke it in part 5 of bug 1365927. Patch coming up.

Assignee: hshih → aosmond

Blocks: 1365927

Status: NEW → ASSIGNED

Flags: needinfo?(aosmond)

Marcia Knous [:marcia]

Updated

•

7 years ago

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 7

•

7 years ago

Attached patch Release owned CompositorManagerParent in CompositorBridgeParent::DeferredDestroy., v1 (obsolete) (deleted) — Details — Splinter Review

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cfa51eb247a5fb9e345abe3218b814f2b5a3d502

Andrew Osmond [:aosmond] (he/him)

Assignee

Updated

•

7 years ago

Attachment #8897011 - Flags: review?(dvander)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 8

•

7 years ago

Comment on attachment 8897011 [details] [diff] [review] Release owned CompositorManagerParent in CompositorBridgeParent::DeferredDestroy., v1 Review of attachment 8897011 [details] [diff] [review]: ----------------------------------------------------------------- :dvander on PTO, let's get a quicker review

Attachment #8897011 - Flags: review?(dvander) → review?(nical.bugzilla)

Nicolas Silva [:nical]

Updated

•

7 years ago

Attachment #8897011 - Flags: review?(nical.bugzilla) → review+

Pulsebot

Comment 9

•

7 years ago

Pushed by aosmond@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/dd4bf41ca47d Release owned CompositorManagerParent in CompositorBridgeParent::DeferredDestroy. r=nical

Pulsebot

Comment 10

•

7 years ago

Backout by aosmond@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/a9f759fd05fd Backed out changeset dd4bf41ca47d for bustage

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 11

•

7 years ago

Attached patch Release the compositor thread after CompositorManagerParent IPDL reference is released., v2 (deleted) — Details — Splinter Review

Let's try this again... try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=802c95d19b25ca1373006fe3cf14eb86e55c7da9 Did a full try run, none of the failures seem related to this change. I broke up the freeing of the thread and the freeing of CompositorManagerParent. It appears the failures were caused by trying to free shared memory after the CompositorManagerParent was freed. So we need to keep it around for just as long as we do now, but ensure the compositor thread itself can go free when the IPDL is shutdown.

Attachment #8897011 - Attachment is obsolete: true

Attachment #8898062 - Flags: review?(nical.bugzilla)

Nicolas Silva [:nical]

Updated

•

7 years ago

Attachment #8898062 - Flags: review?(nical.bugzilla) → review+

Pulsebot

Comment 12

•

7 years ago

Pushed by aosmond@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/c289ff85f94b Release the compositor thread after CompositorManagerParent IPDL reference is released. r=nical

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 13

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/c289ff85f94b

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox57: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla57

Wayne Mery (:wsmwk)

Comment 14

•

7 years ago

(In reply to Wes Kocher (:KWierso) from comment #13) > https://hg.mozilla.org/mozilla-central/rev/c289ff85f94b This patch will have appeared in 08-18 nightly build. Reduced weekend usage makes it is unclear whether nightly crash rate dropped. But judging from two queries for the most common nightly signatures containing RtlSleepConditionVariableCS, the crash definitely is not gone https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&signature=shutdownhang%20%7C%20NtWaitForAlertByThreadId%20%7C%20RtlSleepConditionVariableCS%20%7C%20SleepConditionVariableCS&date=%3E%3D2017-08-06T11%3A52%3A37.000Z&date=%3C2017-08-20T11%3A52%3A37.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&signature=shutdownhang%20%7C%20NtWaitForKeyedEvent%20%7C%20RtlSleepConditionVariableCS%20%7C%20SleepConditionVariableCS&date=%3E%3D2017-08-06T11%3A52%3A37.000Z&date=%3C2017-08-20T11%3A52%3A37.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports

Peter Chang[:pchang]

Comment 15

•

7 years ago

(In reply to Wayne Mery (:wsmwk, NI for questions) from comment #14) > (In reply to Wes Kocher (:KWierso) from comment #13) > > https://hg.mozilla.org/mozilla-central/rev/c289ff85f94b > > This patch will have appeared in 08-18 nightly build. Reduced weekend usage > makes it is unclear whether nightly crash rate dropped. But judging from two > queries for the most common nightly signatures containing > RtlSleepConditionVariableCS, the crash definitely is not gone > > https://crash-stats.mozilla.com/signature/ > ?product=Firefox&release_channel=nightly&signature=shutdownhang%20%7C%20NtWai > tForAlertByThreadId%20%7C%20RtlSleepConditionVariableCS%20%7C%20SleepConditio > nVariableCS&date=%3E%3D2017-08-06T11%3A52%3A37.000Z&date=%3C2017-08- > 20T11%3A52%3A37. > 000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_colum > ns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=- > date&page=1#reports > > https://crash-stats.mozilla.com/signature/ > ?product=Firefox&release_channel=nightly&signature=shutdownhang%20%7C%20NtWai > tForKeyedEvent%20%7C%20RtlSleepConditionVariableCS%20%7C%20SleepConditionVari > ableCS&date=%3E%3D2017-08-06T11%3A52%3A37.000Z&date=%3C2017-08- > 20T11%3A52%3A37. > 000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_colum > ns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=- > date&page=1#reports From the first query, I still saw crash from 20170820 nightly build. Jerry, could you help to check the fix was include the build or not? If yes, then we need to re-open this issue and check it. c00f79dc-97ac-4604-a5c2-ddbaa0170820 2017-08-20 11:16:43 Firefox 57.0a1 20170819100442

Flags: needinfo?(hshih)

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 16

•

7 years ago

The version of "Firefox 57.0a1 20170819100442": https://download-origin.cdn.mozilla.net/pub/firefox/nightly/2017/08/2017-08-19-10-04-42-mozilla-central/firefox-57.0a1.en-US.win64.txt It already contains the fix is this bug. hg log --follow -r :4f4487cc2d30d988742109868dcf21c4113f12f5 | grep c289ff85f94b => changeset: 421990:c289ff85f94b So, reopen this bug.

Status: RESOLVED → REOPENED

Flags: needinfo?(hshih)

Resolution: FIXED → ---

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Comment 17

•

7 years ago

There are a lot of different crash signatures here and only a few crashes on nightly. We could still uplift this fix to beta (for beta 6) if you think it will mitigate the issue for beta users. What do you think?

Flags: needinfo?(aosmond)

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Comment 18

•

7 years ago

Unfortunately this missed the beta 5 build today. We could uplift the current patch for beta 6 on Thursday though.

Marcia Knous [:marcia]

Comment 19

•

7 years ago

shutdownhang | ZwWaitForKeyedEvent | RtlSleepConditionVariableCS | SleepConditionVariableCS signature was the top ranked browser crash in Beta 4.

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Updated

•

7 years ago

tracking-firefox56: + → blocking

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 20

•

7 years ago

Attached patch [uplift for beta] Release the compositor thread after CompositorManagerParent IPDL reference is released. (deleted) — Details — Splinter Review

MozReview-Commit-ID: 3eVziNDH59f

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 21

•

7 years ago

Comment on attachment 8900253 [details] [diff] [review] [uplift for beta] Release the compositor thread after CompositorManagerParent IPDL reference is released. :aosmond is in PTO, so I request this beta uplift. And I will still investigate the remaining crash with the same signature like: https://bugzilla.mozilla.org/show_bug.cgi?id=1389021#c16 --- Approval Request Comment [Feature/Bug causing the regression]: bug 1365927 Please check https://bugzilla.mozilla.org/show_bug.cgi?id=1389021#c6 [User impact if declined]: We might hit the shutdown timeout. [Is this code covered by automated tests?]: Already pass the try in Nightly. https://bugzilla.mozilla.org/show_bug.cgi?id=1389021#c11 [Has the fix been verified in Nightly?]: https://bugzilla.mozilla.org/show_bug.cgi?id=1389021#c18 We still see the same crash in nightly, but the number is fewer. [Needs manual test from QE? If yes, steps to reproduce]: none [List of other uplifts needed for the feature/fix]: none [Is the change risky?]: no. [Why is the change risky/not risky?]: The "mozilla::layers::CompositorThreadHolder::Shutdown()" hang is because that CompositorThreadHolder::Shutdown() is waiting for the CompositorThreadHolder obj released. And this patch try to release the CompositorThreadHolder obj ref-counting during shutdown. [String changes made/needed]: none

Attachment #8900253 - Flags: approval-mozilla-beta?

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Comment 22

•

7 years ago

Comment on attachment 8900253 [details] [diff] [review] [uplift for beta] Release the compositor thread after CompositorManagerParent IPDL reference is released. Review of attachment 8900253 [details] [diff] [review]: ----------------------------------------------------------------- Should fix a beta top crash or at least mitigate the crash rate. Aiming for beta 6 here.

Attachment #8900253 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Comment 23

•

7 years ago

n-i to myself to check the crash rate on Friday after beta 6 is released.

Flags: needinfo?(lhenry)

Ryan VanderMeulen [:RyanVM]

Comment 24

•

7 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/b08da0b32226

status-firefox56: affected → fixed

Marcia Knous [:marcia]

Comment 25

•

7 years ago

I just ran a quick query, and it looks as if this is still happening in B6: http://bit.ly/2vDeOs1

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Comment 26

•

7 years ago

It may be happening less with beta 6. Most 56 users are now on beta 6 (as of Monday morning). So we'll see from today's crash rate. Jerry or Andrew, can you look at the signatures coming in for 56 beta 6?

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Updated

•

7 years ago

Flags: needinfo?(hshih)

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 27

•

7 years ago

I'm checking for this issue now.

Flags: needinfo?(hshih)

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 28

•

7 years ago

I'm looking into this again as well, although I no longer have a straightforward explanation as to why this is still happening. I only see reports for when the compositor thread is in the UI process -- this may just be an artifact of the reporting however because I see no "shutdownhang" crashes for the GPU process. Linux and Mac do not appear to be affected as they have relatively few shutdownhang reports in the last month, and none of them (that I checked) matched the profile -- this is interesting because Linux and Mac don't have the GPU process enabled at all so all users could be affected if it reproduced there (instead of just the subset of Windows users which disable the GPU process).

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 29

•

7 years ago

Currently, the crash are all at win7. And there is no gpu process in these crash reports. I try to add some log in the AddRef()/Release() for compositorThreadHolder object, but I don't see the non-matched addRef/Release pair for compositorThreadHolder. checking...

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 30

•

7 years ago

(In reply to Jerry Shih[:jerry] (UTC+8) from comment #29) > Currently, the crash are all at win7. And there is no gpu process in these > crash reports. > I try to add some log in the AddRef()/Release() for compositorThreadHolder > object, but I don't see the non-matched addRef/Release pair for > compositorThreadHolder. > checking... Are you sure it is tied to the Windows version? I see reports on Windows 8.1, Windows 10.

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 31

•

7 years ago

Hi Andrew, Here is my query string. https://crash-stats.mozilla.com/signature/?_sort=-date&platform=Windows&version=56.0a1&version=57.0a1&version=56.0b&signature=shutdownhang%20%7C%20ZwWaitForKeyedEvent%20%7C%20RtlSleepConditionVariableCS%20%7C%20SleepConditionVariableCS&date=%3E%3D2017-01-01T00%3A00%3A00.000Z&date=%3C2017-08-31T13%3A50%3A00.000Z&proto_signature=%7Emozilla%3A%3Alayers%3A%3ACompositorThreadHolder%3A%3AShutdown And all crashes are at win7. And here is another query string. "Most" of crashes doesn't use gpu process. https://crash-stats.mozilla.com/search/?proto_signature=~mozilla%3A%3Alayers%3A%3ACompositorThreadHolder%3A%3AShutdown&telemetry_environment=~%22gpuProcess%22%3A%7B%22status%22%3A%22unavailable%22%7D&version=56.0a1&version=57.0a1&version=56.0b&platform=Windows&date=%3E%3D2017-01-01T00%3A00%3A00.000Z&date=%3C2017-08-31T13%3A50%3A00.000Z&_sort=-date&_facets=signature&_facets=version&_facets=user_comments&_facets=uptime&_facets=adapter_vendor_id&_facets=build_id&_facets=install_time&_facets=platform_pretty_version&_facets=useragent_locale&_facets=release_channel&_facets=cpu_arch&_facets=process_type&_facets=moz_crash_reason&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature We use some tools to open the firefox and close it. Then, I saw one shutdown hang log. I will turn on the addRef/Release call log for compositorThreadHolder.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 32

•

7 years ago

Probably just an artifact of that particular trace, as other versions of Windows show up on other queries, e.g. Windows 10: https://crash-stats.mozilla.com/report/index/08eea1f9-7ff3-4020-8435-01b2b0170901 https://crash-stats.mozilla.com/report/index/6f8b8c32-c399-42aa-9f4d-52da10170901 Windows 8.1: https://crash-stats.mozilla.com/report/index/0c87f205-ee10-450e-a80e-aeffb0170901 https://crash-stats.mozilla.com/report/index/abd4e87d-9ea0-42b0-8396-9eeeb0170830

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 33

•

7 years ago

(In reply to Jerry Shih[:jerry] (UTC+8) from comment #31) > > And here is another query string. > "Most" of crashes doesn't use gpu process. > https://crash-stats.mozilla.com/search/ > ?proto_signature=~mozilla%3A%3Alayers%3A%3ACompositorThreadHolder%3A%3AShutdo > wn&telemetry_environment=~%22gpuProcess%22%3A%7B%22status%22%3A%22unavailable > %22%7D&version=56.0a1&version=57.0a1&version=56. > 0b&platform=Windows&date=%3E%3D2017-01-01T00%3A00%3A00.000Z&date=%3C2017-08- > 31T13%3A50%3A00.000Z&_sort=- > date&_facets=signature&_facets=version&_facets=user_comments&_facets=uptime&_ > facets=adapter_vendor_id&_facets=build_id&_facets=install_time&_facets=platfo > rm_pretty_version&_facets=useragent_locale&_facets=release_channel&_facets=cp > u_arch&_facets=process_type&_facets=moz_crash_reason&_columns=date&_columns=s > ignature&_columns=product&_columns=version&_columns=build_id&_columns=platfor > m#facet-signature > Hmm, the crash reports are all for the main process, as the only way we hit this particular code path (gfxPlatform::ShutdownLayersIPC) is for shutting down the compositor thread in the main process. Looking for the alternative path for the GPU process (mozilla::gfx::GPUParent::ActorDestroy) turns up nothing. However tweaking your query to search for Linux and Mac OS X did turn up a handful of recent results on Linux, e.g. https://crash-stats.mozilla.com/report/index/39a8a8cb-cf19-4544-8087-0f94b0170821 https://crash-stats.mozilla.com/report/index/d1c54c34-f447-4eda-8526-94c190170827 https://crash-stats.mozilla.com/report/index/cefb6e21-7254-4f16-816e-b99bc0170816 https://crash-stats.mozilla.com/report/index/dc37eb9c-7a4d-47d2-9d3f-376510170815 https://crash-stats.mozilla.com/report/index/27d63e0f-f9f1-4e0e-bb67-65f080170815 Another variant of Linux is it blocking on acquiring the MessageChannel mutex: https://crash-stats.mozilla.com/report/index/b95ba2b6-2cd8-4537-852f-c00bb0170826 I suspect unrelated, but notable in case it offers additional clues. It was also observed on Mac OS X, but last seen on build 20170728100358 for 56.0a1, e.g.: https://crash-stats.mozilla.com/report/index/61fce4e1-5465-47f8-8cbe-4771c0170812 If I can glean anything from these, I will note it on the bug. > We use some tools to open the firefox and close it. Then, I saw one shutdown > hang log. I will turn on the addRef/Release call log for > compositorThreadHolder. Great news -- if you are able to reproduce again with the extra logging, I look forward to seeing the results :D.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 34

•

7 years ago

It appears the Linux crashes only started in build 20170814100258. It has bug 1383742 attached to the signature, which was found to be introduced by bug 1351148. Bug 1351148 relanded on central on 2017-08-13. The volume is very low on Linux 57, and it wasn't promoted to 56, so it cannot explain our 56 crashes. It also does not appear to have had an impact on the Windows 57 crash rate.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 35

•

7 years ago

The first build containing the changes in bug 1365927 was 20170615030208 (https://hg.mozilla.org/mozilla-central/rev/035c25bef7b5e4175006e63eff10c61c2eef73f1, confirmed as present in the build via https://hg.mozilla.org/mozilla-central/log/035c25bef7b5e4175006e63eff10c61c2eef73f1/gfx/layers/ipc/CompositorManagerChild.cpp). The first build that we see the crash signature however is the *next day*'s nightly, 20170616030207. There appear to be lots of telemetry results / crash reports in general for both builds, so unless 20170615030208 got pulled early (I do see the top crash volume (IPCError) got cut in half between 20170615030208 and 20170616030207), it seems it is in the realm of possibility the change causing the crash was not bug 1365927 but something that came the next day. Pushlog between the two builds: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=035c25bef7b5e4175006e63eff10c61c2eef73f1&tochange=fe809f57bf2287bb937c3422ed03a63740b3448b

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 36

•

7 years ago

Milan confirmed that there was nothing especially bad about 20170615030208 (e.g. such that we pulled it earlier and thus maybe didn't have enough users on it to hit the crash). It was also on a Thursday which has "normal" usage patterns. Jerry, if your own testing with the AddRef/Release logging doesn't turn up anything useful, could you throw build 20170615030208 into the tool you are using to reproduce? If it doesn't happen there, but it *does* happen on 20170616030207, I think we need to hook it up to mozregression to figure out what the offending change is. The tinderbox builds should still be available for us to narrow this down without having to rebuild ourselves.

Flags: needinfo?(aosmond) → needinfo?(hshih)

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 37

•

7 years ago

My bad, hg itself contains the actual build information. It seems it was in a day earlier than I thought: first release with nightly linux32 b266a8d8fd59 / 56.0a1 / 20170614100332 / files nightly linux64 b266a8d8fd59 / 56.0a1 / 20170614100332 / files nightly mac b266a8d8fd59 / 56.0a1 / 20170614030206 / files nightly win32 b266a8d8fd59 / 56.0a1 / 20170614030206 / files nightly win64 b266a8d8fd59 / 56.0a1 / 20170614030206 / files last release without nightly linux32 5293e5f89358 / 56.0a1 / 20170613100218 / files nightly linux64 5293e5f89358 / 56.0a1 / 20170613100218 / files nightly mac 5293e5f89358 / 56.0a1 / 20170613030203 / files nightly win32 5293e5f89358 / 56.0a1 / 20170613030203 / files nightly win64 5293e5f89358 / 56.0a1 / 20170613030203 / files This is even more suggestive that it was something new in the 20170616 build.

[:philipp]

Reporter

Comment 38

•

7 years ago

(In reply to Andrew Osmond [:aosmond] from comment #35) > The first build that we see the crash signature however is the *next day*'s > nightly, 20170616030207. i'm not sure if this is correct - in crash stats i see those crashes already popping up starting with build 20170615030208 like bp-0d5df06e-bc3a-4230-ade5-47ebb0170626

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 39

•

7 years ago

Attached file CompositorThreadHolder ref-count log (deleted) — Details

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 40

•

7 years ago

After adding the ref-count log, I only got one crash log at win7. Here is the log: attachment 8904211 [details]. I add debug message in CompositorThreadHolder::Shutdown(), but I don't see the debug message for this function. Another interesting thing is that I saw some logs like "[Child 263496] WARNING: '!mMainThread', file c:/mozilla-source/git/gecko-dev/xpcom/threads/nsThreadManager.cpp, line 320". It means that the main thread is already deleted. But the main thread deletion is called after CompositorThreadHolder::Shutdown(). So, I add some additional log in "nsThreadManager::GetMainThread()" for more information.

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 41

•

7 years ago

I have another crash log which show the gc module try to post some tasks after the main thread deleted. I am not sure it's related to this problem. Still investigating.

Julien Cristau [:jcristau]

Comment 42

•

7 years ago

This is still high volume in 56.0b8

status-firefox56: fixed → affected

status-firefox57: fixed → affected

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 43

•

7 years ago

(In reply to Jerry Shih[:jerry] (UTC+8) from comment #40) > After adding the ref-count log, I only got one crash log at win7. Here is > the log: attachment 8904211 [details]. > > I add debug message in CompositorThreadHolder::Shutdown(), but I don't see > the debug message for this function. > Another interesting thing is that I saw some logs like "[Child 263496] > WARNING: '!mMainThread', file > c:/mozilla-source/git/gecko-dev/xpcom/threads/nsThreadManager.cpp, line > 320". It means that the main thread is already deleted. > But the main thread deletion is called after > CompositorThreadHolder::Shutdown(). So, I add some additional log in > "nsThreadManager::GetMainThread()" for more information. Looks like the thread which the shutdown hang is reported on is not the main thread (since it doesn't match the thread addref and release was called on). From that I can only deduce the content process CompositorManagerParent never gets deallocated. From the crash reports I have looked at, the compositor thread is never blocked and just waiting for events to come in. This would suggest then that the close request from the content processes never came. I checked a number of crash reports and couldn't find any content process crash reports around the same time as the thread shutdown hang. I wonder if the content process is blocking in CompositorBridgeChild::ShutDown. Maybe CompositorBridgeChild::ActorDestroy got called on its own somehow, so we cleared mCanSend which prevents CompositorBridgeChild::Destroy from running, which in turn is the only way CompositorBridgeChild::AfterDestroy gets called, which in turn is how sCompositorBridge is cleared which is the spin condition in CompositorBridgeChild::ShutDown? And the parent process crashes first on the shutdown hang so we never get to see the content process shutdown hang? I can prepare something to fix this, but it is just more speculation from me ;).

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

7 years ago

Keywords: topcrash

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 44

•

7 years ago

Attached patch CompositorBridgeChild::ActorDestroy should not prevent CompositorBridgeChild::ShutDown from returning., v1 (deleted) — Details — Splinter Review

As I said, total speculation. Jerry, can you take a look at the patch and give your thoughts? And maybe apply this patch to build running in that special test harness? My thought process is moving entirely to a badly behaving content process now over something gone wrong in the parent process, but I could be completely off base here... try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9257f97a9fdb64b05c98af94133bae0c11d3b70e

Attachment #8904681 - Flags: feedback?(hshih)

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 45

•

7 years ago

I am not sure I catch your idea. But the patch was applied for the shutdown test. Hope there will have more info for this issue. Let me think more for your description.

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 46

•

7 years ago

I set a infinite loop in the CompositorBridgeChild::ShutDown(), then we will have the same shutdown hang problem as this bug. Since the parent side are waiting for the CompositorBridgeParent::ActorDestroy(), but the child side never sends that. The interesting thing is that parent process has a shutdown hang watchdog, but the content process doesn't. I setup 2 laptop for the the original beta code and attachment 8904681 [details] [diff] [review], but there is no more crash occurred from my testing environment. I can't tell the attachment 8904681 [details] [diff] [review] could fix the problem but at least the sCompositorBridge could be always released.

Flags: needinfo?(hshih)

Jerry Shih[:jerry] (UTC+8) (inactive)

Updated

•

7 years ago

Attachment #8904681 - Flags: feedback?(hshih) → feedback+

Andrew Osmond [:aosmond] (he/him)

Assignee

Updated

•

7 years ago

Attachment #8904681 - Attachment description: [WIP] CompositorBridgeChild::ActorDestroy should not prevent CompositorBridgeChild::ShutDown from returning., v1 → CompositorBridgeChild::ActorDestroy should not prevent CompositorBridgeChild::ShutDown from returning., v1

Attachment #8904681 - Flags: review?(nical.bugzilla)

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 47

•

7 years ago

(In reply to Jerry Shih[:jerry] (UTC+8) from comment #45) > I am not sure I catch your idea. But the patch was applied for the shutdown > test. Hope there will have more info for this issue. Let me think more for > your description. I should probably explain on the bug itself. When CompositorManagerChild::Shutdown is called, it in turn delegates to CompositorBridgeChild::ShutDown before calling CompositorManagerChild::Close and thus starting the process required to release the compositor thread reference in the parent process. CompositorBridgeChild::ShutDown calls CompositorBridgeChild::Destroy and then spin the main thread until the static reference to itself is released. Ordinarily Destroy will tear down the CompositorBridgeChild bits and bobs, and then trigger CompositorBridgeChild::AfterDestroy (via a dispatch to the main thread) to delete the IPDL object and clear said static reference. Then ShutDown stops spinning, and we call Close. If however CompositorBridgeChild::ActorDestroy already got called (somehow -- I did not think too hard on why...), then Destroy does nothing. And we will loop waiting for the static reference to be cleared (which will never happen, since it only happens in AfterDestroy). Which in turn means Close never gets called, and the compositor thread references linger.

Andrew Osmond [:aosmond] (he/him)

Assignee

Updated

•

7 years ago

Priority: -- → P3

Whiteboard: [gfx-noted]

Nicolas Silva [:nical]

Updated

•

7 years ago

Attachment #8904681 - Flags: review?(nical.bugzilla) → review+

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 48

•

7 years ago

(In reply to Andrew Osmond [:aosmond] from comment #47) > (In reply to Jerry Shih[:jerry] (UTC+8) from comment #45) > > I am not sure I catch your idea. But the patch was applied for the shutdown > > test. Hope there will have more info for this issue. Let me think more for > > your description. > > I should probably explain on the bug itself. > > When CompositorManagerChild::Shutdown is called, it in turn delegates to > CompositorBridgeChild::ShutDown before calling CompositorManagerChild::Close > and thus starting the process required to release the compositor thread > reference in the parent process. > > CompositorBridgeChild::ShutDown calls CompositorBridgeChild::Destroy and > then spin the main thread until the static reference to itself is released. > Ordinarily Destroy will tear down the CompositorBridgeChild bits and bobs, > and then trigger CompositorBridgeChild::AfterDestroy (via a dispatch to the > main thread) to delete the IPDL object and clear said static reference. Then > ShutDown stops spinning, and we call Close. > > If however CompositorBridgeChild::ActorDestroy already got called (somehow > -- I did not think too **** why...), then Destroy does nothing. And we > will loop waiting for the static reference to be cleared (which will never > happen, since it only happens in AfterDestroy). Which in turn means Close > never gets called, and the compositor thread references linger. It's more clear. The parent process has a shutdown hang watchdog, but the content process doesn't. I think we should also have a hang monitor at the child side to prevent the problem again. In this case, the child side might still be spinning but we can't know.

Pulsebot

Comment 49

•

7 years ago

Pushed by aosmond@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/df9d3661abb6 CompositorBridgeChild::ActorDestroy should not prevent CompositorBridgeChild::ShutDown from returning. r=nical

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 50

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/df9d3661abb6

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

status-firefox57: affected → fixed

Resolution: --- → FIXED

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 51

•

7 years ago

For release build, the content process will go through ProcessChild::QuickExit(). Then, the content process will not call xpcom_shutdown. So, it will also not go though CompositorBridgeChild::ShutDown. I'm not sure the attachment #8904681 [details] [diff] [review] could fix the problem. But let's see how it will be.

Flags: needinfo?(aosmond)

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 52

•

7 years ago

(In reply to Jerry Shih[:jerry] (UTC+8) from comment #51) > For release build, the content process will go through > ProcessChild::QuickExit(). Then, the content process will not call > xpcom_shutdown. So, it will also not go though > CompositorBridgeChild::ShutDown. > I'm not sure the attachment #8904681 [details] [diff] [review] [diff] [review] could fix the > problem. But let's see how it will be. Ugh, I wonder if is this related to reproducing in other ways? Maybe I should follow the exact configuration for a nightly build. I am still seeing reports (4 thus far) on 20170907220212 which contains the new fix. https://crash-stats.mozilla.com/report/index/609a2aab-8fa9-4af5-ba11-315210170908 https://crash-stats.mozilla.com/report/index/5edb45fa-1e37-4e69-9b09-658420170908 At this point, I need to think about what information would be useful to debug and land something which dumps it to the GFX log. Maybe a list of what CompositorManagerParents are still alive at shutdown...

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 53

•

7 years ago

It still calls CompositorBridgeChild::Destroy in a few other places (e.g. by the CompositorSession) that should still happen on shutdown (e.g. nsBaseWidget::DestroyCompositor via WidgetShutdownObserver::Observe). Before my changes this would release the CompositorBridgeParent and its thread before the quick exit. While it still releases the CompositorBridgeParent, the CompositorManagerParent remains up. Maybe we need to do more in these places. It doesn't make sense though because a quick exit should error out the ProcessLink on the other side...surely. But it is a behavioural change, so I guess it is an option.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 54

•

7 years ago

The signature morphed as a result of some other changes (probably the quantum DOM scheduler).

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 55

•

7 years ago

Attached patch AddRef/Release log for CompositorThreadHolder. (obsolete) (deleted) — Details — Splinter Review

Here is the customize AddRef/Release impl for CompositorThreadHolder. It will use gfxCriticalNote to dump the stack for AddRef/Release call. But I can't get the complete log from the gfxCriticalNote. Some logs seems to be overwrote by the newer log.

Pulsebot

Comment 56

•

7 years ago

Pushed by aosmond@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/e66c8d65d0f1 Explicitly shutdown the CompositorManagerChild during ContentChild::ActorDestroy. r=me

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 57

•

7 years ago

(In reply to Pulsebot from comment #56) > Pushed by aosmond@gmail.com: > https://hg.mozilla.org/integration/mozilla-inbound/rev/e66c8d65d0f1 > Explicitly shutdown the CompositorManagerChild during > ContentChild::ActorDestroy. r=me I will back this out after we get a few builds with it and observe the effect on crash rates. Fingers crossed.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 58

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e66c8d65d0f1

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 59

•

7 years ago

(In reply to Andrew Osmond [:aosmond] from comment #53) > It still calls CompositorBridgeChild::Destroy in a few other places (e.g. by > the CompositorSession) that should still happen on shutdown (e.g. > nsBaseWidget::DestroyCompositor via WidgetShutdownObserver::Observe). Before > my changes this would release the CompositorBridgeParent and its thread Yes, but it happens at parent process. If the parent is blocked at the CompositorThreadHolder::Shutdown, it should already have ran the nsBaseWidget::DestroyCompositor(). > before the quick exit. While it still releases the CompositorBridgeParent, > the CompositorManagerParent remains up. Maybe we need to do more in these > places. > > It doesn't make sense though because a quick exit should error out the > ProcessLink on the other side...surely. But it is a behavioural change, so I > guess it is an option. The parent actor should be deleted by ActorDestory(). So, I'm still checking for the remaining holder for the compositorThread. https://dxr.mozilla.org/mozilla-central/source/obj-x86_64-pc-linux-gnu/ipc/ipdl/PCompositorManagerParent.cpp#234

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Comment 61

•

7 years ago

If this is fixed in nightly can you request uplift to beta? Or is this still a diagnostic patch?

Flags: needinfo?(lhenry)

[:philipp]

Reporter

Comment 62

•

7 years ago

unfortunately i still see those crashes in today's nightly build 20170912100139 which contains the patch: http://bit.ly/2f3G29h

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 63

•

7 years ago

Yes I agree. Back to the drawing board.

Status: RESOLVED → REOPENED

Flags: needinfo?(aosmond)

Resolution: FIXED → ---

Julien Cristau [:jcristau]

Updated

•

7 years ago

status-firefox57: fixed → affected

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 64

•

7 years ago

Attached patch show the ref-counting for CompositorThreadHolder in XPCOM shutdown. (deleted) — Details — Splinter Review

This patch use CompositorThreadHolderRef to record the different types of CompositorThreadHolder object holder. Then, we could dump the log for the remaining holder type during the XPCOM shutdown. MozReview-Commit-ID: GfUtMD11Ajt

Attachment #8907556 - Flags: feedback?(nical.bugzilla)

Attachment #8907556 - Flags: feedback?(aosmond)

Jerry Shih[:jerry] (UTC+8) (inactive)

Updated

•

7 years ago

Attachment #8906563 - Attachment is obsolete: true

Calixte Denizet (:calixte)

Comment 65

•

7 years ago

Pulsebot

Comment 66

•

7 years ago

Backout by aosmond@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/bcbb4575e482 Backed out changeset e66c8d65d0f1 because it did not reduce the crash volume.

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Updated

•

7 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1397214

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 67

•

7 years ago

Attached patch Force CompositorManagerParent to close before shutting down the compositor thread., v1 (obsolete) (deleted) — Details — Splinter Review

So two things in the pipeline from my end right now. One, bug 1399453 has landed on inbound which should annotate the crash reports with the current compositor thread references -- when a shutdown hang is tripped, we should be able to just search for "gfxCompositorThread" to find them. Hopefully that makes it into the next build this evening/tomorrow morning. This may help with diagnosing the problem. In parallel, I had a discussion with dvander about the problem. We agreed that in an ideal world, the parent process should be the one initiating the IPDL objects closing, rather than the child processes. I figured this was more risky than it probably is, because as dvander pointed out, the child processes already need to handle the GPU process going away at a random point. This is sort of a backup plan, assuming the annotations don't tell us anything new but hopefully a sound one if the try is clean. try (central): https://treeherder.mozilla.org/#/jobs?repo=try&revision=92a2008a0c4b243673cc2e57f19bff2c21bb3ec1 try (beta): https://treeherder.mozilla.org/#/jobs?repo=try&revision=030ee6b93585fa860012a965b3cbb2afabf84a8c

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 68

•

7 years ago

(In reply to Andrew Osmond [:aosmond] from comment #67) > Created attachment 8907693 [details] [diff] [review] > Force CompositorManagerParent to close before shutting down the compositor > thread., v1 > > try (central): > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=92a2008a0c4b243673cc2e57f19bff2c21bb3ec1 > try (beta): > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=030ee6b93585fa860012a965b3cbb2afabf84a8c My bad: try (central): https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc8e6365ad4398e155ebbdeb67412e808a90157c

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 69

•

7 years ago

Attached patch Force CompositorManagerParent to close before shutting down the compositor thread., v2 (obsolete) (deleted) — Details — Splinter Review

Fix that leak check (grr). try (central): https://treeherder.mozilla.org/#/jobs?repo=try&revision=e27af77640ce1b1e6bef88a27f31d74cc5e0b4fb try (beta): https://treeherder.mozilla.org/#/jobs?repo=try&revision=316a24034803386ae55d6c315200c671f0f0061a

Attachment #8907693 - Attachment is obsolete: true

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 70

•

7 years ago

Comment on attachment 8907841 [details] [diff] [review] Force CompositorManagerParent to close before shutting down the compositor thread., v2 The crash report annotations show that the CompositorManagerChild/Parent for the content processes are still alive. So we were on the right track with the previous patches, but for some reason, they aren't helping. The content process should quick exit and let the OS cleanup the IPC resources but that doesn't seem to be happening either. As said in comment 67, we actually would prefer to move towards shutting down in the parent process anyways, rather than rely upon orderly shutdown in the content processes. This should be safe since they need to handle the GPU process dying at any time in theory. So this patch should move us in the right direction and hopefully fix the hang once and for all.

Attachment #8907841 - Flags: review?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 71

•

7 years ago

Comment on attachment 8907841 [details] [diff] [review] Force CompositorManagerParent to close before shutting down the compositor thread., v2 Review of attachment 8907841 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/ipc/CompositorManagerParent.cpp @@ +133,5 @@ > AddRef(); > + > + StaticMutexAutoLock lock(sMutex); > + if (!sActiveActors) { > + sActiveActors = new nsTArray<CompositorManagerParent*>(); Arguably we should create the array at startup, and refuse to create new CompositorManagers if this array doesn't exist, since otherwise they might not be severed later. But it's okay as-is.

Attachment #8907841 - Flags: review?(dvander) → review+

Pulsebot

Comment 72

•

7 years ago

Pushed by aosmond@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/7d0893f1b6ce Force CompositorManagerParent to close before shutting down the compositor thread. r=dvander

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 73

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/7d0893f1b6ce

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

status-firefox57: affected → fixed

Resolution: --- → FIXED

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 74

•

7 years ago

Comment on attachment 8907841 [details] [diff] [review] Force CompositorManagerParent to close before shutting down the compositor thread., v2 No crash reports since the change landed. We typically have seen one by now, so I'm ready to call it. Approval Request Comment [Feature/Bug causing the regression]: Bug 1365927 [User impact if declined]: May experience shutdown hangs on Windows. [Is this code covered by automated tests?]: The code is run when it shuts down the test harness in general. [Has the fix been verified in Nightly?]: Yes. The shutdown hang crash reports have stopped coming in. [Needs manual test from QE? If yes, steps to reproduce]: No. [List of other uplifts needed for the feature/fix]: None. [Is the change risky?]: No. [Why is the change risky/not risky?]: It is self contained within one module and easily auditable. While the CompositorManager and subprotocols could be closed earlier (in theory) than we have historically tested, the content processes already support the GPU process going away unexpectedly (e.g. it crashes and later recovers), which is all this emulates from the content process perspective. There should not be any state information lost as a result of the earlier close. [String changes made/needed]: None.

Attachment #8907841 - Flags: approval-mozilla-beta?

Julien Cristau [:jcristau]

Comment 75

•

7 years ago

Comment on attachment 8907841 [details] [diff] [review] Force CompositorManagerParent to close before shutting down the compositor thread., v2 56 is now on m-r. See comment 74.

Attachment #8907841 - Flags: approval-mozilla-beta? → approval-mozilla-release?

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Comment 76

•

7 years ago

Comment on attachment 8907841 [details] [diff] [review] Force CompositorManagerParent to close before shutting down the compositor thread., v2 Looks like this should fix a topcrash blocking the 56 release. Please uplift for the RC build.

Attachment #8907841 - Flags: approval-mozilla-release? → approval-mozilla-release+

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

Depends on: 1400116

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox55: --- → unaffected

status-firefox-esr52: --- → unaffected

Ryan VanderMeulen [:RyanVM]

Comment 77

•

7 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/a01f8c3ec5e9 (FIREFOX_56b13_RELBRANCH) https://hg.mozilla.org/releases/mozilla-release/rev/fce8bb3c16f7

status-firefox56: affected → fixed

Ryan VanderMeulen [:RyanVM]

Comment 78

•

7 years ago

I had to back this out from mozilla-release because it broke debug builds (and I confirmed on Try that this patch was the culprit). Basically, Windows and Linux builds would appear to finish their checktests and then just hang until eventually timing out. Also, OSX debug xpcshell would similarly appear to just hang and end up timing out during chrome/test/unit_ipc/test_resolve_uris_ipc.js. https://hg.mozilla.org/releases/mozilla-release/rev/28c16d0c9746f7e5f4a5d01c21bd3fede6c6e1a5 https://treeherder.mozilla.org/logviewer.html#?job_id=131483091&repo=mozilla-release https://treeherder.mozilla.org/logviewer.html#?job_id=131477593&repo=mozilla-release

status-firefox56: fixed → affected

Flags: needinfo?(aosmond)

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 79

•

7 years ago

Ugh, I should have posted my own trys against mozilla-release instead of mozilla-beta at the time of posted I take it.

Ryan VanderMeulen [:RyanVM]

Comment 80

•

7 years ago

Your Try push from comment 69 against Beta shows the same problem.

Ryan VanderMeulen [:RyanVM]

Comment 81

•

7 years ago

And no, you did the right thing at the time. 56 went from m-b to m-r on the 14th, so your push against Beta on the 13th was fine.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 82

•

7 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #81) > And no, you did the right thing at the time. 56 went from m-b to m-r on the > 14th, so your push against Beta on the 13th was fine. Ugh -- I guess I didn't think it to be related. Is this happening on central as well (to your knowledge)? I know about the leakcheck failure, which is in my queue to look at. But that seems less serious than this.

Ryan VanderMeulen [:RyanVM]

Comment 83

•

7 years ago

It's not happening on m-c. I'm quite certain it would have been backed out already had it been :)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 84

•

7 years ago

This causes permaleaks (bug 1374856) on Linux x64 asan mda2, see https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&tochange=ea39b11826c9f4299ae84231d76a97abecbf9747&fromchange=7fbfb64d7b9b351e557785987baecf4e0cbd1649&filter-searchStr=linux%20asan%20mda2&selectedJob=131595760

Depends on: 1374856

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 85

•

7 years ago

Backed out for leaking in mda2 on Linux x64 asan (bug 1374856) and because the patch was only applied on central: https://hg.mozilla.org/mozilla-central/rev/877d536658e644bd0d1662d608cc2ca3a9fa8d9f

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

7 years ago

Status: RESOLVED → REOPENED

status-firefox57: fixed → affected

Resolution: FIXED → ---

Target Milestone: mozilla57 → ---

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 86

•

7 years ago

Attached patch Force CompositorManagerParent to close before shutting down the compositor thread., v3 [carries r=dvander,me] (deleted) — Details — Splinter Review

I fixed the asan/leakcheck issue, so this should be landable on mozilla-central: https://treeherder.mozilla.org/#/jobs?repo=try&revision=223084d2affbc22de128e0bdfb3bef4a4d61116a And I fixed the 56 branch debug vs release issue by not shutting down in the parent in debug builds. I believe the proper fix will be to always shutdown in the parent in both debug and release but we are already past the 11th hour...

Attachment #8907841 - Attachment is obsolete: true

Flags: needinfo?(aosmond)

Attachment #8909466 - Flags: review+

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 87

•

7 years ago

We will need to uplift both attachment 8904681 [details] [diff] [review] and attachment 8909466 [details] [diff] [review], assuming the try is clean. try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1a6b883cb23225f54dc96d8f0d3b55936031ec92

Pulsebot

Comment 88

•

7 years ago

Pushed by aosmond@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/4dd86bbce649 Force CompositorManagerParent to close before shutting down the compositor thread. r=dvander,me

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 89

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/4dd86bbce649

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

status-firefox57: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla57

Ryan VanderMeulen [:RyanVM]

Comment 90

•

7 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-release/rev/402d087e349e https://hg.mozilla.org/releases/mozilla-release/rev/8a5176d8e799

status-firefox56: affected → fixed

Nicolas Silva [:nical]

Comment 91

•

7 years ago

Comment on attachment 8907556 [details] [diff] [review] show the ref-counting for CompositorThreadHolder in XPCOM shutdown. I am late to the game sorry.

Attachment #8907556 - Flags: feedback?(nical.bugzilla)

Jerry Shih[:jerry] (UTC+8) (inactive)

Updated

•

7 years ago

Attachment #8907556 - Flags: feedback?(aosmond)

Jim Mathies [:jimm]

Updated

•

7 years ago

Blocks: 1402037

Trevor Skywalker

Updated

•

7 years ago

Trevor Skywalker

Comment 94

•

7 years ago

Bug 1379619 is specific to Mac OS X. Whereas Bug 1389021 is specific to Windows OS. These bugs were marked as duplicates of each other. But with different operating systems being affected, and different crash signatures, should they be left separate instead? Maybe linked via "See Also"? Or should the problems with both bugs and operating systems be merged? Changing the OS on Bug 1389021 from Windows to All.

Trevor Skywalker

Comment 95

•

7 years ago

(In reply to Andrew Osmond [:aosmond] from comment #74) > Comment on attachment 8907841 [details] [diff] [review] > Force CompositorManagerParent to close before shutting down the compositor > thread., v2 > > No crash reports since the change landed. We typically have seen one by now, so I'm ready to call it. > > Approval Request Comment > [Feature/Bug causing the regression]: Bug 1365927 Are these crash reports relevant? They appeared in Nightly recently. 00e502a7-1bc6-4b77-a3f2-2627e0170919 2017-09-19 15:35:04 Firefox 57.0a1 20170918220054 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0 2017-09-19 15:12:59 2a12f86b-5f6c-48db-ad9c-c588d0170917 2017-09-17 00:37:59 Thunderbird 57.0a1 20170914030208 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0 2017-09-16 15:52:23 3ad05691-0c7c-4c20-b645-127a50170916 2017-09-16 15:06:09 Firefox 57.0a1 20170914100122 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0 2017-09-15 04:31:26 7c040dc6-f960-42dc-af21-cffef0170915 2017-09-15 13:29:15 Firefox 57.0a1 20170914220209 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0 2017-09-15 09:16:34

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 96

•

7 years ago

(In reply to skywalker333 from comment #95) > Are these crash reports relevant? They appeared in Nightly recently. > > 00e502a7-1bc6-4b77-a3f2-2627e0170919 2017-09-19 15:35:04 Firefox 57.0a1 > 20170918220054 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0 > 2017-09-19 15:12:59 > 2a12f86b-5f6c-48db-ad9c-c588d0170917 2017-09-17 00:37:59 Thunderbird > 57.0a1 20170914030208 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0 > 2017-09-16 15:52:23 > 3ad05691-0c7c-4c20-b645-127a50170916 2017-09-16 15:06:09 Firefox 57.0a1 > 20170914100122 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0 > 2017-09-15 04:31:26 > 7c040dc6-f960-42dc-af21-cffef0170915 2017-09-15 13:29:15 Firefox 57.0a1 > 20170914220209 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0 > 2017-09-15 09:16:34 They are unrelated crashes, and should get a new bug. For the crash tracked by this bug, you should see CompositorThreadHolder::Shutdown in the crashing thread stack trace. Sadly the crash signature can't be made more specific (afaik).

Trevor Skywalker

Updated

•

7 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1379619

Trevor Skywalker

Comment 97

•

7 years ago

(In reply to Andrew Osmond [:aosmond] from comment #96) > (In reply to skywalker333 from comment #95) > > Are these crash reports relevant? They appeared in Nightly recently. > > > > 00e502a7-1bc6-4b77-a3f2-2627e0170919 2017-09-19 15:35:04 Firefox 57.0a1 > > 20170918220054 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0 > > 2017-09-19 15:12:59 > > 2a12f86b-5f6c-48db-ad9c-c588d0170917 2017-09-17 00:37:59 Thunderbird > > 57.0a1 20170914030208 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0 > > 2017-09-16 15:52:23 > > 3ad05691-0c7c-4c20-b645-127a50170916 2017-09-16 15:06:09 Firefox 57.0a1 > > 20170914100122 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0 > > 2017-09-15 04:31:26 > > 7c040dc6-f960-42dc-af21-cffef0170915 2017-09-15 13:29:15 Firefox 57.0a1 > > 20170914220209 Mac OS X EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 0x0 > > 2017-09-15 09:16:34 > > They are unrelated crashes, and should get a new bug. For the crash tracked > by this bug, you should see CompositorThreadHolder::Shutdown in the crashing > thread stack trace. Sadly the crash signature can't be made more specific > (afaik). So then these two bugs are not duplicates and should be kept separate. [Bug 1379619 is specific to Mac OS X. Whereas Bug 1389021 is specific to Windows OS.]

Trevor Skywalker

Updated

•

7 years ago

Andrei Vaida [:avaida]

Comment 100

•

7 years ago

(In reply to Andrew Osmond [:aosmond] from comment #74) > [Is this code covered by automated tests?]: The code is run when it shuts > down the test harness in general. > [Has the fix been verified in Nightly?]: Yes. The shutdown hang crash > reports have stopped coming in. > [Needs manual test from QE? If yes, steps to reproduce]: No. Setting qe-verify- based on Andrew's assessment on manual testing needs and the fact that it has automated coverage.

Flags: qe-verify-

Calixte Denizet (:calixte)

Updated

•

7 years ago

Depends on: 1402592

Jim Mathies [:jimm]

Updated

•

7 years ago

No longer blocks: 1402037

Depends on: 1402037

Marco Castelluccio [:marco]

Comment 101

•

7 years ago

Are we sure the bug is actually fixed? It's possible the signature just shifted a bit. E.g. look at https://crash-stats.mozilla.com/search/?proto_signature=~CompositorThreadHolder%3A%3AShutdown&product=Firefox&date=%3E%3D2017-09-23T10%3A57%3A41.000Z&date=%3C2017-09-26T10%3A57%3A41.000Z&_sort=-date&_facets=signature&_facets=proto_signature&_facets=content_sandbox_level&_facets=content_sandbox_capable&_facets=content_sandbox_enabled&_facets=content_sandbox_capabilities&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-proto_signature.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 102

•

7 years ago

(In reply to Marco Castelluccio [:marco] from comment #101) > Are we sure the bug is actually fixed? It's possible the signature just > shifted a bit. > E.g. look at > https://crash-stats.mozilla.com/search/ > ?proto_signature=~CompositorThreadHolder%3A%3AShutdown&product=Firefox&date=% > 3E%3D2017-09-23T10%3A57%3A41.000Z&date=%3C2017-09-26T10%3A57%3A41. > 000Z&_sort=- > date&_facets=signature&_facets=proto_signature&_facets=content_sandbox_level& > _facets=content_sandbox_capable&_facets=content_sandbox_enabled&_facets=conte > nt_sandbox_capabilities&_columns=date&_columns=signature&_columns=product&_co > lumns=version&_columns=build_id&_columns=platform#facet-proto_signature. Yes. It went away from 56. I think this is fallout from bug 1401668. Maybe I should back it out on central to confirm now that bug 1398070 will have made us more forgiving on a late free, and to make a real fix less urgent.

Andrew Osmond [:aosmond] (he/him)

Assignee

Updated

•

7 years ago

Depends on: 1413011

Karl Tomlinson (:karlt)

Updated

•

3 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1740087

Release owned CompositorManagerParent in CompositorBridgeParent::DeferredDestroy., v1 7 years ago Andrew Osmond [:aosmond] (he/him) (deleted), patch	nical : review+	Details \| Diff \| Splinter Review
Release the compositor thread after CompositorManagerParent IPDL reference is released., v2 7 years ago Andrew Osmond [:aosmond] (he/him) (deleted), patch	nical : review+	Details \| Diff \| Splinter Review
[uplift for beta] Release the compositor thread after CompositorManagerParent IPDL reference is released. 7 years ago Jerry Shih[:jerry] (UTC+8) (inactive) (deleted), patch	lizzard : approval-mozilla-beta+	Details \| Diff \| Splinter Review
CompositorThreadHolder ref-count log 7 years ago Jerry Shih[:jerry] (UTC+8) (inactive) (deleted), text/plain		Details
CompositorBridgeChild::ActorDestroy should not prevent CompositorBridgeChild::ShutDown from returning., v1 7 years ago Andrew Osmond [:aosmond] (he/him) (deleted), patch	nical : review+ jerry : feedback+	Details \| Diff \| Splinter Review
AddRef/Release log for CompositorThreadHolder. 7 years ago Jerry Shih[:jerry] (UTC+8) (inactive) (deleted), patch		Details \| Diff \| Splinter Review
show the ref-counting for CompositorThreadHolder in XPCOM shutdown. 7 years ago Jerry Shih[:jerry] (UTC+8) (inactive) (deleted), patch		Details \| Diff \| Splinter Review
Force CompositorManagerParent to close before shutting down the compositor thread., v1 7 years ago Andrew Osmond [:aosmond] (he/him) (deleted), patch		Details \| Diff \| Splinter Review
Force CompositorManagerParent to close before shutting down the compositor thread., v2 7 years ago Andrew Osmond [:aosmond] (he/him) (deleted), patch	dvander : review+ lizzard : approval-mozilla-release+	Details \| Diff \| Splinter Review
Force CompositorManagerParent to close before shutting down the compositor thread., v3 [carries r=dvander,me] 7 years ago Andrew Osmond [:aosmond] (he/him) (deleted), patch	aosmond : review+	Details \| Diff \| Splinter Review