Closed
Bug 1423261
Opened 7 years ago
Closed 6 years ago
Crash in mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent
Categories
(Core :: DOM: Content Processes, defect, P1)
Tracking
()
RESOLVED
FIXED
mozilla61
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox-esr60 | --- | fixed |
firefox57 | --- | unaffected |
firefox58 | --- | wontfix |
firefox59 | --- | wontfix |
firefox60 | - | wontfix |
firefox61 | --- | fixed |
People
(Reporter: philipp, Assigned: spohl)
References
Details
(4 keywords)
Crash Data
Attachments
(3 files, 9 obsolete files)
(deleted),
patch
|
jimm
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jimm
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jimm
:
review+
RyanVM
:
approval-mozilla-esr60+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is
report bp-0077eff1-a75a-4c26-ac0c-0dd730171205.
=============================================================
Top 10 frames of crashing thread:
0 xul.dll mozilla::ipc::MessageChannel::Clear ipc/glue/MessageChannel.cpp:708
1 xul.dll mozilla::ipc::MessageChannel::~MessageChannel ipc/glue/MessageChannel.cpp:581
2 xul.dll mozilla::dom::PContentParent::~PContentParent ipc/ipdl/PContentParent.cpp:264
3 xul.dll mozilla::dom::ContentParent::cycleCollection::DeleteCycleCollectable dom/ipc/ContentParent.h:316
4 xul.dll SnowWhiteKiller::~SnowWhiteKiller xpcom/base/nsCycleCollector.cpp:2729
5 xul.dll nsCycleCollector::FreeSnowWhite xpcom/base/nsCycleCollector.cpp:2917
6 xul.dll nsCycleCollector::Shutdown xpcom/base/nsCycleCollector.cpp:3987
7 xul.dll nsCycleCollector_shutdown xpcom/base/nsCycleCollector.cpp:4373
8 xul.dll mozilla::ShutdownXPCOM xpcom/build/XPCOMInit.cpp:973
9 xul.dll ScopedXPCOMStartup::~ScopedXPCOMStartup toolkit/xre/nsAppRunner.cpp:1508
=============================================================
crashes during shutdown with this signature are (re)appearing in firefox 58 - the same signature got fixed once before in bug 1363601.
on 58.0b this is currently the #3 browser top crash, all the crashes are from windows and come with MOZ_CRASH(MessageChannel destroyed without being closed).
it first regressed on 58.0a1 build 20171003100226, so a likely pushlog containing the regression would be: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=15f221f491f707b1e8e46da344b6dd5a394b1242&tochange=11fe0a2895aab26c57bcfe61b3041d7837e954cd
Comment 1•7 years ago
|
||
Hey Gabor, can you take a quick look here? You fixed this previously, had something to do with preallocated processes.
Flags: needinfo?(gkrizsanits)
Comment 2•7 years ago
|
||
(In reply to [:philipp] from comment #0)
> This bug was filed from the Socorro interface and is
> report bp-0077eff1-a75a-4c26-ac0c-0dd730171205.
> =============================================================
>
> Top 10 frames of crashing thread:
>
> 0 xul.dll mozilla::ipc::MessageChannel::Clear ipc/glue/MessageChannel.cpp:708
> 1 xul.dll mozilla::ipc::MessageChannel::~MessageChannel
> ipc/glue/MessageChannel.cpp:581
> 2 xul.dll mozilla::dom::PContentParent::~PContentParent
> ipc/ipdl/PContentParent.cpp:264
> 3 xul.dll
> mozilla::dom::ContentParent::cycleCollection::DeleteCycleCollectable
> dom/ipc/ContentParent.h:316
> 4 xul.dll SnowWhiteKiller::~SnowWhiteKiller
> xpcom/base/nsCycleCollector.cpp:2729
> 5 xul.dll nsCycleCollector::FreeSnowWhite
> xpcom/base/nsCycleCollector.cpp:2917
> 6 xul.dll nsCycleCollector::Shutdown xpcom/base/nsCycleCollector.cpp:3987
> 7 xul.dll nsCycleCollector_shutdown xpcom/base/nsCycleCollector.cpp:4373
> 8 xul.dll mozilla::ShutdownXPCOM xpcom/build/XPCOMInit.cpp:973
> 9 xul.dll ScopedXPCOMStartup::~ScopedXPCOMStartup
> toolkit/xre/nsAppRunner.cpp:1508
>
> =============================================================
>
> crashes during shutdown with this signature are (re)appearing in firefox 58
> - the same signature got fixed once before in bug 1363601.
> on 58.0b this is currently the #3 browser top crash, all the crashes are
> from windows and come with MOZ_CRASH(MessageChannel destroyed without being
> closed).
>
> it first regressed on 58.0a1 build 20171003100226, so a likely pushlog
> containing the regression would be:
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=15f221f491f707b1e8e46da344b6dd5a394b1242&tochange=11fe
> 0a2895aab26c57bcfe61b3041d7837e954cd
The patch that reenables the preallocated process manager is in this range. However that didn't come with a big change back then, so I'm not convinced that it is related. (note that we have crashes with this signature even on 57 release where the ppm is disabled completely). Something seems to be changed around November 15th-ish: https://crash-stats.mozilla.com/signature/?product=Firefox&signature=mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AClear%20%7C%20mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3A~MessageChannel%20%7C%20mozilla%3A%3Adom%3A%3APContentParent%3A%3A~PContentParent&date=%3E%3D2017-09-01T13%3A35%3A00.000Z&date=%3C2017-12-07T13%3A35%3A59.000Z#graphs
that's where the crash-rate went up like crazy, I'm not sure why, could not found anything suspicious yet.
Flags: needinfo?(gkrizsanits)
Reporter | ||
Comment 3•7 years ago
|
||
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #2)
> Something seems to be changed around November 15th-ish:
> https://crash-stats.mozilla.com/signature/
> ?product=Firefox&signature=mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AClear%20
> %7C%20mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3A~MessageChannel%20%7C%20mozil
> la%3A%3Adom%3A%3APContentParent%3A%3A~PContentParent&date=%3E%3D2017-09-
> 01T13%3A35%3A00.000Z&date=%3C2017-12-07T13%3A35%3A59.000Z#graphs
>
> that's where the crash-rate went up like crazy, I'm not sure why, could not
> found anything suspicious yet.
that's when 58 was released/pushed to the beta population.
if you focus in to crashes on nightly you can see they started appearing on october 3rd on that channel (pushlog of the changes is in comment #0): https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&signature=mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AClear%20%7C%20mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3A~MessageChannel%20%7C%20mozilla%3A%3Adom%3A%3APContentParent%3A%3A~PContentParent&date=%3E%3D2017-09-01T15%3A35%3A00.000Z#graphs
Comment 4•7 years ago
|
||
Gabor, does anything in the range from Comment #3 stick out to you?
Flags: needinfo?(gkrizsanits)
Comment 5•7 years ago
|
||
(In reply to Mike Taylor [:miketaylr] (58 Regression Engineering Owner) from comment #4)
> Gabor, does anything in the range from Comment #3 stick out to you?
Based on Comment 3, it's probably the preallocated process manager (Bug 1385249)
I've looked into this in the last couple of days, but without being able to reproduce it it's quite hard to tell what's going on. It's a Windows only shutdown crash, and it seems like it can happen without the ppm but the ppm makes it (much) more likely that it happens. What _should_ make sure that this never happens is that during the shutdown we explicitly spin the event loop and wait for the channels to close for all content processes, including the ppm here:
https://searchfox.org/mozilla-central/rev/f6f1731b1b7fec332f86b55fa40e2c9ae67ac39b/dom/ipc/ContentParent.cpp#2764
and short before that we make sure that no more preallocated process is spawned during the shutdown here:
https://searchfox.org/mozilla-central/rev/f6f1731b1b7fec332f86b55fa40e2c9ae67ac39b/dom/ipc/PreallocatedProcessManager.cpp#139
I discovered that probably after the channel is closed the ppm should release the preallocated process (this is happening a bit later now: https://searchfox.org/mozilla-central/rev/f6f1731b1b7fec332f86b55fa40e2c9ae67ac39b/dom/ipc/PreallocatedProcessManager.cpp#84)
What might also cause some troubles is that while we're waiting for the channel mForceKillTimer might get fired, but I fail to see how would that leave the channel open.
Flags: needinfo?(gkrizsanits)
Comment 6•7 years ago
|
||
I guess it's worth a try to release the pp earlier and see if that helps somehow, but probably this will require some more work to understand the underlying real issue around content process shutdown. Another approach is to convert the failing release assertion to something else during shutdown that is more resilient.
Comment 7•7 years ago
|
||
Jim/Blake, any ideas of how to progress here?
Flags: needinfo?(mrbkap)
Flags: needinfo?(jmathies)
Updated•7 years ago
|
Assignee: nobody → mrbkap
Flags: needinfo?(mrbkap)
Comment 8•7 years ago
|
||
This is almost certainly the same problem that billm found in bug 1138520, comment 26. In particular, KillHard either isn't killing the (hung?) child process or, at least, killing the process without notifying the parent and closing the channel.
I don't know too much about Windows IPC stuff and especially its interaction with force killing processes. Jim, do you know what could cause this? It's worth noting that in the past, this seems to have happened for hung child processes (and those processes stayed hung for at least a minute, triggering the parent process's shutdown hang killer).
We probably should be figuring out what's happening to these processes.
Reporter | ||
Comment 9•7 years ago
|
||
[Tracking Requested - why for this release]:
so far in 58rc1 this is the #2 top browser crash (close to 3% of all browser crashes)
tracking-firefox58:
--- → ?
IF we can get a fix ready, this seems to be a good candidate for inclusion into a potential 58.0.1 release. GChang, fYI
Flags: needinfo?(gchang)
Comment 11•7 years ago
|
||
Hey Bob, do you have any ideas for how to track this down. I'm wondering if the child process ever starts up at all.
One thing I've noticed is that we don't seem to ever upgrade the process priority of the preallocated processes that we create. I suppose it might be possible, then, that it doesn't get scheduled ever. I'll file a new bug to track that.
Flags: needinfo?(bobowencode)
Comment 12•7 years ago
|
||
(In reply to Blake Kaplan (:mrbkap) from comment #11)
> One thing I've noticed is that we don't seem to ever upgrade the process
> priority of the preallocated processes that we create. I suppose it might be
> possible, then, that it doesn't get scheduled ever. I'll file a new bug to
> track that.
After reading more code, setting process priorities isn't enabled on any of our current platforms anyway, so fixing this won't have any effect.
Comment 13•7 years ago
|
||
I'll track this as a possible dot release candidate for 58.
Flags: needinfo?(gchang)
Comment 14•7 years ago
|
||
https://crash-stats.mozilla.com/search/?signature=%3Dmozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AClear%20%7C%20mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3A~MessageChannel%20%7C%20mozilla%3A%3Adom%3A%3APContentParent%3A%3A~PContentParent&product=Firefox&date=%3E%3D2018-01-11T14%3A45%3A34.000Z&date=%3C2018-01-18T14%3A45%3A34.000Z&_sort=-date&_facets=signature&_facets=version&_facets=uptime&_facets=shutdown_progress&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-uptime
crash notes -
* always occurs during xpcom shutdown
* uptimes are between 25 and about 130 seconds, none higher.
* currently #4 top browser crash for beta 58
I'm not sure what those uptimes imply but I find it interesting. An odd timing window.
Comment 15•7 years ago
|
||
also impacts both 64 and 32 bit builds, although 32-bit has a much higher incident rate.
Comment 16•7 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #14)
> * uptimes are between 25 and about 130 seconds, none higher.
>
> I'm not sure what those uptimes imply but I find it interesting. An odd
> timing window.
I think super search is messing with me here, uptimes go higher they just don't show up in search results.
Flags: needinfo?(jmathies)
Comment 17•7 years ago
|
||
(In reply to Blake Kaplan (:mrbkap) from comment #8)
> This is almost certainly the same problem that billm found in bug 1138520,
> comment 26. In particular, KillHard either isn't killing the (hung?) child
> process or, at least, killing the process without notifying the parent and
> closing the channel.
I don't think it's the responsibility of the child process to communicate shutdown. The KillProcess call in KillHard's OnGenerateMinidumpComplete calls directly through to TerminateProcess. I've never seen that call fail. There's a scary wait in here btw [1], 60 seconds waiting for the process to shut down.
> I don't know too much about Windows IPC stuff and especially its interaction
> with force killing processes. Jim, do you know what could cause this? It's
Not yet.
One note, I don't think this is a case where the child never launches since we open the handle [2] right before we try to terminate it. That call should fail for an invalid handle.
[1] https://searchfox.org/mozilla-central/source/ipc/chromium/src/base/process_util_win.cc#427
[2] https://searchfox.org/mozilla-central/source/dom/ipc/ContentParent.cpp#3161
Updated•7 years ago
|
Flags: needinfo?(jmathies)
Updated•7 years ago
|
Flags: needinfo?(jmathies)
Flags: needinfo?(bobowencode)
Updated•7 years ago
|
Assignee: mrbkap → nobody
Updated•7 years ago
|
Priority: -- → P2
Updated•7 years ago
|
tracking-firefox58:
+ → ---
Assignee | ||
Comment 18•7 years ago
|
||
We call MOZ_CRASH[1] because we believe the MessageChannel to not be closed. We rely on Unsound_IsClosed()[2] to determine whether or not the channel is closed. But the method name implies that it isn't sound to rely 100% on its return value to determine whether or not the channel is closed, and it is further confirmed in the following comment[3]:
// Unsound_IsClosed and Unsound_NumQueuedMessages are safe to call from any
// thread, but they make no guarantees about whether you'll get an
// up-to-date value; the values are written on one thread and read without
// locking, on potentially different threads. Thus you should only use
// them when you don't particularly care about getting a recent value (e.g.
// in a memory report).
It seems like we shouldn't crash the application based on a possibly out-of-date value. Even if there is a true issue here and the message channel has indeed not been shut down yet, we should check this based on a guaranteed up-to-date value to get an accurate measure of how prevalent this problem is.
Bill, I see that this was added in bug 1119878. Do you agree with the above and do you happen to know of a way to check whether the message channel is closed in a guaranteed up-to-date way?
[1] https://hg.mozilla.org/releases/mozilla-beta/annotate/73ef186ad51a/ipc/glue/MessageChannel.cpp#l704
[2] https://hg.mozilla.org/releases/mozilla-beta/annotate/73ef186ad51a/ipc/glue/MessageChannel.cpp#l701
[3] https://hg.mozilla.org/releases/mozilla-beta/annotate/73ef186ad51a/ipc/glue/MessageChannel.h#l310
Flags: needinfo?(bill.mccloskey)
I think in this case it's safe to use Unsound_IsClosed. The reason Unsound_IsClosed is marked "unsound" is that it reads state that's written on the I/O thread. However, assuming that someone called MessageChannel::Close() on the main thread already (which is what this assertion is trying to check), we should have already done a round-trip from the main thread to the I/O thread in order to close the channel. That happens like this:
MessageChannel::Close() ->
MessageChannel::SynchronouslyClose() ->
ProcessLink::SendClose() -> ...via PostMessage to I/O thread...
ProcessLink::OnCloseChannel() ->
Transport::Close() [this is the platform-specific version that sets closed_]
Sets MessageChannel::mChannelState to ChannelClosed and notifies CV
SynchronouslyClose also waits on the mChannelState becoming ChannelClosed
So the whole thing is race-free *assuming* that you've already called MessageChannel::Close() on the main thread. But that's what we're trying to assert anyway.
As to why this crash is happening, I'm afraid I don't know.
Flags: needinfo?(bill.mccloskey)
Assignee | ||
Comment 20•7 years ago
|
||
(In reply to Bill McCloskey [inactive unless it's an emergency] (:billm) from comment #19)
> (...)
>
> So the whole thing is race-free *assuming* that you've already called
> MessageChannel::Close() on the main thread. But that's what we're trying to
> assert anyway.
>
> As to why this crash is happening, I'm afraid I don't know.
Thank you for the clarifications! In this case we might instead be dealing with a channel error, timeout or similar. Unsound_IsClosed() checks if the channel is in the ChannelClosed state. But looking at MessageChannel::Close(), there are situations that could result in a different channel state.
I'm suggesting that we land this diagnostics patch to figure out what state the channel is in. This will show whether we crash due to one or several different states. We can then work backwards and figure out what might be causing the channel to be in this state when it is expected to be closed.
Assignee: nobody → spohl.mozilla.bugs
Status: NEW → ASSIGNED
Attachment #8954788 -
Flags: review?(jmathies)
Comment 21•7 years ago
|
||
[Tracking Requested - why for this release]:
status-firefox60:
--- → affected
tracking-firefox60:
--- → ?
Comment 22•7 years ago
|
||
Comment on attachment 8954788 [details] [diff] [review]
Diagnostics patch for mChannelState
Review of attachment 8954788 [details] [diff] [review]:
-----------------------------------------------------------------
lgtm. any bets on which we'll end up with?
Attachment #8954788 -
Flags: review?(jmathies) → review+
Assignee | ||
Comment 23•7 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #22)
> Comment on attachment 8954788 [details] [diff] [review]
> Diagnostics patch for mChannelState
>
> Review of attachment 8954788 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> lgtm. any bets on which we'll end up with?
I believe it's most likely that we'll see ChannelError, but no matter what the state will be we should be able to add more diagnostics once we know what we're dealing with.
Assignee | ||
Comment 24•7 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/b05965477a5adc39e0d64f61d13c278e8cf9989a
Bug 1423261: Diagnostics patch to obtain more info about the IPC channel state when we expect it to be closed. r=jimm
Assignee | ||
Updated•7 years ago
|
Keywords: leave-open
Comment 25•7 years ago
|
||
bugherder |
Assignee | ||
Comment 26•7 years ago
|
||
Crash stats[1] just displayed the first crash report[2] after the diagnostics patch landed; the channel state is ChannelConnected. I will keep an eye on further reports.
[1] https://crash-stats.mozilla.com/signature/?product=Firefox&version=60.0a1&signature=mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AClear%20%7C%20mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3A~MessageChannel%20%7C%20mozilla%3A%3Adom%3A%3APContentParent%3A%3A~PContentParent&date=%3E%3D2018-02-23T10%3A45%3A30.000Z&date=%3C2018-03-02T10%3A45%3A30.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports
[2] https://crash-stats.mozilla.com/report/index/377cbbd2-30ec-46f4-b59a-3ca030180302
Updated•7 years ago
|
Assignee | ||
Comment 28•7 years ago
|
||
We discovered that the crash reports in Socorro did not include the protocol name as an annotation, even though it was added by calling CrashReporter::AnnotateCrashReport just before MOZ_CRASH. I've now discovered that Socorro appears to filter annotations based on a schema[1]. `ProtocolName` was not in the schema. However, `IPCFatalErrorProtocol` is and I have confirmed that crash reports for other crashes successfully submit the protocol name under `IPCFatalErrorProtocol`.
Crash reports after this patch should tell us which protocol(s) is/are causing problems here.
[1] https://github.com/mozilla-services/socorro/blob/master/socorro/schemas/crash_report.json
Attachment #8958019 -
Flags: review?(jmathies)
Comment 29•7 years ago
|
||
(In reply to Stephen A Pohl [:spohl] from comment #28)
> Created attachment 8958019 [details] [diff] [review]
> Fix the way we annotate crash reports
>
> We discovered that the crash reports in Socorro did not include the protocol
> name as an annotation, even though it was added by calling
> CrashReporter::AnnotateCrashReport just before MOZ_CRASH. I've now
> discovered that Socorro appears to filter annotations based on a schema[1].
> `ProtocolName` was not in the schema. However, `IPCFatalErrorProtocol` is
> and I have confirmed that crash reports for other crashes successfully
> submit the protocol name under `IPCFatalErrorProtocol`.
>
> Crash reports after this patch should tell us which protocol(s) is/are
> causing problems here.
>
> [1]
> https://github.com/mozilla-services/socorro/blob/master/socorro/schemas/
> crash_report.json
We set these other values here -
https://searchfox.org/mozilla-central/source/ipc/glue/ProtocolUtils.cpp#291
I think the issue is that ProtocolName isn't getting included in super search queries. We have to file a bug on getting that added.
Comment 30•7 years ago
|
||
Bug 1257986 is similar.
Updated•7 years ago
|
Attachment #8958019 -
Flags: review?(jmathies)
Assignee | ||
Comment 31•7 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #29)
> I think the issue is that ProtocolName isn't getting included in super
> search queries. We have to file a bug on getting that added.
There are several reasons I thought changing ProtocolName to IPCFatalErrorProtocol would be preferable over adding ProtocolName to super search:
1. Both "tags" (not sure if this is the right terminology) refer to the same thing: the name of the protocol in question during a fatal IPC error.
2. IPCFatalErrorProtocol has a specific name that more accurately describes that the reason why it was added to the crash report is because there was a fatal error with this particular protocol.
3. ProtocolName is a very generic name and even if we went the route of adding it to super search, I would suggest that we change it before adding it to super search to make it more specific.
4. If we were to change ProtocolName to a more specific name, we would end up with something that sounds a lot like IPCFatalErrorProtocol.
5. The current use of IPCFatalErrorProtocol that you mention is for the same purpose as here: adding the problematic protocol name to the crash report.
6. There are enough new crash reports coming in that we don't need to go back through existing crash reports and add ProtocolName to super search to get to the important data. Looking at IPCFatalErrorProtocol going forward would be sufficient.
Unless there is an implicit reason that I'm not aware of that says that a specific crash annotation tag (IPCFatalErrorProtocol) cannot be used in two distinct places in our code base, I believe we would have a more accurate tag name for this crash annotation going forward and we would see results come in faster by switching to IPCFatalErrorProtocol than by adding ProtocolName to super search.
Flags: needinfo?(jmathies)
Updated•7 years ago
|
Flags: needinfo?(jmathies)
Attachment #8958019 -
Flags: review+
Assignee | ||
Comment 32•7 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/4ea7d945741c1ffb4db49a4d6c0f61ec973eafce
Bug 1423261: Submit IPC protocol names in shutdown crash reports in a way that Socorro is able to display. r=jimm
Comment 33•7 years ago
|
||
bugherder |
Comment 34•7 years ago
|
||
Assignee | ||
Comment 35•7 years ago
|
||
At a high level, shutdown works like this:
1. Parent tells child to shut down and sets mShutdownPending to true.
2. Child goes through shutdown sequence. At the end of the sequence, the child sends a FinishShutdown message to the parent.
3. The parent receives the FinishShutdown message, calls Close() on the message channel and completes shutdown.
This patch adds a crash annotation message to our crash reports for nightly builds. This will change the crash signature a bit because we will trigger this crash inside ContentParent, but it will allow us to gather more information about the state of the ContentParent. My best guess at the moment is that the ContentParent is in the `shutdown pending` state and fails to receive a FinishShutdown message from the child in a timely manner. Assuming that this is the case, we may be able to do away with this roundtrip during shutdown. The parent and the child both have ForceKillTimers and we may be able to rely on this timer to shut down the child should it not terminate in a timely manner. But we should confirm that we're in the `shutdown pending` state first.
Attachment #8959578 -
Flags: review?(jmathies)
Assignee | ||
Comment 36•7 years ago
|
||
Uploaded the wrong patch. This is it instead.
Attachment #8959578 -
Attachment is obsolete: true
Attachment #8959578 -
Flags: review?(jmathies)
Attachment #8959581 -
Flags: review?(jmathies)
Updated•7 years ago
|
status-firefox61:
--- → affected
Comment 37•7 years ago
|
||
Comment on attachment 8959581 [details] [diff] [review]
Add crash message annotation
Review of attachment 8959581 [details] [diff] [review]:
-----------------------------------------------------------------
lgtm!
Attachment #8959581 -
Flags: review?(jmathies) → review+
Assignee | ||
Comment 38•7 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/58ddcf890331b9ddbc81b9bf4e8e3e04f551996e
Bug 1423261: Add crash message annotations to our crash reports to help investigate shutdown crashes due to open message channels when they are expected to be closed. Nightly only. r=jimm
Comment 39•7 years ago
|
||
bugherder |
Assignee | ||
Comment 40•7 years ago
|
||
The first crash report with the latest diagnostics patch has just come in:
https://crash-stats.mozilla.com/report/index/63545839-471f-4910-ae37-cb34a0180320
The IPC Fatal Error Message is:
queued-ipc-messages/content-parent(Browser, pid=-1, open channel, 0x1fd186b8, refcnt=1, numQueuedMessages=140, remoteType=web, mCalledClose=false, mCalledKillHard=true, mShutdownPending=true, mIPCOpen=true)
This seems to confirm my previous suspicion that the parent is in the shutdown pending state, but fails to receive confirmation from the child that it has shut down in a timely manner. Also notice that KillHard has been called on the parent, which could be the reason why we see this crash signature.
Next, I will explore if the parent can request child shutdowns without waiting for those shutdowns to complete.
Assignee | ||
Updated•7 years ago
|
Crash Signature: [@ mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent] → [@ mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent]
[@ mozilla::dom::ContentParent::~ContentParent]
Assignee | ||
Comment 42•7 years ago
|
||
There may be a simple explanation why we might be getting into this state: when a shutdown message is received within a nested event loop by the child, we delay shutdown by an arbitrary 100ms in the hopes that the event loop will have finished by then. This is to prevent premature termination of "unload" or "pagehide" handlers, for example. However, there was no limit to how many times shutdown could be delayed by 100ms. If shutdown was delayed too long, the ForceKillTimer would trigger in the parent, resulting in this crash here.
This patch adds a limit to the number of times that a child can delay shutdown. The limit is equal to half the ForceKillTimer timeout to allow sufficient time for the child to send the FinishShutdown message back to the parent.
This patch should help us verify this theory, and if confirmed, fix the crash.
Attachment #8960940 -
Flags: review?(jmathies)
Assignee | ||
Updated•7 years ago
|
Priority: P2 → P1
Comment 43•7 years ago
|
||
Comment on attachment 8960940 [details] [diff] [review]
Patch - set shutdown timeout in child processes
Review of attachment 8960940 [details] [diff] [review]:
-----------------------------------------------------------------
::: dom/ipc/ContentChild.cpp
@@ +3025,5 @@
> NewRunnableMethod(
> "dom::ContentChild::RecvShutdown", this,
> &ContentChild::ShutdownInternal),
> + delay);
> + mShutdownTimeout -= delay;
Before these changes, we might get caught up in this forever right? With these changes and with a default value of 5 for dom.ipc.tabs.shutdownTimeoutSecs we'll wait up to 2.5 seconds. What's the KillHard timeout? 5 seconds?
::: dom/ipc/ContentChild.h
@@ +830,5 @@
> // NOTE: This member is atomic because it can be accessed from off-main-thread.
> mozilla::Atomic<uint32_t> mPendingInputEvents;
> #endif
>
> + int32_t mShutdownTimeout;
nit - lets move this up near this line -
mozilla::Atomic<bool> mShuttingDown
Attachment #8960940 -
Flags: review?(jmathies) → review+
Assignee | ||
Comment 44•7 years ago
|
||
Thank you! Addressed feedback, carrying over r+.
(In reply to Jim Mathies [:jimm] from comment #43)
> Comment on attachment 8960940 [details] [diff] [review]
> Patch - set shutdown timeout in child processes
>
> Review of attachment 8960940 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> ::: dom/ipc/ContentChild.cpp
> @@ +3025,5 @@
> > NewRunnableMethod(
> > "dom::ContentChild::RecvShutdown", this,
> > &ContentChild::ShutdownInternal),
> > + delay);
> > + mShutdownTimeout -= delay;
>
> Before these changes, we might get caught up in this forever right?
That's right, which may result in the parent killing itself via KillHard(). The crash in this bug occurs because we notice that the channel hasn't been closed when the ContentParent gets destroyed. KillHard() does not attempt to close the channel and therefore it seems likely that this is the reason for these crashes.
> With these changes and with a default value of 5 for
> dom.ipc.tabs.shutdownTimeoutSecs we'll wait up to 2.5 seconds. What's the
> KillHard timeout? 5 seconds?
Right, the code doesn't make this particularly clear but dom.ipc.tabs.shutdownTimeoutSecs actually *is* the KillHard timeout.
Attachment #8960940 -
Attachment is obsolete: true
Attachment #8961206 -
Flags: review+
Assignee | ||
Comment 45•7 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/ff30955a00d2f6cbba9288b3ba2b7b749a577628
Bug 1423261: Don't allow child processes to delay shutdown endlessly when a shutdown message is received from within a nested event loop. r=jimm
Comment 46•7 years ago
|
||
bugherder |
Comment hidden (offtopic) |
Updated•7 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 48•7 years ago
|
||
mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent
mozilla::dom::ContentParent::~ContentParent
Top Crashers for Firefox 61.0a1
26 0.64% 0.03% mozilla::dom::ContentParent::~ContentParent 33 33 0 0 32 0 2016-02-24
Top Crashers for Firefox 60.0b
6 3.28% 0.15% mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent 1181 1181 0 0 1171 0 2017-05-17
Top Crashers for Firefox 59.0.2
4 2.11% 1.08% mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent 2543 2543 0 0 2456 0 2017-05-17
Top Crashers for Firefox 58.0.2
3 2.33% -0.38% mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::dom::PContentParent::~PContentParent 312 312 0 0 315 0 2017-05-17
Keywords: nightly-community,
topcrash
Updated•7 years ago
|
Updated•7 years ago
|
Assignee | ||
Comment 49•7 years ago
|
||
Assignee | ||
Comment 50•7 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/abb2fafe775b3e58cdff89164a229883755a08df
Bug 1423261: Revert changeset ff30955a00d2 since it is ineffective. r=me
Assignee | ||
Comment 51•7 years ago
|
||
Comment on attachment 8961206 [details] [diff] [review]
Patch - set shutdown timeout in child processes
This has been backed out (see comment 50). After further digging it turns out that the shutdown delay due to nested event loops was introduced to help with content process shutdown hangs. Adding a timeout to this delay simply gets us back to the state before we had an explicit shutdown delay due to nested event loops, i.e. a shutdown hang.
Attachment #8961206 -
Attachment is obsolete: true
Assignee | ||
Comment 52•7 years ago
|
||
This patch should fix the crashes reported here by closing the IPC channel during KillHard shutdowns. We are already submitting both parent- and content process crash dumps during KillHard shutdowns and we will therefore continue to have the same data to investigate content process shutdown hangs.
Attachment #8966345 -
Flags: review?(jmathies)
Updated•7 years ago
|
Attachment #8966345 -
Flags: review?(jmathies) → review+
Assignee | ||
Comment 53•7 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/f45d6ae3fdb194a07ef5eba9f9a760f28f564b64
Bug 1423261: Close the IPC channel during KillHard shutdowns to prevent shutdown crashes in the parent process. r=jimm
Comment 54•7 years ago
|
||
bugherder |
Comment 55•7 years ago
|
||
bugherder |
Assignee | ||
Comment 56•7 years ago
|
||
(In reply to Natalia Csoregi [:nataliaCs] from comment #55)
> https://hg.mozilla.org/mozilla-central/rev/f45d6ae3fdb1
This causes bug 1453252. The fix may be as simple as restricting the Close() call to the parent side, but I will need to investigate further when I get back next week. Will back out shortly.
Comment 57•7 years ago
|
||
Backed out changeset f45d6ae3fdb1 (bug 1423261) for causing bug 1453252.
Flags: needinfo?(spohl.mozilla.bugs)
Comment 58•7 years ago
|
||
Backout by shindli@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/88f297d206a6
Backed out changeset f45d6ae3fdb1 for causing bug 1453252. a=backout
Assignee | ||
Comment 59•6 years ago
|
||
The previous patch runs into issues when the IPC channel gets properly closed before cycle collection runs during a KillHard shutdown. So we've learned that the IPC channel may be in either connected or closed state at the time of cycle collection during a KillHard shutdown.
Since we now understand that the parent may still have a connected IPC channel during a KillHard shutdown, it makes sense to skip the intentional crash reported in this bug when KillHard has been called. We continue to collect crash dumps for hung content processes as before.
Attachment #8966345 -
Attachment is obsolete: true
Flags: needinfo?(spohl.mozilla.bugs)
Attachment #8968272 -
Flags: review?(jmathies)
Updated•6 years ago
|
Attachment #8968272 -
Flags: review?(jmathies) → review+
Assignee | ||
Comment 60•6 years ago
|
||
I was a bit too quick in posting my previous patch. The patch applied to code that we added for diagnostics (attachment 8959581 [details] [diff] [review]) and which we should revert. I'm posting three patches instead, of which this is the first one:
1. Revert changeset 58ddcf890331 (attachment 8959581 [details] [diff] [review]) to get back to the previous crash signature.
2. Add a new field (mInKillHardShutdown) to MessageChannel to be able to tell when we're in a KillHard shutdown during cycle collection.
3. Restrict patch 2 above to Nightly only. This makes it easy to turn the fix on for all branches by backing this patch out.
Assignee | ||
Comment 61•6 years ago
|
||
This reverts changeset 58ddcf890331 (attachment 8959581 [details] [diff] [review]) which was added for diagnostic purposes. Marking as r+ since this is a straight backout.
Attachment #8968272 -
Attachment is obsolete: true
Attachment #8968590 -
Flags: review+
Assignee | ||
Comment 62•6 years ago
|
||
This adds a new field (mInKillHardShutdown) to MessageChannel to be able to tell when we're in a KillHard shutdown. This allows us to skip the intentional crash (this bug) during cycle collection.
Attachment #8968591 -
Flags: review?(jmathies)
Assignee | ||
Comment 63•6 years ago
|
||
This restricts the fix to nightly only for now. A simple backout of this patch will allow us to turn the fix on for all branches.
Attachment #8968592 -
Flags: review?(jmathies)
Updated•6 years ago
|
Attachment #8968591 -
Flags: review?(jmathies) → review+
Updated•6 years ago
|
Attachment #8968592 -
Flags: review?(jmathies) → review+
Assignee | ||
Comment 64•6 years ago
|
||
Assignee | ||
Comment 65•6 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/d86b5b69f86745f0d9f505f633f48d1bf2ab45e1
Bug 1423261: Backout diagnostics changeset 58ddcf890331. r=backout
https://hg.mozilla.org/integration/mozilla-inbound/rev/c8842b205236ab4e73ffa1892c00fe379bfd9efd
Bug 1423261: Skip intentionally crashing the browser during KillHard shutdowns. r=jimm
https://hg.mozilla.org/integration/mozilla-inbound/rev/6385170a411128e87a2bc180202cedd7594efbe7
Bug 1423261: Restrict patch to nightly only for now. r=jimm
Comment 66•6 years ago
|
||
bugherder |
Assignee | ||
Comment 68•6 years ago
|
||
This has fallen 23 spots in the nightly top crash list and there are no reported crashes after the 20180417225505 build. This looks ready to be turned on for all branches.
Assignee | ||
Comment 69•6 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/aa51d5460d2815b8794ba244dcf279632c8fce22
Backout changeset c8842b205236 to turn on the fix for bug 1423261 on all branches. a=backout
Assignee | ||
Updated•6 years ago
|
Attachment #8938319 -
Attachment is obsolete: true
Assignee | ||
Updated•6 years ago
|
Attachment #8959581 -
Attachment is obsolete: true
Assignee | ||
Updated•6 years ago
|
Attachment #8968590 -
Attachment is obsolete: true
Assignee | ||
Updated•6 years ago
|
Attachment #8968592 -
Attachment is obsolete: true
Assignee | ||
Updated•6 years ago
|
Keywords: leave-open
Assignee | ||
Comment 70•6 years ago
|
||
I suggest we let this ride the train since this is a shutdown crash and doesn't have a real impact on users, other than seeing a crash dialog.
Comment 71•6 years ago
|
||
Sounds good to me, thanks Stephen.
Reporter | ||
Updated•6 years ago
|
Status: REOPENED → RESOLVED
Closed: 7 years ago → 6 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•6 years ago
|
Target Milestone: --- → mozilla61
Comment 72•6 years ago
|
||
Stephen: I see a few Mac signatures in crash stats, some 61/62: https://bit.ly/2rIEkwT. Is this another manifestation of the crash? Crash reason: MOZ_CRASH(MessageChannel destroyed without being closed (mChannelState == ChannelConnected).)
Flags: needinfo?(spohl.mozilla.bugs)
Assignee | ||
Comment 73•6 years ago
|
||
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #72)
> Stephen: I see a few Mac signatures in crash stats, some 61/62:
> https://bit.ly/2rIEkwT. Is this another manifestation of the crash? Crash
> reason: MOZ_CRASH(MessageChannel destroyed without being closed
> (mChannelState == ChannelConnected).)
It is possible that this is the child side of the same issue, i.e. we're in a KillHard shutdown when the child attempts to destroy the MessageChannel. This would need its own investigation and fix however, so it would be best to track this in a separate bug.
Flags: needinfo?(spohl.mozilla.bugs)
Reporter | ||
Comment 74•6 years ago
|
||
would this be an issue to consider fixing in 60esr?
status-firefox-esr60:
--- → affected
Flags: needinfo?(ryanvm)
Comment 75•6 years ago
|
||
I'll redirect to Stephen on that.
Flags: needinfo?(ryanvm) → needinfo?(spohl.mozilla.bugs)
Assignee | ||
Comment 76•6 years ago
|
||
Comment on attachment 8968591 [details] [diff] [review]
Fix: Skip intentionally crashing the browser when KillHard has been called
Yes, it seems reasonable to consider this for ESR60.
[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration: Avoids showing crash dialogs when we terminate browser processes during excessively long shutdowns.
User impact if declined: If the browser takes too long to shut down we terminate the browser processes. Without this fix, the user is presented with a crash dialog.
Fix Landed on Version: 61
Risk to taking this patch (and alternatives if risky): Has been riding the trains with no fallout.
String or UUID changes made by this patch: none
See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.
Flags: needinfo?(spohl.mozilla.bugs)
Attachment #8968591 -
Flags: approval-mozilla-esr60?
Comment 77•6 years ago
|
||
Comment on attachment 8968591 [details] [diff] [review]
Fix: Skip intentionally crashing the browser when KillHard has been called
Fixes a topcrash on ESR60 and has been shipping with Fx61 without any known problems. Approved for ESR 60.2.
Attachment #8968591 -
Flags: approval-mozilla-esr60? → approval-mozilla-esr60+
Comment 78•6 years ago
|
||
bugherder uplift |
You need to log in
before you can comment on or make changes to this bug.
Description
•