Closed
Bug 1244883
Opened 9 years ago
Closed 8 years ago
crash in RtlEnterCriticalSection | PR_EnterMonitor | mozilla::layers::ImageContainer::ClearImagesFromImageBridge
Categories
(Core :: Graphics: Layers, defect)
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: milan, Assigned: eflores)
References
Details
(Keywords: crash, Whiteboard: [gfx-noted])
Crash Data
Attachments
(1 file, 1 obsolete file)
(deleted),
patch
|
nical
:
review+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is
report bp-8d6f2243-3ef6-4a22-980e-493e72160127.
=============================================================
No particular correlation I could find.
Here's one from 45:
https://crash-stats.mozilla.com/report/index/7df17f96-c760-457a-b9ca-365192160130
Reporter | ||
Comment 1•9 years ago
|
||
Timothy, anything that looks interesting here?
Flags: needinfo?(tnikkel)
Whiteboard: [gfx-noted]
Comment 2•9 years ago
|
||
Probably not likely an imagelib issue. ImageContainer's are also used for video, and all of the code in the crash stack lives under gfx/layers.
Component: ImageLib → Graphics: Layers
Flags: needinfo?(tnikkel)
Assignee | ||
Comment 3•9 years ago
|
||
Most of the crash reports involve media, but I'm not sure yet whether that's relevant.
On MediaDecoderStateMachine shutdown (which indirectly owns the ImageContainer), ImageBridgeChild::FlushAllImages is called. This queues a sync-ish (using AsyncTransactionWaiter) task on the ImageBridge thread which calls ImageBridgeChild::FlushAllImagesSync. This function acts on the ImageContainer, which stays alive while FlushAllImages is waiting.
AsyncTransactionWaiter has a timeout of 50 seconds which it's hitting, AFAICT. So when FlushAllImagesSync finally gets a chance to run, the ImageContainer is already dead.
I can't get much more information out of these crash reports, since the real cause happens before the crash. I think should add a gfxDevCrash on AsyncTransactionWaiter timeout to find the actual cause.
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → edwin
Assignee | ||
Comment 4•9 years ago
|
||
Attachment #8733888 -
Flags: review?(nical.bugzilla)
Assignee | ||
Updated•9 years ago
|
Keywords: leave-open
Assignee | ||
Comment 5•9 years ago
|
||
oops.
Attachment #8733888 -
Attachment is obsolete: true
Attachment #8733888 -
Flags: review?(nical.bugzilla)
Attachment #8733912 -
Flags: review?(nical.bugzilla)
Updated•9 years ago
|
Attachment #8733912 -
Flags: review?(nical.bugzilla) → review+
Comment 6•9 years ago
|
||
cc :sotaro who is rewriting things related to this code in another bug (I think).
Reporter | ||
Comment 8•9 years ago
|
||
Comment on attachment 8733912 [details] [diff] [review]
1244883-assert.patch
Review of attachment 8733912 [details] [diff] [review]:
-----------------------------------------------------------------
::: gfx/layers/ipc/AsyncTransactionTracker.cpp
@@ +35,5 @@
> printf_stderr("Timeout of waiting transaction complete.");
> }
> +
> + if (count == maxCount) {
> + gfxDevCrash(LogReason::AsyncTransactionTimeout) << "Bug 1244883: AsyncTransactionWaiter timed out.";
Would it be useful to have mWaitCount in the crash report?
Comment 9•9 years ago
|
||
bugherder |
Reporter | ||
Comment 10•9 years ago
|
||
We're going to get this backed out (https://treeherder.mozilla.org/#/jobs?repo=ash&revision=24c5fbde4488&group_state=expanded&filter-searchStr=win%20debug) - we may have just gotten the information we needed.
Comment 11•9 years ago
|
||
Mission accomplished, this blew up Windows debug e10s tests all over Ash. Backed out per IRC discussion with Milan.
https://hg.mozilla.org/integration/mozilla-inbound/rev/0b1daf1da80a
https://treeherder.mozilla.org/logviewer.html#?job_id=158705&repo=ash
https://treeherder.mozilla.org/logviewer.html#?job_id=158820&repo=ash
https://treeherder.mozilla.org/logviewer.html#?job_id=158821&repo=ash
https://treeherder.mozilla.org/logviewer.html#?job_id=158824&repo=ash
https://treeherder.mozilla.org/logviewer.html#?job_id=158826&repo=ash
https://treeherder.mozilla.org/logviewer.html#?job_id=158829&repo=ash
https://treeherder.mozilla.org/logviewer.html#?job_id=158831&repo=ash
https://treeherder.mozilla.org/logviewer.html#?job_id=158835&repo=ash
https://treeherder.mozilla.org/logviewer.html#?job_id=158837&repo=ash
To reproduce these on Try, you'll want to apply the patch below that forces e10s on. You can then run the regular non-e10s test suite for that given platform and things will Just Work.
https://hg.mozilla.org/try/rev/f414dcbd1ef6
Comment 12•9 years ago
|
||
backout bugherder |
Merge of backout:
https://hg.mozilla.org/mozilla-central/rev/0b1daf1da80a
Reporter | ||
Comment 13•9 years ago
|
||
Edwin, does the data from those try runs help?
Flags: needinfo?(edwin)
Assignee | ||
Comment 14•9 years ago
|
||
More data can't hurt! I find it odd that those tests didn't hit this bug previously, though. Will look today.
Flags: needinfo?(edwin)
Assignee | ||
Comment 15•9 years ago
|
||
Managed to get a dump of the parent process.
What I see happening is the compositor process trying to shut down the content process. It eventually reaches EnsureProcessTerminated and waits indefinitely *on the I/O thread* for the content process to die, but doesn't attempt to terminate it [1].
On the content side, when the media stuff shuts down, MediaFormatReader::ReleaseMediaResources() calls ImageContainer::ClearAllImages(), which calls ImageBridgeChild::FlushAllImages. This adds a FlushAllImagesSync() task to the ImageBridge thread and waits on an AsyncTransactionWaiter. [2]
On the ImageBridge thread, FlushAllImagesSync() indirectly calls ImageClient::RemoveTextureWithWaiter(). This creates an AsyncTransactionTracker which increments the wait count on the above waiter. [3] That tracker and the TextureClient being destroyed is then passed to ImageBridgeClient::RemoveTextureFromCompositableAsync().
That function sends a RemoveTextureAsync op to the compositor process and makes sure the TextureClient stays alive until we receive a reply from the compositor process by calling CompositableClient::HoldUntilComplete. [4]
The compositor process doesn't get the RemoveTextureAsync message before content dies because its I/O thread is blocked until then.
Sadness ensues.
[1] https://dxr.mozilla.org/mozilla-central/rev/68c0b7d6f16ce5bb023e08050102b5f2fe4aacd8/ipc/chromium/src/chrome/common/process_watcher_win.cc#77
[2] https://dxr.mozilla.org/mozilla-central/rev/68c0b7d6f16ce5bb023e08050102b5f2fe4aacd8/gfx/layers/ipc/ImageBridgeChild.cpp#673
[3] https://dxr.mozilla.org/mozilla-central/rev/68c0b7d6f16ce5bb023e08050102b5f2fe4aacd8/gfx/layers/client/ImageClient.cpp#94
[4] https://dxr.mozilla.org/mozilla-central/rev/68c0b7d6f16ce5bb023e08050102b5f2fe4aacd8/gfx/layers/ipc/ImageBridgeChild.cpp#1240
Assignee | ||
Comment 16•9 years ago
|
||
Oh! Oh... hmm...
All of the dumps from Ash are in XPCOM shutdown.
XPCOM shutdown usually (always?) happens on the child only once the Content channel is closed. On the parent side, the Content channel closing starts the wait for the child to die (via PContentParent::OnChannelClose -> PContentParent::DestroySubtree -> ContentParent::ActorDestroy -> DelayedDeleteSubprocess -> EnsureProcessTerminated).
Ultimately, if we're deleting textures at XPCOM shutdown we have already lost the game. Unfortunately that's when media shutdown happens, which holds onto an ImageContainer until the very end.
Comment 17•9 years ago
|
||
Any chance bug 1215265 would have helped here? Also, the "Force e10s on" patch has had to change a bit. Below is a link to one that'll work on current trunk.
https://hg.mozilla.org/try/rev/8d6ed373efa8fc656264c2e0b15daa948040e176
Assignee | ||
Comment 18•9 years ago
|
||
Yes, that looks like it might fix it -- thanks for the heads up!
I've updated my m-c and pushed a try run. If green, I'll re-land the diagnostic patch as I'm pretty certain that the "in-the-wild" reports have a different cause (for one thing, they don't even have e10s enabled).
Comment 19•9 years ago
|
||
(In reply to Edwin Flores [:eflores] [:edwin] from comment #16)
> XPCOM shutdown usually (always?) happens on the child only once the Content
> channel is closed.
I hope not, because it doesn a whole lot of stuff that relies on IPC still being up. We should make sure that if the channel is closed when we run the shutdown stuff (that would be because of an abnormal shutdown I assume), we can cancel anything that waits on ipc related stuff.
>
> Ultimately, if we're deleting textures at XPCOM shutdown we have already
> lost the game. Unfortunately that's when media shutdown happens, which holds
> onto an ImageContainer until the very end.
That's when all of the shutdown of pretty much anything happens and I don't think we'll be able to shut things down earlier. We definitely need ipc to still be up in ShutdownXPCOM in the case of a normal shtudown, and maybe just kill the process without shutting down XPCOM for abnormal shutdowns.
(In reply to Edwin Flores [:eflores] [:edwin] from comment #18)
> Yes, that looks like it might fix it -- thanks for the heads up!
Ok good, although this stuff is not enabled on Windows yet.
Assignee | ||
Comment 20•9 years ago
|
||
(In reply to Edwin Flores [:eflores] [:edwin] from comment #18)
> Yes, that looks like it might fix it -- thanks for the heads up!
>
> I've updated my m-c and pushed a try run. If green, I'll re-land the
> diagnostic patch as I'm pretty certain that the "in-the-wild" reports have a
> different cause (for one thing, they don't even have e10s enabled).
Hmm, nope. Still happens.
Assignee | ||
Comment 21•9 years ago
|
||
(In reply to Nicolas Silva [:nical] from comment #19)
> (In reply to Edwin Flores [:eflores] [:edwin] from comment #16)
> > XPCOM shutdown usually (always?) happens on the child only once the Content
> > channel is closed.
>
> I hope not, because it doesn a whole lot of stuff that relies on IPC still
> being up. We should make sure that if the channel is closed when we run the
> shutdown stuff (that would be because of an abnormal shutdown I assume), we
> can cancel anything that waits on ipc related stuff.
Looks like I was wrong about "usually": this depends on NS_FREE_PERMANENT_DATA being set. That's why it's only showing up in debug builds (although I haven't quite figured out how it's meant to work without that #defined -- it appears we just hard kill ALL THE THINGS and forgo XPCOM shutdown?).
The way it is shutting down in this case (based on gratuitous addition of printfs) is, in glorious high detail:
On the parent side:
- ContentParent::NotifyTabDestroyed is called
- That dispatches a call to ShutDownProcess(SEND_SHUTDOWN_MESSAGE)
- That sends SendShutdown to the child
On the child:
- RecvShutdown is received
- That does some light shutdown stuff, including firing a notification with subject "content-child-shutdown" -- we could move the shutdown to here, maybe?
- That sends SendFinishShutdown to the parent.
On the parent again:
- RecvFinishShutdown is received
- That calls ShutDownProcess(CLOSE_CHANNEL)
- That calls Close()
The channel is now closed.
Simultaneously-ish:
- PContentChild/Parent::OnChannelClose is eventually called
- That calls DestroySubtree
- Which calls ContentChild/Parent::ActorDestroy
On the child:
- ContentChild::ActorDestroy calls XRE_ShutdownChildProcess if defined(NS_FREE_PERMANENT_DATA)
- This tells the message loop to quit.
On the parent, around the same time:
- ContentParent::ActorDestroy does some things -- lots of things. Among these things is dispatching a call to DelayedDeleteSubprocess for our content process
- That then dispatches a DeleteTask to the IO message loop for that content process
- That destroys our GeckoChildProcessHost
- ~GeckoChildProcessHost calls ProcessWatcher::EnsureProcessTerminated with force = !defined(NS_FREE_PERMANENT_DATA)
- The parent patiently waits for the child to die. [1]
Meanwhile, on the child main thread in XRE_InitChildProcess:
- The UI message loop returns from Run()
- ContentProcess::CleanUp is called
- That calls ScopedXREEmbed::Stop
- Which calls XRE_TermEmbedding
- Which calls NS_ShutdownXPCOM
- Which finally calls ShutdownXPCOM
TL;DR -- ShutdownXPCOM happens. Content channel is closed.
[1] https://dxr.mozilla.org/mozilla-central/rev/29d5a4175c8b74f45482276a53985cf2568b4be2/ipc/chromium/src/chrome/common/process_watcher_win.cc#77
Updated•9 years ago
|
Comment 22•9 years ago
|
||
Note that Sotaro just removed the call to FlushAllImagesSync in bug 1209724.
PContent "opens" but doesn't "manage" PCompositorBridge. Are they actually using the same channel? If they do I'd expect this issue to have showed up a lot more.
PContent also opens PImageBridge and I am certain that PCompositorBridge and PImageBridge do not use the same channel, so we may be trying to fix a problem with PContent we don't have.
Moving the gfx shutdown earlier is tricky because the earlier we shut gfx down, the earlier everything that depends on gfx needs to be shut down as well, and that pulls an uncomfortable amount of dependencies. I would rather make it so we call SendFinishShutdown from ShutdownXPCOM (if we have to, I still think we don't).
Assignee | ||
Comment 23•9 years ago
|
||
(In reply to Nicolas Silva [:nical] from comment #22)
> PContent "opens" but doesn't "manage" PCompositorBridge. Are they actually
> using the same channel? If they do I'd expect this issue to have showed up a
> lot more.
> PContent also opens PImageBridge and I am certain that PCompositorBridge and
> PImageBridge do not use the same channel, so we may be trying to fix a
> problem with PContent we don't have.
They don't use the same channel. But all IPC goes through the I/O thread (at least on the parent).
Comment 24•9 years ago
|
||
(In reply to Nicolas Silva [:nical] from comment #22)
> Note that Sotaro just removed the call to FlushAllImagesSync in bug 1209724.
Correction: he is trying to remove it but that hasn't landed yet.
Comment 25•9 years ago
|
||
(In reply to Edwin Flores [:eflores] [:edwin] from comment #23)
> They don't use the same channel. But all IPC goes through the I/O thread (at
> least on the parent).
Then what we need to fix is the DeleteTask on the parent IO loop being sent too early.
Comment 26•9 years ago
|
||
(In reply to Nicolas Silva [:nical] from comment #22)
> Note that Sotaro just removed the call to FlushAllImagesSync in bug 1209724.
Correction (bis): I meant bug 1254029.
Comment 27•9 years ago
|
||
Assignee | ||
Comment 28•9 years ago
|
||
Bug 1262898 will have fixed the shutdown problems. Re-landing diagnostic patch.
Comment 29•9 years ago
|
||
bugherder |
Updated•9 years ago
|
Blocks: e10s-crashes
Assignee | ||
Comment 31•9 years ago
|
||
This crash is very low volume in nightly and aurora. I don't expect we'll have any new information until Beta 48.
Flags: needinfo?(edwin)
Comment 33•8 years ago
|
||
Crash volume for signature 'RtlEnterCriticalSection | PR_EnterMonitor | mozilla::layers::ImageContainer::ClearImagesFromImageBridge':
- nightly (version 50): 0 crash from 2016-06-06.
- aurora (version 49): 0 crash from 2016-06-07.
- beta (version 48): 0 crash from 2016-06-06.
- release (version 47): 5285 crashes from 2016-05-31.
- esr (version 45): 87 crashes from 2016-04-07.
Crash volume on the last weeks:
Week N-1 Week N-2 Week N-3 Week N-4 Week N-5 Week N-6 Week N-7
- nightly 0 0 0 0 0 0 0
- aurora 0 0 0 0 0 0 0
- beta 0 0 0 0 0 0 0
- release 879 742 893 835 771 584 240
- esr 5 12 8 12 16 15 6
Affected platform: Windows
status-firefox-esr45:
--- → affected
Assignee | ||
Comment 34•8 years ago
|
||
Looks good these days. This signature hasn't been reported since 47, and neither has the diagnostic crash.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(edwin)
Resolution: --- → WORKSFORME
Comment 35•7 years ago
|
||
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
Updated•2 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•