Closed Bug 1244883 Opened 9 years ago Closed 8 years ago

crash in RtlEnterCriticalSection | PR_EnterMonitor | mozilla::layers::ImageContainer::ClearImagesFromImageBridge

Categories

(Core :: Graphics: Layers, defect)

x86
Windows
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox47 --- wontfix
firefox-esr45 --- wontfix

People

(Reporter: milan, Assigned: eflores)

References

Details

(Keywords: crash, Whiteboard: [gfx-noted])

Crash Data

Attachments

(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is report bp-8d6f2243-3ef6-4a22-980e-493e72160127. ============================================================= No particular correlation I could find. Here's one from 45: https://crash-stats.mozilla.com/report/index/7df17f96-c760-457a-b9ca-365192160130
Timothy, anything that looks interesting here?
Flags: needinfo?(tnikkel)
Whiteboard: [gfx-noted]
Probably not likely an imagelib issue. ImageContainer's are also used for video, and all of the code in the crash stack lives under gfx/layers.
Component: ImageLib → Graphics: Layers
Flags: needinfo?(tnikkel)
Most of the crash reports involve media, but I'm not sure yet whether that's relevant. On MediaDecoderStateMachine shutdown (which indirectly owns the ImageContainer), ImageBridgeChild::FlushAllImages is called. This queues a sync-ish (using AsyncTransactionWaiter) task on the ImageBridge thread which calls ImageBridgeChild::FlushAllImagesSync. This function acts on the ImageContainer, which stays alive while FlushAllImages is waiting. AsyncTransactionWaiter has a timeout of 50 seconds which it's hitting, AFAICT. So when FlushAllImagesSync finally gets a chance to run, the ImageContainer is already dead. I can't get much more information out of these crash reports, since the real cause happens before the crash. I think should add a gfxDevCrash on AsyncTransactionWaiter timeout to find the actual cause.
Assignee: nobody → edwin
Attached patch 1244883-assert.patch (obsolete) (deleted) — — Splinter Review
Attachment #8733888 - Flags: review?(nical.bugzilla)
Attached patch 1244883-assert.patch (deleted) — — Splinter Review
oops.
Attachment #8733888 - Attachment is obsolete: true
Attachment #8733888 - Flags: review?(nical.bugzilla)
Attachment #8733912 - Flags: review?(nical.bugzilla)
Attachment #8733912 - Flags: review?(nical.bugzilla) → review+
cc :sotaro who is rewriting things related to this code in another bug (I think).
Comment on attachment 8733912 [details] [diff] [review] 1244883-assert.patch Review of attachment 8733912 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/ipc/AsyncTransactionTracker.cpp @@ +35,5 @@ > printf_stderr("Timeout of waiting transaction complete."); > } > + > + if (count == maxCount) { > + gfxDevCrash(LogReason::AsyncTransactionTimeout) << "Bug 1244883: AsyncTransactionWaiter timed out."; Would it be useful to have mWaitCount in the crash report?
Edwin, does the data from those try runs help?
Flags: needinfo?(edwin)
Depends on: 1259630
More data can't hurt! I find it odd that those tests didn't hit this bug previously, though. Will look today.
Flags: needinfo?(edwin)
Managed to get a dump of the parent process. What I see happening is the compositor process trying to shut down the content process. It eventually reaches EnsureProcessTerminated and waits indefinitely *on the I/O thread* for the content process to die, but doesn't attempt to terminate it [1]. On the content side, when the media stuff shuts down, MediaFormatReader::ReleaseMediaResources() calls ImageContainer::ClearAllImages(), which calls ImageBridgeChild::FlushAllImages. This adds a FlushAllImagesSync() task to the ImageBridge thread and waits on an AsyncTransactionWaiter. [2] On the ImageBridge thread, FlushAllImagesSync() indirectly calls ImageClient::RemoveTextureWithWaiter(). This creates an AsyncTransactionTracker which increments the wait count on the above waiter. [3] That tracker and the TextureClient being destroyed is then passed to ImageBridgeClient::RemoveTextureFromCompositableAsync(). That function sends a RemoveTextureAsync op to the compositor process and makes sure the TextureClient stays alive until we receive a reply from the compositor process by calling CompositableClient::HoldUntilComplete. [4] The compositor process doesn't get the RemoveTextureAsync message before content dies because its I/O thread is blocked until then. Sadness ensues. [1] https://dxr.mozilla.org/mozilla-central/rev/68c0b7d6f16ce5bb023e08050102b5f2fe4aacd8/ipc/chromium/src/chrome/common/process_watcher_win.cc#77 [2] https://dxr.mozilla.org/mozilla-central/rev/68c0b7d6f16ce5bb023e08050102b5f2fe4aacd8/gfx/layers/ipc/ImageBridgeChild.cpp#673 [3] https://dxr.mozilla.org/mozilla-central/rev/68c0b7d6f16ce5bb023e08050102b5f2fe4aacd8/gfx/layers/client/ImageClient.cpp#94 [4] https://dxr.mozilla.org/mozilla-central/rev/68c0b7d6f16ce5bb023e08050102b5f2fe4aacd8/gfx/layers/ipc/ImageBridgeChild.cpp#1240
Oh! Oh... hmm... All of the dumps from Ash are in XPCOM shutdown. XPCOM shutdown usually (always?) happens on the child only once the Content channel is closed. On the parent side, the Content channel closing starts the wait for the child to die (via PContentParent::OnChannelClose -> PContentParent::DestroySubtree -> ContentParent::ActorDestroy -> DelayedDeleteSubprocess -> EnsureProcessTerminated). Ultimately, if we're deleting textures at XPCOM shutdown we have already lost the game. Unfortunately that's when media shutdown happens, which holds onto an ImageContainer until the very end.
Any chance bug 1215265 would have helped here? Also, the "Force e10s on" patch has had to change a bit. Below is a link to one that'll work on current trunk. https://hg.mozilla.org/try/rev/8d6ed373efa8fc656264c2e0b15daa948040e176
Yes, that looks like it might fix it -- thanks for the heads up! I've updated my m-c and pushed a try run. If green, I'll re-land the diagnostic patch as I'm pretty certain that the "in-the-wild" reports have a different cause (for one thing, they don't even have e10s enabled).
(In reply to Edwin Flores [:eflores] [:edwin] from comment #16) > XPCOM shutdown usually (always?) happens on the child only once the Content > channel is closed. I hope not, because it doesn a whole lot of stuff that relies on IPC still being up. We should make sure that if the channel is closed when we run the shutdown stuff (that would be because of an abnormal shutdown I assume), we can cancel anything that waits on ipc related stuff. > > Ultimately, if we're deleting textures at XPCOM shutdown we have already > lost the game. Unfortunately that's when media shutdown happens, which holds > onto an ImageContainer until the very end. That's when all of the shutdown of pretty much anything happens and I don't think we'll be able to shut things down earlier. We definitely need ipc to still be up in ShutdownXPCOM in the case of a normal shtudown, and maybe just kill the process without shutting down XPCOM for abnormal shutdowns. (In reply to Edwin Flores [:eflores] [:edwin] from comment #18) > Yes, that looks like it might fix it -- thanks for the heads up! Ok good, although this stuff is not enabled on Windows yet.
(In reply to Edwin Flores [:eflores] [:edwin] from comment #18) > Yes, that looks like it might fix it -- thanks for the heads up! > > I've updated my m-c and pushed a try run. If green, I'll re-land the > diagnostic patch as I'm pretty certain that the "in-the-wild" reports have a > different cause (for one thing, they don't even have e10s enabled). Hmm, nope. Still happens.
(In reply to Nicolas Silva [:nical] from comment #19) > (In reply to Edwin Flores [:eflores] [:edwin] from comment #16) > > XPCOM shutdown usually (always?) happens on the child only once the Content > > channel is closed. > > I hope not, because it doesn a whole lot of stuff that relies on IPC still > being up. We should make sure that if the channel is closed when we run the > shutdown stuff (that would be because of an abnormal shutdown I assume), we > can cancel anything that waits on ipc related stuff. Looks like I was wrong about "usually": this depends on NS_FREE_PERMANENT_DATA being set. That's why it's only showing up in debug builds (although I haven't quite figured out how it's meant to work without that #defined -- it appears we just hard kill ALL THE THINGS and forgo XPCOM shutdown?). The way it is shutting down in this case (based on gratuitous addition of printfs) is, in glorious high detail: On the parent side: - ContentParent::NotifyTabDestroyed is called - That dispatches a call to ShutDownProcess(SEND_SHUTDOWN_MESSAGE) - That sends SendShutdown to the child On the child: - RecvShutdown is received - That does some light shutdown stuff, including firing a notification with subject "content-child-shutdown" -- we could move the shutdown to here, maybe? - That sends SendFinishShutdown to the parent. On the parent again: - RecvFinishShutdown is received - That calls ShutDownProcess(CLOSE_CHANNEL) - That calls Close() The channel is now closed. Simultaneously-ish: - PContentChild/Parent::OnChannelClose is eventually called - That calls DestroySubtree - Which calls ContentChild/Parent::ActorDestroy On the child: - ContentChild::ActorDestroy calls XRE_ShutdownChildProcess if defined(NS_FREE_PERMANENT_DATA) - This tells the message loop to quit. On the parent, around the same time: - ContentParent::ActorDestroy does some things -- lots of things. Among these things is dispatching a call to DelayedDeleteSubprocess for our content process - That then dispatches a DeleteTask to the IO message loop for that content process - That destroys our GeckoChildProcessHost - ~GeckoChildProcessHost calls ProcessWatcher::EnsureProcessTerminated with force = !defined(NS_FREE_PERMANENT_DATA) - The parent patiently waits for the child to die. [1] Meanwhile, on the child main thread in XRE_InitChildProcess: - The UI message loop returns from Run() - ContentProcess::CleanUp is called - That calls ScopedXREEmbed::Stop - Which calls XRE_TermEmbedding - Which calls NS_ShutdownXPCOM - Which finally calls ShutdownXPCOM TL;DR -- ShutdownXPCOM happens. Content channel is closed. [1] https://dxr.mozilla.org/mozilla-central/rev/29d5a4175c8b74f45482276a53985cf2568b4be2/ipc/chromium/src/chrome/common/process_watcher_win.cc#77
Note that Sotaro just removed the call to FlushAllImagesSync in bug 1209724. PContent "opens" but doesn't "manage" PCompositorBridge. Are they actually using the same channel? If they do I'd expect this issue to have showed up a lot more. PContent also opens PImageBridge and I am certain that PCompositorBridge and PImageBridge do not use the same channel, so we may be trying to fix a problem with PContent we don't have. Moving the gfx shutdown earlier is tricky because the earlier we shut gfx down, the earlier everything that depends on gfx needs to be shut down as well, and that pulls an uncomfortable amount of dependencies. I would rather make it so we call SendFinishShutdown from ShutdownXPCOM (if we have to, I still think we don't).
(In reply to Nicolas Silva [:nical] from comment #22) > PContent "opens" but doesn't "manage" PCompositorBridge. Are they actually > using the same channel? If they do I'd expect this issue to have showed up a > lot more. > PContent also opens PImageBridge and I am certain that PCompositorBridge and > PImageBridge do not use the same channel, so we may be trying to fix a > problem with PContent we don't have. They don't use the same channel. But all IPC goes through the I/O thread (at least on the parent).
(In reply to Nicolas Silva [:nical] from comment #22) > Note that Sotaro just removed the call to FlushAllImagesSync in bug 1209724. Correction: he is trying to remove it but that hasn't landed yet.
(In reply to Edwin Flores [:eflores] [:edwin] from comment #23) > They don't use the same channel. But all IPC goes through the I/O thread (at > least on the parent). Then what we need to fix is the DeleteTask on the parent IO loop being sent too early.
(In reply to Nicolas Silva [:nical] from comment #22) > Note that Sotaro just removed the call to FlushAllImagesSync in bug 1209724. Correction (bis): I meant bug 1254029.
Bug 1262898 will have fixed the shutdown problems. Re-landing diagnostic patch.
Blocks: e10s-crashes
hey edwin, any new information here?
Flags: needinfo?(edwin)
This crash is very low volume in nightly and aurora. I don't expect we'll have any new information until Beta 48.
Flags: needinfo?(edwin)
should we uplift the diagnostics to beta 47?
Flags: needinfo?(edwin)
Crash volume for signature 'RtlEnterCriticalSection | PR_EnterMonitor | mozilla::layers::ImageContainer::ClearImagesFromImageBridge': - nightly (version 50): 0 crash from 2016-06-06. - aurora (version 49): 0 crash from 2016-06-07. - beta (version 48): 0 crash from 2016-06-06. - release (version 47): 5285 crashes from 2016-05-31. - esr (version 45): 87 crashes from 2016-04-07. Crash volume on the last weeks: Week N-1 Week N-2 Week N-3 Week N-4 Week N-5 Week N-6 Week N-7 - nightly 0 0 0 0 0 0 0 - aurora 0 0 0 0 0 0 0 - beta 0 0 0 0 0 0 0 - release 879 742 893 835 771 584 240 - esr 5 12 8 12 16 15 6 Affected platform: Windows
Looks good these days. This signature hasn't been reported since 47, and neither has the diagnostic crash.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(edwin)
Resolution: --- → WORKSFORME
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: