Closed
Bug 1205559
Opened 9 years ago
Closed 9 years ago
Access to TextureClient / TextureChild not thread-safe
Categories
(Core :: Graphics: Layers, defect)
Core
Graphics: Layers
Tracking
()
RESOLVED
FIXED
mozilla44
People
(Reporter: jya, Assigned: nical)
References
Details
Attachments
(1 file, 1 obsolete file)
(deleted),
patch
|
jya
:
review+
|
Details | Diff | Splinter Review |
Investigation of bug 1072313 indicates that we have a race occurring upon freeing the mKeep object in TextureChild . In https://treeherder.mozilla.org/#/jobs?repo=try&revision=750259489f71 ; logging was added to find leaks on the KeepAlive object. In 30% of the MacIOSurface leak we see: TEST-UNEXPECTED-FAIL | leakcheck | default process: 88 bytes leaked (KeepAlive, MacIOSurface) The simple attempt to protect access to mKeep with a monitor: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b109b5578839 We sees that we have no more KeepAlive leaks (though still plenty of MacIOSurface leaks) Now, obviously protecting mKeep with a Monitor isn't the solution as we still have TextureClient::mActor and TextureChild::mTextureClient that are accessed in a non-threadsafe manner. It still indicates a problem. The fix is likely non-trivial as we probably don't want to use a monitor there which would likely have severe performance penalties. I have some more logs and assert to prove that points coming shortly.
Reporter | ||
Comment 1•9 years ago
|
||
Things like: http://mxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#611 // Always make a temporary strong reference to the actor before we use it, // in case TextureChild::ActorDestroy might null mActor concurrently. RefPtr<TextureChild> actor = mActor; This is certainly not thread-safe ... TextureChild::ActorDestroy accesses both TextureChild::mKeep and TextureClient::mActor and can do so concurrently to ~TextureClient
Updated•9 years ago
|
Assignee: nobody → nical.bugzilla
Flags: needinfo?(nical.bugzilla)
Assignee | ||
Comment 2•9 years ago
|
||
The two problems are: - We should not have to manually call TextureClient::ForceRemove (which we when shutting down the ImageBridge protocol for all textures that are still alive (and there is no sane reason for any texture to be kept alive that late after half of gfx has already been shut down). Right now ther can be a race between this manual call to ForceRemove and the TextureClient actually starting its late deallocation on another thread. - A race when code touches the ipdl actor outside the ipdl thread (and ~TextureClient can run off the ipdl thread so it should not do things like KeppUntilDeallocation). There's a patch in bug 1072313 which adds a hook to run destruction code on the ipdl thread. ForceRemove and ActorDestroy can only run in the ipdl thread so they can't race with each other.
Flags: needinfo?(nical.bugzilla)
Reporter | ||
Comment 3•9 years ago
|
||
I think this is the easiest way to solve the problem without a complete rethink of the threading model.
Attachment #8662354 -
Flags: review?(nical.bugzilla)
Reporter | ||
Comment 4•9 years ago
|
||
(In reply to Nicolas Silva [:nical] from comment #2) > ForceRemove and ActorDestroy can only run in the ipdl thread so they can't > race with each other. Are you sure? ActorDestroy I thought I saw it being called from the main thread from time to time. If it is ; then running KeppUntilDeallocation would be a solution. However, what about the other methods accessing TextureClient::mActor ? Are they all always running on the ipdl thread? If not; you still have a race seeing that mActor can be cleared.
Flags: needinfo?(nical.bugzilla)
Reporter | ||
Comment 5•9 years ago
|
||
I added an assert on ActorDestroy testing that we're never on the main thread: it hits it instantly. Assertion failure: !NS_IsMainThread(), at /Users/jyavenard/Work/Mozilla/mozilla-central/gfx/layers/client/TextureClient.cpp:170 #01: mozilla::layers::TextureChild::ActorDestroy(mozilla::ipc::IProtocolManager<mozilla::ipc::IProtocol>::ActorDestroyReason)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x17c6599] #02: mozilla::layers::PTextureChild::DestroySubtree(mozilla::ipc::IProtocolManager<mozilla::ipc::IProtocol>::ActorDestroyReason)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0xcfe2cb] #03: mozilla::layers::PTextureChild::OnMessageReceived(IPC::Message const&)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0xcfdf4e] #04: mozilla::layers::PCompositorChild::OnMessageReceived(IPC::Message const&)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0xed4073] #05: mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x873d5d] #06: mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message const&)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x8730ed] #07: mozilla::ipc::MessageChannel::OnMaybeDequeueOne()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x86ef32] #08: void DispatchToMethod<mozilla::ipc::MessageChannel, bool (mozilla::ipc::MessageChannel::*)()>(mozilla::ipc::MessageChannel*, bool (mozilla::ipc::MessageChannel::*)(), Tuple0 const&)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x893443] #09: RunnableMethod<mozilla::ipc::MessageChannel, bool (mozilla::ipc::MessageChannel::*)(), Tuple0>::Run()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x8932ee] #10: mozilla::ipc::MessageChannel::RefCountedTask::Run()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x88f908] #11: mozilla::ipc::MessageChannel::DequeueTask::Run()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x88f774] #12: MessageLoop::RunTask(Task*)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x7e3930] #13: MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x7e3e9f] #14: MessageLoop::DoWork()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x7e40c4] #15: mozilla::ipc::DoWorkRunnable::Run()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x876e35] #16: nsThread::ProcessNextEvent(bool, bool*)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x198b55] #17: NS_ProcessNextEvent(nsIThread*, bool)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x217e97] #18: nsThread::Shutdown()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x198373] #19: void nsRunnableMethodArguments<>::apply<nsIThread, nsresult (nsIThread::*)()>(nsIThread*, nsresult (nsIThread::*)())[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x1c1323] #20: nsRunnableMethodImpl<nsresult (nsIThread::*)(), true>::Run()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x1c1156] #21: nsThread::ProcessNextEvent(bool, bool*)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x198b55] #22: NS_ProcessPendingEvents(nsIThread*, unsigned int)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x217d07] #23: nsBaseAppShell::NativeEventCallback()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x3d15506] #24: nsAppShell::ProcessGeckoEvents(void*)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x3da56fd] #25: __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__[/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation +0xaa621] #26: __CFRunLoopDoSources0[/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation +0x89e1c] #27: __CFRunLoopRun[/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation +0x8933f] #28: CFRunLoopRunSpecific[/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation +0x88d38] #29: RunCurrentEventLoopInMode[/System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/HIToolbox.framework/Versions/A/HIToolbox +0x30d55] #30: ReceiveNextEventCommon[/System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/HIToolbox.framework/Versions/A/HIToolbox +0x30a97] #31: _BlockUntilNextEventMatchingListInModeWithFilter[/System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/HIToolbox.framework/Versions/A/HIToolbox +0x309cf] #32: _DPSNextEvent[/System/Library/Frameworks/AppKit.framework/Versions/C/AppKit +0x49236] #33: -[NSApplication _nextEventMatchingEventMask:untilDate:inMode:dequeue:][/System/Library/Frameworks/AppKit.framework/Versions/C/AppKit +0x48665] #34: -[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:][/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x3da4237] #35: -[NSApplication run][/System/Library/Frameworks/AppKit.framework/Versions/C/AppKit +0x3d1c8] #36: nsAppShell::Run()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x3da60ec] #37: nsAppStartup::Run()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x4df6541] #38: XREMain::XRE_mainRun()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x4ecde4c] #39: XREMain::XRE_main(int, char**, nsXREAppData const*)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x4ece6f9] #40: XRE_main[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x4ecebc7] #41: do_main(int, char**, nsIFile*)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/firefox +0x297f] #42: main[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/firefox +0x1da6]
Assignee | ||
Comment 6•9 years ago
|
||
(In reply to Jean-Yves Avenard [:jya] from comment #4) > Are you sure? > > ActorDestroy I thought I saw it being called from the main thread from time > to time. Yes, and you are also right: what I called the ipdl thread is the thread that the texture's ipdl actor belongs to, and that can be either the ImageBridge thread (if the texture is created for async video playback), or the main thread (general case). Video stuff is what tends to keep references to the textures outside of the ipdl thread (ImageBridge in this case). > However, what about the other methods accessing TextureClient::mActor ? Are > they all always running on the ipdl thread? TextureClient::mActor is only cleared after the last reference to the TextureClient is gone because TextureClient's reference counting is what triggers the awful and convoluted destruction sequence (I am trying to clean that in another bug but I discover new things with every attempt), unless the texture has survived longer than the ImageBridge (in which case we force the destruction on the ImageBridge thread which can cause the remaining race). When that happens we are basically in a state where a lot of bad things can happen because a lot of components that gfx use have already been destroyed or and the remaining ones are about to be destroyed in a bunch of cycles. a leak is probably one of the least preoccupying thing that can happen in this situation, so I would prefer that we figure out how to make sure things like video decoders let go of all their textures before we get there, rather than try to make things work in this situation. I don't know what's holding the texture for that long, but it could be that it needs to be shut down earlier. For example if it is being shut down on the event NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID, it should probably observe NS_XPCOM_SHUTDOWN_OBSERVER_ID instead. ImageBridge is basically destroyed as late as possible (after that there is NS_XPCOM_SHUTDOWN_THREADS and then all threads must be gone).
Flags: needinfo?(nical.bugzilla)
Reporter | ||
Comment 7•9 years ago
|
||
I did a very crude patch that would cause an assert should any members of TextClient be accessed by two threads concurrently. I got an assert there: xul.dll!mozilla::layers::TextureClient::SetRecycleAllocator(mozilla::layers::TextureClientRecycleAllocator *) [TextureClient.cpp:baaa7fdcb26a : 275 + 0xa] logs: https://treeherder.mozilla.org/logviewer.html#?job_id=11636252&repo=try Now I haven't examined what SetRecycleAllocator is doing it could very well do like TextureClient::ForceRemove dispatch a sync task to another TextureClient method which would trigger an assert ; but from the stack trace, the call to SetRecycleAllocator is initiated from WMFMediaDataDecoder. Tomorrow I'll make it print the backtrace of the thread concurrently accessing the TextureClient ; will give us more information (like there's another crash in RecycleTexture, but this one does seem to be due to a sync dispatch)
Reporter | ||
Comment 8•9 years ago
|
||
(In reply to Jean-Yves Avenard [:jya] from comment #7) > I did a very crude patch that would cause an assert should any members of > TextClient be accessed by two threads concurrently. > > I got an assert there: > xul.dll!mozilla::layers::TextureClient::SetRecycleAllocator(mozilla::layers:: > TextureClientRecycleAllocator *) [TextureClient.cpp:baaa7fdcb26a : 275 + 0xa] > > looks like this one just got fixed by bug 1197534. At least it's comforting to see that the patch mentioned in comment 7 does its job and found a genuine race!
Depends on: 1197534
Reporter | ||
Comment 9•9 years ago
|
||
Still some races / assertion failure, only on windows 7 / 8: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bb4c876a83f7 On debug build: Assertion failure: !mActor->mKeep, at c:/builds/moz2_slave/try-w64-d-00000000000000000000/build/src/gfx/layers/client/TextureClient.cpp:573 This is TextureClient::KeepUntilFullDeallocation On opt build this hits the thread safety crash: which is due to a concurrent access to TextureClient::RecycleTexture. A pity the printf aren't shown ; it would print which other functions are being access simultaneously. From earlier backtraces; this appears related to the recycling of D3D textures.
Flags: needinfo?(matt.woodrow)
Reporter | ||
Comment 10•9 years ago
|
||
Yes, it's the D3D9 recycling that appears racy: the concurrent access is in: 05:05:16 INFO - 8 xul.dll!mozilla::layers::TextureClient::SetRecycleAllocator(mozilla::layers::TextureClientRecycleAllocator *) [TextureClient.cpp:bb4c876a83f7 : 282 + 0xa] 05:05:16 INFO - rsp = 0x000000cb8f0cf2e0 rip = 0x000007fc2df8fe0a 05:05:16 INFO - Found by: call frame info 05:05:16 INFO - 9 xul.dll!mozilla::layers::TextureClientRecycleAllocator::CreateOrRecycle(mozilla::gfx::SurfaceFormat,mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits>,mozilla::layers::BackendSelector,mozilla::layers::TextureFlags,mozilla::layers::TextureAllocationFlags) [TextureClientRecycleAllocator.cpp:bb4c876a83f7 : 129 + 0xb] 05:05:16 INFO - rsp = 0x000000cb8f0cf310 rip = 0x000007fc2df8a94b 05:05:16 INFO - Found by: call frame info 05:05:16 INFO - 10 xul.dll!mozilla::layers::D3D9RecycleAllocator::CreateOrRecycleClient(mozilla::gfx::SurfaceFormat,mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const &) [D3D9SurfaceImage.cpp:bb4c876a83f7 : 228 + 0x1b] 05:05:16 INFO - rsp = 0x000000cb8f0cf3c0 rip = 0x000007fc2df563df 05:05:16 INFO - Found by: call frame info 05:05:16 INFO - 11 xul.dll!mozilla::layers::D3D9SurfaceImage::SetData(mozilla::layers::D3D9SurfaceImage::Data const &) [D3D9SurfaceImage.cpp:bb4c876a83f7 : 60 + 0x20] 05:05:16 INFO - rsp = 0x000000cb8f0cf420 rip = 0x000007fc2df59ce5 05:05:16 INFO - Found by: call frame info 05:05:16 INFO - 12 xul.dll!mozilla::D3D9DXVA2Manager::CopyToImage(IMFSample *,mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const &,mozilla::layers::ImageContainer *,mozilla::layers::Image * *) [DXVA2Manager.cpp:bb4c876a83f7 : 429 + 0x4e] 05:05:16 INFO - rsp = 0x000000cb8f0cf530 rip = 0x000007fc2e866e92 05:05:16 INFO - Found by: call frame info 05:05:16 INFO - 13 xul.dll!mozilla::WMFVideoMFTManager::CreateD3DVideoFrame(IMFSample *,__int64,mozilla::VideoData * *) [WMFVideoMFTManager.cpp:bb4c876a83f7 : 540 + 0x2e] 05:05:16 INFO - rsp = 0x000000cb8f0cf5c0 rip = 0x000007fc2e867538 05:05:16 INFO - Found by: call frame info 05:05:16 INFO - 14 xul.dll!mozilla::WMFVideoMFTManager::Output(__int64,nsRefPtr<mozilla::MediaData> &) [WMFVideoMFTManager.cpp:bb4c876a83f7 : 606 + 0x15] 05:05:16 INFO - rsp = 0x000000cb8f0cf670 rip = 0x000007fc2e869eab 05:05:16 INFO - Found by: call frame info 05:05:16 INFO - 15 xul.dll!mozilla::WMFMediaDataDecoder::ProcessOutput() [WMFMediaDataDecoder.cpp:bb4c876a83f7 : 153 + 0x13] 05:05:16 INFO - rsp = 0x000000cb8f0cf6d0 rip = 0x000007fc2e86a096 05:05:16 INFO - Found by: call frame info 05:05:16 INFO - 16 xul.dll!mozilla::WMFMediaDataDecoder::ProcessDecode(mozilla::MediaRawData *) [WMFMediaDataDecoder.cpp:bb4c876a83f7 : 144 + 0xc] 05:05:16 INFO - rsp = 0x000000cb8f0cf700 rip = 0x000007fc2e869f9f 05:05:16 INFO - Found by: call frame info
Milan - can you prod nical?
Flags: needinfo?(milan)
Is the patch still relevant?
Flags: needinfo?(milan)
Reporter | ||
Comment 13•9 years ago
|
||
The patch is no longer applicable since bug 1072313. However the races are still there. (in particular the allocation / recycling of the D3D textures) ; they are allocated/recycled on multiple threads with no locking in place.
Assignee | ||
Comment 14•9 years ago
|
||
Comment on attachment 8662354 [details] [diff] [review] Make TextureChild/TextureClient thread-safe. Review of attachment 8662354 [details] [diff] [review]: ----------------------------------------------------------------- We don't need to lock around mKeep anymore. It's worth taking the rest of the patch though, even if I haven't seen reports of crashes or leaks related to the other members. That stuff will get rewritten as part of 1200595 and followups but it's taking longer than I hoped.
Assignee | ||
Comment 15•9 years ago
|
||
Jean-Yves, this is a rebased version of your patch without the synchronization around mKeep.
Attachment #8662354 -
Attachment is obsolete: true
Attachment #8662354 -
Flags: review?(nical.bugzilla)
Flags: needinfo?(matt.woodrow)
Attachment #8673520 -
Flags: review?(jyavenard)
Reporter | ||
Comment 16•9 years ago
|
||
Comment on attachment 8673520 [details] [diff] [review] rebased patch Review of attachment 8673520 [details] [diff] [review]: ----------------------------------------------------------------- I'm not sure this will be enough to resolve the issues on Windows where a DXVA surface may be used on multiple threads at once.
Attachment #8673520 -
Flags: review?(jyavenard) → review+
https://hg.mozilla.org/mozilla-central/rev/09de9d59910d https://hg.mozilla.org/mozilla-central/rev/cbbc1719a22b
Status: NEW → RESOLVED
Closed: 9 years ago
status-firefox44:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla44
https://hg.mozilla.org/mozilla-central/rev/09de9d59910d https://hg.mozilla.org/mozilla-central/rev/cbbc1719a22b
You need to log in
before you can comment on or make changes to this bug.
Description
•