Closed Bug 1345735 Opened 8 years ago Closed 8 years ago

Intermittent tsvgx | Found crashes after test run, terminating test

Categories

(Testing :: Talos, defect)

Version 3
defect
Not set
normal

Tracking

(firefox-esr52 unaffected, firefox53 unaffected, firefox54 unaffected, firefox55 fixed)

RESOLVED FIXED
mozilla55
Tracking Status
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- unaffected
firefox55 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: dvander)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed])

Attachments

(2 files)

Whiteboard: [stockwell needswork]
Talos, Windows 8, e10s crash (not timeout), bad report, as in several other bugs and under discussion in bug 1310638.
More than 90% of these have failures have no crash report, but a few do! Here's one: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=85527944&lineNumber=2215 22:57:18 INFO - PROCESS-CRASH | tsvgx | application crashed [@ CrashStatsLogForwarder::CrashAction(mozilla::gfx::LogReason)] 22:57:18 INFO - Crash dump filename: c:\users\cltbld~1.t-w\appdata\local\temp\tmp3ohf2e\profile\minidumps\21b16005-da8d-466d-8185-18535b975da9.dmp 22:57:18 INFO - Operating system: Windows NT 22:57:18 INFO - 6.2.9200 22:57:18 INFO - CPU: amd64 22:57:18 INFO - family 6 model 30 stepping 5 22:57:18 INFO - 8 CPUs 22:57:18 INFO - GPU: UNKNOWN 22:57:18 INFO - Crash reason: EXCEPTION_BREAKPOINT 22:57:18 INFO - Crash address: 0x7ffb9101b63 22:57:18 INFO - Assertion: Unknown assertion type 0x00000000 22:57:18 INFO - Process uptime: 9 seconds 22:57:18 INFO - Thread 9 (crashed) 22:57:18 INFO - 0 xul.dll!CrashStatsLogForwarder::CrashAction(mozilla::gfx::LogReason) [gfxPlatform.cpp:e0be781966d4 : 408 + 0x11] 22:57:18 INFO - rax = 0x000007ffcb730a30 rdx = 0x0000000000000000 22:57:18 INFO - rcx = 0x000007ffbb252fc0 rbx = 0x0000000957e3f3e0 22:57:18 INFO - rsi = 0x0000000951929b20 rdi = 0x0000000000000004 22:57:18 INFO - rbp = 0x0000000957e3f3b0 rsp = 0x0000000957e3f2e0 22:57:18 INFO - r8 = 0x000007ffbba90960 r9 = 0x0000000000000190 22:57:18 INFO - r10 = 0x00000009515985f1 r11 = 0x0000000957e3f210 22:57:18 INFO - r12 = 0x0000000000000002 r13 = 0x0000000000000000 22:57:18 INFO - r14 = 0x0000000088760873 r15 = 0x00000009519067a0 22:57:18 INFO - rip = 0x000007ffb9101b63 22:57:18 INFO - Found by: given as instruction pointer in context 22:57:18 INFO - 1 xul.dll!mozilla::gfx::CriticalLogger::CrashAction(mozilla::gfx::LogReason) [Factory.cpp:e0be781966d4 : 971 + 0x11] 22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0 22:57:18 INFO - rsp = 0x0000000957e3f310 r12 = 0x0000000000000002 22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873 22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb8f822af 22:57:18 INFO - Found by: call frame info 22:57:18 INFO - 2 xul.dll!mozilla::gfx::Log<1,mozilla::gfx::CriticalLogger>::Flush() [Logging.h:e0be781966d4 : 280 + 0x3b] 22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0 22:57:18 INFO - rsp = 0x0000000957e3f340 r12 = 0x0000000000000002 22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873 22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb8c4361e 22:57:18 INFO - Found by: call frame info 22:57:18 INFO - 3 xul.dll!mozilla::layers::SyncObjectD3D11::Init() [TextureD3D11.cpp:e0be781966d4 : 1235 + 0x44] 22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0 22:57:18 INFO - rsp = 0x0000000957e3f3c0 r12 = 0x0000000000000002 22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873 22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb90b2e74 22:57:18 INFO - Found by: call frame info 22:57:18 INFO - 4 xul.dll!mozilla::layers::SyncObjectD3D11::FinalizeFrame() [TextureD3D11.cpp:e0be781966d4 : 1274 + 0x5] 22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0 22:57:18 INFO - rsp = 0x0000000957e3f500 r12 = 0x0000000000000002 22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873 22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb90b17eb 22:57:18 INFO - Found by: call frame info 22:57:18 INFO - 5 xul.dll!mozilla::D3D11DXVA2Manager::CopyToImage(IMFSample *,mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const &,mozilla::layers::Image * *) [DXVA2Manager.cpp:e0be781966d4 : 948 + 0xa] 22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0 22:57:18 INFO - rsp = 0x0000000957e3f6b0 r12 = 0x0000000000000002 22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873 22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb9a1e650 22:57:18 INFO - Found by: call frame info 22:57:18 INFO - 6 xul.dll!mozilla::WMFVideoMFTManager::CreateD3DVideoFrame(IMFSample *,__int64,mozilla::VideoData * *) [WMFVideoMFTManager.cpp:e0be781966d4 : 931 + 0x1d] 22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0 22:57:18 INFO - rsp = 0x0000000957e3f7a0 r12 = 0x0000000000000002 22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873 22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb9a1ef4d 22:57:18 INFO - Found by: call frame info 22:57:18 INFO - 7 xul.dll!mozilla::WMFVideoMFTManager::Output(__int64,RefPtr<mozilla::MediaData> &) [WMFVideoMFTManager.cpp:e0be781966d4 : 1048 + 0x21] 22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0 22:57:18 INFO - rsp = 0x0000000957e3f850 r12 = 0x0000000000000002 22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873 22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb9a22d6d 22:57:18 INFO - Found by: call frame info 22:57:18 INFO - 8 xul.dll!mozilla::WMFMediaDataDecoder::ProcessOutput(nsTArray<RefPtr<mozilla::MediaData> > &) [WMFMediaDataDecoder.cpp:e0be781966d4 : 155 + 0x13] 22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0 22:57:18 INFO - rsp = 0x0000000957e3f950 r12 = 0x0000000000000002 22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873 22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb9a2318f 22:57:18 INFO - Found by: call frame info 22:57:18 INFO - 9 xul.dll!mozilla::WMFMediaDataDecoder::ProcessDrain() [WMFMediaDataDecoder.cpp:e0be781966d4 : 200 + 0x5] 22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0 22:57:18 INFO - rsp = 0x0000000957e3f980 r12 = 0x0000000000000002 22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873 22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb9a22f79 22:57:18 INFO - Found by: call frame info
David - Can you have a look at the crash reports in comment 4 and 5? We see these crashes infrequently in Windows e10s Talos tests, apparently at shutdown time. (We also see *frequent* Windows e10s Talos crashes at shutdown time without any crash reports. I'm hoping the SyncObjectD3D11 crash is the same issue...and that you can fix it, or tell us how to avoid these crashes!)
:milan, can you help us find someone to look at this- we suspect this is the cause for a 25%+ failure rate in all windows 8 talos crashes. Geoff has outlined this nicely in Comment 6.
Flags: needinfo?(milan)
If this is only happening on shutdown, that makes a lot of sense - media could be racing with shutting down D3D11 devices. Can we tell what PID is crashing? If it's the GPU process, then this locking might serve no purpose and can just be removed. If it's content, we'll have to investigate more.
Flags: needinfo?(dvander) → needinfo?(gbrown)
It does appear that all of these crashes are on shutdown. We cannot easily determine the PID, but the process type is available. There is a .extra file associated with minidumps, available as artifacts of the test job. In both crashes from comment 4 and comment 5, the .extra files show ProcessType=gpu. (If we really need the PID, :ted says it is available and can be retrieved with code like https://github.com/mozilla/socorro/blob/master/minidump-stackwalk/stackwalker.cc#L1196. Thanks much :ted!)
Flags: needinfo?(gbrown)
:dvander, knowing this is a GPU process, can you look into removing the shutdown locking?
Flags: needinfo?(dvander)
I talked with Matt Woodrow and it sounds like we can remove the locking on AMD GPUs only. If our machines use nVidia/Intel we'll have to do something else... do you know what they run, Joel?
Flags: needinfo?(dvander) → needinfo?(jmaher)
our machines are documented here: https://wiki.mozilla.org/Buildbot/Talos/Misc#Hardware_Profile_of_machines_used_in_automation and for the linux machines: iX21X4 2U Neutron "Gemini" Series Four Node Hot-Pluggable Server (4 nodes per 2U) 920W High-Efficiency redundant (1+1) Power Supply 1 Intel X3450 CPU per node 8GB Total: 2 x 4GB DDR3 1333Mhz ECC/REG RAM per node 1 WD5003ABYX hard drive per node 1 NVIDIA GPU GeForce GT 610 per node we will have new machines later in the summer or fall, I believe they will be Intel graphics. Glad to hear there is a path forward here.
Flags: needinfo?(jmaher)
Okay, can't dropping the locking... I'll investigate further then.
Assignee: nobody → dvander
Status: NEW → ASSIGNED
Flags: needinfo?(milan)
:dvander, checking in here- do you have any updates here?
Flags: needinfo?(dvander)
(In reply to Joel Maher ( :jmaher) from comment #17) > :dvander, checking in here- do you have any updates here? Sorry, I haven't had time to look at it yet. I can't seem to find any newer stack traces. The logs on brasstacks all say: "16:53:12 INFO - PROCESS-CRASH | tsvgx | application crashed [unknown top frame]" There's no stack, and the minidump isn't parseable by Visual Studio. Did something break or am I looking in the wrong places?
Flags: needinfo?(dvander) → needinfo?(jmaher)
Recall comments 4 and 6: More than 90% of these Windows Talos shutdown crashes have no crash reports (usually there's a minidump, but it isn't parseable). You have to look through at least 20 logs before you find a stack. Here's another one, a little different, from yesterday: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=90753534&lineNumber=2213
Flags: needinfo?(jmaher)
Errr... why do so few runs not have crash reports? Is that intentional?
I put a bunch of sleep() calls in right before video locks the SyncObject, to root out race conditions during shutdown. The first thing that hit was this, InitIPDLActor ends up getting a null forwarder.
Attachment #8858182 - Flags: review?(matt.woodrow)
Attached patch part 2, don't dev-crash (deleted) — Splinter Review
This code actually asserted earlier than the talos crash, because CompositorBridgeChild shouldn't be used in the GPU process and definitely not off the main thread. This patch just skips the gfxDevCrash if we're not in the main thread.
Attachment #8858183 - Flags: review?(matt.woodrow)
Attachment #8858182 - Flags: review?(matt.woodrow) → review+
Attachment #8858183 - Flags: review?(matt.woodrow) → review+
let me know if I can land these patches, right now our #1 failure on the tree is a new talos test (glvideo in bug 1356445) which is failing with this same pattern- in total we have 400+ instances of failures in the last week related to what looks like this issue.
Pushed by danderson@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/d6348ea45c2d Don't create TextureClients if the video bridge has shut down. (bug 1345735 part 1, r=mattwoodrow) https://hg.mozilla.org/integration/mozilla-inbound/rev/7c5628d40478 Don't gfxDevCrash when video fails to acquire a SyncObject. (bug 1345735 part 2, r=mattwoodrow)
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Whiteboard: [stockwell needswork] → [stockwell fixed]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: