Closed
Bug 1345735
Opened 8 years ago
Closed 8 years ago
Intermittent tsvgx | Found crashes after test run, terminating test
Categories
(Testing :: Talos, defect)
Tracking
(firefox-esr52 unaffected, firefox53 unaffected, firefox54 unaffected, firefox55 fixed)
RESOLVED
FIXED
mozilla55
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox53 | --- | unaffected |
firefox54 | --- | unaffected |
firefox55 | --- | fixed |
People
(Reporter: intermittent-bug-filer, Assigned: dvander)
References
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell fixed])
Attachments
(2 files)
(deleted),
patch
|
mattwoodrow
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
mattwoodrow
:
review+
|
Details | Diff | Splinter Review |
Updated•8 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Whiteboard: [stockwell needswork]
Comment 3•8 years ago
|
||
Talos, Windows 8, e10s crash (not timeout), bad report, as in several other bugs and under discussion in bug 1310638.
Comment 4•8 years ago
|
||
More than 90% of these have failures have no crash report, but a few do! Here's one:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=85527944&lineNumber=2215
22:57:18 INFO - PROCESS-CRASH | tsvgx | application crashed [@ CrashStatsLogForwarder::CrashAction(mozilla::gfx::LogReason)]
22:57:18 INFO - Crash dump filename: c:\users\cltbld~1.t-w\appdata\local\temp\tmp3ohf2e\profile\minidumps\21b16005-da8d-466d-8185-18535b975da9.dmp
22:57:18 INFO - Operating system: Windows NT
22:57:18 INFO - 6.2.9200
22:57:18 INFO - CPU: amd64
22:57:18 INFO - family 6 model 30 stepping 5
22:57:18 INFO - 8 CPUs
22:57:18 INFO - GPU: UNKNOWN
22:57:18 INFO - Crash reason: EXCEPTION_BREAKPOINT
22:57:18 INFO - Crash address: 0x7ffb9101b63
22:57:18 INFO - Assertion: Unknown assertion type 0x00000000
22:57:18 INFO - Process uptime: 9 seconds
22:57:18 INFO - Thread 9 (crashed)
22:57:18 INFO - 0 xul.dll!CrashStatsLogForwarder::CrashAction(mozilla::gfx::LogReason) [gfxPlatform.cpp:e0be781966d4 : 408 + 0x11]
22:57:18 INFO - rax = 0x000007ffcb730a30 rdx = 0x0000000000000000
22:57:18 INFO - rcx = 0x000007ffbb252fc0 rbx = 0x0000000957e3f3e0
22:57:18 INFO - rsi = 0x0000000951929b20 rdi = 0x0000000000000004
22:57:18 INFO - rbp = 0x0000000957e3f3b0 rsp = 0x0000000957e3f2e0
22:57:18 INFO - r8 = 0x000007ffbba90960 r9 = 0x0000000000000190
22:57:18 INFO - r10 = 0x00000009515985f1 r11 = 0x0000000957e3f210
22:57:18 INFO - r12 = 0x0000000000000002 r13 = 0x0000000000000000
22:57:18 INFO - r14 = 0x0000000088760873 r15 = 0x00000009519067a0
22:57:18 INFO - rip = 0x000007ffb9101b63
22:57:18 INFO - Found by: given as instruction pointer in context
22:57:18 INFO - 1 xul.dll!mozilla::gfx::CriticalLogger::CrashAction(mozilla::gfx::LogReason) [Factory.cpp:e0be781966d4 : 971 + 0x11]
22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0
22:57:18 INFO - rsp = 0x0000000957e3f310 r12 = 0x0000000000000002
22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873
22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb8f822af
22:57:18 INFO - Found by: call frame info
22:57:18 INFO - 2 xul.dll!mozilla::gfx::Log<1,mozilla::gfx::CriticalLogger>::Flush() [Logging.h:e0be781966d4 : 280 + 0x3b]
22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0
22:57:18 INFO - rsp = 0x0000000957e3f340 r12 = 0x0000000000000002
22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873
22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb8c4361e
22:57:18 INFO - Found by: call frame info
22:57:18 INFO - 3 xul.dll!mozilla::layers::SyncObjectD3D11::Init() [TextureD3D11.cpp:e0be781966d4 : 1235 + 0x44]
22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0
22:57:18 INFO - rsp = 0x0000000957e3f3c0 r12 = 0x0000000000000002
22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873
22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb90b2e74
22:57:18 INFO - Found by: call frame info
22:57:18 INFO - 4 xul.dll!mozilla::layers::SyncObjectD3D11::FinalizeFrame() [TextureD3D11.cpp:e0be781966d4 : 1274 + 0x5]
22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0
22:57:18 INFO - rsp = 0x0000000957e3f500 r12 = 0x0000000000000002
22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873
22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb90b17eb
22:57:18 INFO - Found by: call frame info
22:57:18 INFO - 5 xul.dll!mozilla::D3D11DXVA2Manager::CopyToImage(IMFSample *,mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const &,mozilla::layers::Image * *) [DXVA2Manager.cpp:e0be781966d4 : 948 + 0xa]
22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0
22:57:18 INFO - rsp = 0x0000000957e3f6b0 r12 = 0x0000000000000002
22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873
22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb9a1e650
22:57:18 INFO - Found by: call frame info
22:57:18 INFO - 6 xul.dll!mozilla::WMFVideoMFTManager::CreateD3DVideoFrame(IMFSample *,__int64,mozilla::VideoData * *) [WMFVideoMFTManager.cpp:e0be781966d4 : 931 + 0x1d]
22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0
22:57:18 INFO - rsp = 0x0000000957e3f7a0 r12 = 0x0000000000000002
22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873
22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb9a1ef4d
22:57:18 INFO - Found by: call frame info
22:57:18 INFO - 7 xul.dll!mozilla::WMFVideoMFTManager::Output(__int64,RefPtr<mozilla::MediaData> &) [WMFVideoMFTManager.cpp:e0be781966d4 : 1048 + 0x21]
22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0
22:57:18 INFO - rsp = 0x0000000957e3f850 r12 = 0x0000000000000002
22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873
22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb9a22d6d
22:57:18 INFO - Found by: call frame info
22:57:18 INFO - 8 xul.dll!mozilla::WMFMediaDataDecoder::ProcessOutput(nsTArray<RefPtr<mozilla::MediaData> > &) [WMFMediaDataDecoder.cpp:e0be781966d4 : 155 + 0x13]
22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0
22:57:18 INFO - rsp = 0x0000000957e3f950 r12 = 0x0000000000000002
22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873
22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb9a2318f
22:57:18 INFO - Found by: call frame info
22:57:18 INFO - 9 xul.dll!mozilla::WMFMediaDataDecoder::ProcessDrain() [WMFMediaDataDecoder.cpp:e0be781966d4 : 200 + 0x5]
22:57:18 INFO - rbx = 0x0000000957e3f3e0 rbp = 0x0000000957e3f3b0
22:57:18 INFO - rsp = 0x0000000957e3f980 r12 = 0x0000000000000002
22:57:18 INFO - r13 = 0x0000000000000000 r14 = 0x0000000088760873
22:57:18 INFO - r15 = 0x00000009519067a0 rip = 0x000007ffb9a22f79
22:57:18 INFO - Found by: call frame info
Comment 5•8 years ago
|
||
Another, very similar: https://treeherder.mozilla.org/logviewer.html#?job_id=85750177&repo=try&lineNumber=2215
Comment 6•8 years ago
|
||
David - Can you have a look at the crash reports in comment 4 and 5? We see these crashes infrequently in Windows e10s Talos tests, apparently at shutdown time. (We also see *frequent* Windows e10s Talos crashes at shutdown time without any crash reports. I'm hoping the SyncObjectD3D11 crash is the same issue...and that you can fix it, or tell us how to avoid these crashes!)
Flags: needinfo?(dvander)
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Updated•8 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 9•8 years ago
|
||
:milan, can you help us find someone to look at this- we suspect this is the cause for a 25%+ failure rate in all windows 8 talos crashes. Geoff has outlined this nicely in Comment 6.
Flags: needinfo?(milan)
Assignee | ||
Comment 10•8 years ago
|
||
If this is only happening on shutdown, that makes a lot of sense - media could be racing with shutting down D3D11 devices.
Can we tell what PID is crashing? If it's the GPU process, then this locking might serve no purpose and can just be removed. If it's content, we'll have to investigate more.
Flags: needinfo?(dvander) → needinfo?(gbrown)
Comment 11•8 years ago
|
||
It does appear that all of these crashes are on shutdown.
We cannot easily determine the PID, but the process type is available. There is a .extra file associated with minidumps, available as artifacts of the test job. In both crashes from comment 4 and comment 5, the .extra files show ProcessType=gpu.
(If we really need the PID, :ted says it is available and can be retrieved with code like https://github.com/mozilla/socorro/blob/master/minidump-stackwalk/stackwalker.cc#L1196. Thanks much :ted!)
Flags: needinfo?(gbrown)
Comment 12•8 years ago
|
||
:dvander, knowing this is a GPU process, can you look into removing the shutdown locking?
Flags: needinfo?(dvander)
Assignee | ||
Comment 13•8 years ago
|
||
I talked with Matt Woodrow and it sounds like we can remove the locking on AMD GPUs only. If our machines use nVidia/Intel we'll have to do something else... do you know what they run, Joel?
Flags: needinfo?(dvander) → needinfo?(jmaher)
Comment 14•8 years ago
|
||
our machines are documented here:
https://wiki.mozilla.org/Buildbot/Talos/Misc#Hardware_Profile_of_machines_used_in_automation
and for the linux machines:
iX21X4 2U Neutron "Gemini" Series Four Node Hot-Pluggable Server (4 nodes per 2U)
920W High-Efficiency redundant (1+1) Power Supply
1 Intel X3450 CPU per node
8GB Total: 2 x 4GB DDR3 1333Mhz ECC/REG RAM per node
1 WD5003ABYX hard drive per node
1 NVIDIA GPU GeForce GT 610 per node
we will have new machines later in the summer or fall, I believe they will be Intel graphics.
Glad to hear there is a path forward here.
Flags: needinfo?(jmaher)
Assignee | ||
Comment 15•8 years ago
|
||
Okay, can't dropping the locking... I'll investigate further then.
Assignee: nobody → dvander
Status: NEW → ASSIGNED
Flags: needinfo?(milan)
Comment hidden (Intermittent Failures Robot) |
Comment 17•8 years ago
|
||
:dvander, checking in here- do you have any updates here?
Flags: needinfo?(dvander)
Assignee | ||
Comment 18•8 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #17)
> :dvander, checking in here- do you have any updates here?
Sorry, I haven't had time to look at it yet. I can't seem to find any newer stack traces. The logs on brasstacks all say:
"16:53:12 INFO - PROCESS-CRASH | tsvgx | application crashed [unknown top frame]"
There's no stack, and the minidump isn't parseable by Visual Studio. Did something break or am I looking in the wrong places?
Flags: needinfo?(dvander) → needinfo?(jmaher)
Comment 19•8 years ago
|
||
Recall comments 4 and 6: More than 90% of these Windows Talos shutdown crashes have no crash reports (usually there's a minidump, but it isn't parseable). You have to look through at least 20 logs before you find a stack.
Here's another one, a little different, from yesterday: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=90753534&lineNumber=2213
Updated•8 years ago
|
Flags: needinfo?(jmaher)
Assignee | ||
Comment 20•8 years ago
|
||
Errr... why do so few runs not have crash reports? Is that intentional?
Assignee | ||
Comment 21•8 years ago
|
||
s/not//
Assignee | ||
Comment 22•8 years ago
|
||
I put a bunch of sleep() calls in right before video locks the SyncObject, to root out race conditions during shutdown. The first thing that hit was this, InitIPDLActor ends up getting a null forwarder.
Attachment #8858182 -
Flags: review?(matt.woodrow)
Assignee | ||
Comment 23•8 years ago
|
||
This code actually asserted earlier than the talos crash, because CompositorBridgeChild shouldn't be used in the GPU process and definitely not off the main thread. This patch just skips the gfxDevCrash if we're not in the main thread.
Attachment #8858183 -
Flags: review?(matt.woodrow)
Updated•8 years ago
|
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Attachment #8858182 -
Flags: review?(matt.woodrow) → review+
Updated•8 years ago
|
Attachment #8858183 -
Flags: review?(matt.woodrow) → review+
Comment 25•8 years ago
|
||
let me know if I can land these patches, right now our #1 failure on the tree is a new talos test (glvideo in bug 1356445) which is failing with this same pattern- in total we have 400+ instances of failures in the last week related to what looks like this issue.
Comment 26•8 years ago
|
||
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d6348ea45c2d
Don't create TextureClients if the video bridge has shut down. (bug 1345735 part 1, r=mattwoodrow)
https://hg.mozilla.org/integration/mozilla-inbound/rev/7c5628d40478
Don't gfxDevCrash when video fails to acquire a SyncObject. (bug 1345735 part 2, r=mattwoodrow)
Comment 27•8 years ago
|
||
this looks to be fixed:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20talos%20e10s%20x64%20g4&group_state=expanded&fromchange=61ff36c046fab1fc70772472371e509b363c3302&tochange=f1c0c24105685fe93c82601e48ec566ffe291528
thank you for the fix, review, and landing!
Comment 28•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/d6348ea45c2d
https://hg.mozilla.org/mozilla-central/rev/7c5628d40478
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
status-firefox55:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Updated•8 years ago
|
Whiteboard: [stockwell needswork] → [stockwell fixed]
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
status-firefox53:
--- → unaffected
status-firefox54:
--- → unaffected
status-firefox-esr52:
--- → unaffected
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•