Closed Bug 1091322 Opened 10 years ago Closed 9 years ago

Intermittent test_zmedia_cleanup.html | application crashed [@ mozilla::layers::Compositor::AssertOnCompositorThread()] after Assertion failure: CompositorParent::CompositorLoop() == MessageLoop::current() (Can only call this from the compositor thread!),

Categories

(Core :: Graphics: Layers, defect)

36 Branch
x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME
Tracking Status
e10s + ---

People

(Reporter: KWierso, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure)

16:45:46 INFO - nsStringStats 16:45:46 INFO - => mAllocCount: 2923512 16:45:46 INFO - => mReallocCount: 296938 16:45:46 INFO - => mFreeCount: 2923476 -- LEAKED 36 !!! 16:45:46 INFO - => mShareCount: 4927300 16:45:46 INFO - => mAdoptCount: 145709 16:45:46 INFO - => mAdoptFreeCount: 145709 16:45:46 INFO - => Process ID: 1875, Thread ID: 140698723338688 16:45:46 INFO - TEST-INFO | Main app process: killed by SIGSEGV 16:45:46 INFO - 1815 INFO TEST-START | Shutdown 16:45:46 INFO - 1816 INFO Passed: 214309 16:45:46 INFO - 1817 INFO Failed: 0 16:45:46 INFO - 1818 INFO Todo: 24138 16:45:46 INFO - 1819 INFO Slowest: 230736ms - /tests/dom/imptests/editing/conformancetest/test_runtest.html 16:45:46 INFO - 1820 INFO SimpleTest FINISHED 16:45:46 INFO - 1821 INFO TEST-INFO | Ran 1 Loops 16:45:46 INFO - 1822 INFO SimpleTest FINISHED 16:45:46 INFO - 1823 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_zmedia_cleanup.html | application terminated with exit code 11 16:45:46 INFO - runtests.py | Application ran for: 0:44:05.438840 16:45:46 INFO - zombiecheck | Reading PID log: /tmp/tmpZE5q1Cpidlog 16:45:46 INFO - ==> process 1837 launched child process 1875 16:45:46 INFO - ==> process 1875 launched child process 5172 16:45:46 INFO - zombiecheck | Checking for orphan process with PID: 1875 16:45:46 INFO - zombiecheck | Checking for orphan process with PID: 5172 16:45:55 INFO - mozcrash Saved minidump as /builds/slave/test/build/blobber_upload_dir/7a73cf6d-1b7c-84d1-3f07168f-7f967c13.dmp 16:45:55 INFO - mozcrash Saved app info as /builds/slave/test/build/blobber_upload_dir/7a73cf6d-1b7c-84d1-3f07168f-7f967c13.extra 16:45:55 WARNING - PROCESS-CRASH | /tests/dom/media/tests/mochitest/test_zmedia_cleanup.html | application crashed [@ mozilla::layers::Compositor::AssertOnCompositorThread()] 16:45:55 INFO - Crash dump filename: /tmp/tmpx0nahD.mozrunner/minidumps/7a73cf6d-1b7c-84d1-3f07168f-7f967c13.dmp 16:45:55 INFO - Operating system: Linux 16:45:55 INFO - 0.0.0 Linux 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 16:45:55 INFO - CPU: amd64 16:45:55 INFO - family 6 model 62 stepping 4 16:45:55 INFO - 1 CPU 16:45:55 INFO - Crash reason: SIGSEGV 16:45:55 INFO - Crash address: 0x0 16:45:55 INFO - Thread 17 (crashed) 16:45:55 INFO - 0 libxul.so!mozilla::layers::Compositor::AssertOnCompositorThread() [Compositor.cpp:0c50940fd16d : 46 + 0x18]
Without any gfx people CCed, this bug is *going* places.
The shutdown of compositor goes like this: // [some stuff] sCompositorThreadHolder = nullptr; while (!sFinishedCompositorShutDown) { NS_ProcessNextEvent(nullptr, true); // the main thread is waiting here while the compositor the crash } After sCompositorThreadHolder is nulled, AssertOnCompositorThread will always return false because what it does is check that the current MessageLoop is the one stored in the thread kept by sCompositorThreadHolder. Benoit, can null out the pointer after the while loop to avoid this or is there a dependency on the compositor thread holder being nulled before?
Flags: needinfo?(bjacob)
Blocks: 1093443
Can we please find an active owner for this frequent e10s crash? It's contributing to an entire test suite being hidden at the moment.
tracking-e10s: --- → ?
Flags: needinfo?(milan)
Let's see if Benoit can take a look after we deal with 33.* issues.
Assignee: nobody → bjacob
Flags: needinfo?(milan)
I'll see if I can find a regressing cset to backout in the mean time.
Can't help but notice bug 1088898 landing around the time that this started.
FYI the test test_zmedia_cleanup.html is not drawing anything. It is suppose to turn off the WiFi network interface on B2G emulator. So theoretically it really does something on B2G. That being said, the network related tests on the B2G emulator are currently turned off anyhow, so theoretically we could simply turn this test off as it should not be needed. Obviously that would/could simply hide the real problem here.
Retriggers are strongly pointing to this push as when it started. https://treeherder.mozilla.org/ui/#/jobs?repo=b2g-inbound&revision=edf60abe62a5
Blocks: 998872
Sean, I'm sorry to put you in this situation, but this failure is currently a major contributor to a test suite being hidden by default on Treeherder due to how often it fails. Unfortunately, the gfx team is tied up with OMTC firefighting at the moment, so I'm afraid that backing out bug 998872 is our only realistic short-term option here to getting this resolved.
Flags: needinfo?(selin)
Ryan, another short term solution could be to disable test_zmedia_cleanup.html right now. test_zmedia_cleanup.html is not really a test by itself, so we are not loosing anything by disabling it. In this case we should open another bug report to investigate and clean this up once the gfx team has time again. And I should note that if we would want to re-enable the WebRTC tests on B2G emulator then the disabled test_zmedia_cleanup.html would become a blocker.
Try run of bug 998872 (and some deps that landed on top of it) backed out: https://tbpl.mozilla.org/?tree=Try&rev=f573c2e79394
(In reply to Nils Ohlmeier [:drno] from comment #120) > Ryan, another short term solution could be to disable > test_zmedia_cleanup.html right now. test_zmedia_cleanup.html is not really a > test by itself, so we are not loosing anything by disabling it. In this case > we should open another bug report to investigate and clean this up once the > gfx team has time again. And I should note that if we would want to > re-enable the WebRTC tests on B2G emulator then the disabled > test_zmedia_cleanup.html would become a blocker. These failures are happening on desktop Firefox w/ e10s enabled, not B2G.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #122) > These failures are happening on desktop Firefox w/ e10s enabled, not B2G. I know. Which makes it even stranger. Unfortunately our manifest's don't allow to specify just an 'if', because test_zmedia_cleanup.html should only get execute on B2G. The test itself has code in it to figure out if it runs on the B2G emu.
Ah, I understand what you're saying now. Here's a Try run of test_zmedia_cleanup.html disabled: https://tbpl.mozilla.org/?tree=Try&rev=a637782a99a2
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #112) > Sean, I'm sorry to put you in this situation, but this failure is currently > a major contributor to a test suite being hidden by default on Treeherder > due to how often it fails. Unfortunately, the gfx team is tied up with OMTC > firefighting at the moment, so I'm afraid that backing out bug 998872 is our > only realistic short-term option here to getting this resolved. Hi Ryan, Actually I'm more inclined for Nils' solution. Disabling test_zmedia_cleanup.html, which appears only relevant to WebRTC on B2G, seems more acceptable for short term to me.
Flags: needinfo?(selin)
...if it works.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #121) > Try run of bug 998872 (and some deps that landed on top of it) backed out: > https://tbpl.mozilla.org/?tree=Try&rev=f573c2e79394 The backout Try push looks green, so that at least confirms it to be a valid way of fixing this. We'll see what the test disabling run looks like (my concern being that it'll just move the crash to another test since test_zmedia_cleanup.html should be a no-op on desktop anyway per drno's comments).
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #124) > Ah, I understand what you're saying now. Here's a Try run of > test_zmedia_cleanup.html disabled: > https://tbpl.mozilla.org/?tree=Try&rev=a637782a99a2 (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #129) > (my concern being that it'll just move the crash to another test since > test_zmedia_cleanup.html should be a no-op on desktop anyway per drno's > comments). And sure enough, that's exactly what happens. The crash just moves to test_peerConnection_toJSON.html instead. https://tbpl.mozilla.org/php/getParsedLog.php?id=51852297&tree=Try So yeah, backing out is the only way to get to green in the short-term unless the gfx comes up with manpower to solve this on their end of things.
Flags: needinfo?(selin)
I'm thinking to disable all TV tests for e10s on Linux debug builds as a short-term workaround. Here's the try run and it looks good to me. https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=4313f438e8df
Flags: needinfo?(selin)
(In reply to Sean Lin [:seanlin] from comment #131) > I'm thinking to disable all TV tests for e10s on Linux debug builds as a > short-term workaround. Here's the try run and it looks good to me. > https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=4313f438e8df Looks great, thanks for doing that!
(In reply to Nicolas Silva [:nical] from comment #37) > The shutdown of compositor goes like this: > > // [some stuff] > sCompositorThreadHolder = nullptr; > > while (!sFinishedCompositorShutDown) { > NS_ProcessNextEvent(nullptr, true); // the main thread is waiting here > while the compositor the crash > } > > > After sCompositorThreadHolder is nulled, AssertOnCompositorThread will > always return false because what it does is check that the current > MessageLoop is the one stored in the thread kept by sCompositorThreadHolder. > > Benoit, can null out the pointer after the while loop to avoid this or is > there a dependency on the compositor thread holder being nulled before? Sorry for not getting to this earlier. sFinishedCompositorShutDown is only set to true here: /* static */ void CompositorThreadHolder::DestroyCompositorThread(Thread* aCompositorThread) { MOZ_ASSERT(NS_IsMainThread()); MOZ_ASSERT(!sCompositorThreadHolder, "We shouldn't be destroying the compositor thread yet."); DestroyCompositorMap(); delete aCompositorThread; sFinishedCompositorShutDown = true; } (in the same file). This means that it will never be set to true as long as sCompositorThreadHolder remains as a strong reference helf on the CompositorThreadHolder singleton. This means that if you move the 'sCompositorThreadHolder = nullptr' line after this while loop, then this while loop will never terminate. Just try it :-)
Flags: needinfo?(bjacob)
Ok, then I suppose we should have AssertOnCompositorThread only do the assertion if sCompositorThreadHolder is not null or have another state to mark that the assertion should do nothing after a certain point in the shutdown.
Did disabling those TV tests change the chunking, and that's why it worked? These last few were the (non-intermittent) result of bug 1117650 moving some tests from alphabetically before media to after media, and thus pulling the webrtc tests from being at the start of debug e10s m3 to being at the end of debug e10s m2, where they have to deal with shutdown, which they fail to deal with. I backed it out, to get the tree green, but I don't feel good about that, and since your tests apparently simply aren't capable of running without having some other tests to run after them as a buffer, I really should have disabled all of your tests instead.
Blocks: 1117650
Further evidence that it's WebRTC causing trouble, rather than gfx: that also moved the WebRTC tests in ASAN test chunking, and being near shutdown without anyone else to buffer and allow time for WebRTC to fizzle out results in https://treeherder.mozilla.org/logviewer.html#?job_id=5696825&repo=mozilla-inbound as well.
Maire, do you have anybody who can look into this soonish? This is now actively going to be blocking other devs from landing seemingly-unrelated patches due to chunking changes on B2G.
Assignee: jacob.benoit.1 → nobody
Flags: needinfo?(mreavy)
I can try to add stopping all all media playing in the WebRTC tests to make the life of Gfx easier when it gets to cleanup...
(In reply to Nils Ohlmeier [:drno] from comment #147) > I can try to add stopping all all media playing in the WebRTC tests to make > the life of Gfx easier when it gets to cleanup... Nils -- Please try that and let's then assess how much that helps (if it's sufficient or if more is needed).
Flags: needinfo?(mreavy)
Reproduced the problem on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=599e6e3705bb But even with my change, which stops/pauses all media at the end of each test does not prevent the problem: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b8670b05b212 So I'm out of ideas what we can do on the WebRTC test sides to avoid this problem. If someone from the Gfx side could explain what is actually causing the problem here I'll be happy to adjust the WebRTC tests.
Milan, can you suggest someone who might be able to help? This bug is blocking other patches from landing, which is not so cool :(
Flags: needinfo?(milan)
Does the patch from bug 1122722 help?
Flags: needinfo?(milan)
(In reply to Milan Sreckovic [:milan] from comment #151) > Does the patch from bug 1122722 help?
Flags: needinfo?(drno)
I started a try run with the patches from bug 1117650 and bug 1122722... leave need-info to remind myself on the result later.
Flags: needinfo?(drno)
And obviously I wanted to remove the check mark before saving...
Flags: needinfo?(drno)
So it looks pretty green with that patch included: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2cabfcd658a3 I verified that mochitest chunk 2 executes test_zmedia_cleanup. I re-triggered another 10 times. If that does not fail I think we seem to have a fix for the problem.
So after a quite a few re-runs of the test only one other intermittent problem showed up. So I'm fairly confident that bug 1122722 fixes this issue.
Depends on: 1122722
Flags: needinfo?(drno)
No longer blocks: 1117650
(In reply to Nils Ohlmeier [:drno] from comment #156) > So after a quite a few re-runs of the test only one other intermittent > problem showed up. So I'm fairly confident that bug 1122722 fixes this issue. OK, I'll get bug 1122722 through reviews and do a try run.
Inactive; closing (see bug 1180138).
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.