986113 - B2G child shutdown: MOZ_Assert: Assertion failure: work_queue_.empty(), at message_loop.cc:393

Reporter

Description

•

11 years ago

I have seen this a couple of times now on B2G debug builds. We also hit it during tests.
For example: https://tbpl.mozilla.org/php/getParsedLog.php?id=36455607&tree=B2g-Inbound&full=1#error2

Gregor Wagner [:gwagner]

Reporter

Updated

•

11 years ago

blocking-b2g: --- → 1.4?

Preeti Raghunath(:Preeti)

Comment 1

•

11 years ago

Please request gaia approval for 1.4 on this.

Preeti Raghunath(:Preeti)

Comment 2

•

11 years ago

Meant aurora approval

Gregor Wagner [:gwagner]

Reporter

Comment 3

•

11 years ago

Ehsan, can you give us more information about this assertions? How bad is it? Lets wait on the domain experts before we make a blocking decision.

Flags: needinfo?(ehsan)

(no longer active)

Comment 4

•

11 years ago

It's very bad. :(

For one thing, it means that we leak those objects.  This is fine I guess if we are hitting this assertion during shutdown.

What's more worrying is that the work queue is not empty.  This means that some of the work queue items were never processed before the thread is shutting down, which means that some work which we expected to happen never does happen.  Gregor, can you reproduce this?  If yes, can you please catch this under the debugger and see what the type of the Task objects left in the work queue is?  That should give us an idea about the nature of the problem.

Note that backing out bug 976363 alone is not an option because it will turn mochitest-2 perma orange because of leaks.  That would also require backing out bug 935778 which I would like to avoid at all costs.

Flags: needinfo?(ehsan)

Flags: needinfo?(benjamin)

Flags: needinfo?(anygregor)

(no longer active)

Comment 5

•

11 years ago

(Also note that this is a debug only assertion, and it just means that we now *know* about this problem.  The underlying problem has been happening all along regardless -- bug 976363 just helped uncover it.)

Benjamin Smedberg

Comment 6

•

11 years ago

Does the assert come with a stack? I don't see any of the normal crashreporter stuff in the log. It would be nice to know what kind of threads are still running, if you catch this in a debugger.

If this is only at shutdown I'm not sure I care all that much; it probably doesn't need to block in that case.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 7

•

11 years ago

I'm seeing this in e10s testing too.

Blocks: e10s-tests

Gregor Wagner [:gwagner]

Reporter

Comment 8

•

11 years ago

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #6)
> Does the assert come with a stack? I don't see any of the normal
> crashreporter stuff in the log. It would be nice to know what kind of
> threads are still running, if you catch this in a debugger.
> 
> If this is only at shutdown I'm not sure I care all that much; it probably
> doesn't need to block in that case.

This happens on the emulator and we don't record stack traces for a child process after we run all the tests successfully. Andrew knows more about this.
Maybe we can try to get a stack trace on try with your patch that enables this?

Flags: needinfo?(anygregor) → needinfo?(ahalberstadt)

(no longer active)

Comment 9

•

11 years ago

(In reply to comment #7)
> I'm seeing this in e10s testing too.

Do you have a stack and/or can you repro?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 10

•

11 years ago

This is the output in e10s. I'm wondering if the assertion happens in the child process. The stack isn't too helpful.
https://tbpl.mozilla.org/php/getParsedLog.php?id=36478575&full=1&branch=holly#error0

There's also this very scary looking assertion:
firefox: pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 || !robust' failed.

Andrew Halberstadt [:ahal]

Comment 11

•

11 years ago

The patch in bug 969146 should get the crash stack dumped in the log, but bug 983489 is blocking me from landing it.

Flags: needinfo?(ahalberstadt)

Gregor Wagner [:gwagner]

Reporter

Comment 12

•

11 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #11)
> The patch in bug 969146 should get the crash stack dumped in the log, but
> bug 983489 is blocking me from landing it.

I can land the patch on pine but it doesn't apply currently.

Gregor Wagner [:gwagner]

Reporter

Comment 13

•

11 years ago

I pushed the patch to pine. We have multiple crash stacks but I am not sure if we have one that shows this bug.

https://tbpl.mozilla.org/?tree=Pine&showall=1&rev=f8bd5adbe110

Gregor Wagner [:gwagner]

Reporter

Updated

•

11 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=986836

(no longer active)

Comment 14

•

11 years ago

(In reply to Bill McCloskey (:billm) from comment #10)
> This is the output in e10s. I'm wondering if the assertion happens in the
> child process. The stack isn't too helpful.
> https://tbpl.mozilla.org/php/getParsedLog.
> php?id=36478575&full=1&branch=holly#error0
> 
> There's also this very scary looking assertion:
> firefox: pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion
> `(-(e)) != 3 || !robust' failed.

This stack is very interesting.  I think one process here is aborting from libc's abort(), and the other one is aborting with the assertion in the summary of this bug.  We have no stacks for the latter.  But the former process seems to be trying to post an IPC message to the MessageLoop *while* we're shutting down on the main thread:

18:42:58     INFO -  18  libxul.so!LockImpl::Lock() [lock_impl_posix.cc:3d266881bda3 : 41 + 0x4]
18:42:58     INFO -      rip = 0x00007fc2c94a055d   rsp = 0x00007fc2bab3e830
18:42:58     INFO -      rbp = 0x00007fc2d0213f3e
18:42:58     INFO -      Found by: stack scanning
18:42:58     INFO -  19  libxul.so!tracked_objects::Tracked::SetBirthPlace(tracked_objects::Location const&) [tracked.cc:3d266881bda3 : 73 + 0x4]
18:42:58     INFO -      rip = 0x00007fc2c94b044c   rsp = 0x00007fc2bab3e840
18:42:58     INFO -      rbp = 0x00007fc2d0213f3e
18:42:58     INFO -      Found by: stack scanning
18:42:58     INFO -  20  libxul.so!MessageLoop::PostTask_Helper(tracked_objects::Location const&, Task*, int, bool) [lock.h:3d266881bda3 : 43 + 0x4]
18:42:58     INFO -      rip = 0x00007fc2c94ad9a6   rsp = 0x00007fc2bab3e860
18:42:58     INFO -      rbp = 0x00007fc2d0213f3e
18:42:58     INFO -      Found by: stack scanning
18:42:58     INFO -  21  libxul.so!mozilla::ipc::MessageChannel::RefCountedTask::AddRef() [MessageChannel.h:3d266881bda3 : 386 + 0x18]
18:42:58     INFO -      rip = 0x00007fc2c94bcac6   rsp = 0x00007fc2bab3e8b0
18:42:58     INFO -      rbp = 0x00007fc2d0213f3e
18:42:58     INFO -      Found by: stack scanning
18:42:58     INFO -  22  libxul.so!mozilla::ipc::MessageChannel::OnMessageReceivedFromLink(IPC::Message const&) [MessageChannel.cpp:3d266881bda3 : 551 + 0x4]
18:42:58     INFO -      rip = 0x00007fc2c94c5f60   rsp = 0x00007fc2bab3e8e0
18:42:58     INFO -      rbp = 0x00007fc2d0213f3e
18:42:58     INFO -      Found by: stack scanning
18:42:58     INFO -  23  libxul.so!mozilla::ipc::ProcessLink::OnMessageReceived(IPC::Message const&) [MessageLink.cpp:3d266881bda3 : 259 + 0xb]
18:42:58     INFO -      rip = 0x00007fc2c94c60ce   rsp = 0x00007fc2bab3e940
18:42:58     INFO -      rbp = 0x00007fc2d0213f3e
18:42:58     INFO -      Found by: stack scanning
18:42:58     INFO -  24  libxul.so!IPC::Channel::ChannelImpl::ProcessIncomingMessages() [ipc_channel_posix.cc:3d266881bda3 : 608 + 0xc]
18:42:58     INFO -      rip = 0x00007fc2c94a5a79   rsp = 0x00007fc2bab3e970
18:42:58     INFO -      rbp = 0x00007fc2d0213f3e
18:42:58     INFO -      Found by: stack scanning

This is definitely broken.  It may or may not be the only problem to solve here, but why is the IPC thread doing work while the main thread is shutting down?

Flags: needinfo?(bent.mozilla)

(no longer active)

Comment 15

•

11 years ago

(In reply to Gregor Wagner [:gwagner] from comment #13)
> I pushed the patch to pine. We have multiple crash stacks but I am not sure
> if we have one that shows this bug.
> 
> https://tbpl.mozilla.org/?tree=Pine&showall=1&rev=f8bd5adbe110

The assertions happen on most of the test runs, but I can't find a stack in the ones that I looked at.  Andrew, your patch seems to not be working as expected.

Flags: needinfo?(ahalberstadt)

Andrew Halberstadt [:ahal]

Comment 16

•

11 years ago

This only seems to happen in some of the debug mochitests as far as I can see. The crashes in opt mochitest are bug 983489. The stacks in the debug mochitests seem to be bug 930884 which is odd because that bug should be fixed. Maybe you aren't seeing the stack for the MOZ_ASSERT because the libEGL one happens first?

Flags: needinfo?(ahalberstadt)

bhavana bajaj [:bajaj]

Updated

•

11 years ago

blocking-b2g: 1.4? → 1.4+

bhavana bajaj [:bajaj]

Updated

•

11 years ago

Assignee: nobody → overholt

(no longer active)

Comment 17

•

11 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #16)
> This only seems to happen in some of the debug mochitests as far as I can
> see. The crashes in opt mochitest are bug 983489. The stacks in the debug
> mochitests seem to be bug 930884 which is odd because that bug should be
> fixed. Maybe you aren't seeing the stack for the MOZ_ASSERT because the
> libEGL one happens first?

MOZ_ASSERT is a debug only macro, so it's totally expected to not see that in opt builds.  Looks like bug 930884 is not fixed?  I'm not really sure how to make progress here, we need more info before this bug is actionable.

Andrew Halberstadt [:ahal]

Comment 18

•

11 years ago

Apparently bug 930884 was crashing in the parent process, and that got fixed, but that crash can still happen in a child process? I filed bug 987251 to continue investigating it.

To clarify, these two bugs are potentially blocking us from getting a crash stack. Though I'll see if I can reproduce this locally and try to figure a way of extracting a stack manually.

Depends on: 987251, 983489

Gregor Wagner [:gwagner]

Reporter

Updated

•

11 years ago

Depends on: 986836

Benjamin Smedberg

Updated

•

11 years ago

Flags: needinfo?(benjamin)

Andrew Halberstadt [:ahal]

Comment 19

•

11 years ago

I tried to debug this locally and I got a different MOZ_ASSERT [1] on shutdown for which a stack trace *did* appear (in addition to the libEGL one). I'm not sure if this is related to the MOZ_ASSERT here or not.. but I'm confused why I would see a stack for that locally, but we wouldn't see a stack for this MOZ_ASSERT in production. Maybe the "check_for_crashes" call in my patch is happening too early, before gecko has completely shutdown and we are missing the crash. To test this theory I added a sleep:

https://tbpl.mozilla.org/?tree=Try&rev=110c73e019bb

[1] F/MOZ_Assert(  311): Assertion failure: mUsedShmems.empty(), at /home/ahal/hg/mozilla-central/gfx/layers/ipc/ISurfaceAllocator.cpp:51

(no longer active)

Comment 20

•

11 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #19)
> I tried to debug this locally and I got a different MOZ_ASSERT [1] on
> shutdown for which a stack trace *did* appear (in addition to the libEGL
> one). I'm not sure if this is related to the MOZ_ASSERT here or not.. but
> I'm confused why I would see a stack for that locally, but we wouldn't see a
> stack for this MOZ_ASSERT in production. Maybe the "check_for_crashes" call
> in my patch is happening too early, before gecko has completely shutdown and
> we are missing the crash. To test this theory I added a sleep:
> 
> https://tbpl.mozilla.org/?tree=Try&rev=110c73e019bb
> 
> [1] F/MOZ_Assert(  311): Assertion failure: mUsedShmems.empty(), at
> /home/ahal/hg/mozilla-central/gfx/layers/ipc/ISurfaceAllocator.cpp:51

I'm not sure if I can read this log properly?  Which process is each one of these assertions coming from?  The parent or the child?  Surely they must happen on two different processes as they're both fatal...

(no longer active)

Updated

•

11 years ago

Flags: needinfo?(ahalberstadt)

Andrew Halberstadt [:ahal]

Comment 21

•

11 years ago

Which log? I didn't paste any. I have no idea why I saw two crash stacks locally (though not for this MOZ_ASSERT) but that try run only seems to have one. I also don't know if fixing bug 987251 and bug 983489 will help expose the stack here or if we aren't seeing it for some other reason.

Flags: needinfo?(ahalberstadt)

(no longer active)

Comment 22

•

11 years ago

(In reply to comment #21)
> Which log? I didn't paste any.

The logs in the try run!

> I have no idea why I saw two crash stacks
> locally (though not for this MOZ_ASSERT) but that try run only seems to have
> one. I also don't know if fixing bug 987251 and bug 983489 will help expose the
> stack here or if we aren't seeing it for some other reason.

Hmm.. Can you please help me understand which code is responsible for capturing the stacks here?  MOZ_ASSERT itself doesn't unfortunately.

FWIW, I'm guessing that fixing those bugs will fix the underlying issue here, because they both look like potential symptoms of the underlying problem here.  I don't think fixing either of those will get us stacks per se but that could be because my understanding of how we capture these stack traces is incorrect.

Andrew Halberstadt [:ahal]

Comment 23

•

11 years ago

There's only one stack in the try logs :), I was talking about what I was seeing locally. I'll try to explain more clearly:

1) When MOZ_CRASHREPORTER_SHUTDOWN is set, we quit the application when a child process crash happens:
http://mxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#135

2) My patch in the try run, makes b2g mochitests pass in MOZ_CRASHREPORTER_SHUTDOWN:
https://hg.mozilla.org/try/rev/a9c2ddbf2832

3) When a crash happens, a minidump should be saved in 'path/to/profile/minidumps'. This is on crashreporter and has nothing to do with the test harnesses.

4) We call mozcrash.check_for_crashes() after a test run, on a timeout, or when there is an exception. This basically just checks for minidumps in the profile and passes them on to minidump_stackwalk (which is what prints the stack):
http://mxr.mozilla.org/mozilla-central/source/testing/mozbase/mozcrash/mozcrash/mozcrash.py#22

My theory is that the work_queue.empty MOZ_ASSERT causes a child process crash which triggers this line [1]. Then we hit bug 983489 which causes another crash in a different child process (?) and somehow erases the original minidump or prevents it from being stackwalked (?).

However when I ran debug mochitests locally, I witnessed two separate stacks in the log. This makes me question my original theory. In short, I don't know what is happening :).

[1] http://mxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#139

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 24

•

11 years ago

(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #14)
> This is definitely broken.  It may or may not be the only problem to solve
> here, but why is the IPC thread doing work while the main thread is shutting
> down?

Dunno, shutdown ordering is hard? :)

Currently I think we kill the IO thread here:

  http://mxr.mozilla.org/mozilla-central/source/xpcom/build/nsXPComInit.cpp#1034

Maybe it just needs to be moved earlier.

Flags: needinfo?(bent.mozilla)

Andrew Overholt [:overholt]

Comment 25

•

11 years ago

This is still waiting on bug 986836 which Milan said is going to start moving at the earliest tomorrow.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 26

•

11 years ago

I've been investigating this from the e10s end, and I think I have a better idea what's going on. The problem seems to happen when the child process tries to shut down the image bridge. We call ImageBridgeChild::DestroyBridge() from the main thread. That posts a task to the image bridge thread that calls StopImageBridgeSync. That function tries to send a sync message to the parent process.

All of this stuff happens very late in shutdown. If the parent then dies, the child's send call will fail and it will run some IPC to post a task somewhere to report the error. However, we never actually process that task because we're so late in shutdown. Instead, we end up with this assertion. The task that remains in the queue is MessageChannel::OnNotifyMaybeChannelError.

It's hard to know if the same problem is affecting b2g. However, I think both b2g and e10s would benefit from a fix for bug 924622, which proposes to fix the shutdown code in graphics so that it happens earlier. That would solve the image bridge bug I'm seeing and probably fix a bunch of other things.

Also, I should mention that it's pretty easy to reproduce this problem in e10s by putting sleep calls in the right places.

(no longer active)

Updated

•

11 years ago

Depends on: 924622

Chris Peterson [:cpeterson]

Comment 27

•

11 years ago

Mass tracking-e10s flag change. Filter bugmail on "2be0fcce-e36a-4e2c-aa80-0e3d33eb5406".

tracking-e10s: --- → +

(no longer active)

Comment 28

•

11 years ago

Gregor, there has been a bit of progress on the list of dependencies here.  Can you please retest this when you get a chance?  Thanks!

Flags: needinfo?(anygregor)

Gregor Wagner [:gwagner]

Reporter

Comment 29

•

11 years ago

(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #28)
> Gregor, there has been a bit of progress on the list of dependencies here. 
> Can you please retest this when you get a chance?  Thanks!

Sure, I just merged to pine again. https://tbpl.mozilla.org/?tree=Pine&showall=1

Flags: needinfo?(anygregor)

Jason Smith [:jsmith]

Comment 30

•

11 years ago

Re-triage - this doesn't have proof of being a user facing crash on devices, it only happens in debug build situations & tests. So renoming to reevaluate.

blocking-b2g: 1.4+ → 1.4?

Gregor Wagner [:gwagner]

Reporter

Comment 31

•

11 years ago

(In reply to Jason Smith [:jsmith] from comment #30)
> Re-triage - this doesn't have proof of being a user facing crash on devices,
> it only happens in debug build situations & tests. So renoming to reevaluate.

This might as well be a major source of orange like bug 965677.

Preeti Raghunath(:Preeti)

Updated

•

11 years ago

Assignee: overholt → anygregor

blocking-b2g: 1.4? → 1.4+

Gregor Wagner [:gwagner]

Reporter

Comment 32

•

11 years ago

They are all gfx related as far as I can tell. Milan is a better owner/driver here.

Assignee: anygregor → milan

(inactive) Jim Chen [:jchen] [:darchons]

Updated

•

11 years ago

Blocks: 997191

Chiajung Hung [:chiajung]

Updated

•

11 years ago

Blocks: 997367

Kai-Chih Hu [:khu]

Comment 33

•

11 years ago

(In reply to Gregor Wagner [:gwagner] from comment #32)
> They are all gfx related as far as I can tell. Milan is a better
> owner/driver here.

So, does it mean, that "graphics" component would make sense? 
Milan, may we have your comments for this bug? Thank you.

Flags: needinfo?(milan)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 34

•

11 years ago

It may very well be the case.  We are looking at shutdown issues in bug 924622, but it is slow going.

Flags: needinfo?(milan)

Kai-Chih Hu [:khu]

Updated

•

11 years ago

Component: IPC → Graphics

Preeti Raghunath(:Preeti)

Comment 35

•

11 years ago

Milan

Are we fixing this issue in 1.4? We blocked on it so please help with next steps.

Flags: needinfo?(milan)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 36

•

11 years ago

These shutdown issues are expensive to fix, we've been at it for weeks, with 7-8 landings that got backed out, so best I can offer is to keep working on it.  Bug 924622 is where the work is, if that works on central, then we can proceed.

Flags: needinfo?(milan)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 37

•

11 years ago

Assigning to :nical as a placeholder once we clear bug 924622.

Assignee: milan → nical.bugzilla

Milan Sreckovic [:milan] (needinfo for best results)

Comment 38

•

10 years ago

Now that bug 924622 has landed in 2.0, is this assertion still present there?

Flags: needinfo?(anygregor)

Gregor Wagner [:gwagner]

Reporter

Comment 39

•

10 years ago

(In reply to Milan Sreckovic [:milan] from comment #38)
> Now that bug 924622 has landed in 2.0, is this assertion still present there?

This assertion is no longer there.
We moved to bug 1017781 (https://tbpl.mozilla.org/php/getParsedLog.php?id=40756411&tree=Pine&full=1)
And bug 1017719 which is not gfx related.

Status: NEW → RESOLVED

Closed: 10 years ago

Flags: needinfo?(anygregor) → needinfo?(milan)

Resolution: --- → WORKSFORME

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

10 years ago

Flags: needinfo?(milan)