Closed Bug 774622 Opened 12 years ago Closed 12 years ago

crash with abort message: "mismatched CxxStackFrame ctor/dtors: file /builds/slave/m-cen-andrd-ntly/build/ipc/glue/RPCChannel.cpp, line 656" on quitting Nightly

Categories

(Core :: Graphics: Layers, defect)

16 Branch
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla18
Tracking Status
firefox15 --- unaffected
firefox16 + verified
firefox17 --- verified
firefox18 --- verified
fennec 16+ ---

People

(Reporter: AdrianT, Assigned: nical)

References

Details

(4 keywords, Whiteboard: [native-crash])

Crash Data

Attachments

(1 file, 2 obsolete files)

This bug was filed from the Socorro interface and is report bp-2b3ea40a-be14-424e-b07d-6145d2120717 . Other crashes: https://crash-stats.mozilla.com/report/index/bp-6337ba57-4ad8-450b-9737-bd7772120717 https://crash-stats.mozilla.com/report/index/bp-ace5405c-2498-4f42-a283-ebe1c2120717 ============================================================= Nightly 16.0a1 2012-07-16 Google Nexus (Android 4.0.4) Steps to reproduce: 1) Load planet.mozilla.org. 2) Wait for the page to load. 3) Quit Firefox Actual result: A crash happens. I got 3 different crashes from 4 tries. Adding the crash signatures to the same bug as they may be related since the same steps to reproduce apply. Issue is not reproducible on Aurora ============================================================= 0 libmozalloc.so mozalloc_abort memory/mozalloc/mozalloc_abort.cpp:19 1 libc.so libc.so@0x11f6e 2 dalvik-heap (deleted) dalvik-heap @0x7b511f 3 dalvik-mark-stack (deleted) dalvik-mark-stack @0x30eaf40 4 org.mozilla.fennec-2.apk org.mozilla.fennec-2.apk@0x124556b 5 org.mozilla.fennec-2.apk org.mozilla.fennec-2.apk@0x33696a 6 dalvik-heap (deleted) dalvik-heap @0xfad5f63 7 org.mozilla.fennec-2.apk org.mozilla.fennec-2.apk@0x703633 8 libxul.so XPCJSRuntime::Get js/xpconnect/src/xpcprivate.h:645 9 libxul.so XPCCallContext::~XPCCallContext js/xpconnect/src/XPCCallContext.cpp:302 10 libxul.so nsXPCWrappedJSClass::DelegatedQueryInterface js/xpconnect/src/XPCWrappedJSClass.cpp:745 11 @0xffffffff 12 libxul.so XPCCallContext::~XPCCallContext js/xpconnect/src/XPCCallContext.cpp:334 13 libxul.so nsXPCWrappedJSClass::DelegatedQueryInterface js/xpconnect/src/XPCWrappedJSClass.cpp:745 14 @0x5f4c5d3e 15 libmozglue.so libmozglue.so@0x7041 16 libmozglue.so libmozglue.so@0x8fbd 17 libmozglue.so libmozglue.so@0x90fd 18 libmozglue.so libmozglue.so@0x9c7b 19 libmozglue.so libmozglue.so@0x20ffa 20 libmozglue.so libmozglue.so@0x199d7 21 libmozglue.so libmozglue.so@0x7041 22 libmozglue.so libmozglue.so@0x8fbd 23 libc.so libc.so@0x12132 24 libc.so libc.so@0x436ae 25 libmozglue.so libmozglue.so@0x19b29 26 libmozglue.so libmozglue.so@0x19b41 27 libxul.so std::__node_alloc::allocate _alloc.h:158 28 libxul.so std::priv::_STLP_alloc_proxy<unsigned int, IPC::Message, std::allocator<IPC::Mes _alloc.h:307 29 @0x3 30 libxul.so mozilla::ipc::RPCChannel::DebugAbort ipc/glue/RPCChannel.cpp:656 31 libxul.so mozilla::ipc::RPCChannel::~RPCChannel ipc/glue/RPCChannel.cpp:80 32 libxul.so mozilla::layers::PCompositorParent::~PCompositorParent obj-firefox/ipc/ipdl/PCompositorParent.cpp:88 33 libxul.so mozilla::layers::CompositorParent::~CompositorParent gfx/layers/ipc/CompositorParent.cpp:106 34 libxul.so mozilla::layers::CompositorParent::~CompositorParent gfx/layers/ipc/CompositorParent.cpp:106 35 libxul.so mozilla::layers::CompositorParent::Release CompositorParent.h:58 36 libxul.so RunnableFunction<void , Tuple2<nsRefPtr<mozilla::layers::CompositorParent>, nsRe nsAutoPtr.h:874 37 libxul.so RunnableFunction<void , Tuple2<nsRefPtr<mozilla::layers::CompositorParent>, nsRe ipc/chromium/src/base/task.h:411 38 libxul.so MessageLoop::RunTask ipc/chromium/src/base/message_loop.cc:327 39 libxul.so MessageLoop::DeferOrRunPendingTask ipc/chromium/src/base/message_loop.cc:334 40 libxul.so MessageLoop::DoWork ipc/chromium/src/base/message_loop.cc:434 41 libxul.so mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:86 42 libxul.so MessageLoop::RunInternal ipc/chromium/src/base/message_loop.cc:208 43 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:201 44 libxul.so nsBaseAppShell::Run widget/xpwidgets/nsBaseAppShell.cpp:163 45 libxul.so nsAppStartup::Run toolkit/components/startup/nsAppStartup.cpp:257 46 libxul.so XREMain::XRE_mainRun toolkit/xre/nsAppRunner.cpp:3787 47 libxul.so XREMain::XRE_main toolkit/xre/nsAppRunner.cpp:3864 48 libxul.so XRE_main toolkit/xre/nsAppRunner.cpp:3940 49 libxul.so GeckoStart toolkit/xre/nsAndroidStartup.cpp:73 50 libmozglue.so libmozglue.so@0x10aad 51 dalvik-LinearAlloc (deleted) dalvik-LinearAlloc @0x29918e 52 libdvm.so libdvm.so@0x1ec32 53 dalvik-heap (deleted) dalvik-heap @0xcb3bae 54 libdvm.so libdvm.so@0x58eed 55 data@app@org.mozilla.fennec-2.apk@classes.dex data@app@org.mozilla.fennec-2.apk@classes.dex@0x134e61 56 libmozglue.so libmozglue.so@0x10a5b 57 data@app@org.mozilla.fennec-2.apk@classes.dex data@app@org.mozilla.fennec-2.apk@classes.dex@0x11d4b9 58 libc.so libc.so@0x14a43 59 libdvm.so libdvm.so@0x98e3d 60 libc.so libc.so@0x158a7 61 libmozglue.so libmozglue.so@0x10a5b 62 data@app@org.mozilla.fennec-2.apk@classes.dex data@app@org.mozilla.fennec-2.apk@classes.dex@0x11d4b9 63 libc.so libc.so@0x158a7 64 libmozglue.so libmozglue.so@0x10a5b 65 data@app@org.mozilla.fennec-2.apk@classes.dex data@app@org.mozilla.fennec-2.apk@classes.dex@0x11d4b9 66 libc.so libc.so@0x15f09 67 libdvm.so libdvm.so@0x5ae8d 68 dalvik-LinearAlloc (deleted) dalvik-LinearAlloc @0x347e 69 dalvik-heap (deleted) dalvik-heap @0xddde1e 70 dalvik-heap (deleted) dalvik-heap @0xddde1e 71 libdvm.so libdvm.so@0xa2532 72 dalvik-heap (deleted) dalvik-heap @0xddde1e 73 dalvik-LinearAlloc (deleted) dalvik-LinearAlloc @0x2991a2 74 dalvik-LinearAlloc (deleted) dalvik-LinearAlloc @0x29918e 75 dalvik-LinearAlloc (deleted) dalvik-LinearAlloc @0x29918e 76 data@app@org.mozilla.fennec-2.apk@classes.dex data@app@org.mozilla.fennec-2.apk@classes.dex@0x148534 77 libdvm.so libdvm.so@0x8e8ab 78 libdvm.so libdvm.so@0x74897 79 dalvik-heap (deleted) dalvik-heap @0xcb3bae 80 dalvik-LinearAlloc (deleted) dalvik-LinearAlloc @0x29918e 81 dalvik-LinearAlloc (deleted) dalvik-LinearAlloc @0x29918e 82 libdvm.so libdvm.so@0x58d43 83 dalvik-heap (deleted) dalvik-heap @0xcb3bae 84 libdvm.so libdvm.so@0x5ac1d 85 dalvik-LinearAlloc (deleted) dalvik-LinearAlloc @0x29918e 86 data@app@org.mozilla.fennec-2.apk@classes.dex data@app@org.mozilla.fennec-2.apk@classes.dex@0x9ccb8 87 dalvik-heap (deleted) dalvik-heap @0xcb3bae 88 libdvm.so libdvm.so@0x1edbe 89 libdvm.so libdvm.so@0x30a8e 90 libdvm.so libdvm.so@0xb2f8e 91 dalvik-LinearAlloc (deleted) dalvik-LinearAlloc @0x29b4b6 92 libdvm.so libdvm.so@0x34272 93 libdvm.so libdvm.so@0x5f987 94 dalvik-LinearAlloc (deleted) dalvik-LinearAlloc @0x29b4b6 95 dalvik-heap (deleted) dalvik-heap @0xc9b946 96 data@app@org.mozilla.fennec-2.apk@classes.dex data@app@org.mozilla.fennec-2.apk@classes.dex@0x134a9c 97 libdvm.so libdvm.so@0x6c8ff 98 dalvik-LinearAlloc (deleted) dalvik-LinearAlloc @0x29b4b6 99 dalvik-heap (deleted) dalvik-heap @0xc9b946 100 libdvm.so libdvm.so@0xb7c56 101 libdvm.so libdvm.so@0x5f987 102 libdvm.so libdvm.so@0x6c923 103 libdvm.so libdvm.so@0xb7c56 104 libdvm.so libdvm.so@0x5f987 105 libdvm.so libdvm.so@0xb2f8e 106 libdvm.so libdvm.so@0x5fa37 107 libdvm.so libdvm.so@0x5f987 108 libc.so libc.so@0x12c4e 109 libc.so libc.so@0x127a2
Keywords: regression
Summary: crash in mozalloc_abort on quitting Nightly → crash with abort message: "mismatched CxxStackFrame ctor/dtors: file /builds/slave/m-cen-andrd-ntly/build/ipc/glue/RPCChannel.cpp, line 656" on quitting Nightly
Whiteboard: [native-crash]
Based on crash stats, it first appeared in 16.0a1/20120715. The regression range is: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=0602e44ac248&tochange=57abb5f70e01 It might be a regression from bug 745148.
Depends on: 764756
Chris looks like a regression from your bug, can you take a look?
Assignee: nobody → jones.chris.g
tracking-fennec: ? → 16+
Crash Signature: [@ mozalloc_abort | XPCJSRuntime::Get] [@ mozalloc_abort | nsAString_internal::Assign ] [@ mozalloc_abort | nsSVGUtils::GetFontXHeight ] → [@ mozalloc_abort | XPCJSRuntime::Get] [@ mozalloc_abort | nsAString_internal::Assign] [@ mozalloc_abort | nsSVGUtils::GetFontXHeight] [@ mozalloc_abort | NS_IsMainThread_P] [@ mozalloc_abort | libwebcore.so@0x3acf63] [@ mozalloc_abort | dalvik-bitm…
Crash Signature: dalvik-bitmap-2 (deleted)@0x14d11f] [@ mozalloc_abort] → dalvik-bitmap-2 (deleted)@0x14d11f] [@ mozalloc_abort] [@ libmozalloc.so@0x1704 ]
The mozalloc_abort | XPCJSRuntime::Get signature is #7 top crasher in 16.0a2 and #5 in 17.0a1. This bug ranks probably higher with its other signatures that are shared with other bugs.
Keywords: topcrash
Good build: 2012-07-14 Bad build: 2012-07-15 Possible regression range: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=0602e44ac248&tochange=57abb5f70e01
That range has a lot of changesets in it, please use tinderbox-builds.
The presence of bug 767775 and bug 745148 in that list looks suspicious. Any thoughts, Chris?
Crash Signature: dalvik-bitmap-2 (deleted)@0x14d11f] [@ mozalloc_abort] [@ libmozalloc.so@0x1704 ] → dalvik-bitmap-2 (deleted)@0x14d11f] [@ mozalloc_abort | nsThreadManager::GetIsMainThread] [@ mozalloc_abort | arena_avail_tree_insert] [@ mozalloc_abort | js::ObjectImpl::markChildren] [@ mozalloc_abort] [@ libmozalloc.so@0x1704 ]
Crashes on abort account for 4.4% of all crashes in 15.0b3 and 17.7% in 16.0a2. Guessing this bug is the only crash regression in 16.0, it accounts for around 13% of all crashes in 16.0a2, making it #1 top crasher in 16.0a2.
I've hit this crash a couple of times on the Galaxy Nexus, Android 4.1 while quitting while I was on http://buienradar.nl/ . Adrian, could you try and narrow the regression range further?
Here is the regression range of comment 12: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=011147826130&tochange=9046ecf4db8f Neither bug 767775, nor bug 745148 belong to it. I think it's a regression from bug 763234.
In inbound tinderbox builds: The good build: 2012-07-13 73e6fad01985 The bad build: 2012-07-13 f9723f26c8a9 Pushlog: http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=73e6fad01985&tochange=f9723f26c8a9
Blocks: 763234
Bug 763234 expects that no one uses the CompositorParents or whatever is living on the compositor thread after gfxPlatform has been destroyed. Could it be related to this?
Crash Signature: dalvik-bitmap-2 (deleted)@0x14d11f] [@ mozalloc_abort | nsThreadManager::GetIsMainThread] [@ mozalloc_abort | arena_avail_tree_insert] [@ mozalloc_abort | js::ObjectImpl::markChildren] [@ mozalloc_abort] [@ libmozalloc.so@0x1704 ] → dalvik-bitmap-2 (deleted)@0x14d11f] [@ mozalloc_abort | dalvik-bitmap-2 (deleted)@0x12011f] [@ mozalloc_abort | 0 (deleted)@0x11f911f] [@ mozalloc_abort | nsThreadManager::GetIsMainThread] [@ mozalloc_abort | arena_avail_tree_insert] [@ mozalloc_abor…
In 16.0a2/20120819, it accounts for 46% of all crashes.
(In reply to Scoobidiver from comment #20) > In 16.0a2/20120819, it accounts for 46% of all crashes. I'll ping cjones now and see if anybody else may have more time to look at this.
I am looking at it. I had troubles debugging this until recently because I couldn't get symbols to show up in gdb (see bug 781668). If it is very important that we get a temporary hack to remove the crash ASAP, one (very bad) possibility is to leak CompositorParent. CompositorParent is created a startup and destroyed when exiting firefox. The crash happens in its destructor.
Has anyone been able to create reliable STR for this bug? (In reply to Nicolas Silva [:nical] from comment #22) > If it is very important that we get a temporary hack to remove the crash > ASAP, one (very bad) possibility is to leak CompositorParent. > CompositorParent is created a startup and destroyed when exiting firefox. > The crash happens in its destructor. Shutdown crashes are also extremely frustrating because that's exactly what we want to do at shutdown. Another workaround is determine after what point in shutdown this crash is occurring, and then _exit(0) instead of aborting if we hit this assertion at that point. STR definitely the way to start here.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #23) > STR definitely the way to start here. Adding needURLs and qawanted.
Keywords: needURLs, qawanted
Sorry, I cannot draw URLs for all abort crashes, and this looks like it's becoming just a collection of all abort crashes, which probably can have any number of different causes. I doubt there's any use in pulling URLs for all of those. I also don't know how to pull URLs specifically for one abort message, which may be what could be reasonable here.
Crash Signature: mozalloc_abort | js::ObjectImpl::markChildren] [@ mozalloc_abort | js_NewStringCopyZ] [@ mozalloc_abort | js_NewString] [@ mozalloc_abort | js::TraceChildren] [@ mozalloc_abort] [@ libmozalloc.so@0x1704 ] → mozalloc_abort | js::ObjectImpl::markChildren] [@ mozalloc_abort | js_NewStringCopyZ] [@ mozalloc_abort | js_NewString] [@ mozalloc_abort | js::TraceChildren] [@ mozalloc_abort | __wrap_malloc] [@ mozalloc_abort] [@ libmozalloc.so@0x1704 ]
The steps from Comment 0 can still be used to reproduce the issue using Nightly 2012-08-27 on Galaxy Note running Android 4.0.4. Steps to reproduce: 1) Load planet.mozilla.org. 2) Wait for the page to load. 3) Quit Firefox Reproducibility rate is about 40-50% of the cases. From ~10 tries I got 4 crashes.
Keywords: qawanted
FWIW, I don't manage to reproduce it with debug builds while I can reproduce it on optimized builds (on a galaxy nexus and a nexus 7).
I can reproduce with the suggested steps in comment #28 pretty easily (08/28, GN:4.1.1) I/GeckoApp( 2680): pause I/GeckoApp( 2680): application paused I/Gecko ( 2680): ###!!! ABORT: mismatched CxxStackFrame ctor/dtors: file ../../../ipc/glue/RPCChannel.cpp, line 656
Keywords: reproducible
I see that we are sometimes deleting the compositor thread before we delete CompositorParent. When this happens the nsWindow are already deleted, so it is likely that something dispatches a NewRunnableFunction or NewRunnableMethod that takes the CompositorParent in parameter (this uses a ref ptr internally) which keeps it alive for too long. I'm almost there
Target Milestone: --- → Firefox 17
Crash Signature: mozalloc_abort | js::ObjectImpl::markChildren] [@ mozalloc_abort | js_NewStringCopyZ] [@ mozalloc_abort | js_NewString] [@ mozalloc_abort | js::TraceChildren] [@ mozalloc_abort | __wrap_malloc] [@ mozalloc_abort] [@ libmozalloc.so@0x1704 ] → mozalloc_abort | js::ObjectImpl::markChildren] [@ mozalloc_abort | js_NewStringCopyZ] [@ mozalloc_abort | js_NewString] [@ mozalloc_abort | js::TraceChildren] [@ mozalloc_abort | __wrap_malloc] [@ mozalloc_abort] [@ libmozalloc.so@0x1704 ] [@ mozal…
Version: Firefox 16 → Firefox 18
Crash Signature: mozalloc_abort | ProcessGeneric] [@ mozalloc_abort | mozilla::dom::CSS2PropertiesBinding::set_borderImageRepeat] [@ mozalloc_abort | exn_toSource] [@ mozalloc_abort | browser.db (deleted)@0x156451] [@ mozalloc_abort | browser.db (deleted)@0x356e6c] [… → mozalloc_abort | ProcessGeneric ] [@ mozalloc_abort | mozilla::dom::CSS2PropertiesBinding::set_borderImageRepeat ] [@ mozalloc_abort | exn_toSource ] [@ mozalloc_abort | browser.db (deleted)@0x156451 ] [@ mozalloc_abort | browser.db (deleted)@0x356e6c…
Christian, it's a regression from 16.0, not from 18.0.
Version: Firefox 18 → Firefox 16
It seems that the problem is coming from the CompositorParent being kept alive longer than expected because of a dispatched Task, as I said earlier. This patch fixes the crash on my nexus 7. I just pushed it to try. What it does is it manages a reference count for the thread, so that when CompositorParent::DestroyThread is called it only decrements the ref count, and only when the ref count reaches zero we delete the thread. It turned out that managing a static int counter by hand required less code than creating a reference counted subclass of base::Thread so I did it by hand.
Assignee: jones.chris.g → nsilva
Attachment #656574 - Flags: review?(jones.chris.g)
Crash Signature: (deleted)@0x356e6c ] [@ mozalloc_abort | JSCompartment::sweepBreakpoints ] → (deleted)@0x356e6c ] [@ mozalloc_abort | JSCompartment::sweepBreakpoints ] [@ mozalloc_abort | libfreebl3.so (deleted)@0x20341 | XPCJSRuntime::Get]
oh well, ran into the cxx stackframe abort again with the patch applied, so I probably just made it less likely to happen :( Chris, do you know if it is important that CompositorParent's destructor gets called before the compositor thread is destroyed? If so my patch still has some value...
I have (yet another) theory: The abort that causes the crash is a piece of ipdl code that checks that a stack mCxxStackFrame is empty. this stack is pusehd and poped only through RAII with CxxStackFrame (see RPCChannel.h) when sending and receiving messages, like this RPCChannel::Send(Message* msg) { Message copy = *msg; CxxStackFrame f(*this, OUT_MESSAGE, &copy); // push return AsyncChannel::Send(msg); } // (RAII) pop with a sync message we expect to see the following sequence (not always true actually): Sender.RPCChannel.mCxxStackFrame.push(OUT_MESSAGE) Receiver.RPCChannel.mCxxStackFrame.push(IN_MESSAGE) // the message is handled on the receiver's thread Receiver.RPCChannel.mCxxStackFrame.pop(IN_MESSAGE) Sender.RPCChannel.mCxxStackFrame.pop(OUT_MESSAGE) However, if the stack is poped after sending back to the sender the signal that the message has beene handled, it is actually possible that the sender thread start running again before the receiver thread reaches the end of the scope and pops its stack. In our case the crash happens always in the destructor of CompositorParent very shortly after PCompositorParent::SendStop, because we dispatch a task on the main thread that holds a refPtr to the compositor parent, that calls SendStop and then returns. When the task returns CompositorParent is deleted because the task's RefPtr is destroyed. so ~CompositorParent happens very shortly after SendStop. if i printf_stderr something in ~CxxStackFrame just before poping the stack this makes it way easier to reproduce. Chris, does it make sense? You know ipdl way better than me.
Is the CompositorParent setting itself to be destroyed during ActorDestroy? What's the expected destruction sequence.
Comment on attachment 656574 [details] [diff] [review] Fixes fennec exit crash by reference counting the compositor thread. Clearing review request since doesn't seem the entire fix.
Attachment #656574 - Flags: review?(jones.chris.g)
Crash Signature: (deleted)@0x356e6c ] [@ mozalloc_abort | JSCompartment::sweepBreakpoints ] [@ mozalloc_abort | libfreebl3.so (deleted)@0x20341 | XPCJSRuntime::Get] → (deleted)@0x356e6c ] [@ mozalloc_abort | JSCompartment::sweepBreakpoints ] [@ mozalloc_abort | libfreebl3.so (deleted)@0x20341 | XPCJSRuntime::Get] [@ mozalloc_abort | UnmarkGrayChildren ]
When the widget is destroyed (in its dtor), CompositorChild first sends a WillStop message, telling the parent side that it has to get rid of its resources now. After SendWillStop, a task is dispatched on the main thread to defer the second phase of the destruction (it's the same 2-steps destruction sequence principle as ImageBridge's). This task is a RunnableFunction that executes DeferredDestroyCompositor (nsBaseWdget.cpp) which calls compositorChild->Destroy() and immediately after, loses the last reference to the CompositorParent and the CompositorChild which deletes them. Destroy() calles SendStop() which synchronously finishes the destruction sequence. in short: nsBaseWidget::dtor | CompositorChild::SendWillStop [sync] | | CompositorParent::RecvWillStop | Dispatch DeferredDestroyCompositor ... (on fennec it appears that we run into gfxPlatform::ShutDown here, in which case:) gfxPlatform::ShutDown | delete compositorThread ... DeferredDestroyCompositor | CompositorChild::SendStop [sync] | | compsitorParent::RecvStop | CompositorChild::dtor | CompositorParent::dtor I think we have at least two bugs here, the first being that the compositor thread can be destroyed way too early, so my first patch fixes this. The second bug is that since CompositorParent::dtor (and therefore RPCChannel's dtor) is called immediately after SendStop, we have a race between the CompositorParent's destructor and the RPCChannel's poping its mCxxStackFrames on the other side.
Crash Signature: (deleted)@0x356e6c ] [@ mozalloc_abort | JSCompartment::sweepBreakpoints ] [@ mozalloc_abort | libfreebl3.so (deleted)@0x20341 | XPCJSRuntime::Get] [@ mozalloc_abort | UnmarkGrayChildren ] → (deleted)@0x356e6c ] [@ mozalloc_abort | JSCompartment::sweepBreakpoints ] [@ mozalloc_abort | libfreebl3.so (deleted)@0x20341 | XPCJSRuntime::Get] [@ mozalloc_abort | UnmarkGrayChildren ] [@ mozalloc_abort | __swrite | libxul.so@0x10d3d37] [@ mozall…
The parent and child worker threads each get their own set of stack frames. If we're sharing them somehow, that would be a fantastically bad bug! :/
They're not sharing their stack frames, but they both get deleted in the main thread at the end of DeferredDestroyCompositor. In our case we are having a problem with CompositorParent being destroyed on the main thread while its CxxStackFrames has not yet be poped on the compositor thread after a sync message from the main thread to the compositor thread.
Oh, so we're trying to delete the CompositorParent on the main thread while it's still active on the compositor thread? That's also very bad! :)
Component: General → Graphics: Layers
OS: Android → All
Product: Firefox for Android → Core
Hardware: ARM → All
Target Milestone: Firefox 17 → ---
Version: Firefox 16 → 16 Branch
I have a patch in whice we dispatch a task on the Compositor thread from within CompositorParent::RecvStop in order to make sure that a nsRefPtr of the compositor parent stays there and keeps the compositor parent alive until the end of the ipdl code around RecvStop. So this solves the problem of deleting the CompositorParent while it is still active on the compositor thread. With this patch we then have the compositor thread's ref count that reaches zero within its own thread, so we end up having to dispatch a task on the main thread so that later deletes the compositor thread. The destruction sequence is becoming quite messy: When a widget is destroyed it triggers most of the compositor parent's clean up, and dispatches a task on the main thread to terminate the cleanup. This is not extremely nice because we then have part of the compositor stuff being destroyed before the destruction of gfxPlatform, and the rest being destroyed after. When the second half of the destruction happens, we need to have the compositor thread dispatch a task to himself to avoid deleting the compositor parent too soon. When we reach this task on the compositor thread we then need to dispatch a task from the compositor thread to the main thread so that the main thread deletes the compositor thread. The way we end up dispatching tasks all over the place makes it hard for us too have guarantees that we destroy the modules in the right order. For instance, Chris, do we have a constraint on how soon we must have deleted the last object that uses ipdl?
(In reply to Nicolas Silva [:nical] from comment #45) > When we reach this task on the compositor thread > we then need to dispatch a task from the compositor thread to the main > thread so that the main thread deletes the compositor thread. I'm concerned with dispatching a task from the compositor thread to the main thread after we've destroyed the channel and at which points the main thread is tearing down the browser. We have no guarantee that the main thread will always be in a proper state to respond to this event and wont be far along it's shutdown sequence. I think the right thing to do here is to destroy the associated compositor parent before the widget. To do this we need to find a way to make sure all the pending messages are processed.
This patch seems to fix it for me on fennec debug and optimized. I agree with BenWa that fixing the problem at this level is probably not the best solution, yet we have to do something about this crash in the short term and fixing the entire destruction sequence may be very tricky.
Attachment #656574 - Attachment is obsolete: true
Attachment #658588 - Flags: review?(jones.chris.g)
Crash Signature: libxul.so@0x1148aa3 | NS_IsMainThread_P] [@ mozalloc_abort | org.mozilla.fennec_aurora-2.apk@0xf24f40] [@ mozalloc_abort | __swrite | libxul.so@0x1148aa3 | 0 (deleted)@0x11f911f] → libxul.so@0x1148aa3 | NS_IsMainThread_P] [@ mozalloc_abort | org.mozilla.fennec_aurora-2.apk@0xf24f40] [@ mozalloc_abort | __swrite | libxul.so@0x1148aa3 | 0 (deleted)@0x11f911f] [@ mozalloc_abort | __swrite | libxul.so@0x1148b03] [@ mozalloc_abort | …
Crash Signature: dalvik-bitmap-2 (deleted)@0x111f63] [@ mozalloc_abort | org.mozilla.firefox_beta-1.apk@0x6d2038] [@ mozalloc_abort | NS_DebugBreak_P | mozilla::ipc::RPCChannel::DebugAbort] → dalvik-bitmap-2 (deleted)@0x111f63] [@ mozalloc_abort | org.mozilla.firefox_beta-1.apk@0x6d2038] [@ mozalloc_abort | __swrite | libxul.so@0x10d3cf7] [@ mozalloc_abort | org.mozilla.firefox_beta-2.apk@0x6d2038] [@ mozalloc_abort | org.mozilla.firefox_b…
Comment on attachment 658588 [details] [diff] [review] Fixes fennec exit crash by reference counting the compositor thread and making sure a CompositorParent is not deleted while running code on the compositor thread. >diff --git a/gfx/layers/ipc/CompositorParent.cpp b/gfx/layers/ipc/CompositorParent.cpp >+static int sCompositorThreadRefCount = 0; Document the semantics of this. >+static void KeepCompositorParentAlive(CompositorParent* aNowReadyToDie) Call this DeferredDelete. >+{ >+ aNowReadyToDie->Release(); >+} >+ >+static void DeleteCompositorThread() >+{ >+ if (NS_IsMainThread()){ >+ if (sCompositorThread) { >+ delete sCompositorThread; No need to null check, |delete| does the right thing. > void CompositorParent::DestroyThread() > { >+ --sCompositorThreadRefCount; >+ if(sCompositorThreadRefCount == 0) { Nit: |if (0 == --sRefCount)| > CompositorParent::~CompositorParent() > { >+ --sCompositorThreadRefCount; >+ if (sCompositorThreadRefCount == 0) { >+ DeleteCompositorThread(); >+ } Don't duplicate this code. Refactor it into a helper function.
Attachment #658588 - Flags: review?(jones.chris.g)
Comment on attachment 660164 [details] [diff] [review] Fixes fennec exit crash by reference counting the compositor thread and making sure a CompositorParent is not deleted while running code on the compositor thread. No need to re-request review for small changes like these.
Attachment #660164 - Flags: review?(jones.chris.g) → review+
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla18
Please nominate for Aurora/Beta asap (when you're comfortable with the bake time on Nightly).
I cannot reproduce any @mozalloc_abort crashes on the latest Nightly build. (In reply to Alex Keybl [:akeybl] from comment #54) > Please nominate for Aurora/Beta asap (when you're comfortable with the bake > time on Nightly). This crashes are annoying because occur almost for each Firefox quit. Due to the high number of crashes both on m-a and m-b, I nominate this to Aurora and Beta. We cannot deliver an app which crashes much more frequently than it does any other released Betas so far.
Attachment #660164 - Flags: approval-mozilla-beta?
Attachment #660164 - Flags: approval-mozilla-aurora?
(In reply to Cristian Nicolae (:xti) from comment #55) > Due to the high number of crashes both on m-a and m-b, I nominate this to Aurora > and Beta. That is not exactly what Alex asked. It's instead: does this patch fix the issue AND not add regressions that are worse than the benefit? You answered yes to the first part, what about the second part?
Status: RESOLVED → VERIFIED
(In reply to Scoobidiver from comment #56) what about the second part? So far the app is working fine. I tried different scenarios to reproduce this bug, but no crash occurred. Atm there isn't any noticeable regression, so I guess that this patch is great benefit.
I haven't had any problem since this patch so I feel comfortable having it landed on aurora and beta. What would worry me most is having another problem in the destruction sequence that we have not seen yet (the tear-down sequence of omtc is a mess), but it would have to be fixed on top of this anyway.
Comment on attachment 660164 [details] [diff] [review] Fixes fennec exit crash by reference counting the compositor thread and making sure a CompositorParent is not deleted while running code on the compositor thread. [Triage Comment] Fix for a new top crash in FF16, which we believe is low risk based upon current testing. Approving for branches.
Attachment #660164 - Flags: approval-mozilla-beta?
Attachment #660164 - Flags: approval-mozilla-beta+
Attachment #660164 - Flags: approval-mozilla-aurora?
Attachment #660164 - Flags: approval-mozilla-aurora+
Crash Signature: org.mozilla.firefox_beta-1.apk@0x3d2038] [@ mozalloc_abort | NS_DebugBreak_P | mozilla::ipc::RPCChannel::DebugAbort] → org.mozilla.firefox_beta-1.apk@0x3d2038] [@ mozalloc_abort | NS_DebugBreak_P | mozilla::ipc::RPCChannel::DebugAbort] [@ mozalloc_abort | __swrite | libxul.so@0x10d56b7] [@ mozalloc_abort | __swrite | libxul.so@0x10d56b7 | 0 (deleted)@0x11f911f] [@ moz…
Comment on attachment 660164 [details] [diff] [review] Fixes fennec exit crash by reference counting the compositor thread and making sure a CompositorParent is not deleted while running code on the compositor thread. Review of attachment 660164 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/ipc/CompositorParent.cpp @@ +84,5 @@ > + > +static void DeleteCompositorThread() > +{ > + if (NS_IsMainThread()){ > + delete sCompositorThread; Trailing whitespace @@ +231,5 @@ > + // after this function returns while some ipdl code still needs to run on > + // this thread. > + // We must keep the compositor parent alive untill the code handling message > + // reception is finished on this thread. > + this->AddRef(); // Corresponds to DeferredDeleteCompositorParent's Release NS_ADDREF_THIS() is more common. @@ +232,5 @@ > + // this thread. > + // We must keep the compositor parent alive untill the code handling message > + // reception is finished on this thread. > + this->AddRef(); // Corresponds to DeferredDeleteCompositorParent's Release > + CompositorLoop()->PostTask(FROM_HERE, More trailing whitespace @@ +233,5 @@ > + // We must keep the compositor parent alive untill the code handling message > + // reception is finished on this thread. > + this->AddRef(); // Corresponds to DeferredDeleteCompositorParent's Release > + CompositorLoop()->PostTask(FROM_HERE, > + NewRunnableFunction(&DeferredDeleteCompositorParent, Weird indentation
This crash doesn't occur anymore on the latest Aurora and Beta builds. -- Device: Galaxy Note OS: Android 4.0.4
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: