Closed Bug 519438 Opened 15 years ago Closed 15 years ago

Frequent crashes in nsAppShell::ProcessNextNativeEvent after mozilla::storage::Connection::Close()

Categories

(Toolkit :: Storage, defect, P2)

1.9.2 Branch
x86
Linux
defect

Tracking

()

RESOLVED DUPLICATE of bug 528708

People

(Reporter: johnath, Unassigned)

References

Details

(Keywords: crash)

The stack below has been crashing talos regularly since last Friday (Sep 25) on mozilla-1.9.2 linux, apparently without a bug being filed. The crash happens during a dirty profile Ts test (ts_places_generated_max) leading Mak, Ted and I to speculate that this was a latent crash that the tests have exposed. However, bhearsum confirms that the dirty profile tests have been running longer than that which suggests to me that, instead, a recent checkin caused it. I can't find any instances of this crash earlier than Wan-Teh's NSPR update here: http://hg.mozilla.org/releases/mozilla-1.9.2/rev/7a6b4aea8917 but since it's intermittent, it's not clear whether or not that commit is the culprit. Filing this against Storage as the direct trigger of the crash, and because of the suspicious connection to large profiles. Who can take a look? Instances I know of: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1253933045.1253934285.31717.gz&fulltext=1 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1254030837.1254032087.23904.gz&fulltext=1 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1254051540.1254052785.7507.gz&fulltext=1 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1254144714.1254145997.25714.gz&fulltext=1 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1254188158.1254189435.5281.gz&fulltext=1 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1254224123.1254225413.11654.gz&fulltext=1 Stack: 0 libgtk-x11-2.0.so.0.1200.0 + 0x231c8b eip = 0xb6eaec8b esp = 0xbfaa2e24 ebp = 0xbfaa2e9c ebx = 0xb6ffe0bc esi = 0xb041cb80 edi = 0xb10a5cf0 eax = 0xb5e55820 ecx = 0xb041cb80 edx = 0x656c6966 efl = 0x00010206 1 libgtk-x11-2.0.so.0.1200.0 + 0x231f90 eip = 0xb6eaef91 esp = 0xbfaa2ea4 ebp = 0xbfaa2ebc 2 libgdk-x11-2.0.so.0.1200.0 + 0x175e7 eip = 0xb6bf25e8 esp = 0xbfaa2ec4 ebp = 0xbfaa2edc 3 libglib-2.0.so.0.1400.1 + 0x2f8d5 eip = 0xb6a388d6 esp = 0xbfaa2ee4 ebp = 0xbfaa2f0c 4 libglib-2.0.so.0.1400.1 + 0x2f11b eip = 0xb6a3811c esp = 0xbfaa2f14 ebp = 0xbfaa2f6c 5 libglib-2.0.so.0.1400.1 + 0x3255e eip = 0xb6a3b55f esp = 0xbfaa2f74 ebp = 0xbfaa2fcc 6 libglib-2.0.so.0.1400.1 + 0x32ac4 eip = 0xb6a3bac5 esp = 0xbfaa2fd4 ebp = 0xbfaa2fec 7 libxul.so!nsAppShell::ProcessNextNativeEvent(int) [nsAppShell.cpp:7a6b4aea8917 : 147 + 0x9] eip = 0xb7adea44 esp = 0xbfaa2ff4 ebp = 0xbfaa300c 8 libxul.so!nsBaseAppShell::DoProcessNextNativeEvent(int) [nsBaseAppShell.cpp:7a6b4aea8917 : 151 + 0xa] eip = 0xb7af2eb4 esp = 0xbfaa3014 ebp = 0xbfaa302c 9 libxul.so!nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, int, unsigned int) [nsBaseAppShell.cpp:7a6b4aea8917 : 278 + 0x9] eip = 0xb7af333d esp = 0xbfaa3034 ebp = 0xbfaa305c 10 libxul.so!nsThread::ProcessNextEvent(int, int*) [nsThread.cpp:7a6b4aea8917 : 508 + 0x1c] eip = 0xb7bbc30e esp = 0xbfaa3064 ebp = 0xbfaa309c 11 libxul.so!NS_ProcessNextEvent_P(nsIThread*, int) [nsThreadUtils.cpp : 230 + 0xd] eip = 0xb7b8afbb esp = 0xbfaa30a4 ebp = 0xbfaa30cc 12 libxul.so!nsThread::Shutdown() [nsThread.cpp:7a6b4aea8917 : 468 + 0xb] eip = 0xb7bbc8f8 esp = 0xbfaa30d4 ebp = 0xbfaa310c 13 libxul.so!mozilla::storage::Connection::Close() [mozStorageConnection.cpp:7a6b4aea8917 : 500 + 0x8] eip = 0xb7a84093 esp = 0xbfaa3114 ebp = 0xbfaa313c 14 libxul.so!NS_GetXPTCallStub_P + 0x30 eip = 0xb7bc94c3 esp = 0xbfaa3144 ebp = 0xbfaa3158 15 libxul.so!XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) [xpcwrappednative.cpp:7a6b4aea8917 : 2710 + 0x20] eip = 0xb731cdb8 esp = 0xbfaa3160 ebp = 0xbfaa3458 16 libxul.so!XPC_WN_CallMethod(JSContext*, JSObject*, unsigned int, int*, int*) [xpcwrappednativejsops.cpp:7a6b4aea8917 : 1740 + 0x9] eip = 0xb7323ca1 esp = 0xbfaa3460 ebp = 0xbfaa3528 17 libmozjs.so!js_Invoke [jsinterp.cpp:7a6b4aea8917 : 1373 + 0x27] eip = 0xb709a619 esp = 0xbfaa3530 ebp = 0xbfaa35f8 18 libmozjs.so!js_Interpret [jsops.cpp:7a6b4aea8917 : 2199 + 0x1d] eip = 0xb708b0af esp = 0xbfaa3600 ebp = 0xbfaa3858 19 libmozjs.so!js_Invoke [jsinterp.cpp:7a6b4aea8917 : 1381 + 0xa] eip = 0xb709a7eb esp = 0xbfaa3860 ebp = 0xbfaa3928 20 libxul.so!nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS*, unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) [xpcwrappedjsclass.cpp:7a6b4aea8917 : 1671 + 0x1a] eip = 0xb731a741 esp = 0xbfaa3930 ebp = 0xbfaa3b98 21 libxul.so!nsXPCWrappedJS::CallMethod(unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) [xpcwrappedjs.cpp:7a6b4aea8917 : 570 + 0x13] eip = 0xb7315b67 esp = 0xbfaa3ba0 ebp = 0xbfaa3bc8 22 libxul.so!PrepareAndDispatch [xptcstubs_gcc_x86_unix.cpp:7a6b4aea8917 : 95 + 0x1d] eip = 0xb7bc9fd9 esp = 0xbfaa3bd0 ebp = 0xbfaa3c98 23 libxul.so!nsThread::ProcessNextEvent(int, int*) [nsThread.cpp:7a6b4aea8917 : 527 + 0xa] eip = 0xb7bbc368 esp = 0xbfaa3ca0 ebp = 0xbfaa3cd8 24 libxul.so!NS_ProcessNextEvent_P(nsIThread*, int) [nsThreadUtils.cpp : 230 + 0xd] eip = 0xb7b8afbb esp = 0xbfaa3ce0 ebp = 0xbfaa3d08 25 libxul.so!nsBaseAppShell::Run() [nsBaseAppShell.cpp:7a6b4aea8917 : 170 + 0x9] eip = 0xb7af2fd6 esp = 0xbfaa3d10 ebp = 0xbfaa3d28 26 libxul.so!nsAppStartup::Run() [nsAppStartup.cpp:7a6b4aea8917 : 182 + 0x5] eip = 0xb79c341e esp = 0xbfaa3d30 ebp = 0xbfaa3d48 27 libxul.so!XRE_main [nsAppRunner.cpp:7a6b4aea8917 : 3479 + 0xb] eip = 0xb72e80ea esp = 0xbfaa3d50 ebp = 0xbfaa43c8 28 firefox-bin!main [nsBrowserApp.cpp:7a6b4aea8917 : 156 + 0xe] eip = 0x080495aa esp = 0xbfaa43d0 ebp = 0xbfaa4428 29 libc-2.6.1.so + 0x1604f eip = 0xb65fa050 esp = 0xbfaa4430 ebp = 0xbfaa4498
Mak points out that this may be related to Jesse's bug 496019. I'll add the dependency so that they're co-tracked - the stacks are different, but may point to a similar underlying problem.
Depends on: 496019
Assignee: nobody → mak77
assigned to marco for investigation, due to Places involvement.
Blocks: 438871
i was able to start the max dirty profile talos test on linux, but for now i've not been able to crash, i will try with more runs.
Marco - if you can't reproduce it locally, are there things you could land on trunk to help debug - extra logging, for instance? This one is pretty hot.
Hard to tell, this is something deep in how storage handles threads during db closing, and nsAppShell is not exactly my land. Also could be storage i only activating the native event loop, but we are crashing due to the actual event that gets executed (that maybe has a different handling from trunk where we don't crash). If i could reproduce i could at least ensure if fixing bug 496019 would actually fix this. It took a bit of time to run dirty tests on Linux due to lack of a goot howto documentation. i tried with a 100MB db i was unable to reproduce anything, will try with larger dbs.
this is on mozilla-central and looks like the same, so this crash is cross-branches http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1254756006.1254757288.12591.gz Linux mozilla-central talos dirty on 2009/10/05 08:20:06 but quite more rare on central.
as a reference the max places.sqlite generated files that are used by max dirty profiles tests are about 210MB as of today.
This is very likely bug 469019. I'll try to get a patch for it today.
(In reply to comment #18) > This is very likely bug 469019. I'll try to get a patch for it today. Of course, I meant bug 496019.
sadly, i was unable to locally reproduce with a 210MB db too.
I'm not sure if this crash is related. Storage is not, as far as I can tell, implicated in the crash, however there are 5 frames at the bottom (top? business-end?) of the callstack that are identical to the callstacks from other logs in this bug. FWIW. 3 libxul.so!nsAppShell::ProcessNextNativeEvent(int) [nsAppShell.cpp:4c58fd598abb : 147 + 0x9] eip = 0xb7aae298 esp = 0xbf871ad0 ebp = 0xbf871ae8 4 libxul.so!nsBaseAppShell::DoProcessNextNativeEvent(int) [nsBaseAppShell.cpp:4c58fd598abb : 151 + 0xa] eip = 0xb7ac2a0c esp = 0xbf871af0 ebp = 0xbf871b08 5 libxul.so!nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, int, unsigned int) [nsBaseAppShell.cpp:4c58fd598abb : 296 + 0xa] eip = 0xb7ac2ba3 esp = 0xbf871b10 ebp = 0xbf871b38 6 libxul.so!nsThread::ProcessNextEvent(int, int*) [nsThread.cpp:4c58fd598abb : 508 + 0x1c] eip = 0xb7b8b3da esp = 0xbf871b40 ebp = 0xbf871b78 7 libxul.so!NS_ProcessNextEvent_P(nsIThread*, int) [nsThreadUtils.cpp : 230 + 0xd] eip = 0xb7b5ba44 esp = 0xbf871b80 ebp = 0xbf871ba8 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1253933045.1253934285.31717.gz&fulltext=1
Curtis, iirc that's another bug, not related to this, i'd say bug 505572 maybe?
(In reply to comment #26) > Curtis, iirc that's another bug, not related to this, i'd say bug 505572 maybe? Now that the log on 505572 is gone, it's hard to say. The similarities in the call stack looked a little bit suspicious. It's definitely not the exact same thing, but if does happen to be related it could have some diagnostic value.
Right, the top of the stack is just "the event loop". If storage isn't on the stack, it's not this bug, but it could be a similar crash, with something else spinning an event loop.
Flags: blocking1.9.2?
This is linux only, and it's crashing in glib/gtk. Do we have somebody familiar with these libraries who could take a look at this?
Why is this nominated for blocking? Hasn't been a lot of activity lately, and while test issues are annoying, unless they block the active development of something I have a hard time blocking on it in particular. Basically: why do we feel this needs to be fixed in order to release Firefox 3.6?
(In reply to comment #37) > Why is this nominated for blocking? Hasn't been a lot of activity lately, If it had been blocking+'d when I nominated it three weeks ago, there might have been some activity. > and > while test issues are annoying, unless they block the active development of > something I have a hard time blocking on it in particular. If the test is crashing, the browser is probably crashing too. > > Basically: why do we feel this needs to be fixed in order to release Firefox > 3.6? The browser is crashing on shutdown, while it didn't use to.
Flags: blocking1.9.2? → blocking1.9.2+
Priority: -- → P2
I don't think this should block. The reason this was given blocking was "the test is crashing, the browser is probably crashing too" but we have no data of this crash on crash-stats. That doesn't mean users may not be crashing in it, but it does mean it doesn't happen as much as some of our other crashes which are not blockers. This crash is in gtk code, which is called in a nested event loop. We have nested event loops in more than just this particular bit of code. This crash is either a bug in gtk (given the stacks), or it's a bug we have in calling out to gtk in our native event loop processing. Have we ruled out that this isn't a new crash because we updated system libraries on our tinderboxes? Have we looked into getting symbols for the system calls to get a better idea of the failure? I asked for help in comment 31 for someone more familiar with linux and have gotten deafening silence in return.
Flags: blocking1.9.2+ → blocking1.9.2?
why does it crash?
Flags: blocking1.9.2? → blocking1.9.2+
it doesn't mean that it doesn't happen as often to our users as other crashes -- what do we know about what has to happen to trigger the crash? if using newer GTK causes it, we have a ticking clock. I don't think we need kernel symbols to make progress here, but a core dump and debuginfo for the system libraries would let us at least triage on the basis of some more data about what triggers the problem. To solicit Linux help, I do suggest asking for it at the dev meeting, or a newsgroup post, or mailing people directly. They may well not be already cc:d on the bug!
(In reply to comment #42) > To solicit Linux help, I do suggest asking for it at the dev meeting, or a > newsgroup post, or mailing people directly. They may well not be already cc:d > on the bug! Who?
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1257446852.1257448188.27999.gz Linux mozilla-1.9.2 talos dirty on 2009/11/05 10:47:32 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1257448330.1257449615.12873.gz Linux mozilla-1.9.2 talos dirty on 2009/11/05 11:12:10 Two cycles in a row, but now it's gone again.
I've requested debuginfo for the gdk/gtk libraries in bug 527223.
bug 496019 needs to land on 1.9.2, then we can reevaluate. We need someone to get those gdk/gtk stack traces on the tinderboxes, otherwise we know that something crashes in image libraries but we can't tell what.
Depends on: amo-mxr
are we sure this is linux only? the fx 3.6b1 stack in http://crash-stats.mozilla.com/report/index/f8d60064-092f-4780-808c-b13712091107 looks a lot like comment zero and even more like comment 25. Its not clear to me from comment 27 and 28 that a stack looking like this is not connected or could aid in debugging. Module Signature [Expand] Source 0 xul.dll nsAppShell::ProcessNextNativeEvent widget/src/windows/nsAppShell.cpp:181 1 winmm.dll timeGetTime 2 xul.dll nsBaseAppShell::OnProcessNextEvent widget/src/xpwidgets/nsBaseAppShell.cpp:296 3 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:508 4 xul.dll nsBaseAppShell::Run widget/src/xpwidgets/nsBaseAppShell.cpp:170 5 xul.dll nsAppStartup::Run toolkit/components/startup/src/nsAppStartup.cpp:182 6 nspr4.dll PR_GetEnv 7 firefox.exe wmain toolkit/xre/nsWindowsWMain.cpp:110 8 firefox.exe __tmainCRTStartup obj-firefox/memory/jemalloc/crtsrc/crtexe.c:591 9 kernel32.dll BaseProcessStart these windows crashes are just outside the top 100 and we over 50 or those everyday. distribution of all versions where the nsAppShell::ProcessNextNativeEvent crash was found on 20091106-crashdata.csv 29 Firefox 3.5.4 15 Firefox 3.0.15 12 Firefox 3.5.5 3 Firefox 3.5 2 Firefox 3.5.3 1 Firefox 3.6b1 1 Firefox 3.5.1 1 Firefox 3.0.7 1 Firefox 3.0.14 1 Firefox 3.0.13 If these are not connected let me know and I'll spin off another bug or bugs.
(In reply to comment #49) > are we sure this is linux only? pretty sure > the fx 3.6b1 stack in > http://crash-stats.mozilla.com/report/index/f8d60064-092f-4780-808c-b13712091107 > looks a lot like comment zero and even more like comment 25. Its not clear to > me from comment 27 and 28 that a stack looking like this is not connected or > could aid in debugging. These look very different to me. You should file a new bug that crash.
Severity: normal → critical
any update here?
(In reply to comment #54) > any update here? Shaawn is still working on the patch for connection::close Here we still need someone that could take a stack in libgtk on the talos box.
I'm unassigning from me, i can't do anything more here apart pinging here and there. Shawn's patch could solve this not spinning the events loops but clearly this won't solve the crash, will just avoid firing it after connection::close() Someone with gtk knowledge and access to talos machine should own this at this point.
Assignee: mak77 → nobody
Hi Karl, in the platform meeting they said you might know how to debug this?
It was mentioned in the platform meeting that this type of crash might apply to any nested event loop. Is there any reason to believe there's a more general defect here than the one covered by bug 496019?
(In reply to comment #64) > It was mentioned in the platform meeting that this type of crash might apply to > any nested event loop. Is there any reason to believe there's a more general > defect here than the one covered by bug 496019? I don't see how spinning the event loop in storage would be any different than spinning it elsewhere. Fixing bug 496019 will likely make this crash no longer noticeable to us though...
filed bug 529380 to remove this behavior from Weave.
Blocks: 526391
Was the stack in comment 0 on this platform (Ubuntu 7.04 Desktop) https://wiki.mozilla.org/ReferencePlatforms/Test/Linux It should be possible to get the corresponding debug symbol packages and manually resolve the offsets into objects. addr2line -j .text -C -f -i -e /usr/lib/debug/usr/lib/libgtk-x11-2.0.so.0.1200.0.debug 0x231c8b etc.
(In reply to comment #68) > Was the stack in comment 0 on this platform (Ubuntu 7.04 Desktop) > https://wiki.mozilla.org/ReferencePlatforms/Test/Linux The answer is no, because that release would have gtk+-2.10.11, but this is gtk+-2.12.0. These look like good candidates: https://launchpad.net/ubuntu/+source/gtk+2.0/2.12.0-1ubuntu3/+build/403567 https://launchpad.net/ubuntu/+source/glib2.0/2.14.1-1ubuntu1/+build/394621
(In reply to comment #66) > (In reply to comment #64) > > It was mentioned in the platform meeting that this type of crash might apply to > > any nested event loop. Is there any reason to believe there's a more general > > defect here than the one covered by bug 496019? > I don't see how spinning the event loop in storage would be any different than > spinning it elsewhere. indeed for example bug 528708 is a trunk libgtk crash pretty similar to this (in dirty talos), and is not due to storage spinning the event loop.
Depends on: 528708
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → DUPLICATE
Depends on: 529786
No longer blocks: 438871
Whiteboard: [orange]
No longer depends on: 528708, 529786
No longer blocks: 526391
This no longer blocks for the same reasons bug 528708 doesn't block.
Flags: blocking1.9.2+ → blocking1.9.2-
You need to log in before you can comment on or make changes to this bug.