Closed Bug 21556 Opened 25 years ago Closed 25 years ago

crash on SMP systems: socket transport in load group

Categories

(Core :: Networking, defect, P3)

defect

Tracking

()

VERIFIED FIXED

People

(Reporter: bsemrad, Assigned: warrensomebody)

References

Details

(Keywords: crash, Whiteboard: [PDT+] w/b minus on 3/7 [have fixes!])

Attachments

(2 files)

System specs: Dual PII 400 running Windows NT 4.0 Service Pack 6a, 128 Meg RAM, Tons of hard drive. This is also a fairly young install of NT (about 4 months). Communicator 4.7 never crashes on this machine. This particular mozilla build is from 12-10-99 but Mozilla has been acting this way for me for at least a month or so. I always remove the mozregistry.dat file, the user account directory that gets created and the entire Moz directory every time I re-install a new daily build. I have not modified the bookmarks or any other configuration other than to accept the defaults at initial system startup. I'm not sure if this makes much difference but I'm accessing the net on my NT box through a linux masquerading system attached to a cable modem. Problem: Mozilla seems very unstable on my SMP system (PC specs below). Besides crashing about 50% of the time on startup, I usually (90% of the time) get an exception within 60 seconds of browser startup. Occasionally I can just start Mozilla and let it sit for a minute or so and it will get a read exception while displaying the initial mozilla.org web site. I tried this just now but couldn't get it to reproduce within a few minutes or so. To get it to crash I can usually just type in "http://www.slashdot.org" or "http://www.linuxworld.com" into the url bar and press enter. Then, during the display of the home page of either of these sites Mozilla will usually get an exception before the main page is completely displayed. In my experience either of these sites will crash Mozilla about 30%-50% of the time. Reproducing the crash: Edit the url to be one of the above websites and hit enter. If Mozilla doesn't crash put the cursor on the url bar and hit enter again. I can usually get it to crash within the first couple of tries on either web site. I noticed that it seems much more likely to crash the first few times I visited the site but it may be my imagination. I went to each site about 10 times just now and got it to crash about 4 times on each one. The dialog that popped up notifying me of the exception seemed to be somewhat consistent in that It seemed to be crashing and displaying the same exception message about every other time.
Adding some multi-threading gurus/perps to the cc list. /be
One problem is 18110. (Jan, I think that this is your reproducible testcase)
Service Pack 6a. Wow. We should try and find a developer with that service pack to see where we're crashing.
Depends on: 18110
bsemrad@adsoft.net: could you attach a Dr Watson log from Windows NT?
Here is an excerpt from an email that I sent to dougt@netscape.com about the crash on my machine. I went ahead and downloaded the source for Mozilla dated on 12-13-99 and compiled it and then ran it. Below is a copy of the stack trace of the crash when I tried to go to www.slashdot.org. nsCOMPtr?nsProxyObject>::assign_with_AddRef(nsISupports * 0x02f69060) line 759 + 9 bytes nsCOMPtr?nsProxyObject>::operator=(nsProxyObject * 0x02f69060) line 516 nsProxyObjectCallInfo::nsProxyObjectCallInfo(nsProxyObject * 0x02f69060, nsXPTMethodInfo * 0x021ed670, unsigned int 3, nsXPTCVariant * 0x02f6a3d0, unsigned int 4, PLEvent * 0x02f6a890) line 65 nsProxyObject::Post(unsigned int 3, nsXPTMethodInfo * 0x021ed670, nsXPTCMiniVariant * 0x02d1fe18, nsIInterfaceInfo * 0x02f6e060) line 340 + 57 bytes nsProxyEventObject::CallMethod(nsProxyEventObject * const 0x02f6f810, unsigned short 3, const nsXPTMethodInfo * 0x021ed670, nsXPTCMiniVariant * 0x02d1fe18) line 391 + 55 bytes PrepareAndDispatch(nsXPTCStubBase * 0x02f6f810, unsigned int 3, unsigned int * 0x02d1fecc, unsigned int * 0x02d1feb8) line 100 + 31 bytes SharedStub() line 125 ------------------------------------------ Doug then emailed me with the following: Thanks for the great work. This indeed is bug 18110. I told Doug that I might have a go at fixing it but it has been several days since I told him that and I haven't yet had time to look at it seriously so you should probably not count on me for this one.
Assignee: leger → dp
Component: Browser-General → XPCOM
Assignee: dp → dougt
I've been seeing crashes in assign_with_AddRef under SMP Linux as well. RH6.0 + gcc 2.95.2 + binutils 2.9.1.0.25 + gtk 1.2.5 + glibc 2.1.2 (from RH6.1). Kernels 2.2.12-2.2.14pre17. On some pages (http://userfriendly.org/, http://cnn.com/), I can just let the main page load, not touch the browser, switch to another workspace (using WindowMaker) and the browser will crash within 30secs. This is from last night's testing with the M12 fullcircle build.
I've crashed about 90% of the time loading http://userfriendly.org/ getting one of these stacks. Redhat 6.1, dual pentium II 450 Linux localhost.localdomain 2.2.12-20smp #1 SMP Mon Sep 27 10:34:45 EDT 1999 i686 unknown gtk+-1.2.5-2 gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release) binutils-2.9.1.0.23-6 glibc-2.1.2-11 #0 0x3f in ?? () #1 0x40529bf1 in nsOnStopRequestEvent::~nsOnStopRequestEvent () from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so #2 0x4052962c in nsStreamListenerEvent::DestroyPLEvent () from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so #3 0x40176c6d in PL_DestroyEvent () from /home/endico/mozilla/mozilla/dist/bin/libplds3.so #4 0x40176c46 in PL_HandleEvent () from /home/endico/mozilla/mozilla/dist/bin/libplds3.so #5 0x40176b86 in PL_ProcessPendingEvents () from /home/endico/mozilla/mozilla/dist/bin/libplds3.so #6 0x401471ce in ?? () from /home/endico/mozilla/mozilla/dist/bin/libxpcom.so #7 0x405b6ac4 in event_processor_callback () from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so #8 0x405b680f in our_gdk_io_invoke () from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so #9 0x4086052a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0 #10 0x40861be6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0 #11 0x408621a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0 #12 0x40862341 in g_main_run () from /usr/lib/libglib-1.2.so.0 #13 0x4078c209 in gtk_main () from /usr/lib/libgtk-1.2.so.0 #14 0x405b7067 in nsAppShell::Run () from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so #15 0x404d0c41 in nsAppShellService::Run () from /home/endico/mozilla/mozilla/dist/bin/libnsappshell.so #16 0x804adf1 in main1 () #17 0x804b225 in main () #18 0x4025e1eb in ?? () from /lib/libc.so.6 #0 0x40333238 in main_arena () from /lib/libc.so.6 #1 0x4003af66 in ?? () from /home/endico/mozilla/mozilla/dist/bin/libraptorgfx.so #2 0x4014ec84 in nsCOMPtr_base::assign_with_AddRef () from /home/endico/mozilla/mozilla/dist/bin/libxpcom.so #3 0x4139d18f in nsCOMPtr<nsIChannel>::operator= () from /home/endico/mozilla/mozilla/dist/bin/components/libnecko_http.so #4 0x41392de8 in nsHTTPRequest::~nsHTTPRequest () from /home/endico/mozilla/mozilla/dist/bin/components/libnecko_http.so #5 0x41392ea0 in nsHTTPRequest::Release () from /home/endico/mozilla/mozilla/dist/bin/components/libnecko_http.so #6 0x4138cba5 in nsHTTPChannel::~nsHTTPChannel () from /home/endico/mozilla/mozilla/dist/bin/components/libnecko_http.so #7 0x4138cd63 in nsHTTPChannel::Release () from /home/endico/mozilla/mozilla/dist/bin/components/libnecko_http.so #8 0x4052954e in nsStreamListenerEvent::~nsStreamListenerEvent () from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so #9 0x40529bf1 in nsOnStopRequestEvent::~nsOnStopRequestEvent () from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so #10 0x4052962c in nsStreamListenerEvent::DestroyPLEvent () from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so #11 0x40176c6d in PL_DestroyEvent () from /home/endico/mozilla/mozilla/dist/bin/libplds3.so #12 0x40176c46 in PL_HandleEvent () from /home/endico/mozilla/mozilla/dist/bin/libplds3.so #13 0x40176b86 in PL_ProcessPendingEvents () from /home/endico/mozilla/mozilla/dist/bin/libplds3.so #14 0x401471ce in nsEventQueueImpl::ProcessPendingEvents () from /home/endico/mozilla/mozilla/dist/bin/libxpcom.so #15 0x405b6ac4 in event_processor_callback () from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so #16 0x405b680f in our_gdk_io_invoke () from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so #17 0x4086052a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0 #18 0x40861be6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0 #19 0x408621a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0 #20 0x40862341 in g_main_run () from /usr/lib/libglib-1.2.so.0 #21 0x4078c209 in gtk_main () from /usr/lib/libgtk-1.2.so.0 #22 0x405b7067 in nsAppShell::Run () from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so #23 0x404d0c41 in nsAppShellService::Run () from /home/endico/mozilla/mozilla/dist/bin/libnsappshell.so #24 0x804adf1 in main1 () #25 0x804b225 in main () #26 0x4025e1eb in __libc_start_main (main=0x804b044 <main>, argc=1, argv=0xbffffac4, init=0x80493a4 <_init>, fini=0x804d6d8 <_fini>, rtld_fini=0x4000a610, stack_end=0xbffffabc) at ../sysdeps/generic/libc-start.c:90
Attached file test case (22 gif images) (deleted) —
added test case with 22 gif images. There was a theory that this problem was due to animated gifs but reducing the test case to just two animated gifs didn't cause a crash after 2 tries. I'm guessing that it has more to do with having lots of threads running on different processors. The problem may also have to do with one of the gifs being a lot bigger than the others. It seemed like the userfriendly page had been done loading for a long time but the throbber was still spinning. Apparently the extra time was being spent loading extra frames on one of the animated gifs.
This is a dup of 18110 [dogfood] XPCOM/Proxy needs to be threadsafe!!
dougt, jband just whacked XPConnect to be threadsafe and otherwise refactored it for correct thread-local vs. process-global, etc. considerations. Since the xpcom proxy code sprang from the brow of XPConnect, perhaps his changes could help safen xpcom/proxy. What's the prognosis? /be
Status: NEW → ASSIGNED
many of his changes can be massaged into xpcom/proxy. However, because of the very nature of xpcom/proxy, me do a really good job at protecting ourselves. Simply applying his changes are not good enough.
*** Bug 22648 has been marked as a duplicate of this bug. ***
Summary: Mozilla crashes often on SMP systems. → [Dogfood] Mozilla crashes often on SMP systems.
ugh! Mozilla is pretty unusable for me any more at home on my smp box. It crashes too much. Please please please get dougt an smp box to debug with. I noticed that looking at slashdot.org is causing problems too. It too has lots of images/page and often uses animated gifs. I got this stack after loading mozilla to home page, then loading slashdot.org, then loading mozillazine and staying there a while. This is with this morning's build. #0 0x4017451a in nsProxyObject::Post (this=0x41d585f8, methodIndex=4, methodInfo=0x407bbd0c, params=0xbf5ffa5c, interfaceInfo=0x41d00870) at nsProxyEvent.cpp:423 #1 0x40176747 in nsProxyEventObject::CallMethod (this=0x419dbd28, methodIndex=4, info=0x407bbd0c, params=0xbf5ffa5c) at nsProxyEventObject.cpp:391 #2 0x40181924 in PrepareAndDispatch (self=0x419dbd28, methodIndex=4, args=0xbf5ffb14) at xptcstubs_unixish_x86.cpp:92 #3 0x40181a4a in nsXPTCStubBase::Stub4 (this=0x419dbd28) at ../../../../../../dist/include/xptcstubsdef.inc:6 #4 0x4061211b in ?? () from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so #5 0x4060f4a0 in ?? () from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so #6 0x40613307 in ?? () from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so #7 0x401716b5 in nsThread::Main (arg=0x41b452c0) at nsThread.cpp:83 #8 0x402138fb in _pt_root (arg=0x41b16d98) at ptthread.c:157 #9 0x4022deca in pthread_start_thread (arg=0xbf5ffe60) at manager.c:213
*** Bug 18659 has been marked as a duplicate of this bug. ***
yet another stack trace. looked at mozilla.org, slashdot.org, www.benews.com. It crashed at benews.com in the middle of scrolling after having sat there a while. Maybe this is timer based. It seems like this morning's crashes are happening after 10 minutes or so and some happen while the browser is idle. #0 0x81e48e1 in ?? () #1 0x4019d670 in nsCOMPtr<nsProxyObject>::Assert_NoQueryNeeded ( this=0x4210bb30) at ../../../dist/include/nsCOMPtr.h:444 #2 0x4019d630 in nsCOMPtr<nsProxyObject>::operator= (this=0x4210bb30, rhs=0x81ebd20) at ../../../dist/include/nsCOMPtr.h:516 #3 0x40173448 in nsProxyObjectCallInfo::nsProxyObjectCallInfo ( this=0x4210bb10, owner=0x81ebd20, methodInfo=0x407bbcbc, methodIndex=4, parameterList=0x4210bad8, parameterCount=3, event=0x4210bab8) at nsProxyEvent.cpp:63 #4 0x401743e0 in nsProxyObject::Post (this=0x81ebd20, methodIndex=4, methodInfo=0x407bbcbc, params=0xbf5ffa5c, interfaceInfo=0x824a378) at nsProxyEvent.cpp:374 #5 0x40176747 in nsProxyEventObject::CallMethod (this=0x81a4708, methodIndex=4, info=0x407bbcbc, params=0xbf5ffa5c) at nsProxyEventObject.cpp:391 #6 0x40181924 in PrepareAndDispatch (self=0x81a4708, methodIndex=4, args=0xbf5ffb14) at xptcstubs_unixish_x86.cpp:92 #7 0x40181a4a in nsXPTCStubBase::Stub4 (this=0x81a4708) at ../../../../../../dist/include/xptcstubsdef.inc:6 #8 0x4061211b in nsSocketTransport::fireStatus (this=0x81a73c8, aCode=5) at nsSocketTransport.cpp:1897 #9 0x4060f4a0 in nsSocketTransport::Process (this=0x81a73c8, aSelectFlags=0) at nsSocketTransport.cpp:539 ---Type <return> to continue, or q <return> to quit--- #10 0x40613307 in nsSocketTransportService::Run (this=0x41b6c3b8) at nsSocketTransportService.cpp:467 #11 0x401716b5 in nsThread::Main (arg=0x41b48728) at nsThread.cpp:83 #12 0x402138fb in _pt_root (arg=0x41b48eb0) at ptthread.c:157 #13 0x4022deca in pthread_start_thread (arg=0xbf5ffe60) at manager.c:213
Doug, can this be fixed in M13? Or at least get a Target Milestone. /be
Here's another stack very similar to one before. It crashed sitting at a slashdot article while i was away. Maybe it was reloading an ad? I forget if their ads refresh themselves. #0 0x85b579c in ?? () #1 0x4060e7ab in nsSocketTransport::~nsSocketTransport (this=0x85d75a8, __in_chrg=3) at nsSocketTransport.cpp:223 #2 0x40610760 in nsSocketTransport::Release (this=0x85d75a8) at nsSocketTransport.cpp:1191 #3 0x416b9eae in nsCOMPtr<nsIChannel>::assign_assuming_AddRef ( this=0x862bce8, newPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:415 #4 0x416bea8c in nsCOMPtr<nsIChannel>::assign_with_AddRef (this=0x862bce8, rawPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:760 #5 0x416bf7e3 in nsCOMPtr<nsIChannel>::operator= (this=0x862bce8, rhs=0x0) at ../../../../dist/include/nsCOMPtr.h:515 #6 0x416b05cf in nsHTTPRequest::~nsHTTPRequest (this=0x862bcd0, __in_chrg=3) at nsHTTPRequest.cpp:140 #7 0x416b0720 in nsHTTPRequest::Release (this=0x862bcd0) at nsHTTPRequest.cpp:151 #8 0x416a9cc5 in nsHTTPChannel::~nsHTTPChannel (this=0x85b7af8, __in_chrg=3) at nsHTTPChannel.cpp:117 #9 0x416a9f22 in nsHTTPChannel::Release (this=0x85b7af8) at nsHTTPChannel.cpp:127 #10 0x4060b78e in nsStreamListenerEvent::~nsStreamListenerEvent ( this=0x83fe668, __in_chrg=3) at nsAsyncStreamListener.cpp:77 #11 0x4060c091 in nsOnStopRequestEvent::~nsOnStopRequestEvent (this=0x83fe668, __in_chrg=3) at nsAsyncStreamListener.cpp:257 ---Type <return> to continue, or q <return> to quit--- #12 0x4060b8bf in nsStreamListenerEvent::DestroyPLEvent (aEvent=0x84ba438) at nsAsyncStreamListener.cpp:104 #13 0x401d841b in PL_DestroyEvent (self=0x84ba438) at plevent.c:545 #14 0x401d83b9 in PL_HandleEvent (self=0x84ba438) at plevent.c:532 #15 0x401d827c in PL_ProcessPendingEvents (self=0x80aa1f8) at plevent.c:483 #16 0x4016faa9 in nsEventQueueImpl::ProcessPendingEvents (this=0x80aa1d0) at nsEventQueue.cpp:201 #17 0x40830da4 in event_processor_callback (data=0x80aa1d0, source=6, condition=GDK_INPUT_READ) at nsAppShell.cpp:141 #18 0x40830a2f in our_gdk_io_invoke (source=0x8156560, condition=G_IO_IN, data=0x81f3308) at nsAppShell.cpp:54 #19 0x406d752a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0 #20 0x406d8be6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0 #21 0x406d91a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0 #22 0x406d9341 in g_main_run () from /usr/lib/libglib-1.2.so.0 #23 0x40907209 in gtk_main () from /usr/lib/libgtk-1.2.so.0 #24 0x408313a7 in nsAppShell::Run (this=0x8095350) at nsAppShell.cpp:304 #25 0x4058ffbd in nsAppShellService::Run (this=0x80a9fd0) at nsAppShellService.cpp:465 #26 0x804bf3d in main1 (argc=1, argv=0xbffffba4) at nsAppRunner.cpp:609 #27 0x804c3c7 in main (argc=1, argv=0xbffffba4) at nsAppRunner.cpp:697
yet another stack trace. note the assert this time. It crashed while loading http://www.mozilla.org/banners/ It seemed like it was done loading but the throbber kept going. Document http://www.mozilla.org/ loaded successfully Document: Done (9.162 secs) WEBSHELL+ = 4 Opening file signon.tbl failed FindShortcut: in='http://www.mozilla.org/banners/' out='null' ###!!! ASSERTION: You can't dereference a NULL nsCOMPtr with operator->().: 'mRawPtr != 0', file ../../dist/include/nsCOMPtr.h, line 569 ###!!! Break: at file ../../dist/include/nsCOMPtr.h, line 569 [Switching to Thread 16561] Program received signal SIGSEGV, Segmentation fault. 0x4017451a in ?? () from /home/endico/mozilla/mozilla/dist/bin/libxpcom.so (gdb) where #0 0x4017451a in nsProxyObject::Post (this=0x407575f0, methodIndex=4, methodInfo=0x816b65c, params=0xbf5ffa5c, interfaceInfo=0x8523ae8) at nsProxyEvent.cpp:423 #1 0x40176747 in nsProxyEventObject::CallMethod (this=0x40705a90, methodIndex=4, info=0x816b65c, params=0xbf5ffa5c) at nsProxyEventObject.cpp:391 #2 0x40181924 in PrepareAndDispatch (self=0x40705a90, methodIndex=4, args=0xbf5ffb14) at xptcstubs_unixish_x86.cpp:92 #3 0x40181a4a in nsXPTCStubBase::Stub4 (this=0x40705a90) at ../../../../../../dist/include/xptcstubsdef.inc:6 #4 0x4061211b in nsSocketTransport::fireStatus (this=0x4073aed8, aCode=5) at nsSocketTransport.cpp:1897 #5 0x4060f4a0 in nsSocketTransport::Process (this=0x4073aed8, aSelectFlags=0) at nsSocketTransport.cpp:539 #6 0x40613307 in nsSocketTransportService::Run (this=0x407375e0) at nsSocketTransportService.cpp:467 #7 0x401716b5 in nsThread::Main (arg=0x40738488) at nsThread.cpp:83 #8 0x402138fb in ?? () from /home/endico/mozilla/mozilla/dist/bin/libnspr3.so #9 0x4022deca in pthread_start_thread (arg=0xbf5ffe60) at manager.c:213
Whiteboard: [PDT+]
Target Milestone: M13
Putting on PDT+ radar.
Attached file single gif image (deleted) —
A single animated gif image is enough to cause a crash although it may take a while. Load the image and wait. Eventually mozilla will crash. Sometimes it crashes immediately, sometimes it takes an hour or more. Oddly, I found that i get good stacks when i view an html file with a link to an animated gif but the stack is corrupted if I type the url of the gif image directly into the location bar. It looks like a networking problem that happens to be exercised by animated gifs because they aren't cached, and have to be downloaded from the source over and over. ------------ viewing animated gif ------------ Document http://userfriendly.org/images/buttons/ufbook.gif loaded successfully Document: Done (35.863 secs) [Switching to Thread 25851] Program received signal SIGSEGV, Segmentation fault. 0x84dc660 in ?? () (gdb) where #0 0x84dc660 in ?? () #1 0x19d8f2bf in ?? () Cannot access memory at address 0x5ff8e808. --------------- loaded test case 2, an html file that displays an animated gif ---------------- (gdb) where #0 0x40d00138 in ?? () #1 0x416c8a9c in nsCOMPtr<nsIChannel>::assign_with_AddRef (this=0x40d22ad8, rawPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:760 #2 0x416c97f3 in nsCOMPtr<nsIChannel>::operator= (this=0x40d22ad8, rhs=0x0) at ../../../../dist/include/nsCOMPtr.h:515 #3 0x416ba5df in nsHTTPRequest::~nsHTTPRequest (this=0x40d22ac0, __in_chrg=3) at nsHTTPRequest.cpp:140 #4 0x416ba730 in nsHTTPRequest::Release (this=0x40d22ac0) at nsHTTPRequest.cpp:151 #5 0x416b3cd5 in nsHTTPChannel::~nsHTTPChannel (this=0x418a2588, __in_chrg=3) at nsHTTPChannel.cpp:117 #6 0x416b3f32 in nsHTTPChannel::Release (this=0x418a2588) at nsHTTPChannel.cpp:127 #7 0x4060ca1e in nsStreamListenerEvent::~nsStreamListenerEvent ( this=0x853efc0, __in_chrg=3) at nsAsyncStreamListener.cpp:77 #8 0x4060d321 in nsOnStopRequestEvent::~nsOnStopRequestEvent (this=0x853efc0, __in_chrg=3) at nsAsyncStreamListener.cpp:257 #9 0x4060cb4f in nsStreamListenerEvent::DestroyPLEvent (aEvent=0x84bae80) at nsAsyncStreamListener.cpp:104 #10 0x401d941b in PL_DestroyEvent (self=0x84bae80) at plevent.c:545 #11 0x401d93b9 in PL_HandleEvent (self=0x84bae80) at plevent.c:532 #12 0x401d927c in PL_ProcessPendingEvents (self=0x80ab660) at plevent.c:483 #13 0x4016fc3c in nsEventQueueImpl::ProcessPendingEvents (this=0x80ab638) at nsEventQueue.cpp:228 #14 0x406c1064 in event_processor_callback (data=0x80ab638, source=7, condition=GDK_INPUT_READ) at nsAppShell.cpp:141 #15 0x406c0cef in our_gdk_io_invoke (source=0x811fda8, condition=G_IO_IN, data=0x82344d0) at nsAppShell.cpp:54 #16 0x4087352a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0 #17 0x40874be6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0 #18 0x408751a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0 #19 0x40875341 in g_main_run () from /usr/lib/libglib-1.2.so.0 #20 0x4079c209 in gtk_main () from /usr/lib/libgtk-1.2.so.0 #21 0x406c1667 in nsAppShell::Run (this=0x808d038) at nsAppShell.cpp:304 #22 0x4059107d in nsAppShellService::Run (this=0x80ab438) at nsAppShellService.cpp:465 #23 0x804bf3d in main1 (argc=1, argv=0xbffffba4) at nsAppRunner.cpp:609 #24 0x804c3c7 in main (argc=1, argv=0xbffffba4) at nsAppRunner.cpp:697 (gdb) print *this No symbol "this" in current context. (gdb) print this No symbol "this" in current context. (gdb) up #1 0x416c8a9c in nsCOMPtr<nsIChannel>::assign_with_AddRef (this=0x40d22ad8, rawPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:760 760 assign_assuming_AddRef(NS_REINTERPRET_CAST(T*, rawPtr)); (gdb) print this $1 = (nsCOMPtr<nsIChannel> *) 0x0 (gdb) print *this Cannot access memory at address 0x0. (gdb) up #2 0x416c97f3 in nsCOMPtr<nsIChannel>::operator= (this=0x40d22ad8, rhs=0x0) at ../../../../dist/include/nsCOMPtr.h:515 515 assign_with_AddRef(rhs); (gdb) print this $2 = (nsCOMPtr<nsIChannel> *) 0x40d22ad8 (gdb) print *this $3 = {mRawPtr = 0x0} (gdb) up #3 0x416ba5df in nsHTTPRequest::~nsHTTPRequest (this=0x40d22ac0, __in_chrg=3) at nsHTTPRequest.cpp:140 140 mTransport = null_nsCOMPtr(); (gdb) print this $4 = (nsHTTPRequest *) 0x40d22ac0 (gdb) print *this $5 = {<nsIStreamObserver> = {<nsISupports> = { _vptr. = 0x416cfcc0 <nsHTTPRequest virtual table>}, <No data fields>}, <nsIRequest> = {<nsISupports> = { _vptr. = 0x416cfc80 <nsHTTPRequest::nsIRequest virtual table>}, <No data fields>}, mRefCnt = 1, mMethod = HM_GET, mURI = {mRawPtr = 0x40d693b8}, mVersion = HTTP_ONE_ZERO, mTransport = {mRawPtr = 0x0}, mConnection = 0x418a2588, mHeaders = {mHTTPHeaders = { mRawPtr = 0x40d7dca8}}, mUsingProxy = 0, mRequestBuffer = {<nsStr> = { mLength = 0, mCapacity = 128, mCharSize = eOneByte, mOwnsBuffer = 1, { mStr = 0x40da2140 "", mUStr = 0x40da2140}}, _vptr. = 0x401b0084 <nsCString virtual table>}, mPostDataStream = { mRawPtr = 0x0}} (gdb) quit
Whiteboard: [PDT+] → [PDT+] need SMP machine
Doug is not looking at this until he gets his hands on a machine that exhibits the problem. Anyone else want to take it? Anyone want to get another processor for Doug?
Brian, do we have a machine that dougt can use to debug this? One of those new solaris machines? (with purify?)
Here's the info I dug up on SMP boxes (for the ambitious): Bill Law and dp have dual processor machines, but they're 200mhz and dp thinks that's too slow. Rickg has a 733mhz (?!) machine but I'm not sure if it's here or in san diego. Cyeh may be able to whip something up too. Alec says he sees deadlocks running an MP kernel on a single-processor machine. Warren
If we think the problem is in xpcom/proxy, we could try a code review, even before dougt sits in front of a fast SMP machine (or whatever). I'm up for it, and jband would probably be willing to help. /be
Brendan, et al. we (jband andI) have already done this. I need to protect my hash tables and proxyCallInfo class. I could merely just code these fixes and check it in, but I would rather be able to verify (for myself) that the problem does go away when I do this.
This is silly: you know of thread-safety bugs (MP or Uniprocessor, I dunno), there are people in the Mozilla community being bitten by these bugs (including endico@mozilla.org), but you don't wanna code the fixes until you can test 'em yourself? This is not the way of the Mozilla bazaar. Can you hack up fixes to the current revs of the files, and attach cvs diff -u output to this bug, so others can at least help test for ya? Thanks. /be
An even faster way to reproduce this bug is to use mail. Opening a folder with 2k messages took 5 tries because it made mozilla crash.
Attach a patch and i'll be happy to test it. (And let me know what testing needs to be done)
Rebuilt from the top with dougt's changes and still crashed. #0 0x4019eb33 in nsCOMPtr<nsProxyObject>::assign_with_AddRef ( this=0x40767028, rawPtr=0x8665848) at ../../../dist/include/nsCOMPtr.h:759 #1 0x4019ef67 in nsCOMPtr<nsProxyObject>::operator= (this=0x40767028, rhs=0x8665848) at ../../../dist/include/nsCOMPtr.h:515 #2 0x40174b44 in nsProxyObjectCallInfo::nsProxyObjectCallInfo ( this=0x40767008, owner=0x8665848, methodInfo=0x816bfe0, methodIndex=3, parameterList=0x40746e38, parameterCount=4, event=0x40766f90) at nsProxyEvent.cpp:70 #3 0x40175b00 in nsProxyObject::Post (this=0x8665848, methodIndex=3, methodInfo=0x816bfe0, params=0xbf5ffadc, interfaceInfo=0x86831c8) at nsProxyEvent.cpp:384 #4 0x40177ff7 in nsProxyEventObject::CallMethod (this=0x8680668, methodIndex=3, info=0x816bfe0, params=0xbf5ffadc) at nsProxyEventObject.cpp:394 #5 0x40183184 in PrepareAndDispatch (self=0x8680668, methodIndex=3, args=0xbf5ffb94) at xptcstubs_unixish_x86.cpp:92 #6 0x4018325e in nsXPTCStubBase::Stub3 (this=0x8680668) at ../../../../../../dist/include/xptcstubsdef.inc:5 #7 0x4061343e in nsSocketTransport::doRead (this=0x868e328, aSelectFlags=1) at nsSocketTransport.cpp:976 #8 0x40612755 in nsSocketTransport::Process (this=0x868e328, aSelectFlags=1) at nsSocketTransport.cpp:512 #9 0x406166d7 in nsSocketTransportService::Run (this=0x40768b88) ---Type <return> to continue, or q <return> to quit--- at nsSocketTransportService.cpp:467 #10 0x40172d05 in nsThread::Main (arg=0x4073b4f8) at nsThread.cpp:83 #11 0x402158fb in _pt_root (arg=0x407469a0) at ptthread.c:157 #12 0x4022feca in pthread_start_thread (arg=0xbf5ffe60) at manager.c:213
I'm still crashing but things don't seem as fragile as before. I was able to download my mailbox headers twice in a row without crashing. Last time it crashed 4/5 times. Doug's changes seem to have made an improvement.
Probably the extra locks just slowed down the timing of things, shrinking the window of vulnerability. Dawn -- sounds like we should get a debug build/env on your machine so that we can diagnose the problem when it happens. Can you set that up?
I just got a crash with a fresh tree on a dual 350 PII running linux, here's a stack trace. Program received signal SIGSEGV, Segmentation fault. 0x40175c3a in nsProxyObject::Post (this=0x860ff28, methodIndex=4, methodInfo=0x812ac44, params=0xbf5ffa38, interfaceInfo=0x849b158) at nsProxyEvent.cpp:433 433 mDestQueue->PostEvent(event); (gdb) bt #0 0x40175c3a in nsProxyObject::Post (this=0x860ff28, methodIndex=4, methodInfo=0x812ac44, params=0xbf5ffa38, interfaceInfo=0x849b158) at nsProxyEvent.cpp:433 #1 0x40177ff7 in nsProxyEventObject::CallMethod (this=0x862c7f0, methodIndex=4, info=0x812ac44, params=0xbf5ffa38) at nsProxyEventObject.cpp:394 #2 0x40183184 in PrepareAndDispatch (self=0x862c7f0, methodIndex=4, args=0xbf5ffaf0) at xptcstubs_unixish_x86.cpp:92 #3 0x401832aa in nsXPTCStubBase::Stub4 (this=0x862c7f0) at ../../../../../../dist/include/xptcstubsdef.inc:6 #4 0x4060a4eb in nsSocketTransport::fireStatus (this=0x862c900, aCode=3) at nsSocketTransport.cpp:1903 #5 0x40607860 in nsSocketTransport::Process (this=0x862c900, aSelectFlags=0) at nsSocketTransport.cpp:539 #6 0x4060b0c6 in nsSocketTransportService::ProcessWorkQ (this=0x84f64d0) at nsSocketTransportService.cpp:259 #7 0x4060b794 in nsSocketTransportService::Run (this=0x84f64d0) at nsSocketTransportService.cpp:493 #8 0x40172d05 in nsThread::Main (arg=0x84f6810) at nsThread.cpp:83 #9 0x402158fb in _pt_root (arg=0x85bf110) at ptthread.c:157 #10 0x4022feca in pthread_start_thread (arg=0xbf5ffe60) at manager.c:213 (gdb) print this $2 = (nsProxyObject *) 0x860ff28 (gdb) print *this $3 = {<nsISupports> = {_vptr. = 0x883ce90}, mRefCnt = 140573992, mProxyType = 6, mDestQueue = {mRawPtr = 0x0}, mRealObject = {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, mLock = 0x882a7b8} As far as I can tell "this" was destroyed while one thread is executing this->Post() since there's a check for !mDestQueue in the beginning of nsPorxyObject::Post(), so this should not happend...
Doug, Looking at EventHandler (shouldn't this be static or something?)... http://lxr.mozilla.org/seamonkey/source/xpcom/proxy/src/nsProxyEvent.cpp#460 ...I see that you are holding a per object lock while invoking XPTC_InvokeByIndex. This seems excessive and/or dangerous. Aren't you then precluding reentrant calls via the proxy on the proxied object? Do you really need to protect more than your shared tables of information about the proxies and the refcount managment of the proxies themselves? I think that you should limit the scope of all locks to the bare minimum that is absolutely require so that you decrease the chance of deadlocks or nspr assertions on attempts to reenter a non-reantrant lock.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → DUPLICATE
good catch, both event handlers need to be static. The scope of the locks need to be reduced. marking this bug as a dup of 18110 *** This bug has been marked as a duplicate of 18110 ***
On Linux SMP machine Mozilla M13 crashes almost immediately. It crashes also while you are doing nothing..
Status: RESOLVED → REOPENED
anssi@bigfoot.com, why was this reopened if it is in fact a duplicate of 18110? Your comments don't argue that it is a separate bug from 18110, so I don't see the point in reopening. Resolving it as a duplicate doesn't mean that the bug it describes, duplicated by an earlier bugzilla report, is fixed -- it just means we know that the newer bug is a dup. /be
Clearing DUPLICATE resolution due to reopen.
closing. see other bug.
Status: REOPENED → RESOLVED
Closed: 25 years ago25 years ago
I brought my box in to work today and let dougt hack. He thinks this may actually be a duplicate of 24711.
Re-opening because the bug this bug turned out not to be a duplicate of 18110. Marking as dependent on 24711 and removing dependency on 18110.
Status: RESOLVED → REOPENED
Depends on: 24711
No longer depends on: 18110
assigning to http god.
Assignee: dougt → gagan
Status: REOPENED → NEW
Clearing Duplicate resolution due to reopen.
Resolution: DUPLICATE → ---
Moving to m14.
Keywords: beta1, crash
Target Milestone: M13 → M14
Putting dogfood in the keyword field.
Keywords: dogfood
Summary: [Dogfood] Mozilla crashes often on SMP systems. → Mozilla crashes often on SMP systems.
Putting in correct component.
Component: XPCOM → Networking
Why is this considered Networking now? It's purely a proxy problem, isn't it? It could affect anything. And why is this owned by Gagan?
No. this is a the problem with having socket transports in the load group. The second onStop() crashes SMP machines.
Changing summary from: Mozilla crashes often on SMP systems. To: crash on SMP systems: socket transport in load group Reassigning to Rick Potts because I think he's working on this now.
Assignee: gagan → rpotts
Summary: Mozilla crashes often on SMP systems. → crash on SMP systems: socket transport in load group
hey doug, are you sure that there is a SocketTransport sitting in a load group? I would have thought that that was not possible... -- rick
gagan and jud are in the know.
This is not windows only, I been seeing this on linux for a while too, changing OS and Platform...
OS: Windows NT → All
Hardware: PC → All
Status whiteboard says you need an SMP machine. Hasn't dougt's arrived yet? Mozilla is pretty useless for me at home until this bug gets fixed. I could bring the mahcine in again but the last time I tried that the motherboard fried.
Hey Rick; I'm seeing these crashes _constantly_ on my home machine. Almost any page I visit will eventually end up in this state. Sometimes it's just visiting the page, sometimes it's when I leave the page, sometimes it's just sitting idle (so to speak). I'll start forwarding stack traces.
Here's an *all-too-typical* stack trace on my SMP/NT box... nsStreamListenerEvent::~nsStreamListenerEvent() line 77 + 24 bytes nsOnStopRequestEvent::~nsOnStopRequestEvent() line 258 + 8 bytes nsOnStopRequestEvent::`scalar deleting destructor'(unsigned int 1) + 15 bytes nsStreamListenerEvent::DestroyPLEvent(PLEvent * 0x02fe63e0) line 104 + 30 bytes PL_DestroyEvent(PLEvent * 0x02fe63e0) line 549 + 10 bytes PL_HandleEvent(PLEvent * 0x02fe63e0) line 536 + 9 bytes PL_ProcessPendingEvents(PLEventQueue * 0x02382cd0) line 487 + 9 bytes _md_EventReceiverProc(HWND__ * 0x003e0550, unsigned int 49342, unsigned int 0, long 37235920) line 975 + 9 bytes USER32! 77e71820() 02382cd0() I'm certainly willing to drive this machine remotely if someone wants to try to debug this problem.
Line 77 looks like the release of mContext or possibly mChannel, the line above it. Rickg: Can you see if one of these looks like it has already been deleted? Maybe we've got race between an addref on one thread and a release on this one.
For that particular stack trace, it is possible that the crash is happening on the NS_RELEASE(mContext) because mContext has already been deleted! It turns out that mContext is really an nsHTTPCHannel. Unfortunately, nsHTTPChannel *does not* have thread-safe implementations of AddRef() and Release()... Since these methods are caled on multiple threads (ie. socket transport and UI) there canbe problems :-) I'll check in a fix to make AddRef() and Release() thread-safe and we'll see if things get any better... Are you seeing any other stack traces?
I've just checked in thread-safe AddRef/Release implementations for nsHTTPChannel, nsHTTPResponseListener, nsHTTPRequest and nsHTTPEncodeStream. I suspect that other nsIInputStream implementations (besides nsHTTPEncodeStream) will need thread-safe Addref/Release implementations... In particular the "string stream"
Rick, I've never understood how making addref and release threadsafe really solved things. If one thread might be doing the last release while another is trying to addref, there's obviously some higher-level synchronization needed, isn't there? Or maybe it's just that the thread doing the release shouldn't have been the final release -- but the refcount got tromped somewhere along the way. It still seems like more than the refcount needs to be protected in this case. Warren
One way it can help is that this threadsafety code makes the manipulation of the refcount atomic. If you have one release happening when another addref is going on then the release *might* set the refcount to a lower number then it should be - ignoring the addref's change; i.e --refcnt is really (get, decrement, store). If another thread changes the refcount in the middle of that non-atomic set of actions then you can stomp its change. Only later does that get you when the 'final' release comes when the refcount should really not be zero yet.
Warren, The race you worried about is not a problem. The only time folks should be messing with an object is IF they already have done an adref. There is no chance that a thread is "about to do an adref" on an object unless that thread *has* an outstanding adref ahead of time. Hence there is no risk from some other thread doing a decref (the count is already at least 2, one for each thread handling the object). On some platforms, you can get some guarantees about atomic actions for some class of integers. Waldemar looked into this a LOT for multiprocessor machines, and can probably chime in with potential answers. If the action is not atomic (as pointed out by jband), then there is a big risk of losing either an increment, or a decrement :-(. Adding Waldemar to this thread in case he has suggestions.
My point is that if 2 threads are manipulating the same channel, then the channel better be protecting the state for other operations, not just addref/release.
The issue that I've seen in the past with non-threadsafe Addref/Release is that the refcount can prematurely go to zero. For example, if an object has a reference count of two and two threads call Release() simultaneously, there is a chance that the --mRefCount will be executed on each thread *before* either one checks for 0. In this case, both threads will see (mRefCount == 0) and delete the object. This double deletion was the whole reason that I added the NS_IMPL_THREADSAFE macros to nsISupportsUtils.h
Ok. What other channel implementations need this same fix?
I'm not seeing crashes at home, but as of a day or two ago, I can no longer load any remote pages on my machine at home (SMP machine).
warren, I think that we should examine the File Transport as well... Basically, any pointer that is Addref/Released on another thread requires thread-safe ISupports implementations... Typically, these are the internal nsIStreamListener implementations and the streams... I was thinking of adding some assertions to the non-threadsafe AddRef/Release macros which assert if they are ever called on multiple threads... Do you think this would be useful ? I used to have some debug macros, along the lines of NS_ENSURE_THREADSAFE(...) which could be used to verify that method arguments were threadsafe, but they required using an NS_IMPL_THREADSAFE_QI macro... Troy whined endlessly about that so I removed it :-( However, I could make the checking completely transparent if I added an 'owning thread' pointer as a data member in NS_DECL_ISUPPORTS (for debug only)
I'm nominating bug #24642 and #26686 as dups of this bug. What do people think?
Need to fix by 03/03 for beta1 train.
QA Contact: leger → tever
Whiteboard: [PDT+] need SMP machine → [PDT+] Must fix by 03/03 need SMP machine
*** Bug 24642 has been marked as a duplicate of this bug. ***
Rick's comment about two threads doing simultaneous decrefs, and then both think ing it was their job to do the delete (because they checked non-atomically for a zero after the decref), is really scary :-(. Do we have this problem with many classes of objects, or is there a small set that generally faces this evil handling on multiple threads?
...another question... if this is a problem on SMTP, why are we not hitting it on a single processor machine? Considering that task switching between threads is pre-emptive, I'd expect a similar amount of risk of a conflict. What am I missing? Is there a way to mark an executable to NOT use more than one processor?? Would that give a a wimpy work-around for now???
hey jim, I think that we *are* seeing this problem on single processor machines. Take a look at bug #26686 and bug #24642. They both have tvery similar stack traces... I think that we are seeing it *more* on SMP boxes because we get more concurrency... But the problem still exists on single processors...
Damn.... this is sounding more and more scary. I need to look at how other systems deal with this while doing ref-counting. Ugh... this looks hard (but at least that makes it interesting!!!! :-) ).
Whiteboard: [PDT+] Must fix by 03/03 need SMP machine → [PDT+] w/b minus on 03/03- need SMP machine
After some analysis, I've identified the following classes as being un-threadsafe in their usage of Addref/Release. This analysis was *only* for bringing up the browser - there are definately more in FTP and IMAP :-( For each of these classes, at least one instance is created on one thread and then Addref/Released by another. nsThread nsLocalFileSystem nsFileTransport nsLocalFile nsGenericModule nsFileTransportService nsProxyObject nsInterfaceInfo nsMIMEService nsMIMEInfoImpl nsBasicStringImpl nsDNSService nsIOService nsEventQueueImpl nsSupportsArray AtomImpl nsGenericFactory Each of these classes needs to be analyzed to determine the extent of un-threadsafe beyond Addref() and Release()!!
should we file seperate bugs on each of these? are you going to change the above to use the thread safe version of addref/release?
Does anyone have a proposed patch to fix this? Maybe some changes toto the addref/release macros for everything? I don't crash here at home... I just can't load pages. I can run mozilla remotly to my xserver at work if anyone wants me to test this out
The fix for Addref/Release is trivial. You simply need to use the: NS_IMPL_THREADSAFE_ADDREF(...) NS_IMPL_THREADSAFE_RELEASE(...) macros. The bigger question is if Addref/Release are being accessed on multiple threads, what other members are also accessed - and not threadsafe! I think that as we migrate these classes to use the THREADSAFE macros, we must *also* do a carful analysis to determine the overall threadsaftey (and thread exposure) of each class...
Pavlov, FYI: I'm running Linux at home on a dual 350MHz PII, I've never had problems with loading remote pages (over a modem line) in mozilla (I update and test almost daily), and mozilla hasn't crashed in a while either...
Rick: I'd eventually like to get your assertions for this into the tree too so that the problem doesn't come up in the future (after we've analyzed and fixed all these). Good work figuring out how to spot this. Pavlov: What do you say we build us an SMP box out of our Dell 210s? I want to make sure somebody has a machine in house that will exhibit these problems. Don/Peter/dp: Do any of you have a spare Dell 210 that you can give up for a while to make a multiprocessor out of? That would let me keep mine for development. Thanks.
Damn, I shouldn't have said that! Now, I'm seeing a crash again, and I was able to get a stack trace, the stacktrace is different from all the other ones in this bug but I still think it belongs here. #0 0x4059b090 in main_arena () from /lib/libc.so.6 #1 0x40042f7e in nsCOMPtr<nsIChannel>::assign_assuming_AddRef ( this=0x89d8150, newPtr=0x0) at ../../dist/include/nsCOMPtr.h:416 #2 0x41c183ac in nsCOMPtr<nsIChannel>::assign_with_AddRef (this=0x89d8150, rawPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:787 #3 0x41c18de3 in nsCOMPtr<nsIChannel>::operator= (this=0x89d8150, rhs=0x0) at ../../../../dist/include/nsCOMPtr.h:526 #4 0x41c05ac0 in nsHTTPRequest::~nsHTTPRequest (this=0x89d8138, __in_chrg=3) at nsHTTPRequest.cpp:146 #5 0x41c05c25 in nsHTTPRequest::Release (this=0x89d8138) at nsHTTPRequest.cpp:154 #6 0x41bfe9b5 in nsHTTPChannel::~nsHTTPChannel (this=0x81c5660, __in_chrg=3) at nsHTTPChannel.cpp:127 #7 0x41bfec33 in nsHTTPChannel::Release (this=0x81c5660) at nsHTTPChannel.cpp:142 #8 0x40043a74 in nsCOMPtr<nsIChannel>::~nsCOMPtr (this=0xbffff2c4, __in_chrg=2) at ../../dist/include/nsCOMPtr.h:434 #9 0x40e4fcc7 in nsDocLoaderImpl::DocLoaderIsEmpty (this=0x85c3918, aStatus=0) at nsDocLoader.cpp:495 #10 0x40e4fb18 in nsDocLoaderImpl::OnStopRequest (this=0x85c3918, aChannel=0x8c87db8, aCtxt=0x0, aStatus=0, aMsg=0x0) at nsDocLoader.cpp:437 #11 0x40706b52 in nsLoadGroup::RemoveChannel (this=0x85c3970, channel=0x8c87db8, ctxt=0x0, status=0, errorMsg=0x0) at nsLoadGroup.cpp:535 #12 0x407405bb in nsFileChannel::OnStopRequest (this=0x8c87db8, transportChannel=0x8c87ec8, context=0x0, aStatus=0, aMsg=0x0) at nsFileChannel.cpp:450 #13 0x406efb0d in nsOnStopRequestEvent::HandleEvent (this=0x408ea618) at nsAsyncStreamListener.cpp:282 #14 0x406ef1e7 in nsStreamListenerEvent::HandlePLEvent (aEvent=0x41dd6560) at nsAsyncStreamListener.cpp:97 (More stack frames follow...) Here what it crashed on #1 0x40042f7e in nsCOMPtr<nsIChannel>::assign_assuming_AddRef ( this=0x89d8150, newPtr=0x0) at ../../dist/include/nsCOMPtr.h:416 416 NSCAP_RELEASE(oldPtr); (gdb) print oldPtr $6 = (nsIChannel *) 0x88346f4 (gdb) print *oldPtr $7 = {<nsIRequest> = {<nsISupports> = { _vptr. = 0x4059b088}, <No data fields>}, <No data fields>} Still no problems loading remote pages tho...
I've got a 210. It isn't spare, but I could loan it out for a short time, especially over the weekend.
For what it is worth, the xpcom log might help. It is enabled for release builds too. Here is how you get it: set env NSPR_LOG_MODULES=nsComponentManager:5 set env NSPR_LOG_FILE=xpcom.log mozilla now you should have a xpcom.log There is a sufficiently large chance that we might be able to tell what is happening from the log.
It looks like the last stack trace is slightly different... In this case, the last URL of the document has finished and the LoadGroup is releasing its reference to the "document channel" (which is a nsHTTPChannel). The nsHTTPChannel (this=0x81c5660) releases its nsHTTPRequest (this=0x89d8138), which in turn releases its reference to the nsSocketTransport (0x88346f4) - which is an nsIChannel. Unfortunately, the nsSocketTransport instance has already been deleted :-(
That class of problem (release on an already deleted object) is exactly the sort of thing that would be expected from the problem you isolated. When the ref count on the object is down-counted to zero, and the hit to zero is felt by *two* threads, then *both* threads will delete and clean up that object. When both threads start to "clean up," some related objects will be deleted on one thread, and then later the other thread will come along to "clean up" and do additonal releases on a collected object. This all seems to fit... or am I missing something??
The good news is that I no longer crash when sitting there viewing slashdot.org or the test case. It appears that animated gifs are now cached instead of being downloaded over and over. The bad news is that I still get random crashes with the same stack traces. I'll try gagan's xpcom logging suggestion.
holy cow! I ran mozilla for 15 or so minutes with NSPR_LOG_MODULES and NSPR_LOG_FILE set. I generated a 56mb log file filled with millions of these 1024[8058968]: found lib:libgfx_gtk.so as 812b930 in factory cache. 1024[8058968]: Factory CreateInstance() succeeded. 1024[8058968]: nsComponentManager: FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9}) These spewed out at the rate of 1 or 2 per second even just sitting at http://www.mozilla.org/ Eventually after reloading my mailbox and loading some other pages it crashed. #0 0x40800149 in ?? () #1 0x41c373ac in nsCOMPtr<nsIChannel>::assign_with_AddRef (this=0x420a8ac0, rawPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:787 #2 0x41c37de3 in nsCOMPtr<nsIChannel>::operator= (this=0x420a8ac0, rhs=0x0) at ../../../../dist/include/nsCOMPtr.h:526 #3 0x41c24ac0 in nsHTTPRequest::~nsHTTPRequest (this=0x420a8aa8, __in_chrg=3) at nsHTTPRequest.cpp:146 #4 0x41c24c25 in nsHTTPRequest::Release (this=0x420a8aa8) at nsHTTPRequest.cpp:154 #5 0x41c1d9b5 in nsHTTPChannel::~nsHTTPChannel (this=0x42029b80, __in_chrg=3) at nsHTTPChannel.cpp:127 #6 0x41c1dc33 in nsHTTPChannel::Release (this=0x42029b80) at nsHTTPChannel.cpp:142 #7 0x406e40ee in nsStreamListenerEvent::~nsStreamListenerEvent ( this=0x82efb48, __in_chrg=3) at nsAsyncStreamListener.cpp:81 #8 0x406e4a01 in nsOnStopRequestEvent::~nsOnStopRequestEvent (this=0x82efb48, __in_chrg=3) at nsAsyncStreamListener.cpp:261 #9 0x406e421f in nsStreamListenerEvent::DestroyPLEvent (aEvent=0x84689c8) at nsAsyncStreamListener.cpp:108 #10 0x40189c5b in PL_DestroyEvent (self=0x84689c8) at plevent.c:549 #11 0x40189bf9 in PL_HandleEvent (self=0x84689c8) at plevent.c:536 #12 0x40189abc in PL_ProcessPendingEvents (self=0x812cf78) at plevent.c:487 #13 0x4018b5fc in nsEventQueueImpl::ProcessPendingEvents (this=0x812cf50) at nsEventQueue.cpp:298 #14 0x40935a64 in event_processor_callback (data=0x812cf50, source=9, condition=GDK_INPUT_READ) at nsAppShell.cpp:141 #15 0x409356ef in our_gdk_io_invoke (source=0x4159f368, condition=G_IO_IN, data=0x415b2988) at nsAppShell.cpp:54 #16 0x407cc52a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0 #17 0x407cdbe6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0 #18 0x407ce1a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0 #19 0x407ce341 in g_main_run () from /usr/lib/libglib-1.2.so.0 #20 0x40a12209 in gtk_main () from /usr/lib/libgtk-1.2.so.0 #21 0x40936067 in nsAppShell::Run (this=0x40812e38) at nsAppShell.cpp:304 #22 0x4064eaad in ?? () from /home/endico/mozilla/mozilla/dist/bin/components/libnsappshell.so #23 0x804e60e in main1 (argc=1, argv=0xbffff9e4, splashScreen=0x0) at nsAppRunner.cpp:763 #24 0x804eba0 in main (argc=1, argv=0xbffff9e4) at nsAppRunner.cpp:883 from the end of xpcom.log: 1024[8058968]: found rel:libnecko.so as 807ac80 in factory cache. 1024[8058968]: Factory CreateInstance() succeeded. 1024[8058968]: nsComponentManager: ProgIDToClassID(component://netscape/image/decoder&type=image/gif)->{0d471b70-baf5-11d2-802c-0060088f91a3} 1024[8058968]: nsComponentManager: FindFactory({0d471b70-baf5-11d2-802c-0060088f91a3}) 1024[8058968]: found rel:libnsgif.so as 8085720 in factory cache. 1024[8058968]: Factory CreateInstance() succeeded. 1024[8058968]: nsComponentManager: FindFactory({6049b261-c1e6-11d1-a827-0040959a28c9}) 1024[8058968]: found lib:libgfx_gtk.so as 812b1d8 in factory cache. 1024[8058968]: Factory CreateInstance() succeeded. 1024[8058968]: nsComponentManager: FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9}) 1024[8058968]: found lib:libgfx_gtk.so as 812b930 in factory cache. 1024[8058968]: Factory CreateInstance() succeeded. 1024[8058968]: nsComponentManager: FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9}) 1024[8058968]: found lib:libgfx_gtk.so as 812b930 in factory cache. 1024[8058968]: Factory CreateInstance() succeeded. 1024[8058968]: nsComponentManager: FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9}) 1024[8058968]: found lib:libgfx_gtk.so as 812b930 in factory cache. 1024[8058968]: Factory CreateInstance() succeeded. 1024[8058968]: nsComponentManager: FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9}) 1024[8058968]: found lib:libgfx_gtk.so as 812b930 in factory cache. 1024[8058968]: Factory CreateInstance() succeeded. 1024[8058968]: nsComponentManager: FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9}) 1024[8058968]: found lib:libgfx_gtk.so as 812b930 in factory cache. 1024[8058968]: Factory CreateInstance() succeeded. 1024[8058968]: nsComponentManager: FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9}) 1024[8058968]: found lib:libgfx_gtk.so as 812b930 in factory cache. 1024[8058968]: Factory CreateInstance() succeeded. 1024[8058968]: nsComponentManager: FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9}) 1024[8058968]: found lib:libgfx_gtk.so as 812b930 in factory cache. 1024[8058968]: Factory CreateInstance() succeeded. 1024[8058968]: nsComponentManager: ProgIDToClassID(component://netscape/network/protocol?name=http)->{52a30880-dd95-11d3-a1a7-0050041caf44} 1024[8058968]: nsComponentManager: FindFactory({90012125-1616-4fa1-ae14-4e7fa5766eb6}) 1024[8058968]: found rel:libnecko.so as 807b070 in factory cache. 1024[8058968]: Factory CreateInstance() succeeded. 1024[8058968]: nsComponentManager: FindFactory({de9472d0-8034-11d3-9399-00104ba0fd40}) 1024[8058968]: found rel:libnecko.so as 807a890 in factory cache. 1024[8058968]: nsComponentManager: FindFactory({dbf72351-4fd8-46f0-9dbc-fa5ba60a305c}) 1024[8058968]: found rel:libnecko.so as 807afc8 in factory cache. 1024[8058968]: Factory CreateInstance() succeeded. 1024[8058968]: Factory CreateInstance() succeeded. 1024[8058968]: nsComponentManager: ProgIDToClassID(component://netscape/scriptsecuritymanager)->{7ee2a4c0-4b93-17d3-ba18-0060b0f199a2} 1024[8058968]: nsComponentManager: ProgIDToClassID(component://netscape/network/protocol?name=http)->{52a30880-dd95-11d3-a1a7-0050041caf44} 1024[8058968]: nsComponentManager: ProgIDToClassID(component://netscape/network/cache?name=manager)->{2030f0b0-9567-11d3-90d3-0040056a906e} 1024[8058968]: nsComponentManager: FindFactory({60047bb2-91c0-11d3-8cd9-0060b0fc14a3}) 1024[8058968]: found rel:libnecko.so as 807ac80 in factory cache. 1024[8058968]: Factory CreateInstance() succeeded. 1024[8058968]: nsComponentManager: ProgIDToClassID(component://netscape/image/decoder&type=image/gif)->{0d471b70-baf5-11d2-802c-0060088f91a3} 1024[8058968]: nsComponentManager: FindFactory({0d471b70-baf5-11d2-802c-0060088f91a3}) 1024[8058968]: found rel:libnsgif.so as 8085720 in factory cache. 1024[8058968]: Factory CreateInstance() succeeded.
Dawn: Try setting nsSocketTransport:5 instead of nsComponentManager:5. I think that would be more helpful. Still working on an SMP machine for Rick. Pavlov agreed to pool his machine with mine... if I could only find him!
now that animated gifs don't constantly reload my new test case is browser buster. it broke for me at about the 3rd url. Here's a new stack and the last part of xpcom.log. I have 150K or so of log file with random.yahoo.com and esta.org messages if anyone is interested. using nsComponentManager:5. 0 0x0 in ?? () #1 0x406e40ee in nsStreamListenerEvent::~nsStreamListenerEvent ( this=0x8793948, __in_chrg=3) at nsAsyncStreamListener.cpp:81 #2 0x406e4a01 in nsOnStopRequestEvent::~nsOnStopRequestEvent (this=0x8793948, __in_chrg=3) at nsAsyncStreamListener.cpp:261 #3 0x406e421f in nsStreamListenerEvent::DestroyPLEvent (aEvent=0x8788988) at nsAsyncStreamListener.cpp:108 #4 0x40189c5b in PL_DestroyEvent (self=0x8788988) at plevent.c:549 #5 0x40189bf9 in PL_HandleEvent (self=0x8788988) at plevent.c:536 #6 0x40189abc in PL_ProcessPendingEvents (self=0x812b798) at plevent.c:487 #7 0x4018b5fc in nsEventQueueImpl::ProcessPendingEvents (this=0x812b770) at nsEventQueue.cpp:298 #8 0x40935a64 in event_processor_callback (data=0x812b770, source=9, condition=GDK_INPUT_READ) at nsAppShell.cpp:141 #9 0x409356ef in our_gdk_io_invoke (source=0x8338070, condition=G_IO_IN, data=0x81c6f08) at nsAppShell.cpp:54 #10 0x407cc52a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0 #11 0x407cdbe6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0 #12 0x407ce1a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0 #13 0x407ce341 in g_main_run () from /usr/lib/libglib-1.2.so.0 #14 0x40a12209 in ?? () from /usr/lib/libgtk-1.2.so.0 #15 0x40936067 in nsAppShell::Run (this=0x8130de8) at nsAppShell.cpp:304 #16 0x4064eaad in nsAppShellService::Run (this=0x812b570) at nsAppShellService.cpp:399 #17 0x804e60e in main1 (argc=1, argv=0xbffff9e4, splashScreen=0x0) at nsAppRunner.cpp:763 #18 0x804eba0 in main (argc=1, argv=0xbffff9e4) at nsAppRunner.cpp:883 1026[812d2c8]: --- Leaving nsSocketTransport::Process() [www.esta.org:80 41ccdd20]. mStatus = 80470007. CurrentState = 5 1026[812d2c8]: +++ Entering nsSocketTransport::Process() [www.esta.org:80 41ccdd20]. aSelectFlags = 1. CurrentState = 5 1026[812d2c8]: +++ Entering nsSocketTransport::doRead() [www.esta.org:80 41ccdd20]. aSelectFlags = 1. 1026[812d2c8]: nsReadFromSocket [fd=40805220]. rv = 0. Buffer space = 239. Bytes read =239 1026[812d2c8]: nsReadFromSocket [fd=40805220]. rv = 0. Buffer space = 2048. Bytes read =261 1026[812d2c8]: nsReadFromSocket [fd=40805220]. rv = 80470007. Buffer space = 1787. Bytes read =0 1026[812d2c8]: nsSocketTransport::OnWrite() [www.esta.org:80 41ccdd20]. nsIPipe=408fe088 Count=500 1026[812d2c8]: WriteSegments [fd=40805220]. rv = 0. Bytes read =500 1026[812d2c8]: --- Leaving nsSocketTransport::doRead() [www.esta.org:80 41ccdd20]. rv = 80470007. Total bytes read: 500 1026[812d2c8]: --- Leaving nsSocketTransport::Process() [www.esta.org:80 41ccdd20]. mStatus = 80470007. CurrentState = 5 1026[812d2c8]: +++ Entering nsSocketTransport::Process() [www.esta.org:80 41ccdd20]. aSelectFlags = 1. CurrentState = 5 1026[812d2c8]: +++ Entering nsSocketTransport::doRead() [www.esta.org:80 41ccdd20]. aSelectFlags = 1. 1026[812d2c8]: nsReadFromSocket [fd=40805220]. rv = 0. Buffer space = 1787. Bytes read =500 1026[812d2c8]: nsReadFromSocket [fd=40805220]. rv = 80470007. Buffer space = 1287. Bytes read =0 1026[812d2c8]: nsSocketTransport::OnWrite() [www.esta.org:80 41ccdd20]. nsIPipe=408fe088 Count=500 1026[812d2c8]: WriteSegments [fd=40805220]. rv = 0. Bytes read =500 1026[812d2c8]: --- Leaving nsSocketTransport::doRead() [www.esta.org:80 41ccdd20]. rv = 80470007. Total bytes read: 500 1026[812d2c8]: --- Leaving nsSocketTransport::Process() [www.esta.org:80 41ccdd20]. mStatus = 80470007. CurrentState = 5 1026[812d2c8]: +++ Entering nsSocketTransport::Process() [www.esta.org:80 41ccdd20]. aSelectFlags = 1. CurrentState = 5 1026[812d2c8]: +++ Entering nsSocketTransport::doRead() [www.esta.org:80 41ccdd20]. aSelectFlags = 1. 1026[812d2c8]: nsReadFromSocket [fd=40805220]. rv = 0. Buffer space = 1287. Bytes read =46 1026[812d2c8]: nsReadFromSocket [fd=40805220]. rv = 0. Buffer space = 1241. Bytes read =0 1026[812d2c8]: nsSocketTransport::OnWrite() [www.esta.org:80 41ccdd20]. nsIPipe=408fe088 Count=46 1026[812d2c8]: WriteSegments [fd=40805220]. rv = 0. Bytes read =46 1026[812d2c8]: --- Leaving nsSocketTransport::doRead() [www.esta.org:80 41ccdd20]. rv = 80470007. Total bytes read: 46 1026[812d2c8]: --- Leaving nsSocketTransport::Process() [www.esta.org:80 41ccdd20]. mStatus = 80470007. CurrentState = 5 1026[812d2c8]: +++ Entering nsSocketTransport::Process() [www.esta.org:80 41ccdd20]. aSelectFlags = 20. CurrentState = 5 1026[812d2c8]: Operation failed via PR_POLL_HUP. [www.esta.org:80 41ccdd20]. 1026[812d2c8]: Transport [www.esta.org:80 41ccdd20] is in error state. 1026[812d2c8]: Transport [www.esta.org:80 41ccdd20] is in done state. 1026[812d2c8]: --- Leaving nsSocketTransport::Process() [www.esta.org:80 41ccdd20]. mStatus = 0. CurrentState = 3 1024[8058968]: Deleting nsSocketTransport [komodo.mozilla.org:80 877f0a8]. 1024[8058968]: Deleting nsSocketTransport [random.yahoo.com:80 408c83e8]. 1024[8058968]: Deleting nsSocketTransport [www.esta.org:80 41ccdd20].
hey dawn, This last bit of logging info is starting to look useful :-) can you try it again with NSPR_LOG_MODULES=nsHTTPProtocol:5,nsSocketTransport:5 This will give info about how/when the HTTP objects are destroyed too. Thanks, -- rick
*** Bug 26686 has been marked as a duplicate of this bug. ***
So, I've been trying to reproduce these crashes most of the night on a 2 processor NT machine without any luck :-( I'll try Linux tomorrow... Is anyone else still seeing these crashes on SMP NT boxes? Or is Linux the only platform now?
I did what rick asked and mailed him another stack trace and log file rather than pasting it all here. Here's the end of the log. 1024[8058968]: Canceling nsSocketTransport [dspace.dial.pipex.com:80 42564550]. rv = 0 1024[8058968]: Canceling nsSocketTransport [dspace.dial.pipex.com:80 42578428]. rv = 0 1024[8058968]: Deleting nsHTTPChannel [this=8164f10]. 1024[8058968]: Deleting nsHTTPRequest [this=814c7b0]. 1024[8058968]: Deleting nsSocketTransport [komodo.mozilla.org:80 42093698]. 1024[8058968]: Deleting nsHTTPChannel [this=4201e630]. 1024[8058968]: Deleting nsHTTPRequest [this=41ec6930].
I would be happy to do any testing on Linux that is needed. I saw some ideas of how to get the approiate info eariler in this bug. I did notice that mozilla nightly from last night/this morning was very unstable compared to 48 hours ago in linux/smp.
rpotts@netscape.com asked if anyone still was seeing this on NT: yes. I've sent the full dump directly, this was on "latest nightly": 2000022908 on a dual PII 450 running NT. I had been browsing for about an hour or so, /., UF, mozilla.org, oreily.com, nothing serious when it crashed... took most of NT with it... I had to logout and kill most of my active processes in order to get realtime control back... there's a line in the stack trace for the active thread that might explain that.... (dnetc was running in background, ending that from a command line helped, but didn't restore full usability. what ever the crash did it resulted in normal processes only getting time (even to repaint) when dnetc was IO bound to disk - that is NOT the normal behaviour of dnetc, it is usually very well behaved. after it shutdown it only took a minute for the start menu to appear, another minutes for the shutdown menu option to select.... before that it took ten minutes to get the start->run dilog.) here's the top of the active thread: jsdom!nsGetInterface::operator= gkhtml!NS_NewEventListenerManager gkhtml!NS_NewPresShell gkview!nsCreateInstanceByProgID::operator= gkview!nsCreateInstanceByProgID::operator= [...] mozilla!nsGetInterface::operator= kernel32!GetProcessPriorityBoost mozilla!<nosymbols>
adding link to bug 25910 which most likely is a duplicate
I'll have to take this over now that Rick has gone on sabbatical, but in some sense it's probably Dougt's bug. Status: We worked on this all day yesterday on Dawn's machine and saw numerous crashes. For necko they were often in using the proxy code to post OnStatus and OnProgress notifications back to the mozilla thread. However, we also saw problems where the gfx toolkit would go away and others, so solving just the necko issue won't make us completely stable on MP machines. Possible solutions: (a) don't deliver status/progress at all (disable them in the socket transport and just rel-note it) (b) don't use the proxy code to deliver status/progress (implement the event delivery/thread-switch by hand), (c) get Doug to track down what's going on with proxies. Last night we augmented the TestSocketTransport test program to receive status/progress notifications so that it might also exhibit this problem, and left it running on the machine but didn't see the same failure by the time we went home. :-(
Assignee: rpotts → warren
Found it! NS_MT_SUPPORTED was not defined for Linux (!) and a bunch of classes weren't thread safe. See news://news.mozilla.org/38BF7E94.3CA715DA%40netscape.com for details.
Whiteboard: [PDT+] w/b minus on 03/03- need SMP machine → [PDT+] w/b minus on 03/03 [have fixes!]
The landing is in progress, so I'm extending this to w/b minus on 3/7
Whiteboard: [PDT+] w/b minus on 03/03 [have fixes!] → [PDT+] w/b minus on 3/7 [have fixes!]
Here's the list of classes I'm having to make threadsafe: AtomImpl BasicStringImpl CacheOutputStream InterceptStreamListener MemCacheWriteStreamWrapper TestConnection nsAppShellService nsCacheEntryChannel nsCharsetConverterManager nsConverterFactory nsDNSService nsDateTimeFormatWin nsDocShell nsDocumentOpenInfo nsEventQueueImpl nsEventQueueServiceImpl nsFTPDirListingConv nsFileSpecImpl nsFileTransport nsFileTransportService nsGenericFactory nsGenericModule nsHTTPIndexParser nsIOService nsImapFlagAndUidState nsImapMailCopyState nsImapMockChannel nsInputStreamChannel nsInputStreamFileSystem nsInterfaceInfoManager nsLocalFile nsLocalFileSystem nsLocale nsLocaleService nsMIMEInfoImpl nsMIMEService nsMemCacheChannel nsMemCacheRecord nsMsgAccountManager nsMsgIncomingServer nsMsgMailNewsUrl nsMsgStatusFeedback nsMsgWindow nsObserverService nsPref nsPrefMigration nsProxyEventClass nsProxyEventObject nsProxyObjectManager nsRDFResource nsRunner nsSocketTransport nsSocketTransportService nsStdURLParser nsStorageStream nsStreamConverterService nsSupportsArray nsThread nsThreadPool nsWalletlibService
By what evidence are you basing the need to make the imap classes thread-safe? (by which I assume you mean adding threadsafe add and release refs) Inspection, or actual evidence of CONCURRENT access to add and release ref from multiple threads? The imap code uses BLOCKING proxy calls between threads so that while one thread may be manipulating the ref count, the other thread is blocked.
These changes went in moments ago, along with Andreas' changes. David: These classes were determined experimentally. I hadn't thought about the case where only synchronous proxy code was used, and consequently making AddRef/Release threadsafe _shouldn't_ be necessary (I'd have to really study the proxy code to determine whether that's really true), but I think making these classes threadsafe is mostly harmless -- just a little more overhead in the AddRef/Release which will hopefully be insignificant. Let's see if anything shows up during profiling.
Status: NEW → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → FIXED
Warren, I was playing around on my machine today in the tree you were working on and found lots of other thread safety assertions and crashes in the mail account wizard and while loading my inbox. Do you need that tree any more or is it safe to update to the tip? I don't want to blow away your changes but I don't want to report the crashes if they are unique to my tree.
You can update to the tip. Tons of other fixes went in after that. It would be great if you could verify that the thread safety asserts you mentioned have gone away now. If not, you can send them to me, or file new bugs. Thanks.
Dawn, could you help once again in verifying this bug. I have been told that you were able to reproduce this. Thanks.
Oops, i did this the other day and mailed warren but forgot to comment in the bug. After I updated from the tip things worked great. I got no assertions and didn't crash after several hours. Marking verified.
Status: RESOLVED → VERIFIED
I'm running on a Quad Sun UE450 (Solaris 2.6) and have been experiencing quite a lot of instability.. if I run the exact same code on an UP machine with the exact same OS etc it's almost perfectly stable. I bet the quad will trigger smp bugs more than a dual... I'm running current CVS (tip) with gtk/glib 1.2.6, compiled with gcc 2.95.2 (-O -msupersparc). Here is a stacktrace from searching for 'Mozilla' in the search sidebar and waiting a few seconds (repeatable sometimes 8): #0 0xef1d66b8 in pthread_mutex_lock () from /usr/lib/libthread.so.1 #1 0xef5614c8 in PR_Lock () from /scratch/mozilla/mozilla/dist/bin/./libnspr4.so #2 0xedefff6c in nsSocketTransport::Process () from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so #3 0xedf029f4 in nsSocketTransportService::ProcessWorkQ () from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so #4 0xedf02f30 in nsSocketTransportService::Run () from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so #5 0xef68da64 in nsThread::Main () from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so #6 0xef566208 in _pt_root () from /scratch/mozilla/mozilla/dist/bin/./libnspr4.so Loading a page with a bunch of images resulted in: #0 0xedefd850 in nsStreamListenerEvent::~nsStreamListenerEvent () from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so #1 0xedefdef4 in nsOnStopRequestEvent::~nsOnStopRequestEvent () from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so #2 0xedefd908 in nsStreamListenerEvent::DestroyPLEvent () from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so #3 0xef68b698 in PL_DestroyEvent () from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so #4 0xef68b674 in PL_HandleEvent () from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so #5 0xef68b584 in PL_ProcessPendingEvents () from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so #6 0xef68c328 in nsEventQueueImpl::ProcessPendingEvents () from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so #7 0xee630a74 in event_processor_callback () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #8 0xee630794 in our_gdk_io_invoke () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #9 0xee251d0c in g_main_dispatch () from /usr/local/lib/libglib-1.2.so.0 #10 0xee252444 in g_main_iterate () from /usr/local/lib/libglib-1.2.so.0 #11 0xee252634 in g_main_run () from /usr/local/lib/libglib-1.2.so.0 #12 0xee429814 in gtk_main () from /usr/local/lib/libgtk-1.2.so.0 #13 0xee630f78 in nsAppShell::Run () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #14 0xee6cfa30 in nsAppShellService::Run () from /scratch/mozilla/mozilla/dist/bin/components/libnsappshell.so #15 0x139f0 in main1 () #16 0x13ddc in main ()
stric: Is this the latest build? Debug or optimized? We're still finding thread-safety assertions that we're tracking down, so we know this isn't 100% fixed yet, but we closed this bug because we know that the assertions will help us resolve them over time. I'm wondering if you've seen any assertions, and/or whether you think we should reopen this bug.
Note that crashing on the tip build this past weekend (or today) is no big deal. There is a lot of instability at this moment. Do you crash when you pull last friday's evening build? Try picking that up from Mozilla. That was when we branched for beta, but before the giant landings began. If you are building your own binary, you should try to induce this bug using the Netsacpe beta1 branch. That would be the interesting (sad? surprising?) test. Thanks, Jim
I hate to be a broken record, but the asserts only catch lack of thread safety on addref and release - there could be all sorts of other thread-safety issues.
ftp://ftp.mozilla.org/pub/mozilla/nightly/2000-03-10-08-M15/mozilla-source.tar.gz this is the source tarball from last friday that jar mentioned. I don't see a source tarball for the netscape beta branch. You can pull it from cvs if you use the proper tag. The tag should be listed on the builds or seamonkey newsgroup.
I don't think mozilla.org is doing any bulding of tarballs based on the netscape branch (although you could ask for 'em!! :-) ). That was why the best build I could point at was late in the day on last Friday. Thanks to endico for adding the pointer. Bienvenu is quite correct that other bugs can/will exist in/around multi-threading. There is a good chance that the nature of the thread-induced problem will not be memory-centric (re: double frees, etc.), and hence I personally would be more surprised to see a stack trace that looked consistently like the ones we had been seeing on this bug. Another bug... yes... but I was hoping we were free of this particular class of threading errors. Perhaps we never will be... but a guy can hope! :-) Again, please tell us how you do with the "relatively" stable build that endico identified.
Warren: I was running current (by then) CVS source from CVS HEAD, optimized build. I just updated and now I get crashes when I resize (a bunch) the window when viewing slashdot.org for example.. I get a 120-130 step backtrace.. here's a snip: #0 0x0 in ?? () #1 0xedad7300 in nsInlineFrame::ReflowFrames () from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so #2 0xedad719c in nsInlineFrame::Reflow () from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so #3 0xedadaa9c in nsLineLayout::ReflowFrame () from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so #4 0xedab8b38 in nsBlockFrame::ReflowInlineFrame () from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so ... #87 0xedae75fc in PresShell::ResizeReflow () from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so #88 0xed6dec54 in nsViewManager2::SetWindowDimensions () from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so #89 0xed6e0420 in nsViewManager2::DispatchEvent () from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so #90 0xed6ced54 in HandleEvent () from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so #91 0xeea3bc98 in nsWidget::DispatchEvent () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #92 0xeea3bba8 in nsWidget::DispatchWindowEvent () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #93 0xeea3aa8c in nsWidget::OnResize () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #94 0xeea42ff4 in nsWindow::Resize () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #95 0xed6d0a30 in nsView::SetDimensions () from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so #96 0xed6dec24 in nsViewManager2::SetWindowDimensions () from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so Here's a dump from loading a page with a bunch of png/jpg/gif images: (gdb) bt #0 0xee2ed9d0 in nsStreamListenerEvent::~nsStreamListenerEvent () from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so #1 0xee2ee074 in nsOnStopRequestEvent::~nsOnStopRequestEvent () from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so #2 0xee2eda88 in nsStreamListenerEvent::DestroyPLEvent () from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so #3 0xefa8b650 in PL_DestroyEvent () from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so #4 0xefa8b62c in PL_HandleEvent () from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so #5 0xefa8b53c in PL_ProcessPendingEvents () from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so #6 0xefa8c2e0 in nsEventQueueImpl::ProcessPendingEvents () from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so #7 0xeea2c40c in event_processor_callback () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #8 0xeea2c12c in our_gdk_io_invoke () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #9 0xee651d0c in g_main_dispatch () from /usr/local/lib/libglib-1.2.so.0 #10 0xee652444 in g_main_iterate () from /usr/local/lib/libglib-1.2.so.0 #11 0xee652634 in g_main_run () from /usr/local/lib/libglib-1.2.so.0 #12 0xee829814 in gtk_main () from /usr/local/lib/libgtk-1.2.so.0 #13 0xeea2c910 in nsAppShell::Run () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #14 0xeeacfb30 in nsAppShellService::Run () from /scratch/mozilla/mozilla/dist/bin/components/libnsappshell.so #15 0x139f0 in main1 () #16 0x13ddc in main () How do I update for the beta1 branch? If it's getting stable on this quad I could try it on a 10 cpu onyx2 for some more concurrency 8) With the current code I would not classified it as fixed.. Maybe on dual boxes, but not on a quad..
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: