bsemrad@adsoft.net: could you attach a Dr Watson log from Windows NT?

Brian Semrad

Reporter

Comment 5

•

25 years ago

Here is an excerpt from an email that I sent to dougt@netscape.com about the crash on my machine. I went ahead and downloaded the source for Mozilla dated on 12-13-99 and compiled it and then ran it. Below is a copy of the stack trace of the crash when I tried to go to www.slashdot.org. nsCOMPtr?nsProxyObject>::assign_with_AddRef(nsISupports * 0x02f69060) line 759 + 9 bytes nsCOMPtr?nsProxyObject>::operator=(nsProxyObject * 0x02f69060) line 516 nsProxyObjectCallInfo::nsProxyObjectCallInfo(nsProxyObject * 0x02f69060, nsXPTMethodInfo * 0x021ed670, unsigned int 3, nsXPTCVariant * 0x02f6a3d0, unsigned int 4, PLEvent * 0x02f6a890) line 65 nsProxyObject::Post(unsigned int 3, nsXPTMethodInfo * 0x021ed670, nsXPTCMiniVariant * 0x02d1fe18, nsIInterfaceInfo * 0x02f6e060) line 340 + 57 bytes nsProxyEventObject::CallMethod(nsProxyEventObject * const 0x02f6f810, unsigned short 3, const nsXPTMethodInfo * 0x021ed670, nsXPTCMiniVariant * 0x02d1fe18) line 391 + 55 bytes PrepareAndDispatch(nsXPTCStubBase * 0x02f6f810, unsigned int 3, unsigned int * 0x02d1fecc, unsigned int * 0x02d1feb8) line 100 + 31 bytes SharedStub() line 125 ------------------------------------------ Doug then emailed me with the following: Thanks for the great work. This indeed is bug 18110. I told Doug that I might have a go at fixing it but it has been several days since I told him that and I haven't yet had time to look at it seriously so you should probably not count on me for this one.

Michael Lowe

Updated

•

•

•

•

25 years ago

I'm still crashing but things don't seem as fragile as before. I was able to download my mailbox headers twice in a row without crashing. Last time it crashed 4/5 times. Doug's changes seem to have made an improvement.

Warren Harris

Assignee

Comment 35

•

25 years ago

Probably the extra locks just slowed down the timing of things, shrinking the window of vulnerability. Dawn -- sounds like we should get a debug build/env on your machine so that we can diagnose the problem when it happens. Can you set that up?

jst

Comment 36

•

25 years ago

I just got a crash with a fresh tree on a dual 350 PII running linux, here's a stack trace. Program received signal SIGSEGV, Segmentation fault. 0x40175c3a in nsProxyObject::Post (this=0x860ff28, methodIndex=4, methodInfo=0x812ac44, params=0xbf5ffa38, interfaceInfo=0x849b158) at nsProxyEvent.cpp:433 433 mDestQueue->PostEvent(event); (gdb) bt #0 0x40175c3a in nsProxyObject::Post (this=0x860ff28, methodIndex=4, methodInfo=0x812ac44, params=0xbf5ffa38, interfaceInfo=0x849b158) at nsProxyEvent.cpp:433 #1 0x40177ff7 in nsProxyEventObject::CallMethod (this=0x862c7f0, methodIndex=4, info=0x812ac44, params=0xbf5ffa38) at nsProxyEventObject.cpp:394 #2 0x40183184 in PrepareAndDispatch (self=0x862c7f0, methodIndex=4, args=0xbf5ffaf0) at xptcstubs_unixish_x86.cpp:92 #3 0x401832aa in nsXPTCStubBase::Stub4 (this=0x862c7f0) at ../../../../../../dist/include/xptcstubsdef.inc:6 #4 0x4060a4eb in nsSocketTransport::fireStatus (this=0x862c900, aCode=3) at nsSocketTransport.cpp:1903 #5 0x40607860 in nsSocketTransport::Process (this=0x862c900, aSelectFlags=0) at nsSocketTransport.cpp:539 #6 0x4060b0c6 in nsSocketTransportService::ProcessWorkQ (this=0x84f64d0) at nsSocketTransportService.cpp:259 #7 0x4060b794 in nsSocketTransportService::Run (this=0x84f64d0) at nsSocketTransportService.cpp:493 #8 0x40172d05 in nsThread::Main (arg=0x84f6810) at nsThread.cpp:83 #9 0x402158fb in _pt_root (arg=0x85bf110) at ptthread.c:157 #10 0x4022feca in pthread_start_thread (arg=0xbf5ffe60) at manager.c:213 (gdb) print this $2 = (nsProxyObject *) 0x860ff28 (gdb) print *this $3 = {<nsISupports> = {_vptr. = 0x883ce90}, mRefCnt = 140573992, mProxyType = 6, mDestQueue = {mRawPtr = 0x0}, mRealObject = {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}, mLock = 0x882a7b8} As far as I can tell "this" was destroyed while one thread is executing this->Post() since there's a check for !mDestQueue in the beginning of nsPorxyObject::Post(), so this should not happend...

John Bandhauer

Comment 37

•

25 years ago

Doug, Looking at EventHandler (shouldn't this be static or something?)... http://lxr.mozilla.org/seamonkey/source/xpcom/proxy/src/nsProxyEvent.cpp#460 ...I see that you are holding a per object lock while invoking XPTC_InvokeByIndex. This seems excessive and/or dangerous. Aren't you then precluding reentrant calls via the proxy on the proxied object? Do you really need to protect more than your shared tables of information about the proxies and the refcount managment of the proxies themselves? I think that you should limit the scope of all locks to the bare minimum that is absolutely require so that you decrease the chance of deadlocks or nspr assertions on attempts to reenter a non-reantrant lock.

Doug Turner (:dougt)

Updated

•

25 years ago

Status: ASSIGNED → RESOLVED

Closed: 25 years ago

Resolution: --- → DUPLICATE

Doug Turner (:dougt)

Comment 38

•

25 years ago

good catch, both event handlers need to be static. The scope of the locks need to be reduced. marking this bug as a dup of 18110 *** This bug has been marked as a duplicate of 18110 ***

anssi

Comment 39

•

25 years ago

On Linux SMP machine Mozilla M13 crashes almost immediately. It crashes also while you are doing nothing..

Status: RESOLVED → REOPENED

Brendan Eich [:brendan]

Comment 40

•

25 years ago

anssi@bigfoot.com, why was this reopened if it is in fact a duplicate of 18110? Your comments don't argue that it is a separate bug from 18110, so I don't see the point in reopening. Resolving it as a duplicate doesn't mean that the bug it describes, duplicated by an earlier bugzilla report, is fixed -- it just means we know that the newer bug is a dup. /be

leger

Comment 41

•

25 years ago

Clearing DUPLICATE resolution due to reopen.

Doug Turner (:dougt)

Comment 42

•

25 years ago

closing. see other bug.

Status: REOPENED → RESOLVED

Closed: 25 years ago → 25 years ago

•

25 years ago

Putting dogfood in the keyword field.

Keywords: dogfood

Michael Lowe

Updated

•

25 years ago

Summary: [Dogfood] Mozilla crashes often on SMP systems. → Mozilla crashes often on SMP systems.

Doug Turner (:dougt)

Comment 49

•

25 years ago

Putting in correct component.

Component: XPCOM → Networking

Warren Harris

Assignee

Comment 50

•

25 years ago

Why is this considered Networking now? It's purely a proxy problem, isn't it? It could affect anything. And why is this owned by Gagan?

Doug Turner (:dougt)

Comment 51

•

25 years ago

No. this is a the problem with having socket transports in the load group. The second onStop() crashes SMP machines.

Warren Harris

Assignee

Comment 52

•

25 years ago

Changing summary from: Mozilla crashes often on SMP systems. To: crash on SMP systems: socket transport in load group Reassigning to Rick Potts because I think he's working on this now.

Assignee: gagan → rpotts

Summary: Mozilla crashes often on SMP systems. → crash on SMP systems: socket transport in load group

rpotts (gone)

Comment 53

•

25 years ago

hey doug, are you sure that there is a SocketTransport sitting in a load group? I would have thought that that was not possible... -- rick

Doug Turner (:dougt)

Comment 54

•

25 years ago

gagan and jud are in the know.

jst

Comment 55

•

25 years ago

This is not windows only, I been seeing this on linux for a while too, changing OS and Platform...

OS: Windows NT → All

Hardware: PC → All

Dawn Endico

Comment 56

•

25 years ago

Status whiteboard says you need an SMP machine. Hasn't dougt's arrived yet? Mozilla is pretty useless for me at home until this bug gets fixed. I could bring the mahcine in again but the last time I tried that the motherboard fried.

rickg

Comment 57

•

25 years ago

Hey Rick; I'm seeing these crashes _constantly_ on my home machine. Almost any page I visit will eventually end up in this state. Sometimes it's just visiting the page, sometimes it's when I leave the page, sometimes it's just sitting idle (so to speak). I'll start forwarding stack traces.

rickg

Comment 58

•

25 years ago

Here's an *all-too-typical* stack trace on my SMP/NT box... nsStreamListenerEvent::~nsStreamListenerEvent() line 77 + 24 bytes nsOnStopRequestEvent::~nsOnStopRequestEvent() line 258 + 8 bytes nsOnStopRequestEvent::`scalar deleting destructor'(unsigned int 1) + 15 bytes nsStreamListenerEvent::DestroyPLEvent(PLEvent * 0x02fe63e0) line 104 + 30 bytes PL_DestroyEvent(PLEvent * 0x02fe63e0) line 549 + 10 bytes PL_HandleEvent(PLEvent * 0x02fe63e0) line 536 + 9 bytes PL_ProcessPendingEvents(PLEventQueue * 0x02382cd0) line 487 + 9 bytes _md_EventReceiverProc(HWND__ * 0x003e0550, unsigned int 49342, unsigned int 0, long 37235920) line 975 + 9 bytes USER32! 77e71820() 02382cd0() I'm certainly willing to drive this machine remotely if someone wants to try to debug this problem.

Warren Harris

Assignee

Comment 59

•

25 years ago

Line 77 looks like the release of mContext or possibly mChannel, the line above it. Rickg: Can you see if one of these looks like it has already been deleted? Maybe we've got race between an addref on one thread and a release on this one.

rpotts (gone)

Comment 60

•

25 years ago

For that particular stack trace, it is possible that the crash is happening on the NS_RELEASE(mContext) because mContext has already been deleted! It turns out that mContext is really an nsHTTPCHannel. Unfortunately, nsHTTPChannel *does not* have thread-safe implementations of AddRef() and Release()... Since these methods are caled on multiple threads (ie. socket transport and UI) there canbe problems :-) I'll check in a fix to make AddRef() and Release() thread-safe and we'll see if things get any better... Are you seeing any other stack traces?

rpotts (gone)

Comment 61

•

25 years ago

I've just checked in thread-safe AddRef/Release implementations for nsHTTPChannel, nsHTTPResponseListener, nsHTTPRequest and nsHTTPEncodeStream. I suspect that other nsIInputStream implementations (besides nsHTTPEncodeStream) will need thread-safe Addref/Release implementations... In particular the "string stream"

Warren Harris

Assignee

Comment 99

•

25 years ago

I'll have to take this over now that Rick has gone on sabbatical, but in some sense it's probably Dougt's bug. Status: We worked on this all day yesterday on Dawn's machine and saw numerous crashes. For necko they were often in using the proxy code to post OnStatus and OnProgress notifications back to the mozilla thread. However, we also saw problems where the gfx toolkit would go away and others, so solving just the necko issue won't make us completely stable on MP machines. Possible solutions: (a) don't deliver status/progress at all (disable them in the socket transport and just rel-note it) (b) don't use the proxy code to deliver status/progress (implement the event delivery/thread-switch by hand), (c) get Doug to track down what's going on with proxies. Last night we augmented the TestSocketTransport test program to receive status/progress notifications so that it might also exhibit this problem, and left it running on the machine but didn't see the same failure by the time we went home. :-(

Assignee: rpotts → warren

Warren Harris

Assignee

Comment 100

•

25 years ago

Found it! NS_MT_SUPPORTED was not defined for Linux (!) and a bunch of classes weren't thread safe. See news://news.mozilla.org/38BF7E94.3CA715DA%40netscape.com for details.

Warren Harris

Assignee

Updated

•

25 years ago

Whiteboard: [PDT+] w/b minus on 03/03- need SMP machine → [PDT+] w/b minus on 03/03 [have fixes!]

Jim Roskind

Comment 101

•

25 years ago

The landing is in progress, so I'm extending this to w/b minus on 3/7

Whiteboard: [PDT+] w/b minus on 03/03 [have fixes!] → [PDT+] w/b minus on 3/7 [have fixes!]

Warren Harris

Assignee

Comment 102

•

25 years ago

Here's the list of classes I'm having to make threadsafe: AtomImpl BasicStringImpl CacheOutputStream InterceptStreamListener MemCacheWriteStreamWrapper TestConnection nsAppShellService nsCacheEntryChannel nsCharsetConverterManager nsConverterFactory nsDNSService nsDateTimeFormatWin nsDocShell nsDocumentOpenInfo nsEventQueueImpl nsEventQueueServiceImpl nsFTPDirListingConv nsFileSpecImpl nsFileTransport nsFileTransportService nsGenericFactory nsGenericModule nsHTTPIndexParser nsIOService nsImapFlagAndUidState nsImapMailCopyState nsImapMockChannel nsInputStreamChannel nsInputStreamFileSystem nsInterfaceInfoManager nsLocalFile nsLocalFileSystem nsLocale nsLocaleService nsMIMEInfoImpl nsMIMEService nsMemCacheChannel nsMemCacheRecord nsMsgAccountManager nsMsgIncomingServer nsMsgMailNewsUrl nsMsgStatusFeedback nsMsgWindow nsObserverService nsPref nsPrefMigration nsProxyEventClass nsProxyEventObject nsProxyObjectManager nsRDFResource nsRunner nsSocketTransport nsSocketTransportService nsStdURLParser nsStorageStream nsStreamConverterService nsSupportsArray nsThread nsThreadPool nsWalletlibService

David :Bienvenu

Comment 103

•

25 years ago

By what evidence are you basing the need to make the imap classes thread-safe? (by which I assume you mean adding threadsafe add and release refs) Inspection, or actual evidence of CONCURRENT access to add and release ref from multiple threads? The imap code uses BLOCKING proxy calls between threads so that while one thread may be manipulating the ref count, the other thread is blocked.

Warren Harris

Assignee

Comment 104

•

25 years ago

These changes went in moments ago, along with Andreas' changes. David: These classes were determined experimentally. I hadn't thought about the case where only synchronous proxy code was used, and consequently making AddRef/Release threadsafe _shouldn't_ be necessary (I'd have to really study the proxy code to determine whether that's really true), but I think making these classes threadsafe is mostly harmless -- just a little more overhead in the AddRef/Release which will hopefully be insignificant. Let's see if anything shows up during profiling.

Status: NEW → RESOLVED

Closed: 25 years ago → 25 years ago

Resolution: --- → FIXED

Dawn Endico

Comment 105

•

25 years ago

Warren, I was playing around on my machine today in the tree you were working on and found lots of other thread safety assertions and crashes in the mail account wizard and while loading my inbox. Do you need that tree any more or is it safe to update to the tip? I don't want to blow away your changes but I don't want to report the crashes if they are unique to my tree.

Warren Harris

Assignee

Comment 106

•

25 years ago

You can update to the tip. Tons of other fixes went in after that. It would be great if you could verify that the thread safety asserts you mentioned have gone away now. If not, you can send them to me, or file new bugs. Thanks.

Tom Everingham

•

25 years ago

stric: Is this the latest build? Debug or optimized? We're still finding thread-safety assertions that we're tracking down, so we know this isn't 100% fixed yet, but we closed this bug because we know that the assertions will help us resolve them over time. I'm wondering if you've seen any assertions, and/or whether you think we should reopen this bug.

Jim Roskind

Comment 111

•

25 years ago

Note that crashing on the tip build this past weekend (or today) is no big deal. There is a lot of instability at this moment. Do you crash when you pull last friday's evening build? Try picking that up from Mozilla. That was when we branched for beta, but before the giant landings began. If you are building your own binary, you should try to induce this bug using the Netsacpe beta1 branch. That would be the interesting (sad? surprising?) test. Thanks, Jim

David :Bienvenu

Comment 112

•

25 years ago

I hate to be a broken record, but the asserts only catch lack of thread safety on addref and release - there could be all sorts of other thread-safety issues.

Dawn Endico

Comment 113

•

25 years ago

ftp://ftp.mozilla.org/pub/mozilla/nightly/2000-03-10-08-M15/mozilla-source.tar.gz this is the source tarball from last friday that jar mentioned. I don't see a source tarball for the netscape beta branch. You can pull it from cvs if you use the proper tag. The tag should be listed on the builds or seamonkey newsgroup.

Jim Roskind

Comment 114

•

25 years ago

I don't think mozilla.org is doing any bulding of tarballs based on the netscape branch (although you could ask for 'em!! :-) ). That was why the best build I could point at was late in the day on last Friday. Thanks to endico for adding the pointer. Bienvenu is quite correct that other bugs can/will exist in/around multi-threading. There is a good chance that the nature of the thread-induced problem will not be memory-centric (re: double frees, etc.), and hence I personally would be more surprised to see a stack trace that looked consistently like the ones we had been seeing on this bug. Another bug... yes... but I was hoping we were free of this particular class of threading errors. Perhaps we never will be... but a guy can hope! :-) Again, please tell us how you do with the "relatively" stable build that endico identified.

Tomas Ögren

Comment 115

•

25 years ago

Warren: I was running current (by then) CVS source from CVS HEAD, optimized build. I just updated and now I get crashes when I resize (a bunch) the window when viewing slashdot.org for example.. I get a 120-130 step backtrace.. here's a snip: #0 0x0 in ?? () #1 0xedad7300 in nsInlineFrame::ReflowFrames () from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so #2 0xedad719c in nsInlineFrame::Reflow () from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so #3 0xedadaa9c in nsLineLayout::ReflowFrame () from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so #4 0xedab8b38 in nsBlockFrame::ReflowInlineFrame () from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so ... #87 0xedae75fc in PresShell::ResizeReflow () from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so #88 0xed6dec54 in nsViewManager2::SetWindowDimensions () from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so #89 0xed6e0420 in nsViewManager2::DispatchEvent () from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so #90 0xed6ced54 in HandleEvent () from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so #91 0xeea3bc98 in nsWidget::DispatchEvent () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #92 0xeea3bba8 in nsWidget::DispatchWindowEvent () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #93 0xeea3aa8c in nsWidget::OnResize () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #94 0xeea42ff4 in nsWindow::Resize () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #95 0xed6d0a30 in nsView::SetDimensions () from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so #96 0xed6dec24 in nsViewManager2::SetWindowDimensions () from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so Here's a dump from loading a page with a bunch of png/jpg/gif images: (gdb) bt #0 0xee2ed9d0 in nsStreamListenerEvent::~nsStreamListenerEvent () from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so #1 0xee2ee074 in nsOnStopRequestEvent::~nsOnStopRequestEvent () from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so #2 0xee2eda88 in nsStreamListenerEvent::DestroyPLEvent () from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so #3 0xefa8b650 in PL_DestroyEvent () from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so #4 0xefa8b62c in PL_HandleEvent () from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so #5 0xefa8b53c in PL_ProcessPendingEvents () from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so #6 0xefa8c2e0 in nsEventQueueImpl::ProcessPendingEvents () from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so #7 0xeea2c40c in event_processor_callback () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #8 0xeea2c12c in our_gdk_io_invoke () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #9 0xee651d0c in g_main_dispatch () from /usr/local/lib/libglib-1.2.so.0 #10 0xee652444 in g_main_iterate () from /usr/local/lib/libglib-1.2.so.0 #11 0xee652634 in g_main_run () from /usr/local/lib/libglib-1.2.so.0 #12 0xee829814 in gtk_main () from /usr/local/lib/libgtk-1.2.so.0 #13 0xeea2c910 in nsAppShell::Run () from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so #14 0xeeacfb30 in nsAppShellService::Run () from /scratch/mozilla/mozilla/dist/bin/components/libnsappshell.so #15 0x139f0 in main1 () #16 0x13ddc in main () How do I update for the beta1 branch? If it's getting stable on this quad I could try it on a 10 cpu onyx2 for some more concurrency 8) With the current code I would not classified it as fixed.. Maybe on dual boxes, but not on a quad..

test case (22 gif images) 25 years ago Dawn Endico (deleted), text/html		Details
single gif image 25 years ago Dawn Endico (deleted), text/html		Details