Closed Bug 18005 Opened 25 years ago Closed 25 years ago

[DOGFOOD] Leave mail window for a long time, GetMsg, crash

Categories

(MailNews Core :: Networking, defect, P3)

defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: trudelle, Assigned: dougt)

References

Details

(Whiteboard: [PDT+] Verified for all the platforms)

Attachments

(1 file)

Today's opt build (yesterday too) Launch Apprunner Task>Mail Open IMAP server select inbox read a message (possibly extraneous step) Leave mail window sitting there for a while without using it. Click GetMsg Crash, log available. Saw several times on Mac, once on Linux, will try on Win98
Assignee: phil → bienvenu
Summary: Crash on GetMsg → Leave mail window for a long time, GetMsg, crash
Here's the stack trace. Peter, is this really a Seamonkey stack trace? It has all sorts of names which look like 4.x. Calling chain using A6/R1 links Back chain ISA Caller 00000000 PPC 16C7DF28 068428C0 PPC 16C7E07C 06842870 PPC 174BDE6C LApplication::Run()+000B8 06842800 PPC 17099704 XP_GetNonGridContext+285EC 068427A0 PPC 17447B64 LPeriodical::DevoteTimeToRepeaters(const EventRecord&)+0004C 06842740 PPC 16D2F6D8 CFrontApp::GetApplication()+010D4 068426F0 PPC 16D2FE3C CFrontApp::GetApplication()+01838 06842660 PPC 16D31948 SSL_DataPending+01390 06842610 PPC 16E05C94 CACHE_FindURLInCache+0317C 068425C0 PPC 16EB2568 NET_CacheConverter+01560 06842560 PPC 16E92420 NET_DeregisterContentTypeConverter+087C0 06842510 PPC 17057630 FE_DefaultDocCharSetID+3AAD8 068424A0 PPC 171AD594 XP_Confirm+178D0 06842450 PPC 17071BC4 XP_GetNonGridContext+00AAC 068423B0 PPC 1707292C XP_GetNonGridContext+01814 06842360 PPC 16DCEA64 XP_TempDirName+0C0B4 068422F0 PPC 16DCF5A8 XP_TempDirName+0CBF8 068422B0 PPC 16DD0344 XP_TempDirName+0D994 06842270 PPC 16DD0344 XP_TempDirName+0D994 06842230 PPC 16DD0344 XP_TempDirName+0D994 068421F0 PPC 16DD0358 XP_TempDirName+0D9A8 068421B0 PPC 16D96644 SOB_get_error+019E4 06842170 PPC 174F3EDC Flush_Free+0000C Return addresses on the stack Stack Addr Frame Addr ISA Caller 068424D8 PPC 16D18A18 XP_PlatformFileToURL+0A1D4 068424CC 68K 0636DE42 068424C8 PPC 16CF8804 INTL_DefaultWinCharSetID+004F0 068424B8 68K 17445482 LBroadcaster::BroadcastMessage(long, void*)+0008A 068424A8 PPC 17057630 FE_DefaultDocCharSetID+3AAD8 0684248C 68K 065AA29E 06842458 06842450 PPC 171AD594 XP_Confirm+178D0 06842408 06842400 PPC 174F3E88 Flush_Allocate+0001C 068423B8 068423B0 PPC 17071BC4 XP_GetNonGridContext+00AAC 06842388 PPC 17133678 UGraphicGizmos::BevelRect(const Rect&, short, short , short)+05EE4 06842368 06842360 PPC 1707292C XP_GetNonGridContext+01814 0684235C 06842358 68K 063E7ACA 06842308 06842300 PPC 16F04378 XP_ProgressText+20950 068422F8 068422F0 PPC 16DCEA64 XP_TempDirName+0C0B4 068422EC 68K 063E7ACA 068422DE 68K 0003FFFE 068422D8 068422D0 PPC 1732C8B8 PR_ExitMonitor+00098 068422CC 68K 0635866A 068422B8 068422B0 PPC 16DCF5A8 XP_TempDirName+0CBF8 06842298 68K 063E7ACA 06842288 68K 063DA9CE 06842278 06842270 PPC 16DD0344 XP_TempDirName+0D994 06842258 06842250 PPC 16D7B890 ET_moz_CallFunction+003C0 06842238 06842230 PPC 16DD0344 XP_TempDirName+0D994 06842218 06842210 PPC 16D7BB04 ET_moz_CallFunction+00634 068421F8 068421F0 PPC 16DD0344 XP_TempDirName+0D994 068421D8 068421D0 PPC 16D7C044 ET_moz_CallFunction+00B74 068421CC 68K 0635866A 068421C8 068421C0 PPC 16C7F3D4 068421B8 068421B0 PPC 16DD0358 XP_TempDirName+0D9A8 068421A8 068421A0 PPC 17043C44 FE_DefaultDocCharSetID+270EC 06842194 68K 063DA9CE 06842188 06842180 PPC 174F3EDC Flush_Free+0000C 06842178 06842170 PPC 16D96644 SOB_get_error+019E4 0684215C 68K 0635866A 06842158 06842150 PPC 16DCF51C XP_TempDirName+0CB6C 06842148 68K 063E7ACA 06842138 06842130 PPC 174F3EDC Flush_Free+0000C 06842118 06842110 PPC 174F3EDC Flush_Free+0000C 06842108 06842100 PPC 16DCF020 XP_TempDirName+0C670 068420F8 068420F0 PPC 174F3EDC Flush_Free+0000C 068420F4 068420F0 68K 063E7ACA
Severity: normal → critical
QA Contact: lchiang → esther
Summary: Leave mail window for a long time, GetMsg, crash → [DOGFOOD] Leave mail window for a long time, GetMsg, crash
is this a seamonkey crash, or a 4.5 crash? was 4.5 running at the time?
Did I send the wrong file? Sorry, I'll try it again.
Attached file Macsbug log for GetMsg crash (deleted) —
Looks like there were two logs in the file I sent, and only the first (a 4.7 crash) got pasted. I deleted that log from the file and attached the apprunner log only.
OK, here's the stack trace from the attachment. Looks like a problem shutting down the thread, especially with the proxy event code. I'm assuming biff is not turned on, or we wouldn't have timed out. 04F29908 04F29900 PPC 1791E650 PR_CSetOnMonitorRecycle+00050 04F298C8 04F298C0 PPC 16BBB294 nsThread::Exit(void*)+0001C 04F29888 04F29880 PPC 16BBB438 nsThread::Release()+00040 04F29848 68K 16BBB19E nsThread::~nsThread()+00036 04F29808 04F29800 PPC 16B85690 nsCOMPtr_base::~nsCOMPtr_base()+00030 04F297C8 04F297C0 PPC 163289F4 nsImapProtocol::Release()+289F4 04F297A8 04F297A0 PPC 1791E474 PR_CExitMonitor+00074 04F29788 04F29780 PPC 16329C14 nsImapProtocol::~nsImapProtocol()+29C14 04F29768 04F29760 PPC 16C86950 operator delete(void*)+00014 04F29758 04F29750 PPC 17922580 PR_ExitMonitor+00054 04F29748 04F29740 PPC 17922408 PR_DestroyMonitor+0001C 04F29730 68K 05BA264E 04F29728 04F29720 PPC 16C877F8 free+00030 04F29708 04F29700 PPC 1792405C PR_DestroyLock+00018 04F296E8 04F296E0 PPC 16BC83FC nsProxyEventObject::~nsProxyEventObject()+000F0 04F296D8 04F296D0 PPC 16C8956C nsLargeHeapAllocator::AllocatorFreeBlock(void*)+000 20 04F296C8 04F296C0 PPC 1791DE94 PR_Free+00014 04F296B8 04F296B0 PPC 16B886AC nsAllocator::Free(void*)+00054 04F296A8 04F296A0 PPC 16BC8480 nsProxyEventObject::RootRemoval()+00034 04F29688 04F29680 PPC 16C86950 operator delete(void*)+00014
I tried this on windows. It seemed fine. I'll try linux next.
Right, no biff.
I can't reproduce this on Win98, but I just reproduced it on Linux again.
Are we having a dangling connection to a time-out'd thread?
I reproduced the crash on linux. We get the following stack trace. This is probably some symptom of our screwed-up event handling. Perhaps DougT's proxy event changes will help, though I doubt it. #0 0x40368888 in main_arena () #1 0x68403688 in ?? () #2 0x408e37ea in nsStreamListenerEvent::HandlePLEvent (aEvent=0x83fec48) at nsAsyncStreamListener.cpp:169 #3 0x4019a36b in PL_HandleEvent (self=0x83fec48) at plevent.c:537 #4 0x4019a27c in PL_ProcessPendingEvents (self=0x8736020) at plevent.c:498 #5 0x401599e9 in nsEventQueueImpl::ProcessPendingEvents (this=0x8735ff8) at nsEventQueue.cpp:190 #6 0x405181ec in event_processor_callback (data=0x8735ff8, source=21, condition=GDK_INPUT_READ) at nsAppShell.cpp:228 #7 0x40517aff in our_gdk_io_invoke (source=0x8736080, condition=G_IO_IN, data=0x8722e98) at nsAppShell.cpp:49 #8 0x406b23ca in g_io_unix_dispatch () #9 0x406b3a86 in g_main_dispatch () #10 0x406b4041 in g_main_iterate () #11 0x406b41e1 in g_main_run () #12 0x405dd7a9 in gtk_main () #13 0x405186ff in nsAppShell::Run (this=0x80a2ce8) at nsAppShell.cpp:395 #14 0x4039d351 in nsAppShellService::Run (this=0x80a1f60) at nsAppShellService.cpp:480
More likely we have a proxy event in the event queue, and it refers to a deleted object, like the protocol, or thread. Since linux event handling seems fairly messed up, at least as far as IMAP is concerned, this doesn't surprise me too much.
Whiteboard: [PDT+]
Putting on PDT+ radar.
If you turn on biff at an interval less than 29 minutes, you won't have this problem.
What's happening, I bet, is that we're removing the timed-out connection, attempting to logout, and releasing the imap protocol instance. This eventually causes the imap thread to be destroyed. On windows, this happens later on the thread in question, but it looks like on the mac, it happens immediately on the ui thread. On linux, it looks like the proxy event stuff isn't noticing that the event queue has gone away.
Thanks David, I thought that (30 min. connection drop) might be the case, and the workaround is good enough for dogfood.
It turns out that if we really did drop the connection, everything would be fine. Unfortunately, we try to gracefully close and logout. If I comment out those calls, we don't crash. My gdb/linux skills are pretty marginal - all I can guess is that the vtbl for the StreamListenerEvent is horked, but the object doesn't look deleted. I'll keep poking around but I suspect this will take a few days.
Oy, gevalt. The nsImapProtocol object is definitely getting destroyed before the event queue is finished, which is not good. But what's worse is that I put in a call to StopAcceptingEvents after our thread has stopped running to see if that helps. It didn't help, but it allowed me to discover that on linux (but not windows), our imap event queue is somehow marked the "elder" event queue. (I suspect this should be "eldest"). This seems wrong.
The above is partly wrong - the elder assert happens on windows as well, so perhaps it's not the problem. But, we are executing the onDataAvailableEvent::HandleEvent on the wrong thread on Linux (i.e., the main thread), just like 17065 - my gut tells me this is the root of our problem.
I've verified that if I stop gtk from calling into imap code from the UI thread, this crash doesn't happen. I did this by disabling the nsAppShell::ListenToEventQueue call, which prevents us from getting called from the ui thread. Unfortunately, it also breaks the password prompt, presumably because that's why this event queue listener hack is there in the first place. I believe this is an xpapps problem, so I'm reassigning it back to you, Peter. I truly believe that we should be called from the correct thread.
Assignee: bienvenu → trudelle
David, I think we're all agreed that the source of this problem is the same problem for 17065. Brendan is going to help me find someone to help us figure out what's going on with event processing on linux. I'm hesitant to mark this a dup but the problem is probably the same even though the symptoms are different.
Assignee: trudelle → brendan
Reassigning to breandan for triage. Let's not forget, this also happened on Mac, as did 17065.
Yep, and I believe they both do hacky things with event dispatching to get modal dialogs to work. I believe these two bugs have the same cause, and Scott and I spent a lot of time discovering that in both cases, our events are getting processed by the wrong thread.
Status: NEW → ASSIGNED
Dan is gonna help me fix this on all platforms, yes he is. /be
Blocks: 18471
Target Milestone: M12
17065 is M12, so should this one be. /be
Blocks: 18951
Blocks: 20203
*** Bug 20247 has been marked as a duplicate of this bug. ***
Can also be seen when using "an imap server that only allows a single connection to a folder, and kills previous connections (like the UW server)" as bienvenu mentions in the duplicate bug.
QA Contact: esther → huang
Change QA Contact to me since this is IMAP bug. Cc:Esther.
Same occurs for me: I'm using UW-IMAP, a non-Mozilla-Biff checking the INBOX every 30 seconds and "check mail every 1 min." in Mozilla. While having a normal subfolder (not INBOX and not under INBOX) open, I get "Document: Done (0.21 secs) In OnFolderLoader" every min. or so. Mozilla (debug build) crashed after 20 min. w/o any notice. HTH.
Brendan, what's projected fix date for this bug?
the better question to get started is who is going to tackle this hairy problem? did we find a porkjockey owner?
Assignee: brendan → dougt
Status: ASSIGNED → NEW
dougt has been fixing bugs in event-loop land and kindly offers to take this one. he's gonna dig into this tomorrow. /be
Status: NEW → ASSIGNED
Whiteboard: [PDT+] → [PDT+] 12/9
Sent workaround to mscott to verify. Still tracking down real problem.
Whiteboard: [PDT+] 12/9 → [PDT+] Fix ready, patch sent for review.
Blocks: 21564
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
fix checked in.
I have not been able to reproduce on linux 6.0, NT 4.0 or Mac OS 8.5.1 using 12-16-12m12 commercial build. I was indeed seeing it often on my mac and linux machines prior to this week's builds (fixed this week). I will let huang or someone else who'd seen this double-check before marking it verified.
This bug need to leave PC idle a while...I will test this bug later since I need to continue testing Basic Functionality Test for M12....
Blocks: 22176
Status: RESOLVED → VERIFIED
Whiteboard: [PDT+] Fix ready, patch sent for review. → [PDT+] Verified for all the platforms
Verified on the Linux 12-20-23-M12 final commercial build Verified on the Mac 12-21-11-M12 final commercial build Verified on the Linux 12-21-00-M12 final commercial build I have idled over than 30 minutes without crash for all the platforms!! Marking as Verified.
No longer blocks: 18471
No longer blocks: 18951
No longer blocks: 20203
No longer blocks: 21564
No longer blocks: 22176
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: