Closed Bug 71515 Opened 24 years ago Closed 24 years ago

Reentrance in js_AllocGCThing which leads to crash in debug version

Categories

(Core :: DOM: Navigation, defect)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla0.9.1

People

(Reporter: shanjian, Assigned: adamlock)

References

()

Details

(Keywords: crash)

Attachments

(2 files)

While I was working on bug 70759, the same reproduce process always leads to assertion and crash in my debug build. This problem has been bothering me for several days. To reproduce the problem: 1. go to http://www.chinesecomputing.com 2. click any link on the home page (optional: click down several links) 3. change the Character Coding selection for the page (e.g., change to Chinese Traditional Big 5) After step 3, the program shows assertion at file jsgc.c, line 448 JS_ASSERT(!rt->gcRunning); I spent some time on this, and posted here for your reference: After step 2, set breakpoint at jsgc.c, 1103, rt->gcRunning = JS_TRUE; Then follow step 3. The program will stop at jsgc.c:1103 twice. The first time it stop there, let it go. The second time it will crash. The reentrance happened in the last item of rt->gcArenaPool.first (around line 1205). It happened inside function "finalizer(cx, thing);".
Marking NEW this has a lot of good info in it.
Severity: blocker → critical
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: crash
Crash still occurring with debug build from 2001-03-12 on WinNT. My steps to reproduce are slightly different than above: 1. Go to http://www.chinesecomputing.com 2. Change the character coding selection for the page to anything new 3. CRASH !!! void * js_AllocGCThing(JSContext *cx, uintN flags) { JSBool tried_gc; JSRuntime *rt; JSGCThing *thing; uint8 *flagp; #ifdef TOO_MUCH_GC js_GC(cx, GC_KEEP_ATOMS); tried_gc = JS_TRUE; #else tried_gc = JS_FALSE; #endif rt = cx->runtime; JS_LOCK_GC(rt); JS_ASSERT(!rt->gcRunning); <<<<<<<<<<<<<<<<<<<<<<<<<<< CRASHED HERE if (rt->gcRunning) { METER(rt->gcStats.finalfail++); JS_UNLOCK_GC(rt); return NULL; } METER(rt->gcStats.alloc++);
Attached file WinNT stack trace of crash (deleted) —
NOTES: 1. This crash only happened for me with debug builds, not the nightly binaries 2. With the nightly binaries, I would occasionally get this messagbox on WinNT: ###!!! ASSERTION: no handler for font package: '0', file d:\mozilla\intl\locale\src\nsFontPackageService.cpp, line 70 3. With the debug build on WinNT, I saw messages like this in the debug console just before the crash: JavaScript error: chrome://global/content/charsetOverlay.js line 126: wnd has no properties JavaScript error: chrome://global/content/charsetOverlay.js line 126: wnd has no properties ###!!! ASSERTION: docshell has null child: 'shell', file d:\mozilla\docshell\base\nsDocShell.cpp, line 153 Charset Overlay menu item pressed: EUC-JP
I think this is not a JS Engine problem; reassigning to Embedding:Docshell. cc'ing Brendan, jband for a more refined analysis on the root cause -
Assignee: rogerl → adamlock
Component: Javascript Engine → Embedding: Docshell
QA Contact: pschwartau → adamlock
Phil, any place where you see this on Win32 debug builds only and with another assertion going the you shoul dsuspect a dup of bug 54792. Look in the stack trace for the nest into the event loop from the NS_ASSERTION with a JS gc further up the stack.
I wasn't sure it was bug 54792 because I didn't see any nsDebug::Assertion anywhere in the stack. Should I mark this as a dupe of 54792, then?
Phil, It is not clear if this bug on the whole is only a dup of bug 54792 or not. I think that the two cases you listed probably are. The stack trace attached to this bug is not conclusive for me. There may be an assert hiding in the transition from the the View dtor into the NT code (that the debugger is not showing). Or something else might be going on that causes it to get into dispatching events. We know there have been some cases where dtors called from JS finalizers have made calls into other objects that then get back into JS and cause this assert. I don't know if that is one of them or just the stupid Windows assert dialog bug.
The assertion happens because destruction of the old set of docshells causes garbage collection to start. This causes the destruction of the associated views and when they destroy or release their parent widget the thread goes into message processing loop. Before the loop ends, a DOM keyboard event is processed from the queue and the assertion occurs as it is turned into a JS object.
John, it does look similar to the other bug. I think the problem here is the message processing loop happening during GC. This makes it possible for DOM events to be handled before the loop ends. DOM events usually end up with a JS object being created which is causing the bug we're seeing because GC is still in progress. I can't think of why there is a message processing loop in the middle of this stack trace, but it could be that the widget is actually a proxy so calling methods on it causes events to be posted and a processing loop during which other things can happen. If so then perhaps we can fix this problem by putting checks into the proxy code so that when the calling thread and the object thread are the same it just makes a straight call rather than using a queue.
Target Milestone: --- → mozilla0.9
Danm saw a similar problem, and testified that Windows just *had* to process a message or two in the middle of a Win32 call that I claimed should not process messages. I am clueless about Windows, and defer to others, but it seems to me that between our code's reentrancy restrictions, and JS's run-to-completion execution model, we need to keep event/message processing confined to well-known loops down the stack from JS and other code activations. Can we? /be
The only time I know that Win32 will enter a message processing loop without explicitly being told to is for a marshalled COM call. Since we don't make such a call anywhere near here (though it could happen around the clipboard / drag & drop code) I'm not not sure what the reason is. If it does occur some other time I don't think there is much we can do about it.
Brendan is talking about bug 69586; a crash with a stack trace like this: js_AllocGCThing(JSContext * 0x00db3460, unsigned int 0) line 448 + 41 bytes ... nsWindow::DispatchFocus(unsigned int 105) line 4112 + 15 bytes ... nsWindow::WindowProc(HWND__ * 0x00030288, unsigned int 7, unsigned int 393444, long 0) line 923 + 27 bytes USER32! 77e13eb0() ... DocumentViewerImpl::~DocumentViewerImpl() line 414 + 97 bytes ... js_GC(JSContext * 0x035a8908, unsigned int 0) line 1218 + 11 bytes ... nsDocShell::CreateContentViewer(nsDocShell * const 0x035b47c0, const char * 0x0012f828, nsIRequest * 0x0482c698, nsIStreamListener * * 0x0012f87c) line 2771 + 32 bytes ... PL_HandleEvent(PLEvent * 0x036924ec) line 586 + 10 bytes PL_ProcessPendingEvents(PLEventQueue * 0x00b57318) line 516 + 9 bytes _md_EventReceiverProc(HWND__ * 0x001e00e6, unsigned int 49442, unsigned int 0, long 11891480) line 1067 + 9 bytes ... nsAppShellService::Run(nsAppShellService * const 0x00b4b770) line 408 Some object is being deleted, Mozilla enters js_GC, the whole window gets torn down, Windows sends a WM_SETFOCUS message directly to some (presumably, other) window, the handler for that event invokes JS. eek. It isn't entering a Windows message processing loop at all. Windows is dispatching Windows events directly. All very legal. All very problematic.
It looks as if focus events should be discarded when a window is being taken down. Does anyone know who did the focus stuff?
CC'ing saari@netscape.com Chris, do you have any ideas on the best way to prevent the focus issues from causing DOM events to be fired to a window during it's destruction?
Event firing during window transitions is a general problem that we haven't really addressed yet. We could certainly lock things down if we notify the presshell that it is about to be destroyed and then proceed with said destruction.
In the stack trace I posted above, the message causing all the trouble is a *get* focus message on a window other than the one being destroyed. Two different PresShells. Locking down event handling on a window while it is being destroyed wouldn't help; in fact, that may be the only window you wouldn't have to lock down. Yar.
It may be possible to stick some extra refcounts on the document viewer to stop it going away right in the middle of GC. Investigating further.
Refcount testing didn't work, moving to 0.9.1. Note also that the bug has no apparant ill-effect on the release version save perhaps for a lost focus event. The crash, or rather assertion only happens in debug mode.
Target Milestone: mozilla0.9 → mozilla0.9.1
Adam: no kungFuDeathGrip luck? The debug-only assertion-crash (JS assertions are always fatal) does mean that in release builds, at least one JS object (or perhaps a string or double, those are GC-allocated too) allocation is failing. This should throw an "out of memory" exception. Any sign of that? If not, perhaps the JS allocation is also somehow dependent on DEBUG. /be
*** Bug 77064 has been marked as a duplicate of this bug. ***
[Win2K] I am seeing this now and then.
Here is a list of things I tried, all of which were in nsDocShell::Destroy: 1. Put a death grip on the docshell's content viewer 2. Move the call to DestroyChildren() to before the destruction of the parent docshell's content viewer 3. Splitting the call to destroy each child docshell into two methods called before and after content viewer destruction. No luck so far. I will work on this some more. I will also investigate whether the loss of the focus DOM event has any repercussions, e.g. JS thinking it's out of memory, focus problems etc. Suggestions are welcome.
Blocks: 77421
My previous attempts at death grips were in the wrong place. Turns out that a single line death grip in nsDocShell::Embed() ensures the content viewer is destroyed *after* GC so this chain of events does not occur. Patch also removes a few extraneous lines related to the nsDocShell::Destroy() method. Can I have review and super-review please?
For ease of reference later, a comment would help here: + // Ensure that the content viewer is destroyed *after* the GC - bug 71515 + nsCOMPtr<nsIContentViewer> kungfuDeathGrip = mContentViewer;
Thanks, I'll add a comment above the line.
this looks good to me - sr=rpotts. you might want to delete the comment at line 3079 which says "don't hold onto a reference to the DocViewer beyond this point" :-) Since that is exactly what this patch is doing :-) I believe that that comment is old and out of date... Does anyone think that the comment is still valid? -- rick
Fix checked in.
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: