Closed Bug 71515 Opened 24 years ago Closed 24 years ago

Reentrance in js_AllocGCThing which leads to crash in debug version

Categories

(Core :: DOM: Navigation, defect)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla0.9.1

People

(Reporter: shanjian, Assigned: adamlock)

References

()

Details

(Keywords: crash)

Attachments

(2 files)

While I was working on bug 70759, the same reproduce process always leads to assertion and 
crash in my debug build. This problem has been bothering me for several days.  To reproduce 
the problem:
1. go to http://www.chinesecomputing.com
2. click any link on the home page (optional: click down several links)
3. change the Character Coding selection for the page (e.g., change to Chinese 
Traditional Big 5)

After step 3, the program shows assertion at file jsgc.c, line 448
    JS_ASSERT(!rt->gcRunning);

I spent some time on this, and posted here for your reference:
After step 2, set breakpoint at jsgc.c, 1103, 
    rt->gcRunning = JS_TRUE;
Then follow step 3. The program will stop at jsgc.c:1103 twice. The first
time it stop there, let it go. The second time it will crash. 

The reentrance happened in the last item of rt->gcArenaPool.first (around line 1205).
It happened inside function "finalizer(cx, thing);".
Marking NEW this has a lot of good info in it.
Severity: blocker → critical
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: crash
Crash still occurring with debug build from 2001-03-12 on WinNT.
My steps to reproduce are slightly different than above: 

1. Go to http://www.chinesecomputing.com
2. Change the character coding selection for the page to anything new
3. CRASH !!! 



void *
js_AllocGCThing(JSContext *cx, uintN flags)
{
    JSBool tried_gc;
    JSRuntime *rt;
    JSGCThing *thing;
    uint8 *flagp;

#ifdef TOO_MUCH_GC
    js_GC(cx, GC_KEEP_ATOMS);
    tried_gc = JS_TRUE;
#else
    tried_gc = JS_FALSE;
#endif

    rt = cx->runtime;
    JS_LOCK_GC(rt);
    JS_ASSERT(!rt->gcRunning); <<<<<<<<<<<<<<<<<<<<<<<<<<< CRASHED HERE
    if (rt->gcRunning) {
        METER(rt->gcStats.finalfail++);
        JS_UNLOCK_GC(rt);
        return NULL;
    }
    METER(rt->gcStats.alloc++);
Attached file WinNT stack trace of crash (deleted) —
NOTES:

1. This crash only happened for me with debug builds, not the nightly binaries

2. With the nightly binaries, I would occasionally get this messagbox on WinNT:

   ###!!!  ASSERTION:  no handler for font package: '0', file       
   d:\mozilla\intl\locale\src\nsFontPackageService.cpp, line 70


3. With the debug build on WinNT, I saw messages like this in the 
   debug console just before the crash:

   JavaScript error: 
   chrome://global/content/charsetOverlay.js  line 126:  wnd has no properties

   JavaScript error: 
   chrome://global/content/charsetOverlay.js  line 126:  wnd has no properties

   ###!!!  ASSERTION:  docshell has null child:  'shell',  file       
   d:\mozilla\docshell\base\nsDocShell.cpp,  line 153

   Charset Overlay menu item pressed: EUC-JP

I think this is not a JS Engine problem; reassigning to Embedding:Docshell.
cc'ing Brendan, jband for a more refined analysis on the root cause - 
Assignee: rogerl → adamlock
Component: Javascript Engine → Embedding: Docshell
QA Contact: pschwartau → adamlock
Phil, any place where you see this on Win32 debug builds only and with another 
assertion going the you shoul dsuspect a dup of bug 54792. Look in the stack 
trace for the nest into the event loop from the NS_ASSERTION with a JS gc 
further up the stack.
I wasn't sure it was bug 54792 because I didn't see any nsDebug::Assertion 
anywhere in the stack. Should I mark this as a dupe of 54792, then?  
Phil, It is not clear if this bug on the whole is only a dup of bug 54792 or 
not. I think that the two cases you listed probably are. The stack trace 
attached to this bug is not conclusive for me. There may be an assert hiding in 
the transition from the the View dtor into the NT code (that the debugger is 
not showing). Or something else might be going on that causes it to get into 
dispatching events. We know there have been some cases where dtors called 
from JS finalizers have made calls into other objects that then get back into JS 
and cause this assert. I don't know if that is one of them or just the stupid 
Windows assert dialog bug.
The assertion happens because destruction of the old set of docshells causes 
garbage collection to start. This causes the destruction of the associated views 
and when they destroy or release their parent widget the thread goes into 
message processing loop. Before the loop ends, a DOM keyboard event is processed 
from the queue and the assertion occurs as it is turned into a JS object.
John, it does look similar to the other bug.

I think the problem here is the message processing loop happening during GC. 
This makes it possible for DOM events to be handled before the loop ends. DOM 
events usually end up with a JS object being created which is causing the bug 
we're seeing because GC is still in progress.

I can't think of why there is a message processing loop in the middle of this 
stack trace, but it could be that the widget is actually a proxy so calling 
methods on it causes events to be posted and a processing loop during which 
other things can happen.

If so then perhaps we can fix this problem by putting checks into the proxy code 
so that when the calling thread and the object thread are the same it just makes 
a straight call rather than using a queue.
Target Milestone: --- → mozilla0.9
Danm saw a similar problem, and testified that Windows just *had* to process a
message or two in the middle of a Win32 call that I claimed should not process
messages.

I am clueless about Windows, and defer to others, but it seems to me that
between our code's reentrancy restrictions, and JS's run-to-completion execution
model, we need to keep event/message processing confined to well-known loops
down the stack from JS and other code activations.  Can we?

/be
The only time I know that Win32 will enter a message processing loop without 
explicitly being told to is for a marshalled COM call. Since we don't make such 
a call anywhere near here (though it could happen around the clipboard / drag & 
drop code) I'm not not sure what the reason is.

If it does occur some other time I don't think there is much we can do about 
it. 
Brendan is talking about bug 69586; a crash with a stack trace like this:

js_AllocGCThing(JSContext * 0x00db3460, unsigned int 0) line 448 + 41 bytes
...
nsWindow::DispatchFocus(unsigned int 105) line 4112 + 15 bytes
...
nsWindow::WindowProc(HWND__ * 0x00030288, unsigned int 7, unsigned int 393444, 
long 0) line 923 + 27 bytes
USER32! 77e13eb0()
...
DocumentViewerImpl::~DocumentViewerImpl() line 414 + 97 bytes
...
js_GC(JSContext * 0x035a8908, unsigned int 0) line 1218 + 11 bytes
...
nsDocShell::CreateContentViewer(nsDocShell * const 0x035b47c0, const char * 
0x0012f828, nsIRequest * 0x0482c698, nsIStreamListener * * 0x0012f87c) line 2771 
+ 32 bytes
...
PL_HandleEvent(PLEvent * 0x036924ec) line 586 + 10 bytes
PL_ProcessPendingEvents(PLEventQueue * 0x00b57318) line 516 + 9 bytes
_md_EventReceiverProc(HWND__ * 0x001e00e6, unsigned int 49442, unsigned int 0, 
long 11891480) line 1067 + 9 bytes
...
nsAppShellService::Run(nsAppShellService * const 0x00b4b770) line 408

Some object is being deleted, Mozilla enters js_GC, the whole window gets torn 
down, Windows sends a WM_SETFOCUS message directly to some (presumably, other) 
window, the handler for that event invokes JS. eek.

It isn't entering a Windows message processing loop at all. Windows is 
dispatching Windows events directly. All very legal. All very problematic.
It looks as if focus events should be discarded when a window is being taken 
down. Does anyone know who did the focus stuff?
CC'ing saari@netscape.com

Chris, do you have any ideas on the best way to prevent the focus issues from 
causing DOM events to be fired to a window during it's destruction?
Event firing during window transitions is a general problem that we haven't
really addressed yet. We could certainly lock things down if we notify the
presshell that it is about to be destroyed and then proceed with said destruction.
In the stack trace I posted above, the message causing all the trouble is a *get* 
focus message on a window other than the one being destroyed. Two different 
PresShells. Locking down event handling on a window while it is being destroyed 
wouldn't help; in fact, that may be the only window you wouldn't have to lock 
down. Yar.
It may be possible to stick some extra refcounts on the document viewer to stop 
it going away right in the middle of GC. Investigating further.
Refcount testing didn't work, moving to 0.9.1. 

Note also that the bug has no apparant ill-effect on the release version save 
perhaps for a lost focus event. The crash, or rather assertion only happens in 
debug mode.
Target Milestone: mozilla0.9 → mozilla0.9.1
Adam: no kungFuDeathGrip luck?

The debug-only assertion-crash (JS assertions are always fatal) does mean that
in release builds, at least one JS object (or perhaps a string or double, those
are GC-allocated too) allocation is failing.  This should throw an "out of
memory" exception.  Any sign of that?  If not, perhaps the JS allocation is also
somehow dependent on DEBUG.

/be
*** Bug 77064 has been marked as a duplicate of this bug. ***
[Win2K] I am seeing this now and then.
Here is a list of things I tried, all of which were in nsDocShell::Destroy:

1. Put a death grip on the docshell's content viewer
2. Move the call to DestroyChildren() to before the destruction of the parent 
docshell's content viewer
3. Splitting the call to destroy each child docshell into two methods called 
before and after content viewer destruction.

No luck so far.

I will work on this some more. I will also investigate whether the loss of the 
focus DOM event has any repercussions, e.g. JS thinking it's out of memory, 
focus problems etc. Suggestions are welcome.
Blocks: 77421
My previous attempts at death grips were in the wrong place. Turns out that a 
single line death grip in nsDocShell::Embed() ensures the content viewer is 
destroyed *after* GC so this chain of events does not occur.

Patch also removes a few extraneous lines related to the nsDocShell::Destroy() 
method.

Can I have review and super-review please?
For ease of reference later, a comment would help here:

+     // Ensure that the content viewer is destroyed *after* the GC - bug 71515
+     nsCOMPtr<nsIContentViewer> kungfuDeathGrip = mContentViewer;
Thanks, I'll add a comment above the line.
this looks good to me - sr=rpotts.

you might want to delete the comment at line 3079 which says "don't hold onto a 
reference to the DocViewer beyond this point" :-)  Since that is exactly what 
this patch is doing :-)

I believe that that comment is old and out of date...  Does anyone think that 
the comment is still valid?

-- rick
Fix checked in.
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: