Closed
Bug 83367
Opened 24 years ago
Closed 24 years ago
Crash [@ XPCWrappedNative::FlatJSObjectFinalized] with LDAP datasource
Categories
(Core :: XPConnect, defect)
Tracking
()
VERIFIED
FIXED
mozilla0.9.2
People
(Reporter: peterv, Assigned: jband_mozilla)
References
Details
Attachments
(7 files)
(deleted),
text/plain
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
text/plain
|
Details |
I've been working on the LDAP datasource (which is implemented in JS). With my
newest version (patch available in 83360), I always get a crash on quitting
after I've loaded our XUL example. I can't tell what causes this crash, my new
code for the datasource or that I switched the example to outliner. I (or dmose)
can provide instructions on how to build the datasource and example if you let
me know what platform you want to debug this on.
I'm not too sure about the right component for this bug, it seems related to
XPConnect and to the JS GC. I'll attach a stack trace for the crash.
Reporter | ||
Comment 1•24 years ago
|
||
Comment 2•24 years ago
|
||
jband, I think this is the one that you helped me track down some months ago but
I was too lame to actually file the bug on. You and Brendan had a rather
involved discussion about finalization, and (aybe) XBL, and came to the
conclusion that the fix would be some work, but should be done eventually. Hope
this rings a bell...
Assignee | ||
Comment 3•24 years ago
|
||
peterv: David Bradley is the primary problem solver for xpconnect now. Like me
he prefers windows. It would be good to see what line this crashes on - and
which pointer(?) is bad. I'll help him if I can.
dmose: I remember working with you then. The conversation with brendan was about
the problems with code paths that go from gc to a finalizer to code that ends up
calling js_AllocGCThing - and asserting. What I can't remember is what your
specific problem was that got us talking about that.
Assignee: jband → dbradley
Comment 4•24 years ago
|
||
The specific problem was exactly this: quitting the browser while the LDAP
datasource was being held by XUL.
Reporter | ||
Comment 5•24 years ago
|
||
http://lxr.mozilla.org/seamonkey/source/js/src/xpconnect/src/xpcwrappednative.cpp#842
is the line where it fails, obj seems to be bad. I can't really tell you much
about obj though (Codewarrior can't handle pointers to interfaces :().
Reporter | ||
Comment 6•24 years ago
|
||
I can't try this out right now (no win box at home), but these should be the
steps to reproduce:
apply my patch from bug 83360 (attachment 36531 [details] [diff] [review])
set ENABLE_LDAP_EXPERIMENTAL=1
set DISABLE_TESTS=
cd mozilla\directory\xpcom
nmake /f makefile.win
start the browser
open chrome://communicator/content/ldapviewer/example.xul
wait a bit for the data to load
quit the browser, this is where you should crash
dmose, please correct me if I'm wrong.
Assignee | ||
Comment 7•24 years ago
|
||
hmm. It is certainly possible that xpconnect is messing up its reference
counting here. But, I'd bet that some other code has released a reference that
xpconnect owns; i.e. some other code has more Release calls than AddRef calls on
the same object that xpconnect is referencing so that the object has been
deleted even though xpconnect thinks it is holding a reference. Inspection in a
more helpful debugger can clear this up.
Comment 8•24 years ago
|
||
Sorry for the delay, thanks jband for jumping in. jband, did you attempt his
test? Just curious. I would think if xpconnect had some kind of ref' counting
issue it would be rearing its head in many other places. Not to discount the
possibility. jband, if you haven't run the test, I can give it a try. Maybe I'll
be able to get more information that will point to source. I've got to do a full
build before I can do any of this though, so it will take a couple of hours
before I see any results. jband, if you've already tried his patch let me know,
and I won't bother.
Status: NEW → ASSIGNED
Comment 9•24 years ago
|
||
I was able to duplicate the problem and I'll attach a stack trace. Hopefully
this will provide more clues.
Comment 10•24 years ago
|
||
Assignee | ||
Comment 11•24 years ago
|
||
Ah, so it is coming back to call js_AllocGCThing during a gc. This is just not
allowed. Perhaps a *big* chunk of the stack was just missing in the Mac stacks
from peterv? Peter can this be true? Or is there somehow sometimes a problem
with a bad pointer in that same location? Seems unlikely.
I have not tried the test case. But the windows stack says a lot.
I guess this was the thing dmose and I looked at. I don't see what can be done
off hand. This is why brendan and I got into a 'thowing up hands' discussion
back then. This code is doing something pretty darn reasonable in calling an
observer in its Release handling. It can't be expected to know that the call
will result in a call via xpconnect requiring an JS object allocation in order
to build a wrapper. I see lots of things wrong with every solution I can think
of to this general problem. This is where I tried to convince Brendan that the
JS engine *should* allow this - even if it means restarting gc. Yet, we came up
with plenty of problems with that.
Assignee | ||
Comment 12•24 years ago
|
||
jst and I were talking about this. Though the general solution for all
embeddings of the js engine might be elusive, the paths to this problem through
xpconnect could be dealt with by queueing up and deferring the release calls in
xpconnect until the JSGC_END event. The downside of this is that it would
require the allocation of memory to track the queued object pointers during gc -
which might sometimes have been triggered because of a low memory condition.
This seems unlikely to be a big problem in the browser though.
Comment 13•24 years ago
|
||
jband, I guess we could hope that GC already reclaimed memory before it got to
this point, but that's like roling the dice. Would it be safe to preallocate a
small amount of memory to hold the list? I doubt this list would need to be very
big, would it? How much data would we need to store per object, just a pointer?
Comment 14•24 years ago
|
||
Over a full GC cycle I guess it could grow quite large, how likely is that to
happen vs being out of memory?
Reporter | ||
Comment 15•24 years ago
|
||
Weird. I did get that other stack trace twice, but now I (again) always get the
one I posted.
Comment 16•24 years ago
|
||
Peter, you running release or debug?
Comment 17•24 years ago
|
||
Oh, I guess that's pretty self explanatory from the info provided in the stack
trace. In any case, I find it interesting that the stack you posted terminated
in the XPCWrappedNative::FlatJSObjectFinalized, while mine terminated within the
JS engine.
Could we be seeing a combination of two problems here?
Comment 18•24 years ago
|
||
Ok, to sum things up, we have it crashing in one place on Windows and another on
Mac.
The Mac version is crashing because the obj pointer returned from to->GetNative
is invalid.
The Windows version is crashing because of the JS_ASSERT being fired because an
object is being allocated during a GC cycle.
Peter, you said that you're consistantly getting the bomb on the Mac at the
location indicated by the stack you posted, but had gotten a bomb in another
location, but can't reproduce it. Is that correct?
Also, my current code is almost a week old. I'm going to update it and see if
the result changes. If that allows me to reproduce the fault at the same point
as you, that might indicate a reference counting issue outside of xpconnect, as
I believe I have most of the recent xpconnect changes. Under Windows I might be
able to get more information and determine the source.
Assignee | ||
Comment 19•24 years ago
|
||
I have a patch to support deferring the Release calls on the native xpcom
objects of gc'd wrappednatives until after the gc is done. This can be enabled
at runtime via a method on nsIXPConnect. In my patch I turn this option on in a
call from the nsJSEnvironment ctor. I also added support for telling xpconnect
that gc is only being run on the main thread. This allows xpconnect to omit some
of its locking. Actually, the flag tells xpconnect to itself force gc to happen
on the main thread. So, we could remove the now redundant GCCallback in
nsJSEnvironment if we choose.
This code is working for me. The instrumentation show that most gcs have only
tens of deferred Releases. Some have up to 500 or so. The most I've seen was
1700 on shutdown. I left in code that will do an immediate Release if we fail to
grow the nsVoidArray that holds the deferred Release pointers when adding a
pointer for later Release.
I have not tried this code on the specific test case here. I really wonder if
there are in fact two separate crashes here. I doubt it.
Can someone please try my patch and we can decide if we want it checked in.
Assignee | ||
Comment 20•24 years ago
|
||
Comment 21•24 years ago
|
||
John, I had a look through your patch (didn't try it yet, I'll leave that to
peterv :-) and your changes look good to me, the only change you might consider
is to change:
+ JSBool mMainThreadOnlyGC;
+ JSBool mDeferReleases;
+ JSBool mDoingFinalization;
to use PRPackedBool (JSPackedBool?) to save a whopping 8 byets in the singleton
XPCRuntime :-)
sr=jst should you need it (and assuming it actually fixes the problem ;-).
Assignee | ||
Comment 22•24 years ago
|
||
Thanks for looking at this Johnny. If this were not a singleton I'd opt to go
with your suggestion about the PRPackedBool. But, I'm betting that any given
compiler is going to add more than 8 bytes of additional code just to mask
access to the data in the various inlined accessors/mutators.
Comment 23•24 years ago
|
||
I installed your patch jband. I'm now getting an access violation in the
NS_RELEASE(obj); statement in GCCallback. The memory obj is pointing at, is hex
DD which is what VC++ usually fills delete memory with. I'll look further into
this in the morning, and check back here for any further information. I'll
attach a stack trace.
Comment 24•24 years ago
|
||
Assignee | ||
Comment 25•24 years ago
|
||
OK. So there were two bugs. As it happens both were associated with the same
code in RDF.
The first bug was the indirect call into js_AllocGCThing during gc. (see
dbradley's first stack trace). My xpconnect deferred Release stuff deals with
that.
The other bug was the one peterv originally saw and which shows up in dbradley's
second stack trace. I had to insert instrumentation in xpconnect to find it.
(Note: I think we want the xpconnect deferred Release code, but it does make it
a little harder to find the xpconnect wrapper data associated with a bad native
pointer).
Anyway, the problem is in CompositeDataSourceImpl::Release
http://lxr.mozilla.org/seamonkey/search?string=CompositeDataSourceImpl%3A%3ARel
In the "else if" case the code makes a call out to ds->RemoveObserver(this) and
then immediately does a NS_DELETEXPCOM(this). This pattern ASSUMES that no one
will AddRef the 'this' during the interim.
Now, I know what was intended, but this just ain't right. It crashes because the
ds is implemented in JS and so we automatically build an xpconnect wrapper
around the passed 'this' and that wrapper quite reasonably does an AddRef and
expects that AddRef to pin the object down. But this code deletes the object
anyway. So we later crash when we dereference that pointer.
Ironically, it was this same code requiring wrapper creation during a Release
that caused the other problem during GC. I should have found this faster since
I'd seen this code :)
So what should be done to fix this?
It looks to me like insead of calling NS_DELETEXPCOM(this) you could base a
solution on a "return Release();" Since you've already pumped up refcount then
this would normally go through the "decrement to zero" case. But if an outsider
had added a ref then you'd just decrement the count and stay alive.
However, you have the problem that you'd then also need to be clearing the
mDataSources array after iterating it and before returning. But this brings up
the ugly question: what if the mDataSources array is modified as you are calling
out to ds->RemoveObserver(this) in the loop? What is the underlying model here?
Also, you are doing a bunch of Atomic manipulation. This implies you think
you'll be accessed on other threads. But, if so, how can you get away with
iterating the array without any sort of locking?
I'll attach a patch for consideration. It ignores the multi-threaded risk,
but does deal with the case of changes to the mDataSources array during the
call out. I'm not convinced it does not leak :( But I think it is right. I'm at
a loss to know what to do about the NS_LOG_RELEASE call in this case.
Anyway, it does work for me using the test cited above (and with my xpconnect
patch in place to avoid the JS_ASSERT for the alloc during gc problem).
I'm reassigning to waterson for comment and whatever.
Assignee: dbradley → waterson
Status: ASSIGNED → NEW
Assignee | ||
Comment 26•24 years ago
|
||
Comment 27•24 years ago
|
||
With this patch the adding of count+1 should no longer be needed since the
members of mDataSources are removed from the array inside the loop, so I think
the PR_AtomicAdd at the top could just be a PR_AtomicIncrement and the
PR_AtomicDecrement added could be removed.
I don't think the mDataSources manipulation should need to be threadsafe since
all the remaining pointers to the composite datasource should be to the
nsIRDFObserver interface, and none of the methods on that interface (except for
the Release method!) manipulate mDataSources. (I don't think anyone should ever
QI from nsIRDFObserver to nsIRDFDataSource, should they? Even so, should it be
prevented by implementing the two on different objects?)
Looking at this patch, I can't think of any way it could leak or reenter the
loop in the Release method.
Comment 28•24 years ago
|
||
jband, your patch looks fine. r=waterson
Assignee | ||
Comment 29•24 years ago
|
||
Assignee | ||
Comment 30•24 years ago
|
||
New patch does as dbaron suggests. So I guess the refcount instrumentation
actually balances out.
waterson: still look good to you? Or did I make a mistake I'm not seeing?
Comment 31•24 years ago
|
||
Yup, looks okay. (/me is wishing that weak references had been invented when I
wrote this the first time.)
Assignee | ||
Comment 32•24 years ago
|
||
OK. I'll take this one back and try to do the legwork to get both patches in
0.9.2.
peterv: Do these patches fix the problem for you?
Assignee: waterson → jband
Target Milestone: --- → mozilla0.9.2
Assignee | ||
Updated•24 years ago
|
Hardware: Macintosh → All
Comment 33•24 years ago
|
||
After applying the patch I'm getting another similar crash (accessing deleted
memory) at a different location.
I'll attach a stack trace and investigate further.
Comment 34•24 years ago
|
||
Assignee | ||
Comment 35•24 years ago
|
||
What activities did you perform before the crash? Did you rebuild the rdf dll?
(this requires rebuilding in rdf/build too).
Status: NEW → ASSIGNED
Assignee | ||
Comment 36•24 years ago
|
||
My tree has no code at xpcjsruntime.cpp line 638. Your stack implies that
mDyingWrappedNativeProtoMap is whacked. Are you sure xpconnect was fully
rebuilt?
Comment 37•24 years ago
|
||
r=dbradley
It's on my end, a patch botched the xpcjsruntime.cpp file, the deletion of
mDyingWrappedNativeProtoMap got duplicated when I applied the patch to defer
release calls on natives of wrappednatives.
It is working fine now and changes look good
Comment 38•24 years ago
|
||
Did you intend to include that code with this patch?
Assignee | ||
Comment 39•24 years ago
|
||
Oops. Sorry. That would be your patch to bug 82274. Yes, we might as well check
it in along with this. It is reviewed and 100% safe and sane.
Comment 40•24 years ago
|
||
if the super review in the first patch applies to the better fix then a=
asa@mozilla.org for checkin to the trunk.
(on behalf of drivers)
Assignee | ||
Comment 41•24 years ago
|
||
Both patched were checked into the trunk.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Comment 42•24 years ago
|
||
peterv, does this fix things? If so, would you be able to mark this bug
VERIFIED, thanks -
Reporter | ||
Comment 43•24 years ago
|
||
Verified fixed. Thanks all for the very quick turnaround :).
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•