Closed Bug 51954 Opened 24 years ago Closed 24 years ago

Random crashes in id lookup

Categories

(Core :: JavaScript Engine, defect, P2)

defect

Tracking

()

VERIFIED FIXED

People

(Reporter: pschwartau, Assigned: brendan)

References

Details

(Keywords: crash, Whiteboard: [nsbeta3+][rtm++] FIX IN TRUNK, READY FOR RTM)

Attachments

(3 files)

Using JS shell built 2000-09-08 on WinNT, Linux, and MacOS9. The above testcase now crashes on Linux and Mac. On WinNT: doesn't crash, but takes 28 seconds to complete. Here is the part of the testcase that seems to cause the crash, with a debugging "print" added: function addTestCase( t ) { for ( d = 0; d < TimeInMonth(MonthFromTime(t)); d+= msPerDay ) { t += d; testcases[tc++] = new TestCase( SECTION, "(new Date("+t+")).getDate()", DateFromTime(LocalTime(t)), (new Date(t)).getDate() ); } print ("t is " + t); exit (3); } Here is what appeared in the console: t is -62132140800000 and then: CRASH! Will attach Linux stack trace below...
Attached file Linux stack trace - (deleted) —
Keywords: crash
I saw that the stack trace included global_resolve, which is defined in js.c. On a hunch, I did blame, and found code there bracketed in #ifdef LAZY_STANDARD_CLASSES that uses the new lazy class initialization APIs. (bug 46703) If I comment out #define LAZY_STANDARD_CLASSES earlier in the file, the test runs to completion. It still takes a long time, so I don't think that's part of the problem. The trace also includes 'StdNameToAtom (cx=0x80d7198, name=0x80cd3fc "escape")' ... so it seems that we're trying to resolve 'escape' when we run into trouble. 'escape' may have gone away recently as a fix to 42221, so it seems likely that this problem arises from an interaction between the fixes to 42221 and 46703. CCing Brendan for a look.
I've been running the testsuite successfully for a while, during and long after the LAZY_STANDARD_CLASSES stuff was committed. mccabe, which pointer is bad, and can you find out why? I'll look too, in a bit. /be
Brendan - Are you running the testsuite from the standalone JS shell? It's the missing (or possibly not entirely gone) 'escape' function that I suspect.
Yes, I'm running js/tests/jsDriver.pl -e smdebug -L lc2 lc3 -t and getting no crashes. At least I did two or three days ago. How do you make it crash? And did you debug around a bit to see what the bad address was, and whence it came? /be
I'm getting an assert failure now... [rginda@swamp tests] perl jsDriver.pl -e smdebug -l ecma/Date/15.9.5.10-2.js -f /dev/null -k -#- Executing 1 test(s). *-* Testcase ecma/Date/15.9.5.10-2.js failed: Expected exit code 0, got 134 Testcase terminated with signal 0 Complete testcase output was: Assertion failure: first, at jscntxt.c:85 -#- Wrote results to '/dev/null'. -#- 1 test(s) failed
ok, I guess I needed to make clean. I don't see any failure in the date testcase anymore, but instead, five other segfaults show up under the ecma suite. I'll go discuss this with phil and try to sort it out.
I'm testing on Linux with the standalone js engine. Executing 'load("../jsref.js");load("15.9.5.10-2.js")' in the tests/ecma/Date directory gets me a crash. I think it has to do with overflowing the GC heap before a GC can occur, but I'm still looking. Brendan, sorry for the poorly founded suppositions earlier.
I'm still seeing this. It looks like it was introduced with the LAZY_STANDARD_CLASSES checkin of 8/19/2000. (Verified via before-and-after checkouts.) I'm still using the same recipe to reproduce. It looks like something elusive, as making the testcase generate fewer objects (it uses many) or using the proper "shell.js" test init file rather than "jsref.js" hides the problem. In the JS shell, load("../jsref.js"); load("15.9.5.10-2.js"); foo; (where foo is undefined) crashes. The crash occurs when the engine tries to look up "foo" via js.c's global_resolve; this runs into the JS_ResolveStandardClass code. This bombs when trying to atomize some of the standard class names to compare to "foo", as the hashtable entries for some standard names ("escape") point to garbage. Just why, I don't know.
Summary: Testcase mozilla/js/tests/ecma/Date/15.9.5.10-2.js failing → Random crashes in id lookup
Try purify. Any GC'ing going on? A breakpoint will tell. /be
mccabe and I spent some time looking at this. Purify did not gather any extra info - it crashed in the same place but no error or warnings preceeded that. What we *did* do was find the hashtable entry that contains a garbage string. We set a data breakpoint for that data changing and let it rip. We then got a stack trace for the location where that JSString - that *should* be rooted in the atom table - is free'd. This is before the point when it will later crash. This free happens in a finalization in gc. The *really* interesting thing is that this is a gc invoked from the "last ditch" gc in js_AllocGCThing! This links us to the problem in the marking phase of the same sort of "last ditch" gc we see in bug 53123. I'll attach the stack of the gc location.
Oh My Gawd. This is an ancient bug in atom mark/sweep GC. Patch coming up. I can't believe this wasn't found sooner. The gist of it is that only pinned atoms have their keys marked, and therefore not swept. Worse, we rely on marking pinned atoms (not their keys, mind you) to keep the pinned atoms (again, not their keys) from being swept. Argh! /be
Keywords: patch, review
Wait, it's not as bad as I said -- I was on goof gas. The problem is that the atom mark function wants to mark only pinned atoms, and it was doing that, but it was failing to mark all atom's keys. This too seems right: some atoms are garbage, and so are their keys. If a garbage atom has a non-garbage key, the key will be marked otherwise (via a live object's strong ref to that string jsvsal, say). Now, in the "last ditch" case where js_AllocGCThing calls js_GC, we won't sweep any atoms (the GC_KEEP_ATOMS gcflag is passed to js_GC) because atoms are never rooted (they aren't jsvals, too bad; old old design decision). But note well! Atoms are not swept, but their keys may be, because we have failed to mark those keys (such as the string "escape"), and no other ref may exist to the key. Patch still coming up, but this isn't as dire as I thought in my goof-gas state: it afflicts only the last-ditch GC attempt in js_AllocGCThing. /be
Attached patch proposed fix (deleted) — Splinter Review
mccabe, jband: you guys deserve the credit for leading this horse to water and making him drink. Gimme some testing, r= and a= love, ok? beard: please get me nsbeta3++ approval once the r's and a's are lined up. /be
Assignee: rogerl → brendan
Severity: normal → major
Keywords: nsbeta3, rtm
Priority: P3 → P2
Target Milestone: --- → M18
Status: NEW → ASSIGNED
r=mccabe, looks good and makes sense. The testcase now works OK. Beard is out of town, so another ++er is needed.
Looking for a=jband, need trudelle to play beard here. Peter, this is not a topcrash bug that we know of, but it should be rtm++ at least. I'm concerned that it and 53123-reopen (which is topcrash) are accounting for other topcrash bugs that haven't been diagnosed. I'd like to get the fix in for nsbeta3 too, therefore -- less noise for the remaining bug signal to rise above. /be
a=jband. This looks right to me. The crasher test now runs ok. I can imagine this bug causing some havoc.
Only PDT can double-plus, but I'll give it an nsbeta3+/rtm+ to get on their radar, and some persuasive whiteboard verbiage. To get it into nsbeta3, someone will have to plead the merits to PDT, and given the lateness, it will have to be on the basis of known horrible effect on the user (i.e. P1), preferably combined with a simple/safe/well-tested fix. I don't see the evidence here yet. Noise argument can also be used as a reason *not* to change nsbeta3 bits unnecessarily.
Whiteboard: [nsbeta3+][rtm+] FIX READY TO LAND
"Noise argument can also be used as a reason *not* to change nsbeta3 bits unnecessarily." Certainly, and that's the PDT's call. But fix-in-hand should mean something. The fix here is not going to make things worse. /be
Fix checked into trunk. /be
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Oops, I should keep this open so the trunk changes get onto the Netscape 6 branch after PR3, for RTM. /be
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [nsbeta3+][rtm+] FIX READY TO LAND → [nsbeta3+][rtm+] FIX IN TRUNK, READY FOR RTM
Status: REOPENED → ASSIGNED
PDT marking [rtm++]
Whiteboard: [nsbeta3+][rtm+] FIX IN TRUNK, READY FOR RTM → [nsbeta3+][rtm++] FIX IN TRUNK, READY FOR RTM
Fix is in the branch too, for RTM. /be
Forgot to close this! /be
Status: ASSIGNED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → FIXED
Marking Verified -
Status: RESOLVED → VERIFIED
*** Bug 174871 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: