Closed Bug 188744 Opened 22 years ago Closed 22 years ago

a way to reproduce a fastload hang

Categories

(Core :: XUL, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jrgmorrison, Assigned: bugs)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 1 obsolete file)

This will reproduce a hang in fastload code (and Matti could also do this in win32/debug, which is where I first did this too. Platform or debug dependency? Don't know.) However, it occurred to me that this is not quite like the other reported hangs, where the actual file's footer is apparently corrupt. This hang happens when starting from a (supposedly) valid fastload file. So, I filed this as a separate bug. 1) delete XUL.mfl 2) start Navigator; Quit 3) start Navigator; do 'File->New->Message'; close compo window; Quit 4) start Navigator; click on the Messenger icon 5) Three pairs of assertions a) "redundant multiplexed document?: 'docMapEntry->mString == nsnull" in StartMuxedDocument b) "unknown aURI passed to EndMuxedDocument!: 'uriMapEntry->mDocMapEntry" in EndMuxedDocument for documents: i) chrome://messenger/content/widgetglue.js ii) chrome://messenger/content/accountUtils.js iii) chrome://messenger/content/mail-offline.js Note: other documents are loading at the same time but eventually the hang kicks in. (I don't know when exactly it starts, but the fact that I broke when loading msgHdrViewOverlay.xul is just coincidence. [i.e., it took me a while to attach the debugger and break, with all the malloc and disk thrash going on]. 6) Then ... Hang. Allocating huge amounts of memory. Break in debugger (after some period of time). This is the stack (below). In ChangeTable, oldCapacity is 2,097,152 (which means new table will be ~80MB) Up in nsFastLoadFileWriter::StartMuxedDocument docMapEntry->mString == chrome://messenger/content/msgHdrViewOverlay.xul docMapEntry->mInitialSegmentOffset == 0 docMapEntry->mCurrentSegmentOffset == 0 ---------------------------------------- ChangeTable(PLDHashTable * 0x03611edc, int 1) line 456 + 3 bytes PL_DHashTableOperate(PLDHashTable * 0x03611edc, const void * 0x038552e4, int 1) line 514 + 13 bytes nsFastLoadFileWriter::StartMuxedDocument(nsFastLoadFileWriter * const 0x03611e54, nsISupports * 0x038552e0, const char * 0x0012f384) line 1354 + 25 bytes nsFastLoadService::StartMuxedDocument(nsFastLoadService * const 0x00fed4c8, nsISupports * 0x038552e0, const char * 0x0012f384, int 2) line 288 + 31 bytes nsXULPrototypeCache::StartFastLoadingURI(nsIURI * 0x038552e0, int 2) line 733 + 36 bytes nsXULPrototypeCache::WritePrototype(nsXULPrototypeCache * const 0x014d2a08, nsIXULPrototypeDocument * 0x039cfce8) line 666 + 19 bytes nsXULDocument::EndLoad(nsXULDocument * const 0x03643bb8) line 1690 XULContentSinkImpl::DidBuildModel(XULContentSinkImpl * const 0x039d7928, int 0) line 485 nsExpatDriver::DidBuildModel(nsExpatDriver * const 0x039d7ec8, unsigned int 0, int 1, nsIParser * 0x039d79e0, nsIContentSink * 0x039d7928) line 972 + 23 bytes nsParser::DidBuildModel(unsigned int 0) line 1283 + 41 bytes nsParser::ResumeParse(int 1, int 1, int 1) line 1832 nsParser::ContinueParsing() line 1390 + 19 bytes CSSLoaderImpl::SheetComplete(SheetLoadData * 0x039d3ae8, int 1) line 1800 CSSLoaderImpl::ParseSheet(nsIUnicharInputStream * 0x039e02c0, SheetLoadData * 0x039d3ae8, int & 1) line 1733 SheetLoadData::OnStreamComplete(SheetLoadData * const 0x039d3ae8, nsIUnicharStreamLoader * 0x039d3fa8, nsISupports * 0x00000000, unsigned int 0, nsIUnicharInputStream * 0x039e02c0) line 1124 + 23 bytes nsUnicharStreamLoader::OnStopRequest(nsUnicharStreamLoader * const 0x039d3fac, nsIRequest * 0x039d6028, nsISupports * 0x00000000, unsigned int 0) line 187 nsJARChannel::OnStopRequest(nsJARChannel * const 0x039d602c, nsIRequest * 0x039d61e4, nsISupports * 0x00000000, unsigned int 0) line 606 + 49 bytes nsOnStopRequestEvent::HandleEvent() line 213 nsARequestObserverEvent::HandlePLEvent(PLEvent * 0x039cadac) line 116 PL_HandleEvent(PLEvent * 0x039cadac) line 663 + 10 bytes PL_ProcessPendingEvents(PLEventQueue * 0x00f57c00) line 593 + 9 bytes _md_EventReceiverProc(HWND__ * 0x015403d0, unsigned int 49396, unsigned int 0, long 16088064) line 1379 + 9 bytes ---------------------------------------- I then let it proceed to the next document (msgHdrViewAddresses.js). StartMuxedDocument, in do the PL_DHASH_ADD forced another ChangeTable [now to ~160MB (on a 128MB RAM machine no less. Ew, painful :-)]. And at that point, I shut it down, because this had gone wrong way too long ago to really tell what had happened, and anyways, I didn't have the horsepower to attempt the next (320MB) ChangeTable ;-). I put back the last known good XUL.mfl (saved after step 3 above), and when I broke earlier in the hang, I noticed that mURIMap.entryCount had a value of 4294967294 (or 0xfffffffe). I'm not sure what the trigger for this mistake is, though. It happens during a load of a document subsequent to the first pair of assertions (i.e., for a different doc than the asserts were for).
The mURIMap gets broken (shorthand explanation) in nsFastLoadFileWriter::EndMuxedDocument when it is passed an unknown aURI from nsXULDocument::OnStreamComplete. Well, actually, it's not unknown, it's "chrome://messenger/content/widgetglue.js", but mURIMap has zero entries at this moment [because we didn't add 'widgetglue.js' to mURIMap in nsFastLoadFileWriter::StartMuxedDocument when we bailed early because we thought we were dealing with a redundant entry in the mDocumentMap]. So, in nsFastLoadFileWriter::EndMuxedDocument, with our empty mURIMap and a zeroed out nsURIMapWriteEntry* entry, we call PL_DHashTableRawRemove. And that clears and frees the entry, and decrements (underflows)table->entryCount. After that, every PL_DHASH_ADD decides that it needs more capacity, and the doubling continues until we run out of memory. (Hmm, wait. Or possibly, for a smaller xul dialog we don't run out of memory and eventually post our suspect tables to disk in a format that [in some cases] guarantees failure on the next startup! That's the case where I've been collecting the fastload files and seeing the bad entries in the header and footer of the fastload files. Also, some people have reported an episode where the app grinds away for a while, and then after that everything apparently works OK. [I got a private email about that today].). I guess we need to fix the case in PL_DHashTableRawRemove above, but I think the problem still lies above in the load of the XUL documents that trigger the load of 'widgetglue.js'. What I see happening is that we first try to read in 'messenger.xul' from the fastload file. We find that it isn't there, so we kick off an asynch load of that document. We get the callback on this stack: nsFastLoadFileReader::StartMuxedDocument() line 406 nsFastLoadService::StartMuxedDocument() line 257 + 31 bytes nsXULPrototypeCache::StartFastLoadingURI() line 733 + 36 bytes nsXULPrototypeCache::GetPrototype() line 260 + 14 bytes nsXULDocument::StartDocumentLoad() line 702 + 57 bytes nsContentDLF::CreateRDFDocument() line 528 + 47 bytes nsContentDLF::CreateInstance() ... where we have a wrapped reader, but we really want to be writing, not reading. So we bail out of nsFastLoadFileReader::StartMuxedDocument with NS_ERROR_NOT_AVAILABLE, but now we load the stream and put it in the xul-cache. Then (I'm probably omitting a lot of steps here) we wind up back in fastload code, calling nsFastLoadFileWriter::StartMuxedDocument for 'widgetglue.js', on this stack. nsFastLoadFileWriter::StartMuxedDocument() line 1322 nsFastLoadService::StartMuxedDocument() line 288 + 31 bytes nsXULDocument::OnStreamComplete() line 5805 + 59 bytes nsStreamLoader::OnStopRequest() line 144 nsJARChannel::OnStopRequest() line 606 + 49 bytes nsOnStopRequestEvent::HandleEvent() line 213 ... But the code in nsXULDocument::OnStreamComplete has comments to the effect that "we can't (or shouldn't) be reading this script from the fastload file, and we must be writing a new fastload because ...". Well, actually I don't understand what it's saying :-) At any rate, this is the point where we get into trouble as described at the beginning of this note.
This is a start, and I'm looking for r=jrgm and sr=ben to get it in ASAP. /be
This restores the deserialization attempt code to nsXULContentSink::OpenScript that was lost when bug 112064 was fixed. The previous patch is needed too, for sanity. This patch does not subsume the previous patch. jrgm, can you try it out? Mail me or ping me on IRC if there's some stupid compilation error. Thanks, /be
jrgm found some bad corruption late last night, but I had to crash and never heard the end of the story. There may be another bug lurking, or this fix may not be quite right. I hope to hear more soon, because I'm disappearing for three weeks in another day. Anyone watching this bug who can help, please try the two valid patches in your builds and say whether they help. /be
Attachment #111643 - Attachment is obsolete: true
Flags: blocking1.3b?
The corruption I'm seeing is reproducible as follows: 1) start nav; quit 2) start nav; quit 3) start nav; start msgcompo; quit 4) start nav; start messenger; triggers commonDialog; \ start messengercompose.xul; send a message (get \ progress dialog, and cert dialog [clientauthask.xul] \ quit 5) start messenger (-mail and avoid profile manager [-P]); \ get commonDialog login; start AccountManager.xul; \ ASSERT!!! and bork mObjectMap What I'm seeing in debug: I come in at nsFastLoadFileUpdater::Open, where it populates the object map. That all goes fine and I have 196 objects with first mOID being 8. Next, I pick it up on first call to nsFastLoadFileWriter::WriteObjectCommon after the PL_DHASH_ADD and inside of the !entry->mObject test, so we've added another entry. But entryCount is 196. Huh? When I inspect raw memory for mObjectMap, the entry for mOID of 8 is gone (?). Now, the entry count of 196 means we now get a duplicate mOID of 1576 in the map and that leads to trouble down the road. But I can't see where something would kill that first entry. [By the way, I have reproduced this in todays' mozilla win32 build so this corruption is independent of this fix, it would seem]. (I'm also referring to the untagged values for mOID [(index+1)<<3]).
Comment on attachment 111532 [details] [diff] [review] defend against bad calls into nsFastLoadFile.cpp and js/pldhash.c r=jrgm for hashtable asserts and defense against empty uriMapEntry
Attachment #111532 - Flags: review+
Comment on attachment 111777 [details] [diff] [review] better fix, but jrgm's still testing r=jrgm I think we can go with this. The other problem that I've noted above was already reproducible in the current trunk builds so it is not a consequence of landing this patch. And this patch (along with the earlier patch for the asserts n stuff) does fix another known way that I could cause fastload to hang. It may also be that the second mode of failure (the mOID botch) is not too common, since for most cases, I think this will cause the fastload cache to be destroyed sooner or later [with the case to be afraid of being if the memory read in is less than the fastload file size. Then pain could ensue].
Attachment #111777 - Flags: review+
Flags: blocking1.3b? → blocking1.3b+
The first and third attachments above were checked in by brendan at about 2am last Thursday night (Fri. morning). I've filed bug 189832 to continue to look at what I said in comment 5 above. No bad news from this fix so far. Marking fixed.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: