Closed
Bug 188744
Opened 22 years ago
Closed 22 years ago
a way to reproduce a fastload hang
Categories
(Core :: XUL, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: jrgmorrison, Assigned: bugs)
References
(Blocks 1 open bug)
Details
Attachments
(2 files, 1 obsolete file)
(deleted),
patch
|
jrgmorrison
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jrgmorrison
:
review+
|
Details | Diff | Splinter Review |
This will reproduce a hang in fastload code (and Matti could also do this in
win32/debug, which is where I first did this too. Platform or debug
dependency? Don't know.)
However, it occurred to me that this is not quite like the other reported
hangs, where the actual file's footer is apparently corrupt. This hang
happens when starting from a (supposedly) valid fastload file. So, I filed
this as a separate bug.
1) delete XUL.mfl
2) start Navigator; Quit
3) start Navigator; do 'File->New->Message'; close compo window; Quit
4) start Navigator; click on the Messenger icon
5) Three pairs of assertions
a) "redundant multiplexed document?: 'docMapEntry->mString == nsnull"
in StartMuxedDocument
b) "unknown aURI passed to EndMuxedDocument!: 'uriMapEntry->mDocMapEntry"
in EndMuxedDocument
for documents:
i) chrome://messenger/content/widgetglue.js
ii) chrome://messenger/content/accountUtils.js
iii) chrome://messenger/content/mail-offline.js
Note: other documents are loading at the same time but eventually the
hang kicks in. (I don't know when exactly it starts, but the fact that
I broke when loading msgHdrViewOverlay.xul is just coincidence. [i.e.,
it took me a while to attach the debugger and break, with all the malloc
and disk thrash going on].
6) Then ... Hang. Allocating huge amounts of memory.
Break in debugger (after some period of time). This is the stack (below).
In ChangeTable, oldCapacity is 2,097,152 (which means new table will be ~80MB)
Up in nsFastLoadFileWriter::StartMuxedDocument
docMapEntry->mString == chrome://messenger/content/msgHdrViewOverlay.xul
docMapEntry->mInitialSegmentOffset == 0
docMapEntry->mCurrentSegmentOffset == 0
----------------------------------------
ChangeTable(PLDHashTable * 0x03611edc, int 1) line 456 + 3 bytes
PL_DHashTableOperate(PLDHashTable * 0x03611edc, const void * 0x038552e4, int 1)
line 514 + 13 bytes
nsFastLoadFileWriter::StartMuxedDocument(nsFastLoadFileWriter * const
0x03611e54, nsISupports * 0x038552e0, const char * 0x0012f384) line 1354 + 25
bytes
nsFastLoadService::StartMuxedDocument(nsFastLoadService * const 0x00fed4c8,
nsISupports * 0x038552e0, const char * 0x0012f384, int 2) line 288 + 31 bytes
nsXULPrototypeCache::StartFastLoadingURI(nsIURI * 0x038552e0, int 2) line 733 +
36 bytes
nsXULPrototypeCache::WritePrototype(nsXULPrototypeCache * const 0x014d2a08,
nsIXULPrototypeDocument * 0x039cfce8) line 666 + 19 bytes
nsXULDocument::EndLoad(nsXULDocument * const 0x03643bb8) line 1690
XULContentSinkImpl::DidBuildModel(XULContentSinkImpl * const 0x039d7928, int 0)
line 485
nsExpatDriver::DidBuildModel(nsExpatDriver * const 0x039d7ec8, unsigned int 0,
int 1, nsIParser * 0x039d79e0, nsIContentSink * 0x039d7928) line 972 + 23 bytes
nsParser::DidBuildModel(unsigned int 0) line 1283 + 41 bytes
nsParser::ResumeParse(int 1, int 1, int 1) line 1832
nsParser::ContinueParsing() line 1390 + 19 bytes
CSSLoaderImpl::SheetComplete(SheetLoadData * 0x039d3ae8, int 1) line 1800
CSSLoaderImpl::ParseSheet(nsIUnicharInputStream * 0x039e02c0, SheetLoadData *
0x039d3ae8, int & 1) line 1733
SheetLoadData::OnStreamComplete(SheetLoadData * const 0x039d3ae8,
nsIUnicharStreamLoader * 0x039d3fa8, nsISupports * 0x00000000, unsigned int 0,
nsIUnicharInputStream * 0x039e02c0) line 1124 + 23 bytes
nsUnicharStreamLoader::OnStopRequest(nsUnicharStreamLoader * const 0x039d3fac,
nsIRequest * 0x039d6028, nsISupports * 0x00000000, unsigned int 0) line 187
nsJARChannel::OnStopRequest(nsJARChannel * const 0x039d602c, nsIRequest *
0x039d61e4, nsISupports * 0x00000000, unsigned int 0) line 606 + 49 bytes
nsOnStopRequestEvent::HandleEvent() line 213
nsARequestObserverEvent::HandlePLEvent(PLEvent * 0x039cadac) line 116
PL_HandleEvent(PLEvent * 0x039cadac) line 663 + 10 bytes
PL_ProcessPendingEvents(PLEventQueue * 0x00f57c00) line 593 + 9 bytes
_md_EventReceiverProc(HWND__ * 0x015403d0, unsigned int 49396, unsigned int 0,
long 16088064) line 1379 + 9 bytes
----------------------------------------
I then let it proceed to the next document (msgHdrViewAddresses.js).
StartMuxedDocument, in do the PL_DHASH_ADD forced another ChangeTable
[now to ~160MB (on a 128MB RAM machine no less. Ew, painful :-)].
And at that point, I shut it down, because this had gone wrong way too long
ago to really tell what had happened, and anyways, I didn't have the
horsepower to attempt the next (320MB) ChangeTable ;-).
I put back the last known good XUL.mfl (saved after step 3 above), and
when I broke earlier in the hang, I noticed that mURIMap.entryCount had
a value of 4294967294 (or 0xfffffffe). I'm not sure what the trigger for
this mistake is, though. It happens during a load of a document subsequent
to the first pair of assertions (i.e., for a different doc than the asserts
were for).
Reporter | ||
Comment 1•22 years ago
|
||
The mURIMap gets broken (shorthand explanation) in
nsFastLoadFileWriter::EndMuxedDocument when it is passed an unknown aURI from
nsXULDocument::OnStreamComplete. Well, actually, it's not unknown, it's
"chrome://messenger/content/widgetglue.js", but mURIMap has zero entries at
this moment [because we didn't add 'widgetglue.js' to mURIMap in
nsFastLoadFileWriter::StartMuxedDocument when we bailed early because we
thought we were dealing with a redundant entry in the mDocumentMap].
So, in nsFastLoadFileWriter::EndMuxedDocument, with our empty mURIMap and
a zeroed out nsURIMapWriteEntry* entry, we call PL_DHashTableRawRemove. And
that clears and frees the entry, and decrements (underflows)table->entryCount.
After that, every PL_DHASH_ADD decides that it needs more capacity, and
the doubling continues until we run out of memory.
(Hmm, wait. Or possibly, for a smaller xul dialog we don't run out of memory
and eventually post our suspect tables to disk in a format that [in some
cases] guarantees failure on the next startup! That's the case where I've been
collecting the fastload files and seeing the bad entries in the header and
footer of the fastload files. Also, some people have reported an episode
where the app grinds away for a while, and then after that everything
apparently works OK. [I got a private email about that today].).
I guess we need to fix the case in PL_DHashTableRawRemove above, but I
think the problem still lies above in the load of the XUL documents that
trigger the load of 'widgetglue.js'.
What I see happening is that we first try to read in 'messenger.xul' from the
fastload file. We find that it isn't there, so we kick off an asynch load of
that document. We get the callback on this stack:
nsFastLoadFileReader::StartMuxedDocument() line 406
nsFastLoadService::StartMuxedDocument() line 257 + 31 bytes
nsXULPrototypeCache::StartFastLoadingURI() line 733 + 36 bytes
nsXULPrototypeCache::GetPrototype() line 260 + 14 bytes
nsXULDocument::StartDocumentLoad() line 702 + 57 bytes
nsContentDLF::CreateRDFDocument() line 528 + 47 bytes
nsContentDLF::CreateInstance()
...
where we have a wrapped reader, but we really want to be writing, not reading.
So we bail out of nsFastLoadFileReader::StartMuxedDocument with
NS_ERROR_NOT_AVAILABLE, but now we load the stream and put it in the
xul-cache. Then (I'm probably omitting a lot of steps here) we wind up back in
fastload code, calling nsFastLoadFileWriter::StartMuxedDocument for
'widgetglue.js', on this stack.
nsFastLoadFileWriter::StartMuxedDocument() line 1322
nsFastLoadService::StartMuxedDocument() line 288 + 31 bytes
nsXULDocument::OnStreamComplete() line 5805 + 59 bytes
nsStreamLoader::OnStopRequest() line 144
nsJARChannel::OnStopRequest() line 606 + 49 bytes
nsOnStopRequestEvent::HandleEvent() line 213
...
But the code in nsXULDocument::OnStreamComplete has comments to the effect
that "we can't (or shouldn't) be reading this script from the fastload file,
and we must be writing a new fastload because ...". Well, actually I don't
understand what it's saying :-) At any rate, this is the point where we get
into trouble as described at the beginning of this note.
Comment 2•22 years ago
|
||
This is a start, and I'm looking for r=jrgm and sr=ben to get it in ASAP.
/be
Comment 3•22 years ago
|
||
This restores the deserialization attempt code to nsXULContentSink::OpenScript
that was lost when bug 112064 was fixed.
The previous patch is needed too, for sanity. This patch does not subsume the
previous patch.
jrgm, can you try it out? Mail me or ping me on IRC if there's some stupid
compilation error. Thanks,
/be
Updated•22 years ago
|
Blocks: FastLoadHang
Comment 4•22 years ago
|
||
jrgm found some bad corruption late last night, but I had to crash and never
heard the end of the story. There may be another bug lurking, or this fix may
not be quite right. I hope to hear more soon, because I'm disappearing for
three weeks in another day.
Anyone watching this bug who can help, please try the two valid patches in your
builds and say whether they help.
/be
Updated•22 years ago
|
Attachment #111643 -
Attachment is obsolete: true
Updated•22 years ago
|
Flags: blocking1.3b?
Reporter | ||
Comment 5•22 years ago
|
||
The corruption I'm seeing is reproducible as follows:
1) start nav; quit
2) start nav; quit
3) start nav; start msgcompo; quit
4) start nav; start messenger; triggers commonDialog; \
start messengercompose.xul; send a message (get \
progress dialog, and cert dialog [clientauthask.xul] \
quit
5) start messenger (-mail and avoid profile manager [-P]); \
get commonDialog login; start AccountManager.xul; \
ASSERT!!! and bork mObjectMap
What I'm seeing in debug: I come in at nsFastLoadFileUpdater::Open, where it
populates the object map. That all goes fine and I have 196 objects with
first mOID being 8. Next, I pick it up on first call to
nsFastLoadFileWriter::WriteObjectCommon after the PL_DHASH_ADD and inside of
the !entry->mObject test, so we've added another entry.
But entryCount is 196. Huh? When I inspect raw memory for mObjectMap, the
entry for mOID of 8 is gone (?). Now, the entry count of 196 means we now get
a duplicate mOID of 1576 in the map and that leads to trouble down the road.
But I can't see where something would kill that first entry. [By the way, I
have reproduced this in todays' mozilla win32 build so this corruption is
independent of this fix, it would seem]. (I'm also referring to the untagged
values for mOID [(index+1)<<3]).
Reporter | ||
Comment 6•22 years ago
|
||
Comment on attachment 111532 [details] [diff] [review]
defend against bad calls into nsFastLoadFile.cpp and js/pldhash.c
r=jrgm for hashtable asserts and defense against empty uriMapEntry
Attachment #111532 -
Flags: review+
Reporter | ||
Comment 7•22 years ago
|
||
Comment on attachment 111777 [details] [diff] [review]
better fix, but jrgm's still testing
r=jrgm
I think we can go with this. The other problem that I've noted above was
already reproducible in the current trunk builds so it is not a consequence of
landing this patch. And this patch (along with the earlier patch for the
asserts n stuff) does fix another known way that I could cause fastload to
hang. It may also be that the second mode of failure (the mOID botch) is not
too common, since for most cases, I think this will cause the fastload cache to
be destroyed sooner or later [with the case to be afraid of being if the memory
read in is less than the fastload file size. Then pain could ensue].
Attachment #111777 -
Flags: review+
Updated•22 years ago
|
Flags: blocking1.3b? → blocking1.3b+
Reporter | ||
Comment 8•22 years ago
|
||
The first and third attachments above were checked in by brendan at about 2am
last Thursday night (Fri. morning). I've filed bug 189832 to continue to look
at what I said in comment 5 above.
No bad news from this fix so far. Marking fixed.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•