Closed Bug 359821 Opened 18 years ago Closed 17 years ago

Firefox just crashes after about 10 minutes of use [@ nsHTMLDocument::GetElementById]

Categories

(Core :: DOM: Core & HTML, defect, P1)

x86
Windows XP
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: stavandesai, Assigned: mozilla)

References

()

Details

(6 keywords, Whiteboard: STR in comment 30. fixed on trunk by 348156, 1.8-branch patch here.)

Crash Data

Attachments

(3 files, 5 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0

It does not matter which website or when, Firefox 2.0 seems to just crash after 10 minutes of use. I am using firefox 2.0 on Windows XP service pack 2. I have used Firefox for a long time, and only now am I having this problem. All virus scans by different programs came back negative, same for spyware. Just randomly, the firefox bug reporter comes on, sends the crash information and then firefox closes. 

Reproducible: Always
Please give a few talkback ID's so we can trace the cause of the crashes. See http://kb.mozillazine.org/Talkback for how to get them.
Here are the Incident IDs so far
TB25657363Q, TB25651372H, TB25650554W, TB25624849K, TB25590893G, TB25590064Z, TB25572066W, TB25564431H, TB25563660X, TB25504882Z, TB25503020X, TB25481533Q, TB25480652M, TB25479335Z, TB25428351Y, TB25380519W
Looks like the crashes all occur at 
  http://www.gamespot.com/xbox360
I couldn't reproduce the crash surfing around a bit and leaving it sitting for 15 minutes or so (using Firefox 2 w/ blank profile, Flash 8.0 r24, win32)

The stacks end at a random memory address, or at nsHTMLDocument::GetElementById - about a 50/50 split, with 3 other stacks that are empty. We need someone who knows stack-fu to say more.
Severity: critical → major
Keywords: crash
Version: unspecified → 2.0 Branch
Are you using any extensions? If so, do you still crash in safe mode (http://kb.mozillazine.org/Safe_mode).

Incident ID: 25480652
Stack Signature	nsHTMLDocument::GetElementById cb202024
Product ID	Firefox2
Build ID	2006101023
Trigger Time	2006-11-03 20:15:07.0
Platform	Win32
Operating System	Windows NT 5.1 build 2600
Module	firefox.exe + (00196e33)
URL visited	http://www.gamespot.com/xbox360
User Comments	
Since Last Crash	182 sec
Total Uptime	75242 sec
Trigger Reason	Access violation
Source File, Line No.	c:/builds/tinderbox/Fx-Mozilla1.8-release/WINNT_5.2_Depend/mozilla/content/html/document/src/nsHTMLDocument.cpp, line 2479
Stack Trace 	
nsHTMLDocument::GetElementById  [mozilla/content/html/document/src/nsHTMLDocument.cpp, line 2479]
XPTC_InvokeByIndex  [mozilla/xpcom/reflect/xptcall/src/md/win32/xptcinvoke.cpp, line 102]
XPCWrappedNative::CallMethod  [mozilla/js/src/xpconnect/src/xpcwrappednative.cpp, line 2169]
XPC_WN_CallMethod  [mozilla/js/src/xpconnect/src/xpcwrappednativejsops.cpp, line 1455]
js_Invoke  [mozilla/js/src/jsinterp.c, line 1377]
js_InternalInvoke  [mozilla/js/src/jsinterp.c, line 1471]
JS_CallFunctionValue  [mozilla/js/src/jsapi.c, line 4419]
XPC_NW_FunctionWrapper  [mozilla/js/src/xpconnect/src/XPCNativeWrapper.cpp, line 387]
js_Invoke  [mozilla/js/src/jsinterp.c, line 1377]
js_Interpret  [mozilla/js/src/jsinterp.c, line 4121]
js_Invoke  [mozilla/js/src/jsinterp.c, line 1396]
js_InternalInvoke  [mozilla/js/src/jsinterp.c, line 1471]
JS_CallFunctionValue  [mozilla/js/src/jsapi.c, line 4419]
nsJSContext::CallEventHandler  [mozilla/dom/src/base/nsJSEnvironment.cpp, line 1493]
nsGlobalWindow::RunTimeout  [mozilla/dom/src/base/nsGlobalWindow.cpp, line 6714]
nsGlobalWindow::TimerCallback  [mozilla/dom/src/base/nsGlobalWindow.cpp, line 7086]
nsAppStartup::Run  [mozilla/toolkit/components/startup/src/nsAppStartup.cpp, line 152]
main  [mozilla/browser/app/nsBrowserApp.cpp, line 61]
kernel32.dll + 0x16fd7 (0x7c816fd7)
Assignee: nobody → general
Severity: major → critical
Component: General → DOM
Product: Firefox → Core
QA Contact: general → ian
Summary: Firefox just crashes after about 10 minutes of use → Firefox just crashes after about 10 minutes of use [@ nsHTMLDocument::GetElementById]
Version: 2.0 Branch → 1.8 Branch
I am using extensions, but I have used them all for a long time, and it still does crash in safemode. Also, someone mentioned that all the crashes were at www.gamespot.com/xbox360. I have had crashes at other sites as well, such as my.yahoo.com, www.cnn.com etc. Sometimes, the talkback feedback agent doesn't come up, just the windows send error report. 
This is #67 FF2 crasher.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Boris, you may remember something about getelementbyid ;)
That line is a CallQI call on something that's guaranteed to not be null.  Hence I conclude that it's dead.  Possibly the document is dead as well, though I'd expect that to crash sooner.

As for how that would happen... no idea.  This code is pretty different on the trunk.
*** Bug 359973 has been marked as a duplicate of this bug. ***
Flags: blocking1.8.1.1?
Keywords: helpwanted, qawanted
Whiteboard: Need way to reproduce or something....
Another thing to check is to try with a clean profile.
Would love to see a fix, moved up to #23 on the topcrash list.

Jonas: any thoughts on who to assign this to if not you?
Assignee: general → bugmail
Flags: wanted1.8.1.x+
Flags: blocking1.8.1.1?
Flags: blocking1.8.1.1+
Keywords: topcrash
I just installed Firefox 2.0 on my laptop and it is now crashing as well.  It crashed 2 times within the first 10 minutes of having 2.0 installed. I am using the default theme.
We are also seeing random, somewhat frequent exceptions being thrown by getElementById in our web application.  We haven't been able to boil it down to a simple test case, but would love to see this one get fixed.  Our web application also interacts with Flash on the page.
Robert, I assume you're seeing these in 2.0 and not 1.5?  If so, would you possibly be willing to try some of the intermediate builds and see whether you can narrow down when the problem started?  Or give someone access to your web app?
Yes we are seeing the crashes in the 2.0 release.  Which intermediate releases would you suggest we test against?
Robert, see http://archive.mozilla.org/pub/firefox/nightly/ -- it might take a bit to load the first time, sadly.  :(

The directories you're looking for have names like "2006-09-15-04-mozilla1.8/" -- that would be a build from the 1.8 Gecko branch (which Firefox 2 is based on), compiled on September 15, 2006, at 4am Pacific time.  There will generally be 2-3 directories like that per day (Linux, Windows, Mac builds).  These aren't releases, but development snapshots, but they should be pretty stable in general.

The earliest 1.8 branch build (pretty identical to Firefox 1.5) looks like 2005-08-09-12-mozilla1.8/  and the most recent one on archive right now is 2006-10-16-05-mozilla1.8/ -- more recent builds would be at <ftp://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/>.

I've had good luck in the past with a binary search on the builds in cases like this.  For about a year's worth of builds, as in this case, only about 8-9 need to be tested...
I am the original poster of this bug. The problem used to occur mostly at www.gamespot.com/xbox360. However, recently, it has been happening more often at different locations, such as www.engadget.com, my.yahoo.com etc. Also, I have never done this before. Can I actually expect this bug to get fixed?
As soon as someone with a debugger manages to reproduce it, probably...
Our QA group went back through the nightly builds and have narrowed the bug to beginning on the 7/07 build.  Builds prior to that do not exhibit the bug and ones   from that build onward exhibit the bug.  Our testers report that it seems more reproducible with Flash 8 installed (and using Flash on the page).

I can't be sure the bug we are seeing is identical to the original reported bug, so it may be helpful if some corroborates the build dates.
Assuming the date is good, the checkins between the 06/July and 07/July Windows nightly are listed at
http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=AviarySuiteBranchTinderbox&branch=MOZILLA_1_8_BRANCH&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=1152181140&maxdate=1152272400&cvsroot=%2Fcvsroot
which includes the changes to nsHTMLDocument from bug 333038 and the JS 1.7 landing.

Is it possible to get useful Talkback reports for a 07/July build ? Or are the symbols long gone ...

Jay, see comment 20?

Stavan Desai, can you confirm the regression range?
Robert, could you paste in links to the first build you guys see issues with and the last one you don't, just to make sure we're all on the same page?

Also, does your application use designmode iframes?
Flags: blocking1.8.1.1+ → blocking1.8.1.1-
Here are the file links we tracked the bug we're seeing to.

The bug begins occuring in this build:
 http://archive.mozilla.org/pub/firefox/nightly/2006-07-07-04-mozilla1.8/
 The file link:  firefox-2.0a3.en-US.win32.installer.exe 
 
The bug does NOT occur in this build:
 http://archive.mozilla.org/pub/firefox/nightly/2006-07-06-03-mozilla1.8/
 The file link:  firefox-2.0a3.en-US.win32.installer.exe  

Our application does not use designmode iframes.  It does use regular iframes and embedded Flash object.s
OK.  If designmode is not being used, then bug 333038 is not too likely to be the problem.

Robert, I assume your stacks look like those cited in comment 2?

Looking at those stacks again, what's happening is that chrome JS is calling getElementById on a content document off a timeout (note the XPC_NW_FunctionWrapper on the stack).  Given the regression range, this makes bug 189039 a possible suspect... at the very least it's a change to chrome code that does in fact make getElementById calls (not sure about the timeout part).

Stavan, Robert, are you using typeahead find on pages with textboxes?

What would be ideal here would be someone catching this crash in a debugger and looking up the JS callstack... (e.g. by calling the DumpJSStack() function in the debugger).  That would certainly help with sorting out how we're getting to the crash site...
That line could do the wrong thing if we're holding the only reference to the node, for sure.  But then we'd crash at this point...
This crash has jumped up to the #14 topcrash for Firefox 2.0.0.1. I don't know if we're willing to reconsider blocking status based on that information.
(In reply to comment #28)
> This crash has jumped up to the #14 topcrash for Firefox 2.0.0.1. I don't know
> if we're willing to reconsider blocking status based on that information.
> 

I've been running into an issue that looks very similar with an extension I'm developing. It's crashing immediately below the CallQueryInterface function in GetElementById(). I've run it with my own build of BonEchoDebug, and MallocDebug with scribbling on free turned on, and I see the pointer to the element that's in in the e variable points to an area filled with 0x55, which means it's been freed.

I did also notice that mIsGoingAway always seems to be true in the document when I get this crash.

I can reproduce this with the following steps:
- Install my extension, which listens for google search result loads, and calls XMLHttpRequest's on the result links to determine if they exist, and have the search terms in their text
- Load a page of search results in google, and rapidly click next while there's a lot of requests in flight
- After two or three pages, I usually crash with a callstack that looks just like the one in comment #4. It dies right when I click the next button

It dies in my onreadystatechange callback from my logging, right after I've added an element to the current document by appending to another element's innerHTML. I make a call to retrieve the element I just added with that ID, and it dies in there.

I will attach my extension code if anyone else wants to help debug this. My next steps are going to be adding logging to the getElementById() code to see where the pointer to the element is being retrieved from (eg the hash map?), log the closing of the document, and try to trace the life of the element I add so I can see why it's getting freed. Oh, and I'm testing on OS X if that makes any difference. 
Also attach the .xpi so someone could try it on Win32/Linux ?
This is the extension xpi I mention in my comment, that's needed to reproduce the crash I'm seeing.

You can look at the source code by unzipping the xpi, and then unzipping the content.jar that's inside the xpi. The crashing function is sendPageRequest in petesearch.js

I've included heavy logging for that function, to narrow down where the crash occurs, using dump(). Logging indicates that the crash occurs inside the getPreviewElement(), which looks like this:

function getPreviewElement(document, target)
{
	return document.getElementById(getPreviewIDFromTarget(target));
}
Pete, have you tryed to reproduce in _stock_ 2.0.0.3 or the new 1.9 nightly builds?
Sounds like htmldocument doesn't get notified that element is removed 
from document. Should we remove content objects in ::Destroy (not just unbind) to get notifications or just clear mIdAndNameHashTable there.
(In reply to comment #33)
> Pete, have you tryed to reproduce in _stock_ 2.0.0.3 or the new 1.9 nightly
> builds?

Yes, it was in stock 2.0.3 that I saw this. I only built my own version to try and get some more debug information. I haven't tried it on top-of-tree yet.


I don't think we want to remove the content objects in ::Destroy on branch, that's too big of a change. Instead lets just clear the ID cache and use nsContentUtils::MatchElementId
Depends on: 348156
To test the theory that this is caused by the hash map holding on to destroyed pointers, I added some logging. I had nsDocument::Destroy() log when it was called for each document, and made the following changes to nsHTMLDocument::GetElementById:

NS_IMETHODIMP
nsHTMLDocument::GetElementById(const nsAString& aElementId,
                               nsIDOMElement** aReturn)
{
...
fprintf(stderr, "GetElementById called on document %8x\n", this);
...
  if (e == ID_NOT_IN_DOCUMENT) {
...
fprintf(stderr, "GetElementById:: 0x%8x found in hash after flush\n",e);
  }
else
{
fprintf(stderr, "GetElementById:: 0x%8x found in hash before flush\n",e);
}
...
fprintf(stderr, "GetElementById:: 0x%8x found, but not in hash\n",e);
}


Logging output just before the crash:

Document 0x1d40ee00 destroyed
<... other unrelated log statements ...>
GetElementById called on document 1d40ee00
GetElementById:: 0x3b8aec10 found in hash before flush

This seems to support the idea that the hash map is the cause.

It seems like I could further test this by calling nsHTMLDocument::InvalidateHashTables(). This isn't accessible from the nsDocument base class though, should I create a derived version of Destroy() for nsHTMLDocument that calls nsDocument::Destroy() and then calls InvalidateHashTables(), or is there some other mechanism I could use to make sure it happens at the right time?
Attached patch Proposed patch (obsolete) (deleted) — Splinter Review
Here's a proposed fix. I've added an implementation of Destroy to nsHTMLDocument, that overrides the original virtual nsDocument implementation.
void
nsHTMLDocument::Destroy()
{
  nsDocument::Destroy();
  InvalidateHashTables();
}

It calls back to the the nsDocument Destroy, and then invalidates the hash as an additional step. I've tested this, and I'm no longer able to reproduce the crash with the steps that caused it before.

This patch is against BonEcho. I'm a mozilla newbie, going from http://www.mozilla.org/hacking/life-cycle.html it sounds like I should really do this against top-of-tree, and then maybe it could get backported if necessary for a 2.0.x release?
Please swap the calls and invalidate the hashes first. This is because we're still in a bad state after the Destroy call, so we should do as little as possible then.
Lets block on this since it's a topcrasher. 
Flags: blocking1.9? → blocking1.9+
Attached patch Updated Patch (obsolete) (deleted) — Splinter Review
Thanks Jonas, that makes sense. I originally placed it last in case anything in the base destroy added objects to the hash, but looking through the code it's pretty clear nothing does or should.

Here's the updated patch, the only change from the previous one is the move of the  invalidate call. I've tested it locally with my repro steps, and it no longer crashes. I've only tested on BonEcho, I will add a comment once I've built and run with it on the latest trunk. I assumed that won't block review?
Attachment #261555 - Attachment is obsolete: true
Attachment #262060 - Flags: review?(jonas)
Attachment #261555 - Flags: review?(jonas)
I've now tested with the latest Mac nightly (21st April). I can still repro the crash in the nightly builds, though the crash location seems to vary more. I've seen it in nsHTMLDocument::getAttribute() for example.

I've also built my own version of Minefield, with my patch applied, and I no longer see the crash when following the repro steps.

The callstack for the nightly crashes starts from XMLHttpRequest, so I believe it's the same underlying issue (a freed object pulled from the hash table and used).

Based on this, it seems like the fix is still needed, and should go into 1.9 as well as 1.8.x
Comment on attachment 262060 [details] [diff] [review]
Updated Patch

You should also set some flag indicating that this doc has been destroyed, and then make the implementation of getElementById not put new things into the hash if that flag is set.
Attachment #262060 - Flags: review?(jonas) → review-
Sorry for not mentioning that earlier :(
(In reply to comment #44)
> Sorry for not mentioning that earlier :(

np :)

I'll add some code using the existing NSDocument::mIsGoingAway flag in getElementById(). That only gets set in NSDocument::destroy(), and looks like it's used for similar "do something different when the document's destroyed" behaviors in a couple of other functions. Sound reasonable?

Once I've implemented and tested that change, I'll put up a new patch.
Flags: blocking1.8.1.5?
Here's a patch with all of the places where an object might be added to the hash table guarded with checks on mIsGoingAway.

I had to touch several functions in nsHtmlDocument, in addition to GetElementById.  I searched the code for any use of mIdAndNameHashTable with the PL_DHASH_ADD operation to decide where to put the check.

I made AddToIdTable, UpdateIdTableEntry and ResolveName return early without doing anything if the document was destroyed. 

The one function I didn't touch was ReserveNameInHash, since this is only used in the init and resetURI methods, and the entries it creates are not normal elements at risk of being deallocated.

The changes I made to GetElementById were more involved than I'd like, because the path we want to take (calling MatchElementId and then CallQueryInterface) is at the bottom of the function, and interleaved with hash table accesses that I had to guard.

An alternative for GetElementById is to duplicate that code in an early returning block at the top of the function, so we don't have to guard all the subsequent hash code. This has the disadvantage of making it harder to maintain and keep the two duplicate code blocks in sync going forward. My current approach seems lower risk and more maintainable. A third alternative would be to return null from GetElementById when the document had been destroyed, but this seems too big a change in behavior.

I've tested this against my original steps, and it does fix the problem. It does slightly change the behavior of some of the document functions after destruction, so it would be good to see it tested against some other DOM intensive pages.
Attachment #262060 - Attachment is obsolete: true
Attachment #263474 - Flags: review?(jonas)
Reassigning since Pete is the one actually working on this.
Assignee: jonas → mozilla
As a note for anyone looking for a workaround until we get this in a release, I switched my extension over to retrieving the element by name rather than ID. Since this doesn't use the hash table, it appears to avoid the crash and work on destroyed documents.

The released plugin is up at http://petesearch.com/
Status: NEW → ASSIGNED
After chatting to Jonas, I've reformatted the patch to hide the large number of whitespace differences caused by the indenting, to make it easier to understand, and put it back in for review.
Attachment #263474 - Attachment is obsolete: true
Attachment #265080 - Flags: review?(jonas)
Comment on attachment 265080 [details] [diff] [review]
Updated patch, same as above but using -w to hide whitespace differences

>@@ -2503,34 +2518,38 @@ nsHTMLDocument::GetElementById(const nsA
>   if (e == ID_NOT_IN_DOCUMENT) {
>     // We've looked for this id before and we didn't find it, so it
>     // won't be in the document now either (since the
>     // mIdAndNameHashTable is live for entries in the table)
> 
>     return NS_OK;
>   }
> 
>+  }

No need for the empty line between the two '}' lines.

>       // the fact that it's not in the document
>+      if (entry)
>       entry->mIdContent = ID_NOT_IN_DOCUMENT;
> 
>       return NS_OK;
>     }
> 
>     // We found an element with a matching id, store that in the hash
>+    if (entry)
>     entry->mIdContent = e;
>   }

Please use { } around the two above new if-statements. 


> nsresult
> nsHTMLDocument::ResolveName(const nsAString& aName,
>                             nsIDOMHTMLFormElement *aForm,
>                             nsISupports **aResult)
> {
>   *aResult = nsnull;
> 
>-  if (IsXHTML()) {
>+  if (IsXHTML()||mIsGoingAway) {

Put spaces around the ||.

r/sr=me with that fixed.
Attachment #265080 - Flags: superreview+
Attachment #265080 - Flags: review?(jonas)
Attachment #265080 - Flags: review+
Do we need a separate patch for branch or does the patch work there too?
This is still a 2.0.0.x topcrasher.
Though, first the patch needs to land on trunk.
Pete, could you update the patch, I can then check it in.
Here's the cleaned up patch, diffed against the 2.0.3 branch. I'll update and build against the main trunk, test it there, and then post another patch.
Attachment #265080 - Attachment is obsolete: true
Here's the diff against the trunk. The internals of HTMLDocument have changed enough that I had to do a pretty manual merge, so it might be easier to use the previous patch for the 2.0.x branch when it needs to be added there.

I've tested it, and I got the crash with the attached xpi and described steps without this change on last night's trunk, and couldn't repro it after my changes.

I do wonder if we shouldn't change the hash so that it takes ownership of the objects added to it as a longer-term fix? Anyway, this patch is a solution for the immediate crash at least.
Please hold off landing this on trunk as I'm working on a different fix there. Please do land on branch asap though (after getting approval of course).
Attachment #267546 - Flags: approval1.8.1.5?
Attachment #267546 - Flags: approval1.8.0.13?
This should be fixed on trunk with the patch from bug 348156
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Reopening since bug 348156 was backed out.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Flags: blocking1.8.1.5? → blocking1.8.1.5+
Whiteboard: Need way to reproduce or something.... → fixed on trunk by 348156, 1.8-branch patch here.
Here's the patch against FIREFOX_2_0_0_4_RELEASE
I've included whitespace differences, as per Olli's request, and tested locally.
Attachment #267546 - Attachment is obsolete: true
Attachment #267546 - Flags: approval1.8.1.5?
Attachment #267546 - Flags: approval1.8.0.13?
Comment on attachment 269341 [details] [diff] [review]
Patch as 267546, but against 2.0.4 branch, with whitespace shown

Requested approval on this patch.
Attachment #269341 - Flags: approval1.8.1.5?
Attachment #269341 - Flags: approval1.8.0.13?
Comment on attachment 269341 [details] [diff] [review]
Patch as 267546, but against 2.0.4 branch, with whitespace shown

Jonas, would you mind a brief re-review here to make sure the branch patch isn't going to stumble on some trunk/branch difference?
Attachment #269341 - Flags: review?(jonas)
Whiteboard: fixed on trunk by 348156, 1.8-branch patch here. → STR in comment 30. fixed on trunk by 348156, 1.8-branch patch here.
Comment on attachment 269341 [details] [diff] [review]
Patch as 267546, but against 2.0.4 branch, with whitespace shown

approved for 1.8.1.5 and 1.8.0.13, a=dveditz
Attachment #269341 - Flags: approval1.8.1.5?
Attachment #269341 - Flags: approval1.8.1.5+
Attachment #269341 - Flags: approval1.8.0.13?
Attachment #269341 - Flags: approval1.8.0.13+
This is now a trunk only bug, afaict, so changing the Version field to Trunk.
Version: 1.8 Branch → Trunk
Is there an updated patch for trunk needed here?
No. Bug 348156 should fix this one too.
But better to test after that has landed.
Should be fixed by patch in bug 348156
Status: REOPENED → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → FIXED
Crash Signature: [@ nsHTMLDocument::GetElementById]
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: