Closed Bug 470487 Opened 16 years ago Closed 15 years ago

Firefox Crash [@ nsWindow::GetParentWindow(int)] & [@ nsBaseWidget::Destroy()]?

Categories

(Core :: Widget: Win32, defect, P2)

1.9.1 Branch
x86
Windows XP
defect

Tracking

()

VERIFIED FIXED
mozilla1.9.2a1
Tracking Status
status1.9.2 --- beta1-fixed
blocking1.9.1 --- .6+
status1.9.1 --- .6-fixed

People

(Reporter: chofmann, Assigned: jimm)

References

Details

(4 keywords, Whiteboard: [sg:critical?][crashkill][crashkill-fix])

Crash Data

Attachments

(5 files, 5 obsolete files)

Currently the #4 top crash in Firefox 3.1b2 0 xul.dll nsWindow::GetParentWindow widget/src/windows/nsWindow.cpp:1544 1 xul.dll nsWindow::GetParent widget/src/windows/nsWindow.cpp:1518 2 xul.dll nsBaseWidget::Destroy widget/src/xpwidgets/nsBaseWidget.cpp:260 3 xul.dll nsWindow::Destroy widget/src/windows/nsWindow.cpp:1416 4 xul.dll nsPluginInstanceOwner::Destroy layout/generic/nsObjectFrame.cpp:4030 5 xul.dll DoStopPlugin layout/generic/nsObjectFrame.cpp:1987 6 xul.dll nsStopPluginRunnable::Run layout/generic/nsObjectFrame.cpp:2024 7 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:510 8 xul.dll nsBaseAppShell::Run widget/src/xpwidgets/nsBaseAppShell.cpp:170 9 nspr4.dll PR_GetEnv 10 firefox.exe wmain toolkit/xre/nsWindowsWMain.cpp:87 11 firefox.exe firefox.exe@0x2197 12 kernel32.dll kernel32.dll@0x17066 not many comments to go on. *replying to message in yahoo mail at the time it crashed. Had just hit reply
Component: XUL Widgets → Widget: Win32
Product: Toolkit → Core
QA Contact: xul.widgets → win32
Flags: blocking1.9.1?
Severity: normal → critical
OS: Mac OS X → Windows XP
Summary: Firefox 3.1b2 Crash Report [@ nsWindow::GetParentWindow() → Firefox 3.1b2 Crash Report [@ nsWindow::GetParentWindow() ]
Version: unspecified → 1.9.1 Branch
there is also another similar stack that goes though nsWindow::Destroy widget/src/windows/nsWindow.cpp:1416 I wonder if this is a related crash, or another bug that needs to be spun off? This stack signature "nsBaseWidget::Destroy()" is currently ranked #7 0 xul.dll nsBaseWidget::Destroy widget/src/xpwidgets/nsBaseWidget.cpp:262 1 xul.dll nsWindow::Destroy widget/src/windows/nsWindow.cpp:1416 2 xul.dll nsPluginInstanceOwner::Destroy layout/generic/nsObjectFrame.cpp:4030 3 xul.dll DoStopPlugin layout/generic/nsObjectFrame.cpp:1987 4 xul.dll nsStopPluginRunnable::Run layout/generic/nsObjectFrame.cpp:2024 5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:510 6 xul.dll nsBaseAppShell::Run widget/src/xpwidgets/nsBaseAppShell.cpp:170 7 nspr4.dll PR_GetEnv 8 firefox.exe wmain toolkit/xre/nsWindowsWMain.cpp:87 9 firefox.exe firefox.exe@0x2197 10 kernel32.dll kernel32.dll@0x17066
Summary: Firefox 3.1b2 Crash Report [@ nsWindow::GetParentWindow() ] → Firefox 3.1b2 Crash Report [@ nsWindow::GetParentWindow() ] & [@ nsBaseWidget::Destroy()?
I didn't change the actual logic of nsWindow::GetParentWindow which its old name is nsWindow::GetParent. The crash timing is: if (widget->mIsDestroying) { this line. So, looks like the parent widget is already destroyed.
The pointer to nsWindow is released from HWND by SubclassWindow(FALSE). It is called only in nsWindow::OnDestroy. nsWindow::Destroy and getting WM_CLOSE. I guess that the parent window didn't get WM_CLOSE message but the instance was already released. Then, SubclassWindow(FALSE) is called via nsWindow::Destroy. But then, mWnd was already set to null, so, we fail to release the pointer to nsWindow from HWND.
(In reply to comment #4) > The pointer to nsWindow is released from HWND by SubclassWindow(FALSE). It is > called only in nsWindow::OnDestroy. nsWindow::Destroy and getting WM_CLOSE. nsWindow::Destroy and getting WM_CLOSE are callers of nsWindow::OnDestroy.
It's strange... ::DestroyWindow automatically destroys its children. So, why can ::GetParent in nsWindow::GetParentWindow get the destroyed parent window after the parent nsWindow destroyed??
Masayuki, thanks for helping to investigate. I've cc'ed some others that have some worked in the windows widget code in the past to see if they have ideas.
Keywords: topcrash
any more ideas on this one?
This is actually more likely to be plugin related -- I know there were some bugs & changes about plugin destruction. Cc'ing jst to see if this looks similar..
Can we figure out what plugins are loaded for people who are crashing? Is it always yahoo-related? We've had problems with the yahoo state plugin in the past.
Given that this looks windows specific it's even more odd as on windows we reparent the plugins widget to the root widget (the desktop) when we do plugin destruction off of an event, which seems to be the case here. I looked through a dozen or so crash reports, and only one of them had the yahoo state plugin dll loaded (npYState.dll), so it's likely not specific to yahoo. It's also interesting to note that out of the 6886 crashes found by my query, 6675 happened on December 1st of 2008. So how much of a top-crash this is atm is not clear to me.
This query shows 851 crashes in the last week for users on beta2 http://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A3.1b2&query_search=signature&query_type=contains&query=&date=&range_value=1&range_unit=weeks&do_query=1&signature=nsWindow%3A%3AGetParentWindow%28%29 That ranks it as #7 top crash. Glancing though a few of the module lists I've seen NPAskSBr.dll (the ask tool bar) show up in a couple of reports.
Not blocking for now since it looks like we might have just seen a spike caused by other software/sites? If it persists in B3 as a topcrash please renom.
Flags: blocking1.9.1? → blocking1.9.1-
This ranks as the number 7 for Firefox 3.0.6. My guess is that it'll be highly present in b3 just as it was for b2.
Flags: wanted1.9.0.x+
This is currently #6 in the list of top crashes for Firefxo 3.1 beta 3. Renoming per comment 14.
Flags: blocking1.9.1- → blocking1.9.1?
Flags: blocking1.9.1? → blocking1.9.1+
Priority: -- → P2
Keywords: crash
Summary: Firefox 3.1b2 Crash Report [@ nsWindow::GetParentWindow() ] & [@ nsBaseWidget::Destroy()? → Firefox 3.1b2 Crash Report [@ nsWindow::GetParentWindow() ] & [@ nsBaseWidget::Destroy()]?
I've stared at this code and I can't see anything obviously wrong; we need to find steps to reproduce.
Keywords: qawanted
(In reply to comment #15) > This ranks as the number 7 for Firefox 3.0.6. My guess is that it'll be highly > present in b3 just as it was for b2. Did this start in 3.0.6, or earlier? A regression range would be quite handy.
I've spent half a day trying different scenarios concerning Yahoo Mail and the Yahoo Mail extension and haven't been able to reproduce it whatsoever. The comments on the crash reports aren't helpful for this at all either. Has anyone been able to see this once or twice on their profile? I'd like to get some deatils as to what extensions, sites and operations/actions were used.
maybe between 3.0.4 and 3.0.5? searching the topcrasher lists for recent 3.0.x releases I see these in nsWindow code. release rank stack 3.0.4 43 nsWindow::GetParent() 3.0.5 28 nsWindow::GetParentWindow() 3.0.4 17 nsWindow::GetParentWindow()
er, that should have been maybe between 3.0.3 and 3.0.4? searching the topcrasher lists for recent 3.0.x releases I see these in nsWindow code. release rank stack 3.0.3 43 nsWindow::GetParent() 3.0.4 17 nsWindow::GetParentWindow() 3.0.5 28 nsWindow::GetParentWindow() 3.0.6 20 nsWindow::GetParentWindow() 3.0.7 19 nsWindow::GetParentWindow()
and for the 1.9.1 related releases release rank stack 3.1a2 nothing 3.1b1 7 nsWindow::GetParentWindow() 3.1b2 15 nsWindow::GetParentWindow() 3.1b3 9 nsWindow::GetParentWindow() 3.5b4pre 18 nsWindow::GetParentWindow()
so maybe some time last sept. looks like when it started showing up. v3.0.3, released September 26, 2008 -- v3.0.4, released November 12, 2008 3.1 alpha2 Sept 5, 2008 --- v3.1 Beta 1, released October 14, 2008
at first glance the only thing I see landing around then that might be related is https://bugzilla.mozilla.org/show_bug.cgi?id=452385 any possibility of some bad interaction between the patches there and plugins?
Bug 452385 renamed nsWindow::GetParent() to nsWindow::GetParentWindow(). So this bug had occured in 3.0.3.
> re: comment 25 ok, that's helping to make a bit more sense of the pattern seen in the releases. I looked at the stack traces in 3.0.3 and they look basically the same as comment 0. one thing looking at this bug shows is an increase in the ranking after 3.0.4, and from no visability to being ranked in 3.1 beta1. the increase in visability in 3.1 beta1 may just be related to more users ramping up on 3.1 when it hit the first beta. It does appear that that it takes a large user pool to hit this problem. Is it possible the fixes that went in for bug 452385 reduced the chances of bug being surfaced in the bookmark UI case, but increased the chances in plugin window popups or some other windowing cases, or that some hangs for remaining edge cases in that bug have been turned into crashes?
(In reply to comment #26) > one thing looking at this bug shows is an increase in the ranking after 3.0.4, > and from no visability to being ranked in 3.1 beta1. the increase in > visability in 3.1 beta1 may just be related to more users ramping up on 3.1 > when it hit the first beta. It does appear that that it takes a large user > pool to hit this problem. This is likely related to the fact that crash-stats top crash reports don't "freeze", so they're still based on current data (last 2 weeks or whatever). There's likely less users on 3.0.3 than on 3.1b1 or something weird like that...
A random thought... Can someone who can reproduce this bug (can anyone?) please try removing the mIsDestroying and mOnDestroyCalled variables from windows/nsWindow.h (they already exist in nsBaseWidget). Duplicate variables are never a good idea. They've been there for a long time, but maybe something is now tickling the classes into using different versions?
Yeah, I don't think we have a way to reproduce it.. the fix you suggest is probably a good one in any case, though.
This is still sitting on qawanted -- given that noone can reproduce this, it would be helpful for QA to: - try to find a way to reproduce, by looking at crash info, perhaps by contacting users or some other type of outreach (blog post saying "hey, if you crashed, and you have an about:crashes log that looks like this, get a hold of us"); - get this in the debugger for a developer to look at; - try to find a regression range. Cc'ing Jim as well -- Jim, could you take a look at jeremy's suggestion on comment #28 and fix up those variables?
Juan, you probably added the wrong url? This are crashes from Songbird and only two with comments. Our ones aren't helpful either: http://crash-stats.mozilla.com/report/list?product=Firefox&branch=1.9.1&version=Firefox%3A3.1b3&query_search=signature&query_type=exact&query=nsWindow%3A%3AGetParentWindow%28%29&date=&range_value=1&range_unit=weeks&do_query=1&signature=nsWindow%3A%3AGetParentWindow%28%29 We could file an IT ticket and ask for a couple of URLs where the crash happens.
Assignee: nobody → jmathies
Henrik, I added the intended URL, because it had a couple of comments, as opposed to our stats site which, when I tried to access, would time out.
(In reply to comment #32) > Juan, you probably added the wrong url? This are crashes from Songbird and only > two with comments. > > Our ones aren't helpful either: > http://crash-stats.mozilla.com/report/list?product=Firefox&branch=1.9.1&version=Firefox%3A3.1b3&query_search=signature&query_type=exact&query=nsWindow%3A%3AGetParentWindow%28%29&date=&range_value=1&range_unit=weeks&do_query=1&signature=nsWindow%3A%3AGetParentWindow%28%29 > > We could file an IT ticket and ask for a couple of URLs where the crash > happens. Clicking through a number of these, all seem to have some form of plugin loaded, but there doesn't appear to be a particular plugin that's causing it, so I'm guessing it's more generic.
Summary: Firefox 3.1b2 Crash Report [@ nsWindow::GetParentWindow() ] & [@ nsBaseWidget::Destroy()]? → Firefox Crash [@ nsWindow::GetParentWindow() ] & [@ nsBaseWidget::Destroy()]?
OK, here's my debugging assessment. I don't have time to work up a patch and test it now. But basically, widget destruction on windows is completely broken - it's a wonder it doesn't crash more... This is besides the duplicate member variable problem. The sequence of preferred actions appears to be nsWindow::Destroy() which calls nsBaseWidget::Destroy(), nsWindow::OnDestroy() which calls nsBaseWidget::OnDestroy() and finally nsWindow::~nsWindow(). Lets look at these in order: 1381 NS_METHOD nsWindow::Destroy() 1382 { 1383 // Switch to the "main gui thread" if necessary... This method must 1384 // be executed on the "gui thread"... 1385 nsToolkit* toolkit = (nsToolkit *)mToolkit; 1386 if (toolkit != nsnull && !toolkit->IsGuiThread()) { 1387 MethodInfo info(this, nsWindow::DESTROY); 1388 toolkit->CallMethod(&info); 1389 return NS_ERROR_FAILURE; 1390 } This is fine (the bogus failure code is always ignored). 1392 // disconnect from the parent 1393 if (!mIsDestroying) { 1394 nsBaseWidget::Destroy(); 1395 } This is wrong. This should test be: if (mIsDestroying) // Should be an atomic test and set... return NS_OK; nsCOMPtr<nsIWidget> kungFuDeathGrip(this); mIsDestroying = PR_TRUE; nsBaseWidget::Destroy(); 1397 // just to be safe. If we're going away and for some reason we're still 1398 // the rollup widget, rollup and turn off capture. 1399 if ( this == gRollupWidget ) { 1400 if ( gRollupListener ) 1401 gRollupListener->Rollup(nsnull); 1402 CaptureRollupEvents(nsnull, PR_FALSE, PR_TRUE); 1403 } This is OK. 1405 EnableDragDrop(PR_FALSE); 1406 1407 // destroy the HWND 1408 if (mWnd) { 1409 // prevent the widget from causing additional events 1410 mEventCallback = nsnull; 1411 1412 // if IME is disabled, restore it. 1413 if (mOldIMC) { 1414 mOldIMC = ::ImmAssociateContext(mWnd, mOldIMC); 1415 NS_ASSERTION(!mOldIMC, "Another IMC was associated"); 1416 } 1417 1418 HICON icon; 1419 icon = (HICON) ::SendMessageW(mWnd, WM_SETICON, (WPARAM)ICON_BIG, (LPARAM) 0); 1420 if (icon) 1421 ::DestroyIcon(icon); 1422 1423 icon = (HICON) ::SendMessageW(mWnd, WM_SETICON, (WPARAM)ICON_SMALL, (LPARAM) 0); 1424 if (icon) 1425 ::DestroyIcon(icon); 1426 1427 #ifdef MOZ_XUL 1428 if (eTransparencyTransparent == mTransparencyMode) 1429 { 1430 SetupTranslucentWindowMemoryBitmap(eTransparencyOpaque); 1431 1432 } 1433 #endif This is OK, until here. We should add: SubclassWindow(FALSE); 1434 1435 VERIFY(::DestroyWindow(mWnd)); 1436 1437 mWnd = NULL; 1438 //our windows can be subclassed by 1439 //others and these nameless, faceless others 1440 //may not let us know about WM_DESTROY. so, 1441 //if OnDestroy() didn't get called, just call 1442 //it now. MMP 1443 if (PR_FALSE == mOnDestroyCalled) 1444 OnDestroy(); 1445 } 1446 1447 return NS_OK; 1448 } This should be: VERIFY(::DestroyWindow(mWnd)); mWnd = NULL; } //our windows can be subclassed by //others and these nameless, faceless others //may not let us know about WM_DESTROY. so, //if OnDestroy() didn't get called, just call //it now. MMP if (PR_FALSE == mOnDestroyCalled) OnDestroy(); return NS_OK; } Looking at OnDestroy: 5759 void nsWindow::OnDestroy() 5760 { 5761 mOnDestroyCalled = PR_TRUE; This should test and exit. 5763 SubclassWindow(FALSE); 5764 5765 // We have to destroy the native drag target before we null out our 5766 // window pointer 5767 EnableDragDrop(PR_FALSE); 5768 5769 mWnd = NULL; This should test mIsDestroying and call Destroy() and exit. This should be done before the test for mOnDestroyCalled. 5770 5771 // free GDI objects 5772 if (mBrush) { 5773 VERIFY(::DeleteObject(mBrush)); 5774 mBrush = NULL; 5775 } 5776 5777 #if 0 5778 if (mPalette) { 5779 VERIFY(::DeleteObject(mPalette)); 5780 mPalette = NULL; 5781 } 5782 #endif 5783 5784 // if we were in the middle of deferred window positioning then 5785 // free the memory for the multiple-window position structure 5786 if (mDeferredPositioner) { 5787 VERIFY(::EndDeferWindowPos(mDeferredPositioner)); 5788 mDeferredPositioner = NULL; 5789 } 5790 5791 // release references to children, device context, toolkit, and app shell 5792 nsBaseWidget::OnDestroy(); nsBaseWidget::OnDestroy() does not destroy children (and there is not more app shell). GTK widgets explicitly destroy their children in Destroy()... should windows be doing that too? It seems cleaner... Otherwise OK. 5793 5794 // dispatch the event 5795 if (!mIsDestroying) { 5796 // dispatching of the event may cause the reference count to drop to 0 5797 // and result in this object being destroyed. To avoid that, add a reference 5798 // and then release it after dispatching the event 5799 AddRef(); 5800 DispatchStandardEvent(NS_DESTROY); 5801 Release(); 5802 } 5803 } This is not needed. Or otherwise should the ordering be OnDestroy(), Destroy()? If it is needed, should this have a kungfuDeathGrip? 697 nsWindow::~nsWindow() 698 { 699 mIsDestroying = PR_TRUE; This should test and call Destroy(). Or just call destroy unconditionally... 700 if (gCurrentWindow == this) { 701 gCurrentWindow = nsnull; 702 } 703 704 MouseTrailer* mtrailer = nsToolkit::gMouseTrailer; 705 if (mtrailer) { 706 if (mtrailer->GetMouseTrailerWindow() == mWnd) 707 mtrailer->DestroyTimer(); 708 709 if (mtrailer->GetCaptureWindow() == mWnd) 710 mtrailer->SetCaptureWindow(nsnull); 711 } These need mWnd, so should be moved to Destroy(). 713 // If the widget was released without calling Destroy() then the native 714 // window still exists, and we need to destroy it 715 if (NULL != mWnd) { 716 Destroy(); 717 } Called above. 719 if (mCursor == -1) { 720 // A successfull SetCursor call will destroy the custom cursor, if it's ours 721 SetCursor(eCursor_standard); 722 } Move to destroy? 724 sInstanceCount--; 725 726 #ifdef NS_ENABLE_TSF 727 if (!sInstanceCount) 728 nsTextStore::Terminate(); 729 #endif //NS_ENABLE_TSF 730 731 #ifndef WINCE 732 // 733 // delete any of the IME structures that we allocated 734 // 735 if (sInstanceCount == 0) { 736 if (sIMECompUnicode) 737 delete sIMECompUnicode; 738 if (sIMEAttributeArray) 739 delete [] sIMEAttributeArray; 740 if (sIMECompClauseArray) 741 delete [] sIMECompClauseArray; 742 743 NS_IF_RELEASE(gCursorImgContainer); 744 745 if (sIsOleInitialized) { 746 ::OleFlushClipboard(); 747 ::OleUninitialize(); 748 sIsOleInitialized = FALSE; 749 } 750 } 751 752 NS_IF_RELEASE(mNativeDragTarget); 753 #endif 754 755 } These are OK. Otherwise, once the order of Destroy() and OnDestroy() is fixed, all of the other tests for mIsDestroying and mOnDetroyCalled should be changed to whichever is first. Finally, and what I think is actually the bug here. SubClassWindow should addref and release the this pointer. Hope this helps.
One more thing... The WindowProc does and conditional kungfuDeathGrip on the target, which assumes nsIsDestroying is only set in the destructor. GTK has a nsIsDestroyed flag which is set in the destructor. Maybe windows should have one of these too?
Adding a bit of data - I was looking at the normal shutdown steps a windowed plugin takes when it's destroyed. Similar to Jeremy's break down: Starting a shutdown for a Flash nsObjectFrame: nsObjectFrame::StopPluginInternal nsPluginInstanceOwner::PrepareToStop -> mWidget->SetParent(nsnull) nsObjectFrame::StopPluginInternal done. (time passes for a delayed plugin teardown which gets fired off as a runnable event) nsObjectFrame's DoStopPlugin nsPluginInstanceOwner::Destroy() nsWindow::Destroy() nsBaseWidget::Destroy() nsWindow::GetParentWindow() mWnd = 0x260afe | parent hwnd = 0x10010 | parent widget = 0x0 This is where this crash would occur. A couple notes - we set the native parent window of the plugin to null way up in nsPluginInstanceOwner's PrepareToStop. This happens way before we actually destroy the window. This is the same parent hwnd we retrieve in GetParentWindow later on when we are destroying. During destruction, GetParentWindow's ::GetParent(mWnd) call actually returns the handle to the desktop instead, since GetParent returns the parent _or_ the owner, which in this case is the desktop. We then query that for a mozilla property through GetProp to get the widget, which fails and everything moves on smoothly... nsWindow::GetParentWindow() done. nsBaseWidget::Destroy() done. nsWindow::OnDestroy() nsWindow::SubclassWindow(0) 0x260afe nsBaseWidget::OnDestroy() nsBaseWidget::OnDestroy() done. nsWindow::OnDestroy() done. nsWindow::Destroy() done. ~nsWindow() ~nsWindow() done. So that's a normal shutdown. This bug is rather odd in that in the crash case GetParentWindow retrieves a valid parent hwnd, retrieves a valid mozilla property for the widget pointer, and then tries to access mIsDestroying for the widget. But if shutdown is working correctly, parent should have been set to null and the desktop hwnd would have been returned. I still don't have a specific answer as to how this could happen, but I think it's something more than a slight timing issue in how we destroy. I'm going to look a bit more at the plugin code. I'm wondering if there are cases where we do not delay the unload of plugins like flash. (Also we should probably be calling GetAncestor instead of GetParent in GetParentWindow so we can avoid checking for properties on the desktop window during shutdown.)
Actually, looking at the stacks, all of these crashes occur on threads that are coming through nsStopPluginRunnable::Run, so the delayed vs. immediate destruction isn't the issue. These crashes are all on delayed tear down events.
Attached patch getparent / destroy variable cleanup v.1 (obsolete) (deleted) — Splinter Review
I'm going to look at Jeremy's other cleanup suggestions but I'd like to start with something simple - cleanup the variable madness and make sure get parent is always working with a parent window handle.
Attachment #373186 - Flags: review?(vladimir)
Comment on attachment 373186 [details] [diff] [review] getparent / destroy variable cleanup v.1 ugh, field shadowing.. can you get rid of the other shadowed fields as well? mIs{Shift,Control,Alt}Down are the ones I see at a quick glance
Attachment #373186 - Flags: review?(vladimir) → review+
Looks like it was this little block - beos and os2 also use most of these. PRPackedBool mIsShiftDown; PRPackedBool mIsControlDown; PRPackedBool mIsAltDown; PRPackedBool mIsDestroying; PRPackedBool mOnDestroyCalled; new patch coming up...
Attached patch getparent / destroy variable cleanup v.2 (obsolete) (deleted) — Splinter Review
Threw this at try. I'll land it on m-c once that clears.
Attachment #373186 - Attachment is obsolete: true
Attachment #373201 - Attachment is patch: true
> - HWND parent = ::GetParent(mWnd); > + HWND parent = ::GetAncestor(mWnd, GA_PARENT); Actually we also need the owner here when GetTopLevelWindow() is called with aStopOnDialogOrPopup=PR_FALSE.
(In reply to comment #43) > > - HWND parent = ::GetParent(mWnd); > > + HWND parent = ::GetAncestor(mWnd, GA_PARENT); > Actually we also need the owner here when GetTopLevelWindow() is called with > aStopOnDialogOrPopup=PR_FALSE. Hmm, I'm not seeing why this is. GetTopLevelWindow would return the current window if it's top and I don't see how that would mess up caret position data in calls to ResolveIMECaretPos. I'll do some debugging though and see if I can spot what the potential problem might be. If we do need owner for the ime code, I think it might be best to differentiate between the two calls so that other calls to getparentwindow return the true parent.
It will affect the popup window (e.g. "Organize Bookmarks" dialog). We want a main window pointer from children of the popup. > I think it might be best to differentiate between the two calls so that other > calls to getparentwindow return the true parent. I agree.
(In reply to comment #45) > It will affect the popup window (e.g. "Organize Bookmarks" dialog). We want a > main window pointer from children of the popup. > > I think it might be best to differentiate between the two calls so that other > > calls to getparentwindow return the true parent. > I agree. How will it effect windows like organize bookmarks? Is this an IME issue or something more general? I've been using a build of this for a day now and haven't seen any strange behavior during normal use.
Attached patch getparent / destroy variable cleanup v.3 (obsolete) (deleted) — Splinter Review
Here's an update that chooses between the two so that calls to GetTopLevelWindow continue to return owner.
Attachment #373201 - Attachment is obsolete: true
Attachment #373343 - Attachment is patch: true
Independent of these changes, this crash still seems pretty prevenlent in pre beta 4. Three crash signatures all seems to tie in with this - From the last week: nsWindow::GetParentWindow() (rank 14) - http://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A3.5b4pre&platform=windows&query_search=signature&query_type=exact&query=&date=&range_value=1&range_unit=weeks&do_query=1&signature=nsWindow%3A%3AGetParentWindow() 0 xul.dll nsWindow::GetParentWindow widget/src/windows/nsWindow.cpp:1651 1 xul.dll nsWindow::GetParent widget/src/windows/nsWindow.cpp:1625 2 xul.dll nsBaseWidget::Destroy widget/src/xpwidgets/nsBaseWidget.cpp:259 3 xul.dll nsWindow::Destroy widget/src/windows/nsWindow.cpp:1523 nsBaseWidget::Destroy() (rank 39) - http://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A3.5b4pre&platform=windows&query_search=signature&query_type=exact&query=&date=&range_value=1&range_unit=weeks&do_query=1&signature=nsBaseWidget%3A%3ADestroy() 0 xul.dll nsBaseWidget::Destroy widget/src/xpwidgets/nsBaseWidget.cpp:261 1 xul.dll nsWindow::Destroy widget/src/windows/nsWindow.cpp:1523 2 xul.dll nsPluginInstanceOwner::Destroy layout/generic/nsObjectFrame.cpp:4069 3 xul.dll DoStopPlugin layout/generic/nsObjectFrame.cpp:1986 254 NS_METHOD nsBaseWidget::Destroy() 255 { 256 // Just in case our parent is the only ref to us 257 nsCOMPtr<nsIWidget> kungFuDeathGrip(this); 258 // disconnect from the parent 259 nsIWidget *parent = GetParent(); 260 if (parent) { 261 parent->RemoveChild(this); nsWindow::Destroy() (rank 78) - (Last one of these was the 12th) http://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A3.5b4pre&platform=windows&query_search=signature&query_type=exact&query=&date=&range_value=1&range_unit=weeks&do_query=1&signature=nsWindow%3A%3ADestroy() 0 @0x740072 1 xul.dll nsWindow::Destroy widget/src/windows/nsWindow.cpp:1489 2 xul.dll nsPluginInstanceOwner::Destroy layout/generic/nsObjectFrame.cpp:4069 3 xul.dll DoStopPlugin layout/generic/nsObjectFrame.cpp:1986 4 xul.dll nsStopPluginRunnable::Run layout/generic/nsObjectFrame.cpp:2023 5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:510 I was hoping to reproduce this from some extended use of a test build but still haven't seen it.
(In reply to comment #46) > How will it effect windows like organize bookmarks? Sorry, I have confused a dialog name. It should read "Edit This Bookmark" dialog (i.e. a dialog which will appear when tha star icon is clicked). > Is this an IME issue or > something more general? I've been using a build of this for a day now and > haven't seen any strange behavior during normal use. IME issue. With cleanup v.2 patch and ChangJie, composition window was displayed at wrong position on "Edit Bookmarks" dialog. Even worse, With cleanup v.3 patch, Firefox hanged when I tried to input on "Edit Bookmarks" dialog using ChangJie. It seems to regress bug 452385. Without neither patches, there was no problem.
(In reply to comment #49) > (In reply to comment #46) > > How will it effect windows like organize bookmarks? > Sorry, I have confused a dialog name. It should read "Edit This Bookmark" > dialog (i.e. a dialog which will appear when tha star icon is clicked). > > Is this an IME issue or > > something more general? I've been using a build of this for a day now and > > haven't seen any strange behavior during normal use. > IME issue. With cleanup v.2 patch and ChangJie, composition window was > displayed at wrong position on "Edit Bookmarks" dialog. > Even worse, With cleanup v.3 patch, Firefox hanged when I tried to input on > "Edit Bookmarks" dialog using ChangJie. It seems to regress bug 452385. > Without neither patches, there was no problem. Thanks for testing. I guess I'll go back to using ::GetParent on the IME calls.
Attachment #373343 - Attachment is obsolete: true
The new patch looks good. I haven't find any apparent problems during my short, non-scientific testing.
Comment on attachment 373373 [details] [diff] [review] [m-c checkin: a7b2a881595a] getparent / destroy variable cleanup v.3 r? -> vlad since there have been some additional changes. This might fix the crash problem so we should consider getting it into 1.9.1 after it's sat on trunk for a few days.
Attachment #373373 - Flags: review?(vladimir)
Attachment #373373 - Attachment description: getparent / destroy variable cleanup v.3 → [m-c checkin: a7b2a881595a] getparent / destroy variable cleanup v.3
And... ?
Attachment #373373 - Flags: approval1.9.1?
So we generally know which thing is NULL; is there any data at that point that we could use to figure out what happened? I'm thinking we should maybe add some logging; unfortunately breakpad doesn't support sending arbitrary logged data, but we could leave a file in some dir and ask people to mail it in if they see it. (Should always append to it, etc.)
Looks like this can actually be done: http://mxr.mozilla.org/mozilla-central/source/xpcom/system/nsICrashReporter.idl#101 (appendAppNotesToCrashReport) and it'll show up in breakpad.
Attached patch Properly call Destroy() (obsolete) (deleted) — Splinter Review
I just pushed this patch to the try server... It's a big hammer approach to this, which makes sure Destroy() is called correctly. To do that, SetParent() also needs to manage the parent's child list, which most platforms where not doing. My brief testing showed that the new asserts were not being called, and that reftests and mochitests looked good, except for a few, which appears to caused by a focus problem (the main window isn't refocused after a popup). I'll see what the tryserver says. I think this patch is to intrusive for 3.5. But some things I did learn... It appears that nsWindow::~nsWindow could be called multiple times... That's not good. I think what might be happening is that the code is actually stepping through deleted memory which gets overwritten at some point.
Jeremy, out of interest, do you have a test case where SetParent is called from nsViewManager::ReparentChildWidgets?
Karl... No. I've just been reading the code. bug 129034 is the original implementation. The test case there now asserts on something else (bug 485893 - it doesn't get to SetParent()). It seems to have regressed from 3.0.10, but I haven't looked at it carefully. If SetParent(nonnull) is not needed then these could all be changed to ResetParent(), like what Cocoa did recently, and the code greatly simplified. The assert in nsViewManager::ReparentChildWidgets has never had a bug filed on it. But this is getting off topic.
A possible fix for 1.9.1 might be to move (or add) a call to SetParent(nsnull) just before we destroy the widget in nsPluginInstanceOwner's Destroy. We clear it in PrepareToStop (just below Destroy in the code) a little bit before we actually tear down. http://mxr.mozilla.org/mozilla-central/source/layout/generic/nsObjectFrame.cpp#4217
(In reply to comment #61) > A possible fix for 1.9.1 might be to move (or add) a call to SetParent(nsnull) > just before we destroy the widget in nsPluginInstanceOwner's Destroy. We clear > it in PrepareToStop (just below Destroy in the code) a little bit before we > actually tear down. > > http://mxr.mozilla.org/mozilla-central/source/layout/generic/nsObjectFrame.cpp#4217 Hmm, we call GetParent in SetParent, so that's not going to work without changes to the window code.
Summarizing what's known at this point - - The crash is sporadic, and hard to reproduce - The crash occurs in delayed tear down of a plugin window - We know that SetParent is called in PreparToStop before the tear down event is sent, and it seems that it must succeed in setting the parent widget pointer in the native window's property list to null. - When the event comes back, we go into what is a rather haphazzard process of unhooking and destroying the widget and window. During this process we retrieve the parent widget pointer via the property list. - The parent widget pointer returned on a valid native window handle is not null and does not point to a valid parent widget which results in the crash. I've tried to reproduce this, and looked at ways to do a one off for 1.9.1 but really am not coming up with much of anything except hacks. Vlad suggested logging but I didn't see anything there that would help. What has been suggested is a rework of our tear down methodology with a heavy handed patch in the hopes it will fix the problem. But nobody feels confident about applying that to 1.9.1 this late in the ball game. I'm not entirely sure what direction to take, so I'd welcome any help / suggestions.
(In reply to comment #63) > What has been suggested is a rework of our tear down methodology with a heavy > handed patch in the hopes it will fix the problem. But nobody feels confident > about applying that to 1.9.1 this late in the ball game. Can we start working on this now, land it for 1.9.2 as soon as it's ready and see if it fixes the problem there? If so, and if it bakes for a while, we might be able to take it in 1.9.1.x, despite its size/scope.
(In reply to comment #58) > Created an attachment (id=376333) [details] > Properly call Destroy() > > I just pushed this patch to the try server... It's a big hammer approach to > this, which makes sure Destroy() is called correctly. To do that, SetParent() > also needs to manage the parent's child list, which most platforms where not > doing. My brief testing showed that the new asserts were not being called, and > that reftests and mochitests looked good, except for a few, which appears to > caused by a focus problem (the main window isn't refocused after a popup). > I'll see what the tryserver says. > > I think this patch is to intrusive for 3.5. But some things I did learn... It > appears that nsWindow::~nsWindow could be called multiple times... That's not > good. I think what might be happening is that the code is actually stepping > through deleted memory which gets overwritten at some point. Jeremy, how are you feeling about your destroy rework patch? Want to kick off some reviews?
Spoke to Jeremy through email, his patch had some issues and didn't make it through try, so I'll start in on something myself.
I was going to attach an updated patch, but the more I little bugs I've fixed, the more creep in... What I have become convinced of is that the windows cleverness of using the windows ::GetParent() call is the main problem, because it's ownership model and semantics are just not compatible with what nsIWidget wants. I think that an alternative approach here is to add a nsRefPtr<nsWindow> mParent to nsWindow, and manage it in Create() [=baseParent], SetParent() and Destroy(). Then we can remove GetParentWindow(), since mParent would now be correct. Also, many of the uses of ::GetParent() in nsWindow.cpp are bogus - they should already be using GetParentWindow(). Also, I think that Destroy() should be called on all widgets. I think that the only needed fixes are in nsXULWindow (the last hunk of my patch), and then a loop over the children in Destroy(), like GTK already does. I would also make the loop over the children use a nsCOMPtr so that they are only Released() and destructed when Destroy has returned, just to make life more predictable. I'll merge most of the rest of my changes into the cleanup I'm doing in bug 431634.
This is marked topcrash - do we know how much of a crasher this is under b4?
According to a discussion w/ jimm in IRC, we saw 4991 in a week on 3.0.0.11. In any case, after talking this over with Joe Drew and jimm in IRC, since we can't reproduce this crash, it's in 3.0, we don't have a regression range, *and* we're really down to the last 1-3 bugs that are going to actually hold the release, I think we're at the point where we should minus this. Marking as such. If people disagree and think we should hold the release for this bug, please re-nom and comment.
Flags: blocking1.9.1+ → blocking1.9.1-
(In reply to comment #68) > This is marked topcrash - do we know how much of a crasher this is under b4? http://crash-stats.mozilla.com/query/query?do_query=1&product=Firefox&version=Firefox%3A3.5b4&platform=windows&date=&range_value=1&range_unit=weeks&query_search=signature&query_type=contains&query=GetParentWindow 1597 for one week. If anyone has any ideas on how to pull up a regression range pre the 3.0 release I'd welcome the input. This was present in the first 3.0 release.
(In reply to comment #69) > If people disagree and think we should hold the release for this bug, > please re-nom and comment. I don't disagree, but I think we should block on this for 1.9.2 and consider backporting it to 1.9.1.x if a fix sticks and is relatively safe.
Flags: wanted1.9.1.x?
Flags: blocking1.9.2?
Jim, it's hard without any way of reproducing this crash. All the crash stats have no comments. Chris, could you check if we have a domain or url where this mostly happens? If you can't we should file an it bug to get the results. We still have the email addresses per crash report if users accept their transmission, right? Could we find some users for those the crash happens regularly?
looking at the url data it doesn't look to helpful in this case. the distribution of sites and urls to these crash signatures is very wide. that matches up with the ideas in comment 67 suggesting these crashes are a result of bad interaction within the chrome and general window management; not with anything content related. As for examples of this it looks like 760 different sites for just the 1500 some odd nsWindow::GetParentWindow() crashes I'm looking at in a weeks worth of data. The list of top sites looks like: #crash - domain 78 www.youtube.com 23 mail.google.com 22 www.google.com 21 www.facebook.com 9 www.google.com.vn 8 www.google.fr 8 apps.facebook.com 7 www.orkut.com.br 7 win.mail.ru 7 us.mg2.mail.yahoo.com 6 www.orkut.co.in 6 in.mg50.mail.yahoo.com 5 us.mg1.mail.yahoo.com 5 depositfiles.com 5 addons.mozilla.org 4 www.ynet.co.il 4 www.mininova.org 4 www.microsoft.com 4 www.justin.tv 4 www.google.ro 4 www.google.it 4 www.google.co.id 4 www.gametrailers.com one user might have even hit the crash on about:config , and others hit it on searches, startpage and just a variety of frequent and common browsing activities. to drill down on the top of the youtube list it looks like: Firefox 3.5b4 2 http://www.youtube.com/watch?v=daTTOyu-E1w&hd=1&feature=hd nsWindow::GetParentWindow() Firefox 3.5b4 2 http://www.youtube.com/watch?v=L5brwhalCG0&feature=PlayList&p=BBCB18EBEEF42EC2&index=0&playnext=1 nsWindow::GetParentWindow() Firefox 3.5b4 2 http://www.youtube.com/watch?v=BsAtRKrbB8w nsWindow::GetParentWindow() Firefox 3.5b4 2 http://www.youtube.com/watch?v=wXUMZb7gqdg nsWindow::GetParentWindow() Firefox 3.5b4 1 http://www.youtube.com/watch?v=1DODjKLUMVU nsWindow::GetParentWindow() Firefox 3.5b4 1 http://www.youtube.com/watch?v=UYpxIaAI51M nsWindow::GetParentWindow() Firefox 3.5b4 1 http://www.youtube.com/watch?v=SJvtkCFaKaE nsWindow::GetParentWindow() Firefox 3.5b4 1 http://www.youtube.com/watch?v=PGuhX5AmjuA nsWindow::GetParentWindow() Firefox 3.5b4 1 http://www.youtube.com/watch?v=UuBm0dvIzcc&feature=fvst nsWindow::GetParentWindow() Firefox 3.5b4 1 http://www.youtube.com/ nsWindow::GetParentWindow() Firefox 3.5b4 1 http://www.youtube.com/watch?v=MYNEYp06EkI nsWindow::GetParentWindow() Firefox 3.5b4 1 http://www.youtube.com/watch?v=kMEFCRbbi0g nsWindow::GetParentWindow() Firefox 3.5b4 1 http://www.youtube.com/activesharing_history nsWindow::GetParentWindow() Firefox 3.5b4 1 http://www.youtube.com/watch?v=I8kcnYCHxTk nsWindow::GetParentWindow() Firefox 3.5b4 1 http://www.youtube.com/watch?v=eUm-zvGsgiE nsWindow::GetParentWindow() for nsBaseWidget::Destroy stack signature its basically the same thing. Another 625 crash reports in a weeks worth of 3.5b4 data and 457 different domains listed in the urls 40 www.youtube.com 29 mail.google.com 17 www.facebook.com 11 www.google.com 11 apps.facebook.com 5 www.orkut.co.in 5 www.google.com.vn 5 hi5.com 4 www.google.co.in
I do see a few interesting things that appear to be messing with/reading chrome. I guess one check would be to look at a bunch of reports to try and figure out if there is a common or set of common addons installed like adblockplus, ietab, or speeddial. There really are not very many of these kinds of reports, but the general question of "does chrome get messed up by an addon or cause timing problems not seen otherwise?" might be an interesting one to try and figure out. release # crashes url at time of crash -- stack signature Firefox 3.5b4 1 chrome://adblockplus/content/tip_subscriptions.xul nsWindow::GetParentWindow() Firefox 3.5b4 1 chrome://ietab/content/reloaded.html?url=http://dl.xunlei.com/index.htm?tag=1 nsWindow::GetParentWindow() Firefox 3.5b4 1 chrome://speeddial/content/speeddial.xul nsWindow::GetParentWindow() Firefox 3.5b4 1 chrome://wot/locale/loading.html#aHR0cDovL3d3dy50dnZhdWdoYW4uY29tLw%3D%3D nsWindow::GetParentWindow() Firefox 3.5b4 1 about:sessionrestore nsBaseWidget::Destroy() Firefox 3.5b4 1 chrome://ietab/content/reloaded.html?url=http://update.microsoft.com/*de-ch nsBaseWidget::Destroy() Firefox 3.5b4 1 chrome://speeddial/content/speeddial.xul nsBaseWidget::Destroy()
Because it generally seems to be related to plugin destruction, the actual URL may not even be relevant -- for example, it could be caused by destroying a plugin that was on the page you were visiting previously. That would be my guess even, given the high number of google.com and speeddial crashes... then again, there's an about:sessionrestore crash, and that's kinda weird :)
(In reply to comment #75) > Because it generally seems to be related to plugin destruction, the actual URL > may not even be relevant -- for example, it could be caused by destroying a > plugin that was on the page you were visiting previously. That would be my > guess even, given the high number of google.com and speeddial crashes... then > again, there's an about:sessionrestore crash, and that's kinda weird :) Yeah the current url is not assured to be accurate. The dealyed tear down can take some time. There is an interesting little bit of code in nsStopPluginRunnable's Run that delays tear down due to nesting level of events - http://mxr.mozilla.org/mozilla-central/source/layout/generic/nsObjectFrame.cpp#1965 I've been looking at this recently because in my testing this code never gets used. A possible suspect maybe.
> there's an about:sessionrestore crash, and that's kinda weird :) yeah, I've been looking at a bunch of cases where this helps as a semi-reliable way to find reproducable crashers. e.g. 1) user crashes on a page 2) sesssion restore kicks in and user selects to load the pages from previous crashes 3) page(s) load and the crash happens again this case pokes holes in that theory as being fully reliable. I'm guessing the session restore isn't initiating any plugin page teardowns. I'm getting time of crash, time since last crash, and uptime (time between startup and crash) and that will help to understand and connect these about:sessionrestore crashes to other events on a given users PC a bit better.
I am getting some really crazy stuff going on, including what looked like this crash, by triggering the code in bug 389931. I'm also seeing a lot of slow script warnings, assertions in /mozilla/docshell/shistory/src/nsshentry.cpp's nsSHEntry::AddChild, assertions in the plugin code, hung instances, and plugins that hang around in memory and keep playing. Steps to reproduce: 1) open two windows 2) in the second window, browse to a page and hit ctrl-s 3) leave the modal file chooser open and switch back to the first window 4) browse around to sites with plugins like youtube & yahoo, use multiple tabs, close tabs, etc. 5) try switching back and forth between the two windows occasionally. All kinds of interesting things happen. Script: chrome://global/content/bindings/general.xml:0 Script: chrome://browser/content/browser.js:3552 Script: chrome://global/content/bindings/general.xml:0 && WARNING: NS_ENSURE_TRUE(doc) failed: file f:/Mozilla/firefox/470487/mozilla/modules/plugin/base/src/nsNPAPIPlugin.cpp, line 1179 WARNING: NS_ENSURE_TRUE(cx) failed: file f:/Mozilla/firefox/470487/mozilla/modules/plugin/base/src/nsNPAPIPlugin.cpp, line 1192 WARNING: NS_ENSURE_TRUE(doc) failed: file f:/Mozilla/firefox/470487/mozilla/modules/plugin/base/src/nsNPAPIPlugin.cpp, line 1499 are common. I can't say for sure that this is the cause of this crash. I did see a couple crashes in nswindow's destroy code, but it wasn't in the same place. It was however at points where nswindow was accessing member variables in the base widget. I can address most of this (except the slow script warnings) by simply commenting out the timer code block in nsObjectFrame's nsStopPluginRunnable::Run(), so that plugins destroy when they are initially asked to. Maybe QA can lend a hand in getting some reproducible cases, so much is going on here I'm not sure where all of it is coming from. This is all on a trunk build.
The patch in bug 389931 looks to have landed in Fx 3.0, beta 3, so that would put it in the right time frame for when this crash started showing up.
I've successfully reproduced a crash in beta 4 (no stack dump though) following the above steps. The browser re-opened both windows and reloaded the same content, which might explain some of the crash reports we've seen.
Flags: blocking1.9.1- → blocking1.9.1?
Bug 420886 is where the delayed tear down code for plugins landed. Apparently it addressed a crash bug in realplayer post the landing of bug 389931.
Bug 422926 may also be related.
Given Jim's data, I'd like to see this go back on the 1.9.1 blocking list -- I think we've got a potential can of worms here.
Filed bug 493578 on the slow script warnings, they are unrelated to this.
(In reply to comment #81) > Bug 420886 is where the delayed tear down code for plugins landed. Apparently > it addressed a crash bug in realplayer post the landing of bug 389931. A general observation - the code in bug 420886 seems to be compensating for a bug in real player. In nsPluginInstanceOwner's PrepareToStop, we hide the window, disable it, and then during destruction the realplayer window receives wm_destroy, at which point it should destroy any modal windows. Maybe the right thing to do here is to pull the patch in 420886, and get with real to work on a fix. We seem to be jumping through a lot of hoops and destablizing our own code in other ways to compensate for a an issue in realplayer.
Moving the blocking request to bug 493601.
Flags: blocking1.9.1?
Comment on attachment 373373 [details] [diff] [review] [m-c checkin: a7b2a881595a] getparent / destroy variable cleanup v.3 a191=beltzner
Attachment #373373 - Flags: approval1.9.1? → approval1.9.1+
re: comment 85 kev / jst/ josh, do we have any recent contacts at real that we could open up that discussion.
I'll ping a couple people I know there and see if we can get the RP team engaged.
http://tinyurl.com/nds6x8 shows the current crashes for the Firefox 3.5 release. With 3058 crashes it is currently the #4 topcrash, so it doesn't appear the patch fixed the issue for 1.9.1.
Summary: Firefox Crash [@ nsWindow::GetParentWindow() ] & [@ nsBaseWidget::Destroy()]? → Firefox Crash [@ nsWindow::GetParentWindow(int)] & [@ nsBaseWidget::Destroy()]?
Might be interesting to look at mozilla-central in a few days with bug 500925 landing today.
(In reply to comment #93) > Might be interesting to look at mozilla-central in a few days with bug 500925 > landing today. My guess is that won't help much as this is a widget tear down issue. In general I guess this is still a top issue and needs to be addressed sooner rather than later. Now that the nswindow refactoring is complete, I'll look at this again.
Jim: Is it possible this is fixed in bug 493601? If so, it's likely that this will be fixed in 1.9.1 later.
(In reply to comment #95) > Jim: Is it possible this is fixed in bug 493601? If so, it's likely that this > will be fixed in 1.9.1 later. Doesn't look like it as it's still showing up on 3.6a1pre - http://crash-stats.mozilla.com/query/query?do_query=1&product=Firefox&version=Firefox%3A3.6a1pre&platform=windows&date=&range_value=1&range_unit=weeks&query_search=signature&query_type=contains&query=GetParentWindow
Seems like we have more signatures now - xul.dll nsWindow::GetParentWindow widget/src/windows/nsWindow.cpp:1668 xul.dll nsWindow::GetParent widget/src/windows/nsWindow.cpp:1634 xul.dll nsBaseWidget::SetZIndex widget/src/xpwidgets/nsBaseWidget.cpp:385 xul.dll xul.dll@0x2a44f9 xul.dll nsIView::CreateWidget view/src/nsView.cpp:699 xul.dll nsMenuPopupFrame::CreateWidgetForView layout/xul/base/src/nsMenuPopupFrame.cpp:280 xul.dll xul.dll@0x320c70 -- xul.dll nsWindow::GetParentWindow widget/src/windows/nsWindow.cpp:1668 xul.dll nsWindow::GetParent widget/src/windows/nsWindow.cpp:1634 xul.dll nsBaseWidget::SetZIndex widget/src/xpwidgets/nsBaseWidget.cpp:385 xul.dll xul.dll@0x2a44f9 xul.dll nsIView::CreateWidget view/src/nsView.cpp:699 xul.dll nsMenuPopupFrame::CreateWidgetForView layout/xul/base/src/nsMenuPopupFrame.cpp:280 xul.dll nsMenuPopupFrame::EnsureWidget layout/xul/base/src/nsMenuPopupFrame.cpp:230 xul.dll nsMenuPopupFrame::InitializePopup layout/xul/base/src/nsMenuPopupFrame.cpp:497 xul.dll nsXULPopupManager::ShowPopup layout/xul/base/src/nsXULPopupManager.cpp:454 The DoStopPlugin one though is still the most common. We might have two bugs here, one on Destroy and another on CreateWidget.
This is a refactor of the tear down code in nsWindow. I’d welcome comments. I’ve paired Destroy() down to simply calling DestroyWindow, and rely on the WM_DESTROY event handler to do the actual tear down. Within OnDestroy, I’ve reorganized the order things into what I would consider a more proper tear down order. This may not fix the crash problem, but it should give us a better feel for exactly what’s going on and when. If the crash still occurs in the call to GetParent in nsBaseWidget’s Destroy(), my guess would be that we could eliminate nsWindow code as being the problem and instead look at the view/plug-in code. The reason I say this is that the plugin code calls SetParent(nsnull) on the widget in nsPluginInstanceOwner’s PrepareToStop, which sets the native parent to null and also removes the window from the parent widget's child list. This (supposedly) happens way before Destroy is called on the window. If a call to GetParent on the final Destroy() call returns a valid Mozilla window handle with a bad |this| pointer, nsWindow’s tear down wouldn’t be the problem AFAICT.
Attachment #376333 - Attachment is obsolete: true
Attachment #387214 - Flags: review?(neil)
Comment on attachment 387214 [details] [diff] [review] [m-c checkin: da42cf530fa7] teardown refactor v.1 This passed try fine. I've also been running the try build now for a couple days without problem. Neil, would you be a good reviewer for something like this?
Blocks: 503561
Attachment #387214 - Flags: review?(neil) → review+
Attachment #387214 - Attachment description: teardown refactor v.1 → [m-c checkin: da42cf530fa7] teardown refactor v.1
did this land on the branch?
Group: core-security
Keywords: fixed1.9.1
(In reply to comment #104) > did this land on the branch? We could do that if it looks like the patch actually addressed the issue. It would require a little back porting since we refactored nswindow on trunk. I'd like to let this sit on trunk for a while (maybe 2 or 3 weeks) to test to see if it actually fixed the problem. So far we're about 5 days out from the landing and there hasn't been an occurrence in nightlies - not enough time to confirm IMHO.
For reference, the patch landed on July 13th - http://hg.mozilla.org/mozilla-central/rev/da42cf530fa7 Crash stats - http://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A3.6a1pre&platform=windows&query_search=signature&query_type=contains&query=GetParentWindow&date=&range_value=2&range_unit=weeks&do_query=1&signature=nsWindow%3A%3AGetParentWindow%28int%29 The last DoStopPlugin crash was in a build from the 12th. We had a crash in GetParentWindow from a build on the 13th, but that was in CreateWidget, which might be something else. Hard to say at this point if the problem is fixed.
I see. you have two bugs in one here. that's why beltzner marked it fixed1.9.1
Flags: blocking1.9.2? → blocking1.9.2+
on trunk! I'll get a 1.9.1 patch put together.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Attached patch teardown refactor 1.9.1 v.1 (obsolete) (deleted) — Splinter Review
Blocks: 506108
This is definitely fixed on trunk. http://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A3.6a1pre&platform=windows&query_search=signature&query_type=contains&query=GetParentWindow&date=&range_value=2&range_unit=weeks&do_query=1&signature=nsWindow%3A%3AGetParentWindow%28int%29 Posted a 1.9.1 patch to try, if all goes well after a little testing, we should be good for a 1.9.1.3 release. Bug 506108 is filed on the CreateWidget crash. We could also back port this pretty easily to 3.0 if somebody thinks it's needed.
looking at crash reports from yesterday I still see this signature on 3.5.2 494 total crashes for nsWindow::GetParentWindow(int) on 20090817 distribution of versions where the crash was found on 20090817-crashdata.csv 430 Firefox 3.5.2 44 Firefox 3.5 20 Firefox 3.5.1 but nothing for 3.0.x does that match up with the places the patch has landed and/or needs to land?
chofmann: This was initially filed with the signature of nsWindow::GetParentWindow() and changed to be nsWindow::GetParentWindow(int) with an additional nsBaseWidget::Destroy() signature. While the "int" version of the first signature does not show on 1.9.0, the non-int version is still present, ranked #31 with the nsBaseWidget signature ranked at #73. The above is consistent with this patch having not landed on 1.9.1 or 1.9.0.
Flags: wanted1.9.1.x?
Verified fixed on trunk based on crash stats.
Status: RESOLVED → VERIFIED
Keywords: qawanted
Target Milestone: --- → mozilla1.9.2a1
(In reply to comment #111) > looking at crash reports from yesterday I still see this signature on 3.5.2 > > 494 total crashes for nsWindow::GetParentWindow(int) on 20090817 > > distribution of versions where the crash was found on 20090817-crashdata.csv > 430 Firefox 3.5.2 > 44 Firefox 3.5 > 20 Firefox 3.5.1 > > but nothing for 3.0.x > > does that match up with the places the patch has landed and/or needs to land? That's correct. 3.6 is what you want to look at to see if it's fixed. A 3.5.x patch hasn't landed because the original 3.6 patch back ported to 3.5 caused leaks in component manager that I was never able to track down.
What do you mean "never able to track down"? Are you no longer working on this? It's one of our major crashes, so if you need help with it please do ask for it.
(In reply to comment #115) > What do you mean "never able to track down"? Are you no longer working on > this? It's one of our major crashes, so if you need help with it please do ask > for it. These are the leaks that I can reliably reproduce on the unit test box: TEST-UNEXPECTED-FAIL | runtests-leaks | leaked 484 bytes during test execution TEST-UNEXPECTED-FAIL | runtests-leaks | leaked 1 instance of nsComponentManagerImpl with size 276 bytes TEST-UNEXPECTED-FAIL | runtests-leaks | leaked 2 instances of nsLocalFile with size 88 bytes each (176 bytes total) TEST-UNEXPECTED-FAIL | runtests-leaks | leaked 3 instances of nsStringBuffer with size 8 bytes each (24 bytes total) TEST-UNEXPECTED-FAIL | runtests-leaks | leaked 2 instances of nsTArray_base with size 4 bytes each (8 bytes total) http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1250700153.1250706746.27537.gz&fulltext=1#err7 Why a slight re-org of the window destruction code causes this is a mystery. I've poked around in component manager looking for an answer but haven't turned anything up. If anyone has any suggestions on how to better track this down I'd welcome the input. I'll post the current rev of the patch here in a sec.
Attached patch teardown refactor 1.9.1 v.3 (deleted) — Splinter Review
Attachment #390333 - Attachment is obsolete: true
Mass change: adding fixed1.9.2 keyword (This bug was identified as a mozilla1.9.2 blocker which was fixed before the mozilla-1.9.2 repository was branched (August 13th, 2009) as per this query: http://is.gd/2ydcb - if this bug is not actually fixed on mozilla1.9.2, please remove the keyword. Apologies for the bugspam)
Keywords: fixed1.9.2
Assignee: jmathies → jst
Are the leaks that bad? If this fixes the topcrash it might be worth it.
blocking1.9.1: ? → needed
(In reply to comment #119) > Are the leaks that bad? If this fixes the topcrash it might be worth it. They caused leak check test failures on try. FWIW, I can revisit this next week after the freeze.
So that leak output is basically leaking the component manager; everything else that's leaked is members of the component manager: mRegistryFile and mComponentsDir for the nsIFiles, mLoaderData and mPendingServices for the nsTArray_bases; I'd guess the strings are from mLoaderData (though not sure why it only has 3 entries). Is it possible that this patch effectively changes the timing of things somehow so that we end up executing code after the component manager has been shut down (via nsComponentManagerImpl::Shutdown) and somehow reinstantiates the component manager? Or something like that....
If this is what people were looking at, it's a different bug: http://crash-stats.mozilla.com/query/query?product=Firefox&version=Firefox%3A3.7a1pre&platform=windows&date=&range_value=3&range_unit=weeks&query_search=signature&query_type=contains&query=GetParentWindow&do_query=1# 0 xul.dll nsWindow::GetParentWindow widget/src/windows/nsWindow.cpp:1062 1 xul.dll nsWindow::GetParent widget/src/windows/nsWindow.cpp:1024 2 xul.dll GetWidgetOffset layout/base/nsLayoutUtils.cpp:833 3 xul.dll nsLayoutUtils::TranslateWidgetToView layout/base/nsLayoutUtils.cpp:854 4 xul.dll nsLayoutUtils::GetEventCoordinatesRelativeTo layout/base/nsLayoutUtils.cpp:674 5 xul.dll PresShell::HandleEvent layout/base/nsPresShell.cpp:6188 6 xul.dll nsViewManager::HandleEvent view/src/nsViewManager.cpp:1202 7 xul.dll nsViewManager::DispatchEvent view/src/nsViewManager.cpp:1181 8 xul.dll HandleEvent view/src/nsView.cpp:167 9 xul.dll nsWindow::DispatchEvent widget/src/windows/nsWindow.cpp:2763 10 xul.dll nsWindow::DispatchWindowEvent widget/src/windows/nsWindow.cpp:2786 11 xul.dll nsWindow::DispatchMouseEvent widget/src/windows/nsWindow.cpp:3161 12 xul.dll ChildWindow::DispatchMouseEvent
As long as we don't have a testcase or steps how to reproduce this crash there will be no way to run a regression test. Clearing flag for now.
We're going mark this as blocking 1.9.1.5. We'd like to get it in sooner rather than later, however, I'd like a second review on this. Technically security bugs need a superreview, but I'm explicitly asking for a second review (superreview or not) to get a better sanity check on the code since we haven't yet shipped this code in a release. Jim or Vlad: Is there a Windows person that would be good for a second review here?
blocking1.9.1: needed → .5+
(In reply to comment #124) > We're going mark this as blocking 1.9.1.5. We'd like to get it in sooner rather > than later, however, I'd like a second review on this. Technically security > bugs need a superreview, but I'm explicitly asking for a second review > (superreview or not) to get a better sanity check on the code since we haven't > yet shipped this code in a release. > > Jim or Vlad: Is there a Windows person that would be good for a second review > here? rob arnold, neil rashbrook, and doug turner come to mind. we still have the leaks to deal with though, so this isn't ready to be reviewed/land on 1.9.1.
Jim, can you fix those leaks? Per our discussion earlier today, crashes are our top priority right now.
(In reply to comment #126) > Jim, can you fix those leaks? Per our discussion earlier today, crashes are > our top priority right now. I can certainly try. Will look into what bz thought might be the cause (comment 121).
(In reply to comment #127) > (In reply to comment #126) > > Jim, can you fix those leaks? Per our discussion earlier today, crashes are > > our top priority right now. > > I can certainly try. Will look into what bz thought might be the cause (comment > 121). Things haven't changed much since the last time I worked on this. These leaks only show up on try server - reproducing them on an XP vmware image with both debug and release builds doesn't work. I guess I'll try playing around with a winnt 5.2 unit test image next. If I can reproduce this on something other than try server where I have more control, I might be able to find something. TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 484 bytes during test execution TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 1 instance of nsComponentManagerImpl with size 276 bytes TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 2 instances of nsLocalFile with size 88 bytes each (176 bytes total) TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 3 instances of nsStringBuffer with size 8 bytes each (24 bytes total) TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 2 instances of nsTArray_base with size 4 bytes each (8 bytes total) TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 484 bytes during test execution TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 1 instance of nsComponentManagerImpl with size 276 bytes TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 2 instances of nsLocalFile with size 88 bytes each (176 bytes total) TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 3 instances of nsStringBuffer with size 8 bytes each (24 bytes total) TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 2 instances of nsTArray_base with size 4 bytes each (8 bytes total) http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1255636233.1255645227.4871.gz&fulltext=1#err12
Reassigning to Jim since he's investigating here...
Assignee: jst → jmathies
Depends on: 523186
Whiteboard: [crashkill]
After getting access to a try server slave, I found the crashtest leaks corresponded to two tests involving applets - layout/base/crashtests/265986-1.html parser/htmlparser/tests/crashtests/328751-1.html Disabling those two tests addresses the crashtest leak problem. Will try to track down the mochitest-plain leaks, they probably related to applets as well. Another interesting note - disabling one test and leaving the other results in the same number of leaked bytes.
The other tests for mochitest-plain all tie to some html validation tests that make use of a java applet: http://mxr.mozilla.org/mozilla-central/source/dom/tests/mochitest/dom-level2-html/ http://mxr.mozilla.org/mozilla-central/source/dom/tests/mochitest/dom-level2-html/files/ test_HTMLBodyElement07.html \ test_HTMLBodyElement08.html \ test_HTMLBodyElement09.html \ test_HTMLBodyElement10.html \ test_HTMLBodyElement11.html \ test_HTMLBodyElement12.html \ test_HTMLDocument01.html \ test_HTMLDocument02.html \ test_HTMLDocument03.html \ test_HTMLDocument04.html \ test_HTMLDocument05.html \ test_HTMLDocument07.html \ test_HTMLDocument08.html \ test_HTMLDocument09.html \ test_HTMLDocument10.html \ test_HTMLDocument11.html \ test_HTMLDocument13.html \ test_HTMLDocument14.html \ test_HTMLDocument15.html \ test_HTMLDocument16.html \ test_HTMLDocument17.html \ test_HTMLDocument18.html \ test_HTMLDocument19.html \ test_HTMLDocument20.html \ test_HTMLDocument21.html \ test_HTMLDocument22.html \ test_HTMLDocument23.html \ test_HTMLDocument24.html \ test_HTMLDocument25.html \ test_HTMLDocument26.html \ test_HTMLDocument27.html \ I'm guessing we probably don't want to disable these on 1.9.1.
What's next here? Let's get this done.
Ah, nevermind. I see it now.
Whiteboard: [crashkill] → [crashkill][crashkill-fix]
(In reply to comment #132) > What's next here? Let's get this done. Next step would be to make a decision on how to handle these leaks. We can disable the tests (which allows this to pass try cleanly) or we can spend some additional time looking for a fix to the applet problem.
I don't think we should disable the tests, but I'm willing to live with the leaks. Can we change our expected leak value to accomodate these new leaks? We'd really like to get this in 1.9.1.6 which code freezes on November 10 at 11:59pm.
(In reply to comment #135) > I don't think we should disable the tests, but I'm willing to live with the > leaks. Can we change our expected leak value to accomodate these new leaks? > We'd really like to get this in 1.9.1.6 which code freezes on November 10 at > 11:59pm. Who would make that call? Any pointers on how we would change tolerated leak values? I can work up a patch and request approval on it.
I just made that call. :) I don't know how to change the tolerate leak values, unfortunately. Hopefully someone else can help with that.
Just fix the leaks.
I also don't understand why we're so stumped by the leaks here -- what's the interaction that's changed? What have we looked at or tried? A lot of different tests leak, so it seems like we should be able to track the lifecycle of the relevant objects without too much hassle. It's not like we allocate the component manager very often in the normal course of things!
Has anyone else tried reproducing the crashtest leaks? I still can't, and I've even recently upgraded my system. If not I guess I can allocate a try slave again and see if I can get a debug build going on it.
(In reply to comment #140) > Has anyone else tried reproducing the crashtest leaks? I still can't, and I've > even recently upgraded my system. If not I guess I can allocate a try slave > again and see if I can get a debug build going on it. Never mind! After installing the latest java run time, I'm able to reproduce them!
Attached file java refs (deleted) —
This appears to be the root of the problem. I can't find corresponding releases for two AddRef's from the java plugin, and on shutdown the component manager asserts with a left over refcount of 2.
FWIW, I can reproduce this on 1.9.1, w/Java(TM) Platform SE 6 U17, without the tear down patch applied. Trunk builds don't exhibit the problem.
So... code freeze for 1.9.1.6 is tonight (6.5 hours). While I'm all for getting leaks fixed, this is a leak on shutdown when using Java. It's not a huge issue and it only happens on 1.9.1. The trade off of stability vs leaking on shutdown is pretty clear to me (stability wins!). I don't think we should push off landing this patch longer than 1.9.1.6. The chances are 1.9.1.7 won't be until early February. Jim, other than the leaks, is attachment 395363 [details] [diff] [review] ready for the 1.9.1 branch?
I fully agree with Sam here, let's take this, the leaks mean nothing compared to the stability gains we're hoping to get out of this fix!
(In reply to comment #144) > So... code freeze for 1.9.1.6 is tonight (6.5 hours). While I'm all for getting > leaks fixed, this is a leak on shutdown when using Java. It's not a huge issue > and it only happens on 1.9.1. The trade off of stability vs leaking on shutdown > is pretty clear to me (stability wins!). > > I don't think we should push off landing this patch longer than 1.9.1.6. The > chances are 1.9.1.7 won't be until early February. > > Jim, other than the leaks, is attachment 395363 [details] [diff] [review] ready for the 1.9.1 branch? I can now reproduce this sans my patch, so the java issue is independent, although my patch on 1.9.1 triggers it with try server slave config. Let me see if the patch if fresh, will post back here in a sec.
This still applies cleanly.
Attachment #395363 - Flags: approval1.9.1.6?
Comment on attachment 395363 [details] [diff] [review] teardown refactor 1.9.1 v.3 Approved for 1.9.1.6. a=ss
Attachment #395363 - Flags: approval1.9.1.6? → approval1.9.1.6+
This landed tonight: http://hg.mozilla.org/releases/mozilla-1.9.1/rev/3d0b0cfd9644 The java leaks were apparently known. We have leak threshold values related to java set on 1.9.1 that aren't currently set on try, which is why this problem manifested on test runs. (hat tip to nthomas for helping track this down.) Related bugs: bug 471647, and bug 495533.
Whiteboard: [crashkill][crashkill-fix] → [sg:critical?][crashkill][crashkill-fix]
Henrik, you marked this verified fixed for 1.9.2. How did you verify this? I need to know since I need to verify this for 1.9.1 and I don't see a way to do so based on my reading of the comments.
I did that regarding our crash stats due to a missing testcase. Jim, it looks like that a similar crash has been returned for 1.9.2. The first frames are identical. Shall we open a new bug on that? See bp-673ba0da-133f-488d-91ff-2d63b2091120 0 xul.dll nsWindow::GetParentWindow widget/src/windows/nsWindow.cpp:1063 1 xul.dll nsWindow::GetParent widget/src/windows/nsWindow.cpp:1025 2 xul.dll GetWidgetOffset layout/base/nsLayoutUtils.cpp:839 3 xul.dll nsLayoutUtils::TranslateWidgetToView layout/base/nsLayoutUtils.cpp:860 4 xul.dll nsLayoutUtils::GetEventCoordinatesRelativeTo layout/base/nsLayoutUtils.cpp:680
(In reply to comment #151) > I did that regarding our crash stats due to a missing testcase. > > Jim, it looks like that a similar crash has been returned for 1.9.2. The first > frames are identical. Shall we open a new bug on that? See > bp-673ba0da-133f-488d-91ff-2d63b2091120 > > 0 xul.dll nsWindow::GetParentWindow > widget/src/windows/nsWindow.cpp:1063 > 1 xul.dll nsWindow::GetParent widget/src/windows/nsWindow.cpp:1025 > 2 xul.dll GetWidgetOffset layout/base/nsLayoutUtils.cpp:839 > 3 xul.dll nsLayoutUtils::TranslateWidgetToView > layout/base/nsLayoutUtils.cpp:860 > 4 xul.dll nsLayoutUtils::GetEventCoordinatesRelativeTo > layout/base/nsLayoutUtils.cpp:680 Definitely, that's a completely different stack. Looks like it has something to do with scroll.
There are a few common crash signatures here mixed in together. One is CreateWidget - http://crash-stats.mozilla.com/report/index/bcd92777-df9d-4f36-a7f5-8f0e02091120 http://crash-stats.mozilla.com/report/index/ebe43c86-d12d-44dd-affc-bf5ec2091120 That's likely bug 506108. Another has to do with dispatching events - http://crash-stats.mozilla.com/report/index/a85d727a-fcc7-46df-aab2-b5b652091120 We should split out a separate bug for the events crash.
Blocks: 530070
Filed bug 530070. How do we get bug numbers added to crash stats for specific stacks?
I think the only way we could do that would be to add enough of the top frames to the skiplist in the crash processor so that we can tell the two appart from just the signature of the crash report. In these two cases it seems we would need to add at least nsWindow::GetParentWindow and nsWindow::GetParent, so that the signatures for the two individual crashes would become "nsWindow::GetParentWindow | nsWindow::GetParent | nsBaseWidget::SetZIndex" vs "nsWindow::GetParentWindow | nsWindow::GetParent | GetWidgetOffset". I don't know whether that's a good idea in general, or if it's worth it just for this specific case.
This appears to still happen on the 1.9.0 branch (where it's topcrash #32 or so). Would the patch here work for that branch? Do we need a new one? Can't really unhide the bug while it's still happening on that branch. Does not appear to happen on the 1.8 branch, or at least doesn't appear anywhere in the top-500 crash reports. Tried searching to see if it happened at all but can't get a response from the talkback server.
Flags: blocking1.9.0.19?
Flags: blocking1.9.0.18?
Would love to get this into 3.0.18 but blocking may not be realistic. If the patch applies please request approval ASAP, otherwise we'll block for the next 3.0.x release.
Flags: blocking1.9.0.19?
Flags: blocking1.9.0.19-
Flags: blocking1.9.0.18?
Group: core-security
Crash Signature: [@ nsWindow::GetParentWindow(int)] [@ nsBaseWidget::Destroy()]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: