Closed Bug 851641 Opened 12 years ago Closed 12 years ago

Intermittent components/satchel/test/test_form_autocomplete_with_list.html | Exited with code 1 during test run | application crashed [@ nsAutoLayoutPhase::Enter()]

Categories

(Core :: Layout, defect)

x86_64
macOS
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla23

People

(Reporter: RyanVM, Assigned: tnikkel)

References

(Depends on 1 open bug)

Details

(Keywords: crash, intermittent-failure, Whiteboard: [snappy:p1])

Crash Data

Attachments

(7 files, 7 obsolete files)

(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), patch
MatsPalmgren_bugz
: review+
Details | Diff | Splinter Review
(deleted), patch
mossop
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
MatsPalmgren_bugz
: review+
Details | Diff | Splinter Review
(deleted), patch
tnikkel
: review+
Details | Diff | Splinter Review
https://tbpl.mozilla.org/php/getParsedLog.php?id=20687946&tree=Mozilla-Inbound Rev5 MacOSX Mountain Lion 10.8 mozilla-inbound debug test mochitest-5 on 2013-03-15 06:23:23 PDT for push e90fa0a09ac5 slave: talos-mtnlion-r5-081 06:32:55 INFO - 168589 INFO TEST-PASS | /tests/toolkit/components/satchel/test/test_form_autocomplete_with_list.html | Checking form3 input 06:32:55 INFO - 168590 INFO TEST-INFO | /tests/toolkit/components/satchel/test/test_form_autocomplete_with_list.html | Starting test #201 06:32:55 INFO - 168591 INFO TEST-PASS | /tests/toolkit/components/satchel/test/test_form_autocomplete_with_list.html | Checking form3 input 06:32:55 INFO - 168592 INFO TEST-INFO | /tests/toolkit/components/satchel/test/test_form_autocomplete_with_list.html | Starting test #202 06:32:55 INFO - 168593 INFO TEST-PASS | /tests/toolkit/components/satchel/test/test_form_autocomplete_with_list.html | Checking form3 input 06:32:55 INFO - 168594 INFO TEST-INFO | /tests/toolkit/components/satchel/test/test_form_autocomplete_with_list.html | Starting test #203 06:32:55 INFO - 168595 INFO TEST-PASS | /tests/toolkit/components/satchel/test/test_form_autocomplete_with_list.html | Checking form3 input 06:32:55 INFO - 168596 INFO TEST-INFO | /tests/toolkit/components/satchel/test/test_form_autocomplete_with_list.html | Starting test #300 06:32:55 INFO - Assertion failure: mPresContext->mLayoutPhaseCount[eLayoutPhase_Reflow] == 0 (painting in the middle of reflow), at ../../../layout/base/nsAutoLayoutPhase.cpp:37 06:32:58 WARNING - TEST-UNEXPECTED-FAIL | /tests/toolkit/components/satchel/test/test_form_autocomplete_with_list.html | Exited with code 1 during test run 06:32:58 INFO - INFO | automation.py | Application ran for: 0:08:40.454693 06:32:58 INFO - INFO | automation.py | Reading PID log: /var/folders/6p/28y5jm7s6lz843fsd3j2d_lh00000w/T/tmpG1Adbtpidlog 06:33:09 INFO - PROCESS-CRASH | /tests/toolkit/components/satchel/test/test_form_autocomplete_with_list.html | application crashed [@ nsAutoLayoutPhase::Enter()] 06:33:09 INFO - Crash dump filename: /var/folders/6p/28y5jm7s6lz843fsd3j2d_lh00000w/T/tmpFQEu0z/minidumps/31F42F0B-23AC-47E3-801B-9B8D93E16461.dmp 06:33:09 INFO - Operating system: Mac OS X 06:33:09 INFO - 10.8.0 12A269 06:33:09 INFO - CPU: amd64 06:33:09 INFO - family 6 model 42 stepping 7 06:33:09 INFO - 8 CPUs 06:33:09 INFO - Crash reason: EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 06:33:09 INFO - Crash address: 0x0 06:33:09 INFO - Thread 0 (crashed) 06:33:09 INFO - 0 XUL!nsAutoLayoutPhase::Enter() [nsAutoLayoutPhase.cpp:e90fa0a09ac5 : 34 + 0x0] 06:33:09 INFO - rbx = 0x00007fff73398c68 r12 = 0x00007fff5fbfb2c8 06:33:09 INFO - r13 = 0x0000000107680c10 r14 = 0x00007fff5fbfb2f8 06:33:09 INFO - r15 = 0x000000010767ff80 rip = 0x00000001013f935f 06:33:09 INFO - rsp = 0x00007fff5fbfb0e0 rbp = 0x00007fff5fbfb0f0 06:33:09 INFO - Found by: given as instruction pointer in context 06:33:09 INFO - 1 XUL!PresShell::Paint(nsView*, nsRegion const&, unsigned int) [nsPresShell.cpp:e90fa0a09ac5 : 5512 + 0x4] 06:33:09 INFO - rip = 0x00000001013dcfd0 rsp = 0x00007fff5fbfb100 06:33:09 INFO - rbp = 0x00007fff5fbfb2a0 06:33:09 INFO - Found by: stack scanning 06:33:09 INFO - 2 AppKit + 0xf65d8 06:33:09 INFO - rip = 0x00007fff86b995d9 rsp = 0x00007fff5fbfb110 06:33:09 INFO - rbp = 0x00007fff5fbfb2a0 06:33:09 INFO - Found by: stack scanning 06:33:09 INFO - 3 AppKit + 0x124855 06:33:09 INFO - rip = 0x00007fff86bc7856 rsp = 0x00007fff5fbfb120 06:33:09 INFO - rbp = 0x00007fff5fbfb2a0 06:33:09 INFO - Found by: stack scanning 06:33:09 INFO - 4 AppKit + 0x12477b 06:33:09 INFO - rip = 0x00007fff86bc777c rsp = 0x00007fff5fbfb140 06:33:09 INFO - rbp = 0x00007fff5fbfb2a0 06:33:09 INFO - Found by: stack scanning 06:33:09 INFO - 5 XUL!nsViewManager::ProcessPendingUpdatesForView(nsView*, bool) [nsViewManager.cpp:e90fa0a09ac5 : 407 + 0x7] 06:33:09 INFO - rip = 0x0000000101b38dfd rsp = 0x00007fff5fbfb170 06:33:09 INFO - rbp = 0x00007fff5fbfb2a0 06:33:09 INFO - Found by: stack scanning 06:33:09 INFO - 6 XUL!nsView::DoResetWidgetBounds(bool, bool) [nsView.cpp:e90fa0a09ac5 : 236 + 0x10] 06:33:09 INFO - rip = 0x0000000101b35561 rsp = 0x00007fff5fbfb180 06:33:09 INFO - rbp = 0x00007fff5fbfb2a0
Attached file full stack trace for posterity (deleted) —
Severity: normal → critical
Only happened in debug builds so far, so is mPresContext null or is the moz assert failing? I can't necessarily tell.
This is triggering fairly consistently on my machine at the moment. I added some debug and can verify its the assert triggering. mPresContext is non-null and mPresContext->mLayoutPhaseCount[eLayoutPhase_Reflow] is 1 (so probably not garbage). The condition definitely seems timing related. I had to play with the debug a bit since simple printf's masked the problem completely. Without any debug I seem to trigger nearly 100% of the time. With conditional debug it becomes intermittent. Interestingly, the test with debug also seems to run longer before triggering the assert. With no debug the assert triggers nearly immediately. With a conditional printf added I can see it stepping through the form automation for a few seconds before triggering. Anyway, just some observations in case it helps track this down. This is on a 2009 macbook pro running lion.
Is it possible you could get a stack? The ones in the tinderbox logs seem a little broken and don't explain why we are seeing this problem.
Attached file Another stack dump (deleted) —
Sure. Here is another stack dump. I used NS_ABORT() to trigger it, so you can see the immediate abort trace earlier in the log and then the minidump_stackwalk at the end. Is there a better way to do this? I'm new to the moz-central dev environment.
That's perfect, thank you. It clearly shows what the problem is now.
Awesome. Let me know if you want me to test any patches, etc.
So the problem is that during reflow of a popup we make it visible by setting the view to be visible, this calls Show and Resize on the widget, the OS calls back from the resize call and asks us to paint.
That makes sense looking at the stack trace, but I'm still trying to understand how this was timing sensitive. If the callback from the OS was done synchronously, shouldn't this have always triggered? Or are paints throttled such that if one occurred recently, then this new repaint may get queued for as an async event?
I guess the OS decides if it wants to ask us for a paint or not and it bases it's decision on something, and I don't know if we can always control that. I'm pretty sure I've seen the same issue on Windows too, so we should probably just fix it properly. Which is to make these widget calls to happen async in the view subsystem. I think mstange landed a patch for that a few years ago but it had to be backed out for a regression. But I wouldn't be surprised it we could do it now, or just fix the regression.
I think bug 562138 is what I'm thinking about.
Attached patch patch (obsolete) (deleted) — Splinter Review
Here Markus' patch from bug 562138 unbitrotted. It causes some failures. I think Linux would fail too except test_panel.xul is disabled on linux. https://tbpl.mozilla.org/?tree=Try&rev=bd6721e0627a
So I also tried the patch in attachment 735381 [details] [diff] [review] on my machine. It does seem to prevent the assertion in test_form_autocomplete_with_list.html, but as you said, introduces some other issues. The problem I saw was that while the test passed cleaning, the autocomplete popup was not properly removed from the page at the end of the test. See the attached screenshot.
Attached patch Updated patch (obsolete) (deleted) — Splinter Review
Updated the patch to correct the rendering error in the test. I moved the conditional mWindow->Show(false) to occur earlier in DoResetWidgetBounds(). In this particular case the test was triggering the following check and bailing out before hiding the popup window: if (type == eWindowType_popup && ((curBounds.IsEmpty() && mDimBounds.IsEmpty()) || mVis == nsViewVisibility_kHide)) { // Don't manipulate empty or hidden popup widgets. For example there's no // point moving hidden comboboxes around, or doing X server roundtrips // to compute their true screen position. This could mean that WidgetToScreen // operations on these widgets don't return up-to-date values, but popup // positions aren't reliable anyway because of correction to be on or off-screen. return; } I hope you don't mind me updating/obsoleting your patch. Is this bad form? Thanks.
Attachment #735381 - Attachment is obsolete: true
Attachment #735555 - Attachment is obsolete: true
Great! Thanks for figuring that out. I pushed the new patch to try https://tbpl.mozilla.org/?tree=Try&rev=5f2075c4dbd1 It seems to have fixed the problems on mac, but we still have the Windows failures left.
I'm not near my dev environment at the moment, but it looks like that test has an explicit delay programmed for Linux: var waitSteps = 0; function popupShown(event) { var panel = event.target; if (waitSteps > 0 && navigator.platform.indexOf("Linux") >= 0 && panel.boxObject.screenY == 210) { waitSteps--; setTimeout(popupShown, 10, event); return; } currentTest.result(currentTest.testname + " ", panel); panel.hidePopup(); } Perhaps we need to adapt that for windows as well now that the popup visiblity is deferred. Do the different platforms flush pending updates at different rates?
Attached patch Patch v3 (obsolete) (deleted) — Splinter Review
This is a bit of a shot in the dark since I don't have a windows dev environment to test this. It looks like there is explicit code in this test to wait for popup visibility changes, but only on Linux. I went looking for some history on this and the best I could find was bug 552982 comment 45. It implies that the Linux OS callback will always be deferred and "requires waiting for the window manager to reply". Now that we are deferring all visibility related forced paints, it seems like it makes sense to relax the test to allow deferred painting for all platforms.
Attachment #735594 - Attachment is obsolete: true
(In reply to Ben Kelly from comment #26) > Now that we are deferring all visibility related forced paints, it seems like > it makes sense to relax the test to allow deferred painting for all > platforms. Linux is particularly gross in this case because of how X11 works. It's the best we've been able to do so far. But I don't know if we should give up that easily for Windows. The popupshown event is already dispatched async here http://mxr.mozilla.org/mozilla-central/source/layout/xul/base/src/nsMenuPopupFrame.cpp#472 so if our coords are wrong when it's fired we've already waited a bit.
Hmm, I think I'll need to look when its not 1am to understand. Do you think it would be worth running a try to see if this has any effect? Even if its not the final solution it might give us a data point as to what is going on in the test.
Sure, I pushed it to try https://tbpl.mozilla.org/?tree=Try&rev=790e1ac8fd27 we can see if it ever gets the right position or not with this.
Attachment #736119 - Attachment is obsolete: true
Attachment #735594 - Attachment is obsolete: false
Well, attachment 736119 [details] [diff] [review] change to the test did not help. Restoring the previous patch as the best version. Thanks for humoring me!
That's good news, that means it's not timing dependent which means it should be easier for someone with a Windows setup to debug.
I have a winxp build working in a VM and can now reproduce the failure in test_panel.xul. I will take a look at it over the weekend.
FYI, I'm still looking at this. Slow going since the compile/debug cycle is about 30+ minutes on this VM. My current idea is that the state mismatch between the view manager's GetVisibility() and the window's IsVisible() is causing an issue.
Hey Ben you should be able to speed that up by using pymake if you aren't already. And then going into your objdir and then only building in the following folders under there 1) the folder under which you've made changes 2) layout 3) toolkit/library Let me know if I can be of any help, here, or on email or irc.
Some more things that might help: panels with titlebars we usually rely on a two part process to get them positioned and sized correctly where we first position it without knowing how big the titlebar will be, then get a size/move notification and then we can get the info we need to place it properly. Perhaps something it going wrong there? Logging the nsXULPopupManager::PopupMoved ad nsXULPopupManager::PopupResized both in the working case and not working case might help there.
Wow, thank you! That cut my build time down to something much more reasonable. And the explanation of how popups are displayed is very helpful. I'll investigate this when I next get time. Unfortunately this may not be until next week. Thanks again!
This is blocking the landing of bug 566746, a large project that I'd like to get landed ASAP. Timothy, can you take this if Ben's not available in the near term?
Whiteboard: [snappy:p1]
Yea, I won't have time until next week and my windows development environment is not the best. Please feel free to drive this forward. I'd be happy to help test.
I'll debug the Windows failure. Although that may not be the only thing blocking this from landing: the original patch from bug 562138 had a part 2 that was needed then (haven't looked into if it's still needed), and this patch did get backed out in bug 562138 for causing a regression (haven't looked into if it will still cause that regression).
QA Contact: tnikkel
Assignee: nobody → tnikkel
QA Contact: tnikkel
This makes the main patch here green on try server. But after looking into this more I'm thinking that taking the safer approach of just doing the view show from a runnable we dispatch is a better idea. Currently debugging one try server failure with that approach...
Just setting the visibility on the view turned out to be a little more complicated than it sounds. I re-used the nsXULPopupShownEvent event that we already dispatch so that the popup-shown event is still dispatched immediately after we call Show on the widget.
Attachment #741089 - Flags: review?(matspal)
Attached patch fix a test (deleted) — Splinter Review
After we send a mouse event to click on the cancel button the popup is hidden and shown before it turns into the "Would you like to restart the addon download?" popup. This test seems to be assuming that it stays shown and is just modified. So we need to add another layer of wait_for_notification to wait for the popup-shown event to be dispatched. I'll attach a whitespace diff that will make this patch very clear.
Attachment #741090 - Flags: review?(dtownsend+bugmail)
Attached patch fix a test (diff -w) (deleted) — Splinter Review
Attachment #741090 - Flags: review?(dtownsend+bugmail) → review+
For what its worth, I can confirm attachment 741089 [details] [diff] [review] fixes the issue for me. Thanks!
Comment on attachment 741089 [details] [diff] [review] run view show off a runnable instead of during reflow I'm a little worried about the latency in using an event. Could we delay the mDoViewShow part using PostReflowCallback instead?
(In reply to Mats Palmgren [:mats] from comment #49) > I'm a little worried about the latency in using an event. > Could we delay the mDoViewShow part using PostReflowCallback instead? That's a great idea, I don't know why I didn't think of it. It makes things simpler too because we know it will execute before the popupshown event dispatcher runs.
Attached patch use a reflow callback (obsolete) (deleted) — Splinter Review
Attachment #741089 - Attachment is obsolete: true
Attachment #741089 - Flags: review?(matspal)
Attachment #741478 - Flags: review?(matspal)
Comment on attachment 741478 [details] [diff] [review] use a reflow callback >+nsMenuPopupFrame::ReflowFinished() >+{ >+ mReflowCallbackPosted = false; >+ >+ nsView* view = GetView(); >+ nsViewManager* viewManager = view->GetViewManager(); >+ if (view && viewManager) { No need to null-check those. >+ nsWeakFrame weak(this); >+ viewManager->SetViewVisibility(view, nsViewVisibility_kShow); >+ if (!weak.IsAlive()) { >+ return false; >+ } I don't think SetViewVisibility() can destroy the frame. If it can then this code isn't safe. AFAICT, this bug is only about a fatal assertion in debug builds, am I missing something? Maybe we should wrap the loop in PresShell::HandlePostedReflowCallbacks() in a nsAutoScriptBlocker scope to prevent that from happening? It would be a good safety precaution in general I think... >+ mPopupState = ePopupOpenAndVisible; >+ view = GetView(); >+ if (view) { No need to null-check. > nsView* view = GetView(); > nsViewManager* viewManager = view->GetViewManager(); >+ if (mReflowCallbackPosted) { >+ PresContext()->PresShell()->CancelReflowCallback(this); >+ mReflowCallbackPosted = false; >+ } > viewManager->SetViewVisibility(view, nsViewVisibility_kHide); Nit: move the added lines before the declaration of 'view' perhaps? Looks good otherwise.
Examining the stack in comment 13 a bit more, there's: PresShell::ProcessReflowCommands PresShell::DoReflow ... nsMenuPopupFrame::LayoutPopup nsViewManager::SetViewVisibility ... [ChildView viewWillDraw] nsViewManager::WillPaintWindow nsViewManager::ProcessPendingUpdatesForView and ProcessPendingUpdatesForView has a nsAutoScriptBlocker... The outermost ProcessReflowCommands runs DoReflow inside a nsAutoScriptBlocker scope though, so all of the above is protected by that. Moving the SetViewVisibility call to a reflow callback which isn't run with such protection seems like it might mess up other callbacks that follow.
(In reply to Mats Palmgren [:mats] from comment #54) > I don't think SetViewVisibility() can destroy the frame. If it can > then this code isn't safe. AFAICT, this bug is only about a fatal > assertion in debug builds, am I missing something? The stack in an attachment in this bug shows how SetViewVisibility can enter painting, and painting can flush via PresShell::WillPaint.
I think we need a couple of changes for this to be safe as a reflow callback. First, create a helper class to use as the callback that has a nsWeakFrame member for the MenuPopupFrame. Second, wrap the loop in HandlePostedReflowCallbacks in a nsAutoScriptBlocker scope. If that doesn't work for some reason then I think we should take the nsPopupShowEvent solution.
(In reply to Mats Palmgren [:mats] from comment #57) > If that doesn't work for some reason then I think we should take > the nsPopupShowEvent solution. I'm actually thinking that the first patch, to do view visibility changes async through pending updates on the view manager (called "Updated patch"), is now the most desirable approach.
OK, I'll review that instead. (sorry for wasting your time with the callback suggestion)
Yeah, I like that approach better too. Feel free to ask me for review when you and Ben have fixed the regressions. One thing -- in the added code at the end of DoResetWidgetBounds, can 'this' be dead there? if so, using mWindow isn't safe. You should probably use a strong ref on the stack for the widget.
Attachment #741478 - Flags: review?(matspal)
Comment on attachment 740431 [details] [diff] [review] ignore popup move/resize on hidden widgets We'll need this anyway to land the above patch. I'd like to land this patch separately since it can land now. In nsMenuPopupFrame::LayoutPopup we resize the view and then set the view visibility to show and then set the menu popup state to ePopupOpenAndVisible. Before the above patch the resize would be queued and not flushed to the widget. But when we set the view visibility to show it would flush the resize, and then call Show(true) on the widget. The OS sends us a resize notification that gets handled by nsXULPopupManager::PopupResized/Moved, it is ignored because the menu popup state is not ePopupOpenAndVisible. And then set the menu popup state to ePopupOpenAndVisible. After the above patch the resize is queued and not flushed, and the view show is queued and not flushed. And then set the menu popup state to ePopupOpenAndVisible. Then when the resize and show is flushed to the widget the OS sends us back a resize notification (we don't care about it as long as it matches what we asked for). Then when we get to nsXULPopupManager::PopupResized/Moved we don't bail out because the menu popup state is ePopupOpenAndVisible now. This whole thing is a little bit fragile. We do need a better way to tell resize notifications that come from our own actions and are what we expect to any other type, but we don't have that currently.
Attachment #740431 - Flags: review?(matspal)
Attachment #740431 - Flags: review?(matspal) → review+
Whiteboard: [snappy:p1] → [snappy:p1][leave open]
(In reply to Mats Palmgren [:mats] from comment #60) > One thing -- in the added code at the end of DoResetWidgetBounds, > can 'this' be dead there? if so, using mWindow isn't safe. > You should probably use a strong ref on the stack for the widget. So if we make the assumption that 1) calling widget->Show(true) when the view was visibility hidden 2) calling Resize/Move on a widget that is current visible are the only things that do anything unsafe (ie painting/flushing) and that Show, Reize, Move don't need someone to be holding a ref to their widget then I think the code is ok. The mWindow member on the view holds a ref, so that should keep it alive as long as it's not null. If the visibility is not changing in DoResetWidgetBounds (ie we make no Show calls) then only the Move/Resize calls can be unsafe. But we don't use anything but local variables after those calls, so we should be fine. If we are hiding the window we assumed Show(false) is safe, and the Move/Resize calls are happening on a hidden widget, so should be safe. If we are showing the window the Move/Resize calls should be safe because they happen while the widget is hidden. But I agree that this is complicated and fragile, and we have to make those assumptions, so we can add some precautions
Yeah, sounds very fragile to me. I think we should use a local strong ref variable instead of mWindow.
So I held the widget in a local and used that, but that doesn't make us safe from this getting destroyed still: mViewManager, mVis, mDimBounds, CalcWidgetBounds are all used after we hide the widget but before we resize/move or show it. I can make it completely safe if you'd prefer. The widget show/hide will still happen before we hit the event loop after the reflow: the reason being that FlushPendingNotifications calls ProcessPendingUpdates after reflowing. ResizeReflowIgnoreOverride can still reflow without calling ProcessPendingUpdates after (perhaps it should?) but resizing the window and opening a popup should be very rare cases.
Attachment #735594 - Attachment is obsolete: true
Attachment #743479 - Flags: review?(matspal)
I went through the other issues that might need to be resolved before landing this. The part 2 patch in bug 562138 shouldn't needed to fix this, I think it should only be needed to fix bug 562138 (a white flash when showing popup windows). We can of course try landing it later. The regressions caused by the original landing of bug 562138. There were two types: a crash, and popup window sizing/positioning issues. The crash had no surviving stack (only a link to crash-stats that is now dead) and no one debugged it, so it's hard to tell either way. We have fixed a lot of painting related crashes in view code recently, so I hopeful this won't show up again. The popup widget sizing/positioning issues. The other patch that landed in this bug fixed one such issue. In addition I've landed patches that fixed similar sounding issues caused by other changes since then. So I'm also hopeful these won't be a problem again this time.
> I can make it completely safe if you'd prefer. Yes please.
Attachment #743479 - Attachment is obsolete: true
Attachment #743479 - Flags: review?(matspal)
Attachment #743934 - Flags: review?(matspal)
Attached patch addendum? (deleted) — Splinter Review
Might be simpler to just calculate the bool up front (and re-use it) instead of stashing mDimBounds / mVis.
Comment on attachment 743934 [details] [diff] [review] use view subsystem pending updates to handle show/hides Looks good, r=mats (with/without the suggested addendum)
Attachment #743934 - Flags: review?(matspal) → review+
Comment on attachment 744080 [details] [diff] [review] addendum? Good idea, thanks! I'll include this.
Attachment #744080 - Flags: review+
Whiteboard: [snappy:p1][leave open] → [snappy:p1]
Attachment #741478 - Attachment is obsolete: true
Thanks for fixing this one, Timothy (and for the assistance, Mats).
Depends on: 869151
Blocks: 264231
Depends on: 1172302
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: