Closed Bug 514061 Opened 15 years ago Closed 15 years ago

Flash on sears.shoplocal.com sometimes causes "ASSERTION: recursive painting not permitted" followed by crash

Categories

(Core :: Web Painting, defect)

1.9.1 Branch
x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
status1.9.2 --- beta5-fixed

People

(Reporter: cmtalbert, Assigned: roc)

References

()

Details

(Keywords: assertion, crash, Whiteboard: [sg:vector-critical (flash)?])

Attachments

(8 files)

Attached file Crash stack and !expolitable output (deleted) —
This occurred during our automated crash analysis. Then we were not able to reproduce it until we turned on usermode DEP on XP. With that on, I got the attached stack using a debug mozilla 1.9.1 build. However, I was not able to crash it on subsequent attempts with 1.9.1 or mozcentral. Each time before the crash I'd get a couple of Assertions stating that "recursive painting not permitted", so this might be related to bug 454295. The stack and the exploitable report looks like it's a flash crash and this page randomly loads a flash ad, so it's probably just luck of the draw and you only crash when you get the "lucky" ad.
Whiteboard: [sg:nse]
Clint, can you get a stack trace for the recursive-painting assertion? Also, does "wget -E -H -k -K -p" get you a self-contained and/or consistently-reproducible testcase?
Whiteboard: [sg:nse] → [sg:vector-critical (flash)?]
(In reply to comment #1) > Clint, can you get a stack trace for the recursive-painting assertion? Also, > does "wget -E -H -k -K -p" get you a self-contained and/or > consistently-reproducible testcase? I'm assuming you mean that you want a reproducible testcase for that assertion right? As per above, I can't get a reproducible testcase for the crash.
Yeah. That wget command often works better than straight wget or "save as complete" in Firefox.
Attached file Crash 1 with current nightly (deleted) —
I've been unable to get a minimized version of this site. In all the minimized versions I can get, not all the flash loads. But this has become VERY reproducible on a current build. = Steps = 1. Build debug using this mozconfig on windows XP . $topsrcdir/browser/config/mozconfig CFLAGS="-RTC1" mk_add_options MOZ_CO_PROJECT=browser ac_add_options --enable-application=browser mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/ffx-dbg ac_add_options --enable-debug ac_add_options --disable-optimize mk_add_options MOZ_MAKE_FLAGS="-j4" ac_add_options --enable-accessibility ac_add_options --with-windows-version=600 2. Ensure you have flash installed. 3. Run against the URL above and about 1 in 5 runs will crash. Every run that crashes throws the recursive painting assertion. I did this testing on a build built from this changeset: changeset: 34236:22010cbc1269 tag: tip user: Jim Blandy <jimb@mozilla.org> date: Tue Oct 27 13:54:54 2009 -0700 summary: Bug 524724: Correctly disable WebGL under OSSO. r=red on poor Maemo
Keywords: assertion, crash
Summary: Intermittent Crash on sears.shoplocal.com, might be flash related → Flash on sears.shoplocal.com sometimes causes "ASSERTION: recursive painting not permitted" followed by crash
Chris, let's try to get this in a recorded VM session.
(In reply to comment #6) > Chris, let's try to get this in a recorded VM session. I can't reproduce this in my VM unfortunately. I have DEP turned on, latest Flash 10.0.32.18, and rebuilt with ctalbert's config above. ctalbert: I get a prompt for a zip code when I visit that URL, and I only ever see one ad. What zip code are you entering? I'm using 90201, which is probably different to what you've entered, so I may be seeing a different ad.
(In reply to comment #7) > (In reply to comment #6) > > Chris, let's try to get this in a recorded VM session. > > I can't reproduce this in my VM unfortunately. I have DEP turned on, latest > Flash 10.0.32.18, and rebuilt with ctalbert's config above. > > ctalbert: I get a prompt for a zip code when I visit that URL, and I only ever > see one ad. What zip code are you entering? I'm using 90201, which is probably > different to what you've entered, so I may be seeing a different ad. I don't ever have to enter a zip code. It crashes on load. Let me see if I can record it in my VM.
Attached file Another crash stack_10.0.42.34 (deleted) —
I can still get it to crash today using the same build as previously mentioned. I just have to run it over and over. It still crashes on startup. I didn't realize that the record/replay feature isn't in VMWare fusion 3. I'm working to bring up a VMWare workstation setup and will try the vm there. Or, would it be easier for you to try to reproduce it with the vm? I can zip up the vm and point you to it.
Chris, Roc, I have finally managed to record the crash in Vmware. It took a few days to get vmware 7 (as fusion doesn't have replay/record) and then to get the vm over there and then finally to reproduce it. So, now I have that. How should I get it to you? The VM is pretty big: about 10Gb compressed. Would it be worth it to have dbaron look over my shoulder here in the office, or should I find some room on fs to upload it so you guys can get it?
We don't want to ship the VM around... Probably the most efficient next step is to let Chris Pearce talk you through debugging the Firefox process during replay and getting a proper stack trace from Visual Studio, by setting a breakpoint at NS_DebugBreak_P so we can inspect locals etc. I'm particularly interested in finding out what messages are being dispatched to WindowProc in those assertion stacks.
Yeah, what we want is a stack trace from when that assertion fails... I'll send you an email with instructions on how to setup Visual Studio to allow remote debugging... You'll likely have to change settings in your VM and re-record in order to allow replay debugging...
Hey Chris, Thanks for the email. This week we have the test dev team on site in MV, so I may not have a chance to debug this further until this weekend. :(
Trying to repro it again, and found that it is MUCH more reproducible with a build from changeset http://hg.mozilla.org/mozilla-central/log?rev=a80fb8f5552f This morning out of 4 attempts, I reproduced it 3 times. I now have a VM ware snapshot of the recursive paint assertion captured in the debugger. I've looked at the code, but I'm not familiar with this boundary between our code and flash, so I'll need more experienced eyes here to figure out what's going on. If we changed our code to make this flash crash more reproducible (which seems to be the only conclusion here for the change) then this crash should block release until we figure out if it is something on our end and we fix it, or we determine that flash is truly the culprit and we get them to fix it. Marking blocking ? for 1.9.2
Flags: blocking1.9.2?
Hey guys. So is this Windows only and on trunk? Can it be reproduced with 3.6 beta and either the private build we gave you or the latest public beta? I was UTR on my Mac with 3.5.5
Clint, can you copy the stack trace from Visual Studio and paste it somewhere we can read it?
Can you copy the stack trace from Visual Studio and paste it here? I'd particularly like to know the value of 'msg' for each call to nsWindow::WindowProc that's on the stack. A dump of *this in each of those stack frames would also be useful. You can copy and paste this data from the Visual Studio UI.
Feels like this is a blocker until we get a confirmation that it isn't, yes?
Flags: blocking1.9.2? → blocking1.9.2+
(Oh, and needs an owner ... assigning to roc@oc for now)
Assignee: nobody → roc
Note, I tried to reproduce this and couldn't.
I tried to reproduce it on Windows 7 with some of the recent nighlies, but wasn't able to. I need a VS2008 pro license to be able to build with the mozconfig in comment 4, which I've filed a bug for. FWIW, I didn't see any Flash ad on the website in the summary...
since it isn't clear that we know what the stack(s) look (and associated signatures) I took a crack at trying to come at it from a different angle. This shoplocal web app/store platform looks like it causes a variety of problems. One or more might be connected to the bug being studied here, but to get the full list available I created a tracking bug. see Bug 530213 if you look at https://bug530213.bugzilla.mozilla.org/attachment.cgi?id=413720 and search for "sears" you can see several signatures and links to reports where we are crashing in a variety of places, and other storefronts that have similar signatures. for sears crashes these signatures come into play @0x3050f3f0 @0x80008000 GCGraphBuilder::NoteRoot(unsigned int, void*, nsCycleCollectionParticipant*) RealDefWindowProcWorker memmove | nsTArray_base::ShiftData(unsigned int, unsigned int, unsigned int, unsigned int) NPSWF32.dll@0x138f2f NPSWF32.dll@0x1999fc NPSWF32.dll@0x1cd5c0 NPSWF32.dll@0x22edfa NPSWF32.dll@0x77bd0 and maybe a few more flash addresses none of the stacks in the crash reports look really useful. I'll make this one of the dependent bugs on the tracking bug. Should we close that tracking bug too?
Blocks: 530213
"close" as in make it security-sensitive
Clint has recorded this in his VM so he has all the information we need to reproduce and fix. We just need to work with him to debug this.
(In reply to comment #17) > Can you copy the stack trace from Visual Studio and paste it here? I'd > particularly like to know the value of 'msg' for each call to > nsWindow::WindowProc that's on the stack. A dump of *this in each of those > stack frames would also be useful. You can copy and paste this data from the > Visual Studio UI. The data is attached with the value of *this printed out for each of our stack frames. This stack does look pretty horked -- note the JS trace call that occurs before XRE_Main. Here's a copy of just the stack for folks to chew on in the comment. The value of msg in nsWindow::WindowProc is 15. NPSWF32.dll!08027e1f() [Frames below may be incorrect and/or missing, no symbols loaded for NPSWF32.dll] NPSWF32.dll!080283cc() NPSWF32.dll!080361d2() NPSWF32.dll!08039aed() NPSWF32.dll!07fbe765() NPSWF32.dll!07fbe793() NPSWF32.dll!07ff6e1e() NPSWF32.dll!0802352a() NPSWF32.dll!08000b30() xpcom_core.dll!SearchTable(PLDHashTable * table=0x00000000, const void * key=0x0ee99504, unsigned int keyHash=830573490, PLDHashOperator op=PL_DHASH_LOOKUP) Line 472 + 0x1e bytes C xpcom_core.dll!PL_DHashTableOperate(PLDHashTable * table=0x00285b62, const void * key=0x0709e908, PLDHashOperator op=PL_DHASH_LOOKUP) Line 625 + 0x13 bytes C gklayout.dll!nsPropertyTable::GetPropertyInternal(nsPropertyOwner aObject={...}, unsigned short aCategory=13, nsIAtom * aPropertyName=0x00cc5960, int aRemove=1231988, unsigned int * aResult=0x00449616) Line 194 + 0x18 bytes C++ 0012cc6c() NPSWF32.dll!0806ed45() NPSWF32.dll!08070ec5() NPSWF32.dll!08070ec5() NPSWF32.dll!0806ed45() NPSWF32.dll!08070ec5() NPSWF32.dll!0806ed45() NPSWF32.dll!08070ec5() NPSWF32.dll!0806ed45() NPSWF32.dll!08043860() NPSWF32.dll!08043ac6() gdi32.dll!77f16f79() msvcr90d.dll!_free_dbg(void * pUserData=0x0012ea7c, int nBlockUse=122701920) Line 1260 + 0xc bytes C++ thebes.dll!_moz_cairo_matrix_multiply(_cairo_matrix * result=0x0012ead0, const _cairo_matrix * a=0x04ec9640, const _cairo_matrix * b=0x0012ead0) Line 308 + 0xf bytes C 0012eac4() gklayout.dll!nsViewManager::RenderViews(nsView * aView=0x04ec9640, nsIRenderingContext & aRC={...}, const nsRegion & aRegion={...}) Line 516 C++ gklayout.dll!nsViewManager::Refresh(nsView * aView=0x04ec9640, nsIRenderingContext * aContext=0x074ccd20, nsIRegion * aRegion=0x0427f098, unsigned int aUpdateFlags=1) Line 484 C++ gklayout.dll!nsViewManager::DispatchEvent(nsGUIEvent * aEvent=0x0012ee88, nsIView * aView=0x04ec9640, nsEventStatus * aStatus=0x0012ed5c) Line 992 C++ gklayout.dll!HandleEvent(nsGUIEvent * aEvent=0x0012ee88) Line 168 C++ gkwidget.dll!nsWindow::DispatchEvent(nsGUIEvent * event=0x0012ee88, nsEventStatus & aStatus=nsEventStatus_eIgnore) Line 2936 + 0xc bytes C++ gkwidget.dll!nsWindow::DispatchWindowEvent(nsGUIEvent * event=0x0012ee88, nsEventStatus & aStatus=nsEventStatus_eIgnore) Line 2965 C++ gkwidget.dll!nsWindow::OnPaint(HDC__ * aDC=0x00000000) Line 517 + 0x24 bytes C++ gkwidget.dll!nsWindow::ProcessMessage(unsigned int msg=15, unsigned int & wParam=0, long & lParam=0, long * aRetValue=0x0012f57c) Line 3847 + 0x18 bytes C++ gkwidget.dll!nsWindow::WindowProc(HWND__ * hWnd=0x0003071e, unsigned int msg=15, unsigned int wParam=0, long lParam=0) Line 3547 + 0x20 bytes C++ user32.dll!7e418734() user32.dll!7e418816() user32.dll!7e428ea0() user32.dll!7e428eec() ntdll.dll!7c90e473() user32.dll!7e4194d2() user32.dll!7e428f10() user32.dll!7e419402() user32.dll!7e418a10() gkwidget.dll!nsAppShell::ProcessNextNativeEvent(int mayWait=0) Line 185 C++ gkwidget.dll!nsBaseAppShell::DoProcessNextNativeEvent(int mayWait=0) Line 151 + 0x11 bytes C++ gkwidget.dll!nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal * thr=0x00ccfdd8, int mayWait=0, unsigned int recursionDepth=0) Line 278 + 0xd bytes C++ xpcom_core.dll!nsThread::ProcessNextEvent(int mayWait=1, int * result=0x0012f800) Line 510 C++ xpcom_core.dll!NS_ProcessNextEvent_P(nsIThread * thread=0x00ccfdd8, int mayWait=1) Line 250 + 0x16 bytes C++ gkwidget.dll!nsBaseAppShell::Run() Line 170 + 0xc bytes C++ toolkitcomps.dll!nsAppStartup::Run() Line 182 + 0x1c bytes C++ xul.dll!XRE_main(int argc=2, char * * argv=0x00ccd920, const nsXREAppData * aAppData=0x00cca4f0) Line 3491 + 0x25 bytes C++ firefox.exe!NS_internal_main(int argc=2, char * * argv=0x00ccd920) Line 158 + 0x12 bytes C++ firefox.exe!wmain(int argc=2, wchar_t * * argv=0x00cca3c0) Line 120 + 0xd bytes C++ firefox.exe!__tmainCRTStartup() Line 579 + 0x19 bytes C > firefox.exe!wmainCRTStartup() Line 399 C kernel32.dll!7c817077() mozjs.dll!TraceRecorder::record_JSOP_OBJTOP() Line 10295 + 0xa bytes C++
Thanks. How about the stack for the assertion about recursive painting?
Attached file The Stack and data at the assertion (deleted) —
(In reply to comment #26) > Thanks. How about the stack for the assertion about recursive painting? I thought you'd ask for that :). I reloaded my snapshot and did the same exercise as above with them. This time there are several nsWindow::WindowProc calls (part of that recursive painting no doubt). I've noted the values of msg in each.
Thanks. Was this a nightly PGO build? The stack is quite corrupt :-( OK, so we call from painting into Flash somehow, here: NPSWF32.dll!08043ac6() gdi32.dll!77f16f79() msvcr90d.dll!_free_dbg(void * pUserData=0x0012ea7c, int nBlockUse=122701920) Line 1260 + 0xc bytes C++ *this CXX0017: Error: symbol "this" not found thebes.dll!_moz_cairo_matrix_multiply(_cairo_matrix * result=0x0012ead0, const _cairo_matrix * a=0x04ec9640, const _cairo_matrix * b=0x0012ead0) Line 308 + 0xf bytes C *this CXX0017: Error: symbol "this" not found 0012eac4() gklayout.dll!nsViewManager::RenderViews(nsView * aView=0x04ec9640, nsIRenderingContext & aRC={...}, const nsRegion & aRegion={...}) Line 516 C++ Then Flash does something that triggers a WM_IME_NOTIFY which reenters our code. This is very bad: gkwidget.dll!nsWindow::WindowProc(HWND__ * hWnd=0x0003071e, unsigned int msg=642, unsigned int wParam=11, long lParam=0) Line 3547 + 0x20 bytes C++ *this CXX0017: Error: symbol "this" not found msg is 642 user32.dll!7e418734() user32.dll!7e418816() user32.dll!7e42927b() user32.dll!7e4292e3() imm32.dll!76393f86() imm32.dll!76394e46() NPSWF32.dll!080da343() msg 642 is WM_IME_NOTIFY (0x282) We call nsWindow::ProcessMessageForPlugin which then calls nsWindow::DispatchPendingEvents, which starts processing arbitrary events: gkwidget.dll!nsWindow::WindowProc(HWND__ * hWnd=0x0003071e, unsigned int msg=15, unsigned int wParam=0, long lParam=0) Line 3547 + 0x20 bytes C++ *this CXX0017: Error: symbol "this" not found msg is still 15 user32.dll!7e418734() [Frames below may be incorrect and/or missing, no symbols loaded for user32.dll] user32.dll!7e418816() user32.dll!7e428ea0() user32.dll!7e428f10() user32.dll!7e419402() user32.dll!7e418a10() gkwidget.dll!nsAppShell::ProcessNextNativeEvent(int mayWait=0) Line 185 C++ Message 15 is WM_PAINT again. So we end up asserting. The assert is probably not directly related to the crash. But spinning a Windows event loop during painting is definitely forbidden and almost certainly the cause of our problems.
The WM_IME_NOTIFY message's wParam means IMN_SETCOMPOSITIONWINDOW, which means "the style or position of the composition window is updated". I presume you weren't using an IME? Anyway...
Attached patch possible fix (deleted) — Splinter Review
Don't spin our event loop in nsWindow::DispatchPendingEvents if we're currently painting. I pushed this to the try-server, ID b4a8729a38d8 --- Clint, please download the build when it's done and give it a try!
Whiteboard: [sg:vector-critical (flash)?] → [sg:vector-critical (flash)?][needs review]
Do we need the DispatchPendingEvents in ProcessMessageForPlugin at all? ProcessMessageForPlugin (including the DispatchPendingEvents) was added about a year ago by bug 272847.
That is a good question.
I'm not sure, the DispatchPendingEvents has been called after the user input messages are processed. I imitated that. I guess that it wants the repainting, it should be executed immediately for the feedback.
(In reply to comment #30) > Created an attachment (id=414177) [details] > possible fix > > Don't spin our event loop in nsWindow::DispatchPendingEvents if we're currently > painting. > > I pushed this to the try-server, ID b4a8729a38d8 --- Clint, please download the > build when it's done and give it a try! Ok, I've run this several times, both release and building this patch in debug. = Good News = We don't crash. I can also verify that the recursive painting call is still being hit in debug where I get the hundreds of instances of the new warning messages. = Bad News = With both the try server build and my debug build, none of the text renders. We get the "enter zip code" box but not the text that says "View the Weekly Ad for local Sears Store" and "Enter your zip code or City, State". We also don't get the background painting for the little flash app behind the "Enter Zip Code" box, that's all grayed out. I'll attach a screen shot.
(In reply to comment #34) > (In reply to comment #30) > = Bad News = > With both the try server build and my debug build, none of the text renders. > We get the "enter zip code" box but not the text that says "View the Weekly Ad > for local Sears Store" and "Enter your zip code or City, State". We also don't > get the background painting for the little flash app behind the "Enter Zip > Code" box, that's all grayed out. I'll attach a screen shot. Ack, the minute I typed this, I went back to the VM to get the unpainted screen shot and it started working :/ Now, it works every time, whether I run in debug or using the try server build. I swear I had a string of sessions that didn't paint, but now they all seem to be ok. So, there may be an intermittent paint problem here, but I can't seem to reliably reproduce it. If anyone has the ability to write a test case in flash, here's what we need: * paint a flash UI in the main HTML page * paint a window over it that waits for interaction from the user * fire IME events from flash for the window painted over the main flash UI. If you can do all that without a crash or a recursive paint issue then the test passes. Ideally, we'd love to see a test where the plugin fires each of the window events in this switch statement[1] that would result in DispatchPendingEvents() being called at the end of the function. But, in my testing, we seem to have fixed the crash, so, I think this patch fixes this bug. [1] http://mxr.mozilla.org/mozilla-central/source/widget/src/windows/nsWindow.cpp#3561
Attached patch better fix? (deleted) — Splinter Review
Clint, try this patch to see if it fixes the regression. Try-server build f5128be6e525.
Hmm, only just read your last comment. Oh well. I'd still like to know if the second patch works for you. It's closer to trunk in behaviour but should still fix the crash. It would be interesting to know if you see that new warning I added while using regular non-crashing Flash sites. If you do, there's a chance that the first patch might cause a painting regression on many sites, which would be bad.
Well, fwiw, I'm not able to reproduce this on 1.9.2 or trunk. It would be nice if we could track down a test case that worked in all cases so we could see what's going on. We can land roc's first patch since it looks reasonably safe, but it's a guess fix since we've only reproduced this on one image and the results of that testing haven't really confirmed it's a fix. ctalbert, have you had a chance to play with the second patch? We're looking for a reliable fix that doesn't have any questionable results.
All(In reply to comment #37) > Hmm, only just read your last comment. Oh well. > > I'd still like to know if the second patch works for you. It's closer to trunk > in behaviour but should still fix the crash. > > It would be interesting to know if you see that new warning I added while using > regular non-crashing Flash sites. If you do, there's a chance that the first > patch might cause a painting regression on many sites, which would be bad. I tested around several flash sites with the first patch in debug and did *not* get the recursive paint warning anywhere except the sears site. I tested the second fix both using the try server build and my own debug build and could not create any paint problems. Tried many different sites that use flash - espn.com, cnn.com, youtube.com, the new moon movie site, clicked on flash ads, and audi.com. None of them had any issues painting, and the only one that had any errors was espn, which Roc and I are reviewing in a separate email thread as it may not pertain to this bug. On the sears site with the second patch I still see the recursive paint warning, but I do *not* see the painting issue. I am starting to think the painting issue might have been due to vmware thrashing as this box is being stressed to its limit to run vmware. So, the second patch looks like it has fixed the issue here, and I think we're ok to take it.
Those ESPN messages are definitely unrelated. Given that the first patch and the second patch both appear to work fine, and we don't trigger the first patch on most Flash sites, I think we should take the first patch, because it's simpler.
Attachment #414177 - Flags: review?(jmathies) → review+
> On the sears site with the second patch I still see the recursive paint warning Clint, are you seeing the warning added by this patch, "We were asked to dispatch pending events during painting, denying since that's unsafe" or the assertion failure "recursive painting not permitted" that indicates something going very wrong? Is it the same with the first patch, at least on the sears site?
Whiteboard: [sg:vector-critical (flash)?][needs review] → [sg:vector-critical (flash)?][needs landing]
(In reply to comment #41) > > On the sears site with the second patch I still see the recursive paint > warning > > Clint, are you seeing the warning added by this patch, "We were asked to > dispatch pending events during painting, denying since that's unsafe" or the > assertion failure "recursive painting not permitted" that indicates something > going very wrong? > > Is it the same with the first patch, at least on the sears site? When testing either of these patches, I only see the new warning. I do not see the original "recursive painting" assertion. And I only see the new warning when I am on the sears site. Other flash sites do not seem to get into this state (at least as far as I've tested).
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Whiteboard: [sg:vector-critical (flash)?][needs landing] → [sg:vector-critical (flash)?]
Group: core-security
Component: Layout: View Rendering → Layout: Web Painting
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: