Closed Bug 1353420 Opened 8 years ago Closed 8 years ago

Crash in RtlVirtualUnwind | RtlVirtualUnwindStub | WalkStackMain64

Categories

(Core :: Gecko Profiler, defect)

55 Branch
Unspecified
Windows 10
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox-esr45 --- unaffected
firefox52 --- unaffected
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- unaffected
firefox55 --- fixed

People

(Reporter: marcia, Assigned: away)

References

(Blocks 2 open bugs)

Details

(Keywords: crash, regression, regressionwindow-wanted)

Crash Data

Attachments

(3 files, 1 obsolete file)

This bug was filed from the Socorro interface and is report bp-4146980e-8a76-4548-b5e6-435e32170404. ============================================================= Seen while looking at nightly crash data - this crash has moved up to #2 overall browser crash in the last 7 days: http://bit.ly/2nYLPyh Comments mention suddenly crashing for now reason and "crashes mostly happen when tab bar is manipulated in some way (scrolled, switched, close/open tab, etc)" Crashes appear to have started using 20170327030203 build. Possible regression range based on that Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f5e214144799889e2408c4841351f4053f00544e&tochange=cc53710589fb500610495da5258b7b9221edf681
Blocks: 1329181
Component: General → Gecko Profiler
Keywords: regression
I hit what looks like this yesterday while running Cleopatra scrolling through my facebook news feed: https://crash-stats.mozilla.com/report/index/2335a754-20e6-46c0-91ac-b1d1f2170405
There were some disucssion in bug 1180561, but the crash rate does seem to be increasing. I got 28 crashes of this signature today (so far), which is much worse than it behaves last week. In my case it's usually a crash on loading a new page (e.g. link clicking) followed by another startup crash on restart.
(In reply to Marcia Knous [:marcia - use ni] from comment #0) > Crashes appear to have started using 20170327030203 build. Possible > regression range based on that Build ID: > https://hg.mozilla.org/mozilla-central/ > pushloghtml?fromchange=f5e214144799889e2408c4841351f4053f00544e&tochange=cc53 > 710589fb500610495da5258b7b9221edf681 This regression range seems really unlikely; there's nothing in there that touches the profiler. Are we sure that there weren't other merges that went into the 20170327030203 build that could have caused the problem?
Flags: needinfo?(mozillamarcia.knous)
(In reply to Nathan Froyd [:froydnj] from comment #3) > (In reply to Marcia Knous [:marcia - use ni] from comment #0) > > Crashes appear to have started using 20170327030203 build. Possible > > regression range based on that Build ID: > > https://hg.mozilla.org/mozilla-central/ > > pushloghtml?fromchange=f5e214144799889e2408c4841351f4053f00544e&tochange=cc53 > > 710589fb500610495da5258b7b9221edf681 > > This regression range seems really unlikely; there's nothing in there that > touches the profiler. Are we sure that there weren't other merges that went > into the 20170327030203 build that could have caused the problem? I widened the search scope and I do see crashes going all the way back to 20170127030206 in this signature (when Nightly was in 54). So I will have to try to get you a better regression range.
Flags: needinfo?(mozillamarcia.knous)
(In reply to Marcia Knous [:marcia - use ni] from comment #4) > (In reply to Nathan Froyd [:froydnj] from comment #3) > > (In reply to Marcia Knous [:marcia - use ni] from comment #0) > > > Crashes appear to have started using 20170327030203 build. Possible > > > regression range based on that Build ID: > > > https://hg.mozilla.org/mozilla-central/ > > > pushloghtml?fromchange=f5e214144799889e2408c4841351f4053f00544e&tochange=cc53 > > > 710589fb500610495da5258b7b9221edf681 > > > > This regression range seems really unlikely; there's nothing in there that > > touches the profiler. Are we sure that there weren't other merges that went > > into the 20170327030203 build that could have caused the problem? > > I widened the search scope and I do see crashes going all the way back to > 20170127030206 in this signature (when Nightly was in 54). So I will have to > try to get you a better regression range. This issue severely hurts productivity of anybody who wants to use the gecko profiler. I would be very grateful for seeing a better regression range, thank you Marcia.
I have a proposed fix in bug 1180561 comment 49 that may fix a large number of these (the ones with msmpeg2vdec.dll) but I could use more testers on the patch, if anyone is able to reproduce this. Also it may not cover all the crashes, if there are cases with DLLs other than msmpeg2vdec. Could someone with crash access and WinDbg send me some sample output of "dps @$csp L100" ?
(In reply to David Major [:dmajor] from comment #6) > I have a proposed fix in bug 1180561 comment 49 that may fix a large number > of these (the ones with msmpeg2vdec.dll) but I could use more testers on the > patch, if anyone is able to reproduce this. > > Also it may not cover all the crashes, if there are cases with DLLs other > than msmpeg2vdec. Could someone with crash access and WinDbg send me some > sample output of "dps @$csp L100" ? I uploaded the output of 2 crashes as attachments. The were both 64bits nightly with build-id 20170409123541. Is this helping? I never used windbg before.
Samael can reproduce this locally and sent me a heap dump. His thread being sampled goes like: 00 00000069`3d1e21b8 00007ff9`8d65d72a xul!js::EmptyShape::getInitialShape+0x3 [... lots of normal-looking code frames ...] 40 00000069`3d1eaed0 000000af`c395bfda xul!js::jit::DoCallFallback+0x489 41 00000069`3d1eb120 000000af`c3c80000 0x000000af`c395bfda ^^^^^^^^^^^^^^^^^^^ 42 00000069`3d1eb128 00000069`3d1eb1f8 0x000000af`c3c80000 ^^^^^^^^^^^^^^^^^^^ 43 00000069`3d1eb130 fffe018d`b6074ec0 0x00000069`3d1eb1f8 And his crash site is: ntdll!RtlVirtualUnwind+0x91: 00007ff9`e37bbb61 803f48 cmp byte ptr [rdi],48h ds:000000af`c3c80000=?? ^^^^^^^^^^^^^^^^^ There is reasonable-looking code at "frame 41", 0x000000af`c395bfda: 0:067> !address 0x000000af`c395bfda Usage: <unknown> Base Address: 000000af`c3951000 End Address: 000000af`c3961000 Region Size: 00000000`00010000 ( 64.000 kB) State: 00001000 MEM_COMMIT Protect: 00000020 PAGE_EXECUTE_READ Type: 00020000 MEM_PRIVATE Allocation Base: 000000af`c3940000 Allocation Protect: 00000001 PAGE_NOACCESS 0:067> u 0x000000af`c395bfda 000000af`c395bfda 4883c438 add rsp,38h 000000af`c395bfde 5c pop rsp 000000af`c395bfdf 84c0 test al,al 000000af`c395bfe1 0f8408000000 je 000000af`c395bfef 000000af`c395bfe7 59 pop rcx 000000af`c395bfe8 4883c410 add rsp,10h 000000af`c395bfec c22800 ret 28h 000000af`c395bfef e94c51ffff jmp 000000af`c3951140 But not at "frame 42", 0x000000af`c3c80000; this page is not committed: 0:067> !address 0x000000af`c3c80000 Usage: <unknown> Base Address: 000000af`c3c71000 End Address: 000000af`c3c81000 Region Size: 00000000`00010000 ( 64.000 kB) State: 00002000 MEM_RESERVE Protect: <info not present at the target> Type: 00020000 MEM_PRIVATE Allocation Base: 000000af`c3940000 Allocation Protect: 00000001 PAGE_NOACCESS (But note that it was carved out from the same initial Allocation Base as the other frame.) So: I suspect that it was bogus to stack-walk from 0x000000af`c395bfda to 0x000000af`c3c80000. AFAIK we have no way of providing correct unwind information for JIT code, so I'm surprised that the stack walker even tried. I was under the impression that we would just give up here: https://dxr.mozilla.org/mozilla-central/rev/f914d40a48009c5acd1093e9939cc0ec035696dd/mozglue/misc/StackWalk.cpp#403-410 But maybe not? We _do_ register some (dummy) RUNTIME_FUNCTIONs: https://dxr.mozilla.org/mozilla-central/rev/f914d40a48009c5acd1093e9939cc0ec035696dd/js/src/jit/ProcessExecutableMemory.cpp#165 -- those entries say "if you're unwinding me for exceptions, just call Breakpad instead" -- but maybe those entries provide misleading information when we unwind solely for stack-walking purposes? Jan, does my interpretation sound reasonable? Can you think of any other causes? (I'm actually kind of hoping that I'm wrong about the RUNTIME_FUNCTIONs, because it would mean we've been doing bad stack walks for years)
Flags: needinfo?(jdemooij)
Luke is more familiar with the Win64 exception handler..
Flags: needinfo?(jdemooij) → needinfo?(luke)
(Since the recent code allocation overhaul, the RUNTIME_FUNCTION is registered for the whole JIT memory region. Maybe the decommitted pages in that range confuse things?)
Right, so we have that dummy RUNTIME_FUNCTION registered for the (now single) big JIT code reservation, but it's only for the benefit of SEH unwinding; it doesn't "help" RtlVirtualUnwind in any way since it doesn't provide any info for how to get to the next frame; it just provides an exceptionHandler that RtlVirtualUnwind() shouldn't be calling. I wonder if maybe this RUNTIME_FUNCTION is somehow *hurting*, misleading RtlVirtualUnwind() into thinking this is valid code and worth probing more than if there was no RUNTIME_FUNCTION for the code region...
Flags: needinfo?(luke)
Maybe we should just check if context.Rip is in JIT code and give up if so. Is there a way for code outside of SM to determine whether an instruction is in the JIT region?
I suppose one way might be to VirtualQuery and check for MEM_IMAGE.
Blocks: 1354292
That'd be great when the PC itself is in JIT code. However, there is also the (perhaps more common) case where JIT code has called into C++ like you're seeing in comment 10. Unless there's a way to query frame by frame? But actually, my hope was that someone could perhaps learn more about how RUNTIME_FUNCTION worked and see if there was some way (e.g., some flag we could set) that told RtlVirtualUnwind() to give up.
More thoughts on this: RUNTIME_FUNCTION isn't well documented, but ChakraCore, now open source, does apparently fully describe all their code via RUNTIME_FUNCTION (so that stack unwinding actually works) so perhaps some study of that would help. Also, ChakraCore preserves fp, so describing their code to RUNTIME_FUNCTION is a lot easier than Ion, which doesn't. Recently, to support async stack-walking and precise GC, wasm started preserving fp (bug 1334504). On x86, it was a modest regression (0-10%), but on x64, there was no change at all. So maybe it's worth doing this along with RUNTIME_FUNCTION work so we could have real stack unwinding on Win64?
Yep, per-frame VirtualQuery is what I had in mind. Getting the RUNTIME_FUNCTION stuff right would be ideal, but for the sake of all the Quantum profiling going on, I'd prefer not to block a crash fix on that work. How about this, I'll post a patch for to test for JIT code, and I will be 100% OK with it being removed if a better patch comes along.
Attached patch Sanity check the address at each frame (obsolete) (deleted) — Splinter Review
Not sure who is best to review profiler+Win32 stuff; feel free to reassign.
Assignee: nobody → dmajor
Attachment #8858431 - Flags: review?(n.nethercote)
I hope this isn't too expensive; doing too much work while the main thread is paused would earn me a seat at bug 1329212. Maybe I should sample the sampler under xperf...
Oh no! This is far too slow. VirtualQuery is at the top of my profile.
Attachment #8858431 - Flags: review?(n.nethercote) → review-
Actually, now that all JIT code is in a single contiguous region (js/src/jit/ProcessExecutableMemory.cpp), it should be easy to expose a super-cheap JS::IsPCInJitCode() API.
The stack walk code is in mozglue, where we can't call libxul functions. :-( I guess we could invert the call and have ProcessExecutableMemory::init() call some mozglue export to notify it of the region bounds. Would that be too ugly? I'm not a fan of random mozglue exports; happy to hear other ideas.
Attachment #8858431 - Attachment is obsolete: true
Attachment #8858971 - Flags: review?(luke)
Attachment #8858971 - Flags: review?(luke) → review+
(In reply to David Major [:dmajor] from comment #23) > The stack walk code is in mozglue, where we can't call libxul functions. :-( > > I guess we could invert the call and have ProcessExecutableMemory::init() > call some mozglue export to notify it of the region bounds. Would that be > too ugly? I'm not a fan of random mozglue exports; happy to hear other ideas. The xul code invoking the stack walking can't pass the bounds (or a JSAPI function pointer) to mozglue?
(In reply to Jan de Mooij [:jandem] from comment #25) > (In reply to David Major [:dmajor] from comment #23) > > I guess we could invert the call and have ProcessExecutableMemory::init() > > call some mozglue export to notify it of the region bounds. Would that be > > too ugly? I'm not a fan of random mozglue exports; happy to hear other ideas. > > The xul code invoking the stack walking can't pass the bounds (or a JSAPI > function pointer) to mozglue? I like this solution.
(In reply to Jan de Mooij [:jandem] from comment #25) > The xul code invoking the stack walking can't pass the bounds (or a JSAPI > function pointer) to mozglue? I think this would get messy because there isn't just one piece of xul code making the call; there are many callers of MozStackWalk scattered about. Unless I'm misunderstanding?
(In reply to David Major [:dmajor] from comment #27) > I think this would get messy because there isn't just one piece of xul code > making the call; there are many callers of MozStackWalk scattered about. > Unless I'm misunderstanding? Ah, if it requires modifying many callers I agree your patch may be simpler.
Pushed by dmajor@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/3586c6c53779 Register the JIT region with mozglue so it can avoid stack-walking that code. r=luke
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
My nightly stops crashing these 2 days. Thanks :dmajor!
Blocks: 1366030
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: