Closed
Bug 1353420
Opened 8 years ago
Closed 8 years ago
Crash in RtlVirtualUnwind | RtlVirtualUnwindStub | WalkStackMain64
Categories
(Core :: Gecko Profiler, defect)
Tracking
()
RESOLVED
FIXED
mozilla55
Tracking | Status | |
---|---|---|
firefox-esr45 | --- | unaffected |
firefox52 | --- | unaffected |
firefox-esr52 | --- | unaffected |
firefox53 | --- | unaffected |
firefox54 | --- | unaffected |
firefox55 | --- | fixed |
People
(Reporter: marcia, Assigned: away)
References
(Blocks 2 open bugs)
Details
(Keywords: crash, regression, regressionwindow-wanted)
Crash Data
Attachments
(3 files, 1 obsolete file)
This bug was filed from the Socorro interface and is
report bp-4146980e-8a76-4548-b5e6-435e32170404.
=============================================================
Seen while looking at nightly crash data - this crash has moved up to #2 overall browser crash in the last 7 days: http://bit.ly/2nYLPyh
Comments mention suddenly crashing for now reason and "crashes mostly happen when tab bar is manipulated in some way (scrolled, switched, close/open tab, etc)"
Crashes appear to have started using 20170327030203 build. Possible regression range based on that Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f5e214144799889e2408c4841351f4053f00544e&tochange=cc53710589fb500610495da5258b7b9221edf681
Updated•8 years ago
|
Comment 1•8 years ago
|
||
I hit what looks like this yesterday while running Cleopatra scrolling through my facebook news feed:
https://crash-stats.mozilla.com/report/index/2335a754-20e6-46c0-91ac-b1d1f2170405
Comment 2•8 years ago
|
||
There were some disucssion in bug 1180561, but the crash rate does seem to be increasing. I got 28 crashes of this signature today (so far), which is much worse than it behaves last week. In my case it's usually a crash on loading a new page (e.g. link clicking) followed by another startup crash on restart.
Comment 3•8 years ago
|
||
(In reply to Marcia Knous [:marcia - use ni] from comment #0)
> Crashes appear to have started using 20170327030203 build. Possible
> regression range based on that Build ID:
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=f5e214144799889e2408c4841351f4053f00544e&tochange=cc53
> 710589fb500610495da5258b7b9221edf681
This regression range seems really unlikely; there's nothing in there that touches the profiler. Are we sure that there weren't other merges that went into the 20170327030203 build that could have caused the problem?
Flags: needinfo?(mozillamarcia.knous)
Reporter | ||
Comment 4•8 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #3)
> (In reply to Marcia Knous [:marcia - use ni] from comment #0)
> > Crashes appear to have started using 20170327030203 build. Possible
> > regression range based on that Build ID:
> > https://hg.mozilla.org/mozilla-central/
> > pushloghtml?fromchange=f5e214144799889e2408c4841351f4053f00544e&tochange=cc53
> > 710589fb500610495da5258b7b9221edf681
>
> This regression range seems really unlikely; there's nothing in there that
> touches the profiler. Are we sure that there weren't other merges that went
> into the 20170327030203 build that could have caused the problem?
I widened the search scope and I do see crashes going all the way back to 20170127030206 in this signature (when Nightly was in 54). So I will have to try to get you a better regression range.
Flags: needinfo?(mozillamarcia.knous)
Comment 5•8 years ago
|
||
(In reply to Marcia Knous [:marcia - use ni] from comment #4)
> (In reply to Nathan Froyd [:froydnj] from comment #3)
> > (In reply to Marcia Knous [:marcia - use ni] from comment #0)
> > > Crashes appear to have started using 20170327030203 build. Possible
> > > regression range based on that Build ID:
> > > https://hg.mozilla.org/mozilla-central/
> > > pushloghtml?fromchange=f5e214144799889e2408c4841351f4053f00544e&tochange=cc53
> > > 710589fb500610495da5258b7b9221edf681
> >
> > This regression range seems really unlikely; there's nothing in there that
> > touches the profiler. Are we sure that there weren't other merges that went
> > into the 20170327030203 build that could have caused the problem?
>
> I widened the search scope and I do see crashes going all the way back to
> 20170127030206 in this signature (when Nightly was in 54). So I will have to
> try to get you a better regression range.
This issue severely hurts productivity of anybody who wants to use the gecko profiler. I would be very grateful for seeing a better regression range, thank you Marcia.
I have a proposed fix in bug 1180561 comment 49 that may fix a large number of these (the ones with msmpeg2vdec.dll) but I could use more testers on the patch, if anyone is able to reproduce this.
Also it may not cover all the crashes, if there are cases with DLLs other than msmpeg2vdec. Could someone with crash access and WinDbg send me some sample output of "dps @$csp L100" ?
Reporter | ||
Updated•8 years ago
|
Keywords: regressionwindow-wanted
Comment 7•8 years ago
|
||
Comment 8•8 years ago
|
||
Comment 9•8 years ago
|
||
(In reply to David Major [:dmajor] from comment #6)
> I have a proposed fix in bug 1180561 comment 49 that may fix a large number
> of these (the ones with msmpeg2vdec.dll) but I could use more testers on the
> patch, if anyone is able to reproduce this.
>
> Also it may not cover all the crashes, if there are cases with DLLs other
> than msmpeg2vdec. Could someone with crash access and WinDbg send me some
> sample output of "dps @$csp L100" ?
I uploaded the output of 2 crashes as attachments. The were both 64bits nightly with build-id 20170409123541. Is this helping? I never used windbg before.
Updated•8 years ago
|
status-firefox52:
--- → unaffected
status-firefox53:
--- → unaffected
status-firefox54:
--- → unaffected
status-firefox-esr45:
--- → unaffected
status-firefox-esr52:
--- → unaffected
Assignee | ||
Comment 10•8 years ago
|
||
Samael can reproduce this locally and sent me a heap dump. His thread being sampled goes like:
00 00000069`3d1e21b8 00007ff9`8d65d72a xul!js::EmptyShape::getInitialShape+0x3
[... lots of normal-looking code frames ...]
40 00000069`3d1eaed0 000000af`c395bfda xul!js::jit::DoCallFallback+0x489
41 00000069`3d1eb120 000000af`c3c80000 0x000000af`c395bfda
^^^^^^^^^^^^^^^^^^^
42 00000069`3d1eb128 00000069`3d1eb1f8 0x000000af`c3c80000
^^^^^^^^^^^^^^^^^^^
43 00000069`3d1eb130 fffe018d`b6074ec0 0x00000069`3d1eb1f8
And his crash site is:
ntdll!RtlVirtualUnwind+0x91:
00007ff9`e37bbb61 803f48 cmp byte ptr [rdi],48h ds:000000af`c3c80000=??
^^^^^^^^^^^^^^^^^
There is reasonable-looking code at "frame 41", 0x000000af`c395bfda:
0:067> !address 0x000000af`c395bfda
Usage: <unknown>
Base Address: 000000af`c3951000
End Address: 000000af`c3961000
Region Size: 00000000`00010000 ( 64.000 kB)
State: 00001000 MEM_COMMIT
Protect: 00000020 PAGE_EXECUTE_READ
Type: 00020000 MEM_PRIVATE
Allocation Base: 000000af`c3940000
Allocation Protect: 00000001 PAGE_NOACCESS
0:067> u 0x000000af`c395bfda
000000af`c395bfda 4883c438 add rsp,38h
000000af`c395bfde 5c pop rsp
000000af`c395bfdf 84c0 test al,al
000000af`c395bfe1 0f8408000000 je 000000af`c395bfef
000000af`c395bfe7 59 pop rcx
000000af`c395bfe8 4883c410 add rsp,10h
000000af`c395bfec c22800 ret 28h
000000af`c395bfef e94c51ffff jmp 000000af`c3951140
But not at "frame 42", 0x000000af`c3c80000; this page is not committed:
0:067> !address 0x000000af`c3c80000
Usage: <unknown>
Base Address: 000000af`c3c71000
End Address: 000000af`c3c81000
Region Size: 00000000`00010000 ( 64.000 kB)
State: 00002000 MEM_RESERVE
Protect: <info not present at the target>
Type: 00020000 MEM_PRIVATE
Allocation Base: 000000af`c3940000
Allocation Protect: 00000001 PAGE_NOACCESS
(But note that it was carved out from the same initial Allocation Base as the other frame.)
So:
I suspect that it was bogus to stack-walk from 0x000000af`c395bfda to 0x000000af`c3c80000. AFAIK we have no way of providing correct unwind information for JIT code, so I'm surprised that the stack walker even tried.
I was under the impression that we would just give up here: https://dxr.mozilla.org/mozilla-central/rev/f914d40a48009c5acd1093e9939cc0ec035696dd/mozglue/misc/StackWalk.cpp#403-410
But maybe not? We _do_ register some (dummy) RUNTIME_FUNCTIONs: https://dxr.mozilla.org/mozilla-central/rev/f914d40a48009c5acd1093e9939cc0ec035696dd/js/src/jit/ProcessExecutableMemory.cpp#165 -- those entries say "if you're unwinding me for exceptions, just call Breakpad instead" -- but maybe those entries provide misleading information when we unwind solely for stack-walking purposes?
Jan, does my interpretation sound reasonable? Can you think of any other causes? (I'm actually kind of hoping that I'm wrong about the RUNTIME_FUNCTIONs, because it would mean we've been doing bad stack walks for years)
Flags: needinfo?(jdemooij)
Comment 11•8 years ago
|
||
Luke is more familiar with the Win64 exception handler..
Flags: needinfo?(jdemooij) → needinfo?(luke)
Comment 12•8 years ago
|
||
(Since the recent code allocation overhaul, the RUNTIME_FUNCTION is registered for the whole JIT memory region. Maybe the decommitted pages in that range confuse things?)
Comment 13•8 years ago
|
||
Right, so we have that dummy RUNTIME_FUNCTION registered for the (now single) big JIT code reservation, but it's only for the benefit of SEH unwinding; it doesn't "help" RtlVirtualUnwind in any way since it doesn't provide any info for how to get to the next frame; it just provides an exceptionHandler that RtlVirtualUnwind() shouldn't be calling.
I wonder if maybe this RUNTIME_FUNCTION is somehow *hurting*, misleading RtlVirtualUnwind() into thinking this is valid code and worth probing more than if there was no RUNTIME_FUNCTION for the code region...
Flags: needinfo?(luke)
Assignee | ||
Comment 14•8 years ago
|
||
Maybe we should just check if context.Rip is in JIT code and give up if so.
Is there a way for code outside of SM to determine whether an instruction is in the JIT region?
Assignee | ||
Comment 15•8 years ago
|
||
I suppose one way might be to VirtualQuery and check for MEM_IMAGE.
Comment 16•8 years ago
|
||
That'd be great when the PC itself is in JIT code. However, there is also the (perhaps more common) case where JIT code has called into C++ like you're seeing in comment 10. Unless there's a way to query frame by frame?
But actually, my hope was that someone could perhaps learn more about how RUNTIME_FUNCTION worked and see if there was some way (e.g., some flag we could set) that told RtlVirtualUnwind() to give up.
Comment 17•8 years ago
|
||
More thoughts on this: RUNTIME_FUNCTION isn't well documented, but ChakraCore, now open source, does apparently fully describe all their code via RUNTIME_FUNCTION (so that stack unwinding actually works) so perhaps some study of that would help.
Also, ChakraCore preserves fp, so describing their code to RUNTIME_FUNCTION is a lot easier than Ion, which doesn't. Recently, to support async stack-walking and precise GC, wasm started preserving fp (bug 1334504). On x86, it was a modest regression (0-10%), but on x64, there was no change at all. So maybe it's worth doing this along with RUNTIME_FUNCTION work so we could have real stack unwinding on Win64?
Assignee | ||
Comment 18•8 years ago
|
||
Yep, per-frame VirtualQuery is what I had in mind.
Getting the RUNTIME_FUNCTION stuff right would be ideal, but for the sake of all the Quantum profiling going on, I'd prefer not to block a crash fix on that work. How about this, I'll post a patch for to test for JIT code, and I will be 100% OK with it being removed if a better patch comes along.
Assignee | ||
Comment 19•8 years ago
|
||
Not sure who is best to review profiler+Win32 stuff; feel free to reassign.
Assignee: nobody → dmajor
Attachment #8858431 -
Flags: review?(n.nethercote)
Assignee | ||
Comment 20•8 years ago
|
||
I hope this isn't too expensive; doing too much work while the main thread is paused would earn me a seat at bug 1329212. Maybe I should sample the sampler under xperf...
Assignee | ||
Comment 21•8 years ago
|
||
Oh no! This is far too slow. VirtualQuery is at the top of my profile.
Attachment #8858431 -
Flags: review?(n.nethercote) → review-
Comment 22•8 years ago
|
||
Actually, now that all JIT code is in a single contiguous region (js/src/jit/ProcessExecutableMemory.cpp), it should be easy to expose a super-cheap JS::IsPCInJitCode() API.
Assignee | ||
Comment 23•8 years ago
|
||
The stack walk code is in mozglue, where we can't call libxul functions. :-(
I guess we could invert the call and have ProcessExecutableMemory::init() call some mozglue export to notify it of the region bounds. Would that be too ugly? I'm not a fan of random mozglue exports; happy to hear other ideas.
Assignee | ||
Comment 24•8 years ago
|
||
Attachment #8858431 -
Attachment is obsolete: true
Attachment #8858971 -
Flags: review?(luke)
Updated•8 years ago
|
Attachment #8858971 -
Flags: review?(luke) → review+
Comment 25•8 years ago
|
||
(In reply to David Major [:dmajor] from comment #23)
> The stack walk code is in mozglue, where we can't call libxul functions. :-(
>
> I guess we could invert the call and have ProcessExecutableMemory::init()
> call some mozglue export to notify it of the region bounds. Would that be
> too ugly? I'm not a fan of random mozglue exports; happy to hear other ideas.
The xul code invoking the stack walking can't pass the bounds (or a JSAPI function pointer) to mozglue?
Comment 26•8 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #25)
> (In reply to David Major [:dmajor] from comment #23)
> > I guess we could invert the call and have ProcessExecutableMemory::init()
> > call some mozglue export to notify it of the region bounds. Would that be
> > too ugly? I'm not a fan of random mozglue exports; happy to hear other ideas.
>
> The xul code invoking the stack walking can't pass the bounds (or a JSAPI
> function pointer) to mozglue?
I like this solution.
Assignee | ||
Comment 27•8 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #25)
> The xul code invoking the stack walking can't pass the bounds (or a JSAPI
> function pointer) to mozglue?
I think this would get messy because there isn't just one piece of xul code making the call; there are many callers of MozStackWalk scattered about. Unless I'm misunderstanding?
Comment 28•8 years ago
|
||
(In reply to David Major [:dmajor] from comment #27)
> I think this would get messy because there isn't just one piece of xul code
> making the call; there are many callers of MozStackWalk scattered about.
> Unless I'm misunderstanding?
Ah, if it requires modifying many callers I agree your patch may be simpler.
Comment 29•8 years ago
|
||
Pushed by dmajor@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3586c6c53779
Register the JIT region with mozglue so it can avoid stack-walking that code. r=luke
Comment 30•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Comment 31•8 years ago
|
||
My nightly stops crashing these 2 days. Thanks :dmajor!
You need to log in
before you can comment on or make changes to this bug.
Description
•