Closed Bug 1353420 Opened 8 years ago Closed 8 years ago

Crash in RtlVirtualUnwind | RtlVirtualUnwindStub | WalkStackMain64

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla55

Tracking Flags:

Tracking

Status

firefox-esr45

---

unaffected

firefox52

---

unaffected

firefox-esr52

---

unaffected

firefox53

---

unaffected

firefox54

---

unaffected

firefox55

---

fixed

People

(Reporter: marcia, Assigned: away)

References

(Blocks 2 open bugs)

Details

(Keywords: crash, regression, regressionwindow-wanted)

Crash Data

Attachments

(3 files, 1 obsolete file)

windbg-nightly-20170409123541-crash-1.log 8 years ago Samael Wang [:freesamael] (away for now) (deleted), text/x-log		Details
windbg-nightly-20170409123541-crash-2.log 8 years ago Samael Wang [:freesamael] (away for now) (deleted), text/x-log		Details
Sanity check the address at each frame 8 years ago (Away) (deleted), patch	away : review-	Details \| Diff \| Splinter Review
Register the jit region with mozglue so it can avoid stack-walking in that region 8 years ago (Away) (deleted), patch	luke : review+	Details \| Diff \| Splinter Review

Marcia Knous [:marcia]

Reporter

Description

•

8 years ago

This bug was filed from the Socorro interface and is report bp-4146980e-8a76-4548-b5e6-435e32170404. ============================================================= Seen while looking at nightly crash data - this crash has moved up to #2 overall browser crash in the last 7 days: http://bit.ly/2nYLPyh Comments mention suddenly crashing for now reason and "crashes mostly happen when tab bar is manipulated in some way (scrolled, switched, close/open tab, etc)" Crashes appear to have started using 20170327030203 build. Possible regression range based on that Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f5e214144799889e2408c4841351f4053f00544e&tochange=cc53710589fb500610495da5258b7b9221edf681

Markus Stange [:mstange]

Updated

•

8 years ago

Blocks: 1329181

Component: General → Gecko Profiler

Keywords: regression

Chris Pearce [:cpearce (Not reading bugmail)]

Comment 1

•

8 years ago

I hit what looks like this yesterday while running Cleopatra scrolling through my facebook news feed: https://crash-stats.mozilla.com/report/index/2335a754-20e6-46c0-91ac-b1d1f2170405

Samael Wang [:freesamael] (away for now)

Comment 2

•

8 years ago

There were some disucssion in bug 1180561, but the crash rate does seem to be increasing. I got 28 crashes of this signature today (so far), which is much worse than it behaves last week. In my case it's usually a crash on loading a new page (e.g. link clicking) followed by another startup crash on restart.

Nathan Froyd [:froydnj]

Comment 3

•

8 years ago

(In reply to Marcia Knous [:marcia - use ni] from comment #0) > Crashes appear to have started using 20170327030203 build. Possible > regression range based on that Build ID: > https://hg.mozilla.org/mozilla-central/ > pushloghtml?fromchange=f5e214144799889e2408c4841351f4053f00544e&tochange=cc53 > 710589fb500610495da5258b7b9221edf681 This regression range seems really unlikely; there's nothing in there that touches the profiler. Are we sure that there weren't other merges that went into the 20170327030203 build that could have caused the problem?

Flags: needinfo?(mozillamarcia.knous)

Marcia Knous [:marcia]

Reporter

Comment 4

•

8 years ago

(In reply to Nathan Froyd [:froydnj] from comment #3) > (In reply to Marcia Knous [:marcia - use ni] from comment #0) > > Crashes appear to have started using 20170327030203 build. Possible > > regression range based on that Build ID: > > https://hg.mozilla.org/mozilla-central/ > > pushloghtml?fromchange=f5e214144799889e2408c4841351f4053f00544e&tochange=cc53 > > 710589fb500610495da5258b7b9221edf681 > > This regression range seems really unlikely; there's nothing in there that > touches the profiler. Are we sure that there weren't other merges that went > into the 20170327030203 build that could have caused the problem? I widened the search scope and I do see crashes going all the way back to 20170127030206 in this signature (when Nightly was in 54). So I will have to try to get you a better regression range.

Flags: needinfo?(mozillamarcia.knous)

Hsin-Yi Tsai (she/her) [:hsinyi]

Comment 5

•

8 years ago

(In reply to Marcia Knous [:marcia - use ni] from comment #4) > (In reply to Nathan Froyd [:froydnj] from comment #3) > > (In reply to Marcia Knous [:marcia - use ni] from comment #0) > > > Crashes appear to have started using 20170327030203 build. Possible > > > regression range based on that Build ID: > > > https://hg.mozilla.org/mozilla-central/ > > > pushloghtml?fromchange=f5e214144799889e2408c4841351f4053f00544e&tochange=cc53 > > > 710589fb500610495da5258b7b9221edf681 > > > > This regression range seems really unlikely; there's nothing in there that > > touches the profiler. Are we sure that there weren't other merges that went > > into the 20170327030203 build that could have caused the problem? > > I widened the search scope and I do see crashes going all the way back to > 20170127030206 in this signature (when Nightly was in 54). So I will have to > try to get you a better regression range. This issue severely hurts productivity of anybody who wants to use the gecko profiler. I would be very grateful for seeing a better regression range, thank you Marcia.

(Away)

Assignee

Comment 6

•

8 years ago

I have a proposed fix in bug 1180561 comment 49 that may fix a large number of these (the ones with msmpeg2vdec.dll) but I could use more testers on the patch, if anyone is able to reproduce this. Also it may not cover all the crashes, if there are cases with DLLs other than msmpeg2vdec. Could someone with crash access and WinDbg send me some sample output of "dps @$csp L100" ?

Marcia Knous [:marcia]

Reporter

Updated

•

8 years ago

Keywords: regressionwindow-wanted

Samael Wang [:freesamael] (away for now)

Comment 7

•

8 years ago

Attached file windbg-nightly-20170409123541-crash-1.log (deleted) — Details

Samael Wang [:freesamael] (away for now)

Comment 8

•

8 years ago

Attached file windbg-nightly-20170409123541-crash-2.log (deleted) — Details

Samael Wang [:freesamael] (away for now)

Comment 9

•

8 years ago

(In reply to David Major [:dmajor] from comment #6) > I have a proposed fix in bug 1180561 comment 49 that may fix a large number > of these (the ones with msmpeg2vdec.dll) but I could use more testers on the > patch, if anyone is able to reproduce this. > > Also it may not cover all the crashes, if there are cases with DLLs other > than msmpeg2vdec. Could someone with crash access and WinDbg send me some > sample output of "dps @$csp L100" ? I uploaded the output of 2 crashes as attachments. The were both 64bits nightly with build-id 20170409123541. Is this helping? I never used windbg before.

Justin [:JW_SoftvisionQA]

Updated

•

8 years ago

status-firefox52: --- → unaffected

status-firefox53: --- → unaffected

status-firefox54: --- → unaffected

status-firefox-esr45: --- → unaffected

status-firefox-esr52: --- → unaffected

(Away)

Assignee

Comment 10

•

8 years ago

Samael can reproduce this locally and sent me a heap dump. His thread being sampled goes like: 00 00000069`3d1e21b8 00007ff9`8d65d72a xul!js::EmptyShape::getInitialShape+0x3 [... lots of normal-looking code frames ...] 40 00000069`3d1eaed0 000000af`c395bfda xul!js::jit::DoCallFallback+0x489 41 00000069`3d1eb120 000000af`c3c80000 0x000000af`c395bfda ^^^^^^^^^^^^^^^^^^^ 42 00000069`3d1eb128 00000069`3d1eb1f8 0x000000af`c3c80000 ^^^^^^^^^^^^^^^^^^^ 43 00000069`3d1eb130 fffe018d`b6074ec0 0x00000069`3d1eb1f8 And his crash site is: ntdll!RtlVirtualUnwind+0x91: 00007ff9`e37bbb61 803f48 cmp byte ptr [rdi],48h ds:000000af`c3c80000=?? ^^^^^^^^^^^^^^^^^ There is reasonable-looking code at "frame 41", 0x000000af`c395bfda: 0:067> !address 0x000000af`c395bfda Usage: <unknown> Base Address: 000000af`c3951000 End Address: 000000af`c3961000 Region Size: 00000000`00010000 ( 64.000 kB) State: 00001000 MEM_COMMIT Protect: 00000020 PAGE_EXECUTE_READ Type: 00020000 MEM_PRIVATE Allocation Base: 000000af`c3940000 Allocation Protect: 00000001 PAGE_NOACCESS 0:067> u 0x000000af`c395bfda 000000af`c395bfda 4883c438 add rsp,38h 000000af`c395bfde 5c pop rsp 000000af`c395bfdf 84c0 test al,al 000000af`c395bfe1 0f8408000000 je 000000af`c395bfef 000000af`c395bfe7 59 pop rcx 000000af`c395bfe8 4883c410 add rsp,10h 000000af`c395bfec c22800 ret 28h 000000af`c395bfef e94c51ffff jmp 000000af`c3951140 But not at "frame 42", 0x000000af`c3c80000; this page is not committed: 0:067> !address 0x000000af`c3c80000 Usage: <unknown> Base Address: 000000af`c3c71000 End Address: 000000af`c3c81000 Region Size: 00000000`00010000 ( 64.000 kB) State: 00002000 MEM_RESERVE Protect: <info not present at the target> Type: 00020000 MEM_PRIVATE Allocation Base: 000000af`c3940000 Allocation Protect: 00000001 PAGE_NOACCESS (But note that it was carved out from the same initial Allocation Base as the other frame.) So: I suspect that it was bogus to stack-walk from 0x000000af`c395bfda to 0x000000af`c3c80000. AFAIK we have no way of providing correct unwind information for JIT code, so I'm surprised that the stack walker even tried. I was under the impression that we would just give up here: https://dxr.mozilla.org/mozilla-central/rev/f914d40a48009c5acd1093e9939cc0ec035696dd/mozglue/misc/StackWalk.cpp#403-410 But maybe not? We _do_ register some (dummy) RUNTIME_FUNCTIONs: https://dxr.mozilla.org/mozilla-central/rev/f914d40a48009c5acd1093e9939cc0ec035696dd/js/src/jit/ProcessExecutableMemory.cpp#165 -- those entries say "if you're unwinding me for exceptions, just call Breakpad instead" -- but maybe those entries provide misleading information when we unwind solely for stack-walking purposes? Jan, does my interpretation sound reasonable? Can you think of any other causes? (I'm actually kind of hoping that I'm wrong about the RUNTIME_FUNCTIONs, because it would mean we've been doing bad stack walks for years)

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Comment 11

•

8 years ago

Luke is more familiar with the Win64 exception handler..

Flags: needinfo?(jdemooij) → needinfo?(luke)

Jan de Mooij [:jandem]

Comment 12

•

8 years ago

(Since the recent code allocation overhaul, the RUNTIME_FUNCTION is registered for the whole JIT memory region. Maybe the decommitted pages in that range confuse things?)

Luke Wagner [:luke]

Comment 13

•

8 years ago

Right, so we have that dummy RUNTIME_FUNCTION registered for the (now single) big JIT code reservation, but it's only for the benefit of SEH unwinding; it doesn't "help" RtlVirtualUnwind in any way since it doesn't provide any info for how to get to the next frame; it just provides an exceptionHandler that RtlVirtualUnwind() shouldn't be calling. I wonder if maybe this RUNTIME_FUNCTION is somehow *hurting*, misleading RtlVirtualUnwind() into thinking this is valid code and worth probing more than if there was no RUNTIME_FUNCTION for the code region...

Flags: needinfo?(luke)

(Away)

Assignee

Comment 14

•

8 years ago

Maybe we should just check if context.Rip is in JIT code and give up if so. Is there a way for code outside of SM to determine whether an instruction is in the JIT region?

(Away)

Assignee

Comment 15

•

8 years ago

I suppose one way might be to VirtualQuery and check for MEM_IMAGE.

(Away)

Assignee

Updated

•

8 years ago

Blocks: 1354292

Luke Wagner [:luke]

Comment 16

•

8 years ago

That'd be great when the PC itself is in JIT code. However, there is also the (perhaps more common) case where JIT code has called into C++ like you're seeing in comment 10. Unless there's a way to query frame by frame? But actually, my hope was that someone could perhaps learn more about how RUNTIME_FUNCTION worked and see if there was some way (e.g., some flag we could set) that told RtlVirtualUnwind() to give up.

Luke Wagner [:luke]

Comment 17

•

8 years ago

More thoughts on this: RUNTIME_FUNCTION isn't well documented, but ChakraCore, now open source, does apparently fully describe all their code via RUNTIME_FUNCTION (so that stack unwinding actually works) so perhaps some study of that would help. Also, ChakraCore preserves fp, so describing their code to RUNTIME_FUNCTION is a lot easier than Ion, which doesn't. Recently, to support async stack-walking and precise GC, wasm started preserving fp (bug 1334504). On x86, it was a modest regression (0-10%), but on x64, there was no change at all. So maybe it's worth doing this along with RUNTIME_FUNCTION work so we could have real stack unwinding on Win64?

(Away)

Assignee

Comment 18

•

8 years ago

Yep, per-frame VirtualQuery is what I had in mind. Getting the RUNTIME_FUNCTION stuff right would be ideal, but for the sake of all the Quantum profiling going on, I'd prefer not to block a crash fix on that work. How about this, I'll post a patch for to test for JIT code, and I will be 100% OK with it being removed if a better patch comes along.

(Away)

Assignee

Comment 19

•

8 years ago

Attached patch Sanity check the address at each frame (obsolete) (deleted) — Details — Splinter Review

Not sure who is best to review profiler+Win32 stuff; feel free to reassign.

Assignee: nobody → dmajor

Attachment #8858431 - Flags: review?(n.nethercote)

(Away)

Assignee

Comment 20

•

8 years ago

I hope this isn't too expensive; doing too much work while the main thread is paused would earn me a seat at bug 1329212. Maybe I should sample the sampler under xperf...

(Away)

Assignee

Comment 21

•

8 years ago

Oh no! This is far too slow. VirtualQuery is at the top of my profile.

(Away)

Assignee

Updated

•

8 years ago

Attachment #8858431 - Flags: review?(n.nethercote) → review-

Luke Wagner [:luke]

Comment 22

•

8 years ago

Actually, now that all JIT code is in a single contiguous region (js/src/jit/ProcessExecutableMemory.cpp), it should be easy to expose a super-cheap JS::IsPCInJitCode() API.

(Away)

Assignee

Comment 23

•

8 years ago

The stack walk code is in mozglue, where we can't call libxul functions. :-( I guess we could invert the call and have ProcessExecutableMemory::init() call some mozglue export to notify it of the region bounds. Would that be too ugly? I'm not a fan of random mozglue exports; happy to hear other ideas.

(Away)

Assignee

Comment 24

•

8 years ago

Attached patch Register the jit region with mozglue so it can avoid stack-walking in that region (deleted) — Details — Splinter Review

Attachment #8858431 - Attachment is obsolete: true

Attachment #8858971 - Flags: review?(luke)

Luke Wagner [:luke]

Updated

•

8 years ago

Attachment #8858971 - Flags: review?(luke) → review+

Jan de Mooij [:jandem]

Comment 25

•

8 years ago

(In reply to David Major [:dmajor] from comment #23) > The stack walk code is in mozglue, where we can't call libxul functions. :-( > > I guess we could invert the call and have ProcessExecutableMemory::init() > call some mozglue export to notify it of the region bounds. Would that be > too ugly? I'm not a fan of random mozglue exports; happy to hear other ideas. The xul code invoking the stack walking can't pass the bounds (or a JSAPI function pointer) to mozglue?

Nathan Froyd [:froydnj]

Comment 26

•

8 years ago

(In reply to Jan de Mooij [:jandem] from comment #25) > (In reply to David Major [:dmajor] from comment #23) > > I guess we could invert the call and have ProcessExecutableMemory::init() > > call some mozglue export to notify it of the region bounds. Would that be > > too ugly? I'm not a fan of random mozglue exports; happy to hear other ideas. > > The xul code invoking the stack walking can't pass the bounds (or a JSAPI > function pointer) to mozglue? I like this solution.

(Away)

Assignee

Comment 27

•

8 years ago

(In reply to Jan de Mooij [:jandem] from comment #25) > The xul code invoking the stack walking can't pass the bounds (or a JSAPI > function pointer) to mozglue? I think this would get messy because there isn't just one piece of xul code making the call; there are many callers of MozStackWalk scattered about. Unless I'm misunderstanding?

Jan de Mooij [:jandem]

Comment 28

•

8 years ago

(In reply to David Major [:dmajor] from comment #27) > I think this would get messy because there isn't just one piece of xul code > making the call; there are many callers of MozStackWalk scattered about. > Unless I'm misunderstanding? Ah, if it requires modifying many callers I agree your patch may be simpler.

Pulsebot

Comment 29

•

8 years ago

Pushed by dmajor@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/3586c6c53779 Register the JIT region with mozglue so it can avoid stack-walking that code. r=luke

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 30

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/3586c6c53779

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox55: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla55

Samael Wang [:freesamael] (away for now)

Comment 31

•

8 years ago

My nightly stops crashing these 2 days. Thanks :dmajor!

(Away)

Assignee

Updated

•

7 years ago

Blocks: 1366030

You need to log in before you can comment on or make changes to this bug.