Closed Bug 519616 Opened 15 years ago Closed 15 years ago

Some crashes don't get unwound by the minidump processor usefully [@ @0x0] [@ @0x1]

Categories

(Socorro :: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jrmuizel, Assigned: ted)

References

Details

(Whiteboard: [crashkill])

Attachments

(2 files)

One example is crash d6eddf4e-a71e-4efe-b796-924112090928. The minidump has a 5368 bytes of stack for the crashing thread, but we currently don't get any useful data out of it at all. We should be able to do better.
Here's a very quick and dirty manual unwind of the stack. Not everything below is necessarily sane but it looks like we're crashing during plugin initialization. google_breakpad::ExceptionHandler::WriteMinidumpOnHandlerThread(_EXCEPTION_POINTERS *,MDRawAssertionInfo *) google_breakpad::ExceptionHandler::HandleException(_EXCEPTION_POINTERS *) ... ComputeBorderCornerDimensions nsCOMPtr_base::~nsCOMPtr_base() ... nsPluginInstanceOwner::GetMode(nsPluginMode *) nsNPAPIPluginInstance::InitializePlugin(nsIPluginInstancePeer *) nsPluginHostImpl::TrySetUpPluginInstance(char const *,nsIURI *,nsIPluginInstanceOwner *) nsCOMPtr_base::assign_from_qi(nsQueryInterface,nsID const &) PL_DHashMatchStringKey PL_DHashTableOperate nsComponentManagerImpl::GetFactoryEntry(char const *,unsigned in t nsComponentManagerImpl::IsContractIDRegistered(char const *,int *) SearchTable NS_TableDrivenQI(void *,QITableEntry const *,nsID const &,void * *) nsPluginHostImpl::QueryInterface(nsID const &,void * *) etc...
WinDBG gives a similar stack: ChildEBP RetAddr 0012eb08 7c90df5a ntdll!KiFastSystemCallRet 0012eb0c 7c8025db ntdll!ZwWaitForSingleObject+0xc 0012eb70 7c802542 kernel32!WaitForSingleObjectEx+0xa8 0012eb84 103a4236 kernel32!WaitForSingleObject+0x12 0012eb9c 1048ec2b xul!google_breakpad::ExceptionHandler::WriteMinidumpOnHandlerThread(struct _EXCEPTION_POINTERS * exinfo = 0x0012eec0, struct MDRawAssertionInfo * assertion = 0x0caa17fe)+0x6a [e:\builds\moz2_slave\win32_build\build\toolkit\crashreporter\google-breakpad\src\client\windows\handler\exception_handler.cc @ 562] 0012ebc0 6681f1b7 xul!google_breakpad::ExceptionHandler::HandleException(struct _EXCEPTION_POINTERS * exinfo = 0x0012eec0)+0x53 [e:\builds\moz2_slave\win32_build\build\toolkit\crashreporter\google-breakpad\src\client\windows\handler\exception_handler.cc @ 394] WARNING: Stack unwind information not available. Following frames may be wrong. 0012ee98 7c8438fa QuickTime+0x1f1b7 0012eea0 7c839b39 kernel32!BaseProcessStart+0x39 0012eec8 7c9032a8 kernel32!_except_handler3+0x61 0012eeec 7c90327a ntdll!ExecuteHandler2+0x26 0012ef9c 7c90e48a ntdll!ExecuteHandler+0x24 0012ef9c 00000000 ntdll!KiUserExceptionDispatcher+0xe 0012f298 669c22d7 0x0 0012f32c 100b502e QuickTime+0x1c22d7 0012f334 10699db7 xul!nsCOMPtr_base::~nsCOMPtr_base(void)+0xe [e:\builds\moz2_slave\win32_build\build\obj-firefox\xpcom\build\nscomptr.cpp @ 82] 0012f344 1070c805 xul!nsPluginInstanceOwner::GetMode(nsPluginMode * aMode = <Memory access error>)+0x61 [e:\builds\moz2_slave\win32_build\build\layout\generic\nsobjectframe.cpp @ 2362] 00000000 00000000 xul!nsNPAPIPluginInstance::InitializePlugin(class nsIPluginInstancePeer * peer = <Memory access error>)+0x238 [e:\builds\moz2_slave\win32_build\build\modules\plugin\base\src\nsnpapiplugininstance.cpp @ 1030] Note that, of course, Breakpad doesn't just walk the stack from the top on the crashed thread, but starts from the register state in the exception context: http://code.google.com/p/google-breakpad/source/browse/trunk/src/processor/minidump_processor.cc#167
!analyze -v, gives, among other things: STACK_TEXT: WARNING: Frame IP not in any known module. Following frames may be wrong. 0012f298 669c22d7 0012f318 0012f310 00000000 0x0 0012f32c 100b502e 0407cc00 10699db7 0416a220 QuickTime+0x1c22d7 0012f334 10699db7 0416a220 0416a220 00000000 xul!nsCOMPtr_base::~nsCOMPtr_base+0xe [e:\builds\moz2_slave\win32_build\build\obj-firefox\xpcom\build\nscomptr.cpp @ 82] 0012f344 1070c805 07d18f20 076cb418 00000001 xul!nsPluginInstanceOwner::GetMode+0x61 [e:\builds\moz2_slave\win32_build\build\layout\generic\nsobjectframe.cpp @ 2362] 00000000 00000000 00000000 00000000 00000000 xul!nsNPAPIPluginInstance::InitializePlugin+0x238 [e:\builds\moz2_slave\win32_build\build\modules\plugin\base\src\nsnpapiplugininstance.cpp @ 1030] QuickTime, j'accuse! Now, we just need to figure out why WinDBG can walk this stack and Breakpad can't.
Note that if you open a dump in WinDBG, you can use ".excr" to set your current context to the exception context, such that your displayed register state and call stack will be from the exception context.
Also, to add to the fun in this particular report: http://crash-stats.mozilla.com/report/index/d6eddf4e-a71e-4efe-b796-924112090928 cvasds1.dll e8main1.dll are apparently from Trojan.Dropper/Gen-NV.
A little bit of examination shows that we hit this block in the stack walker: http://code.google.com/p/google-breakpad/source/browse/trunk/src/processor/stackwalker_x86.cc#257 since %eip is 0x0, which is clearly not in a known module, we get shunted to the "standard calling convention" path. However, %ebp is off in the weeds (0x6170706c), so evaluating all of those conditions fails, so the stack walker gives up and returns NULL: http://code.google.com/p/google-breakpad/source/browse/trunk/src/processor/stackwalker_x86.cc#293 There must be a smarter way to handle this particular situation. From running minidump_dump on the dump, we have: stack.start_of_memory_range = 0x12eb08 stack.memory.data_size = 0x14f8 and %esp looks sane: 0x12f29c. I think the stack walker just needs to try a little harder here.
Attached patch try harder to unwind (deleted) — Splinter Review
With this quick patch, I get the following stack: 0 0x0 eip = 0x00000000 esp = 0x0012f29c ebp = 0x6170706c ebx = 0x00000000 esi = 0x0800b5fe edi = 0x000001f1 eax = 0x00040002 ecx = 0x07fa4ff0 edx = 0x07ff67e0 efl = 0x00010286 1 QuickTime.qts + 0x1c22d6 eip = 0x669c22d7 esp = 0x0012f2a0 ebp = 0x6170706c 2 QuickTime.qts + 0x98225 eip = 0x66898226 esp = 0x0012f2d0 ebp = 0x6170706c 3 QuickTime.qts + 0x1c1f2f eip = 0x669c1f30 esp = 0x0012f2f0 ebp = 0x6170706c 4 QuickTime.qts + 0x11b2af eip = 0x6691b2b0 esp = 0x0012f308 ebp = 0x6170706c 5 QuickTimeWebHelper.qtx + 0x9f3c eip = 0x675a9f3d esp = 0x0012f318 ebp = 0x6170706c 6 xul.dll!nsCOMPtr_base::~nsCOMPtr_base() [nsCOMPtr.cpp:c6f51c76fb5d : 81 + 0x7] eip = 0x100b502e esp = 0x0012f334 ebp = 0x6170706c 7 xul.dll!nsPluginInstanceOwner::GetMode(nsPluginMode *) [nsObjectFrame.cpp:c6f51c76fb5d : 2362 + 0xf] eip = 0x10699db7 esp = 0x0012f33c ebp = 0x6170706c 8 xul.dll!nsNPAPIPluginInstance::InitializePlugin(nsIPluginInstancePeer *) [nsNPAPIPluginInstance.cpp:c6f51c76fb5d : 1030 + 0x4f] eip = 0x1070c805 esp = 0x0012f34c ebp = 0x6170706c 9 xul.dll + 0x9a49bb eip = 0x109a49bc esp = 0x0012f37c ebp = 0x6170706c 10 xul.dll + 0x9a49bb eip = 0x109a49bc esp = 0x0012f380 ebp = 0x6170706c 11 mozcrt19.dll!arena_bin_nonfull_run_get [jemalloc.c:c6f51c76fb5d : 3795 + 0x6] eip = 0x78139637 esp = 0x0012f38c ebp = 0x6170706c 12 xul.dll + 0x29bae5 eip = 0x1029bae6 esp = 0x0012f3c8 ebp = 0x0012f444 13 xul.dll + 0x29bcd7 eip = 0x1029bcd8 esp = 0x0012f44c ebp = 0x0012f740 14 xul.dll + 0x299af2 eip = 0x10299af3 esp = 0x0012f748 ebp = 0x0012f934 15 xul.dll + 0x29c526 eip = 0x1029c527 esp = 0x0012f93c ebp = 0x0012f994 16 xul.dll + 0x29c6bb eip = 0x1029c6bc esp = 0x0012f99c ebp = 0x0012ffb0 17 firefox.exe!_IsNonwritableInCurrentImage + 0xd eip = 0x004018e8 esp = 0x0012ffb8 ebp = 0x0012ffe0 18 kernel32.dll + 0x39ad7 eip = 0x7c839ad8 esp = 0x0012ffe8 ebp = 0xffffffff 19 kernel32.dll + 0x1707f eip = 0x7c817080 esp = 0x0012ffec ebp = 0xffffffff 20 firefox.exe!pre_c_init + 0x3 eip = 0x004015b0 esp = 0x0012fffc ebp = 0xffffffff
I implemented the TODO mentioned in this patch: http://people.mozilla.com/~tmielczarek/stackwalker-guess-harder.patch I don't have all the symbols for that version of Firefox handy though, so Jeff was going to test it.
Attached patch WIP including ted's stuff (deleted) — Splinter Review
(In reply to comment #8) > I don't have all the symbols for that version of Firefox handy though, so Jeff > was going to test it. I couldn't get that patch to help. Haven't looked into why yet. One of the other things we can try to do is get a useful value into ebp. I tried this, but was getting worse results because of something that looks like a framepointer in quicktime. I'm not sure how to fix that yet.
One the big problems with the current patch is that once we are able to walk through the quicktime stack using brute force, we're never able to use the more elegant ways of stackwalking. This is what's giving us a bunch of false positives. Ideally, we could get back on track and start using the FPO data again.
Yeah, that's a bummer. We just sort of stumble our way back into a random part of libxul and then we're off by just enough to get a crummy stack from there, right?
(In reply to comment #11) > Yeah, that's a bummer. We just sort of stumble our way back into a random part > of libxul and then we're off by just enough to get a crummy stack from there, > right? Yeah, I think so. It also looks like we don't have frame data for all the functions. For example, if you look at xul.sym, we have FPO data for nsCOMPtr_base::~nsCOMPtr_base() but not for nsPluginInstanceOwner::GetMode() or nsNPAPIPluginInstance::InitializePlugin(). Further, do you know why we get STACK_INFO_FRAME_DATA records for some code and STACK_INFO_FPO for other code?
dump_syms just dumps out whatever (underdocumented) data is in the PDB files using the DIA APIs: http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/common/windows/pdb_source_line_writer.cc#272 You can see the STACK WIN lines in any Windows symbol file: http://symbols.mozilla.org/firefox/firefox.pdb/B9F3DF1DC69045E29B7E9877E67F99EC2/firefox.sym Some of them have program strings, some don't. I guess it just depends on what the compiler decided to do.
(In reply to comment #13) > You can see the STACK WIN lines in any Windows symbol file: > http://symbols.mozilla.org/firefox/firefox.pdb/B9F3DF1DC69045E29B7E9877E67F99EC2/firefox.sym > Some of them have program strings, some don't. I guess it just depends on what > the compiler decided to do. But some have no entry at all. Which means we can't really unwind the stack very well. Is it possible there's a bug someplace that's preventing us from getting entries for some code?
You can look at that pdb_source_line_writer code, it's a pretty straightforward application of the DIA APIs.
I had a look at the pdb's with dia2dump and then ones that I downloaded from the symbol server have significantly fewer fpo entries then the one that I have built. (10098 lines of output vs. 160667 lines of output) Assuming the stack rva's are in order the stack rva's in the downloaded ones only go up to about 0x29ccb0 instead of the 0x77fa4d you would expect.... No idea what would cause this...
The difference could be related to the following linker flag? '-DEBUG -DEBUGTYPE:CV'"
"On the command line, if /DEBUG is specified, the default type is /DEBUGTYPE:CV; if /DEBUG is not specified, /DEBUGTYPE is ignored." so perhaps not.
Should we split off a separate bug for the actual crash we identified here, involving QuickTime?
(In reply to comment #19) > Should we split off a separate bug for the actual crash we identified here, > involving QuickTime? I did so as bug 520650.
I've also split off the missing stack unwind info bug as bug 520651.
(In reply to comment #19) > Should we split off a separate bug for the actual crash we identified here, > involving QuickTime? Yes we probably should.
Blocks: 520650
Depends on: 520651
Looks like a different problem, but also something is confused there: Version 3.7a1pre Branch 1.9.2 those do not match!
(In reply to comment #23) > These crashes also aren't very useful: > ... > For example, XUL doesn't have symbols and I'm not sure why. I have seen a similar example in bug 512810 where crash reports from nightly builds appear to be missing symbols for xul but the reports from 3.6a1 work (sort of). (In reply to comment #24) > Looks like a different problem, but also something is confused there: > Version 3.7a1pre > Branch 1.9.2 > those do not match! Looking at a selection of crash reports for trunk, they all seem to have branch = 1.9.2
I filed that issue as bug 520852.
Here's the stack with my work in progress patch and a description of what's going on: Thread 0 (crashed) 0 0x0 eip = 0x00000000 esp = 0x0012f29c ebp = 0x6170706c ebx = 0x00000000 esi = 0x0800b5fe edi = 0x000001f1 eax = 0x00040002 ecx = 0x07fa4ff0 edx = 0x07ff67e0 efl = 0x00010286 trust: none 1 QuickTime.qts + 0x1c22d6 eip = 0x669c22d7 esp = 0x0012f2a0 ebp = 0x6170706c trust: scan 2 QuickTime.qts + 0x98225 eip = 0x66898226 esp = 0x0012f2d0 ebp = 0x6170706c trust: scan 3 QuickTime.qts + 0x1c1f2f eip = 0x669c1f30 esp = 0x0012f2f0 ebp = 0x6170706c trust: scan 4 QuickTime.qts + 0x11b2af eip = 0x6691b2b0 esp = 0x0012f308 ebp = 0x6170706c trust: scan 5 QuickTimeWebHelper.qtx + 0x9f3c eip = 0x675a9f3d esp = 0x0012f318 ebp = 0x6170706c trust: scan 6 xul.dll!nsCOMPtr_base::~nsCOMPtr_base() [nsCOMPtr.cpp:c6f51c76fb5d : 81 + 0x7] eip = 0x100b502e esp = 0x0012f334 ebp = 0x6170706c trust: scan * everything looks good to this point. The new 'scan' method lets us unwind through the QuickTime stack. Note: it looks like QuickTime has been compiled without a framepointer 7 xul.dll!nsPluginInstanceOwner::GetMode(nsPluginMode *) [nsObjectFrame.cpp:c6f51c76fb5d : 2362 + 0xf] eip = 0x10699db7 esp = 0x0012f33c ebp = 0x6170706c trust: cfi_scan * In frame 7 we can use the unwind info from frame 6. I'm not exactly sure why we have to revert to scanning though. 8 xul.dll!nsNPAPIPluginInstance::InitializePlugin(nsIPluginInstancePeer *) [nsNPAPIPluginInstance.cpp:c6f51c76fb5d : 1030 + 0x4f] eip = 0x1070c805 esp = 0x0012f34c ebp = 0x6170706c trust: scan * Frame 7 is missing unwind info, so we're forced back into the scan method 9 mozcrt19.dll!arena_bin_nonfull_run_get [jemalloc.c:c6f51c76fb5d : 3795 + 0x6] eip = 0x78139637 esp = 0x0012f38c ebp = 0x6170706c trust: scan 10 xul.dll + 0x29bae5 eip = 0x1029bae6 esp = 0x0012f3c8 ebp = 0x0012f444 trust: cfi_scan * Frame 9 has unwind info so we try to use it again. Unfortunately, it also has something that looks like a frame pointer on the stack, so ebp get's set to that. 11 xul.dll + 0x29bcd7 eip = 0x1029bcd8 esp = 0x0012f44c ebp = 0x0012f740 trust: fp * Since we now have a frame pointer (or something that looks like one) we use that. This brings us through the 0x29.... range of code. I'm not sure what's actually here because xul.sym does not have any symbols for this range. The frame pointer unwinder doesn't use AddressSeemsValid() so we don't have symbols for these frames. 12 xul.dll + 0x299af2 eip = 0x10299af3 esp = 0x0012f748 ebp = 0x0012f934 trust: fp 13 xul.dll + 0x29c526 eip = 0x1029c527 esp = 0x0012f93c ebp = 0x0012f994 trust: fp 14 xul.dll + 0x29c6bb eip = 0x1029c6bc esp = 0x0012f99c ebp = 0x0012ffb0 trust: fp * More of the same. 15 firefox.exe!_IsNonwritableInCurrentImage + 0xd eip = 0x004018e8 esp = 0x0012ffb8 ebp = 0x0012ffe0 trust: fp 16 kernel32.dll + 0x39ad7 eip = 0x7c839ad8 esp = 0x0012ffe8 ebp = 0xffffffff trust: fp 17 kernel32.dll + 0x1707f eip = 0x7c817080 esp = 0x0012ffec ebp = 0xffffffff trust: scan 18 firefox.exe!pre_c_init + 0x3 eip = 0x004015b0 esp = 0x0012fffc ebp = 0xffffffff trust: scan * I've not looked at these last few frames in detail.
I'm going to test this patch a little more. It gives us *something* for these crashes @0x0, which is better than the nothing we have now. I want to make sure that it's not going to be worse for other cases, though. Since we have that sampling of 24 hours of minidumps from our production system, I'm going to try running the patched and unpatched stackwalker against a bunch of them, and compare the results.
Assignee: jmuizelaar → ted.mielczarek
The biggest problem keeping us from getting a good stack here is that we don't have proper unwind info. Another problem is that we seem to assume that there is a frame pointer when we do the search of the stack when we have unwind info. Fixing either issue should fix this stack. Getting proper unwind info is the easiest from a correctness standpoint :)
Assignee: ted.mielczarek → jmuizelaar
Assignee: jmuizelaar → ted.mielczarek
Unfortunately that's also the part that's hardest to deal with, since we have to work with whatever Visual C++ is producing.
I ran this against a small set of crashes from the minidump collection, and it looks like it never makes things worse. You can see the minidump_stackwalk output (and diffs between old and new) here: http://people.mozilla.org/~tmielczarek/breakpad-stacks/ This makes sense, since we only really take this code path in a case where we would otherwise just give up and quit walking the stack.
Cleaned up the patch a little and submitted it for review upstream: http://breakpad.appspot.com/32003
Landed upstream: http://code.google.com/p/google-breakpad/source/detail?r=409 Filed bug 521231 on getting our production copy updated.
Depends on: 521231
Summary: Some crashes don't get unwound by the minidump processor usefully [@ @0x0] → Some crashes don't get unwound by the minidump processor usefully [@ @0x0] [@ @0x1]
New code is in production, I think we're done here. We can file a new bug if we find some other type of crash that the stack walker does a bad job on.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Here's a crash with a minimal completely useless stack: bp-2ab122c7-f0f1-457b-ae68-d6ff32090701 Can it be helped with this bug?
It's difficult to say without having access to a specific minidump. Note that the fix for this has been rolled out in production, so crash reports processed after 2009-10-14 (in the evening) will have this fix in effect.
Blocks: 522701
Whiteboard: [crashkill]
Component: General → Socorro
Product: Core → Webtools
QA Contact: general → socorro
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: