Closed Bug 1799225 Opened 2 years ago Closed 2 years ago

Crash in [@ RtlpAllocateHeap | RtlpAllocateHeapInternal | AllocMemory]

Categories

(Toolkit :: Crash Reporting, defect)

x86_64
Windows 11
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox108 --- wontfix
firefox109 --- wontfix

People

(Reporter: ash153311, Unassigned)

References

Details

(Keywords: crash, Whiteboard: [tbird crash] [win:stability])

Crash Data

Crash report: https://crash-stats.mozilla.org/report/index/72786a95-85d3-48bd-8298-f98cb0221104

Reason: STATUS_HEAP_CORRUPTION

Top 10 frames of crashing thread:

0  ntdll.dll  RtlReportFatalFailure  
1  ntdll.dll  RtlReportCriticalFailure  
2  ntdll.dll  RtlpHeapHandleError  
3  ntdll.dll  RtlpHpHeapHandleError  
4  ntdll.dll  RtlpLogHeapFailure  
5  ntdll.dll  RtlpAllocateHeap  
6  ntdll.dll  RtlpAllocateHeapInternal  
7  dbgcore.dll  AllocMemory  
8  dbgcore.dll  GenAllocateProcessObject  
9  dbgcore.dll  GenGetProcessInfo  
Version: Firefox 106 → Other Branch

The stacks I looked at involved the crash reporter. I'm not sure what the deal is though.

Component: General → Crash Reporting
Product: Core → Toolkit
Version: Other Branch → unspecified

The bug is marked as tracked for firefox108 (nightly). We have limited time to fix this, the soft freeze is today. However, the bug still isn't assigned.

:gcp, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit auto_nag documentation.

Flags: needinfo?(gpascutto)

Last Error Value ERROR_INSUFFICIENT_BUFFER

Is this missing error checking in Breakpad? Not sure why this spiked though. It's already with the right triage owner.

Flags: needinfo?(gpascutto)

The severity field is not set for this bug.
:gsvelto, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(gsvelto)

This looks like it could be a bug in Microsoft libraries as the crash is coming deep within MiniDumpWriteDump() but it could also be that something changed in Windows 11 and we need to pass information differently into that function. Most of the crashes come from Windows 11 version 10.0.25236 which is a dev-channel build.

Severity: -- → S3
Flags: needinfo?(gsvelto)
Whiteboard: [tbird crash]

(Edited, my previous analysis was wrong)

The crashes we have here are main process crashes generated during out-of-process crash generation for a child process. So the situation is as follows:

  • a child process is crashing and requested an out-of-process crash dump;
  • the parent process calls into Microsoft code to dump the child process;
  • but the parent process' default heap itself is either already corrupt, or gets corrupted by Microsoft's code, and the corruption gets detected;
  • so we end up doing an in-process crash dump of the parent process and reporting it.

Like [:gsvelto] mentioned, we have almost exclusively Windows 11 23H2 Insider Preview builds in the crashes, starting with build 10.0.25236. The beginning of the crash spike matches with the release of build 10.0.25236 on November 2nd, 2022. In addition to the possibilities already mentioned by [:gsvelto], I would like to suggest that Microsoft could have added new ways to detect heap corruptions, which could lead a usually undetected heap corruption to be more likely to be detected starting with build 10.0.25236. In that case, the heap corruption we would detect here would not necessarily originate from Microsoft code, it could get detected here but have occurred before.

Moving the code that is responsible for dumping other processes (including the main process) to a fully dedicated process should allow us to discriminate between the two possibilities here:

  • if this is a real bug within our crash dump code or Microsoft's MiniDumpWriteDump, we would see the same crash reports as currently;
  • if this crash is just the consequence of simultaneous corruptions affecting the child and main processes, this specific crash should disappear, and we should manage to generate crash dumps for all corruptions.

When we implement the dedicated crash dumping process, we should have it block injection of all third-party DLLs to limit the risk of simultaneous corruptions affecting all our processes including the crash dumping one. An example scenario would be a third-party application injecting its DLL into all our processes and sending to all its in-process clients a message that results in a heap corruption.

Has STR: --- → no
Whiteboard: [tbird crash] → [tbird crash] [win:stability]

As :gcp noted in comment 4 all the crashes where we have the last error value accessible have it set to ERROR_INSUFFICIENT_BUFFER. It might be that we're hitting a limit somewhere inside of dbgcore.dll which is sending it along an error-path that's poorly tested or just hard to recover from. I also wonder if Windows preview builds might work differently than regular ones, like having more internal assertions turned on, like we do on nightly/beta. So a harmless error in release becomes a hard error in a preview.

Talking about error codes made me realize I didn't check what kind of heap failure gets reported. By looking at a few dumps, it appears to always be a heap_failure_entry_corruption, where the other possibilities are as follows:

ntdll!_HEAP_FAILURE_TYPE
   heap_failure_internal = 0n0
   heap_failure_unknown = 0n1
   heap_failure_generic = 0n2
   heap_failure_entry_corruption = 0n3
   heap_failure_multiple_entries_corruption = 0n4
   heap_failure_virtual_block_corruption = 0n5
   heap_failure_buffer_overrun = 0n6
   heap_failure_buffer_underrun = 0n7
   heap_failure_block_not_busy = 0n8
   heap_failure_invalid_argument = 0n9
   heap_failure_invalid_allocation_type = 0n10
   heap_failure_usage_after_free = 0n11
   heap_failure_cross_heap_operation = 0n12
   heap_failure_freelists_corruption = 0n13
   heap_failure_listentry_corruption = 0n14
   heap_failure_lfh_bitmap_mismatch = 0n15
   heap_failure_segment_lfh_bitmap_corruption = 0n16
   heap_failure_segment_lfh_double_free = 0n17
   heap_failure_vs_subsegment_corruption = 0n18
   heap_failure_null_heap = 0n19
   heap_failure_allocation_limit = 0n20
   heap_failure_commit_limit = 0n21
   heap_failure_invalid_va_mgr_query = 0n22

More specifically, RtlpAllocateHeapInternal has paths to report heap_failure_allocation_limit, and RtlpAllocateHeap has paths to report heap_failure_entry_corruption or heap_failure_freelists_corruption. But in our case it seems to always be heap_failure_entry_corruption.

The crash volume suggests that this crash has been fixed in Windows 11 insider preview build 25309 (announced in March). We have received no report for this build and the following, the last reports are from build 25300:

Version 	Count	%
10.0.19043 	1 	0.15 %
10.0.19045 	3 	0.44 %
10.0.25236 	6 	0.87 %
10.0.25247 	6 	0.87 %
10.0.25252 	50 	7.27 %
10.0.25262 	51 	7.41 %
10.0.25267 	152 	22.09 %
10.0.25272 	54 	7.85 %
10.0.25276 	36 	5.23 %
10.0.25281 	38 	5.52 %
10.0.25284 	54 	7.85 %
10.0.25290 	67 	9.74 %
10.0.25295 	35 	5.09 %
10.0.25300 	135 	19.62 %

This could relate to the following entry: Fixed an underlying issue which was leading to Microsoft Edge crashes for some Insiders in the last few flights.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Resolution: FIXED → WORKSFORME
You need to log in before you can comment on or make changes to this bug.