Closed Bug 1793525 Opened 2 years ago Closed 2 years ago

Obtain crash reports from hanging child process on NS_FREE_PERMANENT_DATA builds

Categories

(Core :: IPC, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
110 Branch
Tracking Status
firefox110 --- fixed

People

(Reporter: jld, Assigned: jld)

References

(Blocks 2 open bugs, Regressed 1 open bug)

Details

Attachments

(1 file)

This is something I've mentioned in the comments of bug 1769821 but it should be its own item. On builds that define NS_FREE_PERMANENT_DATA (debug for refcount logging, sanitizers, code coverage), if still we're waiting for child processes via ProcessWatcher during parent process shutdown, we wait indefinitely — the process is expected to write some kind of log, so if we imposed a timeout then we'd break the test run anyway.

But, while this is late in shutdown, the crash reporter is (and this may be a relatively recent change) still functional, including for child processes; on Linux, we can send SIGABRT and get a crash dump (except when bug 1793523 interferes). This should also work on Mac but I haven't tested it yet.

I have more-or-less working patches and this did catch an issue, bug 1792203, where the child process wasn't completely stuck but was waiting for longer than the arbitrary delay I'd picked. In that case the crash report didn't help much, because it was waiting for a timer, but it did give us an idea of how late in shutdown it was and what to think about. In the case of a classical deadlock, the stacks from a crash report would be immediately useful.

Windows is more complicated, because we're using a Windows API to generate minidumps and I don't know if there's a convenient way to cause a crash given a process handle the way we can do on Unix, or if we'd need to integrate more closely with the crash reporter. The code for it would be separate, so that can be a followup bug.

This appears to be doable on Windows without too much difficulty: we can use CreateRemoteThread to cause the other process to crash. This is easiest if we know the process is using Gecko (has the same xul.dll loaded as the parent) so that we can include a function to do the crash and just use its address from the parent process; we already do this in the crash reporter.

Currently, the only uses of ProcessWatcher are in GeckoChildProcessHost and ContentParent, but in theory it's a general-purpose facility and could be applied to any child process in its current form, so I think I'll make the callers explicitly opt in.

There's one problem I'm currently having: if I Sleep a few seconds to avoid false positives, the way I'd been doing on Unix, I can see the injected MOZ_CRASH working but there's no crash report. This is, hopefully, fixable.

Summary: Obtain crash reports from hanging child process on NS_FREE_PERMANENT_DATA builds on Unix → Obtain crash reports from hanging child process on NS_FREE_PERMANENT_DATA builds

On the build types where child processes need to go through full
shutdown to write some kind of logs or data (debug / refcount logging,
sanitizers, code coverage), the parent process currently waits forever
for them to exit and blocks its own shutdown to do so. If we were to
time out and kill the child processes, test runs would fail due to
missing the log files they were expected to write; instead, they fail
and blame the parent process for not exiting, which is not especially
helpful.

What we'd like is to know the state of the child process that's failing
to exit. This patch attempts to get a crash report in that situation; on
Unix we send SIGABRT, and on Windows we inject a thread that calls
DbgBreakPoint, both of which result in invoking our crash reporter.

Blocks: 1791335

Hi Jed, will you be able to address the review comments on Phabricator? The number of failures increased since last week. Thanks.

Status: NEW → ASSIGNED
Flags: needinfo?(jld)

This is actively being worked on.

Severity: -- → S2
Flags: needinfo?(jld)
Priority: -- → P1

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #3)

Hi Jed, will you be able to address the review comments on Phabricator? The number of failures increased since last week. Thanks.

The Windows implementation turns out to have a problem — we get a minidump, but the child process gets stuck and doesn't exit. This appears to be while it's trying to write crash annotations, which won't work because the parent process already closed the pipe it would have read them from, but the child process ought to just fail in that case instead of blocking forever. (In theory this could also happen with a real crash; if it did, on CI, you'd see both the actual child process crash as well as an nsTerminator crash of the parent process.)

I might split off the Unix part and land that separately if I can't come up with a quick fix for Windows.

And there's a new problem: this test failure, which seems to be permafail with my patches; it looks like we're leaking a reference (?) to the SharedThreadPool instance named MediaPDecoder, and only when WMF is used so probably from here. I'll file a bug.

Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5f3b763b270b Get crash reports for child processes shutdown hangs on debug/sanitizer/ccov builds. r=nika,yjuglaret
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 110 Branch
Regressions: 1805761
Regressions: 1805903
Regressions: 1806224
Regressions: 1806274
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: