Open Bug 1685642 Opened 4 years ago Updated 1 year ago

Crash in [@ mozilla::SandboxFork::SandboxFork]

Categories

(Core :: Security: Process Sandboxing, defect, P1)

All
Linux
defect

Tracking

()

Tracking Status
firefox-esr78 --- wontfix
firefox86 --- wontfix
firefox87 --- wontfix
firefox88 --- wontfix
firefox89 --- wontfix
firefox90 --- wontfix
firefox91 --- wontfix

People

(Reporter: sefeng, Assigned: jld)

References

Details

(Keywords: crash, topcrash, Whiteboard: [not-a-fission-bug])

Crash Data

Attachments

(1 file, 2 obsolete files)

Maybe Fission related. (DOMFissionEnabled=1)

Crash report: https://crash-stats.mozilla.org/report/index/19f3b6a8-c652-4f46-bb4d-b7bb90210107

MOZ_CRASH Reason: MOZ_CRASH(socketpair failed)

Top 10 frames of crashing thread:

0 libxul.so mozilla::SandboxFork::SandboxFork security/sandbox/linux/launch/SandboxLaunch.cpp:409
1 libxul.so mozilla::SandboxLaunchPrepare security/sandbox/linux/launch/SandboxLaunch.cpp:353
2 libxul.so mozilla::ipc::GeckoChildProcessHost::AsyncLaunch ipc/glue/GeckoChildProcessHost.cpp:686
3 libxul.so mozilla::dom::ContentParent::BeginSubprocessLaunch dom/ipc/ContentParent.cpp:2414
4 libxul.so mozilla::dom::ContentParent::PreallocateProcess dom/ipc/ContentParent.cpp:658
5 libxul.so mozilla::PreallocatedProcessManagerImpl::AllocateNow dom/ipc/PreallocatedProcessManager.cpp:304
6 libxul.so mozilla::detail::RunnableMethodImpl<mozilla::PreallocatedProcessManagerImpl*, void  xpcom/threads/nsThreadUtils.h:1201
7 libxul.so mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:739
8 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1200
9 libxul.so mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:87

We've had this assertion for a few years, however, we start to hit this assertion regularly since mid last year.

This link gives us all the crashes since the beginning of 2020, 5 of them had fission enabled. So this is might not be fission related. (Not sure when did we add the fission enabled flag to crash reports)

Component: DOM: Content Processes → Security: Process Sandboxing

Very low volume, we have some ideas and will re-investigate if it raises (might be fd exhaustion).

Severity: -- → S4
Priority: -- → P5
Whiteboard: [not-a-fission-bug]

I can reproduce this with fission enabled and a lot of tabs, if right after startup I try to close all tabs by keeping Ctrl+W pressed:

https://crash-stats.mozilla.org/report/index/541fc4ab-1e62-4ea7-9c68-ef3f90210416

It does look like FD exhaustion of some sort, because I get a similar crash with `MOZ_RELEASE_ASSERT(result.mFd.fd != -1) (DuplicateDescriptor failed) with the same STR:

https://crash-stats.mozilla.org/report/index/b68af5c4-5689-4442-9d51-9bff60210416

(In reply to Emilio Cobos Álvarez (:emilio) from comment #3)

I can reproduce this with fission enabled and a lot of tabs, if right after startup I try to close all tabs by keeping Ctrl+W pressed:

https://crash-stats.mozilla.org/report/index/541fc4ab-1e62-4ea7-9c68-ef3f90210416

It does look like FD exhaustion of some sort, because I get a similar crash with `MOZ_RELEASE_ASSERT(result.mFd.fd != -1) (DuplicateDescriptor failed) with the same STR:

https://crash-stats.mozilla.org/report/index/b68af5c4-5689-4442-9d51-9bff60210416

Could this come from bug 1719391 ?

Attachment #9273717 - Attachment is obsolete: true
Attachment #9273708 - Attachment is obsolete: true

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 5 desktop browser crashes on Linux on release

:gcp, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(gpascutto)
Keywords: topcrash

Jed, this started spiking, can you have a look?

Assignee: nobody → jld
Severity: S4 → S3
Flags: needinfo?(gpascutto) → needinfo?(jld)
Priority: P5 → P1

This looks like file descriptor exhaustion, which tends to cause a lot of other problems besides this. In general we don't have good data on how much that happens (even crash reports might be an undercount, because I'm pretty sure parent process fd exhaustion can prevent being able to write out a minidump). Possibly we could increase our fd limit; the Debian/Ubuntu family as well as Fedora have fairly high hard limits, but RHEL could need some outreach. That's a larger issue than this particular bug, of course.

Flags: needinfo?(jld)
Attached image per_major_version.png (deleted) —

I think there are several indicators here suggesting that this may be a file descriptor leak introduced in 116:

  • Most crashes are with Ubuntu which according to comment 26 has a high hard limit.
  • Some user comments mention that they did not have this issue before updating and that now it is recurrent (e.g. here).
  • Graphs per version suggest 116 was a turning point and that this is likely to continue in 117 (see attachment).

There could be some interesting hints in the correlations, where e.g. the following modules have "100% in signature", which seems unusual: libpixbufloader-svg.so, libdrm_radeon.so.1, libdrm_nouveau.so.2.

To clarify the resource limit terminology: the soft limit (rlim_cur) is the limit that's applied when using the resource in question; the hard limit (rlim_max) is how high an unprivileged process can increase the soft limit. We raise the fd soft limit to 4096 or the hard limit, whichever is lower; so, right now we have 4096 fds per process whether it's Red Hat or Ubuntu or anything else, but in principle we could change this number and it should work on every(?) major distro other than Red Hat (where the hard limit is 4096).

If we're seeing a lot of crashes on Ubuntu, it's probably because a lot of Linux users are on Ubuntu.

As for the correlations, note that this is specific to Linux and to the parent process, but the “overall” numbers are all OSes and all process types, so the presence of common Linux libraries maybe isn't too informative.

I was wondering if the memfd:xshmfence correlation might be a clue, but it's marked “96.15% vs 65.70% if platform_pretty_version = Ubuntu 22.04.3 LTS” which might mean nothing more than that it's only seen in the parent process, which is expected.

There are a lot of graphics errors about unexpected remote texture size: Size(0,0) and similar, but that might be a side-effect of fd exhaustion.

I don't have any great ideas here (at least nothing that can be implemented quickly), and this doesn't seem to happen in significant volumes except on release, so if we wanted to do something like increase the limit and see what happens we'd need to wait for a release cycle.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: