Closed Bug 1739924 Opened 3 years ago Closed 3 years ago

widget.dmabuf-webgl.enabled: All tabs start and keep crashing after a while / Crash in [@ mozilla::dom::ipc::SharedStringMap::SharedStringMap]

Categories

(Core :: IPC, defect, P1)

Firefox 94
x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
96 Branch
Tracking Status
firefox-esr91 95+ fixed
firefox94 + fixed
firefox95 + fixed
firefox96 + fixed

People

(Reporter: schreibemirhalt, Assigned: aosmond)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(4 files)

Out of nowhere tabs crash after a while. All tabs will crash at the same time, except for some tabs (seems like especially tabs which play a video), those will crash after I switch away from them and back to them or when I try to open a new site within them. It also affecs all new tabs. The browser becomes unusable for surfing until it is closed or killed and reopened. Behaviour happens at random. Sometimes after half an hour, sometimes after multiple hours. It started right after my update to Firefox 94 from Firefox 93 on Arch Linux. I have enough more than system resources, neither RAM nor swap was full.

Here is some data that was automatically entered into this form:

Crash report: https://crash-stats.mozilla.org/report/index/007b5ede-76b7-458e-bd9a-2a1580211107

MOZ_CRASH Reason: MOZ_RELEASE_ASSERT(result.isOk())

Top 10 frames of crashing thread:

0 libxul.so mozilla::dom::ipc::SharedStringMap::SharedStringMap /build/firefox/src/firefox-94.0.1/dom/ipc/SharedStringMap.cpp:33
1 libxul.so  /build/firefox/src/firefox-94.0.1/intl/strres/nsStringBundle.cpp:489
2 libxul.so  /build/firefox/src/firefox-94.0.1/intl/strres/nsStringBundle.cpp:581
3 libxul.so nsStringBundleBase::GetStringFromName /build/firefox/src/firefox-94.0.1/intl/strres/nsStringBundle.cpp:569
4 libxul.so mozilla::CubebUtils::InitBrandName /build/firefox/src/firefox-94.0.1/dom/media/CubebUtils.cpp:382
5 libxul.so mozilla::detail::RunnableFunction<void  /build/firefox/src/firefox-94.0.1/xpcom/threads/nsThreadUtils.h:531
6 libxul.so mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal /build/firefox/src/firefox-94.0.1/xpcom/threads/TaskController.cpp:770
7 libxul.so nsThread::ProcessNextEvent /build/firefox/src/firefox-94.0.1/xpcom/threads/nsThread.cpp:1148
8 libxul.so mozilla::ipc::MessagePump::Run /build/firefox/src/firefox-94.0.1/ipc/glue/MessagePump.cpp:85
9 libxul.so MessageLoop::Run /build/firefox/src/firefox-94.0.1/ipc/chromium/src/base/message_loop.cc:306

The Bugbug bot thinks this bug should belong to the 'Core::Audio/Video: Playback' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: General → Audio/Video: Playback
Product: Firefox → Core
Component: Audio/Video: Playback → General
Hardware: Desktop → x86_64

To make it clear: All browser tabs crash.
Not only it is really annoying since I use over thousand tabs and killing the browser and waiting a few minutes until all tabs are opening again after I restore the session adds up... but visiting online meetings as long as this bug exists can also get on the nerves of others if I just "quit" cause the tab with the meeting cashed together with all other open tabs.

Summary: All tabs start and keep crashing a while / Crash in [@ mozilla::dom::ipc::SharedStringMap::SharedStringMap] → All tabs start and keep crashing after a while / Crash in [@ mozilla::dom::ipc::SharedStringMap::SharedStringMap]

To add further information: New "New Tab" pages are not directly affected and new about: pages are also not affected. Could be that those do not crash if they are open when all other tabs crash. But yeah when I try to open any internet address in any tab, no matter if new or already existing, tabs crash.

Component: General → IPC

This sounds bad, and it looks like this crash has started happening for a lot of people. It also looks like it is happening for a lot of people.

It is interesting to see that you don't seem to have Fission enabled in the crash in comment 0, so it can't just be that you have a zillion processes. Looking at the crashes in 94, only 10% of them had Fission enabled, so presumably it is unrelated.

This does feel like some kind of resource exhaustion, but I'm not sure what.

I'm marking this as a regression. We've seen this before, but it seems like it is happening more often in 94.

Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: regression

[Tracking Requested - why for this release]: Seems like a pretty serious regression in Linux stability. There are a remarkable number of comments on these crash reports, more than I've seen on crashes in a long time, and lots of them are talking about how Firefox started crashing very frequently for them, in the last week or so.

gcp, is there somebody on your team who might be able to look into this? This has been happening for awhile, but it looks like it got a lot worse in 94 on Linux for some reason. Something is going wrong with some kind of memory-mapping or file-handle kind of thing on Linux and I'm afraid I'm not really familiar with what might be going wrong there. Thanks.

Flags: needinfo?(gpascutto)

If we could figure out a way to reproduce this, running mozregression to figure out what regressed it might be best but I don't know how easy that will be.

Chris, I recall you saying somewhere that there was a big spike in crash pings where the crash reason is "MOZ_RELEASE_ASSERT(result.isOk())" on Linux with Fission enabled? I wonder if that's the same issue. Maybe with Fission it manifests in some kind of non-user visible way so we don't show the crash reporter and they tend to show up more as pings than crash reports on crash stats?

Flags: needinfo?(cpeterson)

Bug 1715975 comment 2 says "The exception seems to be some of the Ubuntu cases, which are kind of weird." and it does look like 64% of these crashes (across all versions) are on Ubuntu 20.04.3 LTS, but maybe that's a very common distro or something.

Zibi, were there any big changes to localization or string bundles in 94? These crashes look string bundle related, but we don't have any real explanation why the crashes would start spiking up in 94 on Linux. Thanks.

Flags: needinfo?(zbraniecki)
Flags: needinfo?(cpeterson)
Keywords: crash

No, not really. We haven't touched StringBundle in years as it is on the path to be deprecated.

The last meaningful change was :kmag's shared memory buffer in 2018.

Flags: needinfo?(zbraniecki)
Flags: needinfo?(gpascutto)

Bug 1514734 claims this is likely OOM (specifically, that it might be a shmem OOM)

Something is going wrong with some kind of memory-mapping or file-handle kind of thing on Linux

I'm unclear of where that conclusion comes from (and hence why this is in IPC - at least that would make sense if it was shmem!).

See also this analysis about the OOM theory: https://bugzilla.mozilla.org/show_bug.cgi?id=1715975#c3

I have enough more than system resources, neither RAM nor swap was full.

When this happens, how does /dev/shmem look? i.e.

morbo@mozwell:~$ df -h
Filesystem              Size  Used Avail Use% Mounted on
tmpfs                    16G  5.0M   16G   1% /dev/shm

That said, shouldn't memfd_create be immune to this? Last changes to shmem there that I remember are bug 1672085 but that was 93, not 94.

Clicking a few reports show those systems are indeed critically low on swap:
https://crash-stats.mozilla.org/report/index/af11bd3b-8f23-4c94-94ae-386af0211117
https://crash-stats.mozilla.org/report/index/045b7608-b3ac-4b1b-90b8-65f900211117
https://crash-stats.mozilla.org/report/index/08cc781b-038d-4411-9c85-8388d0211117
https://crash-stats.mozilla.org/report/index/24654a82-5aba-450f-aac6-dd18b0211117

But as to why this is spiking, no idea.

Flags: needinfo?(schreibemirhalt)

Are we hitting file descriptor limits?

Yep I never had and I still don't have Fission enabled. Also the only thing I updated at the time was Firefox. I did not update my system nor installed anything for over a week and had one or two restarts, then I updated Firefox and after a few hours after the installation of the new FF version it started the crashing of tabs.
Regarding OOM: I have 64GB of RAM. I never saw a spike in my task manager (xfce4-taskmanager) when the issue appeared. Also the browser itself doesn't crash (well, at least most of the time); only the tabs crash. And if I try to reload one of the crashed tabs it crashes again (well, at least most of the time.. For some reason some tabs can be reloaded after a few minutes — but not in every case... e.g. I was able to reload a twitter tab at one time and at another time it did not work — but as soon as I click on "reload all tabs" on another crashed tab then those will crash again. ). It also doesn't seem to matter how many tabs I have open. If I recall correctly it happened at least once before I restored my session so I had only a few tabs open instead of over a thousand.
I suspected some of my add-ons but some of the crash reports from other people barely contain any add-on and not even those that I use.
If you want me to try something then just let me know. Could also come to a Matrix room if you want.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #14)

I have enough more than system resources, neither RAM nor swap was full.

When this happens, how does /dev/shmem look? i.e.

I will send that when tabs crash again.

Flags: needinfo?(schreibemirhalt)

Interesting output to check when this problem is happening:

ulimit -a -H
ulimit -a
pgrep -f firefox | xargs -I {} bash -c 'printf {}" "; lsof -p {} | wc -l'
pgrep -f firefox | xargs -I {} bash -c 'printf {}" "; pmap -p {} | wc -l'

Regarding OOM: I have 64GB of RAM.

Your distro/Linux config might limit some system resources to really low values (don't ask me why they do this!) so unfortunately that doesn't mean much. Also, a bug in a graphics driver that we trigger (for example!) can easily use up all memory. So we still have to account for this possibility.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #12)

Something is going wrong with some kind of memory-mapping or file-handle kind of thing on Linux

I'm unclear of where that conclusion comes from (and hence why this is in IPC - at least that would make sense if it was shmem!).

This was based on the fact that mMap.initWithHandle() is failing. It looks like the reasons that can fail are: the file descriptor passed in was invalid, PR_ImportFile failed, PR_GetOpenFileInfo64 returned an overly large value, PR_CreateFileMap failed, or PR_MemMap failed.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #14)

Clicking a few reports show those systems are indeed critically low on swap:

That's a good point. I thought I'd checked that, but maybe I was looking at total page file instead of available page file. There are a total of 1671 Linux crashes on either 94.0 or 94.0.1. Of those, 714 have 0 available page file, and 925 of the crashes have > 200 million bytes of available page file. So it seems like a lack of page file might be the explanation for many of these crashes, but not even really the majority of them. Although I don't know whether the available page file value is measured when the crash happens, or if it can be stale.

Flags: needinfo?(kmaglione+bmo)

(In reply to Andrew McCreight [:mccr8] from comment #8)

Chris, I recall you saying somewhere that there was a big spike in crash pings where the crash reason is "MOZ_RELEASE_ASSERT(result.isOk())" on Linux with Fission enabled? I wonder if that's the same issue. Maybe with Fission it manifests in some kind of non-user visible way so we don't show the crash reporter and they tend to show up more as pings than crash reports on crash stats?

In Matrix, regarding these crashes, cpeterson said: "The increase I saw in Linux content process crash pings I see in 94 are definitely worse with Fission. They're related to image threads failing to start, so probably not related to the SharedStringMap MOZ_RELEASE_ASSERTs, unless they are both failing due to some system limits on threads or processes?"

Flags: needinfo?(kmaglione+bmo)

How many distinct systems are there? It didn't immediately make sense to me that 0 page file with free physical RAM would cause this to fail, as Linux does memory overcommit. But then looking closer shows that all the reports I was looking at were from the same machine.

Is it possible this looks huge because it's an immediate startup crash per tab? Thus, every browser instance that hits this will send 4-8 or more reports?

For the FD exhaustion theory, this bug https://bugzilla.mozilla.org/show_bug.cgi?id=1719140 also "spiked" but in much lower numbers.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #24)

How many distinct systems are there? It didn't immediately make sense to me that 0 page file with free physical RAM would cause this to fail, as Linux does memory overcommit. But then looking closer shows that all the reports I was looking at were from the same machine.

Good point. I faceted the crashes by install time. The top installation has 129 crashes, then there are 6 install times with 25 to 49 crashes, then there are another dozen install times with around 6 to 16 crashes. Then lots of people with 3 or less.

So yes, because of the fact that it seems to be a crash in content processes where a user can keep trying to open new tabs and fail, there is some kind of constant multiplier on the crashes.

A major change for Linux in Firefox 94 was https://mozillagfx.wordpress.com/2021/10/30/switching-the-linux-graphics-stack-from-glx-to-egl/

Looking at correlations:
(77.18% in signature vs 01.26% overall) adapter_driver_version = 21.0.3.0 [78.90% vs 40.96% if platform = Linux]

Note above that >=21 is required for it to be active. I'm not sure how to interpret that correlation (i.e. if platform is limited, are there still a ton of crashes on other drivers?)

(In reply to Andrew McCreight [:mccr8] from comment #27)

So yes, because of the fact that it seems to be a crash in content processes where a user can keep trying to open new tabs and fail, there is some kind of constant multiplier on the crashes.

When I press on the button to reload a tab and it crashes again, another report appears in about:crashes that can be submitted. It seems like that when multiple tabs crash at the same time only one report is created.

(In reply to schreibemirhalt from comment #29)

When I press on the button to reload a tab and it crashes again, another report appears in about:crashes that can be submitted. It seems like that when multiple tabs crash at the same time only one report is created.

Yeah, multiple web pages can be in the same process. Without Fission, there are no more than 8 content processes.

(In reply to Andrew McCreight [:mccr8] from comment #30)

Yeah, multiple web pages can be in the same process. Without Fission, there are no more than 8 content processes.
Ah, okay. I'd also like to add that the count column there doesn't seem to be the real amount of submitted crashes? I have over 300 reports in about:crashes from this month alone. 49 (install time 1636312441) looks to me like a realistic number of times that I had to close and reopen firefox because of the issue.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #28)

A major change for Linux in Firefox 94 was https://mozillagfx.wordpress.com/2021/10/30/switching-the-linux-graphics-stack-from-glx-to-egl/

Looking at correlations:
(77.18% in signature vs 01.26% overall) adapter_driver_version = 21.0.3.0 [78.90% vs 40.96% if platform = Linux]

Note above that >=21 is required for it to be active. I'm not sure how to interpret that correlation (i.e. if platform is limited, are there still a ton of crashes on other drivers?)

Good catch. Faceting on driver version, all but 3 of these crashes on Linux 94.0/94.0.1 are on 21.0.1.0 or higher. In contrast, if you look at all crashes on Linux 94.0/94.0.1, around 25% of the crashes are lower than 21.0, and another couple percent have version numbers of 300 or higher.

Jim, is there somebody who might be able to look into whether this increase in Linux crashes could be related to the switch to EGL? Thanks.

Flags: needinfo?(jmathies)

I spent some time looking at the comments and URLs in the crash reports.

Lots of Facebook apps that look like they could be games in the URLs:
https://apps.facebook.com/belote-multijoueur/?kt_referer=preroll
https://apps.facebook.com/forestrescuebubble/?ref=bookmarks&fb_source=bookmark&count=0
https://apps.facebook.com/cccityofromance/?fb_source=canvas_bookmark
https://apps.facebook.com/goldencitycasino/?ref=bookmarks&fb_source=bookmark&count=0
https://apps.facebook.com/lets_fish/?ref=bookmarks&fb_source=web_shortcut&count=0

Various other sites that look like games:
https://www.king.com/de/play/candycrush
https://www.royalgames.com/index.jsp?redirect=true
https://www.myjackpot.fr/game/ramses-book-roar
https://www.novumgames.com/spiel/mustersuche/
https://ru1.seafight.com/index.es?action=internalMapUnity

In the comments, there's also a lot of discussion of games:

"I was playing Tetris with Facebook and Gmail open in other tabs. The video in Tetris started skipping and then the tab went kaput."

"was on the AARP game page doing a mini crossword when it froze"

"was on candy crush"

via Google translate: "fed up my computer keeps crashing with FACEBOOK impossible to make my games crashes all day it's really boring"

"jigsaw world keeps crashing"

"jewel hunt crashes after a while"

"i want to play games fix nowwwwwwwwww "

A few mentions of Google Maps. A few mentions of YouTube. A few mentions of watching a SpaceX video. A few mentions of other games.

Another comment with specific steps to reproduce on Google Maps: "Just panning, zooming, and switching between map and 3D views in Google maps."

(In reply to Andrew McCreight [:mccr8] from comment #8)

Chris, I recall you saying somewhere that there was a big spike in crash pings where the crash reason is "MOZ_RELEASE_ASSERT(result.isOk())" on Linux with Fission enabled? I wonder if that's the same issue. Maybe with Fission it manifests in some kind of non-user visible way so we don't show the crash reporter and they tend to show up more as pings than crash reports on crash stats?

MOZ_RELEASE_ASSERT(result.isOk()) crashes are not a problem on Fission. The spiking Fission crashes on Linux are "Should successfully create image I/O thread" and "Failed to start ImageBridgeChild thread" MOZ_RELEASE_ASSERTs and MOZ_CRASH(PR_CreateThread failed!).

These SharedStringMap crashes spiked starting November 2 when Firefox 94 was released. We didn't start rolling Fission out to Firefox 94 users until November 9. So while the thread errors are worse on Fission, they don't seem related to these SharedStringMap crashes (unless they are both side effects of some other system resource exhaustion problem).

(In reply to Gian-Carlo Pascutto [:gcp] from comment #28)

Looking at correlations:
(77.18% in signature vs 01.26% overall) adapter_driver_version = 21.0.3.0 [78.90% vs 40.96% if platform = Linux]

Note above that >=21 is required for it to be active. I'm not sure how to interpret that correlation (i.e. if platform is limited, are there still a ton of crashes on other drivers?)

Crash ping telemetry query for adapter driver versions:

https://sql.telemetry.mozilla.org/queries/82936/source

Seems to be related to x86-64 architecture. Only 9 out of 160K crash pings are from 32-bit x86 and those have adapter driver version 20.0.8.0, so they wouldn't be using the new EGL code.

The 10 most common driver versions for these MOZ_RELEASE_ASSERT(result.isOk()) crash pings:

adapter driver version crash ping count
21.0.3.0 142,583
21.2.2.0 5,312
21.2.3.0 4,937
21.2.4.0 2,752
21.2.5.0 2,671
21.1.8.0 1,962
21.0.1.0 709
22.0.0.0 545
21.2.0.0 284
21.3.0.0 184

The 10 most common driver versions for other Linux x86-64 crash pings:

adapter driver version crash ping count
21.0.3.0 67,561
18.0.5.0 5,957
20.0.8.0 5,405
21.2.5.0 5,035
21.2.2.0 4,486
20.2.6.0 4,175
470.74.0.0 3,184
21.2.3.0 2,929
21.2.4.0 2,660
17.1.3.0 2,617

Posting a few thoughts, in no particular order….

(In reply to Andrew McCreight [:mccr8] from comment #22)

This was based on the fact that mMap.initWithHandle() is failing. It looks like the reasons that can fail are: the file descriptor passed in was invalid, PR_ImportFile failed, PR_GetOpenFileInfo64 returned an overly large value, PR_CreateFileMap failed, or PR_MemMap failed.

It looks like PR_ImportFile can only fail due to failure to malloc, which I think can't happen with mozjemalloc? PR_CreateFileMap on Unix can fail because of malloc, or if PR_GetOpenFileInfo fails, or if the size is larger than that of the file and it can't be extended (shouldn't be possible, because we pass in the size from PR_GetOpenFileInfo64). PR_MemMap is basically just mmap; failures include if the fd was completely invalid (which would already cause an error), or a special file that can't be mapped like a pipe/socket (that one might be able to reach this point, because nothing checks that the size is nonzero), or various resource exhaustions which probably wouldn't apply in a newly started content process and would more likely break malloc rather than this.

The reports that something happens and then it's impossible to start any content processes (like the initial report on this bug, but also comment #27 about many reports per profile) suggest that something happened in the parent process to corrupt its copy of the shared map fd; I'd suspect a double close bug, but that should cause problems in a lot of places, not just here, and the crash stats suggest that there's something special about SharedStringMap.

And all the comments about WebGL in comment #33, combined with the evidence for EGL/X11 vs. GLX having a role… but WebGL currently runs in the content process. (I'd wondered if the sandbox's connect brokering might be buggy somehow, because that didn't get a lot of use until 91, but that would happen when WebGL is first used, not later on, and the way that it's used hasn't really changed between GLX and EGL/X11.) If I understand correctly we'd also be using EGL instead of GLX to set things up for WebRender, so maybe that's part of it?

(I'm also not sure what to make of comment #3, because the (default) new tab page is rendered in a content process, and that process has some special permissions but in theory shouldn't be different in ways that would matter here? Does it have its own SharedStringMap instance(s) instead of sharing with regular content processes?)

I can reproduce that locally, Fedora 34 / X11 / EGL (Gnome).

We also have user reports that switch back to GLX fixes it:
https://bugzilla.redhat.com/show_bug.cgi?id=2020981#c18

As I can reproduce it, is there any info / debug I can provide?

Attached file lsof and pmap (deleted) —
(In reply to Gian-Carlo Pascutto [:gcp] from comment #14) >When this happens, how does /dev/shmem look? i.e. >morbo@mozwell:~$ df -h >Filesystem Size Used Avail Use% Mounted on >tmpfs 16G 5.0M 16G 1% /dev/shm (In reply to Gian-Carlo Pascutto [:gcp] from comment #18) > Interesting output to check when this problem is happening: > > ```ulimit -a -H``` > ```ulimit -a``` > ```pgrep -f firefox | xargs -I {} bash -c 'printf {}" "; lsof -p {} | wc -l'``` > ```pgrep -f firefox | xargs -I {} bash -c 'printf {}" "; pmap -p {} | wc -l'``` I put your commands in a while loop and executed those after the inital crash of nearly all tabs. I tried "Restore This Tab" and "Restore All Crashed Tabs" several times: 1. `while do; df -h | grep /dev/shm; done`: ``` tmpfs 32G 368M 32G 2% /dev/shm ``` The line doesn't change, no matter how often I reload the tabs. As you can see only 368M of 32G shared memory (which is half my real RAM size) is used. 2. `while do; ulimit -a -H; echo ---; done`: ``` -t: cpu time (seconds) unlimited -f: file size (blocks) unlimited -d: data seg size (kbytes) unlimited -s: stack size (kbytes) unlimited -c: core file size (blocks) unlimited -m: resident set size (kbytes) unlimited -u: processes 256790 -n: file descriptors 524288 -l: locked-in-memory size (kbytes) unlimited -v: address space (kbytes) unlimited -x: file locks unlimited -i: pending signals 256790 -q: bytes in POSIX msg queues 819200 -e: max nice 30 -r: max rt priority 95 -N 15: unlimited --- ``` Same as in 1.) (output doesn't change). But I'm not sure if "Restore All Tabs" even worked (tabs don't even seem to try to reload after I pressed a few times on that button. "Restore This Tab" is not affected by this problem.) 3. `while do; ulimit -a; echo ---; done` ``` -t: cpu time (seconds) unlimited -f: file size (blocks) unlimited -d: data seg size (kbytes) unlimited -s: stack size (kbytes) 8192 -c: core file size (blocks) unlimited -m: resident set size (kbytes) unlimited -u: processes 256790 -n: file descriptors 1024 -l: locked-in-memory size (kbytes) unlimited -v: address space (kbytes) unlimited -x: file locks unlimited -i: pending signals 256790 -q: bytes in POSIX msg queues 819200 -e: max nice 30 -r: max rt priority 95 -N 15: unlimited --- ``` Same as before. By the way since this is slightly different than with -H and some values aren't "unlimited" anymore it would be nice to know it those are common values or if the Arch devs changed something. Cause if they did, we could ask them why they changed something. 4. I put the pgrep one lines together in a single loop cause it seemed to me that it makes more sense that way...: ```sh while true do pgrep -f firefox | xargs -I {} bash -c 'printf {}" "; lsof -p {} | wc -l' echo "---lsof---" pgrep -f firefox | xargs -I {} bash -c 'printf {}" "; pmap -p {} | wc -l' echo "---pmap---" done ``` Output varies a bit (maybe cause I switched between windows and tabs that I tried to restore?) but most of it stays the same:

I got an attachment upload error (something with missing key) and my comment was still sent? Wtf?

And the markdown doesn't even display in contrast to what was shown in the preview...

(In reply to Martin Stránský [:stransky] (ni? me) from comment #38)

We also have user reports that switch back to GLX fixes it:
https://bugzilla.redhat.com/show_bug.cgi?id=2020981#c18

That looks interesting. They enabled logging and they have lots of errors related to file descriptors:
IPDL protocol Error: Received an invalid file descriptor
[Parent 10122, Main Thread] WARNING: failed to create memfd: Too many open files
[Parent 10122, IPC I/O Parent] WARNING: Message needs unreceived descriptors

"too many open files" - linux still has limits (4K?) on open fd's per process

(In reply to Martin Stránský [:stransky] (ni? me) from comment #38)

As I can reproduce it, is there any info / debug I can provide?

Given that it looks a lot like a file descriptor leak, lsof output to see what the leaked fds are would help.

Flags: needinfo?(stransky)
Flags: needinfo?(schreibemirhalt)
Flags: needinfo?(jmathies)

"too many open files" - linux still has limits (4K?) on open fd's per process

This is what I was trying to get at with the ulimit data, but perhaps it was dumping the wrong value?

Raising this to S2 - we just realized that the FD exhaustion might affect the parent process, in which case we'll crash and the crash reporter won't be able to report for the same reason.

Severity: -- → S2
Priority: -- → P1

(In reply to Gian-Carlo Pascutto [:gcp] from comment #47)

Raising this to S2 - we just realized that the FD exhaustion might affect the parent process, in which case we'll crash and the crash reporter won't be able to report for the same reason.

Slight correction: As far as I know the crash reporter won't work if we run out of fds in the parent process or the crashing child process — the client creates a pipe and sends one end to the parent with SCM_RIGHTS so it can be told when the reporter is done and it can continue exiting. I noticed this because of the references to sys_pipe in the logs at https://bugzilla.redhat.com/show_bug.cgi?id=2020981, but those are actually parent process crashes (see Parent in earlier logs, also cloned child from the crash handler) and that also seems to be broken by the fd exhaustion.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #46)

This is what I was trying to get at with the ulimit data, but perhaps it was dumping the wrong value?

We attempt to raise the limit to 4k; if the soft limit is already higher than that then we won't lower it, and if the hard limit is lower then we'll raise the soft limit as much as we can. In this case the soft limit is 1k and the hard limit is 512k, so we'll have 4k. (IIRC there's at least one distro which does set the hard limit to 4k by default, which is one reason why our target isn't higher than that.) There's also /proc/{pid}/limits to read another process's limits.

"too many open files" - linux still has limits (4K?) on open fd's per process

The logs from reporter have 5500+ fds in the parent process.

If you're affected by this, the following output is useful:
pgrep -f firefox | xargs -I {} bash -c 'lsof -p {}'

Blocks: gfx-triage

Could this be bug 1735905?

(Ramón Cahenzli from bug 1740260 comment #4)

The fact that disabling EGL works around the issue may be a coincidence. The symptoms are the same as for these bugs reported for Debian and openSUSE. The crackling audio problem is the same as well:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=998108
https://bugzilla.opensuse.org/show_bug.cgi?id=1192067

This seems to be related to the toolchains used to build Firefox, Mesa and the use of LLVM version 2 vs. 3.

Could this be bug 1735905?

Given the segfault is in some of the first code that needs to allocate new fds, the Red Hat bugtracker having people running out of fds as per the errors, and the log posted here showing >5000 in the main process, I don't see why. The bug you linked is a hang - this is a crash in every new process we try to launch.

Martin Stransky confirmed the leaked FDs are sync_file, pointing to WebGL + dmabuf problems.

Attached file last_10min.tar.xz (deleted) —

(In reply to Gian-Carlo Pascutto [:gcp] from comment #50)

"too many open files" - linux still has limits (4K?) on open fd's per process

The logs from reporter have 5500+ fds in the parent process.

If you're affected by this, the following output is useful:
pgrep -f firefox | xargs -I {} bash -c 'lsof -p {}'

I reported the bug in RHBZ. I am attaching the (sanitized, I hope) output of lsof on my system - I had it running on a loop every minute. According to my system log the crash happened at 23:06:06, I've included lsof runs from 22:57 to 23:07.

This bug affects Intel only, AMD looks fine and affects both Wayland and X11 in EGL mode.
It comes from dmabuf backend which opens 'sync_file'.

Simple reproducer is to open any WebGL example (I use https://webglsamples.org/blob/blob.html) and check open files of GeckoMain process. There is increasing number of opened sync_file fd.

Flags: needinfo?(stransky)

We can test WebGL code without fence and call glFinish() instead to see if that helps.

Gnome Xwayland, Debian Testing, Intel

Already running main Nightly instance:
$ pgrep -f firefox | xargs -I {} bash -c 'lsof -p {}' | grep "sync_file" | wc -l
32

Bug reproducible with (clicked through all tabs):
$ MOZ_X11_EGL=1 MOZ_DISABLE_CONTENT_SANDBOX=1 mozregression --launch 2021-11-18 --pref gfx.webrender.all:true -P stdout -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html

$ pgrep -f firefox | xargs -I {} bash -c 'lsof -p {}' | grep "sync_file" | wc -l
462

Not reproducible with widget.dmabuf-webgl.enabled:false.

$ pgrep -f firefox | xargs -I {} bash -c 'lsof -p {}' | grep "sync_file" | wc -l
32

Has STR: --- → yes
Flags: needinfo?(schreibemirhalt)
Regressed by: 1695933
Summary: All tabs start and keep crashing after a while / Crash in [@ mozilla::dom::ipc::SharedStringMap::SharedStringMap] → widget.dmabuf-webgl.enabled: All tabs start and keep crashing after a while / Crash in [@ mozilla::dom::ipc::SharedStringMap::SharedStringMap]
Has Regression Range: --- → yes

I found a file handle could get leaked in bug 1741997 when I background a tab running a WebGL demo with EGL + DMABuf.

It might not be the only problem however. mstranksy is investigating another potential leak related to the drivers.

Xwayland, Ubuntu 21.10, Intel:
Just swiching between 2 webgl tabs (Ctrl+PageUp + PageDown) causes 21 additional sync_file.
One tab crashed at a sync_file count of 3910.
Other tabs are frozen, show plain red at first, then a static image of the webgl content. Some other webgl tabs still work properly.
Crashed tabs don't show a crash report form.
Then I got main process crash. The crash reporter dialog opened. I clicked on Restart, but Nightly crashed each time on startup.
sync_file count is at 11673. Then I exited the crash reporter. The count went down to 0. Then I was able to start Nightly again and saw these on about:crashes:
bp-772f208c-459c-4d6d-8951-d39950211118 11/19/21, 00:59 [@ mozilla::dom::ipc::SharedStringMap::SharedStringMap ]
bp-d33f4e7a-eeb8-4a30-9f9d-2b87a0211118 11/19/21, 00:59 [@ mozilla::dom::ipc::SharedStringMap::SharedStringMap ]
bp-6d50d076-b83e-42fd-b6dd-03c340211118 11/19/21, 00:59 [@ EMPTY: no crashing thread identified; ERROR_NO_MINIDUMP_HEADER ] "unable to find a usable font ()"
bp-edf84066-5c5a-42c3-89d9-71a890211118 11/19/21, 00:58 [@ EMPTY: no crashing thread identified; ERROR_NO_THREAD_LIST ] "unable to find a usable font (Noto Sans)"
bp-6bbdf95f-9654-4901-b083-c8a490211118 11/19/21, 00:58 [@ webrender::glyph_rasterizer::GlyphRasterizer::request_glyphs ] "assertion failed: self.font_contexts.lock_shared_context().has_font(&font.font_key)"
bp-72c33386-161d-4c7f-8393-170760211118 11/19/21, 00:58 [@ mozilla::SandboxFork::SandboxFork ] "MOZ_CRASH(socketpair failed)"
bp-4de7fe07-667c-4bae-adaa-888870211118 11/19/21, 00:58 [@ mozilla::SandboxFork::SandboxFork ] "MOZ_CRASH(socketpair failed)"

Yes, bug 1741997 seems to stop the sync_file count explosion! Nice!
6 tabs don't go beyond 26, no matter how often I switch tabs.
After closing one tab: 25
After closing next tab: 20
After closing next tab: 16
After closing next tab: 12
After closing next tab: 5
After closing last tab: 0

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #45)

Given that it looks a lot like a file descriptor leak, lsof output to see what the leaked fds are would help.

Executed after the initial crash of the tabs and tried to restore some of the tabs in different windows (just like before):

while true
do
pgrep -f firefox | xargs -I {} bash -c 'lsof -p {}' >> crash
echo "---lsof---" >> crash
done
Depends on: 1741997
Attached file lsof.txt.zip (deleted) —

Here is the text in a zip. In a separate comment cause theCouldn’t upload the text as an attachment. Please try again later. Error: Unexpected Error showed up after I clicked on the submit button..

Ubuntu and Debian with Mesa drivers default to Wayland:

  • ESR 91 is unaffected and slow by default (GLX/Xwayland).
  • Firefox Snap ESR91 is affected by default, it uses the native Wayland backend (bug 1543600 == MOZ_ENABLE_WAYLAND=1 == EGL/Wayland and Dmabuf) instead of Xwayland.
    bp-a24c1dc2-8fa6-40c1-ac1e-cf1e20211119 11/19/21, 03:40 [@ libxul.so@0x2e718ce | libxul.so@0xb719f0 | libxul.so@0x5ce31e7 | libxul.so@0xa719c2 | libxul.so@0xa719c2 | libxul.so@0xa81e70 | libxul.so@0xa387b6 | libxul.so@0xa81c1b | libxul.so@0xb71e56 | libxul.so@0xb70661 | libxul.so@0xb6efac | libxul.so@0x6cb55ff... ] MOZ_RELEASE_ASSERT(result.isOk())
    bp-2782e2ee-6e93-4a81-b4e0-dffa30211119 11/19/21, 03:40 [@ libxul.so@0x2e718ce | libxul.so@0xb719f0 | libxul.so@0x5ce31e7 | libxul.so@0xa719c2 | libxul.so@0xa719c2 | libxul.so@0xa81e70 | libxul.so@0xa387b6 | libxul.so@0xa81c1b | libxul.so@0xb71e56 | libxul.so@0xb70661 | libxul.so@0xb6efac | libxul.so@0x6cb55ff... ] MOZ_RELEASE_ASSERT(result.isOk())
    bp-5b93047b-3413-4d42-acaf-836730211119 11/19/21, 03:39 [@ EMPTY: no crashing thread identified; ERROR_NO_THREAD_LIST ] MOZ_CRASH(socketpair failed)
    bp-fee6511d-d63e-4f92-883b-a7a0a0211119 11/19/21, 03:39 [@ EMPTY: no crashing thread identified; ERROR_NO_THREAD_LIST ] MOZ_CRASH()

Bug 1741997 fixes it for me, the sync_file count is stable and it's not rising in time.

(In reply to Darkspirit from comment #64)

Ubuntu and Debian with Mesa drivers default to Wayland:

  • ESR 91 is unaffected and slow by default (GLX/Xwayland).
  • Firefox Snap ESR91 is affected by default, it uses the native Wayland backend (bug 1543600 == MOZ_ENABLE_WAYLAND=1 == EGL/Wayland and Dmabuf) instead of Xwayland.

Does really Snap ESR91 use Wayland? That's a bit insane given the Wayland state in FF91.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #66)

Does really Snap ESR91 use Wayland? That's a bit insane given the Wayland state in FF91.

Yes, it does. But given that the Snaps have other childhood illnesses (bug 1665641), I don't know how relevant that variant is.

I experienced this issue pretty consistently with 94.0.1, after ~10 minutes on jstris.jezevec10.com.

Regarding the hypothesis that Bug 1741997 fixes this issue: I upgraded to 94.0.2 earlier today, and my understanding is that the fix for Bug 1741997 was included in it. However, I got a crash since then; see: https://crash-stats.mozilla.org/report/index/d25dbea4-0c32-4b36-8ef2-a55150211120

Note that there was an older build of 94.0.2 created prior to this fix (buildID=20211117154346). We had to respin the 94.0.2 build to pick up this fix (buildID=20211119140621). Given that we never actually shipped the first 94.0.2 build, I assume you manually downloaded and installed it. Double-check the build you've got and make sure you're on the one we actually intend to ship on Monday with this fix.

Crash rates on Nightly & Beta are also looking much better in builds with the fix for bug 1741997 included. Calling this fixed by bug 1741997. We expect to ship 94.0.2 later today to release users.

Assignee: nobody → aosmond
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 96 Branch
No longer blocks: gfx-triage
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: