widget.dmabuf-webgl.enabled: All tabs start and keep crashing after a while / Crash in [@ mozilla::dom::ipc::SharedStringMap::SharedStringMap]
Categories
(Core :: IPC, defect, P1)
Tracking
()
People
(Reporter: schreibemirhalt, Assigned: aosmond)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(4 files)
Out of nowhere tabs crash after a while. All tabs will crash at the same time, except for some tabs (seems like especially tabs which play a video), those will crash after I switch away from them and back to them or when I try to open a new site within them. It also affecs all new tabs. The browser becomes unusable for surfing until it is closed or killed and reopened. Behaviour happens at random. Sometimes after half an hour, sometimes after multiple hours. It started right after my update to Firefox 94 from Firefox 93 on Arch Linux. I have enough more than system resources, neither RAM nor swap was full.
Here is some data that was automatically entered into this form:
Crash report: https://crash-stats.mozilla.org/report/index/007b5ede-76b7-458e-bd9a-2a1580211107
MOZ_CRASH Reason: MOZ_RELEASE_ASSERT(result.isOk())
Top 10 frames of crashing thread:
0 libxul.so mozilla::dom::ipc::SharedStringMap::SharedStringMap /build/firefox/src/firefox-94.0.1/dom/ipc/SharedStringMap.cpp:33
1 libxul.so /build/firefox/src/firefox-94.0.1/intl/strres/nsStringBundle.cpp:489
2 libxul.so /build/firefox/src/firefox-94.0.1/intl/strres/nsStringBundle.cpp:581
3 libxul.so nsStringBundleBase::GetStringFromName /build/firefox/src/firefox-94.0.1/intl/strres/nsStringBundle.cpp:569
4 libxul.so mozilla::CubebUtils::InitBrandName /build/firefox/src/firefox-94.0.1/dom/media/CubebUtils.cpp:382
5 libxul.so mozilla::detail::RunnableFunction<void /build/firefox/src/firefox-94.0.1/xpcom/threads/nsThreadUtils.h:531
6 libxul.so mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal /build/firefox/src/firefox-94.0.1/xpcom/threads/TaskController.cpp:770
7 libxul.so nsThread::ProcessNextEvent /build/firefox/src/firefox-94.0.1/xpcom/threads/nsThread.cpp:1148
8 libxul.so mozilla::ipc::MessagePump::Run /build/firefox/src/firefox-94.0.1/ipc/glue/MessagePump.cpp:85
9 libxul.so MessageLoop::Run /build/firefox/src/firefox-94.0.1/ipc/chromium/src/base/message_loop.cc:306
Comment 1•3 years ago
|
||
The Bugbug bot thinks this bug should belong to the 'Core::Audio/Video: Playback' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Comment 2•3 years ago
|
||
To make it clear: All browser tabs crash.
Not only it is really annoying since I use over thousand tabs and killing the browser and waiting a few minutes until all tabs are opening again after I restore the session adds up... but visiting online meetings as long as this bug exists can also get on the nerves of others if I just "quit" cause the tab with the meeting cashed together with all other open tabs.
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Comment 3•3 years ago
|
||
To add further information: New "New Tab" pages are not directly affected and new about: pages are also not affected. Could be that those do not crash if they are open when all other tabs crash. But yeah when I try to open any internet address in any tab, no matter if new or already existing, tabs crash.
Updated•3 years ago
|
Updated•3 years ago
|
Comment 4•3 years ago
|
||
This sounds bad, and it looks like this crash has started happening for a lot of people. It also looks like it is happening for a lot of people.
It is interesting to see that you don't seem to have Fission enabled in the crash in comment 0, so it can't just be that you have a zillion processes. Looking at the crashes in 94, only 10% of them had Fission enabled, so presumably it is unrelated.
This does feel like some kind of resource exhaustion, but I'm not sure what.
I'm marking this as a regression. We've seen this before, but it seems like it is happening more often in 94.
Comment 5•3 years ago
|
||
[Tracking Requested - why for this release]: Seems like a pretty serious regression in Linux stability. There are a remarkable number of comments on these crash reports, more than I've seen on crashes in a long time, and lots of them are talking about how Firefox started crashing very frequently for them, in the last week or so.
Comment 6•3 years ago
|
||
gcp, is there somebody on your team who might be able to look into this? This has been happening for awhile, but it looks like it got a lot worse in 94 on Linux for some reason. Something is going wrong with some kind of memory-mapping or file-handle kind of thing on Linux and I'm afraid I'm not really familiar with what might be going wrong there. Thanks.
Comment 7•3 years ago
|
||
If we could figure out a way to reproduce this, running mozregression to figure out what regressed it might be best but I don't know how easy that will be.
Comment 8•3 years ago
|
||
Chris, I recall you saying somewhere that there was a big spike in crash pings where the crash reason is "MOZ_RELEASE_ASSERT(result.isOk())" on Linux with Fission enabled? I wonder if that's the same issue. Maybe with Fission it manifests in some kind of non-user visible way so we don't show the crash reporter and they tend to show up more as pings than crash reports on crash stats?
Comment 9•3 years ago
|
||
Bug 1715975 comment 2 says "The exception seems to be some of the Ubuntu cases, which are kind of weird." and it does look like 64% of these crashes (across all versions) are on Ubuntu 20.04.3 LTS, but maybe that's a very common distro or something.
Updated•3 years ago
|
Comment 10•3 years ago
|
||
Zibi, were there any big changes to localization or string bundles in 94? These crashes look string bundle related, but we don't have any real explanation why the crashes would start spiking up in 94 on Linux. Thanks.
Updated•3 years ago
|
Comment 11•3 years ago
|
||
No, not really. We haven't touched StringBundle in years as it is on the path to be deprecated.
The last meaningful change was :kmag's shared memory buffer in 2018.
Updated•3 years ago
|
Comment 12•3 years ago
|
||
Bug 1514734 claims this is likely OOM (specifically, that it might be a shmem OOM)
Something is going wrong with some kind of memory-mapping or file-handle kind of thing on Linux
I'm unclear of where that conclusion comes from (and hence why this is in IPC - at least that would make sense if it was shmem!).
Comment 13•3 years ago
|
||
See also this analysis about the OOM theory: https://bugzilla.mozilla.org/show_bug.cgi?id=1715975#c3
Comment 14•3 years ago
|
||
I have enough more than system resources, neither RAM nor swap was full.
When this happens, how does /dev/shmem look? i.e.
morbo@mozwell:~$ df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 16G 5.0M 16G 1% /dev/shm
That said, shouldn't memfd_create
be immune to this? Last changes to shmem there that I remember are bug 1672085 but that was 93, not 94.
Clicking a few reports show those systems are indeed critically low on swap:
https://crash-stats.mozilla.org/report/index/af11bd3b-8f23-4c94-94ae-386af0211117
https://crash-stats.mozilla.org/report/index/045b7608-b3ac-4b1b-90b8-65f900211117
https://crash-stats.mozilla.org/report/index/08cc781b-038d-4411-9c85-8388d0211117
https://crash-stats.mozilla.org/report/index/24654a82-5aba-450f-aac6-dd18b0211117
But as to why this is spiking, no idea.
Comment 15•3 years ago
|
||
Are we hitting file descriptor limits?
Reporter | ||
Comment 16•3 years ago
|
||
Yep I never had and I still don't have Fission enabled. Also the only thing I updated at the time was Firefox. I did not update my system nor installed anything for over a week and had one or two restarts, then I updated Firefox and after a few hours after the installation of the new FF version it started the crashing of tabs.
Regarding OOM: I have 64GB of RAM. I never saw a spike in my task manager (xfce4-taskmanager) when the issue appeared. Also the browser itself doesn't crash (well, at least most of the time); only the tabs crash. And if I try to reload one of the crashed tabs it crashes again (well, at least most of the time.. For some reason some tabs can be reloaded after a few minutes — but not in every case... e.g. I was able to reload a twitter tab at one time and at another time it did not work — but as soon as I click on "reload all tabs" on another crashed tab then those will crash again. ). It also doesn't seem to matter how many tabs I have open. If I recall correctly it happened at least once before I restored my session so I had only a few tabs open instead of over a thousand.
I suspected some of my add-ons but some of the crash reports from other people barely contain any add-on and not even those that I use.
If you want me to try something then just let me know. Could also come to a Matrix room if you want.
Reporter | ||
Comment 17•3 years ago
|
||
(In reply to Gian-Carlo Pascutto [:gcp] from comment #14)
I have enough more than system resources, neither RAM nor swap was full.
When this happens, how does /dev/shmem look? i.e.
I will send that when tabs crash again.
Comment 18•3 years ago
|
||
Interesting output to check when this problem is happening:
ulimit -a -H
ulimit -a
pgrep -f firefox | xargs -I {} bash -c 'printf {}" "; lsof -p {} | wc -l'
pgrep -f firefox | xargs -I {} bash -c 'printf {}" "; pmap -p {} | wc -l'
Comment 19•3 years ago
|
||
Regarding OOM: I have 64GB of RAM.
Your distro/Linux config might limit some system resources to really low values (don't ask me why they do this!) so unfortunately that doesn't mean much. Also, a bug in a graphics driver that we trigger (for example!) can easily use up all memory. So we still have to account for this possibility.
Comment hidden (obsolete) |
Updated•3 years ago
|
Comment hidden (obsolete) |
Comment 22•3 years ago
|
||
(In reply to Gian-Carlo Pascutto [:gcp] from comment #12)
Something is going wrong with some kind of memory-mapping or file-handle kind of thing on Linux
I'm unclear of where that conclusion comes from (and hence why this is in IPC - at least that would make sense if it was shmem!).
This was based on the fact that mMap.initWithHandle() is failing. It looks like the reasons that can fail are: the file descriptor passed in was invalid, PR_ImportFile failed, PR_GetOpenFileInfo64 returned an overly large value, PR_CreateFileMap failed, or PR_MemMap failed.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #14)
Clicking a few reports show those systems are indeed critically low on swap:
That's a good point. I thought I'd checked that, but maybe I was looking at total page file instead of available page file. There are a total of 1671 Linux crashes on either 94.0 or 94.0.1. Of those, 714 have 0 available page file, and 925 of the crashes have > 200 million bytes of available page file. So it seems like a lack of page file might be the explanation for many of these crashes, but not even really the majority of them. Although I don't know whether the available page file value is measured when the crash happens, or if it can be stale.
Updated•3 years ago
|
Comment 23•3 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #8)
Chris, I recall you saying somewhere that there was a big spike in crash pings where the crash reason is "MOZ_RELEASE_ASSERT(result.isOk())" on Linux with Fission enabled? I wonder if that's the same issue. Maybe with Fission it manifests in some kind of non-user visible way so we don't show the crash reporter and they tend to show up more as pings than crash reports on crash stats?
In Matrix, regarding these crashes, cpeterson said: "The increase I saw in Linux content process crash pings I see in 94 are definitely worse with Fission. They're related to image threads failing to start, so probably not related to the SharedStringMap MOZ_RELEASE_ASSERTs, unless they are both failing due to some system limits on threads or processes?"
Updated•3 years ago
|
Comment 24•3 years ago
|
||
How many distinct systems are there? It didn't immediately make sense to me that 0 page file with free physical RAM would cause this to fail, as Linux does memory overcommit. But then looking closer shows that all the reports I was looking at were from the same machine.
Is it possible this looks huge because it's an immediate startup crash per tab? Thus, every browser instance that hits this will send 4-8 or more reports?
Comment 25•3 years ago
|
||
For the FD exhaustion theory, this bug https://bugzilla.mozilla.org/show_bug.cgi?id=1719140 also "spiked" but in much lower numbers.
Comment 26•3 years ago
|
||
As did this one https://bugzilla.mozilla.org/show_bug.cgi?id=1685642.
Comment 27•3 years ago
|
||
(In reply to Gian-Carlo Pascutto [:gcp] from comment #24)
How many distinct systems are there? It didn't immediately make sense to me that 0 page file with free physical RAM would cause this to fail, as Linux does memory overcommit. But then looking closer shows that all the reports I was looking at were from the same machine.
Good point. I faceted the crashes by install time. The top installation has 129 crashes, then there are 6 install times with 25 to 49 crashes, then there are another dozen install times with around 6 to 16 crashes. Then lots of people with 3 or less.
So yes, because of the fact that it seems to be a crash in content processes where a user can keep trying to open new tabs and fail, there is some kind of constant multiplier on the crashes.
Comment 28•3 years ago
|
||
A major change for Linux in Firefox 94 was https://mozillagfx.wordpress.com/2021/10/30/switching-the-linux-graphics-stack-from-glx-to-egl/
Looking at correlations:
(77.18% in signature vs 01.26% overall) adapter_driver_version = 21.0.3.0 [78.90% vs 40.96% if platform = Linux]
Note above that >=21 is required for it to be active. I'm not sure how to interpret that correlation (i.e. if platform is limited, are there still a ton of crashes on other drivers?)
Reporter | ||
Comment 29•3 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #27)
So yes, because of the fact that it seems to be a crash in content processes where a user can keep trying to open new tabs and fail, there is some kind of constant multiplier on the crashes.
When I press on the button to reload a tab and it crashes again, another report appears in about:crashes that can be submitted. It seems like that when multiple tabs crash at the same time only one report is created.
Comment 30•3 years ago
|
||
(In reply to schreibemirhalt from comment #29)
When I press on the button to reload a tab and it crashes again, another report appears in about:crashes that can be submitted. It seems like that when multiple tabs crash at the same time only one report is created.
Yeah, multiple web pages can be in the same process. Without Fission, there are no more than 8 content processes.
Reporter | ||
Comment 31•3 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #30)
Yeah, multiple web pages can be in the same process. Without Fission, there are no more than 8 content processes.
Ah, okay. I'd also like to add that the count column there doesn't seem to be the real amount of submitted crashes? I have over 300 reports in about:crashes from this month alone. 49 (install time 1636312441) looks to me like a realistic number of times that I had to close and reopen firefox because of the issue.
Comment 32•3 years ago
|
||
(In reply to Gian-Carlo Pascutto [:gcp] from comment #28)
A major change for Linux in Firefox 94 was https://mozillagfx.wordpress.com/2021/10/30/switching-the-linux-graphics-stack-from-glx-to-egl/
Looking at correlations:
(77.18% in signature vs 01.26% overall) adapter_driver_version = 21.0.3.0 [78.90% vs 40.96% if platform = Linux]Note above that >=21 is required for it to be active. I'm not sure how to interpret that correlation (i.e. if platform is limited, are there still a ton of crashes on other drivers?)
Good catch. Faceting on driver version, all but 3 of these crashes on Linux 94.0/94.0.1 are on 21.0.1.0 or higher. In contrast, if you look at all crashes on Linux 94.0/94.0.1, around 25% of the crashes are lower than 21.0, and another couple percent have version numbers of 300 or higher.
Jim, is there somebody who might be able to look into whether this increase in Linux crashes could be related to the switch to EGL? Thanks.
Comment 33•3 years ago
|
||
I spent some time looking at the comments and URLs in the crash reports.
Lots of Facebook apps that look like they could be games in the URLs:
https://apps.facebook.com/belote-multijoueur/?kt_referer=preroll
https://apps.facebook.com/forestrescuebubble/?ref=bookmarks&fb_source=bookmark&count=0
https://apps.facebook.com/cccityofromance/?fb_source=canvas_bookmark
https://apps.facebook.com/goldencitycasino/?ref=bookmarks&fb_source=bookmark&count=0
https://apps.facebook.com/lets_fish/?ref=bookmarks&fb_source=web_shortcut&count=0
Various other sites that look like games:
https://www.king.com/de/play/candycrush
https://www.royalgames.com/index.jsp?redirect=true
https://www.myjackpot.fr/game/ramses-book-roar
https://www.novumgames.com/spiel/mustersuche/
https://ru1.seafight.com/index.es?action=internalMapUnity
In the comments, there's also a lot of discussion of games:
"I was playing Tetris with Facebook and Gmail open in other tabs. The video in Tetris started skipping and then the tab went kaput."
"was on the AARP game page doing a mini crossword when it froze"
"was on candy crush"
via Google translate: "fed up my computer keeps crashing with FACEBOOK impossible to make my games crashes all day it's really boring"
"jigsaw world keeps crashing"
"jewel hunt crashes after a while"
"i want to play games fix nowwwwwwwwww "
A few mentions of Google Maps. A few mentions of YouTube. A few mentions of watching a SpaceX video. A few mentions of other games.
Comment 34•3 years ago
|
||
Another comment with specific steps to reproduce on Google Maps: "Just panning, zooming, and switching between map and 3D views in Google maps."
Comment 35•3 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #8)
Chris, I recall you saying somewhere that there was a big spike in crash pings where the crash reason is "MOZ_RELEASE_ASSERT(result.isOk())" on Linux with Fission enabled? I wonder if that's the same issue. Maybe with Fission it manifests in some kind of non-user visible way so we don't show the crash reporter and they tend to show up more as pings than crash reports on crash stats?
MOZ_RELEASE_ASSERT(result.isOk())
crashes are not a problem on Fission. The spiking Fission crashes on Linux are "Should successfully create image I/O thread" and "Failed to start ImageBridgeChild thread" MOZ_RELEASE_ASSERTs and MOZ_CRASH(PR_CreateThread failed!)
.
These SharedStringMap crashes spiked starting November 2 when Firefox 94 was released. We didn't start rolling Fission out to Firefox 94 users until November 9. So while the thread errors are worse on Fission, they don't seem related to these SharedStringMap crashes (unless they are both side effects of some other system resource exhaustion problem).
Comment 36•3 years ago
|
||
(In reply to Gian-Carlo Pascutto [:gcp] from comment #28)
Looking at correlations:
(77.18% in signature vs 01.26% overall) adapter_driver_version = 21.0.3.0 [78.90% vs 40.96% if platform = Linux]Note above that >=21 is required for it to be active. I'm not sure how to interpret that correlation (i.e. if platform is limited, are there still a ton of crashes on other drivers?)
Crash ping telemetry query for adapter driver versions:
https://sql.telemetry.mozilla.org/queries/82936/source
Seems to be related to x86-64 architecture. Only 9 out of 160K crash pings are from 32-bit x86 and those have adapter driver version 20.0.8.0, so they wouldn't be using the new EGL code.
The 10 most common driver versions for these MOZ_RELEASE_ASSERT(result.isOk())
crash pings:
adapter driver version | crash ping count |
---|---|
21.0.3.0 | 142,583 |
21.2.2.0 | 5,312 |
21.2.3.0 | 4,937 |
21.2.4.0 | 2,752 |
21.2.5.0 | 2,671 |
21.1.8.0 | 1,962 |
21.0.1.0 | 709 |
22.0.0.0 | 545 |
21.2.0.0 | 284 |
21.3.0.0 | 184 |
The 10 most common driver versions for other Linux x86-64 crash pings:
adapter driver version | crash ping count |
---|---|
21.0.3.0 | 67,561 |
18.0.5.0 | 5,957 |
20.0.8.0 | 5,405 |
21.2.5.0 | 5,035 |
21.2.2.0 | 4,486 |
20.2.6.0 | 4,175 |
470.74.0.0 | 3,184 |
21.2.3.0 | 2,929 |
21.2.4.0 | 2,660 |
17.1.3.0 | 2,617 |
Comment 37•3 years ago
|
||
Posting a few thoughts, in no particular order….
(In reply to Andrew McCreight [:mccr8] from comment #22)
This was based on the fact that mMap.initWithHandle() is failing. It looks like the reasons that can fail are: the file descriptor passed in was invalid, PR_ImportFile failed, PR_GetOpenFileInfo64 returned an overly large value, PR_CreateFileMap failed, or PR_MemMap failed.
It looks like PR_ImportFile
can only fail due to failure to malloc, which I think can't happen with mozjemalloc? PR_CreateFileMap
on Unix can fail because of malloc, or if PR_GetOpenFileInfo
fails, or if the size is larger than that of the file and it can't be extended (shouldn't be possible, because we pass in the size from PR_GetOpenFileInfo64
). PR_MemMap
is basically just mmap
; failures include if the fd was completely invalid (which would already cause an error), or a special file that can't be mapped like a pipe/socket (that one might be able to reach this point, because nothing checks that the size is nonzero), or various resource exhaustions which probably wouldn't apply in a newly started content process and would more likely break malloc rather than this.
The reports that something happens and then it's impossible to start any content processes (like the initial report on this bug, but also comment #27 about many reports per profile) suggest that something happened in the parent process to corrupt its copy of the shared map fd; I'd suspect a double close bug, but that should cause problems in a lot of places, not just here, and the crash stats suggest that there's something special about SharedStringMap
.
And all the comments about WebGL in comment #33, combined with the evidence for EGL/X11 vs. GLX having a role… but WebGL currently runs in the content process. (I'd wondered if the sandbox's connect
brokering might be buggy somehow, because that didn't get a lot of use until 91, but that would happen when WebGL is first used, not later on, and the way that it's used hasn't really changed between GLX and EGL/X11.) If I understand correctly we'd also be using EGL instead of GLX to set things up for WebRender, so maybe that's part of it?
(I'm also not sure what to make of comment #3, because the (default) new tab page is rendered in a content process, and that process has some special permissions but in theory shouldn't be different in ways that would matter here? Does it have its own SharedStringMap
instance(s) instead of sharing with regular content processes?)
Comment 38•3 years ago
|
||
I can reproduce that locally, Fedora 34 / X11 / EGL (Gnome).
We also have user reports that switch back to GLX fixes it:
https://bugzilla.redhat.com/show_bug.cgi?id=2020981#c18
As I can reproduce it, is there any info / debug I can provide?
Reporter | ||
Comment 39•3 years ago
|
||
Comment hidden (duplicate) |
Reporter | ||
Comment 41•3 years ago
|
||
I got an attachment upload error (something with missing key) and my comment was still sent? Wtf?
Reporter | ||
Comment 42•3 years ago
|
||
And the markdown doesn't even display in contrast to what was shown in the preview...
Comment 43•3 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #38)
We also have user reports that switch back to GLX fixes it:
https://bugzilla.redhat.com/show_bug.cgi?id=2020981#c18
That looks interesting. They enabled logging and they have lots of errors related to file descriptors:
IPDL protocol Error: Received an invalid file descriptor
[Parent 10122, Main Thread] WARNING: failed to create memfd: Too many open files
[Parent 10122, IPC I/O Parent] WARNING: Message needs unreceived descriptors
Comment 44•3 years ago
|
||
"too many open files" - linux still has limits (4K?) on open fd's per process
Updated•3 years ago
|
Updated•3 years ago
|
Comment 45•3 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #38)
As I can reproduce it, is there any info / debug I can provide?
Given that it looks a lot like a file descriptor leak, lsof
output to see what the leaked fds are would help.
Comment 46•3 years ago
|
||
"too many open files" - linux still has limits (4K?) on open fd's per process
This is what I was trying to get at with the ulimit data, but perhaps it was dumping the wrong value?
Comment 47•3 years ago
|
||
Raising this to S2 - we just realized that the FD exhaustion might affect the parent process, in which case we'll crash and the crash reporter won't be able to report for the same reason.
Comment 48•3 years ago
|
||
(In reply to Gian-Carlo Pascutto [:gcp] from comment #47)
Raising this to S2 - we just realized that the FD exhaustion might affect the parent process, in which case we'll crash and the crash reporter won't be able to report for the same reason.
Slight correction: As far as I know the crash reporter won't work if we run out of fds in the parent process or the crashing child process — the client creates a pipe and sends one end to the parent with SCM_RIGHTS
so it can be told when the reporter is done and it can continue exiting. I noticed this because of the references to sys_pipe
in the logs at https://bugzilla.redhat.com/show_bug.cgi?id=2020981, but those are actually parent process crashes (see Parent
in earlier logs, also cloned child
from the crash handler) and that also seems to be broken by the fd exhaustion.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #46)
This is what I was trying to get at with the ulimit data, but perhaps it was dumping the wrong value?
We attempt to raise the limit to 4k; if the soft limit is already higher than that then we won't lower it, and if the hard limit is lower then we'll raise the soft limit as much as we can. In this case the soft limit is 1k and the hard limit is 512k, so we'll have 4k. (IIRC there's at least one distro which does set the hard limit to 4k by default, which is one reason why our target isn't higher than that.) There's also /proc/{pid}/limits
to read another process's limits.
Comment 49•3 years ago
|
||
Here's where we create the reply pipe for a parent process crash (note the log message) and here's the one for child processes (note the lack of logging).
Updated•3 years ago
|
Comment 50•3 years ago
|
||
"too many open files" - linux still has limits (4K?) on open fd's per process
The logs from reporter have 5500+ fds in the parent process.
If you're affected by this, the following output is useful:
pgrep -f firefox | xargs -I {} bash -c 'lsof -p {}'
Updated•3 years ago
|
Comment 51•3 years ago
|
||
Could this be bug 1735905?
(Ramón Cahenzli from bug 1740260 comment #4)
The fact that disabling EGL works around the issue may be a coincidence. The symptoms are the same as for these bugs reported for Debian and openSUSE. The crackling audio problem is the same as well:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=998108
https://bugzilla.opensuse.org/show_bug.cgi?id=1192067This seems to be related to the toolchains used to build Firefox, Mesa and the use of LLVM version 2 vs. 3.
Comment 52•3 years ago
|
||
Could this be bug 1735905?
Given the segfault is in some of the first code that needs to allocate new fds, the Red Hat bugtracker having people running out of fds as per the errors, and the log posted here showing >5000 in the main process, I don't see why. The bug you linked is a hang - this is a crash in every new process we try to launch.
Comment 54•3 years ago
|
||
Martin Stransky confirmed the leaked FDs are sync_file
, pointing to WebGL + dmabuf problems.
Comment 55•3 years ago
|
||
30250x_sync_file |
(In reply to Gian-Carlo Pascutto [:gcp] from comment #50)
"too many open files" - linux still has limits (4K?) on open fd's per process
The logs from reporter have 5500+ fds in the parent process.
If you're affected by this, the following output is useful:
pgrep -f firefox | xargs -I {} bash -c 'lsof -p {}'
I reported the bug in RHBZ. I am attaching the (sanitized, I hope) output of lsof on my system - I had it running on a loop every minute. According to my system log the crash happened at 23:06:06, I've included lsof runs from 22:57 to 23:07.
Comment 56•3 years ago
|
||
This bug affects Intel only, AMD looks fine and affects both Wayland and X11 in EGL mode.
It comes from dmabuf backend which opens 'sync_file'.
Simple reproducer is to open any WebGL example (I use https://webglsamples.org/blob/blob.html) and check open files of GeckoMain process. There is increasing number of opened sync_file fd.
Updated•3 years ago
|
Comment 57•3 years ago
|
||
We can test WebGL code without fence and call glFinish() instead to see if that helps.
Comment 58•3 years ago
|
||
Gnome Xwayland, Debian Testing, Intel
Already running main Nightly instance:
$ pgrep -f firefox | xargs -I {} bash -c 'lsof -p {}' | grep "sync_file" | wc -l
32
Bug reproducible with (clicked through all tabs):
$ MOZ_X11_EGL=1 MOZ_DISABLE_CONTENT_SANDBOX=1 mozregression --launch 2021-11-18 --pref gfx.webrender.all:true -P stdout -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html -a https://webglsamples.org/blob/blob.html
$ pgrep -f firefox | xargs -I {} bash -c 'lsof -p {}' | grep "sync_file" | wc -l
462
Not reproducible with widget.dmabuf-webgl.enabled:false
.
$ pgrep -f firefox | xargs -I {} bash -c 'lsof -p {}' | grep "sync_file" | wc -l
32
Updated•3 years ago
|
Assignee | ||
Comment 59•3 years ago
|
||
I found a file handle could get leaked in bug 1741997 when I background a tab running a WebGL demo with EGL + DMABuf.
It might not be the only problem however. mstranksy is investigating another potential leak related to the drivers.
Comment 60•3 years ago
|
||
Xwayland, Ubuntu 21.10, Intel:
Just swiching between 2 webgl tabs (Ctrl+PageUp + PageDown) causes 21 additional sync_file.
One tab crashed at a sync_file count of 3910.
Other tabs are frozen, show plain red at first, then a static image of the webgl content. Some other webgl tabs still work properly.
Crashed tabs don't show a crash report form.
Then I got main process crash. The crash reporter dialog opened. I clicked on Restart, but Nightly crashed each time on startup.
sync_file count is at 11673. Then I exited the crash reporter. The count went down to 0. Then I was able to start Nightly again and saw these on about:crashes:
bp-772f208c-459c-4d6d-8951-d39950211118 11/19/21, 00:59 [@ mozilla::dom::ipc::SharedStringMap::SharedStringMap ]
bp-d33f4e7a-eeb8-4a30-9f9d-2b87a0211118 11/19/21, 00:59 [@ mozilla::dom::ipc::SharedStringMap::SharedStringMap ]
bp-6d50d076-b83e-42fd-b6dd-03c340211118 11/19/21, 00:59 [@ EMPTY: no crashing thread identified; ERROR_NO_MINIDUMP_HEADER ] "unable to find a usable font ()"
bp-edf84066-5c5a-42c3-89d9-71a890211118 11/19/21, 00:58 [@ EMPTY: no crashing thread identified; ERROR_NO_THREAD_LIST ] "unable to find a usable font (Noto Sans)"
bp-6bbdf95f-9654-4901-b083-c8a490211118 11/19/21, 00:58 [@ webrender::glyph_rasterizer::GlyphRasterizer::request_glyphs ] "assertion failed: self.font_contexts.lock_shared_context().has_font(&font.font_key)"
bp-72c33386-161d-4c7f-8393-170760211118 11/19/21, 00:58 [@ mozilla::SandboxFork::SandboxFork ] "MOZ_CRASH(socketpair failed)"
bp-4de7fe07-667c-4bae-adaa-888870211118 11/19/21, 00:58 [@ mozilla::SandboxFork::SandboxFork ] "MOZ_CRASH(socketpair failed)"
Comment 61•3 years ago
|
||
Yes, bug 1741997 seems to stop the sync_file count explosion! Nice!
6 tabs don't go beyond 26, no matter how often I switch tabs.
After closing one tab: 25
After closing next tab: 20
After closing next tab: 16
After closing next tab: 12
After closing next tab: 5
After closing last tab: 0
Reporter | ||
Comment 62•3 years ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #45)
Given that it looks a lot like a file descriptor leak,
lsof
output to see what the leaked fds are would help.
Executed after the initial crash of the tabs and tried to restore some of the tabs in different windows (just like before):
while true
do
pgrep -f firefox | xargs -I {} bash -c 'lsof -p {}' >> crash
echo "---lsof---" >> crash
done
Reporter | ||
Comment 63•3 years ago
|
||
Here is the text in a zip. In a separate comment cause theCouldn’t upload the text as an attachment. Please try again later. Error: Unexpected Error
showed up after I clicked on the submit button..
Comment 64•3 years ago
|
||
Ubuntu and Debian with Mesa drivers default to Wayland:
- ESR 91 is unaffected and slow by default (GLX/Xwayland).
- Firefox Snap ESR91 is affected by default, it uses the native Wayland backend (bug 1543600 == MOZ_ENABLE_WAYLAND=1 == EGL/Wayland and Dmabuf) instead of Xwayland.
bp-a24c1dc2-8fa6-40c1-ac1e-cf1e20211119 11/19/21, 03:40 [@ libxul.so@0x2e718ce | libxul.so@0xb719f0 | libxul.so@0x5ce31e7 | libxul.so@0xa719c2 | libxul.so@0xa719c2 | libxul.so@0xa81e70 | libxul.so@0xa387b6 | libxul.so@0xa81c1b | libxul.so@0xb71e56 | libxul.so@0xb70661 | libxul.so@0xb6efac | libxul.so@0x6cb55ff... ] MOZ_RELEASE_ASSERT(result.isOk())
bp-2782e2ee-6e93-4a81-b4e0-dffa30211119 11/19/21, 03:40 [@ libxul.so@0x2e718ce | libxul.so@0xb719f0 | libxul.so@0x5ce31e7 | libxul.so@0xa719c2 | libxul.so@0xa719c2 | libxul.so@0xa81e70 | libxul.so@0xa387b6 | libxul.so@0xa81c1b | libxul.so@0xb71e56 | libxul.so@0xb70661 | libxul.so@0xb6efac | libxul.so@0x6cb55ff... ] MOZ_RELEASE_ASSERT(result.isOk())
bp-5b93047b-3413-4d42-acaf-836730211119 11/19/21, 03:39 [@ EMPTY: no crashing thread identified; ERROR_NO_THREAD_LIST ] MOZ_CRASH(socketpair failed)
bp-fee6511d-d63e-4f92-883b-a7a0a0211119 11/19/21, 03:39 [@ EMPTY: no crashing thread identified; ERROR_NO_THREAD_LIST ] MOZ_CRASH()
Comment 65•3 years ago
|
||
Bug 1741997 fixes it for me, the sync_file count is stable and it's not rising in time.
Comment 66•3 years ago
|
||
(In reply to Darkspirit from comment #64)
Ubuntu and Debian with Mesa drivers default to Wayland:
- ESR 91 is unaffected and slow by default (GLX/Xwayland).
- Firefox Snap ESR91 is affected by default, it uses the native Wayland backend (bug 1543600 == MOZ_ENABLE_WAYLAND=1 == EGL/Wayland and Dmabuf) instead of Xwayland.
Does really Snap ESR91 use Wayland? That's a bit insane given the Wayland state in FF91.
Comment 67•3 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #66)
Does really Snap ESR91 use Wayland? That's a bit insane given the Wayland state in FF91.
Yes, it does. But given that the Snaps have other childhood illnesses (bug 1665641), I don't know how relevant that variant is.
Comment 68•3 years ago
|
||
I experienced this issue pretty consistently with 94.0.1, after ~10 minutes on jstris.jezevec10.com.
Regarding the hypothesis that Bug 1741997 fixes this issue: I upgraded to 94.0.2 earlier today, and my understanding is that the fix for Bug 1741997 was included in it. However, I got a crash since then; see: https://crash-stats.mozilla.org/report/index/d25dbea4-0c32-4b36-8ef2-a55150211120
Comment 69•3 years ago
|
||
Note that there was an older build of 94.0.2 created prior to this fix (buildID=20211117154346). We had to respin the 94.0.2 build to pick up this fix (buildID=20211119140621). Given that we never actually shipped the first 94.0.2 build, I assume you manually downloaded and installed it. Double-check the build you've got and make sure you're on the one we actually intend to ship on Monday with this fix.
Comment 70•3 years ago
|
||
Crash rates on Nightly & Beta are also looking much better in builds with the fix for bug 1741997 included. Calling this fixed by bug 1741997. We expect to ship 94.0.2 later today to release users.
Updated•3 years ago
|
Updated•3 years ago
|
Description
•