Sub-optimal signatures for crashes in memfd:pipewire-memfd
Categories
(Socorro :: Signature, task)
Tracking
(Not tracked)
People
(Reporter: bwc, Unassigned)
References
Details
This set of signatures is pretty unwieldy, and difficult to track in bugzilla:
Comment 1•2 years ago
|
||
Example signatures:
libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x7
libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x1007
libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x2007
libxul.so@0x3665e29 | memfd:pipewire-memfd (deleted)@0x7
libxul.so@0x366da09 | memfd:pipewire-memfd (deleted)@0x7
libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x107
libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x1c7
libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x2047
libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x347
The first frame I can understand. The second frame--I've never seen anything like that before.
Gabriele: What does (deleted)@0xNNN
mean in a symbol? Should we add a signature generation normalization pass to coalesce those into something like memfd:pipewire-memfd (deleted)0xNNN
or drop the hex altogether? I don't know what information is important here to distinguish between crash signatures.
Comment 2•2 years ago
|
||
There's a few things to unpack here:
- We don't have symbols for libxul.so in the first frame, which is odd, but coming from Ubuntu this is potentially a snap build and we've had trouble scraping symbols from them so needinfo-ing Alexandre about that
- Because we don't have symbols for libxul.so the stackwalker doesn't know how to find the next frame and scans the stacks. While doing so it falls on a mapping that's been deleted: that is it found what looks like an instruction pointer that points into an area of the memory map where a file was mapped, but the file underlying the mapping was deleted from the filesystem. In particular this was an anonymous file created by pipewire via memfd_create()
- This frame is clearly wrong, I don't think we were executing code in that particular area, the stack walker should skip it and go forward with the scanning, even though the real solution here is having symbols in the first place so we don't have to resort to stack scanning
To sum it up: we need symbols but I'll file a bug in the stackwalker to improve behavior under these conditions.
Updated•2 years ago
|
Comment 3•2 years ago
|
||
If I take a look at one of the reports: https://crash-stats.mozilla.org/report/index/ceeb6878-3236-4005-8b48-7430d0221114
- BuildID 20221104181631
- this likely matches https://launchpad.net/~mozilla-snaps/firefox/+snap/firefox-snap-stable/+build/1932023 which was finished building on 20221104 at 2034 CET
- Last run of symbols pulling on november 4th was at 17:04 and it only includes one 107 beta build: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/CwXeEje5QxCos9E9wrJTgw/runs/0/artifacts/public/logs/live_backing.log
- Next run is on november 5th 22:53 and it seems we got nothing https://treeherder.mozilla.org/jobs?repo=mozilla-central&searchStr=symbols-snap&fromchange=2fe5fe82b2f89d3f13d58c9d331deeb6175ba856&selectedTaskRun=LTLxq9yUTTyPw4XcGQe4GQ.0
Comment 4•2 years ago
|
||
I could confirm that the buildid matching this above crash is from the 106.0.5-1 build one can download from https://launchpadlibrarian.net/632047151/firefox_106.0.5-1_amd64.debug https://launchpadlibrarian.net/632047166/firefox_106.0.5-1_amd64.snap
Gabriele, do we have a way to check from the debug symbols file the signature is matching?
$ ll libxul.so/0C36CC917C9C83B272B4110218A758D50/libxul.so.sym
-rw-r--r-- 1 alex alex 660M janv. 1 2010 libxul.so/0C36CC917C9C83B272B4110218A758D50/libxul.so.sym
Also since the wget of packages mentions those URLs in https://firefoxci.taskcluster-artifacts.net/LTLxq9yUTTyPw4XcGQe4GQ/0/public/build/wget_packages.log we know that symbols should have been produced from https://firefoxci.taskcluster-artifacts.net/LTLxq9yUTTyPw4XcGQe4GQ/0/public/logs/live_backing.log but:
- this log shows none being extracted
- the task-uploaded SHA256SUMS file does not contains anything above 106.0 builds for stable releases: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/LTLxq9yUTTyPw4XcGQe4GQ/runs/0/artifacts/public/build/SHA256SUMS.zip so it confirms it failed to add it
Updated•2 years ago
|
Comment 5•2 years ago
|
||
Yes, the debug ID for libxul.so in those crashes is 0C36CC917C9C83B272B4110218A758D50
.
Comment 6•2 years ago
|
||
OK, I think I know:
$ grep -G "firefox_106.0.5-1_amd64.snap,[0-9]" SHA256SUMS
firefox_106.0b5-1_amd64.snap,188637184
This should not match. I'll debug that deeper.
Comment 7•2 years ago
|
||
Alexandre has fixed the issue in the Ubuntu Snap symbol scrapers and uploaded the missing symbols, we'll reprocess all the affected crashes shortly and with some luck the bad signatures will all go away.
Comment 8•2 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #7)
Alexandre has fixed the issue in the Ubuntu Snap symbol scrapers and uploaded the missing symbols, we'll reprocess all the affected crashes shortly and with some luck the bad signatures will all go away.
A first batch of 106.0.[1-5] symbol was done, and I've shared Gabriele a remainder of 107.0.[1-2] because we were also missing those.
I've also made a pull of GNOME SDK ones, just in case.
Reporter | ||
Updated•2 years ago
|
Comment 9•2 years ago
|
||
Where are we with this bug? I'm still seeing crash reports with the problematic signature. Are there other sets of symbols we're missing?
For example, I just reprocessed this crash report and it has the same signature:
bp-7bcca737-d4aa-4288-be39-ff6360221113
It's got libxul.so/0C36CC917C9C83B272B4110218A758D50.
Comment 10•2 years ago
|
||
It seems that the issue morphed from a "missing symbols" issue to a stack walker one. Take this crash for example. The crashing address is 0x36e8da9
within libxul.so
. If I look at the symbols we have this line:
FUNC 36e8900 4a9 0 webrtc::BaseCapturerPipeWire::HandleBuffer(pw_buffer*)
36e8900 + 4a9
is exactly 0x36e8da9
so that's where we should be. We do have unwinding information for that function too, however the stackwalker falls back to stack scanning and ultimately considers that the symbol is missing. Given the address is exactly one byte past the end of the function I wonder if we've got an off-by-one issue in the stackwalker, we have a bunch of heuristics about that and I remember that we had to make some compromises there.
Comment 11•2 years ago
|
||
I filed issue #780 in the rust-minidump repo to address this issue (though this means deploying a hack heuristic to work around it).
Comment 12•2 years ago
|
||
Stack walking issue aside I've noticed that there's still several dependencies of the snap packages that are missing. For example:
- Crash 1c120051-f195-4a8b-91e8-b41b30230112 is missing symbols for
- libdconfsettings.so
- libspa-support.so
- libpipewire-module-client-node.so
- libpipewire-0.3.so.0
- Crash 025a19cb-bba6-4ddf-a971-82e820230109 is missing symbols for
- libxcb.so.1
- libdbus-1.so.3
- libxcb-present.so.0
- Crash 078ee570-dc44-4d99-a650-cbb860230108 is missing symbols for
- libcICE.so.6
These aren't a big deal as they rarely appear in stack traces. I still wanted to make a note of it to double-check at a later time.
Comment 13•2 years ago
|
||
A quick update from the discussion in the rust-minidump issue: this crash is an odd beast. The crashing address is off-by-one WRT to the function where it's supposed to be happening but the signal that caused it originated in the Linux kernel - not in userspace. This might explain the odd address so I'm trying to narrow down my hack to only target this very specific case.
Comment 14•2 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #10)
It seems that the issue morphed from a "missing symbols" issue to a stack walker one. Take this crash for example. The crashing address is
0x36e8da9
withinlibxul.so
. If I look at the symbols we have this line:FUNC 36e8900 4a9 0 webrtc::BaseCapturerPipeWire::HandleBuffer(pw_buffer*)
36e8900 + 4a9
is exactly0x36e8da9
so that's where we should be. We do have unwinding information for that function too, however the stackwalker falls back to stack scanning and ultimately considers that the symbol is missing. Given the address is exactly one byte past the end of the function
The last 5 bytes of the function, which I suspect are the last instruction: e8 b7 cf fa ff
That's a call to 0x3695d60, as the last instruction. My first guess is that the callee was declared noreturn, but it did return; I've heard of this happening and causing problems like this. The symbol file has:
FUNC 3695d60 5d4 0 rtc::webrtc_logging_impl::Log(rtc::webrtc_logging_impl::LogArgType const*, ...)
But that doesn't appear to be declared noreturn. However, the call might be inlined from a noreturn function; poking around in searchfox finds a few functions named FatalLog
or WriteFatalLog
with an RTC_NORETURN
before the decl. (This looks like it's derived from Chromium logging, where an assertion failure / deliberate crash involves a log message with “fatal” level.)
Edited to add: the GitHub issue linked in comment #11 also found more or less the same thing.
Comment 15•2 years ago
|
||
After going through the discussion on GitHub and looking at the crashes once more I think we're dealing with a compiler bug. The preprocessed code that's crashing is the following:
if (spaBuffer->datas[0].chunk->size == 0) {
!rtc::LogMessage::IsNoop<::rtc::LS_ERROR>() && ::rtc::webrtc_logging_impl::LogCall() & ::rtc::webrtc_logging_impl::LogStreamer<>() << ::rtc::webrtc_logging_impl::LogMetadata("/home/gsvelto/projects/mozilla-central/third_party/libwebrtc/modules/desktop_capture/linux/wayland/moz_base_capturer_pipewire.cc", 474, ::rtc::LS_ERROR) << "Failed to get video stream: Zero size.";
return;
}
The crash happens when we return from a function call in the second line. The compiled code looks like this:
36e8da4: e8 b7 cf fa ff call 3695d60 <_ZNKSt4lessISt17basic_string_viewIcSt11char_traitsIcEEEclERKS3_S6_@@xul108+0x2f90>
36e8da9: 00 00 add %al,(%rax)
We're deep within inlined code that's manipulating the strings for the log message. Here's what's going on, the last two zeroes at offset 36e8da9
are padding which the compiler inserted for some reason - possibly because it thought that the call would never return - and this looks genuinely like a bug. However since x86 is x86 00 00
is a valid instruction which is executed and accesses the memory pointed by rax
. Since rax
contains a non-canonical address it triggers a general protection fault which is reported as SIGSEGV / SI_KERNEL
by our code (since the signal was raised by the kernel).
BTW I don't think we're hitting a DCHECK()
because it doesn't appear there are calls to it in the affected code - but I may have missed some - however even if we'd hit it we'd enter WriteFatalLog()
which calls abort() so it shouldn't return.
Comment 16•2 years ago
|
||
Also this is a very long way to say that the signature here cannot be improved: it accurately reflects what's going on which should never happen.
Comment 17•2 years ago
|
||
Just one last note: we are compiling with #define RTC_DCHECK_IS_ON 1
and we do hit WriteFatalLog()
from time to time, but it works as expected. See this crash for example which is from one of the affected builds. It's crashing where it's supposed to, so if a miscompilation is happening either it's preventing WriteFatalLog()
from being called or it's happening somewhere else entirely (and I think it's the latter given that the last call before the crash is to some std::less<>
template and not something within WebRTC's code).
Comment 18•2 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #17)
(and I think it's the latter given that the last call before the crash is to some
std::less<>
template and not something within WebRTC's code).
I don't think that's correct; that's probably just the name of the last public symbol (.dynsym
rather than .symtab
) before the address, due to disassembling a stripped binary. 0x2f90 would be a fairly large offset for a string comparison function, and see also the line of the breakpad symbols I quoted in comment #14.
Comment 19•2 years ago
|
||
But I do tend to think this is a compiler bug, because I don't see how that call could be legitimately noreturn. From the list of inlines (from the GitHub thread), there's nothing in there that's declared noreturn, and nothing “interesting” going on there — they're taking a sequence of operator<<
calls, collecting the argument types into a template parameter pack, and eventually calling webrtc_logging_impl::Log
(with an array of type tags and a varargs list of the values, which it then unpacks into a sequence of operator<<
calls on another type, which seems weird but not wrong).
As for the RTC_DCHECK_NOTREACHED
calls, those could happen if the LogArgType
sequence were ill-formed, but we can see what those values are from the C++ types in the symbol name, and it's fine (kLogMetadata
, kCharP
, kEnd
); we'd need something that was statically known bad for that to make any sense, and it would have to somehow take a different path at runtime.
Comment 20•2 years ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧ from comment #18)
I don't think that's correct; that's probably just the name of the last public symbol (
.dynsym
rather than.symtab
) before the address, due to disassembling a stripped binary. 0x2f90 would be a fairly large offset for a string comparison function, and see also the line of the breakpad symbols I quoted in comment #14.
You're right, I forgot that the binary was stripped -_-
It's definitely rtc::webrtc_logging_impl::Log()
then, but it also gets called in a bunch of other places that keep executing after it. 0x36e8da4
is the only location where the call to that function is followed by padding. If I look for other call-sites that look like that, I can find some, but they're all for noreturn
functions, e.g.:
1091aa8: e8 83 2d ff ff call 1084830 <__stack_chk_fail@plt>
1091aad: 00 00 add BYTE PTR [rax],al
Description
•