Open Bug 1800919 Opened 2 years ago Updated 2 years ago

Sub-optimal signatures for crashes in memfd:pipewire-memfd

Categories

(Socorro :: Signature, task)

Tracking

(Not tracked)

People

(Reporter: bwc, Unassigned)

References

Details

Example signatures:

  • libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x7
  • libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x1007
  • libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x2007
  • libxul.so@0x3665e29 | memfd:pipewire-memfd (deleted)@0x7
  • libxul.so@0x366da09 | memfd:pipewire-memfd (deleted)@0x7
  • libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x107
  • libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x1c7
  • libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x2047
  • libxul.so@0x3664869 | memfd:pipewire-memfd (deleted)@0x347

The first frame I can understand. The second frame--I've never seen anything like that before.

Gabriele: What does (deleted)@0xNNN mean in a symbol? Should we add a signature generation normalization pass to coalesce those into something like memfd:pipewire-memfd (deleted)0xNNN or drop the hex altogether? I don't know what information is important here to distinguish between crash signatures.

Flags: needinfo?(gsvelto)

There's a few things to unpack here:

  • We don't have symbols for libxul.so in the first frame, which is odd, but coming from Ubuntu this is potentially a snap build and we've had trouble scraping symbols from them so needinfo-ing Alexandre about that
  • Because we don't have symbols for libxul.so the stackwalker doesn't know how to find the next frame and scans the stacks. While doing so it falls on a mapping that's been deleted: that is it found what looks like an instruction pointer that points into an area of the memory map where a file was mapped, but the file underlying the mapping was deleted from the filesystem. In particular this was an anonymous file created by pipewire via memfd_create()
  • This frame is clearly wrong, I don't think we were executing code in that particular area, the stack walker should skip it and go forward with the scanning, even though the real solution here is having symbols in the first place so we don't have to resort to stack scanning

To sum it up: we need symbols but I'll file a bug in the stackwalker to improve behavior under these conditions.

Flags: needinfo?(gsvelto) → needinfo?(lissyx+mozillians)
Flags: needinfo?(lissyx+mozillians)

I could confirm that the buildid matching this above crash is from the 106.0.5-1 build one can download from https://launchpadlibrarian.net/632047151/firefox_106.0.5-1_amd64.debug https://launchpadlibrarian.net/632047166/firefox_106.0.5-1_amd64.snap

Gabriele, do we have a way to check from the debug symbols file the signature is matching?

$ ll libxul.so/0C36CC917C9C83B272B4110218A758D50/libxul.so.sym 
-rw-r--r-- 1 alex alex 660M janv.  1  2010 libxul.so/0C36CC917C9C83B272B4110218A758D50/libxul.so.sym

Also since the wget of packages mentions those URLs in https://firefoxci.taskcluster-artifacts.net/LTLxq9yUTTyPw4XcGQe4GQ/0/public/build/wget_packages.log we know that symbols should have been produced from https://firefoxci.taskcluster-artifacts.net/LTLxq9yUTTyPw4XcGQe4GQ/0/public/logs/live_backing.log but:

Flags: needinfo?(gsvelto)

Yes, the debug ID for libxul.so in those crashes is 0C36CC917C9C83B272B4110218A758D50.

Flags: needinfo?(gsvelto)

OK, I think I know:

$ grep -G "firefox_106.0.5-1_amd64.snap,[0-9]" SHA256SUMS
firefox_106.0b5-1_amd64.snap,188637184

This should not match. I'll debug that deeper.

Alexandre has fixed the issue in the Ubuntu Snap symbol scrapers and uploaded the missing symbols, we'll reprocess all the affected crashes shortly and with some luck the bad signatures will all go away.

(In reply to Gabriele Svelto [:gsvelto] from comment #7)

Alexandre has fixed the issue in the Ubuntu Snap symbol scrapers and uploaded the missing symbols, we'll reprocess all the affected crashes shortly and with some luck the bad signatures will all go away.

A first batch of 106.0.[1-5] symbol was done, and I've shared Gabriele a remainder of 107.0.[1-2] because we were also missing those.

I've also made a pull of GNOME SDK ones, just in case.

Where are we with this bug? I'm still seeing crash reports with the problematic signature. Are there other sets of symbols we're missing?

For example, I just reprocessed this crash report and it has the same signature:

bp-7bcca737-d4aa-4288-be39-ff6360221113

It's got libxul.so/0C36CC917C9C83B272B4110218A758D50.

Flags: needinfo?(gsvelto)

It seems that the issue morphed from a "missing symbols" issue to a stack walker one. Take this crash for example. The crashing address is 0x36e8da9 within libxul.so. If I look at the symbols we have this line:

FUNC 36e8900 4a9 0 webrtc::BaseCapturerPipeWire::HandleBuffer(pw_buffer*)

36e8900 + 4a9 is exactly 0x36e8da9 so that's where we should be. We do have unwinding information for that function too, however the stackwalker falls back to stack scanning and ultimately considers that the symbol is missing. Given the address is exactly one byte past the end of the function I wonder if we've got an off-by-one issue in the stackwalker, we have a bunch of heuristics about that and I remember that we had to make some compromises there.

Flags: needinfo?(gsvelto)

I filed issue #780 in the rust-minidump repo to address this issue (though this means deploying a hack heuristic to work around it).

Stack walking issue aside I've noticed that there's still several dependencies of the snap packages that are missing. For example:

These aren't a big deal as they rarely appear in stack traces. I still wanted to make a note of it to double-check at a later time.

A quick update from the discussion in the rust-minidump issue: this crash is an odd beast. The crashing address is off-by-one WRT to the function where it's supposed to be happening but the signal that caused it originated in the Linux kernel - not in userspace. This might explain the odd address so I'm trying to narrow down my hack to only target this very specific case.

(In reply to Gabriele Svelto [:gsvelto] from comment #10)

It seems that the issue morphed from a "missing symbols" issue to a stack walker one. Take this crash for example. The crashing address is 0x36e8da9 within libxul.so. If I look at the symbols we have this line:

FUNC 36e8900 4a9 0 webrtc::BaseCapturerPipeWire::HandleBuffer(pw_buffer*)

36e8900 + 4a9 is exactly 0x36e8da9 so that's where we should be. We do have unwinding information for that function too, however the stackwalker falls back to stack scanning and ultimately considers that the symbol is missing. Given the address is exactly one byte past the end of the function

The last 5 bytes of the function, which I suspect are the last instruction: e8 b7 cf fa ff

That's a call to 0x3695d60, as the last instruction. My first guess is that the callee was declared noreturn, but it did return; I've heard of this happening and causing problems like this. The symbol file has:

FUNC 3695d60 5d4 0 rtc::webrtc_logging_impl::Log(rtc::webrtc_logging_impl::LogArgType const*, ...)

But that doesn't appear to be declared noreturn. However, the call might be inlined from a noreturn function; poking around in searchfox finds a few functions named FatalLog or WriteFatalLog with an RTC_NORETURN before the decl. (This looks like it's derived from Chromium logging, where an assertion failure / deliberate crash involves a log message with “fatal” level.)

Edited to add: the GitHub issue linked in comment #11 also found more or less the same thing.

After going through the discussion on GitHub and looking at the crashes once more I think we're dealing with a compiler bug. The preprocessed code that's crashing is the following:

  if (spaBuffer->datas[0].chunk->size == 0) {
    !rtc::LogMessage::IsNoop<::rtc::LS_ERROR>() && ::rtc::webrtc_logging_impl::LogCall() & ::rtc::webrtc_logging_impl::LogStreamer<>() << ::rtc::webrtc_logging_impl::LogMetadata("/home/gsvelto/projects/mozilla-central/third_party/libwebrtc/modules/desktop_capture/linux/wayland/moz_base_capturer_pipewire.cc", 474, ::rtc::LS_ERROR) << "Failed to get video stream: Zero size.";
    return;
  }

The crash happens when we return from a function call in the second line. The compiled code looks like this:

 36e8da4:       e8 b7 cf fa ff          call   3695d60 <_ZNKSt4lessISt17basic_string_viewIcSt11char_traitsIcEEEclERKS3_S6_@@xul108+0x2f90>
 36e8da9:       00 00                   add    %al,(%rax)

We're deep within inlined code that's manipulating the strings for the log message. Here's what's going on, the last two zeroes at offset 36e8da9 are padding which the compiler inserted for some reason - possibly because it thought that the call would never return - and this looks genuinely like a bug. However since x86 is x86 00 00 is a valid instruction which is executed and accesses the memory pointed by rax. Since rax contains a non-canonical address it triggers a general protection fault which is reported as SIGSEGV / SI_KERNEL by our code (since the signal was raised by the kernel).

BTW I don't think we're hitting a DCHECK() because it doesn't appear there are calls to it in the affected code - but I may have missed some - however even if we'd hit it we'd enter WriteFatalLog() which calls abort() so it shouldn't return.

Also this is a very long way to say that the signature here cannot be improved: it accurately reflects what's going on which should never happen.

Just one last note: we are compiling with #define RTC_DCHECK_IS_ON 1 and we do hit WriteFatalLog() from time to time, but it works as expected. See this crash for example which is from one of the affected builds. It's crashing where it's supposed to, so if a miscompilation is happening either it's preventing WriteFatalLog() from being called or it's happening somewhere else entirely (and I think it's the latter given that the last call before the crash is to some std::less<> template and not something within WebRTC's code).

(In reply to Gabriele Svelto [:gsvelto] from comment #17)

(and I think it's the latter given that the last call before the crash is to some std::less<> template and not something within WebRTC's code).

I don't think that's correct; that's probably just the name of the last public symbol (.dynsym rather than .symtab) before the address, due to disassembling a stripped binary. 0x2f90 would be a fairly large offset for a string comparison function, and see also the line of the breakpad symbols I quoted in comment #14.

But I do tend to think this is a compiler bug, because I don't see how that call could be legitimately noreturn. From the list of inlines (from the GitHub thread), there's nothing in there that's declared noreturn, and nothing “interesting” going on there — they're taking a sequence of operator<< calls, collecting the argument types into a template parameter pack, and eventually calling webrtc_logging_impl::Log (with an array of type tags and a varargs list of the values, which it then unpacks into a sequence of operator<< calls on another type, which seems weird but not wrong).

As for the RTC_DCHECK_NOTREACHED calls, those could happen if the LogArgType sequence were ill-formed, but we can see what those values are from the C++ types in the symbol name, and it's fine (kLogMetadata, kCharP, kEnd); we'd need something that was statically known bad for that to make any sense, and it would have to somehow take a different path at runtime.

(In reply to Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧ from comment #18)

I don't think that's correct; that's probably just the name of the last public symbol (.dynsym rather than .symtab) before the address, due to disassembling a stripped binary. 0x2f90 would be a fairly large offset for a string comparison function, and see also the line of the breakpad symbols I quoted in comment #14.

You're right, I forgot that the binary was stripped -_-

It's definitely rtc::webrtc_logging_impl::Log() then, but it also gets called in a bunch of other places that keep executing after it. 0x36e8da4 is the only location where the call to that function is followed by padding. If I look for other call-sites that look like that, I can find some, but they're all for noreturn functions, e.g.:

 1091aa8:       e8 83 2d ff ff          call   1084830 <__stack_chk_fail@plt>
 1091aad:       00 00                   add    BYTE PTR [rax],al
You need to log in before you can comment on or make changes to this bug.