Closed Bug 1741454 Opened 3 years ago Closed 3 years ago

Tab crash in libxul.so ExpandColorMap, when using jemalloc and LLVM 13

Categories

(Firefox Build System :: Toolchains, defect)

Firefox 94
x86_64
Linux
defect

Tracking

(firefox-esr91 unaffected, firefox94 unaffected, firefox95 fixed, firefox96 fixed)

RESOLVED FIXED
96 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox94 --- unaffected
firefox95 --- fixed
firefox96 --- fixed

People

(Reporter: oxalicc, Assigned: glandium, NeedInfo)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(4 files)

Attached file ff-crash-backtrace.txt (deleted) —

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:94.0) Gecko/20100101 Firefox/94.0

Steps to reproduce:

  1. On KDE plasma desktop environment.
  2. export MOZ_ENABLE_WAYLAND=1
  3. Run firefox on an empty profile.
  4. Open "https://youtube.com"
  5. When loaded, let the mouse cursor quickly move around different video thumbnails.
  6. Few seconds later, see "Tab crashed"

Actual results:

Almost deterministic tab crash.

Full stack backtrace with symbols is in attachments. The crashed function is ExpandColorMap, firefox-94.0.1/media/libwebp/src/dec/vp8l_dec.c:1283

Crash report can be seen here,
https://crash-stats.mozilla.org/report/index/2de601e4-b4e7-43ac-9721-9bd960211116

Note: I recently upgraded from firefox 93.0 to 94.0.1, and the LLVM used for build is upgraded from 11 -> 13. Firefox 93.0 doesn't crash in the same situation.

Expected results:

Not crash.

Forget to mention that, I'm on KDE wayland session. That's why I set export MOZ_ENABLE_WAYLAND=1.

The Bugbug bot thinks this bug should belong to the 'Core::Widget: Gtk' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → Widget: Gtk
Product: Firefox → Core
Component: Widget: Gtk → ImageLib
OS: Unspecified → Linux
Hardware: Unspecified → x86_64

Reproducible with the same trace stack, under X11.

What kind of builds are you using? Provided by your distro? From Mozilla? Built yourself? Can you bisect using mozregression to see what caused the problem?

If you have it in gdb you can goto the frame for mozilla::image::AnimationSurfaceProvider::Run, mImage.mRawPtr->mURI.mRawPtr holds the uri of the image being decoded, I lldb I did:

(lldb) p mImage.mRawPtr->mURI.mRawPtr
(mozilla::net::nsStandardURL *) $9 = 0x0000000111caac10
(lldb) p ((mozilla::net::nsStandardURL *) 0x0000000111caac10).mSpec

to get display the uri. You might have to look at mSpec.mData. Once you have the uri you can see if just loading that image reproduces the problem.

Flags: needinfo?(oxalicc)
Flags: needinfo?(mid.autumn0moon)
Attached file ff-94.0.1-debug-build.log.gz (deleted) —

What kind of builds are you using? Provided by your distro? From Mozilla? Built yourself? Can you bisect using mozregression to see what caused the problem?

I'm on NixOS. Hardware and system information can be found in the crash report link I provided before.
https://crash-stats.mozilla.org/report/index/2de601e4-b4e7-43ac-9721-9bd960211116

From personal discussion, @mid.autumn0moon seems to be also on NixOS and also on AMD platform with integrated graphics card like me.

The firefox is built from this nix file, (with additional --enable-debug-symbols and --disable-strip to include symbols. The issue exists without these flags)
https://github.com/NixOS/nixpkgs/blob/b964655a7d2ab1dcff46aa73d5510239bd193c88/pkgs/applications/networking/browsers/firefox/common.nix#L250
I uploaded the full build log which includes configure flags and environments.

If you have it in gdb you can goto the frame for mozilla::image::AnimationSurfaceProvider::Run, mImage.mRawPtr->mURI.mRawPtr holds the uri of the image being decoded

Here's what I tried,

(gdb) frame 16
#16 0x00007fb1c29da62d in mozilla::image::AnimationSurfaceProvider::Run (this=0x7fb18ad3db20) at /build/firefox-94.0.1/image/AnimationSurfaceProvider.cpp:209
209     /build/firefox-94.0.1/image/AnimationSurfaceProvider.cpp: No such file or directory.
(gdb) p mImage.mRawPtr->mURI.mRawPtr
$1 = (nsISupports *) 0x7fb18ad3c7a0
(gdb) p ((mozilla::net::nsStandardURL *) 0x7fb18ad3c7a0).mSpec
$2 = {<nsTSubstring<char>> = {<mozilla::detail::nsTStringRepr<char>> = {mData = 0x7fb18acba908 "https://i.ytimg.com/an_webp/fUyTKaU6zRc/mqdefault_6s.webp?du=3000&sqp=CM3Fz4wG&rs=AOn4CLCs5ij3lzc0yY6tOEOkvatRBZGUpA", mLength = 116, mDataFlags = 5,
      mClassFlags = mozilla::detail::StringClassFlags::NULL_TERMINATED}, static kMaxCapacity = <optimized out>}, <No data fields>}

The Image URL seems to be,
https://i.ytimg.com/an_webp/fUyTKaU6zRc/mqdefault_6s.webp?du=3000&sqp=CM3Fz4wG&rs=AOn4CLCs5ij3lzc0yY6tOEOkvatRBZGUpA
But directly opening this URL cannot reproduce the issue.

Also worth mentioning that, I used to set --enable-debug, --enable-profiling and --disable-optimize, and this issue disappears. I think it might be related to optimization.

Can you bisect using mozregression to see what caused the problem?

I'll try. It costs me more than 1hr to build once. So it would take some time.

Flags: needinfo?(oxalicc)

Building to bisect is a long process, you don't need to go through that. If you download an official build of Firefox from mozilla can you still reproduce the issue?

If you download an official build of Firefox from mozilla can you still reproduce the issue?

The official release binary distribution 94.0.1 cannot reproduce this issue.

Here is a tiny poc for this crash bug. It is 100% reliable on my system.
It works by repeatedly loading a specific webp image in an iframe until the crash is triggered.

It's worth noting that the webp image used here did not come from YouTube, so the issue is not isolated to them.

https://buckley310.github.io/nixos_firefox_webp_poc/index.html

Attached file disassembly.txt (deleted) —

I can also reproduce this bug with Sean's testcase. According to gdb, the crash happens at a movdqa xmm0,XMMWORD PTR [r12] instruction (at 58396fd in the attached disassembly of DecodeImageStream; ExpandColorMap is inlined into DecodeImageStream in the Nix build). ExpandColorMap does some funny stuff with pointer aliasing, so maybe LLVM is assuming pointers are aligned when they aren't, much like this bug. In my test, r12 was 0x7fffb0d6e298 and new_data pointed to the same memory location.

This happens with llvm/clang++ 13.0.0.

Status: UNCONFIRMED → NEW
Ever confirmed: true

After digging data in coredump with disassembly, I figured that *($rsp+0x48) is transform->data_ and r15d is num_colors << 2. So,

(gdb) print *(VP8LTransform*)(*(void **)($rsp+0x48)-16) # transform
= {type_ = COLOR_INDEXING_TRANSFORM, bits_ = 3, xsize_ = 384, ysize_ = 367, data_ = 0x7f0e4371d460}
(gdb) print $r15d >> 2 # num_colors
= 2

There are only 8byte (64bit) to process, and so does the allocated memory from malloc. movdqa is definitely wrong here since it try to read 16 byte data (xmm is 128bit) and also requires 16byte alignment. Neither of the per-requirements are true. Maybe blame LLVM 13?

For clear, the situation here is,

  1. malloc returns 8byte aligned pointer when allocating 8byte.
  2. LLVM thinks its 16byte aligned and directly uses movdqa to read aligned 16byte.
  3. movdqa fails the alignment requirement.

I'm not sure which of (1) and (2) is considered expected.
And also reading 16byte from 8byte allocation is also quite wrong for me. But someone says it might be okay for vectorization if only they are already 16byte-aligned. I'm not sure about it.

We are using glibc 2.33 in nixpkgs, not sure if it's related.

I agree with the above but "blame" is too strong a word. uint8_t and uint32_t are not compatible types, therefore I believe accessing the same memory location through pointers of these two types is undefined behavior (here's a write-up about it), so LLVM is correct in doing the optimization.

Actually, I just checked and LLVM's official stance is (or was at some point) that uint8_t should be a char type even though the spec does not require it. I retract the previous comment, this is probably a LLVM bug unless there's some other undefined behavior somewhere.

Okay I figured out.
We have jemalloc enabled but it would return 8-byte-aligned memory from malloc, which violates the 16-byte-aligned assumption added recently in clang from LLVM 13.

From this jemalloc issue, https://github.com/jemalloc/jemalloc/issues/1533#issuecomment-507915829

Practically, this means an alignment of 8 for 8-byte allocations, and 16 for 16-or-more-byte allocations, on 64-bit systems. (This can be tweaked as a config option, as well).

And this clang patch, which is included in LLVM 13, added assumptions that malloc must returned 16-byte-aligned pointer, https://reviews.llvm.org/D100879
You can also confirm it by optimizing (uintptr_t)malloc(8) % 16 == 0 https://godbolt.org/z/ennqs1Ma6

I rebuilt firefox with --disable-jemalloc and this issue is gone.

Summary: Tab crash in libxul.so ExpandColorMap → Tab crash in libxul.so ExpandColorMap, when using jemalloc and LLVM 13
Severity: -- → S3
Priority: -- → P3
Component: ImageLib → Memory Allocator

I agree with the latest messages in the llvm issue that this is a clang problem.

Component: Memory Allocator → Toolchains
Product: Core → Firefox Build System

The product::component has been changed since the backlog priority was decided, so we're resetting it.
For more information, please visit auto_nag documentation.

Priority: P3 → --
Regressed by: clang-13
Has Regression Range: --- → yes
Attached file Bug 1741454 - Revert clang's D100879. (deleted) —

It assumes 16-bytes alignment for small allocations on 64-bits, which is
not true for mozjemalloc allocations.
This only addresses the problem with mozilla.org-produced builds. We may
have to figure out something for downstreams.

Assignee: nobody → mh+mozilla
Status: NEW → ASSIGNED

Set release status flags based on info from the regressing bug 1731582

Crash Signature: [@ libxul.so@0x58376fd | firefox@0x168a0b ]
Keywords: crash, regression

Comment on attachment 9252051 [details]
Bug 1741454 - Revert clang's D100879.

Beta/Release Uplift Approval Request

  • User impact if declined: Possibly random crashes, depending on how the compiler decides to do its optimizations vs. how small allocations are actually laid out in memory.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Straightforward backout of a clang upstream patch.
  • String changes made/needed: N/A
Attachment #9252051 - Flags: approval-mozilla-beta?
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 96 Branch

Comment on attachment 9252051 [details]
Bug 1741454 - Revert clang's D100879.

Approved for our last beta.

Attachment #9252051 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: