Tab crash in libxul.so ExpandColorMap, when using jemalloc and LLVM 13
Categories
(Firefox Build System :: Toolchains, defect)
Tracking
(firefox-esr91 unaffected, firefox94 unaffected, firefox95 fixed, firefox96 fixed)
Tracking | Status | |
---|---|---|
firefox-esr91 | --- | unaffected |
firefox94 | --- | unaffected |
firefox95 | --- | fixed |
firefox96 | --- | fixed |
People
(Reporter: oxalicc, Assigned: glandium, NeedInfo)
References
(Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(4 files)
(deleted),
text/plain
|
Details | |
(deleted),
application/gzip
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details |
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:94.0) Gecko/20100101 Firefox/94.0
Steps to reproduce:
- On KDE plasma desktop environment.
export MOZ_ENABLE_WAYLAND=1
- Run firefox on an empty profile.
- Open "https://youtube.com"
- When loaded, let the mouse cursor quickly move around different video thumbnails.
- Few seconds later, see "Tab crashed"
Actual results:
Almost deterministic tab crash.
Full stack backtrace with symbols is in attachments. The crashed function is ExpandColorMap, firefox-94.0.1/media/libwebp/src/dec/vp8l_dec.c:1283
Crash report can be seen here,
https://crash-stats.mozilla.org/report/index/2de601e4-b4e7-43ac-9721-9bd960211116
Note: I recently upgraded from firefox 93.0 to 94.0.1, and the LLVM used for build is upgraded from 11 -> 13. Firefox 93.0 doesn't crash in the same situation.
Expected results:
Not crash.
Forget to mention that, I'm on KDE wayland session. That's why I set export MOZ_ENABLE_WAYLAND=1
.
Comment 2•3 years ago
|
||
The Bugbug bot thinks this bug should belong to the 'Core::Widget: Gtk' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.
Comment 3•3 years ago
|
||
Reproducible with the same trace stack, under X11.
Comment 4•3 years ago
|
||
What kind of builds are you using? Provided by your distro? From Mozilla? Built yourself? Can you bisect using mozregression to see what caused the problem?
If you have it in gdb you can goto the frame for mozilla::image::AnimationSurfaceProvider::Run, mImage.mRawPtr->mURI.mRawPtr holds the uri of the image being decoded, I lldb I did:
(lldb) p mImage.mRawPtr->mURI.mRawPtr
(mozilla::net::nsStandardURL *) $9 = 0x0000000111caac10
(lldb) p ((mozilla::net::nsStandardURL *) 0x0000000111caac10).mSpec
to get display the uri. You might have to look at mSpec.mData. Once you have the uri you can see if just loading that image reproduces the problem.
Updated•3 years ago
|
What kind of builds are you using? Provided by your distro? From Mozilla? Built yourself? Can you bisect using mozregression to see what caused the problem?
I'm on NixOS. Hardware and system information can be found in the crash report link I provided before.
https://crash-stats.mozilla.org/report/index/2de601e4-b4e7-43ac-9721-9bd960211116
From personal discussion, @mid.autumn0moon seems to be also on NixOS and also on AMD platform with integrated graphics card like me.
The firefox is built from this nix file, (with additional --enable-debug-symbols
and --disable-strip
to include symbols. The issue exists without these flags)
https://github.com/NixOS/nixpkgs/blob/b964655a7d2ab1dcff46aa73d5510239bd193c88/pkgs/applications/networking/browsers/firefox/common.nix#L250
I uploaded the full build log which includes configure flags and environments.
If you have it in gdb you can goto the frame for mozilla::image::AnimationSurfaceProvider::Run, mImage.mRawPtr->mURI.mRawPtr holds the uri of the image being decoded
Here's what I tried,
(gdb) frame 16
#16 0x00007fb1c29da62d in mozilla::image::AnimationSurfaceProvider::Run (this=0x7fb18ad3db20) at /build/firefox-94.0.1/image/AnimationSurfaceProvider.cpp:209
209 /build/firefox-94.0.1/image/AnimationSurfaceProvider.cpp: No such file or directory.
(gdb) p mImage.mRawPtr->mURI.mRawPtr
$1 = (nsISupports *) 0x7fb18ad3c7a0
(gdb) p ((mozilla::net::nsStandardURL *) 0x7fb18ad3c7a0).mSpec
$2 = {<nsTSubstring<char>> = {<mozilla::detail::nsTStringRepr<char>> = {mData = 0x7fb18acba908 "https://i.ytimg.com/an_webp/fUyTKaU6zRc/mqdefault_6s.webp?du=3000&sqp=CM3Fz4wG&rs=AOn4CLCs5ij3lzc0yY6tOEOkvatRBZGUpA", mLength = 116, mDataFlags = 5,
mClassFlags = mozilla::detail::StringClassFlags::NULL_TERMINATED}, static kMaxCapacity = <optimized out>}, <No data fields>}
The Image URL seems to be,
https://i.ytimg.com/an_webp/fUyTKaU6zRc/mqdefault_6s.webp?du=3000&sqp=CM3Fz4wG&rs=AOn4CLCs5ij3lzc0yY6tOEOkvatRBZGUpA
But directly opening this URL cannot reproduce the issue.
Also worth mentioning that, I used to set --enable-debug
, --enable-profiling
and --disable-optimize
, and this issue disappears. I think it might be related to optimization.
Can you bisect using mozregression to see what caused the problem?
I'll try. It costs me more than 1hr to build once. So it would take some time.
Comment 7•3 years ago
|
||
Building to bisect is a long process, you don't need to go through that. If you download an official build of Firefox from mozilla can you still reproduce the issue?
If you download an official build of Firefox from mozilla can you still reproduce the issue?
The official release binary distribution 94.0.1 cannot reproduce this issue.
Comment 9•3 years ago
|
||
Here is a tiny poc for this crash bug. It is 100% reliable on my system.
It works by repeatedly loading a specific webp image in an iframe until the crash is triggered.
It's worth noting that the webp image used here did not come from YouTube, so the issue is not isolated to them.
https://buckley310.github.io/nixos_firefox_webp_poc/index.html
Comment 10•3 years ago
|
||
I can also reproduce this bug with Sean's testcase. According to gdb, the crash happens at a movdqa xmm0,XMMWORD PTR [r12]
instruction (at 58396fd in the attached disassembly of DecodeImageStream; ExpandColorMap is inlined into DecodeImageStream in the Nix build). ExpandColorMap does some funny stuff with pointer aliasing, so maybe LLVM is assuming pointers are aligned when they aren't, much like this bug. In my test, r12 was 0x7fffb0d6e298 and new_data pointed to the same memory location.
This happens with llvm/clang++ 13.0.0.
Updated•3 years ago
|
Comment 11•3 years ago
|
||
I reported this upstream at https://bugs.chromium.org/p/webp/issues/detail?id=545.
Reporter | ||
Comment 12•3 years ago
|
||
After digging data in coredump with disassembly, I figured that *($rsp+0x48)
is transform->data_
and r15d
is num_colors << 2
. So,
(gdb) print *(VP8LTransform*)(*(void **)($rsp+0x48)-16) # transform
= {type_ = COLOR_INDEXING_TRANSFORM, bits_ = 3, xsize_ = 384, ysize_ = 367, data_ = 0x7f0e4371d460}
(gdb) print $r15d >> 2 # num_colors
= 2
There are only 8byte (64bit) to process, and so does the allocated memory from malloc
. movdqa
is definitely wrong here since it try to read 16 byte data (xmm is 128bit) and also requires 16byte alignment. Neither of the per-requirements are true. Maybe blame LLVM 13?
Reporter | ||
Comment 13•3 years ago
|
||
For clear, the situation here is,
- malloc returns 8byte aligned pointer when allocating 8byte.
- LLVM thinks its 16byte aligned and directly uses
movdqa
to read aligned 16byte. movdqa
fails the alignment requirement.
I'm not sure which of (1) and (2) is considered expected.
And also reading 16byte from 8byte allocation is also quite wrong for me. But someone says it might be okay for vectorization if only they are already 16byte-aligned. I'm not sure about it.
We are using glibc 2.33 in nixpkgs, not sure if it's related.
Comment 14•3 years ago
|
||
I agree with the above but "blame" is too strong a word. uint8_t and uint32_t are not compatible types, therefore I believe accessing the same memory location through pointers of these two types is undefined behavior (here's a write-up about it), so LLVM is correct in doing the optimization.
Comment 15•3 years ago
|
||
Actually, I just checked and LLVM's official stance is (or was at some point) that uint8_t should be a char type even though the spec does not require it. I retract the previous comment, this is probably a LLVM bug unless there's some other undefined behavior somewhere.
Reporter | ||
Comment 16•3 years ago
|
||
Okay I figured out.
We have jemalloc enabled but it would return 8-byte-aligned memory from malloc, which violates the 16-byte-aligned assumption added recently in clang from LLVM 13.
From this jemalloc issue, https://github.com/jemalloc/jemalloc/issues/1533#issuecomment-507915829
Practically, this means an alignment of 8 for 8-byte allocations, and 16 for 16-or-more-byte allocations, on 64-bit systems. (This can be tweaked as a config option, as well).
And this clang patch, which is included in LLVM 13, added assumptions that malloc
must returned 16-byte-aligned pointer, https://reviews.llvm.org/D100879
You can also confirm it by optimizing (uintptr_t)malloc(8) % 16 == 0
https://godbolt.org/z/ennqs1Ma6
I rebuilt firefox with --disable-jemalloc
and this issue is gone.
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Assignee | ||
Comment 17•3 years ago
|
||
I agree with the latest messages in the llvm issue that this is a clang problem.
Comment 18•3 years ago
|
||
The product::component has been changed since the backlog priority was decided, so we're resetting it.
For more information, please visit auto_nag documentation.
Updated•3 years ago
|
Assignee | ||
Comment 19•3 years ago
|
||
It assumes 16-bytes alignment for small allocations on 64-bits, which is
not true for mozjemalloc allocations.
This only addresses the problem with mozilla.org-produced builds. We may
have to figure out something for downstreams.
Updated•3 years ago
|
Comment 20•3 years ago
|
||
Set release status flags based on info from the regressing bug 1731582
Updated•3 years ago
|
Comment 21•3 years ago
|
||
Assignee | ||
Comment 22•3 years ago
|
||
Comment on attachment 9252051 [details]
Bug 1741454 - Revert clang's D100879.
Beta/Release Uplift Approval Request
- User impact if declined: Possibly random crashes, depending on how the compiler decides to do its optimizations vs. how small allocations are actually laid out in memory.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Straightforward backout of a clang upstream patch.
- String changes made/needed: N/A
Comment 23•3 years ago
|
||
bugherder |
Comment 24•3 years ago
|
||
Comment on attachment 9252051 [details]
Bug 1741454 - Revert clang's D100879.
Approved for our last beta.
Comment 25•3 years ago
|
||
bugherder uplift |
Description
•