Closed Bug 1741454 Opened 3 years ago Closed 3 years ago

Tab crash in libxul.so ExpandColorMap, when using jemalloc and LLVM 13

Tracking

(firefox-esr91 unaffected, firefox94 unaffected, firefox95 fixed, firefox96 fixed)

Status:

RESOLVED FIXED

Milestone:

96 Branch

Tracking Flags:

Tracking

Status

firefox-esr91

---

unaffected

firefox94

---

unaffected

firefox95

---

fixed

firefox96

---

fixed

People

(Reporter: oxalicc, Assigned: glandium, NeedInfo)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(4 files)

ff-crash-backtrace.txt 3 years ago oxalica (deleted), text/plain		Details
ff-94.0.1-debug-build.log.gz 3 years ago oxalica (deleted), application/gzip		Details
disassembly.txt 3 years ago Mauricio Collares [:collares] (deleted), text/plain		Details
Bug 1741454 - Revert clang's D100879. 3 years ago Mike Hommey [:glandium] (deleted), text/x-phabricator-request	pascalc : approval-mozilla-beta+	Details

oxalica

Reporter

Description

•

3 years ago

Attached file ff-crash-backtrace.txt (deleted) — Details

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:94.0) Gecko/20100101 Firefox/94.0

Steps to reproduce:

On KDE plasma desktop environment.
export MOZ_ENABLE_WAYLAND=1
Run firefox on an empty profile.
Open "https://youtube.com"
When loaded, let the mouse cursor quickly move around different video thumbnails.
Few seconds later, see "Tab crashed"

Actual results:

Almost deterministic tab crash.

Full stack backtrace with symbols is in attachments. The crashed function is ExpandColorMap, firefox-94.0.1/media/libwebp/src/dec/vp8l_dec.c:1283

Crash report can be seen here,
https://crash-stats.mozilla.org/report/index/2de601e4-b4e7-43ac-9721-9bd960211116

Note: I recently upgraded from firefox 93.0 to 94.0.1, and the LLVM used for build is upgraded from 11 -> 13. Firefox 93.0 doesn't crash in the same situation.

Expected results:

Not crash.

oxalica

Reporter

Comment 1

•

3 years ago

Forget to mention that, I'm on KDE wayland session. That's why I set export MOZ_ENABLE_WAYLAND=1.

BugBot [:suhaib / :marco/ :calixte]

Comment 2

•

3 years ago

The Bugbug bot thinks this bug should belong to the 'Core::Widget: Gtk' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → Widget: Gtk

Product: Firefox → Core

oxalica

Reporter

Updated

•

3 years ago

Component: Widget: Gtk → ImageLib

OS: Unspecified → Linux

Hardware: Unspecified → x86_64

mid.autumn0moon

Comment 3

•

3 years ago

Reproducible with the same trace stack, under X11.

Timothy Nikkel (:tnikkel)

Comment 4

•

3 years ago

What kind of builds are you using? Provided by your distro? From Mozilla? Built yourself? Can you bisect using mozregression to see what caused the problem?

If you have it in gdb you can goto the frame for mozilla::image::AnimationSurfaceProvider::Run, mImage.mRawPtr->mURI.mRawPtr holds the uri of the image being decoded, I lldb I did:

(lldb) p mImage.mRawPtr->mURI.mRawPtr
(mozilla::net::nsStandardURL *) $9 = 0x0000000111caac10
(lldb) p ((mozilla::net::nsStandardURL *) 0x0000000111caac10).mSpec

to get display the uri. You might have to look at mSpec.mData. Once you have the uri you can see if just loading that image reproduces the problem.

Flags: needinfo?(oxalicc)

Timothy Nikkel (:tnikkel)

Updated

•

3 years ago

Flags: needinfo?(mid.autumn0moon)

oxalica

Reporter

Comment 5

•

3 years ago

Attached file ff-94.0.1-debug-build.log.gz (deleted) — Details

oxalica

Reporter

Comment 6

•

3 years ago

What kind of builds are you using? Provided by your distro? From Mozilla? Built yourself? Can you bisect using mozregression to see what caused the problem?

I'm on NixOS. Hardware and system information can be found in the crash report link I provided before.
https://crash-stats.mozilla.org/report/index/2de601e4-b4e7-43ac-9721-9bd960211116

From personal discussion, @mid.autumn0moon seems to be also on NixOS and also on AMD platform with integrated graphics card like me.

The firefox is built from this nix file, (with additional --enable-debug-symbols and --disable-strip to include symbols. The issue exists without these flags)
https://github.com/NixOS/nixpkgs/blob/b964655a7d2ab1dcff46aa73d5510239bd193c88/pkgs/applications/networking/browsers/firefox/common.nix#L250
I uploaded the full build log which includes configure flags and environments.

If you have it in gdb you can goto the frame for mozilla::image::AnimationSurfaceProvider::Run, mImage.mRawPtr->mURI.mRawPtr holds the uri of the image being decoded

Here's what I tried,

(gdb) frame 16
#16 0x00007fb1c29da62d in mozilla::image::AnimationSurfaceProvider::Run (this=0x7fb18ad3db20) at /build/firefox-94.0.1/image/AnimationSurfaceProvider.cpp:209
209     /build/firefox-94.0.1/image/AnimationSurfaceProvider.cpp: No such file or directory.
(gdb) p mImage.mRawPtr->mURI.mRawPtr
$1 = (nsISupports *) 0x7fb18ad3c7a0
(gdb) p ((mozilla::net::nsStandardURL *) 0x7fb18ad3c7a0).mSpec
$2 = {<nsTSubstring<char>> = {<mozilla::detail::nsTStringRepr<char>> = {mData = 0x7fb18acba908 "https://i.ytimg.com/an_webp/fUyTKaU6zRc/mqdefault_6s.webp?du=3000&sqp=CM3Fz4wG&rs=AOn4CLCs5ij3lzc0yY6tOEOkvatRBZGUpA", mLength = 116, mDataFlags = 5,
      mClassFlags = mozilla::detail::StringClassFlags::NULL_TERMINATED}, static kMaxCapacity = <optimized out>}, <No data fields>}

The Image URL seems to be,
https://i.ytimg.com/an_webp/fUyTKaU6zRc/mqdefault_6s.webp?du=3000&sqp=CM3Fz4wG&rs=AOn4CLCs5ij3lzc0yY6tOEOkvatRBZGUpA
But directly opening this URL cannot reproduce the issue.

Also worth mentioning that, I used to set --enable-debug, --enable-profiling and --disable-optimize, and this issue disappears. I think it might be related to optimization.

Can you bisect using mozregression to see what caused the problem?

I'll try. It costs me more than 1hr to build once. So it would take some time.

Flags: needinfo?(oxalicc)

Timothy Nikkel (:tnikkel)

Comment 7

•

3 years ago

Building to bisect is a long process, you don't need to go through that. If you download an official build of Firefox from mozilla can you still reproduce the issue?

oxalica

Reporter

Comment 8

•

3 years ago

If you download an official build of Firefox from mozilla can you still reproduce the issue?

The official release binary distribution 94.0.1 cannot reproduce this issue.

Sean Buckley

Comment 9

•

3 years ago

Here is a tiny poc for this crash bug. It is 100% reliable on my system.
It works by repeatedly loading a specific webp image in an iframe until the crash is triggered.

It's worth noting that the webp image used here did not come from YouTube, so the issue is not isolated to them.

https://buckley310.github.io/nixos_firefox_webp_poc/index.html

Mauricio Collares [:collares]

Comment 10

•

3 years ago

Attached file disassembly.txt (deleted) — Details

I can also reproduce this bug with Sean's testcase. According to gdb, the crash happens at a movdqa xmm0,XMMWORD PTR [r12] instruction (at 58396fd in the attached disassembly of DecodeImageStream; ExpandColorMap is inlined into DecodeImageStream in the Nix build). ExpandColorMap does some funny stuff with pointer aliasing, so maybe LLVM is assuming pointers are aligned when they aren't, much like this bug. In my test, r12 was 0x7fffb0d6e298 and new_data pointed to the same memory location.

This happens with llvm/clang++ 13.0.0.

Mauricio Collares [:collares]

Updated

•

3 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Mauricio Collares [:collares]

Comment 11

•

3 years ago

I reported this upstream at https://bugs.chromium.org/p/webp/issues/detail?id=545.

oxalica

Reporter

Comment 12

•

3 years ago

After digging data in coredump with disassembly, I figured that *($rsp+0x48) is transform->data_ and r15d is num_colors << 2. So,

(gdb) print *(VP8LTransform*)(*(void **)($rsp+0x48)-16) # transform
= {type_ = COLOR_INDEXING_TRANSFORM, bits_ = 3, xsize_ = 384, ysize_ = 367, data_ = 0x7f0e4371d460}
(gdb) print $r15d >> 2 # num_colors
= 2

There are only 8byte (64bit) to process, and so does the allocated memory from malloc. movdqa is definitely wrong here since it try to read 16 byte data (xmm is 128bit) and also requires 16byte alignment. Neither of the per-requirements are true. Maybe blame LLVM 13?

oxalica

Reporter

Comment 13

•

3 years ago

For clear, the situation here is,

malloc returns 8byte aligned pointer when allocating 8byte.
LLVM thinks its 16byte aligned and directly uses movdqa to read aligned 16byte.
movdqa fails the alignment requirement.

I'm not sure which of (1) and (2) is considered expected.
And also reading 16byte from 8byte allocation is also quite wrong for me. But someone says it might be okay for vectorization if only they are already 16byte-aligned. I'm not sure about it.

We are using glibc 2.33 in nixpkgs, not sure if it's related.

Mauricio Collares [:collares]

Comment 14

•

3 years ago

I agree with the above but "blame" is too strong a word. uint8_t and uint32_t are not compatible types, therefore I believe accessing the same memory location through pointers of these two types is undefined behavior (here's a write-up about it), so LLVM is correct in doing the optimization.

Mauricio Collares [:collares]

Comment 15

•

3 years ago

Actually, I just checked and LLVM's official stance is (or was at some point) that uint8_t should be a char type even though the spec does not require it. I retract the previous comment, this is probably a LLVM bug unless there's some other undefined behavior somewhere.

oxalica

Reporter

Comment 16

•

3 years ago

Okay I figured out.
We have jemalloc enabled but it would return 8-byte-aligned memory from malloc, which violates the 16-byte-aligned assumption added recently in clang from LLVM 13.

From this jemalloc issue, https://github.com/jemalloc/jemalloc/issues/1533#issuecomment-507915829

Practically, this means an alignment of 8 for 8-byte allocations, and 16 for 16-or-more-byte allocations, on 64-bit systems. (This can be tweaked as a config option, as well).

And this clang patch, which is included in LLVM 13, added assumptions that malloc must returned 16-byte-aligned pointer, https://reviews.llvm.org/D100879
You can also confirm it by optimizing (uintptr_t)malloc(8) % 16 == 0 https://godbolt.org/z/ennqs1Ma6

I rebuilt firefox with --disable-jemalloc and this issue is gone.

oxalica

Reporter

Updated

•

3 years ago

Summary: Tab crash in libxul.so ExpandColorMap → Tab crash in libxul.so ExpandColorMap, when using jemalloc and LLVM 13

Mauricio Collares [:collares]

Updated

•

3 years ago

See Also: → https://github.com/jemalloc/jemalloc/issues/1533#issuecomment-974697328

Miko Mynttinen

Updated

•

3 years ago

Severity: -- → S3

Priority: -- → P3

Darkspirit

Updated

•

3 years ago

Component: ImageLib → Memory Allocator

Mike Hommey [:glandium]

Assignee

Comment 17

•

3 years ago

I agree with the latest messages in the llvm issue that this is a clang problem.

Component: Memory Allocator → Toolchains

Product: Core → Firefox Build System

BugBot [:suhaib / :marco/ :calixte]

Comment 18

•

3 years ago

The product::component has been changed since the backlog priority was decided, so we're resetting it.
For more information, please visit auto_nag documentation.

Priority: P3 → --

Mike Hommey [:glandium]

Assignee

Updated

•

3 years ago

Regressed by: clang-13

BMO Automation

Updated

•

3 years ago

Has Regression Range: --- → yes

Mike Hommey [:glandium]

Assignee

Comment 19

•

3 years ago

Attached file Bug 1741454 - Revert clang's D100879. (deleted) — Details

It assumes 16-bytes alignment for small allocations on 64-bits, which is
not true for mozjemalloc allocations.
This only addresses the problem with mozilla.org-produced builds. We may
have to figure out something for downstreams.

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → mh+mozilla

Status: NEW → ASSIGNED

BugBot [:suhaib / :marco/ :calixte]

Comment 20

•

3 years ago

Set release status flags based on info from the regressing bug 1731582

status-firefox94: --- → unaffected

status-firefox95: --- → affected

status-firefox96: --- → affected

status-firefox-esr91: --- → unaffected

Darkspirit

Updated

•

3 years ago

Crash Signature: [@ libxul.so@0x58376fd | firefox@0x168a0b ]

Keywords: crash, regression

Pulsebot

Comment 21

•

3 years ago

Pushed by mh@glandium.org: https://hg.mozilla.org/integration/autoland/rev/fe2900a7e5fd Revert clang's D100879. r=andi

Mike Hommey [:glandium]

Assignee

Comment 22

•

3 years ago

Comment on attachment 9252051 [details]
Bug 1741454 - Revert clang's D100879.

Beta/Release Uplift Approval Request

User impact if declined: Possibly random crashes, depending on how the compiler decides to do its optimizations vs. how small allocations are actually laid out in memory.
Is this code covered by automated tests?: No
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): Straightforward backout of a clang upstream patch.
String changes made/needed: N/A

Attachment #9252051 - Flags: approval-mozilla-beta?

Noemi Erli[:noemi_erli]

Comment 23

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/fe2900a7e5fd

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox96: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 96 Branch

Pascal Chevrel:pascalc (PTO until August 21)

Comment 24

•

3 years ago

Comment on attachment 9252051 [details]
Bug 1741454 - Revert clang's D100879.

Approved for our last beta.

Attachment #9252051 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Pascal Chevrel:pascalc (PTO until August 21)

Comment 25

•

3 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/75408df9c927

status-firefox95: affected → fixed

You need to log in before you can comment on or make changes to this bug.