Closed Bug 1779558 Opened 2 years ago Closed 2 years ago

Intel/Linux: Crash in [@ mozalloc_abort | abort | _iris_batch_flush]. Fixed in Linux kernel 6.1.4 (Fedora 36, Ubuntu 23.04, Debian 12 Bookworm)

Categories

(Core :: Graphics, defect)

Unspecified
Linux
defect

Tracking

()

RESOLVED MOVED
Tracking Status
firefox-esr91 --- unaffected
firefox-esr102 --- wontfix
firefox102 --- wontfix
firefox103 --- wontfix
firefox104 --- wontfix
firefox105 --- wontfix
firefox106 --- wontfix
firefox107 --- wontfix
firefox108 --- wontfix
firefox109 --- wontfix

People

(Reporter: aryx, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(2 files)

4 crashes from 4 different Linux installations of Firefox 102.0.1, isolated crashes for newer development versions

Crash report: https://crash-stats.mozilla.org/report/index/6c8cdf27-4f8f-455f-addc-adf090220714

MOZ_CRASH Reason: ```Redirecting call to abort() to mozalloc_abort


Top 10 frames of crashing thread:

0 firefox-bin mozalloc_abort memory/mozalloc/mozalloc_abort.cpp:35
1 firefox-bin abort memory/mozalloc/mozalloc_abort.cpp:88
2 libgallium_dri.so [clone .lto_priv.0] [clone .cold] /usr/src/debug/mesa-22.1.3/src/gallium/drivers/nouveau/codegen/nv50_ir_from_tgsi.cpp:3653
3 libsqlite3.so.0 libsqlite3.so.0@0x0000000000082fe3
4 libgallium_dri.so tc_call_flush_resource.lto_priv.0 /usr/src/debug/mesa-22.1.3/src/gallium/auxiliary/util/u_threaded_context.c:3790
5 libgallium_dri.so _fini
6 libgallium_dri.so iris_fence_flush /usr/src/debug/mesa-22.1.3/src/gallium/drivers/iris/iris_fence.c:267
7 None @0x00007f4adb9b737f
8 libgallium_dri.so st_context_flush /usr/src/debug/mesa-22.1.3/src/mesa/state_tracker/st_manager.c:808
9 libgallium_dri.so dri_flush /usr/src/debug/mesa-22.1.3/src/gallium/frontends/dri/dri_drawable.c:522

All the reports seem to be from Arch Linux-based systems running Mesa's iris driver. The crashes started a few days after Arch released Mesa 22.1.3.

Mesa 22.1.4 was just released to Arch users.

PS: Still crashes with 22.1.4 (e.g. bp-2903b0e7-85a9-4352-a406-a3ab30220728).

The backtrace is misleading. There shouldn't be any nouveau code in use. Rather, I think this is the abort at the end of _iris_batch_flush at src/gallium/drivers/iris/iris_batch.c:1115.

I'm getting these crashes at random while playing H.264 and AV1 videos, but not VP9. All of them are hardware-decoded (ADL GT2).

I wonder if this is a race triggered by illegal multi-threaded use of GL in Firefox.

PS: Definitely also affects AV1. It mostly happens with YouTube (which is most of the videos I watch) but also other sites.

Is that ERROR("unhandled TGSI opcode: %u\n", tgsi.getOpcode()) at Converter::handleInstruction() ?

No, as I mentioned in comment #2, the backtrace pointing at nouveau code is nonsense but there's an abort in the iris code that fits the earlier frames (up until iris_fence_flush) better.

I get the following message from a mesa debug build:

iris: Failed to submit batchbuffer: No space left on device

Still happens even with GALLIUM_THREAD=0.

Crash Signature: [@ mozalloc_abort | abort | (anonymous namespace)::Converter::handleInstruction] → [@ mozalloc_abort | abort | (anonymous namespace)::Converter::handleInstruction] [@ mozalloc_abort | abort | _iris_batch_flush.cold ]
Summary: Crash in [@ mozalloc_abort | abort | (anonymous namespace)::Converter::handleInstruction] → Crash in [@ mozalloc_abort | abort | _iris_batch_flush.cold ]
Blocks: wr-linux
Component: Graphics: CanvasWebGL → Graphics
OS: Unspecified → Linux
Attached file log.zst (deleted) —

Log made with:

MOZ_LOG="Dmabuf:5, PlatformDecoderModule:5" INTEL_DEBUG=submit \
  mozregression --launch 2022-08-31 -P stdout -a 'https://www.youtube.com/watch?v=NOHtTKXqDeo' \
  |& zstd -7 > log.zst
Crash Signature: [@ mozalloc_abort | abort | (anonymous namespace)::Converter::handleInstruction] [@ mozalloc_abort | abort | _iris_batch_flush.cold ] → [@ mozalloc_abort | abort | (anonymous namespace)::Converter::handleInstruction] [@ <name omitted> | mozalloc_abort | abort | _iris_batch_flush] [@ mozalloc_abort | abort | _iris_batch_flush.cold] [@ mozalloc_abort | abort | _iris_batch_flush]
Summary: Crash in [@ mozalloc_abort | abort | _iris_batch_flush.cold ] → Crash in [@ mozalloc_abort | abort | _iris_batch_flush]

Can you test latest nightly?
Thanks.

Flags: needinfo?(jan.steffens)
Flags: needinfo?(jan.steffens)

Jan,
the 'Renderer' thread is supposed to be much bigger so looks like mem corruption.
Can you try to run ASAN nightly build please?
https://firefox-source-docs.mozilla.org/tools/sanitizer/asan.html#address-sanitizer
Thanks.

Flags: needinfo?(jan.steffens)

btw. On the crash report page I see various Intel and not AMD/Radeon. Looks like threading issue on Intel then?

I have the same HW (ADL GT2) so I'll test that. I wonder why we have Firefox crash when VA-API decoding is running in separated RDD process and should not be affected by it.

Attached file full backtrace (deleted) —

An asan-debug build also triggered the abort in mesa. No reports from ASAN.

Backtrace:

#0  0x00007fbe016e67c5 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7fbda36ac190, rem=rem@entry=0x7fbda36ac190) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  0x00007fbe016eb2d7 in __GI___nanosleep (req=req@entry=0x7fbda36ac190, rem=rem@entry=0x7fbda36ac190) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x00007fbe016eb20e in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x00007fbde17760a2 in common_crap_handler(int, void const*) (signum=signum@entry=6, aFirstFramePC=aFirstFramePC@entry=0x7fbde16fdd6b <nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)+955>) at /builds/worker/checkouts/gecko/toolkit/xre/nsSigHandlers.cpp:96
#4  0x00007fbde177630c in ah_crap_handler(int) (signum=6) at /builds/worker/checkouts/gecko/toolkit/xre/nsSigHandlers.cpp:104
#5  0x00007fbde16fdd6b in nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*) (signo=<optimized out>, info=<optimized out>, context=<optimized out>) at /builds/worker/checkouts/gecko/toolkit/profile/nsProfileLock.cpp:183
#6  0x00007fbe01651a00 in <signal handler called> () at /usr/lib/libc.so.6
#7  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#8  0x00007fbe016a16b3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#9  0x00007fbe01651958 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#10 0x00007fbe0163b53d in __GI_abort () at abort.c:79
#11 0x00007fbdb2c98c21 in _iris_batch_flush () at ../mesa-22.2.3/src/gallium/drivers/iris/iris_batch.c:1121
#12 0x00007fbdb3991f20 in iris_fence_flush () at ../mesa-22.2.3/src/gallium/drivers/iris/iris_fence.c:267
#13 0x00007fbdb2d87700 in st_context_flush () at ../mesa-22.2.3/src/mesa/state_tracker/st_manager.c:808
#14 0x00007fbdb2cafd9e in dri_flush () at ../mesa-22.2.3/src/gallium/frontends/dri/dri_drawable.c:522
#15 0x00007fbda139bdbb in dri2_wl_swap_buffers_with_damage () at ../mesa-22.2.3/src/egl/drivers/dri2/platform_wayland.c:1592
#16 0x00007fbda139080d in dri2_swap_buffers_with_damage () at ../mesa-22.2.3/src/egl/drivers/dri2/egl_dri2.c:2039
#17 0x00007fbda1383083 in _eglSwapBuffersWithDamageCommon () at ../mesa-22.2.3/src/egl/main/eglapi.c:1398
#18 0x00007fbdd047d349 in mozilla::gl::GLLibraryEGL::fSwapBuffersWithDamage(void*, void*, int const*, int) (this=<optimized out>, dpy=0x61b00013fd80, surface=0x6190016a1280, rects=0x602000538270, n_rects=1) at /builds/worker/checkouts/gecko/gfx/gl/GLLibraryEGL.h:510
#19 mozilla::gl::EglDisplay::fSwapBuffersWithDamage(void*, int const*, int) (this=<optimized out>, surface=surface@entry=0x6190016a1280, rects=rects@entry=0x602000538270, n_rects=1) at /builds/worker/checkouts/gecko/gfx/gl/GLLibraryEGL.h:939
#20 0x00007fbdd047c47f in mozilla::gl::GLContextEGL::SwapBuffers() (this=<optimized out>) at /builds/worker/checkouts/gecko/gfx/gl/GLContextProviderEGL.cpp:549
#21 0x00007fbdd11a372c in mozilla::wr::RenderCompositorEGL::EndFrame(nsTArray<mozilla::wr::Box2D<int, mozilla::wr::DevicePixel> > const&) (this=0x606000200360, aDirtyRects=<optimized out>) at /builds/worker/checkouts/gecko/gfx/webrender_bindings/RenderCompositorEGL.cpp:153
#22 0x00007fbdd11d6969 in mozilla::wr::RendererOGL::UpdateAndRender(mozilla::Maybe<mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> > const&, mozilla::Maybe<mozilla::wr::ImageFormat> const&, mozilla::Maybe<mozilla::Range<unsigned char> > const&, bool*, mozilla::wr::RendererStats*) (this=<optimized out>, aReadbackSize=<optimized out>, aReadbackFormat=<optimized out>, aReadbackBuffer=<optimized out>, aNeedsYFlip=<optimized out>, aOutStats=<optimized out>) at /builds/worker/checkouts/gecko/gfx/webrender_bindings/RendererOGL.cpp:222
#23 0x00007fbdd11d2f70 in mozilla::wr::RenderThread::UpdateAndRender(mozilla::wr::WrWindowId, mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> const&, mozilla::TimeStamp const&, bool, mozilla::Maybe<mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> > const&, mozilla::Maybe<mozilla::wr::ImageFormat> const&, mozilla::Maybe<mozilla::Range<unsigned char> > const&, bool*) (this=<optimized out>, aWindowId=..., aStartId=<optimized out>, aStartTime=<optimized out>, aRender=<optimized out>, aReadbackSize=<optimized out>, aReadbackFormat=<optimized out>, aReadbackBuffer=<optimized out>, aNeedsYFlip=<optimized out>) at /builds/worker/checkouts/gecko/gfx/webrender_bindings/RenderThread.cpp:580
#24 0x00007fbdd11d15a8 in mozilla::wr::RenderThread::HandleFrameOneDoc(mozilla::wr::WrWindowId, bool) (this=0x615000203d00, aWindowId=..., aRender=<optimized out>) at /builds/worker/checkouts/gecko/gfx/webrender_bindings/RenderThread.cpp:426
#25 0x00007fbdd11f1e1f in mozilla::detail::RunnableMethodArguments<mozilla::wr::WrWindowId, bool>::applyImpl<mozilla::wr::RenderThread, void (mozilla::wr::RenderThread::*)(mozilla::wr::WrWindowId, bool), StoreCopyPassByConstLRef<mozilla::wr::WrWindowId>, StoreCopyPassByConstLRef<bool>, 0ul, 1ul>(mozilla::wr::RenderThread*, void (mozilla::wr::RenderThread::*)(mozilla::wr::WrWindowId, bool), mozilla::Tuple<StoreCopyPassByConstLRef<mozilla::wr::WrWindowId>, StoreCopyPassByConstLRef<bool> >&, std::integer_sequence<unsigned long, 0ul, 1ul>) (o=<optimized out>, m=<optimized out>, args=<optimized out>) at /builds/worker/workspace/obj-build/dist/include/nsThreadUtils.h:1147
#26 mozilla::detail::RunnableMethodArguments<mozilla::wr::WrWindowId, bool>::apply<mozilla::wr::RenderThread, void (mozilla::wr::RenderThread::*)(mozilla::wr::WrWindowId, bool)>(mozilla::wr::RenderThread*, void (mozilla::wr::RenderThread::*)(mozilla::wr::WrWindowId, bool)) (this=<optimized out>, o=<optimized out>, m=<optimized out>) at /builds/worker/workspace/obj-build/dist/include/nsThreadUtils.h:1153
#27 mozilla::detail::RunnableMethodImpl<mozilla::wr::RenderThread*, void (mozilla::wr::RenderThread::*)(mozilla::wr::WrWindowId, bool), true, (mozilla::RunnableKind)0, mozilla::wr::WrWindowId, bool>::Run() (this=<optimized out>) at /builds/worker/workspace/obj-build/dist/include/nsThreadUtils.h:1200
#28 0x00007fbdcd55bfa8 in nsThread::ProcessNextEvent(bool, bool*) (this=0x615000203800, aMayWait=<optimized out>, aResult=<optimized out>) at /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1198
#29 0x00007fbdcd57075a in NS_ProcessNextEvent(nsIThread*, bool) (aThread=0x9f411, aMayWait=<optimized out>) at /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:465
#30 0x00007fbdcf573ce9 in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) (this=<optimized out>, aDelegate=<optimized out>) at /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:330
#31 0x00007fbdcf329b14 in MessageLoop::RunInternal() (this=<optimized out>) at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:381
#32 0x00007fbdcf329777 in MessageLoop::RunHandler() (this=0x7fbda3eac980) at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:374
#33 MessageLoop::Run() (this=<optimized out>) at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:356
#34 0x00007fbdcd54e2d5 in nsThread::ThreadFunc(void*) (aArg=0x6030001e3eb0) at /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:383
#35 0x00007fbe01ac06b8 in _pt_root (arg=0x612000095ec0) at /builds/worker/checkouts/gecko/nsprpub/pr/src/pthreads/ptthread.c:201
#36 0x00007fbe0169f8fd in start_thread (arg=<optimized out>) at pthread_create.c:442
#37 0x00007fbe01721a60 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Flags: needinfo?(jan.steffens)

Jan, can you confirm MESA version is significant here? I tested on 20.0.x line (Ubuntu 22.04) but I can't reproduce it on the same HW. I wonder if kernel version is also relevant.

It seems to be, but I can't say for sure. The crashes seem to start a few days after Arch released Mesa 22.1.3 and Linux 5.18.9.

The first crash (bp-296c0055-38f0-4396-b982-d47c50220707) happened about 40 minutes after Linux 5.18.10 was released to the testing repos, but the user was still using 5.18.9-arch1-1 (and Mesa 22.1.3).

Crashes are also happening on Fedora ([@ <name omitted> | mozalloc_abort | abort | _iris_batch_flush ]) which you might have more information for.

I reproduced it on ADL GT2 (Iris Xe) on Fedora 37. It crashed after ~10 mins of VA-API playback when 5-6 clips were player together. Will try Fedora 36 on the same hardware. Fedora 36 + Intel 630 doesn't crash for me.

My testing shows that the bug does not care about mesa version. Compiled 22.0.0 and git smapshot of today morning - no difference. But the bug do care about kernel version. With 5.17 series it does not show up for me, starting from 5.18 up to latest drm-tip it is still there.

Checked deeper in Mesa revisions - 21.0.3 works fine with any kernel. And kernel 5.17 works fine with any mesa. All 22+ mesas and 5.18+ kernels contains the error, but it is enoght to get only one o f error-free ancient component to get rid if the ussue. I've tried to bisect the kernel but unfortunately can not succeed so far - the kernel 5.18-rc1 is not working totally on my hadrware so there is a wide slot of code where I can not guess good or bad.

Reverted this kernel commit https://cgit.freedesktop.org/drm-tip/commit/?id=658a0c632625e1db51837ff754fe18a6a7f2ccf8
drm/i915: don't call free_mmap_offset when purging

Seems that the idea "lets do not free the mem here and wait when all those mem pieces will be freed later as they anyway will be freed" is a bad idea for this case.

With simple patch -R over 6.1-rc5 kernel - it stil applies without any issue. Not sure if it solved the problem completely or not but have not seen a crash anymore with latest mesa.

Anyway it is only my experience on one laptop and one gentoo setup - so it is better to give it a try on other environment.

Thank you for analyzing the issue. The referenced commit was pulled into Linux v5.18-rc1. I created issue 7570 ([regression, bisected] Crash in iris_batch_flush) in the Intel DRM issue tracker.

The commit author replied. It’d be great if you joined the discussion over in the Intel DRM issue tracker.

The proposed solution also seems to work.
put that commit back and applied the suggested patch:
https://lore.kernel.org/lkml/20221110053133.2433412-1-mani@chromium.org/

and so far so good. Reverting commit was a kind of black magic that worked (maybe - at least did not crash for 3 hours av1 play), but the siggested commit has some scientific understanding behind it so most likely it is nailing out the root cause of issue, not just saving a bit more space to allow longer time before crash. But again need to test it a bit longer.

After the proff of concept patch the patch series appeared of freedesktop patchwork that will solve the issue. Waiting for this series to land.
https://patchwork.freedesktop.org/series/110720/

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

For more information, please visit auto_nag documentation.

Keywords: topcrash

The path serie to fix the issue is on its way - https://patchwork.freedesktop.org/series/111686/

As soon as it lands will be no crash for this bug anymore.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → MOVED
Duplicate of this bug: 1833824

(Darkspirit from bug 1833824 comment 2)

from crash report:

OS Version 5.19.0-41-generic #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2

Duplicate of bug 1779558 (https://gitlab.freedesktop.org/drm/intel/-/issues/7570#note_1692847, https://patchwork.freedesktop.org/series/111686/).

The fix landed in Linux kernel 6.1.4:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2003687

Linux 6.1.4

drm/i915: improve the catch-all evict to handle lock contention

https://packages.ubuntu.com/search?keywords=linux-image-generic

Debian:

Fedora: https://packages.fedoraproject.org/pkgs/kernel/kernel/

  • fixed: Fedora Rawhide (6.4.0-0.rc2.23.fc39)
  • fixed: Fedora 38 (6.2.15-300.fc38)
  • fixed: Fedora 37 (6.2.15-200.fc37)
  • fixed: Fedora 36 (6.1.7@2023-01-18. now 6.2.15)
Summary: Crash in [@ mozalloc_abort | abort | _iris_batch_flush] → Intel/Linux: Crash in [@ mozalloc_abort | abort | _iris_batch_flush]. Fixed in Linux kernel 6.1.4 (Fedora 36, Ubuntu 23.04, Debian 12 Bookworm)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: