Bugzilla

Comment 1

•

2 years ago

All the reports seem to be from Arch Linux-based systems running Mesa's iris driver. The crashes started a few days after Arch released Mesa 22.1.3.

Mesa 22.1.4 was just released to Arch users.

PS: Still crashes with 22.1.4 (e.g. bp-2903b0e7-85a9-4352-a406-a3ab30220728).

Comment 2

•

2 years ago

The backtrace is misleading. There shouldn't be any nouveau code in use. Rather, I think this is the abort at the end of _iris_batch_flush at src/gallium/drivers/iris/iris_batch.c:1115.

Comment 3

•

2 years ago

I'm getting these crashes at random while playing H.264 and AV1 videos, but not VP9. All of them are hardware-decoded (ADL GT2).

I wonder if this is a race triggered by illegal multi-threaded use of GL in Firefox.

PS: Definitely also affects AV1. It mostly happens with YouTube (which is most of the videos I watch) but also other sites.

Comment 4

•

2 years ago

Is that ERROR("unhandled TGSI opcode: %u\n", tgsi.getOpcode()) at Converter::handleInstruction() ?

Comment 5

•

2 years ago

No, as I mentioned in comment #2, the backtrace pointing at nouveau code is nonsense but there's an abort in the iris code that fits the earlier frames (up until iris_fence_flush) better.

Comment 6

•

2 years ago

I get the following message from a mesa debug build:

iris: Failed to submit batchbuffer: No space left on device

Updated

•

2 years ago

See Also: → https://gitlab.freedesktop.org/mesa/mesa/-/issues/5600

Comment 7

•

2 years ago

Still happens even with GALLIUM_THREAD=0.

Updated

•

2 years ago

status-firefox103: affected → wontfix

status-firefox105: --- → affected

status-firefox106: --- → affected

Summary: Crash in [@ mozalloc_abort | abort | (anonymous namespace)::Converter::handleInstruction] → Crash in [@ mozalloc_abort | abort | _iris_batch_flush.cold ]

Updated

•

2 years ago

Blocks: egl-linux-vaapi

Darkspirit

Updated

•

2 years ago

Blocks: wr-linux

Component: Graphics: CanvasWebGL → Graphics

OS: Unspecified → Linux

Comment 8

•

2 years ago

Attached file log.zst (deleted) — Details

Log made with:

MOZ_LOG="Dmabuf:5, PlatformDecoderModule:5" INTEL_DEBUG=submit \
  mozregression --launch 2022-08-31 -P stdout -a 'https://www.youtube.com/watch?v=NOHtTKXqDeo' \
  |& zstd -7 > log.zst

Updated

•

2 years ago

Summary: Crash in [@ mozalloc_abort | abort | _iris_batch_flush.cold ] → Crash in [@ mozalloc_abort | abort | _iris_batch_flush]

Comment 9

•

2 years ago

Can you test latest nightly?
Thanks.

Flags: needinfo?(jan.steffens)

Comment 10

•

2 years ago

Just had another crash watching an AV1 video on YouTube.

Crash ID bp-33017db1-5c24-40a4-97c8-e1dbc0221113
Built from https://hg.mozilla.org/mozilla-central/rev/b7164776589657b7d7fd40d32268b2a489eed789

Flags: needinfo?(jan.steffens)

Comment 11

•

2 years ago

Jan,
the 'Renderer' thread is supposed to be much bigger so looks like mem corruption.
Can you try to run ASAN nightly build please?
https://firefox-source-docs.mozilla.org/tools/sanitizer/asan.html#address-sanitizer
Thanks.

Flags: needinfo?(jan.steffens)

Comment 12

•

2 years ago

btw. On the crash report page I see various Intel and not AMD/Radeon. Looks like threading issue on Intel then?

Comment 13

•

2 years ago

I have the same HW (ADL GT2) so I'll test that. I wonder why we have Firefox crash when VA-API decoding is running in separated RDD process and should not be affected by it.

Comment 14

•

2 years ago

A better bt: https://crash-stats.mozilla.org/report/index/32a32a91-27db-4c79-94fc-f10c70221114

Comment 15

•

2 years ago

Attached file full backtrace (deleted) — Details

An asan-debug build also triggered the abort in mesa. No reports from ASAN.

Backtrace:

#0  0x00007fbe016e67c5 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7fbda36ac190, rem=rem@entry=0x7fbda36ac190) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  0x00007fbe016eb2d7 in __GI___nanosleep (req=req@entry=0x7fbda36ac190, rem=rem@entry=0x7fbda36ac190) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x00007fbe016eb20e in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x00007fbde17760a2 in common_crap_handler(int, void const*) (signum=signum@entry=6, aFirstFramePC=aFirstFramePC@entry=0x7fbde16fdd6b <nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)+955>) at /builds/worker/checkouts/gecko/toolkit/xre/nsSigHandlers.cpp:96
#4  0x00007fbde177630c in ah_crap_handler(int) (signum=6) at /builds/worker/checkouts/gecko/toolkit/xre/nsSigHandlers.cpp:104
#5  0x00007fbde16fdd6b in nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*) (signo=<optimized out>, info=<optimized out>, context=<optimized out>) at /builds/worker/checkouts/gecko/toolkit/profile/nsProfileLock.cpp:183
#6  0x00007fbe01651a00 in <signal handler called> () at /usr/lib/libc.so.6
#7  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#8  0x00007fbe016a16b3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#9  0x00007fbe01651958 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#10 0x00007fbe0163b53d in __GI_abort () at abort.c:79
#11 0x00007fbdb2c98c21 in _iris_batch_flush () at ../mesa-22.2.3/src/gallium/drivers/iris/iris_batch.c:1121
#12 0x00007fbdb3991f20 in iris_fence_flush () at ../mesa-22.2.3/src/gallium/drivers/iris/iris_fence.c:267
#13 0x00007fbdb2d87700 in st_context_flush () at ../mesa-22.2.3/src/mesa/state_tracker/st_manager.c:808
#14 0x00007fbdb2cafd9e in dri_flush () at ../mesa-22.2.3/src/gallium/frontends/dri/dri_drawable.c:522
#15 0x00007fbda139bdbb in dri2_wl_swap_buffers_with_damage () at ../mesa-22.2.3/src/egl/drivers/dri2/platform_wayland.c:1592
#16 0x00007fbda139080d in dri2_swap_buffers_with_damage () at ../mesa-22.2.3/src/egl/drivers/dri2/egl_dri2.c:2039
#17 0x00007fbda1383083 in _eglSwapBuffersWithDamageCommon () at ../mesa-22.2.3/src/egl/main/eglapi.c:1398
#18 0x00007fbdd047d349 in mozilla::gl::GLLibraryEGL::fSwapBuffersWithDamage(void*, void*, int const*, int) (this=<optimized out>, dpy=0x61b00013fd80, surface=0x6190016a1280, rects=0x602000538270, n_rects=1) at /builds/worker/checkouts/gecko/gfx/gl/GLLibraryEGL.h:510
#19 mozilla::gl::EglDisplay::fSwapBuffersWithDamage(void*, int const*, int) (this=<optimized out>, surface=surface@entry=0x6190016a1280, rects=rects@entry=0x602000538270, n_rects=1) at /builds/worker/checkouts/gecko/gfx/gl/GLLibraryEGL.h:939
#20 0x00007fbdd047c47f in mozilla::gl::GLContextEGL::SwapBuffers() (this=<optimized out>) at /builds/worker/checkouts/gecko/gfx/gl/GLContextProviderEGL.cpp:549
#21 0x00007fbdd11a372c in mozilla::wr::RenderCompositorEGL::EndFrame(nsTArray<mozilla::wr::Box2D<int, mozilla::wr::DevicePixel> > const&) (this=0x606000200360, aDirtyRects=<optimized out>) at /builds/worker/checkouts/gecko/gfx/webrender_bindings/RenderCompositorEGL.cpp:153
#22 0x00007fbdd11d6969 in mozilla::wr::RendererOGL::UpdateAndRender(mozilla::Maybe<mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> > const&, mozilla::Maybe<mozilla::wr::ImageFormat> const&, mozilla::Maybe<mozilla::Range<unsigned char> > const&, bool*, mozilla::wr::RendererStats*) (this=<optimized out>, aReadbackSize=<optimized out>, aReadbackFormat=<optimized out>, aReadbackBuffer=<optimized out>, aNeedsYFlip=<optimized out>, aOutStats=<optimized out>) at /builds/worker/checkouts/gecko/gfx/webrender_bindings/RendererOGL.cpp:222
#23 0x00007fbdd11d2f70 in mozilla::wr::RenderThread::UpdateAndRender(mozilla::wr::WrWindowId, mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> const&, mozilla::TimeStamp const&, bool, mozilla::Maybe<mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> > const&, mozilla::Maybe<mozilla::wr::ImageFormat> const&, mozilla::Maybe<mozilla::Range<unsigned char> > const&, bool*) (this=<optimized out>, aWindowId=..., aStartId=<optimized out>, aStartTime=<optimized out>, aRender=<optimized out>, aReadbackSize=<optimized out>, aReadbackFormat=<optimized out>, aReadbackBuffer=<optimized out>, aNeedsYFlip=<optimized out>) at /builds/worker/checkouts/gecko/gfx/webrender_bindings/RenderThread.cpp:580
#24 0x00007fbdd11d15a8 in mozilla::wr::RenderThread::HandleFrameOneDoc(mozilla::wr::WrWindowId, bool) (this=0x615000203d00, aWindowId=..., aRender=<optimized out>) at /builds/worker/checkouts/gecko/gfx/webrender_bindings/RenderThread.cpp:426
#25 0x00007fbdd11f1e1f in mozilla::detail::RunnableMethodArguments<mozilla::wr::WrWindowId, bool>::applyImpl<mozilla::wr::RenderThread, void (mozilla::wr::RenderThread::*)(mozilla::wr::WrWindowId, bool), StoreCopyPassByConstLRef<mozilla::wr::WrWindowId>, StoreCopyPassByConstLRef<bool>, 0ul, 1ul>(mozilla::wr::RenderThread*, void (mozilla::wr::RenderThread::*)(mozilla::wr::WrWindowId, bool), mozilla::Tuple<StoreCopyPassByConstLRef<mozilla::wr::WrWindowId>, StoreCopyPassByConstLRef<bool> >&, std::integer_sequence<unsigned long, 0ul, 1ul>) (o=<optimized out>, m=<optimized out>, args=<optimized out>) at /builds/worker/workspace/obj-build/dist/include/nsThreadUtils.h:1147
#26 mozilla::detail::RunnableMethodArguments<mozilla::wr::WrWindowId, bool>::apply<mozilla::wr::RenderThread, void (mozilla::wr::RenderThread::*)(mozilla::wr::WrWindowId, bool)>(mozilla::wr::RenderThread*, void (mozilla::wr::RenderThread::*)(mozilla::wr::WrWindowId, bool)) (this=<optimized out>, o=<optimized out>, m=<optimized out>) at /builds/worker/workspace/obj-build/dist/include/nsThreadUtils.h:1153
#27 mozilla::detail::RunnableMethodImpl<mozilla::wr::RenderThread*, void (mozilla::wr::RenderThread::*)(mozilla::wr::WrWindowId, bool), true, (mozilla::RunnableKind)0, mozilla::wr::WrWindowId, bool>::Run() (this=<optimized out>) at /builds/worker/workspace/obj-build/dist/include/nsThreadUtils.h:1200
#28 0x00007fbdcd55bfa8 in nsThread::ProcessNextEvent(bool, bool*) (this=0x615000203800, aMayWait=<optimized out>, aResult=<optimized out>) at /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1198
#29 0x00007fbdcd57075a in NS_ProcessNextEvent(nsIThread*, bool) (aThread=0x9f411, aMayWait=<optimized out>) at /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:465
#30 0x00007fbdcf573ce9 in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) (this=<optimized out>, aDelegate=<optimized out>) at /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:330
#31 0x00007fbdcf329b14 in MessageLoop::RunInternal() (this=<optimized out>) at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:381
#32 0x00007fbdcf329777 in MessageLoop::RunHandler() (this=0x7fbda3eac980) at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:374
#33 MessageLoop::Run() (this=<optimized out>) at /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:356
#34 0x00007fbdcd54e2d5 in nsThread::ThreadFunc(void*) (aArg=0x6030001e3eb0) at /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:383
#35 0x00007fbe01ac06b8 in _pt_root (arg=0x612000095ec0) at /builds/worker/checkouts/gecko/nsprpub/pr/src/pthreads/ptthread.c:201
#36 0x00007fbe0169f8fd in start_thread (arg=<optimized out>) at pthread_create.c:442
#37 0x00007fbe01721a60 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Flags: needinfo?(jan.steffens)

Comment 16

•

2 years ago

Jan, can you confirm MESA version is significant here? I tested on 20.0.x line (Ubuntu 22.04) but I can't reproduce it on the same HW. I wonder if kernel version is also relevant.

Comment 17

•

2 years ago

It seems to be, but I can't say for sure. The crashes seem to start a few days after Arch released Mesa 22.1.3 and Linux 5.18.9.

The first crash (bp-296c0055-38f0-4396-b982-d47c50220707) happened about 40 minutes after Linux 5.18.10 was released to the testing repos, but the user was still using 5.18.9-arch1-1 (and Mesa 22.1.3).

Crashes are also happening on Fedora ([@ <name omitted> | mozalloc_abort | abort | _iris_batch_flush ]) which you might have more information for.

Comment 18

•

2 years ago

I reproduced it on ADL GT2 (Iris Xe) on Fedora 37. It crashed after ~10 mins of VA-API playback when 5-6 clips were player together. Will try Fedora 36 on the same hardware. Fedora 36 + Intel 630 doesn't crash for me.

Updated

•

2 years ago

Blocks: egl-linux-vaapi-release-intel

Dianna Smith [:diannaS]

Updated

•

2 years ago

status-firefox104: affected → wontfix

status-firefox105: affected → wontfix

status-firefox106: affected → wontfix

status-firefox107: --- → wontfix

status-firefox108: --- → affected

status-firefox109: --- → affected

Comment 19

•

2 years ago

My testing shows that the bug does not care about mesa version. Compiled 22.0.0 and git smapshot of today morning - no difference. But the bug do care about kernel version. With 5.17 series it does not show up for me, starting from 5.18 up to latest drm-tip it is still there.

Comment 20

•

2 years ago

Checked deeper in Mesa revisions - 21.0.3 works fine with any kernel. And kernel 5.17 works fine with any mesa. All 22+ mesas and 5.18+ kernels contains the error, but it is enoght to get only one o f error-free ancient component to get rid if the ussue. I've tried to bisect the kernel but unfortunately can not succeed so far - the kernel 5.18-rc1 is not working totally on my hadrware so there is a wide slot of code where I can not guess good or bad.

Comment 21

•

2 years ago

Reverted this kernel commit https://cgit.freedesktop.org/drm-tip/commit/?id=658a0c632625e1db51837ff754fe18a6a7f2ccf8
drm/i915: don't call free_mmap_offset when purging

Seems that the idea "lets do not free the mem here and wait when all those mem pieces will be freed later as they anyway will be freed" is a bad idea for this case.

With simple patch -R over 6.1-rc5 kernel - it stil applies without any issue. Not sure if it solved the problem completely or not but have not seen a crash anymore with latest mesa.

Anyway it is only my experience on one laptop and one gentoo setup - so it is better to give it a try on other environment.

Paul Menzel

Comment 22

•

2 years ago

Thank you for analyzing the issue. The referenced commit was pulled into Linux v5.18-rc1. I created issue 7570 ([regression, bisected] Crash in iris_batch_flush) in the Intel DRM issue tracker.

Paul Menzel

Comment 23

•

2 years ago

The commit author replied. It’d be great if you joined the discussion over in the Intel DRM issue tracker.

Comment 24

•

2 years ago

The proposed solution also seems to work.
put that commit back and applied the suggested patch:
https://lore.kernel.org/lkml/20221110053133.2433412-1-mani@chromium.org/

and so far so good. Reverting commit was a kind of black magic that worked (maybe - at least did not crash for 3 hours av1 play), but the siggested commit has some scientific understanding behind it so most likely it is nailing out the root cause of issue, not just saving a bit more space to allow longer time before crash. But again need to test it a bit longer.

Comment 25

•

2 years ago

After the proff of concept patch the patch series appeared of freedesktop patchwork that will solve the issue. Waiting for this series to land.
https://patchwork.freedesktop.org/series/110720/

BugBot [:suhaib / :marco/ :calixte]

Comment 26

•

2 years ago

ooops.... Initial patch https://lore.kernel.org/lkml/20221110053133.2433412-1-mani@chromium.org/ is still needed :(

Comment 27

•

2 years ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 10 desktop browser crashes on nightly

For more information, please visit auto_nag documentation.

Keywords: topcrash

Comment 28

•

2 years ago

The path serie to fix the issue is on its way - https://patchwork.freedesktop.org/series/111686/

As soon as it lands will be no crash for this bug anymore.