Closed Bug 1661572 Opened 4 years ago Closed 4 years ago

Crash in [@ DMABufSurfaceYUV::UpdateYUVData]

Categories

(Core :: Widget: Gtk, defect)

Unspecified
Linux
defect

Tracking

()

RESOLVED FIXED
88 Branch
Tracking Status
firefox-esr78 --- disabled
firefox86 --- disabled
firefox87 --- disabled
firefox88 --- fixed

People

(Reporter: sefeng, Assigned: stransky)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(5 files)

Crash report: https://crash-stats.mozilla.org/report/index/60545573-c384-4685-8c4a-8cb200200825

Top 10 frames of crashing thread:

0 libgallium_dri.so nouveau_drm_screen_create 
1 libgallium_dri.so nouveau_drm_screen_create 
2 libgallium_dri.so nouveau_drm_screen_create 
3 libgallium_dri.so __driDriverGetExtensions_zink 
4 libgallium_dri.so libgallium_dri.so@0x1287d8 
5 libxul.so DMABufSurfaceYUV::UpdateYUVData widget/gtk/DMABufSurface.cpp:867
6 libxul.so DMABufSurfaceYUV::CreateYUVSurface widget/gtk/DMABufSurface.cpp:734
7 libxul.so mozilla::FFmpegVideoDecoder<58>::CreateImageDMABuf dom/media/platforms/ffmpeg/FFmpegVideoDecoder.cpp:776
8 libxul.so mozilla::FFmpegVideoDecoder<58>::DoDecode dom/media/platforms/ffmpeg/FFmpegVideoDecoder.cpp:504
9 libxul.so mozilla::FFmpegDataDecoder<58>::DoDecode dom/media/platforms/ffmpeg/FFmpegDataDecoder.cpp:181

This seems related to our Wayland support, not sure if there are actionable since the crashes happen in Nouveau driver.

Interesting that I hit the "nouveau" code-path on intel, but I'm seeing this crash on Amazon Chimes during video calling.

bpcde98884-be8b-4362-9d18-1a5a80200904

I have similar SIGSEGV backtrace with radeonsi and X11:

#0  0x00007fc107a82267 in u_transfer_unmap_vtbl () at /usr/lib64/dri/radeonsi_dri.so
#1  0x00007fc107a82e6f in _tc_sync.constprop.0 () at /usr/lib64/dri/radeonsi_dri.so
#2  0x00007fc107a873e3 in tc_flush () at /usr/lib64/dri/radeonsi_dri.so
#3  0x00007fc1070f18c9 in st_context_flush () at /usr/lib64/dri/radeonsi_dri.so
#4  0x00007fc1070e67f9 in dri_flush () at /usr/lib64/dri/radeonsi_dri.so
#5  0x00007fc110817dee in DMABufSurfaceYUV::UpdateYUVData(void**, int*) () at /usr/lib64/firefox/libxul.so
#6  0x00007fc110817f4d in DMABufSurfaceYUV::CreateYUVSurface(int, int, void**, int*) () at /usr/lib64/firefox/libxul.so
#7  0x00007fc110509965 in mozilla::FFmpegVideoDecoder<58>::CreateImageDMABuf(long, long, long, nsTArray<RefPtr<mozilla::MediaData> >&) () at /usr/lib64/firefox/libxul.so
#8  0x00007fc11050a010 in mozilla::FFmpegVideoDecoder<58>::DoDecode(mozilla::MediaRawData*, unsigned char*, int, bool*, nsTArray<RefPtr<mozilla::MediaData> >&) () at /usr/lib64/firefox/libxul.so
#9  0x00007fc110508881 in mozilla::FFmpegDataDecoder<58>::DoDecode(mozilla::MediaRawData*, bool*, nsTArray<RefPtr<mozilla::MediaData> >&) () at /usr/lib64/firefox/libxul.so
#10 0x00007fc11050b1c2 in mozilla::FFmpegDataDecoder<58>::ProcessDecode(mozilla::MediaRawData*) () at /usr/lib64/firefox/libxul.so
#11 0x00007fc111ee7d97 in mozilla::detail::ProxyRunnable<mozilla::MozPromise<nsTArray<RefPtr<mozilla::MediaData> >, mozilla::MediaResult, true>, RefPtr<mozilla::MozPromise<nsTArray<RefPtr<mozilla::MediaData> >, mozilla::MediaResult, true> > (mozilla::FFmpegDataDecoder<58>::*)(mozilla::MediaRawData*), mozilla::FFmpegDataDecoder<58>, mozilla::MediaRawData*>::Run() () at /usr/lib64/firefox/libxul.so
#12 0x00007fc111736db2 in mozilla::TaskQueue::Runner::Run() () at /usr/lib64/firefox/libxul.so
#13 0x00007fc1117369a0 in nsThreadPool::Run() () at /usr/lib64/firefox/libxul.so
#14 0x00007fc111489dd7 in nsThread::ProcessNextEvent(bool, bool*) () at /usr/lib64/firefox/libxul.so
#15 0x00007fc111489b90 in NS_ProcessNextEvent(nsIThread*, bool) () at /usr/lib64/firefox/libxul.so
#16 0x00007fc1114a1e3e in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) () at /usr/lib64/firefox/libxul.so
#17 0x00007fc1118a6d05 in MessageLoop::Run() () at /usr/lib64/firefox/libxul.so
#18 0x00007fc111735a57 in nsThread::ThreadFunc(void*) () at /usr/lib64/firefox/libxul.so
#19 0x00007fc1186f5150 in _pt_root () at /lib64/libnspr4.so
#20 0x00007fc118c3d3f9 in start_thread () at /lib64/libpthread.so.0
#21 0x00007fc118818903 in clone () at /lib64/libc.so.6

This can happens when more than one GPU is installed on system (integrated+dedicated for instance) and we use a wrong GPU.
It may also happens when incorrect data are uploaded to GPU textures when VA-API does not support particular format (say VP8/9) and we do SW decoding by ffmpeg and then upload frames to GPU (dmabuf).

It may help to attach terminal output of:

lspci | grep "VGA"
vainfo --display drm --device /dev/dri/renderD128
vainfo --display drm --device /dev/dri/renderD129

and run Firefox with

MOZ_LOG="PlatformDecoderModule:5"

and attach the log here. I may provide more logging to dmabuf module when we have more similar issues.

Flags: needinfo?(sefeng)
Depends on: 1588904

Not sure why I am needinfo'ed. I guess you want me to try those commands, Martin?

However, I've never hit this crash. I filed this bug because I was analyzing crash reports.

Flags: needinfo?(sefeng)

I have only one gpu, so it is probably the second case.
It seems to crash frequently when there is several (more than two) VP8 videos on the same page.

I can reproduce this easily with this HTML file:

<!DOCTYPE html>
<html>
<body>
<video autoplay muted>
  <source src=http://techslides.com/demos/sample-videos/small.webm type=video/webm>
</video>
<video autoplay muted>
  <source src=http://techslides.com/demos/sample-videos/small.webm type=video/webm>
</video>
<video autoplay muted>
  <source src=http://techslides.com/demos/sample-videos/small.webm type=video/webm>
</video>
</body>
</html>

lspci:

09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)

vainfo:

libva info: VA-API version 1.9.0
libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
libva info: Found init function __vaDriverInit_1_9
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.9 (libva 2.9.0)
vainfo: Driver version: Mesa Gallium driver 20.2.2 for Radeon RX 580 Series (POLARIS10, DRM 3.39.0, 5.9.8-200.fc33.x86_64, LLVM 11.0.0)
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointEncSlice
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointVLD
      VAProfileHEVCMain               : VAEntrypointEncSlice
      VAProfileHEVCMain10             : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointVLD
      VAProfileNone                   : VAEntrypointVideoProc
Attached file ffvp8crash.log (deleted) —

Firefox 83 terminal output with MOZ_LOG="PlatformDecoderModule:5" when page crashed.

Adding to my TODO list, thanks for the reproducer.

Assignee: nobody → stransky

I can reproduce it now on radeon/gallium driver. When running on debug build it complains about double-free.

Hm, looks like multi-threading issue after all, there are significat parts of the backtraces here:

[Switching to thread 175 (Thread 0x7f17160fe640 (LWP 180491))]
#5 0x00007f179c22f1e0 in <signal handler called> () at /lib64/libpthread.so.0
#6 0x000055f70fc51bad in arena_run_reg_dalloc(arena_run_t*, arena_bin_t*, void*, unsigned long) (run=0x7f172f8d5000, bin=0x7f179bb001b8, ptr=0x7f172f8d5bf0, size=80)
at /raid/src/memory/build/mozjemalloc.cpp:2209
#7 0x000055f70fc5150d in arena_t::DallocSmall(arena_chunk_t*, void*, arena_chunk_map_t*) (this=0x7f179bb00000, aChunk=0x7f172f800000, aPtr=0x7f172f8d5bf0, aMapElm=0x7f172f801418)
at /raid/src/memory/build/mozjemalloc.cpp:3288
#8 0x000055f70fc51067 in arena_dalloc(void*, unsigned long, arena_t*) (aPtr=0x7f172f8d5bf0, aOffset=875504, aArena=0x0) at /raid/src/memory/build/mozjemalloc.cpp:3372
#9 0x000055f70fc573d5 in BaseAllocator::free(void*) (this=0x7f17160fbd60, aPtr=0x7f172f8d5bf0) at /raid/src/memory/build/mozjemalloc.cpp:4137
#10 0x000055f70fc53bc5 in Allocator<MozJemallocBase>::free(void*) (arg1=0x7f172f8d5bf0) at /raid/src/memory/build/malloc_decls.h:54
#11 0x000055f70fc8cca6 in PageFree(mozilla::Maybe<unsigned long> const&, void*) (aArenaId=..., aPtr=0x7f172f8d5bf0) at /raid/src/memory/replace/phc/PHC.cpp:1281
#12 0x000055f70fc8d426 in replace_free(void*) (aPtr=0x7f172f8d5bf0) at /raid/src/memory/replace/phc/PHC.cpp:1317
#13 0x000055f70fc47d57 in Allocator<ReplaceMallocBase>::free(void*) (arg1=0x7f172f8d5bf0) at /raid/src/memory/build/malloc_decls.h:54
#14 0x000055f70fc47cf5 in free(void*) (arg1=0x7f172f8d5bf0) at /raid/src/memory/build/malloc_decls.h:54
#15 0x00007f176a905783 in tc_batch_execute (job=job@entry=0x7f17148025a0, thread_index=thread_index@entry=0) at ../src/gallium/auxiliary/util/u_threaded_context.c:163
#16 0x00007f176a905b49 in _tc_sync (tc=tc@entry=0x7f1714802000, func=<optimized out>, info=<optimized out>) at ../src/gallium/auxiliary/util/u_threaded_context.c:277
#17 0x00007f176a90618e in tc_transfer_map (_pipe=0x7f1714802000, resource=0x7f16fc613800, level=0, usage=2, box=0x7f17160fc800, transfer=<optimized out>)
at ../src/gallium/auxiliary/util/u_threaded_context.c:1589
#18 0x00007f1769e9cecc in pipe_transfer_map (transfer=0x7f17160fc7f8, h=160, w=<optimized out>, y=0, x=0, access=2, layer=0, level=0, resource=<optimized out>, context=<optimized out>)
at ../src/gallium/auxiliary/util/u_inlines.h:486
#19 dri2_map_image (context=<optimized out>, image=0x7f1711fac880, x0=0, y0=0, width=280, height=160, flags=2, stride=0x7f16f998a8ec, data=0x7f16f998a8d0)
at ../src/gallium/frontends/dri/dri2.c:1661
#20 0x00007f1770c1aab3 in gbm_dri_bo_map (_bo=0x7f1711fac0b0, x=0, y=0, width=280, height=<optimized out>, flags=2, stride=0x7f16f998a8ec, map_data=0x7f16f998a8d0)
at ../src/gbm/backends/dri/gbm_dri.c:1231
#21 0x00007f178f84b6eb in mozilla::widget::nsGbmLib::Map(gbm_bo*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int*, void**)
(bo=0x7f1711fac0b0, x=0, y=0, width=280, height=160, flags=2, stride=0x7f16f998a8ec, map_data=0x7f16f998a8d0) at /raid/src/widget/gtk/DMABufLibWrapper.h:71

(gdb) thread 176
#7 0x00007f179c22f1e0 in <signal handler called> () at /lib64/libpthread.so.0
#8 0x00007f176a8fee77 in u_transfer_unmap_vtbl (pipe=0x7f17149ba000, transfer=0x7f171489b7e0) at ../src/gallium/auxiliary/util/u_transfer.c:158
#9 0x00007f176a905783 in tc_batch_execute (job=job@entry=0x7f17148025a0, thread_index=thread_index@entry=0) at ../src/gallium/auxiliary/util/u_threaded_context.c:163
#10 0x00007f176a905b49 in _tc_sync (tc=tc@entry=0x7f1714802000, func=<optimized out>, info=<optimized out>) at ../src/gallium/auxiliary/util/u_threaded_context.c:277
#11 0x00007f176a905d08 in tc_flush (_pipe=0x7f1714802000, fence=0x0, flags=1) at ../src/gallium/auxiliary/util/u_threaded_context.c:2188
#12 0x00007f1769eacd7d in st_context_flush (stctxi=0x7f17148b1000, flags=3, fence=0x0, before_flush_cb=0x0, args=0x7f17153968d0) at ../src/mesa/state_tracker/st_manager.c:674
#13 0x00007f1769ea18e1 in dri_flush (cPriv=<optimized out>, dPriv=<optimized out>, flags=<optimized out>, reason=<optimized out>) at ../src/gallium/frontends/dri/dri_drawable.c:536
#14 0x00007f178f84b911 in mozilla::widget::nsGbmLib::Unmap(gbm_bo*, void*) (bo=0x7f1714aedfb0, map_data=0x7f171489b4c0) at /raid/src/widget/gtk/DMABufLibWrapper.h:73
#15 0x00007f178f84833a in DMABufSurface::Unmap(int) (this=0x7f1714add430, aPlane=0) at /raid/src/widget/gtk/DMABufSurface.cpp:612

When ffmpeg decodes video to dmabuf surfaces and dmabuf backed fails to allocate one (for instance we're running out of file descriptors),
we need to disable dmabuf surfaces and restart video decoder to create non-dmabuf ImageHost.

It's possible that DMABufSurface::CreateDMABufSurface() fails, for instance when we're running out of file descriptors. In such case mSurface is null
and we need to check it before we use it.

Also implement DMABUFTextureHostOGL::IsValid() to claim mSurface state.

Depends on D107976

When multiple DMABuf surfaces are used (for instance during video playback) we can run out of free file descriptors.
To avoid such scenario open file DMABuf file descriptors only when it's needed, i.e. when DMABuf objects are mapped to user
space, mapped as EGLImages or shared with another processes.

  • Implement OpenFileDescriptors()/CloseFileDescriptors() methods to provide such functionality and also
    OpenFileDescriptorForPlane() / CloseFileDescriptorForPlane() for particular planes.

  • Use mutex to protect parts where file descriptors are used.

  • Make functions which use file decriptors fails-safe, i.e. return error code when we can't get file descriptor for DMABuf object
    and propagate it.

Depends on D107977

See https://gitlab.freedesktop.org/mesa/mesa/-/issues/4422 for details. Mesa/radeon can't handle multiple map/unmap operations so protect it by lock.

Depends on D107978

Pushed by stransky@redhat.com: https://hg.mozilla.org/integration/autoland/rev/a96a4ccad4aa [Linux] When we fail to create DMABuf surface at FFmpegVideoDecoder then restart video decoder to create a new ImageHost, r=alwu https://hg.mozilla.org/integration/autoland/rev/b19b750f1ecf [Linux] Check mSurface before use at DMABUFTextureData/DMABUFTextureHostOGL, r=sotaro https://hg.mozilla.org/integration/autoland/rev/c063ca73efb0 [Linux] Open DMABuf object file descriptors only when we need them, r=jhorak https://hg.mozilla.org/integration/autoland/rev/cbc4cfcb9a3b [Linux] Protect dmabuf map/unmap operations by mutex to workaround Mesa multi-threading issue, r=jgilbert
Regressions: 1699075
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: