Assertion failure: !mHasBeenDecoded (Adding new frames after decoding is finished?), at /builds/worker/checkouts/gecko/image/FrameAnimator.cpp:117
Categories
(Core :: Graphics: ImageLib, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr102 | --- | unaffected |
firefox111 | --- | disabled |
firefox112 | --- | disabled |
firefox113 | --- | fixed |
People
(Reporter: tsmith, Assigned: chunmin)
References
(Blocks 1 open bug)
Details
(Keywords: assertion, testcase)
Attachments
(2 files)
Found while fuzzing m-c 20230201-b7f075124503 (--enable-debug --enable-fuzzing)
This seems to be the most frequent issue reported at the moment but none of the test cases reproduce the issue.
Requires pref image.avif.sequence.enabled=true
Assertion failure: !mHasBeenDecoded (Adding new frames after decoding is finished?), at /builds/worker/checkouts/gecko/image/FrameAnimator.cpp:117
#0 0x7fb5cf5cf0ca in mozilla::image::AnimationState::UpdateKnownFrameCount(unsigned int) /builds/worker/checkouts/gecko/image/FrameAnimator.cpp:117:3
#1 0x7fb5cf5f0e1b in mozilla::image::RasterImage::NotifyProgress(unsigned int, mozilla::gfx::IntRectTyped<mozilla::OrientedPixel> const&, mozilla::Maybe<unsigned int> const&, mozilla::image::DecoderFlags, mozilla::image::SurfaceFlags) /builds/worker/checkouts/gecko/image/RasterImage.cpp:1593:24
#2 0x7fb5cf5fa4d3 in mozilla::image::RasterImage::NotifyDecodeComplete(mozilla::image::DecoderFinalStatus const&, mozilla::image::ImageMetadata const&, mozilla::image::DecoderTelemetry const&, unsigned int, mozilla::gfx::IntRectTyped<mozilla::OrientedPixel> const&, mozilla::Maybe<unsigned int> const&, mozilla::image::DecoderFlags, mozilla::image::SurfaceFlags) /builds/worker/checkouts/gecko/image/RasterImage.cpp:1646:3
#3 0x7fb5cf5dcbe2 in operator() /builds/worker/checkouts/gecko/image/IDecodingTask.cpp:134:39
#4 0x7fb5cf5dcbe2 in mozilla::detail::RunnableFunction<mozilla::image::IDecodingTask::NotifyDecodeComplete(mozilla::NotNull<mozilla::image::RasterImage*>, mozilla::NotNull<mozilla::image::Decoder*>)::$_10>::Run() /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.h:546:5
#5 0x7fb5cdf13042 in mozilla::SchedulerGroup::Runnable::Run() /builds/worker/checkouts/gecko/xpcom/threads/SchedulerGroup.cpp:114:20
#6 0x7fb5cdf1d635 in mozilla::RunnableTask::Run() /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:539:16
#7 0x7fb5cdf188ac in mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:852:26
#8 0x7fb5cdf1747a in mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:684:15
#9 0x7fb5cdf177d5 in mozilla::TaskController::ProcessPendingMTTask(bool) /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:462:36
#10 0x7fb5cdf21159 in operator() /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:191:37
#11 0x7fb5cdf21159 in mozilla::detail::RunnableFunction<mozilla::TaskController::InitializeInternal()::$_3>::Run() /builds/worker/workspace/obj-build/dist/include/nsThreadUtils.h:546:5
#12 0x7fb5cdf367c7 in nsThread::ProcessNextEvent(bool, bool*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1197:16
#13 0x7fb5cdf3cbad in NS_ProcessNextEvent(nsIThread*, bool) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:477:10
#14 0x7fb5ceb3fcc3 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:107:5
#15 0x7fb5cea61838 in MessageLoop::RunInternal() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:381:10
#16 0x7fb5cea61741 in RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:374:3
#17 0x7fb5cea61741 in MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:356:3
#18 0x7fb5d30831c8 in nsBaseAppShell::Run() /builds/worker/checkouts/gecko/widget/nsBaseAppShell.cpp:148:27
#19 0x7fb5d52d6a7b in XRE_RunAppShell() /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:742:20
#20 0x7fb5ceb40bd9 in mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:235:9
#21 0x7fb5cea61838 in MessageLoop::RunInternal() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:381:10
#22 0x7fb5cea61741 in RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:374:3
#23 0x7fb5cea61741 in MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:356:3
#24 0x7fb5d52d65d8 in XRE_InitChildProcess(int, char**, XREChildData const*) /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:675:34
#25 0x55ce304eece0 in content_process_main /builds/worker/checkouts/gecko/browser/app/../../ipc/contentproc/plugin-container.cpp:57:28
#26 0x55ce304eece0 in main /builds/worker/checkouts/gecko/browser/app/nsBrowserApp.cpp:353:18
#27 0x7fb5e1649082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16
#28 0x55ce304c5348 in _start (/home/twsmith/workspace/browsers/m-c-20230201215112-fuzzing-debug/firefox-bin+0x5b348) (BuildId: 785d584ab07eb3cb7fbcbd4f4f8c4f4318898232)
Updated•2 years ago
|
Reporter | ||
Comment 1•2 years ago
|
||
Reporter | ||
Comment 2•2 years ago
|
||
Reporter | ||
Updated•2 years ago
|
Comment 3•2 years ago
|
||
Thanks for the testcase. I am able to reproduce, but only when the pref image.avif.sequence.enabled=true
Comment 4•2 years ago
|
||
Oops, I meant to say I could reproduce, editing the last comment.
Reporter | ||
Comment 5•2 years ago
|
||
A Pernosco session is available here: https://pernos.co/debug/aoF7fuIh64akPgxsuygjLg/index.html
Looks like we're hitting an intermittent successful decode of the second frame from the testcase, where it should always be failing. If I set image_avif_use_dav1d to false, this issue doesn't occur. In particular, this line should always have a failure code and return from nsAVIFDecoder::Decode()
.
This seems like it may be an issue in dav1d.
Given that this is an intermittent failure code (DAV1D_ERR(EINVAL)) from dav1d, it might be good to get a Pernosco session from an unsuccessful decode to find out where that status is coming from. It appears to be happening in a tile decode worker, so if necessary dav1d could be configured to run only one thread.
However, I don't have access to Pernosco, and this one might be pretty time-consuming to track down without it, so I think I'll leave the rest to someone who does have access.
Comment 8•2 years ago
|
||
When we hit the error it seems to come from this line
Comment 9•2 years ago
|
||
I think I know what the problem is, but maybe I'm missing something because I've never looked at dav1d code until today.
We kick off a decode on an imgdecoder thread, thread #20. This communicates to dav1d-worker threads. dav1d-worker threads 189 and 194 seem to be working on this frame. We haven't yet encountered the error in the frame yet, and thread 189 gets here
where it pulls f->task_thread.error shared by all threads that care about this Dav1dFrameContext. So it pulls 0 (no error yet). Thread 189 pauses here for a bit. Then thread 194 encounters the error in the frame and sets f->task_thread.error here
Then thread 189 continues, it hasn't moved since I last mentioned it. It hits this if
Back when thread 189 fetched f->task_thread.error above the value of f->task_thread.task_counter was 4 and f->task_thread.done was {0,0}, but now that is no longer true (probably because thread 194 has modified them) so we enter this if and call dav1d_decode_frame_exit. We have the stale error value from above, so we pass retval = 0. So this seems the problem is pulling the error value and then later checking the termination condition of the overall frame tasks, in between that time an error can happen on another thread.
It also seems like this code dav1d_decode_frame_exit is a little fishy
which calls atomic_init on the shared error value (f->task_thread.error) and sets it back to 0. We don't actually use f->task_thread.error in this function, but perhaps it causes a problem elsewhere/later?
Also, atomic_init seems a little bit weird to be calling here. cppreference says
"Initializes the default-constructed atomic object obj with the value desired. The function is not atomic: concurrent access from another thread, even through an atomic operation, is a data race.
If obj was not default-constructed, the behavior is undefined.
If this function is called twice on the same obj, the behavior is undefined."
Presumably we only call dav1d_decode_frame_exit after all other threads that might be accessing f->task_thread.error have finished, so probably no other threads are touching this memory at this time. But the part about calling this function twice on the same obj seems bad because we likely already called atomic_init for this memory in dav1d_submit_frame.
Comment 10•2 years ago
|
||
This sounds like something that should be brought up with dav1d itself, but in the meantime it sounds like a fairly reasonable solution would be to disable threading, at least for animated images, for now.
Comment 11•2 years ago
|
||
I opened an issue for this on the dav1d issue tracker.
Comment 12•2 years ago
|
||
I can't access #419 for some reason, I'll ask the admin what's up with that. As for the issue, I have to look at the locking conditions for f->task_thread.error. I think the intention is that it's rolling, that is, it's OK for it to not immediately terminate other threads since it saves us having to take a lock for it (which is expensive for throughput). But we probably don't want to clear it in dav1d_decode_frame_exit(). I remember there being an issue when we don't unconditionally clear it, I'll see if I can find back what that was. If you change dav1d_decode_frame_exit() to only clear f->task_thread.error if retval!=0, does that resolve this issue? Similarly (but unfortunately still racy), if you re-load error before calling dav1d_decode_frame_exit() in thread_task.c:923, does that make this less likely to happen?
Comment 13•2 years ago
|
||
Can you open dav1d #419 (https://code.videolan.org/videolan/dav1d/-/issues/419) still? Nobody on our side can access that issue. I have an email notification for it, but nothing else. We're wondering if maybe some weird flag was set during creation that makes it inaccessible for other people. If you've set security/hidden flags, can you unset them?
Comment 14•2 years ago
|
||
(In reply to Ronald S. Bultje from comment #13)
Can you open dav1d #419 (https://code.videolan.org/videolan/dav1d/-/issues/419) still? Nobody on our side can access that issue. I have an email notification for it, but nothing else. We're wondering if maybe some weird flag was set during creation that makes it inaccessible for other people. If you've set security/hidden flags, can you unset them?
Not sure if this is to me or not. I can't access https://code.videolan.org/videolan/dav1d/-/issues/419 either, it says 404 for me. I don't think I touched it after I filed it, and I don't think I set any flags on it. I have one email notification for it but nothing else as well. I didn't add any information to it that isn't in this bug (although I might have explained the issue better).
Comment 15•2 years ago
|
||
I haven't forgotten about comment 12, just haven't had a chance yet to take a look.
Comment 16•2 years ago
|
||
(In reply to Ronald S. Bultje from comment #12)
If you change dav1d_decode_frame_exit() to only clear f->task_thread.error if retval!=0, does that resolve this issue?
I didn't observe the setting in dav1d_decode_frame_exit cause a problem, so I didn't try this.
Similarly (but unfortunately still racy), if you re-load error before calling dav1d_decode_frame_exit() in thread_task.c:923, does that make this less likely to happen?
I tried this, it seemed to fix this bug for me, and Tyson also was not able to reproduce this bug and the two other bugs. So this is more evidence that this issue is the root cause.
Comment 17•2 years ago
|
||
Victorien has merged a version of that fix into the codebase. Can you see if the issues are now fixed for you for each of the afore-mentioned bugs?
Comment 18•2 years ago
|
||
(In reply to Ronald S. Bultje from comment #17)
Victorien has merged a version of that fix into the codebase. Can you see if the issues are now fixed for you for each of the afore-mentioned bugs?
Thanks! Looking at the patch it's a superset of what we verified with, so I'm pretty confident it will fix the problem. We'll get an updated dav1d into the the Firefox repo and then we can verify and close these bugs. Thanks for your work on this!
Updated•2 years ago
|
Comment 19•2 years ago
|
||
Bug 1816484 landed which should fix this. Tyson can you confirm? Thanks!
Reporter | ||
Comment 20•2 years ago
|
||
I am no longer able to reproduce this issue and fuzzers are no longer reporting it.
Comment 21•2 years ago
|
||
Thanks! Let's call it fixed by bug 1816484 then.
Updated•2 years ago
|
Description
•