Closed Bug 1814560 Opened 2 years ago Closed 2 years ago

Assertion failure: !mHasBeenDecoded (Adding new frames after decoding is finished?), at /builds/worker/checkouts/gecko/image/FrameAnimator.cpp:117

Categories

(Core :: Graphics: ImageLib, defect)

defect

Tracking

()

RESOLVED FIXED
113 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox111 --- disabled
firefox112 --- disabled
firefox113 --- fixed

People

(Reporter: tsmith, Assigned: chunmin)

References

(Blocks 1 open bug)

Details

(Keywords: assertion, testcase)

Attachments

(2 files)

Found while fuzzing m-c 20230201-b7f075124503 (--enable-debug --enable-fuzzing)

This seems to be the most frequent issue reported at the moment but none of the test cases reproduce the issue.

Requires pref image.avif.sequence.enabled=true

Assertion failure: !mHasBeenDecoded (Adding new frames after decoding is finished?), at /builds/worker/checkouts/gecko/image/FrameAnimator.cpp:117

#0 0x7fb5cf5cf0ca in mozilla::image::AnimationState::UpdateKnownFrameCount(unsigned int) /builds/worker/checkouts/gecko/image/FrameAnimator.cpp:117:3
#1 0x7fb5cf5f0e1b in mozilla::image::RasterImage::NotifyProgress(unsigned int, mozilla::gfx::IntRectTyped<mozilla::OrientedPixel> const&, mozilla::Maybe<unsigned int> const&, mozilla::image::DecoderFlags, mozilla::image::SurfaceFlags) /builds/worker/checkouts/gecko/image/RasterImage.cpp:1593:24
#2 0x7fb5cf5fa4d3 in mozilla::image::RasterImage::NotifyDecodeComplete(mozilla::image::DecoderFinalStatus const&, mozilla::image::ImageMetadata const&, mozilla::image::DecoderTelemetry const&, unsigned int, mozilla::gfx::IntRectTyped<mozilla::OrientedPixel> const&, mozilla::Maybe<unsigned int> const&, mozilla::image::DecoderFlags, mozilla::image::SurfaceFlags) /builds/worker/checkouts/gecko/image/RasterImage.cpp:1646:3
#3 0x7fb5cf5dcbe2 in operator() /builds/worker/checkouts/gecko/image/IDecodingTask.cpp:134:39
#4 0x7fb5cf5dcbe2 in mozilla::detail::RunnableFunction<mozilla::image::IDecodingTask::NotifyDecodeComplete(mozilla::NotNull<mozilla::image::RasterImage*>, mozilla::NotNull<mozilla::image::Decoder*>)::$_10>::Run() /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.h:546:5
#5 0x7fb5cdf13042 in mozilla::SchedulerGroup::Runnable::Run() /builds/worker/checkouts/gecko/xpcom/threads/SchedulerGroup.cpp:114:20
#6 0x7fb5cdf1d635 in mozilla::RunnableTask::Run() /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:539:16
#7 0x7fb5cdf188ac in mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:852:26
#8 0x7fb5cdf1747a in mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:684:15
#9 0x7fb5cdf177d5 in mozilla::TaskController::ProcessPendingMTTask(bool) /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:462:36
#10 0x7fb5cdf21159 in operator() /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:191:37
#11 0x7fb5cdf21159 in mozilla::detail::RunnableFunction<mozilla::TaskController::InitializeInternal()::$_3>::Run() /builds/worker/workspace/obj-build/dist/include/nsThreadUtils.h:546:5
#12 0x7fb5cdf367c7 in nsThread::ProcessNextEvent(bool, bool*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1197:16
#13 0x7fb5cdf3cbad in NS_ProcessNextEvent(nsIThread*, bool) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:477:10
#14 0x7fb5ceb3fcc3 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:107:5
#15 0x7fb5cea61838 in MessageLoop::RunInternal() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:381:10
#16 0x7fb5cea61741 in RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:374:3
#17 0x7fb5cea61741 in MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:356:3
#18 0x7fb5d30831c8 in nsBaseAppShell::Run() /builds/worker/checkouts/gecko/widget/nsBaseAppShell.cpp:148:27
#19 0x7fb5d52d6a7b in XRE_RunAppShell() /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:742:20
#20 0x7fb5ceb40bd9 in mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:235:9
#21 0x7fb5cea61838 in MessageLoop::RunInternal() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:381:10
#22 0x7fb5cea61741 in RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:374:3
#23 0x7fb5cea61741 in MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:356:3
#24 0x7fb5d52d65d8 in XRE_InitChildProcess(int, char**, XREChildData const*) /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:675:34
#25 0x55ce304eece0 in content_process_main /builds/worker/checkouts/gecko/browser/app/../../ipc/contentproc/plugin-container.cpp:57:28
#26 0x55ce304eece0 in main /builds/worker/checkouts/gecko/browser/app/nsBrowserApp.cpp:353:18
#27 0x7fb5e1649082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16
#28 0x55ce304c5348 in _start (/home/twsmith/workspace/browsers/m-c-20230201215112-fuzzing-debug/firefox-bin+0x5b348) (BuildId: 785d584ab07eb3cb7fbcbd4f4f8c4f4318898232)
Severity: -- → S3
Attached image testcase.avif (deleted) —
Attached file harness.html (deleted) —

Thanks for the testcase. I am able to reproduce, but only when the pref image.avif.sequence.enabled=true

Oops, I meant to say I could reproduce, editing the last comment.

A Pernosco session is available here: https://pernos.co/debug/aoF7fuIh64akPgxsuygjLg/index.html

Looks like we're hitting an intermittent successful decode of the second frame from the testcase, where it should always be failing. If I set image_avif_use_dav1d to false, this issue doesn't occur. In particular, this line should always have a failure code and return from nsAVIFDecoder::Decode().

This seems like it may be an issue in dav1d.

Given that this is an intermittent failure code (DAV1D_ERR(EINVAL)) from dav1d, it might be good to get a Pernosco session from an unsuccessful decode to find out where that status is coming from. It appears to be happening in a tile decode worker, so if necessary dav1d could be configured to run only one thread.

However, I don't have access to Pernosco, and this one might be pretty time-consuming to track down without it, so I think I'll leave the rest to someone who does have access.

I think I know what the problem is, but maybe I'm missing something because I've never looked at dav1d code until today.

We kick off a decode on an imgdecoder thread, thread #20. This communicates to dav1d-worker threads. dav1d-worker threads 189 and 194 seem to be working on this frame. We haven't yet encountered the error in the frame yet, and thread 189 gets here

https://searchfox.org/mozilla-central/rev/0e3779abe6ad44feb787a295fbb67bc9fe4e927a/third_party/dav1d/src/thread_task.c#907

where it pulls f->task_thread.error shared by all threads that care about this Dav1dFrameContext. So it pulls 0 (no error yet). Thread 189 pauses here for a bit. Then thread 194 encounters the error in the frame and sets f->task_thread.error here

https://searchfox.org/mozilla-central/rev/0e3779abe6ad44feb787a295fbb67bc9fe4e927a/third_party/dav1d/src/thread_task.c#765

Then thread 189 continues, it hasn't moved since I last mentioned it. It hits this if

https://searchfox.org/mozilla-central/rev/0e3779abe6ad44feb787a295fbb67bc9fe4e927a/third_party/dav1d/src/thread_task.c#920

Back when thread 189 fetched f->task_thread.error above the value of f->task_thread.task_counter was 4 and f->task_thread.done was {0,0}, but now that is no longer true (probably because thread 194 has modified them) so we enter this if and call dav1d_decode_frame_exit. We have the stale error value from above, so we pass retval = 0. So this seems the problem is pulling the error value and then later checking the termination condition of the overall frame tasks, in between that time an error can happen on another thread.

It also seems like this code dav1d_decode_frame_exit is a little fishy

https://searchfox.org/mozilla-central/rev/0e3779abe6ad44feb787a295fbb67bc9fe4e927a/third_party/dav1d/src/decode.c#3422

which calls atomic_init on the shared error value (f->task_thread.error) and sets it back to 0. We don't actually use f->task_thread.error in this function, but perhaps it causes a problem elsewhere/later?

Also, atomic_init seems a little bit weird to be calling here. cppreference says

"Initializes the default-constructed atomic object obj with the value desired. The function is not atomic: concurrent access from another thread, even through an atomic operation, is a data race.

If obj was not default-constructed, the behavior is undefined.

If this function is called twice on the same obj, the behavior is undefined."

Presumably we only call dav1d_decode_frame_exit after all other threads that might be accessing f->task_thread.error have finished, so probably no other threads are touching this memory at this time. But the part about calling this function twice on the same obj seems bad because we likely already called atomic_init for this memory in dav1d_submit_frame.

This sounds like something that should be brought up with dav1d itself, but in the meantime it sounds like a fairly reasonable solution would be to disable threading, at least for animated images, for now.

I opened an issue for this on the dav1d issue tracker.

I can't access #419 for some reason, I'll ask the admin what's up with that. As for the issue, I have to look at the locking conditions for f->task_thread.error. I think the intention is that it's rolling, that is, it's OK for it to not immediately terminate other threads since it saves us having to take a lock for it (which is expensive for throughput). But we probably don't want to clear it in dav1d_decode_frame_exit(). I remember there being an issue when we don't unconditionally clear it, I'll see if I can find back what that was. If you change dav1d_decode_frame_exit() to only clear f->task_thread.error if retval!=0, does that resolve this issue? Similarly (but unfortunately still racy), if you re-load error before calling dav1d_decode_frame_exit() in thread_task.c:923, does that make this less likely to happen?

Can you open dav1d #419 (https://code.videolan.org/videolan/dav1d/-/issues/419) still? Nobody on our side can access that issue. I have an email notification for it, but nothing else. We're wondering if maybe some weird flag was set during creation that makes it inaccessible for other people. If you've set security/hidden flags, can you unset them?

(In reply to Ronald S. Bultje from comment #13)

Can you open dav1d #419 (https://code.videolan.org/videolan/dav1d/-/issues/419) still? Nobody on our side can access that issue. I have an email notification for it, but nothing else. We're wondering if maybe some weird flag was set during creation that makes it inaccessible for other people. If you've set security/hidden flags, can you unset them?

Not sure if this is to me or not. I can't access https://code.videolan.org/videolan/dav1d/-/issues/419 either, it says 404 for me. I don't think I touched it after I filed it, and I don't think I set any flags on it. I have one email notification for it but nothing else as well. I didn't add any information to it that isn't in this bug (although I might have explained the issue better).

I haven't forgotten about comment 12, just haven't had a chance yet to take a look.

(In reply to Ronald S. Bultje from comment #12)

If you change dav1d_decode_frame_exit() to only clear f->task_thread.error if retval!=0, does that resolve this issue?

I didn't observe the setting in dav1d_decode_frame_exit cause a problem, so I didn't try this.

Similarly (but unfortunately still racy), if you re-load error before calling dav1d_decode_frame_exit() in thread_task.c:923, does that make this less likely to happen?

I tried this, it seemed to fix this bug for me, and Tyson also was not able to reproduce this bug and the two other bugs. So this is more evidence that this issue is the root cause.

Victorien has merged a version of that fix into the codebase. Can you see if the issues are now fixed for you for each of the afore-mentioned bugs?

(In reply to Ronald S. Bultje from comment #17)

Victorien has merged a version of that fix into the codebase. Can you see if the issues are now fixed for you for each of the afore-mentioned bugs?

Thanks! Looking at the patch it's a superset of what we verified with, so I'm pretty confident it will fix the problem. We'll get an updated dav1d into the the Firefox repo and then we can verify and close these bugs. Thanks for your work on this!

Bug 1816484 landed which should fix this. Tyson can you confirm? Thanks!

Depends on: 1816484
Flags: needinfo?(twsmith)

I am no longer able to reproduce this issue and fuzzers are no longer reporting it.

Flags: needinfo?(twsmith)

Thanks! Let's call it fixed by bug 1816484 then.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Assignee: nobody → cchang
Target Milestone: --- → 113 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: