Open Bug 1394061 Opened 7 years ago Updated 2 years ago

av1 performance regression

Categories

(Core :: Audio/Video: Playback, defect, P3)

defect

Tracking

()

People

(Reporter: rillian, Assigned: drno)

References

()

Details

The recent update of the av1 reference implementation in third_party/aom to upstream commit id f5bdeac22930ff4c6b219be49c843db35970b918 (bug 1380118) resulted in dropped frames for streams over 1 Mbps, as demonstrated by the demo at https://demo.bitmovin.com/public/firefox/av1/ even on high-end hardware. A lot of new features have been added to the code recently, so it may just be extra complexity. It's also now returning 16 bit-per-channel image data, even for 8-bit input, so memory bandwidth should be higher. Or maybe something is interacting badly with the Firefox playback scheduling. This bug is about tracking down and resolving the regression so the demo plays smoothy.
David, the update should be merged soon, making it easier to verify this. Could you run your profiler, please, and see if anything stands out vs the 2017 August 25 Firefox Nightly?
Flags: needinfo?(dmajor)
(Fixed typo) As a first step sanity-check, I confirmed that I can reproduce the symptoms on my test machine (haven't run the profiler yet) with https://demo.bitmovin.com/public/firefox/av1/ at 1Mbps. Nightly 08-25 takes about 9% of my 8-core CPU, so 75% of a core Nightly 08-28 takes about 15% of my CPU, so a full core and then some
Flags: needinfo?(dmajor)
Huge increase in av1_loop_filter_rows: Nightly 0825: xul.dll!av1_loop_filter_frame, 7532 xul.dll!av1_loop_filter_rows, 7531 xul.dll!av1_filter_block_plane_non420_ver, 4828 xul.dll!av1_filter_block_plane_non420_hor, 2676 Nightly 0828: xul.dll!av1_loop_filter_frame, 33164 xul.dll!av1_loop_filter_rows, 33163 xul.dll!av1_filter_block_plane_vert, 16703 xul.dll!av1_filter_block_plane_horz, 16378
Also a large increase in CreateAndCopyData: Nightly 0825: xul.dll!mozilla::VideoData::CreateAndCopyData, 611 xul.dll!mozilla::VideoData::SetVideoDataToImage, 607 xul.dll!mozilla::layers::SharedPlanarYCbCrImage::CopyData, 607 xul.dll!mozilla::layers::UpdateYCbCrTextureClient, 589 xul.dll!mozilla::layers::MappedYCbCrTextureData::CopyInto, 588 xul.dll!mozilla::layers::MappedYCbCrChannelData::CopyInto, 588 Nightly 0828: xul.dll!mozilla::VideoData::CreateAndCopyData, 4907 xul.dll!mozilla::VideoData::SetVideoDataToImage, 4895 xul.dll!mozilla::layers::SharedPlanarYCbCrImage::CopyData, 4895 xul.dll!mozilla::layers::UpdateYCbCrTextureClient, 4882 xul.dll!mozilla::layers::MappedYCbCrTextureData::CopyInto, 4882 xul.dll!mozilla::layers::MappedYCbCrChannelData::CopyInto, 4882
And some notable decreases: av1_cdef_frame decreased from 7106 to 4573 update_boundary_info (#663) decreased from 6043 to negligible The numbers in the data above are sample counts at 8kHz for 10 seconds of playback. So about 80k samples represents "a full core" worth of processing. In total I had 72k samples in nightly 0825, and 92k samples in nightly 0828, which more or less agrees with my comment 3.
Thanks, David. That's very helpful. I looks like the PARALLEL_DEBLOCK feature disabled simd in the loopfilter. I've tried to re-enable it in https://aomedia-review.googlesource.com/c/aom/+/19920 but there may be alignment issues. The CreateAndCopyData spike is on our side. I thought by using the mSkip member of YCbCrBuffer::Plane would avoid the overhead of my downsampling loop, but it if it did it didn't help enough. Hopefully native 10/12-bit support (bug 1215089) will give us a fast-path here, but in the meantime I'll look into optimizing what we have.
Assignee: nobody → giles
Depends on: 1215089
Depends on: 1413734
Nils, this is a tracking bug for a performance regressino from the last update we did. The recent upstream optimizations should help, but we probably still need a 16-bit fast-path. Overlaps with HDR work.
Assignee: giles → drno
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.