Open Bug 1539735 Opened 6 years ago Updated 2 years ago

Perform on demand decoding driven by the compositor and move all decoding into the GPU process

Categories

(Core :: Audio/Video: Playback, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: jya, Unassigned, NeedInfo)

References

(Blocks 2 open bugs)

Details

Idem came up during a 1:1 with :mattwoodrow

Currently the playback stack on Windows is something like:

Demuxing (CP) -> IPC -> Decoding (GPU) -> Copy into a new sync surface (GPU) -> IPC -> MediaFormatReader/MediaDecoderStateMachine which will buffer at least 3 frames (Content) -> VideoSink (CP) -> ImageContainer (CP) -> IPC -> Compositor (GPU) -> Upload to surface / convert to RGB (GPU)

On Windows, a DXVA decoder will use and recycle around 4 surfaces. As such, if you attempt to decode a 5th frame, it will typically use the same surface as what was used for the first frame.

The VideoSink / ImageContainer itself will keep over 10 frames in its queue.

As such, we must perform a blit/copy into a new surface (including allocation of such surface) before returning that image to the MDSM.

In bug 1536449, profiler reveals that copying into sync surface is where most of the time is spent; leading to disastrous performance.

One solution would be to only deal with compressed frame and feed the compositor with those instead.

When the compositor needs to paint a new frame, it would request a decoder to perform decoding on the fly. If decoding is too slow the information would be propagated back to the VideoSink/MDSM which would then active the skip to next frame logic.

We would no longer need to keep a massive queue of decoded frame, and likely do everything under the 4 frames limit imposed by the Windows MFT.

Rank: 20
Priority: -- → P3
Blocks: 1594677

jya, any idea what the priority of this is these days?

Flags: needinfo?(jyavenard)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #1)

jya, any idea what the priority of this is these days?

I'd like to start looking into this after the fission work has completed.

Flags: needinfo?(jyavenard)

Adding more information from bug 1589165:

(In reply to Jean-Yves Avenard [:jya] from bug 1589165 comment #7)

The idea is to have all decoders in the GPU process; there will be no need to copy the decoded image into a shared buffer. The aim is to render it directly what comes out of the decoder, be it software or hardware

Summary: Perform on demand decoding driven by the compositor → Perform on demand decoding driven by the compositor and move all decoding into the GPU process

I'm worried that decoding on demand will add too much time to our frame time and cause us to blow our frame budget. It feels like we should be able to overlap decoding and drawing. From my quick investigation on my Skylake Windows machine using GPUview, video decode can definitely overlap with 3d rendering. Further, I see 4k h264 video decoding times of 6-10ms which would eat a lot of the frame budget (especially when running at higher frame rates like 144Hz)

Also, do we know what Chrome does about this problem?

By reading their blog post, it seems that they're already using this way (pull-based) on their rendering pipeline.

Looks like the VTXDecoder and/or RDD process has unbounded memory growth on Nightly (94.0a1).
Here's a test case (use 4K settings) to repro:

https://www.youtube.com/watch?v=fDIUdXkeBEY

I hope all's well with y'all <3

Edit: looks like this issue is now logged as a regression from bug 1731815

Flags: needinfo?(jmuizelaar)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.