Open Bug 1589165 Opened 5 years ago Updated 2 years ago

7.3% of content process time spent in MacIOSurfaceImage::SetData() during vp9 decoding

Categories

(Core :: Audio/Video: Playback, defect, P2)

defect

Tracking

()

People

(Reporter: jrmuizel, Unassigned)

References

(Blocks 1 open bug)

Details

I suspect we're spending most of our time here doing zero-fill-on-demand/page-faulting a bunch? Can we avoid this copy? If not can we reuse the buffer that we're copying into?

Blocks: 1400787
Priority: -- → P2

Profile: https://share.firefox.dev/2WE2msp

Is this the same as bug 1403618, or is it a different copy?

Flags: needinfo?(jyavenard)

(In reply to Markus Stange [:mstange] from comment #1)

Profile: https://share.firefox.dev/2WE2msp

Is this the same as bug 1403618, or is it a different copy?

This copy is required to move the content from the decoded image into a shared buffer. This is use on all platforms but windows if HW acceleration is enabled.

On mac, bug 1403618 would avoid a later copy from the shared buffer into the final texture. But we would still need to do one upload to an IOSurface, so that would still account for some time. We would just shift the time spent from one place to another.

To answer Jeff question, the image we're copying into is from the image factory, it would be up to this code to return previously used object.
I don't know why he would think that most of the time is spent with zero-fill.

This copy is used to extract the data from the decoder into a shareable surface that we can send to the compositor.

Only way to avoid it would be to do the on-demand decoding as described in bug 1539735

Flags: needinfo?(jyavenard)

(In reply to Jean-Yves Avenard [:jya] from comment #2)

This copy is required to move the content from the decoded image into a shared buffer. This is use on all platforms but windows if HW acceleration is enabled.

I see. So to avoid it for software decoding, would we need to decode straight into the shared buffer?

Only way to avoid it would be to do the on-demand decoding as described in bug 1539735

I don't understand what problem that would be solving. The problem described in bug 1539735 seems to be due to a total hardware buffer limit of 4 buffers, which exists on Windows for hardware decoding, if I understand correctly. But this bug is about software decoding.

On mac, bug 1403618 would avoid a later copy from the shared buffer into the final texture. But we would still need to do one upload to an IOSurface, so that would still account for some time. We would just shift the time spent from one place to another.

I see. I agree, uploading via IOSurface and uploading via a GL texture (with client storage) is the same amount of copies.

Flags: needinfo?(jyavenard)

(In reply to Markus Stange [:mstange] from comment #3)

I don't understand what problem that would be solving. The problem described in bug 1539735 seems to be due to a total hardware buffer limit of 4 buffers, which exists on Windows for hardware decoding, if I understand correctly. But this bug is about software decoding.

Because it's fundamentally the same issue. FFmpeg uses a pool of software buffers. Those buffer are ref-counted so we could keep them for longer; however making this refcounting system work across process (between RDD/Content/GPU) is currently non-functional (and I'm not sure it could be done).
Additionally, the MediaDecoder's State Machine will buffer about 10 decoded frames in the video sink and before sending them to the compositor.

So we have to copy the ffmpeg's image into a shareable memory.

Now if we did the decoding on demand, straight as we need to paint the image, we could directly use the ffmpeg's image immediately uploading into the required surface.
No more copies, no more queuing.

Flags: needinfo?(jyavenard)

(In reply to Markus Stange [:mstange] from comment #1)

Profile: https://share.firefox.dev/2WE2msp

After bug 1653409, the copy has now moved into MacIOSurfaceImage::SetData: https://share.firefox.dev/3ft9MFy

Instruments says that, of all the CPU usage in the content process, 70% is spent in VP9 decoding and 12% is spent in this copy. The rest is in various other media / IPC things and some is in JS.

Summary: 7.3% of content process time spent in MappedYCbCrChannelData::CopyInto() during vp9 decoding → 7.3% of content process time spent in MacIOSurfaceImage::SetData() during vp9 decoding

From what I can tell, the only way to eliminate this copy would be to have an API on each software decoder that would let us specify our own destination buffers into which the video frame should be decoded. From a quick glance, neither ffmpeg nor libav seem to have such an API - they manager their own buffers internally, and they give you access to the buffer data after decoding.
Bug 1539735 can't magically make this copy go away either. The decoded data still needs to be put into a shared buffer - either into a shmem, or, on macOS, into an IOSurface.

(In reply to Markus Stange [:mstange] from comment #6)

Bug 1539735 can't magically make this copy go away either. The decoded data still needs to be put into a shared buffer - either into a shmem, or, on macOS, into an IOSurface.

The idea is to have all decoders in the GPU process; there will be no need to copy the decoded image into a shared buffer. The aim is to render it directly what comes out of the decoder, be it software or hardware

As an interested reader I would like to understand the difference and have a naive question:
Isn't it that you can either
a) copy an image into shared memory to pass it acrosses processes as RawData to WebRender, who uploads/copies it for you (=two copies?), or
b) upload/copy an image into a GL texture by yourself and pass it as NativeTexture to WebRender, who directly renders from it? (=one copy?)
Weren't bug 1403618 and sisters about b)?

The idea is to have all decoders in the GPU process; there will be no need to copy the decoded image into a shared buffer. The aim is to render it directly what comes out of the decoder, be it software or hardware

Woudn't that just reduce the first copy in a) which is not needed for WebRender because you have b)?
So is this concern only about non-WebRender or Software WebRender?

FFmpeg uses a pool of software buffers. Those buffer are ref-counted so we could keep them for longer; however making this refcounting system work across process (between RDD/Content/GPU) is currently non-functional (and I'm not sure it could be done).

Or do you want to go from one-copy down to zero-copy (what Chrome describes here) by letting ffmpeg directly software-decode into a given piece of gpu memory by using dmabuf, so that it doesn't have to allocate its own internal memory you would normally copy from? Is that bug 1376999 and bug 1539735 together? And these gpu cache misses Chrome talks about wouldn't matter because video is anyway not indended for the picture cache?

(In reply to Jean-Yves Avenard [:jya] from comment #7)

The idea is to have all decoders in the GPU process;

Oh, I see! That's a crucial detail that wasn't obvious to me before.

there will be no need to copy the decoded image into a shared buffer.

On macOS the video is presented to the screen with the help of an IOSurface. There is no way around putting data into the IOSurface. The main memory representation of the IOSurface is allocated by the IOSurface, we cannot tell it to wrap an existing buffer that comes from elsewhere. So if the decoder can't decode into the IOSurface backing memory, we have to copy the data into it.

(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #8)

down to zero-copy (what Chrome describes here)

Thanks for the link, that's a really interesting article!

(In reply to Markus Stange [:mstange] from comment #9)

On macOS the video is presented to the screen with the help of an IOSurface.

Let me clarify a bit here. Bug 1653048 recently added an optimization that lets us skip the OpenGL compositing of the video if we have an IOSurface for it. This eliminates a copy. But it means that the IOSurface is now used not just to share the video data across Firefox processes, but also to share the data with the window server process. That means that we now need an IOSurface even if only one Firefox process is involved in the decoding and compositing.
And just to be clear, we're in a really good state now! Eliminating that last copy is a great win. But if there's a way to eliminate the copy in MacIOSurfaceImage::SetData as well, by making the decoders decode into the IOSurface memory buffer, then we'll be in an even better state.

Comment 10 on Linux (Wayland compositing): bug 1617498 would separately hand over videos and WebRender's picture cache tiles to the Wayland server (Mutter), so that WebRender would no longer have to composite them together into one picture (WR draw compositor/WR-non-OS-compositor) before handing over to Wayland.
Until then, partial present (bug 1648872) could tell the Wayland server to look only at the changed region instead of the whole window.
On X11 (bug 1656472), Mesa support is being added and Nvidia supports it already, but needs bug 1650583.

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.