[ffmpeg] Decode video frames directly into VideoData
Categories
(Core :: Audio/Video: Playback, enhancement, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox98 | --- | fixed |
People
(Reporter: stransky, Assigned: alwu)
References
(Depends on 2 open bugs, Blocks 5 open bugs, Regressed 1 open bug)
Details
Attachments
(7 files, 3 obsolete files)
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
image/png
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details |
To eliminate double copy allow to decode video frames directly into dmabuf surfaces.
See:
https://stackoverflow.com/questions/28339980/ffmpeg-while-decoding-video-is-possible-to-generate-result-to-users-provided?rq=1
https://libav.org/documentation/doxygen/master/structAVCodecContext.html#a7c93198032a3a728b13cb7d7e637d295
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Comment 1•3 years ago
|
||
With Bug 1715365 (FFmpegVideoFramePool) we can extend it for frames located at ram to reduce AVFrame -> VideoData copy for all Linux backends - even Basic (at least for some well know pixel formats).
If we allocate the frames in shm memory we may even reduce copy during frame serializing (I'm not sure how the VideoData is processed in GLX/Basic case).
Reporter | ||
Updated•3 years ago
|
Assignee | ||
Comment 2•3 years ago
|
||
If my understanding is correct, this will not only help to avoid one copy (moving data into AVFrame) on Linux, but also can help ffvpx decoding on other platforms?
Reporter | ||
Comment 3•3 years ago
|
||
(In reply to Alastor Wu [:alwu] from comment #2)
If my understanding is correct, this will not only help to avoid one copy (moving data into AVFrame) on Linux, but also can help ffvpx decoding on other platforms?
Yes, any platform which uses ffvpx can use that (but I think it's used on Linux only).
Assignee | ||
Comment 4•3 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #3)
Yes, any platform which uses ffvpx can use that (but I think it's used on Linux only).
We also use ffvpx on Windows and MacOS, so this would help on those platforms as well.
Assignee | ||
Comment 5•3 years ago
|
||
Hi, Martin, I wonder if I could know your plan for this bug? Is there anything that I can help with? Because I feel that could probably help a lot on the performance of vpx decoding while using ffvpx, which will be beneficial on multiple platforms. Thank you!
Reporter | ||
Comment 6•3 years ago
|
||
Hi Alastor, I keep it on my radar but I'm busy with more pressing Linux/Wayland issues right now. But I have patches WIP and I'm going to come back and look at it.
Updated•3 years ago
|
Assignee | ||
Comment 7•3 years ago
|
||
I started investigating how chromium does that today. Here is the place they set the get_buffer2
for ffmpeg in order to use their own buffer, and here is the place they create their own buffer for ffmpeg.
Reporter | ||
Comment 8•3 years ago
|
||
Hi Alastor,
there's my WIP patch attached here, it may help you with the start. It's related to DMABuf but key parts may be reused for general in-place decoding as they are shared.
Key part is:
#if LIBAVCODEC_VERSION_MAJOR >= 58
if (mVideoFramePool /* && (mCodecContext->flags & AV_CODEC_CAP_DR1)*/) {
mCodecContext->opaque = this;
mCodecContext->get_buffer2 = GetVideoBufferWrapper;
mCodecContext->thread_safe_callbacks = 1;
}
#endif
where we set custom method to supply video buffer to ffmpeg:
static int GetVideoBufferWrapper(struct AVCodecContext* aCodecContext,
AVFrame* aFrame, int aFlags);
static void ReleaseVideoBufferWrapper(void* opaque, uint8_t* data);
At FFmpegVideoDecoder<LIBAV_VER>::GetVideoBuffer() you need to create buffer according to ffmpeg expectations, it's similar how we create the buffer for copy.
The patch also contains related changes to FFmpegLibWrapper which may be the same.
Reporter | ||
Comment 9•3 years ago
|
||
Note that you don't have to create custom buffer for every format - see call avcodec_default_get_buffer2(). You say ffmpeg to create the buffer and you need to copy data later as we do now.
Assignee | ||
Comment 10•3 years ago
|
||
Thank for your patch!
After a quick glance, this seems only work for Linux because it only handle DMAbuffer surface. As ffmpeg decoder can also be used on ffvpx for software decoding on Windows and Mac, we also need to handle those cases where we have to create a shmem to share with GPU/Chrome process, in order to let compositor to access the data directly.
Reporter | ||
Comment 11•3 years ago
|
||
(In reply to Alastor Wu [:alwu] from comment #10)
Thank for your patch!
After a quick glance, this seems only work for Linux because it only handle DMAbuffer surface. As ffmpeg decoder can also be used on ffvpx for software decoding on Windows and Mac, we also need to handle those cases where we have to create a shmem to share with GPU/Chrome process, in order to let compositor to access the data directly.
Yes, the patch is designed for dmabuf but can be updated for shm mem. You need to create SHM memory buffer at mCodecContext->get_buffer2 callback and pass it to VideoData instead of copy them:
https://searchfox.org/mozilla-central/rev/75e9d727ce5ba2c14653cf8fb0f1367f085271b7/dom/media/MediaData.cpp#356
https://searchfox.org/mozilla-central/rev/75e9d727ce5ba2c14653cf8fb0f1367f085271b7/gfx/layers/ImageContainer.cpp#246
Assignee | ||
Comment 12•3 years ago
|
||
Hi, Martin,
I'm still investigating the code to see how we can implement that, now I have some rough idea and I collect them into this doc. Would you mind to check if the direction of my thought is correct or not?
Thank you so much.
Reporter | ||
Comment 13•3 years ago
|
||
(In reply to Alastor Wu [:alwu] from comment #12)
Hi, Martin,
I'm still investigating the code to see how we can implement that, now I have some rough idea and I collect them into this doc. Would you mind to check if the direction of my thought is correct or not?
Thank you so much.
Hello Alastor,
There are some facts which needs to be considered:
-
direct decode to dmabuf (on Linux) is very slow. It because ffmpeg also reads from the buffer during decode and it expects the buffer is zeroed. Direct decoding to dmabuf takes 700%!! CPU on my box while indirect decoding 40-60%. The GPU memory is generally very slow to read (if it's even supported). I don't know how is that handled on Windows/Mac but I expect the situation is similar (or the data are cached and then moved to GPU at once but that's what we do already). The correct (and only working) way here is to create a texture, upload video buffer into it and use such texture (which is done by gecko anyway).
-
recycled buffers can be hold for long time by different parts of gecko/different processes.
-
you may need to use avcodec_default_get_buffer2() for video buffer formats which are not supported by us (at least for the beginning).
Right now the decoding sequence is (on Linux):
- ffmpeg allocates video buffer and decodes video data into it
- we allocate shm buffer and copy video data there
- we allocate GL texture and copy data there
So there are 3 allocations and 3 video data copy.
Given the facts I'm suggesting to decode to shm memory buffers only. We can allocate every frame now and implement frame pool later if that's needed, I don't know how shm allocation is expensive (we allocate shm memory for every frame now).
Assignee | ||
Comment 14•3 years ago
|
||
According to comment13, as ffmpeg will read (referenceing frames for decoding) and write the buffer (outputing decoded data), if we provide GPU memory for that, it would be pretty slow because of the need to access GPU memory frequently from CPU. Therefore, we should only allocate CPU memory (shmem buffer) to ffmpeg decoder.
If my above understanding is correct, then this improvement will only help on software composition?
Because for the hardware compostion and hardware decoding, the decoded data are already based on GPU memory, so we don't actually need to copy them [1], and can give them directly to the hardware compositor. For software decoding and hardware composition, we would still create a sharable hardware texture (eg. D3D11 or MacIOSurface ) to share the data to hardware compositor.
For above cases, we would eventually create a hardware based memory buffer to convey the data, so using shmem seems no help at all?
Sorry I still don't really understand how compostior works, so above comment might be wrong.
Thank you so much.
[1] Although now we still copy data on Windows, but from my understanding, that can be avoid (bug 1723207)
https://searchfox.org/mozilla-central/rev/a9ef6ad97d2d5f96d5ed51eda38f1a02700ccff7/dom/media/platforms/wmf/DXVA2Manager.cpp#906-931
Assignee | ||
Comment 15•3 years ago
|
||
(In reply to Alastor Wu [:alwu] from comment #14)
According to comment13, as ffmpeg will read (referenceing frames for decoding) and write the buffer (outputing decoded data), if we provide GPU memory for that, it would be pretty slow because of the need to access GPU memory frequently from CPU. Therefore, we should only allocate CPU memory (shmem buffer) to ffmpeg decoder.
In addition, if this is the reason, then does running ffmpeg decoder on GPU solve the slow problem? (I know only Windows has GPU process, just thinking about its possibility)
Take vp9 on Windows as an example, when hardware decoding is enabled, we will use WMF decoder (Windows platform decoder) in GPU process, but once if we disable hardware decoding, we will fallback to ffvpx decoder in RDD. I imagine that if we can run ffvpx decoder in GPU process, then all read/write on decoded buffer would happen on GPU process and no longer "slow" (if the reason I mentioned above is correct)
Reporter | ||
Comment 16•3 years ago
|
||
(In reply to Alastor Wu [:alwu] from comment #14)
According to comment13, as ffmpeg will read (referenceing frames for decoding) and write the buffer (outputing decoded data), if we provide GPU memory for that, it would be pretty slow because of the need to access GPU memory frequently from CPU. Therefore, we should only allocate CPU memory (shmem buffer) to ffmpeg decoder.
If my above understanding is correct, then this improvement will only help on software composition?
I'm not familiar with Windows/Mac decoding so I can give you only rough picture:
HW decoding:
GPU (decoded ffmpeg video frame) -> GPU (decoded video frame in dmabuf) -> IPC -> GPU (texture used by webrender)
SW decoding with dmabuf backend / WebRender:
CPU (decoded ffmpeg video frame) -> GPU (decoded video frame in dmabuf) -> IPC -> GPU (texture used by webrender)
SW decoding with WebRender
CPU (decoded ffmpeg video frame) -> CPU (decoded video frame in shm) -> IPC -> GPU (texture used by webrender)
SW decoding with SW WebRender
CPU (decoded ffmpeg video frame) -> CPU (decoded video frame in shm) -> IPC -> CPU (SW rendering of video frame)
With in-place decoding we can transform:
'CPU (decoded ffmpeg video frame) -> CPU (decoded video frame in shm)'
to just one step
'CPU (decoded ffmpeg video frame in shm)'
So if we decode directly to shm buffer, the shm buffer can be send to another process for compositing (SW or HW) without additional copy, no matter if it's transformed to texure or used for SW composition.
Assignee | ||
Comment 17•3 years ago
|
||
Assignee | ||
Comment 18•3 years ago
|
||
Update the current WIP, now on Linux I can decode vpx and h264 onto the shmem directly, but sometime it causes a crash (SIGSEGV) inside the libmozavcodec.so, which I suspect that I didn't set the size (or stride, linesize) well. I'm still debugging that issue.
Assignee | ||
Comment 19•3 years ago
|
||
Hi, Martin,
Currently my patch works for live streaming vp9, like this one and h264 video, like this, but it would still crash on this video in 1080p or 1440p (on other resolutions, it won't crash! but I don't know why)
I wonder if you could help me check my patch and provide me any feedback?
Thank you so much.
Reporter | ||
Comment 20•3 years ago
|
||
Sure. I expect we need to align the frame sizes somehow as ffmpeg is a bit picky here and needs extra space.
Will look at it.
Assignee | ||
Comment 21•3 years ago
|
||
Indeed, I printed the data size (the second parameter for av_buffer_create()
) allocated by default allocator, which is different from the size I set.
- H264 (640 * 368)
Data size (by default allocator) = 247119, data size (by my custom allocator) = 235520 (640*368) - VP9 (1280 * 720)
Data size (by default allocator) = 942159, data size (by my custom allocator) = 921600 (1280*760) - VP9 (2560 * 1440)
Data size (by default allocator) = 942159, data size (by my custom allocator) = 921600 (1280*760)
I can found the default allocator always allocates larger size than my custom allocator. In addition, Chrome uses larger size for that parameter, which is calulated by adding the sizes of three planes. I also tried this method, but still got a crash on the last VP9 video.
Reporter | ||
Comment 22•3 years ago
|
||
(In reply to Alastor Wu [:alwu] from comment #21)
Indeed, I printed the data size (the second parameter for
av_buffer_create()
) allocated by default allocator, which is different from the size I set.
- H264 (640 * 368)
Data size (by default allocator) = 247119, data size (by my custom allocator) = 235520 (640*368)- VP9 (1280 * 720)
Data size (by default allocator) = 942159, data size (by my custom allocator) = 921600 (1280*760)- VP9 (2560 * 1440)
Data size (by default allocator) = 942159, data size (by my custom allocator) = 921600 (1280*760)I can found the default allocator always allocates larger size than my custom allocator. In addition, Chrome uses larger size for that parameter, which is calulated by adding the sizes of three planes. I also tried this method, but still got a crash on the last VP9 video.
I recall ffmpeg extends/align both width & height, not just width as one may expect. The default allocator allocation method may be the way how to get it correctly, perhaps we need to debug it.
Also the latest VP9 video is rotated - may that be the factor here? Flipped width/height?
Assignee | ||
Comment 23•3 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #22)
I recall ffmpeg extends/align both width & height, not just width as one may expect. The default allocator allocation method may be the way how to get it correctly, perhaps we need to debug it.
Wow you're right! Aligning the height did solving the crash! \o/
Thank you so much, I will keep working on this and see if there is any other issues or crashes.
Assignee | ||
Comment 24•3 years ago
|
||
Depends on D130220
Assignee | ||
Comment 25•3 years ago
|
||
Now I encounter another problem which is color showing incorrectly due to larger Cb/Cr surfaces size. As ffmpeg needs a special alignment on both height and width, I would create a image and allocate a corresponding buffer. Then, when we receive decoded data, if we give that surface directly to the compositor, it would cause the color mismatch, because the Cb/Cr stride is larger than the compositor expect.
For example, for 1920*1080, the Cb/Cr stride and height should be 960 and 540. But ffmpeg needs 1024 and 544 after doing the alignment, if I represent this texture directly to the compositor, then it would display a wrong green area like what this photo shows.
It seems that I will have to find a way to re-adjust the cb/cr surface region stored in the texture client after I receive decoded data?
Reporter | ||
Comment 26•3 years ago
|
||
How do you create the textures?
For instance eglCreateImage allows to specify strides for all planes which don't need to be power-of-two aligned:
https://searchfox.org/mozilla-central/rev/24fac1ad31fb9c6e9c4c767c6a7ff45d226078f3/widget/gtk/DMABufSurface.cpp#564
(It's also possible that the dmabuf code here is affected by this too).
Assignee | ||
Comment 27•3 years ago
|
||
First I create a PlanarYCbCrImage
from the image container, and create a texture client from there. Then I create a new method to create an empty buffer by providing a PlanarYCbCrData
, the new method would call SharedPlanarYCbCrImage::Allocate to create a shmem buffer.
As FFmpeg requires aligment, when setting the PlanarYCbCrData
(which is used to allocate the shmem buffer), I would set the aligned stride and height on the data (see here, search CreateEmptyPlanarYCbCrData
)
Reporter | ||
Comment 28•3 years ago
|
||
Looks like SharedPlanarYCbCrImage is incorrectly mapped to textures. I'd expect BufferTextureData::CreateForYCbCr() is called for the video frames but BufferTextureData::Create() is used instead with RGBA format. Looks like the YUV is mapped to RGB somewhere and we're uploading RGB textures which is wrong.
Reporter | ||
Comment 29•3 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #28)
Looks like SharedPlanarYCbCrImage is incorrectly mapped to textures. I'd expect BufferTextureData::CreateForYCbCr() is called for the video frames but BufferTextureData::Create() is used instead with RGBA format. Looks like the YUV is mapped to RGB somewhere and we're uploading RGB textures which is wrong.
So no, I was wrong here. The texture seems to be created as YUV at RenderBufferTextureHost::Lock().
Reporter | ||
Comment 30•3 years ago
|
||
Yes, I do see the artifacts with https://www.youtube.com/watch?v=LXb3EKWsInQ
Looks like with and also height is wrong.
Assignee | ||
Comment 31•3 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #30)
Yes, I do see the artifacts with https://www.youtube.com/watch?v=LXb3EKWsInQ
Looks like with and also height is wrong.
Yes, that's ture, that was caused by extra padding for the ffmpeg. If I watch that video on 4K, then this issue won't happen because we don't need the extra padding for linesize in that resulution.
I asked this quesiton on #gfx-firefox and the response from :jgilbert seems matching about what I thought, I have to dive deeper into the buffer creation, to create a larger size for shmem buffer, but keep the YCbCrDescriptor correct (use the original size which we don't modify for ffmpeg)
So no, I was wrong here. The texture seems to be created as YUV at RenderBufferTextureHost::Lock().
In the beginning, I was also thinking about the possiblility of creating the wrong format texture, but after my testing, that is not true. The problem is still I didn't provide correct YCbCrDescriptor so that the compositor would render the wrong image.
Assignee | ||
Comment 32•3 years ago
|
||
Now I follow this suggestion to resize the planes to useful-size before I create the video data, and the result is good! I can eliminate the green part and video in different resolutions could all be displayed correctly.
Reporter | ||
Comment 33•3 years ago
|
||
Yes, the blit can be another way although it adds another frame copy, but that's done on gfx card. I believe there is a way how to upload the textures directly with stride without additional blit (it's already done for dmabuf textures).
Assignee | ||
Comment 34•3 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #33)
Yes, the blit can be another way although it adds another frame copy, but that's done on gfx card. I believe there is a way how to upload the textures directly with stride without additional blit (it's already done for dmabuf textures).
Oh the current way I use won't add any copy, it's just simply readjust the size in the YCbCrDescriptor. The reason of not porting the logic to shmem buffer is that, this is the only use case where we would need a speicial stride on height as well. So instead of adding more explanation and logic on creating shmem buffer, I feel adding another method to readjust the descriptor would be easier and better for maintainence.
Assignee | ||
Comment 35•3 years ago
|
||
But now the interesting thing is that, even if I remove the need of copying the decoded data, the decoding time doesn't seem to be improved. I wonder if creating shmem on the ffmpeg decoder threads degrade the performance of decoding. Still in the investigation.
Updated•3 years ago
|
Updated•3 years ago
|
Assignee | ||
Comment 36•3 years ago
|
||
Depends on D130944
Assignee | ||
Comment 37•3 years ago
|
||
Here are some profiled results. I followed this way to turn ETW log to the format that Firefox profiler can accept. They are all recording while playing this 4K video in 10s.
- Firefox using default buffer allocator (what we current have, will need to copy data into texture or shmem later) result
For vp9 decoding, I searched for vp9_decode_frame()
, which happened in thread#270~277.
The total sample times for vp9_decode_frame()
is 48111ms
(=5929+6333+5961+6809+5920+5249+5648+6262) and the total sample amount is 20680
(=2559+2753+2488+2652+2532+2676+2386+2634). Therefore, we can say one sample took 2.3ms
in average.
For ffmpeg video decoder, I searched for mozilla::FFmpegVideoDecoder::DoDecode()
, which happened in thread#268,269 and #282.
The total sample times for mozilla::FFmpegVideoDecoder::DoDecode()
is 43219ms
(=16353+17186+9680) and the total sample amount is 2472
(=949+885+638). Therefore, one sample took 17.48ms
in average.
- Firefox using the customized buffer allocator (allocate shmem buffer) result
Same searching for vp9_decode_frame()
, which happened in thread#302~309.
The total sample times for vp9_decode_frame()
are 57393ms
(=7606+6329+7865+7275+6327+7107+7563+7321) and the total sample amount is 28356
(=3454+3345+3565+3776+3445+3355+3565+3851). Therefore, we can say one sample took 2.02ms
in average.
For ffmpeg video decoder, I searched for mozilla::FFmpegVideoDecoder::DoDecode()
, which happened in MediaPdecoder#1~#3.
The total sample times for mozilla::FFmpegVideoDecoder::DoDecode()
is 23295ms
(=11242+10936+11117) and the total sample amount is 358
(=114+127+117). Therefore, one sample took 65.07ms
in average. (?)
In term of decoding time in ffvpx decoder, it only improves very little. For the ffmpeg's DoDecode
, it's actully getting worse, but I am not sure if I uses the correct way to analysize these profiles...
Reporter | ||
Comment 38•3 years ago
|
||
Hello Alastor,
I'm not a hardware expert but I think this article gives basic explanation what's going on:
https://tldp.org/HOWTO/Parallel-Processing-HOWTO-2.html
When we decode to shared memory, the memory is cached and needs to be synchronized with other cores (that's the Shared Everything Vs. Shared Something dilemma) as we use multi-threaded decoding. Also I guess the speed heavily depends on cache size (L1, L2, L3) and also DRAM speed. (when L* cache size is small then we may see better performance with direct decode to shm memory as cache may not be so involved there).
Would be interesting to compare single thread decoding (direct decode to shm vs. decode + copy) to see if the multi-thread decoding is a factor here.
Anyway, unless we find a way how to tell OS to not share/propagate shm changes during decode this may be a dead end (as well as it's for dmabuf direct decoding).
Reporter | ||
Comment 39•3 years ago
|
||
I'll try the direct decoding with some old hardware to check the cache theory.
Assignee | ||
Comment 40•3 years ago
|
||
What I feel is multi-thread decoding is not a factor, but the mechanism of synchronizing the shmem might be related. Following result are for multi-threads, and single thread (change this to 1) From below result, it shows that even if using single thread, decode to shmem is still slower.
-
Linux, Multi-threads, Decode + Copy
[Child 579563: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took19340.897155
us in the average. -
Linux, Single thread, Decode + Copy
[Child 794861: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took44225.718321
us in the average. -
Linux, Multi-threads, Decode to shmem
[Child 579979: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took20269.318408
us in the average. -
Linux, Single thread, Decode to shmem
[Child 795783: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took53721.353087
us in the average.
By observing the stack in comment 37 (decode to shmem), it took a lot of time on ZwFreeVirtualMemory
(on Windows) which is called from av_buffer_unref
. In vp9_decode_update_thread_context
, I did see ffvpx would start checking which frames are no longer needed and would unref those AVFrames. And then it seems triaggering resetting data on shmem, which would cost a lot of time. Also, on Linux, the time spend most are on __pthread_cond_wait
which also looks like synchronizing shmeme data between two processes?
However, if those data got reset during decoding, how could we see the image on the compositor?! Because the images are still complete, the shmem buffer should suppose unchanged after we receive decoded video frames. If those data didn't get reset, why av_buffer_unref
would trigger those cleaning methods and took a lot of time?
Reporter | ||
Comment 41•3 years ago
|
||
(In reply to Alastor Wu [:alwu] from comment #40)
What I feel is multi-thread decoding is not a factor, but the mechanism of synchronizing the shmem might be related. Following result are for multi-threads, and single thread (change this to 1) From below result, it shows that even if using single thread, decode to shmem is still slower.
Linux, Multi-threads, Decode + Copy
[Child 579563: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took19340.897155
us in the average.Linux, Single thread, Decode + Copy
[Child 794861: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took44225.718321
us in the average.Linux, Multi-threads, Decode to shmem
[Child 579979: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took20269.318408
us in the average.Linux, Single thread, Decode to shmem
[Child 795783: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took53721.353087
us in the average.
By observing the stack in comment 37 (decode to shmem), it took a lot of time on
ZwFreeVirtualMemory
(on Windows) which is called fromav_buffer_unref
. Invp9_decode_update_thread_context
, I did see ffvpx would start checking which frames are no longer needed and would unref those AVFrames. And then it seems triaggering resetting data on shmem, which would cost a lot of time. Also, on Linux, the time spend most are on__pthread_cond_wait
which also looks like synchronizing shmeme data between two processes?
Hm, maybe the shm memory allocation/free takes so much time? I wonder if the original way (decode to regular memory and then copy to shm) uses different/better way to allocate shm memory chunks or are they recycled?
However, if those data got reset during decoding, how could we see the image on the compositor?! Because the images are still complete, the shmem buffer should suppose unchanged after we receive decoded video frames. If those data didn't get reset, why
av_buffer_unref
would trigger those cleaning methods and took a lot of time?
Is av_buffer_unref called along ReleaseVideoBufferWrapper()? Because that seems to be the only correct way how to call it, when your own buffer is released, right?
As for the images - the shm images are uploaded from shm to OpenGL textures (located at GPU) and that's a single operation - you can release the shm after that as GL keeps copy of the texture on GPU (and maybe also in RAM but that depends how the textures are uploaded). So it's possible that the underlying shm memory is released but the texture is still live and used for rendering.
Updated•3 years ago
|
Updated•3 years ago
|
Assignee | ||
Comment 42•3 years ago
|
||
Last time I discussed with :jrmuizel, and he pointed me out that when I do the measurement, I should use the build which is closer to the release build. Then I tried to measure them again on OPT build with RUSTC_OPT_LEVEL=2 (because release build doesn't have full RUST optimization), here are the result.
I tested on five different videos and on the different platforms, and the result shows that on Linux and MacOS, this improvement did help! (especially on my Linux machine, SW decoding + SW compositing) In my previous measurement, I didn't use OPT build and only test on Linux. This new result should be more precise than the old one I made.
But on Windows, the result is interesting. For some video, it shows very good result (53% improvement), but on other video, it shows very bad result (-77% improvement). For now I didn't figure out the reason, and I will keep investigating on this.
In general, if this result is correct, then this improvement is still worthy to do.
Assignee | ||
Comment 43•3 years ago
|
||
In the previous measurement, the result for Windows 11 Dev didn't show a stable improvement, I am not sure if that is because Windows 11 Dev is still not stable not there might be some unexpected behavior. In addition, yesterday after I restarted (and updated) my Windows 11 Dev, my Windows 11 became totallu usuable, so I couldn't do any measurement again, which forced me to reinstall Windows 10 on my machine.
Therefore, I did the measurement again on Windows 10, and the result did show a stable improvement on all videos. Therefore, I think it would be good to use shmem for decoded data for all sw decoding cases, even if we uses hardware composition. We can always share the decoded data to the compositior process, and then upload the texture to GPU later if necessary.
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Reporter | ||
Comment 44•3 years ago
|
||
Alastor, while I'm looking at the patche, have you tried to run try on them? I see some failures there:
https://treeherder.mozilla.org/jobs?repo=try&revision=7074eecc4c93a9af5859a666cc4a8609c3ddc032
Assignee | ||
Comment 45•3 years ago
|
||
Sure, but the tests I ran are all high level integrated tests because I didn't expect this would break the gtest. I will look into it later, thanks.
Updated•3 years ago
|
Assignee | ||
Comment 46•3 years ago
|
||
Change some less important debugs to verbose so that we can use DEBUG level to focus more important things.
For AudioTrimmer, this will help to reduce huge useless logs if audio doesn't need to be trimmed.
Depends on D130220
Assignee | ||
Comment 47•3 years ago
|
||
ffmpeg decoder can also be used for VPX (and AV1 in the future), so remove 'h264' wording in the warning message.
Depends on D134350
Assignee | ||
Comment 48•3 years ago
|
||
Depends on D134351
Updated•3 years ago
|
Updated•3 years ago
|
Assignee | ||
Comment 49•3 years ago
|
||
There are some offline discussion about how to recycle the shmem buffer, here jya mentioned that we need a cross-process, atomic refcount. At the discussion, I didn't really understand the detail of how we release the shmem buffer from the compositor side, so I checked that again and make a record here.
tl;dr, from my investigation, we do have a cross-process, atomic refcount mechanism to release the shmem buffer.
First, when we create a shmem buffer, it would be stored inside a TextureClient
. and the shmem buffer would only be destroyed when the texture client gets destroyed.
Texture client has an atomic refcount. After decoder returns a data to us, and we give the texture to the compostior, which in another process, there are two refcount that we can ensure at least happened. The first one is the one holding by ffmpeg (we hold a reference to image (done in patch1, in the end of GetVideoBuffer()
), and image holds a reference to texture client. Another is in CompositorBridgeChild
, where the texture would be added to a list, then when compositor finishes using the texture, the texture client would be removed from the list. As the refcount is notified from another process, so it's a cross process mechanism.
When ffmpeg doesn't need to keep the buffer, we would be notifted by ReleaseVideoBufferWrapper()
callback (in patch1) in which we would release the refcount for texture. When compositor doesn't need to keep the buffer, we would be notified by CompositorBridge
. Therefore, we can always ensure shmem would be released correctly.
Assignee | ||
Comment 50•3 years ago
|
||
Update the amount usage comparison in here, directly decoding to shmem significantly reduces the amount of memory usage on Linux and MacOS. But on Windows the result doesn't show any improvement or deterioration, I am not sure why. (Also, the memory usage is super low on Windows. Maybe profiler couldn't capture memory usage well on Windows?)
Reporter | ||
Comment 51•3 years ago
|
||
Do you compare on the same box? Is the decoding on Windows faster than on Linux on the same hardware with SW decoding?
Assignee | ||
Comment 52•3 years ago
|
||
No, they are different devices. You can check about:support
for those devices in the part of Testing Devices
in my document.
Reporter | ||
Comment 53•3 years ago
|
||
I see. I think it would be interesting to compare performance on the same device on Windows / Linux. You can compare Linux performance with/without shm direct decode and the Windows one to deduce if Windows behaves well or not.
Reporter | ||
Comment 54•3 years ago
|
||
Alastor, I have a laptop with both Windows 10 & Linux installed so I can do the Windows/Linux tests if there's a try build available.
Assignee | ||
Comment 55•3 years ago
|
||
Here it's, thank you so much!
Comment 56•3 years ago
|
||
(Alastor Wu [:alwu] from comment #55)
Here it's, thank you so much!
https://treeherder.mozilla.org/jobs?repo=try&revision=3f1c74e0c0f351c27ddaa906e40c5dc9d4d7155f
mozregression --repo try --launch 3f1c74e0c0f351c27ddaa906e40c5dc9d4d7155f
Comment 57•3 years ago
|
||
vs.
mozregression --repo try --launch 3f1c74e0c0f351c27ddaa906e40c5dc9d4d7155f --pref media.ffmpeg.customized-buffer-allocation:false
Reporter | ||
Comment 58•3 years ago
|
||
Thanks. I tested that with gecko profiler but I can't get any meaningful numbers. I'm going to try some system level profiler on Windows to get the difference (also on linux perf may be the best tool for it).
Reporter | ||
Comment 59•3 years ago
|
||
Hello Alastor, is there any video playback performance stats in Firefox? I mean decoded / dropped frames, fps etc.
I testes some clips (8K, 4K) on both Windows/Linux on the same box and:
- CPU usage looks pretty much the same on Windows/Linux. media.ffmpeg.customized-buffer-allocation doesn't seem to have impact to CPU usage, even media.ffmpeg.customized-buffer-allocation:false is about ~1-2% CPU better.
- Video playback seems to be smoother for media.ffmpeg.customized-buffer-allocation:true
Also number of decoding thread may be a factor here.
Reporter | ||
Comment 60•3 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #59)
Hello Alastor, is there any video playback performance stats in Firefox? I mean decoded / dropped frames, fps etc.
I testes some clips (8K, 4K) on both Windows/Linux on the same box and:
- CPU usage looks pretty much the same on Windows/Linux. media.ffmpeg.customized-buffer-allocation doesn't seem to have impact to CPU usage, even media.ffmpeg.customized-buffer-allocation:false is about ~1-2% CPU better.
- Video playback seems to be smoother for media.ffmpeg.customized-buffer-allocation:true
Eh, sorry, the 8K clip was actually AV1 so I need to retest again :)
Reporter | ||
Comment 61•3 years ago
|
||
I re-tested a 4K VP9 clip and there are results:
Windows:
media.ffmpeg.customized-buffer-allocation:false
RDD process: 11-13% CPU, 306MB RAM (700MB all Firefox RAM)
media.ffmpeg.customized-buffer-allocation:true
RDD process: 12-15% CPU, 40MB RAM (400MB all Firefox RAM)
Linux:
The same numbers but CPU is 1-2% lower
Looks like direct decoding has minor performance penalty but huge RAM impact.
I used simple process monitor to see CPU%/RAM sized as I didn't get any useful numbers from VTune. Also I think the CPU/RAM is what we and users care most.
Comment 62•3 years ago
|
||
Sorry for interfering, but can I test this "flag" too with my UHD620 + 4K display combo on Win and Linux since I have issues on both OS with playing 4k@60fps videos? I just do not know how to make it. :)
Comment 63•3 years ago
|
||
(In reply to Ivan from comment #62)
Sorry for interfering, but can I test this "flag" too with my UHD620 + 4K display combo on Win and Linux since I have issues on both OS with playing 4k@60fps videos? I just do not know how to make it. :)
Yes - you'll need mozregression
(pip install --user mozregression
or so) , then you can run the commands from comment 56 and comment 57.
Comment 64•3 years ago
|
||
I didn't measure RAM usage, but in my showcase (Garuda Linux + 4K display + UHD620 + HW accel) I don't catch any benefits in terms of temps and framedrops behaviour compared to 97.0b2 (as I was hope). But could be I shouldn't))
But at the same time no problems to me, video works just fine.
Assignee | ||
Comment 65•3 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #59)
Hello Alastor, is there any video playback performance stats in Firefox? I mean decoded / dropped frames, fps etc.
Yes, you can use VideoPlaybackQuality.droppedVideoFrames.
Looks like direct decoding has minor performance penalty but huge RAM impact.
Thanks for testing! That seems understandable, because now we don't need to allocate memory for copying decoded data, so the improvement should be more significant on high resolution video. Using higher CPU seems also understandable, because allocating shmem needs more work than allocating normal memory.
Reporter | ||
Comment 66•3 years ago
|
||
(In reply to Alastor Wu [:alwu] from comment #65)
Thanks for testing! That seems understandable, because now we don't need to allocate memory for copying decoded data, so the improvement should be more significant on high resolution video. Using higher CPU seems also understandable, because allocating shmem needs more work than allocating normal memory.
I see. I though you did some kind of memory frames recycling with shm. I wonder what causes such massive memory difference then.
Assignee | ||
Comment 67•3 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #66)
I see. I though you did some kind of memory frames recycling with shm. I wonder what causes such massive memory difference then.
I think that is probably also true? I didn't trace the whole process of how those texture buffers would be used though, but the texture allocator supports recycling. The texture client would be pushed back to the pool when its reference drops to 1.
Reporter | ||
Comment 68•3 years ago
|
||
That does not makes much sense to me. I wonder why the decode to shm is slower. When the time is consumed by allocation, then what's recycled? I expect when we're recycling allocated shm memory (and everything looks like so) we don't allocate anything but just cycle few already allocated shm buffers which are mapped/unmapped as textures later during the cycle (that's the texture Lock/Unlock IMHO). Also the tiny memory usage indicates that.
I really think that direct decode to shm is slower due to random access to shm as ffmpeg do read/write to shm during decode and that may be handled by kernel to sync shm/flush cahe etc. I'll try to do some benchmarks to compare so. If that's confirmed we can do the decode to RAM as we did before but use the recycling shm image container to keep the RAM effect of this patch.
Reporter | ||
Comment 69•3 years ago
|
||
I'll also try to get detailed performance data from perf as this is really interesting problem :)
Reporter | ||
Comment 70•3 years ago
|
||
Got the performance data. I test playing 1 min of 4K clip. With direct decode to shm there are shm mapping failures on the stack (but rest is the same):
- 8.07% 0.67% MediaPD~oder #1 [kernel.kallsyms]
- 7.40% asm_exc_page_fault ▒
- 7.33% exc_page_fault ▒
- 7.29% do_user_addr_fault ▒
- 6.85% handle_mm_fault ▒
- 6.56% __handle_mm_fault ▒
- 5.04% __do_fault ▒
- 4.98% shmem_fault ▒
- 4.91% shmem_getpage_gfp ▒
1.58% clear_page_rep ▒
- 1.43% shmem_alloc_and_acct_page ▒
- 1.27% shmem_alloc_page ▒
- 1.19% alloc_pages_vma ▒
- 1.12% __alloc_pages ▒
- 0.96% get_page_from_freelist ▒
- 0.71% rmqueue_bulk ▒
0.58% __list_del_entry_valid ▒
1.19% shmem_add_to_page_cache ▒
- 0.66% start_thread ▒
- 0.59% frame_worker_thread ▒
- vp9_decode_frame ▒
0.59% decode_sb ▒
So looks like we hit mapping failure so the process is suspended unless the page is loaded. May this be caused by multi-thread decode to the same shared memory page?
Reporter | ||
Comment 71•3 years ago
|
||
btw. Recorded on Linux as:
$perf record -g ./firefox
And analyzed as:
$perf report --call-graph
Assignee | ||
Comment 72•3 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #70)
So looks like we hit mapping failure so the process is suspended unless the page is loaded. May this be caused by multi-thread decode to the same shared memory page?
How about if you set thread_count = 1
to disable multi-threads decoding in ffmpeg? will that make any difference?
Linux: The same numbers but CPU is 1-2% lower
Forgot to reply this. So on Linux, memory usage between those two approach didn't have any difference? hmm but shmem should be able to reused because all buffers we want to allocate should be compatible with the shmem we have allocated.
In addition, you said CPU is lower when enabling shmem decoding, so enabling that doesn't degrade the performance on Linux? Even if we hit 8.07% page fault?
I will also try to test by using other profile tools tomorrow to see if I can get other interesting data or not.
Comment 73•3 years ago
|
||
The profile suggests we're not properly recycling the shared memory. (On Linux pages of allocations are usually allocated in the page fault handler when they're first touched. alloc_pages_vma
)
We don't want to be doing any shared memory allocations during video playback once the initial buffers have been allocated.
Reporter | ||
Comment 74•3 years ago
|
||
(In reply to Alastor Wu [:alwu] from comment #72)
In addition, you said CPU is lower when enabling shmem decoding, so enabling that doesn't degrade the performance on Linux? Even if we hit 8.07% page fault?
To be clear I see:
Windows:
media.ffmpeg.customized-buffer-allocation:false
RDD process: 11-13% CPU, 306MB RAM (700MB all Firefox RAM)
media.ffmpeg.customized-buffer-allocation:true
RDD process: 12-15% CPU, 40MB RAM (400MB all Firefox RAM)
Linux:
media.ffmpeg.customized-buffer-allocation:false
RDD process: 9-11% CPU, 306MB RAM (700MB all Firefox RAM)
media.ffmpeg.customized-buffer-allocation:true
RDD process: 11-14% CPU, 40MB RAM (400MB all Firefox RAM)
Reporter | ||
Comment 75•3 years ago
|
||
(In reply to Alastor Wu [:alwu] from comment #72)
How about if you set
thread_count = 1
to disable multi-threads decoding in ffmpeg? will that make any difference?
It's the same, i.e. asm_exc_page_fault is on the perf stack.
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Comment 76•3 years ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #73)
The profile suggests we're not properly recycling the shared memory. (On Linux pages of allocations are usually allocated in the page fault handler when they're first touched.
alloc_pages_vma
)We don't want to be doing any shared memory allocations during video playback once the initial buffers have been allocated.
That seems to be correct - we repeatedly allocate shm memory by VideoBridgeChild. Looking at it now.
Reporter | ||
Comment 77•3 years ago
|
||
So yes. The frames are not recycled because texture sizes are different (shrank). At YCbCrTextureClientAllocationHelper::IsCompatible():
aTextureClient->GetSize()
width = 3840,
height = 2160
mData.mYSize
width = 3840,
height = 2176
bufferData->GetCbCrSize().ref()
width = 1920,
height = 1080
mData.mCbCrSize
width = 1920,
height = 1088
Assignee | ||
Comment 78•3 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #77)
So yes. The frames are not recycled because texture sizes are different (shrank). At YCbCrTextureClientAllocationHelper::IsCompatible():
Ah, good finding! So it seems that we have to provide a cropped size if there are avaliable texture for recycling. But I am curious, if we didn't recylce textures correctly, why we could reduce such a significant amount of memory usage?
Reporter | ||
Comment 79•3 years ago
|
||
(In reply to Alastor Wu [:alwu] from comment #78)
(In reply to Martin Stránský [:stransky] (ni? me) from comment #77)
So yes. The frames are not recycled because texture sizes are different (shrank). At YCbCrTextureClientAllocationHelper::IsCompatible():
Ah, good finding! So it seems that we have to provide a cropped size if there are avaliable texture for recycling. But I am curious, if we didn't recylce textures correctly, why we could reduce such a significant amount of memory usage?
That may be some ffmpeg/ffvpx internal bug or we manage AV Buffers incorrectly (Bug 1743750).
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 80•3 years ago
|
||
I had some offline discussion with Jeff, and I think the texture recyling issue (comment 77) should be fixed along with bug 1750858. The reason is that I don't want to add any new tech debt. An easy way to fix recyling problem would be to modify YCbCrTextureClientAllocationHelper::IsCompatible()
and let it check cropped size instead, but that introduces a new tech debt and definitely is a workaround.
The ultimate goal should be fixing bug 1750858 and then we don't need to crop our planes anymore. Therefore, all texture would be using a larger size for checking the compatibility so that those textures would be possible to be reused.
Even considering the worst case, we would allocate shmem for every video frame, that is actually equal to what we're current doing (create a shmem everytime and then copying decoded data into it). So landing the current patch won't cause any new problem. Considering this feature is an experimental feature which would only be enabled on Nightly first, 1~2% CPU gain in't a big deal and we also see a huge amount of memory usage reduction (although we're not clear why)
Therefore, I think I would still prefer to land current patches first (after them all r+), and bake them on Nightly to see if there is any new issues happening. In the meanwhile, we will also try to fix bug 1750858 in order to recycle textures correctly.
Comment 81•3 years ago
|
||
Comment 82•3 years ago
|
||
Backed out for bustages on FFmpegVideoDecoder.cpp
Backout link: https://hg.mozilla.org/integration/autoland/rev/ae774ab3fdd80116907a1742bfeaa3b43f3a8caa
Log link: https://treeherder.mozilla.org/logviewer?job_id=364752223&repo=autoland&lineNumber=16074
Assignee | ||
Updated•3 years ago
|
Comment 83•3 years ago
|
||
Comment 84•3 years ago
|
||
Backed out for causing build bustages and wpt failures
Backout link: https://hg.mozilla.org/integration/autoland/rev/a20b8d912b3f8cfe4e54143f3d708b549e17fe41
Push with failures: failure push wpt // failure push bustage
Failure log(s): failure log wpt // failure log bustage
Failure line(s): TEST-UNEXPECTED-FAIL | /html/canvas/element/manual/wide-gamut-canvas/canvas-display-p3-drawImage-ImageBitmap-video.html | sRGB-BB0000, Context srgb, ImageData srgb, cropSource=true - assert_true: Actual pixel value 0,0,0,0 is approximately equal to 187,0,0,255. expected true got false // gmake[4]: *** [/builds/worker/checkouts/gecko/config/rules.mk:660: Unified_cpp_ffmpeg_ffmpeg580.o] Error 1
Assignee | ||
Comment 85•3 years ago
|
||
I don't think that wpt failure was caused by my patches, by looking at my last try-run before landing my patches, you can see that failure is intermittent. When the failure happened, the crash happened inside the web render, which is not something my code would touch.
I would say, probably my code changes the timing and exposes this failure and make it more happening more frequently, but my patches are not a root cause. But I will update my patch to avoid that build bustage (weird, it didn't show up in my try-run before!)
Comment 86•3 years ago
|
||
Comment 87•3 years ago
|
||
Backed out on suspicion of causing wpt failure in canvas + video test. Multiple wpt failures.
- Backout link
- Push with failures
- Failure Log
- Failure line 1: html/canvas/element/manual/wide-gamut-canvas/canvas-display-p3-drawImage-video.html | sRGB-FF0100, Context srgb, ImageData srgb, scaleImage=false - assert_true: Actual pixel value 0,0,0,0 is approximately equal to 255,1,0,255. expected true got fals
- Failure line 2: TEST-UNEXPECTED-FAIL | /html/canvas/element/manual/wide-gamut-canvas/canvas-display-p3-drawImage-video.html | sRGB-BB0000, Context display-p3, ImageData srgb, scaleImage=false - assert_true: Actual pixel value 0,0,0,0 is approximately equal to 187,1,0,255. expected true got false
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 88•3 years ago
|
||
Comment 89•3 years ago
|
||
Comment 90•3 years ago
|
||
Backed out for causing reftest failures at color_quads/720p.png.bt709.bt709.pc.yuv420p10.vp9.webm.
Backout link: https://hg.mozilla.org/integration/autoland/rev/9a0a0d21c4d3a0f19dae8fb56e317b88da06eb1b
Failure log: https://treeherder.mozilla.org/logviewer?job_id=365096697&repo=autoland&lineNumber=34091
Reporter | ||
Comment 91•3 years ago
|
||
Alastor, may you land it with media.ffmpeg.customized-buffer-allocation set to false to pass tests and enable it in another patch when all tests are fixed? That will save your time with rebases to latest trunk. It also helps to ensure it doesn't break anything when the pref if off.
Assignee | ||
Comment 92•3 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #91)
Alastor, may you land it with media.ffmpeg.customized-buffer-allocation set to false to pass tests and enable it in another patch when all tests are fixed? That will save your time with rebases to latest trunk. It also helps to ensure it doesn't break anything when the pref if off.
It's ok, that new failure is in my expectation and I know why it failed. (See this, and that should be fixed in bug 1748540) I just forgot to handle that as we disable shmem-decoding for 10bit+ videos on Windows in the last update. After I modify patch4 again and do full testing, and ensure everything is ok, I will reland them again.
Assignee | ||
Comment 93•3 years ago
|
||
I've updated my patch4 to remove the changes for 10+ bits video, which are all related with SW WR on Windows (but now we won't enable shmem decoding for 10+ bits video on Windows, see 1751498)
Partial try run and a completed try run both look good. Also, the new try run I just pushed which is based on the latest m-c after doing some rebasing.
Assignee | ||
Comment 94•3 years ago
|
||
Depends on D134352
Updated•3 years ago
|
Comment 95•3 years ago
|
||
Comment 96•3 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/555a3481ee4d
https://hg.mozilla.org/mozilla-central/rev/f032a3f3200a
https://hg.mozilla.org/mozilla-central/rev/4d0ac457af61
https://hg.mozilla.org/mozilla-central/rev/38eb80e3c143
https://hg.mozilla.org/mozilla-central/rev/3b1e43ee7774
Reporter | ||
Comment 97•3 years ago
|
||
With bug 1753575 fixed I can confirm that shm memory recycling works.
Description
•