Open Bug 1540919 Opened 6 years ago Updated 2 years ago

Replace libyuv with zimg

Categories

(Core :: Graphics: Color Management, task, P3)

task

Tracking

()

People

(Reporter: jya, Unassigned)

References

(Blocks 1 open bug, )

Details

We use libyuv for converting YUV to RGB.

It does a poor job, only handles properly BT601; doesn't do anything other than 8 bits image, can't do transforms or gamma change. It's reasonably fast but at the expense of accuracy.

And it has windows for arm optimisations.

We should replace it with "z" the zimg library https://github.com/sekrit-twc/zimg

Blocks: 1539709

(In reply to Jean-Yves Avenard [:jya] from comment #0)

We use libyuv for converting YUV to RGB.

It does a poor job, only handles properly BT601; doesn't do anything other than 8 bits image, can't do transforms or gamma change. It's reasonably fast but at the expense of accuracy.

And it has windows for arm optimisations.

This is "it doesn't have..."?

Actually, looking at zimg, it doesn't have any ARM optimizations, which seems like a regression from libyuv?

Color space
libyuv does BT709, jpeg and BT601. The yuv to rgb takes a matrix so its possible to add additional colorspaces.

10 bit HDR
10 bit yuv and rgb channel formats are supported. 16 bit for scaling.
Low level support conversions of 9, 10, 12 or 16 bpc, optimized for AVX2.
If theres a specific HDR format you need, the best way is to provide the fourcc... or the code.. that would be acceptable :-)

True it trades off accuracy for performance. Its implemented with fixed point in AVX2 and Neon.
Its meant for real time camera and rendering on mobile devices.
The formats supported are all integer, except the HalfFloat conversion for GPU rendering.

For windows ARM there is a libyuv branch that supports MASM for ARM 32 bit.
But I suggest using clangcl for Windows for both ARM and Intel, which gives 64 bit ARM and shares code with iOS and Android.

Here is a benchmark on SkylakeX with some of the 10 bit formats:

libyuv_test --gunit_filter=10ToOpt --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1

I010ToI010_Opt (269 ms)
I010ToI420_Opt (175 ms)
H010ToH010_Opt (230 ms)
H010ToH420_Opt (169 ms)
I010ToARGB_Opt (298 ms)
I010ToABGR_Opt (295 ms)
I010ToAR30_Opt (343 ms)
I010ToAB30_Opt (344 ms)
H010ToARGB_Opt (295 ms)
H010ToABGR_Opt (295 ms)
H010ToAR30_Opt (343 ms)
H010ToAB30_Opt (346 ms)

I is bt601
H is bt709
AB30 is A2 B10 G10 R10.
I010 and H010 are planar 10 bit, as produced by software codecs.

For YUV to RGB theres a ColorTest that measures accuracy compared to a reference version
[ RUN ] LibYUVColorTest.TestFullYUV
hist -3 -2 -1 0 1 2 3
red 0 0 116992 3170560 120320 0 0
green 0 0 43168 3321676 43028 0 0
blue 4864 145152 418304 2273024 398848 155904 11776
[ OK ] LibYUVColorTest.TestFullYUV (134 ms)
[ RUN ] LibYUVColorTest.TestFullYUVJ
hist -1 0 1
red 269056 2905344 233472
green 402665 2636848 368359
blue 288768 2848768 270336
[ OK ] LibYUVColorTest.TestFullYUVJ (131 ms)

(In reply to Nathan Froyd [:froydnj] from comment #2)

Actually, looking at zimg, it doesn't have any ARM optimizations, which seems like a regression from libyuv?

Yes, I got confused by one commit that added arm64 support via the MSVC solution.

Having said that, I don't believe we use libyuv on ARM devices (android or windows).

On Android we use OpenGL surfaces/texture and on Windows we use D3D11.

All color conversions are done via HW shaders.

(In reply to Frank Barchard from comment #3)

10 bit HDR
10 bit yuv and rgb channel formats are supported. 16 bit for scaling.
Low level support conversions of 9, 10, 12 or 16 bpc, optimized for AVX2.
If theres a specific HDR format you need, the best way is to provide the fourcc... or the code.. that would be acceptable :-)

I must have missed that in the code, as I couldn't see anything using anything else but int8_t as data input in both our copy and upstream :(

Could you point me to the code?
Matrix coefficients are definitely for 8 bis only.

If that helps, we've made all the calculations for the matrices on this page:
https://jdashg.github.io/misc/colors/from-coeffs.html

For HDR, right now we're looking at BT2100 support with HDR10, HDR10+ and HLG transfer function.

We would need HDR10 -> SDR 8 bits conversion (including tone mapping).

Additionally, support for full ranges vs limited YUV ranges. AFAIK, it only supports video/limited ranges as seens by the matrices used.

Reason I looked in zimg, was that I figured seeing it supports all of that already, tweaking it for our need would be quicker than adding all those features to libyuv.

BT2020 support for libyuv was added in this change:
https://phabricator.services.mozilla.com/D25345

True it trades off accuracy for performance. Its implemented with fixed point in AVX2 and Neon.
Its meant for real time camera and rendering on mobile devices.
The formats supported are all integer, except the HalfFloat conversion for GPU rendering.

The YUV calculator looks super handy! Could you add JPeg?
As you can see from the bt2020 CL, the simd friendly version of the yuv to rgb matrix is CPU specific and not user friendly.
So I'd like to see a utility function to convert standard float 4x3 matrix, as provided by this tool, so the simd struct YuvConstants
and then expose the matrix versions of conversions so users can define their own colorspace.

Mind if I upstream the bt2020 functions?
The current 10 bit support would call this U010 which is
U = bt2020
0 = 420
10 = 10 bpc
The intent is to expose 444, 422 and other bpc depths as needed.

10 bit was added in multiple stages and CLs.
This is the low level function for 10 bit YUV to 10 bit RGB
https://cs.chromium.org/chromium/src/third_party/libyuv/source/row_gcc.cc?type=cs&q=+I210ToAR30Row_&sq=package:chromium&g=0&l=2729
The work was done in roughly 10 CL's over 3 months, starting with 2 step conversions that reduced 10 to 8 bit.

The LibYUVColorTest.TestFullYUV was done when a chrome tone shift and color inaccuracy was noticed.
The result of that was that the Y math was upgraded to 16 bits, the chroma rounded so the sum of quantized coefficients sums up to 1.0, and the image and coefficients negated, to extend the range of vpmaddubsw.

When 10 bit was added there were 3 options

  1. fully upgrade the internals to 16 bit

  2. 2 step conversion

  3. adapt the existing convert to read/write 10 bit.

  4. step 1 was 2step
    the 2 step conversion was done, so there are conversions from 10 to 8 bit etc. they use multiply instead of shift, so they can replicate bits and the shift can be specified as a multiplier constant.
    that allowed chromium to add H010ToAR30 to start prototyping.
    surprisingly the ARGBToAR30 was the bottleneck... about 2/3 of the time is packing the 10 bit RGB.

fully upgrading the internals to read 16 bits, do 32 bit intermediate math, and output 16 bit, would be slightly higher quality, but 2x more code and roughly 2x slower, since it could only do 1 pixel at a time in SSE or 2 with AVX2. And it wouldnt be able to use the fast vpmaddubsw. so adapting the existing yuv conversion was done.

  1. read 10 bit, write 10 bit.
    The existing YUV conversion produces 16 bit RGB but then packs it down to 8 bit. That was removed so the 16 bit values could be written full accuracy and faster, with STOREAR30_AVX2
    https://cs.chromium.org/chromium/src/third_party/libyuv/source/row_gcc.cc?type=cs&sq=package:chromium&g=0&l=2556
    The Y in YUVTORGB() is 16 bit, normally taking 8 bits and replicating it to 16 with an unpack in the reader. For H010 the Y is shifted left by 6.
    https://cs.chromium.org/chromium/src/third_party/libyuv/source/row_gcc.cc?type=cs&q=READYUV210_AVX2&g=0&l=2421
    But the chroma use 8 bit vspmaddubsw so they are shifted down in the read. So there is a chroma accuracy loss, but full luma accuracy.

'reasonably fast' as you say. about 4 ms/frame
For 3840x2160 on Skylake
libyuv_test --gunit_filter=H0ToAR*Opt --libyuv_width=3840 --libyuv_height=2160 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1

10 to 10 H010ToAR30_Opt (4056 ms)
10 to 8 H010ToARGB_Opt (3985 ms)
8 to 8 H420ToARGB_Opt (3172 ms)

H010 is a classic 3 plane 420 YUV format. Next step - There is talk of adding P010 which is biplanar like NV12 for Windows Media codec support. It would be an additional reader - READP010_AVX2
And likely some utilities to convert P010ToNV12. The low levels for bit depth conversion are there, so its just hooking it up.

ps It would be interesting to have an identity matrix so conversions could convert YUV to YUV. And variations of that to adapt colorspace. BT601 to BT709 etc.

Priority: -- → P3

(In reply to Frank Barchard from comment #8)

10 bit was added in multiple stages and CLs.
This is the low level function for 10 bit YUV to 10 bit RGB
https://cs.chromium.org/chromium/src/third_party/libyuv/source/row_gcc.cc?type=cs&q=+I210ToAR30Row_&sq=package:chromium&g=0&l=2729
The work was done in roughly 10 CL's over 3 months, starting with 2 step conversions that reduced 10 to 8 bit.

The LibYUVColorTest.TestFullYUV was done when a chrome tone shift and color inaccuracy was noticed.
The result of that was that the Y math was upgraded to 16 bits, the chroma rounded so the sum of quantized coefficients sums up to 1.0, and the image and coefficients negated, to extend the range of vpmaddubsw.

This is what we need in this code.
Converting YUV10/12 (601/709/2020) to RGB 8 bits.

H010 is a classic 3 plane 420 YUV format. Next step - There is talk of adding P010 which is biplanar like NV12 for Windows Media codec support. It would be an additional reader - READP010_AVX2
And likely some utilities to convert P010ToNV12. The low levels for bit depth conversion are there, so its just hooking it up.

Unfortunately, none of our decoder will output P010 (that is bits on the most significant side).
AFAIK, the only decoder I've seen outputting P016 (same as P010 storage wise) is the Windows VP9 MFT

ps It would be interesting to have an identity matrix so conversions could convert YUV to YUV. And variations of that to adapt colorspace. BT601 to BT709 etc.

We don't handle direct composition here, everything is converted to RGB to be rendered on screen. So maybe in the future we will need those, but not for now.

I guess it's time for a resync with libyuv and see what that gives.

Type: defect → task

(In reply to Jean-Yves Avenard [:jya] from comment #9)

(In reply to Frank Barchard from comment #8)

10 bit was added in multiple stages and CLs.
This is the low level function for 10 bit YUV to 10 bit RGB
https://cs.chromium.org/chromium/src/third_party/libyuv/source/row_gcc.cc?type=cs&q=+I210ToAR30Row_&sq=package:chromium&g=0&l=2729
The work was done in roughly 10 CL's over 3 months, starting with 2 step conversions that reduced 10 to 8 bit.

The LibYUVColorTest.TestFullYUV was done when a chrome tone shift and color inaccuracy was noticed.
The result of that was that the Y math was upgraded to 16 bits, the chroma rounded so the sum of quantized coefficients sums up to 1.0, and the image and coefficients negated, to extend the range of vpmaddubsw.

This is what we need in this code.
Converting YUV10/12 (601/709/2020) to RGB 8 bits.

Running the unittests which support wildcards, libyuv_test --gunit_filter=10ToOpt --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1
These are the 10 bit planar YUV formats converted to other formats, including ARGB which is 8 bit per channel:
I010ToI010_Opt (249 ms)
I010ToI420_Opt (172 ms)
H010ToH010_Opt (232 ms)
H010ToH420_Opt (172 ms)
I010ToARGB_Opt (325 ms)
I010ToABGR_Opt (349 ms)
I010ToAR30_Opt (398 ms)
I010ToAB30_Opt (377 ms)
H010ToARGB_Opt (331 ms)
H010ToABGR_Opt (330 ms)
H010ToAR30_Opt (374 ms)
H010ToAB30_Opt (340 ms)

So 601 and 709 10 bit are there
2020 (U010) is missing. But your version has it, so if its ok I'll merge the 2020 support upstream
12 bit (H012 etc) is missing.

For the RGB side, is it ARGB you use? Which is B,G,R,A in memory?
Its hard to add every permutation of YUV and RGB, but not hard to add every 12 bit format to ARGB.

There is low level code for converting 9,10,12,16 to 8 bit

Convert16To8Row_AVX2
Convert8To16Row_AVX2
Convert16To8Plane
void Convert16To8Plane(const uint16_t* src_y,
int src_stride_y,
uint8_t* dst_y,
int dst_stride_y,
int scale, // 16384 for 10 bits
int width,
int height);

I010ToI420() is implimented using Convert16To8Plane.
So its not fully efficient, but any bit depth can be converted in 2 steps.

H010 is a classic 3 plane 420 YUV format. Next step - There is talk of adding P010 which is biplanar like NV12 for Windows Media codec support. It would be an additional reader - READP010_AVX2
And likely some utilities to convert P010ToNV12. The low levels for bit depth conversion are there, so its just hooking it up.

Unfortunately, none of our decoder will output P010 (that is bits on the most significant side).
AFAIK, the only decoder I've seen outputting P016 (same as P010 storage wise) is the Windows VP9 MFT

ps It would be interesting to have an identity matrix so conversions could convert YUV to YUV. And variations of that to adapt colorspace. BT601 to BT709 etc.

We don't handle direct composition here, everything is converted to RGB to be rendered on screen. So maybe in the future we will need those, but not for now.

I guess it's time for a resync with libyuv and see what that gives.

I should add the U420 etc bt.2020 before you do that. That would include 10 bpc version.

(In reply to Jean-Yves Avenard [:jya] from comment #5)

Having said that, I don't believe we use libyuv on ARM devices (android or windows).

On Android we use OpenGL surfaces/texture and on Windows we use D3D11.

I believe we use libyuv in webrtc on all platforms (including android)

libyuv now does rec.2020 8 and 10 bit.
https://bugs.chromium.org/p/libyuv/issues/detail?id=845&q=
The 10 bit YUV and RGB is optimized for AVX2 but not arm. The 8 bit is ARM optimized. (So far no demand for 10 bit on ARM)

Chromium have similar media requirements. color space, 10/12 bit, transfer functions.
The near term plan is U410 (444 10 bit) and U412 (444 12 bit).

ARGBToI420 accuracy has improved by 1 bit. The Y channel uses 8 bit coefficients instead of 7 bit.

One missing feature that would be useful for AVIF is bilinear or better chroma scaling. libyuv uses nearest, which is fast but ugly, and for still images, users seem to notice it. https://news.ycombinator.com/item?id=23614807

While we're on the wishlist, libyuv's API is not very orthogonal and misses a lot of possible conversion possibilities that can be signaled in AV1's sequence header. Obvious missing ones include:

  1. Planar RGB to ARGB
  2. Limited range RGB
  3. Full range Rec. 709
  4. Full range Rec. 2020
  5. YCoCg
  6. BT. 2020 CL

Note for libyuv change requests its best to file an issue on the libyuv issue tracker
https://bugs.chromium.org/p/libyuv/issues/list

Re filtered up sampling during conversion
I had an imformal request from chrome for the same. My plan is to break it into 2 steps as a start.

  1. upsamplle 422 or 420 to 444
  2. convert 444 to rgb

The bilinear scale will filter the 420 to 444, but it may take some logic to set subpixel accurate centering.

  1. planar rgb isnt supported. but planar AYUV is, so theres likely a way to do this. There are packed versions of YUV. YUV24 and AYUV
    The conversions are meant as alternative to NV12ToRGB24 so you can convert NV12ToYUV24 which is packed 24 bit YUV but can be used by a GPU.
    So the functions you're looking for are I444ToYUV24 and I444ToAYUV to convert 3 planes of RGB to packed 24 or 32 bit packed RGB. It doesnt exist but is easy to add. Please file an issue if you need this function.

  2. limited range RGB is not supported. I assume you would want this during RGB to YUV. The first step for this would be different color matrix support, which is on the todo list, but its due to all the YUV color spaces. video and full range bt.601, bt709, rec 2020

3-5 YUV to RGB conversions now accept a matrix parameter.

  1. at one point i had bt709 full range and it wouldnt be hard to dig up the color matrix for that.
    The YUV to RGB with color matrix functions are now exposed, so all yuv color spaces can be user defined and the preset ones can be used with any YUV to RGB conversion

  2. YCoCg
    The advantage of this matrix is its fast and lossless. Using fixed point math would likely be slower than optimized avx/neon. But if thats ok, a matrix could be defined.

  3. CL refers to the non-linear version of rec.2020, similar to gamma correction.
    I dont have a concrete plan for how to support this efficiently. suggestions or C code welcome.

The improved 'Matrix' support is in r1751

Summary of your requests, the colorspaces for YUV to RGB are mostly complete and a matter of defining matrix constants.
For new functions, planar to packed is easy and efficient to implement

Rethinking 1, another way to implement planar RGB is to use planar YUV with an identity matrix
I420AlphaToARGBMatrix() exists.
I444AlphaToARGBMatrix is on my todo list

But implemented as Neon its a simple loop, and avx for 32 bit ayuv is also easy. - 3 pack instructions.
24 bit is also easy (st3) for neon but hard with avx.

void I444AlphaToAYUVRow_NEON(const uint8_t* src_y,
const uint8_t* src_u,
const uint8_t* src_v,
const uint8_t* src_a,
uint8_t* dst_ayuv
int width) {
asm volatile (
"1: \n"
"ld1 {v20.8b}, [%0], #8 \n"
"ld1 {v21.8b}, [%1], #8 \n"
"ld1 {v22.8b}, [%2], #8 \n"
"ld1 {v23.8b}, [%3], #8 \n"
"subs %w5, %w5, #8 \n"
"st4 {v20.8b,v21.8b,v22.8b,v23.8b}, [%4], #32 \n"
"b.gt 1b \n"
: "+r"(src_y), // %0
"+r"(src_u), // %1
"+r"(src_v), // %2
"+r"(src_a), // %3
"+r"(dst_ayuv), // %4
"+r"(width) // %5
: "cc", "memory", "v20", "v21", "v22", "v23",
);
}

z.lib 3.0 supports ARM and NEON. All functions are implemented.

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.