Closed Bug 1624468 Opened 5 years ago Closed 4 years ago

Add a fast path for more gradient types in WR

Categories

(Core :: Graphics: WebRender, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED

People

(Reporter: gw, Assigned: bpeers)

References

(Blocks 1 open bug)

Details

Attachments

(9 files)

We have three gradient types supported in WR - linear, radial and conic. These shaders can be quite expensive to draw into the main scene, and also break batching due to requiring a special shader.

We have a fast path for simple linear gradients, which:

  • Draws the gradient with a custom gradient render task into a texture cache entry that is persisted across frames.
  • Uses the brush_image shader to draw this cached gradient into the main scene instead of running the full gradient shader.

By selecting this path only for simple gradients (horizontal/vertical, no hard stops, etc) we can draw the cache entry with a very small size, and rely on stretching / bilinear filtering of the brush_image shader when drawing the gradient image into the main scene.

There's two goals of this work (we might want to create separate bugs for each part?):

  • Support this fast caching path for radial and conic gradients where possible. This should be relatively simple, by either duplicating or abstracting the current linear gradient fast path into shared code.
  • Consider expanding support in this fast path for more complex gradients (e.g. hard stops, angle start/stop line). It's an open question here how (or if) it's possible to support some of these cases - we can discuss this in detail on matrix/zoom.

The ultimate outcome would be that all gradients get drawn into the main scene via brush_image (with all the gradients drawn into the persistent render task cache), although this probably isn't feasible in all cases.

Bert, might be of interest to you to work on?

Flags: needinfo?(bpeers)

I should add - some of the assumptions related to how gradients are currently implemented may no longer be true, so don't feel constrained by the current implementation details, it might be worth reconsidering these.

For example, a current constraint is that the gradient should should be as fast as possible, since it's often used on full-screen background gradients. However, if the gradient shader is instead only writing to a small number of render task cached pixels, that are persisted across frames, it might be reasonable to have a more complex shader implementation that allows it to handle more complex gradient types, since drawing into the main scene becomes a straightforward image shader.

Blocks: wr-perf
Priority: -- → P3
Assignee: nobody → bpeers
Flags: needinfo?(bpeers)

Thanks for the overview Glenn.
My first reaction when looking at the code was a bit of surprise :) ... as the gradient shader that we cache is not very complex or heavy on ALU; so I wouldn't expect it to be ALU bound and thus beaten by sampling from a 512x16 cached colorstrip.

Thus the first thing I've done is set up some benchmarks with 4 large gradients: two cacheable and two that aren't (once due to a slight angle, once due to hardstops). I checked in RenderDoc that the caching kicks in as expected.

The result seems to be that there's not an awful lot between cache-vs-no-cache, not enough to distinguish it from normal noise inside a run (the attached HTML has some charts, the error line on each bar is min-max) or between runs (the five groups of bars).

In fact, if anything, the non-cached version seems to be consistently faster on one of the tests for me :/

Disclaimer though: I am on a Quadro P400, which has extremely slow memory bandwidth of 32GB/sec (!). A single 1:1 blit of a 3000x1500xRGBA8 takes over 1ms. It's possible that sampling is actually worse than calculating the gradient (even though the ALU also needs to get its data from a LUT, so, I don't know -- perhaps raw loads come through faster than a bottlenecked texture sampler?). I'll see if I can test on my laptop as well.

I guess for more complex gradients the balance could shift a bit; but even then, we're ad-hoc reimplementing a caching mechanism. You may be right that investing that time in making picture caching more powerful, and/or unifying some of these gradient types to have fewer batch breaks, could pay off more.

Anyway let me know if there's a flaw in my thinking or benchmarking here :) I'll try to grab a few more data points. We can chat a bit next week. Cheers!

Ok, good thing I checked a few more GPUs: seems I was too pessimistic, and there is definitely a win from the caching. My confidence is still not a 100%, because I would expect both cached tests to perform very similarly -- once the gradient is baked into the cache, it wouldn't really matter anymore what it looks like when blitting it? So I'm not sure why linear-smooth would be ~10% slower than aligned-gradient (source yaml is in the previous HTML file I attached), hmm...

The point about extending the gradient cache versus improving picture-cache/batching is still valid, let's discuss :)

Attached image stop_hack_bugzilla.png (deleted) —

I've experimented a bit with removing the limitation of 4 stops: if there's more, render multiple cached gradients and emit multiple instances to stitch them together. The single cache_handle becomes a Vec of { RenderTaskHandle, prim rect } as emitted by prepare_interned_prim_for_render. It would then also open the door for hard stops -- each hard stop just starts a new segment.

It's just a hack for now, getting my bearings, but it does work, see the attached image. I can tidy it up a bit more next week, add more test cases, support hard stops, etc.


Other than that, I'm still a bit wary of turning this into a bad re-implementation of picture caching :)
Radial/conic gradients seem like they would need 2D textures; perhaps more than one due to hard stops, with alpha testing. Upscaling wouldn't be lossless like it is with the aligned linear gradients.

Misaligned linear gradients have a few options: 2D bake; rotating UVs in the image shader; rotating the vertices and clipping;... Anything other than the bake might cancel out the performance advantage and/or introduce more batching breakage. Maybe there is some more advanced blitter hiding that we can use.

  • Cleaned up the hacks
  • Added hard stops support
  • Add a bunch of reftests - zooming, clamping, hard stops, clip shape, etc.
  • Patch forthcoming :)
Pushed by bpeers@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3c84f9fd1f93 Add a fast path for more gradient types in WR r=gw
Keywords: leave-open

Hi Andrei, thanks for catching the regression.
I've checked the failures visually, and they are expected and acceptable -- minor differences in rendering due to pre-caching images.
I've updated the fuzzy values for the 3 failing tests and started a new try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f15d0f2dbfbb3daed0c06564d7163266c7bf95a6
I'll try to re-land once that goes green. Thanks.

Edit: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fc315e6d871e11bc53c8c10db8984edceff3beb1

Flags: needinfo?(bpeers)

Hi Glenn, I'm curious on your thoughts for best bang-for-buck next steps?

For caching conic/radial gradients, we would lose quality from upsampling, more than from linear gradients. In particular the center might lose its sharpness, I'm not sure if that's acceptable.
Hard stops would be disallowed; adding them would mean multiple 2D images overlapping, with alpha cutouts.
So both are doable but is it a good idea vs. just relying on picture caching?

Likewise for linear gradients, removing the horizontal-vertical-only restriction would require baking a 2D texture, I think.
We could also rotate UVs in the image brush, but, it doesn't do so right now, and there'd be clamping overhead since the gradient is atlassed. That would seem a batch breaking change, and at some point that overhead must approach the original ALU shader, which isn't all that complex.

So I'm wondering if you had some ideas when you first wrote up this task. Thanks!

Flags: needinfo?(gwatson)
Pushed by bpeers@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9dc84057c6a9 Add a fast path for more gradient types in WR r=gw

Backed out changeset 9dc84057c6a9 (bug 1624468) for causing gradient-move-stops.html to fail
https://hg.mozilla.org/integration/autoland/rev/4e874b1ed303fa1ca1960d3caed4cced4288c49d

push that caused the backout: https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=linux%2C18.04%2Cx64%2Cquantumrender%2Cdebug%2Cweb%2Cplatform%2Ctests%2Ctest-linux1804-64-qr%2Fdebug-web-platform-tests-reftest-e10s-1%2Cw%28wr1%29&fromchange=9dc84057c6a93793865a92a15e54a9231c0f9b03&tochange=4e874b1ed303fa1ca1960d3caed4cced4288c49d&selectedJob=295875161

failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=295875161&repo=autoland&lineNumber=14737

[task 2020-04-02T02:43:15.015Z] 02:43:15 INFO - TEST-START | /css/css-images/gradient-move-stops.html
[task 2020-04-02T02:43:15.027Z] 02:43:15 INFO - PID 13842 | 1585795395018 Marionette INFO Testing http://web-platform.test:8000/css/css-images/gradient-move-stops.html == http://web-platform.test:8000/css/css-images/gradient-move-stops-ref.html
[task 2020-04-02T02:43:15.762Z] 02:43:15 INFO - PID 13842 | 1585795395754 Marionette INFO No differences allowed
[task 2020-04-02T02:43:15.879Z] 02:43:15 INFO - TEST-UNEXPECTED-PASS | /css/css-images/gradient-move-stops.html | Testing http://web-platform.test:8000/css/css-images/gradient-move-stops.html == http://web-platform.test:8000/css/css-images/gradient-move-stops-ref.html
[task 2020-04-02T02:43:15.888Z] 02:43:15 INFO - TEST-INFO expected FAIL | took 846ms
[task 2020-04-02T02:43:15.920Z] 02:43:15 INFO - PID 13842 | [GPU 13905, Compositor] WARNING: Possibly dropping task posted to updater thread: file /builds/worker/checkouts/gecko/gfx/layers/apz/src/APZUpdater.cpp, line 371
[task 2020-04-02T02:43:16.123Z] 02:43:16 INFO - PID 13842 | 1585795396115 Marionette INFO Stopped listening on port 43590
[task 2020-04-02T02:43:16.483Z] 02:43:16 INFO - PID 13842 | [Child 14190, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp, line 3352
[task 2020-04-02T02:43:16.521Z] 02:43:16 INFO - PID 13842 | nsStringStats
[task 2020-04-02T02:43:16.521Z] 02:43:16 INFO - PID 13842 | => mAllocCount: 13510
[task 2020-04-02T02:43:16.521Z] 02:43:16 INFO - PID 13842 | => mReallocCount: 0
[task 2020-04-02T02:43:16.521Z] 02:43:16 INFO - PID 13842 | => mFreeCount: 13510
[task 2020-04-02T02:43:16.521Z] 02:43:16 INFO - PID 13842 | => mShareCount: 18304
[task 2020-04-02T02:43:16.521Z] 02:43:16 INFO - PID 13842 | => mAdoptCount: 350
[task 2020-04-02T02:43:16.522Z] 02:43:16 INFO - PID 13842 | => mAdoptFreeCount: 426
[task 2020-04-02T02:43:16.522Z] 02:43:16 INFO - PID 13842 | => Process ID: 14190, Thread ID: 140365280118656
[task 2020-04-02T02:43:17.157Z] 02:43:17 INFO - PID 13842 | [GPU 13905, Compositor] WARNING: Possibly dropping task posted to updater thread: file /builds/worker/checkouts/gecko/gfx/layers/apz/src/APZUpdater.cpp, line 371
[task 2020-04-02T02:43:17.259Z] 02:43:17 INFO - PID 13842 | ###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2020-04-02T02:43:17.267Z] 02:43:17 INFO - PID 13842 | [2020-04-02T02:43:17Z WARN xulstore::persist] tried to remove key that isn't in the store
[task 2020-04-02T02:43:17.267Z] 02:43:17 INFO - PID 13842 | [2020-04-02T02:43:17Z WARN xulstore::persist] tried to remove key that isn't in the store
[task 2020-04-02T02:43:17.409Z] 02:43:17 INFO - PID 13842 | [Child 14027, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp, line 3352
[task 2020-04-02T02:43:17.439Z] 02:43:17 INFO - PID 13842 | nsStringStats

Flags: needinfo?(bpeers)

:\

I've verified in Chrome and Edge that gradient-move-stops.html and gradient-move-stops-ref.html do look the same, and indeed they don't in the current Nightly; but they do, after my change. So I assume the best thing to do is to remove the "if webrender expect fail" in gradient-move-stops.html.ini.

I'll stop trying to "save time and resources" by carefully picking my try runs, because clearly I'm not -- so here's a try auto run:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=dbd24005ee3d19012ed820ee875f4382f641c5c0

(I'm not adding a new pair of YAML reftests because I'm not sure how to translate this:
background-image: linear-gradient(to right, yellow, blue 70%, green 0);
to a YAML file that'd be any different from the reference 0.7 blue, 0.7 green, 1.0 green.)

Edit: try auto doesn't seem to work. New run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=20d5672c6804436227b41bb7476af6a957bbed0a

Flags: needinfo?(bpeers)
Pushed by bpeers@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/dfa94a886779 Add a fast path for more gradient types in WR r=gw
Backout by shindli@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/92912f17f2b8 Backed out changeset dfa94a886779 for causing web platform permafailures in /css/css-images/gradient-move-stops.html CLOSED TREE

Right, so if I understand correctly, if the test expects failure then Windows says:
TEST-UNEXPECTED-PASS | /css/css-images/gradient-move-stops.html | Testing http://web-platform.test:8000/css/css-images/gradient-move-stops.html == http://web-platform.test:8000/css/css-images/gradient-move-stops-ref.html

but if I change the test to expect equality, Linux says:

TEST-UNEXPECTED-FAIL | /css/css-images/gradient-move-stops.html | Testing http://web-platform.test:8000/css/css-images/gradient-move-stops.html == http://web-platform.test:8000/css/css-images/gradient-move-stops-ref.html

Flags: needinfo?(bpeers)

Add support for repeating gradients in the cached fast path.

(In reply to Bert Peers [:bpeers] from comment #11)

Hi Glenn, I'm curious on your thoughts for best bang-for-buck next steps?

For caching conic/radial gradients, we would lose quality from upsampling, more than from linear gradients. In particular the center might lose its sharpness, I'm not sure if that's acceptable.
Hard stops would be disallowed; adding them would mean multiple 2D images overlapping, with alpha cutouts.
So both are doable but is it a good idea vs. just relying on picture caching?

Likewise for linear gradients, removing the horizontal-vertical-only restriction would require baking a 2D texture, I think.
We could also rotate UVs in the image brush, but, it doesn't do so right now, and there'd be clamping overhead since the gradient is atlassed. That would seem a batch breaking change, and at some point that overhead must approach the original ALU shader, which isn't all that complex.

So I'm wondering if you had some ideas when you first wrote up this task. Thanks!

I think you're probably right - it's probably not worth looking into radial / conic gradients, or the other limitations at this point. In future, we might revisit if we come across some pages where these show up in profiles. Thanks for looking into this!

Flags: needinfo?(gwatson)
Attached image timings.svg (deleted) —

Thanks for the input Glenn. I think you're right, when we see a real-world case, it'll be easier to judge how much time to invest further into special cases.

I've got a green Try for the above 2 fails, but will hold off until after freeze to land.

Intel HD630 is happy with all the optimizations. GTX1050 agrees, except for the repeat in D69476 which shows a regression. Intel probably needs it more badly though and sees gains of up to 20%, so it's probably still valid? If we change our minds, it's easy enough to put back ExtendMode == Clamp in supports_caching, no need to rollback the whole patch.

Attached image GTX1050 (deleted) —

GTX1050

Comment on attachment 9138161 [details] timings.svg HD630

== Change summary for alert #25534 (as of Thu, 02 Apr 2020 17:58:36 GMT) ==

Improvements:

7% Heap Unclassified windows10-64-shippable-qr opt 64,626,073.69 -> 59,793,242.54

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=25534

Attached file gradient_documentation_v1.html (deleted) —

Repeating gradient Documentation version 1 as of April 6, 2020.

Pushed by bpeers@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/376011a4881e Add a fast path for more gradient types in WR r=gw

(In reply to Alexandru Ionescu :alexandrui (needinfo me) from comment #24)

== Change summary for alert #25534 (as of Thu, 02 Apr 2020 17:58:36 GMT) ==

Improvements:

7% Heap Unclassified windows10-64-shippable-qr opt 64,626,073.69 -> 59,793,242.54

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=25534

False alarm, sorry.

Pushed by bpeers@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1b2acec83166 Add a fast path for more gradient types in WR r=gw

I think you're probably right - it's probably not worth looking into radial / conic gradients, or the other limitations at this point. In future, we might revisit if we come across some pages where these show up in profiles. Thanks for looking into this!

Let's close and revisit in a new bug if/when it makes sense.

Status: NEW → RESOLVED
Closed: 4 years ago
Keywords: leave-open
Resolution: --- → FIXED
Regressions: 1703141
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: