Add a fast path for more gradient types in WR
Categories
(Core :: Graphics: WebRender, enhancement, P3)
Tracking
()
People
(Reporter: gw, Assigned: bpeers)
References
(Blocks 1 open bug)
Details
Attachments
(9 files)
(deleted),
application/x-zip-compressed
|
Details | |
(deleted),
application/x-zip-compressed
|
Details | |
(deleted),
image/png
|
Details | |
(deleted),
application/x-zip-compressed
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
image/svg+xml
|
Details | |
(deleted),
image/svg+xml
|
Details | |
(deleted),
text/html
|
Details |
We have three gradient types supported in WR - linear, radial and conic. These shaders can be quite expensive to draw into the main scene, and also break batching due to requiring a special shader.
We have a fast path for simple linear gradients, which:
- Draws the gradient with a custom gradient render task into a texture cache entry that is persisted across frames.
- Uses the brush_image shader to draw this cached gradient into the main scene instead of running the full gradient shader.
By selecting this path only for simple gradients (horizontal/vertical, no hard stops, etc) we can draw the cache entry with a very small size, and rely on stretching / bilinear filtering of the brush_image shader when drawing the gradient image into the main scene.
There's two goals of this work (we might want to create separate bugs for each part?):
- Support this fast caching path for radial and conic gradients where possible. This should be relatively simple, by either duplicating or abstracting the current linear gradient fast path into shared code.
- Consider expanding support in this fast path for more complex gradients (e.g. hard stops, angle start/stop line). It's an open question here how (or if) it's possible to support some of these cases - we can discuss this in detail on matrix/zoom.
The ultimate outcome would be that all gradients get drawn into the main scene via brush_image (with all the gradients drawn into the persistent render task cache), although this probably isn't feasible in all cases.
Reporter | ||
Comment 1•5 years ago
|
||
Bert, might be of interest to you to work on?
Reporter | ||
Comment 2•5 years ago
|
||
I should add - some of the assumptions related to how gradients are currently implemented may no longer be true, so don't feel constrained by the current implementation details, it might be worth reconsidering these.
For example, a current constraint is that the gradient should should be as fast as possible, since it's often used on full-screen background gradients. However, if the gradient shader is instead only writing to a small number of render task cached pixels, that are persisted across frames, it might be reasonable to have a more complex shader implementation that allows it to handle more complex gradient types, since drawing into the main scene becomes a straightforward image shader.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 3•5 years ago
|
||
Thanks for the overview Glenn.
My first reaction when looking at the code was a bit of surprise :) ... as the gradient shader that we cache is not very complex or heavy on ALU; so I wouldn't expect it to be ALU bound and thus beaten by sampling from a 512x16 cached colorstrip.
Thus the first thing I've done is set up some benchmarks with 4 large gradients: two cacheable and two that aren't (once due to a slight angle, once due to hardstops). I checked in RenderDoc that the caching kicks in as expected.
The result seems to be that there's not an awful lot between cache-vs-no-cache, not enough to distinguish it from normal noise inside a run (the attached HTML has some charts, the error line on each bar is min-max) or between runs (the five groups of bars).
In fact, if anything, the non-cached version seems to be consistently faster on one of the tests for me :/
Disclaimer though: I am on a Quadro P400, which has extremely slow memory bandwidth of 32GB/sec (!). A single 1:1 blit of a 3000x1500xRGBA8 takes over 1ms. It's possible that sampling is actually worse than calculating the gradient (even though the ALU also needs to get its data from a LUT, so, I don't know -- perhaps raw loads come through faster than a bottlenecked texture sampler?). I'll see if I can test on my laptop as well.
I guess for more complex gradients the balance could shift a bit; but even then, we're ad-hoc reimplementing a caching mechanism. You may be right that investing that time in making picture caching more powerful, and/or unifying some of these gradient types to have fewer batch breaks, could pay off more.
Anyway let me know if there's a flaw in my thinking or benchmarking here :) I'll try to grab a few more data points. We can chat a bit next week. Cheers!
Assignee | ||
Comment 4•5 years ago
|
||
Ok, good thing I checked a few more GPUs: seems I was too pessimistic, and there is definitely a win from the caching. My confidence is still not a 100%, because I would expect both cached tests to perform very similarly -- once the gradient is baked into the cache, it wouldn't really matter anymore what it looks like when blitting it? So I'm not sure why linear-smooth would be ~10% slower than aligned-gradient (source yaml is in the previous HTML file I attached), hmm...
The point about extending the gradient cache versus improving picture-cache/batching is still valid, let's discuss :)
Assignee | ||
Comment 5•5 years ago
|
||
I've experimented a bit with removing the limitation of 4 stops: if there's more, render multiple cached gradients and emit multiple instances to stitch them together. The single cache_handle
becomes a Vec of { RenderTaskHandle, prim rect }
as emitted by prepare_interned_prim_for_render
. It would then also open the door for hard stops -- each hard stop just starts a new segment.
It's just a hack for now, getting my bearings, but it does work, see the attached image. I can tidy it up a bit more next week, add more test cases, support hard stops, etc.
Other than that, I'm still a bit wary of turning this into a bad re-implementation of picture caching :)
Radial/conic gradients seem like they would need 2D textures; perhaps more than one due to hard stops, with alpha testing. Upscaling wouldn't be lossless like it is with the aligned linear gradients.
Misaligned linear gradients have a few options: 2D bake; rotating UVs in the image shader; rotating the vertices and clipping;... Anything other than the bake might cancel out the performance advantage and/or introduce more batching breakage. Maybe there is some more advanced blitter hiding that we can use.
Assignee | ||
Comment 6•5 years ago
|
||
- Cleaned up the hacks
- Added hard stops support
- Add a bunch of reftests - zooming, clamping, hard stops, clip shape, etc.
- Patch forthcoming :)
Assignee | ||
Comment 7•5 years ago
|
||
Assignee | ||
Updated•5 years ago
|
Comment 9•5 years ago
|
||
Backed out changeset 3c84f9fd1f93 (bug 1624468) for causing css-gradients reftest failures
https://hg.mozilla.org/integration/autoland/rev/4d16948c70a718218e6d877768eec27b5e75568d
push that caused the backout: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=ecb0f5f13cd361a46f920ad5441ece7473add713&failure_classification_id=2
Assignee | ||
Comment 10•5 years ago
|
||
Hi Andrei, thanks for catching the regression.
I've checked the failures visually, and they are expected and acceptable -- minor differences in rendering due to pre-caching images.
I've updated the fuzzy values for the 3 failing tests and started a new try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f15d0f2dbfbb3daed0c06564d7163266c7bf95a6
I'll try to re-land once that goes green. Thanks.
Edit: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fc315e6d871e11bc53c8c10db8984edceff3beb1
Assignee | ||
Comment 11•5 years ago
|
||
Hi Glenn, I'm curious on your thoughts for best bang-for-buck next steps?
For caching conic/radial gradients, we would lose quality from upsampling, more than from linear gradients. In particular the center might lose its sharpness, I'm not sure if that's acceptable.
Hard stops would be disallowed; adding them would mean multiple 2D images overlapping, with alpha cutouts.
So both are doable but is it a good idea vs. just relying on picture caching?
Likewise for linear gradients, removing the horizontal-vertical-only restriction would require baking a 2D texture, I think.
We could also rotate UVs in the image brush, but, it doesn't do so right now, and there'd be clamping overhead since the gradient is atlassed. That would seem a batch breaking change, and at some point that overhead must approach the original ALU shader, which isn't all that complex.
So I'm wondering if you had some ideas when you first wrote up this task. Thanks!
Comment 12•5 years ago
|
||
Comment 13•5 years ago
|
||
Backed out changeset 9dc84057c6a9 (bug 1624468) for causing gradient-move-stops.html to fail
https://hg.mozilla.org/integration/autoland/rev/4e874b1ed303fa1ca1960d3caed4cced4288c49d
failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=295875161&repo=autoland&lineNumber=14737
[task 2020-04-02T02:43:15.015Z] 02:43:15 INFO - TEST-START | /css/css-images/gradient-move-stops.html
[task 2020-04-02T02:43:15.027Z] 02:43:15 INFO - PID 13842 | 1585795395018 Marionette INFO Testing http://web-platform.test:8000/css/css-images/gradient-move-stops.html == http://web-platform.test:8000/css/css-images/gradient-move-stops-ref.html
[task 2020-04-02T02:43:15.762Z] 02:43:15 INFO - PID 13842 | 1585795395754 Marionette INFO No differences allowed
[task 2020-04-02T02:43:15.879Z] 02:43:15 INFO - TEST-UNEXPECTED-PASS | /css/css-images/gradient-move-stops.html | Testing http://web-platform.test:8000/css/css-images/gradient-move-stops.html == http://web-platform.test:8000/css/css-images/gradient-move-stops-ref.html
[task 2020-04-02T02:43:15.888Z] 02:43:15 INFO - TEST-INFO expected FAIL | took 846ms
[task 2020-04-02T02:43:15.920Z] 02:43:15 INFO - PID 13842 | [GPU 13905, Compositor] WARNING: Possibly dropping task posted to updater thread: file /builds/worker/checkouts/gecko/gfx/layers/apz/src/APZUpdater.cpp, line 371
[task 2020-04-02T02:43:16.123Z] 02:43:16 INFO - PID 13842 | 1585795396115 Marionette INFO Stopped listening on port 43590
[task 2020-04-02T02:43:16.483Z] 02:43:16 INFO - PID 13842 | [Child 14190, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp, line 3352
[task 2020-04-02T02:43:16.521Z] 02:43:16 INFO - PID 13842 | nsStringStats
[task 2020-04-02T02:43:16.521Z] 02:43:16 INFO - PID 13842 | => mAllocCount: 13510
[task 2020-04-02T02:43:16.521Z] 02:43:16 INFO - PID 13842 | => mReallocCount: 0
[task 2020-04-02T02:43:16.521Z] 02:43:16 INFO - PID 13842 | => mFreeCount: 13510
[task 2020-04-02T02:43:16.521Z] 02:43:16 INFO - PID 13842 | => mShareCount: 18304
[task 2020-04-02T02:43:16.521Z] 02:43:16 INFO - PID 13842 | => mAdoptCount: 350
[task 2020-04-02T02:43:16.522Z] 02:43:16 INFO - PID 13842 | => mAdoptFreeCount: 426
[task 2020-04-02T02:43:16.522Z] 02:43:16 INFO - PID 13842 | => Process ID: 14190, Thread ID: 140365280118656
[task 2020-04-02T02:43:17.157Z] 02:43:17 INFO - PID 13842 | [GPU 13905, Compositor] WARNING: Possibly dropping task posted to updater thread: file /builds/worker/checkouts/gecko/gfx/layers/apz/src/APZUpdater.cpp, line 371
[task 2020-04-02T02:43:17.259Z] 02:43:17 INFO - PID 13842 | ###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2020-04-02T02:43:17.267Z] 02:43:17 INFO - PID 13842 | [2020-04-02T02:43:17Z WARN xulstore::persist] tried to remove key that isn't in the store
[task 2020-04-02T02:43:17.267Z] 02:43:17 INFO - PID 13842 | [2020-04-02T02:43:17Z WARN xulstore::persist] tried to remove key that isn't in the store
[task 2020-04-02T02:43:17.409Z] 02:43:17 INFO - PID 13842 | [Child 14027, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp, line 3352
[task 2020-04-02T02:43:17.439Z] 02:43:17 INFO - PID 13842 | nsStringStats
Assignee | ||
Comment 14•5 years ago
|
||
:\
I've verified in Chrome and Edge that gradient-move-stops.html
and gradient-move-stops-ref.html
do look the same, and indeed they don't in the current Nightly; but they do, after my change. So I assume the best thing to do is to remove the "if webrender expect fail" in gradient-move-stops.html.ini
.
I'll stop trying to "save time and resources" by carefully picking my try runs, because clearly I'm not -- so here's a try auto
run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dbd24005ee3d19012ed820ee875f4382f641c5c0
(I'm not adding a new pair of YAML reftests because I'm not sure how to translate this:
background-image: linear-gradient(to right, yellow, blue 70%, green 0);
to a YAML file that'd be any different from the reference 0.7 blue, 0.7 green, 1.0 green
.)
Edit: try auto
doesn't seem to work. New run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=20d5672c6804436227b41bb7476af6a957bbed0a
Comment 15•5 years ago
|
||
Comment 16•5 years ago
|
||
Backed out changeset dfa94a886779 (Bug 1624468) for causing web platform permafailures in /css/css-images/gradient-move-stops.html CLOSED TREE
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=296015444&repo=autoland&lineNumber=20944
Comment 17•5 years ago
|
||
Assignee | ||
Comment 18•5 years ago
|
||
Right, so if I understand correctly, if the test expects failure then Windows says:
TEST-UNEXPECTED-PASS | /css/css-images/gradient-move-stops.html | Testing http://web-platform.test:8000/css/css-images/gradient-move-stops.html == http://web-platform.test:8000/css/css-images/gradient-move-stops-ref.html
but if I change the test to expect equality, Linux says:
TEST-UNEXPECTED-FAIL | /css/css-images/gradient-move-stops.html | Testing http://web-platform.test:8000/css/css-images/gradient-move-stops.html == http://web-platform.test:8000/css/css-images/gradient-move-stops-ref.html
Assignee | ||
Comment 19•5 years ago
|
||
Add support for repeating gradients in the cached fast path.
Reporter | ||
Comment 20•5 years ago
|
||
(In reply to Bert Peers [:bpeers] from comment #11)
Hi Glenn, I'm curious on your thoughts for best bang-for-buck next steps?
For caching conic/radial gradients, we would lose quality from upsampling, more than from linear gradients. In particular the center might lose its sharpness, I'm not sure if that's acceptable.
Hard stops would be disallowed; adding them would mean multiple 2D images overlapping, with alpha cutouts.
So both are doable but is it a good idea vs. just relying on picture caching?Likewise for linear gradients, removing the horizontal-vertical-only restriction would require baking a 2D texture, I think.
We could also rotate UVs in the image brush, but, it doesn't do so right now, and there'd be clamping overhead since the gradient is atlassed. That would seem a batch breaking change, and at some point that overhead must approach the original ALU shader, which isn't all that complex.So I'm wondering if you had some ideas when you first wrote up this task. Thanks!
I think you're probably right - it's probably not worth looking into radial / conic gradients, or the other limitations at this point. In future, we might revisit if we come across some pages where these show up in profiles. Thanks for looking into this!
Assignee | ||
Comment 21•5 years ago
|
||
Thanks for the input Glenn. I think you're right, when we see a real-world case, it'll be easier to judge how much time to invest further into special cases.
I've got a green Try for the above 2 fails, but will hold off until after freeze to land.
Intel HD630 is happy with all the optimizations. GTX1050 agrees, except for the repeat
in D69476 which shows a regression. Intel probably needs it more badly though and sees gains of up to 20%, so it's probably still valid? If we change our minds, it's easy enough to put back ExtendMode == Clamp
in supports_caching
, no need to rollback the whole patch.
Assignee | ||
Comment 22•5 years ago
|
||
GTX1050
Assignee | ||
Comment 23•5 years ago
|
||
Comment 24•5 years ago
|
||
== Change summary for alert #25534 (as of Thu, 02 Apr 2020 17:58:36 GMT) ==
Improvements:
7% Heap Unclassified windows10-64-shippable-qr opt 64,626,073.69 -> 59,793,242.54
For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=25534
Assignee | ||
Comment 25•5 years ago
|
||
Repeating gradient Documentation version 1 as of April 6, 2020.
Comment 26•5 years ago
|
||
Comment 27•5 years ago
|
||
bugherder |
Comment 28•5 years ago
|
||
(In reply to Alexandru Ionescu :alexandrui (needinfo me) from comment #24)
== Change summary for alert #25534 (as of Thu, 02 Apr 2020 17:58:36 GMT) ==
Improvements:
7% Heap Unclassified windows10-64-shippable-qr opt 64,626,073.69 -> 59,793,242.54
For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=25534
False alarm, sorry.
Comment 29•5 years ago
|
||
Comment 30•5 years ago
|
||
bugherder |
Comment 31•4 years ago
|
||
I think you're probably right - it's probably not worth looking into radial / conic gradients, or the other limitations at this point. In future, we might revisit if we come across some pages where these show up in profiles. Thanks for looking into this!
Let's close and revisit in a new bug if/when it makes sense.
Description
•