Blitting picture cache tiles seems slower on Windows/ANGLE for the same hardware.
Categories
(Core :: Graphics: WebRender, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr60 | --- | unaffected |
firefox65 | --- | unaffected |
firefox66 | --- | disabled |
firefox67 | --- | fixed |
People
(Reporter: gw, Assigned: sotaro)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
When a picture cache tile is copied from the framebuffer to the texture cache, it shows up in the WR GPU profiler as a bright green bar near the end of the frame (tag "Blit").
When using the same hardware (Intel HD4600) these blits seem to take much longer (I'd guess 5-6x) on Windows/ANGLE than they do on Linux.
There's a few possibilities here:
- The GPU profiler is incorrect on Windows and/or Linux, reporting incorrect values.
- The blits are much slower due to a slow driver path on Windows.
- The blits are much slower due to a slow ANGLE code path being hit.
We could investigate this by:
- Testing if the profile results are reproducible in other GPU profilers (e.g. GPA)
- Testing if the results differ when running on Windows + native GL.
Reporter | ||
Comment 1•6 years ago
|
||
It'd also be interesting to know if this is GPU / vendor specific.
Comment 2•6 years ago
|
||
Does this blit use the same path as when resizing the texture cache? If so, Linux may be using glCopyImageSubData, which isn't supported on angle.
Comment 3•6 years ago
|
||
Oh, no it can't do, since it's from the framebuffer and glCopyImageSubData needs a texture
Reporter | ||
Comment 4•6 years ago
|
||
If I disable ANGLE, I see reported GPU times for caching picture tiles that are much faster (the GPU profile looks around the same as on Linux with native GL).
So I currently suspect we are triggering a slow path in ANGLE (or that the GPU timers being reported when using ANGLE are inaccurate).
Reporter | ||
Comment 5•6 years ago
|
||
Although not as drastic, it does seem to be the same on a machine with an nVidia GTX 1050 - the reported tile blit times are much longer when running with ANGLE compared to native GL.
I will try some vendor profiling tools to see if they report the same differences.
Comment 6•6 years ago
|
||
FWIW, I have noticed that the gpu timer queries on Windows tend to have a large performance impact. Do you have a way to check if what you're seeing is just an artifact of enabling the timer queries or a real performance difference?
Reporter | ||
Comment 7•6 years ago
|
||
It's still possible it's a measurement error - but it does seem reproducible between (non) ANGLES on different machines and GPUs. I'm planning to do some captures in wrench and then some profiling with the nVidia / Intel vendor profiling tools to see what they show.
Reporter | ||
Comment 8•6 years ago
|
||
A few notes from investigating today:
The code path we are hitting in ANGLE for glBlitFrameBuffer results in a D3D draw call with vertex + pixel shader. It's unclear to me so far whether all blits are implemented as draws, or whether we are hitting a slow / uncommon path.
There is an ANGLE extension called ANGLE_framebuffer_blit. I haven't looked into it in detail, but I wonder if this is to work around some performance issues with implementing glBlitFrameBuffer.
When running with the nvidia tools, it's not clear that there is a reported difference between ANGLE/D3D or native GL. However, this may just be because the GPU times are so fast on these test cases with a GTX 1050 - the GPU times are small enough that the noise in the profile timings is a significant percentage of the total time, let alone the blit times.
I'm going to try and run with the Intel GPU tools on a HD4600, where the difference seemed much more significant, and see if I can get more reliable numbers on that.
Comment 9•6 years ago
|
||
For the record, ANGLE_framebuffer_blit's glBlitFramebufferANGLE is just the ES2 version of the ES3-only glBlitFramebuffer.
Reporter | ||
Comment 10•6 years ago
|
||
Reporter | ||
Comment 11•6 years ago
|
||
I'm not making much progress on this - writing up some notes here to see if anyone else has ideas.
Problem:
- When running on a mobile HD4600 GPU, sometimes the time to blit cache tiles from framebuffer into the texture cache are very slow. See the attached capture.png - the bright green blits are 90+% of the GPU time, when they are typically expected to be ~10% of the GPU time.
- It does seem to be a real slow down (rather than a measurement error). Even with GPU timers disabled, it's very noticeably laggy scrolling compared to with ANGLE disabled.
Things I've found:
- Only occurs when ANGLE is enabled in Firefox. If I disable ANGLE and run native GL, the GPU times look as expected.
- Only occurs inside Firefox. If I take a wrench capture and replay it, the GPU times are as expected, with/without --angle enabled.
Random things I've tried in Gecko without any improvements:
- Disabling the blocking present query.
- Disabling triple buffering.
- Tried to disable the DirectComposition path, but I just get a white screen with nothing rendered.
I initially thought it was because the glBlitFrameBuffer impl in ANGLE was going through the slow path (skipping CopySubResource and doing a draw call). However, on the Intel GPU, as best I can tell it's going through the fast path. I also tried manually hacking ANGLE to go through the slow path and it didn't seem to help. It's possible I made a mistake here - might be worth verifying these claims.
Any ideas?
Reporter | ||
Comment 12•6 years ago
|
||
In terms of other devices, I think it's also occurring on nVidia GTX 1050, but it's much harder to say for sure since it's fast enough that it's hard to notice the difference conclusively.
I haven't tried on other Intel GPUs, but I suspect it will occur on those too, not just the HD4600. Running the same hardware on Linux with native GL, the tile blit times are as expected, so the hardware itself is not the problem.
Reporter | ||
Comment 13•6 years ago
|
||
If you want to try and reproduce, it can be more easily seen by making the picture cache code always cache tiles, by:
Commenting out https://searchfox.org/mozilla-central/rev/4587d146681b16ff9878a6fdcba53b04f76abe1d/gfx/wr/webrender/src/picture.rs#1569, so that tiles never become valid.
Changing https://searchfox.org/mozilla-central/rev/4587d146681b16ff9878a6fdcba53b04f76abe1d/gfx/wr/webrender/src/picture.rs#1538 to if true {
so that tiles are cached every frame.
Assignee | ||
Comment 14•6 years ago
|
||
Assignee | ||
Comment 15•6 years ago
|
||
attachment 9045543 [details] reduced Blitting of webrender profiler for me on win10 intel pc. BufferUsage was borrowed from gecko and chromium. But it seems better to use ANGLE's BufferUsage, since webrender does readback.
https://searchfox.org/mozilla-central/rev/6376e2c6bb8b771dd6513156d84ac13b0f15c7f0/gfx/layers/d3d11/MLGDeviceD3D11.cpp#270
https://cs.chromium.org/chromium/src/gpu/ipc/service/direct_composition_child_surface_win.cc?l=324
Reporter | ||
Comment 16•6 years ago
|
||
Thanks Sotaro, I will do some testing with this tomorrow!
Assignee | ||
Comment 17•6 years ago
|
||
Oh, attachment 9045543 [details] was wrong, I updated the patch.
Reporter | ||
Comment 19•6 years ago
|
||
I just tried this patch out on my HD4600 mobile device and it makes a massive difference - the GPU time when scrolling around on most pages drops from ~11ms to ~5ms.
The sooner we can get this merged the better, I think. I'm not sure the best way to address Dzmitry's concerns about doing it on other GPUs. Maybe we merge this and revisit if we encounter perf issues on any other GPUs later?
Reporter | ||
Comment 20•6 years ago
|
||
Discussing this a bit more with Jeff, my thoughts are:
I think we should land it as-is, because (1) the performance penalty is so severe without it, and (2) we don't have any real insight into whether ANGLE will choose a fast / slow path at any time, so we have no guarantees that it won't need this on all platforms at some time anyway.
How about we land it as-is, and keep a close eye on the telemetry graphs for each GPU in the metrics dashboards over the next few days?
Marking this as affected for the 66 webrender experiments. I can still take an uplift here.
Reporter | ||
Comment 23•6 years ago
|
||
Sotaro, I think this will also fix the main cause of major motionmark and other slowdowns you previously reported with picture caching too.
I'd definitely like to get this uplifted to 66, if possible. Ideally we could let it sit in nightly for a day or two to make sure it doesn't regress elsewhere, but it certainly has the potential to be a big performance win on Intel GPUs at least.
Assignee | ||
Comment 24•6 years ago
|
||
(In reply to Glenn Watson [:gw] from comment #19)
The sooner we can get this merged the better, I think. I'm not sure the best way to address Dzmitry's concerns about doing it on other GPUs. Maybe we merge this and revisit if we encounter perf issues on any other GPUs later?
Yea, I agree. It might not affect to NVIDIA, since the patch did not affect to talos result. And motionmark score improvemed on my win10 intel laptop.
Updated•6 years ago
|
Comment 25•6 years ago
|
||
Comment 26•6 years ago
|
||
bugherder |
Updated•6 years ago
|
Would you like to uplift this now? It's had some time on nightly. Are you able to measure a performance gain?
Reporter | ||
Comment 28•6 years ago
|
||
I think this is worth uplifting. Thoughts Jeff, Sotaro?
Assignee | ||
Comment 29•6 years ago
|
||
Yea, it seems to worth uplifting :)
Assignee | ||
Comment 30•6 years ago
|
||
(In reply to Liz Henry (:lizzard) (use needinfo) from comment #27)
Are you able to measure a performance gain?
Performance improvement did not appear on talos on try. Then I tested locally with https://browserbench.org/MotionMark/ on my Intel PC(Lenovo P50). I used the following win64-pgo builds for MotionMark testing.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc6f06fbd9d4ef8fca2782a266f2faca5b65aa16
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7694cc64ee51e21cbaaa0ed810408bbde6b4091f
Score of https://browserbench.org/MotionMark/ were the followings. Without DXGI_USAGE_SHADER_INPUT flang, the scores were unstable.
- With DXGI_USAGE_SHADER_INPUT: 260-320
- Without DXGI_USAGE_SHADER_INPUT: 180-260
OK, please request uplift and this can likely make it into beta 13 next week. Thanks!
Assignee | ||
Comment 32•6 years ago
|
||
Ah, it is not necessary to uplift the patch, since we do not enable WebRender on beta on intel GPU yet.
Comment 33•6 years ago
|
||
Per comment 32.
Comment 34•6 years ago
|
||
I tested out this change on a HD Graphics 530 in a Desktop machine at 1080p and did not see a noticeable difference in motionmark scores: 747 +-5.23% vs 483 +-4.69%.
Description
•