Open Bug 918941 (webgl-shader-cache) Opened 11 years ago Updated 1 year ago

cache results of shader compilation

Categories

(Core :: Graphics: CanvasWebGL, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: vlad, Unassigned)

References

(Depends on 1 open bug, Blocks 3 open bugs)

Details

(Whiteboard: [games:p1] webgl-perf [platform-rel-Games])

Attachments

(3 files)

Compiling WebGL shaders is really expensive, especially on Win32; also slow on Mobile.  We should cache as much as we can.. e.g. given the same WebGL GLSL input, and the same firefox/angle/driver versions, we should be able to load HLSL bytecode from a previous ANGLE->HLSL compiler run, or shader binary code on mobile.
More generally, we should cache program binaries for all platforms. ANGLE supports a form of shader program binaries, so we can just treat it like any other type of shader binary.
OS: Windows 8 → All
Hardware: x86_64 → All
Yeah that's true, this could be done at the GLContext layer.
Any preferences or other settings that could make the cached shaders "invalid" or at least require us to purge this cache?
We should probably collect that list explicitly... at first pass:

- gl strings: GL_VERSION/GL_VENDOR/GL_RENDERER/GL_SHADING_LANGUAGE_VERSION (of the actual GLContext, not the WebGL ones)
- some ANGLE version string if it's not part of the above
- webgl.prefer-native-gl
- webgl.shader_validator
- (windows) D3D compiler DLL version

I don't think we need to depend on the underlying D3D driver below ANGLE, since we'd be caching HLSL bytecodes there.
Granted, the shaders I worked with in the past were likely more complicated, but there is no way that app could have survived at all without us caching the compiled shaders.  It would be nice to somehow get these numbers for our scenario though, so that we at least know what we're talking about.  Is there a way to just time the compile, as just subtracting all of those will be our speed limit.
Yeah, we can just time how long glCompileShader/glLinkShader takes total.  That will be -most- of the total time saved that would be replaced by an internal read-from-cache.  Could do that by modifying the JS or by modifying Firefox itself.  We really should add some telemetry for this too, now that I think about it.

Also, the shadertoy shaders are extremely complicated, way more than what you'd generally see in the real world :/  That's what makes this so much painful.
Blocks: gecko-games
It may be premature claiming that this blocks Gecko as a gaming platform; we do need numbers.
(In reply to Milan Sreckovic [:milan] from comment #7)
> It may be premature claiming that this blocks Gecko as a gaming platform; we
> do need numbers.

I don't have the numbers in front of me right now, but we've seen numbers from games developers before, and it's definitely something that will help with start-up time. Off the top of my head, we had one example with ~110 shaders, a couple of which took many hundreds of milliseconds each to compile.
That's a good example, thanks.  I didn't mean to deprioritize this bug, I was just trying to understand why we're saying that one shouldn't do games on Gecko until it is fixed.  However, it may be more of a "related" than "blocking" relationship we're trying to describe. It also sounds like we either have the numbers or can get them once we want to measure if "we're done" with this bug...
FWIW, our WebGL application spends several seconds compiling shaders.  We would benefit greatly from a compiled shader cache.

The application is behind closed beta right now, but if you'd like access for testing, I can get you an account.
I gathered some simple results. I ran four WebGL demos:

Unigine Crypt Demo: http://crypt-webgl.unigine.com/
Unity Angry Robots Demo, and the two Dead Trigger Demos: http://blogs.unity3d.com/2014/04/29/on-the-future-of-web-publishing-in-unity/

Conveniently nVIDIA on Linux does have a shader cache, so I also ran the demos a second time to see what the speed up might be.

Link Time is the time WebGLContext::LinkProgram() took, full function.
Compile Time is the time ShCompile() call in WebGLContext::CompileShader() took.

Ran on Ubuntu 14.04 x64 and Mac OSX 10.9 x64 on a MBP 11,3

==Result Summary==

Angry Robots Linux
------------------
Total Compile Time: 62.8214 ms across 229 calls
Total Link Time: 519.3323 ms across 114 calls

nVIDIA Cache Time:
Total Compile Time: 64.9124ms
Total Link Time: 17.1665ms

Unity Dead Trigger 2 - Helicopter Linux
---------------------------------------
Total Compile Time: 288.2592 ms across 705 calls
Total Link Time: 1697.2105 ms across 354 calls

nVIDIA cache time:
Total Compile Time: 309.8136ms
Total Link Time: 64.7068ms


Unigine Crypt Demo Linux
------------------------
Total Compile Time: 32.806 ms across 24 calls
Total Link Time: 141.6146 ms across 15 calls

nVIDIA cache time:
Total Compile Time: 33.4767 ms across 24 calls
Total Link Time: 2.4799 ms across 15 calls


Dead Trigger 2 - Village Demo Linux
-----------------------------------
Total Compile Time: 240.5811 ms across 662 calls
Total Link Time: 1274.736 ms across 333 calls

nVIDIA cache time:
Total Compile Time: 226.3913 ms across 660 calls
Total Link Time: 53.7249 ms across 332 calls
=========================================================
Angry Robots OSX
----------------
Compile time: 63.9002 ms across 229 calls
Link time: 52.0501 ms across 114 calls

Dead Trigger 2 - Helicopter OSX
-------------------------------
Compile time: 297.2739 ms across 707 calls
Link time: 227.4129 ms across 355 calls


Dead Trigger 2 - Village OSX
----------------------------
Total Compile Time: 60.6543 ms across 172 calls
Total Link Time: 45.2148 ms across 86 calls

Unigine Crypt Demo - OSX
------------------------
Total Compile Time: 32.5662 ms across 24 calls
Total Link Time: 13.585 ms across 15 calls

I've got a little more detailed files and can post the diff, but Bugzilla doesn't seem to support multi-upload.
Can you test on Windows too?  Shader compilation is generally slower on Windows due to ANGLE.
Attached patch Simple timing diff (deleted) — Splinter Review
Output time taken to compile/link shaders
Attached file Adds results to total timings (deleted) —
Make a text file of the form:

========
WebGL Demo Title
========
[Insert firefox console output here]

e.g.

====
Angry Robots
====
ShCompile: 0.1983ms
Shader: 24
Link Program: 0.2343ms
....

redirect it into the script

./add.py < results
In case the value of implementing a shader cache is not clear to everyone, I will share our situation.  Our WebGL application uses about 140 shaders.  Almost all of those are variants optimized for different numbers of lights, different numbers of skinned bones, etc.

On Windows, compiling shaders costs us about 7 seconds total.  In that time period, the browser is frozen.  (We can spread the compilation work across frames, but that doesn't really solve the problem - it just moves it.  It means we will have a slow frame rate for 15 seconds or whatever.)

Chrome implements a compiled shader cache ( https://code.google.com/p/chromium/issues/detail?id=88572 , https://code.google.com/p/chromium/issues/detail?id=249739 ) which reduces shader compile times to about 10%.

Other options: enable shader compilation in Web Workers or on a background thread in the browser.
Why is there so much interest in this bug, but only two votes (including mine)?
Doing this has been blocked on some of the quota manager/PBackground work -- the asm.js cache in dom/asmjscache is a model for how this can be implemented for webgl.  However, that code is more complicated than it needs to be, and will be simplified greatly by the work that's waiting in bug 961049.  Once that lands, then this can be implemented pretty straightforwardly.
Depends on: 961049, 961057
Depends on: 942542
No longer depends on: 961057
Shader compilation has generally been a load time issue for most engines, though just now two Emscripten ported engines popped up that actually do compiled shaders on demand at runtime. 

It looks like Unreal Engine 4 does this as well. Found a particularly good test case with Unreal Engine 4 PlatformerGame demo, uploaded it to

https://s3.amazonaws.com/mozilla-games/tmp/2016-04-23-PlatformerGame/PlatformerGame.html?cpuprofiler&playback

Attached a screenshot that illustrates the stuttering caused by shader compilation. Pauses in the range of 500-1000 msecs don't seem that uncommon in this demo. Looking at the spikes in geckoprofiler, they lead to glLinkProgram().

Here is one geckoprofile trace: https://cleopatra.io/#report=4f151b9643c09da0d85b4242635204ea9849403b&filter=%5B%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A80498,%22end%22%3A86634%7D,%7B%22type%22%3A%22FocusedCallstackPrefixSampleFilter%22,%22name%22%3A%22Browser_setImmediate_messageHandler()%20%40%20cb8fb3a6-fdd7-4b2c-a124-5248ada9de1d%3A8%22,%22focusedCallstack%22%3A%5B0,2541,3,4,2542%5D,%22appliesToJS%22%3Afalse%7D%5D&selection=2542,2543,2544,2545,2546

The time spent in these seem to be dominated in D3DCompiler_47.dll, as opposed to some ANGLE code, so looks like the slow path is the actual D3D HLSL compilation, and not ANGLE shader translation/validation or some such.
Blocks: 1268629
Whiteboard: [games:p2] → [games:p2] webgl-perf
Alias: webgl-shader-cache
Improved the above test case page to more rigorously measure and highlight the shader compilation times.

Visit

https://s3.amazonaws.com/mozilla-games/tmp/2016-05-05-PlatformerGame-profiling/PlatformerGame-HTML5-Shipping.html?playback&cpuprofiler&webglprofiler&expandhelp&tracegl=50

for an automated run. While the page is running, light blue spikes (the "Cold WebGL Calls" section) will appear on the profiling graph to indicate stuttering from shader compilation. After the run finishes, open the web page console, which will have logged events like

Trace: at t=43965.7, section "Cold GL" called via "_glLinkProgram" <- "__ZL11LinkProgramRK33FOpenGLLinkedProgramConfiguration" <- "__ZN17FOpenGLDynamicRHI25RHICreateBoundShaderStateEP21FRHIVertexDeclarationP16FRHIVertexShaderP14FRHIHullShaderP16FRHIDomainShaderP15FRHIPixelShaderP18FRHIGeometryShader" took 243.69 msecs!

To run the same page interactively, remove the "playback" option from the URL, i.e. visit

https://s3.amazonaws.com/mozilla-games/tmp/2016-05-05-PlatformerGame-profiling/PlatformerGame-HTML5-Shipping.html?cpuprofiler&webglprofiler&expandhelp&tracegl=50

On Firefox Nightly on Windows with Core i7-5960X and a GTX 980 Ti with 365.19 NVidia drivers, 342.50 msecs was the longest observed duration that a shader compilation event took, and overall, there's about 60 such events that take more than 50 msecs. Looks like Chrome has a somewhat effective caching behavior, and there on the second run there was only one stutter event that took longer than 50 msecs.
Whiteboard: [games:p2] webgl-perf → [games:p?] webgl-perf
Whiteboard: [games:p?] webgl-perf → [games:p1] webgl-perf
Whiteboard: [games:p1] webgl-perf → [games:p1] webgl-perf [platform-rel-Games]
platform-rel: --- → ?
platform-rel: ? → ---
Type: defect → enhancement
Priority: -- → P3

Hello! I would like to present a new use case that has popped up: the use of machine learning framework Tensorflow.js. As best as I can tell, this compiles many shaders (a shader per graph op?) when the model initially loads and does the first inference. This causes quite heavy page loads.
While this can be used directly on a web page (which perhaps represents the bulk of usage?), I use it in a Firefox plugin for inference on images.
Basically this causes a 10 second stall when the plugin is initially loaded. As a point of reference on size, that's for a model based on MobilenetV2 - not something considered to be huge. I'm tracking my corresponding issue here

Beyond the undue level of hype that machine learning has received in recent years, I do anticipate a significant growth in its usage - even on the web. Based on that, I would imagine that - whether Tensorflow.js retains its popularity or not - caching shaders will remain important for this use case in the future as the CPU is simply not cut out for the linear algebra needed. (I also see active work on a WebGPU backend, but that's a future story...) For the curious, Tensorflow.js has several demos online.

Thank you team for developing Firefox and I hope you find this new use case report helpful!

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: