Dynamic Function Linking Graph for Shaders

December 16, 2015

For many modern engines, the shear quantity of different configuration options for shaders can start to be a major burden. Many compile-time options can end up increasing the number of compiled shaders exponentially. It can get to the point where the shaders data image can make up a large segment of download time, and compile time can be a major hassle during development.

For example, a pixel shader for forward lit scene elements will often need to be specialized to suit the number and types of lights nearby. If we have a few different types of lights, the number of combinations can become very quickly unmanageable.

But we really need a lot of compile time options! They are very useful.

Dynamic linking methods in D3D

What we really need is a way to do dynamic linking of shaders -- so that we can construct the particular shader we need at runtime.

D3D provides a few different methods for dynamic shader linking. The one simple method involves "classes" and "interfaces."

Classes and interfaces

In the shader code, we can define an interface like this:

interface ILightResolver
{
    float3 Resolve(
        GBufferValues sample,
        LightSampleExtra sampleExtra,
        LightDesc light,
        float3 worldPosition,
        float3 directionToEye,
        LightScreenDest screenDest);
};

Then we can create an implementation of this interface:

class Directional : ILightResolver
{
    float3 Resolve(
        GBufferValues sample,
        LightSampleExtra sampleExtra,
        LightDesc light,
        float3 worldPosition,
        float3 directionToEye,
        LightScreenDest screenDest)
    {
        ...
    }
};

All of the methods in the interface function like virtual methods in C++. We can select which particular implementation to use from C++ code. It sounds very convenient -- and indeed it is really convenient!

XLE has supported this method for some time; but it's mostly just used for debugging shaders. While his method can work very well in simple situations, it has severe performance problems in complex situations.

Problems with lighting resolve

The biggest issue here is that each possible implementation class is included in the one single "uber" shader. There are "call" instructions inserted into the compiled code that will jump to the correct implementation. This jumping should be quick (actually nvidia cards have had quick branching on static bools for a very long time) but in this case the issue is the shader just becomes too large.

If we exceed the instruction cache for pixel shaders, it could cause some very serious performance problems.

With deferred rendering, the granularity seems wrong. We select a configuration, and then use it on a very large number of pixels. And then select the next configuration, etc... It doesn't seem right to have redundant code in the shader when we select the configuration infrequently.

There can also be problems passing large structures through the interface. In this sample, passing GBufferValues values is not ideal. The shader compiler seems to need to consume temporary registers to hold all of the parameters passed -- and in this case the number of parameters is too large.

Static polymorphism with interfaces

One of the cool things about interfaces and classes is they can be resolved at compile time! If we know the true type of the class at compile time, then the shader compiler will treat it just as a normal function call.

This works even if we are interacting with an interface pointer. So for example, I can have the function:

ILightResolver      GetLightResolver()
{
    #if LIGHT_SHAPE == 1
        Sphere result;
    #elif LIGHT_SHAPE == 2
        Tube result;
    #elif LIGHT_SHAPE == 3
        Rectangle result;
    #else
        Directional result;
    #endif
    return result;
}

Now, if I call GetLightResolver().Resolve(...) is it not a dynamic jump. It is treated just as a normal function call. So I can use GetLightResolver() anywhere, and never have to write the preprocessor switch stuff again.

This is a great hidden feature of the HLSL compiler! It's static polymorphism, just like using function overloads or templates in C++. It can really make shader code cleaner and clearer (but unfortunately this doesn't work with the feature described in the next section).

Patching shaders together

Our lighting shaders can actually be split into 3 logical parts: 1) light shape 2) shadow cascade resolve 3) shadow resolve

Each part has a different configuration settings, but we can mix and match them together

What we really want is to be able to choose a configuration for each part independently, and then just stick them all together as one. We could do this back on old consoles -- when we had a lot of control of low level stuff, we would just compile the parts of shaders, and patch them into one a runtime.

Enter ID3D11Linker

Actually, D3D11 has a new feature that can do something like this. It involves the interfaces ID3D11Linker and ID3D11FunctionLinkingGraph. These are new features, but they are features of the compiler -- so they can work with any D3D 11 hardware.

This allows us to create reusable shader "libraries." The libraries can export functions using the "export" keyword:

export float3 DoResolve_Directional(
    float4 position, float3 viewFrustumVector,
    float3 worldPosition,
    float screenSpaceOcclusion
    MAYBE_SAMPLE_INDEX);

So, for example I can compile a library for light shapes, containing one exported function for each shape.

Now, it might be nice if we could just say "import float3 DoResolve_Directional(...)" in another shader, right? It seems logical, but it doesn't seem to be supported. Anyway, we may want to be constructing our linking shader at runtime, and we don't really want to be compiling HLSL source at that time.

However, there is another way...

Function Linking Graph

With the Function Linking Graph we can represent a series of function calls, and the process for passing parameters between them. This is sort of like a simplified "abstract syntax tree" for HLSL. It doesn't support any expressions or any statements other than function calls or return statements. But it's enough to stitch together our shader from several parts.

In XLE, we want high level code to be able to select configuration options, but in an implementation independent way. And we want our solution to fit in well with our assets system (ie, supporting hot reloads and smart handling of errors, etc).

The best way do this is to introduce a simple scripting language. There are 2 ways to do this 1) either a declarative oriented manner (ie, we declare the function nodes and the links between them, and let the system figure out what to do with them) 2) or a more procedural method (ie, something that is just a thin layer over the underlying ID3D11FunctionLinkingGraph methods)

In this case, method 2 offers an efficient and more flexible solution. And the result is almost like a simplied HLSL:

FunctionLinkingGraph:1

main = DeclareInput(
    float4 position : SV_Position,
    float2 texCoord : TEXCOORD0,
    float3 viewFrustumVector : VIEWFRUSTUMVECTOR
    {{#passSampleIndex}}, uint sampleIndex : SV_SampleIndex{{/passSampleIndex}})

// Link in our main module
// We specify a filter for defines here. This is important because some defines
// are intended for this file (eg, for Mustache symbols) while other defines
// need to be passed down to this module
// It's good to be strict about this list, because the fewer defines get passed
// done to the modules, the fewer different versions of that module we have
libLightShape = Module(lib_lightshape.sh, GBUFFER_TYPE;MSAA_SAMPLERS;MSAA_SAMPLES;DIFFUSE_METHOD)
libShadow = Module(lib_shadow.sh, MSAA_SAMPLERS;MSAA_SAMPLES)
libHelper = Module(lib_helper.sh, MSAA_SAMPLERS;MSAA_SAMPLES;HAS_SCREENSPACE_AO)

// The basic structure is simple:
// 1) Calculate some inputs to the resolve operations
// 2) Perform each resolve step
// 3) Generate the output value by combining the resolve outputs
//
// Steps 1 and 3 are fixes, but step 2 varies depending on the options
// selected for the light (ie, this is where the dynamic linking occurs)
setup = libHelper.Setup(position, viewFrustumVector)
worldPosition = Alias(setup.2)
worldSpaceDepth = Alias(setup.3)
screenSpaceOcclusion = Alias(setup.4)
{{#passSampleIndex}}PassValue(sampleIndex, setup.5){{/passSampleIndex}}

light = libLightShape.DoResolve_{{shape}}(
    position, viewFrustumVector,
    worldPosition, screenSpaceOcclusion
    {{#passSampleIndex}}, sampleIndex{{/passSampleIndex}})

cascade = libShadow.DoResolve_{{cascade}}(
    position, texCoord, worldSpaceDepth)

shadow = libShadow.DoResolve_{{shadows}}(
    cascade.3, cascade.4, cascade.5,
    position
    {{#passSampleIndex}}, sampleIndex{{/passSampleIndex}})

finalize = libHelper.FinalizeResolve(light.result, shadow.result)
output = DeclareOutput(float4 outColour : SV_Target0)
outColour = finalize.result

This script is read by a simple minimal parser. This parser should be quick enough to be run during runtime without major issues.

As a preprocessing step, I'm using a Mustache compliant library. This is important for performing the customization we need. The Mustache step might be a little slow. But the result should still be ok in Release builds. In debug the string handling stuff is probably a little slower than ideal.

So, here we're basically just declaring the modules we want to link in. Then we specify the functions we want to call, and the sources for each parameter. We pass the outputs from each function down into where they are needed.

We can't do any expressions or math or anything else here. All we can do is call functions and pass the results to output. Since this script will be compiled at runtime, perhaps a more restrictive language is actually more ideal.

Compiled output

The final compiled shader code form the stitched approach appears to be very similar to a statically linked result. There might be a little overhead in some situations (particularly since the optimizer can't optimize across the function call points). But it seems minimal.

The stitched shader seems to use a few more temporary registers than the statically linked result (in one complex shader, the difference was 11 to 13). This may be an issue on some hardware, but maybe the benefits are worth some performance loss.

Uncommonly used?

ID3D11FunctionLinkingGraph doesn't seem to be frequently used currently. I'm not sure why, it's been around for a few years, it works well and it's pretty useful...

If I do a Google search for ID3D11FunctionLinkingGraph, I only get around 3 pages right now. And there is only one clear sample (and it's a bit hard to find). It's strange to find features that are this useful, and yet still so uncommon!

Anyway, it works in XLE fine now. So maybe now XLE can serve provide an example implementation for anyone that needs it.

Complications

There are some complications involving this method. In particular, functions can refer to global bindings (like constant buffers and resources). These must be compatible for all functions used together (ie, this often means preferring explicitly bound constant buffers and resources).

Also, there are some complex shader features that aren't supported (like interpolation modes for input parameters). And matrices can't be used for input or output parameters. But maybe that kind of missing functionality isn't a high priority.

And the scripting language is actually extremely primitive. But it seems featureful enough to support everything we want to do with it.

Also we can't pass structures into and out of exported functions. Thats very annoying for me, because I've been using structures for lots of things. In particular, they are great for hiding optional parameters (like sampleIndex in the above example). Without structures, the solution for sampleIndex is a good deal more ugly.

XLE 27