Vulkan prototype - the big issues

April 15, 2016

I'm making more progress on the Vulkan prototype. Now the SceneEngine compiles, BufferUploads works, and it's now possible to render basic geometry. Many features aren't supported. But the core rendering features (such as binding textures, compiling shaders, pushing geometry through the pipeline) have an early implementation.

So, while a working Vulkan version is now close, I have a much better idea of what major hurdles there would be to get a really streamlined and efficient Vulkan implementation!

RenderPasses

This is going to be the big issue. RenderPasses are a new concept in Vulkan (though with heritage from Apple Metal and OpenGL), and not very compatible with the DirectX way of handling render targets. This is going to require some deep thought!

I haven't researched this thoroughly -- but this is my basic understanding of what might be happening...

RenderPasses seem to be fundamentally designed around PowerVR-style hardware (such as that found in IPhones and some Android hardware). Now, if I understand the situation correctly, this type of hardware doesn't need a traditional depth buffer. Instead it collates triangles as they are submitted, then splits them into buckets for tiles on the screen, and finally decomposes them into scanlines. The scanlines can be sorted using an old algorithm, so that for every pixel we know the front-most scanline.

The key issue here is that that we collate triangle information in the frame buffer, and do not produce a final pixel value until all triangles are finished. This requires the hardware to maintain buffers of triangle scanlines attached to the frame buffer.

The key problem here is this: what happens if we have render targets A and B, both of which are used in the same frame. We render to A, then to B, and then switch back to A and start rendering again?

In a traditional DirectX OMSetRenderTargets environment, this is not a big issue. We might do this (for example) to render a reflection into target B, which appears on geometry in target A. Maybe we cause a bit of a GPU pipeline stall; but it's probably not going to be a major issue.

With the PowerVR-style hardware, however, it's a bigger deal. We want to continue collating triangle information in A throughout the entire frame. We don't want to separate that into 2. It would be better if A and B were actually 2 separate viewports onto the same frame buffer.

In other words, when we switch away from A, we kind of want to know if we will be returning to it. That's what RenderPasses are useful for. We can express to the API that A is not finished yet, and that B just contains temporary data that will eventually be used on A.

AMD claim on their website that this expressiveness is also useful for traditional forward rendering hardware. But it probably isn't quite so significant.

Nevertheless, Vulkan appears to be built entirely around RenderPasses, and the interface is very different from OMSetRenderTargets. So we need to figure out what to do.

RenderPass integration

There are 2 approaches:

Make RenderPasses a first-class feature of RenderCore::Metal, and deprecate binding render targets
- this would require all client code to change to using RenderPasses. For DirectX, the RenderPass object would just generate OMSetRenderTarget calls
Dynamically build simple RenderPasses when binding render targets
- the client code wouldn't need to change. However, the render passes generated would not be ideal. It would also require calling VkCreateRenderPass and VkCreateFramebuffer during the frame (which is probably not ideal)
Support both RenderPasses and binding render targets
- clients can choose to use either

I'm encouraged to choose method one, even though it will require the most work.

When there are these kinds of incompatibilities between APIs, it's normally best to follow the design of one of the APIs fairly closely. So, we can write Vulkan code in a DirectX-like way, or we can write DirectX code in a Vulkan like way. In this case, the Vulkan method is more expressive and is a clear superset of the DirectX behaviour. Ignoring RenderPasses would make targeting PowerVR-style hardware (at a late date) much more difficult.

But it seems like it will be a little difficult to make the transition to RenderPasses. It's probably going to require significant changes in the SceneEngine. But it's also a good opportunity to think about other improvements to render target management within the SceneEngine.

Descriptors

Vulkan has a slightly different way to look at binding resources and constant buffers to shaders. We have a lot of new concepts:

Descriptor sets
Descriptor set layouts

These relate most closely to the Metal::BoundUniforms object in XLE.

The descriptor set layout contains a list of all of the bindings of a shader (or set of shaders) and the descriptor sets contain the list of objects to actually bind.

Currently, the BoundUniforms objects own both descriptor sets and descriptor set layouts. But I'm not sure this is a great idea, mostly because we end up with just a 1:1 relationship (which seems redundant).

Both of these relate to the "binding" number set in the shader. My understanding at the moment is that the binding number should be a low number, and numbers should be used "mostly" sequentially -- like the register binding numbers in DirectX.

Since XLE binds things via string names, we still need some way to associate the string name to a particular binding number. One idea is to use the hash of some string value for the binding numbers... I haven't seen any documentation to say this is a bad idea -- but, Nevertheless, I still suspect that this is a bad idea. Near-sequential is going to be safer.

We can choose 2 different designs for descriptor set layouts:

keep different descriptor set layouts bound tightly to specific shaders
- ie, each shader could have it's own layout, containing only the bindings relevant to that particular shader
share descriptor set layouts broadly
- so layouts may contain a superset of bindings for any given shader

If we're sharing descriptor sets, we could choose to do that on a per-technique level. That would kind of make sense. Or we could do it on a global level. That is, we could just have the single huge layout for each descriptor set binding index.

The descriptor set binding indices are similar to the "bound uniforms stream index" in XLE. In particular, it makes sense to have a single shared layout for the global binding stream index. This would also move us closer to a "bindless" approach to these things.

However, that introduces the problem of mapping between our current string binding names at the binding indices used by the shader. Since no single shader contains all bindings, the normal method of using reflection doesn't work.

In addition to the layout management issues, there are also issues managing the descriptor sets. Some descriptor sets have temporary data (ie, we expect it might change in a future frame). But some descriptor sets contain only static data. For example, the texture bindings of a model are set at load time and remain constant.

For static descriptor set data, we would ideally write this once, and just reuse it every frame. That seems possible by adapting the SharedStateSet interfaces. But it would require a few changes.

For temporary descriptor set data, I suspect that the simple approach of just writing every frame and then throwing it away might be fine, and it would require fewer changes.

Also, we want to be able to attach the "set" layout qualifier to the GLSL code. However, there is no equivalent to this in HLSL, so no way to get it through the HLSL -> GLSL approach. Maybe we can do some hacks via register bindings, but it might be awkward.

We might possibly need to build a global table of bindings -- this table could assign the string name to binding ids, and also assign a "set" qualifier.

At the moment it's not perfectly clear what's the best approach to dealing with descriptor sets. It might take some experimentation.

Pipelines

Vulkan has some support for precalculating and reusing what it calls "pipelines." These are a combination of all shader stages, the descriptor layouts and the input layouts. But they also have some render state information mixed in.

It's another big departure from DirectX, because it pre-binds these objects together. Again, it's a superset of DirectX behaviour.

It feels like we may have to dynamically create some pipelines during the frame with XLE. That said, for models (which should be the majority of geometry) we should be able to leverage the SharedStateSet code, so that these can be reused over multiple frames.

This would mean that some draw calls would use precreated pipelines, while others would use dynamically created pipelines. I think this might be the best balance between practicality and efficiency...

Shader constant push buffers

In Vulkan, we have a few new approaches for uploading shader constants. We still have uniform buffers (but we have a lower level interface for them). We also have "push buffers" which presumably use the command buffer to pass frame-by-frame data to the constant registers.

Push buffers will be great for data that changes every frame. Uniform buffers are best for data that is retained for multiple frames, or reused in many draw calls.

Fortunately, we also have the ConstantBufferPacket concept in Metal::BoundUniforms. This could be extended to be used with push constants. But push constants seem like they also require shader changes. That is, the maybe the shader must know if it's going to receive those constants via push constants or a uniform buffer.

It might be a good time to rethink how to handle the LocalTransform constants for model rendering. Often, the local transform will be constant over the model's lifetime. In these cases, we might want to store the transform in a retained uniform buffer.

But another alternative is to just use push constants. This will be required in the case of animated transforms. And maybe it could be reasonable efficient? Anyway, it might be interesting to think about.

Image layouts

Vulkan has a new "image layout" concept. Each image object has a specific layout and we set these asynchronously using a command on the command buffer.

Normally, we only care about layout while we're creating the image or initializing it with data. But it's sort of architecturally awkward, because we create the object synchronously using a VkDevice call, but we don't set the layout until we first use it in a command buffer. That's a problem, because we usually don't know when a use is the first time.

All our images should be initialized via buffer uploads. This might make easier to solve this efficiently, because we can add in a "layout initialization" command list that always get executed as part of the buffer uploads update. In this way, buffer uploads will manage layouts, and other code can just ignore it. But it is going to create some difficulties for Transaction_Immediate resources.

Resource deletion

XLE is often lazy about retaining references to objects. In general, we're taking advantage of the way DirectX11 keeps objects alive while they're required by the device.

Often we allocate some object, use it for one frame, and then destroy (expecting the memory to be freed up when the GPU is finished with that frame). While might not be perfectly efficient, it is very convenient.

Vulkan doesn't automatically track the GPUs progress when we free an object. So we need to know when it is currently referenced by a command buffer, and when it might be used in the future (or even if it's currently being used by the GPU). This is complicated by cases with secondary command buffers (such as those used by the buffer uploads).

So we need some new concepts for tracking GPU progress, and destroying objects when they are no longer needed.

Likewise, some objects don't need to be destroyed, but they might need to be overwritten. This happens in systems that are built for streaming. These cases are very similar -- "safe to delete" is often the same as "safe to overwrite."

The majority of objects can follow a simple path that just tracks a GPU frame index. After the GPU finishes a frame, then resources associated with that frame can be destroyed.

In some cases, the destroy might actually be a "reset." For example, if we have a number of pools of temporary descriptor sets, we must track the GPU progress to know when it's safe to reset and reuse a pool.

Some cases might be more complicated -- such as if we have resources tied to secondary command buffers that are retained over several frames. In this case, the resources must be maintained until after the command buffer is no longer used used and destroyed.

Conclusion

Many of the Vulkan concepts are familiar to me from previous projects... So to some extent, many ideas from Vulkan are already built into XLE. But at the same time it's going to require some refactoring and improvements of the current code before we can use Vulkan very efficiently.

While I've been working on Vulkan, I've also been improving the RenderCore::Metal interface. There are aspects of this interface that I've always meant to improve -- but I've never had the chance before. When working just with DirectX, it didn't really need to be very polished; but adding Vulkan to the mix changes that.

Anyway, it feels like Vulkan is going to be around for awhile, the prototype so far has shown where the most important difficulties are going to be!

XLE 27