Vulkan prototype - metal layer refactoring
May 09, 2016

XLE has a thin layer over the underlying graphics API called "Metal". This was originally built for DirectX11 and OpenGLES. But over time it became more DirectX-focused. Part of the goal of building in Vulkan support was to provide a basis for refactoring and improving the metal layer.

The goals for this layer are simple: - compile time polymorphism between underlying graphics APIs - not link time or run time. We know the target during compilation of client code - "leaky" abstraction layer - meaning that most client code is independent of the underlying graphics API - but the underlying objects are still accessible, so client code can write API-specific code when needed - very thin, minimal overhead - for example, many DeviceContext methods get inlined into client code, meaning that performance is similar to using the underlying API directly

To make this kind of layer work, we need to find abstractions that work well for all target APIs. Usually this means finding abstractions that are great for one API, and pretty good for other APIs. That can be tricky, particularly as APIs are changing and evolving over time.

Descriptor Set concept

Ideally we want the concept of "descriptor sets" to exist in the metal API somewhere. There are two reasons for this:

  1. Clean up the DeviceContext interface so there are fewer BindXX(...) methods
  2. pre-cook permanent descriptor sets using BoundUniforms (or otherwise), so that they can be reused frame to frame

Vulkan latest progress -- core lighting working
May 04, 2016

The Vulkan build is steadily getting more and more functionality. Now the core rendering pipeline in SceneEngine is working -- which means we can have deferred lighting, shadows, IBL, tonemapping, etc. Simple scene should render correctly now. But there are some inefficiencies and issues (see below).

Unfortunately the DirectX11 version isn't working at the moment. This is all in the "experimental" branch.

Declarative render passes

Tone-mapping now works, and it was a good prototype for the "declarative" render pass model. This allows us to specify render passes and render targets required using a "description" structure. The system will do some caching and correlate request to resources, creating and binding as necessary.

There is some overhead with this design because it involves doing some per-frame hash and lookups as we go along. It's not as efficient as (for example) just pre-creating all of the frame buffer / render pass objects in a configure step. However, this design is maybe a little more flexible and easier to tie into existing scene engine code. In effect, we're building a new layer that is just one step more abstract from the underlying Vulkan objects.

I've pushed some of the "busy-work" (like declaring subpass dependencies) down into the RenderCore::Metal layer. This makes the interface easier to use... But the downside is that my abstraction is not expressive enough for some unusual cases. For example, I came across a cases where we want to bind the "depth & stencil" aspects of a depth texture in one subpass; and in the second subpass only the "stencil" aspect is bound. This apparently needs a dependency... But it's just really inconvenient with this interface.

I've also build a concept called "named resources" into the Metal::DeviceContext. This allows us to get TextureViews for attachments from the device context. It feels out of place because it's an operation that doesn't involve the hardware, but there doesn't seem to be any better way to handle this case.

Fundamentally we want to define attachments FrameBufferDesc objects, so that we can later refer to them again by binding id. It would be better if some of this functionality was in the RenderCore::Techniques library... But it would be just too much hassle to split it better Techniques and Metal.

Anyway, it's working now in Vulkan. However, I still haven't got to the caching and reuse part. And it also needs to be implemented for DirectX11, also!

Compute shader work!

I added in support for the compute pipeline. It actually was pretty easy. I decided to switch some of the tonemapping code from pixel shaders to compute shaders -- because this seems to be more natural in Vulkan. Working with viewports and render targets is much more complex in Vulkan than DirectX11.

Vulkan prototype - how many separate render passes in the lighting parser?
April 22, 2016

The lighting parser is a set of steps (including geometry rendering and full screen passes) that occur in a similar order and configuration every frame. In some ways it is like a higher level version of a "renderpass" (or frame buffer layout). Each "subpass" in the renderpass is like a step in the lighting parser process for evaluating the frame.

But how do we map the lighting parser steps onto render pass subpasses? Do we want to use one single huge render pass for everything? Or a number of small passes?

Declarative lighting parser

Since we're using a "declarative" approach for "frame buffer layouts", we could do the same for the lighting parser. Imagine a LightingParserConfiguration object that takes a few configuration variables (related to the current platform and quality mode, and also scene features) and produces the set of steps that the lighting parser will perform.

We could use the VariantFunctions interface for this, along with some structures to define the "attachment" inputs and outputs for each step. If we have a step (such as tonemapping) we just define the input attachments, output attachments, and temporary buffers required, and add the tone map function.

Then we would just need a small system that could collect all this information, and generate a RenderCore::FrameBufferDesc object from it.

This would allow us to have a very configurable pipeline, because we could just build up the list of steps as we need them, and from there generate everything we need. In effect, we would declare everything that is going to happen in the lighting parser before hand; and then the system would automatically calculate everything that needs to happen when we call Run().

This could also allow us to have a single render pass for the entire lighting parser process. But is that what we really want?

How many render passes?

There are a few things to consider:

  1. Compute shader usage
  2. frame-to-frame variations in the render pass
  3. parallization within the lighting parser

Vulkan prototype - the big issues
April 15, 2016

I'm making more progress on the Vulkan prototype. Now the SceneEngine compiles, BufferUploads works, and it's now possible to render basic geometry. Many features aren't supported. But the core rendering features (such as binding textures, compiling shaders, pushing geometry through the pipeline) have an early implementation.

So, while a working Vulkan version is now close, I have a much better idea of what major hurdles there would be to get a really streamlined and efficient Vulkan implementation!

RenderPasses

This is going to be the big issue. RenderPasses are a new concept in Vulkan (though with heritage from Apple Metal and OpenGL), and not very compatible with the DirectX way of handling render targets. This is going to require some deep thought!

I haven't researched this thoroughly -- but this is my basic understanding of what might be happening...

RenderPasses seem to be fundamentally designed around PowerVR-style hardware (such as that found in IPhones and some Android hardware). Now, if I understand the situation correctly, this type of hardware doesn't need a traditional depth buffer. Instead it collates triangles as they are submitted, then splits them into buckets for tiles on the screen, and finally decomposes them into scanlines. The scanlines can be sorted using an old algorithm, so that for every pixel we know the front-most scanline.

The key issue here is that that we collate triangle information in the frame buffer, and do not produce a final pixel value until all triangles are finished. This requires the hardware to maintain buffers of triangle scanlines attached to the frame buffer.

The key problem here is this: what happens if we have render targets A and B, both of which are used in the same frame. We render to A, then to B, and then switch back to A and start rendering again?

In a traditional DirectX OMSetRenderTargets environment, this is not a big issue. We might do this (for example) to render a reflection into target B, which appears on geometry in target A. Maybe we cause a bit of a GPU pipeline stall; but it's probably not going to be a major issue.

With the PowerVR-style hardware, however, it's a bigger deal. We want to continue collating triangle information in A throughout the entire frame. We don't want to separate that into 2. It would be better if A and B were actually 2 separate viewports onto the same frame buffer.

In other words, when we switch away from A, we kind of want to know if we will be returning to it. That's what RenderPasses are useful for. We can express to the API that A is not finished yet, and that B just contains temporary data that will eventually be used on A.

Important Vulkan tips
April 12, 2016

After 1 week of playing around with Vulkan, here are 4 tips I've learned. These are maybe not very obvious (some aren't clearly documented), but I think they're important to know.

1. Use buffers as staging areas to initialize images

The crux of this is we need to use a function, vkCmdCopyBufferToImage, to implement staging images with Vulkan. It's almost impossible to do any graphics work with Vulkan without doing this -- but a first glance it might seem a bit counter-intuitive. First a bit of a background.

Vulkan images have a "tiling" property, which can be either "linear" or "optimal." This is property relates to how pixels are arranged in memory within a single mip level.

In linear tiling, texels are arranged in row by row, column by column. So a texel's linear address follows this pattern:

address = (y*rowPitch) + x*bitsPerPixel/8

When we have images on disk, we usually expect them to have this tiling.

However, this has a big problem. Texels in the same row will be near each other in memory. But texels in the same column will be quite separate in memory.

Given that images can be mapped arbitrarily on the final 2D triangles, we can't control the order in which the GPU will access the texels. So this is big issue for the memory cache.

So all GPUs are able to rearrange the texels into a "swizzled" order that tries to guarantee that nearby texels in 2D (or 3D) image space will be nearby in linear memory. This is called "optimal" tiling in Vulkan.