github bluesky mastodon linkedin email
Writing a bindless GPU abstraction layer
May 2, 2026
11 minutes read
The Loon GPU library logo, a line drawing of the head of a common loon.

Back in December 2025, Sebastian Aaltonen published a blog titled “No Graphics API” - it presented a great history of the evolution of GPU hardware, and gave an opinionated perspective on how we could simplify the modern graphics APIs on modern hardware. Like many graphics programmers, I read Sebastian’s blog post and really enjoyed it - I was inspired by it, and decided to see how close to the API he describes I could get today, layered on top of existing platform APIs. The answer turns out to be “pretty darn close”. The result is a project I’m calling Loon GPU, and I’ve put it up on Github. While the library is still rough, poorly tested and surely filled with bugs, it is usable and I wanted to share it early in case other folks were interested. Currently it has a Vulkan 1.3 and Metal 4 backend, and here I want to do a high-level walkthrough to see how the API design maps on to those backends.

In brief, the API looks like this:

  • No buffer objects, instead use GPU pointers everywhere.
  • No vertex buffers. Use vertex pulling instead, it’s much simpler.
  • Textures and samplers are treated bindlessly, with indices into a big texture heap object.
  • No explicit bind groups, instead pass device pointers to shaders to feed them data.

Pointers are first-class

In the API, you get GPU memory from malloc, and free it with free. Unlike the backends it maps on to, there is no concept of a “Buffer” object explicitly, just memory. By default you get memory that is persistently mapped to the CPU and optimized for CPU writing to it, but you can request GPU-local memory or mapped memory optimized for CPU readback.

One change from the API described in the original blog is that the API returns GPU device pointers - this is the only pointer that all 3 memory types map on to, so calling malloc gives you a device pointer and you can call get_host_pointer() to turn that into a CPU-side pointer if you need it.

When you call draw() or dispatch() functions to do work on the GPU, you can provide GPU pointers to give arguments to your shaders. On Metal this is trivial - the API has support for binding arbitrary GPU pointers to an argument table. On Vulkan, we pass these to the shader via push constants.

Internally, we do create buffer objects. In Vulkan, every allocation gets a VkBuffer bound to the entire memory range allocated. In Metal, we create a MTLHeap and create a buffer that covers the entire range as well. In both underlying APIs, we sometimes need to map back from GPU pointers to a buffer + offset pair, so we store a sorted list of GPU pointers that are allocated and do a binary search whenever necessary. While I was working on this library, a new Vulkan extension VK_KHR_device_address_commands was published, which should eliminate the need for most of this mapping when its available. In general the goal is to treat allocations as fairly heavy, and user-space allocators should be built on top of them rather than allocating for every object needed. This should keep the size of the lookup array should stay small and ideally lookups are fast when they are needed.

While it doesn’t seem like it would make a huge difference, having the API just give you pointers really does make the CPU-side code feel a lot more natural. You can more easily create complex data structures just like they were objects in C, and with an ergonomic ring-buffer, constructing draw arguments is trivial. It really feels like it simplifies a lot of the CPU side code, as well as the GPU side. Combined with bindless textures, we don’t have to worry about pipeline layouts, bind groups, or any of the complexity of the Vulkan binding model.

Shaders

This wrapper is opinionated about how your shaders should look, and that is part of what makes it simpler to use than raw Vulkan or Metal. The idea is that all GPU data is bindless, and shaders get their inputs from a single pointer that is passed to them. In order to make this work, we need a shading language.

When I started this experiment I decided to use Slang. It has support for pointers, can compile to SPIRV and Metal, and looked like it would work for my use case. And it mostly does! But unfortunately there are some complications that I think make this the roughest part of trying to use Loon.

To understand the difficulties, let’s start with what I’d like to achieve. In Vulkan, I’d like have a push constant range containing a single pointer assigned to each shader stage. In Metal, I’d like to bind a single buffer to each stage. In Loon, it currently looks like this:

StageVulkan Push Constant Range (bytes)Metal Buffer Index
Compute[0, 8)0
Vertex[0, 8)0
Fragment[8, 16)1

In terms of syntax, Slang offers a very nice way to map shader arguments to push constants. You can declare a shader stage (e.g. a vertex stage) with a syntax like

[shader("vertex")] 
VertexStageOutput vertexMain(uniform InputData* args, uint32_t vertexIdx : SV_VertexID) 

Slang will turn the uniform InputData* parameter and turn it into a push constant. Unfortunately, you have no control over the layout of these push constants - all push constants will get a range starting at 0. This makes it effectively impossible to pass separate arguments to the vertex and fragment shaders as Vulkan shares the push constant range for all shaders in a draw call. This is tracked in a github issue, but it’s not a trivial problem for Slang to solve in the general case. The workaround is to instead do something like this at global scope:

struct Args {
	VertexInput* vert;
	FragmentInput* frag;
};

[[vk::push_constant]] uniform Args args;

For compute shaders though, the entry-point parameter syntax works well.

The situation on Metal is quite a bit worse unfortunately. Slang support for metal is experimental, and I’ve come across a handful of issues that make it difficult-to-impossible to share shaders between the Vulkan and Metal backends. I’m hopeful that this will improve in the future and have submitted some PRs for things I think I can fix, but in the meantime I’ve been translating my Slang shaders to Metal manually and selecting between them at runtime. This is not meant as a criticism of Slang either - Metal is explicitly not a primary target, and we’re trying to use it in a weird way that wouldn’t work as a general binding solution. The ideal solution may be a custom shader compiler or transpiler designed for this binding model, but that’s out of scope for me right now. A custom language is an enticing rabbit hole to go down, but right now I’d rather spend my time improving the base library.

One other pain point worth mentioning here is computing threadgroup sizes in Metal. While in Vulkan and DirectX threadgroup sizes are determined by shader annotations, Metal sets it at dispatch time (with maximum sizes determined at pipeline creationg time). This makes a cross-platform API difficult. Metal 4 gave me hope this would be fixable with the addition of a required_threads_per_threadgroup() annotation - but there’s currently no way to read this value from the CPU side. Currently, loon requires this annotation on compute shaders and does some very hacky parsing to look for the annotation values, so we can pass that value appropriately to the API. Ugly, but it works. Hopefully in a future release this annotation value is exposed through the pipeline state object.

As Mesa’s new KosmicKrisp driver gets better, I may also care less about having a dedicated Metal backend - currently the performance of the translation isn’t great, but it’s also still very early days for the KK project and it should only get better from here.

Drawing

This is a minor point but differs from standard graphics APIs - there are no vertex buffers. Instead, use a pointer to vertex data in your shader and use the vertex index to fetch what you need. This removes a bunch of API surface. Index buffers are still used, but are again specified via a GPU pointer.

Bindless Textures and Samplers

In Metal, bindless textures and samplers are supported natively by the API. The TextureView and Sampler objects you get from the Loon API are just the GPUResourceID values that Metal exposes, and you can plop them into shader data structures and use them just like you would expect.

In Vulkan, we create one large binding set and use descriptor indexing. The table below explains the binding layout, which is based on what Slang generates.

Resource TypeBinding Slot
Samplers0
Sampled Images2
Read/Write images3

In Slang, the Sampler and TextureView types map to DescriptorHandle<T>’s, which can be used to access the resources from shaders without worrying about the binding layout.

No image layouts

Image layouts and layout transitions are one of my least favorite part of Vulkan, and I wanted to avoid needing them. Metal doesn’t have them at all, so if they were in the API it would be purely for Vulkan’s sake. Even worse, since they’re tied to barriers in Vulkan, requiring explicit image transitions in our API could introduce unnecessary pipeline stalls in the Metal backend, hurting performance.

There’s 2 mostly orthogonal things that are done through image layout transitions:

  1. Initialization on the queue timeline
  2. Transitioning to optimal layouts for different uses

Number 1 is required - Vulkan images are created in VK_IMAGE_LAYOUT_UNDEFINED, and need to be transitioned to another layout before they can be used in any way. We work around this by having a global list of uninitialized textures, and on the next command buffer submission, record a small command buffer which transitions all currently uninitialized images to VK_IMAGE_LAYOUT_GENERAL. This is fine for a single queue, but in a scenario with multiple queues (e.g. for async compute) this essentially introduces a sync point between them - no queues can do any work until this image initialization command buffer completes, because the bindless nature means we don’t know when textures are used.

For transitioning to optimal layouts, we basically ignore this. Vulkan recently introduced VK_KHR_unified_image_layouts, and modern NVidia cards support it, indicating that using general layout everywhere shouldn’t degrade performance. Unfortunately, other vendors don’t seem to support this extension on Windows at least. Historically, the reason to use specific image layouts was to enable various optimizations like AMD’s Delta Color Compression (DCC). While a lot of the hardware details around these features aren’t public (or are too hard for me to find), I did look at the Mesa driver source tree, and from a cursory read it seems like on modern AMD chips (RDNA3 and up) image layouts don’t matter except in multisampled textures. So while it means potentially leaving some performance on the table, I’m happy to make that trade off to remove that API surface entirely. Hopefully in the future more drivers support VK_KHR_unified_image_layouts and we can be confident that the general layout is fine to use everywhere.

What can’t be done

When I started this project, my first step was literally copy-pasting the API from Sebastian’s blog post into my text editor. While I’ve changed a few things during implementation, there were only a couple of things that I just couldn’t implement at all on the current underlying APIs.

  • The blog has a gpuSetBlendState() function, decoupling blend state from pipelines. On Vulkan VK_EXT_extended_dynamic_state3 could make this happen, but on Metal this is not currently supported (and as a result VK_EXT_extended_dynamic_state3 isn’t supported on Apple devices via MoltenVK or KosmicKrisp). For now I’ve left blend state as part of pipeline creation.
  • Split barriers in the original design use gpuSignalAfter()/gpuWaitBefore() and wait on a value at a memory location - similar to a futex API. Unfortunately there’s nothing similar to this in existing APIs, and no easy way to emulate it. I do want to come up with a simplified split barrier API, but it will look considerably different.
  • Texture heaps are opaque objects. The original blog has them as just pointers in memory, which would technically be possible using the VK_EXT_descriptor_heap extension. I’ve left them as opaque objects since that extension is still very new, and there’s no real equivalent on Metal.

Next steps

There’s a ton of features I’d like to support before I’d be comfortable calling this a 1.0 release. While I’m generally happy with the shape and ergonomics of the API, as I use it to write more examples I expect it will change - don’t expect stability yet!

Right now I’m working on debuggability, adding support for debug groups, GPU captures, and better error handling - which gets even more important with bindless approaches where the driver isn’t able to help out as much. I want to expand the examples as well to get better test coverage of existing features as well, and I expect there’s a lot of bugs hiding in plain sight. Down the line, adding support for mesh shaders and hardware ray tracing are on my list, but they aren’t high priority until the core gets more stable. On the implementation side, adding support for VK_EXT_descriptor_heap and VK_KHR_device_address_commands could improve performance on modern devices.

As of writing this, I’m in between jobs as I’ve decided to leave Seattle and move back to Canada with my family. I expect I’ll be working on this library a fair bit for the next little while as a result - unemployment leaves me with more free time than is good for me. If you’re looking for a C++ developer with experience in computer graphics, gaussian splatting, or computer vision, please reach out. I’ll primarily be looking in the Toronto area, but I’m very open to remote work as well.

If you’re interested in Loon or have any feedback, reach out on BlueSky or Mastodon


Back to posts