Thousands of Styled Rectangles in 120FPS on GPU

Today I will explain how to write a UI with 100,000 moving, styled rectangles that runs at 120 FPS on an M1 Macbook.

I've been coming back to this problem from time to time and finally managed to connect all the missing pieces together. First I figured out rounded corners and borders, and recently – blur. This allows to implement box-shadow, border-radius, border and of course background-color, which together make up vast majority of modern website styling (that is not the layout of course; See my How to Write a Flexbox Layout Engine to learn more about that).

In my newest iteration of the UI renderer, I went with WebGPU. After diving a lot into it over the past couple of months, I have some observations (brace yourself for a love letter).

Why would you need that?

A usual question that applies to almost all my articles. The reason is – a lot of ideas sound bad not because they are inherently bad, but because the way they are defined is too broad and the way they are usually implemented is too slow. The root of the problem is that most people skip trying to imagine how it would be if that was not the case.

A great example is esbuild by Evan Wallace. By narrowing down the scope of the problem and writing an extremely fast tool for bundling JS and TS, he created entirely new possibilities for bundlers and a whole branch of on-the-fly compiling that just was not possible before, because what took 30s now takes 300ms.

And to give a more specific example related to this rendering, there's a whole paradigm of rendering UIs, called Immediate Mode GUI, which is based on defining UI on the go, in code, as it is being rendered. And it's only possible when rendering is really, really fast. So a good reason would be just to try playing with that in JS, without having to compile a C++ library like Dear ImGui to WASM.

Why WebGPU?

WebGPU is more verbose than WebGL, which reveals itself in two things:

It gives up a lot of abstractions that were meant to be a compromise between what the programmer wants to achieve and what the GPU driver must actually do to execute it.
It is way more declarative, as opposed to WebGL being a giant implicit state machine.

Does WebGPU require more code? Almost always yes. Does it mean it adds boilerplate? IMO absolutely not.

WebGPU is a really well-designed API and it has very well thought of defaults. Methods such as createRenderPipeline() take descriptor objects with tens of fields, but they all have sensible defaults and I never feel like I need to write more code than I necessary.

When writing WebGL I often felt like I was abusing the API to achieve slightly faster code. Now all of that is gone because the API doesn't try too hard to think for me and allows me to instruct GPU more directly.

In WebGL I often felt unsure about what I am allowed to do. "Should this thing be called only after I bind the VAO? Hmm, not sure. Seems to work either way… but what if it's just my GPU driver and it will blow up on someone else's computer?" With WebGPU I haven't had this feeling even once.

Also on a related note, WebGPU is more restrictive in a way that maps better to what GPUs can and can't easily do. In WebGL some functions are cheap and some innocent-looking ones include a ton of work for GPU. If some operation involves resizing a buffer, it might actually require creating a new one and copying over the content. WebGL will happily do it for you, even though it might easily become a huge bottleneck. WebGPU will just say no. We don't resize buffers. Is it a lot more code to create a new one, copy over content and so on? Yes, but so is it a lot more work for the GPU.

But… what about availability? It won't run on phones, especially on mobile Safari, likely for years. So if you need to use it in production right now, then definitely go with WebGL. But if, like me, you are just playing with it for fun and you are not at risk of creating a product that needs to run on phones before 2026, then I think it's a great choice.

Renderer architecture

The UIRenderer class is very simple and looks roughly like this:

export class UIRenderer {
  constructor(
    private device: GPUDevice,
    private readonly context: GPUCanvasContext,
    private colorTextureView: GPUTextureView,
  ) {
    // ...
  }

  rectangle(
    color: Vec4,
    position: Vec2,
    size: Vec2,
    corners: Vec4,
    sigma: number,
  ): void {
    // ...
  }

  render(): void {
    // ...
  }
}

It is set up using device and context. render() is used to submit all commands to the GPU (it should be called at the end of each frame).

rectangle(): color is Vec4 RGB value. The position is screen space, from the top-left corner. Size in pixels. Corners is a Vec4 with radius of each corner, starting from the top-left and going clock-wise. Sigma is the blur radius, with 0.25 being the neutral value (pleasant aliasing for rounded corners; no effect for straight lines).

Note that with those options we can render pretty much any rectangle we might want in a UI. Using blur we can implement box-shadow. Using corners we can implement border-radius. Supporting border is a trivial matter of rendering a smaller rectangle on top of the border one, with border radius reduced by border width (this is correct behavior; otherwise borders have a very unpleasant look).

Concept

There are some things specific about this implementation.

Vertex buffer stores only one full-screen-sized quad:

const vertices = [0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1];
device.queue.writeBuffer(this.vertexBuffer, 0, new Float32Array(vertices));

There's a storage buffer which stores an array of rectangle structs, containing information about screen position, size, color and UV coordinates.

this.rectangleBuffer = device.createBuffer({
  label: "rectangle",
  size: RECTANGLE_BUFFER_SIZE * Float32Array.BYTES_PER_ELEMENT,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
});

Rendering is based on instancing, which means that there's one draw function call that renders multiple meshes, each having its instance index assigned in the vertex shader that can be used for accessing uniform or storage buffers. In WebGPU API it's just one line to render them all: renderPass.draw(6, rectangleCount); (repeat 6 vertices – 2 triangles – renctagleCount times).

renderPass.draw(6, rectangleCount);

Optimization

I found that unsurprisingly it makes the renderer enormously faster if I write rectangle info directly to a statically-sized Float32Array instead of appending to a regular JS array. This obviously removes a lot of dynamic memory allocation plus doesn't require resizing of the main array, which further reduces garbage collection. And from what I've observed, GC is the most common reason for frame drops I observed.

How it looked before:

rectangle(
  color: Vec4,
  position: Vec2,
  size: Vec2,
  corners: Vec4,
  sigma: number
): void {
=  this.rectangleData = this.rectangleData.concat([
    color.x,
    color.y,
    color.z,
    color.w,
    position.x,
    position.y,
    0,
    sigma,
    corners.x,
    corners.y,
    corners.z,
    corners.w,
    size.x,
    size.y,
    width,
    height,
  ]);
  this.rectangleCount += 1;
}

After:

rectangle(
  color: Vec4,
  position: Vec2,
  size: Vec2,
  corners: Vec4,
  sigma: number
): void {
  const struct = 16;
  this.rectangleData[this.rectangleCount * struct + 0] = color.x;
  this.rectangleData[this.rectangleCount * struct + 1] = color.y;
  this.rectangleData[this.rectangleCount * struct + 2] = color.z;
  this.rectangleData[this.rectangleCount * struct + 3] = color.w;
  this.rectangleData[this.rectangleCount * struct + 4] = position.x;
  this.rectangleData[this.rectangleCount * struct + 5] = position.y;
  this.rectangleData[this.rectangleCount * struct + 6] = 0;
  this.rectangleData[this.rectangleCount * struct + 7] = sigma;
  this.rectangleData[this.rectangleCount * struct + 8] = corners.x;
  this.rectangleData[this.rectangleCount * struct + 9] = corners.y;
  this.rectangleData[this.rectangleCount * struct + 10] = corners.z;
  this.rectangleData[this.rectangleCount * struct + 11] = corners.w;
  this.rectangleData[this.rectangleCount * struct + 12] = size.x;
  this.rectangleData[this.rectangleCount * struct + 13] = size.y;
  this.rectangleData[this.rectangleCount * struct + 14] = width;
  this.rectangleData[this.rectangleCount * struct + 15] = height;

  this.rectangleCount += 1;
}

Undeniably the code looks a little bit more rough, but to be honest it's not bad for an optimization compromise. And the speed difference is huge so I am definitely keeping this change.

Shader

This is heavily based on Fast Rounded Rectangle Shadows by Evan Wallace (the same who co-founded Figma and created esbuild, mentioned before).

struct VertexInput {
  @location(0) position: vec2f,
  @builtin(instance_index) instance: u32
};

struct VertexOutput {
  @builtin(position) position: vec4f,
  @location(1) @interpolate(flat) instance: u32,
  @location(2) @interpolate(linear) vertex: vec2f,
};

struct Rectangle {
  color: vec4f,
  position: vec2f,
  _unused: f32,
  sigma: f32,
  corners: vec4f,
  size: vec2f,
  window: vec2f,
};

struct UniformStorage {
  rectangles: array<Rectangle>,
};

@group(0) @binding(0) var<storage> data: UniformStorage;

// To be honest this is a huge overkill. I tried to find what is the 
// least correct value that still works without changing how things 
// look and funnily enough it's 3. Not 3.14, just 3. But let's keep 
// it for the sake of it. 
const pi = 3.141592653589793;

// Adapted from 
// https://madebyevan.com/shaders/fast-rounded-rectangle-shadows/ 
fn gaussian(x: f32, sigma: f32) -> f32 {
  return exp(-(x * x) / (2 * sigma * sigma)) / (sqrt(2 * pi) * sigma);
}

// This approximates the error function, needed for the gaussian 
// integral. 
fn erf(x: vec2f) -> vec2f {
  let s = sign(x);
  let a = abs(x);
  var result = 1 + (0.278393 + (0.230389 + 0.078108 * (a * a)) * a) * a;
  result = result * result;
  return s - s / (result * result);
}

fn selectCorner(x: f32, y: f32, c: vec4f) -> f32 {
  return mix(mix(c.x, c.y, step(0, x)), mix(c.w, c.z, step(0, x)), step(0, y));
}

// Return the blurred mask along the x dimension.
fn roundedBoxShadowX(x: f32, y: f32, s: f32, corner: f32, halfSize: vec2f) -> f32 {
  let d = min(halfSize.y - corner - abs(y), 0);
  let c = halfSize.x - corner + sqrt(max(0, corner * corner - d * d));
  let integral = 0.5 + 0.5 * erf((x + vec2f(-c, c)) * (sqrt(0.5) / s));
  return integral.y - integral.x;
}

// Return the mask for the shadow of a box from lower to upper.
fn roundedBoxShadow(
  lower: vec2f,
  upper: vec2f,
  point: vec2f,
  sigma: f32,
  corners: vec4f
) -> f32 {
  // Center everything to make the math easier.
  let center = (lower + upper) * 0.5;
  let halfSize = (upper - lower) * 0.5;
  let p = point - center;

  // The signal is only non-zero in a limited range, so don't waste 
  // samples. 
  let low = p.y - halfSize.y;
  let high = p.y + halfSize.y;
  let start = clamp(-3 * sigma, low, high);
  let end = clamp(3 * sigma, low, high);

  // Accumulate samples (we can get away with surprisingly few 
  // samples). 
  let step = (end - start) / 4.0;
  var y = start + step * 0.5;
  var value: f32 = 0;

  for (var i = 0; i < 4; i++) {
    let corner = selectCorner(p.x, p.y, corners);
    value
      += roundedBoxShadowX(p.x, p.y - y, sigma, corner, halfSize)
      * gaussian(y, sigma) * step;
    y += step;
  }

  return value;
}

@vertex
fn vertexMain(input: VertexInput) -> VertexOutput {
  var output: VertexOutput;
  let r = data.rectangles[input.instance];
  let padding = 3 * r.sigma;
  let vertex = mix(
    r.position.xy - padding,
    r.position.xy + r.size + padding,
    input.position
  );

  output.position = vec4f(vertex / r.window * 2 - 1, 0, 1);
  output.position.y = -output.position.y;
  output.vertex = vertex;
  output.instance = input.instance;
  return output;
}

@fragment
fn fragmentMain(input: VertexOutput) -> @location(0) vec4f {
  let r = data.rectangles[input.instance];
  let alpha = r.color.a * roundedBoxShadow(
    r.position.xy,
    r.position.xy + r.size,
    input.vertex,
    r.sigma,
    r.corners
  );
  return vec4f(r.color.rgb, alpha);
}

I am not ambitious enough to go and explain all the math behind it, but I will try to give some more context into what's happening:

QUOTE

Unfortunately there's no closed-form solution for a rounded rectangle drop shadow. After a lot of experimentation, I've found the best approach is to use the closed-form solution along the first dimension and sampling along the second dimension.

This means that there's no straightforward equation where you can plug some number and just get the value. It's possible for one dimension, and for the other, we need to do sampling in a loop (so basically solve integral numerically, if that helps).

Padding is needed for the blur since it naturally expands past the border of the blurred rectangle.

Blur intensity is controlled by sigma. 0.25 is a neutral value (as mentioned above). I have yet to find a good way to map it to screen pixels.

Performance

I wrote a demo showing that it can go up to 100,000 (at least on my computer) – Demo – CodeSandbox. The biggest performance win is the, mentioned before, switch to Float32Array.

Comparison how slow it is when there's all extra JS overhead: Slow demo – CodeSandbox.

I am not calling it or aspiring to make a proper benchmark here. Those numbers are all highly dependent on the hardware, sizes of rectangles and other work that the JS thread might need to do in the meantime.

Playground

Let's put all of this together into an interactive demo.

Summary

Final effect in case you are not able to see the WebGPU demo.

With just a bit over 300 LOC we made a renderer that can draw a lot of styled rectangles at once. To the point that we could render pretty much any GUI of a tool/game. And at this point, you probably know where I am going with this.

This is a nice first step to having a proper GUI renderer, but there are many challenges left and maybe the hardest one of them is to decide what to work on next. I can think of a few options:

Text rendering – with all kinds of styled rectangles, a flexbox layout and text rendering we could render almost any UI.
Advanced layout and scrolling – overflow: hidden support. Currently, each rectangle will always be shown in full and there's no way to cut it in half. If we just cut the size, the blur will end in the wrong place. This is connected to adding a scrolling concept to the layout engine.
Custom fragment shader for a rectangle – how about letting the user define a WGSL shader that is executed for a rectangle instead of the default one? Given the extremely dynamic nature of JS, it's absolutely feasible to implement it. The only thing would be a matter of smart implementation. A shader needs to be compiled into a module and then a new pipeline has to be created. I would use the async variant of the API and give the user an async function for setting up in shader. Once that's done, there would be no performance hit from using it.
Rendering images – now when I think about what's missing from having the ability to render most websites, rendering images definitely comes to mind.
Support for gradients – not that important, but it would be nice to have.