Linterra: How I improved voxel engine performance to 980 FPS

I built a Minecraft-style voxel engine from scratch in C++ and OpenGL. This post is a technical walkthrough of the optimizations that took me from 3 FPS to 980 FPS in wireframe at 4 chunks, and from 54 FPS / 1.65 GB to 120 FPS / 251 MB in textured rendering at 24 chunks. The first working build rendered every face of every block in every chunk. Each block is a cube with 6 faces, each face 2 triangles, each triangle 3 vertices.

Milestone 1: Chunk Architecture

Each chunk owns a fixed 3D array of block IDs (uint8_t blocks[16][256][16]). Block IDs are cheap, an enum or integer. The ChunkManager maintains an std::unordered_map keyed by chunk coordinates (with a custom glm::vec2 hash), loading and unloading chunks dynamically as the camera moves. Chunks beyond a fixed render distance are pruned each frame.

Milestone 2: Per-Face Culling and Frustum Culling

Per-Face Culling (Inside Chunk)

When generating a chunk's mesh, only emit a face if the adjacent voxel in that direction is air. For each block, check 6 neighbors, skip any face where the neighbor is solid.

This single change eliminates the majority of geometry in a dense world. Interior faces, which make up the bulk of a filled terrain, disappear entirely. Result at baseline: 3 FPS → 25 FPS at 4 chunks.

Frustum Culling (Per Chunk)

The view frustum is the 6-plane volume representing everything the camera can see. For each chunk, compute its AABB (axis-aligned bounding box) and test it against all 6 frustum planes. If the AABB is fully outside any plane, skip the chunk's draw call entirely.

In large render distance, you're skipping 40–60% of draw calls. This was worth 95 FPS → 124 FPS at 16 chunks when combined with multithreading.

Milestone 2 (cont.): VBO Memory Leak

At 16 chunks, memory sat at 2.71 GB.

Mesh vertex data was being kept in CPU-side std::vector<float> buffers after being uploaded to the GPU via glBufferData. Once the data is on the GPU, the CPU copy is dead weight.

Milestone 2 (cont.): Multithreaded Chunk Generation

Terrain generation using layered FBM (Fractional Brownian Motion) noise is CPU-expensive. With a synchronous chunk manager, every new chunk generated blocks the render thread, causing hitching.

The fix was to offload chunk data generation (noise sampling, block array filling) to background threads using std::async, keeping the main thread dedicated to rendering and GPU uploads.

The key constraint: GPU uploads (glBufferData, VAO setup) must happen on the main thread as OpenGL contexts are not thread-safe by default. Generation is parallelized; upload is not.

Result: 95 FPS → 124 FPS at 16 chunks.

Milestone 3: Greedy Meshing

Per-face culling reduces geometry to visible faces only. Greedy meshing merges adjacent coplanar faces of the same block type into a single larger quad, dramatically reducing triangle count.

The Algorithm

For each axis-aligned slice of the chunk (e.g., all faces pointing in the +Z direction at Z=5):

1. Build a 2D mask of which cells in the slice need a face emitted.

2. Scan the mask. When you find an unvisited cell, greedily expand a rectangle: first extend as far right as possible (same block type, not yet consumed), then extend upward as long as full width is available.

3. Emit one quad for the entire rectangle. Mark all cells in the rectangle as consumed.

4. Repeat until the mask is exhausted.

In a flat terrain (the common case), entire horizontal layers collapse to a handful of quads. A 16×16 flat ground plane goes from 512 triangles to 2.

Result at 24 chunks: 54 FPS → 120 FPS, 1.65 GB → 251 MB. Both FPS and memory improve because fewer vertices means less GPU memory pressure and fewer vertex shader invocations.

Element Buffers

Alongside greedy meshing, switching from raw vertex arrays to indexed rendering (EBOs) reduces vertex duplication. A quad has 4 unique vertices but 6 indices (two triangles). With an EBO you store 4 vertices and reuse them via indices, reducing vertex buffer size by 33% per quad.

Back to home