Saturday, March 12, 2011

Software occlusion culling pt.2: Implementation

In the previous post i’ve described occlusion culling technique that tries to take tad different steps on resolving visibility information. In this one i’m gonna share basic implementation that performs pretty well. Following steps are performed:
  • Transform objects (occluder / test geometry) bounding boxes to viewport space. Determine whether we should clip its triangles or not and tile bounds that object is in. Invisible objects are skipped. Also calculate view space Z bounds that later will used as sort key.
  • Transform vertices ( for objects intersecting frustum planes do transformation and clipping ) to viewport space.
  • Allocate output triangles and in each tile that triangle intersects store allocated triangle indices. We choose to store triangles in groups of four in SoA style. Also the backface mask is calculated during this step so backfacing triangles could be skipped in rasterizer
    Now as we have triangle lists for each tile we sort them using previously stored Z value.
  • Sorted lists are drawn, where occluder triangle scanlines are merely ORed with corresponding scanline of occlusion buffer of tile and test triangles are checked whether any pixels of scanline are visible.
Performance of the rasterizer is greatly affected by the resolution. In the demo i use 720p which can be easily reduced. Triangle list sorting can be expensive if all triangles are in same tiles. Also it could be beneficial to split tiles vertically and have masks for each tile so we can quickly check if tile is already full. I’m currently working on MT implementation and performance is scaled linearly with number of working threads just as expected. Though it seems making fast and reliable job manager on Windows is not that easy but that’s completely different topic.
Demo with source code

Monday, February 28, 2011

Software occlusion culling

Occlusion culling seems to be the obvious feature every mature engine should have these days. Some use HW occlusion queries but they introduce another sync point with GPU, others prefer using software occlusion. That is one can rasterise depth of nearby occluders ( special geomery or maybe even the one been rendered ) and check bound boxes of objects to be drawn againt this depth buffer. Its not necessary to use full resolution buffer. Memory/performance could be saved by using smaller one. There’s basically two approaches to rasterise triangles - scanlines and halfspaces. In scanline approach we draw horizontal lines between pair of triangles edges. When using halfspaces we build bound box of projected triangle and iterate its pixels checking if current one is inside triangle or not. But since all we care about is determining whether given triangle is visible or not some tricks can be used.

On a local forum (yes, its time to learn russian), infamous comrade IronPeter suggested a smart approach to do the occlusion culling. The idea is very simple - we have a queue of triangles. Triangles can be of two types - occluder-triangles which write occlusion info and occludee-triangles which only test it. Occlusion info is written in tiled bit-buffer. This allows us to write 128 pixels at a time at most using some SIMD extension like SSE. Rasterisation procedure works like in convinient scanline rasterisers except for we only need to know start and end coordinate of each scanline clipped by current tile bounds. With this assumptions in mind all is left is to sort triangle queues front-to-back using minimum z value for occludee triangles and maximum one for occluder triangles and then rasterise those triangles in order.

  1. Transform triangles to viewport space
  2. Determine tile bounds
  3. Push triangles into tile triangle queues
  4. Sort front-to-back each triangle queue of each tile ( using zmin for test triangles and zmax for occluders )
  5. Draw sorted triangle queues ( for test triangles check if any pixel of scanline is visible for occluders just OR scanline bits with tile buffer )

The algorithm is pretty simple and straightforward to implement and besides it can be easily threaded. My initial implementation can draw a scene with actual render geometry with more than 300k triangles in 20msec ( 12msec transform + 8msec rasterisation) in 720p. Lower resolution can be used to futher speedup processing. There’s a number improvements to be made. In my expectations SPU and VMX versions should be faster. After porting to other platforms and adding multithreading support i’m planning to do some reseach on automatic occluder and test geometry generation which seems to be pretty interesting topic as well. And i’m gonna share source code when there’ll be enough functionality to be usable in other engines.

P.S. Many thanks to IronPeter for neat idea and some insights on implementation.