Tiled lighting techniques have gained significant interest in recent years. However, a problem with tiled lighting techniques vs. traditional directx 9 styled deferred lighting (additive alpha blending) is the significant amount of false positives largely due to intersection testing using coarse bounding volumes. This is particularly relevant when supporting lights/volumes of different shapes such as long narrow spot lights, wedges, capsules etc. The downside to directx 9 style lighting is this approach does not work with forward lighting and having thousands of lights is extremely expensive due to setup cost per light and overlapping reads from gbuffer and overlapping writes to the frame buffer.
During the development of Rise of the Tomb Raider (ROTR) we came up with a new tiled lighting variant which we named Fine Pruned Tiled Lighting (FPTL) which we describe in GPU Pro 7. There are many details to the full implementation discussed in the article and I will not go over them here but the main point is the cost of fine pruning can easily be absorbed by using asyncronous compute. This implies we obtain a light list with a very minimal amount of false positives almost for free.
As explained in the article the technique will work with essentially any methodology such as deferred shading, pre-pass deferred, tiled forward and even hybrids between these. A demo sample is available though it was written in vanilla directx 11 which implies the asyncronous compute part is left as an exercise for the reader! The demo shows a single terrain mesh lit by 1024 lights (heat map and fine pruning enabled by default). For simplicity the demo is setup as tiled forward though on ROTR we used a hybrid where we supported pre-pass deferred, tiled forward and conventional forward.
When running the demo you will notice fine pruning enabled runs faster than disabled despite the fact that there is no asyncronous compute in the demo (since it is standard DX11). However, the improvement on speed is of course much more significant when asyncronous compute is used correctly.
Other interesting aspects to the implementation is we determine screen-space AABBs around each light (regardless of type of shape) on the GPU. This allows us to reduce coverage significantly for partially visible lights (accellerates fine pruning) and reduces pressure on registers during light list generation (explained in the article). Additionally, we keep light lists sorted by type of shape to miminize chances of thread divergence during tiled forward lighting.
For more information on the details....Buy GPU Pro 7! :)