zeux.io

Quantizing tangent frames

Thu, 30 Apr 2026 00:00:00 +0000

I’ve been working on tangent space generation recently, and also pondering the tradeoffs between QTangent and normal-angle storage. So when, in a completely unrelated discussion, someone said “you could store tangent and normal as two RGB10A2 attributes”, I decided I should measure this once and for all and write it up.

This is going to be a quick post that doesn’t intend to do any original research. Instead it aggregates the research that other people have done, and examines different representations from an accuracy point of view. Some of these are common or obvious, but some are sufficiently obscure that people may not have heard of them.

So say you have a vertex and you would like to store tangent frames in it. Memory is at a premium and memory bandwidth is always valuable to conserve; what representation do you choose?

Bitangent and orthogonality

First of all, we will assume that your tangent frames are orthogonal: tangent is orthogonal to the normal, and bitangent¹ is orthogonal to both of them. The former is generally true if you use MikkTSpace for tangent construction². The latter is not generally true even if you use MikkTSpace - as it will happily generate a tangent frame with bitangent and tangent not really orthogonal to each other, and is correct to do so when the UV mapping is skewed. However, almost every production renderer in existence throws bitangents away and reconstructs them as cross(normal, tangent.xyz) * tangent.w; yours should too - this is a common convention and an extra vector is too much overhead to store.

It is common to do this reconstruction in the fragment shader: this is cheap and reduces interpolation cost. Technically, bitangent reconstruction in the fragment shader produces different results compared to doing it in the vertex shader; if you use normal map baking workflows, your baker likely allows you to choose between these; otherwise it’s unlikely to matter for rendering quality. Everything else we discuss here is presumed to run in the vertex/mesh shader.

Note that the tangent is a 4 component value: a normalized vector, and a tangent frame orientation. The orientation is always either +1 or -1, and it’s important to preserve it to be able to handle arbitrary UV orientation; even if your texture mapping is not “mirrored” in a traditional sense, you still may have areas of the mesh where the orientation is reversed.

In what follows, we will assume orthonormal³ frames: T, B and N are normalized and orthogonal to each other, forming a basis.

Baseline

Since we only have two separate vectors to store, and they are normalized, their components are between -1 and 1. Thus, you could store them separately from each other, using three components per vector and an SNORM encoding; it’s not entirely unreasonable to use 10 bits per component, and the orientation bit can land in one of the 4 spare bits. This can be our baseline.

What we can do for each encoding scheme is run it over the tangent frames of a typical mesh, encode & decode tangents and normals, and then measure the deviation between the original, 32-bit floating-point vectors, and the reconstructed versions. There are different ways to look at the error but we’ll look purely at the angular deviation, in degrees, and look at average and maximum errors for the two vectors separately; since bitangent is computed via a cross product, its error could be a little higher.

This gives us our first result (results are somewhat mesh-dependent but I will be using a not-too-remarkable mesh asset that exhibits similar error properties as other meshes in my dataset):

codec	bits	n_avg	n_max	t_avg	t_max
snorm10x6	60	0.0310	0.0949	0.0326	0.0907

~0.03 degrees average error for both vectors is pretty decent; however, we are using 8 bytes for this. In the rest of the post, we will look at a few 4-byte encodings instead; if you have 8 bytes to spare, this encoding is not optimal since you could get a much lower error using the same 8 bytes, but you might not care - this error is usually small enough. In case you have tighter error tolerances than the rest of the post assumes, you can trivially extend the techniques listed below to use more bits; for example, you can often find a 16-bit gap somewhere else in your vertex (say, as a fourth component of quantized position), in which case you could expand many of the techniques below to 48 bits (16+32) and have dramatically smaller errors too.

Octahedron encoding

Octahedron (octahedral?) encoding gives us a way to encode each unit vector using just two components; and because we have 32 bits, we could use 8 bits per component! Well, almost, because we also need to store an orientation bit somewhere, or reduce the bit count; for now, let’s see what happens if we assume the extra bit can magically go somewhere else and use 8-bit octahedral encoding for two vectors:

codec	bits	n_avg	n_max	t_avg	t_max
oct8x4	32	0.2849	0.9113	0.2882	0.9117

The good news is that we are using half as much space as we used to; the bad news is that our error is up to 10x higher than it used to be. This is not an apples-to-apples comparison as we’ll see in a bit, in that not only did we switch the encoding, but we also dropped the bit count from 10 bits per axis to 8 bits for each of the two octahedral channels. This can still be a reasonable option if we have more bits to spare; for example, with 48 bits you could do 12-bit octahedral encoding for each vector in theory. But this is just a stepping stone and I mention it here not because you should use it, but to introduce the concepts gradually.

As an aside, if you use octahedron encoding, you should be aware of two important variants:

Nicer decoding function courtesy of Rune Stubbe; it is described in the post linked above, in case the Twitter link ever bursts into flames. You should always be using this decoding method if you need octahedron decoding.
Signed octahedron encoding by John White. This method allows you to use an odd number of bits when encoding a vector; for example, here we could encode normal with octahedron encoding with 8 bits per channel, and tangent with signed-octahedron encoding with 7 bits per channel (and 1 sign bit); this would leave us 1 bit for orientation. We will come back to this.

Quaternion encoding

Because the TBN is an orthonormal basis, we can represent it in any way that a rotation matrix can be represented; in particular, quaternions are a very useful representation. Crytek published Spherical Skinning with Dual-Quaternions and QTangents that presents the method, where a quaternion is constructed from the matrix and can also optionally carry the orientation sign, if it’s flipped into a canonical form with w > 0⁴. After decoding the quaternion, we can reconstruct the original vectors using a quaternion-to-matrix formula.

In practice T and N may not be orthogonal, so it is valuable to orthogonalize the frame first: subtract dot(N, T) * N from the tangent, renormalize it, and reconstruct the bitangent. This makes sure the normal is preserved by the quaternion encoding even though the tangent direction changes during orthogonalization.

To store all 4 quaternion components, we must make do with just 8 bits per component, with or without aforementioned orientation packing:

codec	bits	n_avg	n_max	t_avg	t_max
quat8x4	32	0.3208	0.8738	0.3296	0.9022

This is pretty underwhelming; we are getting similar results to two-vector encoding. The quaternion we are encoding is unit-length, so it can be tempting to drop the fourth component, .w, and reconstruct it; this affords us 10 bits per component instead of 8. Unfortunately, while it slightly improves on the average error, it makes the worst case error much worse: reconstruction of any specific quaternion component has regions of instability where the input .w is small, and reconstructing it from quantized xyz results in 0 - which can rotate the basis too much.

A classical solution to this involves dropping the maximum (absolute) component and storing its index which uses an extra 2 bits. In this form we could use 10 bits per axis for the three smallest values, and adjust the quantization range as their absolute value can’t exceed sqrt(2)/2; if we hand-wave away the orientation bit which we don’t quite have space for here, we could use 10 bits per axis and a 2-bit axis index:

codec	bits	n_avg	n_max	t_avg	t_max
quat10x3+i2	32	0.0632	0.2046	0.0640	0.1988

This is getting quite reasonable; our errors are only ~2x larger than the original 8-byte representation, and we only use 32 bits for the quaternion, plus an extra orientation bit that surely you can stash somewhere else in the vertex. Unfortunately, quaternions encoded like this are a little awkward to decode on the GPU because you need to swizzle the vector based on the axis index. We could use the Cayley transform described in the article linked above, which is much nicer algebraically and actually has a spare bit for the orientation, but it produces larger errors in this case which is not ideal:

codec	bits	n_avg	n_max	t_avg	t_max
quat10x3-cayley	30	0.1191	0.3202	0.1198	0.3202

An important property of all of these results is that the error for normal and tangent is the same. Because we are encoding a rotation, all axes of that rotation suffer equally under quantization. I’m going to make an empirical observation/argument that what we want here is a slightly asymmetric encoding: a lower error for normals may be desirable. Intuitively, changing the tangent direction affects the rotation of the extra normal map features whereas changing the normal direction affects lighting even if the normal map is flat; in addition, on densely tessellated meshes where normal quantization is more likely to produce visible artifacts, you’re less likely to have a meaningful normal map. Finally, if the normal map is not flat, the quality of the normals in the normal map is also subject to quantization artifacts; although BC5 could have up to ~11 bits per channel of effective precision depending on the block, so sometimes tangent encoding may still be a limiting factor.

I guess my point is: are you really going to notice if the tangent vector is rotated by a fraction of a degree? Let’s assume you won’t.

Tangent angle encoding

This is likely an older technique, but the earliest concrete reference I could find was in Rendering the Hellscape of DOOM Eternal, slide 35, and it proceeds as follows: let’s encode the normal as a unit vector, for which we already know to use octahedron encoding. Given a normal vector, we can construct an arbitrary orthogonal basis; tangent vector lies in the resulting plane and can be expressed as a single 2D rotation angle.

DOOM encodes all three components (two normal components and tangent angle) in a byte each; this is a little below our quality bar. But the important part here is that this allows us to vary the bit count we use for normal components independently from the tangent angle to hit our 32-bit limit.

Which basis should you use? It does not seem to matter that much. DOOM uses a pretty basic construction that’s a subset of Hughes-Möller; for an overview of older methods I recommend Perpendicular possibilities. The current state of the art seems to be Duff’s modification of Frisvad basis construction, described in Building an Orthonormal Basis, Revisited; it only has a single point of discontinuity and is pretty cheap to reconstruct.

What does seem to matter a lot is that during encoding you need to build a basis from the reconstructed normal. Because during decoding we will not have access to our original normal, and only have a reconstructed version from the quantized representation, it is critical that the angle that we store is computed in the basis that’s built from the same normal. If you use simpler basis construction, the reconstructed normal may fall into a different basis region and produce a completely different coordinate space (for example, if the basis changes when abs(Nx) > abs(Nz), it is possible that the original normal and the reconstructed normal disagree on which component is larger). If you use Duff/Frisvad construction, this is less likely, however there are areas of the sphere where the basis rotates quite violently, and this rotation will amplify the error if angle is encoded improperly.

Other than that, this method is straightforward; and we now have a few different bit allocation alternatives. Let’s look at a few examples:

codec	bits	n_avg	n_max	t_avg	t_max
oct11x2+a10	32	0.0272	0.1136	0.0909	0.2065
oct11x2+a9	31	0.0272	0.1136	0.1775	0.3588
oct10x2+a11	31	0.0662	0.2256	0.0642	0.2256
oct10x2s+a10	31	0.0444	0.1371	0.0954	0.2149

Note that in the first row, we are ignoring the problem of orientation bit storage: 11 bits * 2 components + 10 bits for the angle = 32 bits, which is our limit. However, in the next three variants we explore different versions to steal this bit back: steal it from the angle (worse tangent precision), steal it from the normal (which gives us an extra angle bit for free), and use aforementioned signed octahedron encoding which uses 10 bits * 2 components + 1 sign bit + 10 bits for the angle for the grand total of 31 bits, with one bit to spare for the orientation. In other words, the last three methods are complete and require no additional storage, whereas the first one is cheating a little bit to give a cleaner picture of the resulting errors.

While in all of these, the tangent error is a little (2-3x) worse than our original snorm10 baseline, it’s unlikely to matter in practice; and the other methods are fairly close to snorm10 in terms of normal error, and in fact the 11-bit octahedral encoding improves on it – again, despite using half the memory.

Out of these, my personal recommendation would probably be oct11x2+a9: it should have enough precision for the tangent angle in practice, and it allows us to store the orientation bit inline which results in a complete tangent storage you can simply plug in. To decode this, you need to do some trigonometry… it may be cheap or expensive depending on your GPU. Which brings us to the final option on the table.

Tangent diamond encoding

Described in Jeremy Ong’s post Tangent Spaces and Diamond Encoding, this is a very neat technique that can be seen as a version of tangent angle encoding, but whereas tangent angle requires sincos to reconstruct, here we project the direction onto a unit square in 2D space, and encode the direction in a manner similar to octahedron encoding. This gets us cheaper decoding at the expense of slightly uneven angular error.

Other than that everything relevant for tangent angle encoding applies here too: you need to select the orthogonal basis although the particular method to do it may not matter as much; and it’s critical that the basis is selected based on the reconstructed normal vector, which will be encoded using quantized octahedron encoding. Because we encode the diamond value separately, we have the same freedom of bit allocation.

That post argues that orientation bit could be stored at a higher-than-vertex granularity, e.g. per mesh/meshlet. While I handwaved the bit away in some encodings above, primarily because I was leading to the encodings that I actually like, I would probably not recommend this unless you work with very specific types of content where mirrored UVs just don’t show up. Having looked at hundreds of meshes in the last few weeks, you will encounter plenty of meshes with both UV orientations, you will encounter cases where orientation changes quickly in local proximity, and you may even encounter meshes where orientation diverges within a triangle, making per-meshlet orientation storage impossible!⁵ Just find a bit somewhere ;)

With that in mind, let’s look at the same options we’ve examined earlier; as a reminder, the first one needs 32 bits and doesn’t have space for the orientation bit (store it elsewhere!), whereas the last three reserve the orientation bit and reallocate the other bits between normal and diamond storage:

codec	bits	n_avg	n_max	t_avg	t_max
oct11x2+d10	32	0.0272	0.1136	0.0864	0.2374
oct11x2+d9	31	0.0272	0.1136	0.1691	0.4485
oct10x2+d11	31	0.0662	0.2256	0.0614	0.2324
oct10x2s+d10	31	0.0444	0.1371	0.0912	0.2431

Perhaps unsurprisingly, the normal errors are exactly the same as before: we are storing the normal in the exact same fashion! As far as the tangent error, the diamond storage holds up very well: the average errors are in fact a little smaller, whereas the maximum errors are a little larger. We would reasonably expect the maximum angular errors to be larger because the encoding is not uniform from the rotation perspective, but the extra error here is mostly quite reasonable. As before, if I had to pick one, I would pick oct11x2+d9, with oct10x2s+d10 as an alternative if tangent quality is critical.

Optimal rounding

One minor caveat that I omitted for simplicity but that you may want to incorporate into your actual encodings is optimal rounding. All examples above used basic per-axis rounding: when quantizing octahedron encoding, compute each component using floating-point math, and then convert it to SNORM representation with the correct number of bits, rounding to the nearest integer.

This is correct to do for independent axes; however, in many of the representations discussed, the axes are not independent because they represent a 3D vector or basis using a non-trivial transformation. As such, a more precise way to encode the exact same representation involves checking both rounding directions (floor/ceil) per component, and then picking the encoding with the minimum error. For two-component encodings like octahedron, this requires checking 4 combinations; for three-axis quaternions, this requires checking 8.

The gains you get are usually just in the worst case error and are incremental; I don’t feel like redoing the analysis for all the combinations but here’s an example for how it affects two encodings from the distinct groups we have examined:

codec	bits	n_avg	n_max	t_avg	t_max
oct11x2+d9	31	0.0272	0.1136	0.1691	0.4485
oct11x2+d9-opt	31	0.0246	0.0816	0.1687	0.4485
quat10x3+i2	32	0.0632	0.2046	0.0640	0.1988
quat10x3+i2-opt	32	0.0619	0.1480	0.0619	0.1507

For octahedron/diamond encoding, the tangent error doesn’t change because diamond encoding is single-axis so the only change is in how the normal is encoded; and while the improvement in the average error is modest, the maximum error drops a little further.

For quaternion encoding, as noted before, the precision equally affects all basis vectors; the average error is broadly unchanged but the maximum error shrinks noticeably too.

Overall it’s up to you whether to use the optimal encoding or not; it’s a little more code, but for offline processing it’s usually “free” - so getting a little extra quality boost may well be worthwhile. If you are using signed octahedron encoding, per-axis quantization is already close to optimal.

Conclusion

As I promised, this post does not contain original research :) because of this I’ve also avoided showing any specific shader code for decoding. If you’re interested in implementing any of this, please consult the excellent articles linked throughout the post for specifics! You need to be careful with encoding the angle/diamond based methods as I mentioned and reconstruct the basis from the decoded normal during encoding, but otherwise all of these should be easy and quick to try out.

Eyeballing the diamond encoding method, decoding a tangent frame encoded in this fashion needs ~50 ALU ops including bit manipulation if Duff/Frisvad tangent frame construction is used; this may seem high, but it’s probably still cheaper than using extra vertex data and there are ways to optimize this further: if you are using texture buffers (or Metal’s unpack_unorm10a2_to_float), then RGB10A2 decoding might be “free” in which case oct10x2s+d10 is a neat representation since it only requires you to separate the sign for signed octahedron decoding from the tangent orientation, and could be 20% cheaper or so overall.

If this does still seem high, you would need to measure the resulting decoding cost and decide. Quaternion representations should be a bit cheaper to decode, but inability to redistribute the encoded bit count and inability to fit the extra bit into the 3x10+2 bit encoding is certainly not ideal.

But if you are still using 8-byte or larger tangent frames, consider that DOOM Eternal uses just three bytes⁶, repent, and quantize away!

For convenience, here’s the average error plot as well as a full table with all experiments above, although do note that some of these use the full 32 bits without space for the orientation bit, for which you would need to find space elsewhere in the vertex.

codec	bits	n_avg	n_max	t_avg	t_max
snorm10x6 (baseline)	60	0.0310	0.0949	0.0326	0.0907
oct8x4	32	0.2849	0.9113	0.2882	0.9117
quat8x4	32	0.3208	0.8738	0.3296	0.9022
*quat10x3+i2*	32	0.0632	0.2046	0.0640	0.1988
quat10x3-cayley	30	0.1191	0.3202	0.1198	0.3202
oct11x2+a10	32	0.0272	0.1136	0.0909	0.2065
oct11x2+a9	31	0.0272	0.1136	0.1775	0.3588
oct10x2+a11	31	0.0662	0.2256	0.0642	0.2256
oct10x2s+a10	31	0.0444	0.1371	0.0954	0.2149
oct11x2+d10	32	0.0272	0.1136	0.0864	0.2374
oct11x2+d9	31	0.0272	0.1136	0.1691	0.4485
oct10x2+d11	31	0.0662	0.2256	0.0614	0.2324
oct10x2s+d10	31	0.0444	0.1371	0.0912	0.2431

Addendum

After the article was published, a few people had observations that were not present in the original version of the article, but are worth sharing here:

Andrew Helmer pointed out that another possibility is to use a Fibonacci lattice (paper, Shadertoy), which can encode a unit vector into a single integer. Decoding requires sincos in addition to a few other ops and looks more expensive than octahedron decoding; the precision is pretty close if you are using an even-numbered bit count, and a little better for odd-numbered bit count when the alternative is signed octahedron encoding. I am not including this in the results above, but curious minds can investigate this further.

Mateusz Kielan pointed out two interesting considerations that have to do with visibility buffers. First, visibility buffer rendering involves decoding triangle data per pixel and interpolating the results; this increases the cost of decoding any compressed representations as it needs to be done three times per pixel, which may offset the tradeoffs. Second, with quaternions specifically, this presents a convenient opportunity to interpolate the quaternions instead of interpolating the reconstructed basis vectors. Interpolating quaternions linearly requires handling dot(q1, q2) < 0, but since the shader has access to all three, it’s easy to do; nlerp is generally sufficient from the quality perspective here. Note that to apply the same optimization in a non-visibility-buffer pipeline, you have to either use barycentric extensions to do the interpolation manually, or pre-flip quaternions in individual vertices such that in each triangle any two corners have aligned quaternions. GPU Pro 5 chapter Quaternions Revisited covers this in more detail.

Tom Forsyth pointed out that if you do want to preserve non-orthogonal tangent frames, as long as the basis vectors are unit-length and orthogonal to the normal, you can still use some of the techniques mentioned in this article to save space. Notably, you can use diamond encoding to encode both tangent and bitangent vectors - 32 bits is no longer going to suffice, but you can use something like oct12x2+d12+b12 to encode the normal using octahedral encoding and encode tangent and bitangent using a scalar for each. This results in 48 bits (the orientation bit is not necessary as the bitangent is encoded explicitly and isn’t reconstructed) with excellent precision, and you can always reduce the bit counts, including asymmetric variants (e.g. oct11x2+d9+b9 = 40 bits), if the budgets are tight and use the remaining bits for something else in your vertex.

If you have been referring to these as binormals then I would respectfully ask you to reconsider. ↩
Since I know too much about MikkTSpace now - for UV-degenerate triangles MikkTSpace will use tangent vector (1 0 0) which is not necessarily orthogonal to the normal as a fallback; but forcing orthogonalization in these cases is unlikely to hurt, and this is a topic of a separate discussion. ↩
Unit-length vectors are technically also a simplification; but again one that is commonly accepted. In some cases it can be useful to be able to represent zero-length tangent vectors; this can be incorporated via an extra bit or a sentinel value in some of the encodings described. MikkTSpace can also occasionally generate zero-length tangents although this is quite rare. ↩
This requires slightly tweaking w when it’s exactly equal to 0, and requires a bit of extra care if quantization is used. ↩
Okay, this is a MikkTSpace special that is perhaps not hugely relevant here because it only shows up on UV-degenerate triangles where one or two corners have fallback tangents anyway, and these frames are not orthogonal. Still, you should probably switch to meshoptimizer’s tangent space generation where this problem is fixed ;) ↩
In fairness, the 3-byte representation seems specific to vertex animated meshes, where you probably don’t notice the artifacts at all :) ↩

meshoptimizer 1.0 released

Tue, 09 Dec 2025 00:00:00 +0000

A short post today. If you’re following me on any of the vast array of social accounts then you’re probably aware, but if you’re reading this blog through the ancient technology otherwise known as RSS, meshoptimizer v1.0 has been released yesterday!

In addition to the usual release notes on GitHub, I’ve also written a dedicated announcement that talks a little more about the last couple years of progress. If you’re interested in graphics programming, you might find it interesting:

meshoptimizer v1.0

It’s a little strange to look back and think that it has been nine whole years since the library was originally created. The scope and quality have grown substantially since then, and what started as a small hobby project has slowly turned into an important technology used across the industry. The first release, confusingly numbered 0.5, was just under 1400 lines of code and had a fraction of the functionality that v1.0 provides today¹. I think “0.5” referred to the fact that parts of the code were salvaged from my toy engines I’d developed over the years and as such were somewhat well tested. I assume that what I was thinking at the time is that a few more improvements and a couple versions later and the library can reach 1.0!

What happened instead is a deep rabbit hole of algorithms, hardware details, inventions and new use cases. I’ve written about some of these on this blog, although it’s always been a tradeoff that’s difficult to navigate - do I write more code, or do I write more text about the code? Here’s some of the articles published on the internals over the years:

These only really cover a small portion of the research and work that went into the library. Perhaps someday I’ll write more; it might be interesting to do something special for the 10th anniversary.

As always, feel free to drop me a note if you are using meshoptimizer or have ideas and suggestions for future versions. As I write in the full post linked above, v1.0 is an important milestone, but also 1.0 is just a number.

We continue.

Ironically, because the initial version heavily relied on STL, compiling v1.0 - which has ten times as much source code - takes almost as much time in debug (1.1s vs 0.85s) and only twice as long in release (2.2s vs 1.1s). ↩

Billions of triangles in minutes

Tue, 30 Sep 2025 00:00:00 +0000

Early this year, NVIDIA released their new raytracing technology, RTX Mega Geometry, alongside an impressive Zorah demo. This demo was distributed as a ~100 GB Unreal Engine scene - that can only be opened in a special branch of Unreal Engine, NvRTX. The demo showcased the application of new driver-exposed raytracing features, specifically clustered raytracing, in combination with Nanite clustered LOD pipeline - allowing to stream and display a very highly detailed scene with full raytracing without using the Nanite proxy meshes that Unreal Engine currently generates for raytracing.

As this was too Unreal Engine specific for me, I couldn’t really experiment much with this - but then, in early September, NVIDIA released an update to their vk_lod_clusters open-source sample, that - among other things - featured Zorah scene as a glTF file. Naturally, this piqued my curiosity - and led me to spend a fair amount of time to improve support for hierarchical clustered LOD in meshoptimizer.

Technology

The rest of this will be easier to understand if you have a reasonable grasp of Nanite. If not, I highly recommend Nanite: A Deep Dive by Brian Karis et al which talks about the specifics; I’ll just summarize the basic flow here.

Given a triangle mesh with, probably, a lot of triangles, our task is to 1) generate a hierarchical structure that can represent this mesh at any level of detail, 2) stream parts of this structure at appropriate detail, and 3) render the visible parts of the mesh at appropriate detail level. It’s important that the structure we use can represent multiple levels of detail in multiple different regions of the mesh - this allows it to scale to large models while distributing the detail appropriately; it’s also important that it is efficient to render. The structure that’s chosen here is a graph (DAG) of clusters; each cluster is a small set of triangles, say, up to 128, and represents a small patch of the mesh at a given level of detail. The structure contains clusters at various levels of detail, and the runtime code is responsible for streaming and rendering them to minimize the visual error - a cluster is replaced by a coarser cluster only if the resulting visual error is under 1 pixel (and the resulting switch is hidden by TAA or other temporal filters).

There are three tricky parts to this technique: generation of the structure from the highly detailed mesh; compression of the results to make them efficient to stream; and real-time rendering of the results. We are only going to talk a little bit about the first part :)

To build the structure, a mesh is split into a set of clusters; neighboring clusters are merged in slightly larger groups; each group is simplified independently, preserving the boundary edges of the group; the resulting group is then split into more clusters and the process recurses until no admissible clusters are left. There is a lot of nuance in how the algorithms are combined to ensure that the resulting representation doesn’t exhibit cracks between clusters at various levels of details, and a lot of tradeoffs for individual algorithms involved - easily the subject of a thesis (and indeed, multiple theses have been written on this topic).

Since Nanite was released in 2021, multiple different engines started adopting this processing paradigm. An open-source geometry processing library I work on these days, meshoptimizer, since 2024 has had an example of how to combine multiple different algorithms that meshoptimizer provides to build the resulting structure¹. Having an end-to-end example made it much easier to improve the algorithms and experiment with variants of the higher level technique - could I perhaps use that example code to process the Zorah scene?

A sense of scale

That screenshot above certainly looks pretty; getting to that level of fidelity requires a lot of texture, shading and lighting work beyond Nanite. Fortunately, we’re only concerned with the geometry part; this should make our job easy, right?

As mentioned, while the original Zorah scene is an Unreal Engine asset, NVIDIA published a glTF scene for Zorah as part of their open source Vulkan samples. Let’s take a look, shall we?

zorah_main_public.gltf.7z

1.64B triangles, 18.9B triangles with instancing

36.1 GB on disk

Render cache 62 GB on disk, it can be downloaded or generated

… oh. A 36 GB glTF file that only contains geometry - moreover, it doesn’t contain vertex attribute data for the vast majority of the meshes, just positions! (the sample code derives normals for shading from positions in the shader code)

Trying to import this glTF file in Blender takes ~10 minutes until running out of memory². Naturally, Unreal Engine is much faster - which is to say, the UE import of this file runs out of memory and crashes in just under 5 minutes!³ Evidently, the processing is quite memory heavy and 192 GB RAM is, in fact, not enough for everyone.

Fortunately, we don’t need to import this file: we just need to run the NVIDIA sample code that processes this file. An attempt to do that in early September, however, would also run out of memory when trying to use 16 threads. Experimentally, I found that I could reliably run the processing code using 8 threads (--processingthreadpct 0.25) as long as nothing else was running in the system, as the process would take ~180+ GB RAM. Using 7 threads made it possible to sort of use the computer in the meantime… for approximately 30 minutes that it took to run.

Now that I’ve sufficiently prepared you for just how large this scene is, it’s time to talk about a series of optimizations that make all of this a little more practical :)

Baseline

To build the hierarchical structure in question, we need clusterization (to split meshes into clusters), partitioning (to group clusters together) and simplification (to reduce a cluster group to fewer triangles). Fortunately, meshoptimizer provides algorithms for all three.

As of version 0.25, meshoptimizer contains two main clusterization algorithms: one built for rasterization and mesh shaders, which tries to minimize the number of produced meshlets by packing geometry tightly into them, and one built for raytracing and new clustered raytracing extensions. The former has been evolving over the last 8 years with incremental improvements and fixes; the latter is relatively new and was specifically developed after NVIDIA published their RTX Mega Geometry work⁴. The reason for why two algorithms need to exist is that clusterization, when used for raytracing, is very sensitive to where exactly the cluster boundaries lie - an optimal clusterization for raytracing makes it possible to take individual clusters, build micro-BVH trees for each one, build one BVH tree over all of the resulting clusters and trace the rays through the resulting structure. vk_lod_clusters sample just needs raytracing-optimal clusters, but my original demo has used raster-optimized ones, so we’ll start with that.

When the demo code was written originally, it was structured to be useful for working on the underlying algorithms, not to be reusable. It took some time to rework this into code that’s easy to follow and presents a simple and clean interface; the code takes the mesh as well as a lot of configuration parameters as an input, and produces groups of clusters via a callback. This conversion to reusable code, by itself, also had some performance benefits - in addition to eliminating some redundant STL copies (unlike meshoptimizer proper, this code uses STL for convenience for now), it was also helpful to switch to an interface where the caller communicates vertex attributes separately. Zorah scene uses position-only meshes for the most part, so we shouldn’t spend time on processing normals or other attributes. The new interface also integrates some recent additions to simplification like permissive mode, something that’s out of scope of today’s article. A minimal example is now quite small and simple:

clodConfig config = clodDefaultConfigRT(128);

clodMesh cmesh = {};
cmesh.indices = &indices[0];
cmesh.index_count = indices.size();
cmesh.vertex_count = vertex_count;
cmesh.vertex_positions = positions.data();
cmesh.vertex_positions_stride = sizeof(float) * 3;

clodBuild(config, cmesh,
    [&](clodGroup group, const clodCluster* clusters, size_t cluster_count) -> int {
        ...
    });

What remains, then, is to load the glTF scene, and run the code on each individual mesh. My test code does not save the resulting data to disk - so this is not really an apples-to-apples test, as saving data may incur extra costs and extra serialization. We’ll come back to this at the end. Naturally, we’ll be using multiple threads to process the data - and using cgltf to load the file to memory.

The file is 36 GB; to avoid loading the entire file into memory synchronously before the process starts, as well as a flat 36 GB memory overhead, we’ll use memory mapping; I contributed a small PR to cgltf to make working with memory mapped buffers a little easier.

Finally, one other critical thing we need to do is to reindex the meshes; the source glTF file here has some very large meshes that have very inefficient indexing (e.g. 90M vertices for 30M triangles) - in addition to making it more difficult to get high quality simplification, this also hurts our processing times, since as we’re about to find out, the number of vertices in the mesh is sometimes important.

With these adjustments, a small program, when run on Linux and given 16 threads, completes the processing of the file in ~9m 20s, using ~54.6 GB RAM. This is using rasterization-optimized setup; if we switch to the new raytracing-optimized clusterizer, we get ~7m 10s and ~57.6 GB RAM.

On one hand, this is not that bad! On the other hand, 7-9 minutes is still quite a while; getting a cup of coffee doesn’t take that long. It’s time to see if we can improve on this.

Sparsity woes

Using the excellent Superluminal profiler, we can attempt to understand what is taking time and how we can make it better. First, let’s run both rasterization and raytracing builds and see if we can find any obvious hotspots…

Rasterization:

Raytracing:

Hmm, that’s an awful lot of time to take for a memset! (we’ll come back to other issues here later)

What happens here is that both clusterizers use an array indexed by the vertex index to track whether a vertex is assigned to a current meshlet. This saves us the trouble of having to look through the 64-128 vertices when trying to see if adding a triangle to the meshlet would increase the vertex count. Unfortunately, this code:

memset(used, -1, vertex_count * sizeof(short));

… is only fast as long as the number of vertices is quite small - not so when we are repeatedly clusterizing subsets of a 30M triangle mesh! Curiously, a similar problem, but much less severe, existed in the simplifier too - as part of the work to make meshoptimizer friendlier towards clustered LOD use cases back in 2024, I’ve added a meshopt_SimplifySparse flag that assumes the input index buffer is a small subset of the mesh, and tries to avoid doing O(vertex_count) work at all costs… except that, too, had a small remaining issue, where it initialized a bit array for similar filtering:

memset(filter, 0, (vertex_count + 7) / 8);

Of course, 1 bit per vertex is much cheaper to fill than 16… but this still adds up when working with meshes approaching 100M triangles. Previously, the largest single mesh I’ve tested this code on was 6M triangles, 3M vertices - an order of magnitude smaller than individual meshes in this scene.

There are some ways to make this code more independent of the number of vertices - e.g. dynamically switch to a full hash map - but that carries extra costs and complexities, so for now let’s see what happens if we fix all of the issues by only initializing the array entries used by the index buffer when sparse access (index_count < vertex_count) is detected. Rerunning the code with these fixes⁵, we get 3m 31s for the raster version and 3m 57s for the raytrace version. Progress!

You will notice that the degree of the gains here does not align with the information the profiler is reporting. There are a few factors that contribute here, for example the profiler has significant overhead in this case which may skew the results; but more importantly, the time distribution the profiler is reporting is for all the work that happens across all threads, whereas the wall clock time for the entire processing depends on the slowest thread. Which brings us to…

Balancing threads

Instead of looking at the distribution of functions that take time, let’s instead focus on whether we are using threads well. When running the executable from the terminal, you can use /usr/bin/time -v to get the CPU% the command took; for us these are 1240-1260% depending on the mode we’re running at - in other words, we are using a little more than 12 threads’ worth of aggregate compute.

Let’s use Superluminal to look at the results more closely:

… ah yeah this is not great. If we look at the distribution for the number of triangles per mesh in this scene, we will see that there’s a significant imbalance: a few meshes are in the tens of millions of triangles, but most meshes don’t have as much. If we get unlucky, we may start processing large meshes much later into the process if they aren’t first in line to be queued for the thread pool; here we can see that in the “overhang”, there’s indeed a large mesh that takes ~48s to just build the clusters for the first DAG level. We need to be processing meshes like this first.

While fully general solutions to scheduling problems like this are very complicated and may or may not work well, fortunately we don’t need a general solution. The time it takes to process one mesh is a function of the number of triangles, so we can simply sort the meshes by triangle count in decreasing order. This ensures that we’ll process the most expensive meshes first.

This would also be a good time to mention the memory limits. Experimentally, we now know that it takes ~60 GB to process this scene on 16 threads - part of the reason why our processing is that much faster is that it takes less memory, allowing us to scale to more threads. However, what if the system we need to run on only has 40 GB RAM?⁶ Ideally, you’d use a limiter that only allows a certain fixed number of triangles to be processed “at once”; when running the next mesh, you could check if the total is at zero or under the limit, and wait until it goes below it to be safe. This can be implemented using a std::atomic (and yields/sleeps to avoid burning CPU power unnecessarily - although in this case it’s really a stopgap and we’d prefer to burn all the CPU power available to us thank you very much!), or a counting semaphore. Of course, for the purpose of this scene we’ll set the memory limit to be 60+ GB to make sure we don’t throttle the execution - 192 GB RAM is quite spacious after all.

Anyhow, let’s sort the meshes and rerun the code. Here’s the new thread schedule:

… sweet. While there are still a few little gaps here and there in the schedule, we now see the thread execution being perfectly balanced across 16 threads - /usr/bin/time reports 1574% CPU utilization. Front-loading large meshes means that smaller meshes can fill the gaps at the end fairly efficiently. Of course, if the input scene just has one or two meshes, our parallelism strategy will need to change - but for this scene, “external” parallelism where the axis is mesh count is the best as it allows us to share no data between different threads.

The execution time is now much better: 2m 56s for rasterization and 3m 07s for raytracing. Curiously, the peak memory consumption is actually a little lower (at ~45 GB for the raytracing version instead of ~54 GB before the sort). This is not very intuitive - normally you’d expect the peak memory consumption to be reached when each thread is processing the largest mesh which is what we’re doing here - but there’s probably some explanation that I’m missing right now; this will certainly depend on the particulars of the system allocator.

We’ve come quite a long way; ~3 minutes of processing time is quite a respectable number even though we’re not serializing the resulting data. That’s it then, see you next time!

Faster-er clusterization

… of course we’re not done. Coincidentally, right before NVIDIA had released the new asset files I’ve been working on performance improvements for both clusterizers. All of the results so far have been presented using meshoptimizer v0.25 (plus sparsity fixes), but actually we need to be testing on the latest master, which contains two important improvements to the clusterizer performance.

For the raster-optimized clusterizer, in some cases some internal tree searching functions would keep searching over the same data. I won’t go into too much detail as this post is getting long as it is, and it doesn’t affect these levels as acutely (saving ~3%); from now on, let’s focus on raytracing-optimized structures. Profiling the current code that takes ~3m 07s, we still see the new spatial clusterizer (meshopt_buildMeshletsSpatial) being responsible for two-thirds of the runtime. Fortunately, this is a case where we can point to a single function as the source of most of our problems:

Conceptually, the core of the spatial clusterizer is quite close to a sweep BVH builder. For each level of the tree, we need to determine the best splitting plane; to do that, we need to analyze the cost of putting a splitting plane through the centroid of each triangle along each of the cardinal axes; that cost can be computed by accumulating the bounding boxes of the triangles six times - three axes times two directions, left and right; the resulting cost can be computed from the surface area of the resulting AABBs. While there’s much more to the algorithm itself, thankfully the complex external logic doesn’t contribute that much to the runtime.

void bvhComputeArea(float* areas, const BVHBox* boxes, const int* order, size_t count)
{
	BVHBox accuml = { {FLT_MAX, FLT_MAX, FLT_MAX}, {-FLT_MAX, -FLT_MAX, -FLT_MAX} };
	BVHBox accumr = accuml;

	for (size_t i = 0; i < count; ++i)
	{
		areas[i] = boxMerge(accuml, boxes[order[i]]);
		areas[i + count] = boxMerge(accumr, boxes[order[count - 1 - i]]);
	}
}

This case is a little curious because the performance characteristics of bvhComputeArea change at different levels of the processing, making analysis complicated. When clusterizing large meshes, initial calls to bvhSplit - which is a recursive function - end up processing the entire mesh with the locality of AABB traversal being Not Ideal. As such, we’d expect that function to be memory bound. When the recursive calls get all the way down to a few thousand triangles, the accesses become highly local because the “active” boxes readily fit into L2 and even L1.

The reason why this matters is that I initially thought I could improve this situation by reducing the amount of memory referenced by each box. However, this ended up not dramatically improving higher levels (presumably because the access locality was still poor) and regressing lower levels because storing AABBs in any other way than a few floats costs cycles to decode. After a few attempts to use different box representations, I gave up and tried a thing that should not have worked, which is to just convert the relevant code (boxMerge function used above) to SSE2. A box has two corners that can each be loaded into an SSE2 register; min/max accumulation can use dedicated MINPS/MAXPS instructions; and we can compute the box area by doing a moderate amount of shuffle crimes (in the absence of a dedicated DPPS instruction which requires SSE4.1). The same can then be done on NEON in case you are using ARM servers for content processing or for some strange reason running clustered raytracing acceleration code on a Mac.

The resulting SIMD code is quite straightforward and is only 20 lines of code per architecture. It’s not the world’s best SIMD code: we are only using 3 floats’ worth of computation even though the hardware could use much wider vectors, but unfortunately it’s difficult to rearrange the data to make the layout SIMD-optimal as the order of boxes has to change too frequently. Still, if we rerun the code, we go from 3m 07s to 2m 51s - ~9% speedup overall!⁷ This brings our raytracing-optimized code in line with rasterization-optimized, but we’re not quite done yet.

As mentioned, the earlier levels of the recursion are possibly hitting a memory subsystem limitation, as they end up bringing a lot of bounding boxes into the caches from all around the memory. It stands to reason that, if the bounding box order in memory - which matches the triangle order in the input index buffer - was more coherent, then we might see further speedup.

Indeed, what we can do is sort the triangles spatially, using a Morton order - conveniently, meshoptimizer provides a function that will do just that, meshopt_spatialSortTriangles. Calling this function has a cost - however, as long as the gains in clusterization time outweigh the extra effort to sort the triangles, this should still be a good idea. After trying this on that scene we get 2m 44s - ~5% further speedup for a single extra line. Nice!

Caching allocations

It’s time to tackle the final boss: all of the aforementioned functions need to allocate some memory for the processing. Given 16 threads that allocate sizeable chunks of memory, an ideal allocator would figure out how to keep some amount of memory in thread-local buffers to avoid allocations from one thread contending with allocations from another thread.

Unfortunately, expecting this may be overly optimistic, depending on the platform you’re running on. All of the experiments so far have been run on Linux (using the stock allocator without any extra configuration). And while in general we’re getting very reasonable performance with little contention, even on Linux there’s occasional “red” spots in the thread utilization chart, which indicates that the thread is busy waiting - and if we check, it’s indeed waiting on a different thread to service the allocation.

I’m a little hesitant to conclude specifics because under a heavy thread load, Superluminal offsets the timings enough that I worry about interference between the profiler and the results. However, we could instead switch to the platform where the stock allocator is not very high quality - Windows, and observe the bleak thread utilization picture:

What happens here is an unfortunate interaction between multi-threaded allocations and default large block policy. Large blocks bypass the heap and are allocated using VirtualAlloc; memory allocated this way is quite expensive to work with initially⁸, so repeat allocations/deallocations will cause performance problems. Because multiple threads contend over the same heap mutex, the resulting throughput is affected very significantly.

Fortunately, there’s a simple solution to this problem: just use a per-thread arena and route the allocations to it if they fit. meshoptimizer exposes an easy way to globally override the allocations, and guarantees that allocation/deallocation callbacks will be called in a stack-like manner. This makes it easy to implement a thread-local cache: pre-allocate a chunk of memory, say, 128 MB; allocate out of it using a bump allocator or fall back to malloc; deallocation can check if the pointer belongs to the thread-local arena and if it doesn’t, fall back to free.

Doing this on Linux provides modest further performance improvements; our code now runs in ~2m 35s - around 3.5x speedup from our initial baseline, and significantly better than ~30 minutes. On Windows, before this change, the code so far runs at 4m 20s - and with the thread cache we get 2m 38s, in line with our Linux version! And the utilization looks much better - note that we’re still using the global allocator for some STL code that’s part of the example (but can be replaced in the future), hence the imperfect utilization.

With some extra effort it’s possible to generalize the solution so that it’s easy to integrate on top of the default allocator; I’m planning to add this in a future meshoptimizer version, however since meshoptimizer will be 1.0 this year this will have to wait until the next version after that - in the meantime, the code is available under the MIT license.

Results

Are we done now? Well, more or less :) Most of the improvements, as well as a few improvements I didn’t think were of general enough interest to include, have been incorporated into the demo code that’s now distributed as a single-header “micro-library” via the meshoptimizer repository, clusterlod.h. The code is designed to be easy to modify and adapt, but also be easy to plug in as is.

Out of the aforementioned performance improvements, the call to meshopt_spatialSortTriangles can be made externally if necessary, and the thread cache work has not been submitted yet. It will likely be included into meshoptimizer after v1.0 is released later this year, as it’s generally useful for improving content pipeline performance, with or without clustered LOD.

And I thought this is more or less where things would end, but this example code has proven to be useful enough so that vk_lod_clusters, the sample that spawned all this work, integrates it as an option! You can select it by passing --nvclusterlod 0. The version that’s part of NVIDIA’s repository changes the example code to implement optional “internal” parallelism - being able to generate a cluster DAG from a single mesh using multiple threads. This is not something that is necessary for Zorah or other large scenes like this - as mentioned, “external” parallelism here provides a more natural and performant axis - but is crucial to be able to more quickly generate a DAG for a single large mesh.

Because of their work I can now show another screenshot of the same Zorah asset, but this time running inside the vk_lod_clusters sample using the data generated by meshoptimizer’s clusterlod.h, rendered with an approximately 2 GB geometry pool, in 26ms when using ray tracing and 16ms when using rasterization, on NVIDIA GeForce 3050. Not bad for a GPU that draws all its power from the motherboard without needing a separate power cable!

In vk_lod_clusters the processing is structured a little differently; as a result, it generates a slightly different amount of work compared to my simpler demo I’ve been using for profiling, and also includes data serialization, so it runs a little slower - ~3m 20s with all mentioned optimizations included. In that time it performs all the processing described above and generates the 62 GB cache file - including, hilariously, almost 10 seconds it takes Linux to fopen() this file for writing as it takes a while to discard the existing file contents from the file system cache, if present there! Since I’m also running my 7950X in eco mode, we’ll call it around 3 minutes, give or take.

There are still opportunities for improvement, however. Notably, to be able to stream and display a scene like this efficiently, you need a separate hierarchical acceleration structure that can quickly determine the set of clusters to render; vk_lod_clusters manages to do this using existing meshopt_ functions but that code is not part of clusterlod.h yet. Also the default cluster partitioning algorithm used in clusterlod.h to create groups of clusters is currently only willing to group clusters that are topologically adjacent (as in, they share vertices); this can sometimes result in DAGs that have too many roots, as the groups aren’t merged aggressively enough - and should be improved in the future as well (vk_lod_clusters falls back to a different meshopt_ partitioning algorithm if it detects this case). (Update: as of a few days after this blog was published, this is now fixed in the implementation of meshopt_partitionClusters in meshoptimizer so no further tweaks should be necessary!)

But I’m happy to see a meaningful milestone for this code that started as a basic playground for clusterization algorithms.

Thanks to Christoph Kubisch for discussions, feedback, and vk_lod_clusters integration, to NVIDIA for sharing research, code and assets openly, and to Valve for sponsoring meshoptimizer development.

This example code, as well as much of early developments here, was motivated by improving Bevy’s virtual geometry system. ↩
From here on, all testing results will be on my desktop system - AMD Ryzen 7950X (16C/32T), 192 GB RAM (DDR5-4800), NVIDIA GeForce 3050 8 GB (… my main GPU is AMD Radeon 7900 GRE, but the demo in question relies on NVIDIA specific extensions). ↩
Worth noting that the original Unreal Engine scene was likely assembled out of individual mesh assets that were individually imported and exported, which probably made it possible to work with on more reasonable hardware configurations… assuming you didn’t need to re-process the entire scene at once. ↩
NVIDIA also released a library, nv_cluster_builder, which can perform this RT-aware clusterization - the new meshoptimizer algorithm uses similar ideas but fairly different implementation. ↩
Unless mentioned otherwise, all improvements to the library code have already been committed to meshoptimizer - so as long as you use the master branch you are already getting the performance improvements. ↩
Notwithstanding the fact that processing a 36 GB glTF with just 40 GB RAM is probably not the best idea. ↩
I don’t have a sufficiently powerful Mac to run the entire workload, but curiously on Apple M4 I’ve measured significant speedup from the similar change that converted boxMerge to NEON - on the order of 2x+ speedup for clusterization alone. The gains on x64 are still significant albeit more muted. ↩
This is similar to a problem I ran into a decade ago, described in A queue of page faults - since then it appears that Windows kernel got much better at processing soft faults, but the underlying problem remains and some costs are likely exacerbated by mitigations for various speculative attacks. ↩

Do not disrespect the fractal

Fri, 22 Aug 2025 00:00:00 +0000

Some people have a misconception that in software engineering, skill stops mattering for code quality from some level of seniority, and all of the value add shifts to architecture, high level design decisions, problem setting, or guiding others. And as long as you have staff/senior engineers design a system and oversee mid-level - and, for some work, junior - engineers, the output quality is the same as if you got senior folks to write everything instead.

What I believe, however, is that there’s not as much macro vs micro distinction as people want to imagine - software is fractal. Experts will make micro decisions, regular decisions and macro decisions, that all together combine into high quality software. Decisions at every level influence the quality and, often, influence levels above and below. You can outsource lower levels to non-experts - given time or budget constraints, you may have to - but you are not getting the same result.

You also can’t validate micro decisions from a macro vantage point. The process of making the micro decisions shapes your understanding of the problem; without having solved the problem from the ground up, you don’t have precise visibility into the higher levels. Quality engineering involves constantly shifting between the levels, validating the results and structure by looking at how much pressure propagates to neighboring layers and where things bend vs break.

This is why you shouldn’t replace engineers with LLMs even if you create a great plan and review the code. This is why you will not get software to improve if you outsource layers to non-experts. And this is also, I believe, why large teams routinely fail to make excellent software.

I’ve been planning to write more, shorter, posts on this blog. This one has been in my head for a few weeks now; it’s a little too long to be a tweet, so here you go! If you were hoping for more technical content, I’ve been busy working on meshoptimizer v0.25, so check that out instead :)

Load-store conflicts

Sat, 03 May 2025 00:00:00 +0000

meshoptimizer implements several geometry compression algorithms that are designed to take advantage of redundancies common in mesh data and decompress quickly - targeting many gigabytes per second in decoding throughput. One of them, index decoder, has seen a significant and unexpected variance in performance across multiple compilers and compiler releases recently; upon closer investigation, the differences can mostly be attributed to the same microarchitectural detail that is not often talked about. So I thought it would be interesting to write about it.

Algorithm

The encoding in this case is specialized to index buffers that store triangle lists; every triangle is represented by three vertex indices, so the job of the decoder is to compute the three indices and write them to the output buffer. The encoding scheme takes advantage of multiple sources of redundancy¹ present in carefully optimized index buffers - for the sake of the performance investigation here, we’ll omit most of details except for a central intermediate structure used by both the encoder and decoder: the edge FIFO.

The edge FIFO contains up to 16 triangle edges - pairs of 32-bit indices - and the encoded form of each triangle can reference a previously encountered edge. Thus, to decode the triangle, we need to read the recently seen edge from the FIFO like so:

unsigned int a = edgefifo[(edgefifooffset - 1 - fe) & 15][0];
unsigned int b = edgefifo[(edgefifooffset - 1 - fe) & 15][1];

… then read and decode the third vertex, c, write two new edges of the triangle, bc and ca, using a simple function:

void pushEdgeFifo(EdgeFifo fifo, unsigned int a, unsigned int b, size_t& offset)
{
    fifo[offset][0] = a;
    fifo[offset][1] = b;
    offset = (offset + 1) & 15;
}

… and finally write the triangle (indices a, b, c) to the output buffer.

All other details are not material to the performance differences we are going to discuss: there are multiple less commonly encountered paths through the decoder, the third vertex can be encoded using multiple different ways, etc. For simplicity, let’s focus on the FIFO in question and the code that reads and writes data to it. The FIFO is simply a 16-element array with two integers per element:

typedef unsigned int EdgeFifo[16][2];

This code is simple and straightforward; what can possibly go wrong?

Baseline: clang-20 x86_64

The decoder is used at runtime to decompress the data; when using 32-bit indices, the output form of each triangle is 3 32-bit indices, or 12 bytes of data. Performance of the decoding loop is critical; here and below, we will measure performance as gigabytes per second² written (assuming 32-bit indices), and cycles per triangle. If 16-bit indices are used, most of the code except for writing the triangle runs the same instructions and costs the same, so the expected effective bandwidth is approximately halved, but the cycles per triangle stay the same.

In production, we’d expect that this code is compiled using clang (when targeting mobile or console hardware, or macOS and, in some cases, Windows) or MSVC (when targeting Windows or Xbox). We’ll ignore MSVC here: it has some other challenges with this loop but they are outside of the scope of this post. So, let’s look at how clang (using clang-20) compiles accesses to this array and how fast does the loop run. For simplicity, we’ll ignore most of the loop and just focus on the instructions that read or write to the FIFO or the triangle output buffer³:

; read edge ab from FIFO
mov     r11d,DWORD PTR [rsp+rdx*8-0x10]
mov     edx,DWORD PTR [rsp+rdx*8-0xc]
...
; write edges bc and ca
mov     DWORD PTR [rsp+rbx*8-0x10],ebp
mov     DWORD PTR [rsp+rbx*8-0xc],edx
mov     DWORD PTR [rsp+r15*8-0x10],r11d
mov     DWORD PTR [rsp+r15*8-0xc],ebp
...
; write triangle abc
mov     DWORD PTR [rdi+r10*4],r11d
mov     DWORD PTR [rdi+r10*4+0x4],edx
mov     DWORD PTR [rdi+r10*4+0x8],ebp

The code is straightforward and easy to understand: a and b are read into r11d and edx registers; code not shown here reads c into ebp; and then we write each pair back into FIFO. Because the FIFO accesses are done modulo 16, the writes use two separate indices: these are often sequential, but if the first edge is written to index 15, the next edge will go to index 0, which complicates the address math a bit. This is roughly what I would expect to see from a competent compiler; index codec was originally developed in 2017 and my recollection is that this is similar to what the compilers at the time would produce as well.

When running this code on typical dense/regular meshes, on AMD Ryzen 7950X it decodes at, approximately, 6.6 GB/s; at 5.47 GHz, this corresponds to ~9.9 cycles taken to decode every triangle. Pretty good!

Store-to-load forwarding

To understand the performance characteristics of this code, it’s important to note that FIFO elements that are written on one iteration will often be read on the next iteration. A triangle is likely to share an edge with one of the triangles seen very recently; this allows a fairly small FIFO to still capture most of the edge reuse.

If you’ve written code that targets the PowerPC generation of game consoles, like Xbox 360 and PlayStation 3, you might be getting worried right about now. If you haven’t, let me introduce you to store buffers and Load-Hit-Store.

When the processor executes a store instruction, it’s tempting to assume that the write goes straight to the L1 cache. However, this would be inefficient: a write to L1 cache may require fetching the cache line from the next cache level or even memory, and we are likely to see a write to the same cache line follow soon after. For in-order CPUs, it is critical to amortize the cost of writes in this case, so most in-order CPUs include a special queue, usually called a store buffer; writes go to that queue when the store instruction executes, and eventually make their way to the cache line / memory. For out-of-order CPUs, this is also ubiquitous as store buffers help reduce the amount of time store instructions spend in the retirement queues, and enable executing store instructions speculatively (as the pending stores can be committed or discarded, but all cache writes are final).

If a store instruction does not update the cache immediately, how do load instructions work? What happens when a load instruction needs to access memory that has a pending store in the store buffer?

On the in-order PowerPC CPUs in PlayStation 3 / Xbox 360 consoles this would trigger a condition known as Load-Hit-Store: if the load touched the address that has been written to recently and the write was still in the store buffer (for ~20 cycles), the execution would stall to give the store enough time to finish (for ~40 cycles) and make it to the L2 cache. Needless to say, this was extremely expensive: it was common to see code that’s spending most of its time in LHS stalls on innocuous instruction sequences like repeatedly incrementing the size field stored inside a structure.

Notably, the load may only check a subset of the store address bits, and assume there’s a match even when there isn’t; my recollection is that on Xbox 360 in particular, only 12 bits of the address were checked, which resulted in 4K LHS aliasing: some patterns that didn’t exhibit a physical load-store overlap would still be subject to the significant penalty! This is similar in modern CPUs as well, but the consequences are much less dire.

Fortunately, most CPUs, even old in-order ARM chips that you can still see in very low-end mobile phones, implement a feature called store-to-load forwarding: if a load sees a pending store to the same address in the store buffer, it takes the value from the latest associated entry in the store buffer instead of reading it from the cache. This allows code like the FIFO update above to still run very efficiently, and often is as fast or faster than reading from the cache. Life is good these days.

Pleasant surprise: gcc-14 x86_64

As mentioned above, most production workloads where decoder performance is critical are using clang or MSVC as the compilers; gcc is a little bit more of an outlier, as it is only used on Linux and even then, often commercial games would choose to use clang to build the code for Linux instead. Often gcc trails clang a little bit on some other parts of the decoder family, which is completely fine and not a cause for concern. That said, ever since switching to Linux as my main OS, I would occasionally profile gcc just because it’s the default system wide compiler.

So at one point I was pleasantly surprised to discover that gcc (as late as gcc-14, which is the default compiler on latest Ubuntu) significantly outperforms clang on this code: clang built decoder achieves 6.6 GB/s (~9.9 cycles/triangle), and gcc-14 runs this code at ~7.5 GB/s (~8.7 cycles/triangle). That’s a significant improvement! After investigating the differences in the generated code, it looked like the gains are mostly attributed to the same FIFO code that is compiled very differently:

; read FIFO entry twice: as a 64-bit pair (into rbp) and two 32-bit values (xmm)
mov       rbp,QWORD PTR [rsp+rcx*8+0x70]
movd      xmm1,DWORD PTR [rsp+rcx*8+0x70]
movd      xmm2,DWORD PTR [rsp+rcx*8+0x74]
...
; xmm0 contains 'c' vertex; create 64-bit pairs bc and ca in xmm1/3
movdqa    xmm3,xmm0
punpckldq xmm1,xmm0
punpckldq xmm3,xmm2
...
; write both pairs into two FIFO entries
movq      QWORD PTR [rsp+rax*8+0x70],xmm3
movq      QWORD PTR [rsp+rax*8+0x70],xmm1
...
; write the triangle abc to output
mov       QWORD PTR [rsi],rbp
movd      DWORD PTR [rsi+0x8],xmm0

Instead of simply using separate 32-bit registers, gcc instead uses vector operations for all of the FIFO update code - instead of two 32-bit writes for each FIFO entry, it synthesizes the 64-bit value from two 32-bit elements in SSE registers and writes the entire pair with a 64-bit store. It also reads the FIFO entry (ab) twice: once as two separate 32-bit elements, and once as a 64-bit GPR. The latter allows the compiler to directly store “ab” to the output buffer when writing the output triangle, although this changes when writing 16-bit indices.

This approach reduces the number of writes during the loop fairly significantly; it also allows for more instruction parallelism, as the rest of the loop (not shown here) uses many integer arithmetic instructions, so the SSE instructions can run in parallel on the otherwise idle units. All in all, this allows to reduce the cost of the decoding here by more than a cycle per triangle: a more than 10% speed improvement!

Note that the effectiveness of this technique depends on various efficiency properties of the target system; for example, c vertex is copied into xmm0 from a general purpose register and that copy has a cost. Reading the FIFO entry twice seems to work out in this case but it’s unclear what the impact would be on other CPUs. Also, replicating this in C++ code is certainly possible by using SSE intrinsics directly, but that makes the code less portable and somewhat more fragile wrt performance on MSVC.

I tried a similar approach using 64-bit integer registers and it generated worse code with clang and MSVC. So while the extra performance gains here seemed interesting, it was unclear how to best integrate this into the code without risking regressions - I shelved this until I could revisit it in the future and forgot about it…

Unpleasant surprise: gcc-15 x86_64

… until gcc-15 released last week. I ran a routine benchmark and was surprised to discover that gcc-15 was now producing code that, instead of being a little faster than clang’s, was significantly slower! While gcc-14 code ran at ~7.5 GB/s (~8.7 cycles/triangle), gcc-15 produced code that runs at ~4.8 GB/s (~13.6 cycles/triangle). A rather dramatic 5-cycle regression compared to the previous version.

I’ve expected some sort of significant difference in branching or loop structure, which are issues I’ve ran into on this code in the past, but I think I haven’t internalized just how key the FIFO load-store specifically is to this decoder loop. To spare you a few hours of bisection⁴ to find the offending gcc commit and compare the loop code alongside performance metrics, let’s just immediately look at the way gcc-15 compiles the FIFO access now:

; read FIFO entry as a 64-bit pair
movq    xmm0,QWORD PTR [r9+rcx*8]
...
; write bc edge to FIFO
pshufd  xmm1,xmm0,0xe5
mov     DWORD PTR [r9+rax*8],ecx
movd    DWORD PTR [r9+rax*8+0x4],xmm1
...
; write ca edge to FIFO
movd    DWORD PTR [r9+rax*8],xmm0
mov     DWORD PTR [r9+rax*8+0x4],ecx
...
; write triangle to output
movq    QWORD PTR [rsi],xmm0
mov     DWORD PTR [rsi+0x8],ecx

This doesn’t look too bad, does it? Gone is the redundant two-way read, so we only read the FIFO data once into xmm0; we then directly use that register to store the relevant bits into FIFO, and store it to the output buffer. There’s now four instructions that write to FIFO instead of two, but one that reads from it instead of three… all in all, not too bad? Ah, well… it’s time to talk about the title of this post.

Store-load conflicts

See, my earlier description of store-to-load forwarding was a little bit hand-wavy: the store ends up in the store buffer, and the load instruction checks the store buffer to see if it can read the data from that buffer instead. However, the store buffer in this case would contain two separate stores for each FIFO entry, that each have the 32-bit element they are writing. What happens if a single 64-bit load looks at the store buffer and sees two separate entries that, together, would provide the value for that load?

The answer depends on the microarchitecture, but your baseline expectation should be “nothing good”. Indeed, if we check the Zen 4 optimization guide, we will see (emphasis mine):

The LS unit supports store-to-load forwarding (STLF) when there is an older store that contains all of the load’s bytes, and the store’s data has been produced and is available in the store queue. The load does not require any particular alignment relative to the store or to the 64B load alignment boundary as long as it is fully contained within the store.

In our case, the load is actually no longer fully contained within the store. What happens, then, is that the store-to-load forwarding mechanism fails, and the load instruction needs to wait for stores to actually hit the cache. Mercifully, this is not as bad as what used to happen on PowerPC in case of an LHS: the penalty is much smaller and independent parts of the loop may still proceed, however at best this limits the throughput of the loop to the latency of an L1 load (7 cycles for SSE load) plus the latency of the L1 store. Chips and Cheese post measures the end-to-end penalty at 19 cycles; in our case, the entire loop ends up running at <14 cycles per iteration but there may be opportunities for partial overlap and not all triangles may need access to edges written by the previous triangle: if the edge we need is the edge written by the triangle before that, the latency may be mostly hidden.

To confirm that this is what is happening, we can use AMD μProf, which allows us to gather performance counters and attribute them to individual functions. Specifically, the counter in question is the counter that tracks store-to-load forwarding failures, known as Bad_Status_2_STLI (store-to-load interlock). The description of the counter in performance counter reference is, however, pretty telling; indeed, SIMD code is particularly susceptible to this problem and it’s a bad idea to use narrow element-by-element stores if there’s a risk of the result being read as a wide vector!

The counter is enabled by default in the “Assess Performance (Extended)” set; profiling with that set, we get ~69 STLIs per thousand instructions. The number itself is a little surprising: with ~40 instructions in the loop body, just one of which is the offending load, we’d expect ~25 STLIs per thousand instructions - but we can compare with gcc-14 binary (~24 STLIs per thousand instructions) or clang-20 binary (~1 STLIs per thousand instructions) and confidently conclude that indeed, things are much worse now and STLI count has increased dramatically.

Curiously, while clang binary generates essentially no STLIs, gcc-14 binary generates an appreciable number. This might be due to the “double” access to the FIFO edge, where the loads of individual elements of the edge need to read data from part of the earlier, wider, store - AMD’s manual is silent on this but Chips and Cheese claims this introduces an extra 6-7 cycle latency. Perhaps STLI counts that in this case as well, but the latency ends up being hidden by the rest of the decoding and the actual loop performance doesn’t suffer in that case.

Unfortunately, in larger loops it’s more difficult to trace this problem back to the specific instructions that cause this. AMD supports “Instruction Based Sampling”, which tracks a number of counters associated with the execution of individual instructions, but it does not support gathering data about STLF issues in particular, and the cache latencies it collects do not allow pinpointing the problem, as the delay is not cache-related.

Using perf with -e ls_bad_status2.stli_other produces the following output which makes it a little easier to reason about the cause:

Here, the two instructions with lots of hits are the instruction that immediately follows the problematic load (it is typical for performance sampling to hit instructions that follow the expensive ones), and the instruction that is dependent on it. However, this only works because we already know STLI is the problem; the default profile is much less descriptive with a much longer and less precise skid window:

Surprising reversal: clang aarch64

While profiling the codecs earlier this year, I also briefly looked at the performance numbers on Apple CPUs. These numbers were impressively high: the same index decoder ran at ~7.3 GB/s on Apple M4 (4.4 GHz), equivalent to ~7.2 cycles/triangle. The number was impressive, and suggested that code generation was probably reasonable, so I did not look further. This was a mistake, because then two things happened in close proximity:

gcc-15 was released, significantly degrading performance and causing me to look more into this general area;
Xcode 16.3 was released, incorporating clang-17 - which increased performance on this decoder to ~9.8 GB/s (!!!). Clearly clang-16 was not that reasonable after all!

So let’s look closer at what happens on clang when targeting ARM and running on Apple M4.

When using clang-16 from the older Xcode 16.2, I was getting ~7.2 cycles/triangle per above. Similarly to x86_64, we will only look at the code that relates to FIFO access, as that’s the most important thing for this investigation. Let’s look at the assembly:

; read FIFO entry as a 64-bit pair into SIMD register
ldr   d0, [x15, x19, lsl 3]
...
; write bc edge to FIFO
str   w7, [x6]
orr   x6, x6, 0x4
st1.s ( v0 )[1], [x6]
...
; write ca edge to FIFO
str   s0, [x19]
str   w7, [x19, 0x4]
...
; write triangle to output
stur  d0, [x12, -0x8]
str   w7, [x12]

… okay then. We are seeing what is, essentially, the same code we have already seen gcc-15 generate for x86_64: when reading FIFO entry, we read it into a SIMD register using a 64-bit load; 32-bit components of that register are then written to two separate FIFO entries, along with the third vertex, as separate 32-bit stores. We have just seen this strategy result in a fairly catastrophic performance cliff on Zen 4 because the CPU can’t forward two separate stores from a store buffer into a single load. Is this, perhaps, not the case for Apple CPUs? Let’s refer to Apple Silicon CPU Optimization Guide:

Interesting! It looks like on Apple chips, specifically on their performance cores, a 64-bit load may source both - or one of - 32-bit halves from the entries in the store buffer; this would explain why we are seeing solid performance (7.2 cycles/triangle) on M4 even though the code is seemingly inefficient. The manual, however, does say that this may introduce more strict dependencies, and does not allow more efficient single-element store forwarding, so let’s look at how the code and performance changes with newer compiler.

When using Xcode 16.3 (clang-17⁵), our code runs at ~9.8 GB/s, or ~5.4 cycles/triangle, almost two full cycles faster than the clang-16 binary! And, lo and behold, if we look at the FIFO access we see this:

; load FIFO entry into two 32-bit registers
ldp   w20, w21, [x20]

; write bc edge to FIFO
stp   w20, w7, [x0]

; write ca edge to FIFO
stp   w7, w21, [x0]

; write triangle to output
stp   w21, w20, [x12, -0x4]
str   w7, [x12, 0x4]

This code is, broadly, equivalent to the x86_64 code we’ve seen clang generate - however, it relies on two incredibly useful instructions, ldp/stp, that AArch64 has and x86_64 doesn’t, that allow issuing two loads into two separate registers, or two stores. Presumably, these paired instructions are decoded into two separate micro-operations and execute separately, which completely removes “complex” store-to-load forwarding cases and, presumably, is what allows the code to run at peak efficiency here, at ~half the cycles per triangle compared to equivalent code running on Zen 4. Unfortunately, Instruments does not seem to expose performance counters that are relevant to store-load forwarding so it’s difficult to confirm that this is what happens, but the performance speaks for itself - Apple CPUs are impressive.

Curiously, if we use clang-16 to compile to x86_64 we will see the same, problematic, SIMD code pattern, matching gcc-15; additionally, neither clang-15 nor clang-17 exhibit this issue for either architecture. So this looks to be a regression that clang-16 introduced and clang-17 promptly fixed - making me feel better about not having spotted this issue before! Note that while Apple CPUs have unnaturally versatile load-store forwarding, I’d expect that most other ARM chips are not able to forward multiple stores into a single load - similarly, this would only be a problem for clang-16, and earlier and later versions should all work fine in this particular case.

I haven’t tracked the clang-16 issue down to a specific commit so I don’t know what exactly introduced this regression and what exactly fixed it; this can perhaps be left as an exercise to the reader.⁶

Conclusion

This hasn’t been the first time I’ve encountered store-to-load forwarding issues on x86_64 CPUs; however, I’m more used to these happening as a result of the code that explicitly tries to load or store mismatched element sizes. For example, this problem is prevalent, and requires a lot of care, when unions are used to operate on tagged values: something like this may often hit a case where the structure is filled using individual field writes, but is copied with a 128-bit wide load/store, which may present challenges in certain high-performance interpreters that would often expect the same value to be written and read in quick succession:

struct {
    int tt;
    union {
        bool b;
        void* p;
        double n;
    };
};

Regular structure copies may sometimes hit this problem as well, although these are often less latency sensitive. The index decompression code discussed here is an interesting scenario where all individual accesses in the source code are matched precisely, but the compiler may be eager to combine multiple loads and stores together - and unless it combines the stores, the combined loads may suffer.

In clang, this problem on this specific code is mercifully restricted to just a single, older (clang-16), compiler version; hopefully, gcc will follow suit and fix this in gcc-16. Unfortunately, the presence or absence of this problem is often ephemeral and depends on the exact code compiled, not just the compiler version; for cases where performance matters, beware store-load conflicts and pay close attention to the code compiler generates!

The compression ratio for triangle data is not state of the art compared to methods like Edgebreaker, but that’s an explicit design point, as we want a way to encode index buffers without distoring the vertex cache optimized order. The encoding is specialized to be friendly to general purpose LZ/entropy codecs, so the encoded output could be compressed further by LZ4/Zstd if needed. ↩
Or, to be specific, decimal gigabytes or gibibytes per second, which is the common unit of bandwidth measurement. If you have seen my Mastodon posts about this, or have read the numbers in my bug report, those use binary gigabytes and as such feature smaller numbers - sorry! ↩
The performance of writing to the output buffer is not critical here, but seeing these instructions may help understand the code flow a little better. ↩
I don’t often use bisection for issues like this, but in this case I was asked to file a bug report so I thought bisection would be useful to pinpoint the change that led to this regression. Note that this does not mean the change is necessarily incorrect; it simply has an unintended consequence of a devastating performance regression on this specific code. ↩
It’s not always clear how closely Apple clang versions track the official LLVM releases, but in this case the change in code generation can also be observed on non-Apple clang 16/17 builds ↩
This would be a good time to mention that you can clone meshoptimizer repository and run make -B config=release codecbench && ./codecbench to reproduce these results, using CXX environment variable if necessary to adjust the compiler version. ↩

Measuring acceleration structures

Mon, 31 Mar 2025 00:00:00 +0000

Hardware accelerated raytracing, as supported by DirectX 12 and Vulkan, relies on an abstract data structure that stores scene geometry, known as “acceleration structure” and often referred to as “BVH” or “BLAS”. Unlike geometry representation for rasterization, rendering engines can not customize the data layout; unlike texture formats, the layout is not standardized across vendors.

It may seem like a trivial matter - surely, by 2025 all implementations are close to each other in memory consumption, and the main competition is over ray traversal performance and new ray tracing features? Let’s find out.

Experimental setup

It’s going to be difficult to make any generalized claims here; and testing this requires using many different GPUs by many different vendors, which is time consuming. So for the purpose of this post, we will just look at a single scene - Amazon Lumberyard Bistro, or more specifically a somewhat customized variant by Nvidia which uses more instancing than the default FBX download.

The results are captured by running niagara renderer; if you’d like to follow along, you will need Vulkan 1.4 SDK and drivers, and something along these lines:

git clone https://github.com/zeux/niagara --recursive
cd niagara
git clone https://github.com/zeux/niagara_bistro bistro
cmake . && make
./niagara bistro/bistro.gltf

The code will parse the glTF scene, convert the meshes to use fp16 positions, build a BLAS for every mesh¹, compact it using the relevant parts of VK_KHR_acceleration_structure extension, and print the resulting compacted sizes. While a number of levels of detail are built as the scene is loaded, only the original geometry makes it into acceleration structures, for a total of 1.754M triangles².

The builds are using PREFER_FAST_TRACE build mode; on some drivers, using LOW_MEMORY flag allows to reduce the BLAS size further at some cost to traversal performance, which we will ignore for now.

Experimental results

Running this on the latest (as of end of March) drivers of all respective vendors, on a host of different GPUs, we get the following results; the total BLAS size is presented alongside approximate “bytes/triangle” number - which is not really correct to compute but we will do this anyway.

GPU	BLAS size	Bytes/triangle
AMD Ryzen 7950X (RDNA2 iGPU)	100 MB	57.0
AMD Radeon 7900 GRE (RDNA3)	100 MB	57.0
AMD Radeon 9070 (RDNA4)	84 MB	47.9
NVIDIA GeForce RTX 2080	46 MB	26.5
NVIDIA GeForce RTX 3050	45 MB	25.7
NVIDIA GeForce RTX 4090	45 MB	25.7
NVIDIA GeForce RTX 5070	33 MB	18.8
Intel Arc B580	79 MB	45.0
Apple M4³	93 MB	53.0

Now, that’s quite a gap! The delta between earlier AMD GPUs and the latest NVIDIA GPUs is 3x; comparing the latest AMD and NVIDIA GPUs, we still see a 2.5x disparity in memory consumption. Intel⁴ is a little ahead of RDNA4, at 2.4x larger BLAS vs NVIDIA.

Now, this table presents each BLAS memory consumption as a function of the GPU - it’s clear that there’s some effect of the GPU generation on the memory consumption. However, another important contributing factor is the software, or more specifically the driver. For AMD, we can compare the results of the various driver releases during the last year, as well as an alternative driver, radv⁵, on the same GPU - Radeon 7900 GRE:

Driver (RDNA3)	BLAS size	Bytes/triangle
AMDVLK 2024.Q3	155 MB	88.4
AMDVLK 2024.Q4	105 MB	59.9
AMDVLK 2025.Q1	100 MB	57.0
radv (Mesa 25.0)	241 MB	137.4

As we can see, over the last 9 months, BLAS memory consumption on the same AMD GPU and the same driver codebase has progressively improved by 1.5x⁶, whereas if you use radv, your BLAS consumption is now 2.4x larger than official AMD drivers, not to mention the latest NVidia GPU⁷.

Well… that’s certainly a lot of different numbers. Let’s try to make sense of at least some of them.

Mental model

Let’s try to build some models to help us understand what we should expect. Is 100 MB good for 1.754M triangles? Is 241 MB bad? It’s time to talk about what a BVH actually is.

First, let’s contextualize this with how much data we are feeding in. The way Vulkan / DX12 APIs work is that the application provides the driver with geometry description, which is either a flat list of triangles, or a vertex-index buffer pair. Unlike rasterization where a vertex may carry various attributes packed in the way the application wants, for raytracing you only specify a position per vertex, and the formats are more strictly specified. As mentioned above, in this case we are giving the driver fp16 data - this is important, because on fp32 data you will likely see different results and less drastic differences between vendors.⁸

The index buffer is your usual 32-bit or 16-bit data you would expect to see in rasterization; however, in most or maybe all cases, the index buffer is just a way to communicate your geometry to the driver - unlike rasterization, where efficiency of your index and vertex buffers is critical, here the drivers would typically build the acceleration structure without regard to explicit indexing information.

A flat triangle position list, then, would take 6 bytes per triangle corner * 3 corners per triangle * 1.754M triangles = 31.5 MB. This is not the most memory efficient storage: this scene uses 1.672M unique vertices, so using a 16-bit index buffer would require ~10 MB for vertex positions and ~10.5 MB for indices, and some meshlet compression schemes can go below that⁹; but regardless, our baseline for a not-very-efficient geometry storage can be in the neighborhood of 20-30 MB, or up to 18 bytes per triangle.

A flat triangle list is not useful - the driver needs to build the acceleration structure that can be used to efficiently trace rays against. These structures are usually called “BVH” - bounding volume hierarchy - and represent a tree with a low branching factor where the intermediate nodes are defined as bounding boxes, and the leaf nodes store triangles. We will go over specific examples of this in the next section.

Typically, you would want this structure to have high memory locality - when encountering a triangle in that data structure, you don’t want to have to reach for the triangle’s vertex data elsewhere in memory. In addition, Vulkan and DX12 allow to get access to the triangle id for ray hit (which must match the index of the triangle in the originally provided data); also, multiple mesh geometries can be combined in a single tree, and for ray tracing performance it’s uneconomical to separate the geometries into separate sub-trees, so the triangle information must also carry the geometry index. With all of this, we arrive at something like this:

struct BoxNode
{
	float3 aabb_min[N];
	float3 aabb_max[N];
	uint32 children[N];
};

struct TriangleNode
{
	float3 corners[3];
	uint32 primid;
	uint32 geomid;
};

N is the branching factor; while any number between 2 (for a binary tree) and something exorbitantly large like 32 is possible in theory, in practice we should expect a small number that allows the hardware to test a reasonably small number of AABBs against a ray quickly; we will assume N=4 for now.¹⁰

With N=4 and fp32 coordinates everywhere, BoxNode is 112 bytes and TriangleNode is 44 bytes. If both structures use fp16 instead, we’d get 64 bytes for boxes and 26 bytes for triangles instead. We know (mostly…) how many triangle nodes we should have - one per input triangle - but how many boxes are there?

Well, with a tree of branching factor 4, if we have 1.754M leaf nodes (triangles), we’d hope to get 1.754M/4 = 438K box nodes at the next level, 438K/4 = 109K at the next level, 109K/4 = 27K at the level after that, 27K/4 = 6.7K after that, and all the way until we reach the root - which gives us about 584K. If you don’t want to use boring division one step at a time, this is about a third as many box nodes as triangle nodes, which was discovered by Archimedes around 2250 years ago.

Conveniently, this means N triangles should take, approximately, N*sizeof(TriangleNode) + (N/3)*sizeof(BoxNode) memory, or sizeof(TriangleNode) + sizeof(BoxNode)/3 bytes per triangle. With fp32 coordinates this gives us ~81.3 bytes per triangle, and with fp16 it’s ~47.3.

This analysis is imprecise for a number of reasons. It ignores the potential for imbalanced trees (not all boxes may use 4 children for optimal spatial splits); it ignores various hardware factors like memory alignment and extra data; it assumes a specific set of node sizes; and it assumes the number of leaf (triangle) nodes is equal to the input triangle count. Let’s revisit these assumptions as we try to understand how BVHs actually work.

radv

Since the memory layout of a BVH is ultimately up to the specific vendor’s hardware and software and I don’t want to overly generalize this, let’s focus on AMD.

AMD has a benefit of having multiple versions of their RDNA architecture - although there were no changes between RDNA2 and RDNA3 that would affect the memory sizes - and having documentation as well as open source drivers. Now, one caveat is that AMD actually does not properly document the BVH structure (the expected node memory layout should have been part of RDNA ISA, but it’s not - AMD, please fix this), but between the two open source drivers enough details should be available. By contrast, pretty much nothing is known about NVidia, but they clearly have a significant competitive advantage here so maybe they have something to hide.¹¹

The way AMD implements ray tracing is as follows: the hardware units (“ray accelerators”) are accessible to shader cores as instructions that are similar to texture fetching; each instruction is given the pointer to a single BVH node and ray information, and can automatically perform ray-box or ray-triangle tests against all boxes or triangles in the node and return the results. The driver, then, is responsible for:

At build time, producing the BVH composed of nodes that match the HW format
At render time, building shader code that iterates over the tree, using the special instructions for node tests

While the official documentation for RT formats is lacking, we do not have to reverse engineer this as we have two separate drivers with source code.

radv, the unofficial driver which is the default on Linux and SteamOS, has very clean and easy to read code base, which defines the structures as follows:

struct radv_bvh_triangle_node {
   float coords[3][3];
   uint32_t reserved[3];
   uint32_t triangle_id;
   /* flags in upper 4 bits */
   uint32_t geometry_id_and_flags;
   uint32_t reserved2;
   uint32_t id;
};

struct radv_bvh_box16_node {
   uint32_t children[4];
   float16_t coords[4][2][3];
};

struct radv_bvh_box32_node {
   uint32_t children[4];
   vk_aabb coords[4];
   uint32_t reserved[4];
};

These should be mostly self-explanatory (vk_aabb has 6 floats to represent min/max) and mostly maps to our earlier sketch. From this we can infer that RDNA GPUs support fp16/fp32 box nodes, but require full fp32 precision for triangle nodes. Additionally, triangle node here is 64 bytes, fp16 box node is 64 bytes, and fp32 box node is 128 bytes: maybe unsurprisingly, GPUs like things to be aligned and this is reflected in these structures.

Looking closer at the source code, you can spot some additional memory that is allocated to store “parent links”: for each 64 bytes of the entire BVH, the driver allocates a 4-byte value, which will store the parent index of the node associated with this 64-byte chunk (due to alignment, every 64-byte aligned chunk is part of just one node). This is important for traversal: the shader uses a small stack for traversal that keeps the indices of the nodes that are currently being traversed, but that stack may not be sufficient for the full depth of large trees. To work around that, it’s possible to fall back to using these parent links - recursive traversal could be implemented in a completely stackless form, but reading the extra parent pointer from memory for every step would presumably be prohibitively expensive.

Another, more crucial, observation, is that at the time of this writing radv does not support fp16 box nodes - all box nodes emitted are fp32. As such, we can try to redo our previous analysis using radv structures:

64 bytes/triangle for triangle nodes
128 * 1/3 ~= 43 bytes/triangle for box nodes
(64 + 43) / 64 * 4 ~= 7 bytes/triangle for parent links

… for a grand total of ~114 bytes/triangle we would expect from radv. Now, radv’s actual data is 137 bytes/triangle - 23 more bytes unaccounted for! This would be a good time to mention that while we would hope that the tree is perfectly balanced and the branching factor is, indeed, 4, in reality we would expect some amount of imbalance - both due to the nature of the algorithms that build these trees, that are highly parallel in nature and don’t always reach the optimum, and due to some geometry configurations just requiring somewhat uneven splits in parts of the tree for optimal traversal performance¹².

AMDVLK

Given that the hardware formats of the BVH nodes are fixed, it does not seem like there would be that much leeway in how much memory a BVH can take. With fp32 box nodes, we’ve estimated that BVH can take a minimum of 114 bytes/triangle on AMD hardware, and yet even the largest number we can see from the official driver was 88.4 bytes/triangle. What is going on here?

It’s time to consult the official AMD raytracing implementation. It is more or less what is running in both Windows and Linux versions of AMD’s driver; it should probably be taken as a definitive source, although unfortunately it’s quite a bit harder to follow than radv.

In particular, it does not contain C structure definitions for the BVH nodes: most of the code there is in HLSL and it uses individual field writes with macro offsets. That said, for RDNA2/3, we need to look at the triangle node more closely:

// Note: GPURT limits triangle compression to 2 triangles per node. As a result the remaining bytes in the triangle node
// are used for sideband data. The geometry index is packed in bottom 24 bits and geometry flags in bits 25-26.

#define TRIANGLE_NODE_V0_OFFSET 0
#define TRIANGLE_NODE_V1_OFFSET 12
#define TRIANGLE_NODE_V2_OFFSET 24
#define TRIANGLE_NODE_V3_OFFSET 36
#define TRIANGLE_NODE_GEOMETRY_INDEX_AND_FLAGS_OFFSET 48
#define TRIANGLE_NODE_PRIMITIVE_INDEX0_OFFSET         52
#define TRIANGLE_NODE_PRIMITIVE_INDEX1_OFFSET         56
#define TRIANGLE_NODE_ID_OFFSET 60
#define TRIANGLE_NODE_SIZE      64

So it’s still 64 bytes; but what is this “NODE_V3” field, and what’s this triangle compression? Indeed, radv_bvh_triangle_node structure had a field uint32_t reserved[3]; right after coords array; it turns out that the 64-byte triangle node in AMD HW format can store up to 2 triangles instead of just one.

AMD documentation refers to this as “triangle compression” or “pair compression”. The same concept can be seen in Intel’s hardware as “QuadLeaf”. In either case, the node can store two triangles that share an edge, which requires just 4 vertices. The triangles do not have to be coplanar; the hardware intersection engine will dutifully intersect the ray against both and return one or both intersection points as required.

Now, this type of sharing is not always possible. For example, if the input consists of a triangle soup of disjointed triangles, then we will hit the worst case of one triangle per leaf node. And in some cases even if two triangles can be merged, if one of them is much larger doing so might compromise SAH metrics. However, generally speaking, we would expect a lot of triangles to be grouped up in pairs.

This changes our analysis pretty significantly:

Instead of 64 bytes/triangle for leaves, we only have 32 bytes/triangle
Since we have half as many leaves, we will also have half as many box nodes, for ~21 bytes/triangle
And the parent link cost is accordingly reduced by half as well, for ~4 bytes/triangle

Which brings up the total to 57 bytes/triangle… assuming ideal conditions: all triangles can be merged in pairs, all nodes have a branching factor of 4 (something we know is probably false based on radv results). In reality this is the configuration that AMD driver used to run in 2024.Q3 drivers, and it had 88 bytes/triangle - 31 bytes more than expected - which is probably a combination of more box nodes than we would expect, as well as less-than-perfect triangle pairing. Another quirk here is that AMDVLK driver implements what’s known as SBVH: individual triangles can be “split” across multiple BVH nodes, effectively appearing in the tree multiple times. This helps with ray tracing performance for long triangles, and may further skew our statistics as the number of triangles stored in leaf nodes may indeed be larger than the input provided!¹³

radv does not implement either optimization at this time; importantly, in addition to this impacting memory consumption significantly, I would expect this also has a significant impact in ray tracing cost - indeed, my measurements indicate that radv is significantly slower on this scene than the official AMD driver, but that is a story for another time.

Now, what happened in AMD’s 2024.Q4 release? If we trace the source changes closely (which is non-trivial as the commit structure is erased from source code dumps, but I’m glad we at least have as much!), it becomes obvious that what has happened is that fp16 box nodes are now enabled by default. Before this, box nodes used fp32 by default, and with that change many box nodes would use fp16 instead.

There are some specific conditions when this would happen - if you noticed from the radv structs, fp32 box nodes have one more reserved field and fp16 box nodes don’t - this field is actually used to store some extra information that may be deemed important on a per-node basis in some cases¹⁴. But regardless, the perfect configuration for RDNA2/3 system seems to be:

2 triangles per 64-byte leaf = 32 bytes/triangle
64-byte fp16 box * 1/3 * 1/2 = 11 bytes/triangle
4 bytes of parent links per 64b = 3 bytes/triangle

… for a total of 46 bytes/triangle. This is the absolute best case and as we’ve seen before, it’s unrealistic to expect for complex geometry such as Bistro; the best results from the AMD driver use 57 bytes/triangle, 11 bytes/triangle more than the theoretical optimum.¹⁵

Worth noting that 2025.Q1 release reduced the memory consumption from ~60 bytes/triangle to ~57 bytes/triangle. Since we know some amount of memory is lost in various inefficiencies of the resulting structure compared to the optimum, it might be possible to squeeze more juice from this in the future - but given that the hardware units expect a fixed format, and some amount of efficiency loss is inevitable if you need to maintain good tracing performance, the remaining gains are going to be limited.

RDNA4

… until the next hardware revision, that is.

While RDNA3 kept the BVH format for RDNA2 for the most part (with some previously reserved bits now used for various culling flags but that’s a minor change that doesn’t affect memory consumption), RDNA4 appears to redesign the storage format completely. Presumably, all previous node types are still supported since radv works without changes, but gpurt implements two major new node types:

As is clear from the name, BVH8 node stores 8 children; instead of using fp16 for box bounds, it stores the box corners in a special format¹⁶ with 12-bit mantissa and shared 8-bit exponent between all corners, plus a full fp32 origin corner. This adds up to 128 bytes - from the memory perspective it’s just as much as two fp16 BVH4 nodes from RDNA2/3, but it should permit the full fp32 range of bounding box values - fp16 box nodes could not represent geometry with coordinates outside of +-64K! - so I would expect that RDNA4 BVH data does not need to use any BVH4 nodes, and this allows AMD to embed other sorts of data into the box node, such as the OBB index for their new rotation support, and the parent pointer (which previously, as you recall, had to be allocated separately).

struct ChildInfo
{
    uint32_t minX         : 12;
    uint32_t minY         : 12;
    uint32_t cullingFlags : 4;
    uint32_t unused       : 4;

    uint32_t minZ         : 12;
    uint32_t maxX         : 12;
    uint32_t instanceMask : 8;

    uint32_t maxY      : 12;
    uint32_t maxZ      : 12;
    uint32_t nodeType  : 4;
    uint32_t nodeRange : 4;
};

struct QuantizedBVH8BoxNode
{
    uint32_t internalNodeBaseOffset;
    uint32_t leafNodeBaseOffset;
    uint32_t parentPointer;
    float3 origin;

    uint32_t xExponent  : 8;
    uint32_t yExponent  : 8;
    uint32_t zExponent  : 8;
    uint32_t childIndex : 4;
    uint32_t childCount : 4;

    uint32_t obbMatrixIndex;
    ChildInfo childInfos[8];
};

Primitive node is somewhat similar to triangle node, but it’s larger (128 bytes) and much more versatile: it can store a variable number of triangle pairs per node, and does that using what seems like a micro-meshlet format, where triangle pairs use vertex indices, with a separate section of the 128-byte packet storing the vertex positions - using a variable amount of bits per vertex for position storage.

For position storage, all bits inside coordinates for a single node are split into three parts: prefix (must be the same across all floats for the same axis), value, trailing zeroes; all parts have the same bit width for the same axis across all vertices in the node. For fp16 source positions, I would expect prefix storage to remove the initial segment of bits shared between the positions which would be close together in space, and most of the trailing fp32 bits to be zero. It would probably be reasonable to expect around 30-33 bits per vertex (3 * 10-bit mantissas, with most of the exponent bits shared and the trailing zeroes removed) with that setup on average.

The triangle pair vertex indices are encoded using 4 bits per index, with a few other bits used for other fields; primitive indices are stored as a delta from a single base value inside the primitive node, similarly to positions. Notably, the triangle pair has three independent indices for three corners per triangle - so it looks like the pair does not necessarily have to share the geometric edge, which presumably improves the efficiency at which geometry can be converted to this format at a small cost of 8 extra bits for every other triangle¹⁷. The number of pairs per node is limited to 8 pairs, or 16 triangles.

It may seem that this indexed storage format is at odds with what was mentioned earlier in the post: if the driver discards initial index buffer during BLAS construction, how can it use indices here? The answer is that the BVH construction proceeds as before, and some subtrees get packed into primitive nodes. During this packing, shared vertices are identified opportunistically using bitwise equality between vertex corners - so it does not matter if the source geometry was indexed or not, as long as the triangle corner positions are exactly equal.

All of this makes it difficult to correctly estimate the optimal storage capacity of such a node. With the limit of 16 triangles, we would ideally hope to be able to pack a 3x5 vertex grid (15 vertices, 8 quads)¹⁸. If ~30 bits per vertex for position storage is accurate, then 15 vertices will take 57 bytes of storage. With each triangle pair occupying 29 bits, 8 pairs would take 29 bytes, for a total of 86 bytes. A few additional bytes are required for headers, various anchors that are used to reconstruct positions and primitive indices, and a few bits per triangle for primitive index, assuming spatially coherent input triangles - which is probably reasonable to expect to fit. Thus, a dense mesh might be able to be packed into 16 triangles per node or ~8 bytes/triangle.

Since BVH nodes are 8-wide, this also proportionally reduces the total expected number of box nodes, from 1/3 of primitive nodes to just 1/7¹⁹. And given that the parent pointers are already embedded into box nodes, this gives us a best case theoretical bound of approximately:

128-byte primitive nodes with 16 triangles/node = 8 bytes/triangle
128-byte box nodes, 1/7th of 1/16th of triangles = 1.2 bytes/triangle

… for a grand total of 9.2 bytes/triangle. Now, with the actual numbers of ~48 bytes/triangle, this is clearly a wildly unrealistic goal:

Even with BVH4, we have not seen anywhere near 4x branching factor on our test geometry in practice; achieving 8x without degrading BVH quality should be even harder
RDNA4 acceleration units can process eight box or two triangle intersections at once; a node with 16 triangles will thus be much more expensive to process vs a node with 2. This may mean the driver decides to limit the number of triangles in each leaf node artificially to maintain trace performance.
The description above is simplified assuming a similar high level tree structure to RDNA2/3, but in reality QBVH8 nodes can reference a subrange of a given primitive node; for example, one could imagine a single primitive node with 16 triangles, and a single QBVH8 node that, in each child, only references 2 of those triangles - which may be a different way to improve traversal performance. This means the box:triangle node ratio may be closer to 1:1 or 1:2 in practice, for 4-8 bytes/triangle instead of 1.2.

Between these factors, it’s hard to estimate the realistic expected memory consumption - but it seems plausible that we will see continued reduction of BVH sizes with future driver updates. Additionally, note that Bistro geometry has a lot of triangles with odd shapes and in general is not particularly uniform or dense. It’s possible that on denser meshes the effective bytes/triangle ratio is going to be closer to the theoretical optimum - exploring denser meshes is left as an exercise to the reader!

Conclusion

Hopefully you’ve enjoyed this whirlwind tour through the exciting world of hardware accelerated BVH storage! In summary, BVH memory consumption is highly hardware and driver specific: driver can only build a BVH out of nodes that the hardware ray accelerators natively understand, and this format varies with GPU generations; some drivers may only support a subset of hardware formats due to limited development time or tracing efficiency concerns; and specific algorithms used in the drivers for building the BVH will yield trees with different branching factors and leaf packing, which will greatly affect the results.

It will be interesting to revisit this topic in a year or so: AMD has made significant progress in both software and hardware in reducing their BVH structures, and while on RDNA2/3 it’s hard to see the BVH memory getting reduced by much, it’s not fully clear just how much headroom they have on RDNA4 depending on the scene. Similarly, it’s clear that NVidia has improved their internal hardware formats in 5xxx series, and it’s possible that there is some room left for the driver to optimize the storage further.

While standardized BVH formats would make it much easier to reason about the memory impact for raytracing renderers, it seems extremely unlikely to happen anytime soon; each vendor uses different hardware node formats with different properties and some unique features, and continues to evolve them in newer architectures. It’s unclear if these will ever converge to a common format; D3D12 experience with standard swizzle provides a cautionary tale… Something to watch here would be the future of cluster acceleration structures; while they do not solve the difference in memory consumption directly and are not supported by anyone except NVidia quite yet, they might make it easier to reason about the composition of BVH data and produce more predictable memory consumption on future GPUs - modulo trace performance concerns. Time will tell.

For the purpose of this analysis we will ignore TLAS; for this particular scene the memory costs of TLAS storage are very low - it only has a few hundred mesh instances; while this can be much larger in real games, I would expect BLAS storage to dominate. ↩
Due to instancing, the amount of geometry present in the scene is larger - around 4M; be careful with that detail if you compare other Bistro variants to the numbers presented here, as they may not match. ↩
The Apple number was added after the article was published originally (on April 10th); since existing translation layers like MoltenVK do not support raytracing, this is based on the separate code that shares geometry processing code and uses Metal APIs to build the acceleration structures. ↩
These numbers were captured using official Intel drivers on Windows, not Mesa on Linux. I don’t have Intel’s Linux numbers handy, and don’t feel like re-plugging the GPU again. ↩
radv is the default user-space driver for Linux systems, and the production driver for Steam Deck; all AMD measurements apart from the one explicitly listed below are taken from their official driver, AMDVLK, which is mostly the same between Windows/Linux. ↩
I think the gaming community affectionately refers to this phenomenon as “AMD fine wine”. ↩
Note that all of these numbers are not using NVidia’s latest clustered acceleration structures aka “mega geometry”; this is a subject for another post, but it doesn’t affect the analysis drastically. ↩
And I’m not repeating all of this again for fp32. I would argue that fp32 is quite excessive for 99% of meshes in any game, and you need to be using either fp16 or snorm16 position components if you are trying to actually optimize the memory footprint of your game. ↩
For example, a recently released AMD DGF SDK seems to primarily target position-only geometry, and as such might be useful for future AMD GPUs. It would just cover the geometry storage though, so we can’t use their numbers to estimate the future BVH cost; we also don’t know if this is even something that they plan to support in their RT cores. ↩
For Intel GPUs it looks like N=6; for AMD GPUs, N=4 for RDNA2/3 and RDNA4 has a new N=8 node. Little is known about NVidia GPUs as usual. ↩
I could have studied Intel GPUs more, as they do have an open source driver as part of Mesa; however, it’s unclear if their proprietary driver shares the same source, and in general I just was more interested in AMD when investigating this. ↩
Curious readers are encouraged to explore this topic further; on AMD hardware, you can use Radeon Raytracing Analyzer to analyze the BVH as well as traversal efficiency characteristics for your workload. ↩
In theory it should be possible to do further tests with AMDVLK driver to disambiguate this somewhat and/or patch the code to provide more statistics, but it’s 9 PM and I’d like to finish this post today if possible. ↩
fp16 boxes are also naturally limited to the range of 16-bit floating point numbers; this would be a problem for some meshes with fp32 vertex coordinates, but it’s not an issue if the source vertex positions are also fp16. ↩
It also should be noted that gpurt sources make vague references to larger-than-64 byte triangle nodes that contain more triangles; if that can result in using more shared edges than the optimum might be lower - but this also might refer to earlier hardware revisions that never materialized. ↩
I have not studied this source code as extensively as the RDNA2/3 details, so all of this is an approximate description of what I can gather from skimming the code. Some details here are likely incorrect and/or missing. ↩
The text here is written using “triangle pair” as this is how the code references these structures, but it’s unclear if there are any restrictions on packing - it may be that AMD kept the term for convenience, or maybe earlier versions of the format used a shared edge with a smaller descriptor, and they later introduced extra bits to decouple the triangles and didn’t rename the concept. ↩
This math is similar to meshlet configurations described in an earlier post. ↩
As a generalization of Archimedes formula, sum(1/k^i) = 1/(k-1) ↩

Year of independence

Tue, 31 Dec 2024 00:00:00 +0000

I am happy to report that life after Roblox does indeed exist.

When I quit, people told me I should take some time off, relax, unwind, recharge, travel… That would make sense¹! What I did instead however is a combination of “writing a lot of code” and “talking to a lot of people”. Many dozens of companies and individuals reached out - thank you to everyone who did! - and I had a lot of fun meetings and conversations and got a better sense of what people are building these days and why.

A lot of the discussion was around the technology in the spaces that I am broadly familiar with - game development, simulation, low level systems engineering. In some ways, it was too comfortable, as in many cases I could see exactly what the years ahead for that company would be like - which was exactly what I was not looking for.

At the same time, I was also fascinated by the technology behind LLMs. The claims about achieving AGI seem wildly exaggerated², but the tech is still useful and bewildering. To try to get a better sense of what this is all about and how all of that works, I ended up reading more papers in the first few months of my “funemployment” than I’ve read in the previous decade, and started a new open source project, calm, which is a from-scratch single-user LLM inference engine for CUDA & Metal (and CPU SIMD I guess because why not).

This project was born out of my desire to learn more but also out of dissatisfaction with the state of the art at the time. To run the models you could use slow and bulky PyTorch with endless environment setup issues and never ending issues around quantization support³, or, even worse, try to use NVidia’s TensorRT-LLM which is impossible to build without Docker containers and dreadful to build with them; compared to that, llama.cpp was a breath of fresh air - but it still felt too bulky and inefficient compared to what seemed possible. Starting a new project in an unfamiliar field is daunting because you need to go from “nothing works” to “something works” before considering efficient implementations - fortunately, Andrej Karpathy’s llama2.c appeared just in time, so I copied the code and started hacking, ultimately ending up rewriting a 1000-line .c file into a 4000-line project that is a very very very fast single-user LLM inference engine.

Before this project, I’ve spent almost two decades programming GPUs, but somehow have never done it in CUDA. I had a rough understanding of what that entails of course, but it was still a lot of fun discovering the peculiarities of the modern NVidia hardware, working with excellent NVidia performance tools, and coming up with ideas for how to structure the data and kernels to squeeze as close to the theoretically possible performance as possible. I even ended up going into some wild corners of multi-GPU programming and wrote a fully fused cooperative kernel (NVidia, please don’t deprecate these) that managed to run on H100 with okay efficiency⁴.

I also ended up writing a Metal version, just to see what Apple hardware is really like to work with. In regular graphics programming, people tend to look at Metal vs Vulkan and Metal is so much easier to work with and makes so many reasonable choices - although not everything is great in Metal - but trying to port CUDA to Metal was exactly the opposite. Things worked but were much clunkier, requiring manual resource management and dispatch; lack of robust scheduling guarantees necessitated inefficient dispatch flow and restricted optimization options; profiling tools were mostly not very useful; writing your own profiling tools was mostly not an option⁵; and I ended up having to use Dougall Johnson’s applegpu project to disassemble some of the kernels to be able to optimize them better. Overall it was “fine” but not entirely enjoyable; the kernels do not scale as well to higher-end Apple models like M3 Max, but since I never had direct access to these - another big issue for the Apple ecosystem is the almost non-existent cloud infrastructure, whereas with NVidia you can rent any GPU in a few minutes and get direct SSH access for a few $ an hour! - I mostly left it as is.

At the end of this journey, I got to the point where getting even more performance (on NVidia HW) would really require rethinking the entire pipeline - not just using an off-the-shelf model, but at the minimum doing distillation into differently structured models, which requires access to a lot of data that was difficult to get and a lot of compute resources, far beyond “sure let’s spend $100 to play with H100 on a weekend”. This was no longer feasible as an individual - and after a few weeks of considering joining an AI lab, I ultimately decided the path was not mine to take.

Something I would tell everybody I’ve talked to this year is that not only do I not know what field I want to explore in the future, but I also don’t know the mode I want to explore that field in - individual contributor at a company that pursues lofty goals that ultimately require a large team? technical visionary that guides the efforts of a said large team? a blend at a small company where both can be highly impactful in combination? cofounder at a startup starting from a blank slate and hopefully reaching the stars? - but the more I talked to different teams in different fields the more I got a sense that without an idea that captivates me so much I just have to do it, which even the LLMs were not, losing the complete freedom and independence I’ve been enjoying during the first few months was just… not worth it.

I toyed with a few more ideas around LLMs but ultimately ended up shelving that entire direction. One difficult conflict to resolve here is that the entire technology stack that the ML industry has built is deeply flawed, and yet taking a real shot at fixing that requires complete dedication, significant resources, and - to get real uptake - strong support and connections across the industry. Which again would shift the balance from “independence” to “corporate” too much for comfort, in addition to other practical issues.⁶

Not quite certain where to go from here, I decided to spend a little more time on meshoptimizer. In terms of the actual project, I had a vague set of directions in which to take the core library as well as some glTF-related work I’ve been meaning to get to; and I was also wondering if I could make the development sustainable through some sort of hybrid sponsorship model. The former was tangible; the latter felt difficult to orchestrate. I know little about this so take it with a grain of salt, but I was skeptical of the “donation” based sponsorships ala GitHub Sponsors / Patreon (it works for a few incredibly successful projects, but seems to require endless community outreach and my baseline expectation was “funding my morning coffee would be non-trivial”), and corporate sponsorship means constantly working to find new companies, fighting legal and accounting in every new sponsor to settle the terms, justifying the value (ugh), balancing requests for features from paying sponsors with what I felt was right to do, etc. Ultimately this seemed like it would both erode the independence and create a lot of new coordination and funding work of the type that I do not enjoy to be viable.

So without a firm plan, I thought, well, I should just focus on meshoptimizer for a little bit and see what happens. And then I discovered something I used to know but have since forgotten:

Graphics is fun, actually.

As exhilarating as exploring the field of multiplying giant matrices quickly to steer the weights to be able to perfectly model content of questionable copyright status was, it turned out that working with 3D art⁷ and rendering techniques, hacking on meshoptimizer and writing shaders was… fun.

gltfpack, which is co-developed alongside meshoptimizer library, was fun to hack on because it meant working with complex scenes - meshes, scene graphs, animations, textures, oh my! - and while it is lighter on the complex algorithms, improvements are fulfilling because they support an ever-expanding glTF ecosystem and help people ship their content or make it more efficient.

meshoptimizer proper was fun to hack on because it required delving deep into undocumented hardware details, learning about established algorithms and inventing new ones, and making the library more useful which helps many companies that use it - if your project is not on this list, please let me know! There’s a large amount of untapped potential in interesting and useful algorithms that can be hidden behind a small API surface - in contrast with something like pugixml where a lot of the value is in the API surface itself - and improving the library internals helps many different engines that use it with minimal integration or adaptation effort.

Importantly, the pace and direction of development are unconstrained - while fundamentally my goal is to make both projects useful, if a processing algorithm could be a little faster and I feel like I want to spend some time on this, that’s what I’m going to spend time on⁸; if mesh shading efficiency can be improved then I can do this even if many existing production pipelines are still stuck with index buffers and vertex shaders; if improving an algorithm requires research in an unconventional direction then that’s what is on the table; and if a particular week is just not a good week for working then I guess code is not being written.

That said, in addition to the perpetual lack of funding, one challenge with open source (at least in fields I’m used to like game development) is the limited feedback and contributions you get from the companies that use the technology. Contributions in open-source are a separate nuanced topic which maybe I will write about one day, but limited feedback coupled with working on the library in isolation means that there are aspects of the library that don’t work as well as they could which you don’t know about and there’s portions of the library that simply don’t exist because this is not a problem you are aware of - these issues sometimes just remain unsolved, and sometimes gain proprietary solutions that companies keep re-inventing independently.

To try to work around this problem a little bit, I’ve also spent some time contributing code to Godot Engine (1 2 3 4 5 6 7 8 9) and Bevy Engine (1 2 3 4). Working with Godot helped me develop some algorithms further and significantly improve the mesh import pipeline processing using other algorithms, which prompted improvements in meshoptimizer documentation among other work; working with Bevy helped me understand the requirements of hierarchical clusterization (with some improvements that have been made to support this use case better, although this journey is far from over and hopefully more things will happen in the future) and work a little more with Rust (which was fun but do not expect a rewrite-in-Rust or new Rust projects from me in 2025).

Something that I completely forgot about when writing this is that I also spent some time working on Luau! It’s nothing ground-breaking or earth-shattering, but I’ve contributed quite a few (1 2 3 4 5 6 7 8 9 0) codegen and compiler optimizations including significantly improved vector operation lowering and a few other improvements here and there, and I am hopeful that Luau will get a pretty good lerp function soon™

And then I got a reach-out from someone working at Valve with an offer to sponsor meshoptimizer development.

While I knew that Valve uses meshoptimizer in various games through third-party license notices (thank you! not all companies that use meshoptimizer do attribution - sometimes this is forgotten, sometimes it’s technically-okay-but-I-wish-you-still-did-this because it’s part of the content pipeline that is never shipped to users), I did not realize how many components are used. Working through the details with the team made me hopeful that this can be a case of “aligned” sponsorship or open source funding done right. I had a rough roadmap for meshoptimizer development, and it turned out that that roadmap is broadly interesting to Valve as well, so no pivot was involved; so far there is minimal extra burden as well. Neither I nor meshoptimizer are affiliated with Valve in any way, and it is still the case that the development direction and priorities are determined entirely by me (driven by the needs of different users!). Besides just funding, more direct communication helps improve the library further, by testing on production-quality data and gaining more insight into what works, what doesn’t, and what could be possible.

There is a little bit of a risk of a bus factor here: having just one sponsor means the risk of losing it is that much higher - either the company could lose interest or the dynamics could change such that I would see my own, or my project’s, independence unraveling - something that as you can probably tell is more and more important to me. But so far I’ve been very pleasantly surprised and there is no sight of a bus coming so for now, my plans graduated from “focus on meshoptimizer for a little bit” to “focus on meshoptimizer for a while”.

Notably, gltfpack is still being developed in spare time. If your company is interested in funding gltfpack development with no strings attached, feel free to reach out by e-mail!

As a result of all of this, meshoptimizer has seen significant work done this year, and I expect this to continue. While this is an imperfect metric, here’s a pie chart that aggregates, for each line of the core library, the year this line changed last. meshoptimizer does not go through mass spontaneous refactors and the code is generally changed when it needs to improve, so I like this as a rough way to gauge the progress as well as the robustness of some parts of the codebase.

In terms of large features, simplification has seen a lot of work this year (sparsity and explicit locks for hierarchical LODs, much better attribute-aware simplification, many improvements to topology handling and error metrics, component pruning), meshlet clustering has improved a little further (more to come in 2025), meshlets can be optimized for locality to help with rasterization efficiency on NVidia, meshoptimizer now supports provoking vertex index buffer generation based on John Hable’s work published on SIGGRAPH 2024, and I’m wrapping up improvements to the vertex codec (smaller meshes, even faster decoding, and variable encoding speeds!) as we speak. gltfpack has seen many small improvements as well as better welding for models with unnecessary normal splits, automatic geometry deduplication which reduces output size on some large scenes, as well as texture compression improvements.

Working on meshoptimizer and gltfpack is fun and rewarding, but is not relaxing: both are production projects that are widely used and have many diverse demands. You can’t simply write a few lines of code and commit them. Both have issue reports of… varying… quality… that need to be looked at, and in general working on them still feels like… work. This is something that made me go back to the early part of the year and think about calm - the name stands for “CUDA Accelerated Language Model inference”, but also for a different development paradigm, as the first thing I’ve done when I created the project was to write “the goal of this project is experimentation and prototyping; it does not aim to be production ready or stable” in the README and blanket disable GitHub Issue reports. Because this project was fundamentally for me to tinker with, and if it doesn’t work for anyone else, it’s their problem, not mine.

In part because of this, but also to keep up with the ever-changing ecosystem and keep a fresh perspective on real-time graphics despite not working in that field directly anymore, I also “rebooted” my Vulkan renderer project, niagara.

This project was started in 2018 (wow, time flies!) as an educational YouTube stream series - the goal was to write a simple but modern Vulkan renderer from scratch, on stream, both to have a useful resource for people who are learning graphics programming and as a way for me to experiment with video streaming which I have not done before that. The project ran in an active mode during 2018, and then mostly went on hiatus, as I felt like it reached a good baseline for pure geometry rendering (featuring meshlet-based renderer, GPU culling, object occlusion culling, etc.) and significant further progress would require a lot of new concepts. I’ve done a couple of streams in 2023 after getting a new AMD GPU, as the AMD mesh shading pipeline was different from the hardware perspective and required more work to get it up to the level of performance I considered acceptable, but did not have further plans.

However, there were still many interesting areas of graphics that project - and I personally - left unexplored; I was particularly interested in topics new to me, as a byproduct of working at Roblox for the last decade+ my working level of rendering stopped around “late PS3-early PS4” level, notably excluding ray tracing, bindless, the exciting world of temporal jitter and “boiling soup of pixels” and other revolutionary advances since. So a few months ago I decided that this project can serve two goals at once: continue being an educational resource for people who want to learn graphics, and also serve as a playground “I can just write code and nobody can stop me” for myself.

As part of this, I now sometimes write code for this project off-stream but also still write code on-stream for topics that feel fun to explore with a live audience. The project now loads glTF scenes, uses bindless texturing, deferred shading, HW raytracing⁹ with soft(ish) shadows, with many other things planned for the future, time permitting :) If you have not seen the project before and have a spare week or two, the full YouTube playlist is just 82 hours for now!¹⁰

So, where does this leave us, a few short hours before 2025 begins? What are the plans and resolutions? Where will I be and what will I do in 12 months? Excitingly, the answer remains the same - I don’t know! But I think more and more I appreciate the incredible power that freedom and independence give you, and it becomes less and less likely that you will see an “I am joining a company” announcement from me in the future. “I am starting a company” is still on the table for now ;)

And maybe most importantly: by combining “work can be fun” and “you can just do things”, we arrive at “you can just do work that is fun”. Hopefully for many years to come.

And some travel and unwinding did happen at various points of this year :) But this is not a travel blog… ↩
As recently as a few days ago I needed to write a 5-line function and both Claude and O1 failed completely at doing that for me, so I had to do it myself - in a classical xkcd automation moment, it took me much less time to do it myself than to try to get an LLM to do it for me. ↩
I hear it’s better these days with projects like gpt-fast and torchao; there’s also now alternatives like tinygrad that are much more pleasant to work with. ↩
This is entirely impractical - why would you ever optimize a single request latency to death on an 8xH100 system? - so this is on a branch that will never be merged. ↩
calm implements a small CUDA profiler using the CUPTI trace library, which was helpful to profile on cloud hardware, and more convenient to work with vs NVidia tools at times - but this was not strictly necessary, and the fact that you can even do this speaks to the excellent engineering discipline in the CUDA ecosystem. ↩
While I have some misgivings about both, I look forward to efforts of Modular and TinyCorp. ↩
Of perfectly certain and proper copyright status, thank you for asking. ↩
Content pipeline processing speed is important to me; it is also important to some meshoptimizer users who have multi-hundred-million-triangle meshes to process, but I am sure not the highest priority for others. ↩
I even ended up contributing a small patch to radv to use an RDNA3 feature to accelerate RT traversal, although a lot more work is required for radv to be competitive with AMD drivers in this area. ↩
The audio quality in the early days was pretty rough :( Maybe just watch the streams since 2023! ↩

Unlearning metrics and algorithms

Mon, 09 Sep 2024 00:00:00 +0000

The first somewhat social platform that I’ve used was LiveJournal; I used it around 2004-2010. Back then, we had posts and comments, but one of the notable features of the platform was the uni-directional friend relationships. The number of people who befriended you was somewhat of a status symbol, with a special term “тысячник” (a person with 1000+ reverse friend connections) used to denote Popular People.

That said, my recollection is that people mostly wrote what was fun or interesting for them to write about. Your friend feed contained a chronological display of whatever your friends posted - no ads, no algorithms.

This post is different than usual, and it was originally written in late 2022 and published on Cohost. Back then, I was excited about Cohost’s future and planned to use it as a micro-blogging platform, reserving this blog for long, very technical and carefully written posts.

In the middle of 2023, following a Financial Update post, I realized Cohost will not survive much longer, and had to change the plans. As a result, most of the prior Cohost content has already been reposted on this blog; one more small technical post will be posted in the coming weeks. I wanted to repost this here, today, even if it doesn’t follow the typical theme of this blog - because, today Cohost team announced it will shut down imminently. This post seemed relevant, and it’s the only remaining non-technical post that I’d like to keep for posterity.

As noted before, I will try to write more short posts on this blog in the future, although they will likely be technical in nature.

At some point LiveJournal faded into irrelevance and Twitter was the hot new thing. Twitter started as a similarly simple platform - you had follow counts, and replies to tweets, but additional measures quickly entered the picture - these days, your tweet can be replied to, but the “engagement” indicators include retweets and likes. I’ve been meaning to write about this for a couple weeks now, and since then Twitter started also showing tweet view counts - yet another number that’s right in your face next to every tweet.

Once every tweet carries a set of visible engagement metrics next to it, it’s natural to start thinking about them. Am I reaching my audience? Was this a good tweet? How do I make my messaging more interesting?

Of course what also starts happening is that these - and other - engagement signals are used by the platform to form the content that people see. Long gone are the days of chronological timelines - Twitter still supports a linear view, but very aggressively selects an algorithmic timeline which is a setting that is stored per device/session. The algorithm takes visible cues, such as retweet/like numbers and progression, as well as less visible or documented cues - for example Twitter reportedly artificially reduces the reach of tweets that contain links, which makes it more difficult to share external content on Twitter.

This is a vicious cycle. Global factors like follow count are now not very meaningful - my account has 12K followers on Twitter (some of them bots), and yet some of my tweets get closer to 2K impressions¹. By itself it shouldn’t matter as much but when it results in lower tweet signals, you start to wonder - am I doing something wrong? Should I not post this tweet because people aren’t going to engage with it? Should I post controversial or outrageous takes because that’s what generates buzz?

There’s really no rational reason for having these metrics drive the content, lacking an actual numeric (monetary) incentive - it shouldn’t make a difference whether 10 or 1000 people liked my tweet, and yet the number is right there, prominent, tantalizing, and the satisfaction of using the platform seems way too closely related to how high the numbers go.

What’s worse, for me at least this shifted the thinking about conversation. If my reply gets 100x less visibility than a tweet, should I even bother replying? Replies have smaller numbers so they’re less satisfactory and thus less valuable, or so the twisted thinking goes. Replying to Important People is numerically much more engaging than holding a profoundly interesting conversation - not a good outcome!

It’s because of all of this that I’m happy to see smaller, simpler, newer platforms.

Cohost doesn’t show you a single number as far as I can tell. I don’t even know how many people follow me, and while I could probably find out if I tried hard enough - it doesn’t matter. What matters is the quality of the content I write, and the quality of the conversations in the comments.

Mastodon does show you a bunch of numbers, but the feed is chronological… and in fact, both the web client on mastodon.gamedev.place and Ivory, the iOS client I use, by default hide the boost and favorite numbers - which is a setting I’m happy to keep in its default, sane, position! As such the focus seems to be much more so on a discussion - and indeed, while it seems like the size of the user base is drastically smaller than Twitter’s, and my follower count is 10x smaller than it used to be, it feels like there’s a similar amount of interesting conversations that I actually want to read or engage in, at least in my field.

It’s still hard to not think about numbers that represent reach - in other networks like GitHub I still use the number of forks and stars to judge how popular a given repository is². I still look at view counts on my YouTube videos to try to figure out what content I should publish and whether the whole video thing is worthwhile to begin with. That said, I’m trying to break away from caring about metrics and focus on content and discussion quality - numbers be damned.

By itself this could just be a difference between total Twitter accounts and monthly active Twitter accounts, as opposed to an algorithmic bias - something that’s difficult to estimate. ↩
And this is probably not terribly healthy either; fork count in particular is now a measure of nothing useful as a lot of people seem to have a habit of forking a repository without an intend to change the fork in any way. ↩

X is justifiably slow

Fri, 31 May 2024 00:00:00 +0000

I regularly hear or read statements like this: “X is slow but this is to be expected because it needs to do a lot of work”. It can be said about an application or a component in a larger system, and can refer to other resources that aren’t time. I often find these profoundly unhelpful as they depend much more on the speaker’s intuition and understanding of the problem, than X itself.

This post is much shorter than usual, and it was originally written in 2022 and published on Cohost. I’m going to experiment with posting shorter technical content like this more regularly in the coming months, including reposting my earlier Cohost posts (of which this is one of).

X¹ may be slow because it’s using an inefficient algorithm. When a database query takes a while to return a large volume of results, it may be because there’s no index for that query - the query engine might be relatively well optimized, but the poor algorithmic complexity wins.

X may be slow because it’s using an inefficient implementation of an algorithm. It’s common to see 10x gaps or more between different implementations of the same idea, and without profiling the code it’s hard to know whether something unexpected is taking the majority of time.

X may be slow because it’s not taking advantage of the hardware. Modern hardware is incredibly fast, and many people don’t have a good intuition for just how fast computers they use day to day are. The performance deficit between a reasonable serial implementation and code that uses efficient SIMD and multithreading can be significant - it’s not particularly outlandish to see 100x delta on modern multi core chips with wide SIMD on computationally intensive problems - and replacing drastically cache inefficient algorithms with cache efficient ones can also yield dramatic speedups.

X may be slow because it’s actually doing the work, but doing the work isn’t necessary. Maybe X has no cache in front of it, or the cache hit rate is 10x worse than it should be, or maybe the cache retrieval itself is slow even though it doesn’t need to be.

X may not even use the right framing for a problem. C++ compilers are notoriously slow, but it’s not because the process of code compilation is fundamentally slow - it’s because every element of the stack often carries profound inefficiencies that can be corrected by reframing the problem (which may require significant changes to the formulation - maybe instead of C++ you need to compile a different language!).

And yet there are cases when “X is slow because it’s doing a lot of work” is actually probably right - when the problem has been well explored and can’t be reframed, when the implementation is thoroughly profiled and optimized, and especially when you can do some sort of speed of light calculation², eg “on this system we can’t read memory faster than 50 GB/s, and yes we do need to read this much memory because we’ve already compressed the data to the extent feasible”.

It can be very difficult to tell the difference, which is why I get annoyed a little bit every time I hear this, because the odds that enough analysis has been done on the particular implementation of the particular solution on the specific hardware and the exact data that’s being processed before the statement is made are slim.

As should be obvious from the framing, X here is a variable, not a web site formerly known as Twitter. ↩
LLM inference speed of light post provides a practical example of such exercise. ↩

target_clones is a trap

Sat, 20 Apr 2024 00:00:00 +0000

In Luau, modulo operator a % b is defined as a - floor(a / b) * b, the definition inherited from Lua 5.1. While it has some numeric issues, like behavior for b = inf, it’s decently fast to compute so we have not explored alternatives yet.

That is, it would be decently fast to compute if floor was fast.

This post is much shorter than usual, and it was originally written in 2022 and published on Cohost. I’m going to experiment with posting shorter technical content like this more regularly in the coming months, including reposting my earlier Cohost posts (of which this is one of).

For example, on A64 the codegen for the relevant C function is short and sweet, and the function is trivially inlineable:

luai_nummod(double, double):
        fdiv    d2, d0, d1
        frintm  d2, d2
        fmsub   d0, d2, d1, d0
        ret

Unfortunately, on Intel architectures this isn’t as simple. When compiling the native C source code with -msse4.1 command line switch, the codegen is also simple but it uses roundsd instruction that requires SSE4.1 to function: an instruction set that debuted 15 years ago and yet you can’t rely on it being present still.

luai_nummod(double, double):
        movapd  xmm2, xmm0
        divsd   xmm2, xmm1
        roundsd xmm2, xmm2, 9
        mulsd   xmm1, xmm2
        subsd   xmm0, xmm1
        ret

Without SSE4.1, MSVC can be coerced to generate a lengthy inline SSE2 sequence with fast math pragmas, but clang insists on calling the libc function which has a substantial penalty¹.

Ideally what we want is to synthesize two versions of the function, one with SSE4.1 and one with SSE2, and have the compiler call the right one automatically based on the hardware we’re targeting at build time or are running at compile time. Fortunately, gcc 6.0 (2016) introduced a target_clone attribute precisely for this purpose. You can simply add the following attribute to our function:

__attribute__((target_clones("default", "sse4.1")))

and the compiler will generate two versions of the function itself, and a helper “resolver” function (see GNU indirect function (ifunc) mechanism) that is ran at process startup, computes the function pointer we’re going to use, and stores the result in procedure linkage table (PLT) which is used to call the function via indirect calls.

Perfect - so we just add the attribute for gcc/clang and we’re done!

… well.

While the attribute was implemented in gcc6 in 2016, and clang does have an implementation for that attribute, clang only supports it starting from clang 14 (released in 2022, and as such might not be your production compiler yet).

Additionally, in clang 14 there seems to be a problem that prevents use of this attribute on inline functions, as multiple resolvers are generated and they aren’t correctly marked with flags for linker to merge them. This is often not a problem but it is a problem in this case - for targets like AArch64 that don’t need the dispatch to begin with, or for x64 with SSE4.1 used as a compilation target, we’d like the resulting function to be inlinable. The issue seems to be fixed in clang 15.

What’s more, this feature is really less of a gcc feature and more of a glibc feature. When glibc is not available, this feature doesn’t seem to exist - this notably includes macOS. While by default clang on macOS enables SSE4.1 these days, when targeting earlier versions of macOS using -mmacosx-version-min=10.11, SSE4.1 code generation gets disabled by default².

Of course, even on Linux this can be a problem. Some distributions, like Alpine Linux, use musl libc and the toolchain there doesn’t support ifunc and as a consequence target_clones doesn’t work either³.

Now would be a great time to mention that ifunc was one of the mechanisms used in the recent - as of 2024 when this was reposted, not as of 2022 when this was written! - xz backdoor… Something tells me ifunc is not coming to musl based distributions any time soon.

So yes, the target_clones attribute exists, and it solves the problem pretty elegantly… when it is supported, which, even 6 years after it was introduced in gcc, still is “pretty rarely”. It’s unfortunate that SIMD in C is full of portability problems like this - for a language that prides itself in unlocking the maximum performance, actually reaching that performance can be rather painful.

In 2023, we ended up solving the efficiency problem without using target_clones via manual CPUID dispatch to set up a function pointer in cases where SSE4.1-friendly computations were part of builtin functions, a mechanism that deserves a separate post eventually.

gcc can generate the inline SSE2 version with -ffast-math but that switch is unsafe to enable globally, so absent a way to enable it just for one function we’re still out of luck. ↩
This is intentional as OSX 10.11 still supports iMacs released in 2007, that have Core 2 Duo (T7700) CPU - these support up to SSSE3, but roundsd is from SSE4.1. ↩
It’s not fully clear to me which components of the system on Alpine really present the problem - this ostensibly should be a linker feature, not a libc feature, but I digress. ↩

Meshlet triangle locality matters

Tue, 09 Apr 2024 00:00:00 +0000

When working with mesh shaders, the geometry needs to be split into meshlets: small geometry chunks where each meshlet has a set of vertices and triangle indices that refer to the vertices inside each meshlet. Mesh shader then has to transform all vertices and emit all transformed vertices and triangles through the shader API to the rasterizer. When viewed through the lens of traditional vertex reuse cache, mesh shaders seemingly make the reuse explicit so you would think that vertex/triangle locality within one meshlet doesn’t matter.

You would be wrong.

Construction

As covered in an earlier post, it’s non-trivial to select an optimal meshlet configuration, as it presents a challenging balance between improving vertex reuse and maintaining reasonable meshlet culling rates; additionally, on different GPUs the meshlet execution maps differently to the underlying hardware, with various efficiency criteria¹. Once we do select a meshlet configuration, splitting geometry into meshlets becomes non-trivial: not only are there a lot of different possible ways to split a mesh into fixed-size meshlets, but it’s not even clear what we need to optimize for!

meshoptimizer provides two algorithms for this task: meshopt_buildMeshletsScan and meshopt_buildMeshlets.

meshopt_buildMeshletsScan expects an index sequence that is optimized for vertex reuse, and splits it into meshlets such that a meshlet always corresponds to the longest subsequence that still satisfies meshlet limits; when a limit is exceeded, a new meshlet starts. This algorithm is very fast and is suitable to run at load time when working with mesh shaders using meshes that were optimized for the traditional rasterization pipeline, but can often produce too many meshlets or meshlets that are not spatially coherent.

meshopt_buildMeshlets, on the other hand, aggregates triangles into meshlets using heuristics that maximize topological and spatial proximity: the goal is to minimize the amount of border vertices that waste vertex reuse potential, and keep the triangles of a given meshlet clustered together (plus, optionally, keep the triangle normals of a given meshlet pointing in roughly the same direction to improve cone culling rejection rates).

meshopt_buildMeshlets is a better algorithm. On most meshes, it produces fewer meshlets than meshopt_buildMeshletsScan (usually by 1-3%), and the meshlets can be more easily culled by various meshlet culling techniques (resulting in 1-5% more meshlets culled). The Scan variant is really only provided for load-time use, where the extra cost of running the full buildMeshlets algorithm may be prohibitive.

Discovery

As I was investigating results from a recent academic publication², some things didn’t quite add up. The testing was done using niagara, and depending on the algorithms being selected and the parameters they were run with, the meshes would sometimes get more meshlets but render faster. By itself this is not necessarily surprising: niagara uses a series of culling optimization steps; but what’s surprising is that there would be cases where a particular split into more meshlets would produce as many output triangles or more, but render faster.

During mesh shading, the mesh shader itself does some amount of work that is somewhat sensitive to the meshlet contents - for example, vertex attributes need to be fetched from the vertex buffer using potentially arbitrary indices, which could cause a variable number of cache misses. The rasterization stage then will do triangle setup and culling, which is fixed function but still takes time; and of course, depending on the order of triangles the shading may get more or less efficient due to depth rejection.

To isolate as many effects as possible, I’ve changed niagara locally so that no culling was performed anywhere, the mesh shader only used built-in position output (no other attributes), and changed the mesh shader to simply output vec4(0.0) for all vertices, which ensured minimal cost and uniform load for mesh shader itself and eliminated fragment shader overhead. The performance difference between different meshlet algorithms remained and if anything became more pronounced - what was previously a couple percent difference in performance was now 10+%.

Investigation

With a stable and more sensitive performance environment, it became easier to experiment; after a few different attempts, I tried to use buildMeshletsScan and it resulted in significantly faster (10%) rendering while producing a few more meshlets. Moreover, adjusting build parameters to produce more meshlets often resulted in a configuration where significantly more meshlets generated via buildMeshletsScan were faster to render than significantly fewer, generated via buildMeshlets.

Since the Scan algorithm doesn’t do anything smart, this had to have been due to the order of triangles inside each meshlet affecting the rendering performance. I then validated this theory by running existing vcache (meshopt_optimizeVertexCache*) and vfetch (meshopt_optimizeVertexFetch) optimization algorithms that meshoptimizer provides on meshlet triangle data itself, reordering data inside each meshlet individually. This ended up “fixing” all confusing results observed so far - now fewer meshlets were never slower to render as long as each meshlet was carefully optimized, and buildMeshlets resulted in smaller rendering times, as it should.

Curiously, while vcache (reordering triangles so that new triangles refer to more recently seen vertices) provided the major benefit, vfetch (reordering index values so that the sequence of indices inside the meshlet is relatively sequential) also helped a little bit - the numbers were on the order of 10-15% improvement from vcache optimization and 1-2% from vfetch optimization, depending on the mesh. This is despite the fact that, since the mesh shader simply emitted zero position for each vertex, the shader itself did not depend on these values at all - it simply copied them to rasterizer buffers (using gl_PrimitiveIndicesEXT API).

Unfortunately, this is where the detailed investigation hits a wall. Clearly, the values that are being fed to the rasterizer need to be somewhat local to get better performance, but the exact mechanism here is uncertain. Unfortunately, this happens in a fixed-function stage where APIs provide no official counters, and NSight Graphics only gives one counter of significance³ that is changing in this workload (ISBE allocation stalled), which is an indication that for meshes that are slower to render, the mesh shader spends more time waiting for available space in rasterization queue that connects mesh shader with fixed-function rasterization hardware. This is not very useful because it is possible that the problem is the size of the queue (for example, triangle indices are compacted in some way inside the queue), or that the rasterizer itself benefits from locality (for example, by caching edge equations in a short window of triangles), or both. The mesh shader active execution time was unchanged so the problem must be scoped to the rasterizer’s triangle setup, but that’s as much as can be said with certainty.

Note: NSight also shows a delta in PES+VPC throughput, but this can similarly be attributed to both rasterizer not being as efficient on a different index sequence as well as not being fed quickly enough from mesh shading stage due to size limitations.

It’s worth noting that AMD never mentions anything of this sort in their mesh shader optimization materials, and on my integrated AMD RDNA2 GPU there was no difference one way or the other: performance of more densely packed meshlet sequences was better regardless of the triangle order within each meshlet.⁴

Solution

One of my hypotheses was that mesh shader outputs data in a buffer that uses triangle strips as a storage format. This was motivated by the fact that when NV_mesh_shader Vulkan extension was introduced, using mesh shaders through it reflected the number of mesh shader invocations in the pipeline statistic that normally corresponds to geometry shaders, suggesting that both use the same hardware - and geometry shaders’ native output format is triangle strips. This initially seemed like a good lead, especially since using meshopt_optimizeVertexCacheStrip (which isn’t strictly speaking producing a perfect strip order but usually produces an order fairly close to one) performed better than meshopt_optimizeVertexCache - however, after digging deeper and comparing performance of various orders with the number of shared edges between consecutive triangles, the theory got invalidated. I still suspect that the rasterizer stores triangles encoded in some way to reduce space which happens to benefit both strips and lists with good locality, but it’s hard to know for sure⁵.

Ultimately, since we’re dealing with the behavior of a fixed function unit that can only be observed by comparing performance (a mesh-specific and flimsy indicator!), I ended up experimenting for a while with a few different ways to optimize triangle sequences for locality and implemented an optimization algorithm that reorders triangles to maximize recency of seen vertices within a short window as well as reordering index values to be sequential. The algorithm is now available in meshoptimizer as meshopt_optimizeMeshlet which should be a one-line addition to a meshlet optimization pipeline.

In the future, if more hardware details are disclosed or other vendors are found to have similar, but better documented, behavior, the algorithm can be improved further - for now, it provides a significant speedup on NVidia hardware (in a 100% rasterizer bound workload mentioned above it accelerated rendering by 10-15%; more realistically, when triangles end up actually rasterized, mesh shader is running vertex transformation and cluster culling is active, I’ve measured 3-5% improvement depending on the mesh on triangle-dense scenes in niagara - still a good win for something like a shadow pass!), without any detrimental effects on AMD.

Conclusion

If you are working with meshlets, it’s highly recommended to update to the latest version of meshoptimizer (or rather to the latest commit, this will be part of 0.21 that hasn’t been released yet) and use the newly added meshopt_optimizeMeshlet function on each meshlet as part of your geometry processing pipeline. Similarly to all other optimization algorithms, it affects triangle order but doesn’t affect appearance otherwise.

It’s a little frustrating that these optimizations have to be discovered and developed blindly, without much information about what helps and what hurts performance from IHVs; hopefully one day NVidia will publish more detailed performance optimization guides!⁶

AMD recently published a blog post as well as a GDC talk that goes into specifics of mesh shader behavior on AMD hardware; this post will be mostly concerned with NVidia. ↩
This probably needs another review post, but the summary is that you should keep using the latest meshoptimizer versions for meshlet building. ↩
In theory, NSight Graphics Pro might have more metrics but it’s not publicly available; if someone from NVidia wants to send me a build or explain what is going on, that would be great! ↩
I do not have an Intel Arc GPU or Apple M3 to test; for now, I am assuming this is unique to NVidia GPUs. ↩
In a similar situation with hardware vertex reuse, I was able to use pipeline statistics as an accurate reflection of what happens in hardware; this allowed doing very small modifications on synthetic index buffers to understand the behavior. Doing the same analysis with only performance as a guide is very difficult and error-prone. ↩
Surely with the recent ML developments gaming and graphics are just a passion project for NVidia, so it’s okay to be more open ;-) ↩

Condvars and atomics do not mix

Sat, 23 Mar 2024 00:00:00 +0000

When using std::condition_variable, there’s an easy to remember rule: all variables accessed in wait predicate must be changed under a mutex. However, this is easy to accidentally violate by throwing atomics in the mix.

This post is much shorter than usual, and it was originally written in 2022 and published on Cohost. Originally my plan was to use Cohost for shorter notes like this one, and this blog post for long-form carefully detailed content. However, Cohost has an uncertain future and various limitations, and restricting this blog to long form posts results in very few articles that actually end up being written! As such, I’m going to experiment with posting shorter technical content like this more regularly in the coming months, including reposting my earlier Cohost posts (of which this is one of).

Consider how a typical job pool worker function might look like:

std::unique_lock<std::mutex> lock(m_mutex);

while (true) {
    m_has_work.wait(lock, [this] {
        return m_queue.size() > 0;
    });

    // Get the job from the queue and execute it
}

This code is conserving CPU resources in case the work queue is empty by waiting on m_has_work which is a std::condition_variable. The problem though is that to cleanly terminate this thread, the main thread needs to call join() on the std::thread object running this code - but if the thread is waiting for work, join() will hang because work never arrives! No problem, let’s add std::atomic<bool> m_kill_flag, and change the loop accordingly:

m_has_work.wait(lock, [this] {
    return m_kill_flag || m_queue.size() > 0;
});

if (m_kill_flag) break;

Now all we need to do is raise the flag before joining the threads:

// Notify all workers that they need to die right now.
m_kill_flag = true;
m_has_work.notify_all();

// Wait for all workers to die.
for (size_t i = 0; i < m_threads.size(); i++)
    m_threads[i].join();

All good? Not so fast! This code has a race condition, and may occasionally hang!

The fact that m_kill_flag is an atomic here is doing us a disservice: if we change it to a regular bool, then Clang’s thread sanitizer dutifully complains that the write to the boolean is unprotected:

WARNING: ThreadSanitizer: data race (pid=88143)
  Read of size 1 at 0x00016d231114 by thread T5 (mutexes: write M32):
...
  Previous write of size 1 at 0x00016d231114 by main thread:

The boolean is read under the mutex, but it was written without the mutex being held. It may feel like an overkill to grab a mutex to toggle a boolean, and using std::atomic fixes ThreadSanitizer report - but doesn’t fix the race.

Consider that wait(pred) is equivalent to a loop like this:

while (!pred()) cvar.wait();

What can happen in case above is that the thread checks the kill flag, which hasn’t been set to true yet, but before it gets the chance to park the thread (cvar.wait() will add the thread to a list of threads waiting on the cvar so that notify_all can wake it), the main thread sets the flag to true and calls notify_all. The notification state isn’t “sticky” - notify_all will not wake threads that aren’t currently waiting on the condition variable!

After this main thread proceeds to call join, and the worker thread calls cvar.wait() as it missed both setting of the flag to true and the attempt to notify the variable. Thus the thread waits on condition variable forever, and main thread waits to join the thread forever - a deadlock, that unfortunately escapes ThreadSanitizer’s attention because std::atomic silences the report.

The correct way to go here is to ditch the atomic and grab the mutex in the destructor:

// Notify all workers that they need to die right now.
{
    std::unique_lock<std::mutex> lock(m_mutex);
    m_kill_flag = true;
    m_has_work.notify_all();
}

This ensures that the state of kill flag can’t change between checking the predicate state, and the work that the condition variable does to atomically unlock the mutex and add the thread to the condition variable wait list, fixing the race.

Note that notify_all can also be called outside of the scope; the code above results in a small loss of efficiency, as threads that get woken will attempt to grab the mutex that’s being held by the main thread. That said, the threads will serialize with each other on wakeup so it’s not likely to be a significant issue in this case, but it’s something to keep in mind in other cases, especially when using notify_one.

Of course, there’s no general rule of thumb that any code mixing atomics and condition variables has races like this - but whenever this mix happens it can be useful to do a very careful audit of the code. Atomics provide what I like to call “physical” atomicity - individual variables will be in a coherent state - but what’s often desired is “logical” atomicity, where whole system invariants continue to hold, and issues around this are easy to miss especially when tools like ThreadSanitizer only check individual accesses.

LLM inference speed of light

Fri, 15 Mar 2024 00:00:00 +0000

In the process of working on calm, a minimal from-scratch fast CUDA implementation of transformer-based language model inference, a critical consideration was establishing the speed of light for the inference process, and measuring the progress relative to that speed of light. In this post we’ll cover this theoretical limit and its implications.

If you’re interested in more derivation and some graphs, this notebook does the same modeling in Python.

Inference mechanics

When a language model¹ is generating tokens, it does so one token at a time; language models (specifically, decoder-only text transformer models, but the rest of the post will just describe them as LLMs) can be understood as a function that takes a token as input and produces an array of probabilities for all tokens from the vocabulary (which typically has 50-250K tokens, and each token is a few letters). Then, the program samples from the set of all tokens using the probabilities to guide the sampling, produces the next token, and the process repeats. This means that there is no possibility of parallelism when generating one sequence of text - the generation process can be modeled one token at a time².

Broadly, the language model does two types of operations when processing a token: a matrix-vector multiplication, where a large matrix (e.g. 8192x8192) is multiplied by a vector to produce another vector and attention computation. During generation, the model can see not just the state of the current token, but also internal states from all previous tokens in the sequence - both the ones the user has written in the prompt and the ones the model itself has generated. These are stored in a structure called “KV-cache” (key-value cache), which is essentially a set of key and value vectors for each previous position in the text. Attention takes a query vector generated for the current token, computes a dot product between it and all key vectors for all previous positions, then normalizes the resulting set of scalars and computes a value vector by computing a weighted sum of all value vectors for all previous positions, using the dot product as the score.³

Now, both matrix-vector multiplication and attention computation have one important characteristic in common: for each element read from the matrix or KV-cache, we need to do a very small number of floating-point operations. Matrix-vector multiplication does one multiply-add (2 FLOPs) per matrix element; attention computation does one multiply-add per key element for dot product, and one multiply-add per value element for computing a weighted sum.

Modern CPUs and GPUs have a much higher rate of ALU operations (multiplies, adds) compared to the rate at which they can read inputs from memory. For example:

AMD Ryzen 7950X has 67 GB/s memory bandwidth and 2735 GFLOPS, for a 40:1 FLOP:byte ratio⁴
NVidia GeForce RTX 4090 has 1008 GB/s memory bandwidth and 83 TFLOPS, for a 82:1 FLOP:byte ratio⁵
NVidia H100 SXM (which is a data-center card) has 3350 GB/s memory bandwidth and 67 TFLOPS, for a seemingly more modest 20:1 FLOP:byte; however, for problems that look like a matrix multiplication, tensor cores provide ~494 TFLOPS without sparsity for a 147:1 FLOP:byte ratio.

The numbers get even worse for smaller floating point numbers like FP16 or FP8: H100 tensor cores have a theoretical throughput of 1979 TFLOPS for dense FP8 matrices, which brings the FLOP:byte ratio to 590:1. Needless to say, in any of these configurations and regardless of whether tensor cores are used or what the floating-point format is, ALU is in abundance.

Thus, any problem that only needs to do two operations per element must be bandwidth-limited, and we should be able to estimate the minimum amount of time it can take to run the inference process from the model configuration, the size of the KV-cache, and the available bandwidth.

Mistral speed-of-light

Without going too much into the exact formulas and matrices used, let’s look at a model like Mistral 7B, which has 7.2 billion parameters (so the total number of all matrix elements is 7.2B).

The composition of the parameters is as follows:

4096 * 32000 = 131M parameters for embedding matrix; this matrix isn’t used in a matrix-vector multiply, as just a single row of the matrix is read per token, so we will not include this in the bandwidth calculations
32 * (4096 * (128 * 32 + 128 * 8 * 2) + 4096 * 128 * 32) = 1342M parameters for computing attention-related vectors
32 * (4096 * 14336 * 3) = 5637M parameters for transforming hidden state via a feed-forward network
4096 * 32000 = 131M parameters for converting the hidden state into token probabilities; this is used in a matrix multiply unlike the embedding matrix

This adds up to ~7111M “active” parameters that are used in matrix multiplications. If the model is using FP16 for the matrix elements, we end up having to read ~14.2 GB of data for each token.⁶ Additionally, while each matrix is going to be used again when running inference for the next token, caches are measured in tens of megabytes, and as such we can assume this process can not run faster than memory bandwidth as the weights won’t stay in cache between runs⁷.

This covers matrix math; attention computation needs to read the KV-cache up until the current token, so the amount of data read depends on how many tokens the model sees when generating the new one - that includes the system prompt (usually hidden from the user), user prompt, previous model output, and can include multiple user prompts for a longer chat session.

For Mistral, KV-cache stores 8 128-element vectors for each key for each layer and 8 128-element vectors for each value for each layer, which adds up to 32 * 128 * 8 * 2 = 65K elements per token; if the KV-cache is using FP16 for individual elements, then for token number P we will need to read P * 130 KB of memory - for example, token number 1000 will need to read 130 MB of data from KV-cache.

From these numbers, it’s now easy to compute the minimal amount of time required for inference. For example, on NVidia RTX 4090 (1008 GB/s), 14.2 GB take ~14.1 ms to read, so we can expect ~14.1 ms per token for tokens with low position numbers (KV-cache impact is negligibly small). If we use 8-bit weights, we need to read 7.1 GB and that takes ~7.0 ms. These are lower bounds - they represent the minimum theoretically possible time per token.

Are theoretical bounds useful?

We’ve done a bunch of math and arrived at a few numbers that tell us we can’t run inference faster than a given threshold - is this useful? Let’s look at a few reasons why it could be.

To actually reach that time, you need a high-quality software implementation, and hardware that can reach the theoretical peak bandwidth. This means that if a given implementation is far from the optimal number, it’s a cause for investigation: efficiency might be left on the table, either on the software or on the hardware side. For example, on RTX 4090 calm achieves ~15.4 ms/tok for Mistral 7B when using 16-bit weights and ~7.8 ms/tok for 8-bit weights - this is around 90% of the theoretically possible performance.⁸ On Apple M2 Air when using CPU inference, both calm and llama.cpp only reach ~65% of the theoretical 100 GB/s bandwidth, suggesting that the quoted peak bandwidth can only be fully utilized with the help of iGPU.

Bandwidth scales linearly with the number of bytes used per element; this means that we can both estimate theoretical wins from smaller weight formats (quantization) and validate the quality of implementations by comparing the actual performance with the theoretical limit. For example, on RTX 4090 llama.cpp achieves ~17.1 ms/tok for Mistral 7B when using 16-bit weights (82% of peak), ~10.3ms/tok for 8.5-bit weights (71% of peak) and ~6.7ms/tok for 4.5-bit weights (58% of peak), suggesting significant optimization opportunities for smaller formats.

In addition to providing a lower bound on decoding time, the modeling above suggests that the inference process is significantly under-utilizing the ALU units. To fix this, the FLOP:byte balance needs to shift; techniques like speculative decoding attempt to help with this, but for a multi-user example we can note that when multiple user requests are being processed, we can perform multiple matrix-vector multiplications with the same matrix at the same time (otherwise known as a matrix-matrix multiplication!) – an optimal implementation of matrix-matrix multiplication becomes ALU-bound for sufficiently large matrices. This is why this ALU:byte imbalance is not a critical issue for production inference systems - when you ask ChatGPT to help with a task, your request is evaluated concurrently with many other requests on the same GPU, and the bandwidth is utilized more efficiently. Crucially, request batching typically does not help with KV-cache bandwidth (unless the requests share a very large prefix) because KV-cache size and bandwidth increases with the number of requests, whereas the weight matrix stays constant.

Mixture of Experts models like Mixtral have slightly different scaling characteristics: batching initially only increases the bandwidth required, but once the expert utilization becomes significant the inference becomes increasingly ALU bound.

Finally, if batching is not applicable, bandwidth serves as a critical estimator, constant across model variations/device type or architecture, for the expected inference performance, and you can use it to decide on the hardware you need to use. For example, NVidia RTX 4080 has 716 GB/s bandwidth, so you would expect it to run LLM inference at ~0.7x the speed of RTX 4090 - this can be different from the relative performance in other workloads such as gaming, ray tracing or inference of other types of neural networks!

Conclusion

For problems like this where the amount of computation and memory access is known apriori, using theoretical speed of light modeling as grounding is really important, as it helps validate the quality of implementations and predict the impact of architectural changes.

Ideally, your inference implementation should carefully calculate the achieved effective bandwidth, and you should use it during profiling as the main source of guidance - as this is the value that you know the limit for! Do make sure to calculate it carefully though - calm had several cases where an architectural quirk made the computed bandwidth slightly incorrect :)

Appendix: Group query attention

Mistral-7B is a very well-balanced model; in all calculations above it would almost seem as if the KV-cache is not an essential part of the cost structure. One of the reasons behind this is the comparatively short context (Mistral-7B is using windowed attention which limits the bandwidth consumed to a window of 4096 tokens), but the other, perhaps more important reason, is the use of group-query attention.

In group-query attention (with a 4x ratio), to produce 4 dot-products for the attention, instead of using 4 query vectors and computing a dot product with 4 corresponding key vectors, we take one key vector but 4 query vectors and perform 4 dot products. This allows us to reduce the size and the required bandwidth for the KV-cache - instead of reading each element from the KV-cache and only doing one multiply-add operation on it, we’re now doing 4, which rebalances ALU:bandwidth ratio somewhat in our favor.

This is also critical for KV-cache memory size, but that may not be apparent for such short contexts: 4096-token context takes 0.5 GiB with Mistral, but a comparable model without GQA (like Llama 7B) would “only” need 2 GiB. Let’s look at a recent model that does not use GQA, Cohere’s Command-R.

The model itself has ~35B parameters, so at 16 bits/weight, we would need to read 70 GB of weights for each token during inference⁹. For each token it needs to store 40 * 128 * 64 * 2 = 655K elements in the KV-cache, which at 16 bits/element is 1.3 MB per token.

Thus a 4096-token context would take ~5.3 GB; that’s already somewhat significant compared to ~70 GB weights. However, things get scarier if you consider that Cohere’s model is advertised to have 200K token context window – to compute the last token of the 200K context window, you would need to read 260 GB! (let’s ignore the fact that you would also need 260 GB of VRAM to store it)

In a typical “production” (still single-user) setting, things shift even more. Weights would often use 4-bit quantization (~4.5 bits/weight as is often implemented), and KV-cache might use 8-bit (FP8) values. If we “conservatively” assume 100K context (half of the advertised maximum), this would get us to ~19.7 GB for model weights and ~65 GB for KV-cache, and to compute the last token we need to read all of that from memory. Suddenly attention computation is going from being insignificantly small to taking ~75% of the time, assuming both run at peak bandwidth!

While 100K context may seem a little extreme, in a multi-user context this is also a fair representation of the expected workload. Batching allows us to make matrix multiplication ALU-bound and read the model weights once per batch of values (= per 64+ user requests), but every user request would typically have its own KV-cache, so attention stays bandwidth bound - and requires a lot of memory to fit all users’ requests on a single node!

If these models used 4x GQA, the size and required bandwidth for KV-cache would have been 4x smaller; while still significant for tens of thousands of tokens of context, it would have been more manageable. There might be a quality degradation associated with GQA for Cohere’s intended use cases - it would be interesting to see the technical report as it may contain relevant ablation studies, but purely from the cost/performance point of view, GQA needs to be evaluated for every transformer-based LLM as the benefits are too significant to ignore.

This post is going to omit a lot of details and not attempt to fully explain the mechanics of transformer modeling; I’m not the best person to do so and detailed articles have been written by other people. ↩
This is different in a prefill phase, where the model is given existing text and is asked to convert it to the internal representation, where the tradeoffs are different. Also, notably techniques like speculative execution attempt to provide some degree of parallelism by trying to use a less accurate predictor serially and then validating the guesses in parallel. Neither technique will be discussed here. ↩
This description omits multi-head attention and the details of “normalization” (softmax), but neither are critical for understanding the inference performance speed of light. ↩
These numbers are from AIDA64 table in this review; my 7950X uses slower memory so it can only sustain ~50 GB/s bandwidth. ↩
These numbers are taken from NVidia spec sheets; as such they represent the theoretical limits. ↩
Here and elsewhere GB is a decimal unit, equal to 1000^3, not GiB. All bandwidth measurements reported by manufacturers are powers of 10 even though the RAM sizes are powers of 2. ↩
There’s an architectural variant that fixes this by duplicating some layers which for smaller models can keep them in memory during inference, but I’m not aware of an open-source model that uses this. ↩
Close, but not quite there - 100% bandwidth utilization is unfortunately very hard to get close to on NVidia GPUs for this workload. Larger GPUs like H100 are even more difficult to fully saturate; on Mixtral - this is a different architecture but it obeys the same tradeoffs for single sequence generation if you only count active parameters - calm achieves ~75% of theoretically possible performance, although large denser models like Llama 70B can get closer to the peak. ↩
Command-R has a large vocab (256K) and large hidden state (8192) so it spends a whopping 2B parameters on embeddings, but it reuses the same matrix for embedding and classification so we don’t need to exclude this from the inference bandwidth calculation. ↩

It is time

Tue, 28 Nov 2023 00:00:00 +0000

I joined Roblox in August 2012; eleven years and 4000 commits later, it’s time to say goodbye. Today was my last day.

Roblox was around 60 people when I joined. It was desktop-only, English-only, and predominantly US-focused (we thought it was okay to take the site down for a few hours every Wednesday night to update as most of North America would be asleep by then!). Neither DevEx nor developer forum existed (and even the word “developer” wasn’t around), the games you would play were much simpler and less polished, and the engine and tools needed a lot of work.

I spent the decade that followed rebuilding the technology with the help of amazing coworkers and making many friends along the way. Looking back, I’m proud of what we all have built together - I wrote more about my contributions in a retrospective post in 2020. The defining characteristic of engineering at Roblox to me has always been finding the best fit for the product and vision, even if it leads you down the path few have traveled before - the outcome is usually a mix of conventional and unconventional technology, and I think we’ve done well on both fronts.

In the last few years, I’ve been fortunate to be able to self-direct and dedicate my time to problems that I thought were important for the company and interesting to solve. A lot of this time was spent on various language-related projects as well as some engine-wide initiatives (although I wound that down at the end of 2022 to focus on Luau). Luau is now a nicer language with a stronger implementation, and several big efforts are underway to make it even better. I’m also happy to report that all hack week projects I’ve implemented have either been shipped as part of Roblox or are actively being worked on¹.

Additionally, after a fair amount of effort, we’ve open-sourced Luau at the end of 2021; I drove this initiative personally from the start and up until very recently. This year, the games Alan Wake 2 and Warframe both switched to Luau as their internal scripting language, which has been very validating. Open-sourcing engine components is unusual for Roblox and I’m grateful that for this specific project, it could happen; with the strong team that continues to develop the language, I’m optimistic about the future - Roblox remains committed to Luau and the open-source efforts around it².

Fast forward to today, it is difficult to recognize the company I joined all these years ago. The technology, processes, business metrics, and company size are all eons ahead of where they were back then - I can’t take credit for most of this but I’m happy to have helped. I’ve also grown significantly since then as an engineer and as a leader (unfortunately I am also 11 years older now…), and I’m very grateful for this opportunity.

The tremendous success, however, comes at a price. Roblox helped me grow alongside the company, but in recent years it became increasingly more difficult to keep up with the organizational scale while still building things. Maintaining the focus on areas I still considered critical required continuous efforts that drained the energy too much, as the company embraced new evolution directions that I was not particularly excited about. The organizational dynamics made steering decisions and projects needlessly difficult at times, and it felt like the culture I was used to was dissolving.

Ultimately I realized that the company would do just fine without me - but that for me to continue to self-improve without distractions, and to deliver impact the way I prefer to, I needed to let go.

Earlier this year, I had a chance to connect to a lot of the amazing developers on my ninth RDC, some for the ninth year in a row, some now working at Roblox, and it was just like the good old days - so much passion for the platform, so much knowledge and creativity, so excited to talk about all the new features. They truly are a cornerstone of Roblox, and I will miss them - and all of the brilliant people who remain with the company.

So, what is next for me? The answer makes me excited and terrified at the same time - I don’t know!

My decision isn’t completely rational - I don’t know what path I want to take, or what goal I want to reach. Rather, I decided the best way for me to discover the next journey is to end this one - and in doing so, create a void that will naturally give birth to something new. I plan to dedicate more time to various open-source projects in the coming months and pursue new ideas - we will see where this takes me!

I am always happy to connect and discuss interesting new ideas and opportunities - unless you’re a recruiter³. I’m also tentatively interested in consulting, to the extent it will allow me to broaden my worldview, especially if it aligns with the open-source work I am doing anyway - please don’t hesitate to reach out via e-mail in either case!

Improved voxel lighting doesn’t carry any code that I’ve written as part of the hack week but it was inspired by it and I’m happy it’s being implemented nonetheless! ↩
I do not yet know to what extent I will have the time and energy to participate as an external contributor myself; this is certainly possible, but no promises! ↩
Leaving one big company for another big company just to implement someone else’s idea would not make that much sense after all. If you are a founder of a tiny startup, let’s talk ;) ↩

Efficient jagged arrays

Fri, 30 Jun 2023 00:00:00 +0000

A data structure that comes up fairly often when working with graphs or graph-like structure is a jagged array, or array-of-arrays. It’s very simple to build it out of standard containers but that’s often a poor choice for performance; in this post we’ll talk about a simple representation/construction code that I found useful across multiple different projects and domains.

Crucially, we will focus on immutable structures - ones that you can build in one go from source data and then continuously query without having to change it. This seems like a major constraint but for many problems it is sufficient to build the structure once, and it makes significantly simpler and more efficient implementations possible.

Applications

Graphs seem like an abstract data structure out of a CS graduate course, but they come up fairly often in diverse algorithms. Here are a few examples for why you might want to use a jagged array:

For mesh processing, given an index buffer, it would be useful to build a list of triangles that each vertex belongs to
For other mesh processing algorithms, instead of triangles you might want to keep track of edges that each vertex belongs to
For intermediate representations in a compiler, you might want to build a list of basic blocks that jump to a given basic block (aka predecessors)
For transform graphs where every node only specifies a parent node, you might want to build a list of children for every node

In all of these examples, you’re starting from a set of data that already contains the relevant relationship (parent-child, vertex-triangle), but for efficiency creating a data structure that can be used to quickly look up the entire set is useful. For example, to compute an average vertex normal, you’d need to aggregate triangle normals from all incident triangles; there are ways to do this without building adjacency structures but if crease handling is desired, it can be difficult to incorporate into an algorithm if adjacency is not available.

In all cases above, the immutability is often tolerable. For example, the mesh topology might be static throughout the algorithm, or the basic block information might only change infrequently and as such can be recomputed on demand.

Naive implementation

For concreteness, we will assume that we’re building a triangle adjacency structure, where for each vertex we need to keep track of all triangles it belongs to. Given the index buffer and number of vertices, the solution is very simple¹:

std::vector<std::vector<unsigned int>> adjacency(vertex_count);

for (size_t i = 0; i < index_count; ++i)
{
    unsigned int triangle = unsigned(i / 3);

    adjacency[indices[i]].push_back(triangle);
}

The code is simple but has a number of efficiency problems²:

For each vertex we’re paying an overhead of sizeof(std::vector), which is 24 bytes on a 64-bit system, just to store the array - even if the vertex is not used. This is a memory problem as well as a performance problem since lookups into adjacency will use more cache space than necessary.
Because std::vector grows exponentially, typically with a factor of 1.5, we might also lose memory on the extra elements that never end up being used. If the length of each list is 6 on average, the vector capacity will be 8, losing ~8 bytes per vertex.
Because std::vector stores data on heap, we will end up also potentially wasting memory on allocation metadata and block rounding. The size depends on the allocator, on Linux we end up wasting ~16 bytes between each allocation to allocation headers.
In addition, we will need to reallocate each vector continuously as we’re pushing new elements. The growth of std::vector is particularly slow for small sizes, and to get to size 6 we will need 4 allocations. This results in a substantial performance overhead, as doing this many small allocations and deallocations could easily dominate the performance of the algorithm. While this (and the previous) problem can be mitigated by reserving the size of each element up-front if we have a guess³, if our guess is off in either direction we could waste more memory or more time.

For code above if each vertex belongs to 6 triangles, the allocations where the actual lists are stored are going to be ~48 bytes apart on Linux; this is reasonable from the macro locality perspective⁴, but given that we only have ~24 bytes of data (6 * sizeof(unsigned int)) it still means that traversing this data would waste ~half of the bytes loaded into cache, even ignoring the sizeof(std::vector) (which is also 24 bytes, so in total we’re using ~72 bytes per vertex to store just 24 bytes of triangle indices).

Counting once

The majority of inefficiencies all come from the fact that we aren’t quite sure how long each per-vertex list of triangles should be; guessing this number up front can misfire, but given that the lists are not going to change in size after the adjacency computation is done, we can do much better by counting the list sizes first in a separate pass. Once that’s done, we will be able to reserve each vector to the exact size that it needs, and fill the lists as we did before.

While we are here, we can also replace std::vector with a plain old array pointer that we allocate with new - after all, we’re paying a substantial memory cost for each std::vector object since it needs to manage the size and capacity fields.

struct Triangles
{
    unsigned int count = 0;
    unsigned int offset = 0;
    unsigned int* data = nullptr;
};

std::vector<Triangles> adjacency(vertex_count);

for (size_t i = 0; i < index_count; ++i)
    adjacency[indices[i]].count++;

for (size_t i = 0; i < vertex_count; ++i)
    if (adjacency[i].count > 0)
        adjacency[i].data = new unsigned int[adjacency[i].count];

for (size_t i = 0; i < index_count; ++i)
{
    unsigned int triangle = unsigned(i / 3);

    adjacency[indices[i]].data[adjacency[indices[i]].offset] = triangle;
    adjacency[indices[i]].offset++;
}

We now need to do one more pass through the source data to compute the size of each individual list; once that’s done, we can allocate the exact number of entries and fill them with a second pass. While we’re here, we can also use 32-bit integers instead of 64-bit integers for count and offset, unless we want to process vertices that are part of >4B triangles.

Let’s see how we’re doing on our efficiency goals.

Instead of ~4 allocations per vertex, we now use just one, immediately allocating the array of correct size
We’re using less memory per vertex because we aren’t using std::vector anymore; sizeof(Triangle) is only 16 bytes.
We’re still losing memory on allocation metadata and block rounding; on Linux, we end up only using ~32 bytes of memory for each allocated block though.

In total, we’ve reduced the total number of allocations to one per vertex, and the total amount of memory per vertex from 72 bytes to 48 bytes. Naturally, we can do better.

Merging allocations

The major source of remaining inefficiencies is the fact that each list is allocated separately. A much more efficient approach would be to allocate enough memory for all the lists and then place each list at an offset in the resulting large allocation such that for each list, the items of that list follow the items of the list before it. This sounds as if it would require a lot of extra code and tracking data, but it turns out we are in a reasonable position to do this without too much extra complexity:

Note that in our previous solution, just precomputing count was not enough: we also needed to track offset for each vertex during the algorithm’s operation, as we needed to know which element in each triangle list to write to next. With one more pass over the counts, we can instead compute the offset assuming all lists are in one large allocation: instead of each offset starting from 0, we will start offset for each vertex with the total number of elements needed by all preceding vertices. This will allow us to allocate all triangle lists as part of one large allocation, and stop storing the pointer to each list in each vertex. The initial value for each offset is known as a prefix sum and is trivial to compute in one pass.

After we do that and fill all lists, we will have shifted each offset by the size of each list - in order to refer back to the correct range, we will need to subtract count from offset again to compensate for the extra additions. This was not a problem in our previous solution because we’d store a pointer in each list, but now that all the memory for all lists is shared we need offset to refer to the contents of each list when actually using the adjacency structure.

struct Triangles
{
    unsigned int count = 0;
    unsigned int offset = 0;
};

std::vector<Triangles> adjacency(vertex_count);

for (size_t i = 0; i < index_count; ++i)
    adjacency[indices[i]].count++;

unsigned int sum = 0;

for (size_t i = 0; i < vertex_count; ++i)
{
    adjacency[i].offset = sum;
    sum += adjacency[i].count;
}

std::vector<unsigned int> data(sum);

for (size_t i = 0; i < index_count; ++i)
{
    unsigned int triangle = unsigned(i / 3);

    data[adjacency[indices[i]].offset] = triangle;
    adjacency[indices[i]].offset++;
}

// this corrects for offset++ from the previous loop:
// use &data[adjacency[vertex].offset] when querying adjacency
for (size_t i = 0; i < vertex_count; ++i)
    adjacency[i].offset -= adjacency[i].count;

Instead of allocating each list separately we now put them all one after each other in a large allocation. This packs the data more densely and allows us to use indices instead of pointers to refer to individual elements, as well as eliminating all allocation waste and overhead.

Note that in the code above, offset is now tracking a “global” offset - which is to say, we now expect that the sum of all lengths of all lists fits into a 32-bit integer. This is valuable for memory efficiency, but does technically restrict this structure to slightly smaller data sources. It’s easy to change by using size_t for offset and sum if necessary.

Let’s see how we’re doing on our efficiency goals.

We are now using 0 allocations per vertex (2 allocations total).
Each vertex needs 8 bytes of data for maintaining the list (count and offset)
The lists themselves are stored tightly without any extra memory overhead

Each vertex is now thus using 8+24 = 32 bytes of memory (assuming an average list length is 6), which is fairly close to optimal… but we’re not quite done yet.

Removing counts

Up until this things were mostly pretty intuitive; I’ve used data structures computed in a similar manner to above for years. Recently I’ve realized there’s still a little bit of redundancy left ~~and I totally did not just spend writing this entire post to mention this one tiny tweak, which is probably also widely used elsewhere, to an otherwise standard construction~~.

Consider an array of counts, [1 6 5 3 6] for example, that we’ve computed in the first loop.

Our second loop computes a running sum, filling offsets with [0 1 7 12 15], and yielding sum=21. These are the offsets where each individual list will live.

Our third loop actually fills all the list items, and in doing so advances each offset to get to the following offsets array: [1 7 12 15 21].

Our fourth loop finally corrects this by subtracting count from each element.

If you look at the shifted offset array carefully, it’s actually almost a subset of our initial offsets array! Which really makes sense: if offsets[v] is initially pointing at the beginning of each individual list, then after we filled all the lists offsets[v] is pointing at the end of the list - which also happens to be the beginning of the next list due to how we’ve laid out all lists in memory!

This means that we don’t need to adjust the offsets using counts - we can simply shift the elements forward by 1 in the array – conceptually. Or we can simply fill the elements in the right place in the array. This also means that counts, once filled in the initial loop and used in the prefix sum, are no longer necessary - except in iteration when querying adjacency lists, but there we can simply use offsets[v + 1] to denote the end of the range where the list is stored.

This allows us to mostly ignore count - we still need to compute it once, but we’ll store it in the offsets[] array, and all further operations will work with offsets. Instead of shifting the array by moving the elements around, we will simply carefully adjust the offset indexing, which will save us the trouble of doing any offset correction at the end.

// note: we allocate one extra element so that when querying, we use:
// &data[offsets[v]] .. &data[offsets[v + 1]]
std::vector<unsigned int> offsets(vertex_count + 1);

// offsets[0] starts at 0 and stays 0
unsigned int* offsets1 = &offsets[1];

// compute count into offsets1
for (size_t i = 0; i < index_count; ++i)
    offsets1[indices[i]]++;

// transform counts into offsets in place
unsigned int sum = 0;

for (size_t i = 0; i < vertex_count; ++i)
{
    unsigned int count = offsets1[i];
    offsets1[i] = sum;
    sum += count;
}

// populate lists; this automatically adjusts offsets to final value
std::vector<unsigned int> data(sum);

for (size_t i = 0; i < index_count; ++i)
{
    unsigned int triangle = unsigned(i / 3);

    data[offsets1[indices[i]]] = triangle;
    offsets1[indices[i]]++;
}

// all done!

Compared to our previous solution, we no longer need a final loop to correct the data, but more importantly we don’t need to explicitly track count anymore which saves us ~4 bytes per vertex and yields a fairly optimal⁵ general solution. Note that you can still replace unsigned int with size_t for offsets if the total number of elements can ever exceed 4B, which would still only use 8 bytes per vertex for metadata.

To iterate through the resulting structure, offsets and data need to be retained as they compose the entirety of the data structure (offsets1 is temporary and just simplifies the code a little bit):

const unsigned int* begin = &data[offsets[vertex]];
const unsigned int* end = &data[offsets[vertex + 1]];

for (unsigned int* tri = begin; tri != end; ++tri)
    // do something with the triangle index

Conclusion

The jagged array in a single allocation (or two) is surprisingly useful and versatile! Hopefully you’ve liked the post, which I tried to structure such that transformations from naive to optimal solution are intuitive and have a clear rationale. There are other algorithms and data structures that can benefit from similar optimizations; most crucially, cleanly splitting the operations into “build” and “query” such that you can build up the structure in one go is something that can be used in many other cases to get to a much better data layout or performance, whether driven by algorithmic complexity or machine efficiency. Past which, it really pays off to be careful with allocations and structure as much data as possible in the form of large directly indexable arrays.

Note: After publishing this, Paul Lalonde on Mastodon noted that the final algorithm can be a good match for GPU processing: between atomics for the counting/filling phase, and a plethora of fast parallel prefix sum algorithms, the construction of this data structure can be done entirely on the GPU with a few compute shaders/kernels, and once built the structure is easy and fast to query as well.

Sometimes, ability to mutate is still important and some ideas from this post will not apply. Some algorithms in meshoptimizer use the second-to-last variant of the code (with counts and offsets separated): while it does not allow for arbitrary mutation, it does allow removing elements from the individual lists which can speed up the adjacency queries in some cases, so merging these two is not always fruitful.

We technically don’t need to use a jagged array here - an array of any list-like data structure would suffice. However, an array is desirable as it would improve traversal performance as elements would be next to each other in memory. ↩
For astute readers, the division by 3 may or may not be a small efficiency concern; the compiler can replace it with an induction variable and even if it doesn’t, this will get compiled to a few instructions that don’t involve integer divisions - we will ignore this specific problem throughout the rest of the article as it’s very specific to computing triangle indices and trivial to fix. ↩
The number “6” referenced above is not accidental, because for large manifold meshes it’s likely that each vertex participates in ~6 triangles, however even here depending on triangulation it is possible to see other valences. ↩
Meaning that even though we’re allocating a lot of tiny blocks, they are still going to be fairly close in memory as they are allocated sequentially. This can change if another thread is allocating other small blocks concurrently depending on the allocator. ↩
The use of std::vector for data implies a redundant zero initialization step that can be skipped by using new; also, the last loop should probably read and write offsets1 more explicitly to avoid codegen issues due to aliasing. ↩

Fine-grained backface culling

Fri, 28 Apr 2023 00:00:00 +0000

Backface culling is something we take for granted when rendering triangle meshes on the GPU. In general, an average mesh is expected to have about 50% of its triangles facing away from the camera. Unless you forget to set appropriate render states in your favorite graphics API, the hardware will reject these triangles as early in the rasterization pipeline as possible. Thus, it would seem that backface culling is a solved problem. In this post, however, we’ll explore a few alternative strategies that may or may not improve rendering performance.

Standard backface culling

As long as you’re setting up backface culling in your graphics API, for example by using VK_CULL_MODE_BACK_BIT in Vulkan, the hardware will perform backface culling automatically. Any backfacing triangle will be culled early in the triangle setup process, typically by a fixed function unit. Triangle setup typically can process a small number of triangles per cycle per geometry engine, and the triangle rejection rate may be a small multiple of that - the performance will certainly vary by the GPU vendor and by the GPU model. For example, according to the RDNA whitepaper, Navi GPUs cull two primitives per clock per “primitive unit”, of which Radeon 5700 XT has four - adding up to rejection rate of 8 triangles per clock. The 2-to-1 ratio of cull throughput to processing throughput is typical as half of the triangles will be culled on average.

On some GPU drivers, the culling may be implemented in “software” and run in a shader stage; we will cover this later in this post.

Cluster cone culling

In more recent GPU architectures, the geometry pipeline gained more flexibility with the introduction of task and mesh shaders. NVidia GeForce GPUs support task/mesh shaders starting from 20xx series (Turing), and AMD Radeon GPUs support them starting from 6xxx series (RDNA2), although there’s some amount of impedance mismatch for AMD between the exposed programming model and the hardware support, that has been improved in 7xxx (RDNA3). Task/mesh shaders can be used in Vulkan (via VK_EXT_mesh_shader) and in DirectX 12 (via shader model 6.5).

With these extensions, coarse grained culling of geometry becomes feasible: the geometry is split into a number of “meshlets”, each of which is a small number of triangles, and the task shader can reject meshlets if no triangles in the meshlet are visible. As long as this check can be done efficiently (and conservatively), this could improve rasterization performance on geometry-dense scenes, as the culling is performed in batches of triangles.

For the purpose of this article, backface culling can be done on meshlet granularity by using cluster cone culling. For each meshlet, the intersection between all negative half-spaces of all triangles in the meshlet is approximated with a cone, such that any point in that cone is simultaneously in all negative half-spaces - if the camera is in this cone, all triangles are backfacing and do not need to be rendered. meshoptimizer provides algorithms for splitting meshes into meshlets and computing bounding information, and you can look at a full end-to-end integration of this technique in niagara renderer.

This test is cheap and coarse, which makes it a good candidate for task/mesh shaders. However, it’s also very conservative: while on average we’d expect 50% of triangles to be backfacing when using 124-triangle clusters and a single cone per cluster, cone culling can typically reject only up to ~25% on dense and reasonably smooth meshes, and the rejection rate will be lower on meshes with larger triangles or more complex topology (for example, on Lumberyard Bistro interior scene the rejection rate is just 4%)¹.

While the cone is an imperfect approximation, fundamentally on meshes with large triangles or sharp changes in local curvature coarse backface culling is bound to be much less effective, as triangles with different orientations will be mixed together in the same meshlet. This can be mitigated by grouping triangles into meshlets by orientation, but that introduces a lot of topological seams that hurt transformation and rasterization efficiency, and results in a poor tradeoff².

Bruteforce backface culling

With mesh shader execution model, every meshlet vertex gets transformed to clip space as a result of mesh shader threadgroup invocation. Between this, and the presence of meshlet-local topology, it’s trivial to perform backface culling in the mesh shader manually. In Vulkan, this can be done by outputting gl_CullPrimitiveEXT for each primitive³.

To perform backface culling, we need to save the vertex position in clip space into threadgroup storage:

vertexClip[i] = vec3(clip.xy / clip.w, clip.w);

and compute the result of backface culling for each triangle:

gl_MeshPrimitivesEXT[i].gl_CullPrimitiveEXT =
    vertexClip[a].z > 0 && vertexClip[b].z > 0 && vertexClip[c].z > 0 &&
    (vertexClip[b].x - vertexClip[a].x) * (vertexClip[c].y - vertexClip[a].y) >
    (vertexClip[c].x - vertexClip[a].x) * (vertexClip[b].y - vertexClip[a].y);

Note that while backface culling simply requires computing the sign of the triangle area in screen space, because this computation is done after post-perspective divide, we can only use the result of the test when all vertices are in front of the clip plane. This can be efficient if we need to perform other forms of culling, such as small primitive culling, but if we only need to do backface culling then we can use a more efficient formulation suggested in Triangle Scan Conversion using 2D Homogeneous Coordinates:

// for each vertex
vertexClip[i] = clip.xyw;

// for each triangle
gl_MeshPrimitivesEXT[i].gl_CullPrimitiveEXT =
    determinant(mat3(vertexClip[a], vertexClip[b], vertexClip[c])) > 0;

Either of the above formulas results in rejecting backfacing triangles precisely, reaching ~50% culling efficiency. But, why repeat the work the hardware is doing anyway - wouldn’t a fixed function unit be more efficient?

As briefly noted before, on some hardware when using traditional rasterization pipeline, the driver may end up doing some amount of per-triangle culling in the synthesized shaders. Notably, on AMD GPUs radv driver - and likely other drivers - does this for backface, frustum, and small primitive culling, using some heuristics to decide whether this is worthwhile.

The intuition behind why this can be beneficial lies in the imbalance between fixed function geometry units and shader units. Going back to Radeon 5700 XT as an example, with its four primitive units it can cull 8 triangles per cycle. At the same time, that GPU has 40 compute units, with each unit dispatching up to 64 scalar multiply-adds per cycle, which adds up to ~2560 multiply-adds per cycle. A 3x3 matrix determinant (above) takes ~9 scalar multiply-adds, so theoretically we should be able to cull ~280 triangles per cycle using ALUs. Of course, this speed of light computation omits a lot of details, and some of the shader units may be busy executing other workloads (although for geometry-heavy passes like depth prepass or shadowmap pass the bottleneck is likely going to be in rasterization), but ultimately it’s clear that in certain cases it’s possible to dramatically outperform the fixed function culling hardware.

In fact, because AMD drivers tend to use shader culling when using traditional rasterization pipeline (at least, on desktop), using some form of fine-grained shader culling may be required to reach performance parity with mesh shading pipeline, as - at least as of this writing and on radv - mesh shaders do not get any form of shader culling by default.

Precomputed triangle visibility masks

While 9 multiply-adds per triangle is not that much, it can be tempting to omit these computations altogether and find a more compute efficient way to cull triangles. In 2015, the GPU-Driven Rendering Pipelines SIGGRAPH talk introduced a technique for precomputing triangle visibility masks. The space around the center of each cluster is classified into a small number of regions (6 in the talk), and for each region we pre-compute a mask where each bit of a mask corresponds to triangle visibility in that region. The mask has to be conservative - we must record 1 in the mask if the triangle is visible from any point in that region.

Then, at runtime, we classify the camera position into one of the regions and use the corresponding mask to cull triangles using simple bit tests. Other than classification, which can be done in the task shader as the results are shared between all triangles in the cluster, this only requires to fetch the mask - which can be done using scalar loads for each 32-triangle group - and a bit test:

gl_MeshPrimitivesEXT[i].gl_CullPrimitiveEXT =
    maskSide >= 0 && (meshletData[maskOffset + (i >> 5)] & (1 << (i & 31))) == 0;

To compute the region, we need to assign a region index based on the camera position in cluster space. This requires transforming the camera position with inverse object transforms, and classifying the resulting vector using code like this for 6 regions:

int maskSide =
    max(abs(dir.x), max(abs(dir.y), abs(dir.z))) <= meshlets[mi].radius
        ? -1
        : abs(dir.x) > abs(dir.y) && abs(dir.x) > abs(dir.z)
            ? (dir.x >= 0 ? 0 : 3)
            : abs(dir.y) > abs(dir.z)
                ? (dir.y >= 0 ? 1 : 4)
                : (dir.z >= 0 ? 2 : 5);

This is not particularly cheap in abstract but, when done in task shader, the work is shared between all triangles in a cluster and as such has a fairly negligible cost.

Computing the masks is fairly simple: we need to test each triangle against each region of space and see if the region is entirely within the negative half-space of the triangle plane (in which case no point in the region can see the front side of the triangle), or not. This task is made slightly harder by the fact that the region is infinite, but it has a simple algebraic formulation. For example, for a 6-side frustum, the region corresponding to “+X” side is defined by four rays, where t is the point along the ray:

$ P(t) = (t, \pm t, \pm t) $

Since the camera position is unlikely to be inside the cluster, we can assume t >= meshlet radius (and check that this actually holds during runtime classification so that the tests are conservative). Then, we simply need to solve the negative half-space test for each ray, assuming the triangle plane equation is $ Ax + By + Cz + D = 0 $:

$ \forall t \ge radius : At \pm Bt \pm Ct + D \le 0 $

$ \forall t \ge radius : At \pm Bt \pm Ct \le -D $

$ A \pm B \pm C \le min(0, -D/radius) $

If the above holds for each of four rays (there are four +/- combinations), the region is entirely in the negative half-space of the triangle plane and we can set the corresponding bit in the mask to 0. This works because any point inside the region is a linear combination of up to four points on the rays that delimit it, and the result of the plane equation can be interpolated linearly, resulting in a negative number (put another way, the region is a convex set with the boundary defined by four rays).

If the above does not hold for any of the rays, the region is not entirely in the negative half-space of the triangle plane and we must set the corresponding bit in the mask to 1. While this results in a conservative approximation of visibility, it’s not entirely precise - we’ll look at experimental results in a minute, but intuitively in a similar test done in 2D with the space split into four quadrants, most triangles would be classified as invisible only from one region out of 4, as the plane will intersect the other two and the fourth will be entirely in positive half-space:

This problem can be mitigated by increasing the number of regions - although that, in turn, will require more storage for visibility masks. For example, we can split each frustum region into four sub-regions, which requires a total of 24 bits = 3 bytes per triangle (the same amount of space as meshlet-local topology storage occupies). That also allows us to use a slightly simpler classifier:

int maskSide =
    max(abs(dir.x), max(abs(dir.y), abs(dir.z))) <= meshlets[mi].radius
        ? -1
        : (abs(dir.x) > abs(dir.y) && abs(dir.x) > abs(dir.z)
            ? 0
            : abs(dir.y) > abs(dir.z) ? 8 : 16)
        + (dir.x >= 0 ? 0 : 1) + (dir.y >= 0 ? 0 : 2) + (dir.z >= 0 ? 0 : 4);

Multiple cones per meshlet

In the technique above, we are limited to a fixed subdivision of space into regions, which can result in efficiency issues - we either use very few regions, which results in visibility approximation that is too conservative, or a lot of regions which can improve the culling efficiency but require a lot of storage.

An alternative variant of precomputed visibility is to use the same technique as cluster cone culling, but instead of only using one cone per cluster, allow multiple cones and specify which triangle belongs to each cone. For example, if we use 4 cones instead of 1, we could store the 2-bit cone id for each triangle, perform 4 cone tests in the task shader, send the 4-bit mask to the mesh shader and for each triangle, pick the correct bit to determine visibility.

To compute the cone id, we need to classify triangles into groups with similar triangle normals. For this we can use K-means clustering, which converges to a fairly optimal result in very few iterations. Once we have the cone id for each triangle, we can compute the bounding cone information for each subset of the triangles in the meshlet, and record the cone axis/angle. This can be done with repeated invocations of meshopt_computeMeshletBounds for each cone as long as the apex-less formulation of the cone test is used (see meshoptimizer.h for details).

Just as with cluster cone culling, we can do all classification in the task shader - however, unless all cone tests agree that the triangles are back-facing, we can’t reject the entire cluster, and need to pass the mask to the mesh shader invocation and perform the per-triangle test (here we’re using bitshifts to index into a mask using a 2-bit cone id):

gl_MeshPrimitivesEXT[i].gl_CullPrimitiveEXT =
    ((mconeCull >> ((meshletData[maskOffset + (i >> 4)] >> ((i & 15) * 2)) & 3)) & 1) != 0;

Estimating culling efficiency

For any given mesh, it’s fairly straightforward to analyze the culling efficiency offline: assume the camera is in a given position, run the classification/culling code sketched above, and count the total number of backface culled triangles. The results depend greatly on the mesh structure and the camera position; the table below summarizes culling rates for several meshes in a test set using each of the approximate culling techniques described above (keeping in mind that the optimal target is 50%)

Algorithm	Kitten (niagara)	Happy Buddha	Sponza	Lumberyard Interior
Cluster cone culling	25.2%	28.0%	7.7%	4.2%
Precomputed masks (6 regions)	10.4%	10.6%	8.0%	10.2%
Precomputed masks (24 regions)	26.6%	28.2%	24.8%	24.6%
Multi-cone culling (2 cones)	32.9%	37.4%	18.5%	12.2%
Multi-cone culling (4 cones)	38.1%	42.5%	28.5%	24.3%

As we can see, cone culling is very sensitive to the cluster composition - even when using 4 cones per cluster, we see very different culling rates depending on how the meshes are constructed. In contrast, precomputed masks give much more consistent results, although they also suffer from the dependency on triangle alignment vs region planes.

Unfortunately, neither 2 cones nor 6 precomputed masks get us close to the 50% target, and even with 24 regions the precomputed masks only get as far as ~25% culling efficiency. 24 masks require 24 bits per triangle; 4 cones require 2 bits per triangle plus ~16 bytes of cone data per meshlet (~3 bits per triangle total, assuming 128 triangle meshlets).

Additionally, cluster cone culling remains the only coarse test: for all other tests, reaching the indicated efficiency numbers requires per-triangle culling - for which we can always apply the brute-force option as well. While the tests require less ALU per triangle, they aren’t free - some amount of bitmask loading and bit extraction is still necessary, only to reach suboptimal efficiency. Ultimately, the only way to know if any of these techniques are worthwhile is to try them out in a real application.

Measuring hardware performance

Lacking the patience to fully implement all of these techniques in a production renderer, I’ve implemented them in niagara. We are going to use the default kittens scene, and disable all other forms of per-cluster culling to isolate the performance impact of backface culling alone. While using the kittens mesh may seem to favor cluster cone culling due to the higher efficiency, this is partially mitigated by the use of level of detail - when level of detail is enabled, a lot of the scene is rendered using lower density meshes than the base mesh for which the results were reported above. We will see that it affects efficiency of cluster cone culling and brings it more in line with what we could expect for more realistic geometry.

The tests were performed on NVidia RTX 4070 Ti using NVidia NSight Graphics which measures the frame time while locking the GPU clocks, hopefully providing more stable results. For each method, we report the number of triangles rendered as well as the full frame rendering time - while that includes other passes, the workload is squarely dominated by geometry processing. The culling efficiency rate is computed as the percentage of triangles rejected compared to the baseline.

Algorithm	Triangles (LOD)	Time (LOD)	Triangles (no LOD)	Time (no LOD)
Baseline (no culling)	106.1M	8.38ms	729.8M	55.20ms
Cluster cone culling	91.8M (13%)	7.25ms	494.1M (32%)	45.88ms
Precomputed masks (6 regions)	92.4M (13%)	8.58ms	644.9M (12%)	54.36ms
Precomputed masks (24 regions)	74.4M (30%)	7.61ms	527.1M (27%)	49.25ms
Multi-cone culling (2 cones)	79.7M (25%)	7.73ms	445.6M (39%)	45.79ms
Multi-cone culling (4 cones)	69.3M (35%)	7.13ms	416.7M (43%)	44.15ms
Brute-force culling	49.4M (53%)	7.63ms	368.0M (50%)	46.32ms
Cluster cone + brute-force culling	49.4M (53%)	6.96ms	368.0M (50%)	40.07ms

As we expect, we see that the efficiency of mask-based methods is similar between LOD and no-LOD version, but the cone culling based methods lose efficiency in LOD version, as instead of a lot of very detailed meshes they have to work with a mix of coarse and detailed meshes. Also worth noting is that these numbers are measured on a configuration where each meshlet only has up to 64 triangles, and the earlier modeling table used 124 triangles as a maximum⁴.

What we may or may not expect is that while efficiency and performance do not correlate exactly, generally speaking spending more effort to cull more triangles pays off. This is especially pronounced in the brute-force culling version, which - while not the fastest - is a contender for the top 3 spot, despite simply doing, again, what the rasterizer is otherwise perfectly capable of doing. The notable exception to this is cluster cone culling, which saves a decent amount of performance despite the relatively low efficiency - but remember, cluster cone culling is the only method we’ve discussed that is able to perform culling on the cluster level, which makes the triangles this method culls much more valuable as they do not contribute at all to the cost of running the mesh shader.

As a result, combining cluster cone culling (to minimize the amount of work done by the mesh shader) with brute-force culling (to cull the triangles that cluster cone culling misses) gives us the best of both worlds: we get the performance of cluster cone culling with the efficiency of brute-force culling. This is the approach that niagara uses right now. All of the other culling methods are interesting, but fundamentally the bruteforce culling is cheap enough that the extra complexity doesn’t really pay off as none of the methods can reach the same culling efficiency. Of course, the results in a more production ready scene or on a different GPU may vary.

Note: I originally intended to rerun the experiments on AMD Radeon GPU, but these numbers take forever to gather, and based on my previous experiments with fine-grained culling on AMD hardware I suspect the results will hold even more strongly there, as seemingly any amount of ALU that is spent to cull triangles pays off there… Sorry!

Conclusion

In this post we’ve looked at a number of different techniques to cull backfacing triangles in mesh shaders. We’ve seen that cluster cone culling - which is the only method that can cull triangles on the cluster level - can result in suboptimal culling rates but is still valuable to reduce the mesh shader workload. When considering triangle-level culling methods, all of them seem to, in the end, lose to bruteforce culling or at best be on par with it.

The code for fine-grained culling, as well as cone culling integration, is available in niagara and meshoptimizer; mask-based culling is likely going to be added to meshoptimizer in the future as well even though it looks like it’s not going to be the best option for most use cases. Implementing K-means for cone cluster selection is left as an exercise to the reader ;)

There are other, more powerful, culling techniques that can make task-level culling of meshlets more effective, such as occlusion culling which is also implemented in niagara, but are outside of the scope for today’s post. ↩
meshoptimizer’s main algorithm for meshlet construction, meshopt_buildMeshlets, has a parameter cone_weight that helps control this tradeoff to a small extent, although the algorithm always prefers connectivity over orientation. All numbers in this post assume cone_weight = 0.5. ↩
Depending on the hardware implementation, you may or may not want to try also compacting visible triangles in the mesh shader. My experiments did not suggest that this is worthwhile on current NVidia or AMD GPUs, but that may change on other hardware. ↩
This discrepancy comes from the fact that niagara currently uses the only meshlet configuration that was reasonably performant on AMD hardware, and I didn’t invest the time yet to do rigorous cross-vendor tuning. ↩

Meshlet size tradeoffs

Mon, 16 Jan 2023 00:00:00 +0000

When working with mesh shaders to draw meshes, you need to split your source geometry into individual units called meshlets. Each meshlet would be processed by one mesh shader workgroup, and when compiling this mesh shader you need to specify the maximum number of triangles and vertices that the meshlet contains.

These numbers are subject to some hardware limits. On current drivers, AMD, Intel and NVidia expose limits of 256 triangles and 256 vertices through EXT_mesh_shader Vulkan extension, but NVidia advertises a higher limit of 512 triangles & 256 vertices through NV_mesh_shader. These limits are the ones you’d want to use when building your meshlets, eg when using meshoptimizer’s meshopt_buildMeshlet function - but what numbers do you actually use?

Note: this was originally posted as two cohost posts, Meshlet sizing theory and Meshlet sizing efficiency. 🐞

There are hardware efficiency implications here that we will explore in a future post but let’s first try to get a sense for the abstract tradeoffs.

Vertex transform

Let’s first explore the relationship between number of vertices and number of triangles. For that, let’s assume that our source mesh is two-manifold, and as such our meshlet is a subset of a larger two-manifold mesh. To make things easier, let’s assume that each meshlet can be “flattened” - which is to say, we can lay out the meshlet triangles on a plane without triangles overlapping¹.

This allows us to view a meshlet as a planar graph and to apply Euler’s formula: V-E+T=1². V and T are numbers of vertices and triangles, respectively, and E is the number of edges. Every triangle has three edges, and in every meshlet that was created out of a two-manifold mesh we have unconnected edges that lie on the border of a meshlet and connected edges.

Let’s say that our meshlet has P unconnected (perimeter) edges; then 3T=2E-P (as 3T would count every non-perimeter edge twice), so E=(3T+P)/2. Plugging this back into Euler’s formula and simplifying we get 2V=2+T+P. You can validate this on simple examples; a single triangle has V=3 T=1 P=3; a quad has V=4 T=2 P=4.

The mesh shader execution model implies that each vertex within the meshlet will be unpacked/transformed once, but if the vertex is shared between two meshlets then it will be transformed redundantly. As such, the number of vertices shared with other meshlets - which is the same as the number of perimeter edges, P - needs to be minimized.

For very large meshlets, optimal P tends to get insignificantly small as V and T grow large. Intuitively, P grows as a square root of V - for example, an N^2 quad grid has (N+1)^2 vertices, 2*N^2 triangles, but only 4*N perimeter edges. So for very large meshlets, 2V approaches T (there are approximately half as many vertices as triangles); however, our limits tend to be reasonably small, and as such a 1:2 vertex:triangle ratio is unreasonable to expect.

We can look at the ratio between P and T as, approximately, transform waste - if each perimeter vertex is only shared between two meshlets, then for a large source mesh that is split into K meshlets, we’ll get K*T triangles and K*P redundant vertex transforms, so we get P/T redundant vertex transformations for every triangle - ideally we’d like that to be as close to zero as possible. We can also compute ACMR, which is a metric traditionally used for vertex transform optimization, by dividing V by T, for which 0.5 is the optimal (but unreachable) target.

To make this a little more concrete, let’s look at how grid-like meshlets look like at different sizes. For simplicity, let’s look at the maximum vertex count of 32, 64, 96, 128 and 256, and let’s use a simple script to find the best grid-like configuration:

v limit 32 : best 4 x 5 v 30 t 40 p 18 ; p/t 0.45 acmr 0.75
v limit 64 : best 7 x 7 v 64 t 98 p 28 ; p/t 0.29 acmr 0.65
v limit 96 : best 7 x 11 v 96 t 154 p 36 ; p/t 0.23 acmr 0.62
v limit 128 : best 10 x 10 v 121 t 200 p 40 ; p/t 0.20 acmr 0.60
v limit 256 : best 15 x 15 v 256 t 450 p 60 ; p/t 0.13 acmr 0.57

This information alone should lead us to an obvious conclusion - large meshlets are good, because they maximize the effective vertex reuse and thus minimize the vertex transform cost. For EXT_mesh_shader limits we’d outlined earlier³, we need to limit the triangle count to 256, which ends up with the actual limits of around 144 vertices / 242 triangles for regular grid patches (ACMR 0.60); for NV_mesh_shader, we can max out the 256 vertex limit for a slightly better 0.57. In practice there’s no reason to use regular grid patches, but it’s important to note that experimentally we see that the optimal grid configuration for a given limit looks like a square more than a long strip, since it minimizes the perimeter length.

Alas, things are not quite as simple as “bigger meshlets = better” because there are other contributing factors in this problem.

Meshlet culling

First let’s look at how meshlets of different sizes behave wrt meshlet culling. In a mesh shading driven pipeline you’d expect some amount of per-meshlet culling to happen - after all, if that isn’t happening at all, why use mesh shaders to begin with? niagara, an experimental toy renderer written on YouTube streams, implements - as of today! - frustum culling, occlusion culling and backface culling for meshlets. For each culling mechanism we can compute efficiency - how many triangles can we remove from the rasterizer by implementing a given culling scheme, in this case when working on meshlet granularity?

A meshlet with a higher triangle limit is, naturally, larger. While this has some impact on frustum culling, it has a more noticeable impact on occlusion culling⁴ efficiency. Any numbers here are going to be highly scene-dependent and the scene it is taken on is very artificial (but full of kittens!) so take this with a grain of salt, but on the graph below you can see that on our test scene occlusion culling efficiency drops from ~80% for a meshlet with size 64 to ~66% for a meshlet with size 256 (triangles). This is a significant decrease and can easily offset any gains in transform efficiency in practice.

The picture is similarly bleak for backface culling. To perform backface culling on meshlets without considering individual triangles⁵, we need to perform cone culling based on the aggregate normal cone generated by meshoptimizer. This test works well when the normal cone is tight - that is, when all triangles in a meshlet have similar orientation. However, the more triangles are in a meshlet, the higher is the chance that they will have normals that point in different directions. While it is possible to gather triangles into meshlets using normal clustering alone, that results in a lot of unconnected triangles which results in significant issues with transform waste, so this is pretty unavoidable.

Indeed, normally we’d expect that 50% of triangles on average are backfacing, but the efficiency of meshlet level backface culling drops to 14% with 64-triangle meshlets and all the way to 3% for 256-triangle meshlets - making the culling test barely worthwhile for very large meshlets! In part this is a problem because we use level of detail, and as such a lot of geometry that we render is far away and relatively coarse. Indeed, disabling level of detail gets us somewhat more respectable 32% for 64-triangle meshlets and 21% for 256-triangle meshlets - unfortunately this isn’t very realistic since at that point we’re pushing close to 1 billion triangles in our test scene.

Hardware occupancy

It’s also important to note that larger meshlets lead to a higher mesh shader threadgroup memory footprint. The mesh shader computation model implies that a given mesh shader outputs V vertices and T triangles from the threadgroup, and the memory for this needs to be allocated out of a finite storage. The details vary per GPU vendor and we’ll discuss some of them in a future post; however, the number of vertices can pretty significantly affect this memory footprint as a single vertex likely needs at least 32 bytes of output memory and likely more if more than a couple varying outputs are written. The issue with using too much memory here is the possibility of reduced occupancy - since the threadgroup memory is allocated out of a hardware-specific fixed amount of memory, large footprint leads to only being able to run a few different threadgroups on an individual compute unit concurrently. This can be an issue because it reduces execution efficiency - having multiple different workgroups that execute “at the same time” helps GPUs hide memory and ALU latency. Unfortunately, vendors currently don’t publish guidance on specific numbers here for mesh shading, and the details may vary significantly on different vendors.

Additionally really small meshlets, e.g. with 8 or 16 triangles, are going to suffer from threadgroup execution efficiency as well, due to mesh shader computation model - as typically one meshlet would be processed by one threadgroup, and AMD and NVidia GPUs need at least 32 threads in a threadgroup.

Overall, all of this says that we need to somehow balance “larger is better” from the transform perspective with “smaller is better” from culling and execution efficiency perspective.

Conclusion?

Looking at the original NVidia mesh shader publication, one concrete number they recommend is 64 vertices and 84 or 126 triangles per meshlet. 126 feels like a typo but it’s not - NVidia implementation allocates output memory for primitive indices in groups of 128 bytes and has to leave a little extra room to store primitive count, so 84*3=252 bytes and 126*3=378 bytes are good targets for that hardware. From our previous post we know that with 64 vertex limit we can at best expect 98 triangle patches for a regular grid; thus while 84 isn’t going to run at quite the peak efficiency, it is likely reasonable as a “default” guidance to use V=64 T=84, as a compromise between higher culling efficiency of smaller meshlets, and transform waste (V=64 T=126 is also a good target although it will typically waste a little bit of primitive memory as we’d realistically go up to <100 triangles per meshlet).

Unfortunately, applying this configuration on AMD hardware will lead to suboptimal execution due to the specifics of how AMD hardware supports mesh shaders - and it’s also not clear whether “transform waste” is important there relative to some other factors! This is a subject we’ll explore in a future post, as it requires looking closely at how exactly mesh shaders execute on different hardware.

This is not necessarily the case in general as not all subsets of two-manifold meshes can be planarized, but it’s a reasonable generalization. ↩
Euler’s formula says V-E+F=2 but it treats the infinitely large region outside of any face as one, so F=T+1. ↩
256/256 is currently the de-facto common denominator across all vendors that have an implementation of EXT_mesh_shader, as well as the minimum set of limits required by the spec - however, there are other important limits to keep in mind that we will discuss later. ↩
Intuitively, frustum culling of individual meshlets is mostly important around the screen boundary, whereas occlusion culling of individual meshlets is important everywhere. ↩
The renderer implements per-triangle culling as well but the efficiency numbers above are gathered with all triangle-level culling disabled. ↩

Approximate projected bounds

Thu, 12 Jan 2023 00:00:00 +0000

When working with various forms of culling, it can be useful to project the object bounds to screen space. This is necessary to implement various forms of occlusion culling when using a depth pyramid, or to be able to reject objects or clusters that don’t contribute to any pixels. The same operation can also be used for level of detail selection, although it’s typically faster to approximate the projected area on screen - here we’re interested in efficient conservative projected bounds. “Conservative” means that the resulting bounds must contain the original object. “Efficient” means that we’ll need to restrict ourselves to projecting 3D bounds that are known to contain the object - naturally, two common choices are a sphere and a box.

Near clipping

Perspective projection has a singularity on the plane that is orthogonal to camera direction and goes through the camera position. Points on that plane have W=0 in homogeneous space, and attempts to perform a post-perspective divide will result in Inf/NaN. As such, graphics hardware clips rasterized triangles to near plane (post-perspective Z>0). When the object bounds intersect near plane in 3D, in order to compute the precise projected bounds we would need to compute the bounds of the clipped volume. This can be computationally intensive, and is also entirely unnecessary when the result is used for culling - if the object intersects the camera plane, the odds of it being culled are nil. As such we’re going to assume that we will reject objects that intersect the near plane for simplicity.

Sphere projection

Spheres are convenient to transform - e.g. a world-space sphere can be converted to view-space with just one matrix transformation, as the sphere retains its radius under such a transform. Unfortunately, after projection the sphere is not a disk - however, it is possible to analytically derive the precise bounds of such projection. 2D Polyhedral Bounds of a Clipped, Perspective-Projected 3D Sphere (Michael Mara, Morgan McGuire) does precisely that. While that paper comes with source code, it solves a more generic problem - you can refer to that paper for derivation, but we can take that source code and specialize it for the task at hand, where we focus on screen-space bounds and ignore near clipping.

// 2D Polyhedral Bounds of a Clipped, Perspective-Projected 3D Sphere. Michael Mara, Morgan McGuire. 2013
bool projectSphereView(vec3 c, float r, float znear, float P00, float P11, out vec4 aabb)
{
    if (c.z < r + znear) return false;

    vec3 cr = c * r;
    float czr2 = c.z * c.z - r * r;

    float vx = sqrt(c.x * c.x + czr2);
    float minx = (vx * c.x - cr.z) / (vx * c.z + cr.x);
    float maxx = (vx * c.x + cr.z) / (vx * c.z - cr.x);

    float vy = sqrt(c.y * c.y + czr2);
    float miny = (vy * c.y - cr.z) / (vy * c.z + cr.y);
    float maxy = (vy * c.y + cr.z) / (vy * c.z - cr.y);

    aabb = vec4(minx * P00, miny * P11, maxx * P00, maxy * P11);
    // clip space -> uv space
    aabb = aabb.xwzy * vec4(0.5f, -0.5f, 0.5f, -0.5f) + vec4(0.5f);

    return true;
}

Note that in this implementation we assume that the projection is symmetrical for simplicity, and as such only two elements of the projection matrix (P00 and P11) are necessary. Symmetrical projections are the most common ones, but it’s easy to incorporate asymmetrical projections that may occur in VR rendering at a small added cost of four extra additions at the end. The function also converts the output from clip space (where coordinates range from [-1..1] and Y goes up) to normalized screen space or UV space, where coordinates range from [0..1] and Y goes down; if desired this transform can be folded into the projection transform for a small profit.

All in all this requires ~30 FLOPs for the transform plus 12 FLOPs for the final conversions.¹ Note that c is a view-space center of the sphere above - when the sphere is in world space instead, we’ll need to transform it to view space first, which takes ~18 FLOPs, for a grand total of 60 FLOPs.

Naive box projection

If the input is an AABB, things become more difficult. Unlike a sphere, that retains the properties of being a sphere after transformation, an axis-aligned bounding box stops being axis-aligned - this makes precise calculation of the projected bounds difficult as any corner of the bounding box can be extremum, and the problem lacks symmetry. Because of this, a typical method requires projecting all 8 corners, performing a perspective divide, and computing min/max of the resulting values. To perform near plane rejection, before doing the divide we need an “early” out if any transformed vectors have .w < znear.

Here’s a pseudo-code for what this entails, assuming a world-space input:

bool projectBox(vec3 bmin, vec3 bmax, float znear, mat4 viewProjection, out vec4 aabb)
{
    vec4 P0 = vec4(bmin.x, bmin.y, bmin.z, 1.0) * viewProjection;
    vec4 P1 = vec4(bmin.x, bmin.y, bmax.z, 1.0) * viewProjection;
    vec4 P2 = vec4(bmin.x, bmax.y, bmin.z, 1.0) * viewProjection;
    vec4 P3 = vec4(bmin.x, bmax.y, bmax.z, 1.0) * viewProjection;
    vec4 P4 = vec4(bmax.x, bmin.y, bmin.z, 1.0) * viewProjection;
    vec4 P5 = vec4(bmax.x, bmin.y, bmax.z, 1.0) * viewProjection;
    vec4 P6 = vec4(bmax.x, bmax.y, bmin.z, 1.0) * viewProjection;
    vec4 P7 = vec4(bmax.x, bmax.y, bmax.z, 1.0) * viewProjection;

    if (min(P0.w, P1.w, P2.w, P3.w, P4.w, P5.w, P6.w, P7.w) < znear) return false;

    aabb.xy = min(
        P0.xy / P0.w, P1.xy / P1.w, P2.xy / P2.w, P3.xy / P3.w,
        P4.xy / P4.w, P5.xy / P5.w, P6.xy / P6.w, P7.xy / P7.w);
    aabb.zw = max(
        P0.xy / P0.w, P1.xy / P1.w, P2.xy / P2.w, P3.xy / P3.w,
        P4.xy / P4.w, P5.xy / P5.w, P6.xy / P6.w, P7.xy / P7.w);

    // clip space -> uv space
    aabb = aabb.xwzy * vec4(0.5f, -0.5f, 0.5f, -0.5f) + vec4(0.5f);

    return true;
}

This is substantially more expensive than the equivalent computation for a sphere - each vector-matrix transform requires 18 FLOPs (as resulting .z is unused), so the entire function requires 144 FLOPs just to do the transformation, 16 FLOPs for post-perspective divide, 35 FLOPs for various min/max computations and 8 FLOPs for final conversions of the result - a grand total of 203 operations!²

Update: After the article was published, several people noted that it’s also possible to slightly reduce the overhead here using point classification: using the technique from Fast Projected Area Computation for Three-Dimensional Bounding Boxes, once the camera position in bounding box space is known, you can classify silhouette edges via a table lookup, which gives 6 (or, rarely, 4) silhouette vertices that then need to be transformed as above. This leaves the transformation precise, but requires additional code and table data so may or may not be worthwhile on a GPU compared to the methods below. Thanks to Eric Haines and Alan Hickman for correction!

Optimized box projection

The original version of this article provided a way to reduce the amount of computation in the naive projection by eliminating redundant multiplications, but Aslan Dzodzikov suggested a much more efficient version in the comments. Instead of computing the corners by multiplying the world-space corner of AABB by the matrix, we can take advantage of the distributive property of matrix multiplication and note that vec4(bmin.x, bmin.y, bmax.z, 1.0) * viewProjection == vec4(bmin.x, bmin.y, bmin.z, 1.0) * viewProjection + vec4(0, 0, bmax.z - bmin.z, 0) * viewProjection. This allows us to only compute one full vector-matrix product and three products that reduce to row/column multiplication, and the rest of the computation is just vector additions:

bool projectBox(vec3 bmin, vec3 bmax, float znear, mat4 viewProjection, out vec4 aabb)
{
    vec4 SX = vec4(bmax.x - bmin.x, 0.0, 0.0, 0.0) * viewProjection;
    vec4 SY = vec4(0.0, bmax.y - bmin.y, 0.0, 0.0) * viewProjection;
    vec4 SZ = vec4(0.0, 0.0, bmax.z - bmin.z, 0.0) * viewProjection;

    vec4 P0 = vec4(bmin.x, bmin.y, bmin.z, 1.0) * viewProjection;
    vec4 P1 = P0 + SZ;
    vec4 P2 = P0 + SY;
    vec4 P3 = P2 + SZ;
    vec4 P4 = P0 + SX;
    vec4 P5 = P4 + SZ;
    vec4 P6 = P4 + SY;
    vec4 P7 = P6 + SZ;

    if (min(P0.w, P1.w, P2.w, P3.w, P4.w, P5.w, P6.w, P7.w) < znear) return false;

    aabb.xy = min(
        P0.xy / P0.w, P1.xy / P1.w, P2.xy / P2.w, P3.xy / P3.w,
        P4.xy / P4.w, P5.xy / P5.w, P6.xy / P6.w, P7.xy / P7.w);
    aabb.zw = max(
        P0.xy / P0.w, P1.xy / P1.w, P2.xy / P2.w, P3.xy / P3.w,
        P4.xy / P4.w, P5.xy / P5.w, P6.xy / P6.w, P7.xy / P7.w);

    // clip space -> uv space
    aabb = aabb.xwzy * vec4(0.5f, -0.5f, 0.5f, -0.5f) + vec4(0.5f);

    return true;
}

To compute each of SX/SY/SZ we need 4 FLOPs (since resulting .z is unused, and the matrix multiplication reduces to a row/column multiplication); the matrix transform requires 18 FLOPs and the rest of the calculations require 7*3 = 21 FLOPs, so the transformation requires 51 FLOPs in total. The rest of the computation remains as is, with 16+35+8 FLOPs to compute the bounds, which adds up to 110 FLOPs - about half as many as the naive computation! The result is still precise, although may not be exactly equal to the naive computation due to round-off errors.

View-space approximations

In order to improve these results further we would need to convert the box to something that is easier to deal with - namely, another axis-aligned box, but this time in view space. Converting AABBs between different spaces has been discussed on this blog before, and requires an additional vector transform by a 3x3 matrix which is a component-wise modulus of the original rotation matrix. Note that this transformation is conservative but not precise - the resulting bounding volume and, as a result, the projected bounds are going to be larger.

Once the AABB we have is in view space, a number of things become easier. To perform the near clipping rejection, we simply need to look at the minimum Z coordinate for the AABB - no need to compute it from scratch! For the actual projection, the first step is to notice that out of 8 corners, 4 of them on the “front” side (minimum Z) and 4 of them on the “back” side (maximum Z) share the denominator during post-perspective division. Additionally, when computing, for example, the maximum X for the projection, we obviously only need to consider the “right” side of the box - now that we work in view space, it’s easy to know which vertices may be extremum! - and all vertices on the right side share the same view-space X. As such, to compute the max X, we simply need to consider two values:

float maxx = max(viewmax.x / viewmax.z, viewmax.x / viewmin.z);

However, we can simplify this a little further: since viewmax.z >= viewmin.z, and both are positive, we know exactly which of them to divide by based on the sign of viewmax.x:

float maxx = viewmax.x / (viewmax.x >= 0 ? viewmin.z : viewmax.z);

With this, we’re ready to implement the entire function:

bool projectBoxView(vec3 c, vec3 r, float znear, float P00, float P11, out vec4 aabb)
{
    if (c.z - r.z < znear) return false;

    // when we're computing the extremum of projection along an axis, the maximum
    // is reached by front face for positive and by back face for negative values
    float rminz = 1 / (c.z - r.z);
    float rmaxz = 1 / (c.z + r.z);
    float minx = (c.x - r.x) * (c.x - r.x >= 0 ? rmaxz : rminz);
    float maxx = (c.x + r.x) * (c.x + r.x >= 0 ? rminz : rmaxz);
    float miny = (c.y - r.y) * (c.y - r.y >= 0 ? rmaxz : rminz);
    float maxy = (c.y + r.y) * (c.y + r.y >= 0 ? rminz : rmaxz);

    aabb = vec4(minx * P00, miny * P11, maxx * P00, maxy * P11);
    // clip space -> uv space
    aabb = aabb.xwzy * vec4(0.5f, -0.5f, 0.5f, -0.5f) + vec4(0.5f);

    return true;
}

bool projectBoxApprox(vec3 min, vec3 max, mat4 view, float znear, float P00, float P11, out vec4 aabb)
{
    vec4 c = vec4((min + max) * 0.5, 1.0) * view;
    vec3 s = (max - min) * 0.5;
    vec3 r = s * mat3(abs(view[0].xyz), abs(view[1].xyz), abs(view[2].xyz));
    
    return projectBoxView(c.xyz, r, znear, P00, P11, aabb);
}

The view-space projection takes 17 FLOPs to implement the core projection, plus 12 FLOPs to transform the result; to project a box to view-space, we need to convert the box to center+radius form (12 FLOPs, although this may not be necessary if the AABB is already stored as center+radius), and do the matrix multiplications (18 FLOPs for center and 24 FLOPs for radius), for a grand total of 83 FLOPs, which is ~30% fewer operations than the optimized precise version. Note that in the projectBoxApprox code above we choose to precompute reciprocals of min/max Z to reduce the cost - this technically increases the number of floating-point operations by 2 by trading 4 divisions by 2 reciprocal operations and 4 multiplies, which is usually a good tradeoff on GPUs.

Evaluation

So far we’ve ended up with two projection functions for bounding boxes, an optimized precise variant and a view space approximation. We can also adapt projectBoxView to serve as a sphere approximation - which is to say, by approximating a sphere with a box with the same radius, we still retain the conservative nature of the projection, and trade off some computational time for precision.

projectBoxView takes ~29 FLOPs instead of ~42 FLOPs for projectSphereView, but that assumes that we’re already starting with a view-space sphere. It’s likely that we need to transform the sphere to view space instead, which adds 18 FLOPs for a matrix transform, for 47 FLOPs vs 60 FLOPs - or a ~27% speedup, at least as far as operation count is concerned - which is similar to the FLOPs delta between projectBoxApprox and projectBox.

As noted before, FLOPs are not an accurate measure of performance. To get a better sense of performance of different variants on a real GPU, we will use AMD Radeon GPU Analyzer, compile each of the four variants and measure the instruction count³, as well as use PVRShaderEditor to estimate cycle count on PowerVR Series6 GPUs and Mali Offline Compiler to estimate cycle count on ARM Mali-G78.

Function	FLOPs	GCN instructions	PVR cycles	Mali cycles
projectBox	110	138	73	0.91
projectBoxApprox	83	102	36	0.64
projectSphere	60	86	38	0.59
projectSphereApprox	47	82	26	0.41

We can see that while the bounding box approximation is yielding meaningful ALU savings, the sphere approximation is much closer to the precise sphere projection and will likely yield similar performance.

Importantly, both approximations convert the primitive into a view-space bounding box and as such they yield larger bounds. To evaluate this, I’ve integrated all 4 methods into niagara, and looked at the ratio of screen space bounds areas that approximations return, as well as the impact the approximations have on occlusion culling efficiency.

Unfortunately, both approximations end up returning ~1.65x larger bounds in terms of area, which results in ~10% reduction in occlusion culling efficiency. As such, even for the box projection, savings in ALU on the approximation are likely to be negated by the reduction in efficiency in later rendering stages, assuming the projection results are used for geometry culling downstream.

Conclusion

Despite the title of the article, it looks like in general, when computing projected bounds, it’s enough to approximate them by rejecting the cases when the primitive intersects the near plane - and in all other cases computing precise bounds, when done carefully with well optimized code, is reasonably close to view space approximations in performance while providing a much more accurate result. A more accurate result allows more precise culling, which is likely to be a net benefit overall.

In this post we simplify the reasoning about the execution cost and treat every primitive operation as a single FLOP. The real performance implications are a little more nuanced, as division and square root are typically more expensive, and some multiplies and adds in the code above can be fused. ↩
Again, these are approximate. For example on AMD GPUs, division is actually two operations, one of which (1/x) is comparatively expensive, and the computation above assumes min(x, y) is a single operation whereas it may actually require two - compare and select. ↩
While measuring instruction count is better than measuring FLOPs as it takes into account the compiler and hardware specifics, it’s still a bad approximation of the real performance as some GCN instructions have different throughput and scheduling constraints. Unfortunately, RGA does not output cycle estimates. ↩

VPEXPANDB on NEON with Z3

Fri, 02 Sep 2022 00:00:00 +0000

When working on vertex compressor for meshoptimizer in 2018, one of the goals was to make a bitstream that can be decompressed using (short) SIMD vector operations. This led to a host of design decisions in terms of how the data is structured, and some challenges when mapping the decoder flow onto various SIMD architectures. The most significant issue has to do with implementing an operation that often doesn’t have a straightforward implementation: byte expansion.

Without going into the details of the bitstream format, we get a byte mask (with 16 bit elements, one per each vector byte) from an earlier stage of decoding process, and for every 1 in the mask we need to read a byte from the input stream and place that into the corresponding byte in the SIMD register. For example, a mask with three bits, 0000_0100_0011_0000, leads to reading three bytes from the current position in the stream, creating a vector xxxxxxxx_xxAAxxxx_xxxxBBCC_xxxx, and advancing the stream position by 3.

Baseline implementation: SSSE3/AVX-512

The bitstream was designed to be quickly decodable on mainstream Intel/AMD processors, so there must be an efficient lowering for some reasonable version of SSE instruction set. By structuring the bitstream to make sure we can always read 16 bytes from valid positions in the stream, we can load 16 bytes after the current position in the stream, but we then need to move the bytes corresponding to the ones set in the mask in their right locations. The general way to do this is to use PSHUFB (SSSE3) instruction which can perform an arbitrary byte shuffle on the source 16-byte register, yielding any 16-byte permutation as a result.

Computing the shuffle masks from bit masks is difficult to do efficiently, so instead we’ll precompute the masks and store them in a table. There’s a grand total of 2¹⁶ masks which would require a 1MB lookup table, which is not very space or performance efficient, but we can instead precompute the table for an 8-bit mask, which requires just 4KB of space. From there on we need to be able to synthesize a full shuffle mask, and the number of bits set in the first half determines the offset that the second mask should pull data from: for example, if the first 8 bits have 5 bits set, and the second 8 bits only have one, then the corresponding byte in the second half of the vector needs to be read from an offset +5 in the input stream.

Importantly, the mask originally comes as a result of vectorized comparisons, with one byte for each comparison result (containing 0b11111111 or 0b00000000); however we can easily convert it to a 16-bit mask in a general purpose register using PMOVMSKB (SSE) instruction.

To compute the offset we thus need to have an 8-bit population count - something that’s available as an instruction on modern x64 CPUs, but something that’s also easy to precompute a 256-byte table for. Conveniently, to figure out how much to advance the stream after reading the data, we also need the population count - we need to shift the position by the total number of bits set in the 16-bit masks, which can be computed as a sum of two 8-bit population counts.

Overall, this requires a bit of setup, but is reasonably efficient, with the following pseudo-code:

static uint8_t kTableShuffle[256][8]; // 8-bit shuffle masks
static uint8_t kTableCount[256]; // 8-bit population counts

// mask is __m128i

int mask16 = _mm_movemask_epi8(mask);
uint8_t mask0 = uint8_t(mask16 & 255);
uint8_t mask1 = uint8_t(mask16 >> 8);

__m128i sm0 = _mm_loadl_epi64(&kTableShuffle[mask0]);
__m128i sm1 = _mm_loadl_epi64(&kTableShuffle[mask1]);
__m128i sm1off = _mm_set1_epi8(kTableCount[mask0]);

__m128i sm1r = _mm_add_epi8(sm1, sm1off);
__m128i smfull = _mm_unpacklo_epi64(sm0, sm1r);

__m128i data16 = _mm_loadu_si128(data);
__m128i result = _mm_shuffle_epi8(data16, smfull);

data += kTableCount[mask0] + kTableCount[mask1];

This approach works well on Intel/AMD hardware, however it does take a bunch of setup - to compute the shuffle mask from the two halves of the 16-bit mask, we need to load two halves of it from memory and reconstruct the full mask with ~4 additional instructions. I didn’t know about this initially, but byte expansion is very useful in contexts like this and so AVX-512 includes a dedicated instruction that matches our desired semantics perfectly, VPEXPANDB, which allows us to eliminate all the setup and replace all of the above with:

// mask is __mmask16

__m128i data16 = _mm_loadu_si128(data);
__m128i result = _mm_mask_expand_epi8(_mm_setzero_si128(), mask, data16);

data += _mm_popcnt_u32(mask);

Note: AVX-512 has other fantastic instructions that help in other areas of the decoder, such as VPMULTISHIFTQB; overall, using AVX-512 allows us to make the decoder >10% faster on an Intel CPU - all this while still using 128 bit SIMD vectors, and as such without the risk for associated frequency adjustments. Unfortunately, AVX-512 is not widely supported but I’m looking forward to benchmarking this on Zen4 in the future.

NEON emulation

Unfortunately, NEON doesn’t have anything as powerful as byte expansion instructions, so we need to use something similar to the baseline SSE implementation. Most instructions translate without issues, and shuffle can be implemented using table lookup instruction TBL, but PMOVMSKB presents a challenge.

This problem comes up frequently when porting SSE code to NEON, as PMOVMSKB is incredibly useful in many algorithms; until recently, a reasonable generic alternative involved using horizontal add instruction (only available on AArch64). In our case the problem is made a little bit simpler - instead of a 16-bit mask we actually need two 8-bit masks, as we’re going to use them to compute load offsets anyway, and we know that the input is a 16-byte register with each byte containing all 1s or 0s, which simplifies the emulation and makes it pretty reasonable:

static const uint8_t kByteMask[16] = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128};
 
uint8x16_t byte_mask = vld1q_u8(kByteMask);
uint8x16_t masked = vandq_u8(mask, byte_mask);

uint8_t mask0 = vaddv_u8(vget_low_u8(masked));
uint8_t mask1 = vaddv_u8(vget_high_u8(masked));

However, it still requires two horizontal adds and horizontal adds aren’t particularly cheap: 4 cycles of latency per Cortex-X2 optimization guide. A recent blog post from Google Cloud team presents a more efficient replacement for many uses of PMOVMSKB by using a narrowing shift, which allows to create a 64-bit mask from a 128-bit vector, and then change the rest of the algorithm to adjust the operations to expect the bits in the right places. This results in significant speedups on a number of algorithms - unfortunately, it’s less useful for our case, because we ultimately need two 8-bit masks instead of a single 64-bit one.

A further complication here is that horizontal adds aren’t universally supported. On ARMv7, these are not available and a longer instruction sequence using paired adds must be used instead; additionally, Microsoft does not implement horizontal adds in their NEON headers, so when targeting Windows on ARM we again can’t use the most efficient sequence.

Sometimes all it takes is a MUL

For a while, the above is the version that was used in native NEON builds. The same issue had to be solved in the WebAssembly version of the decoder - while WebAssembly SIMD proposal now implements i8x16.bitmask instruction, the earlier versions of the proposal didn’t have it (the shuffle instruction was also initially absent, which would have put the SIMD decoder at risk, but v8x16.swizzle instruction was added relatively early¹).

Because of this, the WebAssembly version used a rather slow scalar fallback that used many 64-bit ors and shifts to move the bits to the right places. Before I got around to replacing it with i8x16.bitmask, someone named 0b0000000000000 shared a much more efficient, if cryptic, solution with me:

usize MoveMask(U8x16 m) {
  alignas(16) uint64 xy[2];
  wasm_v128_store(xy, m);

  constexpr uint64 magic = 0x103070F1F3F80ULL;
  uint64 lo = ((xy[0] * magic) >> 56);
  uint64 hi = ((xy[1] * magic) >> 48) & 0xFF00;
  return (hi + lo);
}

Since we need 8-bit masks anyway, this results in the following NEON code:

const uint64_t magic = 0x000103070f1f3f80ull;

uint64x2_t mask2 = vreinterpretq_u64_u8(mask);

mask0 = uint8_t((vgetq_lane_u64(mask2, 0) * magic) >> 56);
mask1 = uint8_t((vgetq_lane_u64(mask2, 1) * magic) >> 56);

It does assume that 64-bit multiplication is efficient, however on ARMv7 this ends up being about the same speed as the less efficient NEON implementation with paired adds, and it’s consistently faster compared to horizontal adds on AArch64 (on the entire algorithm, of which all of the above is only a part of, it results in ~2% faster decoding on Apple M2 and ~3% faster decoding on AWS Graviton2; it also compiles into the same efficient code on MSVC where the lack of horizontal adds is not an issue anymore).

Ok, but… why does this code even work? Where do we get the magic constant from? And how did 0b0000000000000 arrive at this idea?

Enter Z3

The fact that the multiplication gets us what we want is possible to see if we look at what happens when we multiply the magic constant by 255 - since all input bytes are either 00000000 or 11111111, the multiplication result is the sum of (255*magic) << (k*8) for each k that signifies a non-zero input byte:

0x000103070f1f3f80 * 255 = 0x0102040810204080

We can see that the result has one bit set in each resulting byte; this means that this value can be shifted by any multiple of 8 and the resulting values can be OR’d together (which is the same as adding them because there’s no carry involved); doing so results in the top 8 bits of the value corresponding to the values of k where the byte was non-zero - exactly what we need to get our bitmask!²

There are a few ways to derive this in reverse; I’m not sure what exact method 0b0000000000000 used, however we can simultaneously devise the magic constant, and validate that the multiplication gives us what we want, by using Z3 theorem prover.

A full introduction to Z3 is neither inside the scope of this article nor is it mine to give - however, what Z3 allows us to do in this case is to give the solver our problem statement:

Given any 64-bit integer x, such that every byte of x is either 11111111 or 00000000
Compute a 64-bit integer y = x * magic
Ensure that each of the top 8 bits of y is set iff the corresponding byte of x is 11111111

We can then give Z3 a fixed magic value and verify that this holds for every x - something that we could check by bruteforcing all 256 permutations of x in this case, but that can be impractical in some other cases - or, instead, ask Z3 to find a magic value that satisfies the conditions for any x, or report to us that it does not exist.

Z3 can be used via a SMT interface, but I’d recommend avoiding it and using Python Z3 module instead. With the following short program, in a couple seconds the solver gives us the answer:

from z3 import *

def same8(v):
  return Or(v == 0, v == 0xff)

magic = BitVec('magic', 64)

x = BitVec('x', 64)
y = x * magic

solve(ForAll([x],
  Or(
    # x has bytes that aren't equal to 0xff or 0x00
    Not(And([same8((x >> (i * 8)) & 0xff) for i in range(8)])),
    # every byte of x has bits that are equal to a corresponding top bit of y
    And([(x >> (i * 8)) & 1 == (y >> (56 + i)) & 1 for i in range(8)])
)))

$ python3 bitmask.py 
[magic = 284803830071168]

We can also ask the solver to find the constant for a slightly different version of the problem - for example, if all bytes are either 0 or 1, we can just tweak the same8 function and have the solver give us the new constant, 0x102040810204080, which we have seen before. Or, to see something that is harder to get the intuition for, if we change the code to remove the same8 condition and just expect to find a constant that, after multiplication, moves the top bits of every byte to the top 8 bits (for a more general emulation of movemask), Z3 will print no solution - for any magic constant, given arbitrary input bits due to carry propagation there’s some value of x for which the equality doesn’t hold.

Z3 is probably not the best tool for every problem like this - you need to know the shape of the solution you’re looking for, and in certain cases the problem constraints are such that a solution simply does not exist. However, it’s surprisingly versatile; it can be used to validate optimizations, discover vulnerabilities in hash functions, and that’s before you venture outside of the domain of bit vecs and explore the full power of predicate logic.

Removing the shift

The solution we’ve come up with so far requires two 64-bit multiplies and a shift. Is that the best we can do, or can we try to remove the shift?

Well, we can’t remove the shift as it is - Z3 will happily tell us that it’s impossible to find a magic constant that will move bits from the end of a number into the beginning (although it’s also obvious without Z3). However, by using a 128-bit multiplication instruction, we can adjust the magic constant such that the 8 bit result lands into the top 64-bit half of the product. This can be accessed conveniently on GCC/clang with the __int128 extension, resulting in the following code:

const unsigned __int128 magic = 0x0103070f1f3f8000ull;

uint64x2_t mask2 = vreinterpretq_u64_u8(mask);

mask0 = uint8_t((vgetq_lane_u64(mask2, 0) * magic) >> 64);
mask1 = uint8_t((vgetq_lane_u64(mask2, 1) * magic) >> 64);

The shift instruction gets optimized away, resulting in just two UMULH instructions in addition to vector->scalar moves, and an extra ~1% throughput improvement on Apple M2. However, UMULH may not be as fast as the regular multiplication; Cortex-X2 optimization guide lists it as having 1 extra cycle of latency, and using this on AWS Graviton2 actually results in ~2% throughput reduction. On MSVC, the same optimization requires using _umul128 intrinsic so overall this solution is unfortunately not as portable as the original optimization.

Conclusion

These few lines of meshoptimizer code had an interesting journey - starting with an SSSE3 implementation, which was the mental model I had at the time, discovering that the core idea, byte expansion, is so common that of course AVX-512 has a dedicated instruction for it, wrestling with NEON emulation of movemask, enhancing WebAssembly SIMD proposal (this snippet was one of the use cases for addition of swizzle and bitmask) and finally discovering that in some cases, a plain old 64-bit multiply is as powerful as a SIMD instruction or two.

The performance gains quoted here may seem small - but as often happens with performance tuning, to get solid wins for the entire algorithm you need to optimize all individual parts to get cumulative gains! Getting a few % of throughput gains on the entire algorithm merely from tuning something as small as movemask emulation is pretty exciting from that perspective.

To reproduce the performance measurements, you can clone meshoptimizer from GitHub and run make config=release codecbench && ./codecbench; the relevant throughput measurement is the first one (vtx), and the source code is in src/vertexcodec.cpp, most notably decodeBytesGroupSimd function.

While the specific code snippets presented here are probably not as generally useful unless you’re writing a SIMD-friendly compressor, I hope that the introduction to Z3 motivates other people to use theorem provers more, both to discover numeric properties that might not be obvious at first glance, and to validate code transformations. The single program shared here likely represents a tiny fraction of Z3 power and utility - feel free to share other fun things you’ve used Z3 for in the comments!

As an aside, the process of contributing to WebAssembly SIMD proposal was very enjoyable! ↩
In our case, the natural bit order works best; when a different bit order is desired, it’s a matter of simply changing the magic constant. E.g. to get the reverse order, you need to multiply 0x8040201008040201 by the multiplicative inverse of 255 mod 2⁶⁴ to get 0x0080c0e0f0f8fcff ↩

On Proebsting's Law

Sat, 08 Jan 2022 00:00:00 +0000

A friend recently learned about Proebsting’s law and mentioned it to me off hand. If you aren’t aware, Proebsting’s law states:

Compiler Advances Double Computing Power Every 18 Years

Which is to say, if you upgrade your compiler every 18 years, you would expect on average your code to double in performance on the same hardware. This is in sharp contrast to Moore’s law, and suggests that we should be cautious about the performance gains that compiler evolution brings. Proebsting writes:

Perhaps this means Programming Language Research should be concentrating on something other than optimizations. Perhaps programmer productivity is a more fruitful arena.

I knew about the law’s existence but I never really asked myself - do I believe in it?

Note: this article was published as a Gist in January 2021, but I decided to repost it in the blog form with minor edits.

Can we measure this?

It occurred to me that I could try to do an experiment. I could take a modern compiler and compare performance of generated code - along with perhaps a few other metrics - vs a 20-year-old one.

At least this was my initial intention; however I’ve long wanted to do another experiment which is to figure out how LLVM has changed over the years. To combine these two I wanted to get an old version of LLVM and test it against a modern version.

To make this experiment a bit more interesting, I was going to test LLVM 1.0 - unfortunately, it only comes with 32-bit Linux binaries that I wasn’t able to get to work fully due to lack of 32-bit system headers, and it segfaulted when compiling one of the source files. So we’re going to test two versions of LLVM:

LLVM 2.7. This is the first release of LLVM that contains a version of Clang that can compile C++ code.
LLVM 11. This is the latest stable release of LLVM that I happen to have available.

LLVM 2.7 was released in April 2010, which was 11 years ago (10.5 years before the release of LLVM 11 in August 2020). So we wouldn’t quite expect a 2x speedup according to Proebsting’s law - only a 1.5x one.

We’re going to compare these compilers on compile time and run time axis as follows:

Using an amalgamated version of meshoptimizer library, we’re going to build libmeshoptimizer.o several times for each compiler, with and without optimizations (-O0 through -O3), and note the build time.
Using the resulting optimized .o file we’re going to compile the meshoptimizer demo program using modern clang, run it on a Stanford dragon mesh and compare timings for various algorithms.

The reason why we’re going to compile the demo program separately is that demo program uses STL and I don’t want to find versions of STL that are compatible with these older compilers.

Note: I’m aware that this is not a rigorous or a scientific way to analyze the law; the law itself is also a bit tongue in cheek so who cares? Don’t read too much into the results.

Let’s go!

Building library code

I’ve downloaded a binary release of LLVM 2.7 from releases.llvm.org; LLVM 11 comes with Ubuntu 20. I’m running everything using WSL2 on a Linux partition to make sure the performance numbers are representative of real hardware.

Each compiler is used to build all meshoptimizer source (8.5 KLOC) as a single translation unit to simplify the build process, in four configurations: -O0, -Os -DNDEBUG, -O2 -DNDEBUG and -O3 -DNDEBUG.

Build time comparison:

Build	LLVM 2.7	LLVM 11
O0	0.236s	0.267s (+13%)
Os	0.540s	0.992s (+84%)
O2	0.618s	1.350s (+118%)
O3	0.658s	1.443s (+119%)

Object size comparison:

Build	LLVM 2.7	LLVM 11
O0	229.5 KB	215.3 KB (-6%)
Os	80.9 KB	74.4 KB (-8%)
O2	86.2 KB	106.9 KB (+24%)
O3	85.5 KB	111.9 KB (+30%)

Based on this analysis we can observe that the debug compilation throughput was not impacted very significantly - over 10 years of development time clang+llvm got 15% slower in debug builds, which is not surprising and not particularly alarming. Release mode, however, is noticeably slower - 2.2x slower in O2/O3.

In terms of output size, the numbers look healthy - O2/O3 builds got ~25-30% larger but that by itself isn’t a problem as long as we see matching performance increases - in Os, where size is important, the binary got 8% smaller.

Runtime: basics in O0/Os/O2/O3

The problem when comparing runtime is that it’s not clear what specific build we need to compare, and what code we need to benchmark. meshoptimizer comes with lots of algorithms that have various performance characteristics. It would be interesting to analyze all of them, but since this article doesn’t promise to be scientific, we’re going to pick a few algorithms and measure them in all build configurations, and then select one configuration to dig into Proebsting’s law further.

To get a basic understanding, let’s pick just three algorithms - vertex cache optimization, simplification and index decompression. We’re going to look closer into performance of other algorithms later, but it would be good to get a sense of the differences between the versions on a small set.

Vertex cache optimization:

Build	LLVM 2.7	LLVM 11
O0	506ms	482ms (-5%)
Os	176ms	167ms (-5%)
O2	175ms	181ms (+3%)
O3	174ms	183ms (+5%)

Simplification:

Build	LLVM 2.7	LLVM 11
O0	761ms	741ms (-3%)
Os	376ms	335ms (-11%)
O2	379ms	325ms (-14%)
O3	366ms	318ms (-13%)

Index decompression:

Build	LLVM 2.7	LLVM 11
O0	21.3ms	18.9ms (-11%)
Os	7.0ms	4.6ms (-34%)
O2	5.1ms	4.6ms (-9%)
O3	5.2ms	4.6ms (-12%)

The picture that is beginning to emerge here seems rather grim. We see speedups in the 10-15% range in optimized builds, with an exception of index decompress in Os that seems more like an outlier, where likely -Os inlining heuristics in LLVM 11 result in the same code across different optimization levels; we also see speedups in the 5% range in unoptimized builds.

Now, it’s important that in addition to the disclaimer about the comparison not being particularly scientific the reader also understands one extra detail - all algorithms in meshoptimizer are carefully optimized. This isn’t a run-of-the-mill C++ code - this is the code that was studied under various profilers and tweaked until, while it remained reasonably concise, the performance was deemed worthy.

It is possible in theory that code that’s less carefully optimized exhibits different behavior, or that the benchmarks chosen here are simply not as amenable to compiler optimization as they could be - the lack of prominent difference between different optimization levels is also noteworthy (although O3 in particular has been studied before in academic research and the value of that mode was inconclusive).

To try to get a more complete picture, let’s now look at more algorithms and compare them in O2 build only.

Runtime: algorithms in O2

We’re going to first take a look at a more complete set of algorithms from meshoptimizer library; this isn’t every single algorithm in existence as some of the algorithms have performance characteristics that aren’t very distinct compared to other algorithms already presented here. This also excludes vertex decompression which is going to be mentioned separately.

Algorithm	LLVM 2.7	LLVM 11
Reindex	92ms	86ms (-7%)
Cache	175ms	183ms (+4%)
CacheFifo	49ms	48ms (-2%)
Overdraw	57ms	52ms (-8%)
Stripify	46ms	36ms (-20%)
Meshlets	519ms	545ms (+5%)
Adjacency	250ms	188ms (-25%)
Simplify	380ms	323ms (-15%)
SimplifySloppy	61ms	45ms (-26%)
SpatialSort	22ms	19ms (-14%)
IndexEncode	29ms	26ms (-11%)
IndexDecode	5.2ms	4.6ms (-12%)

Overall the picture here is not very different from what we’ve already established - LLVM 11 seems to produce code that’s 10-15% faster on most benchmarks. There are a couple outliers where the performance gain is more substantial, up to 25%, and a couple benchmarks where LLVM 11 actually generates consistently slower code, up to 5% - this is not a measurement error.

I’ve reran the outliers using -O3 with the following results, that made the gap a bit less wide but still substantial:

Algorithm	LLVM 2.7	LLVM 11
Stripify	44ms	35ms (-20%)
Adjacency	212ms	174ms (-18%)
SimplifySloppy	52ms	44ms (-15%)

These gains are certainly welcome, although it is unfortunate that they seem to come at the cost of 2x slower compilation. This takes me back to “The death of optimizing compilers” by Daniel J. Bernstein - I wonder if there’s a happier middle ground that can be found, one where the compiler gives more control over optimization decisions to the developer and allows tuning the code to reach gains that can be seen here at a lower complexity and compilation performance cost.

Runtime: SIMD

All of the algorithms presented before were scalar, implemented using portable C++. While portions of some of these can be vectorized in theory, in practice clang 11 even at -O3 struggles with generating efficient SIMD code for most/all of them.

meshoptimizer does have several algorithms that have first-class SIMD versions, implemented using SSE/NEON/Wasm intrinsics. Their performance was compared using codecbench, a utility that comes with meshoptimizer and outputs performance in GB/sec - so the numbers in the following tables are reversed, larger is better.

Algorithm	LLVM 2.7	LLVM 11
vertex decode	2.3 GB/s	3.0 GB/s (+30%)
filter-oct8	2.6 GB/s	2.8 GB/s (+8%)
filter-oct12	4.1 GB/s	4.2 GB/s (+2%)
filter-quat12	2.4 GB/s	2.6 GB/s (+8%)
filter-exp	13.2 GB/s	13.6 GB/s (+3%)

All of the filters are typical SIMD streaming kernels - there’s no branches or complex data dependencies. Perhaps unsurprisingly, the delta in performance of the compiled code is thus not very significant. The vertex decode is substantially more complicated - it contains function calls, branches, mix of scalar and vector instructions and in general can be more challenging for the optimizer.

It’s worth noting that on this particular example, using -O3 with LLVM 2.7 brings the performance up from 2.3 GB/s to 2.7 GB/s, while having no effect on LLVM 11 - bringing the delta between LLVM 11 and LLVM 2.7 back to ~10% range.

It’s undoubtedly possible to find examples of loops that LLVM 2.7 couldn’t vectorize (by virtue of not having an autovectorizer) and LLVM 11 can - unfortunately, my experience even on streamlined kernels like the aforementioned filters force me to maintain a deep distrust towards the auto-vectorizer (out of the 4 filter kernels above, clang 11 can not vectorize even a single one, and gcc 10 can only vectorize ‘exp’ - one out of 4). I would claim that any gains due to auto-vectorization can’t be counted as significant until programmers are given better tools to make these optimizations more predictable and reliable.

Conclusion?

The overall picture seems to be as follows.

LLVM 11 tends to take 2x longer to compile code with optimizations, and as a result produces code that runs 10-20% faster (with occasional outliers in either direction), compared to LLVM 2.7 which is more than 10 years old. This may be a general rule, something specific to highly tuned code, or something specific to meshoptimizer algorithms.

Without spending more than an evening it’s hard to disambiguate the reasons. And this post definitely doesn’t pretend to be a thorough research - it’s just a fun little study of how competitive clang 2.7 looks like in 2021. Without a doubt, the amazing community behind LLVM didn’t spend the last decade for naught - but if you still believe in the sufficiently smart optimizing compiler, it may be time to reconsider the extent to which you can rely on the compiler to make your code faster year after year, as if anything Proebsting’s law should probably be reformulated as:

Compiler Advances Double Computing Power Every 50 Years, And The Interval Keeps Growing

It’s important to recognize that there are many forces that together define the rate at which software performance changes - between hardware getting faster (yes, even in the last 10 years, despite what articles like “Free Lunch Is Over” would make you believe), compilers getting better, software development practices frequently getting out of hand and a large discrepancy between the expertise of the software developers wrt optimization, compiler advances are just one, rather small, piece of the puzzle. Perhaps Daniel Bernstein was right after all.

Eight years at Roblox

Sun, 02 Aug 2020 00:00:00 +0000

I joined Roblox at the end of 2012 as a rendering engineer; I had just spent more than a year working on various titles from FIFA franchise after years of console game development and was becoming a bit tired of the “big game development”. My work on FIFA was as a contractor and I got an offer for a full-time position, but I also had a friend who worked at Roblox reach out and offer me to move to California and work on Roblox. I knew absolutely nothing about Roblox, but California was nice and my friend told me it would be awesome. The platform was so different (and so strange!) that I decided to take a chance - here I am, 8 years later, still working at Roblox and enjoying it. I started on my first full time job in April 2007 so at this point I’ve worked for 13 years in game development and 8 of them were at Roblox.

My memory works in interesting ways. I remember my interview pretty well, I remember having lunch at some place in San Mateo downtown near the Roblox HQ - a few people were at lunch including Roblox CEO David Baszucki and I remember him asking many questions about my thoughts about the engines and rendering, and distinctly remember not finishing most of my lunch because I talked most of the time. However I don’t really remember what was going through my head in regards to my perception of Roblox - why did I join besides just thinking I want to do something else for a change? Who knows, but I am glad I did.

I don’t really understand why Roblox is so successful - you can invent all sorts of reasons in retrospect but it’s hard to validate them, and if you came to anybody back in 2012 and asked for an investment to build a platform where all games are user generated and run on a custom engine with a custom toolset and all users participate in a giant virtual economy and …, I think you’d have gotten a blank stare.

But I do understand that I found the perfect place for me, especially at that point in my career - I enjoy working on game technology but I never liked working on actual games, and Roblox maximizes the number of developers who can use the technology you work on while maintaining a good autonomy and a very wide range of problems you’d need to solve. It’s very hard to get bored here.

I think I could talk for hours about Roblox - it somehow became a huge part of my life. I was very fortunate to join at the time when I did and witness the growth of our technology and business. I am really unsure of what the future holds but it’s hard to imagine what, if anything, comes after Roblox - I certainly don’t intend to leave any time soon…

So I thought it might be fun to do what I’ve planned to do for a year or more now, and to go over all decently sized projects I’ve ever worked on at Roblox. This is based on resummarizing and reliving the source control history, which tells me I’ve had 2752 changes that made it to our main branch, with merge commits counting as one, so, uh, this blog might be on a larger side. Hopefully this will be fun!

Before we begin, I just want to conclude this by saying that I’m very grateful to the Roblox leadership for treating me well, for all the friends and colleagues I made along the way, and for the wonderful Roblox community. The reason why I still enjoy what I do is because whenever I write about a new big thing I’m working on or a small feature or even a bug fix, it’s usually met with excitement which keeps me going. Thank you all from the bottom of my heart. I don’t think I could have done it without you and I hope this continues for as long as possible despite the current trying and uncertain times.

July 2012: Assorted fixes to rendering code

Notably including half-pixel offset fixes for Direct3D9 which I guess is a rite of passage for rendering engineers. The rendering code back then was based on OGRE rendering engine, so I had to learn that, and this was also my first time using OpenGL professionally - prior to that I’ve used Direct3D 9 and proprietary console APIs, and Direct3D 10/11 as a hobby.

August 2012: Prototype new part rendering

Initially added for “100 player” project, in October it evolved to render all parts and continued to be used as part renderer until the introduction of instancing in 2018. Otherwise known as “featherweight parts”. This was further optimized and deployed around November 2012. Most of this code survived to this day but evolved over time, and is still used when instancing doesn’t apply.

The core idea in this system was to dynamically batch meshes together, for characters this would be based on the character model hierarchy, and for everything else the grouping is spatial. This allowed us to reduce the number of draw calls, which was a big concern due to both driver overhead and inefficiencies in OGRE.

This would pave the way for what eventually turned out to be a complete, but gradual, rewrite of the rendering stack. The main motivation for this was always performance - what we ended up let us port to mobile (the old rendering code was nowhere near fast enough even for relatively simple scenes), and break new grounds on the number of objects we could render in a frame.

August 2012: OGRE upgrade from 1.6 to 1.8

One of a few OGRE upgrades we’ve needed to do, this one was to get better GLES support. It was pretty painful to do those, just like any other big middleware update is. Read further to learn what happened to OGRE eventually…

One thing I remember from doing these is that documentation in source code makes the upgrade process that much more painful. I had scripts that changed the copyright years in headers back to whatever they were in our tree just to make merging less painful, but there was some OGRE upgrade where 70% of the changes were documentation, and this was very hard to get through.

The reason why these were challenging in general is that whenever we did an upgrade we had to a) merge our plentiful changes with the new code, b) gate dangerous parts of the upgrade with flags. We’ve used the same system of feature flags (we call them fast flags) since I joined Roblox which allows us to dynamically disable parts of the release based on metrics, but this requires actually isolating changes behind if statements selectively - which for OGRE was sometimes necessary as we didn’t know what the impact of some low level change in OpenGL code would be.

September 2012: First HLSL->GLSL shader compiler

Before this we had hand-translated shaders, which started to be painful to maintain. The first version of the pipeline used hlsl2glsl and glsl-optimizer (same as Unity back in the day). We are using version 3 today, see below!

Since this was done at the point where we used OGRE, the compiler would take HLSL files, preprocess and translate them to optimized GLSL, and save the resulting GLSL back to disk - which would then be loaded by OGRE directly through the material definition file. Eventually we replaced this with a binary shader pack that could store GLSL code for OpenGL and shader bytecode for other APIs, but back then we shipped HLSL and GLSL source and compiled HLSL code on device!

September 2012: F# scripts for HW statistics

Our equivalent of “Steam Hardware Survey” that went through SQL databases and coalesced various system information bits to help us understand the hardware at the time. This was during my era of obsession with F#, so it was written in F# instead of something like Python. We don’t use this anymore and don’t even have the SQL database in question!

We never published the resulting data, and I’m not sure how often we used it to make decisions, but it was fun to look at the number of graphics cards from various vendors or amount of RAM or resolution a typical Roblox user has.

September 2012: First exploit fix

Although I was hired as a rendering engineer, I had a lot of really deep low-level systems experience and as a consequence ended up engaging in both optimization work and security related work from the very beginning. I don’t do this anymore these days but I was often involved in the security work for the first 3 or 4 years. Now we fortunately have people who can do this full time and better than I could :)

September 2012: Character texture compositor

A second part of “100 player project”, necessary to render every character in one draw call (these were really expensive for us back in the day!). A side effect included some resolution sacrifices on character items that shirt creators aren’t fond of. The new system managed the atlas texture memory, rebaking humanoids far away to smaller textures to conserve texture memory. The compositor survived with minor changes to this day, although we’re now working on a new one.

The compositor was built in a very configurable fashion, allowing the high level code to specify the layout to bake, and managing all the complex asynchronous processing and budgeting by itself. This allowed us to switch the composit layout completely years later for R15.

October 2012: Assorted memory/performance work

At the end of 2012 we were actively working on the mobile port. Since then we’ve had to do a lot of work in a lot of different parts of the engine to make data structures smaller and algorithms - faster. Of course you’re never done with optimizations so we do this to this day. Curiously, our minimum spec on iOS stayed the same since the initial launch in 2012!

A fun fact is that even though we started with iPad 2 as the min. spec we discussed adding support to iPad 1 after launch. At the time there were a lot of people who couldn’t play Roblox on iOS on older hardware. However the performance characteristics of those devices were just… not good enough. You could touch the screen with the finger and pan the camera, and during panning you lost 30% of a single available core to the OS processing the touch. We decided to not add support for this, and 8 years later it seems like a great decision for sure :D

October 2012: Event-based profiler in F#

It was very hard to use Xcode Instruments to profile frame spikes on an iPad; to try to figure out how to get our performance to a better place on mobile, I wrote some ad-hoc code to dump all internal log events to a binary stream, and a desktop UI tool in F# and WPF to visualize it. This included a Lua profiler as well that could display profiles of Lua code in a traditional hierarchical aggregated view based on event data. This did not survive but curiously we ended up using a similar approach with microprofile years later.

November-December 2012: Finalize part rendering

What started in August as a character-only renderer that supported meshes, evolved into something that could render any part in Roblox the same way as old rendering code did. This was not easy, both because performance was really important in every part of the code, and because there’s a lot of corner cases that had to function pretty much as they did before. Except for perhaps the legacy cylinder rendering:

This code supported FFP as well, using matrix palette blending to efficiently render characters with rigid joints, and on desktop also came with vertex shaders that were carefully optimized to run faster on Intel GPUs without hardware vertex shading (through software vertex shading path). Also this implemented stencil shadows using GPU-based extrusion and CPU-based detection of silhouettes with dynamic index buffers. Fun times!

January 2013: Voxel lighting work begins

My memory is a bit fuzzy on this one but I think we were brainstorming possible ways to implement full scene shadows in a way that would work on mobile, and I’ve recently watched the presentation from Little Big Planet on how they did lighting on PS3 with voxel based computations; our CEO was part of the discussions and mentioned “what if all lighting was voxel based”, and the rest is history. The approach we ended up taking was very unique and distinct from many other voxel based implementations to my knowledge.

February 2013: Skylight and optimizations

In January voxel lighting engine got support for sun shadows and point/spot lights but it felt like on good GPU hardware we could get better results with other techniques, so we were looking for other things we can use voxels for. I don’t remember who came up with the idea but this is when skylight was implemented, which is a form of ambient occlusion with sky as a light source, and is very hard to do correctly without voxels.

To make voxel lighting practical I also rewrote all functions using hand coded SIMD (SSE2), including the voxelizer - lighting on CPU isn’t practical without this (this code was later translated to NEON for the iOS port).

The resulting lighting code survived up until FIB Phase 1 which added HDR support and changed voxelizer to use anisotropic occupancy, but is otherwise still used today.

March 2013: New materials

In an effort to redefine the way Roblox games look (back then we thought Roblox as a platform needs an art style), the work on new materials began. We used to use a random set of shaders including some procedural ones; this work replaced the shader framework with “surface shaders” (this is inspired by Aras P. work on Unity from around the same time; we use the resulting shader interfaces to this day although it’s not clear that they are actually pulling their weight, and if I did it today again I would not have gone that route), and implemented more traditional texture-based materials on top, ruining wood grain forever.

April 2013: Binary file format

Annoyed with the time it took our custom XML parser/serializer to work with large places, I designed and implemented a custom binary file format. It was chunk-based with per-chunk LZ4 compression and custom binary filters to preprocess the data to help LZ4; the format was structured to make reflection interop cheaper, and maximize loading performance. We use this as the main file format to this day, although the format got a few small tweaks (mainly extensions to handle cryptographic signatures and more efficient shared binary blobs). I’m still happy with the design but I’d slightly change the layout in a couple of places to make loading for very big places more cache-coherent, something that wasn’t as big of a concern back then. This can still be done today but requires small revisions to how some chunks represent data.

The initial rollout of this change was just for Play Solo, which saved the entire world to a file and loaded the result back into the new datamodel; this meant it was safe to release because no permanent data loss would occur. After this we gradually switched to using this format for publishing places, and eventually started using it for models (packages) as well. Today almost all semantically rich content on Roblox uses this format.

Ironically we did end up replacing our XML parser with a library of mine, pugixml, in 2019 - although the binary storage is still more performant and space efficient.

May 2013: OpenGL ES2 support

When we shipped our iOS port it was done with ES1 (FFP); this meant a lot of features didn’t work, including lighting which was becoming pretty important. OGRE support for ES2 was immature at the time, so this included a lot of fixes in OGRE code, and a fair amount of shader tweaks, plus the aforementioned NEON optimization for voxel lighting code to make it practical to run on a mobile device.

This change helped in the future work - after this landed we never used FFP on mobile, always using shaders to render content, which meant that we wouldn’t need to support ES1 for any technology upgrades, that as it turned out were waiting around the corner.

May 2013: Remove legacy part rendering code

The development of Roblox engine usually follows a tic-toc-tac pattern (okay we don’t actually have a name for this, but whatever). First we make a new awesome implementation of a subsystem that was in the need of being replaced. Then we work on that new implementation becoming better. Then we remove the legacy system to simplify maintenance. By this point we’ve switched all parts to render in the new cluster rendering path and the old code was ready to be removed. The commit says “removes 500 kb of C++ code, 400 kb of shader/material code, and 3 Mb of content/ textures. Also removes 17 rendering fast flags and 5 rendering log groups.” which felt pretty good at the time!

June 2013: Thumbnail rendering optimizations

The way we render thumbnails is with the same rendering code we usually use on the client, but it runs on our servers using a software renderer. This was inefficient because we used a slow software renderer (single-threaded without a JIT) and additionally we went through the setup/teardown of the entire engine for every new image; this was reworked to use a faster software renderer (which was mostly hard because building open-source software on Windows is a pain), and to reuse the same engine context for many thumbnails, which allowed us to dramatically cut the amount of servers we used. It’s comparatively rare that the work I do can be measured in money spent for the company so this felt good.

July 2013: New materials for real

… I guess in March I just did some preparatory work, and it’s in June when we actually started working on new shaders and new textures. There was a lot of back-n-forth with our artist to make sure the art can look good in-game but also not create too many issues for games using built-in materials completely counter to their intended purpose (e.g. using Sand colored in blue as water).

A big portion of work here though was a new UV layout for our terrain materials - we used a voxel based terrain that had a few distinct block types (block, wedges, corner wedges), and together with artists we came up with a new UV layout that was more uniformly distributing pixels in the texture to get good density everywhere.

July 2013: Point light shadows

I remember thinking about voxel lighting more and at some point realizing, well, we can do directional shadows from the sun, we can do point lights - why can’t we do both at the same time, so that every single light source can cast shadow? It turned out that the approach we used for the directional shadows could be adapted to work for point lights, and with some optimizations and tweaks the first version of the voxel lighting engine was finally complete. This would survive up until Future Is Bright Phase 1 which would ship at the end of 2018. This was finalized in September 2013 and optimized with SIMD:

Change 37700: SIMD shadow update now works, but I have no idea why.
Change 37701: SIMD shadow update works, and now I know why :) Still need more optimization

I love writing commit messages that straddle the border between “professional” and “fun”.

August 2013: New text renderer

Since all of the easy problems, such as part rendering and lighting, were solved, it was time to face the final rendering challenge: text.

Back then we used a prebaked bitmap (actually two) at two different sizes, and a very poorly written layout code that didn’t support kerning and didn’t handle spacing well. Instead I wrote an F# script (of course!) that baked lots of different sizes of a single font into a large atlas; to conserve texture space, I used a rectangle packer. At runtime the layout algorithm used kerning data to place glyphs at correct locations. This substantially improved text quality at most frequently used sizes, and would last for a few years up until internationalization became a priority and we had to start rendering the font atlas dynamically from TTF source. The layout algorithm would survive for a few more years up until we integrated Harfbuzz to do complex Unicode aware shaping - both of these were done by other people years later.

September 2013: Remote events

Continuing the trend of increasing the scope of work beyond just rendering, I’ve worked on the design and implementation of new remote events API, including fire-and-forget events and remote function calls (which are super nifty in Lua - because of coroutine support, you can just call a function in Lua, that call will be routed to the server, server will run the code and return the result, and your coroutine will continue running, almost oblivious to the time spent!). It was very hard to find good names for the APIs involved; we haven’t changed any of this since and I still struggle with what the correct function/event name is sometimes.

October 2013: Infinite terrain

Voxel terrain wasn’t very popular among our game developers. For a feature that took a lot of effort to develop and maintain this was unsatisfying, and I was trying to figure out “why”. One hypothesis was that the limited size (512x64x512 voxels I want to say?) made it too limiting; to remedy that I’ve worked on a new sparse voxel storage format and different replication mechanisms to allow arbitrarily large terrains. This took around a month to implement fully. This code no longer exists because I ended up throwing the old terrain out completely later - this is probably my largest body of work that just doesn’t exist in Roblox today, although if I hadn’t worked on that, smooth terrain would have likely taken longer and would be worse, because many ideas from that translated well.

During this work I also added TerrainRegion (a standalone object that could store voxel data) together with APIs for copy & paste - this part hasn’t changed since and is still available and useful.

November 2013: Implement RenderStepped callback

I don’t really remember much about this - I don’t think we had an API proposal process at the time so I think this just came up and we just did it, but this is a pretty significant event in retrospect because it started the general trend of giving Lua scripts much more agency; without this we could not have implemented cameras in Lua, for example, and some games today - for better or for worse - run a lot of their game loop in render stepped for minimal input latency. At this point we’ve already used a two-thread parallelization strategy where normally input->gameplay->render cycle is a 2-frame cycle on the CPU, but by using RenderStepped you can cut that short and have minimal latency for user interaction at some throughput cost.

December 2013: UI rendering optimizations

Somehow all games or engines I have ever worked with end up with UI consuming disproportionate amount of frame time, and ours is no exception :( Over the years I’ve had several points at which I snapped and committed a dozen performance improvements to make things faster, this is one of them; every time the changes weren’t necessarily transformative or really complex, but delivered solid performance gains by shaving ~10% or more at a time.

December 2013: Smooth terrain hack week

This was my first hack week - the 2012 hack week was held in July right before I joined, although I doubt I would have achieved much as I was very unfamiliar with the codebase. By this point I knew everything there was to know about our rendering system, OGRE, and terrain; I combined this knowledge to implement a prototype for new terrain that used Marching Cubes meshing of voxel distance field instead of blocky terrain we had, with support for geometric LOD (with gaps between LODs because hack week) and also added water with refraction and realtime reflections and geometric grass. Over the next few years I ended up shipping most of this, although the implementation was dramatically different (e.g. we don’t use marching cubes), and finally in 2019 somebody else shipped geometric grass (again with a very different implementation), marking my 2013 hack week fully live :D

January 2014: Mobile performance

A continuing drive to get our rendering code faster and faster invited more optimization. I remember us using two internal games for profiling, one was made by an internal games team (we used to make games ourselves back then! We quickly realized that we’re a platform and that our community can make games better than we can though) and another by John Shedletsky, a name all Roblox readers would recognize. We were trying to get them to run at stable 30 FPS on iPad 2, which was challenging. A lot of small performance tweaks went in here but I was starting to become really frustrated with the amount of time we lost in OGRE and OpenGL driver. I was pretty sure a lot of OpenGL work wasn’t just because the driver wasn’t very fast (that problem would have to wait until the introduction of Metal), but also because OGRE’s GLES backend was very inefficient. We could have tried to optimize OGRE, but that codebase was so large and unwieldy to work with that a question had to be asked: do we need it?

So I spent one day to do an experiment: I set up an alternative GL-focused rendering path alongside OGRE. This took just a day because I focused on just getting part rendering to work, and only converting the actual render loop away from OGRE - using OGRE scaffolding to manage resources, and then getting OpenGL resource ids out into our own code. There were no special optimizations after that, I just wrote code that I thought was very simple and minimal - just do the state setup that needs to be done in the simplest way. The results were that by rewriting a portion of the rendering frame that took 13 ms in OGRE, we could replicate it in just 3 ms in our renderer.

This made the decision of what to do next obvious.

February-April 2014: New rendering engine

We decided to completely remove OGRE as a dependency in favor of our own rendering engine. To simplify the matters a bit, we decided on just two rendering backends, Direct3D 9 (with FFP support) and a combined OpenGL backend with desktop and mobile “modes”, to support macOS and iOS, without FFP support.

The fact that we already used OGRE in the most minimal way possible made things easier - we didn’t need to port the animation system, all the shaders we used were our own, etc. The only significant high level component we used from OGRE was the particle system, and we had an engineer start on redoing that - I focused on everything else, including defining a new graphics API abstraction layer, implementing Direct3D9 and OpenGL backends for that, working on a basic material system and render op list, etc.

This had to be done side by side with the old engine, so we copied the high level code that used OGRE and started reworking that. A big painpoint when working with OGRE was getting access to any hardware functionality (I don’t remember details too well but one thing I remember is render targets not being structured very well), so I spent a bit of time thinking about the graphics abstraction - but not too much, as we could iterate on that in the future (and we did!). A big focus was on usability (it had to be reasonably easy to use it from our high level rendering code) and leanness (for performance and sanity reasons we wanted the concepts in the abstraction to map mostly 1 to 1 to the underlying implementations).

Because mobile was a focus, we ended up inheriting some concepts from OpenGL (like Framebuffer, ShaderProgram, Geometry, uniform handles, etc.). Some of these survived to this day and continue being useful for other APIs; some parts of the abstraction saw major changes, for example we fully transitioned to uniform buffer style interface after a Direct3D 11 port, and render pass based interface during a Metal port. The abstraction continues to slowly evolve over time, and one part I’m super excited about is that since the initial release, the abstraction actually became leaner and more straightforward to implement (for example, we used to have a distinction between Renderbuffers and Textures, and now we just have Textures).

This was the right time to do this change. We already implemented all critical high level rendering components by that time, so we knew exactly what to focus on - but this was back when we only had two rendering engineers, myself included, and so we didn’t have to stall major projects by rebuilding the foundation of the engine.

The new code took around 2.5 months to complete and ship; the results were fantastic - much lower CPU overhead, much simpler code base, much faster builds and smaller distribution size - it was a massive win along every single possible axis. Most of that code still exists and is in active use today, some parts had to be expanded or improved as we gained more graphics API - we went from supporting just two APIs to supporting five.

The epic for the change was named

US22804: Do you know how to K.I.L.L. an OGRE?

Which in addition to being an obvious pun on “ogre” is a reference to Death Note which I watched for the first time around that time and rewatched many times since.

May 2014: Lua security fixes

During this period I’ve submitted an unusual number of various security mitigations for different exploits so while I don’t remember this it must have become a focus.

In addition to that in May and June I’ve started helping with our Android port - something that was made much easier due to our new rendering engine, as OGRE support for EGL was incomplete at the time.

May 2014: Data persistence data loss

Something that I completely forgot about since but was reminded by a page on Roblox trivia wiki is that I was responsible for a data loss in our data persistence system. The core issue was caused by an innocuous code change that refactored some XML serialization logic in an effort to make sure that at all callsites we correctly use binary file format for saving when requested. We had code that could serialize Roblox instances as part of a larger XML container, used for web API serialization - I mistakenly assumed this was part of web API support code that we didn’t need anymore.

Unfortunately, this was actually important for games that used our data persistence system that was in the process of being replaced by newer data stores. This only affected games that stored instance data (thankfully, most games used exclusively primitive types); unfortunately, we didn’t have unit tests for the system and our manual regression test only used primitive types. Additionally manual tests on our pre-production test environments that relied on backend data were often flaky due to environment stability issues (something that we’ve largely solved in 2020 by switching to testing using production infrastructure and pre-release client/server builds instead of relying on separate environments).

What made the issue damaging is that the legacy system in question treated all errors during data loading as “data is absent, start from scratch”. This was due to the fact that the web endpoint returned a status code 404 for non-existent data, which resulted in an exception propagated through the client-side code; instead of special casing that error code, and disabling data saving for any subsequent save, the code assumed any error is a 404 error and joyfully proceeded to start with an empty data blob, saving it when the player quit the game.

This was further aggravated by the release timing - we used to release the client & server at 9 PM at night; any issue that was discovered immediately after the release would lead to the release rollback, but this issue was only discovered a few hours after the release by a developer who reported it to us - at which point everybody was sound asleep and all players who played affected games that night would lose their game data irrecoverably (as the system in question also had no backups). This also isn’t a problem in 2020 as we switched the release process to one that’s safe to do at any time of day and we now release in the morning so we can react to any issue that’s discovered hours after the release immediately.

Needless to say I’ve learned a few things about refactoring legacy code, testing and deployment processes, etc. I think this is the only destructive or negative thing I’ve done during my time at Roblox, and it felt terrible at the time. Time heals all wounds though!

June 2014: Lua linter

I’m not sure how this came about, but I think I was just thinking how we can make it easier for people to write correct Lua code, and an obvious problem was lack of any sort of static analysis / linting. luacheck was something that existed at the time, but it wasn’t very fast and I thought we needed a tool that’s written in C++ for this.

Amazingly it looks like the first change for this tried to do static analysis on compiled Lua bytecode, a fact that I completely forgot until today, but I quickly changed gears and started working on a Lua->AST parser. I’ve written many parsers before that and some of them were very high performance, so I knew what I was doing pretty well.

After implementing the parser, I implemented a few static analysis rules based on AST; some of these later shipped as Script Analysis Widget (which required a fair amount of Qt work that a long-time Studio engineer helped me with), and some were experimental, such as a data flow analysis, that were incomplete and never went live.

This work would prove to be very important 5 years later, as you’ll learn if you’re still reading this and if you’re going to get through the rest of this post :D

July-August 2014: Lua sandboxing

The exploits were still rampant on the platform. One thing that would be good to emphasize here for people unfamiliar with Roblox is that not only did we have a full Lua interpreter based on Lua 5.1 (so pretty easy to reverse engineer as that is open source), but we also had a client-trusting networking model that was used in all games at the time. In the beginning of 2014 we introduced a new networking model which eventually (in 2018) became the only one.

People who are networking experts might scoff at this point. “Client authority is such an obvious mistake, what were they thinking?” To which I would say that it’s my firm belief that has Roblox started with a server-authoritative model, it’s very possible that the company would not exist today - it’s hard to develop multiplayer games, and client authority makes a lot of responsive gameplay very easy to write (of course it’s very exploitable, which is why we ultimately got rid of the old model, but at the time we already had lots of developers with years of experience, and also a much better understanding of how to make the platform more accessible despite the replication barrier).

Anyhow, the dominant attack vector in 2014 was to find a way where the script source, replicated from the server, gets to the client and replace it with a malicious script - which would get access to all our APIs and through the client authority allow to perform any changes to the world state.

With a secure replication mode being out but not being used by the majority of existing games, we had to find other ways to block these attacks but were tired of playing the cat & mouse game. To that end I reworked the Lua VM to split the compiler, removing it from the client completely (and switching replication to bytecode), and changing the bytecode to be different from stock Lua VM as well as changing the VM internals to obfuscate various data structures. This was hard to deploy - this breaks network replication for one - and we wanted to do this very quickly; this was also incompatible with some things like dynamic script evaluation, and even the process of starting a Roblox game involved downloading a Lua script from a web endpoint and running it! I focused on the core VM portions of this change and had another engineer help with other bits, and we ultimately shipped this change in around two months.

I remember reading the exploiter forums the night of the release, and seeing a thread to the effect of “all exploits no longer work”, and one of the exploit authors replying something to the effect of “oh god, this will take a while to get around”. Of course exploiters always catch up, and they did - months and months later (and we were able to continue playing the cat & mouse game for a while longer, until the replication mode became the de-facto standard).

September-November 2014: Smooth terrain

With Lua work out of the way, I went back to the hack week project from 2013. We had conversations earlier in the year and all agreed that we need a new terrain system - the old blocky one continued to not be very popular and we just didn’t believe that the features it provides are that interesting. With an existence proof from hack week, it was now time to figure out what to do.

I think this work started a bit earlier in the year in a separate prototyping framework where I was able to quickly experiment with voxel representation etc, but it was time to figure out how to ship this.

In September I’ve done most of the basic rendering work, and then started to focus on other aspects. This was the first large cross-functional project that I’ve done at Roblox - except for the terrain tools that were written by stickmasterluke, I’ve done all of the implementation work here.

In October I’ve worked on physics - this was my chance to go back to some physics programming (I’ve done physics work in prior jobs, including a custom Box2D SPU port with many algorithms optimized or rewritten, but never in Roblox). This required using parts of Bullet collision pipeline which we’ve integrated earlier for CSG to work.

In November I’ve worked on some more rendering pieces and the undo stack, and started working on replication support. Again, a lot of this work was easier to do since it mirrored the Infinite terrain work from 2013.

December 2014: Hack week

I don’t remember my exact train of thought here, but I think I just accumulated a bunch of rendering ideas that I wanted to try; unlike my last hack week this one didn’t really have a specific focus, and I decided to just implement as many small ideas as I could possibly fit in a week.

A great side effect of this is that you’re always ready to present. This is a big challenge in hack week - how do you time things so that after a very intensive week of work you have code that works enough for you to show a demo? This being hack week, this code doesn’t have to work perfectly - in fact you want to minimize the amount of “polish” work you do so that you can maximize the “oomph” and deliver more ambitious projects - but what if you don’t make it? The way I solved this problem in 2014 is by cramming a bunch of small projects into one; every day I’d start with a goal of finishing one aspect, and if I got there earlier - great! Just start the next one early.

What ended up in the hack week presentation is area light support for the voxel engine (later shipped as SurfaceLights), encoding light direction into voxel grid for per-pixel lighting (never shipped, but incorporated into the next hack week), soft particles (shipped later), particle lighting (this was done in a very brute-force way in this demo; I implemented it in a better way in Future Is Bright hack week, and we shipped that implementation later), HDR rendering with a very poorly tuned tone mapper (we didn’t use any of this code but we did end up implementing HDR rendering as part of Future Is Bright), shadow mapping with support for translucency and colored objects based on exponential shadow maps (we didn’t ship this exact code but this will show up later in the timeline), and volumetric lighting support using the shadow maps (never shipped this either).

This hack week ended up being rich on small practical ideas - except for volumetric lighting and colored translucency, we ended up shipping all of these in one form or another over the next few years.

January 2015: Smooth terrain tools & API

With the hack week being over it was time to continue working on smooth terrain, bringing it closer to completion. Here we’ve tried to figure out how we should work on tools. Traditionally you’d expect the tools to be implemented in C++, but we wanted to give our plugin creators a lot of power.

So we decided to try to implement a fast Lua API for voxel reads and writes (bypassing our reflection layer for performance), and build tools in Lua on top of this. This ended up being a great decision, as we were able to quickly iterate on tools and have community be empowered to create their own (performance, of course, suffered as a result - something we’re trying to fully recover from to this day).

February-June 2015: Smooth terrain productizaton

At this point all the pieces were there - I had rendering, physics and replication working, and an API to build tools with. The tools weren’t ready yet, but neither were any of the pieces production quality - a lot of cleanup optimization work remained.

During these months I’ve worked on polishing the code and making it faster. This involved writing custom triangle mesh colliders instead of using Bullet code (using a fast cache coherent KDTree which was much faster to build and a faster to query compared to Bullet’s BVH), improving rendering code performance and memory consumption, improving terrain broadphase, making old APIs work with smooth terrain, etc., etc.

A lot of work goes into a high quality feature, and a lot of work can follow the first version - during this time I also prototyped geometric LOD support but it wasn’t ready for the release so we shipped without it.

During this time we also started replacing the old temporary art with new art, which required making some rendering changes to improve the look based on art feedback as well.

Overall I really loved working on smooth terrain. When a single feature touches so many areas, you get a chance to implement a lot of different things, get familiar with dark corners of most of the codebase, and improve a lot of code everywhere as a result. Smooth terrain also required innovation in the algorithms as well as a lot of attention to detail in performance to be practical, and resulted in an entirely new building primitive to be available for Roblox developers. Lots of people loved the result and as we continued working on performance (it was just me for the first couple of years but we had a few people work on impactful performance changes for terrain in 2019 and 2020 as well, sometimes coming up with much better solutions vs whatever I implemented in 2015), it quickly became a great way to build large worlds in Roblox. Of course sky (or, in this case, horizon) is the limit, and we need more and more improvements here.

I remember the week we shipped smooth terrain, and in the same week one of the biggest games at the time, The Quarry, switched to it - which was scary! I don’t think we were quite ready, and the feedback from players told us that the tech wasn’t quite optimized enough for big maps, but it was fun nonetheless.

July 2015: microprofiler

I don’t really remember exactly what prompted this, but I decided that the set of profiling tools we were using wasn’t really adequate for a lot of work that we needed to do. Instead of reinventing the wheel completely as I did in 2012, I decided to integrate a microprofile library by Jonas Meyer. I think I picked it over Remotery, which was the other open-source library available at the time, because Remotery required a web browser to work and I wanted something with on-screen visualization.

However, I wanted something that we could ship in the production client. Up until then if you had a performance problem in a non-development build the only hope of understanding what was wrong was to reproduce it on a local build and use platform-specific tools to profile. I wanted something where in any production build you could hit one button and instantly see the profiler you could interact with.

Getting there required a lot of fixes to the code to make it robust, to make it cheaper to compile the profiler code without keeping it active all the time, some features I felt were lacking, etc. All of this work was done in the open on GitHub (zeux/microprofile), but unfortunately the upstream repository at the time was hosted on Bitbucket and used Mercurial, which made pull requests impractical, and over time the fork diverged enough that it was too hard to merge it back. Ah well.

This ended up being possibly the single biggest thing I’ve done to improve internal engineering at Roblox. The culture of having the profiler at your fingertips, and having the same tool available on all platforms so that it’s always easy to check; the possibility to identify performance problems that are infrequent in nature; the fact that the same tool is available to our developer community so when they report performance bugs it’s now possible to get a profiler dump from them; all of these things made it much easier to talk about performance and to work on performance as a company. We also ended up shipping support for microprofile on the server (available to game developers, so that they can profile live servers with the same tool!), on mobile (again, used by us internally and game developers), and we also now have internal infrastructure to capture long running sessions and gather statistical information with ability to drill into individual problematic instances.

August 2015: Cylinders

Somehow at this point Roblox has existed for a decade with support for cylinder as a basic primitive type, but without physics support for cylinders (cylinders were “approximated” by a ball). During smooth terrain work I became familiar with our collision pipeline and it seemed straightforward to add cylinders by using Bullet support for cylinders - so I did!

This ended up causing a fair amount of trouble for our physics engineers, as they were later forced to fix several issues in Bullet integration (which is good, I guess) that became more prominent for quickly rotating cylinders, fix a few numerical instabilities in Bullet GJK that were important to make it possible to build cars with quickly rotating wheels, as well as reimplement ray casts that were very imprecise in Bullet as well.

Sorry about that, folks. But hey, at least we have cylinders now!

September 2015: Character shadows

Since before I joined and up until 2014 (or so?), we used stencil shadows for characters. Voxels weren’t small enough to represent a character with enough precision but stencil shadows were costly, painful to maintain, resulted in double shadowing artifacts and had a very different look from voxel shadows from the environment. We shipped blob shadows earlier - they solved most of these problems but were too coarse and didn’t look as good.

After my 2014 hack week I’ve tried to extend the exponential shadow implementation to be practical, but it was hard to make it work with only characters rendering into the shadow map, and I didn’t think we were quite ready for a full scene shadow map implementation. After several failed experiments I settled on a solution I liked, which was inspired by the static geometry shadows from the classical “Perspective Shadow Maps: Care and Feeding” article (written by the same friend who got me into Roblox - hi, Simon!), but incorporated an extra depth channel into the shadow map to be able to reject shadows from the surfaces that fail the test.

The trick is to have two-channel shadow maps where one channel stores depth and another stores the shadow intensity; then you blur the intensity and dilate the depth information, which allows you to render soft shadows very quickly, as long as self-shadowing doesn’t matter.

I later learned that the same technique is presented in GPU Pro 3 “Depth Rejected Gobo Shadows”.

Incidentally the doge meme was popular in Roblox at the time, so the ticket is appropriately named “US30615: Character Shadows Much Improved Such Wow”.

This shadowing technique remains in Roblox to this day, although it’s been largely superseded by the new exponential variance shadow map implementation that I contributed to by convincing the fantastic engineer who worked on this to try to make it work :D

October 2015: Various optimizations

I think at this point I was a bit tired of a few giant projects completed earlier in the year, and I just worked on a bunch of different small optimizations.

I find that this in general is a great way to spend the time between projects - if you wait for something big to ship, a fantastic way to deliver value is to open a profiler, look at a few captures, and try to make them faster by cleaning code up or using slightly more efficient constructions in the places that matter. In an engine such as ours, there’s so many games that stress parts of the engine in so many different ways, all of this work pays off eventually.

November 2015: OpenGL (ES) 3

I really don’t remember what this was motivated by. But up until this point we’ve used GL2 on macOS and GLES2 on iOS/Android. There was some important driver for adapting our code to be GL3 compatible, I just don’t remember what it was :)

This required some shader compiler tweaks and some code tweaks but ultimately wasn’t too bad. When I implemented the original GL backend I made the decision to make just one for all GL version, and I haven’t regretted this since (this was a direct response to OGRE having a separate backend for GLES which created many more problems than it solved).

Of course the hard part of this change came later. In December I had to work around a host of compatibility issues on Android and macOS, where older drivers didn’t necessarily implement GL/GLES3 correctly, requiring us to detect these and fall back to GL2/GLES2.

November 2015: Terrain memory optimization

This was always known to become necessary at some point, but we shipped the first version without it. During some memory analysis it turned out that smooth terrain was much more memory hungry than old blocky terrain. The ultimate solution to this was going to be level of detail, but it never hurts to make things more efficient - I switched to a carefully packed vertex format to reduce memory use, getting the vertex down to ~20 bytes. In addition a bug in the old code generated ~10% vertices on the border of chunks that just weren’t used, so that was one more easy win.

December 2015: Hack week

So this year, unlike last year but like the year before that, I had a theme. I knew what I wanted to do.

When we worked on the first version of the voxel lighting engine, we actually did a quick test to see what would happen if voxels were 1x1x1 stud. And the results looked fantastic. In the previous hack week I’ve also learned that we really need HDR, and that adding lighting direction to a voxel could make things look nicer.

I wanted to combine all of that and have a version of our voxel lighting that supported 1x1x1 voxels. If done naively, that requires 64x more memory - so I knew I needed many voxel grids, nested in a cascade. Even with that though, to maintain realtime updates for high resolution portion of the grid next to the player, you need way more power - so I knew I needed a GPU implementation.

At this point I’ve never written a compute shader, so doing all of this in a week was daunting. So - I confess! - I cheated by starting to work on the hack week 3 days earlier. Hey, don’t judge me - we didn’t even have support for compute shaders at the time!

What followed was 10 days of what probably was one of the most intense and fun rendering projects I’ve done. It wasn’t just working on lighting - it was working on a non-traditional lighting system using GPUs as a general purpose compute unit (given that I haven’t used compute shaders before…). And I had to get to a point where something worked in slightly more than a week.

I knew what I wanted to accomplish, but I didn’t know all the algorithms involved - I couldn’t simply port the CPU voxel lighting engine since a lot of that code can’t be parallelized to the point where GPUs can run that performantly.

So I had to reimagine some of the algorithms, use some technology… creatively… (e.g. the GPU voxelizer used instancing and geometry shaders to run the coverage shader for each voxel of the primitive’s bounding box, and then used max blending - I think? - to aggregate coverage per voxel in a 3D texture), and ultimately got to a demo a day earlier so I even had time to implement a really bad voxel GI using something resembling light propagation volumes (… trying to debug SH math in the process).

My biggest worry was that I would have nothing to show at the end. I didn’t have a working GPU debugger, and this was a very unfamiliar territory for me since I had to quickly get up to speed on what’s possible and what’s not possible in D3D11 compute. I remember being very dismayed at the D3D11 restriction around typed loads, and having to work around that.

In the end, I did get to a demo, and it was a blast.

January-April 2016: We Are VR

With the hack week over (but not done, as we’ll see later), I switched to the next big thing. This was the time when the industry was going crazy in a wave of VR. VR was the next big thing, it was happening, and it was happening now. Analysts were projecting explosive growth, and we knew we had to be there.

Well, we knew it would take years for VR to become pivotal, but we thought that by investing into VR now - by “getting on the map” - we’d secure enough foothold to become a strong player. To reduce risk, we decided to start with desktop VR - the plan being to add VR support to the platform and to use side loading to support Oculus and Vive headsets, without committing to any single store yet.

I worked on the engine parts of this initiative, adding support for stereo rendering, vsynced presentation, latency optimizations for parts of the input pipeline, integration of LibOVR and OpenVR, etc. We had a few other people starting to prototype character navigation, UI integration and the like.

Of course we had our fair share of rendering tweaks we needed to do - what do you do with particles? Do post-effects work? Do all parts of rendering code handle asymmetric frustums? Etc. As for optimizations, some of them were VR-specific and allowed doing something once per frame instead of twice, but some were general enough to apply to the rest of the rendering modes as well.

We also had to figure out a compatibility strategy - how do you play existing Roblox games in VR? We believed that this is where the strength of our platform lies - VR desperately needed content and we had a lot of it, it just wasn’t VR-ready. Which is why built-in character navigation and UI portability were a big deal - how do we do this with acceptable comfort? Games that wanted to could of course use VR in a more conscious manner; I remember building a table tennis simulator game that wasn’t very fun but it was still very profoundly impressive when you played it for the first time.

Once the desktop version worked, and we agreed we’d ship it in a stealth way, we needed to figure out what the full product release would look like. And, us being mobile first, we naturally turned to mobile VR.

From the business perspective this was a mistake, as mobile VR at the time was in a very sad state which we started discovering along the way; VR without positional tracking is not for the faint of heart, and the state of the ecosystem at the time was… bad. However this was also fun to navigate - we looked into Cardboard-like devices and, quickly getting dissatisfied with the SDKs available to us, I wrote a custom VR renderer using gyro/accel inputs, a pseudo Kalman filter tuned for latency, and a late latching setup where the final postprocessing quad would get timewarped using the latest known information from the CPU side about where the head is looking. The results were much better than what stock SDK provided, but still very far from a decent VR experience, let alone what you could get on desktop with positional tracking.

Except for the mobile VR code which never shipped, all the rest is still in Roblox today. Miraculously, it still works even though we don’t spend almost any time on maintaining it - in fact, one of the winners of a 2020 Game Jam was a VR game (The Eyes of Providence) that was actually lots of fun to play, with gameplay that was very unique to both VR and our platform and combined the strengths of both in one package.

Ultimately, we ended up testing the Daydream waters as well but never shipped anything - the engine still supports VR but a full product integration will have to wait for a time where a VR platform is meaningful to us as a business.

May-July 2016: Smooth terrain LOD

We always knew we needed a geometric LOD system for terrain. I even prototyped it when working on terrain initially, but didn’t have time to get it to work well enough to ship.

Well, now was the time. There’s a lot of careful work here in getting LOD updates to be responsive, managing the cost of geometry updates (which previously was limited to load time), hiding seams between chunks of different detail levels, etc., etc.

To test this I’ve used a level with ~500M voxels that was a Mars terrain import; levels of that size were only practical with LOD, but also stressed all other parts of the system, forcing me to implement a new in-memory storage format for voxel data, optimize various parts of the system, including deserialization, undo history and physics, and do more performance work everywhere.

Even that proved to not ultimately be enough, and we had a few more people take a stab at improving various components of the system since.

August 2016: New shader pipeline

With an eye towards new features, such as compute, and new graphics APIs, such as Metal, I’ve set out to find a better solution for shader translation. The combination of hlsl2glsl + glsl-optimizer worked but was limited to DX9/GLES2 style shaders, and everything else was a hack.

One thing I’ve wanted is support for constant buffers, which would clean up the way how we handled constants in the cross-platform rendering abstraction a lot, but it was very painful to do.

So I set out to find a new solution, and settled on hlslparser, as used in The Witness. It was using an AST->source translation which was simpler than our previous pipeline, e.g. for Metal we wouldn’t have to go through glsl-optimizer’s IR, but it was incomplete so I ended up working on a lot of small changes to make it practical to switch to it (all of them are merged upstream), and replacing our old pipeline. We still relied on glsl-optimizer for OpenGL to optimize the shaders as that made mobile drivers happier, but this opened the door to using more modern features, such as uniform buffers, in the “frontend” (hlslparser could then flatten these to vec4 arrays, which made the resulting shaders compatible with ES2).

Thus the second version of the shader toolchain got born; we will end up reworking this again in the future!

September 2016: Rendering abstraction cleanup

Before implementing Metal support, I’ve wanted to make it easier to implement more modern APIs. This involved reworking the constant handling - we used to use integer handles that could be retrieved by name, like in GL - and introducing render pass support.

For constants, we settled on constant buffer support, where a structure layout is mirrored between CPU & GPU, and the entire buffer is set in bulk. This ended up being a huge win both in terms of simplicity of the code and in terms of performance - up until that point we’ve had to emulate constant buffers on Direct3D 11 (which was a port done by another engineer which is why it wasn’t mentioned in this post), and set individual uniforms on GL. With this change we had the shader compiler flatten the uniform structs into vec4 arrays for GL, and we could remove all reflection code from our shader pipeline, and simplify and optimize the setup significantly.

For render passes, we were targeting Metal so we went with a simple immediate-mode pass specification. It is used by our high-level rendering code with explicit load/store masks, and used in GL backend to discard attachment contents when necessary.

This is also the large significant refactor of our rendering interface; today, almost 4 years later, it looks pretty close to how it looked like 4 years ago - which is nice! I think we found a great balance between simplicity and performance across all the different APIs. The only large addition since then was support for compute, which we still aren’t using in production [but it helps on hack weeks…].

October 2016: Metal on iOS

With shader translation support implemented by hlslparser and the rendering abstraction refactor done, it was time to do a Metal port. The motivation was, of course, render dispatch performance - GL driver consumed a significant portion of our frame, for what didn’t seem like a good reason.

I think I did the initial bringup on a Friday. I came to work a bit early, wrote code non-stop until 6 PM, went back home, and proceeded to write code until 10 PM when the app ran and rendered correctly.

Of course after that I had to spend the rest of the month on cleanup, corner case fixes, optimizations, etc., doing some small refactoring in other backends to make implementations align closer.

Metal worked very well after that - on iOS we had very few stability issues after launch, and the backend required little maintenance since.

November 2016: Assorted cleanup and optimization

It doesn’t look like anything very big happened here - some small cleanup of the rendering backend, some leftover Metal fixes, optimizations, etc.

Calm before the storm, as they say.

December 2016: Hack week

Was it possible to top the last hack week?

I felt like the demo from the last week, while very awesome technically, didn’t quite have enough ‘oomph’. Part of the problem was the limitation of the voxel lighting technique, so I wanted to fix that, but also to have a slightly less ambitious plan.

This may come as a surprise because the result of the hack week was what probably is regarded by the community as the best thing I’ve ever worked on:

However something to realize is that in the previous year, I was stepping on untrodden ground; this year I decided to see if I could implement a production grade lighting engine - while pretty much knowing exactly what I need to do and how. I’ve implemented shadow maps many times before in my life; I’ve even implemented a Forward+ renderer in my F# engine (as you probably realized from the lack of F# in the few years of updates, I stopped using the language for a while) a few years ago.

I just needed to combine all of that. In addition I think this was right after reading a slide deck on Doom 2016 - which is a game that looks great, is very fast and has a very simple renderer by modern standards. I was inspired by this and decided to implement that - Forward+ with shadow maps, and Doom 2016-style particle lighting with an atlas.

I think I didn’t cheat this time around, I probably started the hack week on Saturday but that seems fair (?), and I managed to implement the entire thing in a week. It helped that I already had decent HDR code from last hack week so I could copy that, I had compute scaffolding so I could copy that for the light culling; the rest had to be mostly done from scratch, except for the light culling shader that I stole from my F# engine.

The results, well, the results shaped the Roblox lighting work for the next 3 years. We ended up sharing a build with the community and developers made levels that blew our minds.

2017: Future Is Bright and Vulkan

I think I’ve been writing this blog for 4 hours now and I’m not even up to 2020. But the interesting part is that it looks like in 2017 I’ve done very little work that actually shipped to production in a meaningful way.

This is because of a few things coinciding.

One, I’ve started doing way more technical direction. Helping teams with the roadmaps, helping design solutions to some hard problems, doing code reviews, etc. etc. This was also the time when I was the de-facto rendering team lead which didn’t reflect well on the time to write code.

Two, a lot of my attention was spent on the Future Is Bright prototypes. I now had two hack weeks from two prior years, both had very interesting ideas but we needed to figure out which one of them, or which combination of ideas rather, we need to pursue. This was more tricky than you’d think - some people in the company favored the voxel approach for reasons I won’t go too much into, and the voxel approach resulted in very unique soft look, and provided an answer to the entire rendering equation; shadow map approach resulted in superior quality for direct lighting, and was faster, but wasn’t sufficient by itself.

We also needed to answer content compatibility questions (what do we do on low end?), among others.

So I ended up doing a lot of research/prototype work on both implementations, trying to find the limits of both approaches. This resulted in more explorations into GPU-friendly ways to do voxel lighting, adding extra features, ultimately porting both prototypes to Metal and running them on an iPhone, etc.

The summary of this work is available here. We eventually decided to pursue the shadow map/forward+ route for direct lighting, and are likely going to use a voxel-based solution for the indirect components in the future.

Three, Vulkan was the next API to tackle. The final boss, so to speak. Another engineer did the initial port but eventually I took over.

The concrete projects that ended up shipping involved more work on the rendering API (adding compute support to make next hack weeks easier…), and building a third - hopefully final! - version of the shader compilation pipeline using SPIRV, a shader intermediate representation from Vulkan. This included contributing to multiple open-source projects to make this practical, a lot of work I won’t go into since I’m getting tired writing all of this.

At the end of the year we settled the FIB path and had a Vulkan implementation that was ready to ship on Android - or so we thought. The real release had to wait until next year.

Of course throughout the year I shipped a few small optimizations here and there and worked on a few fixes.

May 2017: Memory tracking

… oh, and there’s this. We were working on some memory diagnostic tools and I decided to see if I could implement a very low-overhead always-on tracking system for memory.

To make this practical, it had to be able to categorize memory allocations but do nothing else - by overriding global operators new & delete and embedding a little bit of metadata in addition to the user data, it was possible to do this at minimum cost and keep it enabled in all builds.

This ended up challenging to do well because of various issues with allocator mismatch, and required some further work in 2018, but the system does exist to this day and remains a vital source of memory-related information in production builds.

We had a few people look at memory occasionally, but it always involved using somewhat clunky platform-specific tools and made it a bit hard to tell at a glance whether there’s a problem. Now that we had a way to validate memory usage and identify biggest consumers to trigger a subsequent investigation, a few memory-related issues became obvious. So I also worked on fixing some of them, including some mesh memory optimizations (using my independently developed library, meshoptimizer), reducing script bytecode size by compressing it better, coming up with a new encoding scheme for part outline data which helped us save 20% of part memory - two years later I removed outlines from our code outright as they weren’t useful any more, but back then we still had games relying on them - and fix a few assorted memory problems with animations.

January-February 2018: Vulkan

In the beginning of the year we basically had rock-solid Vulkan code, but production deployments required working around lots of driver issues and there wasn’t enough time for that, so that work had to happen in 2018. In the first couple of months we ironed out all of the issues and finally activated Vulkan on a large subset of production devices.

More fixes were required throughout the rest of the year. Vulkan proved extremely challenging to fully release - I’m happy that we did do this now, with 60% of our user base using that and enjoying the resulting performance benefits, and it providing a path to us not relying on OpenGL as much; but it was a struggle, and honestly to a large extent it’s stubbornness and sunken cost fallacy that got us over the milestone.

I’ve written enough about Vulkan elsewhere, between multiple talks at different conferences and a few blog posts here so I’ll just leave it at that. I take equal amounts of pride and solace in me, among other early adopters, paving the way for others to have an easier time.

February-March 2018: Network performance and bandwidth

One side effect of Vulkan work is that it got me completely burned out on rendering. I suspect it’s a combination of this just being incredibly frustrating, and - along with lighting prototyping - contributing to me not shipping anything concrete in 2017. So I felt like I was done with rendering for a while, and while I still helped guide the team to deliver other projects, I wanted to do other things for a change.

Partly because of this and partly because of some challenges in this area at the time I took a brief detour to focus on networking. As part of this I’ve implemented many small performance fixes to various parts of the stack, redesigned parts of the networking protocol to be more efficient, completely rewrote our physics data compressor to provide higher quality with less bandwidth consumption (this code is still used today, although it’s possible to improve on this further), and wrote a few specs for future improvements in this area, most of which have been implemented by other people now.

March 2018: Hack Week

I forget why, but the hack week didn’t happen in 2017 and happened in March. According to what I wrote above I was pretty done with rendering, and decided to do something else. At the time I have spent a lot of time thinking about the future of scripting at Roblox - programming languages and evolution thereof.

To that end I’ve decided to explore a set of static typing extensions over Lua, using my script analysis work from 2014 (which we’ve used since it shipped) as a starting point. I extended the syntax with optional type annotation, and wrote a hybrid between a data flow analyzer and a unification-based inference engine, which you can see in action here:

We haven’t used any of this code directly but this paved the way to a lot of the subsequent programming languages work we’ve started doing, although I personally haven’t worked on the type checking bits too much - see below!

April 2018: Accurate Play Solo

As I was working more on networking code, I started to understand the limitations and flexibility there very well. At the time Studio had a Play Solo testing mode that wasn’t using replication; it was a constant struggle to keep your game functioning correctly because of the semantics differences.

This would come up in discussions occasionally and everybody either said that we really need the play solo mode because anything else just can’t be fast enough, or that we really need a full, if slow, replicated cloud-based testing solution as that’s the only way to get parity with production. At some point I couldn’t take it anymore and I just went ahead and built a prototype that started a full replicated session locally very quickly.

For this to work I had to tweak a bunch of parameters in the networking code and, crucially, spawn the server and client in the same process, so that it could happen as quickly as possible. There was still more overhead in this mode, and you did need two full datamodels, but it was much better than anything else we had and so we decided to ship this. I only worked on the initial prototype here, but the important lesson here is that existence proof is so often so important - the best way to get people to believe something is possible is to show the fait accompli to them and watch them marvel.

Since then we’ve removed the old play solo from the product, although there are some parts of the play solo flow that aren’t as fast as my old prototype was - which we will fix one day.

July 2018: New terrain rasterizer for pathfinding

(note, I’m omitting some more shader pipeline work and background Vulkan work in prior months)

Generally I think I was still in the mode of helping others much more than doing personal work in 2018. A few highlights from these projects involve me rewriting the rasterizer for navmesh generation, and dynamic instancing for rendering.

In June or thereabouts one of our engineers was finishing the rewrite of the navigation system, from the old voxel-based system to the new system based on Recast+Detours. Part of the system involved voxelization into spans and using the result to generate navmesh.

The rasterizer used in that system is conservative, and not that efficient; on large maps with a lot of terrain this was proving to be a bottleneck. I realized that due to the somewhat unique construction of the terrain mesh it was possible to do a very good approximation using a fast non-conservative half-space rasterizer, and use a few tricks to match positive triangles to negative triangles to fill spans very efficiently.

Curiously, even though I’ve never worked on any commercial products that targeted systems without a GPU, this is the second software rasterizer I ended up shipping - the first being one for a software occlusion system running on the SPUs back in my PS3 days.

July 2018: Cross-cluster instancing

Another engineer was finishing the instancing system; despite this being a rendering project I couldn’t resist and looked into improving performance of that code, and ended up adding dynamic instancing in addition to clustered batched instancing.

As a result, we have a system now that can aggregate large sets of similar objects statically so that we don’t waste the time to regenerate and reupload constant data for heavy scenes, but if some sets are smaller, we reaggregate them dynamically on the fly to merge the resulting draw calls for free.

The result is performant almost regardless of the scene composition which is neat!

November 2018: Help optimize FIB phase 1

Other folks were actively working on FIB phase 1 (which consisted of a new voxel lighting implementation with HDR support, a tone mapper and an HDR-ish postfx pipeline), but in November we realized that we aren’t sure we can make it by the end of the year - the code was done, it worked properly, but we wanted to ship with minimal performance regressions and our metrics showed us that were we to ship now, we would have dropped performance by 10-15% on mobile.

So I helped by implementing a few optimizations in various places of the stack to get us back on track, which contributed to helping release FIB phase 1 on time.

The rest of 2018 doesn’t seem super eventful - similar to 2017, I’ve worked on a few small bits here and there and focused a lot on helping others, writing specs, that sort of thing. Until at the end of 2018 I wrote a technical spec for the next Lua VM which would consume much of the next year for me.

Aside: Future of FIB

In some sense of course Future Is Bright is my child. I made the original hack week demo, and was very involved in the initial stages of the production work, including some fixes for phase 1 above.

However, the other phases see progressively less of my involvement, and all phases would not have shipped without other people’s work. In fact despite my original code still being present in all three phases, most of the code in all three phases is not mine.

I was somewhat involved in the phase 2, not so much by contributing code, but by convincing the engineer who did a lot of the work to try to figure out how to combine a few crazy ideas we were discussing together, notably tile-based incremental cascade updates (inspired by Insomniac’s CSM Scrolling) with EVSM (exponential variance shadow maps) - look ma, no PCF! This ended up working wonderfully even though it was daunting at first.

As far as phase 3 is concerned, although some of that code is still the same as it was in my hack week, again most of the effort at this point is not mine. As Roblox grows and I take more of an advisory role in rendering, more and more the work we ship is that of the entire team and a lot of different engineers Roblox developers may or may not know, as opposed to a few people who started all of this.

2019: Luau

At this point I knew pretty much exactly what we need to do in terms of language evolution at Roblox. We used Lua 5.1 as a base, but it wasn’t fast enough - so we needed a much faster implementation - and it wasn’t robust enough for large projects, so we needed gradual typing.

One engineer started working on the type system, and I started working on the virtual machine. We took the existing linting infrastucture I wrote in 2014, took the parser from it, made it more robust and faster, and then I wrote a compiler that compiled it to the new bytecode, an interpreter for this bytecode, many changes in the VM to make faster execution practical, etc.

This would consume me for the entire year, in addition to guidance for other projects in the company. It’s by no means a trivial task - I ended up delving very deeply into the dark art of making fast interpreters, tuning ours tightly to the compilers and hardware we ran on, making a lot of performance improvements everywhere in the Lua stack (e.g. our reflection layer, even though not being part of the system, ended up being 2-3x faster as a result of this work - once you have good benchmarks, it’s easier to make progress!).

This represented more than just making things faster though - it’s a new chapter in Roblox history, where before this we used the language we’ve been given since the very beginning, and now we treat it as our own, Luau. This resulted in us adding libraries we could have technically added before but due to the lack of focus haven’t, adding features that the Lua community at large has been begging for for years but that somehow never made it to Lua proper, and in general doing a lot of deep, meaningful and impactful work that helped our community.

In addition to the new compiler and interpreter, I also reworked the existing static analysis passes to the new framework and added several more; a lot of attention was dedicated to compilation and analysis performance as well, with more work to follow in 2020.

As a result at the end of 2019 we had a modern and performant interpreted language stack, which was ready for more polish in 2020; the type checking work was well under way but wasn’t finished in 2019, and we’re continuing to work on it today.

December 2019: Hack Week

With the entire year dedicated to Luau, it felt fitting to end the year with a language-related hack week. As a result of the prior work our implementation was pretty competitive with LuaJIT interpreter (losing on some benchmarks still, but winning on a couple, with the code base written in portable and maintainable C). The next frontier was, of course, compiled performance.

One reason why I started working on Luau is because on technical grounds the widely deployed solutions in programming language space never seemed sufficient.

For example, intepreters in all existing languages are slow; the only fast widely used production interpreter on the planet I’m aware of is LuaJIT, but it is hand-coded in assembly. I didn’t believe this is the only answer, so now we have an interpreter that beats any other interpreter out there except for LuaJIT (including every JS interpreter we’ve tested).

There are JIT compilers that are amazingly fast; however, if you look at dynamically typed languages, then JIT story is often unsatisfactory, and always complicated. A modern JavaScript VM has a two-three tier JIT with tiers having to support type recording, dynamic deoptimization with on-the-stack replacement, many different hidden representations of the same language types, etc. This is despite the fact that type information can be present in a JS program when TypeScript or Flow is used as a source language (this type information can be unsound, but that’s a separate problem).

It felt unsatisfactory to discard type information and then having to learn it dynamically again; it felt unsatisfactory to have to do implementation heroics to transparently replace data structures. I wanted to experiment with a gradually typed system where through the ownership of the entire stack, starting at the language level, the resulting JIT can be much simpler and deliver comparable results.

I didn’t quite get there - it’s hard to do this in the space of a week, but I did have a lot of fun doing that, and it feels like the theory hadn’t been disproven at least. With a slightly stronger code generator the goals seem achievable, and I hope to get a chance to explore this more in the coming years.

Oh, this time I started the hack week on Sunday, so basically no cheating ;)

2020: More Luau

There’s always more performance work to be gained, and I did spend part of this year doing further tuning and optimization - mostly going through the backlog from last year. A lot more work was spent doing memory optimizations, trimming down the sizes of various data structures, debug info, etc. This included writing a new memory allocator and a few other bits.

In addition to that I’ve written a new low-level debugging engine; our old one relied on line hooks which we don’t support in the new VM, so I had to make a new one that works closer to how you’d expect a native debugger to work, using software breakpoints (bytecode patching) and single-step execution mode to get breakpoints and stepping to work.

Some work also went into compilation throughput, but also cleanliness and correctness of the entire stack. I’ve spent some time writing fuzzing support code, something I’m going to blog about soon I hope, and making sure the language stack is cleanly separated from the rest of the engine and coherent internally - we have a clean separation between the high-level aspects of the language tooling and the VM, and can compile and test either of those without the other (although of course testing VM without the compiler requires pre-building bytecode somehow).

I’m now starting to turn my attention to garbage collection, with some optimization work already shipping but the ultimate destination being generational incremental collection with good pacing (Lua 5.4 has a generational non-incremental collector, so that’s not a good source of inspiration) as well as resuming the JIT experiments and hopefully eventually shipping something.

Just like in prior years I’m also spending a fair amount of time helping the rest of the team with whatever projects they happen to be working on. And looks like I did work on one thing that wasn’t Luau specific:

March 2020: Faster multi-core narrowphase

A big focus for the entire engine team since 2019 has been to make the engine scale to larger worlds, and use available cores more effectively. Up until March I was mostly involved in this initiative in advisory capacity, but I decided to get my hands dirty and fix a few issues that were apparent from the scaling.

We already had a few components parallelized at that time; after doing some multi-core profiling on our server hardware I didn’t like the scaling property of our parallel narrowphase and decided to write a new version.

This involved writing a more carefully tuned implementation for the narrowphase itself - the old code had two long serial phases (prologue & epilogue) and I restructured a lot of computations to make sure that prologue is almost empty, and epilogue only involves serial processing on transitions of contact states (e.g. a body waking up or going to sleep) which happens more rarely.

We also had some issues with balancing the workload across cores, so I added a more general facility to our task scheduler that could be used to run data-parallel workloads more easily without having to tune the work split too much.

Finally, narrowphase ended up hitting a part of our physics pipeline that I was never truly happy with - where to read the full transform matrix of a body, some lazy hierarchical update is required to perform the full computation. When you have many cores, doing these updates in parallel serializes computation which can result in significant performance problems. The optimal path here is to redesign the system to eliminate lazy update - this is on our radar but it’s very difficult, so I did the next best thing and wrote carefully tuned lock-free code that allowed us to reduce the synchronization time during the updates to a minimum.

… and it’s 11:59 PM on a Sunday and I’m done so hopefully somebody made it to the end. Thank you.

Writing an efficient Vulkan renderer

Thu, 27 Feb 2020 00:00:00 +0000

In 2018, I wrote an article “Writing an efficient Vulkan renderer” for GPU Zen 2 book, which was published in 2019. In this article I tried to aggregate as much information about Vulkan performance as I could - instead of trying to focus on one particular aspect or application, this is trying to cover a wide range of topics, give readers an understanding of the behavior of different APIs on real hardware and provide a range of options for each problem that needs to be solved.

At the time of publishing this article, the Kindle edition of the book is available for $2.99 on Amazon - that’s cheaper than a cup of coffee and it’s definitely worth your time and money. It contains many great articles about rendering effects and design.

This, however, is the full, free of charge copy of the article - hopefully it will help graphics programmers to understand and use Vulkan to the full of its ability. The article has been lightly edited to mention Vulkan 1.1/1.2 promotions where applicable - fortunately, not much has changed in the last two years for Vulkan performance, so the content should still be mostly accurate.

Enjoy!

This article has been translated to Korean by 이정섭 and to French by Dorian Fevrier.

Abstract

Vulkan is a new explicit cross-platform graphics API. It introduces many new concepts that may be unfamiliar to even seasoned graphics programmers. The key goal of Vulkan is performance – however, attaining good performance requires in-depth knowledge about these concepts and how to apply them efficiently, as well as how particular driver implementations implement these. This article will explore topics such as memory allocation, descriptor set management, command buffer recording, pipeline barriers, render passes and discuss ways to optimize CPU and GPU performance of production desktop/mobile Vulkan renderers today as well as look at what a future looking Vulkan renderer could do differently.

Modern renderers are becoming increasingly complex and must support many different graphics APIs with varying levels of hardware abstraction and disjoint sets of concepts. This sometimes makes it challenging to support all platforms at the same level of efficiency. Fortunately, for most tasks Vulkan provides multiple options that can be as simple as reimplementing concepts from other APIs with higher efficiency due to targeting the code specifically towards the renderer needs, and as hard as redesigning large systems to make them optimal for Vulkan. We will try to cover both extremes when applicable – ultimately, this is a tradeoff between maximum efficiency on Vulkan-capable systems and implementation and maintenance costs that every engine needs to carefully pick. Additionally, efficiency is often application-dependent – the guidance in this article is generic and ultimately best performance is achieved by profiling the target application on a target platform and making an informed implementation decision based on the results.

This article assumes that the reader is familiar with the basics of Vulkan API, and would like to understand them better and/or learn how to use the API efficiently.

Memory management

Memory management remains an exceedingly complex topic, and in Vulkan it gets even more so due to the diversity of heap configurations on different hardware. Earlier APIs adopted a resource-centric concept – the programmer doesn’t have a concept of graphics memory, only that of a graphics resource, and different drivers are free to manage the resource memory based on API usage flags and a set of heuristics. Vulkan, however, forces to think about memory management up front, as you must manually allocate memory to create resources.

A perfectly reasonable first step is to integrate VulkanMemoryAllocator (henceforth abbreviated as VMA), which is an open-source library developed by AMD that solves some memory management details for you by providing a general purpose resource allocator on top of Vulkan functions. Even if you do use that library, there are still multiple performance considerations that apply; the rest of this section will go over memory caveats without assuming you use VMA; all of the guidance applies equally to VMA.

Memory heap selection

When creating a resource in Vulkan, you have to choose a heap to allocate memory from. Vulkan device exposes a set of memory types where each memory type has flags that define the behavior of that memory, and a heap index that defines the available size.

Most Vulkan implementations expose two or three of the following flag combinations¹:

VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT – this is generally referring to GPU memory that is not directly visible from CPU; it’s fastest to access from the GPU and this is the memory you should be using to store all render targets, GPU-only resources such as buffers for compute, and also all static resources such as textures and geometry buffers.
VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT – on AMD hardware, this memory type refers to up to 256 MB of video memory that the CPU can write to directly, and is perfect for allocating reasonable amounts of data that is written by CPU every frame, such as uniform buffers or dynamic vertex/index buffers
VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT² – this is referring to CPU memory that is directly visible from GPU; reads from this memory go over PCI-express bus. In absence of the previous memory type, this generally speaking should be the choice for uniform buffers or dynamic vertex/index buffers, and also should be used to store staging buffers that are used to populate static resources allocated with VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT with data.
VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT – this is referring to GPU memory that might never need to be allocated for render targets on tiled architectures. It is recommended to use lazily allocated memory to save physical memory for large render targets that are never stored to, such as MSAA images or depth images. On integrated GPUs, there is no distinction between GPU and CPU memory – these devices generally expose VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT that you can allocate all static resources through as well.

When dealing with dynamic resources, in general allocating in non-device-local host-visible memory works well – it simplifies the application management and is efficient due to GPU-side caching of read-only data. For resources that have a high degree of random access though, like dynamic textures, it’s better to allocate them in VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT and upload data using staging buffers allocated in VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT memory – similarly to how you would handle static textures. In some cases you might need to do this for buffers as well – while uniform buffers typically don’t suffer from this, in some applications using large storage buffers with highly random access patterns will generate too many PCIe transactions unless you copy the buffers to GPU first; additionally, host memory does have higher access latency from the GPU side that can impact performance for many small draw calls.

When allocating resources from VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, in case of VRAM oversubscription you can run out of memory; in this case you should fall back to allocating the resources in non-device-local VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT memory. Naturally you should make sure that large frequently used resources such as render targets are allocated first. There are other things you can do in an event of an oversubscription, such as migrating resources from GPU memory to CPU memory for less frequently used resources – this is outside of the scope of this article; additionally, on some operating systems like Windows 10 correct handling of oversubscription requires APIs that are not currently available in Vulkan.

Memory suballocation

Unlike some other APIs that allow an option to perform one memory allocation per resource, in Vulkan this is impractical for large applications – drivers are only required to support up to 4096 individual allocations. In addition to the total number being limited, allocations can be slow to perform, may waste memory due to assuming worst case possible alignment requirements, and also require extra overhead during command buffer submission to ensure memory residency. Because of this, suballocation is necessary. A typical pattern of working with Vulkan involves performing large (e.g. 16 MB – 256 MB depending on how dynamic the memory requirements are) allocations using vkAllocateMemory, and performing suballocation of objects within this memory, effectively managing it yourself. Critically, the application needs to handle alignment of memory requests correctly, as well as bufferImageGranularity limit that restricts valid configurations of buffers and images.

Briefly, bufferImageGranularity restricts the relative placement of buffer and image resources in the same allocation, requiring additional padding between individual allocations. There are several ways to handle this:

Always over-align image resources (as they typically have larger alignment to begin with) by bufferImageGranularity, essentially using a maximum of required alignment and bufferImageGranularity for address and size alignment.
Track resource type for each allocation, and have the allocator add the requisite padding only if the previous or following resource is of a different type. This requires a somewhat more complex allocation algorithm.
Allocate images and buffers in separate Vulkan allocations, thus sidestepping the entire problem. This reduces internal fragmentation due to smaller alignment padding but can waste more memory if the backing allocations are too big (e.g. 256 MB).

On many GPUs the required alignment for image resources is substantially bigger than it is for buffers which makes the last option attractive – in addition to reducing waste due to lack of extra padding between buffers and images, it reduces internal fragmentation due to image alignment when an image follows a buffer resource. VMA provides implementations for option 2 (by default) and option 3 (see VMA_POOL_CREATE_IGNORE_BUFFER_IMAGE_GRANULARITY_BIT).

Dedicated allocations

While the memory management model that Vulkan provides implies that the application performs large allocations and places many resources within one allocation using suballocation, on some GPUs it’s more efficient to allocate certain resources as one dedicated allocation. That way the driver can allocate the resources in faster memory under special circumstances.

To that end, Vulkan provides an extension (core in 1.1) to perform dedicated allocations – when allocating memory, you can specify that you are allocating this memory for this individual resource instead of as an opaque blob. To know if this is worthwhile, you can query the extended memory requires via vkGetImageMemoryRequirements2KHR or vkGetBufferMemoryRequirements2KHR; the resulting struct, VkMemoryDedicatedRequirementsKHR, will contain requiresDedicatedAllocation (which might be set if the allocated resource needs to be shared with other processes) and prefersDedicatedAllocation flags.

In general, applications may see performance improvements from dedicated allocations on large render targets that require a lot of read/write bandwidth depending on the hardware and drivers.

Mapping memory

Vulkan provides two options when mapping memory to get a CPU-visible pointer:

Do this before CPU needs to write data to the allocation, and unmap once the write is complete
Do this right after the host-visible memory is allocated, and never unmap memory

The second option is otherwise known as persistent mapping and is generally a better tradeoff – it minimizes the time it takes to obtain a writeable pointer (vkMapMemory is not particularly cheap on some drivers), removes the need to handle the case where multiple resources from the same memory object need to be written to simultaneously (calling vkMapMemory on an allocation that’s already been mapped and not unmapped is not valid) and simplifies the code in general.

The only downside is that this technique makes the 256 MB chunk of VRAM that is host visible and device local on AMD GPU that was described in “Memory heap selection” less useful – on systems with Windows 7 and AMD GPU, using persistent mapping on this memory may force WDDM to migrate the allocations to system memory. If this combination is a critical performance target for your users, then mapping and unmapping memory when needed might be more appropriate.

Descriptor sets

Unlike earlier APIs with a slot-based binding model, in Vulkan the application has more freedom in how to pass resources to shaders. Resources are grouped into descriptor sets that have an application-specified layout, and each shader can use several descriptor sets that can be bound individually. It’s the responsibility of the application to manage the descriptor sets to make sure that CPU doesn’t update a descriptor set that’s in use by the GPU, and to provide the descriptor layout that has an optimal balance between CPU-side update cost and GPU-side access cost. In addition, since different rendering APIs use different models for resource binding and none of them match Vulkan model exactly, using the API in an efficient and cross-platform way becomes a challenge. We will outline several possible approaches to working with Vulkan descriptor sets that strike different points on the scale of usability and performance.

Mental model

When working with Vulkan descriptor sets, it’s useful to have a mental model of how they might map to hardware. One such possibility – and the expected design – is that descriptor sets map to a chunk of GPU memory that contains descriptors – opaque blobs of data, 16-64 bytes in size depending on the resource, that completely specify all resource parameters necessary for shaders to access resource data. When dispatching shader work, CPU can specify a limited number of pointers to descriptor sets; these pointers become available to shaders as the shader threads launch.

With that in mind, Vulkan APIs can map more or less directly to this model – creating a descriptor set pool would allocate a chunk of GPU memory that’s large enough to contain the maximum specified number of descriptors. Allocating a set out of descriptor pool can be as simple as incrementing the pointer in the pool by the cumulative size of allocated descriptors as determined by VkDescriptorSetLayout (note that such an implementation would not support memory reclamation when freeing individual descriptors from the pool; vkResetDescriptorPool would set the pointer back to the start of pool memory and make the entire pool available for allocation again). Finally, vkCmdBindDescriptorSets would emit command buffer commands that set GPU registers corresponding to descriptor set pointers.

Note that this model ignores several complexities, such as dynamic buffer offsets, limited number of hardware resources for descriptor sets, etc. Additionally, this is just one possible implementation – some GPUs have a less generic descriptor model and require the driver to perform additional processing when descriptor sets are bound to the pipeline. However, it’s a useful model to plan for descriptor set allocation/usage.

Dynamic descriptor set management

Given the mental model above, you can treat descriptor sets as GPU-visible memory – it’s the responsibility of the application to group descriptor sets into pools and keep them around until GPU is done reading them.

A scheme that works well is to use free lists of descriptor set pools; whenever you need a descriptor set pool, you allocate one from the free list and use it for subsequent descriptor set allocations in the current frame on the current thread. Once you run out of descriptor sets in the current pool, you allocate a new pool. Any pools that were used in a given frame need to be kept around; once the frame has finished rendering, as determined by the associated fence objects, the descriptor set pools can reset via vkResetDescriptorPool and returned to free lists. While it’s possible to free individual descriptors from a pool via VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT, this complicates the memory management on the driver side and is not recommended.

When a descriptor set pool is created, application specifies the maximum number of descriptor sets allocated from it, as well as the maximum number of descriptors of each type that can be allocated from it. In Vulkan 1.1, the application doesn’t have to handle accounting for these limits – it can just call vkAllocateDescriptorSets and handle the error from that call by switching to a new descriptor set pool. Unfortunately, in Vulkan 1.0 without any extensions, it’s an error to call vkAllocateDescriptorSets if the pool does not have available space, so application must track the number of sets and descriptors of each type to know beforehand when to switch to a different pool.

Different pipeline objects may use different numbers of descriptors, which raises the question of pool configuration. A straightforward approach is to create all pools with the same configuration that uses the worst-case number of descriptors for each type – for example, if each set can use at most 16 texture and 8 buffer descriptors, one can allocate all pools with maxSets=1024, and pool sizes 16*1024 for texture descriptors and 8*1024 for buffer descriptors. This approach can work but in practice it can result in very significant memory waste for shaders with different descriptor count – you can’t allocate more than 1024 descriptor sets out of a pool with the aforementioned configuration, so if most of your pipeline objects use 4 textures, you’ll be wasting 75% of texture descriptor memory.

Two alternatives that provide a better balance wrt memory use are:

Measure an average number of descriptors used in a shader pipeline per type for a characteristic scene and allocate pool sizes accordingly. For example, if in a given scene we need 3000 descriptor sets, 13400 texture descriptors, and 1700 buffer descriptors, then the average number of descriptors per set is 4.47 textures (rounded up to 5) and 0.57 buffers (rounded up to 1), so a reasonable configuration of a pool is maxSets=1024, 5*1024 texture descriptors, 1024 buffer descriptors. When a pool is out of descriptors of a given type, we allocate a new one – so this scheme is guaranteed to work and should be reasonably efficient on average.
Group shader pipeline objects into size classes, approximating common patterns of descriptor use, and pick descriptor set pools using the appropriate size class. This is an extension of the scheme described above to more than one size class. For example, it’s typical to have large numbers of shadow/depth prepass draw calls, and large numbers of regular draw calls in a scene – but these two groups have different numbers of required descriptors, with shadow draw calls typically requiring 0 to 1 textures per set and 0 to 1 buffers when dynamic buffer offsets are used. To optimize memory use, it’s more appropriate to allocate descriptor set pools separately for shadow/depth and other draw calls. Similarly to general-purpose allocators that can have size classes that are optimal for a given application, this can still be managed in a lower-level descriptor set management layer as long as it’s configured with application specific descriptor set usages beforehand.

Choosing appropriate descriptor types

For each resource type, Vulkan provides several options to access these in a shader; application is responsible for choosing an optimal descriptor type.

For buffers, application must choose between uniform and storage buffers, and whether to use dynamic offsets or not. Uniform buffers have a limit on the maximum addressable size – on desktop hardware, you get up to 64 KB of data, however on mobile hardware some GPUs only provide 16 KB of data (which is also the guaranteed minimum by the specification). The buffer resource can be larger than that, but shader can only access this much data through one descriptor.

On some hardware, there is no difference in access speed between uniform and storage buffers, however for other hardware depending on the access pattern uniform buffers can be significantly faster. Prefer uniform buffers for small to medium sized data especially if the access pattern is fixed (e.g. for a buffer with material or scene constants). Storage buffers are more appropriate when you need large arrays of data that need to be larger than the uniform buffer limit and are indexed dynamically in the shader.

For textures, if filtering is required, there is a choice of combined image/sampler descriptor (where, like in OpenGL, descriptor specifies both the source of the texture data, and the filtering/addressing properties), separate image and sampler descriptors (which maps better to Direct3D 11 model), and image descriptor with an immutable sampler descriptor, where the sampler properties must be specified when pipeline object is created.

The relative performance of these methods is highly dependent on the usage pattern; however, in general immutable descriptors map better to the recommended usage model in other newer APIs like Direct3D 12, and give driver more freedom to optimize the shader. This does alter renderer design to a certain extent, making it necessary to implement certain dynamic portions of the sampler state, like per-texture LOD bias for texture fade-in during streaming, using shader ALU instructions.

Slot-based binding

A simplistic alternative to Vulkan binding model is Metal/Direct3D11 model where an application can bind resources to slots, and the runtime/driver manage descriptor memory and descriptor set parameters. This model can be implemented on top of Vulkan descriptor sets; while not providing the most optimal results, it generally is a good model to start with when porting an existing renderer, and with careful implementation it can be surprisingly efficient.

To make this model work, application needs to decide how many resource namespaces are there and how they map to Vulkan set/slot indices. For example, in Metal each stage (VS, FS, CS) has three resource namespaces – textures, buffers, samplers – with no differentiation between e.g. uniform buffers and storage buffers. In Direct3D 11 the namespaces are more complicated since read-only structured buffers belong to the same namespace as textures, but textures and buffers used with unordered access reside in a separate one.

Vulkan specification only guarantees a minimum of 4 descriptor sets accessible to the entire pipeline (across all stages); because of this, the most convenient mapping option is to have resource bindings match across all stages – for example, a texture slot 3 would contain the same texture resource no matter what stage it’s accessed from – and use different descriptor sets for different types, e.g. set 0 for buffers, set 1 for textures, set 2 for samplers. Alternatively, an application can use one descriptor set per stage³ and perform static index remapping (e.g. slots 0-16 would be used for textures, slots 17-24 for uniform buffers, etc.) – this, however, can use much more descriptor set memory and isn’t recommended. Finally, one could implement optimally compact dynamic slot remapping for each shader stage (e.g. if a vertex shader uses texture slots 0, 4, 5, then they map to Vulkan descriptor indices 0, 1, 2 in set 0, and at runtime application extracts the relevant texture information using this remapping table.

In all these cases, the implementation of setting a texture to a given slot wouldn’t generally run any Vulkan commands and would just update shadow state; just before the draw call or dispatch you’d need to allocate a descriptor set from the appropriate pool, update it with new descriptors, and bind all descriptor sets using vkCmdBindDescriptorSets. Note that if a descriptor set has 5 resources, and only one of them changed since the last draw call, you still need to allocate a new descriptor set with 5 resources and update all of them.

To reach good performance with this approach, you need to follow several guidelines:

Don’t allocate or update descriptor sets if nothing in the set changed. In the model with slots that are shared between different stages, this can mean that if no textures are set between two draw calls, you don’t need to allocate/update the descriptor set with texture descriptors.
Batch calls to vkAllocateDescriptorSets if possible – on some drivers, each call has measurable overhead, so if you need to update multiple sets, allocating both in one call can be faster
To update descriptor sets, either use vkUpdateDescriptorSets with descriptor write array, or use vkUpdateDescriptorSetWithTemplate from Vulkan 1.1. Using the descriptor copy functionality of vkUpdateDescriptorSets is tempting with dynamic descriptor management for copying most descriptors out of a previously allocated array, but this can be slow on drivers that allocate descriptors out of write-combined memory. Descriptor templates can reduce the amount of work application needs to do to perform updates – since in this scheme you need to read descriptor information out of shadow state maintained by application, descriptor templates allow you to tell the driver the layout of your shadow state, making updates substantially faster on some drivers.
Finally, prefer dynamic uniform buffers to updating uniform buffer descriptors. Dynamic uniform buffers allow to specify offsets into buffer objects using pDynamicOffsets argument of vkCmdBindDescriptorSets without allocating and updating new descriptors. This works well with dynamic constant management where constants for draw calls are allocated out of large uniform buffers, substantially reduce CPU overhead, and can be more efficient on GPU. While on some GPUs the number of dynamic buffers must be kept small to avoid extra overhead in the driver, one or two dynamic uniform buffers should work well in this scheme on all architectures.

In general, the approach outlined above can be very efficient in terms of performance – it’s not as efficient as approaches with more static descriptor sets that are described below, but it can still run circles around older APIs if implemented carefully. On some drivers, unfortunately the allocate & update path is not very optimal – on some mobile hardware, it may make sense to cache descriptor sets based on the descriptors they contain if they can be reused later in the frame.

Frequency-based descriptor sets

While slot-based resource binding model is simple and familiar, it doesn’t result in optimal performance. Some mobile hardware may not support multiple descriptor sets; however, in general Vulkan API and driver expect an application to manage descriptor sets based on frequency of change.

A more Vulkan centric renderer would organize data that the shaders need to access into groups by frequency of change, and use individual sets for individual frequencies, with set=0 representing least frequent change, and set=3 representing most frequent. For example, a typical setup would involve:

Set=0 descriptor set containing uniform buffer with global, per-frame or per-view data, as well as globally available textures such as shadow map texture array/atlas
Set=1 descriptor set containing uniform buffer and texture descriptors for per-material data, such as albedo map, Fresnel coefficients, etc.
Set=2 descriptor set containing dynamic uniform buffer with per-draw data, such as world transform array

For set=0, the expectation is that it only changes a handful of times per frame; it’s sufficient to use a dynamic allocation scheme similar to the previous section.

For set=1, the expectation is that for most objects, the material data persists between frames, and as such could be allocated and updated only when the gameplay code changes material data.

For set=2, the data would be completely dynamic; due to the use of a dynamic uniform buffer, we’d rarely need to allocate and update this descriptor set – assuming dynamic constants are uploaded to a series of large per-frame buffers, for most draws we’d need to update the buffer with the constant data, and call vkCmdBindDescriptorSets with new offsets.

Note that due to compatibility rules between pipeline objects, in most cases it’s enough to bind sets 1 and 2 whenever a material changes, and only set 2 when material is the same as that for the previous draw call. This results in just one call to vkCmdBindDescriptorSets per draw call.

For a complex renderer, different shaders might need to use different layouts – for example, not all shaders need to agree on the same layout for material data. In rare cases it might also make sense to use more than 3 sets depending on the frame structure. Additionally, given the flexibility of Vulkan it’s not strictly required to use the same resource binding system for all draw calls in the scene. For example, post-processing draw call chains tend to be highly dynamic, with texture/constant data changing completely between individual draw calls. Some renderers initially implement the dynamic slot-based binding model from the previous section and proceed to additionally implement the frequency-based sets for world rendering to minimize the performance penalty for set management, while still keeping the simplicity of slot-based model for more dynamic parts of the rendering pipeline.

The scheme described above assumes that in most cases, per-draw data is larger than the size that can be efficiently set via push constants. Push constants can be set without updating or rebinding descriptor sets; with a guaranteed limit of 128 bytes per draw call, it’s tempting to use them for per-draw data such as a 4x3 transform matrix for an object. However, on some architectures the actual number of constants available to push quickly depends on the descriptor setup the shaders use, and is closer to 12 bytes or so. Exceeding this limit can force the driver to spill the push constants into driver-managed ring buffer, which can end up being more expensive than moving this data to a dynamic uniform buffer on the application side. While limited use of push constants may still be a good idea for some designs, it’s more appropriate to use them in a fully bindless scheme described in the next section.

Bindless descriptor designs

Frequency-based descriptor sets reduce the descriptor set binding overhead; however, you still need to bind one or two descriptor sets per draw call. Maintaining material descriptor sets requires a management layer that needs to update GPU-visible descriptor sets whenever material parameters change; additionally, since texture descriptors are cached in material data, this makes global texture streaming systems hard to deal with – whenever some mipmap levels in a texture get streamed in or out, all materials that refer to this texture need to be updated. This requires complex interaction between material system and texture streaming system and introduces extra overhead whenever a texture is adjusted – which partially offsets the benefits of the frequency-based scheme. Finally, due to the need to set up descriptor sets per draw call it’s hard to adapt any of the aforementioned schemes to GPU-based culling or command submission.

It is possible to design a bindless scheme where the number of required set binding calls is constant for the world rendering, which decouples texture descriptors from materials, making texture streaming systems easier to implement, and facilitates GPU-based submission. As with the previous scheme, this can be combined with dynamic ad-hoc descriptor updates for parts of the scene where the number of draw calls is small, and flexibility is important, such as post-processing.

To fully leverage bindless, core Vulkan may or may not be sufficient; some bindless implementations require updating descriptor sets without rebinding them after the update, which is not available in core Vulkan 1.0 or 1.1 but is possible to achieve with VK_EXT_descriptor_indexing extension (core in Vulkan 1.2). However, basic design described below can work without extensions, given high enough descriptor set limits. This requires double buffering for the texture descriptor array described below to update individual descriptors since the array would be constantly accessed by GPU.

Similarly to the frequency-based design, we’ll split the shader data into global uniforms and textures (set 0), material data and per-draw data. Global uniforms and textures can be specified via a descriptor set the same way as described the previous section.

For per-material data, we will move the texture descriptors into a large texture descriptor array (note: this is a different concept than a texture array – texture array uses one descriptor and forces all textures to have the same size and format; descriptor array doesn’t have this limitation and can contain arbitrary texture descriptors as array elements, including texture array descriptors). Each material in the material data will have an index into this array instead of texture descriptor; the index will be part of the material data, which will also have other material constants.

All material constants for all materials in the scene will reside in one large storage buffer; while it’s possible to support multiple material types with this scheme, for simplicity we’ll assume that all materials can be specified using the same data. An example of material data structure is below:

struct MaterialData
{
	vec4 albedoTint;

	float tilingX;
	float tilingY;
	float reflectance;
	float unused0; // pad to vec4

	uint albedoTexture;
	uint normalTexture;
	uint roughnessTexture;
	uint unused1; // pad to vec4
};

Similarly, all per-draw constants for all objects in the scene can reside in another large storage buffer; for simplicity, we’ll assume that all per-draw constants have identical structure. To support skinned objects in a scheme like this, we’ll extract transform data into a separate, third storage buffer:

struct TransformData
{
	vec4 transform[3];
};

Something that we’ve ignored so far is the vertex data specification. While Vulkan provides a first-class way to specify vertex data by calling vkCmdBindVertexBuffers, having to bind vertex buffers per-draw would not work for a fully bindless design. Additionally, some hardware doesn’t support vertex buffers as a first-class entity, and the driver has to emulate vertex buffer binding, which causes some CPU-side slowdowns when using vkCmdBindVertexBuffers. In a fully bindless design, we need to assume that all vertex buffers are suballocated in one large buffer and either use per-draw vertex offsets (vertexOffset argument to vkCmdDrawIndexed) to have hardware fetch data from it, or pass an offset in this buffer to the shader with each draw call and fetch data from the buffer in the shader. Both approaches can work well, and might be more or less efficient depending on the GPU; here we will assume that the vertex shader will perform manual vertex fetching.

Thus, for each draw call we need to specify three integers to the shader:

Material index; used to look up material data from material storage buffer. The textures can then be accessed using the indices from the material data and the descriptor array.
Transform data index; used to look up transform data from transform storage buffer
Vertex data offset; used to look up vertex attributes from vertex storage buffer

We can specify these indices and additional data, if necessary, via draw data:

struct DrawData
{
	uint materialIndex;
	uint transformOffset;
	uint vertexOffset;
	uint unused0; // vec4 padding

	// ... extra gameplay data goes here
};

The shader will need to access storage buffers containing MaterialData, TransformData, DrawData as well as a storage buffer containing vertex data. These can be bound the shader via the global descriptor set; the only remaining piece of information is the draw data index, that can be passed via a push constant.

With this scheme, we’d need to update the storage buffers used by materials and draw calls each frame and bind them once using our global descriptor set; additionally, we need to bind index data – assuming that, like vertex data, index data is allocated in one large index buffer, we only need to bind it once using vkCmdBindIndexBuffer. With the global setup complete, for each draw call we need to call vkCmdBindPipeline if the shader changes, followed by vkCmdPushConstants to specify an index into the draw data buffer⁴, followed by vkCmdDrawIndexed.

In a GPU-centric design, we can use vkCmdDrawIndirect or vkCmdDrawIndirectCountKHR (provided by KHR_draw_indirect_count extension, promoted to core Vulkan 1.2) and fetch per-draw constants using gl_DrawIDARB (provided by KHR_shader_draw_parameters extension) as an index instead of push constants. The only caveat is that for GPU-based submission, we’d need to bucket draw calls based on pipeline object on CPU since there’s no support for switching pipeline objects otherwise.

With this, vertex shader code to transform the vertex could look like this:

DrawData dd = drawData[gl_DrawIDARB];
TransformData td = transformData[dd.transformOffset];
vec4 positionLocal = vec4(positionData[gl_VertexIndex + dd.vertexOffset], 1.0);
vec3 positionWorld = mat4x3(td.transform[0], td.transform[1], td.transform[2]) * positionLocal;

Fragment shader code to sample material textures could look like this:

DrawData dd = drawData[drawId];
MaterialData md = materialData[dd.materialIndex];
vec4 albedo = texture(sampler2D(materialTextures[md.albedoTexture], albedoSampler), uv * vec2(md.tilingX, md.tilingY));

This scheme minimizes the CPU-side overhead. Of course, fundamentally it’s a balance between multiple factors:

While the scheme can be extended to multiple formats of material, draw and vertex data, it gets harder to manage
Using storage buffers exclusively instead of uniform buffers can increase GPU time on some architectures
Fetching texture descriptors from an array indexed by material data indexed by material index can add an extra indirection on GPU compared to some alternative designs
On some hardware, various descriptor set limits may make this technique impractical to implement; to be able to index an arbitrary texture dynamically from the shader, maxPerStageDescriptorSampledImages should be large enough to accomodate all material textures - while many desktop drivers expose a large limit here, the specification only guarantees a limit of 16, so bindless remains out of reach on some hardware that otherwise supports Vulkan

As the renderers get more and more complex, bindless designs will become more involved and eventually allow moving even larger parts of rendering pipeline to GPU; due to hardware constraints this design is not practical on every single Vulkan-compatible device, but it’s definitely worth considering when designing new rendering paths for future hardware.

Command buffer recording and submission

In older APIs, there is a single timeline for GPU commands; commands executed on CPU execute on the GPU in the same order, as there is generally only one thread recording them; there is no precise control over when CPU submits commands to GPU, and the driver is expected to manage memory used by the command stream as well as submission points optimally.

In contrast, in Vulkan the application is responsible for managing command buffer memory, recording commands in multiple threads into multiple command buffers, and submitting them for execution with appropriate granularity. While with carefully written code a single-core Vulkan renderer can be significantly faster than older APIs, the peak efficiency and minimal latency is obtained by utilizing many cores in the system for command recording, which requires careful memory management.

Mental model

Similarly to descriptor sets, command buffers are allocated out of command pools; it’s valuable to understand how a driver might implement this to be able to reason about the costs and usage implications.

Command pool has to manage memory that will be filled with commands by CPU and subsequently read by GPU command processor. The amount of memory used by the commands can’t be statically determined; a typical implementation of a pool would involve thus a free list of fixed-size pages. Command buffer would contain a list of pages with actual commands, with special jump commands that transfer control from each page to the next one so that GPU can execute all of them in sequence. Whenever a command needs to be allocated from a command buffer, it will be encoded into the current page; if the current page doesn’t have space, the driver would allocate the next page using a free list from the associated pool, encode a jump to that page into the current page and switch to the next page for subsequent command recording.

Each command pool can only be used from one thread concurrently, so the operations above don’t need to be thread-safe⁵. Freeing the command buffer using vkFreeCommandBuffers may return the pages used by the command buffer into the pool by adding them to the free list. Resetting the command pool may put all pages used by all command buffers into the pool free list; when VK_COMMAND_POOL_RESET_RELEASE_RESOURCES_BIT is used, the pages can be returned to the system so that other pools can reuse them.

Note that there is no guarantee that vkFreeCommandBuffers actually returns memory to the pool; alternative designs may involve multiple command buffers allocating chunks within larger pages, which would make it hard for vkFreeCommandBuffers to recycle memory. Indeed, on one mobile vendor, vkResetCommandPool is necessary to reuse memory for future command recording in a default setup when pools are allocated without VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT.

Multi-threaded command recording

Two crucial restrictions in Vulkan for command pool usage are:

Command buffers allocated from one pool may not be recorded concurrently by multiple threads
Command buffers and pools can not be freed or reset while GPU is still executing the associated commands

Because of these, a typical threading setup requires a set of command buffer pools. The set has to contain F*T pools, where F is the frame queue length – F is usually 2 (one frame is recorded by the CPU while another frame is being executed by the GPU) or 3; T is the number of threads that can concurrently record commands, which can be as high as the core count on the system. When recording commands from a thread, the thread needs to allocate a command buffer using the pool associated with the current frame & thread and record commands into it. Assuming that command buffers aren’t recorded across a frame boundary, and that at a frame boundary the frame queue length is enforced by waiting for the last frame in the queue to finish executing, we can then free all command buffers allocated for that frame and reset all associated command pools.

Additionally, instead of freeing command buffers, it’s possible to reuse them after calling vkResetCommandPool - which would mean that command buffers don’t have to be allocated again. While in theory allocating command buffers could be cheap, some driver implementations have a measurable overhead associated with command buffer allocation. This also makes sure that the driver doesn’t ever need to return command memory to the system which can make submitting commands into these buffers cheaper.

Note that depending on the frame structure, the setup above may result in unbalanced memory consumption across threads; for example, shadow draw calls typically require less setup and less command memory. When combined with effectively random workload distribution across threads that many job schedulers produce, this can result in all command pools getting sized for the worst-case consumption. If an application is memory constrained and this becomes a problem, it’s possible to limit the parallelism for each individual pass and select the command buffer/pool based on the recorded pass to limit the waste.

This requires introducing the concept of size classes to the command buffer manager. With a command pool per thread and a manual reuse of allocated command buffers as suggested above, it’s possible to keep a free list per size class, with size classes defined based on the number of draw calls (e.g. “<100”, “100-400”, etc.) and/or the complexity of individual draw calls (depth-only, gbuffer). Picking the buffer based on the expected usage leads to a more stable memory consumption. Additionally, for passes that are too small it is worthwhile to reduce the parallelism when recording these - for example, if a pass has <100 draw calls, instead of splitting it into 4 recording jobs on a 4-core system, it can be more efficient to record it in one job since that can reduce the overhead of command memory management and command buffer submission.

Command buffer submission

While it’s important to record multiple command buffers on multiple threads for efficiency, since state isn’t reused across command buffers and there are other scheduling limitations, command buffers need to be reasonably large to make sure GPU is not idle during command processing. Additionally, each submission has some overhead both on the CPU side and on the GPU side. In general a Vulkan application should target <10 submits per frame (with each submit accounting for 0.5ms or more of GPU workload), and <100 command buffers per frame (with each command buffer accounting for 0.1ms or more of GPU workload). This might require adjusting the concurrency limits for command recording for individual passes, e.g. if a shadow pass for a specific light has <100 draw calls, it might be necessary to limit the concurrency on the recording for this pass to just one thread; additionally, for even shorter passes combining them with neighboring passes into one command buffer becomes beneficial. Finally, the fewer submissions a frame has the better – this needs to be balanced with submitting enough GPU work earlier in the frame to increase CPU and GPU parallelism though, for example it might make sense to submit all command buffers for shadow rendering before recording commands for other parts of the frame.

Crucially, the number of submissions refers to the total number of VkSubmitInfo structured submitted in all vkQueueSubmit calls in a frame, not to the number of vkQueueSubmit calls per se. For example, when submitting 10 command buffers, it’s much more efficient to use one VkSubmitInfo that submits 10 command buffers compared to 10 VkSubmitInfo structures with one command buffer per each, even if in both cases only one vkQueueSubmit call is performed. Essentially, VkSubmitInfo is a unit of synchronization/scheduling on GPU since it has its own set of fences/semaphores.

Secondary command buffers

When one of the render passes in the application contains a lot of draw calls, such as the gbuffer pass, for CPU submission efficiency it’s important to split the draw calls into multiple groups and record them on multiple threads. There are two ways to do this:

Record primary command buffers that render chunks of draw calls into the same framebuffer, using vkCmdBeginRenderPass and vkCmdEndRenderPass; execute the resulting command buffers using vkQueueSubmit (batching submits for efficiency)
Record secondary command buffers that render chunks of draw calls, passing the render pass to vkBeginCommandBuffer along with VK_COMMAND_BUFFER_USAGE_RENDER_PASS_CONTINUE_BIT; use vkCmdBeginRenderPass with VK_SUBPASS_CONTENTS_SECONDARY_COMMAND_BUFFERS in the primary command buffer, followed by vkCmdExecuteCommands to execute all recorded secondary command buffers

While on immediate mode GPUs the first approach can be viable, and it can be a bit easier to manage wrt synchronization points on the CPU, it’s vital to use the second approach on GPUs that use tiled rendering instead. Using the first approach on tilers would require that the contents of the tiles is flushed to memory and loaded back from memory between each command buffer, which is catastrophic for performance.

Command buffer reuse

With the guidance on the command buffer submission above, in most cases submitting a single command buffer multiple times after recording becomes impractical. In general approaches that pre-record command buffers for parts of the scene are counter-productive since they can result in excessive GPU load due to inefficient culling required to keep command buffer workload large and can trigger inefficient code paths on some tiled renderers, and instead applications should focus on improving the threading and draw call submission cost on the CPU. As such, applications should use VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT to make sure the driver has freedom to generate commands that don’t need to be replayed more than once.

There are occasional exceptions for this rule. For example, for VR rendering, an application might want to record the command buffer for the combined frustum between left and right eye once. If the per-eye data is read out of a single uniform buffer, this buffer can then be updated between the command buffers using vkCmdUpdateBuffer, followed by vkCmdExecuteCommands if secondary command buffers are used, or vkQueueSubmit. Having said that, for VR it might be worthwhile to explore VK_KHR_multiview extension if available (core in Vulkan 1.1), since it should allow the driver to perform a similar optimization.

Pipeline barriers

Pipeline barriers remain one of the most challenging parts of Vulkan code. In older APIs, the runtime and driver were responsible for making sure appropriate hardware-specific synchronization was performed in case of hazards such as fragment shader reading from the texture that was previously rendered to. This required meticulous tracking of every single resource binding and resulted in an unfortunate mix of excessive CPU overhead to perform a sometimes excessive amount of GPU synchronization (for example, Direct3D 11 driver typically inserts a barrier between any two consecutive compute dispatches that use the same UAV, even though depending on the application logic the hazards may be absent). Because inserting barriers quickly and optimally can require knowledge about the application’s use of resources, Vulkan requires the application to do this.

For optimal rendering, the pipeline barrier setup must be perfect. A missing barrier risks the application encountering a timing-dependent bug on an untested – or, worse, not-yet-existing – architecture, that in the worst case could cause a GPU crash. An unnecessary barrier can reduce the GPU utilization by reducing potential opportunity for parallel execution – or, worse, trigger very expensive decompression operations or the like. To make matters worse, while the cost of excessive barriers can be now visualized by tools like Radeon Graphics Profiler, missing barriers are generally not detected by validation tools.

Because of this, it’s vital to understand the behavior or barriers, the consequences of overspecifying them as well as how to work with them.

Mental model

The specification describes barriers in terms of execution dependencies and memory visibility between pipeline stages (e.g. a resource was previously written to by a compute shader stage, and will be read by the transfer stage), as well as layout changes for images (e.g. a resource was previously in the format that is optimal to write via the color attachment output and should be transitioned to a format that is optimal to read from the shader). However, it might be easier to think about barriers in terms of their consequences – as in, what can happen on a GPU when a barrier is used. Note that the GPU behavior is of course dependent on the specific vendor and architecture, but it helps to map barriers that are specified in an abstract fashion to more concrete constructs to understand their performance implications.

A barrier can cause three different things to happen:

Stalling execution of a specific stage until another stage is drained of all current work. For example, if a render pass renders data to a texture, and a subsequent render pass uses a vertex shader to read from this shader, GPU must wait for all pending fragment shader and ROP work to complete before launching shader threads for the vertex work in a subsequent pass. Most barrier operations will lead to execution stalling for some stages⁶.
Flushing or invalidating an internal GPU-side cache and waiting for the memory transactions to finish to make sure another stage can read the resulting work. For example, on some architectures ROP writes might go through the L2 texture cache, but transfer stage might operate directly on memory. If a texture has been rendered to in a render pass, then the following transfer operation might read stale data unless the cache is flushed before the copy. Similarly, if a texture stage needs to read an image that was copied using transfer stage, L2 texture cache may need to get invalidated to make sure it doesn’t contain stale data. Not all barrier operations will need to do this.
Converting the format the resource is stored in, most commonly to decompress the resource storage. For example, MSAA textures on some architectures are stored in a compressed form where each pixel has a sample mask indicating how many unique colors this pixel contains, and a separate storage for sample data. Transfer stage or shader stage might be unable to read directly from a compressed texture, so a barrier that transitions from VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL or VK_IMAGE_USAGE_TRANSFER_SRC_BIT might need to decompress the texture, writing all samples for all pixels to memory. Most barrier operations won’t need to do this, but the ones that do can be incredibly expensive.

With this in mind, let’s try to understand the guidance for using barriers.

Performance guidelines

When generating commands for each individual barrier, the driver only has a local view of the barrier and is unaware of past or future barriers. Because of this, the first important rule is that barriers need to be batched as aggressively as possible. Given a barrier that implies a wait-for-idle for fragment stage and an L2 texture cache flush, the driver will dutifully generate that every time you call vkCmdPipelineBarrier. If you specify multiple resources in a single vkCmdPipelineBarrier call, the driver will only generate one L2 texture cache flush command if it’s necessary for any transitions, reducing the cost.

To make sure the cost of the barriers isn’t higher than it needs to be, only relevant stages need to be included. For example, one of the most common barrier types is one that transitions a resource from VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL. When specifying this barrier, you should specify the shader stages that will actually read this resource via dstStageMask. It’s tempting to specify the stage mask as VK_PIPELINE_STAGE_ALL_COMMANDS_BIT to support compute shader or vertex shader reads. Doing so, however, would mean that vertex shader workload from the subsequent draw commands can not start, which is problematic:

On immediate mode renderers, this slightly reduces the parallelism between draw calls, requiring all fragment threads to finish before vertex threads can start, which leads to GPU utilization dropping to 0 at the end of the pass and gradually rising from 0 to, hopefully, 100% as the next render pass begins;
On tiled mode renderers, for some designs the expectation is that all vertex work from the subsequent pass executes to completion before fragment work can start; waiting for fragment work to end for any vertex work to begin thus completely eliminates the parallelism between vertex and fragment stages and is one of the largest potential performance problems that a naively ported Vulkan title can encounter.

Note that even if the barriers are specified correctly – in this case, assuming the texture is read from the fragment stage, dstStageMask should be VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT – the execution dependency is still present, and it can still lead to reduced GPU utilization. This can come up in multiple situations including compute, where to read data from a compute shader generated by another compute shader you need to express an execution dependency between CS and CS but specifying a pipeline barrier is guaranteed to drain the GPU of compute work entirely, followed by slowly filling it with compute work again. Instead, it can be worthwhile to specify the dependency via what’s called a split barrier: instead of using vkCmdPipelineBarrier, use vkCmdSetEvent after the write operation completes, and vkCmdWaitEvents before the read operations starts. Of course, using vkCmdWaitEvents immediately after vkCmdSetEvent is counter-productive and can be slower than vkCmdPipelineBarrier; instead you should try to restructure your algorithm to make sure there’s enough work submitted between Set and Wait, so that by the time GPU needs to process Wait, the event is most likely already signaled and there is no efficiency loss.

Alternatively, in some cases the algorithm can be restructured to reduce the number of synchronization points while still using pipeline barriers, making the overhead less significant. For example, a GPU-based particle simulation might need to run two compute dispatches for each particle effect: one to emit new particles, and another one to simulate particles. These dispatches require a pipeline barrier between them to synchronize execution, which requires a pipeline barrier per particle system if particle systems are simulated sequentially. A more optimal implementation would first submit all dispatches to emit particles (that would not depend on each other), then submit a barrier to synchronize emission and simulation dispatches, then submit all dispatches to simulate particles - which would keep GPU well utilized for longer. From there on using split barriers could help completely hide the synchronization cost.

As far as resource decompression goes, it’s hard to give a general advice – on some architectures this never happens, and on some it does but depending on the algorithm it might not be avoidable. Using vendor specific tools such as Radeon Graphics Profiler is critical to understanding the performance impact decompression has on your frame; in some cases, it may be possible to adjust the algorithm to not require the decompression in the first place, for example by moving the work to a different stage. Of course it should be noted that resource decompression may happen in cases where it’s completely unnecessary and is a result of overspecifying barriers – for example, if you render to a framebuffer that contains a depth buffer and never read depth contents in the future, you should leave the depth buffer in VK_IMAGE_LAYOUT_DEPTH_STENCIL_OPTIMAL layout instead of needlessly transitioning it into VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL which might trigger a decompression (remember, the driver doesn’t know if you are going to read the resource in the future!).

Simplifying barrier specification

With all the complexity involved in specifying barriers, it helps to have examples of commonly required barriers. Fortunately, Khronos Group provides many examples of valid and optimal barriers for various types of synchronization as part of Vulkan-Docs repository on GitHub. These can serve to improve the understanding of general barrier behavior, and can also be used directly in a shipping application.

Additionally, for cases not covered by these examples and, in general, to simplify the specification code and make it more correct, it is possible to switch to a simpler model where, instead of fully specifying access masks, stages and image layouts, the only concept that needs to be known about a resource is the resource state that encapsulates the stages that can use the resource and the usage mode for most common types of access. Then all transitions involve transitioning a resource from state A from state B, which is much easier to understand. To that end, Tobias Hector, a member of Khronos Group and a co-author of the Vulkan specification, wrote an open-source library, simple_vulkan_synchronization, that translates resource state (otherwise known as access type in the library) transitions into Vulkan barrier specification. The library is small and simple and provides support for split barriers as well as full pipeline barriers.

Predicting the future with render graphs

The performance guidelines outlined in the previous section are hard to follow in practice, especially given conventional immediate mode rendering architectures.

To make sure that the stages and image layout transitions are not overspecified, it’s important to know how the resource is going to be used in the future – if you want to emit a pipeline barrier after render pass ends, without this information you’re generally forced to emit a barrier with all stages in the destination stage mask, and an inefficient target layout.

To solve this problem, it’s tempting to instead emit the barriers before the resource is read, since at that point it’s possible to know how the resource was written to; however, this makes it hard to batch barriers. For example, in a frame with 3 render passes, A, B, and C, where C reads A’s output and B’s output in two separate draw calls, to minimize the number of texture cache flushes and other barrier work it’s generally beneficial specify a barrier before C that correctly transitions outputs of both A and B; instead what would happen is that there’s a barrier before each of C’s draw calls. Split barriers in some cases can reduce the associated costs, but in general just-in-time barriers will be overly expensive.

Additionally, using just-in-time barriers requires tracking the resource state to know the previous layout; this is very hard to do correctly in a multithreaded system since the final execution order on GPU can only be known once all commands are recorded and linearized.

Due to the aforementioned problems, many modern renderers are starting to experiment with render graphs as a way to declaratively specify all dependencies between frame resources. Based on the resulting DAG structure, it’s possible to establish correct barriers, including barriers required for synchronizing work across multiple queues, and allocate transient resources with minimal use of physical memory.

A full description of a render graph system is out of scope of this article, but interested readers are encouraged to refer to the following talks and articles:

FrameGraph: Extensible Rendering Architecture in Frostbite, Yuriy O’Donnell, GDC 2017
Advanced Graphics Tech: Moving to DirectX 12: Lessons Learned, Tiago Rodrigues, GDC 2017
Render graphs and Vulkan — a deep dive, Hans-Kristian Arntzen

Different engines pick different parameters of the solution, for example Frostbite render graph is specified by the application using the final execution order (which the author of this article finds more predictable and preferable), whereas two other presentations linearize the graph based on certain heuristics to try to find a more optimal execution order. Regardless, the important part is that dependencies between passes must be declared ahead of time for the entire frame to make sure that barriers can be emitted appropriately. Importantly, the frame graph systems work well for transient resources that are limited in number and represent the bulk of required barriers; while it’s possible to specify barriers required for resource uploads and similar streaming work as part of the same system, this can make the graphs too complex and the processing time too large, so these are generally best handled outside of a frame graph system.

Render passes

One concept that is relatively unique to Vulkan compared to both older APIs and new explicit APIs is render passes. Render passes allow an application to specify a large part of their render frame as a first-class object, splitting the workload into individual sub-passes and explicitly enumerating dependencies between sub-passes to allow the driver to schedule the work and place appropriate synchronization commands. In that sense, render passes are similar to render graphs described above and can be used to implement these with some limitations (for example, render passes currently can only express rasterization workloads which means that multiple render passes should be used if compute workloads are necessary to support). This section, however, will focus on simpler uses of render passes that are more practical to integrate into existing renderers, and still provide performance benefits.

Load & store operations

One of the most important features of render passes is the ability to specify load and store operations. Using these, the application can choose whether the initial contents of each framebuffer attachments needs to be cleared, loaded from memory, or remain unspecified and unused by the application, and whether after the render pass is done the attachment needs to be stored to memory.

These operations are important to get right – on tiled architectures, using redundant load or store operations leads to wasted bandwidth which reduces performance and increases power consumption. On non-tiled architectures, driver can still use these to perform certain optimizations for subsequent rendering – for example, if the previous contents of an attachment is irrelevant but the attachment has associated compression metadata, driver may clear this metadata to make subsequent rendering more efficient.

To allow maximum freedom for the driver, it’s important to specify the weakest load/store operations necessary – for example, when rendering a full-screen quad to the attachment that writes all pixels, on tiled GPUs VK_ATTACHMENT_LOAD_OP_CLEAR is likely to be faster than VK_ATTACHMENT_LOAD_OP_LOAD, and on immediate mode GPUs LOAD is likely to be faster – specifying VK_ATTACHMENT_LOAD_OP_DONT_CARE is important so that the driver can perform an optimal choice. In some cases VK_ATTACHMENT_LOAD_OP_DONT_CARE can be better than either LOAD or CLEAR since it allows the driver to avoid an expensive clear operation for the image contents, but still clear image metadata to accelerate subsequent rendering.

Similarly, VK_ATTACHMENT_STORE_OP_DONT_CARE should be used in case the application is not expecting to read the data rendered to the attachment - this is commonly the case for depth buffers and MSAA targets.

Fast MSAA resolve

After rendering data to an MSAA texture, it’s common to resolve it into a non-MSAA texture for further processing. If fixed-function resolve functionality is sufficient, there are two ways to implement this in Vulkan:

Using VK_ATTACHMENT_STORE_OP_STORE for the MSAA texture and vkCmdResolveImage after the render pass ends
Using VK_ATTACHMENT_STORE_OP_DONT_CARE for the MSAA texture and specifying the resolve target via pResolveAttachments member of VkSubpassDescription

In the latter case, the driver will perform the necessary work to resolve MSAA contents as part of work done when subpass/renderpass ends.

The second approach can be significantly more efficient. On tiled architectures, using the first approach requires storing the entire MSAA texture to main memory, followed by reading it from memory and resolving to the destination; the second approach can perform in-tile resolve in the most efficient manner. On immediate mode architectures, some implementation may not support reading compressed MSAA textures using the transfer stage – the API requires a transition into VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL layout before calling vkCmdResolveImage, which may lead to decompression of the MSAA texture, wasting bandwidth and performance. With pResolveAttachments, the driver can perform the resolve operation at maximum performance regardless of the architecture.

In some cases, fixed function MSAA resolve is insufficient. In this case, it’s necessary to transition the texture to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL and do the resolve in a separate render pass. On tiled architectures, this has the same efficiency issues as vkCmdResolveImage fixed-function method; on immediate mode architectures the efficiency depends on GPU and driver. One possible alternative is to use an extra subpass that reads the MSAA texture via an input attachment.

For this to work, the first subpass that renders to MSAA texture has to specify the MSAA texture via pColorAttachments, with VK_ATTACHMENT_STORE_OP_DONT_CARE as the store op. The second subpass that performs the resolve needs to specify MSAA texture via pInputAttachments and the resolve target via pColorAttachments; the subpass then needs to render a full-screen quad or triangle with a shader that uses subpassInputMS resource to read MSAA data. Additionally, the application needs to specify a dependency between two subpasses that indicates the stage/access masks, similarly to pipeline barriers, and dependency flags VK_DEPENDENCY_BY_REGION_BIT. With this, the driver should have enough information to arrange the execution such that on tiled GPUs, the MSAA contents never leaves the tile memory and instead is resolved in-tile, with the resolve result being written to main memory⁷. Note that whether this happens depends on the driver and is unlikely to result in significant savings on immediate mode GPUs.

Pipeline objects

Older APIs typically used to split the GPU state into blocks based on functional units – for example, in Direct3D 11 the full state of GPUs modulo resource bindings can be described using the set of shader objects for various stages (VS, PS, GS, HS, DS) as well as a set of state objects (rasterizer, blend, depth stencil), input assembly configuration (input layout, primitive topology) and a few other implicit bits like output render target formats. The API user then could set individual bits of the state separately, without regards to the design or complexity of the underlying hardware.

Unfortunately, this model doesn’t match the model hardware typically uses, with several performance pitfalls that can occur:

While an individual state object is supposed to model parts of GPU state and could be directly transferred to commands that setup GPU state, on some GPUs the configuration of the GPU state required data from multiple different state blocks. Because of this, drivers typically must keep a shadow copy of all state and convert the state to the actual GPU commands at the time of Draw/DrawIndexed
With the rasterization pipeline getting more complex and gaining more programmable stages, some GPUs didn’t map them directly to hardware stages, which means that the shader microcode can depend on whether other shader stages are active and, in some cases, on the specific microcode for other stages; this meant that the driver might have to compile new shader microcode from state that can only be discovered at the time of Draw/DrawIndexed
Similarly, on some GPUs, fixed functional units from the API description were implemented as part of one of the shader stages – changing the vertex input format, blending setup, or render target format could affect the shader microcode. Since the state is only known at the time of Draw/DrawIndexed, this, again, is where the final microcode had to be compiled

While the first problem is more benign, the second and third problem can lead to significant stalls during rendering as, due to the complexity of modern shaders and shader compilation pipelines, shader compilation can take tens to hundreds of milliseconds depending on hardware. To solve this, Vulkan and other new APIs introduce the concept of pipeline object – it encapsulates most GPU state, including vertex input format, render target format, state for all stages and shader modules for all stages. The expectation is that on every supported GPU, this state is sufficient to build final shader microcode and GPU commands required to set the state up, so the driver never has to compile microcode at draw time and can optimize pipeline object setup to the extent possible.

This model, however, presents challenges when implementing renderers on top of Vulkan. There are multiple ways to solve this problem, with different tradeoffs wrt complexity, efficiency, and renderer design.

Just-In-Time compilation

The most straightforward way to support Vulkan is to use just-in-time compilation for pipeline objects. In many engines due to the lack of first-class concepts that match Vulkan, the rendering backend must gather information about various parts of the pipeline state as a result of various state setup calls, similarly to what a Direct3D 11 driver might do. Then, just before the draw/dispatch where the full state is known, all individual bits of state would be grouped together and looked up in a hash table; if there’s already a pipeline state object in the cache, it can be used directly, otherwise a new object can be created.

This scheme works to get the application running but suffers from two performance pitfalls.

A minor concern is that the state that needs to be hashed together is potentially large; doing this for every draw call can be time consuming when the cache already contains all relevant objects. This can be mitigated by grouping state into objects and hashing pointers to these objects, and in general simplifying the state specification from the high-level API point of view.

A major concern, however, is that for any pipeline state object that must be created, the driver might need to compile multiple shaders to the final GPU microcode. This process is time consuming; additionally, it can not be optimally threaded with a just-in-time compilation model – if the application only uses one thread for command submission, this thread would typically also compile pipeline state objects; even with multiple threads, often multiple threads would request the same pipeline object, serializing compilation, or one thread would need several new pipeline objects, which increases the overall latency of submission since other threads would finish first and have no work to do.

For multi-threaded submission, accessing the cache can result in contention between cores even when the cache is full. Fortunately, this can be solved by a two-level cache scheme as follows:

The cache would have two parts, the immutable part that never changes during the frame, and the mutable part. To perform a pipeline cache lookup, we first check if the immutable cache has the object – this is done without any synchronization. In the event of the cache miss, we lock a critical section and check if the mutable cache has the object; if it doesn’t, we unlock the critical section, create the pipeline object, and then lock it again and insert the object into the cache, potentially displacing another object (additional or synchronization might be required if, when two threads request the same object, only one compilation request is issued to the driver). At the end of the frame, all objects from the mutable cache are added to the immutable cache and the mutable cache is cleared, so that on the next frame access to these objects can be free-threaded.

Pipeline cache and cache pre-warming

While just-in-time compilation can work, it results in significant amount of stuttering during gameplay. Whenever an object with a new set of shaders/state enters the frame, we end up having to compile a pipeline object for it which could be slow. This is a similar problem to what Direct3D 11 titles would have, however in Direct3D 11 the drivers did a lot of work behind the scenes to try to hide the compilation latency, precompiling some shaders earlier and implementing custom schemes for patching bytecode on the fly that didn’t require a full recompilation. In Vulkan, the expectation is that the application handles pipeline object creation manually and intelligently, so a naive approach doesn’t work very well.

To make just-in-time compilation more practical, it’s important to use the Vulkan pipeline cache, serialize it between runs, and pre-warm the in-memory cache described in the previous section at application startup from multiple threads.

Vulkan provides a pipeline cache object, VkPipelineCache, that can store driver-specific bits of state and shader microcode to improve compilation time for pipeline objects. For example, if an application creates two pipeline objects with identical setup except for culling mode, the shader microcode would typically be the same. To make sure the driver only compiles the object once, the application should pass the same instance of VkPipelineCache to vkCreateGraphicsPipelines in both calls, in which case the first call would compile the shader microcode and the second call would be able to reuse it. If these calls happen concurrently in different threads the driver might still compile the shaders twice since the data would only be added to the cache when one of the calls finishes.

It’s vital to use the same VkPipelineCache object when creating all pipeline objects and serialize it to disk between runs using vkGetPipelineCacheData and pInitialData member of VkPipelineCacheCreateInfo. This makes sure that the compiled objects are reused between runs and minimizes the frame spikes during subsequent application runs.

Unfortunately, during the first play through the shader compilation spikes will still occur since the pipeline cache will not contain all used combinations. Additionally, even when the pipeline cache contains the necessary microcode, vkCreateGraphicsPipelines isn’t free and as such compilation of new pipeline objects can still increase the frame time variance. To solve that, it’s possible to pre-warm the in-memory cache (and/or VkPipelineCache) during load time.

One possible solution here is that at the end of the gameplay session, the renderer could save the in-memory pipeline cache data – which shaders were used with which state⁸ – to a database. Then, during QA playthroughs, this database could be populated with data from multiple playthroughs at different graphics settings etc. – effectively gathering the set of states that are likely to be used during the actual gameplay.

This database can then be shipped with the game; at game startup, the in-memory cache could be prepopulated with all states created using the data from that database (or, depending on the amount of pipeline states, this pre-warming phase could be limited to just the states for the current graphics settings). This should happen on multiple threads to reduce the load time impact; the first run would still have a longer load time (which can be further reduced with features like Steam pre-caching), but frame spikes due to just-in-time pipeline object creation can be mostly avoided.

If a particular set of state combinations wasn’t discovered during QA playthroughs, the system can still function correctly – at the expense of some amount of stuttering. The resulting scheme is more or less universal and practical – but requires a potentially large effort to play through enough levels with enough different graphics settings to capture most realistic workloads, making it somewhat hard to manage.

Ahead of time compilation

The “perfect” solution – one that Vulkan was designed for – is to remove just-in-time compilation caches and pre-warming, and instead just have every single possible pipeline object available ahead of time.

This typically requires changing the renderer design and integrating the concept of the pipeline state into the material system, allowing a material to specify the state completely. There are different possible designs; this section will outline just one, but the important thing is the general principle.

An object is typically associated with the material that specifies the graphics state and resource bindings required to render the object. In this case, it’s important to separate resource bindings from the graphics state as the goal is to be able to enumerate all combinations of graphics state in advance. Let’s call the collection of the graphics state a “technique” (this terminology is intentionally similar to terminology from Direct3D Effect Framework, although there the state was stored in the pass). Techniques can then be grouped into effects, and a material would be referring to the effect, and to some sort of key to specify the technique from the effect.

The set of effects and set of techniques in an effect would be static; the set of effects would also be static. Effects are not as vital to being able to precompile pipeline objects as techniques but can serve as useful semantical grouping of techniques – for example, often material is assigned an effect at material creation time, but technique can vary based on where the object is rendered (e.g. shadow pass, gbuffer pass, reflection pass) or on the gameplay effects active (e.g. highlight).

Crucially, the technique must specify all state required to create a pipeline object, statically, ahead of time – typically as part of the definition in some text file, whether in a D3DFX-like DSL, or in a JSON/XML file. It must include all shaders, blend states, culling states, vertex format, render target formats, depth state. Here’s an example of how this might look:

technique gbuffer
{
	vertex_shader gbuffer_vs
	fragment_shader gbuffer_fs

#ifdef DECAL
	depth_state less_equal false
	blend_state src_alpha one_minus_src_alpha
#else
	depth_state less_equal true
	blend_state disabled
#endif
	
	render_target 0 rgba16f
	render_target 1 rgba8_unorm
	render_target 2 rgba8_unorm

	vertex_layout gbuffer_vertex_struct
}

Assuming all draw calls, including ones used for post-effects etc, use the effect system to specify render state, and assuming the set of effects and techniques is static, it’s trivial to precreate all pipeline objects – each technique needs just one – at load time using multiple threads, and at runtime use very efficient code with no need for in-memory caches or possibility of frame spikes.

In practice, implementing this system in a modern renderer is an exercise in complexity management. It’s common to use complex shader or state permutations – for example, for two-sided rendering you typically need to change culling state and perhaps change the shaders to implement two-sided lighting. For skinned rendering, you need to change vertex format and add some code to the vertex shader to transform the attributes using skinned matrices. On some graphics settings, you might decide that the render target format needs to be floating-point R10G11B10 instead of RGBA16F, to conserve bandwidth. All these combinations multiply and require you to be able to represent them concisely and efficiently when specifying technique data (for example, by allowing #ifdef sections inside technique declarations as shown above), and – importantly – being aware of the steadily growing amount of combinations and refactoring/simplifying them as appropriate. Some effects are rare enough that they could be rendered in a separate pass without increasing the number of permutations. Some computations are simple enough that always running them in all shaders can be a better tradeoff than increasing the number of permutations. And some rendering techniques offer better decoupling and separation of concerns, which can also reduce the number of permutations.

Importantly though, adding state permutations to the mix makes the problem harder but doesn’t make it different – many renderers have to solve the problem of a large number of shader permutations anyway, and once you incorporate all render state into shader/technique specification and focus on reducing the number of technique permutations, the same complexity management solutions apply equally to both problems. The benefit of implementing a system like this is perfect knowledge of all required combinations (as opposed to having to rely on fragile permutation discovery systems), great performance with minimal frame-to-frame variance including the first load, and a forcing function to keep the complexity of rendering code at bay.

Conclusion

Vulkan API shifts a large amount of responsibility from driver developers onto application developers. Navigating the landscape of various rendering features becomes more challenging when many implementation options are available; it’s challenging enough to write a correct Vulkan renderer, but performance and memory consumption is paramount. This article tried to discuss various important considerations when dealing with specific problems in Vulkan, present multiple implementation approaches that provide different tradeoffs between complexity, ease of use and performance, and span the range between porting existing renderers to redesigning renderers around Vulkan.

Ultimately, it’s hard to give a general advice that works across all vendors and is applicable to all renderers. For this reason, it’s vital to profile the resulting code on the target platform/vendor – for Vulkan, it’s important to monitor the performance across all vendors that the game is planning to ship on as the choices the application makes are even more important, and in some cases a specific feature, like fixed-function vertex buffer bindings, is the fast path on one vendor but a slow path on another.

Beyond using validation layers to ensure code correctness and vendor-specific profiling tools, such as AMD Radeon Graphics Profiler or NVidia Nsight Graphics, many open-source libraries that can help optimize your renderer for Vulkan are available:

VulkanMemoryAllocator - provides convenient and performant memory allocators for Vulkan as well as other memory-related algorithms such as defragmentation.
volk - provides an easy way to use driver-provided Vulkan entrypoints from the driver directly which can reduce function call overhead
simple_vulkan_synchronization - provides a way to specify Vulkan barriers using a simplified access type model, which helps balance correctness and performance
Fossilize - provides serialization support for various Vulkan objects, most notably for pipeline state creation info which can be used to implement pre-warming for a pipeline cache.
perfdoc - provides layers similar to validation layers, that analyze the stream of rendering command and identify potential performance problems on ARM GPUs
niagara - provides an example bindless renderer that follows some of the advice from this article (but not all of it!)
Vulkan-Samples - provides many samples that explore various tradeoffs in implementation of Vulkan rendering techniques along with details on the performance on mobile.

Finally, some vendors develop open-source Vulkan drivers for Linux; studying their sources can help gain more insight into performance of certain Vulkan constructs:

GPUOpen-Drivers for AMD - contains xgl which has the Vulkan driver source, and PAL which is a library used by xgl; many Vulkan function calls end up going through both xgl and PAL

mesa3d/radv for AMD - contains community-developed open-source radv driver

mesa3d/anvil for Intel - contains Anvil driver

Author wishes to thank Alex Smith (Feral Interactive), Daniel Rákos (AMD), Hans-Kristian Arntzen (ex. ARM), Matthäus Chajdas (AMD), Wessam Bahnassi (INFramez Technology Corp) and Wolfgang Engel (CONFETTI) for reviewing the drafts of this article and helping make it better.

We only cover memory allocation types that are writable from host and readable or writable from GPU; for CPU readback of data that has been written by GPU, memory with VK_MEMORY_PROPERTY_HOST_CACHED_BIT flag is more appropriate. ↩
Note that VK_MEMORY_PROPERTY_HOST_COHERENT_BIT generally implies that the memory will be write-combined; on some devices it’s possible to allocate non-coherent memory and flush it manually with vkFlushMappedMemoryRanges. ↩
Note that with the 4 descriptors per pipeline, this approach can’t handle full pipeline setup for VS, GS, FS, TCS and TES – which is only a problem if you use tessellation on drivers that only expose 4 descriptor sets. ↩
Depending on the GPU architecture it might also be beneficial to pass some of the indices, like material index or vertex data offset, via push constants to reduce the number of memory indirections in vertex/fragment shaders. ↩
Regrettably, Vulkan doesn’t provide a way for the driver to implement thread-safe command buffer recording so that one command pool can be reused between threads; in the scheme described, cross-thread synchronization is only required for switching pages which is relatively rare and can be lock-free for the most part. ↩
It’s crucial to note that a commonly held belief that individual draw calls execute in isolation without overlap with other work is wrong – GPUs commonly run subsequent draw calls in parallel across render state, shader and even render target switches. ↩
Of course, it’s not guaranteed that the driver will perform this optimization - it depends on the hardware architecture and driver implementation. ↩
This can use an application-specific format, or a library like Fossilize ↩

Learning from data

Wed, 22 Jan 2020 00:00:00 +0000

Machine learning is taking the world by storm. There’s amazing progress in many areas that were either considered intractable or had not reached a satisfying solution despite decades of research. A lot of results in machine learning are obtained using neural networks, but that’s just one class of algorithms. Today we’ll look at one key algorithm from meshoptimizer that was improved by getting the machine to find the best answer instead of me, the human¹.

Problem

One little known fact is that the performance of rendering a mesh depends significantly on the order of triangles in that mesh. Most GPUs use a structure that we will call “vertex cache” (also known as “post T&L cache”, “post transform cache” and “parameter cache”) which can cache the results of the vertex shader invocation. The cache is indexed by the vertex index, and the details of the cache replacement are not documented and vary between GPU vendors and models.

For example, if the triangle [2 1 3] immediately follows a triangle [0 1 2] in the index buffer, it’s very likely that the vertices 1 and 2 will not be transformed redundantly. However, if there are a lot of other triangles between these two in the index buffer, the GPU might need to transform these vertices again. Minimizing these redundant vertex shader invocations (cache misses) is beneficial for performance.

There are many ways such a cache could function; a few obvious models are a fixed-size FIFO cache and a fixed-size LRU cache. Existing hardware mostly doesn’t follow any of these; specifically for fixed-size FIFO, relying on the replacement policy can be dangerous as illustrated by Optimal grid rendering isn’t optimal.

However, even if we knew the replacement policy for our target hardware, what would we do with this information? Problems of this nature tend to be NP-complete and require some sort of heuristics to get a reasonable result in finite amount of time.

Algorithms

There are a few well known algorithms for optimizing meshes for vertex reuse. Two algorithms that I’ve implemented for meshoptimizer are Tipsy, which models a fixed-size FIFO cache, and Tom Forsyth’s Linear-Speed Vertex Cache Optimisation , which models a fixed-size LRU cache.

Both algorithms are greedy and work a bit similarly - given a set of recently seen vertices, they look at the adjacent triangles (TomF) or vertices (Tipsy), pick the next one, and emit the triangle or adjacent triangles. The selection of the next triangle is optimized to try to improve the overall cache efficiency.

Coming up with a heuristic that doesn’t compromise the global order in favor of the local order is challenging, since the heuristic only looks at the next possible choice². This is especially hard since the details of the cache behavior aren’t known.

Initially after implementing both algorithms, I was tempted to abandon TomF’s algorithm. Using a fixed-size FIFO cache, Tipsy was generally producing slightly more efficient results on most meshes, substantially more efficient results on some meshes, notably large uniform grids, and was several times faster to boot - which isn’t that big of a deal until you are dealing with meshes that have millions of triangles. However, what if the hardware isn’t using a fixed-size FIFO³?

Evaluation

Instead of using a simple FIFO model when evaluating the algorithms, I decided to compare the behavior on the actual hardware.

There are a couple of ways to do this - notably, you can use unordered access from vertex shaders to try to measure the number of times each vertex gets transformed, or use performance counters provided by the hardware to measure the vertex shader invocation count. I implemented a simple program that used the performance counters and ran it on a few test meshes.

The results were surprising. On both NVidia and AMD hardware, on the meshes where Tipsy was doing comparably or slightly better on the FIFO cache misses, the hardware counters consistently showed that TomF⁴ was generating a noticeably more efficient order - with the exception of uniform grids.

First I wanted to understand the hardware behavior better. While it’s straightforward to compute the total number of vertex shader invocations from an index sequence, it makes analysis cumbersome - this would tell us “this index sequence results in 19 invocations”, but won’t tell us why. To help with this a special analysis tool was written to measure invocation count on variants of the input sequence:

For a given index sequence, we will compute the number of invocations for each prefix of the sequence that forms a triangle list; this means that for each triangle in the sequence, we can look at the sequence up until the previous triangle and determine the increase in the number of invocations from adding this triangle.
In addition to that, to try to analyze the indices within the triangle better, we will look at variants of the sequence where we replace triangle a b c with a a a and a b b.
Finally, to understand the state of the cache after processing the sequence, for each index i we will append the triangle i i i to the sequence and measure the resulting invocations⁵.

The reason why we need to perform the modifications of the sequence is so that we can try to understand which specific indices in the sequence are causing a cache hit or miss. For example, if we know that adding the triangle a b c increases the invocation count by 1, we can guess that only one of the indices results in an additional invocation; if replacing this triangle with a a a leaves the invocation count as is, but replacing it with a b b increases it by 1 as well, then it’s likely that we had to transform b and not a or c.

As a result, for a given index sequence we can produce the result that looks like this:

// GPU 1: NVIDIA GeForce GTX 965M (Vendor 10de Device 1427)
// Sequence: 120 indices
//   0*3:   1*   2*   3*   4*   5*   6*   7*   8*   9*  10*   1    2
//   4*3:   3    4    5    6    7    8    9   10    1    2    3    4
//   8*3:   5    6    7    8    9   10    1    2    3    4    5    6
//  12*3:   7    8    9   10    1    2    3    4    5    6    7    8
//  16*3:   9   10    1    2    3    4    5    6    7    8    9   10
//  20*3:   1    2    3    4    5    6    7    8    9   10    1    2
//  24*3:   3    4    5    6    7    8    9   10    1    2    3    4
//  28*3:   5    6    7    8    9   10    1    2    3    4    5    6
//  32*3:   7*   8*   9*  10*   1*   2*   3*   4*   5*   6*   7    8
//  36*3:   9   10    1    2    3    4    5    6    7    8    9   10
// Cached  : 1 2 3 4 5 6 7 8 9 10 (10)
// Invocations: 20

In this case, given a degenerate sequence of 12 1-10 index groups, we get 20 invocations - each vertex was processed twice - and we know which specific indices were processed (marked with *) and which weren’t. Now what we need to do is to analyze a number of index sequences to study the patterns and come up with a theory that explains them best.

I won’t go too much into the results of the analysis - while it’s an interesting topic on its own, I’m not sure what hardware vendors would think of this, and I later learned of a fantastic paper that did this analysis using similar methods, Revisiting The Vertex Cache: Understanding and Optimizing Vertex Processing on the modern GPU.

As a result, I had to enhance the simulation algorithms used in meshoptimizer to measure the efficiency of the resulting index sequences to support more configurable parameters so that I could quickly measure the efficiency of the resulting sequence on models that resemble NVidia, AMD and Intel. All three vendors use different parameters and replacement policies for their caches - if any of them are reading this, it would be nice if these details were publicly documented.

With that out of the way… how do we actually improve the results?

Scoring

The way TomF algorithm works is that every time it emits a triangle, it picks the triangle with the highest score from the set of candidates (that are adjacent to a few most recently seen vertices). The score of the triangle is the sum of the scores of the vertices, and the score of the vertex is determined from two numbers, cache (position of the vertex in LRU cache, 0 = most recent) and valence (the number of triangles that this vertex belongs to that haven’t been emitted yet) as follows:

the score is the sum of cache score and valence score
cache score is last_triangle_score if cache < 3 and ((cache - 3) / (cache_size - 3)) ^ cache_decay_power otherwise
valence score is valence_boost_scale / (valence ^ valence_boost_power)
last_triangle_score = 0.75, cache_decay_power = 1.5, valence_boost_scale = 2.0, valence_boost_power = 0.5

So generally speaking, the score decreases non-linearly as the cache position increases (the score is higher for recently seen vertices) and decreases non-linearly as the valence increases (the score is higher for vertices with fewer remaining triangles). Additionally, the score for three most recently seen vertices is artificially lowered to avoid the strip-like runs that penalize long-term efficiency.

The heuristic intuitively makes sense. However, is it the best heuristic possible? One of the known problems with it was the suboptimal performance on uniform grids. I wanted to solve this, but wasn’t sure how to do so - it’s hard to find local tweaks of this heuristic that result in more optimal behavior globally.

One obvious thought is that while the shape of the heuristic makes sense, it’s unclear that the parameters are chosen optimally. Why is the last triangle score 0.75 instead of, say, 0.8?

To try to explore this further I implemented a simple brute-force optimization algorithm: there are only 4 values, so we can reasonably explore many different combinations of the values, adjusting each parameter by a small increment, such as 0.05, and rerunning the algorithm on a test data set. If the resulting efficiency improves, we found constants that are slightly better.

After rerunning the tuning, I found slightly better values for the formula - last_triangle_score = 0.8, valence_boost_scale = 3.2, valence_boost_power = 0.9 - but the gains were marginal, ~1% relative improvement on simulated NVidia-like model.

Tables

One other problem of the original algorithm was performance. In order to make it as fast as I could without sacrificing the quality, I replaced the expensive power functions in the original formula with cheaper table lookups, resulting in tables like this:

// last_triangle_score = 0.8, cache_decay_power = 1.5
static const float vertex_score_table_cache[1 + max_cache_size] =
{
    0.000000f,
    0.800000f, 0.800000f, 0.800000f, 1.000000f, 0.948724f, 0.898356f, 0.848913f, 0.800411f,
    0.752870f, 0.706309f, 0.660750f, 0.616215f, 0.572727f, 0.530314f, 0.489003f, 0.448824f,
    0.409810f, 0.371997f, 0.335425f, 0.300136f, 0.266180f, 0.233610f, 0.202490f, 0.172889f,
    0.144890f, 0.118591f, 0.094109f, 0.071591f, 0.051226f, 0.033272f, 0.018111f, 0.006403f,
};

// valence_boost_scale = 3.2, valence_boost_power = 0.9
static const float vertex_score_table_live[1 + max_valence] =
{
    0.000000f, 3.200000f, 1.714838f, 1.190531f, 0.918959f, 0.751756f, 0.637990f, 0.555344f,
    0.492458f, 0.442927f, 0.402856f, 0.369740f, 0.341890f, 0.318127f, 0.297601f, 0.279684f,
    0.263902f, 0.249888f, 0.237358f, 0.226085f, 0.215885f, 0.206611f, 0.198139f, 0.190368f,
    0.183215f, 0.176605f, 0.170480f, 0.164787f, 0.159481f, 0.154523f, 0.149879f, 0.145521f,
};

After this I reduced the table sizes to 16 for cache and 8 for valence, since this didn’t seem to matter for the quality of the results, but reducing the simulated cache size substantially improved the performance of the algorithm.

One day I was staring at the tables and it suddenly hit me: if I am already evaluating the score as f(cache) + g(valence) and if f and g are already table-driven, why do I need to generate the tables from some preconceived formula with bruteforced parameters if I can just find the optimal tables?

Tuning

After the tables became smaller, the cache table had 16 floats between 0 and 1, and the valence table had 8 floats - they were originally outside of the [0..1] range but since the entire formula is scale-independent, we can normalize everything to [0..1]. This gives us 24 floats that we need to find.

For each set of 24 values we can take a data set with many representative meshes (the test sets I use vary a bit, my most recent set contains a uniform grid - this was important to improve! - and a collection of various meshes including scans, sculpts, high poly and low poly game-ready models, for a total of ~1M triangles), optimize them using the table-driven algorithm and then measure the results. The same process can then be applied to a larger control set to make sure we aren’t overfitting the parameters to the test data set.

To compare the results, I measured the data using models for multiple vendors (initially using Intel, NVidia and AMD) and computing the fitness function as a sum of relative improvements between the baseline and the optimized result for all meshes. This metric is independent of the triangle count in each mesh, which is good because it means that one high-poly mesh doesn’t get weighted more than meshes with fewer triangles.

Now that we can evaluate a given table… how do you bruteforce a 24-dimensional space?

I have repeated the optimization process several times in the last few years, experimenting with different algorithms. In each case I took a specific optimization algorithm and ran it using a many-core machine using OpenMP for parallelization⁶ (typically 64-core or 96-core cloud instance, ultimately paying hundreds of dollars to cloud providers) over a course of several days. To save costs when running on cloud, I ended up using Google Cloud instead of AWS EC2 because the compute prices were cheaper, and using preemptible instances to get 2-3x cheaper execution. Because the optimization has to run for multiple days, I implemented support for persisting state so that when the cloud VM gets shut down due to preemption the state isn’t lost; additionally a non-preemptible watchdog VM was running and constantly waking up the many-core VM every 5 minutes⁷.

96 threads sound like a lot, but 24 dimensions is just way too much. For each combination of 24 numbers we need to run a million triangles through the optimization algorithm and then compute the fitness function. This process isn’t very fast.

Optimization

To make the problem tractable, we need to use an optimization algorithm. I tried to use three different ones over the years.

First I implemented a genetic algorithm. Since our table is just a short vector of 24 floats, it’s trivial to implement mutation and crossover. Then you seed the population with many random vectors, and wait. And wait. And wait.

I don’t know if this was a good idea, but I was upset when after a round of evolution the next population wasn’t always better than the old one - that is, the best individual from one population could be replaced by a slightly worse one from the next population, and it wasn’t clear that the process should converge. To fix that the generic evolution cheated and picked the best few specimens out of the population and copied them to the new population verbatim; the rest were generated using mutation/crossover.

The results of the tuning process were extremely promising. Although it took many CPU hours to get there, the resulting table was better than the table I started with - in many cases the delta was noticeable but not very significant, except for uniform grids where the results were ~10% better than the previous table.

Additionally, this served as a validation for some of the intuition behind the heuristic used previously. The training was done tabula rasa - with no a-priori knowledge about the problem, and only given the framework for the solution (the separable vertex scoring function), the machine has decided that indeed, the three most recently seen vertices should have a lower score than others (even though it didn’t assign equal weights to all three, using 0.792f, 0.767f, 0.764f - of course there’s a fair bit of noise in the resulting values).

The optimization process seemed to get stuck after 3 days, and since I was paying real money for the experiments, I decided to try to alter the training algorithm in hopes that a different algorithm can produce better results.

The next algorithm I implemented was a variant of simulated annealing. Instead of performing mutation/cross-over, and arbitrarily picking some fraction of the last population as survivors, there’s a set of “temperature bands” and within each we can find a new state by mutating the old state, and probabilistically decide whether to replace the old state with the new one based on the difference in fitness.

The classical description of the algorithm suggests starting with a high temperature value and gradually “cooling” it off - when temperature reaches 0, the result stabilizes. This didn’t fit the desired behavior since I didn’t know how to tune the starting temperature and wanted the algorithm to run indefinitely, so instead the annealing is performed simultaneously for multiple temperature values, and at the end of the run the best state propagates from higher temperatures to lower temperatures. This gives similar continuity to the solution - the best result (the state of the temperature band 0) continuously gets better.

Of course, the algorithm eventually gets stuck as well. It is probably the case that it’s possible to get better results with annealing but the results of the genetic optimizer were superior.

At some point last year I was talking to Angelo and the subject of numeric optimization came up. He introduced me to differential evolution which, contrary to what you might expect from the name, doesn’t require the function to be differentiable. Using the basic formulation from Wikipedia and the set of parameters suggested in Good Parameters for Differential Evolution, I was able to improve on the results a little bit further - differential evolution running for a day on a 96-core GCP instance resulted in the final tables that are still used right now.

The beauty of the approach is that it just takes a little bit of machine time now to generate the functions for the target profile. For example, if somebody from AMD reached out to me tomorrow and told me exactly how the vertex reuse policy works (hint hint), I could generate a special table for AMD hardware that would probably be a few percent more efficient than the existing one - which might be reasonable to use when shipping a console title. Additionally, it’s much easier to explore a wider set of shapes of the heuristic function - maybe instead of adding the vertex scores we should multiply them? (nope) maybe we should generate a 16x8 non-separable table instead of 16+8 separable one? (haven’t tried this yet because of concerns about insufficient size of the data set) etc.

Compression

A lot of work I am doing lately involves finding ways to efficiently compress the vertex and index data - instead of implementing complex mesh traversal algorithms that take a while to run during decompression and penalize rendering efficiency, the algorithms start from a triangle order optimized for vertex reuse and try to compress that.

Specifically, meshoptimizer now ships with a index encoder (that assumes the index buffer has been optimized for vertex reuse, preserves the order, and outputs a much smaller byte sequence that can be decoded back into the index buffer at 1-2 GB/sec) and a stripifier (that starts with an optimized index buffer, and generates a sequence of triangle strips that tries to maintain reasonable reuse efficiency by slightly changing the triangle order but not too much in order to find a balance between vertex reuse and triangle strip length).

When rendering performance is crucial, the index buffer should be optimized for vertex reuse and then compressed; when VRAM size is crucial it may be worth optimizing the index buffer and then converting it to triangle strips.

What if it’s really important to optimize for compressed mesh size instead? Can we keep the overall algorithms and encoding structure, and produce a smaller compressed mesh? After exploring a few different ways to encode the index buffer differently while still maintaining the order perfectly, I kept running into the order restriction and it became obvious that to generate smaller meshes with this approach, it was critical to reorder triangles a bit differently.

If only there was a way to find a triangle order for a mesh that, instead of just optimizing for rendering efficiency, tried to make the index buffer - after using the specialized index compression - smaller.

Wait. Right. That’s what we just did.

Eureka

And so instead of training a table to coerce the algorithm to find the most efficient sequence to render, I decided to train the algorithm to find the smallest sequence to transmit.

Since the goal was to use the existing compression algorithm, the fitness function measured the size of the compressed index data (after using both meshoptimizer index codec and deflate) in relation to the triangle count, and tried to minimize this. This time I decided to try to run all three algorithms again - since the fitness function was completely different from before, I wasn’t sure which algorithm would win⁸. Every algorithm was ran for ~1 day on slightly weaker hardware, and the winning algorithm was ran for several days to produce the tables.

Here are the results for all three algorithms over the first 24 hours (left hand side) and the first 100 minutes (right hand side). As we can see, all algorithms get most of the way there in the first hour or two, but differential evolution outperforms both alternative optimization algorithms on this problem by a significant margin.

After this differential evolution was ran for a few more days, producing the following table:

// cache score
{0.f, 0.977f, 0.981f, 0.984f, 0.539f, 0.401f, 0.607f, 0.358f, 0.435f, 0.715f, 0.385f, 0.312f, 0.439f, 0.465f, 0.135f, 0.183f, 0.064f},
// valence score
{0.f, 0.944f, 0.678f, 0.417f, 0.434f, 0.481f, 0.322f, 0.297f, 0.271f},

This table is notably different from the vertex reuse table in that the first three elements are, in fact, close to 1. So the algorithm ranks the three vertices of the most recently seen triangle the highest.

As we discussed briefly earlier, this tends to result in a strip-like order. And indeed, this order happens to result in substantially smaller triangle strips as well - so it’s a great set of parameters to use when trying to reduce index count! However, the optimization was trying to minimize the size of compressed index data when ran through index encoder and deflate compression. The index encoder was designed to compress cache-optimized index sequences, not triangle-strip-like sequences - why is it doing better?

It turned out that the strip-like order has much more predictable triangle structure than a cache-optimized order. There’s a long sequence of triangles adjacent to one another that goes back and forth across the (topologically) planar areas of the mesh. This results in the index encoder generating more predictable encoded sequences⁹ which, in turn, results in deflate compressing the results better. The tuning algorithm picks up on that and finds the sequence that resembles strips without knowing anything about triangle strips - in fact, no part of the optimization pipeline knows about strips, they just happen to compress really well after the index codec!

But see, it gets even better. Once we know that triangle strips are an interesting target, it’s not too hard to slightly tweak the index codec to anticipate triangle strip-like input and produce even more efficient byte sequences on these. After which you can re-train the tables to minimize the size even further! Which is what I am doing as we speak¹⁰.

Conclusion

When this journey started, I viewed the vertex cache optimization algorithm as something that should be understood and tuned by a human. And this still seems very valuable and important - structural changes to the algorithm require a deep understanding of the problem.

However, studying the data is very powerful, and sometimes the machine can look for patterns on our behalf. This can be used to validate theories we have - it’s really fascinating to have the optimization process discover that something you thought to be true about the problem is, indeed, as far as we know, true! - and to discover theories we don’t yet have.

Optimization algorithms in particular are an incredibly effective tool to have in the toolbox. A lot of attention is on deep learning and study of differentiable programs these days, but even if you don’t know too much about how the target function behaves, and you can’t run the learning algorithm on a large cluster of GPUs, it’s still possible to leverage the data to come up with enlightening answers.

A necessary disclaimer: I’m not a machine learning expert. It’s entirely possible that this article misuses some terms and that some analysis and conclusions here are wrong. You have been warned. ↩
For example, it’s tempting to always pick the triangle that shares two vertices with the last emitted triangle; this can result in strip-like order which tends to be inefficient in the long run since it produces long strips of triangles and each vertex ends up being transformed twice on average for regular meshes. ↩
Of course if you optimize the mesh for a FIFO cache of a size that’s too large, the results are going to be substantially worse than expected - however, the variation between cache sizes isn’t that high in practice. ↩
Sorry, Tom, it’s your fault for not coming up with a short algorithm name. ↩
Thanks to theagentd from Java-Gaming.org for the idea ↩
I used to scoff at OpenMP because of the lack of control and the focus on parallel for loops that tends to be insufficient when doing complex parallelization at scale, but it turns out that I just didn’t have the right problem. For this task a few OpenMP pragmas turned a serial program into a scalable parallel program with minimal effort. ↩
Hopefully this doesn’t violate the terms of service for using preemptible instances? Hey, it was all for science and I paid for this out of my pocket. ↩
If I am being honest, I mostly did this to gather data for this blog post. ↩
I am sorry if this isn’t making too much sense but this post is getting long, and the details of index codec are best left for another day. ↩
Also the resulting triangle sequence results in more compressible vertex data as well! But this discussion is also best left for another day. ↩

Three years of Metal

Thu, 12 Dec 2019 00:00:00 +0000

3 years ago, we ported our renderer to Metal. It didn’t take much time, it was a blast and it worked really well on iOS. Today Metal is in better shape than ever - and I’d like to talk a bit about that.

But first, if you have not read the original article, you might want to start with that; most of that still holds today.

Metal in 2019

The biggest changes that happened to Metal from my point of view in the last 3 years are about adoption at a massive scale.

3 years ago, a quarter of iOS devices had to use OpenGL. Today, for our audience, this number is ~2% - which means our OpenGL backend barely matters anymore. We still maintain it but this will not continue for long.

The drivers are also better than ever - generally speaking we don’t see driver issues on iOS, and when we do they often happen on early prototypes, and by the time the prototypes make their way to production, the issues are usually fixed.

We’ve also spent some time improving our Metal backend, focusing on three areas:

Reworking the shader compilation toolchain

One other thing that happened in the last three years is the release and development of Vulkan. While it would seem that the APIs are completely different (and they are), Vulkan ecosystem gave the rendering community a fantastic set of open-source tools that, when combined, result in an easy-to-use production quality compilation toolset.

We used the libraries to build a compilation toolchain that can take HLSL source code (using various DX11 features including compute shaders), compile it to SPIRV, optimize the said SPIRV, and convert the resulting SPIRV to MSL (Metal Shading Language). It replaces our previous toolchain that could only use DX9 HLSL source as an input and had various correctness issues for complicated shaders.

It is somewhat ironic that Apple didn’t have anything to do with this, but here we are. Huge thanks to the contributors and maintainers of glslang, spirv-opt and SPIRV-Cross. We have contributed a set of patches to these libraries to help us ship the new toolchain as well, and use it to retarget our shaders to Vulkan, Metal and OpenGL APIs.

macOS support

macOS port was always a possibility but wasn’t a big focus for us until we started missing some features and decided that we should invest into Metal on macOS to get faster renderer and unlock some future projects.

From the implementation perspective, this wasn’t very hard at all. Most of the API is exactly the same; other than window management, the only area that required substantial tweaks was memory allocation. On mobile, there’s a shared memory space for buffers and textures whereas on desktop, the API assumes a dedicated GPU with its own video memory.

It’s possible to quickly work around that by using managed resources, where the Metal runtime takes care of copying the data for you. This is how we shipped our first version, but we later reworked the implementation to more explicitly copy resource data using scratch buffers so that we could minimize the system memory overhead.

The biggest difference between macOS and iOS was stability. On iOS we were dealing with just one driver vendor on one architecture, whereas on macOS we had to support all three vendors (Intel, AMD, NVidia). Additionally, on iOS we - luckily! - skipped the first version of iOS where Metal was available, iOS 8, and on macOS this was not practical because we would get too few users to use Metal at the time. Because of the combination of these issues, we have hit many more driver issues in both relatively innocuous and relatively obscure areas of the API on macOS.

We still support all versions of macOS Metal (10.11+), although we started removing support and switching to legacy OpenGL backend for some versions with known shader compiler bugs that are hard for us to work around, e.g. on 10.11 we now require macOS 10.11.6 for Metal to work.

The performance benefits were in line with our expectations; in terms of market share, today we are at ~25% OpenGL and ~75% Metal users on macOS platform, which is a pretty healthy split. This means that at some point in the future it may be practical for us to stop supporting desktop OpenGL at all, as no other platforms we support use it, which is great in terms of being able to focus on APIs that are easier to handle and get good performance with.

Iterating on performance and memory consumption

We are historically pretty conservative with the graphics API features that we use, and Metal is no exception. There are several big feature updates that Metal has acquired over the years, including improved resource allocation APIs with explicit heaps, tile shaders with Metal 2, argument buffers and GPU-side command generation, etc.

We mostly don’t use any of the newer features - so far, the performance has been reasonable, and we’d like to focus on improvements that apply across the board, so something like tile shaders, that requires us to implement very special support for it throughout the renderer and is only accessible on newer hardware, is less interesting.

Having said that, we spent some amount of time tuning various parts of the backend to just run faster - using completely asynchronous texture uploads to reduce stuttering during level loads, which was completely painless, doing the aforementioned memory optimizations on macOS, optimizing CPU dispatch in various places of the backend by reducing cache misses etc., and - one of the only newer features we have explicit support for - using memoryless texture storage when available to significantly reduce the memory required for our new shadow system.

Future

Overall, the fact that we didn’t have to spend too much time on Metal improvements is actually a good thing - the code that was written 3 years ago, largely speaking, works and is fast and stable, which is a great sign of a mature API. Porting to Metal was a great investment, given the amount of time it took and the continuous benefits it gives us and our users.

We constantly reevaluate the balance between the amount of work we do for different APIs - it is very likely that we will need to dive deeper into more modern parts of Metal API for some of the future rendering projects; if it does happen, there’s probably going to be another post about this!

Robust pipeline cache serialization

Wed, 17 Jul 2019 00:00:00 +0000

When writing a Vulkan renderer, one has to learn a lot of new concepts. Some of them are easier to deal with than others, and one of the pretty straightforward additions is the pipeline cache. To make sure pipeline creation is as efficient as possible, you need to create a pipeline cache and use it whenever you need to create a new pipeline. To make sure subsequent runs of your application don’t have to spend the time repeatedly compiling the shader microcode, you need to save the pipeline cache data to a file, and load it next time your application starts. How hard can this possibly be?

Pretty hard, as it turns out.

What’s in a pipeline cache?

Pipeline cache data is a (mostly) opaque blob; you create a VkPipelineCache object, possibly giving it the initial blob to start with, and then at some point you can retrieve the data blob from this object.

While we don’t know much about the contents of the blob short of reading graphics driver source code¹, the pipeline cache data is guaranteed to start with a structure that identifies the device and looks something like this:

struct VkPipelineCacheHeaderOne
{
    uint32_t length; // == sizeof(VkPipelineCacheHeaderOne)
    uint32_t version; // == VK_PIPELINE_CACHE_HEADER_VERSION_ONE
    uint32_t vendorID;
    uint32_t deviceID;
    uint8_t uuid[VK_UUID_SIZE];
};

The header is followed by driver-specific information that typically contains bits of shader microcode, the format of which depends on the GPU, and auxiliary data that may contain arbitrary driver defined structures. Some drivers treat this blob as a structured file stream and read data from it, some drivers store raw structures defined in driver source in that blob and use memcpy or pointer casts to navigate the data; needless to say, a driver update may invalidate the way the data is stored.

Now, in theory, the application just needs to use vkGetPipelineCacheData to retrieve a data blob after the application reaches a steady state (for example before the application exits…), save the blob to a file, and then pass this blob using VkPipelineCacheCreateInfo::pInitialData when creating the pipeline cache on the next run. If the contents of the blob doesn’t work for the current version of the driver - maybe the driver was updated, or maybe the user switched to a different GPU - the driver is supposed to ignore the initial data and create an empty pipeline cache.

In practice, theory and practice are a bit different. The rule of thumb in practice is that a driver will only be able to correctly handle the exact blob that the exact same driver gave your application previously. Which is where the problems begin².

Is the driver the same?

The specification assumes that the cache isn’t compatible between different devices (which is why vendorID and deviceID are present in the header), and relies on the driver to establish a pipeline UUID - which is a 16-byte GUID - that accurately identifies the full set of factors that lead to being able to interpret a pipeline cache blob - you can think of this as a version number of the pipeline cache format. For example, during a driver upgrade, it may be the case that the pipeline cache format is not updated, in which case the UUID typically shouldn’t change, which means that the application won’t need to recompile the shaders from scratch.

However, drivers in the wild tend to exhibit two types of problems.

Some, older, drivers neglect to verify the UUID correctly. As a result, during a driver update application may try to give the blob with a stale UUID to the driver, the driver will try to interpret this as recent data and as a result, vkCreatePipelineCache may crash. Note that in general vkCreatePipelineCache doesn’t provide a guarantee that it accepts arbitrary data and can handle it cleanly.

Some drivers, including pretty recent ones, may neglect to update UUID in a driver update that actually breaks compatibility of the shader pipeline binary. This can happen during a driver version update (although this is rare), or - something that happens trivially on current drivers of at least one major vendor - between driver binaries that are built from the same version for different ABI. If a 32-bit driver and a 64-bit driver that ship on the same system have the same pipeline UUID, then saving the cache from a 32-bit version of the application and loading it from a 64-bit version may cause the driver to crash - which is exactly what happens when you ship a 32-bit version of your application and then update it to 64-bit following Google’s guidelines.

Is the data the same?

Now that we know what awaits us when it comes to header validation, what’s next is validating the data. After calling vkGetPipelineCacheData, application saves the blob, and loads the exact same blob on the next run.

It turns out that saving data to a file is basically impossible to do well. Filesystem issues as well as process stability issues may in some cases lead to files that are partially written, have chunks filled with zeroes at the end (or even with garbage), or, as a special case, are created but stay zero-size. On mobile, this can be complicated by the fact that the application is likely to be terminated abruptly at an arbitrary point in time by the user or the OS, something that happens less frequently on desktop; on Android it’s also common to use multi-process (multi-activity) applications and if your pipeline cache code runs in both processes and shares the same output file, these challenges become even harder to solve.

The reason why zero-size files are particularly interesting is that there is at least one driver version that we’ve ran into where passing a non-NULL pInitialData and initialDataSize == 0 returns an error during pipeline cache creation. Which brings us to the final caveat.

Error handling is hard

While the spec says that vkCreatePipelineCache should basically always succeed, short of running out of memory, such statements in the spec are rarely accurate. When creating the pipeline cache, the driver is supposed to ignore initial data if it’s incompatible (for example if it’s zero sized, if the stored UUID didn’t match the expected UUID, or if deserialization failed for any other reason); some drivers, instead, fail to create the pipeline cache.

The user definitely isn’t at fault here, so aborting the application would not be polite; while it’s generally possible to proceed without a pipeline cache, that’s usually a terrible idea because that means that each pipeline has to be recompiled from scratch. That is, pipeline caches have utility even if they are not serialized to disk because they allow the driver to cache the results of compilation across pipeline objects in memory.

All of this naturally leads to…

It’s not paranoia if they are really out to get you

… the solution. When serializing pipeline cache data to the file, we use a header that is filled with enough information to be able to validate the data, with the pipeline cache data following immediately afterwards:

struct PipelineCachePrefixHeader
{
    uint32_t magic;    // an arbitrary magic header to make sure this is actually our file
    uint32_t dataSize; // equal to *pDataSize returned by vkGetPipelineCacheData
    uint64_t dataHash; // a hash of pipeline cache data, including the header

    uint32_t vendorID;      // equal to VkPhysicalDeviceProperties::vendorID
    uint32_t deviceID;      // equal to VkPhysicalDeviceProperties::deviceID
    uint32_t driverVersion; // equal to VkPhysicalDeviceProperties::driverVersion
    uint32_t driverABI;     // equal to sizeof(void*)

    uint8_t uuid[VK_UUID_SIZE]; // equal to VkPhysicalDeviceProperties::pipelineCacheUUID
};

The hash of the pipeline cache data will allow us to validate the integrity of the data; to reduce the chance of an I/O error actually causing an integrity issue, we create a temporary file, write this header to the file followed by the pipeline cache data, and then move the file to the target location using rename.³

When loading the pipeline cache, we read the header, read the data, validate the data read using dataSize and dataHash, then validate that the data can be safely passed to the driver by comparing the remaining fields with the properties of the device⁴.

If the data is valid, vkCreatePipelineCache is called with the correct initial data. Crucially, if this call fails, this suggests that the driver implements additional checks that our logic didn’t detect on its own - instead of proceeding without the pipeline cache, we create an empty pipeline cache in this case by calling vkCreatePipelineCache again, with no initial data.

We also create the empty pipeline cache if the pipeline cache file was not found or our validation logic classified the data as unusable.

Note: because we incorporate driverVersion into the header, any driver update will cause pipeline cache to be rebuilt; we include this check because this completely eliminates issues where pipeline cache UUID doesn’t update even if it should - typically driverVersion is updated as part of build process, whereas UUID update is more manual. For applications that target desktop exclusively this can be too aggressive - in general desktop drivers are likely to be more well behaved with respect to handling pipeline cache validity so not all of this advice applies.

Conclusion

~~Friends don’t let friends ship Vulkan on Android~~ Vulkan drivers are not always correct and don’t always follow the specification to the letter. Pipeline cache data is an especially fragile part of the Vulkan renderer because I/O is challenging to get right, and there’s often minimal to no integrity checks in the driver. However, with enough application-side validation, you can eliminate stability issues coming from the pipeline cache handling in practice - it just takes work.

Good luck. ~~You’re going to need it.~~

Which you can absolutely do these days! For example, here’s an implementation of vkGetPipelineCacheData for radv. ↩
The remainder of this article is based on the experience of continuously shipping Roblox client on Android with Vulkan support and surviving through various Android OS updates, driver updates and in general dealing with both early and current Vulkan drivers from all major vendors. ↩
In theory rename is supposed to be atomic, but in practice the exact semantics and guarantees vary with the file system; hash is useful as a way to perform a robust comparison. ↩
Depending on the application you may want to also use different file names based on, for example, vendorID or driverABI; this is more interesting on desktop and less interesting on mobile. ↩

qgrep internals

Sat, 20 Apr 2019 00:00:00 +0000

In 2011-2012 I worked on FIFA Street, followed by FIFA EURO 2012 DLC and finally FIFA 13 - all of these games were based on the same codebase, and this codebase was HUGE. Given an unknown codebase, you need a way to quickly get around it - since you don’t know the code, you resort to search-based navigation, aka grep. Using Visual Studio Ctrl+Shift+F search on a HDD on a codebase this size means that every search takes minutes. This was frustrating and as such I decided to solve this problem.

I wanted a tool that was much faster than existing alternatives¹ and was able to perform both literal and regular expression searches at similar speed. At the time, the only existing tool that gave near-instantaneous results was Google Code Search - unfortunately, the performance on case-insensitive queries or some types of regular expressions at the time wasn’t very good². Thus I decided to make a new tool, qgrep³. This article will go over the design and performance optimizations that make qgrep really fast.

How fast are we talking about?

Remember, qgrep was written in 2012, so the competition at the time was grep, ag, Visual Studio Ctrl+Shift+F, and the hardware target was a multi-core CPU with an HDD. Today the golden standard of performance is set by ripgrep, which runs much faster than the alternatives, and the norm is to use an SSD. Still, qgrep is much faster.

As an example, let’s try to search for a simple query, vkCmdDrawInde.*KHR = 0, in UE4 codebase which is similar in size today to FIFA codebase from 2012. We’ll run qgrep and ripgrep twice: first time is after a reboot so the filesystem cache is cold, and second time is immediately after that. The timings are done using my desktop system with i7 8700K (6 cores 12 threads) and Samsung 970 EVO SSD (which is an M.2 NVMe drive) on Windows 10, using latest ripgrep/qgrep x64 builds. Here’s the hot run:

C:\work\qgrep>build\Release_x64\qgrep search ue4 S "vkCmdDrawInde.*KHR = 0"
C:/work/unrealengine/Engine/Source/ThirdParty/Vulkan/Include/vulkan/vulkan.hpp:44478:    PFN_vkCmdDrawIndexedIndirectCountKHR vkCmdDrawIndexedIndirectCountKHR = 0;
Search complete, found 1 matches in 0.03 sec

C:\work\unrealengine>rg --stats "vkCmdDrawInde.*KHR = 0"
Engine\Source\ThirdParty\Vulkan\Include\vulkan\vulkan.hpp
53924:    PFN_vkCmdDrawIndexedIndirectCountKHR vkCmdDrawIndexedIndirectCountKHR = 0;

1 matches
1 matched lines
1 files contained matches
118457 files searched
143 bytes printed
1237500936 bytes searched
3.651454 seconds spent searching
1.216025 seconds

And here’s the table of timings in seconds; here we also run qgrep with a b option that will be explained later (it stands for bruteforce):

tool	cold	hot
qgrep	0.42	0.03
qgrep bruteforce	0.66	0.08
ripgrep	10.2	1.2

It’s possible that if in 2012 I used SSDs and ripgrep was available, I wouldn’t have gone through the trouble of making a new tool - but the pain threshold was crossed so I did and here we are. But how is it possible to make a code search tool that’s so much faster than ripgrep? Why, by cheating, of course.

Eliminating I/O bottlenecks

The single biggest source of performance issues at the time was disk I/O and I/O-related system calls. A recursive grep tool must start by recursively traversing the target folder; as new files are encountered, it needs to read their contents and perform regular or literal search on the results, and print the matches if any. In theory, the filesystem in-memory cache is supposed to make subsequent searches fast; on FIFA codebase, however, both file hierarchy and file contents was never fully in the filesystem cache, especially as you worked on the code (for example, switching to the browser to google something would likely evict large portions of the search data set). Even when filesystem hierarchy was in the cache, retrieving the hierarchy wasn’t very efficient either due to a large number of kernel operations required. As you can see by looking at ripgrep results, the impact is very significant even on SSDs; on HDDs we’re talking about waiting for a minute for the search to complete.

To get around these issues, qgrep maintains a compressed representation of the entire searchable codebase. It’s stored in one file that consists of chunks; each chunk contains a list of file paths and file data, both compressed using LZ4. Files are added to the current chunk up until the uncompressed size reaches a certain threshold (512 KB at the moment) and then the chunk is compressed and a new chunk is started. If a file is very large (which is actually the case for vulkan.hpp at 2.2 MB), it’s broken up into several chunks using newlines as a splitting point which works well because the search is line based.

Chunks are compressed and decompressed atomically in the sense that you can’t update just one file in the chunk - compressor receives a blob that consists of contents of all files concatenated together. Chunks that are too small decrease the efficiency of compression; chunks that are too large make incremental updates and parallel decompression less efficient.

Having a single file that contains compressed representation of all input data means that even if the file is not in the file cache, reading this is as fast as possible since it’s often stored contiguously on disk. Compression is important because HDD read speeds are pretty slow compared to fast decompressors, and compressing increases the chance the files will stay in cache.

One other big advantage is that files can be pre-filtered during indexing to eliminate files that are not interesting to search; qgrep by default includes most text files using file extension as a filter. This means that large folders with, say, build artifacts, only affect the time it takes to update the data, and don’t impact search performance. On UE4 example above, ripgrep looks through 1.2GB of file data, and qgrep only looks through 1.0GB (compressed to 212MB).

Why LZ4?

An important goal when choosing the compression algorithm was to have decompression run at similar performance to the search itself (which, as you will see soon, can be really efficient). While it may seem that having a slower decompressor is good enough as long as it’s much faster than the disk, when the file stays in cache the disk I/O isn’t a factor so the faster the decompressor is, the better.

At the time a solid choice for having extremely fast decompression performance was LZ4; today Zstd provides an interesting alternative that I haven’t evaluated rigorously for this use case. Snappy was an alternative available at the time but it had slower decompression, and didn’t have a high-compression option. LZ4 also made a lot of progress over the years; several updates to lz4 since 2012 improved search performance on decompression-heavy queries by ~50% in total.

When qgrep was made initially, all focus was on search performance, and the time it took to update the qgrep data wasn’t that important. Because of this, LZ4HC was perfect for the job - at the cost of spending more time when compressing the data, it gave better compression ratios. In later releases a few changes were made to address the slow compression performance:

Chunk compression was moved to a separate thread so that it could run in parallel with reading files off disk and performing newline conversion and the like, and writing resulting chunk data to disk
A lower compression level was used (which is a relatively recent addition to LZ4HC) to trade off a little bit of compression ratio for a lot of compression time.
The default way to update qgrep data is now incremental - if no files inside a chunk changed and the chunk sizes don’t need to be rebalanced to maintain a reasonable average size, no recompression is performed

As a result, it takes 6.5 seconds with hot file cache to rebuild UE4 search data from scratch and 1.2 seconds to update a single file in it; in the latter case most of the time is spent traversing file directory structure.

Multi-threading

Naturally, qgrep tries to multithread the search process. There’s a single thread that reads chunks off disk, and dispatches them to a thread pool. The threads in the pool handle decompression of chunk data and the search itself.

Several important problems arise during this process.

First, the read speed and search speed can be substantially unbalanced. For example, if the data is in file cache, reading it from disk runs at memory speed, whereas decompression and search are usually slower. If the data is really large, it could be the case that the total memory impact of the chunk queue is substantial. Because of this, qgrep extensively uses a thread-safe blocking queue that has a special twist - when the queue is created, it’s being told how much total memory the queue should hold at most. Whenever an item is pushed to the queue, if the item’s memory impact exceeds the budget, the thread that pushes the item waits for “space” to become available. This makes sure that we can search arbitrarily large datasets with limited memory.

Second, search threads will generate results in an arbitrary order - each thread will generate search results in each chunk in order, but we’d like to display results from different chunks to the user in the exact same order that a single thread would. This means we need to buffer the output results - however, for some search queries the size of that buffer can be arbitrarily large. To solve this problem, qgrep uses a special purpose ordered queue - the way it works is that the producers (search threads) submit items to that queue with an index indicating the original chunk index, and the single consumer that prints the output processes items in the order indicated by this index. For example, if 3 threads finish chunks 1,2,3 quickly and the 4th thread takes a while to finish chunk 0, the output for chunks 1,2,3 will stay in memory until chunk 0 is processed and added to the queue, after which the consumer thread can process output with indices 0,1,2,3 in order.

Third, a lot of care needs to be taken to optimize memory allocations. A big source of performance issues can be allocating and deallocating large blocks of memory because on many systems doing so will repeatedly release the memory to the system, allocate it again, and trigger many page faults that can result in serializing execution, as described in my older post. This issue is solved by using a pool for large allocations (which is somewhat counter intuitive since it’s more common to pool small allocations than large ones).

Finally, for large matches, the process of highlighting the match results may take quite a bit of time - once we know there’s a match in a given line, we need to find the bounds of this match, and generate appropriate markup for the terminal output with colors. Because of this, highlighting runs in the search thread and the output chunks contain final formatted output so that the single output consumer thread only needs to print the results to the terminal.

Regular expression search

After decompressing chunk contents, we use a regular expression engine to find the match. The engine that qgrep uses is RE2; the design and performance characteristics are described here. My understanding is that the design of ripgrep is similar, but I’m not an expert on the matter - I profiled RE2 in 2012 and it was faster than other engines I tested. Since then I’ve used RE2 in some other applications and the only engine I’ve seen that could beat RE2 depending on the task was PCRE JIT, which wasn’t available in 2012 (and for qgrep specifically I’m not sure if the JIT time will pay for the increased performance since we’re interested in end-to-end time of the query and aren’t able to cache the JIT code).

The search is done on a line by line basis, however instead of feeding each line to the regular expression engine at a time, the regular expression is ran on the entire file at a time; whenever there’s a match, we output that match and jump to the beginning of the next line to continue the search. This results in optimal performance as the cost of preparing the match data is minimized, and the various scanning optimizations like memchr mentioned below can truly shine.

One notable weakness of RE2 is case-insensitive searches; it’s a general property of other regular expression engines as well - case insensitive matches generate much more complex state machines; additionally one reason why RE2 is fast is it uses memchr when trying to scan for a literal submatch of a regular expression (in the test query from this post this happens for ‘v’ and ‘K’) instead of putting one character through the automata at a time, and case insensitive searches invalidate this. Since case insensitive searches are very common, qgrep takes a shortcut and assumes that case insensitive searches only need to handle ASCII - when that assumption holds, we can transform both the regular expression and file contents to ASCII lower case. For file data, performance of this transform is critical so we use SSE2 optimized casefold function with the simple loop that processes 16 bytes at a time:

// Shift 'A'..'Z' range ([65..90]) to [102..127] to use one signed comparison insn
__m128i shiftAmount = _mm_set1_epi8(127 - 'Z');
__m128i lowerBound = _mm_set1_epi8(127 - ('Z' - 'A') - 1);
__m128i upperBit = _mm_set1_epi8(0x20);

__m128i v = _mm_loadu_si128(reinterpret_cast<const __m128i*>(i));
__m128i upperMask = _mm_cmpgt_epi8(_mm_add_epi8(v, shiftAmount), lowerBound);
__m128i cfv = _mm_or_si128(v, _mm_and_si128(upperMask, upperBit));
_mm_storeu_si128(reinterpret_cast<__m128i*>(dest), cfv);
dest += 16;

Which is a moral equivalent of computing unsigned(ch - 'A') < 26 ? (ch | 0x20) : ch, but has to work around the lack of unsigned comparisons in SSE2 by using signed comparisons instead. Note that this only works for ASCII but “proper” Unicode support was never important enough to fix; ripgrep handles this more rigorously.

Finally, regular expression search is augmented with a fast literal scanner. Very often, regular expressions start with a literal prefix (and of course sometimes the prefix is the entire expression, when you’re just searching for a specific keyword). In this case, we can use special algorithms to accelerate the search. There’s many algorithms described in the literature, and I’ve spent some time in 2012 comparing different implementations⁴, but outside of artificial search queries like aaaaaaaaaaaaaaaaaaaaaaab a custom SIMD optimized search algorithm was consistently faster.

The basic idea is very simple - we can implement a very efficient SIMD scanner that scans characters 16 or 32 at a time, loads the data into SSE register, compares it to a register filled with the first character from the pattern, and does a more precise match if any match is found (the match location can be determined using __builtin_ctz intrinsic or MSVC equivalents):

while (offset + 16 <= size)
{
    __m128i val = _mm_loadu_si128(reinterpret_cast<const __m128i*>(data + offset));
    __m128i maskv = _mm_cmpeq_epi8(val, pattern);
    int mask = _mm_movemask_epi8(maskv);

    if (mask == 0)
        ;
    else
        return offset + countTrailingZeros(mask);

    offset += 16;
}

This works okay in many cases, but it’s possible to improve on this. For example, if you scan for “ foo”, applying this algorithm naively will require repeatedly searching for the space (0x20) character and rejecting the match because the next character isn’t ‘f’. Instead what we can do is search for the least likely character from the pattern, and if we find it, compare characters around it for a more precise match (using SIMD), and finally confirm the match with memcmp. It’s possible to maintain the character frequency table based on the actual text we’re searching through but all source code is more or less alike, so I ended up precomputing a static frequency table and choosing the character to search for using this table. The inner loop of the matcher looks like this:

while (offset + 32 <= size)
{
    __m128i value = _mm_loadu_si128(reinterpret_cast<const __m128i*>(data + offset));
    unsigned int mask = _mm_movemask_epi8(_mm_cmpeq_epi8(value, firstLetter));

    // advance offset regardless of match results to reduce number of live values
    offset += 16;

    while (mask != 0)
    {
        unsigned int pos = countTrailingZeros(mask);
        size_t dataOffset = offset - 16 + pos - firstLetterOffset;

        mask &= ~(1 << pos);

        // check if we have a match
        __m128i patternMatch = _mm_loadu_si128(
            reinterpret_cast<const __m128i*>(data + dataOffset));
        __m128i matchMask = _mm_or_si128(patternMask,
            _mm_cmpeq_epi8(patternMatch, patternData));

        if (_mm_movemask_epi8(matchMask) == 0xffff)
        {
            size_t matchOffset = dataOffset + firstLetterOffset - firstLetterPos;

            // final check for full pattern
            if (matchOffset + pattern.size() < size &&
                memcmp(data + matchOffset, pattern.c_str(), pattern.size()) == 0)
            {
                return matchOffset;
            }
        }
    }
}

If we find the least frequent character, we compare up to 16 characters around it, and finally use memcmp; this results in really good scanning performance overall. There’s a bit of overhead for short prefixes so if the prefix is a single character we use a simpler SSE matcher.

Filtering searched chunks

With all of the optimizations above, when the data set is in filesystem cache and you have a few fast CPU cores, the searches are very quick as witnessed by the performance of the “bruteforce” variant. However, when comparing to Google CodeSearch I noticed that while the worst case search performance there is really bad (substantially worse than ripgrep), the best case is really good, easily beating qgrep. This is because it uses an index based on ngrams⁵ - for each unique 3-gram (3 consecutive characters) of a string it keeps the list of files that have this ngram anywhere. Given a regular expression, RE2 can produce a set of substrings that need to be in the string for it to match the expression (it’s slightly more complex than that in that it can produce a literal search tree); CodeSearch uses this to find the list of files that match all required 3-grams (for example, for a file to match “vkCmdDraw”, it has to be in the lists associated with 3-grams “vkC”, “kCm”, “Cmd”, “mdD”, “dDr”, “Dra” and “raw”) and then searches in these files. Often the resulting list of files is very small, and so the matches are very quick - qgrep is fast but it’s hard to search through a gigabyte of data faster than it takes to search through a few files.

I wanted qgrep to be faster for cases where an ngram index works well, but I didn’t want to compromise the worst case performance. When profiling qgrep bruteforce on the query from this blog post, the main bottleneck is LZ4 decompression, which is ~3x slower than regular expression search (which is mostly dominated by the fast literal matcher described above). Thus if we want to save time, we need to avoid decompressing the chunk entirely - since files are compressed as a single blob, this means that we need to know ahead of time that no file in the chunk contains the given string.

What’s more, we’d like to determine this using minimal extra time and data - while we could store some sort of ngram index, it can take quite a bit of space. Which means that this is a perfect usecase for a Bloom filter.

Bloom filters are a neat construct that allows us to spend a very small amount of memory and time to answer a question: does a given item belong to a given set? The catch is that the result can contain false positives - the answer to “is this item in the set?” is either “definitely not” or “maybe yes”.

Surprisingly, this is exactly what we need. Each chunk stores an index - a Bloom filter that contains all ngrams (after some experiments I settled on 4-grams) from all files in the chunk. When reading chunks from the file, we read the chunk index first, and check all ngrams from the regular expression - if any ngram is “definitely not” in the index, we can skip the entire chunk - we don’t even need to read it from the file, which means we save some I/O time!

The index is built when the chunk is constructed & compressed; the index is built such that the size of the index is 10% of the compressed data size to limit the worst case impact of the index; based on this size we estimate the number of Bloom filter hash iterations that is optimal for the false positives rejection rate - the Wikipedia article contains details on this. The index search itself is so fast that it’s done as we read the chunks in the main thread.

The efficiency of the index is highly dependent on the query, of course. For some queries we end up reading ~10% more data from the file and not discarding any chunks (which generally has minimal impact on the overall search performance). For many queries though it’s highly effective - for the query from the example in this post, we end up filtering on ngrams for both “vkCmdDrawInde” and “KHR = 0” strings; as a result, out of 1903 chunks in UE4 data set, we only need to decompress and process 6. Most of the resulting search time is spent reading chunk index data (all 20 MB) which can probably be optimized by packing the index data more tightly and/or using mmap, but tight packing makes incremental updates more challenging and I somehow never got around to implementing mmap support.

Note that for the 6 chunks that have a potential match (that’s ~3MB of source text), only one chunk has an actual match - because of the fact that Bloom filter isn’t precise and we compute it for all files in a chunk at once we waste a bit of time on unrelated data, but this allows us to not compromise the worst case search performance which is great.

Conclusion

Searching large volumes of source code using regular expressions is an interesting problem. There are several possible designs - ripgrep uses raw file access and tries to optimize that as much as possible; codesearch uses a trigram index in hopes that it results in minimal number of files so subsequent optimizations aren’t as interesting; qgrep tries to accelerate worst case searches by reducing the filesystem performance impact as much as possible and adding a pseudo-index on top.

Should you use qgrep? I don’t know! Compared to ripgrep, the need to maintain an up-to-date database is definitely a usability issue (although qgrep watch can make this less painful by adding modified files to the special “changed” file list and occasionally updating the database) and the commandline interface is somewhat arcane. Compared to codesearch, qgrep seems like a win but if you have 100 GB of data to search through you probably really need a “real” index. In any event, the goal of this post isn’t to advertise qgrep - it’s to talk about some interesting optimization techniques used in its implementation.

Of course, if you aren’t using qgrep, you’re missing out on this sweet sweet Vim plugin built entirely in Vimscript:

You would think that another option is Visual Assist or equivalent tools; however, this only works for the currently selected platform/configuration which, in a cross-platform codebase, isn’t always sufficient; it also is restricted to C++ symbols and has its own scaling challenges in a codebase with way too much source code. ↩
This was back when codesearch was written in C++ I think; the Go version looks impressively fast these days, but I haven’t spent a lot of time using it to know how it fares on a wide range of use cases. ↩
Some people who worked on EA codebase at the time used another tool called qgrep; it was closed source/proprietary and wasn’t as fast as I thought it should be - it was basically a regular expression engine that ran on the .tar.gz archive of the code. I decided to start with the same name in hopes that I can find a better name later, but never got around to it. I also inherited the command line structure from this tool because I wanted something that people on the team who wanted faster search could immediately use. ↩
I think I used an earlier version of SMART but I may be mistaking it for some other string search algorithm collection. ↩
This process is described in detail here: Regular Expression Matching with a Trigram Index. ↩

Small, fast, web

Mon, 11 Mar 2019 00:00:00 +0000

When implementing vertex/index decoders in meshoptimizer, the main focus was on lean implementation and decompression performance.

When your streaming source is capable of delivering hundreds of megabytes per second, as is the case with SSD drives, and you want to accelerate loading by compressing the data further, you need to be decompressing at multiple hundreds of megabytes per second, ideally a gigabyte, to make sure a small number of CPU cores can keep up with IO. Keeping implementation lean meant it was easy to understand and optimize. To supplant the inevitable loss of compression ratio, the codecs were designed in such a way that their output can be compressed further using lossless general purpose compressors such as lz4 or zstd, thus offering an easy tradeoff between compression ratio and performance.

This set of implementation decisions unexpectedly resulted in algorithms that are a pretty good fit for delivering web content. The performance penalty often induced by running code in the browser is offset by the incredibly high baseline performance, and most web content has “free” gzip compression efficiently applied during the download process. This article will describe the evolution of decoder.js, a WebAssembly port of geometry decoders from meshoptimizer.

Baseline

To get the decoders working, two functions needed to be exposed to JavaScript¹:

int meshopt_decodeVertexBuffer(void* destination, size_t vertex_count, size_t vertex_size, const unsigned char* buffer, size_t buffer_size);
int meshopt_decodeIndexBuffer(void* destination, size_t index_count, size_t index_size, const unsigned char* buffer, size_t buffer_size);

To simplify the port, the assumption is that the data can be encoded offline using C++ or Rust versions of the library, and only decoders are necessary at runtime². JavaScript code would download the encoded data - possibly using fetch() which would be capable of using gzip decompression built into the browser - pass them to the decode functions, and then upload the resulting vertex/index buffers to WebGL directly or using the 3D framework of choice.

The functions are self-contained, perform no memory allocations and don’t use the standard library short of basic functions like memset/memcpy. This meant that the requirements on the cross-compilation toolchain should be minimal. The obvious choice - and probably the only practical option at the moment? - is Emscripten.

After installing it using fantastic emsdk distribution, and experimenting with a few different options, I was able to get the code to compile:

emcc src/vertexcodec.cpp src/indexcodec.cpp -Os -DNDEBUG -s EXPORTED_FUNCTIONS='["_meshopt_decodeVertexBuffer", "_meshopt_decodeIndexBuffer"]' -o decoder.js

This resulted in two files, decoder.js (14801 bytes, 4606 bytes after gzip) and decoder.wasm (3843 bytes, 1680 bytes after gzip). The bulk of the code was thus the JavaScript runtime, the WebAssembly file that contained the actual source was relatively small.

However, to invoke these functions, the caller would have to provide the buffers to the WebAssembly compiled functions from the JavaScript side.

Allocating memory from JavaScript

Before I started the port I was hoping that due to how simple the function interface is, I would be able to directly pass two ArrayBuffer objects - one for input and one for output - to the function and have it work directly with the data. Unfortunately, this isn’t really possible today - WebAssembly specification only allows for one Memory object to be used. Because of this, the JavaScript wrapper code would need to allocate two buffers on heap, copy data into one buffer, perform the decompression and then copy data into the target buffer allocated by the caller:

Module.decodeVertexBuffer = function (target, vertexCount, vertexSize, source) {
    var tp = Module._malloc(vertexCount * vertexSize);
    var sp = Module._malloc(source.length);

    Module.HEAPU8.set(source, sp);

    var res = Module._meshopt_decodeVertexBuffer(tp, vertexCount, vertexSize, sp,
        source.length);

    target.set(Module.HEAPU8.subarray(tp, tp + vertexCount * vertexSize), 0);

    Module._free(sp);
    Module._free(tp);

    if (res != 0) {
        throw new Error("Malformed vertex buffer data");
    }
}

This requires a malloc/free implementation to be provided by the runtime; since we can not know, ahead of time, how large the heap might become, we need to compile our code with support for memory growth:

emcc src/vertexcodec.cpp src/indexcodec.cpp -Os -DNDEBUG -s EXPORTED_FUNCTIONS='["_meshopt_decodeVertexBuffer", "_meshopt_decodeIndexBuffer", "_malloc", "_free"]' -s ALLOW_MEMORY_GROWTH=1 --post-js decoder-post.js -o decoder.js

This results in larger decoder.js (17980 bytes, 5427 bytes after gzip) and substantially larger decoder.wasm (12432 bytes, 5047 bytes after gzip). Fortunately, Emscripten provides a custom malloc implementation, emmalloc, that isn’t as fast as the default implementation, but is substantially leaner - we don’t really care about performance since we only allocate two buffers, and switching to emmalloc reduces decoder.wasm to 5993 bytes, 2577 bytes after gzip - we’ve lost ~1 KB after gzip which is reasonable.

Reducing distribution size further

We’re now down to two files that add up to 8004 bytes after gzip, which is pretty good - unfortunately, to start requesting the WebAssembly module, we need to fully parse and execute decoder.js, which increases our latency by the time it takes to do one extra HTTP request. To mitigate this cost, Emscripten supports an option to embed the binary files into the JavaScript file using a Base64 string (-s SINGLE_FILE=1). Using that costs us the inclusion of Base64 decoder and inflating the size of the WebAssembly file, but for small libraries it’s worthwhile.

Applying this option means that we just have a single file decoder.js with a size of 28006 bytes, 9918 bytes after gzip. The size penalty is probably not dissimilar to the size of HTTP header…

Fortunately, we have one more easy switch left that we haven’t enabled yet - Emscripten docs recommend to use Google Closure compiler, using --closure 1 switch. This compiler is written in Java (so using it requires installing Java SDK…) and processes JS code to perform dead code elimination and minification of identifiers including object keys. To make sure the code keeps working I had to enable MODULARIZE=1 mode and change dotted access like Module._free to array access Module["_free"] so that Closure compiler doesn’t minify _free which has to match the name of WebAssembly export.

emcc src/vertexcodec.cpp src/indexcodec.cpp -Os -DNDEBUG -s EXPORTED_FUNCTIONS='["_meshopt_decodeVertexBuffer", "_meshopt_decodeIndexBuffer", "_malloc", "_free"]' -s ALLOW_MEMORY_GROWTH=1 -s MALLOC=emmalloc -s MODULARIZE=1 -s EXPORT_NAME=MeshoptDecoder --closure 1 --post-js decoder-post.js -o decoder.js

Closure compiler doesn’t change the size of WebAssembly code or the overhead of Base64 encoding, but it does reduce the JavaScript runtime quite a bit; with this, decoder.js settled at 20023 bytes, 8469 bytes after gzip.

Optimizing for performance

Now it’s time to look at performance. WebAssembly tries to reach near native speeds, however based on the past experience with sandboxed systems like this I expected some amount of overhead. Looking at decoding performance on nyra.obj (28K triangles, 18K vertices), the web version was indeed substantially slower. However, C++ version uses SSSE3 or NEON when available; measuring with SIMD disabled paints a slightly less gloomy picture³:

	vertex decode	index decode
C++, SSSE3	0.10 msec	0.10 msec
C++, scalar	0.30 msec	0.10 msec
emcc -Os (Chrome)	0.55 msec	0.47 msec

Not quite sure how to profile the code, short of trying to use internal Chrome options to inspect the assembly and guess where the overhead is coming from, I decided to make sure that -Os was indeed justified. While in theory optimizing for size may compromise performance, in practice for native compilation -Os often generates code that is close to optimal so I wasn’t expecting miracles. I was pleasantly surprised.

	vertex decode	index decode	decoder.js size
emcc -Os	0.55 msec	0.47 msec	18211 bytes, 7845 bytes after gzip
emcc -O2	0.49 msec	0.27 msec	21209 bytes, 8902 bytes after gzip
emcc -O3	0.48 msec	0.27 msec	19387 bytes, 8193 bytes after gzip

While -O2 noticeably increases the binary size, -O3 results in a somewhat more modest increase and in a substantial performance boost, making index decoder substantially faster in particular. The performance delta with native code is still, unfortunately, very large. In case of JavaScript, the timings do include the extra cost of copying the data, however that’s not where the bulk of the cost lies - the copies are almost free compared to the time it takes to run the actual code.

Stack > heap

Now that performance looks slightly better, it’s time to take another look at code size. While it’s pretty reasonable as it stands, it’s still a handful of JS code to download, and a large part of this code - ~1 KB post gzip - is code we didn’t write that implements malloc/free. In many ways, Emscripten is very much like embedded software - linear memory model, NULL pointer is a valid pointer, strict code size constraints, etc. - and like in many embedded systems, heap is implemented using sbrk.

In Emscripten, the linear memory space consists of the stack, that has a fixed size, and a heap that follows after the stack. sbrk allows you to manipulate the pointer to the end of heap space; advancing sbrk past the end of the heap triggers heap resize, which leads to Emscripten reallocating the backing Memory object for WebAssembly. Since we only need to allocate two buffers and they have the same lifetime, our memory management resembles stack more than heap and thus sbrk is really easy to integrate:

Module['decodeVertexBuffer'] = function (target, vertexCount, vertexSize, source) {
    var tp = Module["_sbrk"](vertexCount * vertexSize);
    var sp = Module["_sbrk"](source.length);

    Module["HEAPU8"].set(source, sp);

    var res = Module["_meshopt_decodeVertexBuffer"](tp, vertexCount, vertexSize, sp,
        source.length);

    target.set(Module["HEAPU8"].subarray(tp, tp + vertexCount * vertexSize), 0);

    Module["_sbrk"](tp - Module["_sbrk"](0));

    if (res != 0) {
        throw new Error("Malformed vertex buffer data");
    }
}

Note that you can use sbrk(0) to get the pointer to the top of the heap space without moving it. There are some alignment considerations that we’re ignoring here since there are certain guarantees about vertexSize that make them irrelevant.

As a result, we no longer pay the cost of including emmalloc, and the result shows: decoder.js now takes 16654 bytes, 7004 bytes after gzip - we saved more than 1K by removing malloc! This is probably not an interesting consideration for larger programs, but for a library as small as this it’s worthwhile.

Minimal runtime

It’s pretty obvious at this point that the bulk of our size is in the JavaScript runtime - even after Closure optimizations, there’s still a lot of work that it needs to do. After seeing the tweet by Andre Weissflog about new Emscripten feature, MINIMAL_RUNTIME, I was curious and decided to try it out. It requires an Emscripten version built from latest source, and lives at the bleeding edge - the first change for it was merged just about a month ago. It doesn’t support two features I need, SINGLE_FILE and ALLOW_MEMORY_GROWTH, but I was able to hack single file support in and build a working prototype without memory growth support; the resulting file, after going through Emscripten JS optimizer (no Closure support just yet), decoder.js shrunk from 16654 bytes to 9306 bytes. Memory growth would mean a slight size increase but the result would probably be around 9500 bytes before gzip - very sizeable savings.

This is where I thought this would end - I’d wait for minimal runtime to become more mature, perhaps contribute SINGLE_FILE support that wasn’t otherwise planned, and eventually get the size increase I wanted. However, the more I thought about this, the more I wanted to try to just do this myself. After all, how much code do you really need to run the decoders?

I started by taking Emscripten, asking it to generate just the .wasm file:

emcc src/vertexcodec.cpp src/indexcodec.cpp -O3 -DNDEBUG -s EXPORTED_FUNCTIONS='["_meshopt_decodeVertexBuffer", "_meshopt_decodeIndexBuffer"]' -s ALLOW_MEMORY_GROWTH=1 -s TOTAL_STACK=32768 -s TOTAL_MEMORY=65536 -o decoder.wasm

And then copying the code from Mozilla developer site for WebAssembly.instantiate and gradually filling the missing pieces. Instead of compiling sbrk into the WebAssembly binary - which resulted in extra symbols such as DYNAMICTOP_PTR that I would have to export and maintain - a simple JS implementation of sbrk was written. To download the file, the Base64 blob is passed to fetch() function as a data URI - this method probably doesn’t work in all environments, but it does seem to work in all browsers, it’s small in terms of code size and is reasonably fast, since the browser does all the heavy-lifting. This resulted in a very short and, even if I say so myself, beautiful implementation:

var wasm = "BASE64DATA";
var memory = new WebAssembly.Memory({
    initial: 1
});
var heap = new Uint8Array(memory.buffer);
var brk = 32768; // stack top

var sbrk = function(size) {
    var old = brk;
    brk += size;
    if (brk > heap.length) {
        memory.grow(Math.ceil((brk - heap.length) / 65536));
        heap = new Uint8Array(memory.buffer);
    }
    return old;
};

var imports = {
    env: {
        memory: memory,
        _emscripten_memcpy_big: function(d, s, n) {
            heap.set(heap.subarray(s, s + n), d);
        },
    }
};

var instance = {};
var promise =
    fetch('data:application/octet-stream;base64,' + wasm)
    .then(response => response.arrayBuffer())
    .then(bytes => WebAssembly.instantiate(bytes, imports))
    .then(result => instance = result.instance);

With this, and the wrapper functions that looked more or less the same as the original Emscripten version, the new “runtime” was complete. The only extra symbol other than the heap that WebAssembly version needs is the memcpy version for large blocks. Since WebAssembly - unfortunately! - doesn’t come with an instruction to efficiently copy a memory block, Emscripten implements memcpy using a hand-coded loop that falls back to JS function for really large blocks.

To build the full library, instead of producing a new .js file from a source by embedding Base64 encoded version into that, I opted for removing the dependencies on JS minifiers and the like, and patching in the binary using sed⁴:

sed -i "s#\(var wasm = \)\".*\";#\\1\"$$(cat decoder.wasm | base64 -w 0)\";#" decoder.js

The resulting decoder.js file with embedded Base64 blob is the smallest yet by far, taking 8346 bytes, 3676 bytes after gzip, at about 1K smaller (pre-gzip) than Emscripten minimal runtime. It only does one job, but it seems to do it pretty well. Note that this doesn’t include any minification - the code is so small that a minifier isn’t as critical, and it’s refreshing to be able to edit or debug the original source directly.

Future work

While the results are pretty good overall, and the optimized decoders will ship as part of the next meshoptimizer release, there’s always room for further improvement.

Performance of the WebAssembly version is still substantially lower than that of the native version. This is partly due to lack of SIMD - the SIMD proposal for WebAssembly, especially with the hopefully imminent addition of dynamic shuffles, should hopefully address that - but index codec doesn’t use SIMD. There’s clearly a lot of room for improvement here; probably mostly on the toolchain side but there may be some modifications to the C++ code that would make it faster as well. One big issue is the lack of easy access to native code - I’d love to start from the generated x64 assembly and then figure out why it’s inefficient and where in the stack the inefficiency occurs.

In terms of code size, while it’s looking really good right now at just under 4K after gzip, the eventual introduction of SIMD version would at least double the WebAssembly portion (to maintain support for browsers without WebAssembly SIMD), if not more. After the optimizations the breakdown of code size before gzip in the final version is as follows:

JS runtime: 1966 bytes
wasm meshopt_decodeVertexBuffer: 1772 bytes (~2363 bytes after Base64)
wasm meshopt_decodeIndexBuffer: 2398 bytes (~3197 bytes after Base64)
wasm memcpy: 454 bytes (~605 bytes after Base64)

It’s likely possible to recode memcpy to fit the workflow better and/or rely on the eventual builtin; JS runtime could be minified, running it through UglifyJS produces 1059 byte file which is ~900 bytes smaller, and it’s possible that instead of using Base64 to encode the file, a better option is to either use a larger but more compressible encoding, or to put the data into a separate file and use some kind of prefetch declaration in the source HTML file to fetch .js and .wasm files in parallel.

This is probably a good time to mention that while I have extensive experience with native code, my exposure to JavaScript or WebAssembly has been non-existent before this work. I’m likely doing many things wrong, and comments with suggestions for improvements are appreciated! ↩
A more or less complete example of performing all the necessary transformations can be found in tools/meshencoder.cpp. ↩
All timings are taken on i7-8700K; C++ version is compiled using gcc -O3, WebAssembly version is running in latest stable Chrome and the measurements are taken after 10 runs of the function to exclude JIT/warmup. ↩
While modifying the build source directly may seem bad, this makes editing workflow much easier when WebAssembly portion doesn’t need to be rebuilt - you just edit the file! ↩

Flavors of SIMD

Sun, 17 Feb 2019 00:00:00 +0000

During development of meshoptimizer a question that comes up relatively often is “should this algorithm use SIMD?”. The library is performance-oriented, but SIMD doesn’t always provide significant performance benefits - unfortunately, the use of SIMD can make the code less portable and less maintainable, so this tradeoff has to be resolved on a case by case basis. When performance is of utmost importance, such as vertex/index codecs, separate SIMD implementations for SSE and NEON instruction sets need to be developed and maintained. In other cases it’s helpful to understand how much SIMD can help to make the decision. Today we will go through the exercise of accelerating sloppy mesh simplifier, a new algorithm that was recently added to the library, using SSEn/AVXn instruction sets.

For our benchmark, we will be simplifying a 6M triangle “Thai Buddha” model, reducing it to 0.1% of the triangle count. We will use one compiler, Microsoft Visual Studio 2019, to target x64 architecture. The scalar algorithm can perform this simplification in about 210 ms¹, using one thread of an Intel Core i7-8700K (running at ~4.4 GHz). Simplification can be parallelized in some cases by splitting a mesh into chunks, however this requires some extra connectivity analysis to be able to preserve chunk boundaries, so for now we’ll limit ourselves to pure SIMD optimizations.

Measure seven times

To understand our opportunities for optimization, let’s profile the code using Intel VTune; we’ll be running simplification 100 times to make sure we have enough profiling data.

Here I’m using the microarchitecture exploration mode to get both the time each function takes as well as where the bottlenecks are. We can see that simplification is performed using a set of functions; each function is self-contained in that all of the time is spent in the function itself, not in any callees. The list of functions is sorted by the time they take, here’s the same list sorted by the order in which they execute, to make the algorithm easier to understand:

rescalePositions normalizes positions of all vertices into a unit cube to prepare for quantization using computeVertexIds
computeVertexIds computes a 30-bit quantized id for each vertex by taking a uniform grid of a given size and quantizing each axis to the grid (grid size fits into 10 bits, thus the id needs up to 30)
countTriangles computes the approximate number of triangles that the simplifier would produce given a grid size, assuming that all vertices that lie in the same grid cell are merged together
fillVertexCells fills a table that maps each vertex to a cell that this vertex belongs to; all vertices with the same id map to the same cell
fillCellQuadrics fills a Quadric (a symmetric 4x4 matrix) structure for each cell that represents the aggregate information about geometry contributing to the cell
fillCellRemap computes a vertex index for each cell, picking one of the vertices that lies in this cell and minimizes the geometric distortion according to the error quadric
filterTriangles outputs the final set of triangles according to the vertex->cell->vertex mapping tables built earlier; naive mapping can produce ~5% duplicate triangles on average, so the function filters out duplicates.

computeVertexIds and countTriangles run multiple times - the algorithm determines the grid size to perform vertex merging by doing an accelerated binary search to reach the target triangle count, 6000 in this case, and computes the number of triangles that each grid size would generate for each iteration. Other functions run just once. On the file in question, it takes us 5 search passes to find the target grid size, that happens to be 40³ in this case.

VTune helpfully tells us that the most expensive function is the function that computes quadrics, accounting for close to half of the total runtime of 21 seconds. This will be our first target for SIMD optimization.

Piecewise SIMD

Let’s look at the source of fillCellQuadrics to get a better idea of what it needs to compute:

static void fillCellQuadrics(Quadric* cell_quadrics, const unsigned int* indices, size_t index_count, const Vector3* vertex_positions, const unsigned int* vertex_cells)
{
    for (size_t i = 0; i < index_count; i += 3)
    {
        unsigned int i0 = indices[i + 0];
        unsigned int i1 = indices[i + 1];
        unsigned int i2 = indices[i + 2];
        unsigned int c0 = vertex_cells[i0];
        unsigned int c1 = vertex_cells[i1];
        unsigned int c2 = vertex_cells[i2];

        bool single_cell = (c0 == c1) & (c0 == c2);
        float weight = single_cell ? 3.f : 1.f;

        Quadric Q;
        quadricFromTriangle(Q,
            vertex_positions[i0],
            vertex_positions[i1],
            vertex_positions[i2],
            weight);

        if (single_cell)
        {
            quadricAdd(cell_quadrics[c0], Q);
        }
        else
        {
            quadricAdd(cell_quadrics[c0], Q);
            quadricAdd(cell_quadrics[c1], Q);
            quadricAdd(cell_quadrics[c2], Q);
        }
    }
}

The function goes over all triangles, computes a quadric for each one, and adds it to the quadrics for each cell. Quadric is a symmetric 4x4 matrix which is represented as 10 floats:

struct Quadric
{
    float a00;
    float a10, a11;
    float a20, a21, a22;
    float b0, b1, b2, c;
};

Computing the quadric requires computing a plane equation for the triangle, building the quadric matrix and weighing it using the triangle area:

static void quadricFromPlane(Quadric& Q, float a, float b, float c, float d)
{
    Q.a00 = a * a;
    Q.a10 = b * a;
    Q.a11 = b * b;
    Q.a20 = c * a;
    Q.a21 = c * b;
    Q.a22 = c * c;
    Q.b0 = d * a;
    Q.b1 = d * b;
    Q.b2 = d * c;
    Q.c = d * d;
}

static void quadricFromTriangle(Quadric& Q, const Vector3& p0, const Vector3& p1, const Vector3& p2, float weight)
{
    Vector3 p10 = {p1.x - p0.x, p1.y - p0.y, p1.z - p0.z};
    Vector3 p20 = {p2.x - p0.x, p2.y - p0.y, p2.z - p0.z};

    Vector3 normal =
    {
        p10.y * p20.z - p10.z * p20.y,
        p10.z * p20.x - p10.x * p20.z,
        p10.x * p20.y - p10.y * p20.x
    };
    float area = normalize(normal);

    float distance = normal.x*p0.x + normal.y*p0.y + normal.z*p0.z;

    quadricFromPlane(Q, normal.x, normal.y, normal.z, -distance);

    quadricMul(Q, area * weight);
}

This looks like a lot of floating-point operations, and we should be able to implement them using SIMD. Let’s start by representing each vector as a 4-wide SIMD vector, and also let’s change the Quadric structure to have 12 floats instead of 10 so that it fits exactly into 3 SIMD registers²; we’ll also reorder the fields to make computations in quadricFromPlane more uniform:

struct Quadric
{
    float a00, a11, a22;
    float pad0;
    float a10, a21, a20;
    float pad1;
    float b0, b1, b2, c;
};

Some of the computations here, notably the dot product needed to normalize the normal and compute the plane distance, don’t map well to earlier versions of SSE - fortunately, SSE4.1 introduced a dot product instruction that is quite handy here.

static void fillCellQuadrics(Quadric* cell_quadrics, const unsigned int* indices, size_t index_count, const Vector3* vertex_positions, const unsigned int* vertex_cells)
{
    const int yzx = _MM_SHUFFLE(3, 0, 2, 1);
    const int zxy = _MM_SHUFFLE(3, 1, 0, 2);
    const int dp_xyz = 0x7f;

    for (size_t i = 0; i < index_count; i += 3)
    {
        unsigned int i0 = indices[i + 0];
        unsigned int i1 = indices[i + 1];
        unsigned int i2 = indices[i + 2];
        unsigned int c0 = vertex_cells[i0];
        unsigned int c1 = vertex_cells[i1];
        unsigned int c2 = vertex_cells[i2];

        bool single_cell = (c0 == c1) & (c0 == c2);

        __m128 p0 = _mm_loadu_ps(&vertex_positions[i0].x);
        __m128 p1 = _mm_loadu_ps(&vertex_positions[i1].x);
        __m128 p2 = _mm_loadu_ps(&vertex_positions[i2].x);

        __m128 p10 = _mm_sub_ps(p1, p0);
        __m128 p20 = _mm_sub_ps(p2, p0);

        __m128 normal = _mm_sub_ps(
            _mm_mul_ps(
                _mm_shuffle_ps(p10, p10, yzx),
                _mm_shuffle_ps(p20, p20, zxy)),
            _mm_mul_ps(
                _mm_shuffle_ps(p10, p10, zxy),
                _mm_shuffle_ps(p20, p20, yzx)));

        __m128 areasq = _mm_dp_ps(normal, normal, dp_xyz); // SSE4.1
        __m128 area = _mm_sqrt_ps(areasq);

        // masks the result of the division when area==0
        // scalar version does this in normalize()
        normal = _mm_and_ps(
            _mm_div_ps(normal, area),
            _mm_cmpneq_ps(area, _mm_setzero_ps()));

        __m128 distance = _mm_dp_ps(normal, p0, dp_xyz); // SSE4.1
        __m128 negdistance = _mm_sub_ps(_mm_setzero_ps(), distance);
        __m128 normalnegdist = _mm_blend_ps(normal, negdistance, 8);

        __m128 Qx = _mm_mul_ps(normal, normal);
        __m128 Qy = _mm_mul_ps(
            _mm_shuffle_ps(normal, normal, _MM_SHUFFLE(3, 2, 2, 1)),
            _mm_shuffle_ps(normal, normal, _MM_SHUFFLE(3, 0, 1, 0)));
        __m128 Qz = _mm_mul_ps(negdistance, normalnegdist);

        if (single_cell)
        {
            area = _mm_mul_ps(area, _mm_set1_ps(3.f));
            Qx = _mm_mul_ps(Qx, area);
            Qy = _mm_mul_ps(Qy, area);
            Qz = _mm_mul_ps(Qz, area);

            Quadric& q0 = cell_quadrics[c0];

            __m128 q0x = _mm_loadu_ps(&q0.a00);
            __m128 q0y = _mm_loadu_ps(&q0.a10);
            __m128 q0z = _mm_loadu_ps(&q0.b0);

            _mm_storeu_ps(&q0.a00, _mm_add_ps(q0x, Qx));
            _mm_storeu_ps(&q0.a10, _mm_add_ps(q0y, Qy));
            _mm_storeu_ps(&q0.b0, _mm_add_ps(q0z, Qz));
        }
        else
        {
            // omitted for brevity, repeats the if() body
            // three times for c0/c1/c2
        }
    }
}

There’s nothing particularly interesting in this code; we use unaligned loads/stores a lot here - while it’s possible to align the input Vector3 data, there doesn’t seem to be a noticeable penalty here for unaligned reads. Note that in the first half of the function that computes the normal and area, we aren’t utilizing vector units that well - our vectors have 3 components, and in some cases just one (see areasq/area/distance computation), whereas the hardware can perform 4 operations at once. Regardless, let’s see how much this helped.

fillCellQuadrics now takes 5.3 seconds per 100 runs instead of 9.8, which saves ~45 ms for one simplification run - not bad, but somewhat underwhelming. Besides using just 3 components out of 4 in many instructions, we also are using dot product that has a pretty hefty latency. If you’ve written any SIMD code before, you know that the right way to compute dot products…

When’s the last time you had one quadric?

… is to compute four of them at once. Instead of storing one normal vector in one SIMD register, we’ll use 3 registers - one will store 4 x components of a normal vector, one will store y and the third will store z. For this to work, we need to have 4 vectors to work with at once - which means we’ll be processing 4 triangles at once.

We’re dealing with a lot of arrays that are indexed dynamically - while normally it can help to pre-transpose your data to already have arrays of x/y/z components³, this will not work well with dynamic indexing so we’ll load 4 triangles worth of data as we normally do, and transpose the vectors using a handy _MM_TRANSPOSE macro.

In theory, a pure application of this principle would mean that we need to compute each component of the final 4 quadrics in its own SIMD register (e.g. we’d have __m128 Q_a00 which will have 4 a00 members of the final quadrics). In this case, the operations on quadrics lend themselves pretty nicely to 4-wide SIMD, and doing this transformation actually makes the code slower - so we’ll only transpose the initial vectors, and then transpose the plane equations back and run the exact same code we used to run to compute the quadrics, but repeated 4 times. Here’s how the code that computes the plane equations looks after this, with the remaining sections omitted for brevity:

unsigned int i00 = indices[(i + 0) * 3 + 0];
unsigned int i01 = indices[(i + 0) * 3 + 1];
unsigned int i02 = indices[(i + 0) * 3 + 2];
unsigned int i10 = indices[(i + 1) * 3 + 0];
unsigned int i11 = indices[(i + 1) * 3 + 1];
unsigned int i12 = indices[(i + 1) * 3 + 2];
unsigned int i20 = indices[(i + 2) * 3 + 0];
unsigned int i21 = indices[(i + 2) * 3 + 1];
unsigned int i22 = indices[(i + 2) * 3 + 2];
unsigned int i30 = indices[(i + 3) * 3 + 0];
unsigned int i31 = indices[(i + 3) * 3 + 1];
unsigned int i32 = indices[(i + 3) * 3 + 2];

// load first vertex of each triangle and transpose into vectors with components (pw0 isn't used later)
__m128 px0 = _mm_loadu_ps(&vertex_positions[i00].x);
__m128 py0 = _mm_loadu_ps(&vertex_positions[i10].x);
__m128 pz0 = _mm_loadu_ps(&vertex_positions[i20].x);
__m128 pw0 = _mm_loadu_ps(&vertex_positions[i30].x);
_MM_TRANSPOSE4_PS(px0, py0, pz0, pw0);

// load second vertex of each triangle and transpose into vectors with components
__m128 px1 = _mm_loadu_ps(&vertex_positions[i01].x);
__m128 py1 = _mm_loadu_ps(&vertex_positions[i11].x);
__m128 pz1 = _mm_loadu_ps(&vertex_positions[i21].x);
__m128 pw1 = _mm_loadu_ps(&vertex_positions[i31].x);
_MM_TRANSPOSE4_PS(px1, py1, pz1, pw1);

// load third vertex of each triangle and transpose into vectors with components
__m128 px2 = _mm_loadu_ps(&vertex_positions[i02].x);
__m128 py2 = _mm_loadu_ps(&vertex_positions[i12].x);
__m128 pz2 = _mm_loadu_ps(&vertex_positions[i22].x);
__m128 pw2 = _mm_loadu_ps(&vertex_positions[i32].x);
_MM_TRANSPOSE4_PS(px2, py2, pz2, pw2);

// p1 - p0
__m128 px10 = _mm_sub_ps(px1, px0);
__m128 py10 = _mm_sub_ps(py1, py0);
__m128 pz10 = _mm_sub_ps(pz1, pz0);

// p2 - p0
__m128 px20 = _mm_sub_ps(px2, px0);
__m128 py20 = _mm_sub_ps(py2, py0);
__m128 pz20 = _mm_sub_ps(pz2, pz0);

// cross(p10, p20)
__m128 normalx = _mm_sub_ps(
    _mm_mul_ps(py10, pz20),
    _mm_mul_ps(pz10, py20));
__m128 normaly = _mm_sub_ps(
    _mm_mul_ps(pz10, px20),
    _mm_mul_ps(px10, pz20));
__m128 normalz = _mm_sub_ps(
    _mm_mul_ps(px10, py20),
    _mm_mul_ps(py10, px20));

// normalize; note that areasq/area now contain 4 values, not just one
__m128 areasq = _mm_add_ps(
    _mm_mul_ps(normalx, normalx),
    _mm_add_ps(
        _mm_mul_ps(normaly, normaly),
        _mm_mul_ps(normalz, normalz)));
__m128 area = _mm_sqrt_ps(areasq);
__m128 areanz = _mm_cmpneq_ps(area, _mm_setzero_ps());

normalx = _mm_and_ps(_mm_div_ps(normalx, area), areanz);
normaly = _mm_and_ps(_mm_div_ps(normaly, area), areanz);
normalz = _mm_and_ps(_mm_div_ps(normalz, area), areanz);

__m128 distance = _mm_add_ps(
    _mm_mul_ps(normalx, px0),
    _mm_add_ps(
        _mm_mul_ps(normaly, py0),
        _mm_mul_ps(normalz, pz0)));
__m128 negdistance = _mm_sub_ps(_mm_setzero_ps(), distance);

// this computes the plane equations (a, b, c, d) for each of the 4 triangles
__m128 plane0 = normalx;
__m128 plane1 = normaly;
__m128 plane2 = normalz;
__m128 plane3 = negdistance;
_MM_TRANSPOSE4_PS(plane0, plane1, plane2, plane3);

The code got quite a bit longer; we’re now processing 4 triangles in each loop iteration - we no longer need any SSE4.1 instructions for that though, and we should be utilizing SIMD units better now. Did this actually help?

… ok, this wasn’t really worth it. We did get a tiny bit faster, and fillCellQuadrics is now almost exactly 2x faster compared to the non-SIMD function we started with, but it’s not clear if the significant increase in complexity justifies this. In theory we should be able to use AVX2 and process 8 triangles per loop iteration, however this requires even more manual loop unrolling⁴. Let’s try something else instead.

AVX2 = SSE2 + SSE2

AVX2 is a somewhat peculiar instruction set. It gives you 8-wide floating point registers and allows to compute 8 operations using just one instruction; however, generally speaking instructions have the same behavior as two SSE2 instructions ran on two individual halves of the register⁵. For example, _mm_dp_ps computes a dot product between two SSE2 registers; _mm256_dp_ps computes two dot products between two halves of two AVX2 registers, so it’s limited to a 4-wide product for each half.

This often makes AVX2 code different from a general-purpose “8-wide SIMD”, but it works in our favor here - instead of trying to improve vectorization by transposing the 4-wide vectors, let’s go back to our first attempt at SIMD and unroll the loop 2x, using AVX2 instructions instead of SSE2/SSE4. We’ll still need to load and store 4-wide vectors, but in general the code is just a result of replacing __m128 with __m256 and _mm_ with _mm256_ with a few tweaks:

unsigned int i00 = indices[(i + 0) * 3 + 0];
unsigned int i01 = indices[(i + 0) * 3 + 1];
unsigned int i02 = indices[(i + 0) * 3 + 2];
unsigned int i10 = indices[(i + 1) * 3 + 0];
unsigned int i11 = indices[(i + 1) * 3 + 1];
unsigned int i12 = indices[(i + 1) * 3 + 2];

__m256 p0 = _mm256_loadu2_m128(
    &vertex_positions[i10].x,
    &vertex_positions[i00].x);
__m256 p1 = _mm256_loadu2_m128(
    &vertex_positions[i11].x,
    &vertex_positions[i01].x);
__m256 p2 = _mm256_loadu2_m128(
    &vertex_positions[i12].x,
    &vertex_positions[i02].x);

__m256 p10 = _mm256_sub_ps(p1, p0);
__m256 p20 = _mm256_sub_ps(p2, p0);

__m256 normal = _mm256_sub_ps(
    _mm256_mul_ps(
        _mm256_shuffle_ps(p10, p10, yzx),
        _mm256_shuffle_ps(p20, p20, zxy)),
    _mm256_mul_ps(
        _mm256_shuffle_ps(p10, p10, zxy),
        _mm256_shuffle_ps(p20, p20, yzx)));

__m256 areasq = _mm256_dp_ps(normal, normal, dp_xyz);
__m256 area = _mm256_sqrt_ps(areasq);
__m256 areanz = _mm256_cmp_ps(area, _mm256_setzero_ps(), _CMP_NEQ_OQ);

normal = _mm256_and_ps(_mm256_div_ps(normal, area), areanz);

__m256 distance = _mm256_dp_ps(normal, p0, dp_xyz);
__m256 negdistance = _mm256_sub_ps(_mm256_setzero_ps(), distance);
__m256 normalnegdist = _mm256_blend_ps(normal, negdistance, 0x88);

__m256 Qx = _mm256_mul_ps(normal, normal);
__m256 Qy = _mm256_mul_ps(
    _mm256_shuffle_ps(normal, normal, _MM_SHUFFLE(3, 2, 2, 1)),
    _mm256_shuffle_ps(normal, normal, _MM_SHUFFLE(3, 0, 1, 0)));
__m256 Qz = _mm256_mul_ps(negdistance, normalnegdist);

After this we could take each 128-bit half of the resulting Qx/Qy/Qz vectors and run the same code we used to run to add quadrics; instead, we’ll assume that if one triangle has all three vertices in the same cell (single_cell == true), then it’s likely that the other triangle has all three vertices in one cell, possibly a different one, and perform the final quadric aggregation using AVX2 as well:

unsigned int c00 = vertex_cells[i00];
unsigned int c01 = vertex_cells[i01];
unsigned int c02 = vertex_cells[i02];
unsigned int c10 = vertex_cells[i10];
unsigned int c11 = vertex_cells[i11];
unsigned int c12 = vertex_cells[i12];

bool single_cell =
    (c00 == c01) & (c00 == c02) &
    (c10 == c11) & (c10 == c12);

if (single_cell)
{
    area = _mm256_mul_ps(area, _mm256_set1_ps(3.f));
    Qx = _mm256_mul_ps(Qx, area);
    Qy = _mm256_mul_ps(Qy, area);
    Qz = _mm256_mul_ps(Qz, area);

    Quadric& q00 = cell_quadrics[c00];
    Quadric& q10 = cell_quadrics[c10];

    __m256 q0x = _mm256_loadu2_m128(&q10.a00, &q00.a00);
    __m256 q0y = _mm256_loadu2_m128(&q10.a10, &q00.a10);
    __m256 q0z = _mm256_loadu2_m128(&q10.b0, &q00.b0);

    _mm256_storeu2_m128(&q10.a00, &q00.a00, _mm256_add_ps(q0x, Qx));
    _mm256_storeu2_m128(&q10.a10, &q00.a10, _mm256_add_ps(q0y, Qy));
    _mm256_storeu2_m128(&q10.b0, &q00.b0, _mm256_add_ps(q0z, Qz));
}
else
{
    // omitted for brevity
}

The resulting code is simpler, shorter and faster than our failed SSE2 approach:

Of course, we didn’t get 8x faster than our original scalar code with AVX2, we’re just 2.45x faster. Our loads and stores are still 4-wide since we’re forced to work with inconvenient memory layout due to dynamic indexing, and the computations aren’t optimal for SIMD - but with this change, fillCellQuadrics is no longer the top on our profile, and we should focus on other functions.

Gather ‘round, children

We saved 4.8 seconds off our test run (48 msec per simplification run), and our top offender is now countTriangles. The function is seemingly simple, but it does run 5 times instead of just once, so it makes sense that it would account for disproportionately more time:

static size_t countTriangles(const unsigned int* vertex_ids, const unsigned int* indices, size_t index_count)
{
    size_t result = 0;

    for (size_t i = 0; i < index_count; i += 3)
    {
        unsigned int id0 = vertex_ids[indices[i + 0]];
        unsigned int id1 = vertex_ids[indices[i + 1]];
        unsigned int id2 = vertex_ids[indices[i + 2]];

        result += (id0 != id1) & (id0 != id2) & (id1 != id2);
    }

    return result;
}

It iterates over all original triangles, and computes the number of non-degenerate triangles by comparing vertex ids. It’s not immediately clear how to make this use SIMD… unless you use gathers.

AVX2 is the instruction set that introduced a family of gather/scatter instructions to x64 SIMD; each instruction can take a vector register that contains 4 or 8 indices, and perform 4 or 8 loads or stores simultaneously. If we could use gathers here, we could load 3 indices, perform gather on all of them at once (or in groups of 4 or 8), and compare the results. Gathers have historically been pretty slow on Intel CPUs, however let’s try this. To make gathers easier to do we’ll load 8 triangles worth of index data, transpose the vectors similarly to our earlier attempt, and do the comparisons on respective elements of each vector:

for (size_t i = 0; i < (triangle_count & ~7); i += 8)
{
    __m256 tri0 = _mm256_loadu2_m128(
        (const float*)&indices[(i + 4) * 3 + 0],
        (const float*)&indices[(i + 0) * 3 + 0]);
    __m256 tri1 = _mm256_loadu2_m128(
        (const float*)&indices[(i + 5) * 3 + 0],
        (const float*)&indices[(i + 1) * 3 + 0]);
    __m256 tri2 = _mm256_loadu2_m128(
        (const float*)&indices[(i + 6) * 3 + 0],
        (const float*)&indices[(i + 2) * 3 + 0]);
    __m256 tri3 = _mm256_loadu2_m128(
        (const float*)&indices[(i + 7) * 3 + 0],
        (const float*)&indices[(i + 3) * 3 + 0]);

    _MM_TRANSPOSE8_LANE4_PS(tri0, tri1, tri2, tri3);

    __m256i i0 = _mm256_castps_si256(tri0);
    __m256i i1 = _mm256_castps_si256(tri1);
    __m256i i2 = _mm256_castps_si256(tri2);

    __m256i id0 = _mm256_i32gather_epi32((int*)vertex_ids, i0, 4);
    __m256i id1 = _mm256_i32gather_epi32((int*)vertex_ids, i1, 4);
    __m256i id2 = _mm256_i32gather_epi32((int*)vertex_ids, i2, 4);

    __m256i deg = _mm256_or_si256(
        _mm256_cmpeq_epi32(id1, id2),
        _mm256_or_si256(
            _mm256_cmpeq_epi32(id0, id1),
            _mm256_cmpeq_epi32(id0, id2)));

    result += 8 - _mm_popcnt_u32(_mm256_movemask_epi8(deg)) / 4;
}

_MM_TRANSPOSE8_LANE4_PS macro is an AVX2 equivalent of _MM_TRANSPOSE4_PS that’s not present in the standard header but is easy to derive; it takes 4 AVX2 vectors and transposes two 4x4 matrices that they represent independently:

#define _MM_TRANSPOSE8_LANE4_PS(row0, row1, row2, row3) \
do { \
    __m256 __t0, __t1, __t2, __t3; \
    __t0 = _mm256_unpacklo_ps(row0, row1); \
    __t1 = _mm256_unpackhi_ps(row0, row1); \
    __t2 = _mm256_unpacklo_ps(row2, row3); \
    __t3 = _mm256_unpackhi_ps(row2, row3); \
    row0 = _mm256_shuffle_ps(__t0, __t2, _MM_SHUFFLE(1, 0, 1, 0)); \
    row1 = _mm256_shuffle_ps(__t0, __t2, _MM_SHUFFLE(3, 2, 3, 2)); \
    row2 = _mm256_shuffle_ps(__t1, __t3, _MM_SHUFFLE(1, 0, 1, 0)); \
    row3 = _mm256_shuffle_ps(__t1, __t3, _MM_SHUFFLE(3, 2, 3, 2)); \
} while (0)

We have to transpose the vectors using floating-point register operations because of some idiosyncrasies in SSE2/AVX2 instruction sets. We also load data a bit sloppily; however, it seems like this mostly doesn’t matter because we’re bound by the performance of gather:

countTriangles does run ~27% faster now, and note that CPI - cycles per instruction - is now pretty abysmal; we’re dispatching ~4x less instructions but gather instructions take a lot of time. It’s great that they help us run a bit faster, but of course the performance gains are somewhat underwhelming. We did manage to get under fillCellQuadrics in profile, which brings us to our last function in the top we haven’t looked at yet.

Chapter 6, where things are as they should be

computeVertexIds is the last remaining function we’ll look at today - it runs 6 times during our algorithm so it’s also a great target for optimization. This function is the first one that actually looks like it should map cleanly to SIMD:

static void computeVertexIds(unsigned int* vertex_ids, const Vector3* vertex_positions, size_t vertex_count, int grid_size)
{
    assert(grid_size >= 1 && grid_size <= 1024);
    float cell_scale = float(grid_size - 1);

    for (size_t i = 0; i < vertex_count; ++i)
    {
        const Vector3& v = vertex_positions[i];

        int xi = int(v.x * cell_scale + 0.5f);
        int yi = int(v.y * cell_scale + 0.5f);
        int zi = int(v.z * cell_scale + 0.5f);

        vertex_ids[i] = (xi << 20) | (yi << 10) | zi;
    }
}

After all other optimizations we’ve explored here we know what we need to do - we need to unroll the loop 4 or 8 times, since it doesn’t make any sense to try to accelerate just one iteration, transpose vector components, and perform the computation in parallel on all of them. Let’s do this with AVX2, processing 8 vertices at a time:

__m256 scale = _mm256_set1_ps(cell_scale);
__m256 half = _mm256_set1_ps(0.5f);

for (size_t i = 0; i < (vertex_count & ~7); i += 8)
{
    __m256 vx = _mm256_loadu2_m128(
        &vertex_positions[i + 4].x,
        &vertex_positions[i + 0].x);
    __m256 vy = _mm256_loadu2_m128(
        &vertex_positions[i + 5].x,
        &vertex_positions[i + 1].x);
    __m256 vz = _mm256_loadu2_m128(
        &vertex_positions[i + 6].x,
        &vertex_positions[i + 2].x);
    __m256 vw = _mm256_loadu2_m128(
        &vertex_positions[i + 7].x,
        &vertex_positions[i + 3].x);

    _MM_TRANSPOSE8_LANE4_PS(vx, vy, vz, vw);

    __m256i xi = _mm256_cvttps_epi32(
        _mm256_add_ps(_mm256_mul_ps(vx, scale), half));
    __m256i yi = _mm256_cvttps_epi32(
        _mm256_add_ps(_mm256_mul_ps(vy, scale), half));
    __m256i zi = _mm256_cvttps_epi32(
        _mm256_add_ps(_mm256_mul_ps(vz, scale), half));

    __m256i id = _mm256_or_si256(
        zi,
        _mm256_or_si256(
            _mm256_slli_epi32(xi, 20),
            _mm256_slli_epi32(yi, 10)));

    _mm256_storeu_si256((__m256i*)&vertex_ids[i], id);
}

And look at the results:

We made computeVertexIds 2x faster, which, with all other optimizations combined, brings our total runtime to ~120 ms for one simplification run, which adds up to 50 million triangles/second.

It may look like we’re not getting the level of performance that we expected to see again - shouldn’t computeVertexIds improve more than 2x from introducing SIMD? To answer this, let’s try to see how much work this function is doing.

computeVertexIds is ran 6 times during the course of one simplification run - 5 times during the binary search, and once at the end to compute the final ids that are used for further processing. Each time this function processes 3M vertices, reading 12 bytes for each vertex and writing 4 bytes.

In total, this function processes 1800M vertices over 100 runs of the algorithm, reading 21 GB of data and writing 7 GB back. To process 28 GB of data in 1.46 seconds requires 19 GB/sec bandwidth. Running memcmp(block1, block2, 512 MB) on this system finishes in 45 msec, which makes me think that only about 22 GB/sec is achievable on a single core⁶. Essentially we’re now running close to memory speed and improving performance further would require packing our vertex data tighter so that positions require less than 12 bytes to store.

Conclusion

We’ve taken a pretty well optimized algorithm that could simplify very large meshes at a rate of 28 million triangles per second, and used SSE and AVX instruction sets to make it almost 2x faster, achieving 50 million triangles/second. Along this journey we had to explore different ways to apply SIMD - using SIMD registers to store 3-wide vectors, attempting to leverage SoA transposes, using AVX2 to store two 3-wide vectors, using gathers to load data slightly faster than it’s possible with scalar instructions and, finally, a straightforward application of AVX2 for stream processing.

SIMD often isn’t a good starting point for optimization - the sloppy simplifier went through many iterations of both algorithmic optimizations and micro-optimizations without the use of platform-specific instructions; however, at some point most other optimization opportunities are exhausted and, if performance is critical, SIMD is a fantastic tool to be able to use when necessary.

I’m not sure how many of these optimizations will end up in meshoptimizer master - after all, this was mostly an experiment to see how much it’s possible to push the hardware without drastically changing any algorithms involved. Hopefully this was informative and can give you ideas to optimize your code. The final source code for this article is available here; this work is based off meshoptimizer 99ab49, with Thai Buddha model available on Sketchfab.

This corresponds to ~28.5 million triangles/second which arguably is fast enough for practical purposes, but I was curious as to how much it’s possible to push the hardware here. ↩
In this case making the quadric structure larger by 8 bytes doesn’t seem to make a difference performance-wise; unaligned loads should be mostly running at the same speed as aligned loads these days so it likely doesn’t matter much one way or the other. ↩
Or rather what you should normally do is to pack data using small groups of SIMD registers, for example float x[8], y[8], z[8] for each 8 vertices in your input data - this is known as AoSoA (arrays-of-structures-of-arrays) and gives a good balance between cache locality and ease of loading into SIMD registers. ↩
Ideally you should be able to use ISPC here for it to generate all this code - however, my naive attempts to get ispc to generate good code here didn’t work well. I wasn’t able to get it to generate optimal load/store sequences, instead it resorted to using gather/scatter which resulted in code that’s substantially slower than speed of light here. ↩
My understanding is that first CPUs that supported AVX2 literally implemented AVX2 by decoding each instruction into two or more micro-ops, so performance gains were limited to the instruction fetch phase. ↩
AIDA64 benchmark gets up to 31 GB/sec read speed on my system, but it uses multiple cores to get there. ↩

Is C++ fast?

Thu, 17 Jan 2019 00:00:00 +0000

A library that I work on often these days, meshoptimizer, has changed over time to use fewer and fewer C++ library features, up until the current state where the code closely resembles C even though it uses some C++ features. There have been many reasons behind the changes - dropping C++11 requirement allowed me to make sure anybody can compile the library on any platform, removing std::vector substantially improved performance of unoptimized builds, removing algorithm includes sped up compilation. However, I’ve never quite taken the leap all the way to C with this codebase. Today we’ll explore the gamut of possible C++ implementations for one specific algorithm, mesh simplifier, henceforth known as simplifier.cpp, and see if going all the way to C is worthwhile.

Methodology

Mesh simplifier is an implementation of an edge collapse quadric based simplification algorithm with many tweaks to improve performance and quality of the result. The algorithm is still in development but has had a fair share of effort put into it. The details are really not that important, but it helps to understand the structure and size:

The entire algorithm is implemented in one standalone .cpp file that has almost exactly a thousand lines of code (1004 as of this writing), including comments, blank lines, lines with braces, etc.
The algorithm almost exclusively uses heap-allocated arrays as data structures, using raw pointers for this
The algorithm needs a hash table and a sorting routine, implemented from scratch

We will look at several variations of the implementation, starting with one that uses C++ containers and algorithms that would be helpful for that algorithm, then remove one C++ feature at a time and measure compilation speed and runtime performance as we go on three compilers, gcc 7.3, clang 6 and msvc 2017 on Core i7-8700K running Windows 10 / Ubuntu 16.10. We’ll measure compilation performance by just compiling one .cpp file (with default options in debug and -O2 optimization level in release), and measure runtime performance by simplifying buddha.obj (1M triangle mesh) to 25% of its size. After we reach the current implementation, we will explore the option of changing the code to pure C99.

Note that the way I arrived at these implementations is by taking the code you can see in the repository right now, and changing it to be more idiomatic Modern C++¹. However, these are generally very close to past versions of simplifier.cpp - the difference being that it’s possible to directly compare the variants now.

Baseline: a lot of C++

The version we’re starting with is the original simplifier.cpp from current meshoptimizer master, with the following modifications:

All raw pointers changed to std::vector
Instead of a home-grown hash table we use std::unordered_set
Instead of a home-grown sorting routine we use std::sort

Here’s the performance that we’re getting as a result:

compiler/stl	debug compile	release compile	debug run	release run
gcc	520 ms	646 ms	2273 ms	572 ms
clang	400 ms	684 ms	2356 ms	566 ms
clang libc++	400 ms	725 ms	1535 ms	584 ms
msvc	422 ms	566 ms	36317 ms	579 ms

This is a good starting point. We can see that performance is pretty solid in release - 0.6 seconds to decimate 1M triangle mesh is a good level of performance - generally more or less reasonable in debug with a notable exception of MSVC (the adverse behavior of MSVC STL in debug mode was one of the forcing functions to remove all STL use from meshoptimizer), and compile times generally vary but are uninspiring.

To put the compile times in perspective, Jonathan Blow recently posted a video stream with compiler performance improvements, where his game engine and game written in his new language compile and link in about a second (compilation itself takes about 0.9 seconds). That’s on a codebase that has 100K lines of code - our algorithm only has 1K lines of code (excluding STL, of course - it’s not entirely fair to exclude STL, but it’s not entirely fair to include STL either since we know our algorithm can be implemented in 1K LOC without any STL dependencies). 400 ms is something you notice when compiling your code, even if it’s just one file, and something that makes me less happy when working on the code - given many files like that, cumulative compilation performance can be bad. This is given the fact that our implementation is pretty spartan about the STL dependencies - we only use three algorithms/containers. Let’s see what happens when we stop using one of them.

Not using unordered_set in the first place

The secret about the previous version we benchmarked is that it never existed in that form. While meshoptimizer initially used STL containers and algorithms, it never used std::unordered_set - that’s because based on prior experience I expected the performance to be insufficient for the kinds of algorithms I wanted to write, and had a custom replacement that was using quadratic probing in a large power of two sized array, which is similar to Google’s dense_hash_set design. It’s a kind of hash table I use and implement often in different codebases for different applications, so I’m very familiar with it. The implementation in simplifier.cpp is just 35 lines of code², so it’s easy to drop in and adapt for the use case at hand. Let’s see what happens when we use that instead.

compiler/stl	debug compile	release compile	debug run	release run
gcc	334 ms	461 ms	2054 ms	460 ms
clang	270 ms	517 ms	2152 ms	452 ms
clang libc++	314 ms	609 ms	1179 ms	415 ms
msvc	361 ms	461 ms	28337 ms	380 ms

It looks like the extra 35 lines for a manual implementation of a better hash table were worth it. We’re seeing significant performance improvements across the board, in debug/release and both in terms of compile time and run time. The largest increase in runtime performance is on MSVC, we got 1.5x faster, and this is given the fact that hash table isn’t used as a core part of the algorithm - it’s only used to establish uniqueness relationship between individual vertices before the algorithm starts.

This highlights the poor fit of std::unordered_set to performance-critical workloads, especially ones that are insert-heavy. Unfortunately, this is not an implementation defect and thus is not possible to correct - the issue is that the standard requirements on unordered containers preclude more efficient implementations. Here’s to hoping that eventually we’ll get a better hash table in the standard.

Exact sorting algorithms are overrated

At some point during development of the simplifier repeated profiling of various meshes showed that a lot of time is being spent in std::sort. Now, std::sort isn’t the fastest sorting algorithm, but it’s generally extremely competitive with custom implementations and it’s hard to beat without changing the problem around. In my case, sorting was used on an array of edge collapses, with the sort key being a floating point error value - so the natural instinct is to use a 3-pass radix sort, using 11, 11 and 10 bits of the key in each pass. However, there’s an interesting alternative available to us here - we can do radix sort in a single pass, using an 11 bit key³.

What happens is that we have a 32-bit non-negative floating point value; if we take the top 12 bits and ignore the topmost one (since that’s a sign bit and is always 0), we get 11 bits that represent 8 bits of exponent and 3 bits of mantissa, which essentially gives us a value of similar magnitude but a significant round-off error. If we sort using this value as a key, as a result the sorting sequence isn’t going to be perfectly ordered with respect to the full 32-bit key. However, in our case we need to sort to be able to process better edge collapses first based on a heuristic - and the heuristic is a gross approximation so the extra error our sorting introduces is not noticeable. This technique is surprisingly useful in other domains where you don’t necessarily need an exact order either. A benefit of a single-pass radix sort is that it’s faster (you only need to do one pass over the data instead of 3!) and simpler to implement than a full-blown radix sort, taking just 36 lines of code⁴.

compiler/stl	debug compile	release compile	debug run	release run
gcc	287 ms	403 ms	949 ms	334 ms
clang	230 ms	461 ms	962 ms	327 ms
clang libc++	312 ms	546 ms	940 ms	328 ms
msvc	330 ms	430 ms	26824 ms	285 ms

This time the gains in compilation times are somewhat more modest. We’ve removed <algorithm> header but it doesn’t seem so have had very significant benefits to compilation time - we’re still including <vector> and it’s possible that large STL headers are pulled by both. However, the effects on performance are very significant, especially on debug performance in libstdc++ (most likely std::sort is very slow in debug there) but the gains in release builds are also exciting. What is not obvious from this graph is that sorting got so much faster that it almost completely disappeared from the profiles compared to the other work - the entire algorithm runs “just” 1.35x faster, but the gains measured on just the sorting code are much larger, 117 ms -> 10 ms in release builds.

So long, std::vector

One number that we haven’t moved substantially yet is the time it takes to run this code in debug using MSVC. While it’s natural to expect unoptimized builds to be slower than optimized, they have to be fast enough. Sometimes you want to debug your problem on a non-trivial input dataset. Sometimes you want to run the debug build with full checks through your tests to make sure they don’t trigger any bugs that could disappear in release. Sometimes you are trying to debug a different part of the program, but you still need to run the rest of it. Programmers creatively come up with many workarounds that make the problem less severe - you can make special builds that enable some optimizations but not all, you can use mixed optimization settings for different projects, you can use #pragma optimize to temporarily disable optimizations around offending parts of the code - but all of these seem like duct-tape. Let’s try to replace the only STL component we’re still using, std::vector, with a really simple dynamic array - we don’t need resize or push_back in our code, all arrays are initialized with the right size. Our demands are low enough that our std::vector replacement is just 40 lines of code⁵, and mostly consists of operator[] definitions ;)

compiler/stl	debug compile	release compile	debug run	release run
gcc	158 ms	303 ms	980 ms	318 ms
clang	138 ms	320 ms	1021 ms	297 ms
clang libc++	142 ms	324 ms	1028 ms	299 ms
msvc	156 ms	219 ms	3482 ms	265 ms

This is certainly… interesting. By replacing std::vector with our own type we not only significantly improved debug performance in MSVC, but also halved the compile time for several compilers we were testing. Debug performance in gcc/clang regressed a bit - I believe this is because my replacement uses assert to perform bounds checking on every operator[] access, and in libc++ and libstdc++ these are controlled using separate defines, _GLIBCXX_ASSERTIONS and _LIBCPP_DEBUG respectively. Enabling these defines for the std::vector variant increases the debug runtime performance to ~1350 ms for both libraries⁶, so our replacement is faster when comparable functionality is enabled.

Release performance also slightly increased across the board - this is because for many of our arrays, the default initialization performed by std::vector’s constructor is redundant as we’re going to fill the array anyway. With std::vector, you can either resize a large array and then compute the items (which requires default-initializing every item redundantly), or reserve and push_back repeatedly (which requires a bit more code for adding each item, and this overhead can also add up). With a custom container it’s easy to have an option to skip initialization - in fact, in our replacement that’s the only option since it’s easy to memset the array manually if necessary.

There and back again*

A custom container with bounds checking operator[] was mostly a success, but it didn’t quite make me happy. In some algorithms the extra cost of the container was still pretty substantial. In some algorithms internal functions would use raw pointers to maximize release performance, which meant bounds checking isn’t performed anyway. And algorithm inputs used raw pointers which required careful handling. Because of the use of raw pointers in many critical places, I would run builds with Address Sanitizer as part of CI pipeline and also occasionally locally, so I felt safe about the lack of out-of-bounds accesses. Debuggers wouldn’t be able to display the arrays without custom visualizers, and more crucially would have problems evaluating member access (this is true of std::vector as well depending on the debugger), which made watch expressions more complex and debugging - less pleasant. The status quo provided neither complete safety, nor complete performance, and I decided to try to use raw pointers instead.

Of course, one other benefit of containers is the extra protection against memory leaks - I wasn’t particularly keen on remembering to free each allocated pointer, so I made a meshopt_Allocator class⁷ that could allocate large blocks of typed data and remember the allocated pointer; at the end of the scope all allocated blocks would be deleted. This resulted in the fused allocator+array class being split into two - a special allocator class fulfilled the memory management duties, and as for the array a raw pointer would suffice. Address Sanitizer, along with rigorous testing and hand crafted assertion statements, would keep the code correct.

compiler/stl	debug compile	release compile	debug run	release run
gcc	147 ms	260 ms	720 ms	320 ms
clang	132 ms	294 ms	699 ms	301 ms
clang libc++	131 ms	297 ms	697 ms	300 ms
msvc	141 ms	194 ms	1080 ms	261 ms

While I’m not 100% happy with the tradeoff, it has worked well so far. It’s great to remove the cognitive overhead associated with figuring out whether in each function we should use a raw pointer, an iterator or the container. Worth noting is that the overhead that builds with Address Sanitizer have is very reasonable, and having it on makes me feel safer since it captures a superset of the problems bounds checks in containers do.

compiler/sanitizer	compile	run
gcc	147 ms	721 ms
gcc asan	200 ms	1229 ms
gcc asan ubsan	260 ms	1532 ms
clang	135 ms	695 ms
clang asan	154 ms	1266 ms
clang asan ubsan	180 ms	1992 ms

Let’s C

Once we’ve switched to raw pointers, there’s really not much of C++ left in our code. There is still an occasional template or two, but the number of instantiations is small enough that we could duplicate the code for each type we need it for. meshoptimizer uses C++ casts for pointers and functional casts (int(v)) for numbers, but C has neither, so that has to change. A few other syntactical annoyances emerge, but really it’s not hard at this point to make a C version of the code. It does require more sacrifices, and there’s the issue of MSVC that either has to use C89 or compile our C99 code as C++, unless we’re willing to only support latest MSVC versions, but it’s doable. After we have stopped using every C++ standard header though, does it really matter?

compiler/stl	debug compile	release compile	debug run	release run
gcc	105 ms	209 ms	710 ms	321 ms
clang	95 ms	254 ms	711 ms	310 ms
msvc c++	139 ms	192 ms	1087 ms	262 ms
msvc c99	125 ms	180 ms	1085 ms	261 ms

There is a notable impact on gcc/clang compilation time - we save ~40 ms in both by switching to C. The real difference at this point is in the standard headers - simplifier.cpp uses math.h that happens to be substantially larger in C++ mode compared to C mode, and that difference will increase even more once the default compilation mode is set to C++17:

compiler	c99	c++98	c++11	c++14	c++17
gcc	105 ms	143 ms	147 ms	147 ms	214 ms
clang	95 ms	129 ms	133 ms	134 ms	215 ms
clang libc++	95 ms	130 ms	132 ms	136 ms	140 ms

The issue is that math.h includes cmath in gcc/clang which pulls in a lot of C++ machinery, and in C++17 in libstdc++ adds a slew of new special functions, that are rarely useful but will make compilation slower anyway. Removing the dependency on math.h is easy in this case⁸:

#ifdef __GNUC__
#define fabsf(x) __builtin_fabsf(x)
#define sqrtf(x) __builtin_sqrtf(x)
#else
#include <math.h>
#endif

and brings us all the way to C compile times. This is definitely an area where libstdc++ and libc++ could improve in the future - I don’t think it’s reasonable to force users of C headers to pay for the C++ baggage. With the exception of the math.h issue, it doesn’t look like C is faster to compile than C++ assuming a compile time conscious subset of C++ is used - so at this point a switch to C isn’t warranted for meshoptimizer.

Conclusion

Hopefully the excursion through past, present and possible future changes in simplifier.cpp was useful. When making C/C++ libraries, it’s important to pay attention to more than just correctness - portability, ease of compilation, compilation times, runtimes in both debug and release, debuggability - all of these are important and will help reduce the friction for both users of the library and contributors. C++ is an unforgiving language, but, given enough time and effort, it’s possible to get good performance - assuming you’re willing to question everything, including practices that are sometimes believed to be universal, such as the effectiveness or efficiency of STL or CRT.

We started with half a second of compile times on gcc and 36 seconds of runtime on MSVC in debug mode and ended with 100 ms compile times for gcc and around a second of runtime on MSVC, which is much more pleasant to work with. Of course, at 1K lines compiling for 100 ms, and assuming linear scaling, we would require a full second per 10K lines, which is still substantially slower than some other languages - but not entirely unreasonable for a full build ran on a single core. Getting there for large codebases developed over many years is a much harder problem, one that will be left as an exercise for the reader ;)

All source modifications to simplifier.cpp are available here; in order described in the article, it’s simplifiervsm.cpp, simplifiervs.cpp, simplifierv.cpp, simplifierb.cpp, simplifier.cpp and simplifier.c.

Some time last year during a regular lunch time C++ discussion at work somebody said “there’s a good language subset of C++, C with classes”, to which I replied “there’s an even better subset, C with structs”. That’s what most of meshoptimizer source code looks like, barring a few templates :) ↩
The hash table interface is just two functions, hashBuckets and hashLookup: simplifier.cpp:124 ↩
Using 11 bits is a reasonable choice because that requires a 2048-entry histogram, which takes 8 KB and comfortably fits within a 16 KB L1 cache. Given a 32 KB L1 cache you can extend the histogram to 12 bits but going beyond that is generally less efficient. You can read more about the radix sort in the Radix Sort Revisited article by Pierre Terdiman. ↩
The full implementation of sortEdgeCollapses function is available here: simplifier.cpp:712 ↩
This class is no longer a part of meshoptimizer, but you can look at the older slightly longer version here: meshoptimizer.h:605 ↩
I discovered this after investigating the curious performance difference in debug; I’m loathe to repeat the debug benchmarks for all previous test cases, so I’ll assume that the overhead is the extra ~30% seen on the std::vector - hopefully that doesn’t change the general picture. I’m not sure why these assertions aren’t enabled by default in the first place - that doesn’t seem very user friendly - but this should mirror the default experience of working with these libraries. ↩
This class is used by all meshoptimizer algorithms and is available here: meshoptimizer.h:662 ↩
This feels like duct tape, and one that I’d have to apply to multiple source files independently, so for now I opted for not doing this. However if C++17 mode becomes the default before this issue gets fixed, I’ll have to reconsider since 2x compile time penalty is a bit too much to swallow. ↩

Voxel terrain: physics

Sat, 30 Dec 2017 00:00:00 +0000

In the last article we’ve discussed the particulars of voxel data definition and storage for voxel terrain we use at Roblox. From there on a lot of other systems read & write data from the storage and interpret it in different ways - the implementation for each system (rendering, networking, physics) is completely separate and not tied too much to decisions storage or other systems are making, so we can study them independently.

While logically speaking it would make sense to look at mesher next (which is how we call the component that is capable of taking a box of voxel data and producing triangle data representing the terrain surface with material attributes), since it is used by both physics and rendering systems, the algorithm is pretty involved and has quite a bit of “magic” so we will leave that for some other time and will instead look at physics today.

Initial prototype

Physics support is crucial for terrain since so much content we have relies on having robust physics behavior, both in-game and in-editor. For terrain in particular, having physics support meant implementing collision detection between terrain and all other shapes we use, as well as supporting raycasts efficiently. We care about performance and memory consumption, as the assumption is that some worlds will be heavily relying on terrain, including terrain physics.

While our physics engine is custom, we use some components of Bullet Physics for broadphase/narrowphase (specifically for complex convex objects and convex decomposition, relying on Bullet’s GJK implementation and some other algorithms), so it made sense to start by prototyping the solution that would heavily rely on Bullet.

Similarly to voxel storage, we divide the entire world into chunks and represent each chunk as a collision object; this division is crucial because each chunk becomes the unit of update of physics data - since terrain can be changed at any time, the chunk size is a balance between update cost (if a chunk is too big then every time we update a voxel in that chunk we’d have to pay a high cost of updating the entire chunk - we currently assume that incremental updates are too complex to implement as voxel changes can lead to changes in topology of the resulting mesh) and chunk overhead (if a chunk is 2^3 voxels, then we’d need to spend a lot of time/memory managing chunks). We settled on 8^3 chunks (as a reminder, a character is slightly taller than 1 voxel, which should give you a sense of scale) as a balance between these factors.

While we could try to do collision directly using the underlying voxel representation, the meshing algorithm we use is complex and makes it hard to accurately predict where the surface will be without running the algorithm; since our voxels are pretty big we wanted rendering and physics representation to match closely to eliminate visual artifacts so we decided to use the polygonal representation for collision.

Thus, the prototype took each chunk, ran mesher on the chunk to generate a triangle mesh from it, and then created a Bullet collision object using btBvhTriangleMeshShape. The resulting objects were inserted into the general broadphase structure along with other objects in the world; whenever an object, say, a ball, intersected one of the chunks, we would generate a contact object and then run Bullet’s algorithms to determine the contact points.

While this prototype got us going, it highlighted several key areas of improvement:

Broadphase data is too imprecise - whenever any object intersects a 8^3 voxel chunk, we need to run narrowphase algorithm on the contact; this results in a lot of redundant contacts that don’t generate any contact points, especially in caves where an object hanging in mid-air can overlap with a relatively large chunk bounding box but never touch any geometry
Narrowphase data is too large - for each chunk we end up storing the triangle mesh that is less compact than the voxel data (since we need to store vertex positions/indices) and Bullet’s BVH structure that is used to accelerate collision, which is also pretty large
Narrowphase data is too slow to generate - while our meshing algorithm is heavily optimized, Bullet’s processing of the resulting mesh is relatively slow (up to 3x slower than generating the mesh)

This suggested that we need to change our approach for both broadphase and narrowphase. Let’s look at what we ended up doing there.

Bit broadphase: construction

Since the key issue with the broadphase was precision, we decided to teach broadphase about the structure of each object and do early rejects based on the actual voxel data - whenever an object moves, instead of creating a contact with each overlapping chunk, we would first look at the voxel data in the chunk to see if the object’s AABB intersects any voxel data.

While our voxel data is pretty compact, we wanted to bring the memory impact of the broadphase data to the minimum. Additionally, the way our meshing algorithm works is that the geometry generated from a single voxel isn’t contained within that voxel and can spill into the neighboring voxels, so we actually need to check neighboring voxels as well. For all of this to work efficiently, we decided that for each chunk we would store a bitmask (where each voxel would correspond to one bit) that would tell us if each voxel has any geometry to collide with or not.

It’s vital to be able to generate this mask without relying on the meshing algorithm (since running the meshing algorithm on the entire terrain is too time consuming, and the way our code is structured broadphase data has to be available for the entire world), so we approximate this by saying that if a voxel is solid, it can in theory generate geometry inside any of its neighboring voxels (including diagonal neighbors, for a total of 3^3=27 voxels), and fill all of these voxels with 1s in the mask - this process is called dilation. This also requires that we look at the neighboring voxels of the chunk, so our input is a 10^3 voxel box and our output is a 8^3 bitmask.

Finally, we have two types of contacts - solid and water (we use contacts between primitives and water to compute buoyancy), so we generate two bitmasks. The process works roughly as follows (note that for exposition the images in this article are assuming 4^3 chunks instead of 8^3):

To make this process fast, we dilate using bit operations - we first generate the 10^3 bitmask and store it in an array of 10^2 16-bit integers, then we dilate each integer horizontally like this:

data[y][z] = (data[y][z] << 1) | data[y][z] | (data[y][z] >> 1);

Then we dilate along other two axes in two passes like this:

temp[y][z] = data[y][z - 1] | data[y][z] | data[y][z + 1];

As a result, we get 10^2 10-bit dilated bitmasks; we then extract 8^2 8-bit bitmasks that correspond to chunk’s voxels (discarding the redundant boundary voxels) and store them as the broadphase data. During this process we also filter out redundant chunks - if all voxels in the 10^3 volume are filled with air then there’s no geometry there; also, if all voxels in the 10^3 volume are filled with either a solid material or water, then our meshing algorithm will never generate any polygons but we still need to keep this chunk in the broadphase (for example, to be able to tell if an object is fully submerged in water) so we tag it in a special way.

Overall, this results in extremely low storage costs - the worst case is 2 bit/voxel (1 bit for solid mask and 1 bit for water mask), but many chunks are discarded since they are either empty or full and in this case we just need to store the chunk structure but not the mask.

Bit broadphase: overlap test

Now that we’ve generated the broadphase data (this is done when the level loads and each chunk’s broadphase data is regenerated whenever any voxel in a chunk changes), we can look at how we use it.

Whenever the object moves, we take the AABB of the object with the old and new transforms, query the broadphase for overlaps and perform contact management - if the object had a solid contact with a chunk in the old position but no solid contact with that chunk in the new position, we can remove that contact. We track up to 1 solid and 1 water contact per object-chunk pair, so a very large object can end up having multiple contacts with terrain (and each contact can generate multiple contact points as a result of the narrowphase work that we’ll discuss later).

The query is two-step - first, we take the object’s AABB, expand it by 1 voxel to cover for objects touching geometry generated by overflowing into the neighboring voxel, project it to chunk space (by dividing the coordinates by 8 voxels), and convert the min/max to integer (using floor/ceil), which gives us the range of chunks. Then we take each chunk and inspect the bit data to see if the object’s AABB touches any bits marked as solid/water. While the latter could be implemented in a naïve way by querying each voxel in the object’s AABB, we take advantage of the bit data as follows.

Remember that we have 8^3 bits in a chunk; we store each Y-slice (Y points up) of this chunk in a 64-bit integer, with each successive group of 8 bits determining the data for a X-row. Each mask is thus a simple array, and we store one mask for solid and one mask for water:

uint64_t solid[8];
uint64_t water[8];

Now, we take the object’s AABB, we project it to the chunk space and determine the bounds in voxel space. Then, we look at object’s XZ extents as it intersects the chunk and generate a 64-bit mask that has 1s set where the object intersects voxels along XZ plane:

Then we iterate through all Y-slices that the object covers, and do a simple bit test for each slice:

if (chunk.solid[y] & mask)
    touchesSolid = true;

This lets us check the entire XZ slice of a chunk - up to 64 voxels! - for overlap with a simple instruction (on 32-bit architectures this test takes ~3 instructions), which makes it very efficient to do precise queries on reasonably large objects. To optimize the overhead for small objects, we make sure to create the mask for the XZ extents as fast as possible - while we could do it by iterating through XZ extents and setting bits, we note that the mask we have is really an intersection of two masks representing one vertical strip and one horizontal strip; for each direction we keep a lookup table and combine the lookup results to get the resulting mask:

uint64_t mask = masksVer[cmin.x][cmax.x] & masksHor[cmin.z][cmax.z];

Narrowphase: analysis & plan

Now that we have our broadphase data in good shape, let’s look at narrowphase. On a high level the way Bullet-based narrowphase worked was as follows:

Each chunk stores an array of vertices (each vertex has 3 floats for position and 1 byte for material), an array of indices (each triangle has 3 16-bit indices) and a BVH (which is an AABB tree)
When creating the tree, the list of triangles would get repeatedly subdivided into nodes up until the leaf nodes have just one triangle
When performing collision detection, a set of triangles is extracted from the tree using a simple AABB query; each triangle gets collided with the target primitive using either a specialized algorithm for this pair (like triangle-sphere) or a generic GJK/EPA algorithm. The points resulting from these collisions are fed into a simplex structure that keeps up to 4 contact points and tries to maximize contact area
When performing raycasts, a simple ray-AABB tree query is ran; each matching leaf node is intersected with the ray. The closest point is kept and returned as the result.

We decided to keep the collision detection algorithms and the overall structure, but replace all other components with versions that worked better for our use case. To keep the memory cost low, we started doing lazy generation of collision objects - we would only generate triangle/tree data when a contact was created or a raycast against the chunk was performed, and would keep a cache of the resulting objects so that the memory overhead remained manageable.

This significantly improved narrowphase memory consumption, but the generation cost remained prohibitively expensive. Bullet has two versions of their BVH triangle mesh tree: a non-quantized one and a quantized one. Each keeps an AABB tree but stores them differently (with 32-bit floating point in one case, and 16-bit integer in another). Since each triangle has a leaf node and the tree is binary, you need about twice as many nodes as you have triangles.

The base memory cost for a raw mesh is ~13b per vertex (3 floating point coordinates and 1 byte material) and ~6b per triangle (for 3 16-bit indices), which adds up to ~12.5b/triangle since on average there are twice as many triangles as vertices.

The non-quantized Bullet BVH takes 64 bytes per node which adds up to ~128b/triangle (the node structure only needs 44 bytes but it’s padded to 64 bytes), which is ~10x the memory cost of triangle data. The tree also takes ~3x longer to generate compared to generating the mesh (using our implementation that converts voxel data to triangle data).

The quantized Bullet BVH takes 16 bytes per node which adds up to ~32b/triangle, which is ~2.5x the memory cost of triangle data. It is slower to generate compared to non-quantized version, around ~5x longer compared to generating the mesh.

Both of these options were very unsatisfactory in terms of both memory and time to generate the data (due to lazy generation the construction time is important), so we decided to replace the tree structure with a custom kD tree with two planes per node.

Loose kD tree: construction

The kD tree is a binary tree with each node splitting the space along one axis in two parts. Usually kD tree only has one splitting plane per node, but this makes dealing with triangles that aren’t contained completely within one of the children complicated (you have to cut them with the splitting plane), so we settled on a loose kD tree - each node has two splitting planes along the same axis, where all left children are contained within a subspace defined by one of them, and all right children are contained within a subspace defined by the other one. The planes tightly fit the content of child nodes, resulting in two possible plane configurations:

While an AABB tree can localize the space a bit better than kD tree in general, kD tree can be faster to construct and you can recover most of the information during the recursive traversal as we’ll discuss later. The big benefit is that you only need to store 2 values per node instead of a full AABB. We also decided to store more than one triangle per leaf node - in general storing 2 triangles instead of 1 does not significantly affect the query quality - and ended up with this structure:

union KDNode {
    struct {
        float splits[2];
        unsigned int axis: 2; // 0=X, 1=Y, 2=Z
        unsigned int childIndex: 30; // children are at childIndex+0,1
    } branch;

    struct {
        unsigned int triangles[2]; // up to two triangles per leaf
        unsigned int axis: 2; // must be 3; same offset as branch.axis
        unsigned int triangleCount: 30;
    } leaf;
};

The tree structure is relatively straightforward; all nodes are stored in one large array so that we can refer to nodes using an index. The 4-th value of axis index is used as a tag to distinguish leaves, and just one child index is stored since each branch node always has both children. We store up to 2 triangles in each leaf node; we easily have space for 4 but storing 4 triangles instead of 2 was slightly slower to process collisions for so we went with 2.

Note that the node is 12 bytes and since we store 2 triangles in each leaf node, we need around as many total nodes as we have triangles, so the memory cost ends up being 12b/triangle. There is absolutely more room for optimization here as well - we could reduce the memory cost for vertex data by packing positions using 16-bit integers and optimize the kD tree node in a similar fashion, getting to ~6b per kD node and resulting in total triangle impact of 3.5b vertex + 6b index + 6b tree = 15.5b, but the current result of ~24.5b/triangle was good enough to ship.

The process of tree construction is relatively standard - we start with an array of triangles, and recursively subdivide it, generating branch nodes in the process (and stopping the recursion once we reach 2 triangles per node or less). For each division we pick the axis by taking the longest axis of the current AABB, place the split point at the average of triangle midpoints along this axis, filter triangles to the left/right of that plane using their midpoints as the decision factor and then recompute the left and right split planes using all 3 triangle vertices. If the resulting distribution ends up too skewed in terms of triangle counts (this is currently defined as <25% of triangles ending up in one of the two nodes), we repartition and put half of the triangles in one subtree and half in another subtree (out of the list sorted by midpoints) - this maintains a balance between spatial coherency and tree depth, limits the tree depth and makes sure the process terminates.

The construction process ends up being a bit simpler and is coded more efficiently than Bullet’s, and thus is ~3-4x faster than the Bullet’s fastest tree construction algorithm. This results in mesh & tree data being roughly equivalent in size and roughly equivalent in generation time, which is a good balance (or rather this means that to make significant improvements in the overall process you need to significantly optimize both :D).

Loose kD tree: queries

As mentioned before, we only need two types of queries - AABB query (where we need to gather all triangles contained within a given AABB, which is used for narrowphase) and raycast query (where we need to gather all triangles intersecting a ray, or just the one with the closest intersection point). Both of these are implemented using stackless traversal - since the tree depth is bounded, it’s easy to precompute the tree depth and preallocate scratch space for the given traversal. Stackless traversal doesn’t necessarily save us much space or time, but it helps understand the profiling results since all overhead from the traversal in both branches and leaves is centralized in one function, making it somewhat easier to work with.

kD tree doesn’t have the full knowledge about the node extents in each node, like AABB tree, but we can recover it during the traversal. For AABB query when we encounter a branch node, we only descend down the branches that have the AABB in the right half-space; due to the hierarchical traversal this ends up only traversing the nodes with volume overlapping the AABB:

if (node.branch.splits[1] <= aabbMax[axis])
    buffer[offset++] = childIndex + 1; // push right child

if (node.branch.splits[0] >= aabbMin[axis])
    buffer[offset++] = childIndex + 0; // push left child

This can be inefficient if the queried AABB is outside of the full kD tree bounds so we store an AABB for each kD tree for early rejection.

For ray queries, we choose to instead do a segment-tree traversal, where the segment is defined by the ray origin/direction and two limits for the parameter t, tmin and tmax, that contain all points within the subspace defined by each node. When we encounter a branch, we need to intersect the split planes with the ray (which is simple & fast since planes are axis-aligned), and adjust the t limits for subsequent traversal:

float sa = raySource[axis];
float da = rayDir[axis];

float t0 = (node.branch.splits[i0] - sa) / da;
float t1 = (node.branch.splits[i1] - sa) / da;

if (t1 <= rn.tmax)
    buffer[offset++] = { childIndex + i1, max(t1, rn.tmin), rn.tmax };

if (t0 >= rn.tmin)
    buffer[offset++] = { childIndex + i0, rn.tmin, min(t0, rn.tmax) };

Similarly to AABB queries, if the ray doesn’t intersect the full kD tree bounds the traversal can be inefficient, so we compute the initial segment limits by intersecting the ray against the stored AABB.

Additionally, to accelerate ray queries that just need the first point, we arrange the traversal to first visit the branch that defines a subspace that occurs earlier along the ray’s direction (this affects i0/i1 indices in the snippet above); if we find an intersection point that is earlier along the ray than the minimum t of a given segment then we can terminate the traversal. Unfortunately, due to floating point precision issues we need to slightly expand the segment on each branch.

With this we get efficient tree queries for both AABB and raycasts; the AABB query performance is on par with Bullet implementation, but the raycast query is faster, both because we do less work for each branch (only intersecting two planes with a ray), and because we can terminate the traversal early if we found a suitable intersection point.

Future work

While the resulting algorithms we’ve built work pretty well, they can undoubtedly be improved even further. Memory consumption of narrowphase could be improved; additionally we currently store triangles in the kD tree - Christer Ericson suggested that if you store quads as a first class primitive, the raycasts can be up to 2x more efficient since Möller-Trumbore algorithm can handle quads with minimal additional computation (our terrain is built using quads, and some of them are planar, so this could be viable).

One area that we have yet to explore is the parts of narrowphase that we still use from Bullet. It’s possible that one can utilize better algorithms for doing triangle-convex collisions or for reducing the contact point manifold, and additionally we currently use a few hacks to deal with interior edge collisions whereas we could generate the data about whether each edge is exterior or interior and use this when generating collision points/normals.

Finally, something that we have explored during the prototyping phase but didn’t get into production is disjoint terrain regions - whenever you modified voxel data our prototype performed a basic connectivity analysis and converted all disjoint connected voxel regions into a freely moving object. Computing collisions for this in real-time most likely involves approximating the shape with a convex hull, although other options might be possible and we will probably explore this one day.

Optimal grid rendering isn't optimal

Mon, 31 Jul 2017 00:00:00 +0000

I have been working a lot on vertex cache optimization lately, exploring several algorithms from multiple axes - optimization performance, optimization efficiency, corner cases and the like. While doing so, I’ve implemented a program to verify that the algorithms actually produce results beneficial for real hardware - and today we will discuss one such algorithm, namely “Optimal Grid Rendering”.

Vertex cache

First, let’s very briefly cover what vertex cache stands for. It has many names - post-T&L cache (from the days of fixed function hardware), post-transform cache, parameter cache and probably a few others I don’t know about. What it amounts to is that GPU caches the result of vertex shader invocation - namely, all the output attributes - in a cache that’s keyed by the vertex index and is usually small (tens of vertices). If the vertex index is not found in the cache, GPU has to run the vertex shader and store the results in this cache for future use.

The details of cache behavior are not documented by GPU vendors to my knowledge, and can vary in terms of the size (does the cache store a fixed number of vertices? fixed number of vector vertex shader outputs? fixed number of scalar vertex shader outputs?), replacement policy (FIFO? LRU?) and some other aspects (interaction between vertex reuse and warp packing). However ultimately the efficiency of the cache depends on the order of vertex indices in the input index buffer.

There are several algorithms that optimize meshes for vertex cache. Some of them model a cache with a specific size and replacement policy, others use heuristics to produce cache oblivious meshes. Generally these algorithms operate on generic triangle meshes, but today we’ll look at an algorithm that works only on uniform grids.

Optimal Grid Rendering

Ignacio Castaño wrote a nice blog post about a technique that allows one to achieve perfect vertex cache hit ratio on a fixed size FIFO cache, called Optimal Grid Rendering. This technique is not new; it’s hard for me to date it precisely - I personally learned about it circa 2006, but I’m pretty sure it comes from the days of triangle strips and hardware T&L. The key parts of the algorithm are as follows:

Rendering the grid as multiple vertical strips, with each strip width being just under the cache size;
Prefetching the first row of each strip using degenerate triangles.

This algorithm is designed to have each vertex leave the cache at exactly the right moment when the vertex will not be needed again. Picking the right cache size is crucial - if you produce the optimal grid for a cache that’s slightly larger than the one target hardware uses, you get an index sequence that transforms each vertex twice (the same thing happens if your strips are correctly sized but you remove the degenerate triangles from the output).

Given a cache that is exactly the right size and uses sequential FIFO replacement policy (vertex indices are added to FIFO cache one at a time), the resulting sequence is perfect. The question is, does the actual hardware performance match this model?

Measuring vertex shader invocations

There are several ways to investigate the vertex reuse behavior on a given GPU for a given index sequence:

Measure time it takes to render a mesh with the index buffer;
Measure the number of vertex shader invocations directly using GPU performance counters;
Measure the number of vertex shader invocations indirectly by using atomic increment in vertex shader.

Ultimately we care about the time it takes to render, but it’s hard to measure time in a stable way, especially on a GPU. We could setup a contrived workload with an extremely heavy vertex shader to make results more accurate, but we’ll do something else.

Both other methods can be faulty as well - GPU performance counters don’t necessarily have to return sensible information, and driver could change the execution flow based on whether vertex shader has memory writes (for example by disabling vertex reuse…). However for the sake of this analysis we will use the performance counters, which can be read using D3D11_QUERY_PIPELINE_STATISTICS.

So our testing method is as follows:

Render the mesh using the supplied index buffer
Use pipeline statistics query to get the number of vertex shader invocations
Confirm the pipeline statistics data by comparing the number of triangles with the expected baseline

Algorithm evaluation

We have several algorithms we will evaluate:

Optimal - Optimal Grid Rendering algorithm described in the aforementioned article, with striped data and row prefetching
Striped - render the grid as striped columns but do not render any degenerate triangles for prefetching
Tipsify - take the regular uniform grid and optimize it for vertex cache using Tipsify algorithm
TomF - take the regular uniform grid and optimize it for vertex cache using Tom Forsyth’s algorithm

For evaluation we will compare ATVR - average transformed vertex ratio, or the ratio of vertex shader invocations to total number of vertices. The ideal number is 1 - based on the Optimal Grid article, we would expect Optimal algorithm to reach the optimum when the target cache size is set to hardware cache size; striped algorithm can only be effective when two rows of a strip fit into the cache, and should deteriorate to ATVR=2 for larger strips; Tipsify produces results depending on the target cache size and should produce results somewhat inferior to the optimal algorithm for the hardware cache size; finally, TomF should give the same results regardless of the target cache size.

For each GPU we test, we will look at a graph of ATVR (lower is better) for all 4 methods based on the cache size, for a 100x100 quad grid. Traditionally the cache size is measured in vertices; while it’s possible that the number of attributes vertex shader outputs affects the effective cache size, on all 3 GPUs that are tested there is no observable difference between having the vertex shader output 1 float4 attribute and 10 - as such all tests are done on a vertex shader that outputs 5 float4 attributes.

Results: NVidia and AMD

The results here are not what we’d expect.

For NVidia, the best method is Striped with cache size 6 (this results in strips that are 4 quads wide, or 5 vertices wide), with ATVR 1.53; Tipsify performs best at cache size 14 with ATVR 1.60; additionally, both striped and optimal reach ATVR >2 at cache size 11.

AMD has similar results - the optimal method is Striped with cache size 8 (ATVR 1.21), Tipsify peaks at cache size 16 (ATVR 1.25).

These results suggest that either the cache replacement policy is completely incompatible with Optimal Grid Rendering method, or that degenerate triangles are filtered out before the vertices get processed, or both.

Results: Intel

These results are more in line with what we thought might happen - optimal reaches ATVR=1, striped breaks down for cache size 66 (strip size 65 vertices), Tipsify is slightly worse than optimal but approaches it at cache size 128 (ATVR 1.007). This suggests that Intel has a FIFO cache for 128 vertices, which actually means that with optimal grid, we don’t even need to stripe the 100x100 grid - one row of the grid fits into the cache as is.

Hypothesis: degenerate triangle prefetch doesn’t work

So while on Intel the algorithm clearly works, it doesn’t work on AMD or NVidia. The algorithm requires a FIFO cache and expects degenerate triangles to generate vertex shader invocations, so maybe degenerate triangles are skipped very early?

We can test whether degenerate triangles are filtered out by computing the number of invocations in a vertex buffer where some vertices are only referenced by degenerate triangles, such as this one:

0 1 1 2 3 4 5 5 5

For a GPU that feeds vertices in a degenerate triangle through the same vertex pipeline we’d expect 6 vertex shader invocations - and indeed this is what we get on all three GPUs. Thus this doesn’t seem like the problem here.

Hypothesis: cache uses LRU replacement policy

If AMD and NVidia do not use strict FIFO, what could they use? There are probably many algorithms one can use with many small tweaks, but one obvious alternative is LRU. It should be possible to learn how the cache works by inspecting the vertex shader invocations for a variety of index buffers, but let’s try something simpler - let’s model a FIFO cache with a fixed size and an LRU cache with a fixed size and see what results we get.

With a fixed size FIFO cache of 128 vertices, we are getting the exact same result we got from Intel GPU - which means that our guess was probably right.

With a fixed size LRU cache of 16 vertices, we are getting results that resemble both NVidia and AMD a lot in shape, although there are some deviations. The Tipsify curve matches NVidia curve for that algorithm, but AMD curve has a smaller minimum with a somewhat larger cache size (note that Tipsify simulates a FIFO cache as well, although it doesn’t take as much of a penalty for running the results on LRU cache of a similar size, so it makes sense that the sizes don’t quite match), as well as a slightly smaller optimal strip size.

Hypothesis: vertex reuse and warps

If you think about the GPU as a massively parallel unit, and take into account the fact that you need to dispatch warps with 32 vertices to compute units (or wavefronts with 64 vertices on AMD), the concept of a fixed size cache with vertices inserted into it one by one stops making sense. In the fantastic blog series “A trip through the graphics pipeline”, Fabian Giesen suggests an alternative way how the cache could work - in fact, it might be better to think of it as vertex reuse within warp instead of a cache.

Let’s say that primitive assembly submitted triangles to the rasterizer in the form of a warp id, and three indices into the warp. Then to achieve vertex reuse within the entire warp we just need to gather triangles for the rasterizer to process up until we fill up the entire warp worth of vertices to process. Then we would have up to 32 vertices in a warp referenced by some number of triangles in rasterizer; when the warp completes execution, we can kick off rasterization of all of them immediately.

It seems that we’d also want to kick off rasterization work in some limited batches; in case of an index buffer with repeating triangle 0 1 2 we’d have to buffer a lot of indices into the same warp - so it’s likely that the number of triangles we can dispatch is pretty small. By measuring the vertex shader invocations for the index buffer (0 1 2)+, it looked like on NVidia hardware specifically each subsequent batch of 32 triangles increases the vertex shader invocation count by 3, suggesting a buffer of 32 triangles.

While this model has some improvements over the LRU in terms of how well it matches the observed data (for example, it has the same saw tooth pattern for striped grid at small sizes), overall it’s actually worse than LRU - it does not degrade as quickly as real hardware when the grid is rendered using strips that are too wide. The data we have observed definitely suggests some smaller fixed size limit. Which brings us to the final question - what if comparing indices with the entire warp of data was too expensive, and instead we only compared to the last 16?

At this point we are in the land of guesswork - it could be LRU-within-warp or it could be some other limiting factor that I’m not accounting for instead; however this graph does look quite similar to the graph we get on NVidia hardware. If NVidia engineers ever publish the details of their vertex cache I will be glad to learn how wrong I was throughout the entire post ;)

Update: NVidia engineers didn’t, but in 2018 a paper Revisiting The Vertex Cache: Understanding and Optimizing Vertex Processing on the modern GPU was released which builds a more precise model of vertex cache behavior using somewhat similar techniques, along with a special purpose optimization algorithm.

Conclusion

We have gone through an interesting exercise of measuring the actual behavior of given index sequences on real hardware, and trying to model several possible approaches hardware could take to understand the behavior better. It’s pretty clear that Intel uses a 128-entry FIFO cache; as for AMD and NVidia, 16-entry LRU seems to approximate their behavior pretty well, although it’s doubtful that this is how the actual hardware works.

At any rate, neither NVidia nor AMD actually perform well using the Optimal Grid Rendering algorithm. This is a crucial lesson for the design of vertex cache optimization algorithms - it’s important that the algorithm does not assume the precise cache model. Ideally the algorithm would generate a cache oblivious sequence - in this case TomF’s algorithm tries to do that although you can see that it doesn’t perform very close to what’s achievable on the hardware - but even if the algorithm assumes a certain cache size, it would be best to treat the cache replacement model as a heuristic instead of a hard set of rules. Tipsify works pretty well on both NVidia and AMD - despite the fact that the algorithm assumes a fixed-size FIFO cache, it only uses the cache size for a heuristic to select the fan sequence, so when the model mismatches the performance remains reasonable.

You can get the source code used to generate data for the graphs in this post here.

Voxel terrain: storage

Mon, 27 Mar 2017 00:00:00 +0000

It’s been about almost two years since we shipped the first version of smooth voxel terrain at Roblox, and with it being live for a while and seeing a lot of incremental improvements I wanted to write about the internals of the technology - this feature required implementing serialization, network replication, collision detection, ray casting, rendering and in-memory storage support and within each area some implementation details ended up being quite interesting. Today we’ll talk about voxel definition and storage.

What’s in a voxel?

Whenever you design a voxel system you have to answer a basic question - what defines a voxel? We wanted our terrain to implicitly define a mostly smooth surface with each point on the surface having a material value, so we could use some kind of contouring algorithm to generate polygons from the surface. Since it was important to have control over the surface in 3 dimensions (heightfield would not cut it), we needed to define the data that represents the surface. During a prototyping phase a few options became clear:

Material. In this model we’d store just a material (an enum value) per voxel, and rely on neighboring voxels to construct the surface.
Signed distance field. In this model we’d store the distance to the surface from the center of the voxel, with sign determining whether the center is outside or inside. This is a very well known model that’s frequently used to drive algorithms like Marching Cubes. We would also store the material per voxel.
Hermite data. In this model instead of storing data about voxels, you store data about the intersections between surface and grid edges. Assuming the surface only intersects each voxel edge at most once, you have to store a bit for every edge that defines whether surface intersects this edge, an intersection point (which is defined by a 0..1 scalar), surface normal at the intersection point and the material at the vertices.
Occupancy. In this model we’d store a material and the occupancy which defines the amount of matter around the voxel center - 0 meaning roughly that the voxel is completely empty and 1 meaning that the voxel is full.

We experimented with the material model and rejected it because it did not give us enough flexibility for tools - our voxels had to be pretty big for the solution to be practical for low end platforms, and thus we needed to be able to represent sub-voxel information (for example, to have a tool that can “erode” terrain on a scale much finer than a voxel size).

Hermite data was very interesting - you could build pretty rich tooling based on boolean operations and allow a fine control over smoothness of the surface. However, what wasn’t clear was how we would present this to developers in a form of API. We felt like CSG-based API such as “subtract a sphere out of terrain” was not sufficient - how do you implement a smoothing tool if this is all you have? - and we wanted to give our developers an API that allows full control over the terrain data such that you can implement any tool with it (in fact, all our existing terrain manipulation tools are using the public terrain API and are written in Lua). We didn’t feel like exposing raw Hermite data was an option due to its complexity so we had to discard this approach.

Signed distance field was more expressive than occupancy model, but ultimately resulted in similarly looking content and ended up having a more complicated mental model (how do I fill voxels that are far away from the surface? Does it even matter? How is the distance clamped? etc.) so we did not feel like it was worth the tradeoff.

Thus we settled on the occupancy model, where every voxel would be defined by a material and occupancy pair. Our definition of occupancy is a bit peculiar because due to how our meshing algorithm works matter “overflows” from one voxel to another - occupancy 1 does not mean that the voxel looks like a cube, rather it just means that the voxel has as much matter as it can possibly contain. By discarding the Hermite model we lost an ability to precisely define sharp edges on the surface, which we decided to reclaim by using the material value - in addition to the texture that is applied the material also defines how terrain is shaped around the voxel; thus what defines geometry is the material and occupancy in the voxel and in voxel’s neighborhood. A screenshot from an early prototype shows how different materials with the same occupancy data can produce very different results:

Memory format

To represent a voxel, you need a material (which is an enum) and occupancy (which is a floating-point value in 0..1 range). Material storage is straightforward - we just store an index into the material table, which currently has 23 entries. To conserve space, we store occupancy as a quantized 8-bit value, which adds up to two bytes per voxel.

In a material-occupancy definition there is inherent ambiguity in terms of empty voxel representation - you can define it as occupancy = 0 but then the material choice does not matter; or you can define it as material = air but then the occupancy does not matter. We chose the model with an explicit Air material, and we decode occupancy as (value + 1) / 256.0 for non-air materials and 0.0 for air materials (which means that the occupancy is precisely represented in a floating point number which is not true for decoding methods like value / 255.0). Voxels with material = Air and occupancy != 0 are not valid; we make sure they never appear in the grid with simple branchless code:

// make sure occupancy is always 0 for Air material (0)
this->material = material;
this->occupancy = occupancy & (-static_cast<int>(material) >> 31);

When we originally developed the system, we decided to settle on a simple sparse storage format: all voxels would be stored in chunks 32x32x32 with each chunk represented as a linear 3D array (since each voxel needs two bytes for storage, this adds up to 64 KB per chunk; we also experimented with 16x16x16 chunks but the difference wasn’t very noticeable), and the entire voxel grid is simply a hash map, mapping the index of the chunk to the chunk contents itself:

HashMap<Vector3int32, Chunk> chunks;

It’s worth noting that we needed a sparse unbounded storage because our worlds do not have set size limits and you can use voxel content at any point; of course, voxel access in a local region needed to be very fast. A simple two-level storage scheme (hash map + 3D array) turned out to work really well for this - it’s simple to implement and reason about, and provides an easy way to trade “sparseness” for increased locality by adjusting chunk size. We ended up using this way of storing voxel data throughout the entire engine, with individual systems adjusting the chunk size to match the expected workload better (storage uses 32^3 chunks, network uses 16^3 chunks for initial replication and 4^3 chunks for updates, physics uses 8^3 chunks, rendering first used 32x16x32 chunks and later switched to chunks that increase in size from 16^3 up to 256^3 voxels).

Inside the chunk we ended up storing several 3D arrays - we call a 3D array of voxels “box” - of different sizes, forming a mipmap pyramid for our voxel content. We have 32^3 Box for the original content, and 16^3, 8^3 and 4^3 mipmaps that contain downsampled voxel data. The extra storage cost of mipmaps ends up being negligible, and the small penalty that we take for updating the mip levels when writing voxels to the voxel grid is well worth it considering that we are using them for Level Of Detail which allows us to render a lot fewer triangles in the distance, as well as for some other optimizations. The downsample algorithm that we use guarantees that if a voxel at a certain level of mip pyramid is filled with Air, all voxels corresponding to this area in the upper levels of the pyramid are also filled with Air - for example, if the bottom mip level has all 64 voxels empty then the entire chunk is empty.

Disk format

When it comes to storing voxels on disk, we wanted a more compact representation than a 3D array. There are a lot of ways to store voxels - we needed an encoding scheme that stored voxels as bytes (without any cross-byte bit packing to avoid efficiency loss of a byte-level lossless compressor - we usually take the encoded voxels and compress them further with LZ4), was reasonably efficient on the types of content we see frequently, and was simple - we could always upgrade the format later if we needed to spend more time doing it.

Our previous experiments in voxel technology suggested that if you take each chunk and do RLE compression on it, efficiently representing runs of voxels with the same contents, it resulted in pretty good compression rates and pretty efficient compression code. Remember, we don’t need to be optimal - we need to compress data to reduce it in size for storage efficiency, but we do frequently have a LZ codec running on top of that. Running RLE before LZ may seem counter-intuitive, but it significantly reduces the size of data making LZ faster, and in some cases means you don’t even need to do LZ compression because RLE on its own is enough - voxel data is frequently very regular.

An additional quirk is that each voxel in our system occupies two bytes; we wanted to have a more compact baseline representation so that even if RLE can’t find runs of sufficient length we still end up with a smaller file. A key realization is that many voxels are completely solid - if you imagine a mountain built from voxels, usually all interior voxels in the mountain only carry material information (that is still meaningful in case you want to dig a hole through the mountain later) - occupancy is usually 1 for these cells. Similarly, for Air we never need to store occupancy - it’s always zero! With this and RLE in mind, here’s the encoding we came up with:

Each run of voxels starts with a byte that encodes the material (6 bits), whether we need to store the occupancy byte (1 bit) and whether we have a run length of more than one (1 bit). After this byte we store the occupancy value in a single byte (if the lead byte told us we needed it), followed by the run length in a single byte (if the lead byte told us we needed it). Since we encode single-voxel runs without a run length, we store run length minus one so that the longest run we can store is 256 voxels - we need to break up longer runs which is not a problem because you just need two bytes to encode a run as long as all voxels are solid or empty in case of Air.

Note that we only store count if we have more than one voxel in a run, so we could store count-2 instead of count-1; storing count-1 means that largest run is 256 voxels (which is a nice round number) and means we can use 0 later to extend the run format if necessary.

This encoding results in compression ratios close to 2x even without RLE, and with RLE it’s anywhere between 2x and 30x for real-world content; the maximum efficiency is reached for chunks that use just one material and all voxels are solid or empty; we can encode a chunk like this in 128 2-byte runs, which brings us from 64 KB per chunk to just 256 bytes (and a few bytes of metadata such as chunk index).

It’s definitely possible to improve on this encoding but this ended up working well for us. In addition to it serving as a file storage format, we also use it to compress undo history - whenever you perform an operation on the region of voxels, our undo system extracts the box of voxels from the grid before the operation, packs it using this encoding (the compression is extremely fast) and saves the result in case the user needs to undo the operation. We also use a variant of this with more aggressive bit packing and some other tweaks for network replication.

Memory format, take two

After we shipped the first version of our voxel system and started seeing a lot of adoption among developers, we started working on different ways to optimize the system in terms of both memory and performance. This was when we implemented rendering LOD which uses the mipmap levels and substantially reduces both time to render terrain and also memory required to store terrain geometry. After this optimization the bulk of memory cost of new terrain became the voxel data - storing two bytes per voxel just wasn’t going to cut it for low-end mobile devices.

At this point we had a lot of code written to assume that you can read a box of voxels from a terrain region and be able to iterate over the box very efficiently; some areas of the code went beyond reading a cell from a box one by one (which required two integer multiplications to compute the offset in a linear array) and used fast row access via a function like this:

const Cell* readRow(int y, int z) const;

We wanted to store voxel data more efficiently but we did not want to compromise on the access performance - we were willing to make voxel writes a bit slower, but voxel reads - which are required by every subsystem that works with voxel data - had to stay fast, and for some inner loops it was important to be able to quickly iterate through a row of voxels. In addition to the amount of effort to extract a voxel from the box memory locality was also crucial - we considered using some complicated tree structure but ultimately decided against it.

One obvious solution was to use our RLE scheme - keep chunks compressed in memory (either using pure RLE encoding or using LZ4 in addition to that), decompress on demand and keep a small cache of uncompressed chunks that are accessed frequently. This was a reasonable option but required to compromise on the read performance in case the chunk needed decompression - both LZ4 and our RLE decompressor are pretty fast but they are slower than just reading the memory, even after taking into account bandwidth savings. What we really wanted was a solution that allowed us to reduce chunk memory overhead but retain the read performance.

One of the issues was the chunk size. For some content 32^3 chunks weren’t imposing too much overhead, but for simple terrain that had almost no variation in height but was very large in XZ dimension the requirement to store 32 voxels along the Y axis for every non-empty chunk increased the memory overhead substantially. Terrains of such large sizes were previously impractical because of the lack of rendering LOD, but now rendering could handle this just fine if not for the voxel memory overhead. Reducing chunk sizes too much made read operations slower because read requests for large regions had to look up many chunks in the hash, and also increased the overhead of chunk metadata we had to store (for a 8^3 chunk the size of raw voxel data is just 1 KB so even a few pointers can add up to several percent of memory).

This is where we decided to use the ideas from our RLE packing to make a new in-memory compressed box representation. What worked well in RLE is packing each voxel in common cases into just one byte and assuming long contiguous runs of the same voxel data. If we now assume that these runs occupy full rows of data we can store chunks as follows:

Each row (32^3 chunk has 32^2 rows) is either allocated or not.
For unallocated rows, we store one byte - which represents the material value - and assume that all cells In this row are filled with this material and have “default” occupancy (1 for solid materials and 0 for air).
For allocated rows, we store an offset into cell data, which is a linear array that contains data for all allocated rows in an uncompressed fashion.

You can see how for a typical chunk (the diagram above shows a 4^3 chunk for simplicity), most rows will not be allocated - in this case 10 are empty and 4 are filled with grass with occupancy 1 - and the remaining rows, 2 in this case, need extra storage to specify material and occupancy for every cell.

In a 32^3 chunk we’d need up to 1024 allocated rows, which meant the row offset did not fit into one byte (it also doesn’t fit into one byte for 16^3 chunks because you need to dedicate some bits for the row state). There are some ways to work around the problem for 16^3 chunks but we decided to just use two bytes per row for the row header, which contains the “allocated” bit, and either the offset or the material value. To implement the function readRow, we have a global (read-only) array that contains pre-filled arrays of voxels for every material up to a certain chunk size, so we could implement the function like this:

const Cell* readRow(int y, int z) const
{
    uint16_t row = rows[y * size + z];

    return (row & kRowTagAllocated) ? rowdata + (row & kRowTagMask) * size : gCellRows[row & kRowTagMask];
}

Notice that this is constant-time and also very very fast - the voxel read performance did not suffer as a result of this change (and instead improved since we need to read less memory).

Writing becomes more complicated because we sometimes need to write cells into unallocated rows, which requires reallocating the cell content - we split the write workloads into two stages where the first stage marks the rows that will be written to, then we reallocate the chunk if necessary, and then actually perform the writes of new cell data. When we mark the rows we analyze the content we’re going to write there to keep the rows packed (unallocated) if possible. This resulted in code that was slower and more complicated, but voxel writes are relatively rare. There are some artificial cases in our scheme where repeatedly doing tiny writes can repack the chunk too many times - this is pretty simple to work around though because you can always pessimistically unpack the entire chunk at any point by allocating all rows.

The worst case impact of this change is that we spend 2 bytes for every row in addition to whatever we had before, which adds 2 KB to a 64 KB chunk; most chunks have at least several rows compressed, and in some levels the impact is quite dramatic - for an extreme example, a layer of voxels that is just one voxel thick in Y dimension will have all rows unallocated which turns 64 KB chunks into 2 KB, reducing memory impact 32x. Of course the memory savings vary depending on the content - one nice side effect of this change is that the chunk size impacts memory much less because the packed representation does not spend extra memory for empty rows.

Results and future work

To give you some idea of the resulting sizes, these numbers were captured on a very large procedurally generated level with many biomes using different materials, with mountains, cliffs, canyons and caves; this level has about 1 billion non-empty voxels:

Storage	Size	Bytes per voxel
Disk (RLE)	73 MB	0.07
Disk (RLE+LZ4)	50 MB	0.05
Disk (RLE+zstd)	38 MB	0.04
Network (bit-packed RLE)	59 MB	0.06
Memory #1 (unpacked)	2973 MB	2.97
Memory #2 (row packed)	488 MB	0.49

Clearly RLE is very efficient at compressing this data; we do get further improvements using LZ4 but they are not very substantial, and the more valuable option is to employ entropy coding. What also works very well is using RLE and packing data a bit more tightly - this has an advantage that it compresses and decompresses data very quickly and requires no state so it works well for network packets where we only need to send a small chunk, although using byte packed RLE with a fast general compressor on top of it would also work well.

Also note that, while with a naive chunked storage format we need more than 2 bytes per voxel to store data in memory with fast access (2 bytes per voxel would be optimal, but this is where the chunk size comes into play - the bigger the chunks are, the more memory you waste on empty voxels), our new row packed format reduces it by a factor of 6, which gives us 0.5 bytes per voxel and preserves very fast random access and linear read performance.

None of the methods presented above are very complicated, but this is pretty much the point - the simpler the code that we ship ends up being, the easier it is to reason about its performance and behavior, and the easier it is to rework it later. Ultimately while there are ways to compress data better, and probably also ways to organize data for faster access, what we ended up with has a nice balance between performance, simplicity and compactness and solved the issues we had - compared to the raw storage, storing undo data as RLE compressed chunks and voxel data as row-packed chunks is significantly more efficient and does not have a noticeable performance impact on our workloads.

We are pretty happy with the resulting solution; one thing to improve is that we currently send baseline mip level over the network, which means that for big levels it takes a lot of bandwidth to send data that’s so far away that rendering will not use the baseline mip for it anyway. Since we already have the mip data it makes a lot of sense to send it instead - which we haven’t done yet but definitely will do in the future. This mostly requires careful management of data that the client has or doesn’t have, and requires limiting the physics simulation to only work in the regions where we have the full resolution data, because our physics code uses it to generate collision data - which will be the topic for a following post.

Metal retrospective

Thu, 01 Dec 2016 00:00:00 +0000

We have successfully shipped the Metal rendering backend to millions of users, and I want to write a bit about that. There are varying opinions on Metal in the industry - some claim Metal would not have been needed if only Apple dedicated more attention to OpenGL and Vulkan, some say it’s the easiest graphics API that ever existed. Why even bother with Metal, some ask, if you can just write OpenGL or Vulkan code, and use MoltenGL or MoltenVK to the same effect? Here are my thoughts on the API.

Why Metal?

When Apple announced Metal at WWDC in 2014, my initial reaction was to ignore it. It was only available on the newest hardware which most of our users didn’t have, and while Apple claimed it solved CPU performance issues, for us to optimize for the smallest market would mean the gap between the fastest devices and the slowest devices grows even more. At the time we were running OpenGL ES 2 only on Apple, and also starting to port to Android.

Fast forward two and a half years, here’s how the Metal market share looks for our users:

This is much more appealing than it used to be. It is still the case that implementing Metal does not help the oldest devices, but the GL market on iOS keeps shrinking, and for us the content that we run on oldest devices is frequently different anyway from the content that runs on newest devices, so it definitely makes sense to dedicate some effort to make it faster. Given that your iOS Metal code will run on Mac with very few changes, it could make sense to use it on Mac as well even if you are mobile-focused (we currently only ship Metal builds on iOS).

I think it is worthwhile to analyze the market share in a bit more detail. On iOS, we support Metal for iOS 8.3+; while there are some users who can’t run Metal because of OS version restrictions, most of the 25% who still run GL are simply using older devices that have SGX hardware. They also don’t have any OpenGL ES 3 features, and we’re content with running a lower-end rendering path there (although we’d love all devices to go Metal - fortunately the GL/Metal split will only improve). On Mac, Metal API is newer and the OS plays a pretty significant part - you have to use OSX 10.11+ to use Metal and half of our users simply have a more dated OS - it’s less about the hardware and more about the software (95% of our Mac users run OpenGL 3.2+).

So given the market share, we still have options that do not involve porting to Metal. One of them is to just use MoltenGL, which would use the OpenGL code we already have, but supposedly be faster; another it to port to Vulkan (to get better performance on PC, and eventually Android) and use MoltenVK. I have briefly evaluated MoltenGL and was not too thrilled with the results - it took some effort to make our code run at all, and while performance was a bit better compared to stock OpenGL I was hoping for more. As for MoltenVK, I think it is misguided to try to implement one low-level API as a layer above another one - you’re bound to get impedance mismatch that will result in suboptimal performance - maybe it will be better than the high-level API you used to use, but it’s unlikely to be as fast as possible, which supposedly is why you’re choosing a low-level API to begin with! One other important aspect is Metal implementation is much simpler than a Vulkan one - more on that later - so in some sense I’d prefer a Metal -> Vulkan wrapper instead of a Vulkan -> Metal.

It is also worth noting that apparently on iOS 10 on newest iPhones there is no GL driver - GL is implemented on top of Metal. Which means using OpenGL is only really saving you a bit of development effort - not that much, considering that the promise of “write once, run anywhere” that OpenGL has does not really work out on mobile.

Porting

I would say that overall porting to Metal was a breeze. We have a lot of experience working with different graphics APIs, ranging from high level APIs like Direct3D 9/11 to low level APIs like PS4 GNM. This gives a unique advantage of being able to comfortably use an API like Metal that is simultaneously reasonably high level but also leaves some tasks like CPU-GPU synchronization for the app developer to do.

The only hurdle really was getting our shaders to compile - once that was done and it was time to write the code it became apparent that the API is so simple and self-explanatory that the code practically wrote itself. I got the port that rendered most things in a suboptimal fashion running in about 10 hours in a single day, and spent two more weeks cleaning up the code, fixing validation issues, profiling and optimizing and doing general polish. To get an API implementation in this time frame speaks volumes of the quality of the API and the toolset. I believe there are several aspects that contribute:

You can develop the code incrementally, with good feedback at every stage. Our code started by ignoring all CPU-GPU synchronization, being really suboptimal about certain parts of state setup, using built-in reference tracking for resources and never running CPU and GPU in parallel to avoid running into issues; the optimization/polish phase then converted this into something we could ship, never losing the ability to render in the process.
The tools are there for you, they work and they work well. This is not as much of a surprise for people who are used to Direct3D 11 - but this is the first time on mobile where I had a CPU profiler, a GPU profiler, a GPU debugger and a GPU API validation layer that all worked well in tandem, catching most issues during development and helping optimize the code.
While the API is somewhat lower level than Direct3D 11, and it leaves some key low-level decisions to the developer (such as the render pass configuration or the synchronization), it still uses a traditional resource model where each resource has certain “usage flags” it has been created with but does not require pipeline barriers or layout transitions, and a traditional binding model where each shader stage has several slots you can freely assign resources to. Both of these are familiar, easy to understand and require very limited amount of code to get going fast.

One other thing that helped is that our API interface was ready for Metal-like APIs - it is very lean but it exposes enough detail (such as render passes) to be able to easily write a performant implementation. At no point in our implementation did I need to save/restore state (many API interfaces suffer from this, particularly due to treating render target setup as state changes and resources/state binding persisting through that) or make complicated decisions about resource lifetime/synchronization. About the only “complicated” piece of code needed to render is one that creates the render pipeline state by hashing bits that are needed to create one - pipeline state objects are not part of our API abstraction. Even that is pretty straightforward and fast. I will write more about our API interface in a separate post.

So, a week to get the shaders compiling, two weeks to get a polished optimized implementation¹ - what are the results? The results are great - Metal absolutely delivers on the performance promise. For one, the single threaded dispatch performance is noticeably better than with OpenGL (shrinking the draw dispatch part of our render frame by 2-3x depending on the workload), and this is given that our OpenGL implementation is pretty well tuned in terms of reducing redundant state setup and playing nice with the driver by using fast paths. But it does not stop there - multithreading in Metal is trivial to utilize provided that your rendering code is ready for it. We haven’t switched to threaded draw dispatch yet but are already converting some other parts that prepare resources to happen off the render thread, which, unlike with OpenGL, is pretty much effortless.

Beyond that, Metal allows us to fix some other performance issues by giving easily accessible and reliable tools. One of the central parts of our rendering code is the system that computes lighting data on the CPU in world space and uploads it to regions of a 3D texture (which we have to emulate on OpenGL ES 2 hardware). The updates are partial so we can’t duplicate the entire texture and have to rely on however the driver implements glTexSubImage3D. At one point we tried to use PBO to improve update performance but faced significant stability issues across the board, both on Android and iOS. On Metal there are two builtin ways to upload a region - MTLTexture.replaceRegion that you can use if GPU is not currently reading the texture, or MTLBlitCommandEncoder (copyFromBufferToTexture or copyFromTextureToTexture) that can upload the region asynchronously just in time for GPU to start using the texture.

Both of these methods were slower than I’d like - the first one wasn’t really available since we had to support efficient partial updates, and it worked purely on CPU using what looked like a very slow address translation implementation. The second one worked but seemed to use a series of 2D blits to fill the 3D texture which were both pretty expensive to set up commands for on the CPU side and also had a very high GPU overhead for whatever reason. If this were OpenGL it would be over - in fact, the performance of these two methods roughly matched the observed cost of a similar update in OpenGL. Fortunately, this being Metal, it has easy access to compute shaders - and a super simple compute shader gave us the capability to do a buffer -> 3D texture upload that was very fast on CPU and GPU and basically solved our performance problems in this part of the code for good²:

Method	CPU cost	GPU cost
OpenGL, glTexSubImage3D	1.0 ms	3.0 ms
Metal, replaceRegion	2.3 ms	0.0 ms
Metal, copyFromBuffer	0.2 ms	3.0 ms
Metal, copyFromTexture	1.0 ms	3.0 ms
Metal, compute shader copy	0.2 ms	0.3 ms

As a final general comment, maintaining Metal code is pretty much effortless as well - all extra features we had to add so far were easier to add there than on any other API we support, and I expect this trend to continue. There was a bit of a concern that adding one more API would require constant maintenance, but compared to OpenGL this does not really require much work; in fact, since we won’t have to support OpenGL ES 3 on iOS any more, this means we can simplify some OpenGL code we have as well.

Stability

Today on iOS Metal feels very stable. I am not sure what the situation was like at launch in 2014, or what it is like on Mac today, but both the drivers and the tools for iOS feel pretty solid.

We had one driver issue on iOS 10 that had to do with loading shaders compiled with Xcode 7 (which we fixed by switching to Xcode 8), and one driver crash on iOS 9 that turned out to be a result of misusing nextDrawable API. Other than that we haven’t seen any behavioral bugs or any crashes - for a relatively new API Metal has been very solid across the board.

Additionally, the tools you get with Metal are varied and rich; specifically, you can use:

A pretty comprehensive validation layer that will identify common issues in using the API. It’s basically like Direct3D debug - which is familiar for Direct3D but pretty much unheard of in OpenGL land (in theory ARB_debug_callback is supposed to solve this, in practice it’s mostly unavailable and when it is, not terribly helpful)
A working GPU debugger which shows all commands you have dispatched along with their state, the render target contents, the texture contents, etc. I don’t know if it has a functioning shader debugger because I never needed that, and the buffer inspection could be a bit easier, but it mostly does the job.
A working GPU profiler which shows per-pass performance stats (time, bandwidth) and also per-shader execution time. Since the GPU is a tiler you can’t really expect per-drawcall timings, of course. Having this level of visibility - especially considering the complete lack of any GPU timing information in graphics APIs on iOS - is great.
A working CPU/GPU timeline trace (Metal System Trace) which shows the scheduling of CPU and GPU rendering workload, similar to GPUView but actually easy to use, modulo some UI idiosyncrasies.
An offline shader compiler that validates your shader syntax, occasionally gives you useful warnings, converts your shader into a binary blob that’s pretty fast to load at runtime and additionally reasonably well optimized beforehand, reducing the load times since driver compiler can be faster.

If you come from Direct3D or console world, you may take every single one of these for granted - trust me, in OpenGL every single one of these is unusual and is met with excitement, especially on mobile where you are used to dealing with occasionally broken drivers, no validation, no GPU debugger, no GPU profiler that is helpful, no ability to gather GPU scheduling data and being forced to work with a text-based shader language that each vendor has a slightly different parser for.

Conclusion

Metal is a great API to both write code for, and ship applications with. It’s easy to use, it has predictable performance, it has robust drivers and solid toolset. It beats OpenGL in every single aspect except for portability, but the reality with OpenGL is that you really only should have used it on three platforms (iOS, Android and Mac), and two of those now support Metal; additionally, the portability promise of OpenGL is largely not fulfilled as the code that you write on one platform very frequently ends up not working on another for different reasons.

If you are using a third-party engine like Unity or UE4, Metal is already supported there; if you aren’t and you enjoy graphics programming or care deeply about performance and take iOS or Mac seriously, I strongly urge you to give Metal a try. You will not be disappointed.

Yeah, okay, and maybe a week to fix a few bugs discovered during testing ↩
The numbers are for 128 KB worth of data updated per frame (two 32x16x32 RGBA8 regions) on A10 ↩

Ten years of parsing XML

Sun, 06 Nov 2016 00:00:00 +0000

Exactly ten years ago, the first version of my XML parser, pugixml, got released to the public.

pugixml was born out of frustration with status quo - ten years ago, XML parsers ranged from “slow” to “super slow”. Expat had decent performance, but was based on SAX (stream parsing with callbacks), which made parsing some documents like COLLADA very inconvenient. TinyXML was extremely memory hungry and extremely slow. There was a library on CodeProject, called pugxml, that was a bit faster than TinyXML and used an interesting parsing approach - in-situ or inplace parsing, that I describe in more detail in my article in POSA, “Parsing XML at the Speed of Light”.

I was not satisfied with the performance or the code, but it was a good start. I decided to fork pugxml, call it pugixml¹ (“i” for “improved”), clean it up a bit and make it faster. I never imagined this would start a ten year long journey.

The source code started as a single 1800 LOC source file and a small header, and is at ~12500 LOC today². While I am trying to focus on features that are important and avoid bloat, the first version was extremely bare-bones - it did not even support mutable trees - while today it features multiple UTF encodings, two mutable tree representations and an XPath 1.0 query engine. It quickly became apparent that the only way to guarantee quality is to create a very comprehensive unit test suite - which is at ~14700 LOC and covers close to 99% source lines. In addition to that pugixml has been extensively tested using both afl-fuzz and LLVM libFuzzer, was checked using many static analyzers and underwent security audits by actual human beings.

During development, pugixml went through two version control systems (from SVN through git-svn to pure Git), three documentation generators (from Doxygen to Boost Quickbook + DocBook to AsciiDoc) and two build systems (from Jamplus to CMake + Make). In terms of performance it got faster and leaner with pretty much every version - the parsing engine has been carefully tuned for several compilers over these years, and the memory consumption also decreased over time, with the latest version introducing a new compact tree representation. As a result, while it’s not necessarily faster than every single other XML parser out there in all cases, it sure is in most, and is very competitive from memory standpoint as well (benchmark results).

Initially pugixml supported just one compiler (Microsoft Visual C++) and just one platform (Windows). Today it supports more than a dozen platforms and quite a few compilers, ranging all the way from Microsoft Visual C++ 6³ to newest C++14 compilers, and including some pretty esoteric toolchains (like Wind River). C++ being what it is, this takes some effort to maintain and test, but the library is there for you on any platform you choose to port your application to, and I happily accept portability patches, including warning fixes (with the goal of being warning-free on as wide of a range of compilers/compiler options as feasible).

pugixml tries to strike a balance between ease of use and robustness on the interface side as well as efficiency and portability on the implementation side. There are definitely some tradeoffs that are not made optimally, and once in a while I lament about a certain part of the API that is hard to take away now, but overall I hear very positive feedback from the users of the library.

I started from a SVN repository with a .zip file download, and now you can get pugixml as a package in many Linux distributions as well as Homebrew and NuGet. The best part is that most of that is not my initiative - several people maintain Linux packages which is great because I don’t have resources to do all that myself.

Nothing makes me happier and prouder than e-mails from the wild - talking about successful integration or replacement of another XML parser, significant performance or memory gains achieved using pugixml, a weird embedded system where the compiler’s interpretation of trivial C++ constructs is sometimes unconventional, or just a note saying that API is nice to use.

I have heard from individuals and companies, big and small; people who make small applications and companies that make end-user products with extremely wide reach, like Skype; people who work in aerospace for different countries; people who have kilobytes of stack space on their embedded devices and people who have gigabytes of XML data to parse (thankfully the last two categories don’t intersect). Many users are incredibly helpful and dedicated - one of the crazier bugs I had to fix involved compiling pugixml on SPARC64 in QEMU to investigate and fix a floating-point alignment issue, with the person who reported it preparing the QEMU image with git, gcc, gdb and pugixml already inside it.

I learned an incredible amount over these 10 years, and a big part of that is due to my work on pugixml. While the pace of pugixml development has definitely been slowing down, there are some big features that I occasionally implement, and I expect to continue maintaining, improving and polishing it in the future - so here’s to another 10 years!

No, I am not really sure how to pronounce it. ↩
It’s still a single source file and a small header - this simplifies integration and forces me to keep source reasonably small ↩
Yes, I still maintain support for this compiler. It is mostly straightforward, except for working around the template mangling issue where every single template argument has to be present in the function signature, otherwise a wrong instantiation may be used. ↩

Optimizing slerp

Thu, 05 May 2016 00:00:00 +0000

In the last article (Approximating slerp) we discussed a need for a fast and reasonably precise quaternion interpolation method. By looking at the data we arrived at two improvements to nlerp, a less precise one and a more precise one. Let’s look at their implementations and performance!

We will implement three functions - nlerp, which is the baseline normalized linear interpolation, fnlerp, which will use the simpler approximation for the interpolation parameter, and onlerp which will use the more exact approximation from the previous article. While implementing these functions seems trivial given the results of the hard work, we will focus on vectorized version using SSE2. This post demonstrates what is probably the most important way to vectorize computations - it’s frequently pretty simple to implement, yields very good results and scales to arbitrary SIMD width.

Scalar first

Perhaps counter-intuitively, our path to SIMD implementation will start at a scalar one. We will not just use it for comparisons - we will actually directly convert it to a SIMD version. Since we derived all the coefficients and equations in the last article, all implementations are pretty straightforward; here’s one of them:

Q onlerp(Q l, Q r, float t)
{
    float ca = dot(l, r);

    float d = fabsf(ca);
    float A = 1.0904f + d * (-3.2452f + d * (3.55645f - d * 1.43519f));
    float B = 0.848013f + d * (-1.06021f + d * 0.215638f);
    float k = A * (t - 0.5f) * (t - 0.5f) + B;
    float ot = t + t * (t - 0.5f) * (t - 1) * k;

    return unit(lerp(l, r, 1 - ot, ca > 0 ? ot : -ot));
}

As you can see, this is similar to nlerp - it handles quaternion double-cover and normalizes the quaternion after interpolating - but instead of using t it tries to find a better fit by adjusting it using approximation we derived. Note that, similarly to nlerp but unlike slerp this function does not have a singularity at t=0.

This is more expensive to compute than nlerp; note though that some of the extra computations only depend on t and thus can be performed once for cases where you have to interpolate a lot of quaternions with the same t (which is the case for some types of animation sampling).

Instead of measuring performance, let’s look at the performance modeled by Intel Architecture Code Analyzer which is a great tool that allows you to place special markers in your code and look at the approximate scheduling on the target Intel CPU of your choice. These numbers are approximate but they let me skip setting up the benchmark and interpreting the timing results. ¹

To model performance, we will look at a loop that reads two streams of data - one quaternion stream and one stream with interpolation coefficients - performs the interpolation from the base quaternion and writes the result into the fourth stream. Adding the IACA markers is pretty straightforward:

static void run(Q* dest, const Q& l, const Q* r, const float* t, size_t size, F f)
{
    for (size_t i = 0; i < size; ++i)
    {
        IACA_START
        dest[i] = f(l, r[i], t[i]);
        IACA_END;
    }
}

We’ll be measuring the throughput, not the latency, since we’re assuming that you’re running the interpolation many times on big arrays of data. And here are the results:

Function	Cycles/element
nlerp	14.15
fnlerp	18.35
onlerp	22.95

As you can see, while our approximations add some overhead the functions remain relatively fast. Why is there no slerp in the table? Ah, that’s because IACA can’t really measure performance of code that uses standard trigonometric instructions. They are implemented in the standard library and the implementation is not being inlined so there’s no code to analyze! Anecdotally, slerp is about 3x slower than onlerp in this test, although the results will vary greatly based on the architecture and the standard library implementations (e.g. with x87 FPU the difference should be much more dramatic).

When’s the last time you had ONE value to interpolate?

Now let’s outline the process that we will use to vectorize the code. We always start with the scalar function. Now, the scalar function does the scalar transformation once - in our case it performs a quaternion interpolation on two quaternions. The vectorized implementation will work on N items by performing the scalar transformation N times, where N is usually the SIMD width (4 in our case since SSE2 registers are 4 floats wide).

The data flow of the vectorized function will be otherwise the same - instead of using float variables, we will use variables of vector type; whenever a scalar function performed the multiplication between two floats we will multiply two SIMD registers. Here’s how you convert a quaternion dot product:

// ca = l.x * r.x + l.y * r.y + l.z * r.z + l.w * r.w
__m128 ca =
    _mm_add_ps(
        _mm_add_ps(_mm_mul_ps(lX, rX), _mm_mul_ps(lY, rY)),
        _mm_add_ps(_mm_mul_ps(lZ, rZ), _mm_mul_ps(lW, rW)));

Note that this is the exact same computation we performed in the scalar version - but since we’re using SIMD registers we’re doing 4 dot products at once. Also in this instance we’re grouping the operands a bit differently for addition to reduce latency a bit - since addition is left-associative, a + b + c + d is evaluated as ((a + b) + c) + d, which creates dependencies between all addition operations; (a + b) + (c + d) is in some cases faster.

Whenever there is a branch in the scalar version, we have a problem - we’re essentially running N computations on the data in parallel, and if the branch condition is different for different computations then we’d like to execute different SIMD instructions on different parts of the registers, but most SIMD instruction sets can’t do that. Instead, we’ll remove all branches by computing both sides of the branch and then selecting the result using bitwise operators; here’s an example ²:

// rt = ca > 0 ? ot : -ot
__m128 rt0 = ot;
__m128 rt1 = _mm_sub_ps(_mm_setzero_ps(), ot);
__m128 rtMask = _mm_cmpgt_ps(ca, _mm_setzero_ps());
__m128 rt = _mm_or_ps(_mm_andnot_ps(rtMask, rt0), _mm_and_ps(rtMask, rt1));

This is mostly straightforward except for the fact that SSE2 does not have a bitwise select so we have to emulate it using primitive bitwise operations.

With all the computations out of the way, the final problem is loading and saving the data. What we are using here is essentially SoA (structures of arrays) layout of data - instead of putting all 4 components of a quaternion in a SSE register, we’re putting 4 X components in a register. Frequently the data is laid out using AoS (arrays of structures). In some cases changing layout of your data in memory is more optimal but for now we will just convert from one to the other by loading 4 quaternions into 4 SSE registers and then transposing the resulting 4x4 “matrix” - columns become rows, rows become columns, and individual components are nicely grouped in one register:

__m128 lX = _mm_load_ps(&l[0].x);
__m128 lY = _mm_load_ps(&l[1].x);
__m128 lZ = _mm_load_ps(&l[2].x);
__m128 lW = _mm_load_ps(&l[3].x);

_MM_TRANSPOSE4_PS(lX, lY, lZ, lW);

After we compute the results, we can use the same transpose macro (_MM_TRANSPOSE4_PS is a standard SSE2 macro that expands into multiple permutation instructions) to convert results to AoS to store them. When I’m vectorizing code I usually start like this and then gradually convert more and more internal structures to use SoA layout internally while keeping SoA <-> AoS conversions at the interface.

AoSoA

One important digression is that usually people think of big arrays of data when they see the term “SoA” - this is not the case here! A technique that’s almost always better is to use so called “block SoA”, where your arrays are always pretty short - e.g. 4/8 elements - and your actual data is composed of arrays of these structures. Some people refer to this as AoSoA - array of structures of arrays.

For example, given this structure:

struct ContactLimiter
{
    Vector2 normalProjector1;
    Vector2 normalProjector2;
    float angularProjector1;
    float angularProjector2;
};

std::vector<ContactLimiter> limiters;

Here’s how you would convert it to AoSoA:

template <int N>
struct ContactLimiter
{
    float normalProjector1X[N];
    float normalProjector1Y[N];
    float normalProjector2X[N];
    float normalProjector2Y[N];
    float angularProjector1[N];
    float angularProjector2[N];
}

std::vector<ContactLimiter<4>> limiters;

That way you can write a function that processes 4 limiters at a time without having to go between AoS and SoA. Naturally, all of this means that you have to face extra complexity when dealing with data sizes that are not divisible by your group width. Two frequently used options are:

Process as much data as you can (size / N) in blocks of N using SIMD; process the rest using scalar code
Pad the arrays with dummy data until its size is divisible by N; don’t use the extra few items at the end

Option 2 is usually best since it uses just one version of the code and is easier to implement; however, it requires the processing to not have any side effects which could be problematic in certain cases.

Finishing touches

With the major structural work out of the way, the final two issues to tackle are quaternion double-cover and normalization.

Double-cover is pretty straightforward to handle - in fact, there’s code above that shows what we can do - however, the comparisons and selects and negates are all excessive for the simple use case that we have - it’s sufficient to use a few bitwise operations to negate a value if another value is negative since in floating-point sign is just a bit:

// rt = ca > 0 ? ot : -ot
__m128 signMask = _mm_castsi128_ps(_mm_set1_epi32(0x80000000));
__m128 rt = _mm_xor_ps(ot, _mm_and_ps(signMask, ca));

Normalization requires a bit more care. While we could use division and square root instructions available in SSE2, they’re pretty slow. What we actually need in our case is an inverse square root of dot(q, q) - once we compute that, we can multiply q by that instead of using a slower division at a minor precision cost (a / b is more precise than a * (1 / b) because the second expression rounds twice, losing 0.5 ULPs at every step). However, if we use _mm_rsqrt_ps blindly, the error that we get from our function increases ~4x! Considering that we went through some trouble to reduce the error by an extra order of magnitude, this is definitely non-ideal.

This is of course because _mm_rsqrt_ps is just an approximation that has limited precision - improving that requires using a Newton-Raphson iteration step, as suggested by Intel Architecture Optimization manual:

__m128 us0 = _mm_rsqrt_ps(un);
__m128 us1 = _mm_mul_ps(_mm_mul_ps(_mm_set1_ps(0.5f), us0), _mm_sub_ps(_mm_set1_ps(3.f), _mm_mul_ps(_mm_mul_ps(us0, us0), un)));

With that the results are still faster than using _mm_sqrt_ps/_mm_div_ps and the precision is back to normal.

Going wider

Here’s an interesting property of the code that we ended up with. Since instead of putting each quaternion in its own SIMD register we put 4 scalars from the scalar code into a register, our code is more or less agnostic to the SIMD width. It’s not a stretch to change the code to use AVX2 which has 8-wide registers - in fact, the process is mostly mechanical:

Replace __m128 with __m256
Replace _mm_ with _mm256_
Replace si128 with si256 (damn it, Intel)
Multiply offsets in load/store instructions by 2

After this we are left with one final piece - we don’t have a macro to transpose elements. Since we load 8 quaternions into 4 AVX2 registers, we need a special transposition macro. We could implement one that orders the elements like we need:

x0 y0 z0 w0 x1 y1 z1 w1        x0 x1 x2 x3 x4 x5 x6 x7
y2 y2 z2 w2 y3 y3 z3 w3   ->   y0 y1 y2 y3 y4 y5 y6 y7
z4 y4 z4 w4 z5 y5 z5 w5   ->   z0 z1 z2 z3 z4 z5 z6 z7
w6 y6 z6 w6 w7 y7 z7 w7        w0 w1 w2 w3 w4 w5 w6 w7

Unfortunately AVX2 does not have a very good supply of cross-lane operations - in other words, most AVX2 instructions work within separate 128-bit halves. However, it’s easy to create a operation (called _MM_TRANSPOSE8_LANE4_PS in the code) that transposes two blocks of 4x4 elements like this:

x0 y0 z0 w0 x1 y1 z1 w1        x0 x2 x4 x6 x1 x3 x5 x7
y2 y2 z2 w2 y3 y3 z3 w3   ->   y0 y2 y4 y6 y1 y3 y5 y7
z4 y4 z4 w4 z5 y5 z5 w5   ->   z0 z2 z4 z6 z1 z3 z5 z7
w6 y6 z6 w6 w7 y7 z7 w7        w0 w2 w4 w6 w1 w3 w5 w7

And then swizzle the input ‘t’ array in the correct order like this:

_mm256_permutevar8x32_ps(tt, _mm256_setr_epi32(0, 2, 4, 6, 1, 3, 5, 7))

We’ll do just that, and this is the final step we need to take to make an AVX2 version of our lerping functions that works on 8 quaternion pairs at a time.

FMA

… no, not that FMA. One last thing that we can do is take advantage of fused multiply-add instructions available on some architectures (like Haswell). This instruction computes the expression a * b + c at the cost of one multiplication (with slightly higher precision).

It’s pretty trivial to identify these expressions in our code - it has a fair share of them, used for computing the dot product between two quaternions, computing interpolation coefficients, etc.

However, since we’re using clang we don’t need to do that - we can ask it to automatically fuse instructions whenever possible by passsing these command-line arguments:

-mfma -ffast-math

The reason we need to pass -ffast-math is that this optimization changes the output of the program since the precision is different. In our case this is not something to be concerned about - in fact, using FMA reduces the error slightly (as expected).

Results

Ok, all the hard work is done - let’s see what we got in the end. We’ll use IACA again to measure the performance of a loop - for SSE2 versions every loop iteration is processing 4 quaternions instead of 1, so we’ll divide the numbers we get from IACA by 4 (and for the AVX2 version we’ll divide by 8).

Function	Cycles/element
nlerp	14.15
fnlerp	18.35
onlerp	22.95
nlerp4	4.76
fnlerp4	6.14
onlerp4	7.19
onlerp8	3.65
onlerp8 FMA	2.63

Our SSE2 code is 2.5-3x faster than scalar, which is reasonable - we still lose time on AoS <-> SoA conversion. The AVX2 code is even more impressive, at 6x faster than scalar - the AVX2 function takes roughly as many cycles as the SSE2 function but processes twice as many elements per iteration! And FMA version is allegedly a full cycle faster.

Keep in mind that these timings are estimated, not measured. My ghetto measurements don’t agree with these numbers, but they are performed in a setting where other factors, such as memory access time, may play a significant factor in determining the execution time.

All of the above code and more is available here. Note that the SIMD code is admittedly pretty ugly - the intrinsic names, sign bit manipulations etc. obscure the meaning of the code which is unfortunate because really the SIMD code is very much like the scalar code and the process of converting one to the other is pretty automatic, even if the results look wildly different. But that’s a problem for another time.

I do have some benchmarking code in the Gist with the sources, but I did not spend any effort to make the results stable. ↩
There is a more efficient way to do this in this specific case by just manipulating the sign bit but this was the only branch that I could show the technique on. ↩

Approximating slerp

Thu, 23 Jul 2015 00:00:00 +0000

Quaternions should probably be your first choice as far as representing rotations goes. They take less space than matrices (this is important since programs are increasingly more memory bound); they’re similar in terms of performance of basic operations (slower for some, faster for others); they are much faster to normalize which is frequently necessary to combat accumulating error; and finally they’re way easier to interpolate. In this post we’ll focus on interpolation.

If you’ve read “Hacking Quaternions” (2002) by Jonathan Blow, then this article will be familiar. Then again, it’s been 13 years, and these results are more precise and more rigorously derived.

Spherical interpolation

A well known method of interpolating quaternions is called $slerp$ or spherical interpolation. Spherical interpolation is a linear combination of two quaternions with the coefficients that depend on the half-angle of rotation between the quaternions:

$ a = \arccos(q_0 \cdot q_1) $

The most important feature of $slerp$ is that the interpolation has constant angular velocity - that is, the angle of rotation from $q_0$ to the resulting quaternion changes as a linear function of interpolating coefficient $t$. $slerp$ is defined as follows:

$ slerp(q_0, q_1, t) = q_0\frac{\sin((1 - t) a)}{\sin(a)} + q_1\frac{\sin(t a)}{\sin(a)} $

This function has a singularity at $a = 0$ (which corresponds to $q_0 = q_1$), so in practice $slerp$ is replaced by a simple linear interpolation as $q_0$ approaches $q_1$. In addition, because of quaternion double-cover - for each rotation there are two unit quaternions that represent it, $q$ and $-q$ - $slerp$ implementation has to account for that and negate one of the quaternions if $q_0 \cdot q_1$ is negative.

The problem with $slerp$ is that it’s expensive to compute. You have to evaluate four trigonometric functions; since they are usually implemented using a range reduction step followed by a polynomial approximation with a relatively high power, this can get expensive. We can try to replace them with simpler approximations that are less precise, but it’s more efficient to solve this issue in a more direct way.

One other way to interpolate quaternions is $nlerp$ - which is just a linear interpolation, followed by a renormalization step (as well as aforementioned negation to solve issues with double-cover). Here’s how it can work:

Q nlerp(Q l, Q r, float t)
{
	float lt = 1 - t;
	float rt = dot(l, r) > 0 ? t : -t;

	return unit(lerp(l, r, lt, rt));
}

This code assumes that unit normalizes the quaternion and lerp performs the computation l * lt + r * rt.

This is much simpler than $slerp$ - the only semi-expensive step here is normalization (but even this is pretty efficient given the reciprocal square root intrinsics that are present in most SIMD instruction sets). However, this does not give us constant velocity interpolation.

Despite not having constant velocity, $nlerp$ follows the same path as $slerp$ - both operations produce values that lie on the shortest arc between the two input quaternions. This naturally means that by adjusting the coefficient of interpolation in $nlerp$ we can get the same result as computed by $slerp$.

For many applications constant velocity is not actually very important - for example, if you use quaternions in your animation system, it’s possible that your artists made the animations using spline-based Euler angle interpolation. So the choice of the interpolation is to an extent arbitrary - the canonical way of exporting animations is starting with a high-frequency sampled animation (e.g. 60 Hz), and removing keyframes while the interpolation error is acceptable. If this is the case, a different interpolation method will just change the number of keyframes so maintaining constant velocity is not critical. For the rest of the article though we will assume that we need a close-to-constant angular velocity interpolation.

Approximating slerp with nlerp

This is the equation we’re solving (we need to find $t'$):

$ nlerp(q_0, q_1, t') = slerp(q_0, q_1, t) $

Given that, and some normalizing factor $s$ (remember, nlerp is a linear interpolation followed by normalization), we have:

$ \frac{q_0(1 - t') + q_1 t'}{s} = q_0\frac{\sin((1 - t) a)}{\sin(a)} + q_1\frac{\sin(t a)}{\sin(a)} $

Let’s assume that the coefficients of linear combination are equal (if they are the equality will surely hold); from that we get:

$ s' = \frac{s}{\sin(a)} $

$ \frac{1 - t'}{s'} = \sin((1 - t) a) $

$ \frac{t'}{s'} = \sin(t a) $

From that it’s easy to get $t'$:

$ \frac{1}{s'} = \frac{1 - t'}{s'} + \frac{t'}{s'} = \sin((1 - t) a) + \sin(t a) $

$ \frac{1}{t'} = \frac{1 / {s'}}{t' / {s'}} = \frac{\sin((1 - t) a) + \sin(t a)}{\sin(t a)} $

$ \frac{1}{t'} = 1 + \frac{\sin((1 - t) a)}{\sin(t a)} $

$ t' = \frac{1}{1 + \frac{\sin((1 - t) a)}{\sin(t a)}} $

This derivation leads us to the final formula that uses the cosine of the angle between quaternions as the parameter $d$:

$ d = q_0 \cdot q_1 $

$ t' = \frac{1}{1 + \frac{\sin((1 - t) \arccos d)}{\sin(t \arccos d)}} $

Now that we know how to compute $t'$, we need to find a good approximation that is fast to compute - which means a polynomial approximation. Note that we need to compute $d$ anyway to determine if we need to flip one of the quaternions - so if we can efficiently approximate $t'$, we can get an interpolation function that’s as precise as $slerp$ and as fast as $nlerp$!¹

The first step to finding a good approximation is looking at the data - in this case, at the function $t' = t'(d, t)$².

Staring at the data

The easiest way to analyze the function is to graph it over the domain we’re interested in. Let’s first visualize our function in 3D over $[0..1]$:

$t'(d, t)$

This looks close to a plane, suggesting that $t'(d, t) \approx t$. Thus the difference will probably be easier to look at:

$t'(d, t) - t$

This looks interesting - our function seems to resemble a cubic polynomial in any d-slice. Let’s plot several 2D slices at different values of d:

$t'(d, t) - t,\space d=0.01, 0.2, 0.7, 0.99$

Every d-slice of our function has three roots - 0, 0.5, 1. In these values the value of $t'$ is the same as $t$, which means that $nlerp$ is exact in these three points³. This also suggests that a polynomial approximation of $t'(d, t) - t$ has $t(t-0.5)(t-1)$ as factors. The simplest approximation is thus $t'(d, t) \approx K(d)(t-1)(t-0.5)t+t$, where $K$ is the factor that “flattens” the spline as seen on the graphs.

Is this a good approximation? Let’s check!

$K(d, t)=\frac{t'(d, t) - t}{t(t-0.5)(t-1)},\space d=0.01, 0.2, 0.7, 0.99$

From this it is obvious that while $K$ is reasonably flat, for small values of $d$ that correspond to large angles between input quaternions it resembles a quadratic polynomial of the form $A(t-0.5)^2+B$ (the form is apparent because lowest point is at $t=0.5$). We now know that we can either model $K(d, t)$ without taking $t$ into account, which will give results that are less accurate, or model $K(d, t)$ as a quadratic polynomial with coefficients that depend on $d$ alone.

Let’s explore both options.

Fitting K(d)

For any values of $d$ and $t$, we can compute $K(d, t)$. If we model $K$ as a value that does not depend on $t$, this gives us a lot of points that conflict - e.g. for a given value of $d$ we’d want $K$ to take a set of different values. You can think of this as having a lot of points on a plane and trying to fit them to a function. It makes sense to first plot these points, which is what we will do:

$K(d, t)$

This looks like a quadratic polynomial. Of course since this is not a function any approximation will give an error - we can find a polynomial that minimizes the sum of squares of the errors using least squares fitting, which yields our result:

$0.931872 - 1.25654 d + 0.331442 d^2$

Thus our first approximation becomes:

$ K_0(d, t) = 0.931872 - 1.25654 d + 0.331442 d^2 $

Fitting K(d, t)

To get a more precise approximation, we’ll have to find $A$ and $B$ in $K(d, t) = A(t-0.5)^2+B$. $K$ has a singularity at $t=0.5$, but we can evaluate it at $t=0.49$ to get an estimate of $B$, and evaluating at $t=0.01$ gets us $0.25A+B$. Both values will depend on $d$ so naturally we will plot them:

$A=4*(K(d, 0.01)-K(d, 0.49)),\space B=K(d, 0.49)$

The blue line represents $A$ and looks like a parabola; the orange line represents $B$ and looks like a line. Let’s first try to fit both of them independently:

$A=4*(K(d, 0.01)-K(d, 0.49)),\space B=K(d, 0.49),\space A' \in P_2,\space B' \in P_1$

The fit is not very good - it looks like we’re missing an extra degree in both polynomials. Let’s try to approximate $A$ using a cubic polynomial and $B$ using a quadratic one:

$A=4*(K(d, 0.01)-K(d, 0.49)),\space B=K(d, 0.49),\space A' \in P_3,\space B' \in P_2$

This is much better. The resulting polynomials that we get are:

$ A_1(d) = 1.0615 - 2.97792 d + 2.89199 d^2 - 0.983735 d^3 $

$ B_1(d) = 0.853322 - 1.07504 d + 0.225676 d^2 $

$ K_1(d, t) = A_1(d)(t-0.5)^2 + B_1(d) $

One issue is that we were fitting the polynomials $A$ and $B$ independently, and we were assuming that $K(d)$ is a quadratic polynomial, which is just an approximation. The errors from multiple approximations that we fit independently will accumulate and we won’t get the best results. Since we know the final form we want, we can fit the entire expression at once - Mathematica can do this using FindFit function (and black magic). This gives us the following result:

$ A_2(d) = 1.0904 - 3.2452 d + 3.55645 d^2 - 1.43519 d^3 $

$ B_2(d) = 0.848013 - 1.06021 d + 0.215638 d^2 $

$ K_2(d, t) = A_2(d)(t-0.5)^2 + B_2(d) $

Evaluating approximation error

All of the approximations we computed above were using least-squares error metric in terms of $K$. However, $K$ is not really meaningful since this is just an intermediate value necessary to compute $t$. We can compute the error in $t$ but the ultimate metric that we care about is the interpolation result - how much does the resulting quaternion deviate from the one obtained using $slerp$?

Without loss of generality we can assume that the input quaternions were $q_1=(0,0,0,1)$ and $q_2=(\sqrt{1-d^2},0,0,d)$. The scalar component of the result of $nlerp$ is thus:

$ q_{lerp} = (\sqrt{1-d^2}t',0,0,(1-t')+dt') $

$ q_{nlerp} = \frac{(\sqrt{1-d^2}t',0,0,(1-t')+dt')}{|(\sqrt{1-d^2}t',0,0,(1-t')+dt')|} $

$ q_w = \frac{(1-t')+dt'}{\sqrt{(1-d^2){t'}^2 + (1-t'+dt')^2}} $

Since the scalar component of the quaternion is the cosine of the half-angle of rotation, and in our case we’re starting from angle 0, we expect that for any parameter $t$ we’ll get the half-angle $t\arccos d$. This lets us define the absolute angular error:

$ e = 2|t\arccos d - \arccos \frac{(1-t')+dt'}{\sqrt{(1-d^2){t'}^2 + (1-t'+dt')^2}}| $

We can now measure the maximum error for $nlerp$ and all three representations and get:

$ e_{nlerp} = 1.42229 * 10^{-1} = 8.15^{\circ} $

$ K_0: e_0 = 6.96632 * 10^{-3} = 0.40^{\circ} $

$ K_1: e_1 = 1.09562 * 10^{-3} = 0.06^{\circ} $

$ K_2: e_2 = 7.76255 * 10^{-4} = 0.04^{\circ} $

This is pretty good - remember, this is the maximum absolute error across our entire range! We can clearly see that our efforts to make the approximation more precise paid off - using a more involved approximation for $K$ together as well as carefully fitting the coefficients reduced the error by an order of magnitude. Also note that all errors reach their maximum value for quaternions that are at almost $180^{\circ}$ rotation from each other. If we reduce the interval so that initial quaternions are at most $90^{\circ}$ from each other, we get:

$ e_{nlerp} = 1.60363 * 10^{-2} = 0.91^{\circ} $

$ K_0: e_0 = 1.12533 * 10^{-4} = 0.006^{\circ} $

$ K_1: e_1 = 1.24728 * 10^{-4} = 0.007^{\circ} $

$ K_2: e_2 = 7.22881 * 10^{-5} = 0.004^{\circ} $

It’s interesting that while our approximations do help, they are not that different from each other once the angle between the quaternions is not too high. This makes sense if you recall that $K$ was very flat for large values of $d$ - so we don’t really get more precision because our basic approximation was good enough!

Note also how we got pretty good results despite the fact that we did not optimize for the maximum error. All of the fits that we did minimized the sum of squares of the errors; additionally, our error was not in terms of the angle but in terms of some internal parameters. I tried to explicitly refit the equations to minimize the maximum angular error, but was not very successful - the results ended up being close so let’s leave it at that.

Now that we got two good approximations and analyzed the error we can make an informed decision of whether to use a more or less precise implementation. In the next article we will look at the implementation of proposed approximations to see the relative performance of all interpolation methods.

Of course, in reality you have to make tradeoffs so it will be slower than $nlerp$ and less precise than $slerp$… ↩
It is possible to find a general fit for a polynomial of two variables using methods like GLM instead of trying to guess a good form of an approximation. I tried to use GLM for this problem and the results are comparable in terms of precision but are slightly more expensive to compute if you try to use a generic polynomial of the same degree. ↩
It is crucial for an interpolation function to be exact in 0 and 1; having an exact solution for 0.5 is a nice to have. ↩

A queue of page faults

Sun, 21 Dec 2014 00:00:00 +0000

Execution time in many programs is dominated by memory access time, not compute time. This is becoming increasingly true with higher instruction-level parallelism, wider SIMD, larger core counts and a lack of breakthroughs in memory access latency. A lot of performance talks now start by explaining that in a cache hierarchy the last-level cache miss is 100x or more expensive than a first-level cache hit, TLB misses are scary and contiguous data is great. But there is another beast that lurks in the depths of virtual memory subsystem, and its name is Page Fault.

Oh, hi. ¹

Last week Bruce Dawson published a post Hidden Costs of Memory Allocation that explains an unconventional wisdom - large block allocation is expensive. While any seasoned C++ programmer has an adequate cost model ² for small allocations, and there are all sorts of tools to deal with the cost - specialized free lists, bump-pointer allocators, data structures that favor arrays over separately allocated nodes - the allocations are usually perceived to have a cost associated with their number, not size.

Turns out, nothing is free - and allocating a multi-megabyte block has a real cost. What’s worse, this cost is not paid when the allocation is performed - it can be paid when you perform a memory access or even when a background thread in another process fills it with zeros. The post mentioned above describes the problem in more detail using synthetic tests - I want to focus on an easily measurable impact in a real application.

Measuring the cost

One of my projects, qgrep, is a fast grep database designed to quickly search through code bases of varying sizes using regular expressions. Performance is the main feature of this project (if you don’t care about performance you can just use grep) - most of the work involved was optimization and there were some interesting decisions made and algorithms used that I may blog about in the future. For now let’s focus on memory.

qgrep uses one “database” file that consists of a lot of compressed chunks stored sequentially. The main thread goes through the file, allocates enough memory for chunk processing (so that both compressed and uncompressed data fits), reads the (compressed) chunk data from the file and hands the chunk over to one of the worker threads. A worker thread decompresses the chunk data and performs regular expression search on the result. Worker threads don’t allocate memory - they work in the chunk memory that was allocated by main thread.

Chunks are usually around 512 Kb (uncompressed); the average compression ratio for source code is 1:4 so the main thread performs an allocation of roughly 512+128=640 Kb for every chunk, reads 128 Kb from the file and passes the result to one of the threads (there is a worker thread per core). Back in March 2012 I implemented a simple free-list pool for these allocations that resulted in significant speedups on some queries… But this free-list contains huge blocks - we’re using a block size of 512+256=768 Kb to satisfy ~640 Kb requests; how can it possibly be a win?

To find out, let’s profile the application and figure this out. I’m using Linux kernel v3.19-rc1 as the data set, and grepping for the regular expression fooo?bar that has the following matches³:

$ qgrep init linux ~/linux
$ qgrep update linux
$ qgrep search linux fooo?bar
~/linux/arch/m68k/include/asm/openprom.h:298:
    int vector; /* This is foobar, what does it do? */
~/linux/arch/sparc/include/asm/openprom.h:205:
    int vector; /* This is foobar, what does it do? */
~/linux/arch/um/include/shared/init.h:23:
 * extern int initialize_foobar_device(int, int, int) __init;
~/linux/drivers/block/pktcdvd.c:1462:
static int kcdrwd(void *foobar)
~/linux/drivers/block/pktcdvd.c:1464:
    struct pktcdvd_device *pd = foobar;
~/linux/drivers/of/unittest.c:407:
    selftest(rc == 0 && !strcmp(strings[0], "foobar"), "of_property_read_string_index() failure; rc=%i\n", rc);
~/linux/drivers/usb/storage/unusual_devs.h:667:
/* Submitted by Michal Mlotek <mlotek@foobar.pl> */
~/linux/include/linux/init.h:26:
 * extern int initialize_foobar_device(int, int, int) __init;
~/linux/tools/perf/util/quote.h:15:
 * sprintf(cmd, "foobar %s %s", sq_quote(arg0), sq_quote(arg1))

To find these matches, qgrep has to scan through 466 Mb of source data that is compressed to 122 Mb in 932 chunks. On my laptop it takes around 100 ms to find the matches using 8 threads. To analyze the effect the aforementioned change has on performance, we’ll run the tests on Mac OSX (32/64 bit process) and Windows 7 (32/64 bit process), using 1 or 8 threads and with or without allocation pooling and measure the wall time. Here are the results (averaged over 100 runs):

Platform	8 threads pool	8 threads no pool	1 thread pool	1 thread no pool
MacOSX 32-bit	131 ms	142 ms	570 ms	610 ms
MacOSX 64-bit	113 ms	124 ms	510 ms	543 ms
Windows 32-bit	120 ms	204 ms	433 ms	471 ms
Windows 64-bit	100 ms	120 ms	404 ms	404 ms

When the pool is not used, we end up requesting ~600 Mb from the system; when the pool is used, we can satisfy some requests using the free list so we end up requesting ~100 Mb with 8 threads and ~300 Mb with 1 thread. This makes sense since with 1 thread the data processing is slower so worker threads return allocated blocks to the pool more slowly and we end up with a larger queue of chunks.

The results in the table above mostly make sense - there must be some overhead associated with allocating memory and as Bruce Dawson’s post suggests this overhead grows with the total requested size. Switching to 64-bit improves performance since we have more registers so compiler can optimize inner loops better. However, there is an outlier - in 32-bit on Windows using the custom pool increases our performance almost two-fold. Wait, what?

Finding the reason

Let’s use the excellent Visual Studio Concurrency Visualizer tool⁴ to find out what’s going on! Here are the screenshots with 8-thread mode (click to enlarge):

Using the pool

Not using the pool

The first “worker thread” in these timelines is the thread that handles output processing (it usually is not CPU-heavy but for some reason there’s a significant time in the first capture spent initializing console output…), and the remaining 8 threads handle decompression and search. Green sections represent CPU activity, red sections represent the thread waiting on a synchronization primitive and orange sections represent “memory management” - in our case, page fault handling.

As we can see, when we’re using the pool the threads are generally either doing work or waiting for more work to be queued by the main thread; however, when not using the pool we have a lot of orange sections everywhere up until the end of processing. Let’s zoom in and look at one of the orange sections:

This is the curse of page faults. When we start using a memory page that has not yet been mapped to a physical location, we get a page fault that is handled by the kernel - so whatever code touches the memory first (like LZ4 decompression in this case) can suddenly get slower and if you’re not looking at the kernel stacks there is no way for you to find out. There is something else suspicious in this image though. In several instances we have multiple threads performing page fault handling, and the threads continue execution at almost exactly the same time! Is it possible that… page fault handling is serialized in the kernel?

Let’s look at some numbers. When the pool is not used, we’re allocating ~600 Mb from the system. We can confirm that we get page faults on all of this memory by inspecting the page fault counter (we can read it using GetProcessMemoryInfo call from Psapi.h); it’s 152k faults in our case which, given a 4k page size, means that every single allocation we perform is initially using non-committed pages so we have to pay the page fault cost.

Bruce Dawson measures the cost of page faults to be 175 μs per Mb, which for 500 Mb (600 Mb without pool vs 100 Mb with pool) is… wait for it… 87 ms. Which is suspiciously close to the observed timing difference (204 ms - 120 ms) for 8 threads. For 1 thread the memory difference is 200 Mb, and the timing difference is 38 ms which is exactly equal to 471 ms - 433 ms!

It looks like based on the observed behavior we can come to a conclusion - you will pay 175 μs for every megabyte of pagefaults (so ~680 ns for one page fault), page fault processing is single-threaded and if you happen to hit a page fault in two threads their execution will be serialized. Which can be a significant problem if you’re processing a lot of data using all available cores with otherwise heavily optimized code.

The devil is in the detail

But wait, why do we not see the problem on x64? Does the kernel map pages on x64 in a more performant way? Yeah, right.

Remember that you’re not really paying for memory that you allocate - you’re paying for memory that you use after mapping it into the address space of the process. Normally the heap implementation is designed to get memory from the system and keep it in various free lists without returning it to the system - however, not all block sizes are treated like this. Most heap implementations fall back to using virtual memory allocation above certain size.

You can refer to Windows 8 Heap Internals for an in-depth look at how Windows heap allocation/deallocation works; notice that for allocations larger than VirtualMemoryThreshold the implementation switches to VirtualAlloc⁵ that reserves the pages but does not fully commit them. VirtualMemoryThreshold is set to 508 Kb and our allocations are slightly larger than 512 Kb so we should use memory allocated with VirtualAlloc even on x64… right?

Alas, either the heap implementation is different on Windows 7 or the linked article is wrong - but the actual code that performs the decision on which allocation path to use in RtlpAllocateHeap looks like this:

# x86
shr ecx, 3
mov dword ptr [ebp-2Ch], ecx
...
mov eax, dword ptr [ebp-2Ch]
cmp eax, dword ptr [ebx+60h]

# x64
shr r13, 4
...
mov eax, dword ptr [rbx+9Ch]
cmp r13, rax

So the threshold - which is actually 65024 for x86 and 65280 for x64 - is defined in terms of allocation size granularity, which is 8 on x86 and 16 on x64. Thus on x64 the actual cutoff is 1 Mb, not 512 Kb, and our allocations fall below this threshold. This causes the heap implementation to use a free-list to satisfy these requests - which solves the memory mapping in a similar problem to our custom free-list since once we allocate and use the pages they will stay mapped into our address space so we’ll only pay the page fault cost once.

Now that we know that allocations have to use virtual memory subsystem directly for us to see a difference, we should wonder what is the actual behavior on Mac OSX. The observed performance difference is not significant - but maybe this is because it’s not using newly mapped memory for every chunk. We can switch to mmap/munmap to find out:

Platform	8 threads pool	8 threads no pool	1 thread pool	1 thread no pool
MacOSX 64-bit Using `malloc`	113 ms	124 ms	510 ms	543 ms
MacOSX 64-bit Using `mmap`	107 ms	165 ms	510 ms	604 ms

Upon a cursory glance at magazine_malloc.c it looks like large allocations are cached (see LARGE_CACHE) under certain circumstances, which can explain why using mmap results in a noticeable performance difference when disabling allocation pooling.

Interestingly, the timing differences for mmap may suggest that the OSX kernel actually does not serialize page fault requests from multiple cores - disabling the pool with 8 threads results in extra 58 ms for extra 500 Mb of allocated memory (8.4 Gb/s), whereas disabling the pool with 1 thread results in 94 ms cost for 200 Mb of memory (2 Gb/s), so the processing seems to scale with the number of cores. It should be possible to confirm or disprove this by reading the kernel sources but this post already took too long to write - please leave a comment if you know whether this is accurate!

Conclusion

This article takes another look on an effect of (soft) page faults on performance. Note that the resulting performance difference is very dramatic - admittedly, this is a somewhat special case because the rest of the processing is highly optimized (page faults at 5.7 Gb/s start to be noticeable once your processing itself performs at 6 Gb/s…).

Having said that, this is actually quite common - a lot of components that are widely used turn out to be really slow once you set very aggressive performance limits. Slow single-core page remapping makes certain interesting approaches like C4 garbage collector infeasible, among other things⁶.

Investigating performance issues requires good tools and willingness to dive deeply into implementation details. I wish more platforms had good built-in profilers with timeline visualization similar to Concurrency Visualizer. I wish some platforms shipped with an open-source kernel. Will we see Windows kernel on GitHub one day?

Thanks to Bruce Dawson for corrections and suggestions for clarification.

Time flies. If you remember this blog from 4 years ago you may have noticed that I changed the blogging platform again. This resulted in some spurious RSS updates - sorry about that! Posts and comments have been migrated and the feed should be stable now. Please let me know if something is off. Oh, and I will not promise to blog on a regular basis since apparently it does not end well. ↩
I may be exaggerating here since the actual overhead of small allocations depends a lot on the implementation details (operating system version, process bitness, etc). The cost model usually boils down to “they can be expensive”, which is probably good enough. ↩
Why not foobar? Wait until another article about qgrep internals to find out! ↩
This is a profiler that uses ETW to gather important events about the system and displays them with a UI that you can actually enjoy using, unlike other ETW-based tools Microsoft provides. The tool is not yet available for Visual Studio 2015 so we’ll use Visual Studio 2013. ↩
… which causes the implementation to call VirtualFree on deallocation instead of putting the block on a free list. ↩
A few years ago I had to stop using memory mapped files in .NET because the thread that waits for hard page fault (which takes ages compared to a soft page fault!) often blocked the thread performing virtual memory operations as a part of GC processing, resulting in I/O stalls affecting the entire application. ↩

Quantizing floats

Tue, 14 Dec 2010 00:00:00 +0000

Over the next few posts I’d like to write about optimizing mesh data for run-time performance (i.e. producing vertex/index buffers that accurately represent the source model and are as fast to render for GPU as possible).

There are several important things you have to do in order to optimize your meshes, and one of them is packing your vertex/index data. Packing index data is trivial - for any sane mesh there are no more than 65536 unique vertices, so a 16-bit index buffer is enough; this is a small thing, but trivial to do. Reducing the vertex size is more complex.

In order to compress your vertex data you have to know the nature of your data (sign, range, special properties (like, is it a normalized vector), value distribution) and the available compression options. This is the topic for the next article; today I want to talk about quantization.

All methods of vertex compression that are trivially implementable on GPU involve taking the floating-point source data and storing it in a value with less bits of precision; usually the value is either an integer or a fixed-point with a limited range (typically [-1; 1] or [0; 1]). This process is known as quantization.

The goal of quantization is to preserve the original value with as much accuracy as possible - i.e., given a decode(x) function, which converts from fixed-point to floating-point, produce an encode(x) function such that the error, i.e. abs(decode(encode(x)) - x), is minimized. Additionally it may be necessary to perfectly encode a finite set of numbers (i.e so that the error is zero) - for example, it is usually useful to preserve endpoints, i.e. if you’re quantizing pixel component values, you’re encouraged to encode 0 and 1 perfectly, or pixels that were previously fully transparent will start to slightly leak some color on the background, and pixels that were previously completely white will give a dark color if you exponentiate their intensity.

Note that the error function is defined in terms of both encode and decode functions - the search for quantization function should start with the decode function. For GPU, decode functions are usually fixed - there are special ‘normalized’ formats, that, when used in a vertex declaration, automatically decode the value from small precision integer to a limited-range floating point value. While it is certainly possible to use integer formats and do the decoding yourself, the default decode functions are usually sane.

So, what are the functions? For DirectX 10, there are *_UNORM and *_SNORM formats. Their decoding is described in the documentation: for *_UNORM formats of n-bit length, the decode function is decode(x) = x / (2^n - 1), for *_SNORM formats of n-bit length the decode function is decode(x) = clamp(x / (2^(n-1) - 1), -1, 1). In the first case x is assumed to be an unsigned integer in [0..2ⁿ-1] interval, in the second case it’s a signed integer in [-2^n-1..2^n-1-1] interval.

In for the UNORM case the [0..1] interval is divided in 2^n - 1 equal parts. You can see that 0.0 and 1.0 are represented exactly; 0.5, on the other hand, is not. The SNORM case is slightly more complex - the integer range is not symmetric, so two values map to -1.0 (-2^n-1 and -2^n-1 - 1).

This is only one example; other APIs may specify different behaviors. For example, OpenGL 2.0 specification has the same decoding function for unsigned numbers, but a different one for signed: decode(x) = (2x + 1) / (2^n - 1). This has slightly better precision (all numbers encode distinct values), but can’t represent 0 exactly. AMD GPU documentation describes a VAP_PSC_SGN_NORM_CNTL register, which may be used to set the normalization behavior to that of either OpenGL, Direct3D 10 or a similar method to Direct3D 10, but without [-1..1] range clamping (i.e. the actual range is not symmetrical).

Once we know the decoding formula, it’s easy to infer the encoding formula which gives the minimum error on average. Let’s start with unsigned numbers first. We have a [0..1] floating point number, and a 3-bit unsigned integer ([0..7] integer range).

First let’s mark all values that are exactly representable using the decode function on the 0..1 range (the top row of numbers, and black lines denote these) - just decode all integers from the range and draw a line. Now, in order to minimize the error, for every number we have to encode we have to pick the closest line, and select the corresponding number. I’ve drawn red lines that are exactly in the middle of corresponding black lines; all numbers between two red lines (which correspond to values in the row labeled ‘original’) will be encoded to the same number. The number each subrange should encode to is specified in the bottommost row.

Now we can visualize the encoding; all that’s left is to provide a function. Note that the encoding is not exactly uniform - the size of leftmost and rightmost subranges is half that of all other subranges. This is not a problem, since we’re optimizing for the minimal error, not for the equal range length.

The function is easy - if you multiply all numbers from the row ‘original’ by 7 (2ⁿ - 1), you’ll see that all that’s left is to apply the round-to-nearest function; since we’re limited to unsigned numbers, the encode function is encode(x) = int (x / 7.0 + 0.5) (which is a standard way to turn round-to-zero, which is the C float-to-int cast behavior, to round-to-nearest for positive numbers).

Here is another image for the signed numbers, using Direct3D 10 rules. The range is [-1..1], we still have a 3-bit integer with [-4..3] range - we’re going to provide an encoding function that gives us the number in [-3..3] range. Using exactly the same reasoning as above, to encode the number we have to multiply it by 3, and then round to the nearest integer. Be careful - since float-to-int cast does a round-to-zero, or a truncate, the round function is slightly more complex. The encode function is as follows: encode(x) = int (x / 3.0 + (x > 0 ? 0.5 : -0.5)).

Just for reference, three functions for quantizing values to 8 bits are:

// Unsigned quantization: input: [0..1] float; output: [0..255] integer
encode(x) = int (x * 255.0 + 0.5)

// Signed quantization for D3D10 rules: input: [-1..1] float; output: [-127..127] integer
encode(x) = int (x * 127.0 + (x > 0 ? 0.5 : -0.5))

// Signed quantization for OpenGL rules: input: [-1..1] float; output: [-128..127] integer
encode(x) = int (x * 127.5 + (x >= 0 ? 0.0 : -1.0))

3/4/2014: the original version of signed quantization for OpenGL was incorrect for negative values; the bug is fixed. See comments for discussion.

These functions are the perfect foundation for the next step: reducing the size of vertex buffer by reducing the vertex size. Until next time!

Exit code trivia

Mon, 06 Dec 2010 00:00:00 +0000

Whenever there is an automated process involved, such as asset/code building, unit testing, automatic version packaging, bulk log processing, etc., there often is a set of command-line tools which do their thing and return the result. Then there is a calling process (which may be as simple as a batch file, or as complex as IncrediBuild), which launches the tool and acts upon success/failure.

In the world of command-line tools, success/failure is represented with exit code. However, it is important to understand that exit codes are to be treated carefully.

Here is a rough set of guidelines to handling exit codes:

The canonical success code is 0, not 1. This is also true for return codes of functions - 0 always makes success. Never return 1 from your command-line tool to communicate success - no caller will expect this.
Related to the above - there should be only one success code, i.e. everything else should be treated as error. There is no unambiguous encoding for several success values; the user probably does not care about details, the success is enough; for some system calls, like system(), cross-platform handling of different success values results in extra work (Windows returns the exit code as is, Linux returns a value that contains the exit code and additional information).
In utmost majority of cases you don’t need more than one error code either. The reasons are the same.
Even if you decide to use several error codes, do not use negative numbers. Some negative numbers may be used as special values for functions that normally return exit codes - in fact, one such number is -1; the family of spawn functions return -1 on error, so if you return -1 from your tool, the resulting error will be unexpected - we had one such case with SCons, where the matters were additionally complicated by the fact that -1 raised an OSError exception, which was swallowed by the SCons internals for some weird reason).
If the tool fails, returning an error code is not enough - you should output the additional error information, which should be as detailed as needed to be able to further investigate the issue (i.e. don’t return ‘file load failed’ flag, print the name of file that the program failed to open, and the error code).
As a somewhat related thing, if the tool succeeds, prefer less verbose output. An ideal tool is the tool that outputs zero lines of information if it succeeded (which reduces the clutter, enables easier detection of warnings, and generally makes people pay attention to the problems in the automated process because they are the only thing that’s printed!). If you need debugging/statistics information, consider adding a separate command-line flag. If you need version information for diagnostics, output it when a special command-line flag is used, not for every build.
Be careful with batch files. It is very easy to accidentally lose an exit code in the batch file. In fact, if you can avoid batch files completely or make them one-liners that call your script interpreter of choice, do it; if you can’t, still try to go that way as far as possible.

So basically, if you only use 0 (success) and 1 (failure) exit codes, return additional failure information via stdout/stderr, and don’t pollute stdout with things that are not indications of some problem, the users of your command line tool will love you.

Optimizations that aren't

Mon, 29 Nov 2010 00:00:00 +0000

We all like it when our code is fast. Some of us like the result, but dislike the process of optimization; others enjoy the process. However, optimization for the sake of optimization is wrong, unless you’re doing it in your pet project. Optimized code is sometimes less readable and, consequently, harder to understand and modify; because of that, optimization often introduces subtle bugs.

Since optimization is not a process with only positive effects, in production it’s important that optimization process follows certain guidelines that make sure the optimization does more good than bad. An example set of optimization steps would be:

Make sure that the code you’re optimizing works. If possible, it should be covered by tests; otherwise one can resort to saving the results that the code produces, i.e. a data array for a particular input or a screenshot.
Measure the performance of the target code in a specific situation, for example on a fixed set of input data, or, in case of games, at the very beginning of the level, or measure the average/maximum timings across the whole level.
Verify that the measurements are precise enough, i.e. don’t have a very large variation between runs.
Verify that the performance is inadequate for your target requirements (you can’t start optimizing if you don’t know your target requirements). It’s important that the measured situation is common enough - ideally you should measure in the worst possible circumstances for the code, which are still possible in the target product (i.e. if the unit number cap is 1000, profile with 1000 units). If necessary, make several measures in different situations.
Record the timings/memory statistics/other performance-related information.
Optimize the code using any available means, starting with the ones that are easier to code and minimally affect maintainability. In game development, if there is a substantial gain that is necessary, maintainability reasons should probably be cast aside.
Check that the code still works (run unit tests, compare the results with that from 1.)
Measure using the same data from 2., compare the results, repeat the process if necessary.

There are two absolutely crucial things here - make sure that the code still works, and have proper profiling before- and after- results. Often it’s useful to make a note of the results after each significant chunk of optimization, and save the results somewhere - some optimizations might get in the way later, and with the records you’ll probably be able to separate critical optimizations from less critical.

If you did not verify the code, it’s possible that the code now does something different - such optimization is usually bad (one exception is rendering algorithms, where usually you can replace ‘is exactly the same’ with ‘looks something like’ or even ‘is noticeably different, but the artists like it better/can live with it’).

If you did not profile the code, you don’t know if it works faster, and if it does, if it is considerably faster. Such optimization is worthless.

I have an actual story about that. Unfortunately, the information I have is incomplete - I have the code with an “optimization” that considerably decreases the actual performance, but I don’t have the change history. Still.

There is (was?) a COLLADA Exporter by Feeling Software, which, given an input Maya scene, produces a COLLADA XML document. This process is done at export time, which is either triggered by the artist manually, or is done automatically during the build process. The performance requirements for such tools are obviously different from the ones of a game - but optimizing the content pipeline response time is arguably equally important to optimizing game framerate, because faster iteration times and a good team mean more iterations, and more iterations mean more polished product.

Back at CREAT Studios, we used COLLADA pipeline for Maya/Max export; we tried to avoid touching the code, but sometimes we could not avoid it. An awesome export response time for a mesh is one second; a good one is ten seconds. We had some models that exported for several minutes. After some profiling several issues showed up - and here is one of them.

During the export, there are several parts of a document that can reference the same nodes from Maya DAG (Directed Acyclic Graph, pretty much the entire scene in Maya is a DAG); it is necessary to ‘sample’ the said nodes (i.e. to get the values of some attributes for these nodes for different time values). Sampling can be slow in Maya, because it can involve complex updates of the DAG - to accelerate that, there is a special class, CAnimCache, that caches the sampling requests. The key for the sampling request is a pair (object, attribute), the value is the list of attribute values and several flags. object is represented as MObject, plug is represented as MPlug.

The cache is organized as follows: there is an associative container with the key being the object, and the value being a list of parts. Each part holds the attribute and the cached value:

struct Part { MPlug plug; FloatList values; };
struct Node { MObject node; vector<Part> parts; };

struct Cache
{
    map<MObject, Node*> cache;
};

The code looks reasonable - the cache lookup is logarithmic in terms of object count and then linear in attribute count - objects usually have a modest amount of attributes, it should be fast enough. The cache key could probably be a pair of pointers, but oh well.

Still, somebody thought that this code is not fast enough. I do not know if the necessary performance tests were made - I guess they were not, or maybe the map was not a map but a vector when the change was made - anyway, somebody thought that this code is not fast enough, specifically that the map lookup is slow.

It’s easy to optimize the map lookup if we assume that the consecutive cache lookups happen with the same object, but with a different attribute - this is a reasonable assumption and it holds in practice. So, the code was modified and looked like this:

struct Cache
{
    map<MObject, Node*> cache;
    Node* search;

    Cache(): search(NULL) {}
    
    bool FindCacheNode(const MObject& node)
    {
        iterator it = cache.find(node);
        if (it != cache.end())
        {
            search = it->second;
            return true;
        }
        return false;
    }

    void CachePlug(const MPlug& plug)
    {
        if (search == NULL || search->node != plug.node()) FindCacheNode(plug.node());
        if (search == NULL)
        {
            search = new Node(plug.node());
            cache.insert(plug.node(), search);
        }
        
        /* additional processing of the search node */
    }
};

Can you spot the problem?

At the first call to CachePlug, search is NULL, so the function FindCacheNode is called, which does not find the node. search is still NULL, so a new node is inserted; now search points to this node.

At the next call to CachePlug with a different MObject, search is non-NULL, but the node is different, so FindCacheNode is called again. It can’t find the desired node - after all, nobody inserted it! - so it returns false… without resetting search to NULL!. In fact, nobody ever resets search to NULL - so nobody adds new Node’s - so the map always has one element, and the parts vector contains all attributes of all nodes in the scene! As you can imagine, this makes all functions from the cache linear in terms of scene object count, and thus the whole export process quadratic. All functions still worked, but the export was slow for large scenes.

It is hard to reconstruct the sequence of events without a change history - however, one thing is certain. At some point here somebody did an optimization without any prior profiling (map lookup could not be a serious factor - after I fixed the bug, the functions from this class were nowhere near the profile top), and without any profiling after the change - otherwise he’d spot the bug.

The code travels in sometimes unexpected ways. A year ago I found the same issue in OpenCOLLADA, which inherited some code from Feeling Software exporter. (it was fixed after my report).

Optimization without profiling is wrong. Profiling without measuring and comparing the results is wrong. Please do not do either of that. And please, look at your code in the profiler once in a while, even if the performance is tolerable - you’ll find things you didn’t expect.

P.S. The credit to discovering the optimization bug actually goes to Peter Popov (of the Linux RSX fame).

Z7: Everything old is new again

Mon, 22 Nov 2010 00:00:00 +0000

Debug information is the data that allows the debugger to, uhm, debug your program. It consists of the information about all types used in the program, of source line information (what instruction originated from what source line), of variable binding information (to know where on the stack frame/in register pool each local variable is stored) and other things that help you debug your program.

There are two different ways to store the debug information for C/C++ code: one follows the ‘separate compilation’ model of C++ and stores debug information in the object file for each translation unit, another adopts the ‘everything is a huge database’ model and stores debug information for the whole project in a single database. The first approach is the one taken by GCC; MSVC, on the other hand, uses the second approach by default.

Here’s how it works in practice: suppose you have an application project, game, that references two static library projects, render and sound. There is a single database file (which has .pdb extension) for each project - they usually are located in the same intermediate folder as object files - so in this example we have three PDB files, which by default are all called something like vc80.pdb, depending on the MSVS version - but, since you can change that, we’ll assume they’re called game.pdb, render.pdb and sound.pdb. While the files in all projects are compiling, the compiler computes the debugging information for the current translation unit and updates the corresponding .pdb file.

However, the debugger can’t work with multiple pdb files - it wants a single PDB file. So the linker, in the process of linking the final application, in our case game project, merges all PDB files in a single file - let’s call it gamefinal.pdb. The linker gets paths to all PDB files from object files (or from object files inside static libraries), reads debug information from them, generates a single PDB file, writes it to disk and stores the path to this file in the executable (exe or dll). Debugger reads the PDB path from the executable module and uses the debugging information from that file.

There are some nice properties of this system:

The resulting debugging information is separate from the executable - you can generate it for all builds, including retail, but don’t redistribute the pdb. In fact, please always generate the debugging information for all builds! Prior to Visual Studio 2010 the default settings for Release configuration excluded any debug information, which is unfortunate.
The mechanism for discovering the “source” PDB files at link stage is flexible - I’ve described the default setup for freshly created projects, however you can modify it - you can have all projects update a single PDB file, or you can have 1 PDB per object file. Linker will work regardless of the setup.

However, there is a problem - what if several files are compiled in parallel? In case they refer to the same PDB file, we have to use some synchronization mechanism. This concern (perhaps there were other reasons that I’m not aware of) led to the following design - there is a server process, called mspdbsrv.exe, which handles PDB file operations and ensures safe concurrent access. Compiler uses the server to update PDB files, linker uses the server to read source PDB files and update the final PDB file. Some operations are apparently asynchronous - you can sometimes observe that even though the linker process has exited, the final PDB file processing is not finished, which can lead to file access errors.

So, now everything works fine, right? Almost.

When you’re using distributed compilation, i.e. via IncrediBuild, the compiler processes are run on different machines. They update some PDB file locally, which is then transferred to your machine. However, this effectively disables the PDB server operations - instead of a single server process that updates all PDB files, there are now multiple server processes, one for each worker machine! This leads to disaster, which manifests in corrupted PDB files and can be easily observed if you try to use make/scons/jam/any other build system with MSVC + IncrediBuild + compiler-generated PDB files.

IncrediBuild has a special hack in order to make this work - when you compile the solution via Microsoft Visual Studio, IncrediBuild modifies the build command line by splitting the PDB file for each project into several files, making sure that all files with the same PDB name go to the same agent. You should be able to use the same hack for make/scons/jam, since you can declare that you tool behaves like cl.exe in IncrediBuild profile, but I don’t know the details and couldn’t get it to work.

It turns out that MSVC initially used the first debug information storage approach - i.e. it stored the debug information in object files. Moreover, this mode is still available via the /Z7 switch (this is the so-called ‘old style debug information’, or ‘C7 Compatible’ in the MSVC GUI - you can find the setting in Project Properties -> C++ -> General -> Debug Information Format). This has the following implications:

Debug information is now local to translation unit - there are no races in case of concurrent compilation by design.
The PDB server is no longer used during the compilation, because it is not needed.
The linker reads debug information from object files directly, instead of looking for PDB path and opening the PDB (in fact, there is no PDB path in object files).
Static libraries contain embedded object files, so a static library file is now self-contained - it contains all information that’s necessary for linking

Obviously, the compile and link file access pattern change greatly. The change in compilation/linking times is hard to estimate - on one hand, with /Zi all debug information was consolidated in a single PDB file (per project), now it’s scattered throughout object files (which, by the way, increases the size of intermediate files because now there is duplicate debug information), on the other hand the linker should read object files anyway, so locality should not be worse. Also, we eliminate a theoretical synchronization bottleneck (the PDB server), so multiprocess builds can get faster.

Here are my completely unscientific benchmark results on OGRE builds with cold cache in four build variants: /Zi (PDB files, single core build), /Zi /MP (PDB files, multicore build), /Z7 (no PDB files, single core build), /Z7 /MP (no PDB files, multicore build). For each configuration, I did a clean build of the OgreMain.dll using a new source folder every time, then I rebooted to force file cache cleanup, changed a single source file and did a build once again. Both compilation and linking times are included. The tests were done on a Core i7 920.

	/Zi	/Zi /MP	/Z7	/Z7 /MP
clean cl	6:45	1:51	6:32	1:32
clean link	0:20	0:20	0:17	0:17
incremental cl	0:15	0:15	0:08	0:08
incremental link	0:17	0:17	0:24	0:24

While there are some savings for the clean build, the total incremental build time is the same (which can be explained if this is the cost of reading old debug information - since it is moved from link time to compilation of the single changed source file). With that in mind, Z7 and Zi are probably more or less interchangeable - unless you need Edit & Continue support, which is not supported with old-style debug information. Still, I like the /Z7 approach better.

#include <rules>

Mon, 15 Nov 2010 00:00:00 +0000

We’re stuck with C++, at least for another console generation. C++ has many quirks that I wish were not there, but there is no real alternative as of today. While modern languages tend to adopt the bulk compilation and/or smart linkers and so can have a proper module system and eat the cake too, C++ is stuck with header files (on the other hand, C++ builds are incremental and almost embarrassingly parallel). While the strategy of dealing with header files and staying sane seems more or less obvious, I’m amazed as to how many people still get this wrong. I hope that this post helps to clear the mud somewhat. The post applies to C as well, but is useless for people who are blessed to work with other languages.

The problem with include files is that the preprocessor is usually quite dumb - you tell it to include the file, it includes the entire contents of the file, recursively. If you don’t tell it to include the file but try to use the symbol from that file - you get a compilation error. If you tell it to include too many files, it includes all of them, and the compilation time suffers.

In general, the more a header is included in other files (including transitive inclusion, i.e. A includes B includes C means that A indirectly includes C), the more files you’ll need to recompile once the header changes. Iteration time is very important - which is a topic for another time - so we’d like to minimize the amount of header inclusion. This brings us to the first important rule: Each file should include the minimum amount of files. The rule helps ensure that your code builds fast.

Now, let’s suppose that the header file contains a class declaration. By the nature of C++, a class declaration won’t compile without some other declarations - for example if a class A inherits from a class B and contains a field of type C, then you have to give the compiler declarations of both B and C in the same translation unit (i.e. in the cpp file that you’re compiling - after preprocessor has done its work) - before A’s declaration. Now, there are two options here - you can either include the relevant header files in the header with A’s declaration, or force the user to always include B and C headers manually before A. The problem is that sometimes the user does not know about these dependencies (i.e. the field of type B can be private), sometimes the dependencies change, so every time you’re adding some declaration dependencies to your types you’re breaking user’s code, and, since declaration dependencies are transitive, often to include a single header you’ll need a dozen or more seemingly unrelated ones. For this reasons, it’s important for all headers to be self-contained - anybody should be able to include any header in any cpp file without compilation errors. Which brings us to the second important rule - each file should include all dependent headers, i.e. for each declaration that’s required by the compiler there should be a corresponding include. This rule helps ensure that the programmers stay sane.

These two rules together define the algorithm for proper header file authoring: for each required declaration, include a corresponding header in your header file; don’t include more headers than that. In order to guarantee that you did not forget the necessary headers, make sure that your header file is the first #include in the corresponding source file, except the common header, if your codebase has one.

Do not include a header for a dependency declaration where a forward declaration will suffice; use forward declarations when possible (if you’re not familiar with forward declarations, google it). Sometimes it pays off to go to extra lengths to remove header dependencies, using techniques like pimpl - this depends on the exact situation, but avoid including heavy platform files, like windows.h or d3d9.h, to popular headers (I’ve written about a way to make a slim version of d3d9.h in a blog post, scroll down to the last section).

With the rules above, there is only one thing left - since we can include a header twice accidentally (i.e. A depends on B and C, and B depends on C, so C is included twice into A), we’ll need some protection against that. So each file should include the guards against multiple inclusion. There are two methods for this - either use #pragma once or use header guards. #pragma once is a non-standard technique, that tells the preprocessor explicitly “don’t include this file more than once in a single translation unit”. Header guards can emulate the behavior using preprocessor defines:

#ifndef FILE_NAME_H
#define FILE_NAME_H
...
#endif

Many people don’t know this, but #pragma once is widely supported in modern compilers. It’s superior to header guards in two ways: it can be faster than header guards (i.e. MSVC does not read the file with #pragma once more than once, but does read the file with header guards several times), and it’s foolproof - you don’t have to invent the identifier for a header so you can’t screw it. So use #pragma once if you can, use header guards if you must. If some compilers that you use don’t support #pragma once and you can’t convince the vendors to add the feature, make sure that the header guards are unique using a deterministic generation algorithm. For example, you can use something like “take the list consisting of the name of the project, and all components of the relative file path; convert all elements to upper case and join with underscore”, resulting with identifiers like THEGAME_RENDER_LIGHTING_POINTLIGHT_H. Do not use short file names alone, they are not unique! (unless your coding standard requires that). Oh, and if you don’t use an autogenerating macro, don’t put a comment after the #endif (i.e. #endif // THEGAME_RENDER_LIGHTING_POINTLIGHT_H) - such comments are only useful as a copy-paste history.

While using header guards allows you to have the same file included several times in a single translation unit, it also allows you to test whether the file was already included, i.e. #ifdef THEGAME_RENDER_LIGHTING_POINTLIGHT_H. You should never conditionally exclude a section of a header file based on whether some file was included! Doing this introduces the inclusion order dependency which is unnatural, and hard to debug without a preprocessor output. If you’re thinking about something like “oh, if the renderer interface was included, I should probably provide a light renderer class, but otherwise it would just add unnecessary clutter”, you should split your header file in two parts, and the second part should explicitly include the renderer interface, since it depends on it.

At least in game development, the language is frequently extended with some generally useful primitives that are used throughout the whole codebase. The most used one is probably an assertion macro (since the standard one sucks, you should have your own), but there are other examples - logging facilities, fixed-size types, min/max functions, various platform/configuration defines (“are we on a big-endian platform?”), memory management-related macros. It’s common practice to put all of those in a single common header file; you should control the size of this file (where by ‘size’ I mean the cumulative size of all headers it includes, of course), and you should make sure that each source file includes the common header before everything else - otherwise you’ll get into trouble (sometimes you’ll spend several hours looking for the reasons - i.e. if you include a header that checks platforms endianness before the common file, you’re in the world of hurt).

Well, I think that’s all about header files; there are also the include paths though. In order to include the file, you have to specify the path to it - either a “relative to the current file” path, or “relative to one of the include directories” path. There are two important goals here:

If you’re writing a library - a relatively small one, i.e. not a platform like Unreal Engine - the header files should require minimal configuration, so ideally the user does not have to add include directories to compile or use your library. For such projects, consider making all include paths current file-relative.
Otherwise, include paths should be easily greppable - the path to the same file should ideally be the same in all other files. So make all include paths include directory-relative; moreover, try to make sure that include paths are unambiguous - i.e. that you don’t have two different representations for the same file path, like and inside render project.
Whatever rule you use, try to make sure it’s consistent between different projects, as much as necessary. Ideally even the include directories should be the same, i.e. include directories for the engine project should be a strict subset of include directories for the game project.

And as a final advice - learn to use the preprocessor output (cl /E, gcc -E), learn to use the include output (cl /showIncludes, gcc -M), gather the codebase statistics (average size after preprocessing, most included header files, header files with largest payload, etc.) and optimize your codebase by eliminating dependencies and spreading the word. Nothing beats a sub-second iteration time.

Oh, did I mention that good header dependencies decrease the linking time?

Lua callstack with C++ debugger

Sun, 07 Nov 2010 00:00:00 +0000

Lua is a very popular scripting language in game development industry. Many games use Lua for various scripting needs (data representation, UI scripting, AI scripting), and some go as far as write the majority of the game in Lua. At CREAT, we used Lua for all of UI scripting, and for AI and other game logic on some projects. And, well, there were times when the game crashed - and the callstack consisted mainly of Lua functions.

While there are probably very few bugs in Lua library code, and the language is safe so you can’t get buffer overruns or other madness only via script code, script code itself is useless, because it can’t do any interaction with the outside world - user, world state, scoreboard servers, etc. So naturally there is a Lua binding for some C/C++ functions, so that scripts can call them. Now, if one of these functions crashes - for example, because they got invalid input data - how do we trace the problem back to the script code?

Assuming we don’t want to modify C++/Lua code in any way, nor do we want to restart the game with tracing hook enabled - the easily reproducible bugs are often a luxury - we’re left with the following methods:

If the external Lua debugger was attached, it’s likely that we’ll be able to get the callstack and the related information from it.
We can trick the game into calling a call stack dumping function (using lua_getstack and lua_getinfo).
We can get the call stack manually, by inspection of Lua data structures.

It is possible that you don’t have a working Lua debugger, do not have it attached or that it does not work at the moment (oh, and the deadline was yesterday). I’m going to describe the last two approaches here.

Use a stack dumping function

This approach is superior to the third one because you can have arbitrarily complex logic in the stack dumping function - i.e. you can print local variables along with the call stack - and it’s less tedious. Just make sure your stack dumping function does not crash :) However, unless you have good debugger support for this, calling the function so that the program can work after the point can be problematic.

Anyway, at first you’ll need the function itself. The trivial implementation looks like this:

void lua_stacktrace(lua_State* L)
{
    lua_Debug entry;
    int depth = 0; 

    while (lua_getstack(L, depth, &entry))
    {
        int status = lua_getinfo(L, "Sln", &entry);
        assert(status);

        dprintf("%s(%d): %s\n", entry.short_src, entry.currentline, entry.name ? entry.name : "?");
        depth++;
    }
}

In order to get local variable information, you’ll have to use lua_getlocal and ordinary functions for getting values from Lua stack; this is left as an exercise to the reader.

Now we have the function; you’ll have to make sure that the function is linked in your executable; just reference it from some other function like this:

volatile bool x = false;
if (x) lua_stacktrace(NULL);

Now you have to call the function. If you’re lucky to have a debugger that can do this - for example, Microsoft Visual Studio can often do this from the Watch or Immediate windows - then just add the expression lua_stacktrace(L), where L is the pointer to the Lua state (games often have a single Lua state, in which case I recommend you to save it to the global variable to make debugging easier).

Otherwise, you’ll have to save all registers and other relevant CPU state, setup the registers/stack so that you can call the function, set the instruction pointer to the first instruction of the function, add a breakpoint to the returning instruction of the function and hit F5. The function code will execute and stop on the breakpoint; here you have to restore all registers and CPU state, restore the instruction pointer and hit F5 again.

You don’t want to do that.

Seriously, it’s way too complex and chances are, you’ll screw something up so that the game will crash anyway. So I recommend to pick a thread you don’t care about anymore, setup the necessary stuff to call the function and call it - the thread will not work anymore, but you’ll have your callstack. I often used the approach to for post-mortem crash debugging, so the program is dead anyway.

Depending on the platform ABI, the relevant setup is different; for example:

On x86, the argument is read from stack, using the esp register (esp + 4 should contain the pointer); for MSVC, add a watch *(void**)(esp + 4), change the value to the lua_State pointer, get the address of the target function by adding a watch lua_stacktrace, go to the function in disassembly window, use “Set Next Statement” command on the first instruction, hit F5.
On PowerPC, the argument is read from register r3; add a watch r3, change the value to the lua_State pointer, go to the function in the disassembly window, use “Set Next Statement” or the equivalent command of the debugger on the first instruction, hit F5.

You’ll see the call stack and the game will crash, but now you have additional context for the problem and can debug the crash further. If you’re using this method a lot, I suggest making a less trivial function, which is able to dump locals. Just in case, dprintf in the code above dumps the string to debug window (using OutputDebugStringA); use whatever debugging output available on your platform.

Inspect Lua data structures

The approach with calling the function is dangerous, since it can stop or corrupt the execution flow; also it requires code execution, which may be unavailable - for example, you can’t use it if you’re debugging via crash dumps on some platforms. Therefore it’s useful to know how Lua represents the call stack, so that you’re able to get the call stack information using the safe debugger features, i.e. object state inspection.

As before, I’ll assume you know the lua_State pointer; it’ll be referred to as L.

First, we’ll need to get low-level call stack information. It’s stored in an array of CallInfo structures, and L has three pointers to it: base_ci, ci, end_ci. Get the stack frame count with L->ci - L->base_ci + 1 (let’s assume it’s 6), then display all of them with L->base_ci,6 (this is a special watch expression, it’s supported by Microsoft debugger and PS3 debugger - debuggers for other platforms might have an equivalent feature).

Each callstack entry has two important fields: func, which points to a function object representing the call frame (we’ll get the function and source file from it), and savedpc, which points to a saved program counter (we’ll get the line from it).

Function object is a Lua object, which can represent either a Lua function or a C function. We can verify that the interesting entry is a function by checking that L->base_ci[5].func->tt equals 6 (LUA_TFUNCTION); after that we’ll check the type of function with L->base_ci[5].func->value.gc->cl.c.isC.

If it’s 1, then it is a C function; we can get the function pointer with L->base_ci[5].func->value.gc->cl.c.f, and that’s it. This function will be in the ordinary call stack of the relevant thread; also, the top stack entry should be the C function, unless you’re inspecting the state while Lua code is running inside the VM.

The previous frame in our case contains a Lua function (L->base_ci[4].func->value.gc->cl.c.isC is 0), so we’ll get the additional information for it. The Lua function contains a pointer to the prototype, which is stored in L->base_ci[4].func->value.gc->cl.l.p (it contains a pointer to the Proto object, which is 0x00330d80 in my case - I’ll use this pointer to reduce the watch expression complexity).

Now, we’re close. The prototype contains the source file path, you can get it with (char*)(&((Proto*)0x00330d80)->source->tsv + 1). It’s a string, and in Lua string data is situated right after the string header (you can also skip the char* cast and use the ,s watch modifier). Now all we need is line information.

Remember savedpc from earlier? This is a pointer which points to some instruction in ((Proto*)0x00330d80)->code array - you can get the instruction index like this: L->base_ci[4].savedpc - ((Proto*)0x00330d80)->code, which is 5 in our case (if you’re doing address arithmetics by hand, don’t forget to divide by 4 - this is the instruction size, thankfully all instructions in Lua are 4 bytes in size). However, this is the instruction that follows the call; we actually need the previous instruction to get the point of call, so the instruction index is 4.

Now all we have to do is to get the line number from lineinfo array: ((Proto*)0x00330d80)->lineinfo[4] (which is 41 in our case).

That’s all - we know the source file, we know the source line - now we can repeat the process above for each call stack entry.

Some final remarks:

Since Lua implements tail call optimization, the callstack will sometimes be unexpected - some entries will be skipped. You can check if that’s the case by looking at tailcalls field inside CallInfo: L->base_ci[2].tailcalls.
The first call stack entry (with the index 0) contains nil value; just ignore it.
In complex cases you’ll have several Lua states (multithreading, coroutines) - the process of stack unwinding is the same.
You can get local variable values too by using CallInfo top field and looking at function debug metadata; this is more complicated but doable.
If you’re writing an embeddable language, please make sure that in your product, getting a call stack is at least as easy.

Moving on

Wed, 03 Nov 2010 00:00:00 +0000

The day has come - I’ve left CREAT Studios and started working at Saber Interactive as a PS3 (well, that was obvious) programmer (well, that was obvious too).

I worked at CREAT for three years and a half; I’ve enjoyed it immensely - I had the privilege of working with some smart people, together we built an engine for next generation (then) consoles, and I’m quite proud of the results. During these years I’ve helped ship a lot of PS3 projects - though none of them were AAA (what does AAA mean anyway?), all of them are good games and some have interesting tech inside. On my last day I got into a TerRover match with my colleagues and only came to at 10 PM - it was that much fun.

You should not have a favourite weapon. To become over-familiar with one weapon is as much a fault as not knowing it sufficiently well.

Still there was a brave new world out there - I wanted to work on projects of larger scale, I wanted to see what other companies look like and to delve into unknown technology to further enhance my understanding of game development - and here I am. Count me excited!

Source code: Implementing Direct3D for fun and profit

Mon, 25 Oct 2010 00:00:00 +0000

Almost a year and a half ago I blogged about several useful things that you can do with custom IDirect3DDevice9 implementations. I don’t know why I did not post the code back then, but anyway - here it is:

dummydevice.h - this is just an example of a dummy device implementation; it implements all device methods with stubs that can’t be called without a debugging break. This is useful for other partial implementations.

deferreddevice.h - this is the implementation of the device that buffers various rendering calls and then allows to execute them on some other device. Note that it lives in a fixed size memory buffer, which can be easily changed, and that it implements only a subset of rendering-related functions (i.e. no FFP).

texturedevice.h - this is the implementation of the device that works with D3DXCreateTextureFromFile for 2D textures and cubemaps (3D texture support is missing but can be added in the same way).

DL_BREAK is the replacement for __debugbreak, DL_ASSERT is a custom assertion macro (with neat (void)sizeof(!(expr)) trick that I hope everybody knows about by now), everything else should be obvious.

Quicksort killer sequence

Mon, 25 Oct 2010 00:00:00 +0000

Today I’m going to describe a not very practical but neat experiment, the result of which is a sequence that’s awfully slow to sort using Microsoft STL implementation; additionally, the method of generating such sequence naturally extends to any other quicksort-like approach.

First, a quick refresher on how std::sort [in Microsoft STL] works. It is a variant of introsort with insertion sort for small chunks. It proceeds as follows:

For small sequences (32 elements or less), it uses insertion sort, which has O(n²) average complexity, but has a better constant than a quick sort;
For other sequences, a median of either three or nine elements, depending on the sequence size, is selected as a pivot;
The array is partitioned in place, resulting in three chunks: the leftmost chunk has all elements that are less than the pivot, the middle chunk has all elements that are equal to the pivot, and the right chunk has all elements that are greater than the pivot;
Left and right chunks are sorted recursively (actually, only the smaller one is sorted via a recursive call, but that’s not significant);
Finally, if the recursion depth is too big (more than 1.5*log2(N)), the algorithm switches to heap sort, which has a worst-case complexity of O(n*log(n)).

This, given a careful implementation, results in a good general sorting function - it uses quicksort (which has a lower constant than heapsort), but falls back to heap sort on inputs that sort slowly with quicksort. However, due to unfortunate debug checks inside pop_heap function in MSVC2005 and 2008, the heap sort is quadratic in debug builds (this has been fixed in MSVC2010), so if we can make a sequence that’ll make quicksort quadratic, this introsort implementation will also go quadratic in debug builds.

Since all quicksort-like sorts only depend on the order between elements (they’re comparison-based), we can build the sequence of any type (i.e. a list of strings), and then make a sequence of some other type (i.e. integer list) with the same order; the number of comparisons will be the same.

Each quicksort-like sort has the following algorithm:

Select the median(s) either using pseudo-random numbers or some fixed set of elements inside the given range;
Partition the range in several chunks, with rightmost chunk consisting of all elements larger than the largest median (my method can be naturally extended to multi-pivot sorts);
Recursively sort the chunks.

Our goal, in order to make the worst possible sequence, is to maximize the size of the rightmost part; then the recursive call depth will be linear in terms of original element count, and the whole routine will be quadratic. To achieve that, we’re going to incrementally build the strings in the list with the following algorithm:

Get the locations of median candidates for the first sorting pass (i.e. not including recursive calls);
One of them (the middle one, assuming that it’s moved appropriately) is the median (pivot); we append the following letters to all strings:
- ‘a’ to all median candidates to the left of the pivot;
- ‘b’ to the pivot itself;
- ‘c’ to all other elements.
With the previous pass we maximize the amount of elements that are larger than the pivot; after this, we proceed recursively.

In order to get the information about the median candidates, the median and the partition results, we need to slightly instrument the sorting function; I made the following interface:

struct sort_context
{
    virtual bool less(const element& lhs, const element& rhs) { return lhs.last < rhs.last; }
    virtual void partition_begin() {}
    virtual void partition_median(const element* med) {}
    virtual void partition_end(const element* right_begin, const element* right_end) {}
};

struct predicate
{
    sort_context* context;

    bool operator()(const element& lhs, const element& rhs) const
    {
        return context->less(lhs, rhs);
    }
};

The sorting function should call partition_begin before each sorting pass, partition_median after the median is selected, and partition_end after the array is partitioned, passing the range of the rightmost chunk.

Then we can implement the function that retrieves indices of median candidates:

std::pair<std::vector<size_t>, size_t> get_first_median_positions(element* data, size_t count)
{
    struct median_context: sort_context
    {
        bool inside;
        unsigned int counter;

        const element* median;
        std::vector<const element*> positions;

        median_context(): inside(false), counter(0), median(0)
        {
        }

        virtual bool less(const element& lhs, const element& rhs)
        {
            if (inside && counter == 0)
            {
                positions.push_back(&lhs);
                positions.push_back(&rhs);
            }

            return sort_context::less(lhs, rhs);
        }

        virtual void partition_begin()
        {
            assert(!inside);
            inside = true;
        }

        virtual void partition_median(const element* med)
        {
            assert(inside);
            inside = false;
            if (counter++ == 0) median = med;
        }
    };

    // collect median data
    median_context c;
    sort(data, count, &c);

    if (!c.median)
    {
        assert(c.positions.size() == 0);
        return std::make_pair(std::vector<size_t>(), 0);
    }

    // sort & remove duplicates
    std::sort(c.positions.begin(), c.positions.end());
    c.positions.erase(std::unique(c.positions.begin(), c.positions.end()), c.positions.end());

    // convert from pointers to offsets
    std::vector<size_t> result(c.positions.size());

    for (size_t i = 0; i < result.size(); ++i) result[i] = c.positions[i] - data;

    // get median position
    std::vector<const element*>::iterator median = std::find(c.positions.begin(), c.positions.end(), c.median);
    assert(median != c.positions.end());

    return std::make_pair(result, median - c.positions.begin());
}

a function that sorts the array and returns the partition information for the first pass:

std::pair<size_t, size_t> get_first_partition_right_modify(element* data, size_t count)
{
    struct partition_context: sort_context
    {
        unsigned int counter;
        const element* begin;
        const element* end;

        partition_context(): counter(0), begin(0), end(0)
        {
        }

        void partition_end(const element* right_begin, const element* right_end)
        {
            if (counter++ != 0) return;

            begin = right_begin;
            end = right_end;
        }
    };

    // get partitioning data
    partition_context c;
    predicate pred = {&c};
    std::sort_instrumented(data, data + count, pred);

    // get indices
    return (c.begin == 0 && c.end == 0) ? std::make_pair(0, 0) : std::make_pair(c.begin - data, c.end - data);
}

and finally the main function, that uses the above helpers:

void update_array(element* data, size_t count)
{
    // get positions of the first median candidates (along with the median itself)
    std::pair<std::vector<size_t>, size_t> p = get_first_median_positions(data, count);

    if (p.first.empty()) return;

    // fill elements as follows:
    // - elements from median candidates before median get an 'a' appended
    // - median element gets a 'b' appended
    // - all other elements get a 'c' appended (so that they go into the right half after partition)
    std::map<size_t, char> actions;

    for (size_t i = 0; i < p.second; ++i) actions[p.first[i]] = 'a';
    actions[p.first[p.second]] = 'b';
    char action_otherwise = 'c';

    for (size_t i = 0; i < count; ++i)
    {
        std::map<size_t, char>::iterator ait = actions.find(i);

        data[i].last = (ait == actions.end()) ? action_otherwise : ait->second;
        *data[i].data += data[i].last;
    }

    // copy the elements to preserve the original data
    std::vector<element> copy(data, data + count);

    // get the right partition (left should be very small so we don't care)
    std::pair<size_t, size_t> partition = get_first_partition_right_modify(&copy[0], count);

    // process the right half
    update_array(&copy[0] + partition.first, partition.second - partition.first);
}

Note that as an optimization, the predicate only compares the last characters of the strings; since after each partition the contents of the right chunk consists of equal elements, the only difference is in appended character (which is one of ‘a’, ‘b’, ‘c’).

The only task that remains is to convert the string array to the integer array with the same order; this is straightforward, except that we have to use std::multiset for sorting since std::sort is slow on this set of data (which was the goal, after all :):

std::vector<size_t> generate_array(size_t count)
{
    // create element array with empty strings
    element* data = new element[count];

    for (size_t i = 0; i < count; ++i)
    {
        data[i].data = new std::string;
        data[i].last = 0;
    }

    // update it to make worst possible order
    update_array(data, count);

    // make a sorted copy using std::multiset because std::sort is slow on this data (we prepared the data this way!)
    std::multiset<element> copy_set(data, data + count);
    std::vector<element> copy(copy_set.begin(), copy_set.end());

    // create an order remap
    std::map<std::string*, size_t> order;

    for (size_t i = 0; i < copy.size(); ++i) order[copy[i].data] = i;

    // create an integer array with the same order
    std::vector<size_t> result;

    for (size_t i = 0; i < count; ++i) result.push_back(order[data[i].data]);

    // cleanup
    for (size_t i = 0; i < count; ++i) delete data[i].data;
    delete[] data;

    return result;
}

Here is the full source code for this post. It contains the above code for generating the killer sequence for a quick sort implementation, and additionally the instrumented sorting function from MSVC2008 STL. This code may not compile on other compilers because of the MS-specific parts of the sorting function itself, but otherwise should work fine.

AABB from OBB with component-wise abs

Sun, 17 Oct 2010 00:00:00 +0000

This post is about a neat trick that is certainly not of my invention, but that should really be more well-known; at least, I haven’t heard of it till I stumbled across it while reading Box2D sources.

There are a lot of bounding volumes out there; the most widespread are certainly spheres and boxes, which come in two flavors - axis-aligned bounding boxes (AABB) with faces parallel to the coordinate planes, and oriented bounding boxes (OBB), which is essentially a AABB and an orientation matrix.

It’s common to use AABB in spatial subdivision structures, like octrees, kD-trees, ABT and so on - the intersection test between two AABB is pretty straightforward. However, when dealing with dynamic meshes, it is needed to recalculate the AABB of the mesh when the mesh transformation changes.

Assuming that the mesh has a local bounding box (which is an AABB), the usual way to get world-space AABB for the mesh is as follows:

Get 8 corners of the mesh AABB
Transform all corners to the world space with the mesh transformation matrix
Find the component-wise minimum and maximum of the resulting 8 vectors

However, there is a better way, which reduces the amount of floating-point operations to one quarter of the above. It’s easily derived once we slightly change the AABB representation - while AABB are commonly represented with two vectors, min and max, let’s assume that our box is represented with the center and extent vector:

min = center - extent
max = center + extent

center = (min + max) / 2
extent = (max - min) / 2

Now, the 8 corners of the original AABB are in the form of center + (±extent.x, ±extent.y, ±extent.z). Transforming those by the matrix M is thus

M * (center + (±extent.x, ±extent.y, ±extent.z))

Let’s expand the matrix-vector multiplication; the result looks like this:

M00 * (center.x ± extent.x) + M01 * (center.y ± extent.y) + M02 * (center.z ± extent.z) + M03

(and likewise for other two components)

We can slightly rearrange the equation to get this:

(M00 * center.x + M01 * center.y + M02 * center.z + M03) + (±M00 * extent.x + ±M01 * extent.y + ±M02 * extent.z)

(and likewise for other two components)

Now, the left part is shared by all 8 points, and is equal to M * center (i.e. to the AABB center, transformed to the world space); this is the center of the new AABB.

The right part is different for all points; however, it’s obvious that, since extent vector has non-negative components, that the minimum of the right part is reached when all of ±M00, ±M01, ±M02 are negative, and the maximum is reached when all of them are positive. Thus, the maximum of the right part is:

abs(M00) * extent.x + abs(M01) * extent.y + abs(M02) * extent.z

(likewise for other two components).

Note that this is the matrix-vector multiplication, with the matrix being the component-wise absolute value of the original transformation matrix, and the vector being the extent vector (which has to be transformed as if it is a direction, i.e. without taking matrix translation into account).

The resulting code looks like this (this is F# with SlimDX math classes):

let matrix_abs (matrix: Matrix) =
    let mutable m = Matrix()
    for i in 0..3 do
        for j in 0..3 do
            m.[i, j] <- abs matrix.[i, j]
    m

let transform_aabb_fast (aabb: BoundingBox) matrix =
    let center = (aabb.Minimum + aabb.Maximum) / 2.f
    let extent = (aabb.Maximum - aabb.Minimum) / 2.f

    let new_center = Vector3.TransformCoordinate(center, matrix)
    let new_extent = Vector3.TransformNormal(extent, matrix_abs matrix)

    BoundingBox(new_center - new_extent, new_center + new_extent)

Instead of 8 shuffles, 8 matrix-point multiplications and 8 vector min+max operations, we need to convert the AABB to and from center+extent representation (extent can be alternatively computed as aabb.Maximum - center) and do one matrix-point and one matrix-direction multiplications, which is usually faster.

In case the original mesh bounding volume was an OBB (in my experience, this is usually not necessary, as local-space AABB give a good enough approximation for common cases, but still), this can be applied in the same way - you’ll have to get a full transformation matrix by multiplying the OBB and mesh transformation matrices.

When I’ve seen this in Box2D, at first I did not understand why the code works at all - the meaning of component-wise absolute value is not immediately obvious. Now I know; and I hope that this was of some interest to you.

Death by static initialization

Sun, 10 Oct 2010 00:00:00 +0000

The language war in game development is long over - and the winner is C++. The utmost majority of code that’s going to run on the users side (engine code and game code) is written in C++. This is mostly not because the language is good, but because there is no better alternative.

Many features of C++ carry some penalty in different areas - performance, memory overhead, compilation time, code flow clearness, etc. The great thing about the language is that you usually can avoid using the feature where you don’t need it or would rather do without.

One powerful feature in C++ (which is, by the way, present in most high-level languages, like Java, C#, Python, etc.) is static initialization. In the days of C the only code that ran before the main() was the CRT startup code - basically, nothing interesting ever happened outside of main(). Since in C++ constructors of global variables are executed before main(), you can theoretically run the entire game before main (not that that is a good idea).

The use of static initializers is usually discouraged; while useful for removing some glue code, like various entity registration (one of the examples is auto-registering unit tests via globals’ constructors - many C++ test frameworks use this approach, mine included), static initialization has several problems:

The order of execution between translation units is not defined for static constructors; using a global variable from constructor of another global variable leads to undefined behavior.
The code flow is no longer obvious - i.e. you can get crashes or stalls in the code that’s running before main().
In order to do anything interesting before main(), you usually have to initialize some of your subsystems (i.e. a logging facility), which leads to more and more code being put into static initializers, which does not help things.
Static initializers only run if the translation unit they’re in is linked to the executable; because of this, the automatic use of static initializers that are compiled to a static library is sometimes impossible (you have to touch at least one symbol from the object file in question).

However, while working on one of our titles, I found another problem with static initializers - sometimes they cost you in memory. I’m working on console titles; memory is a scarce resource on current generation consoles, so whenever I see a chunk of memory that’s 1 Mb or more, and that’s not supposed to be there, I try to remove it.

Some of you probably think that a megabyte is such a tiny amount of memory that it’s no use fussing about it; well, the harsh reality of game development is that most optimizations consist of shaving off a percent of available performance/memory a lot of times - there often is no single 50% or even 10% bottleneck.

Because of that I sometimes look at the game executable file to see what’s the memory overhead of just loading our code to the target console, and what this overhead comes from. We have a GCC-based toolchain, so there is a variety of tools available; the relevant tools for these tasks are size (gives section sizes, which is good for a general overview) and nm (gives a sorted list of symbols, enabling a more detailed analysis).

Imagine my surprise, when I found that slightly more than a megabyte in our 6 Mb ELF contains static initialization code! I found this using a simple command-line (did I mention I love Perl one-liners?):

nm --print-size game.elf | perl -ne "$sum += hex($1) if (/^\S+\s+(\S+).*static_init/); END { print $sum; }"

We do not have that much static initialized objects; in fact, almost the only place where we do have them is our serialization system. We have an in-place serialization framework that can save (on Windows PC) a graph of C++ objects to the file so that the objects have the same memory layout as on the target platform, so we can load the file to memory (on console), do pointer fixup and start using the objects.

Unfortunately, due to popular demands of many programmers, the system has to support polymorphic objects and multiple inheritance; this means that, in addition to pointer fixup, we have to fixup pointers to virtual function tables - moreover, because of multiple inheritance, there may be more than one vtbl pointer in a single object! Because of this, the system executes a special constructor for each object via placement new; the constructor itself does nothing except it guarantees that it does not initialize any fields/aggregate objects, so that the values from the file are left intact; however, for objects with vfptrs, compiler adds the relevant code to the constructor.

The only problem now is to call the right constructor for each object. We have an RTTI system for this (it’s not RTTI in the usual sense - you can’t get the object’s type in runtime - but you can, in compilation time, get a type identifier, which is a CRC32 of the type name, which is the same across all platforms). There is a table of functions, that’s indexed by type RTTI identifier; you can get a function by the identifier, then execute the function on a chunk of memory, and you’ll get the initialized chunk of memory - all that without knowing the type at compile time.

Well, that’s cute and stuff, but how do we fill the table? In essence, we have to call this:

template <class T> static void registerClassByType()
{
    _registerClass(T::rttiType(), sizeof(T), T::_Creator, T::_Destructor);
}

for each serializable type. For this, we have the following auto-registration class:

template <class ClassType> struct AutoRegister
{
    void ping() {}

    AutoRegister()
    {
        ClassesTable::registerClassByType<ClassType>();
    }

    static AutoRegister registrator;
};

template <class ClassType> AutoRegister<ClassType> AutoRegister<ClassType>::registrator;

Now, if we ensure that this class has a proper instantiation (which is done by calling AutoRegister::registrator.ping()), we’re set. The ping call is performed from a function, that’s generated from a macro inside the class declaration:

struct Foo
{
    RTTI(Foo);
};

… and herein lies the problem. You see, the compiler has to generate the code that calls the static initializer. The problem is, the compiler has to generate it inside each translation unit (if the ping() is instantiated in the unit, of course) - because the compiler does not know if there are other calls to the same initializer in other translation units, because object files are compiled in isolation. This can result in several calls to the same static initializer; the compiler, linker and CRT have to ensure that each initializer is called only once.

There are two approaches to this problem:

Generate a separate section for each static initializer call; mark the section so that the linker puts all these sections together, and CRT gets a pointer to the section block start/end. This is the approach taken by Microsoft compilers; the section, in our case, is called .CRT$XCx, with the last x substituted with some uppercase letter (which controls the initialization order - see crt0dat.c from CRT sources for more details). There is only a single call to each initializer because the linker merges the sections referring to the same initializer.
Generate a separate function for each translation unit; the function contains calls to all initializers in the declaration order, and looks like this (on x86, with two static initializers in one translation unit):

	pushl	%ebp
	movl	%esp, %ebp
	subl	$8, %esp
	
	cmpb	$0, __ZGVN12AutoRegisterIiE11registratorE
	je	L8
L4:
	cmpb	$0, __ZGVN12AutoRegisterIjE11registratorE
	je	L9
	leave
	ret

L9:
	movb	$1, __ZGVN12AutoRegisterIjE11registratorE
	leave
	jmp	__Z19registerClassByTypeIjEvv

L8:
	movb	$1, __ZGVN12AutoRegisterIiE11registratorE
	call	__Z19registerClassByTypeIiEvv
	jmp	L4

There is only a single call to each initializer because of branches inside this function.

As you can see, in the second case the linker can not merge anything - there is a big function for each translation unit; so if you have a single serializable class, that has its header included in 1000 translation units, it contributes roughly 5 instructions in the x86 case; on our target platform the overhead is 9 instructions per initializer (36 bytes).

The problem manifests itself when there is a moderate to large amount of files, and when each file includes a lot of serializable object headers; unfortunately, while our engine code has sensible include structure, so it generates <50k of initialization code, the game code tends to have spaghetti includes; thus, while each class instantiation only costs 36 bytes, for a huge number of files the total amount of initialization code became a problem. Eventually we got rid of the automatic type registration, making it semi-automatic (you had to manually register a type, but all types that are referenced by it got registered automatically), and reduced our executable by 1+ Mb.

C++ is a powerful language; but some of its powers cost you dearly. A low-level C++ programmer must be aware of various code generation subtleties, employ various analysis tools to notice the problems early, and use certain C++ features sparingly. In other words, “Constant vigilance”!

Taking testing seriously

Sun, 03 Oct 2010 00:00:00 +0000

As I’ve written in the previous post, there is a long way to go from first tests to the complete testing suite. Without further ado, here is the list of things I consider important for a test suite of a middleware product. Some of the items here are only relevant for the case where you want an automatic continuous integration-style testing - they’re marked with asterisk (*****).

Get a good testing framework. With a good framework you have to be able to add a new test in a couple of lines of code, and add a new check in a single line of code. Extra bonus points for libraries that do not require code generation, since this makes building pipeline easier. You can look at the existing frameworks (my personal recommendation is UnitTest++), or write your own - it’s actually extremely easy to do, my frameworks are usually less than 10 kb of code. This is needed to reduce test writing friction - the more tests you write, the better.
Augment the framework by adding domain-specific testing helpers. For example, pugixml is about processing XML documents, so I have a special TEST_XML(name, “xml contents”) test declaration macro, that automatically declares a test with loaded document; I also have a set of XPath-related checking macros, i.e. CHECK_XPATH_STRING(context, “concat(‘a’, ‘b’)”, “ab”).
***** Assertions, crashes and hangs should result in test failures instead of halting the whole process (although there should be a separate switch that crashes the whole thing so that you can attach a debugger). This is usually easily done on top of any framework, on Windows you can override unhandled exception filter; hangs are usually dealt with by external code (i.e. test runner).
Replace the allocation functions with special versions that check memory leaks automatically; you can either do the test after the application runs to completion, or (preferably) check that each tests deallocates all the memory it allocates (which may fail if your library has global caches). Allocation function replacement can be done at the library level (your library does all allocations through an overridable interface, right? RIGHT?), or you can just override operator new/delete - though you’re going to have problems with STL allocations (i.e. some of the memory allocated by iostreams is not freed in some MSVCRT configurations to make applications exit faster).
Depending on your application allocation policy, you can also replace the allocations to do one of the following:
1. Always allocate memory such that the memory immediately past the user block is the page without write access.
2. Never deallocate memory, instead mark deallocated memory as no-access.

This helps ensure the correct handling of allocated memory (page protection can be done with VirtualProtect/mprotect calls).

Reduce test output as much as possible. In case the tests succeed, you should stick with just outputting a single line of information, i.e. ‘SUCCESS: N tests passed’. On the other hand, if some of the tests fail, give as much information as you can - names of failing tests, file/line/callstack information of failing checks, actual tested values - these can often reduce the time to fix the code.
Single click test - you should have a single command that builds the whole library together with tests (incremental/distributed building is a must for even moderately sized libraries), runs the tests and outputs test results. All files that are necessary for testing should be included with the tests, together with testing scripts. Ideally, you should be able to run the tests on any machine quickly, provided it has the necessary development tools installed (i.e. a compiler).
If your library can be compiled in several configurations, test all of them - in fact, ideally you should test all configuration combinations. This ensures that you don’t have code that simply does not work in some weird configuration combo, which may very well be required by one of your users. Also this forces you to reduce the configuration combination count, which is (arguably) a good thing.
If your library should support several compilers, test all of them (or at least as much as you can handle) - many C++ constructs are treated slightly differently by different compilers, and don’t even get me started on the standard library. Test all versions of all supported compilers to be sure you actually support them, since every commit can break the compilation.
If your library is cross-platform, test all supported platforms (or at least as much as you can handle). Don’t forget to test 64-bit targets; also, if possible, test on both little-endian and big-endian platforms (see below).
Single click full test - again, ideally testing all of the platforms with all compilers and configurations should be automatic - you should be able to run a single command, go watch a movie, and then return to see the test report. Speaking of test report - if you have more than a couple of platforms/configurations, you should construct a report which gives a birds-eye view on the state of your library. It should ideally fit on a single page, or a couple of pages, so you can immediately tell if something is wrong; keep the full build log near the summary report to be able to dig in should a problem arise.
***** If there are many people working on the project, you should really invest your time in a continious integration process. Usually a separate machine that runs basic tests (i.e. major configurations on most important platforms) after each commit and does full-blown tests during the night is good enough. You do not need any special software to pull that off, although it may help - I do not have any positive experience with CI software so can’t really recommend anything except the DIY approach.
Code coverage is important. If some code is not executed by the tests, you do not have any evidence that it works at all. Remember how I talked about the safety net? Well, there are holes in the safety net wherever there is no coverage. You can use free tools like gcov (although it only works with MinGW/gcc compilers) to do that; it’s trivial to write a simple gcov information parser to include the coverage statistics in your test report.
Code coverage is not everything. Even if all code lines and/or branches run under the tests, it’s not the proof of code correctness. I did a curious experiment once - I ran a script which commented out each line or a consecutive pair of lines in the source code in turn, and then ran the tests; if the tests passed, it meant that the coverage is not complete. While this is certainly not a ideal approach, and is not possible at all unless your code is around 10-30k LOC, it did help me find some redundant code, and I even caught one bug (memory allocation failure was not handled correctly in a function) with this.
And at last, but certainly not at least - after you’ve done all of these, maintain the test suite. These things tend to break if they’re left by themselves - read the overnight test report every morning, pay attention to every test failure, and make sure they are solved as fast as possible. Otherwise, all of the above would be in vain.

Well, that’s it, basically. With all these steps, you’ll be able to say, that you’ve done everything you could to ensure your product’s quality. While this does not mean that you won’t have any bugs, at least this means that you won’t have any bugs you could have anticipated.

Finally, I’ll summarize the pugixml test setup.

All test code and data is in Subversion repository, so everyone can check it out and build. The tests are built with the help of Jamplus build framework - they are automatic, except the fact that you should install jamplus and additionally configure all necessary compilers on Windows - there is no way most of them can be automatically configured. All pugixml allocations go through special allocators, that use both of the page protection approaches I outlined above. Since I don’t use CI, I don’t guard myself against the asserts, crashes or hangs, although sometimes I feel I should do it.

At the higher level, there are several scripts that launch jamplus with all toolsets that are supported on the current platform, with the desired configuration combinations. All configurations of a single toolset are built in a single jam run, which gives me maximum parallelism. Each script produces a log with special markers for each configuration test result.

There is a top-level script, which launches the test on all platforms with all toolsets, merges the output logs by concatenation, and then invokes the script that parses the log and produces the HTML report, a screenshot of which you can see at the beginning of the post (it’s clickable!). I run the local single-toolset single-configuration tests after each change; the full test suite is run manually after several changes (i.e. each 20 revisions or so).

To test the library on different platforms, I use VirtualBox; I have several virtual machines (one for each OS, two for Linux/FreeBSD because of 32/64 bitness), each is configured so that it launches a special listener script on startup, which receives the build command over the socket, runs the build, outputs the result through the socket, and shutdowns itself. In addition to the usual platforms (x86/x64 on Linux, FreeBSD, Solaris and MacOS X), I use MacOS X to run the tests in big-endian environment - MacOS X lets you run the programs compiled for PowerPC architecture (they’re emulated, but it’s good enough).

So, that’s it. I hope the description of the important points for testing process and the testing process itself was of some use to you; if you’re interested in the details (i.e. in automatically running tests via VirtualBox), you can look at the source - look for .pl and .sh files, since most of the scripts are in Perl, with additional /bin/sh help. While the minimalism of my library allowed me to give extreme attention to testing, I believe that proper testing process is critical for the code quality of any other library, regardless of the size; here, at work, we lack in test coverage, but we still have a CI process that tests all platforms with all configurations automatically, and it was very helpful - I’ve certainly never regretted the invested time.

Testing libraries is important - who knew?!

Sat, 25 Sep 2010 00:00:00 +0000

Four years and a half ago, I was working on a pet game project which used XML format as intermediate storage format. Initially we used TinyXML, but I got tired of its interface and horrible parsing performance, and found pugxml. It was somewhat faster, with the interface which was somewhat better, but still - it was very rough. I decided to slightly change the library, improving performance and design along the way. Thus pugixml was born.

Little did I know at the time, that in five years I’d still be working on the same code. The amount of code and documentation is nearing a megabyte (the library itself is 280 kb, the rest is samples, tests and documentation), the revision number is 750+, hardly any original code is left untouched - it’s not a weekend affair anymore, that’s for sure.

In 2006 I took a very different approach to programming; initially the library had no tests at all. When I started developing an XPath implementation, I worked with a set of simple expressions in a single function in a single source file; once I considered my implementation to be complete, I made a Perl script that matched the test function output to the expected pattern to occasionally check it. Amazingly, I survived without tests for quite a while (the first proper test was added a year ago). Currently the amount of test code takes 1.5x the amount of library code, the code and platform coverage is, in my opinion, very good, and it’s time I wrote about testing.

There are different types of projects, and - at least in my opinion - automated testing is not mission critical for many, and not feasible for some. Often the requirements are vague and/or non-existent, like in game development, often they change on a weekly basis; a single feature may have three radically different implementation, with the first two being thrown out completely - you probably do not want to waste time testing that. Still, my experience in the application testing is rather limited, so I’ll discuss library testing.

When you’re making a library (or a cross-platform / cross-title engine), the situation is different. First, the code you’re writing is going to be used by many people on many projects/platforms. Whereas the bug in game code affects this game and can be fixed without any additional problems once it’s found, fixing the bug in the library has a huge latency - the users will be using the old version for a while. Some of them will find the bug, and either update to the new version, fix it themselves without telling you, or disregard it entirely (“The application crashes once in an hour? What, can’t you just restart it?”). The code you’ve written will fail to work on some of the platforms (we’re talking about C++ here) due to sloppy code, buggy libraries, buggy compilers (I’ve found many bugs in compilers/libraries during pugixml development. While most of them are in outdated software, there are still people out there who use pugixml with MSVC6), etc. - most of that you don’t care about when you’re delivering an application, but it will hurt you if you’re a library developer.

Why tests are that important, anyway? Is that because they make sure your code is correct?

No, unfortunately this is not true. You can’t even make sure your code is correct by proving it, because the proof will likely contain a bug (as a famous quote tells us, “Beware of bugs in the above code; I have only proved it correct, not tried it.”). The tests can pass because you’re lucky and they don’t expose that particular bug; the tests can have a bug; or perhaps you think that the tests are fine, and the function works, but the user/specification/other library/etc. expects a different behavior and thus the code is still incorrect.

All of the above are not the reasons to skip the tests, because the tests do improve the code quality substantially. And they do it because…

They force you to use your own code. Without that you’ll get hard-to-use interfaces, functions that a person once wrote and never ran (or perhaps he ran them once, and then the supporting code was restructured so that the code broke).
They force you to think about your own code. When you’re writing a test, you’re trying to test many different code paths (i.e. if you have a function that does a slow name lookup and a fast handle lookup, you’ll test both of them). While thinking about your code, you’ll likely think of some way to break it.

“What if I don’t delete this object?” “Uh-oh, this callback takes a non-const reference - what if I change the object?” “This function sums the list elements - what if I pass the empty list?” “Hmm, I did not write code to deallocate strings - why isn’t there a memory leak?”

By thinking about your code, you’ll be able to better understand its internals and flaws, and eventually get a better version of the code.

They give you a safety net. If you’re optimizing an algorithm, how do you know that it works the same way it did before? If you encounter a bug and fix it, how do you know that this bug will never appear again? If you have to upgrade to the next version of the library you’re using - how do you know what works the same was as before and what does not? How do you port your code to the new platform - does the old code work there? One possible answer to all of the above is automatic testing.

Yes, it won’t guarantee that everything works - but it’s the best you can do. Moreover, every time a bug is found, you should understand why it was not caught by your tests. Perhaps that branch of code was never tested - you should expand your tests. Or perhaps the object’s internal state got messed up - you can likely add validation code to catch such bugs earlier. Also, the bug is likely to have siblings - think about the similar code in the rest of the codebase, and add the relevant tests.

Now, there are different types of tests. The exact type of testing you’re doing should depend on the component in question - some components can be unit tested, some require functional tests (at work we have unit tests for some components, and screenshot-based tests for others). Often the results are impossible to verify analytically (i.e. an A.I. simulation), but you can do a smoke test. (which, I believe, is generally too weak for libraries, but can be applied to games with good results).

Since we’re talking about libraries, the majority of them do not require smoke tests - they be tested using a combination of unit and functional tests.

The first steps are easy - you get a testing framework, and start writing tests. The tests verify the functionality the code has to deliver, the new bug reports are converted to tests - everything is fine. It’s also easy to use - there is this special compilation mode which you have to toggle in the vcproj, then copy a couple of files to the bin folder, and then the application outputs the test log, which says ‘passed’ or ‘failed’ for each test, along with some useful debugging information - but it’s easy to grep for ‘fail’.

It seems that even after all these tests, there is a wide gap between testing and what I’ll call the serious testing. Next time I’ll discuss the features of the serious testing process, as I see it; for now let me tease you with a screenshot with pugixml automated test report (it’s clickable), taken while I was writing the post:

There's more than one way to map a cat

Sun, 19 Sep 2010 00:00:00 +0000

There are lots of data structures out there, ranging from primitive to so sophisticated that only a single person in the world understands them (these are, of course, mostly useless). The choice of the data structure is mostly specific to the problem; however, obviously some data structures are generally more popular/useful than others.

I’d say that the most generally useful data structure in the imperative world is an array. Arrays are as good as you can get if you do not need fast searching over large datasets - they have a simple efficient memory pattern (unless you need a multi-gigabyte array), leading to fast iteration, they have minimal meta data (per instance cost is zero), they can be indexed in guaranteed constant time, array transformations are trivially parallelizable, they generalize to multiple dimensions easily, etc.

However, in functional world, from my limited experience it looks like the favorite data structure is a consed list, which is better known as a singly-linked list. There is a type, which is called cons cell, which is a pair. A list consists of cons cells, which are linked together - the first element of a cons cell signifies the data element, the second one points to the next cell, or to the special object, nil, which represents an empty list:

(1 2 3 4) == (cons 1 (cons 2 (cons 3 (cons 4 nil))))

For some reason, a linked list is also the favorite data structure of our game logic programmers, though this has nothing to do with functional programming.

Consed lists are present in lots of functional programming languages; in some, they’re the only basic structure (Lisp/Scheme rely heavily on consed lists, even the code is composed of s-expressions which are stored in consed list form; Haskell strings are consed lists, etc.). However, for a programmer who spends half of his work day in a profiler, consed lists are a rather bad structure.

The benefits of consed lists are:

A new element can be added to the list in constant time, without mutation.

Not much, is it? The drawbacks are:

The memory access pattern is unpredictable and usually bad, leading to cache misses;
Even with a good allocator, at least half of the memory is used for meta data, increasing bandwidth;
Parallel processing is hard; if the processing function is relatively cheap, there will be no gains from parallelism;
While you can easily insert an element before the first one, you can’t easily append to the list. There is nothing pretty about non-symmetrical data structures;
You can’t easily remove an element from the list either;
Each cell is a heap-allocated object, which puts considerable pressure on garbage collector in GC’d environments.

There is a function, called map (mapcar in Common Lisp), which takes a consed list and a one-argument function, and returns a new list, which is produced by application of the function to all elements of the source list. This function, obviously, has O(N) time complexity and O(N) space requirements (after all, it generates a new list!). Looking at various possible implementations of this function gives some insight into problems of consed lists in particular and functional programming style in general.

My language of choice for today is F#, which is a multi-paradigm language (with both functional and object-oriented imperative elements) based on .NET platform. Of course, F# has a built-in consed lists, so we’ll start with them.

Naive recursive approach

When an imperative programmer has to implement a map function, his first natural reaction is to write a for loop. However, for loops usually have a mutable iterator, and thus are either discouraged or not present at all in functional languages. Instead, functional programmers like to use recursion. Indeed, a recursive implementation of map is straightforward, once you get used to recursive functions:

let rec mapcat1 pred lst =
    match lst with
    | car :: cdr -> (pred car) :: (mapcat1 pred cdr)
    | [] -> []

Aside from some syntactic weirdness (:: is used to make a cons cell, function arguments are specified without commas and without braces, I use pattern matching here instead of ifs), the code should be self-explanatory. Mapping an empty list produces an empty list; mapping anything else can be done recursively by making a cons cell with first element being a transformed first element of the original list, and the rest being a result of recursive map.

This function works, but it has a tiny problem - it’s recursive. Wait, what?

Tail-recursive approaches

You see, as you’ve been likely taught, recursion is bad. Each call to a recursive function pushes function arguments and the return address to the stack, creating a stack frame; a recursive function like map creates N stack frames for list of N elements, so it requires O(N) temporary memory. What’s worse, however, is that in some functional languages, like F#, the size of stack is bounded, so processing a large list is going to generate a StackOverflowException (Haskell, on the contrary, is happy to grow the stack until no address space is left in the process).

To solve this problem without loops, a concept of tail recursion is introduced. If the function call is the very last thing the function does, there is no need for additional memory - the old stack frame is not needed after the call, so the new frame can replace the old frame. For some languages the tail-recursion is an additional optimization (for example, some C++ compilers do it), for other it’s a spec requirement (i.e. a guaranteed feature).

There is a common recursive function transformation pattern, which is almost never done automatically by the compiler, but is usually easy to do by hand. Our map function can be transformed into this one:

let mapcat2 pred lst =
    let rec loop rest acc =
        match rest with
        | car :: cdr -> loop cdr (pred car :: acc)
        | [] -> acc
    loop lst []

There is an inner recursive function, which is conveniently named loop (which, I guess, shows my imperative background); the part of the list which is already built is stored in the acc(umulator) argument, which is updated at each call with the new element.

Note that loop cdr (pred car :: acc) is tail recursive, but (pred car) :: (loop pred cdr) is not - there is an extra cons operator after the function call.

The new version of the function works without stack overflows even on large inputs. However, due to the code structure change, it produces the list with the reversed order of elements, because we walk through the list and prepend the elements to the result, so the first element of the original list will be the last element of the new list.

Well, it’s easy - instead of prepending, we’ll append!

let mapcat3 pred lst =
    let rec loop rest acc =
        match rest with
        | car :: cdr -> loop cdr (acc @ [pred car])
        | [] -> acc
    loop lst []

I’ve changed :: to @, which is an append operator. Victory!

Wait, why is the program still running?

Remember I’ve said that consed lists are not symmetrical? You can easily prepend an element, but you don’t know where the last element of the list is, so append has to iterate through the entire left argument, making the new function quadratic. Oops. Note that this won’t be a problem with doubly linked lists, but, unfortunately, append was not a priority 50 years ago, when Lisp was accidentally created.

Ok, seems we’ll have to just reverse the result. A relatively common interview question is to code a single linked list reverse function. An inplace version is usually required, so that the old list gets destroyed in the reversing process; we can’t do that here, because consed cells, like all other objects in a pure functional world, are immutable - you can’t change them once they’re created. So we have to make an additional copy of the list (we can use the same map function for this, because it conveniently returns a transformed reversed list):

let mapcat4 pred lst =
    mapcat2 pred lst |> reverse

let mapcat5 pred lst =
    mapcat2 pred lst |> mapcat2 (fun x -> x)

|> is an F# pipeline operator, you can read it as “the left part’s expression result is passed as an additional argument to the right part”.

Now we finally have a correct solution; it is tail recursive, so it consumes O(1) stack frames, but it creates an additional copy of the list, so it still needs O(N) temporary memory. However, it works in F# for long lists.

Continuation-passing style

There is another workaround for the stack frame problem, which is very functional in spirit. What we need in order to build our list in the right order is to traverse through the list, remembering the chain of nodes, and then to unwind the chain starting from the last node. This is essentially what we do in a recursive approach, however we can stay tail recursive if the unwinding chain will be formed from continuations.

Think of it this way: we have to make a function, which, given a tail of the result, prepends another element to it, forming the next tail. If we have N such functions, and each is calling the next one, then the unwind chain will be formed from the function calls, which coincidentally happen to be tail recursive.

This is an example implementation (fun x -> expr is a way to create an anonymous function with argument x which evaluates expr as its result):

let mapcat6 pred lst =
    let rec loop rest cont =
        match rest with
        | car :: cdr -> loop cdr (fun acc -> cont (pred car :: acc))
        | [] -> cont []
    loop lst (fun x -> x)

We start with an identity function, which, given an argument x, returns it back. Then we form a function chain, which for a list [1; 2; 3] will look like this (in the order of creation):

fun x -> x
fun acc1 -> (fun x -> x) (pred 1 :: acc1) (* this is equal to fun acc1 -> pred1 :: acc1 *)
fun acc2 -> (fun acc1 -> pred 1 :: acc1) (pred 2 :: acc2) (* this is equal to fun acc2 -> (pred 1) :: (pred 2) :: acc2 *)
fun acc3 -> (fun acc2 -> (pred 1) :: (pred 2) :: acc2) (pred 3 :: acc3) (* this is equal to fun acc3 -> (pred 1) :: (pred 2) :: (pred 3) :: acc3 *)

And finally the result is called with an empty list, which results in (pred 1) :: (pred 2) :: (pred 3) :: [], which is what we want.

We’ve successfully traded N stack frames by N closure instances, hurray! (amazingly, this is faster than the original recursive version; see below for timings).

Imperative world

Timing the standard List.map function, which does the same thing, showed that it’s way faster than the fastest of the above. The only way to optimize this, as far as I understand, is to introduce mutable data structures, which means introducing a special structure instead of the built-in cons cell (Scheme-aware readers can immediately recognize that Scheme has a built-in set-cdr! function, which is what we’ll need here).

The code is very much imperative, apart from the tail-recursion-instead-of-loops, so I’ll leave it without explanations:

type Cell =
    val car: int
    val mutable cdr: Cell

    new (car_, cdr_) = { car = car_; cdr = cdr_; }

let nil: Cell = Unchecked.defaultof<Cell>
let cons car cdr = Cell(car, cdr)

let mapcat2_mut pred lst =
    let rec loop (rest: Cell) (last: Cell) =
        if System.Object.ReferenceEquals(rest, null) then
            rest
        else
            let cell = cons (pred rest.car) nil
            last.cdr <- cell
            loop rest.cdr cell

    let head = Cell(0, nil)
    let res = loop lst head
    head.cdr

Note that I always create a stub object which is thrown out; that’s because I’m lazy.

Results

Now let’s stuff everything in a single program, add some timing code, and look at the results (the complete F# source is available here).

The program, when run, outputs the following:

"recursive" took 1.656725 ms
"tail-recursive (cons)" took 0.341175 ms
"tail-recursive (cons) + reverse" took 0.922769 ms
"tail-recursive (cons) + reverse (via map)" took 0.943074 ms
"tail-recursive (continuations)" took 1.110428 ms
"standard" took 0.496710 ms
"mutable recursive" took 1.430596 ms
"mutable tail-recursive" took 0.563062 ms
"mutable loop" took 0.577617 ms

So, as we can see, the fastest way is the List.map function, which is closely followed by our mutable variant (there are both tail-recursive and loop versions here - F# has native support for loops); the next best are the functions which construct two lists, followed by the continuation version (amazing!), and, finally, by recursive version. The first tail-recursive variant is the fastest of them all, but it’s incorrect.

How did they do that? Why is List.map as fast (well, it’s even 10% faster), than our mutable version, given that the F# list node is immutable? I’ve studied the F# assembly using ildasm, and found out, that…

… they mutate the resulting list. List.map creates a head node from the first element of the list, and then calls mapToFreshConsTail, which creates the rest, and modifies the tail (cdr) of the cells in the process.

Conclusion: When purity and performance collapse, performance usually wins.

Oh, and using arrays here results in 0.1 ms runtime, which is 5x faster than the fastest list-based solution. Just saying.

View frustum culling optimization - Balancing the pipes

Sat, 11 Sep 2010 00:00:00 +0000

Last time (I don’t blame you if you forgot, that was a year and a half ago) I described the view frustum culling solution, which involved projecting the box to clip space and testing the points against plane equations there. This is more or less the solution we used at work at the time; the production version has two additional features, size culling (after the culling operation we have the clip-space extents, so we can test if the box screen-space area is too small) and occlusion culling (boxes are tested against the software Z-buffer, again, using the clip-space extents). However, we’re going to keep things simple and see if we can optimize the simple culling function further.

The fastest version back then stopped at 104 cycles per test. Let’s look at the code, count the instructions, and think about the further optimization possibilities.

The full version of previous code is here.

SPU instructions can be separated into several groups; instruction in each group usually share the latency and the pipeline. SPU have two pipelines - even and odd. The instruction set is separated into odd and even instructions; even instructions should execute on even pipe, odd - on odd pipe. SPU can execute two instructions per cycle, if they are executed on different pipes, if there are no stalls due to register dependencies, and if the addresses of instructions are “even” and “odd”, respectively (all instructions are 4-byte, so even/odd distinction refers to the offset modulo 8 - it can be either 0 or 4).

Obviously, the information described here is SPU-specific, but only up to a point. Latency hiding is very useful on many other architectures, and optimizing for proper pipe utilization is often useful even in pixel shaders.

The ultimate goal, of course, is to have each cycle completely busy - so that on each cycle there are two instructions to execute. Obviously, this is hard to do in practice, because of register dependencies, and lack of instruction balance.

Register dependencies problem refers to the fact that each instruction has some latency. Generally, after an instruction is issued, the next instruction can be issued on the next cycle (or on the same cycle, if dual-issue restrictions above are met). However, the actual result of the instruction is usually made available only after several cycles; trying to read from the destination register before the instruction has written new data into it results in stalls. For example, let’s take vector-matrix multiplication as an example. We have four columns of a matrix in registers c0 through c3, and the column vector in register p. Then the transformation code (which resembles the code from the earlier article in this series, look for transform_point) might look like this:

shufb px, p, p, const_splat_x
shufb py, p, p, const_splat_y
shufb pz, p, p, const_splat_z

fma result, pz, c2, c3
fma result, py, c1, result
fma result, px, c0, result

Wow, this is fast! Only three instructions for actual math, and three for splatting the components. Yeah, but fma has 6-cycle latency, so there is a 5-cycle stall after each fma (and there is a 3-cycle stall before first fma, because shufb has 4-cycle latency). So this code transforms the point in 24 cycles. We can modify the code to slightly reduce the stalls:

shufb px, p, p, const_splat_x
shufb py, p, p, const_splat_y
shufb pz, p, p, const_splat_z

fm tempx, px, c0
fm tempy, py, c1
fa tempxy, tempx, tempy
fma result, pz, c3, tempxy

We waste 1 cycle before the first fm (dependency on px), 5 cycles before the first fa (dependency on tempy) and 5 cycles before the fma (dependency on tempxy). So we won approximately 6 cycles. This is still not much, and so the proper way to speed this up is to add other code to hide latencies. For example, if we have to transform 6 points, then we can just replicate each instruction 6 times (of course, we’ll use 6x more registers) and eliminate all latency stalls; that way we can transform 6 points in approximately 41 cycle (6 instructions * 6 points = 36 cycles, +5 cycles of latency for the last point), which is much better.

The next goal for full pipe utilization after register dependencies are eliminated is to make sure that pipes are evenly balanced. As I’ve written before, each instruction executes on one pipe; for example, all floating-point instructions execute on even pipe (here is a useful document that has latency and pipe information for all SPU instructions). If your code executes in 100 cycles, and has 80 floating-point instructions, then there is not much you can do (unless you can remove some of those).

Let’s check the code for the function from the above. I’ve run a simple perl script to count the instructions in an assembly fragment (did I mention I love Perl one-liners?):

perl -e "while (<>) { $x{$2}++ if (/^.+?:(\s+\S+){4}\s+(\S+)/); } print qq{$_ $x{$_}\n} foreach (sort keys %x);" <file.s

and got this:

and 5
bi 1
fcgt 12
fm 3
fma 33
hbr 1
il 1
ila 1
ilhu 4
iohl 3
lqd 10
lqr 2
or 6
orx 6
shufb 32
xor 2

We have 48 floating-point instructions (even), 12 loads (odd), 13 per-component bitwise operations (even), 6 orx (this is a bitwise operation that operates on the whole quadword at once; generally, such operations are odd), 32 shufb (odd) and 11 other instructions that deal with constant formations and return from functions, which we’ll ignore for now (pretend they don’t exist). So there are 61 even instructions and 48 odd instructions - the code is more or less balanced, but it could’ve been better. There are two problems in the code that are easy to fix.

The first problem is that we call transform_points4 twice; while the shuffles are shared between the calls (each call to transform_points has 16 shuffles, but they are the same across two calls, because they operate on the same view projection matrix), some math could be shared but is not. We call the function like so:

static inline void transform_points_4(qword* dest, qword x, qword y, qword z, const struct matrix_t* mat)
{
#define COMP(c) \
    qword res_ ## c = SPLAT((qword)mat->row3, c); \
    res_ ## c = si_fma(z, SPLAT((qword)mat->row2, c), res_ ## c); \
    res_ ## c = si_fma(y, SPLAT((qword)mat->row1, c), res_ ## c); \
    res_ ## c = si_fma(x, SPLAT((qword)mat->row0, c), res_ ## c); \
    dest[c] = res_ ## c;
 
    COMP(0);
    COMP(1);
    COMP(2);
    COMP(3);
    
#undef COMP
}

transform_points_4(points_cs_0, minmax_x, minmax_y, minmax_z_0, &clip);
transform_points_4(points_cs_1, minmax_x, minmax_y, minmax_z_1, &clip);

Note that X and Y vectors for two groups of points are the same, which is expected because in local space our box is an AABB (each group of 4 points represents points of the face with normal pointing up or down the Z axis). However, we start doing multiply-add operations with Z component, which prevents sharing the calculations.

Rearranging the computations in xyz or yxz order enables us to share 8 floating points operations with the previous call:

res_ ## c = si_fma(x, SPLAT((qword)mat->row0, c), res_ ## c); \
res_ ## c = si_fma(y, SPLAT((qword)mat->row1, c), res_ ## c); \
res_ ## c = si_fma(z, SPLAT((qword)mat->row2, c), res_ ## c); \

Another minor annoyance is that we have to negate the w component and to compare Z with 0. The point of the code in question:

// calculate -w
qword points_cs_0_negw = si_xor(points_cs_0[3], (qword)(vec_uint4)(0x80000000));
qword points_cs_1_negw = si_xor(points_cs_1[3], (qword)(vec_uint4)(0x80000000));

// for each plane...
#define NOUT(a, b, c, d) si_orx(si_or(si_fcgt(a, b), si_fcgt(c, d)))

qword nout0 = NOUT(points_cs_0[0], points_cs_0_negw, points_cs_1[0], points_cs_1_negw);
qword nout1 = NOUT(points_cs_0[3], points_cs_0[0], points_cs_1[3], points_cs_1[0]);
qword nout2 = NOUT(points_cs_0[1], points_cs_0_negw, points_cs_1[1], points_cs_1_negw);
qword nout3 = NOUT(points_cs_0[3], points_cs_0[1], points_cs_1[3], points_cs_1[1]);
qword nout4 = NOUT(points_cs_0[2], (qword)(0), points_cs_1[2], (qword)(0));
qword nout5 = NOUT(points_cs_0[3], points_cs_0[2], points_cs_1[3], points_cs_1[2]);

#undef NOUT

is to calculate, for each plane, if any point is not outside the plane, i.e. if there is any point with i.e. z < w for far plane. The code does it by computing z < w for all points, and then or-ing together the results. Instead we can abuse the fact that for negative numbers, the sign (most significant) bit is 1. For far plane we can take w - z instead; now, if it is negative for all points, then z < w does not hold for all points, and the point is outside. We can take w - z for all points, and together the results, and check the most significant bit - it is 1 iff the point is outside.

SPU does not have a horizontal-and instruction (a straightforward way to do the above would be to do something like si_andx(si_and(..., ...))), but we can replace this with the equivalent:

not(andx(and(a, b), and(c, d))) == orx(not(and(a, b)), not(and(c, d)))

Fortunately, there is a not(and(a, b)) instruction available, so we can write the code as follows:

// for each plane...
#define NOUT(op, idx0, idx1) si_orx(si_nand(op(points_cs_0[idx0], points_cs_0[idx1]), op(points_cs_1[idx0], points_cs_1[idx1])))

qword nout0 = NOUT(si_fa, 0, 3); // (x + w) >= 0 for any point
qword nout1 = NOUT(si_fs, 3, 0); // (w - x) >= 0 for any point
qword nout2 = NOUT(si_fa, 1, 3); // (y + w) >= 0 for any point
qword nout3 = NOUT(si_fs, 3, 1); // (w - y) >= 0 for any point
qword nout4 = si_orx(si_nand(points_cs_0[2], points_cs_1[2])); // z >= 0 for any point
qword nout5 = NOUT(si_fs, 3, 2); // (w - z) >= 0 for any point

#undef NOUT

With these two modifications, we remove 10 floating-point operations and two xor’s (and replace or with nand, which are similar); we have to convert the most-significant bit of the result to a 0/1 mask, which can be done with a single arithmetic right shift:

return si_to_int(nout) >> 31;

In total we saved 11 even instructions, so now there are 50 even and 48 odd instructions - much better. The performance of is_visible slightly improved (by 7 cycles, to be precise), so it is now 97 cycles. Why are there 50 even instructions, but at least twice more cycles? Well, the code still has some register dependency stalls; also, while the amount of work on both pipes is now roughly equal, this work has to be done at different times throughout the execution - i.e. this code:

shufb
...
shufb
fma
...
fma

is slower than this code:

fma
shufb
...
fma
shufb

because the first one issues 1 instruction per cycle (given no register dependencies), and the second one issues 2. There are several ways to fix this, the easiest one being - do more operations in the function and have the compiler rearrange the instructions to better utilize dual pipelining. Additionally some calculations may be shared - for example, the constant formation.

Making this change is trivial - just rename the old function to is_visible_impl and make it inline, and add a new function:

__attribute__((noinline)) void is_visible(qword* result, const struct matrix_t* transform, const struct aabb_t* aabb, unsigned int count, const struct matrix_t* frustum)
{
    for (unsigned int i = 0; i < count; i += 4)
    {
        qword r0 = si_from_uint(is_visible_impl(transform + i + 0, aabb + i + 0, frustum));
        qword r1 = si_from_uint(is_visible_impl(transform + i + 1, aabb + i + 1, frustum));
        qword r2 = si_from_uint(is_visible_impl(transform + i + 2, aabb + i + 2, frustum));
        qword r3 = si_from_uint(is_visible_impl(transform + i + 3, aabb + i + 3, frustum));

        result[i + 0] = r0;
        result[i + 1] = r1;
        result[i + 2] = r2;
        result[i + 3] = r3;
    }
}

Note that the frustum is the same for all AABB/matrix pairs, which makes sense for common usage patterns.

This code runs at 74 cycles per iteration at 1024 iterations, which is much closer to the optimal 50. Of course, the code size is larger now, and we’ll have to restructure the calling code.

There is another technique that can reduce stalls and improve dual-issue rate, which is called software pipelining. I currently don’t know if it will prove useful for this case; if it will, I’ll demonstrate it on this code, otherwise I’ll show it on a different (simpler) code.

The complete source for this post can be grabbed here.

View Frustum Culling series contents:

Introduction

Vectorize me

Structures and arrays

Never let me branch

Representation matters

Balancing the pipes

Reset and reload

Sat, 11 Sep 2010 00:00:00 +0000

Long time no see, everyone.

First of all, the blog has been moved. The new blog address is http://zeuxcg.org, and the new feed address is http://zeuxcg.org/feed. Please, update your bookmarks!

In addition to changing the address, I’ve changed the blogging platform - this blog is now powered by WordPress, which at first impression is superior to Blogger in many ways - built-in code highlighter, built-in ‘read more’ support, image storage, slightly better html generation (i.e. it does not screw my posts up as often as Blogger did), better themes, non-anonymous comments without Google account, etc. I bet there are some downsides, but anyway I hope it will be a better experience (and will motivate me to write more posts, of course).

All old posts are imported from Blogger along with the comments; their contents is left as is, apart from minor cleanup and link cross-reference.

Previously most of my posts were of considerable length; I’ve even got as far as stuffing several completely different notes in a single post. The format is going to change slightly - there are going to be small notes as well as normal sized posts. Also probably the amount of non-graphics related posts is going to increase; still, I’ll try to keep the content mostly relevant to game development.

Don't you dare flip that bit

Wed, 08 Sep 2010 00:00:00 +0000

I’ve decided to take a small break from VFC series and post something completely different. VFC series continues next time, don’t worry.

In our studio, there is a single engine that’s shared between most projects. There is a separate team, which develops and maintains it – I happen to be a part of this team. Each project has its own branch of the engine, and changes are integrated in both directions as needed. Of course, the engine needs a solid test base. We have two methods of testing our code – unit-tests and small demos (we call them spikes). Each spike is demonstrating some specific feature (i.e. particle systems, animation trees, huge scene with complex shaders, etc.) and often has some custom C++ code and some art, of course. Sometimes it’s just boxes and spheres (for some reason, there is no teapot object in Maya, so no teapots :-( ), sometimes it’s scenes from our projects. We have some automated means to test the spikes (memory leak/log errors check, screenshot comparison), but the post is not about them.

The story starts with one of the spike which contains a level from a game which is being developed. It’s a moderately complex level, I usually develop/optimize rendering code in it. One day I run it on PS3, move camera somewhere and suddenly notice something weird – there is an object (a tree, actually), which extrudes from the place it ought to be to level boundary or farther (unfortunately, I don’t have any screenshots and I’m not allowed to post them for that matter). Well, there is a PIX-like tool on PS3, so I launch it, capture the frame, look at the mesh and see that there is one vertex that went astray. I reload the spike to narrow down the problem via edit & continue (I’ll have to blog on that some day), and the object is okay. Well, shrug back to work.

Then approximately a week later I was fixing shadows on the same project the scene was from (for some reason, my love for PSM vanished once I started using it in an actual project…), and suddenly while going through the level notice the same bug – there is some other object with a stray vertex. I launch the PIX-like tool again, and now inspect the problem closer – it appears that one of indices is broken – instead of 0x0023, it’s 0x8023, which leads to sampling incorrect vertex (with position.w != 1). I note the address, reload the game (this tool requires a special program to be launched on some devkit to “replay” the capture – replaying gives ability to see wireframe, for example – and since for some reason we only have one PS3 devkit for the engine team, I launch replay on the same kit I captured from, so if I want the game running I have to reload it) – and the problem is not there again. Well, I reload it several times for the next hour, and the object is still ok. I check our geometry file – and the index in it is 0x0023. There we have it – a heisenbug. Apparently, either the file loaded incorrectly, or someone damaged my index buffer (it’s in video memory, btw).

Well, how do you debug a heisenbug? First, I had to make a list of possible reasons. But in order to do this, I need more data – two random occurrences are not enough! I talk this over with my lead and we decide to wait till it appears again. Also I check with the projects Q&A – they’ve never seen it but they will tell me immediately if they do. Great.

The next day I launch the game again, go through the level a couple of times and suddenly notice the same bug – the same vertex is displaced. After leaving the game running for some more time and looking in different places, I find several more bugs – it even appears on some skinned characters. Moreover, it’s obvious now that the bug appears after some time – I can cycle through the level and see more and more stray vertices. So I do a few captures, analyze the data, record addresses, expected and actual contents, look at the resulting file and think.

Now we have a pattern. First, it’s obviously a memory damage, file loading is not to blame – the loaded meshes stay in memory at the same places. Second, it affects both vertex and index buffers – on some models the wrong data is in vertex buffer (by the way, noticing a bug in vertex buffer by looking at vertex positions is… hard. Especially if it’s 3500 vertices long). Third, it, uh, affects only one bit. For a 0x0023 vs 0x8023, this could’ve been a full-byte damage, but on vertex data it’s obvious that there is exactly one bit damaged – sometimes it’s 1 when it should have been 0, sometimes vice versa. Neighbor bits/bytes are not affected (the comparison with geometry file confirms it). As another side-note, index buffers are very much alike – when I search for pattern from vertex buffer in the file, there is often a single match even if a pattern is only 16 bytes long, for index buffers it sometimes takes me 100 bytes or so to detect the correct place.

Now, the weird thing is that this happens on geometry that should never be touched by any code after it’s loaded – we don’t do any vertex/index processing on static geometry, we even skin on GPU. So it should be some seemingly unrelated code, which performs bit manipulation – probably some state packed in bit vector? Well, who knows… Anyway, I try to disable some code that can be disabled more or less safely without losing the ability to render stuff (so that I can see something) – it does not help, i.e. bugs are still there. They are in different places for different runs, but there is an object that is constantly damaged the same way no matter what I change. I quickly find the address, and plug it into debugger watch.

In this case the memory layout for level data is static (thank god!), so addresses do not change. After several runs, the address is confirmed – it’s always damaged. Moreover, if I fix the value there, after some time (which ranges from instantly to ten-twenty seconds) it’s damaged again – and I’m able to confirm that only one bit is getting changed. We’re getting somewhere – one of many manifestations of the bug reproduces without problems, so I can do something about it.

Now that we know the story, it’s time to think about suspects. We’ve got PPU, SPU and RSX in picture. It does not look like RSX – it’s pretty hard to setup the rendering so that exactly one bit somewhere in the middle of geometry data is damaged and everything around it is okay – so it’s not RSX. SPUs can only access external (video/system) memory via DMAs, and DMAs have to be at least 4 byte aligned – so if it’s SPU, it loaded a chunk of memory, changed a bit there and put it back. We had around 100 kb of our SPU code back then, so we checked it and it did not seem to do anything like it. Still, that was a possibility, and we also had Havok running on SPUs, so we could not eliminate them. And, well, we have lots of PPU code that can freely write wherever it wishes.

Step one – eliminating the suspects. I wanted to eliminate SPU code, so I temporarily switched the systems that were using it (both our code and Havok) to their PPU variants where possible, and added some asserts to DMAs in the single remaining job. No changes, the bug is still there – so it’s not SPU code or at least if it’s SPU code, we don’t know anything about it (there are some SPU processes launched by GameOS). So it has to be PPU code. The behaviour looked like that of our video memory manager (it stores block information in video memory, and indeed can change some bits), but replacing it with dumb linear one (ptr += size) did not make a difference. Time to move further!

Step two – determining the place of corruption. It should be simple – we switch the bit off, put a data breakpoint and find the culprit? Well, no – for some reason, DABR (data-access breakpoint, PowerPC) does not work on video memory – i.e. I can’t even set it. There is another facility to trap memory accesses (which works on more types of memory accesses than DABR), but it does not work with video memory either. So the resolution was that we have to launch a high-priority thread that constantly checks the address in question.

The assert started trigging amazingly early – even before the level started to load! When it was triggered, the main thread (and other threads) were either sleeping or stopped in seemingly random places, but it was a definite improvement – the first assertion was before game initialization even completed, so the geometry just happened to be at the wrong place (in the wrong time…).

At the end of game initialization there was 6 active SPU threads and 15 or so PPU, which was clearly a bit much. Luckily, since we didn’t have to load the level any more, we could remove stuff more freely. So I added an infinite loop with beginFrame/endFrame calls to the end of initialization routine (to prevent post-initialization crashes) and started commenting out various subsystems’ initialization, eventually reaching the state where there are no SPU threads and only two PPU threads – main one and graphics interrupt handler. There was still a lot of initialization code left, since I left the stuff that did not create threads.

The assert still triggered randomly, so I thought I’d change the checking method. GCC has an option that instruments all functions by adding calls to special functions at the beginning/end. The option is called -finstrument-functions (it makes the compiler call __cyg_profile_func_enter on enter and __cyg_profile_func_exit on exit; moreover, this even works in case the function was inlined), MSVC has a similar pair of options, /GH and /Gh. I added this option to all our PPU code (this excludes middleware we have no source for, like Havok, and also excludes Sony libraries, but in most cases there still is our code that calls into middleware/libraries), and added enter/exit functions that assert if our address contains wrong data. It’s funny that GCC instrumented my enter/exit functions as well (despite the fact they should have fixed prototype), so I had to tag them with no_instrument_function attribute.

Running the resulting image did not produce clear results either – assert still triggered in more or less random places. Additionally now the callstack did not contain any meaningful information except the function from which the enter/exit stub was called, which complicated stuff a bit. Getting frustrated, I started removing the remaining initialization code one by one. Finally all that was left was a single call to graphics device initialization – I removed some inner parts, but I could not remove that altogether because it maps video memory to process space – and loop with beginFrame/endFrame, which I reduced to 4 calls to Sony libraries (which essentially performed a flip). Replacing a loop with plain infinite one (while (true) ;) made the bug disappear.

So it’s not our code; moreover, it occurs because of Sony graphics library. There were two answers left – either it’s not our fault or we somehow initialize the library incorrectly. I ran the game (original version) on another devkit, but there was no problems there – so I suspected firmware version, but reflashing devkit to different versions did not change anything. After that I launched one of SDK samples – but the address contents was intact (i.e. did not change after I changed it in debugger to some meaningful value). So it had to be our code. Careful comparison of initialization routine indeed spotted some differences (we passed a zero-initialized structure in one of the functions, though the documentation and the sample stated that NULL should be passed instead), but eliminating them again did not fix it. Well, that was weird. A year and a half ago, when we started porting the engine to PS3, I’ve written a small several-days demo in the process of studying the graphics library/RSX. Luckily I’ve got the sources and a prebuilt image (several SDK revisions and year and a half in the past!). I launched it, and the address contents was in tact – but I’ve decided to peek further, and dumped the whole video memory to the file. Inspection of this file revealed, that there was a single non-zero byte with offset > 32 Mb from video memory start. Amazingly, the low half-word (two bytes) of it was equal to the address I was working with the whole time! So the bug was even in my small demo? Give me one last try…

I’ve launched Sony’s sample again and did not even have to dump video memory – the very same address was corrupt! So I filed a support request, suggesting that it was a hardware problem – after all, I’ve finally reproduced it on SDK sample, and it did not reproduce on another devkit – and soon got a confirmation. Damn. Damn. Damn!

It took me a day and a half to reach the answer (it could’ve been much worse though). I guess the moral is that you have to presume nothing, including correct hardware operation – though of course I could not blame the hardware until all other reasons proved themselves wrong.

On joys and sorrows of library development

Tue, 29 Sep 2009 00:00:00 +0000

This may come as a surprise, but I am not dead. In fact, what you see is a new post! As usual I have a lot of interesting themes to cover, and barely enough time to spare. While I’m at it, let me tell you about NDAs. I hate NDAs with a passion – I’ve got some things to blog about that are partially covered by NDA (of course, the interesting parts are NOT); also I’ve been thinking that this is a non-issue and basically that I can blog about things that are not quite critical, but half a year ago or so I was forced to remove a blog post; the reasons are not exactly clear but it seems that it was because of a single sentence that mentioned something that’s NOT secret in my point of view and was NOT relevant to post contents. For this reason I’m hesitant to write about some topics so I’ll either skip them altogether (which is a shame) or find a way to omit all details that might seem sensitive to people. Also I’m not sure if blogging about post removal due to NDA is an NDA violation?..

Anyway, the topic for today is something different – I’ll write a bit about library development.

In past few years I’ve developed and maintained a C++ XML parser PugiXML. This is a tiny library which focuses on performance and ease of use. We’ve had tremendous speedups of export process after converting from TinyXML, and I know lots of other success stories. PugiXML is portable (lots of platforms and compilers are supported, I’ve gone through special efforts to support MSVC6 and old CodeWarriors), console-aware (i.e. you can toggle off STL/exception support, override memory management, etc.), small, robust, etc. It even features an XPath evaluator!

PugiXML was born as a project to clean up pugxml – initial idea was to strip pugxml header from sources (thus reducing compilation/linking times), slightly cleanup interface and use it. What followed was an almost complete rewrite of the code, bringing the parser closer to standard compliance, adding useful features for DOM inspection, and greatly improving speed. There are bits of code left from pugxml, and interface is very similar, but it’s quite a different project now. As far as I know, the only parser in use that beats PugiXML at parsing speed is RapidXML, and the only major problem with PugiXML is that it’s Unicode support is pretty much limited by UTF8. Though both of those may change at some point in the future :)

I’m going to write some stuff here that may be of interest to other people.

Interface

The initial API was taken as-is from pugxml; in the hindsight, this was both a good (since it offered a very simple transition for pugxml users) and bad thing. It’s a bad thing because the interface is seriously cluttered.

For example, there are at least four methods of traversing children nodes: you can use the next_sibling() function (the DOM is structured as a graph of nodes, with nodes connected via pointers; each node contains a pointer to both right and left siblings, the function gets the right one), you can use the node iterator, you can use xml_tree_walker (which is a Visitor-like interface), and finally you can grab all child elements via an insert iterator with all_elements_by_name(). Oh, and you can use XPath, which makes five methods.

As another example, every method for string-based queries (i.e. xml_node::attribute(const char*), which means “give me the first attribute with the following name) has a corresponding method which uses wildcard matching instead of string comparison (i.e. node.attribute_w(“foo*ba?”) will match foobar or fooatbaz).

Overall, it’s not that much (I have a friend who’s been working with a codebase that has an interface with 760+ virtual functions, so I’m not easily scared) and it does not stand in the way while you’re using the library, but it certainly does not help maintaining and developing it.

But the worst part is that I can’t remove any of those functions. For example, I consider tree walker to be a bad abstraction; it’s rarely usable, and if it is, it’s easy to write it outside the library. If I had a full API usage statistics, I could’ve made a conscious decision – either nobody uses it and I remove it, or there are very few who do and I extract it into an external helper class in an external header (possibly changing the interface slightly), or it’s a feature that is used in every second application that uses my library and I can’t do anything. The problem is I have no statistics, so I can’t do anything.

Other than that, I feel the interface to be good (I use it relatively often both in my pet projects and at work, so if there was something that annoyed me I would’ve fixed that); the best decision for me is pointer abstraction – in pugixml you don’t work with pointers to node (as with TinyXML), you work with tiny pointer wrapper class (the size is equal to that of a pointer) that’s passed by value; the point is that there is no null pointer exception, all operations on “null” nodes/attributes are perfectly defined. Of course, the same could be done with a pointer API by using a dummy object instead of null pointer, what matters is the decision to protect the user. Also I find that this makes parsing code much more concise – you don’t have to do error handling for every API call!

Performance

The parsing performance is very good, on COLLADA files it’s hundreds of megabytes per second (probably closer to gigabyte); the bottleneck is always HDD read speed unless the file is cached. Of course, it’s still slightly slower than it could be; also the performance comes for a price of not being fully standard compliant – it manifests in allowing certain XML standard violations, such as disallowed Unicode symbols in attribute/node names, multiple node attributes with the same name, etc. This means that while any correct XML file will be parsed, some malformed ones will not be rejected. Up to some point there even were flags to make parser pass certain standard violations (i.e. there was a mode that could handle HTML-style unclosed tags by scanning for matching open tag and automatically closing all descendants), but I removed them to reduce clutter (that was at the point when parser was used by me and a couple of friends so no harm done).

The memory consumption is also good enough (when we switched from TinyXML at work, we got ~2x improvement in terms of memory required to parse a DOM tree), although it could be better. Surprisingly this was achieved without any tricks that I love (take the pointer, take lower N bits, stuff something useful in there, pretend that everything was that way) and almost without any bit-packing.

All good things come at a price – the parser currently requires that the whole XML file is a large contiguous chunk of memory (i.e. if you have a 200 Mb file to parse, you have to have a 200 Mb chunk of address space); also, this chunk dies with the document so in the worst case PugiXML can lose in peak memory consumption if you modify your tree too much (i.e. load a 200 Mb document from file, remove all nodes, add an equivalent amount of contents by hand – the memory overhead of PugiXML will be i.e. 400 Mb (larger than that because nodes take some space too), the memory overhead of a typical parser will be 200 Mb). Of course this is almost never a problem in practice.

Next time: performance highlights (tricks to make parsing fast, saving performance), user requests, documentation, portability concerns

Implementing Direct3D for fun and profit

Mon, 08 Jun 2009 00:00:00 +0000

I can’t believe I’m writing this, it’s been what, 2 months? During that time a lot of things happened – I’ve been to the conference and gave an hour-long talk about our SPU rendering stuff (which was more or less well received), I’ve almost completed an occlusion subsystem (rasterization-based), which is giving good results; and the financial crisis has finally hit the company I work at – some projects are freezed due to the lack of funding, and some people are fired. It’s kind of sad walking through half-empty offices… Anyway, I know I promised to write often but as I am actively developing my pet engine at home and there is a lot of stuff to work on at my day job, so time is a scarce resource for me. My blog/todo.txt file is already 20 entries long, where some things are too small to deserve a post, and others demand a lengthy series. I’ll try to select something interesting from time to time and blog about it. As for todays topic,

Every object in core Direct3D (I’ll be talking about 9 today, but the same thing should apply to 10 and 11) is an interface. This means that the details of actual implementation is hidden from us, but this also means that we can implement those interfaces. Why could we want to do that?

Reverse engineering

If you work in game industry/computer graphics, or, well, any other IT-related field, I suppose, then you should be constantly gaining new knowledge; otherwise your qualification as a specialist will decrease very fast. There are lots of ways to learn, and one of the best is to learn from others experience. Unfortunately, while there is a lot of information on the technology of some titles, most are not described at all. Also sometimes the descriptions are inaccurate – after all, devil is in the details. So what you can do is take an existing title and reverse-engineer it – that is, gain information about implementation details from the outside. Disclaimer: Of course, this information is provided only for educational value. Reverse engineering can violate the laws of your country and/or the EULA of the product. Don’t use it if it does.

In PC / Direct3D world there are two primary tools than can allow such introspection – NVidia PerfHUD and Microsoft PIX. There is also a beta of Intel GPA (which is, by the way, quite promising, if lacking polish), but it is more or less like PIX. Using PIX does not require modifications of the host program, however PIX does not work for some titles (it might crash), is slow (especially for titles with complex scenes, lots of draw calls, etc.) and is not very convenient to use as a reverse engineering tool for other reasons.

PerfHUD is more useful in some areas, but you need to create Direct3D device with a special adapter and in REF mode in order for PerfHUD to work. While some games already have this kind of support in released version (notable examples include The Elder Scrolls 4: Oblivion and S.T.A.L.K.E.R. - Shadows of Chernobyl), others are more careful (I hope if you’re reading this blog you have a build configuration such as Master or Retail, which sets appropriate defines so you can compile development-only stuff, such as asset reloading, profiling or NVPerfHUD support) out of the executable). But still if you manage to intercept the call to Direct3DCreate9 (which can be done for example by creating a DLL, calling it d3d9.dll and putting it near the game executable), you can return a proxy IDirect3D9 object, that forwards all calls to the actual object, except that it modifies the adapter/device type that are passed to CreateDevice. In fact, such proxy objects are used by both PIX and GPA, though the injection technique is more complex.

There are even some programs that simplify the following for you, allowing you to run any title in PerfHUD-compatible mode.

Multithreaded rendering

In fact, this is already described in a Gamefest 2008 presentation “Practical Parallel Rendering with DirectX 9 and 10, Windows PC Command Buffer Recording” (you can get slides and example code here). Basically, since neither Direct3D9 nor Direct3D10 support proper multithreading (creating device as multithreaded means that all device calls will be synchronized with one per-device critical section), you can emulate it via a special proxy device, which records all rendering calls in a buffer, and then uses the buffer to replay the command stream via real device. This saves processing time for other rendering work you do alongside API calls by allowing it to work in multiple threads, and is a good stub for deferred context functionality that’s available on other platforms (including Direct3D11 and all console platforms). I use this technique in my pet engine mainly for the purpose of portability – I can render different parts of the scene into different contexts simultaneously, and then “kick” the deferred context via the main one. On PS3 the “kick” part is very lightweight, so the savings are huge; on Windows during the “kick” part deferred context replays the command stream, so it can be quite heavy, but it’s faster than doing everything in one thread, and the code works the same way. When I start supporting Direct3D11, the same code will work concurrently, provided a good driver/runtime support of course.

Note that I don’t use Emergent library as is – I consider it too heavyweight and obscure for my purposes. They try to support all Direct3D calls, while I use only a handful – I don’t use FFP, I don’t create resources via this device, etc. My implementation is simple and straightforward, and is only 23 Kb in size (11 of which are reused in another component – see below). If anybody wants to use it I can provide the code to you to save you an hour of work – just drop a comment.

Currently my implementation has a fixed size command buffer, so if you exceed it, you’re doomed. There are several more or less obvious ways to fix this, but I hope that by the time I get to it I’ll already have D3D11 in place.

Asset pipeline

My asset pipeline is more or less the same for all asset types – there is a source for the asset (Maya/Max scene, texture, sound file, etc.), which is converted via some set of actions to a platform-specific binary that can be loaded by the engine. In this way the complexity of dealing with different resource formats, complex structures, data non suitable for runtime, etc. is moved from engine to tools, which is great since it reduces the amount of runtime code, making it more robust and easier to maintain. The data is saved to a custom format which is optimized for loading time (target endianness, platform-specific data layout/format for graphics resources, compression). I think I’ll blog about some interesting aspects/choices in the future as time permits (for example, about my experience of using build systems, such as SCons and Jam, for data builds), but for now I’ll focus on a tool that builds textures.

This tool loads the texture file, generates mipmap levels for the texture if necessary (if it was not a DDS with mip chain, and if target texture requires mipmap levels), compresses it to DXTn if necessary (again, that depends on source format and building settings), and makes some other actions, both platform-specific and platform-independent. In order for it to work, I need an image library that can load image formats I care about, including DDS with DXTn contents (so that I don’t need to unpack/repack it every time, and so that artists can tweak DXT compression settings in Photoshop plugin – in my experience there is rarely a visible difference, but if they give me a texture and I compress it to DXT and there are some artifacts, I’m to blame – and if they use Photoshop, it’s not my scope :)). As it turns out, D3DX is a good enough image loading library, at least it works for me (although in retrospect I probably should’ve used DevIL, and perhaps I will switch to it in the future).

Anyway, to load a texture via D3DX, you need a Direct3D device. As it turns out, while you can create a working REF device in under 10 lines of code (using desktop window and hardcoded settings), you can’t create any device, including NULLREF, if your PC does not have a monitor attached. This problem appeared once I got my pipeline working via IncrediBuild, and sometimes on some machines texture building failed. Since I did not want to modify my code too much, I ended implementing another proxy device, which is suitable for loading a texture with D3DX functions. This time it was slightly harder, because I needed implementations for some functions of IDirect3DDevice9, IDirect3DTexture9 and IDirect3DSurface9, but again the resulting code is quite small and simple – 6 Kb (plus the 11 Kb dummy device I mentioned earlier), and I can load any 2D texture. Of course I’ll need to add some code to load cubemaps and even more code to load volume textures, but for now it’s fine the way it is.

So these are some examples of situations where implementing Direct3D interfaces might prove useful. The next post is going to either be about multithreading, or about some asset pipeline-related stuff, I guess I’ll decide once I get to writing it.

UPDATE 25 OCT 2010: Here is the example code:

texturedevice.h - this is the implementation of the device that works with D3DXCreateTextureFromFile for 2D textures and cubemaps (3D texture support is missing but can be added in the same way).

Placeholder

Tue, 31 Mar 2009 00:00:00 +0000

I’m sorry for the lack of real post - it was a busy week, and a somewhat busy month lies ahead - I’m attending a local game conference in May and giving a speech about the process of porting our rendering subsystem to SPU (I hope to cover this topic here some day), so some time is spent preparing slides/etc.; my pet projects demand more attention than usual; there’s some weird but nevertheless interesting stuff at work… I’ll try to keep up, but you should really expect some more weeks without any posts. Don’t beat me.

Anyway, a bunch of slides from GDC09 Tutorial sessions are finally uploaded; there is some good stuff in “Advanced Visual Effects with Direct3D”, and there’s some awesome stuff in “Insomniac Games’ Secrets of Console and Playstation 3 Programming”. I mean, finally someone told people who compute view-space normal Z as $sqrt(1 - x^2 - y^2)$ that they don’t know what they’re doing! Not to mention SPU stuff, like KISS SPU scheduler (we have a simple enough custom scheduler at work, but it’s still far), SPU debugging stories and other SPU talks. By the way, if you’re interested in SPU-related topics and have not read everything here, then you don’t take SPU seriously.

There are also Khronos’ slides here - don’t read them unless you have absolutely nothing to do.

Hashes, hazards and vfptr

Sun, 22 Mar 2009 00:00:00 +0000

There is a bunch of small notes I’d like to share – none of them deserves a post, but I don’t want them to disappear forever.

Using hash table as a fixed-size cache

When I worked with Direct3D 10, I found state objects quite cumbersome to work with – they’re very slow to create (or at least were back then) and the exact separation of states into objects was sometimes inconvenient from design point of view. Also I’ve already had a set of classes that divided states into groups and functions like setDepthState with redundancy checking, so I needed to write an implementation for existing interface. The solution I came up with was very simple and elegant, so I’d like to outline it once more (although I sort of mentioned it in the original post).

The natural thing to do here is to cache state object pointer inside state class, and recompute it if necessary (when binding newly created/modified state). There are two issues to solve here – 1. state object creation is expensive (even if you’re creating an object with the same data several times in a row – in which case D3D10 runtime returns the same pointer – the call takes 10k cycles), 2. there is a limit on the amount of state objects (4096 for each type). Solving the first one is easy – just make a cache with key being state object description and value being the actual pointer; solving the second one is slightly harder because you’ll have to evict entries from your cache based on some policy.The way I went with was to create a fixed size array (the size should be a power of two less or equal than 4096 and depends on the state usage pattern), make a hash function for state description and use this array as a cache indexed by hash. In case of cache collision the old state object got released.

I often use simple non-resizable hash tables (recent examples include path lookup table in flat file system and vertex data hash to compute index buffer from COLLADA streams), but I always insert collision handling code – but, as it turns out, in this case you can omit it and get some benefit at the same time.

Direct3D 10 Read/Write hazard woes

While in some ways Direct3D 10 is clearly an improvement over Direct3D 9, in lots of areas the design could’ve been better. I surely can’t count all deficiencies using my both hands, but some problems annoy me more than the others. One of the things that leads to possible performance/memory compromises is resource Read/Write hazard. There are several inputs for various stages (shader resources (textures, buffers), constant buffers, vertex/index buffers) and several output ones (render targets, depth surfaces, stream out buffers), and there are resources that can be bound to both input and output stage; for example, you can bind a texture to output stage as a render target, then render something to it, and then bind the same texture as a shader resource so that shader can sample rendered data from it. However, Direct3D 10 Runtime does not allow a resource to be bound to both input and output stage at the same time.

One disadvantage is that sometimes you’d like to do an in-place update of render target – for example, to do color correction or some other transformation. In fact, this is a perfectly well-defined operation – at least on NVidia hardware – if you’re always reading the same pixel you’re writing to; otherwise you’ll get old value for some pixels and new one for others. Here there is an actual read/write hazard, but due to the specific hardware knowledge we can exploit it to save memory.

Another disadvantage is that a resource being bound to output pipeline stage does not mean it’s being written to! A common example is soft particles – Direct3D 10 introduced cross-platform unified depth textures so that you can apply postprocessing effects that require scene depth without extra pass to output depth in a texture or MRT – you can use the same depth buffer you were using for scene render as a texture input to the shader. While this works perfectly for post processing (except for the fact that you can’t read depth from MSAA surfaces – ugh…), it fails miserably for soft particles. You usually disable depth writes for particles so there is no real read/write hazard, but because the runtime thinks there is one you can’t bind depth buffer so that HW performs depth test – you can only do depth testing in pixel shader yourself via discard. This disables early coarse/fine Z culling, which results in abysmal performance.

Luckily MSAA depth readback is supported in D3D10.1, and in D3D11 you can bind resources to output pipeline stages as read-only. Too bad there is no D3D11 HW yet, and D3D10.1 is not supported by NVidia…

Knowing the class layout – vfptr

There are two weird points regarding class layout and vfptr (virtual function table pointer) that I’d like to note here – they are related to very simple cases, I’m not going to talk about multiple or god forbid virtual inheritance here.

Why do you need to know class layout? Well, it’s useful while writing code so your classes can occupy less space, it’s extremely useful while debugging obscure bugs, and you can’t even start doing in-place loading/saving without such knowledge (I think I’ll make a special post regarding in-place stuff soon). And don’t even get me started on debuggers that can’t display anything except registers and (if you’re lucky) primitive locals – we used to have such debugger on PSP, and CodeWarrior for Wii is only slightly better.

Anyway, the first weird point is related to CodeWarrior – it had been like this on PS2, and it’s like this on Wii – I doubt that’ll ever change. You see, while on normal compilers there is no way to control vfptr placement – for simple classes without inheritance it always goes in the first word – on CodeWarrior it lies in the place of declaration – except that you can’t declare vfptr in C++, so it lies in the place where the first virtual function is declared. Some examples follow:

// layout is vfptr, a, b
struct Foo { virtual void foo1(); unsigned int a; unsigned int b; };

// layout is a, vfptr, b
struct Foo { unsigned int a; virtual void foo1(); unsigned int b; };

// layout is a, vfptr, b
struct Foo { unsigned int a; virtual void foo1(); unsigned int b; virtual void foo2(); };

// layout is a, b, vfptr
struct Foo { unsigned int a; unsigned int b; virtual void foo2(); };

Marvelous, isn’t it? Now there is an entry in our coding standard at work which says “first virtual function declaration has to appear before any member declarations”.

The second point was discovered only recently and appears to happen with MSVC. Let’s look at the following classes:

struct Foo1 { virtual void foo1(); unsigned int a; float b; };
struct Foo2 { virtual void foo1(); unsigned int a; double b; };

Assuming sizeof(unsigned int) == 4, sizeof(float) == 4, sizeof(double) == 8, what are the layouts of the classes? A couple of days ago I’d say that:

Foo1: 4 bytes for vfptr, 4 bytes for a, 4 bytes for b; alignof(Foo1) == 4
Foo2: 4 bytes for vfptr, 4 bytes for a, 8 bytes for b; alignof(Foo2) == 8

And in fact this is exactly the way these classes are laid out in GCC (PS3/Win32), CodeWarrior (Wii) and other relatively sane compilers; MSVC however chooses the following layout for Foo2:

Foo2: 4 bytes for vfptr, 4 bytes of padding, 4 bytes for a, 4 bytes of padding, 8 bytes for b; alignof(Foo2) == 8

Of course the amount of padding increases if we replace double with i.e. __m128. I don’t see any reason for such memory wastage, but that’s the way things are implemented, and again I doubt this will ever change.

Optimizing build times with Direct3D 9

Yesterday after making some finishing touches to D3D9 implementation of some functions in my pet project (which is coincidentally a game engine wannabe), I hit rebuild and could not help noticing the difference in compilation speed for different files. The files that did not include any heavy platform-specific headers (such as windows.h or d3d9.h) were compiled almost immediately, files with windows.h included were slightly slower (don’t forget to define WIN32_LEAN_AND_MEAN!), and files with d3d9.h were slow as hell compared to them – the compilation delay was clearly visible. Upon examination I understood that including windows.h alone gets you 651 Kb of preprocessed source (all numbers are generated via cl /EP, so the source doesn’t include #line directives; also WIN32_LEAN_AND_MEAN is included in compilation flags), and including d3d9.h results in a 1.5 Mb source.

Well, I care about compilation times, so I decided to make things right – after all, d3d9.h can’t require EVERYTHING in windows.h and other headers it includes. After half an hour of work, I arrived with minid3d9.h (which can be downloaded here).

Including minid3d9.h gets you 171 Kb of preprocessed source, which is much better. This file defines everything that’s necessary for d3d9.h and also a couple of things my D3D9 code used (i.e. SUCCEDED/FAILED macros); you might need to add something else – it’s not always a drop-in replacement. Also I’ve taken some measures that enable safe inclusion of this file after CRT/Platform SDK headers, but don’t include it before them – generally, include it after everything else.

This decreased the full rebuild time by 30% for me (even though D3D9 code is less than 15% in terms of code size and less than 10% in terms of translation unit count) – I certainly expected less benefit! You’re free to use this at your own risk; remember that I did not test in on 64-bit platform so perhaps it needs more work there.

View frustum culling optimization - Representation matters

Sun, 15 Mar 2009 00:00:00 +0000

Before getting into professional game development I’ve spent a fair amount of time doing it for fun (in fact, I still do it now, although less intensively). The knowledge came from a variety of sources, but the only way that I knew and used to calculate frustum planes equations was as follows – get the equations in clip space (they’re really simple – (1, 0, 0, 1), (0, -1, 0, 1), etc.) and then get world space ones by transforming the planes with inverse transpose of view projection camera matrix [correction: in fact, you need to transform with inverse transpose of inverse view projection matrix, which equals to just transpose of view projection matrix]. It’s very simple and intuitive – if you know a simple way to express what you need in some space, and a simple way to transform things from that space to your target one, you’re good to go.

Imagine my surprise when I started doing game development as a day job and after some time accidentally stumbled upon a piece of our codebase that calculated frustum planes in a completely different way. Given a usual perspective camera setup, it calculated frustum points via some trigonometry (utilizing knowledge about vertical/horizontal FOV angles, near/far distances and the fact that it’s a perspective camera without any unusual alterations), and then used them to obtain the equations. I thought it to be very weird – after all, it’s more complex and is constrained to specific camera representation, whereas clip space method works for any camera that can be set up for rendering (orthographic projection, oblique-clipping, etc.).

But as it turns out, the same thing can be said about our culling code. It’s quite good at culling given box against an arbitrary set of planes (i.e. if you use it for portal/anti-portal culling with arbitrary shape of portals/occluders), but since we have a usual frustum, maybe we can improve it by going to clip space, entirely skipping world space? Let’s try it.

We’re going to transform AABB points to clip space, and then test them against frustum planes in clip space. Note that we can’t divide by w after transforming – that will lead to culling bugs because post-projective space exhibits a discontinuity at the plane with equation z = 0 in view space; however, this is not needed – the frustum plane equations in clip space are as follows:

x >= -w, x <= w: left/right planes
y >= -w, y <= w: top/bottom planes
z >= 0, z <= w: near/far planes

Note that if you’re using OpenGL clip space convention, the near plane equation is z >= -w; this is a minor change to the culling procedure.

First, to transform points to clip space, we’re going to need a world view projection matrix – I hope the code does not require any additional explanations:

static inline void transform_matrix(struct matrix_t* dest, const struct matrix_t* lhs, const struct matrix_t* rhs)
{
#define COMP_0(c) \
    qword res_ ## c = si_fm((qword)lhs->row2, SPLAT((qword)rhs->row ## c, 2)); \
    res_ ## c = si_fma((qword)lhs->row1, SPLAT((qword)rhs->row ## c, 1), res_ ## c); \
    res_ ## c = si_fma((qword)lhs->row0, SPLAT((qword)rhs->row ## c, 0), res_ ## c); \
    dest->row ## c = (vec_float4)res_ ## c;


#define COMP_1(c) \
    qword res_ ## c = si_fma((qword)lhs->row2, SPLAT((qword)rhs->row ## c, 2), (qword)lhs->row3); \
    res_ ## c = si_fma((qword)lhs->row1, SPLAT((qword)rhs->row ## c, 1), res_ ## c); \
    res_ ## c = si_fma((qword)lhs->row0, SPLAT((qword)rhs->row ## c, 0), res_ ## c); \
    dest->row ## c = (vec_float4)res_ ## c;

    COMP_0(0);
    COMP_0(1);
    COMP_0(2);
    COMP_1(3);

#undef COMP_0
#undef COMP_1
}

After that we’ll transform the points to clip space, yielding 2 groups with 4 vectors (x, y, z, w) in each one; the code is almost the same as in the previous post, only we now have 4 components:

static inline void transform_points_4(qword* dest, qword x, qword y, qword z, const struct matrix_t* mat)
{
#define COMP(c) \
    qword res_ ## c = SPLAT((qword)mat->row3, c); \
    res_ ## c = si_fma(z, SPLAT((qword)mat->row2, c), res_ ## c); \
    res_ ## c = si_fma(y, SPLAT((qword)mat->row1, c), res_ ## c); \
    res_ ## c = si_fma(x, SPLAT((qword)mat->row0, c), res_ ## c); \
    dest[c] = res_ ## c;


    COMP(0);
    COMP(1);
    COMP(2);
    COMP(3);
    
#undef COMP
}

// transform points to clip space
qword points_cs_0[4];
qword points_cs_1[4];

transform_points_4(points_cs_0, minmax_x, minmax_y, minmax_z_0, &clip);
transform_points_4(points_cs_1, minmax_x, minmax_y, minmax_z_1, &clip);

If all 8 clip-space points are outside left plane, i.e. if for all 8 points p.x <= -p.w, then the box is completely outside. Since we’re going to use SoA layout, such tests are very easy to perform. We’ll need a vector which contains -w for 4 points; SPU do not have a special negation instruction, but you can easily emulate it either by subtracting from zero or by xoring with 0x80000000. Theoretically xor is better (it has 2 cycles of latency, subtract has 6), but in our case there is no difference in speed; I’ll use xor nonetheless:

// calculate -w
qword points_cs_0_negw = si_xor(points_cs_0[3], (qword)(vec_uint4)(0×80000000));
qword points_cs_1_negw = si_xor(points_cs_1[3], (qword)(vec_uint4)(0×80000000));

Now we’ll calculate “not outside” flags for each plane; the method is exactly the same as in previous post (as is the final result computation), only now we’re not doing dot products.

// for each plane…
#define NOUT(a, b, c, d) si_orx(si_or(si_fcgt(a, b), si_fcgt(c, d)))

qword nout0 = NOUT(points_cs_0[0], points_cs_0_negw, points_cs_1[0], points_cs_1_negw);
qword nout1 = NOUT(points_cs_0[3], points_cs_0[0], points_cs_1[3], points_cs_1[0]);
qword nout2 = NOUT(points_cs_0[1], points_cs_0_negw, points_cs_1[1], points_cs_1_negw);
qword nout3 = NOUT(points_cs_0[3], points_cs_0[1], points_cs_1[3], points_cs_1[1]);
qword nout4 = NOUT(points_cs_0[2], (qword)(0), points_cs_1[2], (qword)(0));
qword nout5 = NOUT(points_cs_0[3], points_cs_0[2], points_cs_1[3], points_cs_1[2]);

#undef NOUT

This is the final version. It runs at 104 cycles per test, so it’s slightly faster than the last version. But this method is better for another reason – we’ve calculated clip space positions of box vertices as a by-product (also we’ve calculated world view projection matrix, but it’s likely to be of little further use, because usually shader constant setup happens at a later point in another module). Some things you can do with them:

Feed them as an input to the rasterizer to do rasterization-based occlusion culling (simple depth buffer, HOM, etc.). This is the road I have not taken yet, though I hope I will do it some day.
Use them for screen size culling – if your bounding box when projected to the screen is small enough (i.e. less than 3x3 pixels), you usually can safely throw it away. This is what I do in our production code; it involves dividing positions by w (don’t forget to discard the size culling results if any point has w < epsilon!), computing min/max x/y for the results, subtracting min from max and checking if the difference along each axis is less than threshold. The actual implementation is left as an exercise to the reader.

This concludes the computation part of view-frustum culling. There is still something to do here – there is p/n-vertex approach which I did not implement (but I’m certain that it won’t be a win over my current methods on SPUs); there are minor potential improvements to the current code that are not worth the trouble for me; all implemented tests return only binary result (outside / not outside), a ternary version can help in hierarchical culling (though this can be achieved with a minor modification to all presented code). There might be something else I can’t think of now – post a comment if you’d like to hear about other VFC-related topics!

I’m going to write one final post regarding VFC, which deals with the code that will use is_visible to perform culling of the given batch array – the topics include DMA and double buffering; after that the whole VFC series will be over and I’m going to switch to something different.

The complete source for this post can be grabbed here.

View Frustum Culling series contents:

Introduction

Vectorize me

Structures and arrays

Never let me branch

Representation matters

Balancing the pipes

Fighting against CRT heap and winning

Sun, 08 Mar 2009 00:00:00 +0000

Memory management is one of (many) cornerstones of tech quality for console games. Proper memory management can decrease amount of bugs, increase product quality (for example, by eliminating desperate pre-release asset shrinking) and generally make life way easier – long term, that is. Improper memory management can wreak havoc. For example, any middleware without means to control/override memory management is, well, often not an option; any subsystem that uncontrollably allocates memory can and will lead to problems and thus needs redesigning/reimplementing. While you can tolerate more reckless memory handling on PC, it often results in negative user experience as well.

In my opinion, there are two steps to proper memory management. First one is global and affects all code – it’s memory separation and budgeting. Every subsystem has to live in its own memory area of fixed size (of course, size can be fixed for the whole game or vary per level, this is not essential). This has several benefits:

Memory fragmentation is now local – subsystems don’t fragment each other’s storage, thus fragmentation problems happen less frequently and can be reproduced and fixed faster
Fixed sizes mean explicit budgets – because of them out of memory problems are again local and easily tracked to their source. For example, there is no more “game does not fit in video memory, let’s resize some textures” - instead, you know that i.e. level textures fit in their budget perfectly, but the UI artists added several screen-size backgrounds, overflowing UI texture budget
Because each subsystem lives in its own area, we have detailed memory statistics for no additional work, which again is a good thing for obvious reasons
If memory areas have fixed sizes, they either have fixed addresses or it’s easy to trace address range for each of them – this helps somewhat in debugging complex bugs

Second one is local to each subsystem – once you know that your data lives in a fixed area, you have to come up with a way to lay your data in this area. The exact decisions are specific to the nature of data and are up to the programmer; this is out of this post’s scope.

Memory is divided into regions, each region is attributed to a single subsystem/usage type – if we accept this, it becomes apparent that any unattributed allocations (i.e. any allocations into global heap) are there either because nobody knows where they should belong or because the person who coded those does not want to think about memory – which is even worse (strict separation and budgeting makes things more complicated in short term by forcing people to think about memory usage – but that’s a good thing!). Because of this global heap contains junk by definition and thus should ideally be eliminated altogether, or if this is not possible for some reason, should be of limited and rather small size.

Now that we know the goal, it’s necessary to implement it – i.e. we want to have a way to replace allocations in global heap with either fatal errors or allocations in our own small memory area. On different platforms there are different ways to do it – for example, on PS3 there is a documented (and easy) way to override CRT memory management functions (malloc/free/etc.); on other platforms with GNU-based toolchain there is often a --wrap linker switch – however, on some platforms, like Windows (assuming MSVC), there does not seem to be a clean way to do it. In fact, it seems that the only known solution is to modify the CRT code. I work with statically linked CRT, so this would mean less distribution problems, but more development ones – I’d have to either replace prebuilt CRT libraries (which is out of the question because it makes working with other projects impossible) or ignore them and link my own, which is better – but still, the process required building my own (hacked) version of CRT. I did not like the approach, so I came up with my own.

First, some disclaimers. This code is tested for statically linked Win32 CRT only – it requires some modifications to work on Win64 or with dynamically linked CRT – I might do the Win64 part some day, but not DLL CRT. Also I’m not too clear on EULA issues; because of this, I’ll post my entire code except for one function that’s essentially ripped from CRT and fixed so that it compiles – read further for more details. Finally, there may be some unresolved issues with CRT functions I don’t currently use (though I think my solution covers most of them) – basically, this is a demonstration of approach with proof-of-concept code, and if you decide to use it you’re expected to fix the problems if they arise :)

Our first priority is to replace CRT allocation functions without modifying libraries. There are basically two ways to do something like it – link time and run time. Link time approach involves telling the linker somehow that instead of existing functions it should use the ones supplied by us. Unfortunately, there does not seem to be a way to do this except /FORCE:MULTIPLE, which results in annoying linker warnings and disables incremental linking. Run time way involves patching code after the executable is started – hooking libraries like Detours do it, but we don’t need such a heavyweight solution here. In fact, all that’s needed is a simple function:

static inline void patch_with_jump(void* dest, void* address)
{
    // get offset for relative jmp
    unsigned int offset = (unsigned int)((char*)address – (char*)dest – 5);
    
    // unprotect memory
    unsigned long old_protect;
    VirtualProtect(dest, 5, PAGE_READWRITE, &old_protect);
    
    // write jmp
    *(unsigned char*)dest = 0xe9;
    *(unsigned int*)((char*)dest + 1) = offset;
    
    // protect memory
    VirtualProtect(dest, 5, old_protect, &old_protect);
}

This function replaces first 5 bytes of code contained in dest with jump to address (the jump is a relative one so we need to compute relative offset; also, the code area is read-only by default, so we unprotect it for the duration of patching). The primitive for stubbing CRT functions is in place – now we need to figure out where to invoke it. At first I thought that a static initializer (specially tagged so that it’s guaranteed to execute before other initializers) would be sufficient, but after looking inside CRT source it became apparent that heap is initialized and (which is more critical) used before static initialization. Thus I had to define my own entry point:

int entrypoint()
{
    int mainCRTStartup();
    
    patch_memory_management_functions();
    
    return mainCRTStartup();
}

Now to patch the functions. We’re interested in heap initialization, heap termination and various (de)allocation utilities. There is _heap_init, _heap_term and lots of variants of malloc/free and friends – they are all listed in source code. Note that I stubbed all _aligned_* functions with BREAK() (__asm int 3), because neither CRT code nor my code uses them – of course, you can stub them if you need.

There are several highlights here. First one I stumbled upon is that _heap_term is not getting called! At least not in static CRT. After some CRT source digging I decided to patch __crtCorExitProcess – it’s useful only for managed C++, and it’s the last thing that gets called before ExitProcess. The second one is in function _recalloc, that’s specific to the allocator you’re using to replace the default one. The purpose of _recalloc is to reallocate the memory as realloc does, but cleaning any additional memory – so if you do malloc(3) and then _recalloc(4), ((char*)ptr)[3] is guaranteed to be 0. My allocator aligns everything to 4 bytes and has a minimal allocation size limit; the original size that was passed to allocation function is not stored anywhere. It’s easy to fix it for CRT because _recalloc is used in CRT only for blocks allocated with calloc, and I hope _recalloc is not used anywhere else. By the way, there is a bug in CRT related to _recalloc – malloc(0) with subsequent _recalloc(1) does not clear first byte (because for malloc(0) block with size 1 is created); moreover, more bugs of such nature are theoretically possible on Win64. Personally I find calloc weird and _recalloc disgusting; luckily it’s Windows-only.

Ok, now we’re done – are we? Well, everything went well until I turned leak detection on. It turns out that there are lots of allocations left unfreed by CRT – amazingly, there is a __freeCrtMemory function that frees some of those, but it’s compiled in only in _DEBUG, and it’s called only if CRT debugging facilities are configured to dump memory leaks on exit. Because of this I needed to copy the code, modify it slightly so that it compiles and invoke the function before heap termination. However, this function does not free everything – there were some more allocations left, that I needed to handle myself. You can see the code in cleanup_crt_leaks(). After cleaning up leaks printf(), which was used to output leaks to console, became unusable (oh, horror!), so I came up with the following function:

void debug_printf(const char* format, …)
{
    char buf[4096];
    
    va_list arglist;
    va_start(arglist, format);
    wvsprintfA(buf, format, arglist);
    va_end(arglist);
    
    // console output
    HANDLE handle = GetStdHandle(STD_OUTPUT_HANDLE);
    WriteFile(handle, buf, (unsigned long)strlen(buf), NULL, NULL);


    // debug output
    OutputDebugStringA(buf);
}

Finally, the last problem is that some CRT code checks global variable _crtheap prior to allocation, so we have to initialize it to something (that affects fopen() and other functions that use dynamically created critical sections).

Well, now it works and I’m quite happy with the results. Of course it’s slightly hackish, but CRT code is such a mess that it blends in nicely. The more or less complete source code is here. Note that if you’re using C++ new/delete and you have not overridden them globally for some reason, you might want to patch _nh_malloc/_heap_alloc with malloc_stub as well.

View frustum culling optimization - Never let me branch

Sun, 01 Mar 2009 00:00:00 +0000

In previous iteration we converted the code to SoA instead of AoS, which enabled us to transform OBB points to world space relatively painlessly, and eliminated ugly and slow dot product, thus making the code faster. Still, the code is slow. Why?

Well, as it appears, the problem is branching.

I wanted to write a long post about branches and why they are often a bad idea for PPU/SPU, but it turns out that Mike Acton beat me to it – be sure to read his articles for detailed explanation: part 1 part 2 part 3 - so I’ll make it short. For our case, there are two problems with branching:

First, code performance depends on input data. Visible boxes are worst case (this is the one the cycle count is for); invisible boxes are faster, with the fastest case (where the box is behind the first plane) taking 128 cycles. Because of this, it’s hard to estimate the run time of culling, given the number of objects – upper bound is three times bigger than lower bound.

Second, branches divide the code in blocks, and compiler has problems performing optimizations between blocks. We have a constant-length loop, inside we compute 8 dot products for a single plane, then check if all of them are negative, and in this case we early-out. Note that there are a lot of dependencies in computation of dot products – si_fmas in dot4 depend on the result of previous si_fmas, si_fcgt depends on the result of dot4, etc. Here is an example of disassembly for performing a single dot4 operation, assuming that we already have SPLAT(v, i) in registers:

fma res, v2, z, v3
fma res, v1, y, res
fma res, v0, x, res

Pretty reasonable? Well, not exactly. While we have 3 instructions, each one depends on the result of the previous one, so we can use our result in 18 cycles instead of 3 (fma latency is 6 cycles). If we need to compute 6 dot4, and we have some sort of branching after each one, like we had in the code for previous attempt, we’ll pay the cost of 18 cycles for each iteration (of course, there’ll also be some cost associated with comparison and branching). On the other hand, if we computed all 6 dot4 without any branches, the code could’ve looked like:

fma res[0], v2[0], z[0], v3[0]
fma res[1], v2[1], z[1], v3[1]
…
fma res[5], v2[5], z[5], v3[5]


fma res[0], v1[0], y[0], res[0]
…
fma res[5], v1[5], y[5], res[5]

fma res[0], v0[0], x[0], res[0]
…
fma res[5], v0[5], x[5], res[5]

This code has 18 instructions, and all results are computed in 24 cycles – but we’re computing 6 dot4 instead of 1! Also 24 cycles is the latency for res[5] – we can start working on res[0] immediately after last fma gets issued.

The problem is not only related to instruction latency (in our case, register dependencies), but also to pipeline stalls – SPU has two pipelines (even and odd), and can issue one instruction per pipeline per cycle for, uhm, perfect code – each type of instruction can be issued only on one of the pipes, for example arithmetic instructions belong to even pipe, load/store/shuffle instructions belong to odd one. Because of this shuffles can be free if they dual-issue with arithmetics and do not cause subsequent dependency stalls.

Compiler tries to rearrange instructions in order to minimize all stalls – register dependencies, pipeline stalls and some other types – but it is often not allowed to do it between branches. Because of this it’s best to eliminate all branches – compiler will be left with a single block of instructions and will be able to do a pretty good job hiding latencies/dual-issuing instructions. This is often critical – for example, our current version wastes almost half of cycles while waiting for results because of register dependency.

Of course, eliminating branches is often a tradeoff – sometimes it makes worst-case run faster, but best-case now runs slower, as we observed last time with x86 code. The decision depends on your goals and on frequency of various cases – remember that branchless code will give you a guaranteed (and usually acceptable) lower bound on performance.

So, in order to eliminate branches, we’ll restructure our code a bit – instead of checking for each plane if all points are outside, we’ll check if any point is inside, i.e. if the box is not outside of the plane:

static inline qword is_not_outside(qword plane, const qword* points_ws_0, const qword* points_ws_1)
{
    qword dp0 = dot4(plane, points_ws_0[0], points_ws_0[1], points_ws_0[2]);
    qword dp1 = dot4(plane, points_ws_1[0], points_ws_1[1], points_ws_1[2]);


    qword dp0pos = si_fcgt(dp0, (qword)(0));
    qword dp1pos = si_fcgt(dp1, (qword)(0));

    return si_orx(si_or(dp0pos, dp1pos));
}

si_orx is a horizontal or (or across) instruction, which ors 4 32-bit components of source register together and returns the result in preferred slot, filling the rest of vector with zeroes. Thus is_not_outside will return 0xffffffff in preferred slot if box is not outside of plane, and 0 if it’s outside.

Now all we have to do is to call this function for all planes, and combine the results – we can do it with si_and, since the box is not outside of the frustum only if it’s not outside of all planes; if any is_not_outside call returns 0, we have to return 0.

// for each plane…
qword nout0 = is_not_outside((qword)frustum->planes[0], points_ws_0, points_ws_1);
qword nout1 = is_not_outside((qword)frustum->planes[1], points_ws_0, points_ws_1);
qword nout2 = is_not_outside((qword)frustum->planes[2], points_ws_0, points_ws_1);
qword nout3 = is_not_outside((qword)frustum->planes[3], points_ws_0, points_ws_1);
qword nout4 = is_not_outside((qword)frustum->planes[4], points_ws_0, points_ws_1);
qword nout5 = is_not_outside((qword)frustum->planes[5], points_ws_0, points_ws_1);


// merge "not outside" flags
qword nout01 = si_and(nout0, nout1); 
qword nout012 = si_and(nout01, nout2); 

qword nout34 = si_and(nout3, nout4); 
qword nout345 = si_and(nout34, nout5); 

qword nout = si_and(nout012, nout345);

return si_to_uint(nout);

I changed return type for is_visible to unsigned int, with 0 meaning false and 0xffffffff meaning true; this won’t change client code, but slightly improves performance.

Now when we compute everything in a single block, compiler schedules instructions in a way that we waste close to zero cycles because of latency. The new branchless version runs at 119 cycles, which is more than 3 times faster than the previous version, and 10 times faster than initial scalar version. This results in 37 msec for million calls, which is almost 2 times faster than fastest result on x86 (finally!). Moreover, this is slightly faster than the best case of previous version – so there is no tradeoff here, new version is always faster than old one. Note that eliminating branches is not worth it for x86 code (i.e. it does not make worst case faster, which is expected, if you remember that we had to do 2 checks per plane in order to make SoA approach faster than AoS).

The current source can be grabbed here.

That’s all for now – stay tuned for the next weekend’s post! I plan to post something not VFC-related the next week, then another VFC post the week after that. If you’re starting to hate frustums, SPU, me and my blog - sorry about that, but we’ll be done with VFC some day, I swear! :)

View Frustum Culling series contents:

Introduction

Vectorize me

Structures and arrays

Never let me branch

Representation matters

Balancing the pipes

View frustum culling optimization - Structures and arrays

Sun, 15 Feb 2009 00:00:00 +0000

Last week I’ve tried my best at optimizing the underlying functions without touching the essence of algorithm (if there was a function initially that filled a 8-vector array with AABB points, optimizations from previous post could be done in math library). It seems the strategy has to be changed.

There are several reasons why the code is still slow. One is branching. We’ll cover that in the next issue though. Another one has already been discussed on this blog related to shaders – we have 4-way SIMD instructions, but we are not using them properly. For example, our point transformation function wastes 1 scalar operation per each si_fma, and requires additional .w component fixup after that. Our dot product function is simply horrible. Once again we’re going to switch layout for intermediate data from Array of Structures to Structure of Arrays.

We have 8 AABB points, so we’ll need 6 vectors – 2 vectors per each component. Do we need all 6? Nah. Since it’s an AABB, we can organize stuff so that we need only 4 like this:

x X x X
y y Y Y
z z z z
Z Z Z Z

Note that vectors for x and y components are shared between two 4-point groups. Of course this sharing will go away after we transform our points to world space – but that makes it easier to generate SoA points from min/max vectors.

How do we generate them? Well, we already know the solution for Z – there is some magical si_shufb instruction, that worked for us before. It’s time to know what exactly it does, as it can be used to generate x/y vectors too.

What si_shufb(a, b, c) does is it takes a/b registers, and permutes their contents using c as pattern, yielding a new value. Permutation is done at a byte level - each byte of c corresponds to a resulting byte, which is computed one of the following ways:

0x0v corresponds to a byte of left operand with index v
0x1v corresponds to a byte of right operand with index v
0x80 corresponds to a constant 0x00
0xC0 corresponds to a constant 0xFF
0xE0 corresponds to a constant 0x80
other values result in one of the above, the exact treatment is out of the scope

This is a superset of Altivec vec_perm instruction, and can be used to do very powerful things, as we’ll realize soon enough. For example, you can implement usual GPU-style swizzling like so:

src_zxxx = si_shufb(src, src, ((qword)(vec_uint4){0x08090a0b, 0×00010203, 0×00010203, 0×00010203}));

First four bytes of my pattern correspond to bytes 8-11 of left argument, all other four-byte groups correspond to bytes 0-3 of left argument. This is equal to applying .zxxx swizzle. As you can probably see, the code can get very obscure if you use shuffles a lot, so I’ve made some helper macros:

// shuffle helpers
#define L0 0×00010203
#define L1 0×04050607
#define L2 0x08090a0b
#define L3 0x0c0d0e0f


#define R0 0×10111213
#define R1 0×14151617
#define R2 0x18191a1b
#define R3 0x1c1d1e1f

#define SHUFFLE(l, r, x, y, z, w) si_shufb(l, r, ((qword)(vec_uint4){x, y, z, w}))

// splat helper
#define SPLAT(v, idx) si_shufb(v, v, (qword)(vec_uint4)(L ## idx))

SHUFFLE is for general shuffling, SPLAT is for component replication (.yyyy-like swizzles). Note that in previous post SPLAT was used in transform_point to generate .xxxx, .yyyy and .zzzz swizzles from AABB point.

Let’s generate AABB points then.

// get aabb points (SoA)
qword minmax_x = SHUFFLE(min, max, L0, R0, L0, R0); // x X x X
qword minmax_y = SHUFFLE(min, max, L1, L1, R1, R1); // y y Y Y
qword minmax_z_0 = SPLAT(min, 2); // z z z z
qword minmax_z_1 = SPLAT(max, 2); // Z Z Z Z

That was easy. Now if we want first 4 points, we use minmax_x, minmax_y, minmax_z_0; for the second group, we use minmax_x, minmax_y, minmax_z_1.

Now, we have 2 groups of 4 points in each, SoA style – we have to transform them to world space. It’s actually quite easy – remember the first scalar version? If you’ve glanced at the code, you’ve seen a macro for computing single resulting component:

#define COMP(c) p->c = op.x * mat->row0.c + op.y * mat->row1.c + op.z * mat->row2.c + mat->row3.c

As it turns out, this can be converted to SoA style multiplication almost literally – you just need to think of op.x, op.y, op.z as of vectors with 4 values of some component; mat->rowi.c has to be splatted over all components. The resulting function becomes:

static inline void transform_points_4(qword* dest, qword x, qword y, qword z, const struct matrix43_t* mat)
{
#define COMP(c) \
    qword res_ ## c = SPLAT((qword)mat->row3, c); \
    res_ ## c = si_fma(z, SPLAT((qword)mat->row2, c), res_ ## c); \
    res_ ## c = si_fma(y, SPLAT((qword)mat->row1, c), res_ ## c); \
    res_ ## c = si_fma(x, SPLAT((qword)mat->row0, c), res_ ## c); \
    dest[c] = res_ ## c;


    COMP(0);
    COMP(1);
    COMP(2);
    
#undef COMP
}

Note that it’s not really that much different from the scalar version, only now it transforms 4 points in 9 si_fma and 12 si_shufb instructions. We’re going to transform 2 groups of points, so we’ll need 18 si_fma instructions, si_shufb can be shared – luckily, the compiler does it for us so we just need to call transform_points_4 twice:

// transform points to world space
qword points_ws_0[3];
qword points_ws_1[3];


transform_points_4(points_ws_0, minmax_x, minmax_y, minmax_z_0, transform);
transform_points_4(points_ws_1, minmax_x, minmax_y, minmax_z_1, transform);

Previous vectorized version required 24 si_fma and 24 si_shufb, plus 8 correcting si_selb (to be fair, it could be actually optimized to require 6 si_shufb + 8 si_selb, but it’s still not a win over SoA). Note that 18 si_fma + 12 si_shufb does not mean 30 cycles. SPUs are capable of dual-issuing some instructions – there are two groups of instructions, one group runs at even pipeline, another one – at odd. si_fma and si_shufb run on different pipelines, so the net throughput will be closer to 18 cycles (slightly larger than that if si_shufb latency can’t be hidden).

Now all that’s left is to calculate dot products with a plane. Of course we’ll have to calculate them 4 at a time. But wait – in our case execution of inner loop terminated after the first iteration. So previously we were doing only one (albeit ugly) dot product, and now we’re doing 4, or even 8! Isn’t that a little bit excessive? Well, that’s not – but we’ll save a more detailed explanation for the later post, for now let the results speak for themselves.

In order to calculate 4 dot products, we’ll make a helper function:

static inline qword dot4(qword v, qword x, qword y, qword z)
{
    qword result = SPLAT(v, 3);


    result = si_fma(SPLAT(v, 2), z, result);
    result = si_fma(SPLAT(v, 1), y, result);
    result = si_fma(SPLAT(v, 0), x, result);

    return result;
}

And call it twice. Again, we’ll be doing four splats twice, but compiler is smart enough to eliminate this. After that we’ll have to compare all 8 dot products with zero, and return false if all of those are negative.

// for each plane...
for (int i = 0; i < 6; ++i)
{
    qword plane = (qword)frustum->planes[i];


    // calculate 8 dot products
    qword dp0 = dot4(plane, points_ws_0[0], points_ws_0[1], points_ws_0[2]);
    qword dp1 = dot4(plane, points_ws_1[0], points_ws_1[1], points_ws_1[2]);

    // get signs
    qword dp0neg = si_fcgt((qword)(0), dp0);
    qword dp1neg = si_fcgt((qword)(0), dp1);

    if (si_to_uint(si_gb(si_and(dp0neg, dp1neg))) == 15)
    {
        return false;
    }
}

si_fcgt is just a floating-point greater comparison; I’m abusing the fact that 0.0f is represented as a vector with all bytes equal to zero here. si_fcgt operates like SSE comparisons and returns 0xffffffff for elements where the comparison result is true, and 0 for others. After that I and the results together, and then use si_gb instruction to gather bits of results. si_gb takes least significant bit from each element and inserts it into corresponding bit of the result; we get a 4-bit value in preferred slot, everything else is zeroed out. If it’s equal to 15, then si_and returned a mask where all elements are 0xffffffff, which means that all dot products are less than zero, so the box is outside.

Note that si_gb is like _mm_movemask_ps, only it takes least significant bits instead of most significant – in case of SSE, we don’t need to do comparisons. We can avoid comparisons here by anding dot products directly, and then moving the sign bit to least significant bit (it can be done by rotating each element 1 bit to the left, that’s achieved by si_roti(v, 1)), but this is slightly slower, so we won’t do it.

Now, the results. The code runs at 376 cycles, which is more than 2 times faster than the previous version, and almost 4 times faster than the original. This speedup is partially because we’re doing things more efficiently, partially because we got rid of branches; we’ll discuss this the next week. A million calls takes 117 msec, which is still worse than x86 results – but it’s not the end of the story. Astonishingly, applying exactly the same optimizations to SSE code results in 81 msec for gcc (which is 30% faster than naively vectorized version), and in 104 msec for msvc8 (which is 40% slower!).

The fastest version is still produced by msvc8 from previous version. This should not be very surprising, as we changed inner loop from performing one dot-product to performing 8 at once, so that shows. We can optimize it in this case by adding early out – after we compute first 4 dot products, we’ll check if all of them are positive; if some of them are not, we can safely skip additional 4 dot products and continue to the next iteration. It results in 87 ms for msvc8 and 65 ms for gcc, with gcc-compiled SoA finally being faster than all previous approaches. Of course, this is a worst case for SoA – in case inner loops actually did not terminate after first iteration the performance gain would be greater. Adding the same optimization to SPU code makes it slightly (by 3 cycles) slower; the penalty is tens of cycles if the early out does not happen and we have to compute all 8 dot products, so it’s definitely not worth it.

The current source can be grabbed here.

That’s all for now – stay tuned for the next weekend’s post!

View Frustum Culling series contents:

Introduction

Vectorize me

Structures and arrays

Never let me branch

Representation matters

Balancing the pipes

View frustum culling optimization - Vectorize me

Sun, 08 Feb 2009 00:00:00 +0000

Last week I’ve posted some teaser code that will be transformed several times, each time yielding a faster one - “faster” in terms of “taking less cycles for the test case on SPU”. A lot of you probably looked at my admittedly lame excuse for, uhm, math library and want to ask – why the hell do you use scalar code? We’re going to address the problem in this issue. This is probably a no-brainer for most of my readers, but this is a good opportunity to introduce some important points about SPUs and introduce some actual vector code before diving further.

But first, we need some background information on SPUs. For the todays post, there is a single important thing to know about SPUs – they are vector processors. Unlike most common architectures (PowerPC, x86, etc.), SPUs have only one register set, which consists of 128-bit vectors. The current implementation has 128 of them, and each register is treated differently in different instructions (you have different instructions for adding two registers as if they contained 4 single precision floats or 16 8-bit integers). The important point is that, while you can compile a piece of scalar code for SPU, it’s going to use vector registers and vector instructions; the scalar values are assumed to reside in so called preferred slot – for our current needs, we only care about preferred slot for 32-bit scalars, which is the first one (index 0). Register components are numbered from least address in memory onwards, which is really refreshing after SSE little-endian madness.

This actually goes slightly further – not only all registers are 16-byte, but all memory accesses (I’m obviously talking about local storage access here – though the same mostly applies to DMA; I’ll be probably discussing something DMA-related after VFC series ends) should be – you can only load/store a full register’s worth of data from/to 16b-aligned location. Of course, you can implement a workaround for scalar values – for loading, load the 16 byte chunk the value is in, and then shift it in the register so that it resides in preferred slot; for saving, load the destination 16 byte chunk, insert desired value in it via shifting/masking, and then store the whole chunk back. In fact, this is exactly what compiler does. Moreover, for our struct vector3_t, loading three components in registers will generate such load/shift code for every component, since compiler does not know the alignment (the whole vector could be in one 16 byte chunk, or it could be split in half between any two components).

In order to leverage available power, we have to use available vector instructions. SPUs have a custom instruction set, which is well documented. For now, it’s important to know that there is a fused multiply-add instruction, which computes a*b+c, and there is no dot product instruction (or floating-point horizontal sum, for that matter). In fact, on current generation of consoles, XBox360 is pretty unique in that it does have a dot product instruction.

So, our code is bad because we have lots of scalar memory accesses and lots of scalar operations, which are not using available processing power properly. Let’s change this!

One option is to code in assembly; this has obvious benefits and obvious pitfalls, and we’ll use intrinsics instead. For SPUs, we have three intrinsics sets to choose from – Altivec emulated (vec_, the same as we use on PPU), generic type-aware (spu_) and low-level (si_). GCC compiler provides several vector types as language extensions (some examples are ‘vector float’ and ‘vector unsigned char’, which correspond to 4 32-bit floats and 16 8-bit unsigned integers, respectively); a single spu_ instruction translates to different assembly instructions depending on a type, while si_* instructions operate on abstract register (it has type ‘qword’, which corresponds to ‘vector signed char’) – i.e. to add two vectors, you can use spu_add(v1, v2) with typed registers, or one of si_a, si_ah, si_fa, si_dfa to add registers as 32-bit integer, 16-bit integer, 32-bit floating point or 64-bit floating point, respectively. We’ll be using si_* family for several reasons – one, they map to assembly exactly, so getting used to si_* instructions make it much easier to read (and possibly write) actual assembly, which is very useful when debugging or optimizing code, two, spu_* family is not available in C, as it uses function overloading. I’ll explain specific intrinsics as we start using them.

First thing we’ll do is dispose of redundant vector3_t/plane_t structures (in a real math library, we won’t do this of course, but this is a sample), and replace them with qwords. This way, everything will be properly aligned, and we won’t need to write load/store instructions ourselves (as opposed to something like struct vector3_t { float v[4]; }).

Then, we have to generate an array of points. Each resulting point is a combination of aabb->min and aabb->max – for each component we select either minimum or maximum value. As it turns out, there is the instruction that does exactly that – it accepts two registers with actual values and a third one with pattern; for each bit in pattern, it takes left bit for 0 and right bit for 1 – it’s equivalent to (a & ~c) | (b & c), only in one instruction.

The code becomes

// get aabb points
qword points[] =
{
    min,                                                   // x y z
    si_selb(min, max, ((qword)(vec_uint4){~0, 0, 0, 0})),  // X y z
    si_selb(min, max, ((qword)(vec_uint4){~0, ~0, 0, 0})), // X Y z
    si_selb(min, max, ((qword)(vec_uint4){0, ~0, 0, 0})),  // x Y z


    si_selb(min, max, ((qword)(vec_uint4){0, 0, ~0, 0})),  // x y Z
    si_selb(min, max, ((qword)(vec_uint4){~0, 0, ~0, 0})), // X y Z
    max,                                                   // X Y Z
    si_selb(min, max, ((qword)(vec_uint4){0, ~0, ~0, 0})), // x Y Z
};

Note that I’m using another gcc extension to form vector constants. This is very convenient and does not exhibit any unexpected penalties (the expected ones being additional constant storage and additional instructions to load them).

Then we have transform_point; we’ll have to transform a given vector by matrix, and additionally to stuff a 1.0f in .w component of the result in order for the following dot product to work (I sort of hacked this in scalar version by using dot(vector3, vector4)). Vector-matrix SIMD multiplication is very well-known – we’ll need add/multiply instructions, and ability to replicate a vector element across the whole vector. For this we’ll use a si_shufb instruction – I’ll leave the detailed explanation for the next issue, for now just assume that it works as desired :)

static inline qword transform_point(qword p, const struct matrix43_t* mat)
{
    qword px = si_shufb(p, p, (qword)(vec_uint4)(0×00010203));
    qword py = si_shufb(p, p, (qword)(vec_uint4)(0×04050607));
    qword pz = si_shufb(p, p, (qword)(vec_uint4)(0x08090a0b));


    qword result = (qword)mat->row3;

    result = si_fma(pz, (qword)mat->row2, result);
    result = si_fma(py, (qword)mat->row1, result);
    result = si_fma(px, (qword)mat->row0, result);

    result = si_selb(result, ((qword)(vec_float4){0, 0, 0, 1}), ((qword)(vec_uint4){0, 0, 0, ~0}));

    return result;
}

We replicate point components, yielding three vectors, and then compute transformation result using si_fma (fused multiply-add; returns a * b + c) instruction. After that we combine it via selb to get 1.0f in the last component.

Note that in this case we are fortunate to have our matrix laid out as it is – another layout would force us to transpose it prior to further computations to make vectorization possible. In scalar case, the layout does not make any difference.

Finally, we’ll have to compute dot product. As there is no dedicated dot product instruction, we’ll have to emulate it, which is not pretty.

static inline float dot(qword lhs, qword rhs)
{
    qword mul = si_fm(lhs, rhs);


    // two pairs of sums
    qword mul_zwxy = si_rotqbyi(mul, 8);
    qword sum_2 = si_fa(mul, mul_zwxy);

    // single sum
    qword sum_2y = si_rotqbyi(sum_2, 4);
    qword sum_1 = si_fa(sum_2, sum_2y);

    // return result
    return si_to_float(sum_1);
}

First we get a component-wise multiplication result by using fm; then we’ll have to compute horizontal sum. First we sum odd/even components together separately. For that, we rotate our register to the left by 8 bytes (si_rotqbyi) and add with the original. After that, we rotate the result left by 4 bytes (to get the second sum at the preferred slot) and add with the original.

For mul = (1, 2, 3, 4), we get the following values:

mul_zwxy = 3 4 1 2
sum_2 = 4 6 4 6
sum_2y = 6 4 6 4
sum_1 = 10 10 10 10

The result is converted to float via si_to_float cast intrinsic – it just tells the compiler to reinterpret result as if it was a float (actual scalar value is assumed to be in preferred slot), this usually does not generate any additional instructions.

Note that in case of SPU, there is only one register set – thus there is no penalty for such vector/scalar conversion. This code will not perform very well for other architectures – for example, on PowerPC converting vector to float in this way causes a LHS (Load Hit Store; it occurs when you read from the same address you just wrote into) because vector should be stored to stack to load vector element into float register; LHS causes a huge stall (40-50 cycles) and thus performance can be compromised here. For this reason, if your PPU/VMX math library has an optimized dot product function that returns float, don’t use it in performance critical code – find another approach. It’s interesting that if you think about it, you don’t need dot products that much, as I’ll show in the next issue.

Anyway, the current code runs at 820 cycles, which is 50% faster than scalar code. This equals to approximately 256 msec per million calls, the corresponding numbers for x86 being 136 msec for gcc and 74 msec for msvc8. Once x86 code is changed so that dot() function returns its result in a vector register, and resulting sign is then analyzed via _mm_movemask_ps instruction, timings change to 126/68, respectively. We’ve made some progress there, but our SPU implementation is still far from x86 in terms of speed though we’re using the same techniques. I promise that the end result will be much more pleasing though :)

The current source can be grabbed here.

That’s all for now – stay tuned for the next weekend’s post!

View Frustum Culling series contents:

Introduction

Vectorize me

Structures and arrays

Never let me branch

Representation matters

Balancing the pipes

View frustum culling optimization - Introduction

Sat, 31 Jan 2009 00:00:00 +0000

Here I come again, back from almost a year long silence – and for some weird reason a visitor counter shows that people are still reading my blog! This was an eventful year for me – I worked on lots of things at work and on some at home, got 3 more shipped titles to put in my CV, started really programming on PS3 (including many RSX-related adventures, optimizations and, recently, SPU coding, which I happen to enjoy a lot), and, as some of you will probably guess from the code below, started using Vim. Some other (good) changes gave me more free time, so this post is a first one in a new one-post-in-a-week series (which will hopefully not be the, uh, last one also).

I have a small piece of code at work, which performs simple frustum culling for a given OBB. Initially it was written in an unoptimized (and cross-platform) way, later it was rewritten for Altivec with interesting optimizations, which yielded 3x performance boost, IIRC, and recently it was rewritten again for SPU. I considered this series of code transformation an interesting one, so I thought I’d expand it slightly (adding more intermediate stages), pretend it always was on SPU, and write several posts about it.

This post is a teaser, featuring testing methodology, source data, algorithm and initial (unoptimized) version of code, with unoptimized underlaying math library. Each code snippet will be short and self-contained, since the problem at hand is simple enough. Later posts in series will each feature some performance-related transformation (which can be obviously applied to lots of algorithms), with the last post giving more or less the current version of my code at work.

So, let’s get started!

It starts simple - we have a handful of meshes, with each mesh having some kind of bounding volume and a local-to-world transformation. The task at hand is to determine, for a given frustum, whether the mesh is potentially visible (inside/intersecting with the frustum). The test has to be conservative – i.e. it can answer “visible” for meshes which are actually invisible – but it does not have to be exact. In fact, for many meshes the bounding volume itself is inexact, but additionally the algorithm can sacrifice some accuracy in order to be faster.

For our case, the bounding volume is AABB (axis-aligned bounding box) - “axis-aligned” here means that box axes are aligned to mesh local space axes, so in world space this is OBB. We’re testing it against an arbitrary frustum – it can be a usual perspective/orthogonal one, or something more fancy (for example, our reflection/refraction rendering passes use perspective projection with oblique clipping, so near/far planes are not perpendicular to the viewing direction, and in fact they are not parallel at all!). Frustum is defined by a 4x4 matrix, though obviously we’re free to convert it to any other representation we like.

There are two common approaches to testing if the box is inside frustum or not. First, equations of all 6 frustum planes are extracted. Then, for each plane it’s determined if the box is completely outside (i.e. in the negative half-space, if planes’ normals are pointing inside the frustum) or not; if the box is not completely outside for all planes, it’s reported visible, otherwise it’s reported invisible. This can be extended to differentiate “completely inside” and “partially inside” results, though we don’t need it, since we’re going to render both groups of meshes anyway.

Two approaches differ in the methodology of box-plane testing – one (bruteforce) is testing all 8 vertices of the box, another one is taking a single point (the p-vertex) and testing it, which is enough to give the correct answer if the correct p-vertex is chosen. The series will concentrate on the bruteforce approach, at least for now.

The naïve version of the algorithm first extracts the plane equations, using frustum combined view projection matrix (this is done once for each frustum, so it is not performance sensitive; as such, the code for this is omitted and frustum is assumed to have 6 computed plane equations initially). Then it applies the described bruteforce algorithm as follows:

bool is_visible(struct matrix43_t* transform, struct aabb_t* aabb, struct frustum_t* frustum)
{
    // get aabb points
    struct vector3_t points[] =
    {
        { aabb->min.x, aabb->min.y, aabb->min.z },
        { aabb->max.x, aabb->min.y, aabb->min.z },
        { aabb->max.x, aabb->max.y, aabb->min.z },
        { aabb->min.x, aabb->max.y, aabb->min.z },


        { aabb->min.x, aabb->min.y, aabb->max.z },
        { aabb->max.x, aabb->min.y, aabb->max.z },
        { aabb->max.x, aabb->max.y, aabb->max.z },
        { aabb->min.x, aabb->max.y, aabb->max.z }
    };

    // transform points to world space
    for (int i = 0; i < 8; ++i)
    {
        transform_point(points + i, transform);
    }

    // for each plane…
    for (int i = 0; i < 6; ++i)
    {
        bool inside = false;

        for (int j = 0; j < 8; ++j)
        {
            if (dot(points + j, frustum->planes + i) > 0)
            {
                inside = true;
                break;
            }
        }

        if (!inside)
        {
            return false;
        }
    }

    return true;
}

It uses five predefined data structures (vector3_t, matrix43_t, aabb_t, frustum_t and plane_t which is the type of frustum->planes array elements); those, and the (again, naïve) code of two used functions is available here (along with the rest of the code).

Note that matrix43_t is laid out so that three translation components are adjacent to each other in memory (I don’t use the term “whatever major” here because it’s very misleading); in our real code, rows actually consist of four components, with the fourth one being undefined for matrix43_t (of course, all operations should proceed as if the column was filled with 0 0 0 1). Similarly, vector3_t has four components, with the fourth one being undefined (this affects aabb_t). This is something that is assumed to stay this way forever, so all of our discussed code will somehow work around that when needed. From the next post and onwards, data layout of the sample code will be exactly the same as in our engine, I’ve omitted padding fields in today’s version for simplicity.

The testing methodology is simple – I compile the code for SPU using a compiler from Sony’s toolchain with -O3 level of optimization (the code is C99, by the way), and then run it via SPU simulator. This is an extremely useful tool that’s provided by Sony for PS3 developers, which can run a SPU program and, for example, report run statistics – cycles elapsed, various stalls, etc. As far as I know, IBM has a public simulation suite with comparable capabilities, but since it requires Linux I never bothered to test it out. The number of cycles that the tool reports is then slightly reduce to account for some startup overhead (which is 34 cycles), so the number that I present here is the number of cycles for non-inlined call, with included function call overhead. For the reference, I’ll include expected run time on a million OBBs (on one SPU, obviously), and corresponding run times of more or less the same code for PC (on my Core2 Duo 2.13 GHz), compiled with gcc 4.3.0 and MSVC 8.0 (SPU intrinsics will be replaced with SSE1 code). Those are only for reference, don’t quote me on them :)

The cycle count for this naïve code is 1204, which (given a 3.2 GHz SPU) translates to 376 msec per million calls. The same code gives me roughly 84 msec when compiled with gcc (switches -O3 -msse) and 117 msec when compiled with cl (switches /O2 /fp:fast /arch:SSE). The test runs on the same data each time (taking care that processing is actually being performed), which excludes cache misses from the picture; the actual data is the same for SPU and PC tests and consists of a small box being completely inside the frustum – so that’s a worst case for the outer loop (we have to test the box against all planes only to report that it’s actually visible), although that’s a best case for the inner loop (the first vertex tested for each plane is inside, so we can bail out early). I don’t have any real-world average statistics on iteration counts of those loops; anyway, eventually we’re going to eliminate some, and then all of those branches, so that the code will perform at constant speed.

That’s all for now – stay tuned for the next weekend’s post!

View Frustum Culling series contents:

Introduction

Vectorize me

Structures and arrays

Never let me branch

Representation matters

Balancing the pipes

COLLADA: quick update

Wed, 12 Mar 2008 00:00:00 +0000

Time’s running fast. Two weeks has passed since my post about COLLADA, and I’ve found a killer bug in FCollada TBN generation code.

As 3dsmax native API does not provide support for returning TBN (I do not know about Maya, perhaps it does not too), Feeling Software implemented their own algorithm for TBN calculation, based on source found in Maya 7.0 documentation, “Appendix A: Tangent and binormal vectors”. Of course, relying on NVMeshMender would be too easy.

And after three years of Feeling Software’s Collada plugins, there is a bug in TBN generation code. You can read the full details here (the poster is me), but to keep it simple - returned tangent/binormal are opposite to the correct ones because of incorrect sign in equations (proof with asset files and comparison between Maya reference code and FCollada is also in the post). Well, perhaps it’s just that I misunderstand something, but I definitely think it is a bug - there’s just too many things to back it up.

And suddenly I can’t post a bug report on Feeling Software forum, and through I get to know that Collada free support is discontinued. Given that other alternatives to DAE export from Max/Maya are just not worth the trouble, this means that suddenly COLLADA starts to feel much less attractive than before.

I’m even considering writing a small (geometry, node hierarchy, skin controller and sampled & baked animation - should not be that hard) plugin for 3dsmax/Maya…

COLLADA: Best thing since sliced bread or not?

Sun, 24 Feb 2008 00:00:00 +0000

About half a year ago, our team at work that develops the engine decided to try and switch from the proprietary Maya export plugin (it exported geometry, animation and materials) to COLLADA. The old export pipeline was somewhat obscure, lacked some useful optimizations, and (what’s most important) lacked any convenient way to setup materials. That was not a problem for platforms with more or less fixed functionality, but with next-generation consoles (or should I say current-generation already?) it’s quite different.

So the switch has been made (it did not take half a year, it’s just that I’m writing about it only now), and I’m going to share some experience gained throughout the process.

What is COLLADA exactly? It’s an asset interchange format, based on XML (with complete XML Schema and specification), and a series of tools – exporters from popular DCC software (Maya, 3d Studio Max, XSI, Blender, etc.), viewers, libraries for loading/saving/manipulating COLLADA DOM tree.

This means several important things. First, it’s an asset interchange format, which means that it is not supposed to be used as a format for the retail assets. DCC saves COLLADA file, the custom tool loads it, reads useful information from it, applies optimizations (possibly platform-specific), and saves to some binary format. Second, you don’t have to write the export plugin for any DCC tool you use – in theory, all you do is write the said tool that converts .dae to your format and it magically works with all possible tools. Third, it’s slowly becoming something like an industry standard – every popular DCC has an export plugin, some well-known tools can read DAE files (i.e. FXComposer), it has support of well-known companies like Sony, and more and more engines are adopting it.

But that, of course, does not mean that it is a perfect solution.

So, what exactly are COLLADA advantages (why do you want to use it)?

You get a more or less DCC-independent pipeline. Even if your artists only ever use Maya, it does not mean that you’ll never need 3dsmax support (our engine is now being used by a company which only has 3dsmax-aware artists, so the task “support 3dsmax as geometry/animation export tool” has appeared – and it took a day or two).
It is an additional layer of abstraction between DCC and your builder. This means that tedious work with DCC APIs is now inside the exporter, which is (ideally) the code you shouldn’t even know about. As a result, export pipeline is much simpler.
There is a built-in custom material support (ColladaFX). Basically, it allows you to specify a material created from hardware shader (Cg/CgFX), and supplies the artist with convenient way of tweaking the shader parameters (with viewport preview as an added bonus).
DCC plugins usually support importing DAE files. Why is this important? In old pipeline, we had the proprietary plugin export the .sb file, then a platform-independent tool applied some optimizations (reducing scene graph, removing redundant stuff from scene, merging meshes, etc.), and then the platform-specific exporter read that file and converted it to platform-specific format (stripifying, cache optimization, vertex packing, etc.). Obviously, any kind of visual feedback is lost at the moment you export .sb from Maya, so a special viewing/introspection tool was developed. If you use COLLADA and manage to write some export tools such that they only modify .dae file, you can later import it in your DCC tool. If your pipeline is made of a series of such builders, and (!) you save result of each builder, debugging the pipeline becomes much easier.

Well, so I’ve told all the good things about COLLADA I know of. Unfortunately, there is a number of things that are not so good.

Scheme is complex and redundant in many ways. Writing a complete (able to parse any compliant COLLADA file) parser is hard, so either use an existing library (FCollada?) or parse only the subset of scheme your DCC tool exports. I prefer the latter approach, because it’s simpler for me and also is much faster in terms of performance.
DCC export is sometimes quite slow (in case of Maya, for example, the export usually takes two times as long as the code that parses the file, builds the platform-specific structures and saves them). So cache your .dae files (we’re using SCons as a build system, and a network cache, so it’s not as frustrating as it could be).
It is an additional layer of abstraction between DCC and your builder. This means that every time you encounter a bug, it could be either the export plugin or your builder (or, uh, a series of your builders). And if you use some DAE-parsing library, it could be that it is the source of problem. Fortunately, such cases are rare.
Export plugins sometimes are not very top-quality. For example, lack of pivot animation export in ColladaMaya, ColladaFX support, bugs, etc.
It is just an export plugin, do not expect any miracles. For example, if 3dsmax gives you TBN that does not make sense, COLLADA is not going to fix it.
ColladaFX is very bad from the usability standpoint:
- It’s hard for artists to create a new material and to correctly setup binding to geometry (for example, Maya TBN shader binding is not quite clear because of Cg).
- It’s much harder for them to use it in 3dsmax because of even less convenient interface and some problems with parameter binding – just ask your artist to setup an existing model with ColladaFX materials in 3dsmax and you’ll know why
- Perhaps it’s slightly better with CgFX, but since we don’t use it, I can’t say for sure.
ColladaFX implementation is quite bad:
- There are frequent crashes in 3dsmax (we fixed some of them and are considering submitting a patch).
- Cg materials did not even export correctly because of exporter bug in 3dsmax! We submitted a patch that should already be in trunk.
- ColladaFX materials export from Maya did not work with batch build.

So, generally, ColladaFX seems great on paper, but requires a lot of work, both in technical implementation and usability areas. We are considering rewriting the Maya interface part from scratch.

Fortunately, COLLADA exporter plugins we’re using are open-source, so we debug them if they do not work, fix bugs (isn’t it exciting?!) and add functionality as we feel appropriate (though of course this complicates the process of updating plugin versions).

Let’s summarize the above. If you do not have any established and well-working export pipeline and are not planning a custom DCC plugin for material setup or things like that – I’d definitely recommend COLLADA, because it’ll be easier than a custom plugin if you don’t have the relevant experience, and it will make it possible to support several DCC tools, which is a good thing. If you have a well established export pipeline that you’re happy about, there is obviously no need to use COLLADA. In other cases the answer is more complex. I myself am quite happy because of transition to COLLADA, because it made everything better, and the major disappointment of COLLADA was ColladaFX, which we did not have an equivalent for anyway (and export of default materials like Phong/Blinn/Lambert works just fine), but of course your mileage may vary.

If you are using COLLADA and have different experience about any of the enlisted areas, please write a comment! For example, do you use ColladaFX? Do you use FCollada and/or ColladaDOM and does it help you? Perhaps you use Feeling Software proprietary export plugins and have something good (or bad) to say about them?

My own lighting shader with blackjack and hookers

Sun, 28 Oct 2007 00:00:00 +0000

So, yesterday we were discussing various lighting approaches at local IRC chat, and someone said that it was impossible to make a ps.2.0 single-pass lighting shader for 8 lights with diffuse, specular and attenuation, of course with normal map. Well, I told him he was actually wrong, and that I can prove it. Proving it turned out to be a very fun and challenging task. I am going to share the resulting shader and the lessons learned with you.

From the start I knew it was not going to be easy – ps.2.0 means that no arbitrary swizzles are available (which can in some cases restrict possible optimizations), and there is a limit of 64 arithmetic instructions. Add to that the fact that there is no instruction pairing as in ps.1.x (which is rather strange, as all hardware I know of is capable of pairing vector and scalar instructions).

I decided to compute lighting in world space, as I thought that passing all lighting data from VS to PS is going to consume too much interpolators. So I passed world-space view vector, world-space position and tangent to world space conversion matrix from VS to PS, and had PS transform normal to world space and compute everything else. This proved to be insufficient to reach target result, but I am going to show you the shader anyway, as it has some interesting ideas, some of which are still in the final shader.

Here is the shader. Ignore stuff like bindings.h, TRILINEAR_SAMPLER, etc. – these are to allow FX Composer bindings of parameters.

Interesting things to note:

For world-space lighting, there is no need to do normal expansion (normal * 2 – 1), you can encode it into tangent space -> world space conversion matrix. Saves 1 instruction.
Vectorize like crazy. You can do a lot of stuff cheaper if you vectorize your calculations. For example, here we have 8 lights, that’s 2 groups, 4 lights in each one. You can save instructions on computing attenuation for 4 lights at once (you get 1 mad instruction for 4 lights, instead of 1 mad instruction per light), you can save a lot of instructions for specular computations (read below), etc.
If you want to add a lot of scalar values, dot is your friend. dot(value, 1) will add all components of value.
Be smart about replacing equations with equivalent ones. For example, naïve Phong specular equation is dot(R, V), where R is reflected light vector. In this case, you have to do reflect() for each light. Reflect is not quite cheap, and, as we have 8 lights and only 64 instructions, every instruction that’s executed per light is very expensive. But dot(reflect(L, N), V) is equal to dot(reflect(V, N), L) (V is a view vector), which requires only a single reflect().
Specular is expensive, because pow() call is expensive. Why? Because, sadly, pow() can’t be vectorized. pow() is interpreted as three instructions, log, mul, exp (pow(a, b) is equal to exp(log(a) * b)), and neither log, nor exp have vectorized form. The result is that you’ll waste at least two instructions per light only to compute specular power.

There are several solutions. The first one is to fix specular power to some convenient value. For example, pow(a, 16) can be implemented as 4 muls (and in fact HLSL compiler does it automatically), and muls can be vectorized – so you can compute specular power for all 4 lights for 4 instructions, which is much better.

However, there is a better solution, which is described here: A Non-Integer Power Function on the Pixel Shader. Basically, you can approximate the function pow(x, N) with pow(max(A*x + B, 0), M). A and B can be tuned such that the maximum error is low enough for the approximation to be useable in practice, and M can be made very small, for example I use M=2, and there are no artifacts for N=18 (the results ARE different, and it can be seen by switching between formulas in realtime, but the difference is small and no one will be able to tell that you use an approximation). Alternatively, you can have your artists tune A and B directly.

The net effect is that instead of several instructions per light, we compute specular power in 2 instructions – mad_sat to compute A*x+B, and mul to compute the result (remember, M=2).

Okay, we have a bunch of cool optimizations here. Are we done? Sadly, no, we are not. The presented shader compiles into 75 instructions. The problem is that per-light cost is still too high. We have to compute a unnormalized light vector (1 sub), compute its squared length (1 dp3), normalize it (1 rsq + 1 mul), compute dot(N, L) for diffuse lighting (1 dp3), compute dot(R, L) for specular lighting (1 dp3), and combine diffuse lighting with specified colors (1 mad). This is 7*8=56 instructions, which leaves 8 instructions, and we need much more. What can we do?

Well, we can simplify our calculations and replace light colors with light intensities. This will strip 8 instructions, but add 2 instructions for multiplying NdotL vector by intensity, and 1 instruction for summing all diffuse components together (dp3), which is still not enough – we need to reserve instructions for other things (for example, normal transformation takes 3 instructions, specular computation is 4 instructions for 8 lights, attenuation is 4 instructions for 8 lights, and there are still several other things left).

Yesterday, I gave up and went to sleep. But today, after some thinking, I felt like a barrier in my head collapsed – I knew why it did not work out, and I knew the better solution.

See, it is true that we have a lot of overhead per light source. The problem is not in the fact that we need to do a lot of calculations – the problem is in that we are using computation power inefficiently.

Let’s look at the instruction counts I presented earlier. Yes, it’s true that we do 1 sub per light – but sub is capable of processing 4-float vectors, and we are using it to subtract 3-component vectors, so we lose some power here. The same stays true for everything else – all dot products and mul for example.

And the barrier that collapsed in my head had an inscription “You have the mighty dot product, use it”. As it turned out, there is not a lot of sense in treating GPU assembly differently from for example SSE instructions. We do not have dot product in SSE. Trying to compute 4 dot products at once in a straightforward way in SSE is subject to miserable failure – most likely FPU will be faster. But if you change the way your data is organized, and instead of laying it in AoS order:

v1.x v1.y v1.z (unused)
v2.x v2.y v2.z (unused)
v3.x v3.y v3.z (unused)
v4.x v4.y v4.z (unused)

lay it out in SoA order:

v1.x v2.x v3.x v4.x
v1.y v2.y v3.y v4.y
v1.z v2.z v3.z v4.z

And lo and behold, a lot of slow dot product instructions are now just 3 muls and 2 adds – simple and fast.

This was a triumph. I’m making a note here: HUGE SUCCESS.

I decided to try the same thing with my shader code. It solved all problems like magic. The resulting code while doing exactly the same calculations, now compiles in 54 instructions.

The reason is simple, of course. For example, where in the previous shader we computed squared lengths in 1 instruction per light, here we do it for 3 instructions for 4 lights, effectively using 25% less ALU. The new layout also made it possible to pass lights via interpolators (light data fits into 6 interpolators), which allowed to remove 1 sub instruction per light, and also 3 instructions for transforming normal into tangent space (at the expense of adding 1 expand instruction, of course).

Apart from the SoA data layout, which effectively is the reason why the new shader is so much smaller, there is only one trick - instead of normalizing each vector, we correct dot product results. This saves a couple of instructions for the entire shader.

The old shader did not fit into the instruction limit, the new one does, and it has 10 spare instructions. There is a bunch of things you could do with it. For example, you can implement parallax mapping – 10 instructions should be enough for several parallax steps. Note that one interpolator can be freed (view vector can be stored in COLOR interpolator at the expense of 1 additional instruction (expand from [0,1] to [-1,1])), so you can implement shadow mapping (for example, make first light source a directional one – this is straightforward, you just have to modify vertex shader to supply correct tangent-space direction, and place 0 in light_radius_inv to disable attenuation – and add shadows for it).

There is also space for some small tweaks – i.e. disable specular for triangles with dot(N, L) < 0, make wrap-around lighting, add ambient lighting, have colored specular (use a per-light color instead of the global one), add specular attenuation, etc.

Note, that a couple of instructions can still be saved if you do not want light colors, only intensities (read above).

So, this was a pleasant experience, and I am glad that I decided to write this shader. Sadly, all articles I’ve read about shader optimizations (save for one, “Bump My Shiny Metal” by Andrew Aksyonoff in ShaderX 4) are about common and trivial stuff – vectorize your computations, inspect compiler output… I hope that this post was more interesting.

Render state rant

Sat, 06 Oct 2007 00:00:00 +0000

While designing D3D10, a number of decisions were made to improve runtime efficiency (to reduce batch cost, basically). It’s no secret that D3D9 runtime is not exactly lean & mean – there is a lot of magic going on behind the scenes, a lot of validation, caching, patching…

For example, D3D9 has the ability to set any render or sampler state at any time. However, hardware does not really work that way. The states are separated into groups, and you can only set the whole group at a time (of course, the exact structure of groups is hardware dependent). What this means for runtime/driver is that SetRenderState/SetSamplerState often do not actually set anything. Instead, they modify a so called shadow state – they update the changed states in a local shadow copy, and mark the respective groups as dirty. Then, when you call DrawIndexedPrimitive, the changed state groups are being set, and the dirty flags – cleared.

Also there could be some cross-state checking going on, I don’t know for sure.

So, D3D10 designers decided to replace hundreds of states by 4 simple state objects. Behold, ID3D10BlendState, ID3D10DepthStencilState, ID3D10RasterizerState and ID3D10SamplerState. These are immutable driver state objects – you create them once, you can’t modify them, and driver knows about them, which means it can do smart things (for example, store a chunk of push buffer that sets the respective state inside each state object) and thus optimize state setting. Also all possible validation is made at creation time only.

Sounds cool, right? Yeah, it did first time I’ve read about state objects. Except…

Problem #1. State objects are relatively expensive to construct. Which means 10-100 thousand CPU cycles on my Core 2 Duo. You know, given that we have a user mode driver, given that the smartest thing I can think of here is to 1. hash the state to check if the same state has already been created (it is actually done, you can check it by creating the same state object several times), 2. If the hash lookup failed, validate the state (perform some sanity checks), 3. construct a chunk with push buffer commands, 4. store it in allocated memory. And that should be performed by user mode code. Something is horribly wrong here, I swear.

Note, that even creating the already existing state object (hash state, hash lookup, compare actual state values, remember?) takes 10 thousand cycles. The caching should be performed in D3D10 runtime (the pointer returned is exactly the same - i.e. for input layouts, the caching is performed by the driver, as the pointer to the InputLayout is different, but the underlying driver object is the same; for state objects, pointers to runtime objects are equal), so this means that computing a hash of a 40-byte description object, doing a table lookup, and then comparing two 40-byte objects to test for equality in case of hash collision takes 10 thousand cycles.

Problem #2. You can’t create more than 4096 state objects. Which means that you can’t just forget about it and create a state for each object, even for static ones. Well, you can try, but one day it’ll fail.

Problem #3. The separation of states into groups is outrageous. I did not tell one small thing – not all states are actually immutable. There are two things you can change without constructing a new state object (they act as parameters for state object setting functions). Those are… stencil reference value and blend factor. Everything else is immutable, for example, depth bias and slope scaled depth bias.

How many of you have used blend factor with pixel shader (let’s say ps.2.0)-capable hardware?

How many of you have used a constantly changing stencil reference value?

I’d like to do things like progressively loading textures. What do I mean? Let’s load our texture from smaller mip levels to larger ones. Let’s interpolate MinLOD parameter from N to N-1 once N-1-th mip level is completely loaded – let’s do it over some fixed time. This way there will be no mip level popping, but instead we’ll see gradually improving quality as new levels are loaded – trilinear filtering will do proper interpolation for us. That’s easy, right? No. Not in D3D10.

Yes, I know I could cache render states inside objects, and perform some lifetime management (LRU?..). Though this won’t help in case of constantly changing parameters.

Yes, I know I could separate render states into groups how I like it best, have a 4096-entry hash table, and do lookups in it. And this is actually what I am doing now.

But it does not make me happy.

Robust unit cube clipping for shadow mapping

Tue, 25 Sep 2007 00:00:00 +0000

Shadow mapping is my primary area of interest in computer graphics, so expect more posts on this topic. Today I’d like to tell about robust unit cube clipping regarding different projection matrix building techniques.

The task at hand is relatively simple – given a set of points representing shadow receivers and a set of points representing shadow casters, build a matrix that, while used for shadow rendering, maximizes shadow map utilization (both in terms of shadow map area and depth precision) and at the same time eliminates all shadow clipping artifacts. I have not implemented anything except directional light support properly, so expect another post sooner or later.

Note that all points are assumed to be in world space, and for a lot of algorithms it’s preferred to take vertices of convex hull of receivers, clipped by view frustum, instead of actual receivers’ vertices – it’s not always required, but for the sake of simplicity we will assume that all receivers’ points are inside view frustum. Of course casters’ points are arbitrary.

Uniform Shadow Mapping

Uniform shadow mapping is shadow mapping with a simple orthographic projection, without any perspective reparametrization. As unit cube clipping is an operation that’s usually done after constructing some approximate matrix, let’s suppose we already have a view and projection matrix for our light configuration – note that in case of uniform light the only thing we care about is that view matrix has the same direction as the light, and the projection matrix represents some orthographic projection – everything else is irrelevant, so we can take arbitrary view position, more or less arbitrary up vector, construct a view matrix, and assume that projection matrix is actually identity matrix.

Now let’s construct two axis-aligned bounding boxes, one for receiver points, transformed to our light post-perspective space, another one for caster points, again in light PPS. Note that projection is orthographic, therefore there is no post-perspective division employed, and there are no singularities during AABB construction.

Now we have to build a new matrix that transforms our light viewprojection matrix to minimize shadow map wastage and Z precision loss, while preserving shadows. The actual choice of values depends on some states that are set while applying shadow map to scene.

At first, let’s deal with XY extents. Many papers propose choosing receivers’ XY extents, because we don’t care about points of casters that are outside receivers’ extents – i.e. can’t cast shadows on receivers. This provides correct results, but we can do slightly better – we can select intersection of casters’ and receivers’ XY extents:

float min_x = max(receivers.min_x, casters.min_x);
float min_y = max(receivers.min_y, casters.min_y);
float max_x = min(receivers.max_x, casters.max_x);
float max_y = min(receivers.max_y, casters.max_y);

What we get here is that if casters’ extents are “wider” than receivers’, we get the same extents as before. However, if you have a big receiver and relatively small casters on it, this will select smaller extents. Note that extents are always correct – they enclose all points that BOTH some caster and some receiver could project to – i.e. all points there you potentially could need shadow. Everywhere else there is no shadow.

Whether this is beneficial depends on scene type – you’ll usually get no benefit from this approach in real-life scenes if there is a single shadow map for the whole view – but if you have some kind of frustum splitting approach, depending on your scene, this could lead to quality improvement.

There is still a problem left – if you have CLAMP filtering set for your shadow map, this approach will cause visual bugs, because now some of receivers’ points are actually out of shadow map – so if there is a caster that fills border pixels with Z value that produces shadow, the shadow will stretch outwards. Solution? Either use BORDER filtering or when rendering to NxN shadowmap set a viewport with X = 1, Y = 1, width = N-2, height = N-2 – this will make sure that border pixels are not affected with casters. This is a small price to pay for potentially tighter frustum.

Now we have tight XY extents, and we’ll have to solve problems with Z.

Let’s suppose we’ve chosen ZN and ZF as our Z extents. This would mean that:

All casters are clipped by ZN and ZF
Shadow map contain depth values in range [0..1] in light PPS
All receivers points that have Z < ZN will have Z < 0 in light PPS
All receivers points that have Z > ZF will have Z > 1 in light PPS

At first, we can’t let our casters be clipped by near plane – this would produce artifacts – so the resulting ZN value has to be less or equal to casters’ minimal Z value. If there are no receivers with Z < casters.min_z, there is no sense to push ZN further (to decrease ZN, that is). If there ARE receivers left with Z < casters.min_z, then there should not be any shadows there. Let’s look at our shadow test.

float is_in_shadow = shadowmap_depth < pixel_depth ? 1 : 0;

For receivers’ points with Z < ZN, pixel_depth is below 0, and the test always returns false, so is_in_shadow = 0 – this is actually what we want. So there is no need to make ZN less than casters.min_z, thus ZN = casters.min_z.

For receivers’ points with Z > ZF, the situation is opposite – pixel_depth is above 1, and the test always returns true, thus making all receivers points with Z > ZF shadowed. This means that there should be no receivers’ points with Z > ZF, so ZF is greater or equal than receivers.max_z.

Is there any reason to push ZF beyond that? No. No, because we don’t care about casters points that have Z > receivers.max_z – they are not going to cast shadows on any receiver anyway. Thus ZF = receivers.max_z.

Now that we have our XY and Z extents, we construct a scaling/biasing matrix that maps [min_x..max_x] x [min_y..max_y] x [ZN .. ZF] to [-1..1] x [-1..1] x [0..1] (or to [-1..1] x [-1..1] x [-1..1] in case you’re using OpenGL), and multiply your light viewprojection matrix by this matrix. By the way, the scaling/biasing matrix is of course equal to the corresponding orthographic projection matrix, so you can use existing functions like D3DXMatrixOrthoOffCenterLH to compute it.

Now that we’ve solved the problem for uniform shadow mapping, let’s move on to various perspective reparametrization algorithms.

Trapezoidal Shadow Mapping

The brief outline of TSM algorithm is as follows – construct light viewprojection matrix, transform receivers’ points to light PPS, approximate them by a trapezoid, construct a matrix that maps trapezoid to unit cube, multiply light viewprojection matrix by the trapezoid mapping matrix.

Thus applying unit cube clipping to TSM is very simple – you first construct a tight frustum for uniform shadow mapping (see above), and then use it for TSM, with no further corrections. This produces correct extents in the resulting matrix.

As we selected XY extents as intersection of casters’ and receivers’ extents, the slightly more correct approach would be to use not the receivers’ points for trapezoidal approximation, but rather points of the receivers’ volume, clipped by planes that correspond to the chosen XY extents. However, my experiments resulted in no significant quality improvement – texel distribution was good enough without this step.

Also note, that as TSM produces a frustum with relatively high FOV, the distortion of post-perspective W coordinate can affect Z coordinate badly. Some solutions for these problems are already presented in the original TSM paper (though expect another post about various methods for fixing Z errors).

Light-space Perspective Shadow Mapping

The brief outline of LiSPSM algorithm is as follows – construct light space with certain restrictions, transform all “interesting” points in the light space, build a perspective frustum that encloses all points, transform it back from light space to world space.

The problem is that if you treat only receivers’ points as “interesting”, you get shadow clipping due to Z extents; if you treat both receivers’ and casters’ points as “interesting”, the shadows are correct, but you get worse texel distribution. Also you can’t really fix Z extents AFTER you’ve computed the frustum – because perspective projection has singularities for points on plane with Z == 0, and occasionally some caster point gets near this plane and you get very high post-perspective Z extents, which screw the Z precision.

In short, I don’t know a good solution. If the light faces the viewer, we can use the approach for normal positional light for PSM (read below). Otherwise, it does not seem we can do anything. Currently I focus my frustum on both casters’ and receivers’ points. If you know a solution, I’d be more than happy to hear it.

Perspective Shadow Mapping

Note that I am not going to describe unit cube clipping for the PSM from the original paper by Stamminger and Drettakis, because there is a much better PSM algorithm, described in GPU Gems 1 by Simon Kozlov (Chapter 14, “Perspective Shadow Maps: Care and Feeding”).

The brief outline of the algorithm is as follows – construct virtual camera which is essentially a real camera slid back a bit to improve quality (to improve zf/zn ratio), transform light to virtual camera PPS. If it’s a directional light in PPS (which means that light’s direction was orthogonal to view direction), construct uniform shadow mapping matrix for the light direction in PPS (of course you should transform all casters and receivers to PPS). The resulting matrix should first transform incoming points to virtual camera PPS, and then transform them by uniform shadow mapping matrix.

If light is a positional one in PPS (which, of course, happens most often), then at first compute a bounding cone with center in light PPS around all receivers’ points (transformed to PPS again, of course). Then we have two cases – either light in PPS becomes an inverted one – that happens if light is shining from behind the viewer, i.e. Z coordinate of light direction in view space is positive – or light is a normal one. For further reference I suggest you read Stamminger and Drettakis paper (STAMMINGER M., DRETTAKIS G.: Perspective shadow maps. In Proceedings of SIGGRAPH(2002), pp. 557.562.).

If light is a normal one, then all casters that can possibly cast shadows on visible receivers’ regions are in front of virtual camera’s near plane – so we can construct a normal shadow mapping matrix from bounding cone parameters. If light is an inverted one, then usual shadow mapping matrix will not contain all casters; therefore Kozlov proposes to use an “inverse projection matrix” which is constructed by specifying –z_near as near plane value and z_near as far plane value. Again, for further reference I suggest you read his paper.

Now, we want to perform unit cube clipping, and there are actually three cases we have to resolve (directional light in PPS, positional light in PPS, inverted positional light in PPS). Why do we have problems? Well, at first, while all receivers points are inside original view frustum (and thus they are inside virtual view frustum), because we clipped receivers’ volumes, caster points are arbitrary, and so there are singularities for caster points with view space Z close to 0.

In case of normal positional light in PPS (directional light that shines in our face), we don’t care about casters that are beyond near plane; so we can clip our casters by near plane, and all caster points will be well-defined in post-projective space, which means that after we’ve computed the projection in PPS from bounding cone, we can find receivers’ and casters’ extents and use the same algorithm we had for uniform shadow mapping – just multiply the light matrix in PPS by unit cube clipping matrix (this will extend light frustum in PPS to hold all needed caster points).

Theoretically, you’ll need precise geometrical clipping of caster volumes by near plane. However, in practice, for my test scenes the simple approach of computing extents only for points in front of near plane worked well. Your mileage may vary.

In case of inverted positional light in PPS, we can no longer clip by near plane, because we’ll clip away potential casters. What we’d like to do instead is to take Z extents as in normal unit cube clipping for all caster points, and modify the shadow mapping matrix as usual. Why can’t we do that (and, as a matter of fact, why could not we do that for normal positional light in PPS, why do we need camera clipping)?

Well, the problem is, that if there are caster points with view-space Z near 0, the PPS Z extents of casters become very huge. As we use casters’ minimal Z value as our Z near value, this can lead to huge extents of unit cube clipping matrix, which makes Z precision very bad. For positional light in PPS, we avoid this by clipping casters by near plane, so that we no longer have casters with very big coordinates in PPS.

Luckily, with inverse projection matrix we can easily solve that – just clip casters’ minimal Z value with some small negative value, i.e. -10. (Note: here and below, “casters’ minimal Z value” refers to minimal Z value of casters transformed first to virtual camera PPS, and then by shadow mapping matrix).

Why does it work? Well, with normal perspective matrix, if we decrease Z near value’s magnitude (i.e. decrease the actual Z near value, as it’s positive for normal matrix), we enlarge our viewing frustum. It turns out that for inverse projection matrix, decreasing Z near value’s magnitude again enlarges viewing frustum – and clipping Z near by some negative value decreases its magnitude, while increasing the actual value.

Finally, if the light is a directional one in PPS, we can clip casters by camera’s near plane too, as we did in case of normal positional light.

Recap:

If light is a directional one in PPS, construct uniform shadow mapping matrix as in first part of this post, only using casters clipped by camera near plane (either by geometrical clipping or just by throwing away caster points which are beyond near plane).
If light is a positional one in PPS, construct bounding cone with center in light’s position that holds all receiver points. Then, for normal (non-inverted) light, construct shadow mapping matrix (view matrix: simple view matrix with eye position = light’s position, view direction = bounding cone direction, and arbitrary up vector; projection matrix: matrix with FOV wide enough to hold the whole bounding cone, and Z extents that encompass all receiver points), and perform unit cube clipping, only using casters clipped by camera near plane.

For inverted light, construct an inverted projection matrix instead of a normal one, and instead of clipping casters clip casters’ minimal Z value.

Thanks to Simon Kozlov for suggesting solution for shadow clipping problems, basically all the text in this section consists of his ideas.

eXtended Perspective Shadow Mapping

I am not going to describe algorithm and clipping problems in details, as they are well summarized in the original paper. The only mention I will make is that at the step 11 of the algorithm, all points with w < epsilonW are clipped. This sometimes causes clipping of shadows. The solution is to modify the extents building procedure as follows: instead of throwing away points with w < epsilonW completely, just don’t compute Z bounds for them. This is a hack, and it works only because we’re doing 2D rectangle intersection while doing unit cube clipping.

The author of the algorithm knows about the problem. We discussed several approaches that lead to fixing it, and this was the best one we could come up with for now.

Particle rendering revisited

Sat, 22 Sep 2007 00:00:00 +0000

Recently I was doing particle rendering for different platforms (D3D9, PS3, XBox360), and I wanted to share my experience. The method I came with (which is more or less the same for all 3 platforms) is nothing new or complex - in fact, I know people were and are doing it already - but nevertheless I’ve never seen it described anywhere, so it might help somebody.

The prerequisites are that we want a pretty flexible simulation (such as, all parameters are controlled by splines, and splines come from some particle editor – perhaps also there is some collision detection, or instead particles are driven by some complex physics only) – which means that (a) we don’t have a simple function position(time) (and likewise for other particle parameters), and (b) we don’t want to implement a fully GPU-based solution, with rendering to vertex buffer/streamout. After all, next-gen GPUs are not that powerful, we don’t have that many particles, and we often do not use all available cores (in case of PS3/360 at least) efficiently.

Also let’s suppose for the sake of simplicity that our particles are actually billboards that can only rotate around view Z axis – i.e. they are always camera-facing. This does not really matter so much, but it will make things easier.

What we’d like to do ideally is to upload particles to a buffer, and have GPU render from it. To keep the amount of data low, we’d like to copy exactly one instance of each particle, without duplication. The classical (trivial) approach is to fill VB with particle data, 4 vertices per each particle, while doing all computations on CPU – that is, vertex shader only transforms particle in clip space. This is of course not very wise (after all, we’re trying to save some CPU clocks here), so another classical (slightly less trivial) approach is to fill VB with particle data, 4 vertices per each particle, where those 4 vertices differ only in their UV coordinates. UVs act like corner identifications – you know UV coordinate in vertex shader, and you know the corner of the particle you’re processing ((0, 0) = upper left corner, etc.). Thus you can easily perform actual coordinate position calculation in vertex shader like this:

float3 corner_position = particle_position + camera_axis_x * (uv.x – 0.5) * size + camera_axis_y * (uv.y – 0.5) * size;

Also we have point sprites that seem to achieve the same thing we need – you upload exactly 1 vertex per each particle. However, they have lots of disadvantages – point size is limited, you can’t rotate them, etc.

The method I am talking about goes slightly further. Let’s divide our particle data in two parts, actual particle data (position, color, angle of rotation, etc.) and UV coordinates. Now we notice, that what we really want is to have two streams of data, one stream contains particle data without duplication, the other stream contains ONLY UV coordinates – moreover, this buffer consists of the same data, repeated many times – you have 4 vertices (0, 0), (1, 0), (1, 1), (0, 1) for the first particle, 4 vertices (0, 0), (1, 0), (1, 1), (0, 1) for the second one, etc. – so we’d like to be able to specify it once and have GPU “loop” over them.

In effect, we want something like this:

Fortunately, there is a solution that can solve it – it’s hardware instancing. Unfortunately, it’s not available everywhere – you (usually) need SM3.0 support for it. We’re going to accept this disadvantage however.

Thus we have a static stream with 4 “vertices” representing corner data (each “vertex” consists of a single float2), and a dynamic stream with N “instances” representing particles (each “instance” consists of, in our example, a position, color and angle). We render N quads, so the vertex shader gets executed 4*N times – every time we have 1 piece of instance data, and 1 piece of corner data. We compute actual particle corner position as shown above, and output it.

Note that it looks like point sprites. It has a disadvantage in that we have 4 vertex shader runs per each particle, instead of 1 with point sprites – but I have yet to see vertex processing becoming a limiting factor for particles. Also it has a more limited hardware scope. What we get in return is much more flexibility (you are not even limited to screen-facing particles; you can pass orientation (i.e. a quaternion) instead of a single angle). The amount of data that has to be uploaded by the application per frame is the same.

Now let’s go over platform-specific implementation details.

Direct3D 9

Unfortunately, D3D9 does not have “quad” primitive type, so we’ll have to use a dummy index buffer. The setup is as follows:

For all particle systems, create a single index buffer with 6 indices describing a quad (0, 1, 2, 2, 1, 3), and a single corner stream that will contain corner values. I chose to store UV coordinates in D3DCOLOR, though FLOAT2 is ok too.
Create a proper vertex declaration, that says that corner (UV) data goes in stream 0, and particle data goes in stream 1.
For each particle system, create a dynamic vertex buffer (note: it’s usually better to create 2 buffers and to use buffer 0 on first frame, buffer 1 on second frame, buffer 0 on third frame, etc. – thus making synchronization costs lower and lowering chance for buffer renaming), which will hold particle data.
Every frame, lock your buffer, upload particle data in it as is (i.e. 1 copy of data per particle).
Draw as follows:

device->SetStreamSource(0, shared_buffer_with_corner_data, 0, sizeof(CornerVertex));
device->SetStreamSourceFreq(0, D3DSTREAMSOURCE_INDEXEDDATA | particle_count);

device->SetStreamSource(1, buffer_with_particle_data, 0, sizeof(ParticleData));
device->SetStreamSourceFreq(1, D3DSTREAMSOURCE_INSTANCEDATA | 1);

device->DrawIndexedPrimitive(D3DPT_TRIANGLELIST, 0, 0, 4, 0, 2);

Note that:

You have to set corner data as stream 0 due to D3D9 restrictions
You pass parameters to DIP as if you want to render a single quad

In theory, this method uses hardware instancing, and hardware instancing is supported only for SM3.0-compliant cards. However, in practice, all SM2.0-capable ATi cards support hardware instancing – it’s just that Direct3D9 does not let you use it. ATi engineers made a hack that lets you enable instancing for their cards – just do this once at application startup:

if (SUCCEEDED(d3d->CheckDeviceFormat(D3DADAPTER_DEFAULT, D3DDEVTYPE_HAL, D3DFMT_X8R8G8B8, 0, D3DRTYPE_SURFACE, (D3DFORMAT)MAKEFOURCC('I','N','S','T'))))
{ 
    // Enable instancing 
    device->SetRenderState(D3DRS_POINTSIZE, MAKEFOURCC('I','N','S','T'));
}

I love ATi D3D9 hacks, they are ingenious.

XBox 360

You can perform rendering with the proposed method as is – the only difference is that you’ll have to fetch vertex data manually in vertex shader via vfetch command because there is no explicit instancing support. For further reference, look at CustomVFetch sample.

PS3

You can perform rendering with the proposed method as is – you’ll have to set frequency divider operation to MODULO with frequency = 4 for corner stream, and to DIVIDE with frequency = 4 for particle data stream.

Direct3D 10

I have not actually implemented this for Direct3D 10, but it should be pretty straightforward – you’ll have to create proper input layout (with D3D10_INPUT_PER_INSTANCE_DATA set for all elements except corner data), create index buffer with 6 indices as for D3D9, and then render via DrawIndexedInstanced(6, particle_count, 0, 0, 0);

Note that for Direct3D 10 you can also render from your particle data stream with D3D10_PRIMITIVE_TOPOLOGY_POINTLIST, and perform quad expansion with geometry shader. This in theory should somewhat speed up vertex processing part, but in practice I have very bad experience with geometry shaders on NVidia cards performance-wise. If you have an ATi R600 or (perhaps) next-generation NVidia card, I’d be happy to hear that things are okay there.