Is C++ fast?

17 Jan 2019

A library that I work on often these days, meshoptimizer, has changed over time to use fewer and fewer C++ library features, up until the current state where the code closely resembles C even though it uses some C++ features. There have been many reasons behind the changes - dropping C++11 requirement allowed me to make sure anybody can compile the library on any platform, removing std::vector substantially improved performance of unoptimized builds, removing algorithm includes sped up compilation. However, I’ve never quite taken the leap all the way to C with this codebase. Today we’ll explore the gamut of possible C++ implementations for one specific algorithm, mesh simplifier, henceforth known as simplifier.cpp, and see if going all the way to C is worthwhile.

Methodology

Mesh simplifier is an implementation of an edge collapse quadric based simplification algorithm with many tweaks to improve performance and quality of the result. The algorithm is still in development but has had a fair share of effort put into it. The details are really not that important, but it helps to understand the structure and size:

The entire algorithm is implemented in one standalone .cpp file that has almost exactly a thousand lines of code (1004 as of this writing), including comments, blank lines, lines with braces, etc.
The algorithm almost exclusively uses heap-allocated arrays as data structures, using raw pointers for this
The algorithm needs a hash table and a sorting routine, implemented from scratch

We will look at several variations of the implementation, starting with one that uses C++ containers and algorithms that would be helpful for that algorithm, then remove one C++ feature at a time and measure compilation speed and runtime performance as we go on three compilers, gcc 7.3, clang 6 and msvc 2017 on Core i7-8700K running Windows 10 / Ubuntu 16.10. We’ll measure compilation performance by just compiling one .cpp file (with default options in debug and -O2 optimization level in release), and measure runtime performance by simplifying buddha.obj (1M triangle mesh) to 25% of its size. After we reach the current implementation, we will explore the option of changing the code to pure C99.

Note that the way I arrived at these implementations is by taking the code you can see in the repository right now, and changing it to be more idiomatic Modern C++¹. However, these are generally very close to past versions of simplifier.cpp - the difference being that it’s possible to directly compare the variants now.

Baseline: a lot of C++

The version we’re starting with is the original simplifier.cpp from current meshoptimizer master, with the following modifications:

All raw pointers changed to std::vector
Instead of a home-grown hash table we use std::unordered_set
Instead of a home-grown sorting routine we use std::sort

Here’s the performance that we’re getting as a result:

compiler/stl	debug compile	release compile	debug run	release run
gcc	520 ms	646 ms	2273 ms	572 ms
clang	400 ms	684 ms	2356 ms	566 ms
clang libc++	400 ms	725 ms	1535 ms	584 ms
msvc	422 ms	566 ms	36317 ms	579 ms

This is a good starting point. We can see that performance is pretty solid in release - 0.6 seconds to decimate 1M triangle mesh is a good level of performance - generally more or less reasonable in debug with a notable exception of MSVC (the adverse behavior of MSVC STL in debug mode was one of the forcing functions to remove all STL use from meshoptimizer), and compile times generally vary but are uninspiring.

To put the compile times in perspective, Jonathan Blow recently posted a video stream with compiler performance improvements, where his game engine and game written in his new language compile and link in about a second (compilation itself takes about 0.9 seconds). That’s on a codebase that has 100K lines of code - our algorithm only has 1K lines of code (excluding STL, of course - it’s not entirely fair to exclude STL, but it’s not entirely fair to include STL either since we know our algorithm can be implemented in 1K LOC without any STL dependencies). 400 ms is something you notice when compiling your code, even if it’s just one file, and something that makes me less happy when working on the code - given many files like that, cumulative compilation performance can be bad. This is given the fact that our implementation is pretty spartan about the STL dependencies - we only use three algorithms/containers. Let’s see what happens when we stop using one of them.

Not using unordered_set in the first place

The secret about the previous version we benchmarked is that it never existed in that form. While meshoptimizer initially used STL containers and algorithms, it never used std::unordered_set - that’s because based on prior experience I expected the performance to be insufficient for the kinds of algorithms I wanted to write, and had a custom replacement that was using quadratic probing in a large power of two sized array, which is similar to Google’s dense_hash_set design. It’s a kind of hash table I use and implement often in different codebases for different applications, so I’m very familiar with it. The implementation in simplifier.cpp is just 35 lines of code², so it’s easy to drop in and adapt for the use case at hand. Let’s see what happens when we use that instead.

compiler/stl	debug compile	release compile	debug run	release run
gcc	334 ms	461 ms	2054 ms	460 ms
clang	270 ms	517 ms	2152 ms	452 ms
clang libc++	314 ms	609 ms	1179 ms	415 ms
msvc	361 ms	461 ms	28337 ms	380 ms

It looks like the extra 35 lines for a manual implementation of a better hash table were worth it. We’re seeing significant performance improvements across the board, in debug/release and both in terms of compile time and run time. The largest increase in runtime performance is on MSVC, we got 1.5x faster, and this is given the fact that hash table isn’t used as a core part of the algorithm - it’s only used to establish uniqueness relationship between individual vertices before the algorithm starts.

This highlights the poor fit of std::unordered_set to performance-critical workloads, especially ones that are insert-heavy. Unfortunately, this is not an implementation defect and thus is not possible to correct - the issue is that the standard requirements on unordered containers preclude more efficient implementations. Here’s to hoping that eventually we’ll get a better hash table in the standard.

Exact sorting algorithms are overrated

At some point during development of the simplifier repeated profiling of various meshes showed that a lot of time is being spent in std::sort. Now, std::sort isn’t the fastest sorting algorithm, but it’s generally extremely competitive with custom implementations and it’s hard to beat without changing the problem around. In my case, sorting was used on an array of edge collapses, with the sort key being a floating point error value - so the natural instinct is to use a 3-pass radix sort, using 11, 11 and 10 bits of the key in each pass. However, there’s an interesting alternative available to us here - we can do radix sort in a single pass, using an 11 bit key³.

What happens is that we have a 32-bit non-negative floating point value; if we take the top 12 bits and ignore the topmost one (since that’s a sign bit and is always 0), we get 11 bits that represent 8 bits of exponent and 3 bits of mantissa, which essentially gives us a value of similar magnitude but a significant round-off error. If we sort using this value as a key, as a result the sorting sequence isn’t going to be perfectly ordered with respect to the full 32-bit key. However, in our case we need to sort to be able to process better edge collapses first based on a heuristic - and the heuristic is a gross approximation so the extra error our sorting introduces is not noticeable. This technique is surprisingly useful in other domains where you don’t necessarily need an exact order either. A benefit of a single-pass radix sort is that it’s faster (you only need to do one pass over the data instead of 3!) and simpler to implement than a full-blown radix sort, taking just 36 lines of code⁴.

compiler/stl	debug compile	release compile	debug run	release run
gcc	287 ms	403 ms	949 ms	334 ms
clang	230 ms	461 ms	962 ms	327 ms
clang libc++	312 ms	546 ms	940 ms	328 ms
msvc	330 ms	430 ms	26824 ms	285 ms

This time the gains in compilation times are somewhat more modest. We’ve removed <algorithm> header but it doesn’t seem so have had very significant benefits to compilation time - we’re still including <vector> and it’s possible that large STL headers are pulled by both. However, the effects on performance are very significant, especially on debug performance in libstdc++ (most likely std::sort is very slow in debug there) but the gains in release builds are also exciting. What is not obvious from this graph is that sorting got so much faster that it almost completely disappeared from the profiles compared to the other work - the entire algorithm runs “just” 1.35x faster, but the gains measured on just the sorting code are much larger, 117 ms -> 10 ms in release builds.

So long, std::vector

One number that we haven’t moved substantially yet is the time it takes to run this code in debug using MSVC. While it’s natural to expect unoptimized builds to be slower than optimized, they have to be fast enough. Sometimes you want to debug your problem on a non-trivial input dataset. Sometimes you want to run the debug build with full checks through your tests to make sure they don’t trigger any bugs that could disappear in release. Sometimes you are trying to debug a different part of the program, but you still need to run the rest of it. Programmers creatively come up with many workarounds that make the problem less severe - you can make special builds that enable some optimizations but not all, you can use mixed optimization settings for different projects, you can use #pragma optimize to temporarily disable optimizations around offending parts of the code - but all of these seem like duct-tape. Let’s try to replace the only STL component we’re still using, std::vector, with a really simple dynamic array - we don’t need resize or push_back in our code, all arrays are initialized with the right size. Our demands are low enough that our std::vector replacement is just 40 lines of code⁵, and mostly consists of operator[] definitions ;)

compiler/stl	debug compile	release compile	debug run	release run
gcc	158 ms	303 ms	980 ms	318 ms
clang	138 ms	320 ms	1021 ms	297 ms
clang libc++	142 ms	324 ms	1028 ms	299 ms
msvc	156 ms	219 ms	3482 ms	265 ms

This is certainly… interesting. By replacing std::vector with our own type we not only significantly improved debug performance in MSVC, but also halved the compile time for several compilers we were testing. Debug performance in gcc/clang regressed a bit - I believe this is because my replacement uses assert to perform bounds checking on every operator[] access, and in libc++ and libstdc++ these are controlled using separate defines, _GLIBCXX_ASSERTIONS and _LIBCPP_DEBUG respectively. Enabling these defines for the std::vector variant increases the debug runtime performance to ~1350 ms for both libraries⁶, so our replacement is faster when comparable functionality is enabled.

Release performance also slightly increased across the board - this is because for many of our arrays, the default initialization performed by std::vector’s constructor is redundant as we’re going to fill the array anyway. With std::vector, you can either resize a large array and then compute the items (which requires default-initializing every item redundantly), or reserve and push_back repeatedly (which requires a bit more code for adding each item, and this overhead can also add up). With a custom container it’s easy to have an option to skip initialization - in fact, in our replacement that’s the only option since it’s easy to memset the array manually if necessary.

There and back again*

A custom container with bounds checking operator[] was mostly a success, but it didn’t quite make me happy. In some algorithms the extra cost of the container was still pretty substantial. In some algorithms internal functions would use raw pointers to maximize release performance, which meant bounds checking isn’t performed anyway. And algorithm inputs used raw pointers which required careful handling. Because of the use of raw pointers in many critical places, I would run builds with Address Sanitizer as part of CI pipeline and also occasionally locally, so I felt safe about the lack of out-of-bounds accesses. Debuggers wouldn’t be able to display the arrays without custom visualizers, and more crucially would have problems evaluating member access (this is true of std::vector as well depending on the debugger), which made watch expressions more complex and debugging - less pleasant. The status quo provided neither complete safety, nor complete performance, and I decided to try to use raw pointers instead.

Of course, one other benefit of containers is the extra protection against memory leaks - I wasn’t particularly keen on remembering to free each allocated pointer, so I made a meshopt_Allocator class⁷ that could allocate large blocks of typed data and remember the allocated pointer; at the end of the scope all allocated blocks would be deleted. This resulted in the fused allocator+array class being split into two - a special allocator class fulfilled the memory management duties, and as for the array a raw pointer would suffice. Address Sanitizer, along with rigorous testing and hand crafted assertion statements, would keep the code correct.

compiler/stl	debug compile	release compile	debug run	release run
gcc	147 ms	260 ms	720 ms	320 ms
clang	132 ms	294 ms	699 ms	301 ms
clang libc++	131 ms	297 ms	697 ms	300 ms
msvc	141 ms	194 ms	1080 ms	261 ms

While I’m not 100% happy with the tradeoff, it has worked well so far. It’s great to remove the cognitive overhead associated with figuring out whether in each function we should use a raw pointer, an iterator or the container. Worth noting is that the overhead that builds with Address Sanitizer have is very reasonable, and having it on makes me feel safer since it captures a superset of the problems bounds checks in containers do.

compiler/sanitizer	compile	run
gcc	147 ms	721 ms
gcc asan	200 ms	1229 ms
gcc asan ubsan	260 ms	1532 ms
clang	135 ms	695 ms
clang asan	154 ms	1266 ms
clang asan ubsan	180 ms	1992 ms

Let’s C

Once we’ve switched to raw pointers, there’s really not much of C++ left in our code. There is still an occasional template or two, but the number of instantiations is small enough that we could duplicate the code for each type we need it for. meshoptimizer uses C++ casts for pointers and functional casts (int(v)) for numbers, but C has neither, so that has to change. A few other syntactical annoyances emerge, but really it’s not hard at this point to make a C version of the code. It does require more sacrifices, and there’s the issue of MSVC that either has to use C89 or compile our C99 code as C++, unless we’re willing to only support latest MSVC versions, but it’s doable. After we have stopped using every C++ standard header though, does it really matter?

compiler/stl	debug compile	release compile	debug run	release run
gcc	105 ms	209 ms	710 ms	321 ms
clang	95 ms	254 ms	711 ms	310 ms
msvc c++	139 ms	192 ms	1087 ms	262 ms
msvc c99	125 ms	180 ms	1085 ms	261 ms

There is a notable impact on gcc/clang compilation time - we save ~40 ms in both by switching to C. The real difference at this point is in the standard headers - simplifier.cpp uses math.h that happens to be substantially larger in C++ mode compared to C mode, and that difference will increase even more once the default compilation mode is set to C++17:

compiler	c99	c++98	c++11	c++14	c++17
gcc	105 ms	143 ms	147 ms	147 ms	214 ms
clang	95 ms	129 ms	133 ms	134 ms	215 ms
clang libc++	95 ms	130 ms	132 ms	136 ms	140 ms

The issue is that math.h includes cmath in gcc/clang which pulls in a lot of C++ machinery, and in C++17 in libstdc++ adds a slew of new special functions, that are rarely useful but will make compilation slower anyway. Removing the dependency on math.h is easy in this case⁸:

#ifdef __GNUC__
#define fabsf(x) __builtin_fabsf(x)
#define sqrtf(x) __builtin_sqrtf(x)
#else
#include <math.h>
#endif

and brings us all the way to C compile times. This is definitely an area where libstdc++ and libc++ could improve in the future - I don’t think it’s reasonable to force users of C headers to pay for the C++ baggage. With the exception of the math.h issue, it doesn’t look like C is faster to compile than C++ assuming a compile time conscious subset of C++ is used - so at this point a switch to C isn’t warranted for meshoptimizer.

Conclusion

Hopefully the excursion through past, present and possible future changes in simplifier.cpp was useful. When making C/C++ libraries, it’s important to pay attention to more than just correctness - portability, ease of compilation, compilation times, runtimes in both debug and release, debuggability - all of these are important and will help reduce the friction for both users of the library and contributors. C++ is an unforgiving language, but, given enough time and effort, it’s possible to get good performance - assuming you’re willing to question everything, including practices that are sometimes believed to be universal, such as the effectiveness or efficiency of STL or CRT.

We started with half a second of compile times on gcc and 36 seconds of runtime on MSVC in debug mode and ended with 100 ms compile times for gcc and around a second of runtime on MSVC, which is much more pleasant to work with. Of course, at 1K lines compiling for 100 ms, and assuming linear scaling, we would require a full second per 10K lines, which is still substantially slower than some other languages - but not entirely unreasonable for a full build ran on a single core. Getting there for large codebases developed over many years is a much harder problem, one that will be left as an exercise for the reader ;)

All source modifications to simplifier.cpp are available here; in order described in the article, it’s simplifiervsm.cpp, simplifiervs.cpp, simplifierv.cpp, simplifierb.cpp, simplifier.cpp and simplifier.c.

Some time last year during a regular lunch time C++ discussion at work somebody said “there’s a good language subset of C++, C with classes”, to which I replied “there’s an even better subset, C with structs”. That’s what most of meshoptimizer source code looks like, barring a few templates :) ↩
The hash table interface is just two functions, hashBuckets and hashLookup: simplifier.cpp:124 ↩
Using 11 bits is a reasonable choice because that requires a 2048-entry histogram, which takes 8 KB and comfortably fits within a 16 KB L1 cache. Given a 32 KB L1 cache you can extend the histogram to 12 bits but going beyond that is generally less efficient. You can read more about the radix sort in the Radix Sort Revisited article by Pierre Terdiman. ↩
The full implementation of sortEdgeCollapses function is available here: simplifier.cpp:712 ↩
This class is no longer a part of meshoptimizer, but you can look at the older slightly longer version here: meshoptimizer.h:605 ↩
I discovered this after investigating the curious performance difference in debug; I’m loathe to repeat the debug benchmarks for all previous test cases, so I’ll assume that the overhead is the extra ~30% seen on the std::vector - hopefully that doesn’t change the general picture. I’m not sure why these assertions aren’t enabled by default in the first place - that doesn’t seem very user friendly - but this should mirror the default experience of working with these libraries. ↩
This class is used by all meshoptimizer algorithms and is available here: meshoptimizer.h:662 ↩
This feels like duct tape, and one that I’d have to apply to multiple source files independently, so for now I opted for not doing this. However if C++17 mode becomes the default before this issue gets fixed, I’ll have to reconsider since 2x compile time penalty is a bit too much to swallow. ↩

« Voxel terrain: physics

Flavors of SIMD »

Show Comments

Author

Posts