Is C++ fast?
17 Jan 2019A library that I work on often these days, meshoptimizer, has changed over time to use fewer and fewer C++ library features, up until the current state where the code closely resembles C even though it uses some C++ features. There have been many reasons behind the changes - dropping C++11 requirement allowed me to make sure anybody can compile the library on any platform, removing std::vector
substantially improved performance of unoptimized builds, removing algorithm
includes sped up compilation. However, I’ve never quite taken the leap all the way to C with this codebase. Today we’ll explore the gamut of possible C++ implementations for one specific algorithm, mesh simplifier, henceforth known as simplifier.cpp
, and see if going all the way to C is worthwhile.
Methodology
Mesh simplifier is an implementation of an edge collapse quadric based simplification algorithm with many tweaks to improve performance and quality of the result. The algorithm is still in development but has had a fair share of effort put into it. The details are really not that important, but it helps to understand the structure and size:
- The entire algorithm is implemented in one standalone .cpp file that has almost exactly a thousand lines of code (1004 as of this writing), including comments, blank lines, lines with braces, etc.
- The algorithm almost exclusively uses heap-allocated arrays as data structures, using raw pointers for this
- The algorithm needs a hash table and a sorting routine, implemented from scratch
We will look at several variations of the implementation, starting with one that uses C++ containers and algorithms that would be helpful for that algorithm, then remove one C++ feature at a time and measure compilation speed and runtime performance as we go on three compilers, gcc 7.3, clang 6 and msvc 2017 on Core i7-8700K running Windows 10 / Ubuntu 16.10. We’ll measure compilation performance by just compiling one .cpp file (with default options in debug and -O2
optimization level in release), and measure runtime performance by simplifying buddha.obj
(1M triangle mesh) to 25% of its size. After we reach the current implementation, we will explore the option of changing the code to pure C99.
Note that the way I arrived at these implementations is by taking the code you can see in the repository right now, and changing it to be more idiomatic Modern C++1. However, these are generally very close to past versions of
simplifier.cpp
- the difference being that it’s possible to directly compare the variants now.
Baseline: a lot of C++
The version we’re starting with is the original simplifier.cpp
from current meshoptimizer master, with the following modifications:
- All raw pointers changed to
std::vector
- Instead of a home-grown hash table we use
std::unordered_set
- Instead of a home-grown sorting routine we use
std::sort
Here’s the performance that we’re getting as a result:
compiler/stl | debug compile | release compile | debug run | release run |
---|---|---|---|---|
gcc | 520 ms | 646 ms | 2273 ms | 572 ms |
clang | 400 ms | 684 ms | 2356 ms | 566 ms |
clang libc++ | 400 ms | 725 ms | 1535 ms | 584 ms |
msvc | 422 ms | 566 ms | 36317 ms | 579 ms |
This is a good starting point. We can see that performance is pretty solid in release - 0.6 seconds to decimate 1M triangle mesh is a good level of performance - generally more or less reasonable in debug with a notable exception of MSVC (the adverse behavior of MSVC STL in debug mode was one of the forcing functions to remove all STL use from meshoptimizer), and compile times generally vary but are uninspiring.
To put the compile times in perspective, Jonathan Blow recently posted a video stream with compiler performance improvements, where his game engine and game written in his new language compile and link in about a second (compilation itself takes about 0.9 seconds). That’s on a codebase that has 100K lines of code - our algorithm only has 1K lines of code (excluding STL, of course - it’s not entirely fair to exclude STL, but it’s not entirely fair to include STL either since we know our algorithm can be implemented in 1K LOC without any STL dependencies). 400 ms is something you notice when compiling your code, even if it’s just one file, and something that makes me less happy when working on the code - given many files like that, cumulative compilation performance can be bad. This is given the fact that our implementation is pretty spartan about the STL dependencies - we only use three algorithms/containers. Let’s see what happens when we stop using one of them.
Not using unordered_set in the first place
The secret about the previous version we benchmarked is that it never existed in that form. While meshoptimizer initially used STL containers and algorithms, it never used std::unordered_set
- that’s because based on prior experience I expected the performance to be insufficient for the kinds of algorithms I wanted to write, and had a custom replacement that was using quadratic probing in a large power of two sized array, which is similar to Google’s dense_hash_set
design. It’s a kind of hash table I use and implement often in different codebases for different applications, so I’m very familiar with it. The implementation in simplifier.cpp
is just 35 lines of code2, so it’s easy to drop in and adapt for the use case at hand. Let’s see what happens when we use that instead.
compiler/stl | debug compile | release compile | debug run | release run |
---|---|---|---|---|
gcc | 334 ms | 461 ms | 2054 ms | 460 ms |
clang | 270 ms | 517 ms | 2152 ms | 452 ms |
clang libc++ | 314 ms | 609 ms | 1179 ms | 415 ms |
msvc | 361 ms | 461 ms | 28337 ms | 380 ms |
It looks like the extra 35 lines for a manual implementation of a better hash table were worth it. We’re seeing significant performance improvements across the board, in debug/release and both in terms of compile time and run time. The largest increase in runtime performance is on MSVC, we got 1.5x faster, and this is given the fact that hash table isn’t used as a core part of the algorithm - it’s only used to establish uniqueness relationship between individual vertices before the algorithm starts.
This highlights the poor fit of std::unordered_set
to performance-critical workloads, especially ones that are insert-heavy. Unfortunately, this is not an implementation defect and thus is not possible to correct - the issue is that the standard requirements on unordered containers preclude more efficient implementations. Here’s to hoping that eventually we’ll get a better hash table in the standard.
Exact sorting algorithms are overrated
At some point during development of the simplifier repeated profiling of various meshes showed that a lot of time is being spent in std::sort
. Now, std::sort
isn’t the fastest sorting algorithm, but it’s generally extremely competitive with custom implementations and it’s hard to beat without changing the problem around. In my case, sorting was used on an array of edge collapses, with the sort key being a floating point error value - so the natural instinct is to use a 3-pass radix sort, using 11, 11 and 10 bits of the key in each pass. However, there’s an interesting alternative available to us here - we can do radix sort in a single pass, using an 11 bit key3.
What happens is that we have a 32-bit non-negative floating point value; if we take the top 12 bits and ignore the topmost one (since that’s a sign bit and is always 0), we get 11 bits that represent 8 bits of exponent and 3 bits of mantissa, which essentially gives us a value of similar magnitude but a significant round-off error. If we sort using this value as a key, as a result the sorting sequence isn’t going to be perfectly ordered with respect to the full 32-bit key. However, in our case we need to sort to be able to process better edge collapses first based on a heuristic - and the heuristic is a gross approximation so the extra error our sorting introduces is not noticeable. This technique is surprisingly useful in other domains where you don’t necessarily need an exact order either. A benefit of a single-pass radix sort is that it’s faster (you only need to do one pass over the data instead of 3!) and simpler to implement than a full-blown radix sort, taking just 36 lines of code4.
compiler/stl | debug compile | release compile | debug run | release run |
---|---|---|---|---|
gcc | 287 ms | 403 ms | 949 ms | 334 ms |
clang | 230 ms | 461 ms | 962 ms | 327 ms |
clang libc++ | 312 ms | 546 ms | 940 ms | 328 ms |
msvc | 330 ms | 430 ms | 26824 ms | 285 ms |
This time the gains in compilation times are somewhat more modest. We’ve removed <algorithm>
header but it doesn’t seem so have had very significant benefits to compilation time - we’re still including <vector>
and it’s possible that large STL headers are pulled by both. However, the effects on performance are very significant, especially on debug performance in libstdc++ (most likely std::sort
is very slow in debug there) but the gains in release builds are also exciting. What is not obvious from this graph is that sorting got so much faster that it almost completely disappeared from the profiles compared to the other work - the entire algorithm runs “just” 1.35x faster, but the gains measured on just the sorting code are much larger, 117 ms -> 10 ms in release builds.
So long, std::vector
One number that we haven’t moved substantially yet is the time it takes to run this code in debug using MSVC. While it’s natural to expect unoptimized builds to be slower than optimized, they have to be fast enough. Sometimes you want to debug your problem on a non-trivial input dataset. Sometimes you want to run the debug build with full checks through your tests to make sure they don’t trigger any bugs that could disappear in release. Sometimes you are trying to debug a different part of the program, but you still need to run the rest of it. Programmers creatively come up with many workarounds that make the problem less severe - you can make special builds that enable some optimizations but not all, you can use mixed optimization settings for different projects, you can use #pragma optimize
to temporarily disable optimizations around offending parts of the code - but all of these seem like duct-tape. Let’s try to replace the only STL component we’re still using, std::vector
, with a really simple dynamic array - we don’t need resize
or push_back
in our code, all arrays are initialized with the right size. Our demands are low enough that our std::vector
replacement is just 40 lines of code5, and mostly consists of operator[]
definitions ;)
compiler/stl | debug compile | release compile | debug run | release run |
---|---|---|---|---|
gcc | 158 ms | 303 ms | 980 ms | 318 ms |
clang | 138 ms | 320 ms | 1021 ms | 297 ms |
clang libc++ | 142 ms | 324 ms | 1028 ms | 299 ms |
msvc | 156 ms | 219 ms | 3482 ms | 265 ms |
This is certainly… interesting. By replacing std::vector
with our own type we not only significantly improved debug performance in MSVC, but also halved the compile time for several compilers we were testing. Debug performance in gcc/clang regressed a bit - I believe this is because my replacement uses assert
to perform bounds checking on every operator[]
access, and in libc++ and libstdc++ these are controlled using separate defines, _GLIBCXX_ASSERTIONS
and _LIBCPP_DEBUG
respectively. Enabling these defines for the std::vector
variant increases the debug runtime performance to ~1350 ms for both libraries6, so our replacement is faster when comparable functionality is enabled.
Release performance also slightly increased across the board - this is because for many of our arrays, the default initialization performed by std::vector
’s constructor is redundant as we’re going to fill the array anyway. With std::vector
, you can either resize
a large array and then compute the items (which requires default-initializing every item redundantly), or reserve
and push_back
repeatedly (which requires a bit more code for adding each item, and this overhead can also add up). With a custom container it’s easy to have an option to skip initialization - in fact, in our replacement that’s the only option since it’s easy to memset
the array manually if necessary.
There and back again*
A custom container with bounds checking operator[]
was mostly a success, but it didn’t quite make me happy. In some algorithms the extra cost of the container was still pretty substantial. In some algorithms internal functions would use raw pointers to maximize release performance, which meant bounds checking isn’t performed anyway. And algorithm inputs used raw pointers which required careful handling. Because of the use of raw pointers in many critical places, I would run builds with Address Sanitizer as part of CI pipeline and also occasionally locally, so I felt safe about the lack of out-of-bounds accesses. Debuggers wouldn’t be able to display the arrays without custom visualizers, and more crucially would have problems evaluating member access (this is true of std::vector
as well depending on the debugger), which made watch expressions more complex and debugging - less pleasant. The status quo provided neither complete safety, nor complete performance, and I decided to try to use raw pointers instead.
Of course, one other benefit of containers is the extra protection against memory leaks - I wasn’t particularly keen on remembering to free each allocated pointer, so I made a meshopt_Allocator
class7 that could allocate large blocks of typed data and remember the allocated pointer; at the end of the scope all allocated blocks would be deleted. This resulted in the fused allocator+array class being split into two - a special allocator class fulfilled the memory management duties, and as for the array a raw pointer would suffice. Address Sanitizer, along with rigorous testing and hand crafted assertion statements, would keep the code correct.
compiler/stl | debug compile | release compile | debug run | release run |
---|---|---|---|---|
gcc | 147 ms | 260 ms | 720 ms | 320 ms |
clang | 132 ms | 294 ms | 699 ms | 301 ms |
clang libc++ | 131 ms | 297 ms | 697 ms | 300 ms |
msvc | 141 ms | 194 ms | 1080 ms | 261 ms |
While I’m not 100% happy with the tradeoff, it has worked well so far. It’s great to remove the cognitive overhead associated with figuring out whether in each function we should use a raw pointer, an iterator or the container. Worth noting is that the overhead that builds with Address Sanitizer have is very reasonable, and having it on makes me feel safer since it captures a superset of the problems bounds checks in containers do.
compiler/sanitizer | compile | run |
---|---|---|
gcc | 147 ms | 721 ms |
gcc asan | 200 ms | 1229 ms |
gcc asan ubsan | 260 ms | 1532 ms |
clang | 135 ms | 695 ms |
clang asan | 154 ms | 1266 ms |
clang asan ubsan | 180 ms | 1992 ms |
Let’s C
Once we’ve switched to raw pointers, there’s really not much of C++ left in our code. There is still an occasional template or two, but the number of instantiations is small enough that we could duplicate the code for each type we need it for. meshoptimizer uses C++ casts for pointers and functional casts (int(v)
) for numbers, but C has neither, so that has to change. A few other syntactical annoyances emerge, but really it’s not hard at this point to make a C version of the code. It does require more sacrifices, and there’s the issue of MSVC that either has to use C89 or compile our C99 code as C++, unless we’re willing to only support latest MSVC versions, but it’s doable. After we have stopped using every C++ standard header though, does it really matter?
compiler/stl | debug compile | release compile | debug run | release run |
---|---|---|---|---|
gcc | 105 ms | 209 ms | 710 ms | 321 ms |
clang | 95 ms | 254 ms | 711 ms | 310 ms |
msvc c++ | 139 ms | 192 ms | 1087 ms | 262 ms |
msvc c99 | 125 ms | 180 ms | 1085 ms | 261 ms |
There is a notable impact on gcc/clang compilation time - we save ~40 ms in both by switching to C. The real difference at this point is in the standard headers - simplifier.cpp
uses math.h
that happens to be substantially larger in C++ mode compared to C mode, and that difference will increase even more once the default compilation mode is set to C++17:
compiler | c99 | c++98 | c++11 | c++14 | c++17 |
---|---|---|---|---|---|
gcc | 105 ms | 143 ms | 147 ms | 147 ms | 214 ms |
clang | 95 ms | 129 ms | 133 ms | 134 ms | 215 ms |
clang libc++ | 95 ms | 130 ms | 132 ms | 136 ms | 140 ms |
The issue is that math.h
includes cmath
in gcc/clang which pulls in a lot of C++ machinery, and in C++17 in libstdc++ adds a slew of new special functions, that are rarely useful but will make compilation slower anyway. Removing the dependency on math.h
is easy in this case8:
#ifdef __GNUC__
#define fabsf(x) __builtin_fabsf(x)
#define sqrtf(x) __builtin_sqrtf(x)
#else
#include <math.h>
#endif
and brings us all the way to C compile times. This is definitely an area where libstdc++ and libc++ could improve in the future - I don’t think it’s reasonable to force users of C headers to pay for the C++ baggage. With the exception of the math.h
issue, it doesn’t look like C is faster to compile than C++ assuming a compile time conscious subset of C++ is used - so at this point a switch to C isn’t warranted for meshoptimizer.
Conclusion
Hopefully the excursion through past, present and possible future changes in simplifier.cpp
was useful. When making C/C++ libraries, it’s important to pay attention to more than just correctness - portability, ease of compilation, compilation times, runtimes in both debug and release, debuggability - all of these are important and will help reduce the friction for both users of the library and contributors. C++ is an unforgiving language, but, given enough time and effort, it’s possible to get good performance - assuming you’re willing to question everything, including practices that are sometimes believed to be universal, such as the effectiveness or efficiency of STL or CRT.
We started with half a second of compile times on gcc and 36 seconds of runtime on MSVC in debug mode and ended with 100 ms compile times for gcc and around a second of runtime on MSVC, which is much more pleasant to work with. Of course, at 1K lines compiling for 100 ms, and assuming linear scaling, we would require a full second per 10K lines, which is still substantially slower than some other languages - but not entirely unreasonable for a full build ran on a single core. Getting there for large codebases developed over many years is a much harder problem, one that will be left as an exercise for the reader ;)
All source modifications to simplifier.cpp
are available here; in order described in the article, it’s simplifiervsm.cpp
, simplifiervs.cpp
, simplifierv.cpp
, simplifierb.cpp
, simplifier.cpp
and simplifier.c
.
-
Some time last year during a regular lunch time C++ discussion at work somebody said “there’s a good language subset of C++, C with classes”, to which I replied “there’s an even better subset, C with structs”. That’s what most of meshoptimizer source code looks like, barring a few templates :) ↩
-
The hash table interface is just two functions,
hashBuckets
andhashLookup
: simplifier.cpp:124 ↩ -
Using 11 bits is a reasonable choice because that requires a 2048-entry histogram, which takes 8 KB and comfortably fits within a 16 KB L1 cache. Given a 32 KB L1 cache you can extend the histogram to 12 bits but going beyond that is generally less efficient. You can read more about the radix sort in the Radix Sort Revisited article by Pierre Terdiman. ↩
-
The full implementation of
sortEdgeCollapses
function is available here: simplifier.cpp:712 ↩ -
This class is no longer a part of meshoptimizer, but you can look at the older slightly longer version here: meshoptimizer.h:605 ↩
-
I discovered this after investigating the curious performance difference in debug; I’m loathe to repeat the debug benchmarks for all previous test cases, so I’ll assume that the overhead is the extra ~30% seen on the
std::vector
- hopefully that doesn’t change the general picture. I’m not sure why these assertions aren’t enabled by default in the first place - that doesn’t seem very user friendly - but this should mirror the default experience of working with these libraries. ↩ -
This class is used by all meshoptimizer algorithms and is available here: meshoptimizer.h:662 ↩
-
This feels like duct tape, and one that I’d have to apply to multiple source files independently, so for now I opted for not doing this. However if C++17 mode becomes the default before this issue gets fixed, I’ll have to reconsider since 2x compile time penalty is a bit too much to swallow. ↩
« Voxel terrain: physics | Flavors of SIMD » |