VPEXPANDB on NEON with Z3
02 Sep 2022When working on vertex compressor for meshoptimizer in 2018, one of the goals was to make a bitstream that can be decompressed using (short) SIMD vector operations. This led to a host of design decisions in terms of how the data is structured, and some challenges when mapping the decoder flow onto various SIMD architectures. The most significant issue has to do with implementing an operation that often doesn’t have a straightforward implementation: byte expansion.
Without going into the details of the bitstream format, we get a byte mask (with 16 bit elements, one per each vector byte) from an earlier stage of decoding process, and for every 1 in the mask we need to read a byte from the input stream and place that into the corresponding byte in the SIMD register. For example, a mask with three bits, 0000_0100_0011_0000
, leads to reading three bytes from the current position in the stream, creating a vector xxxxxxxx_xxAAxxxx_xxxxBBCC_xxxx
, and advancing the stream position by 3.
Baseline implementation: SSSE3/AVX512
The bitstream was designed to be quickly decodable on mainstream Intel/AMD processors, so there must be an efficient lowering for some reasonable version of SSE instruction set. By structuring the bitstream to make sure we can always read 16 bytes from valid positions in the stream, we can load 16 bytes after the current position in the stream, but we then need to move the bytes corresponding to the ones set in the mask in their right locations. The general way to do this is to use PSHUFB
(SSSE3) instruction which can perform an arbitrary byte shuffle on the source 16byte register, yielding any 16byte permutation as a result.
Computing the shuffle masks from bit masks is difficult to do efficiently, so instead we’ll precompute the masks and store them in a table. There’s a grand total of 2^{16} masks which would require a 1MB lookup table, which is not very space or performance efficient, but we can instead precompute the table for an 8bit mask, which requires just 4KB of space. From there on we need to be able to synthesize a full shuffle mask, and the number of bits set in the first half determines the offset that the second mask should pull data from: for example, if the first 8 bits have 5 bits set, and the second 8 bits only have one, then the corresponding byte in the second half of the vector needs to be read from an offset +5 in the input stream.
Importantly, the mask originally comes as a result of vectorized comparisons, with one byte for each comparison result (containing 0b11111111
or 0b00000000
); however we can easily convert it to a 16bit mask in a general purpose register using PMOVMSKB
(SSE) instruction.
To compute the offset we thus need to have an 8bit population count  something that’s available as an instruction on modern x64 CPUs, but something that’s also easy to precompute a 256byte table for. Conveniently, to figure out how much to advance the stream after reading the data, we also need the population count  we need to shift the position by the total number of bits set in the 16bit masks, which can be computed as a sum of two 8bit population counts.
Overall, this requires a bit of setup, but is reasonably efficient, with the following pseudocode:
static uint8_t kTableShuffle[256][8]; // 8bit shuffle masks
static uint8_t kTableCount[256]; // 8bit population counts
// mask is __m128i
int mask16 = _mm_movemask_epi8(mask);
uint8_t mask0 = uint8_t(mask16 & 255);
uint8_t mask1 = uint8_t(mask16 >> 8);
__m128i sm0 = _mm_loadl_epi64(&kTableShuffle[mask0]);
__m128i sm1 = _mm_loadl_epi64(&kTableShuffle[mask1]);
__m128i sm1off = _mm_set1_epi8(kTableCount[mask0]);
__m128i sm1r = _mm_add_epi8(sm1, sm1off);
__m128i smfull = _mm_unpacklo_epi64(sm0, sm1r);
__m128i data16 = _mm_loadu_si128(data);
__m128i result = _mm_shuffle_epi8(data16, smfull);
data += kTableCount[mask0] + kTableCount[mask1];
This approach works well on Intel/AMD hardware, however it does take a bunch of setup  to compute the shuffle mask from the two halves of the 16bit mask, we need to load two halves of it from memory and reconstruct the full mask with ~4 additional instructions. I didn’t know about this initially, but byte expansion is very useful in contexts like this and so AVX512 includes a dedicated instruction that matches our desired semantics perfectly, VPEXPANDB
, which allows us to eliminate all the setup and replace all of the above with:
// mask is __mmask16
__m128i data16 = _mm_loadu_si128(data);
__m128i result = _mm_mask_expand_epi8(_mm_setzero_si128(), mask, data16);
data += _mm_popcnt_u32(mask);
Note: AVX512 has other fantastic instructions that help in other areas of the decoder, such as
VPMULTISHIFTQB
; overall, using AVX512 allows us to make the decoder >10% faster on an Intel CPU  all this while still using 128 bit SIMD vectors, and as such without the risk for associated frequency adjustments. Unfortunately, AVX512 is not widely supported but I’m looking forward to benchmarking this on Zen4 in the future.
NEON emulation
Unfortunately, NEON doesn’t have anything as powerful as byte expansion instructions, so we need to use something similar to the baseline SSE implementation. Most instructions translate without issues, and shuffle can be implemented using table lookup instruction TBL
, but PMOVMSKB
presents a challenge.
This problem comes up frequently when porting SSE code to NEON, as PMOVMSKB
is incredibly useful in many algorithms; until recently, a reasonable generic alternative involved using horizontal add instruction (only available on AArch64). In our case the problem is made a little bit simpler  instead of a 16bit mask we actually need two 8bit masks, as we’re going to use them to compute load offsets anyway, and we know that the input is a 16byte register with each byte containing all 1s or 0s, which simplifies the emulation and makes it pretty reasonable:
static const uint8_t kByteMask[16] = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128};
uint8x16_t byte_mask = vld1q_u8(kByteMask);
uint8x16_t masked = vandq_u8(mask, byte_mask);
uint8_t mask0 = vaddv_u8(vget_low_u8(masked));
uint8_t mask1 = vaddv_u8(vget_high_u8(masked));
However, it still requires two horizontal adds and horizontal adds aren’t particularly cheap: 4 cycles of latency per CortexX2 optimization guide. A recent blog post from Google Cloud team presents a more efficient replacement for many uses of PMOVMSKB
by using a narrowing shift, which allows to create a 64bit mask from a 128bit vector, and then change the rest of the algorithm to adjust the operations to expect the bits in the right places. This results in significant speedups on a number of algorithms  unfortunately, it’s less useful for our case, because we ultimately need two 8bit masks instead of a single 64bit one.
A further complication here is that horizontal adds aren’t universally supported. On ARMv7, these are not available and a longer instruction sequence using paired adds must be used instead; additionally, Microsoft does not implement horizontal adds in their NEON headers, so when targeting Windows on ARM we again can’t use the most efficient sequence.
Sometimes all it takes is a MUL
For a while, the above is the version that was used in native NEON builds. The same issue had to be solved in the WebAssembly version of the decoder  while WebAssembly SIMD proposal now implements i8x16.bitmask instruction, the earlier versions of the proposal didn’t have it (the shuffle instruction was also initially absent, which would have put the SIMD decoder at risk, but v8x16.swizzle instruction was added relatively early^{1}).
Because of this, the WebAssembly version used a rather slow scalar fallback that used many 64bit ors and shifts to move the bits to the right places. Before I got around to replacing it with i8x16.bitmask
, someone named 0b0000000000000
shared a much more efficient, if cryptic, solution with me:
usize MoveMask(U8x16 m) {
alignas(16) uint64 xy[2];
wasm_v128_store(xy, m);
constexpr uint64 magic = 0x103070F1F3F80ULL;
uint64 lo = ((xy[0] * magic) >> 56);
uint64 hi = ((xy[1] * magic) >> 48) & 0xFF00;
return (hi + lo);
}
Since we need 8bit masks anyway, this results in the following NEON code:
const uint64_t magic = 0x000103070f1f3f80ull;
uint64x2_t mask2 = vreinterpretq_u64_u8(mask);
mask0 = uint8_t((vgetq_lane_u64(mask2, 0) * magic) >> 56);
mask1 = uint8_t((vgetq_lane_u64(mask2, 1) * magic) >> 56);
It does assume that 64bit multiplication is efficient, however on ARMv7 this ends up being about the same speed as the less efficient NEON implementation with paired adds, and it’s consistently faster compared to horizontal adds on AArch64 (on the entire algorithm, of which all of the above is only a part of, it results in ~2% faster decoding on Apple M2 and ~3% faster decoding on AWS Graviton2; it also compiles into the same efficient code on MSVC where the lack of horizontal adds is not an issue anymore).
Ok, but… why does this code even work? Where do we get the magic constant from? And how did 0b0000000000000
arrive at this idea?
Enter Z3
The fact that the multiplication gets us what we want is possible to see if we look at what happens when we multiply the magic constant by 255  since all input bytes are either 00000000 or 11111111, the multiplication result is the sum of (255*magic) << (k*8)
for each k that signifies a nonzero input byte:
0x000103070f1f3f80 * 255 = 0x0102040810204080
We can see that the result has one bit set in each resulting byte; this means that this value can be shifted by any multiple of 8 and the resulting values can be OR’d together (which is the same as adding them because there’s no carry involved); doing so results in the top 8 bits of the value corresponding to the values of k
where the byte was nonzero  exactly what we need to get our bitmask!^{2}
There are a few ways to derive this in reverse; I’m not sure what exact method 0b0000000000000
used, however we can simultaneously devise the magic constant, and validate that the multiplication gives us what we want, by using Z3 theorem prover.
A full introduction to Z3 is neither inside the scope of this article nor is it mine to give  however, what Z3 allows us to do in this case is to give the solver our problem statement:
 Given any 64bit integer
x
, such that every byte ofx
is either11111111
or00000000
 Compute a 64bit integer
y = x * magic
 Ensure that each of the top 8 bits of
y
is set iff the corresponding byte ofx
is11111111
We can then give Z3 a fixed magic
value and verify that this holds for every x
 something that we could check by bruteforcing all 256 permutations of x
in this case, but that can be impractical in some other cases  or, instead, ask Z3 to find a magic
value that satisfies the conditions for any x
, or report to us that it does not exist.
Z3 can be used via a SMT interface, but I’d recommend avoiding it and using Python Z3 module instead. With the following short program, in a couple seconds the solver gives us the answer:
from z3 import *
def same8(v):
return Or(v == 0, v == 0xff)
magic = BitVec('magic', 64)
x = BitVec('x', 64)
y = x * magic
solve(ForAll([x],
Or(
# x has bytes that aren't equal to 0xff or 0x00
Not(And([same8((x >> (i * 8)) & 0xff) for i in range(8)])),
# every byte of x has bits that are equal to a corresponding top bit of y
And([(x >> (i * 8)) & 1 == (y >> (56 + i)) & 1 for i in range(8)])
)))
$ python3 bitmask.py
[magic = 284803830071168]
We can also ask the solver to find the constant for a slightly different version of the problem  for example, if all bytes are either 0 or 1, we can just tweak the same8
function and have the solver give us the new constant, 0x102040810204080, which we have seen before. Or, to see something that is harder to get the intuition for, if we change the code to remove the same8
condition and just expect to find a constant that, after multiplication, moves the top bits of every byte to the top 8 bits (for a more general emulation of movemask), Z3 will print no solution
 for any magic constant, given arbitrary input bits due to carry propagation there’s some value of x
for which the equality doesn’t hold.
Z3 is probably not the best tool for every problem like this  you need to know the shape of the solution you’re looking for, and in certain cases the problem constraints are such that a solution simply does not exist. However, it’s surprisingly versatile; it can be used to validate optimizations, discover vulnerabilities in hash functions, and that’s before you venture outside of the domain of bit vecs and explore the full power of predicate logic.
Removing the shift
The solution we’ve come up with so far requires two 64bit multiplies and a shift. Is that the best we can do, or can we try to remove the shift?
Well, we can’t remove the shift as it is  Z3 will happily tell us that it’s impossible to find a magic constant that will move bits from the end of a number into the beginning (although it’s also obvious without Z3). However, by using a 128bit multiplication instruction, we can adjust the magic constant such that the 8 bit result lands into the top 64bit half of the product. This can be accessed conveniently on GCC/clang with the __int128
extension, resulting in the following code:
const unsigned __int128 magic = 0x0103070f1f3f8000ull;
uint64x2_t mask2 = vreinterpretq_u64_u8(mask);
mask0 = uint8_t((vgetq_lane_u64(mask2, 0) * magic) >> 64);
mask1 = uint8_t((vgetq_lane_u64(mask2, 1) * magic) >> 64);
The shift instruction gets optimized away, resulting in just two UMULH
instructions in addition to vector>scalar moves, and an extra ~1% throughput improvement on Apple M2. However, UMULH may not be as fast as the regular multiplication; CortexX2 optimization guide lists it as having 1 extra cycle of latency, and using this on AWS Graviton2 actually results in ~2% throughput reduction. On MSVC, the same optimization requires using _umul128
intrinsic so overall this solution is unfortunately not as portable as the original optimization.
Conclusion
These few lines of meshoptimizer code had an interesting journey  starting with an SSSE3 implementation, which was the mental model I had at the time, discovering that the core idea, byte expansion, is so common that of course AVX512 has a dedicated instruction for it, wrestling with NEON emulation of movemask, enhancing WebAssembly SIMD proposal (this snippet was one of the use cases for addition of swizzle and bitmask) and finally discovering that in some cases, a plain old 64bit multiply is as powerful as a SIMD instruction or two.
The performance gains quoted here may seem small  but as often happens with performance tuning, to get solid wins for the entire algorithm you need to optimize all individual parts to get cumulative gains! Getting a few % of throughput gains on the entire algorithm merely from tuning something as small as movemask emulation is pretty exciting from that perspective.
To reproduce the performance measurements, you can clone meshoptimizer from GitHub and run make config=release codecbench && ./codecbench
; the relevant throughput measurement is the first one (vtx), and the source code is in src/vertexcodec.cpp
, most notably decodeBytesGroupSimd
function.
While the specific code snippets presented here are probably not as generally useful unless you’re writing a SIMDfriendly compressor, I hope that the introduction to Z3 motivates other people to use theorem provers more, both to discover numeric properties that might not be obvious at first glance, and to validate code transformations. The single program shared here likely represents a tiny fraction of Z3 power and utility  feel free to share other fun things you’ve used Z3 for in the comments!

As an aside, the process of contributing to WebAssembly SIMD proposal was very enjoyable! ↩

In our case, the natural bit order works best; when a different bit order is desired, it’s a matter of simply changing the magic constant. E.g. to get the reverse order, you need to multiply
0x8040201008040201
by the multiplicative inverse of 255 mod 2^{64} to get0x0080c0e0f0f8fcff
↩
« On Proebsting's Law 