Don't you dare flip that bit

I’ve decided to take a small break from VFC series and post something completely different. VFC series continues next time, don’t worry.

In our studio, there is a single engine that’s shared between most projects. There is a separate team, which develops and maintains it – I happen to be a part of this team. Each project has its own branch of the engine, and changes are integrated in both directions as needed. Of course, the engine needs a solid test base. We have two methods of testing our code – unit-tests and small demos (we call them spikes). Each spike is demonstrating some specific feature (i.e. particle systems, animation trees, huge scene with complex shaders, etc.) and often has some custom C++ code and some art, of course. Sometimes it’s just boxes and spheres (for some reason, there is no teapot object in Maya, so no teapots :-( ), sometimes it’s scenes from our projects. We have some automated means to test the spikes (memory leak/log errors check, screenshot comparison), but the post is not about them.

The story starts with one of the spike which contains a level from a game which is being developed. It’s a moderately complex level, I usually develop/optimize rendering code in it. One day I run it on PS3, move camera somewhere and suddenly notice something weird – there is an object (a tree, actually), which extrudes from the place it ought to be to level boundary or farther (unfortunately, I don’t have any screenshots and I’m not allowed to post them for that matter). Well, there is a PIX-like tool on PS3, so I launch it, capture the frame, look at the mesh and see that there is one vertex that went astray. I reload the spike to narrow down the problem via edit & continue (I’ll have to blog on that some day), and the object is okay. Well, shrug back to work.

Then approximately a week later I was fixing shadows on the same project the scene was from (for some reason, my love for PSM vanished once I started using it in an actual project…), and suddenly while going through the level notice the same bug – there is some other object with a stray vertex. I launch the PIX-like tool again, and now inspect the problem closer – it appears that one of indices is broken – instead of 0x0023, it’s 0x8023, which leads to sampling incorrect vertex (with position.w != 1). I note the address, reload the game (this tool requires a special program to be launched on some devkit to “replay” the capture – replaying gives ability to see wireframe, for example – and since for some reason we only have one PS3 devkit for the engine team, I launch replay on the same kit I captured from, so if I want the game running I have to reload it) – and the problem is not there again. Well, I reload it several times for the next hour, and the object is still ok. I check our geometry file – and the index in it is 0x0023. There we have it – a heisenbug. Apparently, either the file loaded incorrectly, or someone damaged my index buffer (it’s in video memory, btw).

Well, how do you debug a heisenbug? First, I had to make a list of possible reasons. But in order to do this, I need more data – two random occurrences are not enough! I talk this over with my lead and we decide to wait till it appears again. Also I check with the projects Q&A – they’ve never seen it but they will tell me immediately if they do. Great.

The next day I launch the game again, go through the level a couple of times and suddenly notice the same bug – the same vertex is displaced. After leaving the game running for some more time and looking in different places, I find several more bugs – it even appears on some skinned characters. Moreover, it’s obvious now that the bug appears after some time – I can cycle through the level and see more and more stray vertices. So I do a few captures, analyze the data, record addresses, expected and actual contents, look at the resulting file and think.

Now we have a pattern. First, it’s obviously a memory damage, file loading is not to blame – the loaded meshes stay in memory at the same places. Second, it affects both vertex and index buffers – on some models the wrong data is in vertex buffer (by the way, noticing a bug in vertex buffer by looking at vertex positions is… hard. Especially if it’s 3500 vertices long). Third, it, uh, affects only one bit. For a 0x0023 vs 0x8023, this could’ve been a full-byte damage, but on vertex data it’s obvious that there is exactly one bit damaged – sometimes it’s 1 when it should have been 0, sometimes vice versa. Neighbor bits/bytes are not affected (the comparison with geometry file confirms it). As another side-note, index buffers are very much alike – when I search for pattern from vertex buffer in the file, there is often a single match even if a pattern is only 16 bytes long, for index buffers it sometimes takes me 100 bytes or so to detect the correct place.

Now, the weird thing is that this happens on geometry that should never be touched by any code after it’s loaded – we don’t do any vertex/index processing on static geometry, we even skin on GPU. So it should be some seemingly unrelated code, which performs bit manipulation – probably some state packed in bit vector? Well, who knows… Anyway, I try to disable some code that can be disabled more or less safely without losing the ability to render stuff (so that I can see something) – it does not help, i.e. bugs are still there. They are in different places for different runs, but there is an object that is constantly damaged the same way no matter what I change. I quickly find the address, and plug it into debugger watch.

In this case the memory layout for level data is static (thank god!), so addresses do not change. After several runs, the address is confirmed – it’s always damaged. Moreover, if I fix the value there, after some time (which ranges from instantly to ten-twenty seconds) it’s damaged again – and I’m able to confirm that only one bit is getting changed. We’re getting somewhere – one of many manifestations of the bug reproduces without problems, so I can do something about it.

Now that we know the story, it’s time to think about suspects. We’ve got PPU, SPU and RSX in picture. It does not look like RSX – it’s pretty hard to setup the rendering so that exactly one bit somewhere in the middle of geometry data is damaged and everything around it is okay – so it’s not RSX. SPUs can only access external (video/system) memory via DMAs, and DMAs have to be at least 4 byte aligned – so if it’s SPU, it loaded a chunk of memory, changed a bit there and put it back. We had around 100 kb of our SPU code back then, so we checked it and it did not seem to do anything like it. Still, that was a possibility, and we also had Havok running on SPUs, so we could not eliminate them. And, well, we have lots of PPU code that can freely write wherever it wishes.

Step one – eliminating the suspects. I wanted to eliminate SPU code, so I temporarily switched the systems that were using it (both our code and Havok) to their PPU variants where possible, and added some asserts to DMAs in the single remaining job. No changes, the bug is still there – so it’s not SPU code or at least if it’s SPU code, we don’t know anything about it (there are some SPU processes launched by GameOS). So it has to be PPU code. The behaviour looked like that of our video memory manager (it stores block information in video memory, and indeed can change some bits), but replacing it with dumb linear one (ptr += size) did not make a difference. Time to move further!

Step two – determining the place of corruption. It should be simple – we switch the bit off, put a data breakpoint and find the culprit? Well, no – for some reason, DABR (data-access breakpoint, PowerPC) does not work on video memory – i.e. I can’t even set it. There is another facility to trap memory accesses (which works on more types of memory accesses than DABR), but it does not work with video memory either. So the resolution was that we have to launch a high-priority thread that constantly checks the address in question.

The assert started trigging amazingly early – even before the level started to load! When it was triggered, the main thread (and other threads) were either sleeping or stopped in seemingly random places, but it was a definite improvement – the first assertion was before game initialization even completed, so the geometry just happened to be at the wrong place (in the wrong time…).

At the end of game initialization there was 6 active SPU threads and 15 or so PPU, which was clearly a bit much. Luckily, since we didn’t have to load the level any more, we could remove stuff more freely. So I added an infinite loop with beginFrame/endFrame calls to the end of initialization routine (to prevent post-initialization crashes) and started commenting out various subsystems’ initialization, eventually reaching the state where there are no SPU threads and only two PPU threads – main one and graphics interrupt handler. There was still a lot of initialization code left, since I left the stuff that did not create threads.

The assert still triggered randomly, so I thought I’d change the checking method. GCC has an option that instruments all functions by adding calls to special functions at the beginning/end. The option is called -finstrument-functions (it makes the compiler call __cyg_profile_func_enter on enter and __cyg_profile_func_exit on exit; moreover, this even works in case the function was inlined), MSVC has a similar pair of options, /GH and /Gh. I added this option to all our PPU code (this excludes middleware we have no source for, like Havok, and also excludes Sony libraries, but in most cases there still is our code that calls into middleware/libraries), and added enter/exit functions that assert if our address contains wrong data. It’s funny that GCC instrumented my enter/exit functions as well (despite the fact they should have fixed prototype), so I had to tag them with no_instrument_function attribute.

Running the resulting image did not produce clear results either – assert still triggered in more or less random places. Additionally now the callstack did not contain any meaningful information except the function from which the enter/exit stub was called, which complicated stuff a bit. Getting frustrated, I started removing the remaining initialization code one by one. Finally all that was left was a single call to graphics device initialization – I removed some inner parts, but I could not remove that altogether because it maps video memory to process space – and loop with beginFrame/endFrame, which I reduced to 4 calls to Sony libraries (which essentially performed a flip). Replacing a loop with plain infinite one (while (true) ;) made the bug disappear.

So it’s not our code; moreover, it occurs because of Sony graphics library. There were two answers left – either it’s not our fault or we somehow initialize the library incorrectly. I ran the game (original version) on another devkit, but there was no problems there – so I suspected firmware version, but reflashing devkit to different versions did not change anything. After that I launched one of SDK samples – but the address contents was intact (i.e. did not change after I changed it in debugger to some meaningful value). So it had to be our code. Careful comparison of initialization routine indeed spotted some differences (we passed a zero-initialized structure in one of the functions, though the documentation and the sample stated that NULL should be passed instead), but eliminating them again did not fix it. Well, that was weird. A year and a half ago, when we started porting the engine to PS3, I’ve written a small several-days demo in the process of studying the graphics library/RSX. Luckily I’ve got the sources and a prebuilt image (several SDK revisions and year and a half in the past!). I launched it, and the address contents was in tact – but I’ve decided to peek further, and dumped the whole video memory to the file. Inspection of this file revealed, that there was a single non-zero byte with offset > 32 Mb from video memory start. Amazingly, the low half-word (two bytes) of it was equal to the address I was working with the whole time! So the bug was even in my small demo? Give me one last try…

I’ve launched Sony’s sample again and did not even have to dump video memory – the very same address was corrupt! So I filed a support request, suggesting that it was a hardware problem – after all, I’ve finally reproduced it on SDK sample, and it did not reproduce on another devkit – and soon got a confirmation. Damn. Damn. Damn!

It took me a day and a half to reach the answer (it could’ve been much worse though). I guess the moral is that you have to presume nothing, including correct hardware operation – though of course I could not blame the hardware until all other reasons proved themselves wrong.