This post features some additional notes on the topic of transporting bugs with checkpoints, which is the subject of a paper at the S4D 2010 conference.
The idea of transporting bugs with checkpoints is some ways obvious. If you have a checkpoint of a state, of course you move it. Right? However, changing how you think about reporting bugs takes time. There are also some practical issues to be resolved. The S4D paper goes into some of the aspects of making checkpointing practical.
In particular, we need the checkpoints to be:
- Portable – so that checkpoints can be copied around between computers
- Deterministic- so that everyone opening a checkpoint sees the same behavior
- Compact – so that they can actually be moved around without incurring undue pain
- Differential – so that a checkpoint can build on previous state and just contain a set of changes, not the entire state of the target system
Most of my paper is spent on how to make checkpoints small enough to be easily transported, and how it fits with development workflows. The requirements above would seem to be common sense, but there are checkpointing systems out there that do not fulfill them. In particular, the portability aspect is hard to get right.
There are other ways to achieve transportation of bugs, and this blog post will fill in on some related work that I could not fit into the paper or which I discovered only after the final version of the paper was submitted.
There seem to be boundless creativity in creating methods to record live systems and replay their inputs/outputs/internal behavior/other interesting behavior on another system to support debug or analysis or other tasks. It shows just how important the replication of bugs is to the development of systems, and just how hard it is to accurately capture a bug in practice.
The company called Zealcore was doing some interesting work in software-based recording of “only the relevant events”, and then replaying this on a lab machine. Their angle on the problem was to have software record a minimal trace of important events on a live system, and then control the runtime system in a lab to replicate the event trace. Making this efficient and precise was the subject of a sequence of research papers in the early 2000s. Zealcore was acquired by Enea in 2008, and I have not seen much from them since. From what I can tell, the Zealcore fundamental technology for recording on a live system (or at least the ideas) have been continued into a new company called Percepio.
Aa fundamental difference between these recording systems and checkpointing systems is that they do not capture the complete target system state in the way a checkpoint does. The recording is much more compact, but it does not really solve the same problem. It is not based on running the target inside a simulator (other than at the replay end). What the relative success of such recording system indicates, however, is that in many systems, there are “important” and “irrelevant” aspects of inputs and events and behaviors, and that recording and replaying only “important” aspects is often sufficient to trigger bugs.
You can also throw hardware at the problem.
Completely unexpectedly, I also found a reference to a hardware-based record/replay system in a Communications of the ACM interview with Edsger Dijkstra (a rerun of an interview from 2002). Apparently, during the early programming of the IBM 360, IBM realized that debugging interrupts was hard. The solution was to create a piece of special hardware which would record interrupts, and later replay them with precise timing. In this way, you achieved repeatable executions of the most difficult code there was. I must quote what Dijkstra says on this “throw money at the problem” approach:
When IBM had to develop the software for the 360, they built one or two machines especially equipped with a monitor. That is an extra piece of machinery that would exactly record when interrupts took place and from where to where. And if something went wrong, it could replay it again and use the recorded history to control when interrupts would occur. So they made it reproducible, yes, but at the expense of much more hardware than we could afford. Needless to say, they never got the OS/360 right.
The final comment is typical for Dijkstra’s thinking that debugging is just an indication that you did not get the program and design right from the start. That’s certainly true, and he would likely have considered my little S4D paper as an unnecessarily complicated solution to a problem that should not have existed in the first place.
I, however, find the idea of the monitor interesting. I think that building something like that today would be much more difficult, as chips are very highly integrated and the support for replaying interrupts would have to go right into the heart of an SoC. But it would be interesting if it could be done.
There is also a paper from 1969 that I wrote about a few years ago that does include the idea of recording and replaying asynchronous external inputs to a simulator.
Other Checkpointing Systems
There might be some related use of checkpoints (or snapshots as they are more commonly known) in the development of game emulators. There is clearly the ability to save game state in a portable way in emulators like MAME. Such states can be useful to help debug the emulator, but in a different way from the approach that I presented. In the emulator case, the state is really the state of the emulated target. It is not the state of the emulator program itself. If game emulator snapshots were used to debug the game code, it would be the same situation as what I describe in the S4D paper.
As I understand it, this is more like a attaching an example document that makes a program crash to a bug report, rather than transporting the state of the emulator itself.
Going down in the level of abstraction, I have also been told that RTL simulators offers a similar ability and that they have used in a similar way. Since I am not at all familiar with that field, I would not comment on this in the paper.
Transporting RTL bugs using checkpoints makes perfect sense. In an RTL simulator, the target state is very clearly described in an unambiguous way with no relationship to host state. Checkpointing should be easy to implement and checkpoints should be portable, anything else would be a poor implementation. The simulation is also deterministic, assuming a reasonable implementation of the simulator. The simulated world is also encapsulated with a set of test cases, RTL simulations are too slow to be interfaced to the real world. If an RTL simulator is interfaced to something else, recording the incoming signals should be straightforward since they are at a very low level (bits, clocks, pin states).
The use of checkpointing with RTL also fits with a conversation I had in 2005 when Virtutech introduced reverse execution in Simics. At one of the tradeshows where we showed the technology, an older gentleman approach me and told me that he had done similar things with hardware simulators back in the 1980s. He immediately understood the implementation idea (checkpoints with deterministic replay), and sounded like he felt it was nothing much new.
Finally, at some other event last year, I saw an demonstration of an RTL-level tool where the trace of the execution was generated on one machine, but inspected on a different machine. That amounts to a portable trace, even if the data volumes were rather large (many GB) and essentially required the RTL simulator (or hardware accelerated emulator) to be sharing disks with the investigation machine. Still, nothing prevents such a solution from being remotely used. The main difference from what I describe is that here only the result of the execution (trace of signals) is transported, not an actual state snapshot that can be brought up to continue the execution in a different place.
If you have any other notes on this, please comment!