After some discussions at the S4D conference last week, I have some additional updates to the history and technologies of reverse execution. I have found one new commercial product at a much earlier point in time, and an interesting note on memory consistency.
First and most importantly, I must revise my previously published history of reverse execution. It turns out I was wrong about Lauterbach. Rather than having something that was record-replay debug as I thought, it turns out that their CTS, Context Tracking System, actually is a working reverse debugger. And it has been that way since 1999, beating Green Hills Time Machine to market by quite a few years. Thus, the award for “first reverse debugger based on hardware trace recording” has to be reassigned to Lauterbach. I was presented with a Trace32 newsletter from 1999 where it was very clear that the CTS allows a user to move backwards in time in the trace, tying the point in time to source code and registers. The CTS can also be given commands to step backwards in time until some condition is fulfilled, which is in essence reverse breakpointing.
I have updated my post from early 2012 to reflect this new understanding. That is the charm of blogging – you are at liberty to go back rewrite what you wrote when new facts appear. Still, I keep the old text around but with strike-through to show what I originally wrote, even if it was indeed wrong. Better to be clear on what has been revised than to silently change things.
Second, as noted before, weak memory models complicate replay for reverse execution. In a typical host-based approach to reconstruction-based reverse debug, the replay of a parallel execution is done in serial (with interleaving at all points that communication was detected). This allows race-condition debug for the case that races are caused by accessing shared memory without proper locking. However, if races arise not from the program design but from the hardware behavior, these will not be reproduced. In a weak memory model like ARM employs, it is theoretically possible for certain memory operations to take noticeable time to propagate from one processor to another. Such conditions will not be replayable when using a single processor to reproduce a parallel scenario. If nothing else will the replay on a single processor ensure perfect memory consistency between threads. There is no way that hardware can allow such scenarios to be replayed without very special support being designed into the hardware.The only method I can imagine that can handle this correctly is to use cycle-accurate simulators that are still deterministic and applicable to reconstruction, running with a perfect model of the memory system – which means it will be painfully slow, but when debugging complex hardware-software memory consistency bugs, that might be the only tool that works.