A comment on my old blog post about the history of reverse execution gave me a pointer to a fairly early example of replay debugging. The comment pointed at a 2002 blog post which in turn pointed at a 1999 LWN.net text which almost in passing describes a seemingly working record-replay debugger from 1995. The author was a Michael Elizabeth Chastain, of whom I have not managed to find any later traces.
The critical part of the LWN write-up is this:
I have a trace-and-replay program based on ptrace. The tracer is similar to strace.
The replayer is the cool part. It takes control whenever the target process executes a system call, annuls the original system call, and overwrites the target process registers and address space with the values that I want to be in there.
This is the core of any replay system for user-space processes: intercept OS interaction, record it on the real run, and replay it in the replay run. The system could apparently be used with gdb to get a record-replay debugger (similar to where the RR debugger started before going full reverse debugger).
However, the implementation also suffered from a huge issue of practicality: the need to know what each syscall did and how it affected the state. With the Linux kernel moving quickly in 1995 (and still moving pretty quickly today, two decades later), maintaining that part of the implementation turned out to be impractical.
The replayer needs a table of every system call and how it affects memory, and that table needs more entries every week (thanks to ioctl). So I have a great demo, if you have 1.3.42 kernel headers to compile it against.
There is always a challenge to follow an API as it changes, and that’s one of the advantages of a heavy solution like Simics or a full VM: by simulating the hardware interface, you actually have a narrower interface than trying to work at the OS API level. However, and API-level solution initially requires much less work.
Still, what this method did provide was a portable trace, allowing a recording to be made on one machine and replayed on another. The LWN write-up mentions that it was possible to exchange traces over the Internet, recreating a run from a different host on the local host. That is very powerful and a feature that is still hard to come by today.
One of the two guys put up a mud server and traced it. He sent me the trace file, and I ran gdb on it. I re-executed his program, I set breakpoints anywhere I wanted, I inspected data at any breakpoint. Hmmm, there’s a structure that looks funny, I’ll just restart and set an
This quote makes it clear that there was no reverse debugging going on here, only perfect repeatability. It would seem that this is a user-level, record-replay solution for programs on Linux. Single-threaded single-processor only, since no mention is made of threads (and multicore processors had not yet happened in 1995 so multiprocessors were exotic beasts).