A Replay Debugger from 1995!

2016-06-05_21-29-16A comment on my old blog post about the history of reverse execution gave me a pointer to a fairly early example of replay debugging. The comment pointed at a 2002 blog post which in turn pointed at a 1999 LWN.net text which almost in passing describes a seemingly working record-replay debugger from 1995. The author was a Michael Elizabeth Chastain, of whom I have not managed to find any later traces.

The critical part of the LWN write-up is this:

I have a trace-and-replay program based on ptrace. The tracer is similar to strace.

The replayer is the cool part. It takes control whenever the target process executes a system call, annuls the original system call, and overwrites the target process registers and address space with the values that I want to be in there.

This is the core of any replay system for user-space processes: intercept OS interaction, record it on the real run, and replay it in the replay run. The system could apparently be used with gdb to get a record-replay debugger (similar to where the RR debugger started before going full reverse debugger).

However, the implementation also suffered from a huge issue of practicality: the need to know what each syscall did and how it affected the state. With the Linux kernel moving quickly in 1995 (and still moving pretty quickly today, two decades later), maintaining that part of the implementation turned out to be impractical.

The replayer needs a table of every system call and how it affects memory, and that table needs more entries every week (thanks to ioctl). So I have a great demo, if you have 1.3.42 kernel headers to compile it against.

There is always a challenge to follow an API as it changes, and that’s one of the advantages of a heavy solution like Simics or a full VM: by simulating the hardware interface, you actually have a narrower interface than trying to work at the OS API level. However, and API-level solution initially requires much less work.

Still, what this method did provide was a portable trace, allowing a recording to be made on one machine and replayed on another. The LWN write-up mentions that it was possible to exchange traces over the Internet, recreating a run from a different host on the local host. That is very powerful and a feature that is still hard to come by today.

One of the two guys put up a mud server and traced it. He sent me the trace file, and I ran gdb on it. I re-executed his program, I set breakpoints anywhere I wanted, I inspected data at any breakpoint. Hmmm, there’s a structure that looks funny, I’ll just restart and set an
earlier breakpoint.

This quote makes it clear that there was no reverse debugging going on here, only perfect repeatability. It would seem that this is a user-level, record-replay solution for programs on Linux. Single-threaded single-processor only, since no mention is made of threads (and multicore processors had not yet happened in 1995 so multiprocessors were exotic beasts).

6 thoughts on “A Replay Debugger from 1995!”

  1. Thanks for following up on my link to this and a great job on documenting the history of reverse debuggers in your previous blog post! Hope all is well in Uppsala…

  2. Hi, I’m Michael Elizabeth Chastain.

    The code is available here:
    http://ibiblio.org/pub/linux/devel/debuggers/mec-0.3.tar.gz

    (Despite the suffix, I think this is a .tar file, not a .tar.gz file).

    Indeed, it was a single-threaded, single-processor, record-replay debugger. And I did get a friend who was running a MUD to send me a trace file and I ran gdb on the trace file.

    Why a MUD? Because I had worked on Merc Mud and I knew that it was an interesting multi-user server based on single-processor, single-processing select() calls, with no signals.

    My demo was the culmination of about 1 year solo full-time unpaid work. I learned incredible amounts about Linux, though, and that paid off further in my career.

    The key technical insight is that a Linux process starts in a deterministic state when it is execve’d, and all ordinary user-mode instructions are deterministic. System calls are the only source of new information (in the simple model). And ptrace() has enough power to handle system calls.

    My technical downfall were those damn ioctl’s. I should have just instrumented the top 50 of them and then marked any others as “untraceable”. But I was OCD about tracing all of them!

    My social downfall was not starting (or finding) a community. My original plan was to make a commercial product for SunOS and make my living from that. I believed I was years ahead of anyone else, so I worked solo in stealth mode.

    Well, Chris Faylor of Cygnus Solutions saw the lwn.net post, and invited me to give a talk at Cygnus and apply for a job at Cygnus. It was exciting to talk to a room full of engineers who understood ptrace and understood gdb. One of the high points came when I explained that I didn’t have to make a special version of gdb — that gdb *itself* interacts with its inferior process by making system calls, so I just used my system call interception technology to tell gdb my version of what was happening in the inferior. The room broke out in laughter.

    Later on I worked for Google, on projects unrelated to debugging. I’m retired now.

    I haven’t looked at debuggers for several years and I’m looking forward to reading your essays and catching up on the field.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.