I have a long-standing interested in debugging in general and reverse debugging in particular and the related idea of record-replay debug (see a series of blog posts I did a few years ago on the topic: history 1, history 2, history 3, S4D report, updates, Simics reverse execution, and then Lab Cloud record/replay). Recently, I found out that Undo Software, one of the pioneers in the field, had released a product called “Live Recorder“. So I went to check it out by reading their materials and comparing it to what we have seen before.
Live Recorder is a product that promises to enable the workflow you really want: run tests in production or test lab, and when a failure occurs, automatically and perfectly reproduce the error in the development lab. This is a workflow that I have been pitching for Simics for a long time (see for example a blog post on Simics recording session checkpoints, and on how to do continuous integration with simulation). Making it work from the field, however, is a bit of a challenge since you cannot slip in a tool underneath a running deployed system like you would do in the lab.
Enter Live Recorder, where a programmer inserts some instrumentation code and a library into their (Linux user-level) application, and ships it. The program then records all asynchronous inputs so that its execution can be later replicated in the lab in case a failure occurs. The assumption is that the program itself is deterministic (the only sane assumption you can make unless you want to take a complete instruction trace which is just technically impractical) and that given the same set of inputs, you will get the same behavior and the same eventual bug. Not a bad idea. By integrating the recording into the application, there is no need to modify the system itself to insert the instrumentation, and an application developer can deploy the application just like they normally do.
Update: Note that this integration means that Live Recorder is a bit less convenient to use for a quick recording of a program in the lab, where just attaching or running the program under a reverse debugger probably achieves the desired effect. Requiring in-program instrumentation really means that you make recording part of the feature set of your software. This is like building in extra logging, really. I think it makes sense for deployed applications (how else would you do it there), but not necessarily for on-off debug.
The application will then always log inputs, and if a failure occurs, the log file is sent back to the developer. Not unlike how you do things today when a failure happens – ask the customer for system information and logs. But this log lets you replay, so it is a bit more powerful.
Once a log is in the lab, it can then be replayed under the control of the UndoDB reverse debugger, and once it has been replayed once, you can then debug it using reverse debugging techniques. Record/replay and reverse debugging complement each other very nicely, actually. We see the same workflow with Simics, and it is very powerful. Microsoft also supports this with their IntelliTrace tool, which however is not a full reverse debugger and only works for “managed” code, while Live Recorder works for standard C code.
The devil, as always, is in the details. From what I understand, Live Recorder can only work if a program is serialized on a single execution thread. Threaded programs are supported, but they are serialized on a single actual execution thread to avoid non-deterministic timing-dependent thread interactions. Regular UndoDB enforces serialization, as do gdb reverse. RogueWave ReplayEngine also replays in a single thread from what I understand. Unless you do a full system simulation like Simics or get a deterministic OS in place, this is likely as good as you can get. There are clearly applications where this limitation is not a problem. For example, if you have a set of programs that collaborate using networking and messaging, their relative behavior is going to be highly non-deterministic, and recording each program on its own will definitely help in debug. So I don’t think Live Recorder is that useful for a massively parallel high-performance program, but it would be applicable to most other classes.
There is an interesting tool from Intel, DrDebug, that uses the PIN binary instrumentation framework to successfully instrument and record and replay even truly concurrent programs. However, the cost for this is a much bigger slowdown than what is claimed for Live Recorder (around 100x vs less than 2x). It just shows there is no really free lunch. Thus, DrDebug is more of a lab tool than something you could deploy for ongoing recording of running programs, even for a test build on a deployed system.
Another limitation is the limited length of the recording buffer. It has to be limited for obvious reasons, but when it runs out, it just “rolls over”. Without saving the state of the program. Thus, a replay will not start from the same initial state as the real run if you pick things up in mid-run. I did not find any clear answer to how this works in practice. I do know of quite a few more limited recording systems from the past that did the same, and basically made the assumption that program internal state is a function of the inputs, and given a long enough stream of inputs, you will get to the same state. I.e., all you need is the last N inputs and you should be fine. The long-defunct Zealcore system had the same idea, and fuzzing input replay from tools like CodeNomicon does the same – bring up the system, and subject it to a series of inputs to hopefully force the same crash as seen before.
So overall, a nice take on a good idea, and I wish them luck. As a product person, I am really curious as to just where and how the particular set of limitations and capabilities in Live Recorder makes it a viable solution. They definitely do exist, and Undo Software lists a few examples including financial trading software in their white paper, but that threading limitation is annoying.
Here’s a nice case study giving a bit more detail on how exactly Live Recorder can be used to help automated testing in very large complex codebases (SAP HANA in this case) – http://undo.io/resources/case-studies/sap-hana-case-study/