A new record, replay, and reverse debugger has appeared, and I just had to take a look at what they do and how they do it. “rr” has been developed by the Firefox developers at Mozilla Corporation, initially for the purpose of debugging Firefox itself. Starting at a debugger from the angle of attacking a particular program does let you get things going quickly, but the resulting tool is clearly generally useful, at least for Linux user-land programs on x86. Since I have tried to keep up with the developments in this field, a write-up seems to be called for.
The best overview of the technology behind rr is the talk by Robert O’Callahan, one of the lead developers, given at the Linux conference in Australia in late January of 2016. The talk is found on Youtube at https://www.youtube.com/watch?v=ytNlefY8PIE. More details are found in writing in an online slide set at https://mozilla.github.io/rr/rr.html
The introduction to rr on http://rr-project.org/ provides a good overview of the benefits of record-replay and reversible debugging. It is all familiar, as I have worked with reverse debuggers since around 2005. It worth pointing out, just like the materials around rr do, that having record and replay is a tremendous game-changer in debug. By getting repeatability of a bug, you get much further towards fixing it. Reverse debugging then adds icing on the cake by allowing easier root-cause analysis.
They also call out a somewhat subtle benefit that I have not seen stated as clearly in the past:
We also hoped that deterministic replay would make debugging of any kind of bug easier. With normal debuggers, information you learn during the debugging session (e.g. the addresses of objects of interest, and the ordering of important events) often becomes obsolete when you have to rerun the testcase. With deterministic replay, that never needs to happen: your knowledge of what happens during the failing run increases monotonically.
Thinking about it, we have done this with Simics since forever – reusing addresses and building up debug script logic over many reruns is entirely natural when determinism is your standard behavior. But we never really realized that this was a huge thing in and of itself. Nice point.
Another point worth spending a paragraph on is the relationship to gdb reverse debugging. Just like many other reverse debuggers, rr uses the standard gdb reverse debugging commands in gdb as their frontend. It seems that nobody is actually using the built-in gdb reverse debugger as it is just too slow to be useful – but having that primitive solution in place means that we have a set of reverse debug commands in the world’s most common debugger, and the presence of those commands make it easier to deploy new reverse debuggers. Thus, what is essentially a toy demo implementation inside standard gdb still serves a good purpose in the ecosystem. UndoDB relies on those gdb commands as well as one of its frontends, and Simics can use gdb-reverse as well.
A key goal of the rr project was performance. In their opinion, speed makes a tool attractive, and slowness makes it unattractive. Makes sense. The question is always “how slow is slow?” Their aim was very aggressive – over 100% overhead (double the execution time) and you lose the audience. Under 50% is necessary, and they claim to have ended up at around 25% for many common runs of Firefox. With performance as the goal, they did some truly interesting optimizations to the recording process. Update: note that the performance issue here is multiplied by 1000x when doing reruns of the same text to “shake the system”. The more times you want to rerun the same thing, the more important performance becomes – especially when recording. Depending on the technology employed, the slowdown for reverse debugging can be more or less correlated to the slowdown recording. For a tool like Simics, recording overhead is minimal, even though reverse debugging on a recorded session is quite a bit slower.
In particular, recording of system calls was heavily optimized as that was a major bottleneck. Plain ptrace is too slow, as each syscall ends up trapping into the rr program, with the accompanying context switches. The tracing was optimized by injecting code into the recorded program, using seccomp-bpf, resulting in a shim layer that handles the 25 most common system calls within the context of the debugged program. This does complicate the implementation to make sure all system calls actually work, but the performance improvement is up to 100x on microbenchmarks. This mechanism is very dependent on Linux, and it would seem that Linux is the only OS to currently offer the necessary features. It is easy to understand why a Windows implementation of this would be essentially a complete rewrite, and it seems that right now, Windows just plain lacks the necessary features to allow the same recording logic to work.
The developers pointed out that rr could only have been done in the last five years; before then, there just wasn’t enough support in the kernel. In a way, rr is the most modern possible implementation of single-process user-level multiple-threaded record, replay, and reverse (updated: used to say single-threaded, which was plain wrong). Previous solutions had to do things in other ways that were slower, simply since the mechanisms that rr now uses just weren’t available. A bit like how modern VM systems on X86 all use VT-x, while in the past, you had to rely on heroics like VmWare binary rewrites of the target code.
A clever trick was used to allow gdb to call functions in the debugged program. This functionality, while very useful to call custom inspection functions and similar, is not very sound and potentially dangerous as it does change the state of the debugged program. Indeed, the rr developers point out that there is a high chance of function calling resulting in a total program crash. Doing a function call into the debugee clearly violates replay – so they use a classic Unix trick: spawn off a fork of the program, do the call in there. That starts from the same state but does not affect the state of the original thread. Thus, you can do calls to things like JIT instrumentation code in the debugged program, while still retaining record-replay.
Probably the most interesting part of the rr implementation is how they handle the eternal issue of process scheduling. How do you replay the points where a program is interrupted by the OS due to its time slice ending? For this case, rr uses performance counter values to note how long the process has run since it was started, and for replay, it sets up a performance counter interrupt that will stop the process at the same point. This requires a reliable and deterministic counter, and they settled on the retired branches counter in modern Intel processors. This is actually not entirely deterministic, as the developers note in a presentation found at github:
That the system actually still work despite this is a bit mysterious, but probably it comes down to the serialized execution of the program. By removing true concurrency, the code is rendered less sensitive to task switch precise timing. Furthermore, if most interaction between threads is using OS mechanisms or good synchronization, the replay will likely be close enough for the “important” interactions between threads. It could also be the case that the execution implicitly syncs itself up at syscalls. As long as there are quite a few syscalls in a program, this could be sufficient to keep the replay from diverging. A colleague of mine pointed out that if this imprecise switching had been done for actually concurrent code, and especially kernel-level code, it would not have worked reliably at all. So, the limited scope of rr makes this work.
As always when reverse debuggers and record-replay systems are analyzed, we need to look at the assumptions and limitations that made the approach work. There is no free lunch, but the question is just what the cost is.
For rr, the main limitations are:
- Only supports user-space processes.
- Single-core execution – this is common to pretty much all other reverse debuggers except those built on full-system simulators like Simics (and a few recording systems like PinPlay). Running on a single core does limit the observable bugs to some extent: I have seen quite a few bugs requiring multiple cores acting simultaneously to trigger, like this one. However, there are many bugs that are triggered just by using multiple threads sharing a single core. The creators of rr said that this could maybe be solved with better hardware support – support that is not available on any architecture today.
- Update: as noted in the comment by Robert O’Callahan, by making the scheduling more chaotic errors can be triggered far more often. This is a known technique, and on Simics we have applied this by changing the length of time quanta for multiprocessor simulation. Similar idea.
- No shared-memory interaction with outside processes, since shared-memory operations cannot be recorded.
- Only supports Linux host, as discussed above.
- Requires a modern Intel processor in the host –less of an issue, everyone should be on something Sandy Bridge or newer by now on a desktop or laptop (unless you are on AMD). However, servers tend to live longer, and I still see old Westmere servers around from time to time.
- Does not work on AMD x86 processors, since they lack a performance counter with the required behavior. rr is very close to the edge in terms of processor behavior.
- No support for architectures like ARM, since they lack the necessary deterministic instruction-based counters. The problem is apparently (at least in part) that the load-locked/store-conditional atomic instructions have entirely unpredictable counts as they make events like interrupts affect the counts of the user-level process in a way that cannot be corrected for. The user code will loop an unknown number of times in al LL-SC pair.
- On Linux host, rr does not yet cover all system calls, just the ones they have needed so far. Developing in this agile way is an advantage to focusing on a single program – you can make a functional Firefox debugger far faster than you can make a debugger functional for any and all programs. It is just a matter of work, not any fundamental limitation.
- OpenGL and GPU cannot be supported – this is rather interesting. In the past, reverse execution never even bothered to list this as a limitation, since direct use of GPUs in a program was so rare. Today, GPUs are more important and visible to user-level software, and this limitation becomes worth noting. Still, most of Firefox can apparently be tested and record-replay debugged with direct GPU access turned off (according to the creators of rr).
One thing that I am wondering about, and where I am not really sure what the status is, is whether a recording can be moved from one machine to another. Since the recording and replay of task switches depend on performance counter events, and those events are usually different between microarchitectures, it would seem that there are limitations to the transportability of recordings. I would be surprised if two different generations of Intel chips share the exact same values for the same program. In addition, what happens if you change the underlying OS kernel on the host? That might well affect the behavior and availability of OS calls.For user-level debuggers, the UndoDB LiveRecorder, RogueWave Replayengine, Microsoft Intellitrace, and DrDebug all support recording on one machine and replay on a different machine, but also use heavier implementation methodologies than what rr is using.
As a point of comparison, a simulator-based reverse, record, and replay debugger (such as the debugger built into Simics) can get around all these limitations, at the cost of slower execution and a higher implementation time cost.
Why yet another one?
Rr seems to be a nice addition to the field of reverse debuggers. I like their approach of starting with record-replay and adding reverse as a secondary feature. That captures the essential problem of reproducibility.
But I have to wonder why it had to be created at all? It seems that the answer is “none of the others worked” (they have a related works page at https://github.com/mozilla/rr/wiki/Related-work that explains why some solutions did not really work; but which also unfortunately seems to have missed quite a few commercial solutions such as RogueWave and Simics). I also have an overview of reverse debuggers from a few years ago.
Given the application, it would have seen that UndoDB would have been the most obvious solution to adopt, but I suppose that an open-source project would have a problem relying on a commercial tool. The real issue here might be that gdb reverse debug is hopelessly slow, as well as not supporting record-replay in this manner. Or maybe it just indicates that reverse debug in practice is a niche solution that comes with too many caveats to ever bloom into a general open-source tool like gdb. Given that, rr might as good as it gets right now, if you are on a modern Linux, on an Intel processor, debugging user-level code that is contained inside a single process, and that does not suffer from too many real concurrency bugs.