There are some things in computing that seem “obviously” true and that “clearly” make it “impossible” to do some things. One example of this is the idea that you cannot go backwards in time from the current state of a program or computer system and recover previous state by just reversing the semantics of the instructions in the program. In particular, that you cannot take a core dump from a failed system and reverse-execute back from it – how could you? In order to do reverse debugging and reverse execution, you “have to” record the state at the first point in time that you want to be able to go back to, and then record all changes to the state. Turns out I was wrong, as shown by a recent Usenix OSDI paper.
REPT: reverse debugging of failures in deployed software
Proving my preconceptions wrong was a paper called “REPT: Reverse Debugging of Failures in Deployed Software”, published at the Usenix OSDI (Operating System Design and Implementation) 2018 conference by Weidong Cui, Xinyang Ge, and Ben Niu from Microsoft Research; Baris Kasikci and Upamanyu Sharma, from the University of Michigan; Ruoyu Wang, Arizona State University; and Insu Yun, Georgia Institute of Technology.
REPT means “Reverse Execution with Processor Trace”, and the approach is based on using Intel Processor Trace to capture the control flow of a program along with a core dump of the final state of memory and registers at the point of a crash (or other error). So, to be really precise, it is not actually “reverse from a core dump”… since it requires the instruction trace information to be available – but it still recovers information about register and memory values from the past starting from the current. Which is a pretty amazing feat in itself.
Note that using an instruction flow trace together with a dump of the initial state of the program would be a rather simple and typical way to build a reverse debugger. That is essentially what systems like Simics and other “re-execution based” reverse debuggers do. The core trick in REPT is to recover past values using just the execution history and the current state.
REPT uses hardware tracing (Intel Processor Trace for Intel processors) to record the past execution of each thread in the program. This captures the execution history in terms of which code was executed, but not the values of memory and registers. The tool uses a circular trace buffer that basically captures the most recent execution all the time. When a crash happens, it is thus easy to capture the current state as a core dump + the trace buffers for each thread.
Classification
To get a sense for where this fits among the many other variants of reverse debugging that I have looked over the years, it is good to take a look at what it supports and does not support.
- Windows target only, but that does not seem to be inherent in the technology itself – this reproduction technology ought to work for any OS.
- Requires the hardware to support Intel Processor Trace, available from Broadwell and forward (2014). The paper says that the ARM Embedded Trace Macrocell should also provide sufficient information to implement REPT.
- Applies to user-level programs, not system-level code.
- Handles multicore true concurrency, and multi-threaded software.
- Works at the binary level (i.e., does not require a language virtual machine).
- Handles code emitted by Just-in-time-compilers, as long as what is in the memory dump contains the code that was executed. I.e., if a JIT compiler overwrites its own generated code in the middle of a trace, the early part of the trace will not be valid.
- Overhead is very low since only instruction traces are collected using Intel Processor Trace – the chosen approach was designed as a way to gain performance compared to the previous WinDbg Time-Traveling Debugger (TTD) approach.
- There is a problem with I/O intense software, since there is a lot of unknown data change when doing a syscall that REPT currently does not try to resolve.
One key property of the approach is it has an “event horizon” beyond which it cannot see, since at some point in time going backwards from now there will be too much uncertainty in the information for it to be useful.
Overall, this is a decent set of capabilities and definitely applicable to real software- which the paper also shows by using the tool to analyse a set of known bugs from a variety of real-world open-source software workloads like PHP, Python, Putty, Apache, and Libreoffice. It is also being used inside of Microsoft for software debug.
Recreating Past State
The paper shows that you can (most often) recover memory and register contents from the past (a previous point in time) from the current state, even with instructions that “destroy” information like “XOR RAX, RAX”. The trick is to look at how information flows between locations and instructions, and build an inference graph backwards in time. This graph mixes register values, memory values, and the value of pointers to other memory locations. The algorithm can correct previously derived values if they are later proven to be wrong from new information. Even if a particular value is destroyed locally in a register, there are often (usually, it seems) traces of it in other parts of the state that can be used to reproduce it.
The details are described in detail in the paper, and make for a very good read. However, it makes no sense for me to dig into them and explain them here. It is more relevant to take a look at just how this can be used.
It is important to notice that the algorithm might be unable to determine values for all locations at all points time. Indeed, this happens to some extent for all the test cases discussed in the paper. However, this does not mean that the results are not useful – the recovered values are usually sufficient to debug the bugs they looked at. Intuitively, this makes sense – how often do you actually need to look at ALL the variables in a program at a certain point in time in order to figure out what caused a bug? Normally, it is enough to look at a few values.
Concurrency
As mentioned above, REPT handles concurrent execution of multiple threads in a program. The recording of the trace proceeds independently for each thread, with local time stamps issued at regular intervals in each trace – note that this does not happen on each instruction. The instruction trace is split into segments at each time stamp. It is not clear from the paper how often these time stamps happen, it seems to depend on the hardware being used.
Since x86 is TSO, what is needed next is to essentially order all the stores in the program in a linear sequence. The per-thread trace time stamps provide a partial ordering of all instructions, allowing the REPT system to find execution segments that overlap in time. For such “concurrent” segments, REPT arbitrarily linearizes them into a concrete order. This works “well enough”, especially with some special handing of writes to common locations in concurrent segments.
The main limitation they claim is that if one thread writes a pointer that is used by another thread within a concurrent segment, the value analysis might go astray as it then follows the wrong pointer value. Imperfect, but often good enough.
They stress-test the concurrency handling using the Racey stress-testing benchmark, and find that even for this extreme case, only 6% of all segments feature concurrent accesses to the array. Thus, they claim that in most cases, concurrency is a small problem.
OS Calls
OS calls are simply handled as unknown code – any memory writes are unknown, and all registers that are volatile according to the calling convention are marked as unknown. This means that OS calls introduce quite a bit of uncertainty into the analysis. The authors note that this could likely be improved by more semantically aware recording of OS calls, such as performed in RR when recording a run. Going backwards, this would probably mean figuring out more precisely what data in the core dump resulted from the system call – or something along those lines.
In practice, given the rather short final execution sequences that are analysed with REPT, this does not prevent it from being useful in practice. But it does exclude certain classes of problems from debugging.
Short Time Horizon
As already mentioned, REPT uses a circular buffer to capture the most recent execution history for each thread. This limits how far back REPT can go, but it seems sufficient to solve many real problems. Still, what is striking to me is the short length of the execution traces that they used when reproducing the bugs in the paper.
If I understand their tables right, most of the time they did not need to go more than 100k instructions back in order to find a bug, and often just a few thousand instructions were enough to diagnose a bug. Based on the kind of use I make of reverse debugging, I would have expected this to be too small to be useful. When using Simics, I often see jumps of millions of instructions when going backwards – but that is using a full-system reverse execution approach. For REPT, only user-level activity counts, which might explain the difference at least in part.
The length of the traces that can debugged is also limited not just by the trace horizon but also by the fact that the further back you go, more register and memory values end up being unknown. The authors made some experiments to evaluate the accuracy (using the WinDbg TTD as their ground truth), and showed that as they reversed further back, the accuracy dropped. Still, some 80-90% of register values could be recovered as far back as 1M instructions. This is before running out of execution trace buffer, presumably.
Practical Power
REPTs is integrated into WinDbg (which already had another way to do reverse execution). More interestingly, it is also integrated with Windows Error Reporting (WER) – developers can request that the WER system collect PT traces on user machines, and those traces are then collected and forwarded to the developers. This deploys record-replay debug at a scale never seen before.
As the paper says:
We have received anecdotal stories from Microsoft developers in using REPT to successfully debug failures reported to WER. The very first production bug that was successfully resolved with the help of REPT had been left unfixed for almost two years because developers could not reproduce the crash locally… With the reverse engineering enabled by REPT, the developer was able to step through the function based on the reconstructed execution history and quickly find out the root cause and fix the bug. In summary, a two-year-old bug was fixed in just a few minutes thanks to REPT.
Essentially, the first part of this comes down to “record-replay helps with reproduction”. This would be helpful even without a reverse debugger to exploit the recording. The second part is the expected and common observation that reverse debugging is just a lovely way to do debug.
The paper notes that REPT is low overhead, especially when compared to previous approaches like WinDbg TTD. This is thanks to the efficiency of using hardware execution tracing and avoiding recording changing data.
The unique approach taken by REPT means that it has a somewhat different set of limitations compared to other reverse debuggers. I have already mentioned some above, like the fact that it cannot debug across the point where a JIT compiler overwrites existing code. The paper also mentions that code that gets stuck in an infinite loop is a hard case, since the circular trace buffer basically just contains the loop with no indication of how it was entered. System calls and mallocs can also cause problems for REPT since they introduce a lot of fuzziness in the value analysis.
Summary
Overall, REPT is a fabulous technical achievement that does something I spontaneously would have said is if not impossible at least pointless. But it is not, and REPT produces a good-enough solution for many real-world problems. It might not be as general as other reverse-debug solutions, but combining REPT with Windows crash dumps provides a way to get insight into rare bugs from the field, which is a huge benefit.
Count me impressed.