Reversing out of Reverse

The Intel Simics simulator version 7 removed a long-standing feature from the simulator framework. Reverse execution is no longer available. In its place, in-memory snapshots were introduced, which arguably offer most of the benefits at a lower implementation cost. What happened? I’ve been asked about the reasoning behind the chance on several occasions since I left Intel. I’d like to share my perspective on the decision, as it highlights the challenges of turning an idea into a robust, shippable feature.

Disclaimer: This blog post reflects my own opinions and reflections, and not those of any previous or future employer.

Back in the Days

Reverse execution was one of the headline features of Virtutech Simics 3.0, launched in 2005. It was incredibly cool (still is), and we made a big splash with the feature. The “Simics Hindsight” product got its name from reverse execution. The launch was accompanied by good slogans like “rethink reality” and “time is now on your side”.

The DATE 2005 Virtutech booth, with reverse execution slogans.

We did some early demos to showcase the feature, including a really cool one that ran a video player forwards until it crashed and then backwards to debug the crash. This included seeing the video play backwards, looking really cool. However, it was incredibly brittle. Setting breakpoints at the wrong point or doing the wrong sequence of forward and backwards execution tended to result in a crashing simulator or a deviating simulation. Making the video play nicely required some very specific settings for just when the display was updated when going backwards.

There was hope that reverse debug would allow the simulator to be sold to the general software development community, and not “just” the embedded market. However, it turned out it was just too big a hurdle to move a software setup into the simulated environment for debug. Doable, but too cumbersome in practice (despite some clever convenience hacks developed by Simics developers who applied Simics reverse execution to debug Simics itself).

Using a system-level tool for application-level development and test requires massive investments in convenience features to even start to make sense. The tool has to be made part of standard flows and not just something brought out for difficult problems. This is much more likely to be the case for embedded applications, firmware, and low-level code where the simulator is used for most or all testing anyways – then reverse is a nice add-on feature. But as a tool for “generic” desktop/server software? Too hard.

Implementation Concept: Simple

The implementation of reverse execution and reverse debugging in Simics was described extensively in a 2012 paper and blog post series: https://jakob.engbloms.se/archives/1547. In short, the way it works is to take checkpoints regularly while simulating forward, and then “faking” reverse by jumping back to a checkpoint in the past and simulating forward to a point before the starting point. This makes the simulation stop at a point “in the past”, from the perspective of the user.

Implementation Details: Not so Simple

The simulation framework must provide some specific abilities to make reverse execution and in particular reverse debugging work:

  • In-memory checkpointing – the ability to set the state of the model to a certain previously visited state.
  • Recording – recording and replaying asynchronous inputs to the simulator, as it is reversing and running forward repeatedly.
  • Breakpoint handling – this is what really distinguishes reverse debugging.
  • Reverse as a simulator state – features like scripting must deal with the reverse illusion correctly, and thus the simulator must have a defined execution state for reverse.

Breakpoint handling was covered in my previous blog posts and papers on the subject, but in short, “stop at the most recent breakpoint occurrence” is implemented by running the same segment of time twice. The first time, breakpoint hits are just noted without stopping the simulator. With this information, the simulator can then replay the time segment a second time, and ignore all breakpoint hits except the last one.

Scripting is non-trivial when reverse gets involved. For example, consider a script that waits for a breakpoint. If the point of the script is to implement a work-around for a model limitation or software issue by taking an action each time the breakpoint is hit, it clearly should be triggered on each breakpoint hit. On the other hand, if the point of the script is to perform debug actions it should not be triggered until the last breakpoint in a reverse action is hit.

It is obvious that supporting reverse execution and debugging is more complicated than just running forward. It required a long period of testing, tuning, and fixing to make it truly solid. Some corner cases only manifested themselves a decade after the feature originally launched. But in practice the framework was usually good enough. After all, it is just a single code base and improvements accumulate over time.

Implementation Details: Models…

The biggest issue for reverse execution in virtual platforms was always model support. Supporting reverse execution puts additional requirements on models. Unlike the simulator framework, new models are written all the time, and improvements do not accumulate as nicely.

A model needs to support the following to work well under reverse:

  • Saving in-memory checkpoints, which for a model means providing its state to the framework to save.
  • Restoring in-memory checkpoints, by loading a previously saved model state.
  • Deterministic execution, so that replaying each time segment gives the same result each time.
  • No reliance on scripts for correct execution.

These requirements are not all that hard to meet provided the model is designed with them in mind from the start. I have done it. But it is easy to miss something.

Typical problem areas are complex cached internal state (when to clean up and rebuild the caches?), model aspects like register aliases (where do you and don’t you save the state?), or models that expect to call other models as part of their setup (when exactly is it safe to make such calls?). Models that rely on scripts to provide workarounds for software or model limitations are a very real problem (this pattern was not considered when reverse was designed).

What makes this extra painful is that a single bad model can ruin the whole simulation. All the models in a system must be reverse-execution capable. Using non-Simics-native models tends to break both checkpointing and reverse.

Let’s go through just how this works in practice.

Starting a Model from Scratch

The typical startup of a simulated system (in Simics) looks like this:

The simulated system gets setup in few steps:

  • All objects get created, with all attributes at default values.
  • Some object attribute values are changed from the creation defaults, from the system setup code and scripts. This happens exactly once.
  • The model-internal setup code is run after all objects have all their attributes set.
  • The system setup code finally runs some actions that are global, after all objects have been initialized. This also includes starting scripts that will be running during the simulation – and critically, such scripts might be doing workarounds for model issues, logically being part of the model but physically separated from them.

This process is basically “running code forward”. All virtual platform models support this flow, and it is a flow that is tested each and every time a new simulation is started.  

Starting from a Checkpoint

Starting a new simulation from a checkpoint is slightly different:

The steps are:

  • Create the objects with defaults – exactly like before.
  • Set object attribute values to the values saved in the checkpoint. This covers all attributes, not just the ones that were changed in the setup that started the original simulation. The order of attribute setting might also be different from what is seen when setting up the simulation using startup scripts (especially if some attribute modifications are done in the final script stage).
  • The model-internal setup code is run after all objects have all their attributes set.

Notably, no setup scripts are part of this flow, since a checkpoint does not contain any scripts. It is possible to have a script wrapping the loading of a checkpoint and starting runtime scripts – but it is not part of the simulator framework flow. This means that workaround scripts are not included when starting from a checkpoint, which can result in a different behavior compared to the simulation from which the checkpoint was saved.

Thus, making a model checkpointable requires a bit more care than creating a model that can just be created once and then used in a simulation. The crucial assumption is that once the setup scripts have finished, the state of all the attributes of all the objects in the simulation is complete and consistent. The models must also work without any scripts.

In practice, making a model checkpointable requires additional tests that have to be explicitly added to the CI system.

Restoring In-Memory Checkpoints

In-memory checkpoints put further requirements on models. Restoring an in-memory checkpoint means only setting the attributes of an object, no other actions. Like this:

There is a single step:

  • Set object attribute values to the values saved in the checkpoint.

The attribute values are set in objects that already exist. I.e., there is no object creation phase. Neither is the model-internal setup code called, which is logical since no object is being created. Just like with checkpoints, no scripts are invoked.

In practice, this adds another level of difficulty to the model coding. For example, a checkpoint-compatible model might rely on model-internal setup calls to do things like allocating internal buffers based on attribute values. Generalizing to in-memory checkpoints requires working only from attribute values and attribute setter functions.

This means that a model passing checkpointing tests might still not work for in-memory checkpoints and reverse execution.

Determinism

A key part of reverse execution is strict determinism. Every run of a segment of code from a certain initial state absolutely must result in the same end state. This is another case where subtle issues can occur – it is possible that a model behaves deterministically in a particular environment, but fails when compiled for a different host or even when used in a different target system due to differences in how model interfaces are called.

Checking that a model is deterministic is another special type of test that would not be covered by simple “does it work” tests. It really calls for powerful unit tests. Testing for determinism at a system level is both necessary and fine to ensure that a system is deterministic in aggregated. But if a system turns out to be non-deterministic, debugging that at the system level is really hard. Therefore, ideally, each model should be tested in isolation and determinism issues found at the unit level.

Model Support for Reverse

Given these constraints, it should come as no surprise that in practice, reverse debugging only works for a few well-tested and proven-in-use system models. Among those is the Simics generic PC model, the QSP, which provides a way to use Simics to debug Simics with reverse. Most models do not support reverse.

The model problem was not noticeable in the early days of reverse execution. In the 2000s, the Simics team was small, the models were small and built by the core development team. Models generally supported both checkpointing and reverse execution. The models also tended to have a long life as they were sold to users using the same hardware for a long time – basically, providing models of existing hardware to improve (embedded) software development. There was time to polish models and make sure they supported features like reverse.

After Simics was acquired by Intel, model development was scaled up tremendously and changed in nature. More and bigger models are being built by more different teams. The primary driver of model development is delivery time, to meet requirements from pre-silicon project plans. Models are usually not maintained after the corresponding hardware is released. Models are constantly updated during their lifetime to track changes to the hardware and system design. This all results in models with less time spent on polishing. In practice, no models of real hardware supports reverse execution.

Thus, what we are left with is a feature in the simulator framework that can be demoed using well-behaved models, but that fails to work for most real models.

Replacing with “Snapshots”

The designated replacement for reverse execution in Simics is “snapshots”, which means in-memory checkpoints without the reverse framework. This removes the framework mechanisms related to reverse, keeping only the in-memory checkpointing system. That is still a bit of code as it has to deal with differential state in memory images and other details that I glossed over above, but it is a radical simplification.

As I see it, the snapshot feature provides a 90% solution compared to full reverse.

Debug: Manual Faked Reverse

Snapshots can be used to improve the debug experience. Instead of restarting a whole run from scratch, a snapshot can save a state within the confines of a simulation session and then quickly go back to it. This makes it easier to iterate debugging a small section of recent code. It won’t replicate the “run backwards to most recent breakpoint” functionality, but in the absolute majority of cases it is still a time-saver.

The same functionality could be achieved using complete checkpoints, but it is much faster and easier to jump back to a snapshot than to start a whole new simulation session over. It also retains all scripts, breakpoints, and other session state.

Fuzzing: Perfect Match

The core idea of fuzzing is to repeatedly pull the software under test back to an initial state and then apply different inputs. By observing the effects of the inputs on the code execution and the results, a fuzzer can generate new inputs that explore the execution paths and state space of the software. Like this:

Software fuzzing on top of the Simics simulator has become a popular use case. We presented a paper on it at DVCon Europe 2023, for example. The Excite project at Intel published an early application in the 2016-2017 time frame. More recently, the TSFFS project from Intel was made open-source. The DARPA Cyber Grand Challenge used fuzzing and reverse execution to both find and debug issues in code.

Fuzzing is an obvious match for in-memory snapshots. However, in previous versions of the Simics simulator, using in-memory snapshots meant activating the whole reverse execution framework, which sometimes added significant overhead to the snapshot save and restore process.

It is expected that fuzzing will run faster with pure snapshots, and arguably fuzzing is one of the best arguments for the simulator and its models supporting in-memory checkpoints.

Snapshots and Models?

From the model perspective, just how much does snapshotting simplify things compared to reverse execution? I do think snapshots are easier on models.

  • They do not necessarily need determinism. Since the semantics of a snapshot is to represent a point in time, a non-deterministic simulator will still “work”. It would be a bummer to use it for repeated debug of the same code. For fuzzing non-determinism is OK, if not ideal as it might make it harder to replicate a failure. At least it is technically not unsound.
  • Workaround scripts can be made to work. If all a script does is to wait for some event and then poke a model, such a script will work with snapshots as all the state is in the model. If a script has a state machine, not so much. With classic reverse, the scripts would have had to consider the reversing state of the simulator.

The biggest problem is the state handling – the model must still support having its state set multiple times during a simulation. For example, this is what fuzzing looks like in practice in the simulator:

I.e., in-memory checkpoint restores with additional actions to start and finish each test case.

Final Notes: When to Give Up?

As I said at the beginning, reverse execution offers a case study in the rise and fall of a product feature. It rises since the fundamental idea is great and the initial execution solid. Eventually, it falls down due to the difficulty of actually making it work at scale. Dropping the feature from the framework is a case of “kill your darlings”. The team spent almost two decades with this feature in the framework, so it is sad to see it go, but if it does not work with most models in actual use, there is no point in keeping it.

Simics reverse execution might have worked out as a sellable feature if the use case for the simulator had been modeling a small number of relatively simple targets for user-level software development. But that is not the market that Simics found.

In a way, it is a close miss. But it was time to give up on the feature.

Reverse Still Lives

However, there is a way to do reverse debugging today. Undo offers a record-replay reverse debugger for user-level Linux software, which is great for chasing down tricky bugs, especially those related to concurrency. By working at the user level, most system support issues go away (the tool still depends on the processor architecture). Getting it to work is still not trivial, but it is like the Simics framework, not like the Simics models.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.