Checkpointing: Meaningless, Difficult, or just Overlooked?

gears1One thing that surprises me is how rare the feature of checkpointing or snapshotting is in the land of virtual platforms, despite the obvious benefits of that feature. Indeed, checkpointing was one of the first cool things demonstrated to me when I joined Virtutech back in 2002. Today, I could not ever imagine doing without it. Not having checkpointing is like having a word processor where you only get to save once, when your document is finished, with no option of saving intermediate states.

But not everyone seems to consider this an important feature, judging from its relative rarity in the world of EDA and virtual platforms. Why is this? Let’s look at some possible explanations.

But first, let’s examine the subject of this post a bit more. What is checkpointing, precisely?

What?

In short, it is the ability of a virtual platform or virtualization environment to save the state of an executing simulation to disk (or memory or something) and later bring the saved state back and continue the simulation as if nothing had happened.

In detail, there are four operations that need to be supported for this to be truly useful:

checkpoints

  • Saving and restoring to the same simulation system on the same host machine (i.e., into the exact same program binary for the simulation).
  • Restoring on a different machine (where different can mean a machine with a different word-length, endianness, and operating system).
  • Restoring into a bug-fixed version of the same simulation model.
  • Restoring into a completely different simulation model that happens to have the same state.

Why?

Let’s look at some use cases for checkpointing:

The last operation is very interesting, since it carries with it the ability to change abstraction level. It is used in IBM Mambo (see a 2006 IBM paper that you now have to buy due to an annoying change in IBM policy) to exactly this effect, and in Simics for the Freescale QorIQ P4080 as well. It is also well exploited by academic research frameworks for Simics, such as GEMS and SimFlex. Essentially, the idea is to position using fast mode, and then move over to detailed mode. The advantage to doing this over a checkpoint is that you can farm out the experiments across many different hosts, save the precise starting point for future regression tests, and try different detailed settings from a known common starting position.

The most obvious use for checkpoints is to avoid repeating simulation work that does not add value, in particular booting of operating systems. A modern OS boot  easily takes billions of instructions (say 10 seconds on a dual-core gigahertz machine… do the math). Being able to save a simulation effort like this for instant reuse is such a standard part of how I work with virtual platforms that I could not imagine the pain of not having it.

Checkpointing is also a useful communications tool: it makes it possible for any user of a virtual platform to precisely communicate the system state and configuration to anybody else with access to the same virtual platform system (note that a Checkpoint, at least in Simics land, contains the list of objects in the simulation and how they are connected, so you do not need any other description of the simulation setup). This helps in debugging models – a user testing it can easily package problems and report them to the modeling team. And it helps in debugging software running on the virtual platform, as a tester can package up the precise system state right before a bug hits and send it back to development. Incredibly powerful! Here, portability of checkpoings across hosts is obviously very important, as well as across model versions. Once you have a fix for a model bug, you test it using the checkpoint, and check that things now proceed as they should.

Checkpointing also comes in handy as a backup-save ability when configuring an interactive target system. In many cases, the loading and configuration of software on a target is a very valuable and hard-to-repeat-exactly activity. Adding in software, configuring it, starting servers, assigning network addresses, configuring communications paths for backplanes can take a lot of time. On physical machines or virtual platforms, if you mess up, you have to go back and start over. With checkpointing, you can incrementally save work as you go along. This is a common use case for the snapshotting ability in VmWare, for example. But it works equally well for embedded targets modeled as virtual platforms.

There are more uses, the paragraphs above just scratch the surface of the utility of checkpoints.

Why Not?

But despite the obvious benefits, this feature is very rarely found in virtual platforms. I can see three main lines of argument:

  • Meaningless: for tests comprising only short software runs like a few million or tens of millions of instructions, rerunning it is fast enough. Or changes major enough. That checkpointing seems pointless. I can buy that — but only until the simple target is part of a greater context. If a DSP, for example, is part of a big system setup, you want to save its state even if it is only running a few small million-instruction loops.
  • Difficult: I think this might be the most important explanantion. Doing checkpointing right puts requirements on the simulation kernel and on all processors and device models. All models have to be coded with discipline so that all state is available and can be set at any point in time. In particular, this means that explicit threading like employed in SystemC SC_THREAD is out. It must also be admitted that certain types of models like detailed processor models can be very difficult to serialize and deserialize from disk, simply due to the enormous intricacies of their implementations. But had they been designed with checkpointing in mind from the start, it would have been less difficult.
  • Overlooked: The virtual platform was designed without thinking of checkpointing. Alternatively, no customers asked for it, so it was not built.

I find the last argument very interesting, since I can see what happens once you have tried checkpointing. In my experience, once a user of a virtual platform has tried checkpointing, they want it. It goes from a interesting idea to a must-have feature very quickly. No arguments about why it is hard or why they can do without it work, as they have seen how things should be done.

For me, I think it is akin to my first encounter with a Macintosh computer, and the concept of “undo” in programs. Before that, I was happily editing code on a ZX Spectrum, in an environment where “undo” meant “manually remember how it looked at change it”. I had no problems with that, but once I saw how things could be done, there was no going back.

10 thoughts on “Checkpointing: Meaningless, Difficult, or just Overlooked?”

  1. It does not look quite as “overlooked” anymore, considering that SystemC virtual platform systems from both Cadence and CoWare now support it. In a very limited form, essentially just restarting on the same machine.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.