I while ago I wrote a blog post on checkpointing in virtual platforms, and what it is good for. Checkpointing has been a fairly rare feature in virtual platform tools for some reason, but it seems to be picking up some implementations. In particular, I recently noticed that Cadence added it to their simulator solutions a while ago (2007 according to their blog posts). There are a two blog posts by George Frazier of Cadence (“saving boot time” and “advanced usage“) that offer some insight into what is going on.
Note that checkpointing is nothing new to RTL-level HDL simulators, since that is a much more controlled environment than a general virtual platform. I think the Cadence blog put it quite well:
Save and restore (or restart) has existed in HDL simulators for years, but things are trickier if SystemC is involved. For one thing, SystemC simulators use external tools for compilation and linking: i.e. gcc. They have more or less a “black box” understanding of global variables, local variables, file descriptors and heap values that make up the simulation state at any point in time. When you throw in multiple threads implemented with application-level threading packages and the fact that C++ heap objects are impractical to save programmatically, it’s easy to see why save and restore tools for HDL simulators can’t be easily extended for SystemC.
I could not say it better myself. What is interesting is that the Cadence solution does solve this problem, in a limitied way, for a limited use case. I have not looked at their solution in detail (such as using it myself), but this paragraph indicates that the solution is essentially a complete memory contents dump:
During restart, all internal variables inherit the same values from the process as it existed at the time of save (for example, C variables declared static). While this behavior helps assure that SystemC state information is properly saved and restored, it can also leave variables that reference the process environment (like file descriptor and sockets to other processes) in limbo.
Doing it this way is heroic in effort but also quite limited in scope. If I look at the four operations for restoring from a checkpoint that I outlined in my previous blog post on checkpointing:
- Restore to same machine, same model
- Restore to different machine, same model
- Restore to same or different machine, updated model
- Restore to same or different machine, completely different model
It is clear that you can only do the first, as the solution will restore the state of an implementation of a model, not just its relevant state as is done in Simics checkpointing. The sole advantage of this approach is that it does work with arbitrary code. But it does not support any of the more powerful uses of checkpoints beyond simply not repeating work for a single user on a particular machine.
I don’t think a memory dump can travel even to a second machine of similar make and setup, since it will depend on the precise memory layout of a process that starts. And that is affected by DLL and shared objects load order, which is hard to control. The versions of all libraries have to be exactly the same too. It is not even clear that a checkpoint survives the upgrading of the OS on the machine being used, as that will surely change things in terms of precise memory allocation.Would be happy for the Cadence users to be proven wrong, but in principle I think checkpointing done right requires models to be written explicitly to support it. Just like any serialization solution in any programming language.
I must admit that George Frazier does mention using checkpoints “weeks or even months” after initial save, but there is no mention of changing the code of the model in that time frame. For me checkpoints tend to live for years, I have some nice Simics demo checkpoints that have been with us for some five years at this point in time, surviving from Simics 2.0 to 2.2 to 3.0 to 3.2 to 4.0 to 4.2… thanks to the power of the Simics “save only explicitly defined state” principle, and checkpoint updater functions that essentially rewrite old checkpoints to make them compatible with new and updated machine models.
The other bit that I find interesting is what is considered the biggest headache: not the actual saving of the memory state, but how to handle open files and similar host operating-system connections:
If a save operation is performed when the file is open then problems can arise if the program attempts to write to the same file after restore (because the file descriptor associated with the open file will be in a different state after restore).
It is nice to have a way to solve this, but it is also pretty shocking that you have to solve it! It kicks of a mini-rant… A virtual platform model should not read or write or access other host resources directly in any way, in my rulebook for sound programming practice. All host dependencies should be handled via the simulation core and framework, in a manner that is checkpoint-safe, portable, and does not rely on any information from the host directly in the models. It is crucial to localize all such host interactions in specially written host connection modules that make sure all regular simulation modules run in a completely encapsulated and virtual world.
So, overall, hats off to Cadence for actually doing something, but keep in mind that it will be very limited until some discipline is exercised in modeling and state considered as something separate from the implementation.
Some more on threads and checkpointing.
A key feature of the Cadence solution is that you do restore threads. This is also a key problem, in that it prevents most use cases of checkpointing.
The problem with threads is fundamental, in that they are an implementation mechanism, not a part of the target system state.
If you want to restore the state of a model to where it left of, and that model contains threads, you really expect to see the threads to the same state. This includes the state of their local call stack, registers with local variables, the program counter, and the stack pointer. Such information cannot be saved in a portable and reliable manner, as it is directly derived from a particular compilation of a particular version of the code. The smallest change to the code or the addition of a single variable will throw the mechanism off. Not to mention modifying a model to add threads, remove threads, or use threads in a different way.
A mechanism that converts the current thread location to an explicit state variable and then checkpoints that state is certainly possible to create. However, that is fairly complicated to implement, and an event-driven model will achieve the same with less complexity in implementation. Another advantage of a pure state variable and event-driven approach is that the state of the model can be investigated at any point in time, and potentially changed.
So I prefer TLM models not to use threads at all.