Time-Accurate Simulation Revisited – 15 years later

stopwatchA long time ago, when I was a PhD student at Uppsala University, I supervised a few Master’s students at the company CC-Systems, in some topics related to the simulation of real-time distributed computer systems for the purpose of software testing. One of the students, Magnus Nilsson, worked on a concept called “Time-Accurate Simulation”, where we annotated the source code of a program with the time it would take to execute (roughly) on the its eventual hardware platform. It was a workable idea at the time that we used for the simulation of distributed CAN systems. So, I was surprised and intrigued when I saw the same idea pop up in a paper written last year – only taken to the next level (or two) and used for detailed hardware design!

The idea inTAS was to compile and run the code for a real-time distributed system on a Windows PC, using an API that would compile to either the real machine or the target system (classic host-compiled API-based embedded system simulation). Using the timing annotations, we would regularly stop the simulation and wait for the real-world time to catch up with the time in the simulation, thus achieving a simulation that would run at about the same speed as the real system. This let us run multiple nodes in a CAN network together with correct relative timing, as well as integrating the simulated nodes with real hardware without getting strange timing effects. The underlying assumption obviously being that the Windows PC was quite a few times faster than the target system, which was certainly the case.

The code was annotated with breakpoints that would stop the current task and wait for the rest of the simulation and the world to catch up. For purely simulated setups, I think we allowed time to run both faster and slower than the real world, as that provided some additional testing value. It was a nice idea, and it was put into production use at CC-Systems from what I know. Guessing from publicly available information, the concept has survived into the tool that was first known as CCSimTech, but which is now sold as SimTecc from Maximatecc. CC-Systems has become part of a company called Maximatecc, apparently.

That was background.

The 2015 paper that I found has taken the idea of host-compiled API-based simulation with timing annotations in the source code to a whole new level. Maybe two or three levels above what we did back in 2002. The paper is Bringmann et al, “The Next Generation of Virtual Prototyping: Ultra-fast Yet Accurate Simulation of HW/SW Systems”, from DATE 2015. It describes an approach where the host-compiled simulator is used to replace a cycle-accurate instruction-set simulation in a virtual platform, used for detailed hardware performance analysis and architectural design. The goal is to gain orders of magnitude in speed without losing too much precision in timing; a traditional functionally correct virtual platform ISS like that used in Simics, Qemu, ARM FastSim, and similar is insufficient for hardware design at this level. The programs are assumed to use an OS API that is known and that is simulated as part of their VP. Behind the scenes, the OS API simulator does time accounting and integrates to a global time manager that computes penalties for accesses to shared resources. The OS API simulation can adjust the scheduling and timing of OS events to better approximate the behavior on the real hardware.

Compared to the TAS approach that we implemented, the target domain is rather different. In TAS, we were trying to simulate software for a distributed system. In this case, we are looking at driving a transaction-level virtual platform of an SoC, including buses and models of devices.
The overall approach taken to implement the “ultra-fast yet accurate” virtual platform is something that I would like to describe as approximate-and-adjust. The system would do things, compute a time for the operation, and then periodically take a look at all operations that are going on and add penalties to various tasks and hardware operations to account for resource conflicts. Rather than synchronize all parallel units each cycle, they synchronize occasionally and maintain enough knowledge and context to do fairly accurate penalization. It is a way to combine temporally decoupled execution with precise timing. The concept of adjusting post-facto is used to good effect in the paper and previous work by the same group. It is worth considering in other VP approaches as well, even though penalizing software tasks post-facto is quite a bit harder when using an ISS running the actual OS rather than having the simulated OS as part of the directly controllable model.

For code running on processors, the idea is to insert markers in the code that correspond to basic blocks in the compiled code. This approach sounds simple in principle, but it is fairly sophisticated in practice -it tries to adjust the correspondence to account for the actions of an optimizing compiler. We never bothered with that back in 2002 as it we were only concerned about fairly coarse-grained correctness of time. They do not make the assumption that the compiled code has the same exact structure as the source code, which is very important to be able to handle real-world software compiled by real-world compilers.

In addition, their design flow generates a secondary simulation task for each software task, that only models the software flow. Each time the main task reaches an instrumentation point, the secondary task is called, and it updates its model of where the code is. Essentially, the secondary task is very close to the execution flow of the actual binary. Having that true mirror image of the real binary makes it possible to account for timing dependencies between different pairs of basic blocks (like cache and pipeline conflicts), which is really hard to do when forced to work in the source-code structure of the original program.

I am honestly amazed at the power and sophistication and complexity of this approach. The team has clearly started with a fairly simple idea, and then added several layers of fixes and adjustment mechanisms to get the timing to the level of precision that they want. Their dedication to the idea of not using a cycle-accurate processor is impressive. The machinery employed is rich and powerful, and is very far from the simple approach that I helped build once upon a time.

Obviously, the approach has limitations. It would seem to me that a shared-memory multiprocessing (SMP) OS setup would be very hard to correctly simulate with the given approach. Code that generates code on the fly (like a JIT compiler) or code that relies on opaque binary-only libraries would not work too well either. But for basic compiled control code running on a reasonably simple RTOS, it should work. Keeping the OS API simulator compatible with the real OS is also an issue, but it seems solvable enough to be practical.

For functional-level virtual platforms, I honestly believe that using an ISS is the right thing to do. It means running the whole code, running absolutely any code, and there is no need to try to model the OS API and behavior. The complexity of simulating an OS is usually less than that of building a simulator for the underlying processor – especially when the goal of the exercise (modeling an SoC) dictates that you have to model the entire device set anyway. However, given that the cost of doing accurate timing in an ISS is a slowdown of something like 100000 from a fast functional simulator, I can see why users interested in timing as a primary factor would find it worth the time to build these kinds of elaborate systems.

I want to make it totally clear that I am not in any way being disparaging about the approach described in the paper. I am amazed that they could find a way to make this work and work well, and that they persevered. It shows that there are always different ways to solve a problem, and that we should not discount a particular approach just because it seems limited at first glance. If someone would have told me that that TAS approach could be used for low-level timing-accurate simulation, I would have dismissed the idea – since I would have seen all kinds of issues that would make it hard to pull off. I am happy that the paper showed me that the approach could be extended to the point of very good timing accuracy.

As a final note, while researching the old TAS paper, I came across yet another previous incarnation of the same idea, namely the TIBBIT system from 1995 where timing annotations are used at to facilitate code conversions from one real-time system to another without disturbing the system timing. Yet another use for timing as part of source code!


One thought on “Time-Accurate Simulation Revisited – 15 years later”

  1. You truly are kind of retarded Mr. Engblom. The best thing coming out of Sweden these days is Koenigsegg. With your maturity instead of making on your own and starting your own companies you instead choose to dedicate yourself to the retarded american Intel company. What is wrong with building your own dreams in Sweden?!?!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.