I have been working with computer simulation and computer architecture for more than 20 years, and one thing that has been remarkably stable over time is the simulation slowdown inherent in “cycle accurate” computer simulation. Regardless of who I talked to or what they were modeling, the simulators ran at around 100 thousand times slower than the machine being modeled. It even holds true going back to the 1960s! However, there is a variant of simulation that aims to make useful performance predictions while running around 10x faster (or more) – mechanistic models (in particular, the Sniper simulator).
Mechanistic models as implemented in Sniper and related simulators streamline the simulation of a processor pipeline while still modeling enough of the architecture to provide meaningful experimental data for guiding computer architects. These models run code and collect performance data during the run, in classic computer architecture fashion.
There are higher-level models proposed in the computer architecture literature, where evaluations do not require the re-simulation of a workload on each architecture change, but these are more abstracted than the mechanistic models and suffer from worse accuracy. They are very helpful for design-space exploration, but the goal that I see for mechanistic models is to make product and engineering calls on fairly concrete designs.
I have read the following papers as the basis for this blog post:
- Sniper homepage, at http://snipersim.org/w/The_Sniper_Multi-Core_Simulator
- Trevor Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, Lieven Eckhout. “An Evaluation of High-Level Mechanistic Core Models”, ACM Transactions on Architecture and Code Optimization (TACO), August 2014.
- Almutaz Adileh, Cecilia González-Álvarez, Juan Miguel De Haro Ruiz, Lieven Eeckhout. “Racing to Hardware-Validated Simulation”, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), March 2019.
- Sam Van den Steen, Sander De Pestel, Moncef Mechri, Stijn Eyerman, Trevor Carlson, David Black-Schaffer, Erik Hagersten, and Lieven Eeckhout. ”Micro-Architecture Independent Analytical ProcessorPerformance and Power Modeling”, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), March 2015.
Simulating the Long Effects
To get even close to the real timing of real code on a real processor, some things absolutely have to be simulated: caches, branch predictors, and prefetchers. These mechanisms have a huge effect on performance, and their effect is “long”. A long effect accumulates state over many thousands or millions of instructions, and a small change in one instruction can radically affect the timing of code millions of instructions into the future. It is simply impossible to make a reasonable local analysis of the timing of a memory access without looking at quite a few preceding ones.
Traditional computer architecture approaches that fast-forward execution to a certain point combine the fast-forwarding with warming simulation for the long effects. The current typical length of warming for caches, prefetchers, and branch predictors is on the order of 100M instructions (which would be nice to shorten in the interest of performance).
In Sniper and other mechanistic models, the simulation of caches, prefetchers, and branch predictors is carried out in a manner very similar to a classic cycle-level processor model. However, simulating such factors is not all that slow – say at most a slowdown of 100x.
The memory-management unit translation look-aside buffers (TLBs) feel like they have a pretty long reach too, but their modeling is more sketchy in Sniper since they require operating-system involvement.
The papers I have read only considered user-level code (which puts a huge asterisk on the generality of the results in my mind). One argument being made in the papers to ignore the TLB is that that well-behaved compute code typically suffers very few TLB misses. Note that there is code that is hugely TLB-dependent for performance. For example code that allocate and use tens of gigabytes of memory and where the use or non-use of “huge pages” will directly impact performance.
Simplifying the Core
The core contribution of the mechanistic models presented in the TACO paper (in particular, what they call the IW-centric model) is the abstraction they make of the core execution pipeline of a processor. The idea is that it is sufficient to model the issuing of instructions and the latency of each type of instruction, along with the size of the reorder buffer. Walking each instruction (or micro-operation) through all pipeline stages is not necessary to get a reasonable estimate.
The underlying assumption as I see it is really that the execution pipeline of a modern high-end core is good enough that you do not usually need to worry about the details of the pipeline stages, interlocks, forwarding paths, reorder buffers, wrong path predictions, or rename registers. The pipeline can be approximated as a statistical fog that typically works – but simulating it in detail is extremely expensive. By ignoring these details, the simulation can be much faster while still being sufficiently accurate to be useful. In the words of the TACO 2014 paper, Section 4.1:
Although the cache hierarchy and branch predictors continue to be simulated in detail […], many structures such as the fetch and decode logic, additional hardware structures for issuing instructions such as the issue queue, and register renaming, as well as the commit stage, are not simulated in detail because of the assumption of a balanced processor microarchitecture. In addition, we assume a fixed penalty from the discovery of a mispredicted branch to the dispatching of new instructions (Table I), and we do not simulate the impact of wrong-path instructions on the cache and branch predictors.
Looking at this from a different perspective: what is actually being modeled?
- The different functional units and the types of instructions that they can take – including issue contention.
- The dependency between instructions, as the critical path through the code dictates how quickly the processor can actually get through the code. In particular, operations that depend on memory operations that miss in the cache can really stretch out how long time it takes for a processor to run a particular piece of code.
- Reorder buffer – the IW-centric core model keeps track of a window of micro-operations and actually determines when each is dispatched. Simpler models like the interval model ignores this aspect, and lose accuracy as a result.
- Memory-level parallelism (MLP) – which memory operations can and cannot proceed in parallel. Instruction-level parallelism (ILP) is provided by the previous three mechanisms. In addition, the interaction between MLP and ILP is included.
A mechanistic model ignores the execution of wrong paths due to misses in branch prediction and other speculation mechanisms, as well as the effect of (weak) memory consistency models. It does not account for instruction decoding as a bottleneck either, as that is part of the assumption of a balanced core.
The below table summarizes the architectural mechanisms, and whether they are included in mechanistic models or not. Also included is the fast functional model with penalties, as a point of comparison.
|Mechanism||Classic Cycle Simulation||Mechanistic model (Sniper)||Fast functional + penalties|
|TLB||Yes||No (user-level only)||Yes|
|Memory consistency model||Yes||No||No|
The paper contain several useful insights and observations on performance simulation.
Without instruction parallelism being modeled, there is no way to model memory parallelism. If the simulator handles each instruction and each memory access as a unit before processing the next, no parallelism can be accounted for. Obvious, but important to keep in mind.
Modeling relative differences between design alternatives might require good absolute estimates of performance. The reason is that resources like DRAM and links between individual chips have a certain absolute bandwidth limit – and hitting or not hitting it has a huge impact on performance. The intensity of accesses in absolute terms matters greatly. To quote the TACO paper again:
This difference in relative (scaling) accuracy can in fact be correlated with each core model’s absolute accuracy. Although architects usually claim to require only relative simulation accuracy, scaling often largely depends on contention of shared resources such as DRAM and QPI bandwidth, which in turn requires request rates and thus core performance to have a certain level of absolute accuracy.
The papers also show that the “1 IPC model” is useless. The 1-IPC model is popular since it is very easy to implement given a normal fast functional virtual platform (like Simics and Qemu). The functional simulator is extended with models of the long effects (caches, branch predictors, prefetchers, TLB). A fixed penalty time is accounted for each cache miss, bp miss, etc., on top of a basic model where each instruction takes one cycle (or half a cycle, or whatever). This model can run very quickly, but since it entirely misses ILP and MLP, it is also rather wrong.
A bit of a let-down, I used to think that a fast functional simulator with caches would be somewhat useful to estimate at least relative performance, but the error is so large that it can give totally misleading results – at least when used to compute time. It still makes sense to simulate the effects and count events, as a first-order approximation. I think. Maybe.
The above only covers what happens inside of a single core, which is not really how computers are built today. Dealing with multicore can be a real killer for simulator performance, since not only does the simulator have to run the simulations for the additional cores, but accurately accounting for the resource sharing and contention between the cores. When using temporal decoupling, the time quantum typically has to be lower than the latency of an L2 miss, or around 10 cycles. This really affects performance. Trying to run the simulation in parallel requires a lot of synchronization to capture effects between the cores.
For Sniper, the solution is to run each core simulation in parallel (on its own host thread) and allow for quite a bit of time difference between the cores (according to the TACO paper, some 100ns, which works out to something like 100 to 500 instructions depending on the processor being modeled). This does reduce the accuracy, but also allows for much greater simulation speed. A classic cycle-level simulator could never get away with this, since it would be too imprecise when modeling the memory system in more detail.
It is not entirely clear to me just how memory contention is modeled, but it seems to be about looking at all the accesses from all cores in a certain time interval. This is the same kind of model that have been used in the past for network modeling, and it does make sense as a speed-accuracy trade-off.
The Operating System?
Unfortunately, none of the publications that I have seen around Sniper deal with full-system models and simulation. They are all looking at user-level programs like SPEC or other compute workloads. Which makes me wonder how well this works when simulating a complete software stack including the operating system, driver, hypervisors, hardware, and other fun.
It seems like it should be rather easy to run Sniper with full-system code by attaching the core models to an existing full-system simulator… but is there something hiding in the transitions between execution modes that could trip up the chosen approximations? How would the model deal with input and output operations over PCIe and memory-mapped devices? Is there something new there, or just memory accesses with a slightly different latency?
Core-to-core synchronization is another issue that is crucial if running an operating system, and where the temporal decoupling inherent in the Sniper memory simulation could probably have some interesting effects. Maybe all that is needed a decent model for the IO subsystem and the latencies involved, but how precise must it be? What about DMA from devices and interrupt timing? Is that also just noise that can be handled approximately?
Good questions, and I have no real idea about the answer.
Finally, how much faster is something like Sniper compared to other abstraction levels? This is a key question, since “speed” is a key motivator for the approach to begin with. Judging from the published data, it seems that the overall slowdown is some 10000 (ten thousand) times (it could be 5000x as well, but that is roughly the same as I see it).
This is at least 10 times faster than the traditional detailed computer architecture simulators I described at the beginning of this blog post (which are typically running at around 100,000x slowdown). Add in the benefit of some parallel execution, and it makes sense to say that it is usefully faster than precise simulators. It might even be close to a 100x advantage when compared to unusually slow cycle-level simulators – comparing a 5000x slowdown of Sniper to a 250,000x slowdown from some architectural simulations I have heard professors cite gives a 50x advantage.
Thus, my order-of-magnitude map of simulator performance has been updated as is currently this:
- 1x – using virtualization, a simulator like Simics can get to basically the same speed as native hardware.
- 10x – a good just-in-time (JIT) compiler gets within 10x of the hardware (when running on the same hardware, like an x86 JIT on x86 – simulating something like a microcontroller core on a modern x86 can run many times real-time speed due to the performance advantage of the host over the target).
- 100x – adding cache simulation to a JIT simulator slows the execution down by another factor of 10 (see for example this Intel blog post on cache simulation in Simics).
- 5000x to 10,000x – Sniper et al.
- 100,000x to 250,000x – Cycle-level simulation
Modeling instruction dependencies and the front-end and back-end like Sniper does costs roughly 10x to 100x slowdown from just simulating caches and branch predictors. Adding in the pipeline details slows things down by another 10x. This means that 90% (roughly) of the work of a cycle-level simulator is spent on the pipeline details. This is critical for design validation and unavoidable, but Sniper gets what seems to be usefully close without it.
Overall, an interesting idea that appears to have established a useful new tier in computer architecture simulation. I wonder why I did not look at it before.