Simulating Computer Architecture with “Mechanistic” Models – No more 100k Slowdown?

I have been working with computer simulation and computer architecture for more than 20 years, and one thing that has been remarkably stable over time is the simulation slowdown inherent in “cycle accurate” computer simulation. Regardless of who I talked to or what they were modeling, the simulators ran at around 100 thousand times slower than the machine being modeled. It even holds true going back to the 1960s! However, there is a variant of simulation that aims to make useful performance predictions while running around 10x faster (or more) – mechanistic models (in particular, the Sniper simulator).

Mechanistic Models

Mechanistic models as implemented in Sniper and related simulators streamline the simulation of a processor pipeline while still modeling enough of the architecture to provide meaningful experimental data for guiding computer architects. These models run code and collect performance data during the run, in classic computer architecture fashion.

There are higher-level models proposed in the computer architecture literature, where evaluations do not require the re-simulation of a workload on each architecture change, but these are more abstracted than the mechanistic models and suffer from worse accuracy. They are very helpful for design-space exploration, but the goal that I see for mechanistic models is to make product and engineering calls on fairly concrete designs.

I have read the following papers as the basis for this blog post:

Simulating the Long Effects

To get even close to the real timing of real code on a real processor, some things absolutely have to be simulated: caches, branch predictors, and prefetchers. These mechanisms have a huge effect on performance, and their effect is “long”. A long effect accumulates state over many thousands or millions of instructions, and a small change in one instruction can radically affect the timing of code millions of instructions into the future. It is simply impossible to make a reasonable local analysis of the timing of a memory access without looking at quite a few preceding ones.

Traditional computer architecture approaches that fast-forward execution to a certain point combine the fast-forwarding with warming simulation for the long effects. The current typical length of warming for caches, prefetchers, and branch predictors is on the order of 100M instructions (which would be nice to shorten in the interest of performance).

In Sniper and other mechanistic models, the simulation of caches, prefetchers, and branch predictors is carried out in a manner very similar to a classic cycle-level processor model. However, simulating such factors is not all that slow – say at most a slowdown of 100x.

TLBs?

The memory-management unit translation look-aside buffers (TLBs) feel like they have a pretty long reach too, but their modeling is more sketchy in Sniper since they require operating-system involvement.

The papers I have read only considered user-level code (which puts a huge asterisk on the generality of the results in my mind). One argument being made in the papers to ignore the TLB is that that well-behaved compute code typically suffers very few TLB misses. Note that there is code that is hugely TLB-dependent for performance. For example code that allocate and use tens of gigabytes of memory and where the use or non-use of “huge pages” will directly impact performance.

Simplifying the Core

The core contribution of the mechanistic models presented in the TACO paper (in particular, what they call the IW-centric model) is the abstraction they make of the core execution pipeline of a processor.  The idea is that it is sufficient to model the issuing of instructions and the latency of each type of instruction, along with the size of the reorder buffer. Walking each instruction (or micro-operation) through all pipeline stages is not necessary to get a reasonable estimate.  

The underlying assumption as I see it is really that the execution pipeline of a modern high-end core is good enough that you do not usually need to worry about the details of the pipeline stages, interlocks, forwarding paths, reorder buffers, wrong path predictions, or rename registers. The pipeline can be approximated as a statistical fog that typically works – but simulating it in detail is extremely expensive. By ignoring these details, the simulation can be much faster while still being sufficiently accurate to be useful. In the words of the TACO 2014 paper, Section 4.1:

Although the cache hierarchy and branch predictors continue to be simulated in detail […], many structures such as the fetch and decode logic, additional hardware structures for issuing instructions such as the issue queue, and register renaming, as well as the commit stage, are not simulated in detail because of the assumption of a balanced processor microarchitecture. In addition, we assume a fixed penalty from the discovery of a mispredicted branch to the dispatching of new instructions (Table I), and we do not simulate the impact of wrong-path instructions on the cache and branch predictors.

Looking at this from a different perspective: what is actually being modeled?

  • The different functional units and the types of instructions that they can take – including issue contention.
  • The dependency between instructions, as the critical path through the code dictates how quickly the processor can actually get through the code. In particular, operations that depend on memory operations that miss in the cache can really stretch out how long time it takes for a processor to run a particular piece of code.
  • Reorder buffer – the IW-centric core model keeps track of a window of micro-operations and actually determines when each is dispatched.  Simpler models like the interval model ignores this aspect, and lose accuracy as a result.
  • Memory-level parallelism (MLP) – which memory operations can and cannot proceed in parallel. Instruction-level parallelism (ILP) is provided by the previous three mechanisms. In addition, the interaction between MLP and ILP is included.  

A mechanistic model ignores the execution of wrong paths due to misses in branch prediction and other speculation mechanisms, as well as the effect of (weak) memory consistency models.  It does not account for instruction decoding as a bottleneck either, as that is part of the assumption of a balanced core.

The below table summarizes the architectural mechanisms, and whether they are included in mechanistic models or not. Also included is the fast functional model with penalties, as a point of comparison.

Mechanism Classic Cycle Simulation Mechanistic model (Sniper) Fast functional + penalties
Cache Yes Yes Yes
Branch predictor Yes Yes Yes
Prefetchers Yes Yes Yes
TLB Yes No (user-level only) Yes
Instruction decode Yes No No
Issue ports Yes Yes No
Reorder-buffer size Yes Yes No
Instruction-level parallelism Yes Yes No
Instruction dependencies Yes Yes No
Pipeline stages Yes No No
Memory-level parallelism Yes Yes No
Wrong path Yes No No
Memory consistency model Yes No No

Other Insights

The paper contain several useful insights and observations on performance simulation.

Without instruction parallelism being modeled, there is no way to model memory parallelism. If the simulator handles each instruction and each memory access as a unit before processing the next, no parallelism can be accounted for. Obvious, but important to keep in mind.

Modeling relative differences between design alternatives might require good absolute estimates of performance. The reason is that resources like DRAM and links between individual chips have a certain absolute bandwidth limit – and hitting or not hitting it has a huge impact on performance. The intensity of accesses in absolute terms matters greatly. To quote the TACO paper again:

This difference in relative (scaling) accuracy can in fact be correlated with each core model’s absolute accuracy. Although architects usually claim to require only relative simulation accuracy, scaling often largely depends on contention of shared resources such as DRAM and QPI bandwidth, which in turn requires request rates and thus core performance to have a certain level of absolute accuracy.

The papers also show that the “1 IPC model” is useless. The 1-IPC model is popular since it is very easy to implement given a normal fast functional virtual platform (like Simics and Qemu). The functional simulator is extended with models of the long effects (caches, branch predictors, prefetchers, TLB). A fixed penalty time is accounted for each cache miss, bp miss, etc., on top of a basic model where each instruction takes one cycle (or half a cycle, or whatever). This model can run very quickly, but since it entirely misses ILP and MLP, it is also rather wrong.

A bit of a let-down, I used to think that a fast functional simulator with caches would be somewhat useful to estimate at least relative performance, but the error is so large that it can give totally misleading results – at least when used to compute time. It still makes sense to simulate the effects and count events, as a first-order approximation. I think. Maybe.

Going Multicore

The above only covers what happens inside of a single core, which is not really how computers are built today. Dealing with multicore can be a real killer for simulator performance, since not only does the simulator have to run the simulations for the additional cores, but accurately accounting for the resource sharing and contention between the cores. When using temporal decoupling, the time quantum typically has to be lower than the latency of an L2 miss, or around 10 cycles. This really affects performance. Trying to run the simulation in parallel requires a lot of synchronization to capture effects between the cores.

For Sniper, the solution is to run each core simulation in parallel (on its own host thread) and allow for quite a bit of time difference between the cores (according to the TACO paper, some 100ns, which works out to something like 100 to 500 instructions depending on the processor being modeled). This does reduce the accuracy, but also allows for much greater simulation speed. A classic cycle-level simulator could never get away with this, since it would be too imprecise when modeling the memory system in more detail.

It is not entirely clear to me just how memory contention is modeled, but it seems to be about looking at all the accesses from all cores in a certain time interval. This is the same kind of model that have been used in the past for network modeling, and it does make sense as a speed-accuracy trade-off.

The Operating System?

Unfortunately, none of the publications that I have seen around Sniper deal with full-system models and simulation. They are all looking at user-level programs like SPEC or other compute workloads. Which makes me wonder how well this works when simulating a complete software stack including the operating system, driver, hypervisors, hardware, and other fun.

It seems like it should be rather easy to run Sniper with full-system code by attaching the core models to an existing full-system simulator… but is there something hiding in the transitions between execution modes that could trip up the chosen approximations? How would the model deal with input and output operations over PCIe and memory-mapped devices? Is there something new there, or just memory accesses with a slightly different latency?

Core-to-core synchronization is another issue that is crucial if running an operating system, and where the temporal decoupling inherent in the Sniper memory simulation could probably have some interesting effects. Maybe all that is needed a decent model for the IO subsystem and the latencies involved, but how precise must it be?  What about DMA from devices and interrupt timing?  Is that also just noise that can be handled approximately? 

Good questions, and I have no real idea about the answer.

Simulation Performance

Finally, how much faster is something like Sniper compared to other abstraction levels? This is a key question, since “speed” is a key motivator for the approach to begin with. Judging from the published data, it seems that the overall slowdown is some 10000 (ten thousand) times (it could be 5000x as well, but that is roughly the same as I see it).  

This is at least 10 times faster than the traditional detailed computer architecture simulators I described at the beginning of this blog post (which are typically running at around 100,000x slowdown). Add in the benefit of some parallel execution, and it makes sense to say that it is usefully faster than precise simulators. It might even be close to a 100x advantage when compared to unusually slow cycle-level simulators – comparing a 5000x slowdown of Sniper to a 250,000x slowdown from some architectural simulations I have heard professors cite gives a 50x advantage.

Thus, my order-of-magnitude map of simulator performance has been updated as is currently this:

  • 1x – using virtualization, a simulator like Simics can get to basically the same speed as native hardware.
  • 10x – a good just-in-time (JIT) compiler gets within 10x of the hardware (when running on the same hardware, like an x86 JIT on x86 – simulating something like a microcontroller core on a modern x86 can run many times real-time speed due to the performance advantage of the host over the target).
  • 100x – adding cache simulation to a JIT simulator slows the execution down by another factor of 10 (see for example this Intel blog post on cache simulation in Simics).
  • 5000x to 10,000x – Sniper et al.
  • 100,000x to 250,000x – Cycle-level simulation

Modeling instruction dependencies and the front-end and back-end like Sniper does costs roughly 10x to 100x slowdown from just simulating caches and branch predictors. Adding in the pipeline details slows things down by another 10x. This means that 90% (roughly) of the work of a cycle-level simulator is spent on the pipeline details. This is critical for design validation and unavoidable, but Sniper gets what seems to be usefully close without it.

Overall, an interesting idea that appears to have established a useful new tier in computer architecture simulation. I wonder why I did not look at it before.

5 thoughts on “Simulating Computer Architecture with “Mechanistic” Models – No more 100k Slowdown?”

  1. At last, even with dated Sniper as example, we’re seeing some meaningful progress on performance simulation topic. In particular, many time YES – you can’t get any stable and accurate performance estimation (leave alone modeling) without data dependency analysis (i.e. called ILP and MLP here). No way, even and especially with machine-learning :).

    Another sad story is slowdown scale vs different kind of simulators. Just ask youself, how come that cycle accurate simulation is x250000 slower? Given 3Ghz core and smallest instruction throughput of half cycle (like any integer ALU) it means that simulating that same instruction takes x250000 clocks (assuming simulating code has single clock IPC on host it turns into 250000 instructions!) how could it be? C++ OOP I guess in very academic or hardware engineer’s style.

    Let’s drop the past and use JIT to schedule and execute performance simulation like we do for functional simulation. Data dependencies are fixed across hyper-blocks and is the only matter of accuracy for both ILP and MLP (in single core case). Sure, it’s not completely static (because of pipeline hazards and cache misses, for example) but it’s quite easy to adjust.

    On system level performance simulation – from past experience, I believe loosely-timed device models would be enough and could be easily produced based on available functional ones. However, now the key is that all instructions are retired in-order, so having memory load from memory-mapped PCIe space would stall time until its completion. Sure, past instructions will be speculatively executed and their latency should be covered (hidden) by latency of prior MMIO access. So you should model scheduling by (infinite?) ROB + retirement.

    As I commented some time ago, I believe we should rewrite cycle-accurate simulators to improve their current nonsense speed and for fast performance simulation we should instead simulate data dependency and retirement of ideal (infinitely wide) machine and then enforce latency and throughput constraints on such model.

  2. Thanks!

    I think the reason for 100k to 250k slowdown is not so much how simulators are coded as what they have to do. The factor has been present for 50+ years, and it seems reasonable to assume that if it was possible to speed it up significantly that would have happened by now.

    The cycle-driven simulator is not really running code. It is simulating a system of parts that work together to eventually run instructions. I believe the key problem for cycle-driven simulation is that there is so much going on in parallel in the hardware for each cycle that it simply will take a long time to churn through it. Once you get the memory system and speculation into the model, data dependencies are no longer necessarily easy to see… and from the perspective of the simulation of the processor, they are just one of many things that are handled by the multitude of little simulation units. It is more interesting to model how the processor tracks dependencies, than to discover them from the code being simulated.

    The slowdown stems from the domain being simulated, not necessarily from bad programming (that said, you can likely get 10x speedup from most simulators with some proper software engineering and optimization).

  3. Recently I had some experience in optimizing databases – it’s quite regular to see smth like 1000 TPS (transaction per seconds) in single thread in-memory mode without any parallelization – i.e. transaction is locked to the same thread and done entirely – from parsing SQL to updating all the required data structures – within same thread. Looks like simulating single cycle of x86 core is slower than processing single SQL transaction 🙂 This is weird.

    I have some side projects on simulating RISC-V pipelines now, hopefully that will help me to finally conclude on “expected” speed. Though I believe it should be at least 1 MIPS for single-core case with simple main memory model.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.