I had many interesting conversations at the HiPEAC 2017 conference in Stockholm back in January 2017. One topic that came up several times was the GEM5 research simulator, and some cool tricks implemented in it in order to speed up the execution of computer architecture experiments. Later, I located some research papers explaining the “full speed ahead” technology in more detail. The mix of fast simulation using virtualization and clever tricks with cache warming is worth a blog post.
What is GEM5?
The GEM5 simulator is the current “standard” tool in computer architecture research. Just like when I was a PhD student and “everyone” used Simplescalar to do research on processor architecture, memory systems, and related topics, it seems “everyone” is using GEM5 today. GEM5 is just what a computer architecture researcher needs (and apparently it is also being used in some commercial companies). It is the research equivalent of the detailed simulators that commercial computer architects use (see my previous blog on architectural simulators for more discussions on this).
At its core, GEM5 is a classic configurable out-of-order pipeline simulator with caches, branch predictors, etc. There is a big ecosystem built on GEM5, including add-on simulators for GPUs, power consumption, networks-on-chip, and other areas of research. This is a real incarnation of the product management concept of completing a product with partner offerings and integrations, having been created spontaneously from the needs of researchers.
GEM5 features several different instruction sets and machine models, but it seems that most users today use the Intel Architecture and simple PC target system. The ARM targets also see some use, and then RISC-V appears to be getting more popular (which I did find a bit surprising).
The standard methodology for using GEM5 is to use sampling, where you only run small parts of a big benchmark at full detail. In this way, you get around the months-long simulation runs that would be needed to run the full benchmark in detailed mode.
The idea behind sampling is to run the entire program in a fast functional mode (where the defintion of “fast” depends on the simulator, obviously), and drop to a fully detailed simulation at sampling points. The state of the detailed simulator is based on the state of the functional simulator at the start of the detailed run.
Since a detailed simulator contains much more detail than a functional simulator (such as cache contents, branch predictor state, pipelined instructions, etc.), you typically first do a warming phase to warm up these mechanisms in order to render the final detailed simulation accurate. The warming is necessary for processor performance mechanisms that depend on a long history of execution – in particular, the caches.
This means that warming runs much faster than the full detailed simulation, but it is still much slower than the functional simulation. In my experience, I would expect a fast functional simulator to slow down around 100x when adding caches and similar, and then by another 1000x to do a detailed pipeline simulation. Or it might be 10x and 1000x – we are talking about many orders of magnitude either way.
It works like this, with the size of each run not at all shown to scale:
Sampling is a great innovation, but as always when you remove one bottleneck, another appears. In the case of GEM5, the next bottleneck is the time it takes for the functional simulator to run through a benchmark and create checkpoints. After that, it is cache warming. Both these aspects have been addressed in the Full Speed Ahead work.
Full Speed Ahead – Virtualization for Functional Speed
My understanding of Full Speed Ahead based is based on the following two papers:
- “Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed”, of the 2015 IEEE International Symposium on Workload Characterization, by Andreas Sandberg, Nikos Nikoleris, Trevor E. Carlson, Erik Hagersten, Stefanos Kaxiras, and David Black-Schaffer.
- “Adaptive Cache Warming for Faster Simulations”, 9th Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools (RAPIDO 2017), by Gustaf Borgström, Andreas Sembrant and David Black-Schaffer.
The first thing that the FSA approach does is to replace the slow standard functional simulator model in GEM5 with direct execution on the host using X86 or ARM virtualization. This speeds things up tremendously, since we go from a single-MIPS-level interpreted simulator to several thousand MIPS-level native execution. Obviously this is only applicable to the IA targets.
I have to note at this point that the slow speed of purely functional simulation in GEM5 is an artifact of GEM5 itself and not a general fact for all functional simulators – the literature cites a speed of about 1 MIPS to run an IA target, which is not very impressive. A good just-in-time compiler (JIT) or dynamic binary translator (DBT) as used in tools like Qemu, Simics, OVP, CAPTIVE, and many others should be able to raise that to 1000 MIPS or more.
As an aside, it also seems that everyone in the academic world have missed that Simics [link to Wind River product page] have had virtualization-based fast simulation of X86 on X86 since 2007, just over a year after the instructions were announced by Intel and AMD. Thus, the absolute benefits in time quoted in the papers are compared to a relatively poor baseline – but that is the real baseline that is being used by everyone in the GEM5 community.
There is obviously a lot of work that goes into making the state transfer from the virtualized execution to the detailed execution work. The approach makes use of Unix-style fork() semantics to spin off multiple clones of the main execution, which is a common way to achieve in-memory checkpoints on Linux machines.
It should be noted that there is a rather large limitation in that the FSA approach does not simulate multicore targets. This simplifies the initial implementation, but I wonder how easy it is to introduce a temporally decoupled or even truly parallel simulation engine to GEM5. GEM5 appears to assume very tight synchronization between the simulated cores when it does multicore simulation, like switching between cores every cycle, and that might be hard to mesh with virtualization-based acceleration.
Another possible issue that is not addressed is how the nature of the host system could affect the simulation. When using VT-x, you are normally exposing a copy of the underlying machine to the software running inside the virtual machine – and that might be anything. There is a footnote in the first paper that mentions that they force the compiler to emit SSE code instead of X87 classic floating-point code in order to get around limitations in the GEM5 simulator that does do X87 very well. If they allowed benchmarks to run using X87, it would work fine in FSA, but then break when transferring over to the standard simulator for warming. This might be better solved by using CPUID virtualization to mask dynamically discovered instruction set extensions. I can see this becoming a problem over time, as IA software is getting better and better at probing the host hardware for particular instruction set variants and using them dynamically when available.
Execution vs Checkpointing
The FSA approach is very straightforward, and clearly speeds up the functional part of a sampled simulation. However, there is a competing solution – using checkpoints of the state at each sampling point and reusing them for many experiments. In such a case, the cost of running the fast functional simulation to establish the checkpoints would be easily amortized across many detailed runs, making FSA a rather marginal improvement…
However, just like argued by the FSA authors, the checkpoint approach has some serious problems. It limits experiments to the variation achievable from each checkpoint. Fundamentally, each checkpoint represents a frozen state for the system state covered by the checkpoint.
If you want to change the software configuration, you need to redo the checkpoints. If a checkpoint contains the cache warming state, you can only simulate with the given cache size and have to redo the checkpoint to change that parameter.
A faster functional simulator makes it possible to replace checkpoints with on-the-fly single-use state transfers into detailed mode. Which means that more varied scenarios can be explored. You can vary both software and hardware configurations between runs, and thus learn more about the performance landscape.
I actually think this is rather important… modern software often takes a look at the hardware it is running on (size of caches, number of cores, types of cores, observed execution time, etc.) to configure itself for best performance. Thus, you cannot meaningfully freeze samples from an execution and use them with another hardware configuration – this removes a critical feedback path from the hardware to the software. The SPEC benchmarks beloved by researchers do not do this, but many real and interesting workloads most definitely do. Modern computer design is huge on adaptability and feedback loops, and computer architecture experiments really need to be set up to deal with this, which means being execution driven as far as ever possible.
I wonder if the use of functional checkpoints that do not contain architecture state would substantially affect the timing trade-offs for FSA. Using purely functional state checkpoints would mean redoing all the warming for each checkpoint, allowing for more architectural flexibility – but the FSA approach might still be better if you do not reuse checkpoints more than a few times.
Being Clever about Cache
Apart from the use of virtualization, the FSA approach does some truly clever things to improve cache warning. It seems that once virtualization is used for fast functional simulation, the cache warming is the most time-consuming part of the simulation! Since it needs to run through very many instructions (10s or 100s of millions seem to be standard) to properly warm up all the long-term memories in the core, it takes a lot more time than the final detailed simulation! This is a result that rather surprised me, since detailed simulation is incredibly slow – but they also need to run on far fewer instructions to produce interesting results.
A simple way to overcome this bottleneck is to execute multiple warming runs and detailed simulations in parallel. This increases throughput, while not reducing the latency of each simulation. This is a classic comp.arch technique – having very many simulations running in parallel was a common use case for academic Simics users back in the early days. I recall working with universities that used hundreds of Simics licenses in order to parallelize simulation of many different variants and sample points, while most commercial users would only use a single simulation instance per user. Today, this has changed with the advent of continuous integration [link to Wind River blog on CI with Simics], where commercial software testing is also using hundreds or thousands of simultaneous runs.
However, you can do better.
The first trick used in FSA is to do a short cache warming run that does not actually end up warming the entire cache. Some cache lines have not been touched at all. Then, you run two variants of the detailed simulation. One assumes that memory accesses that access untouched (cold) cache lines would have hit the cache, and the other assumes that they all miss. Logically, this should result in a best-case and worst-case time, with the “real” time being somewhere between the two. In this way, it is possible to estimate the error caused by the much shorter cache warming, in turn making it possible to shorten it down to a “minimal” amount.
In the RAPIDO 2017 paper, this approach is automated to build an adaptive system that searches for the minimal cache warming run length that provides the desired accuracy (as measured in the error in cycles-per-instruction from the detailed simulation run). If the error is too high, you rerun the same sample with a longer cache warming stage. This is really clever, and offers a way to automatically save time in computer architecture simulation by doing just enough work to get useful results.
There are some benchmarks where this just does not work, but for most cases the best-case and worst-case simulations quickly converge.
I must say that these two papers were among the most readable I have ever read. Well-written, clearly reasoned and with good illustrations. The authors take pains to bring up cases when the approaches proposed do not work well, which very honorable and shows confidence in the basic results. This is the approach to research you would like everyone to have, but unfortunately not everyone is this clear about the limitations of their approach. Far too often I see papers where the authors are trying to hide or ignore limitations in order to get a “better” result.
It would have been nice with references to Simics use of virtualization (we did mention it in our 2010 book chapter in Processor and System-on-Chip Simulation, as well as in our 2014 book about Simics and virtual platforms).
Good work, and it is great to see it come out of the department where I spent my PhD days.