The SystemC Evolution Fika on April 7 had threading/parallelism as its theme. There were four speakers who presented various angles on how to parallelize SystemC models. The presentations and following discussion provided a variety of perspectives on threading as it can be applied in virtual platforms and other computer architecture simulations. It was pretty clear that the presenters and audience had quite different ideas about just what the target domain looks like and the best way to introduce parallelism to SystemC. Here is my take on what was said.
The Event
The SystemC Evolution Fika contained the following presentations:
- Matthieu Moy: sc-during: Parallel Programming on Top of SystemC
- Tanguy Sassolas: Ensuring reproducible parallel LT TLM models simulation with SCale SystemC kernel
- Jakob Engblom: The Intel Simics Simulator and SystemC and Threading (yes that is the author of this post)
- Rainer Dömer: RISC: A Compiler for Parallel SystemC with Maximum Standard Compliance
Update: The presentations can be found at https://systemc.org/events/scef202204/
I have read papers from or seen presentations by all of the other participants in the past. Indeed, Rainer presented his thoughts on parallelization of SystemC at the very first SystemC Evolution Day in 2016, which I also attended. I have had some discussions with Matthieu as well in the past year, and I have seen the SCale approach presented previously.
What is the Problem?
Overall, I get a sense that people myself included came with different application backgrounds and considered quite different use cases under the heading of “parallel simulation”. The nature of the problem is clearly perceived differently by different people, and this could be a reason discussions around parallelizing simulations of virtual platforms sometimes seem to get nowhere. There might just be a conceptual difference that nobody picks up on.
This is my attempt to structure the problem, from what I heard at the SystemC Evolution Fika.
Granularity
The first aspect to consider is the granularity of parallelism. What is actually being considered as a relevant target for the parallelization?
Chunky Parallelism
If you have a virtual platform background (i.e., a focus on the software running on the platform), it seems natural to start with the need to parallelize instruction-set simulators and heavy compute jobs. Those are the heavy components of the simulation, while regular devices tend to take less time and be less critical. This leads to a view of parallelism as something applied selectively to a few blocks in a design. The assumption is that most of the model stays the same as when writing a serial model. The parallel blocks tend to be quite dissimilar from each other.
It looks something like this:
The parallel program in this case looks more like a threaded desktop application that is using a lot of threads to do a lot of different things at once. Or the set of processes running on top an operating system in daily interactive use. I.e., rather varied, noisy, and with constantly varying load balance.
This is in essence the approach taken in Simics, where most device models run in a serial environment, and individual models are explicitly made thread-aware. Sometimes entire sets of processors are given a single thread, other times each processor core gets its own. Depends on how much work they have to do. Behind the scenes there is a worker pool and core scheduler that finds work to run in parallel, and a threading programming model specific to Simics.
The sc-during approach from Matthieu Moy looks targets this kind of chunky parallelism, providing and API to offload work on threads in parallel to the main simulation. The code would be written to explicitly spawn compute work, and to carefully synchronize back to the serial SystemC kernel.
The SCale system presented by Tanguy Sassolas is also looking for parallelism in a chunky system with instruction-set simulators. The programmer is responsible for determining which parts of the system should be executed on separate threads, as well as for annotating accesses to shared resources for safety.
Fine-Grained
Another view is to consider a mostly symmetrical parallel hardware design such as a compute accelerator. Here, there are typically many identical small blocks that are run in parallel in the hardware, and that seem like very natural candidates for parallel simulation. This is typically expected to be more fine-grained than the chunky approach, since each block considered for parallelization is smaller.
Something like this:
This leads down the route to a mapping individual models to threads, and to think of a model as something quite similar a high-performance computing (HPC) program that performs a single computation as a large set of small parallel compute tasks.
Rainer Dömer’s work on RISC lean towards this formulation of the problem. He has been looking at cases where the SystemC simulation expresses hardware designs, and it is possible to analyze the SystemC code to determine what can be run in parallel. This is very similar to HPC parallelizing compilers. The RISC approach is harder to apply to setups containing instruction-set simulators, as the actual parallelism and shared-data accesses cannot be determined from the simulator source code.
There was a variant of this view in discussion comment by Martin Barnasconi. He noted that from a purely standards perspective, all we have are the SC_THREAD constructs. Thus, that is what we would basically have to try to parallelize.
Multiple-Kernel Parallelism
No discussion of parallelism would be complete without considering the “multiple-kernel” approach to SystemC parallelization. This is where multiple separate programs (operating-system processes) running on the same host are used to execute a model. Each process runs a stand-alone simulation, but the different simulations communicate over special channels. Like this:
None of the presentations at the SystemC Evolution Fika brought up this particular style of parallelism, but it is one that is used in practice both in academia and industry. There are commercial solutions available for SystemC to connect and synchronize the execution across multiple processes. For example, the CoMix (Concurrent Model Interface) from Cadence. Simics can run in a distributed mode like this (in addition to internal threading in each process).
Definition of Correctness
Given some division of a system into parts that should be run in parallel, what are the semantics of the parallelized simulation? Here, there are definitely quite a few different opinions.
One view is to say that a parallel simulation should produce the same result as a serial execution of the same model, only faster. I.e., that the parallel execution has no impact on the semantics of the virtual platform. In a SystemC context, this means that the execution of the standard serial SystemC kernel is the gold standard, and any deviation from that is a problem.
The SCale approach is built from this perspective. It enables a safe and correct simulation by providing primitives to annotate accesses to shared model resources like memory, and then watches the execution to detect cases where parallel execution would result in incorrect results. There are mechanism in place to stop divergence and bring the simulation semantics back to serial-equivalence.
In my experience, enforcing serial-equivalent semantics for the whole simulation is typically quite costly, at least for closely-coupled models like those sharing memory. Classic Simics multimachine accelerator does provide a serial-equivalent semantics for a simulation made up of separate boards that only communicate over networks (similar in granularity to multi-kernel, but running on a single simulation kernel).
However, when Simics was redesigned to run shared-memory processors in parallel, serial equivalence and determinism were dropped. It just required too much synchronization to enforce such semantics.
That leads to the other view, that the parallel execution produces a different result from a serial execution. This is how threaded code works in general (with the exception of highly structured HPC code). Each run can be (will be) different depending on timing differences, which threads get to run when, the order in which shared data is modified, etc.
This has some drawbacks, and debugging threading-related problems in a simulator is about as much fun as doing it in any other program. However, it also definitely unlocks maximum performance, and in general it works sufficiently well in practice.
There is a theoretical middle ground, where the parallel execution is repeatable, but with a different semantic than the serial execution. The idea here is to ensure that if the simulation is started in the same configuration and with the same inputs, it follows the same execution path. Using a different number of threads to run the simulation would result in different execution paths, but it would at least be possible to repeat any particular run. Which is all you need for solid debug.
I am not sure if anyone has implemented such a solution (record-replay debug does not count; this would be a parallelism design that would be deterministic by design).
Coding Impact
The talks at the Fika did not go all the deep into the actual coding of parallel simulation models. But there seems to be three approaches here.
The first is examplified by Simics, SCale, and sc-during: explicitly code parallelism into a few selected parts of the model, and leave the rest alone. Ideally, the rest of the model does not need to do anything, since the parallelized models provide the same interface as regular serial models.
The second is examplified by RISC: the code is analyzed to find parallelization. All code is considered implicitly parallel, but should not need to deal with all the complexities of parallel coding.
The third is to make the whole model into a parallel program, using locks to protect shared data, etc. This is not an ideal scenario in my opinion, since parallel programming is hard to get right. Modeling a hardware system effectively while also constantly keeping an eye out for parallel programming problems would seem to slow down the modeling process and likely to lead to more bugs.
Ways to Move Forward
Since this was a SystemC evolution day, what were the actual designs proposed for introducing parallelism in SystemC?
Matthieu Moy and Tanguy Sassolas follow the same design philosophy of working around SystemC. Adding new functionality and APIs to allow threads to be run in parallel to classic SystemC, but without requiring changes to the core kernel. In a way, this approach does not so much evolve SystemC as just use it as is as the basis for an extended system.
Rainer Dömer wants to introduce a set of targeted changes to the standard to turn SystemC into a language that is friendly to parallelization. This is a more fundamental approach that seems very hard to actually get consensus around – but it also an honest attempt to actually evolve the standard towards incorporating parallelism outright.
The Simics approach provides coarse-grained parallelism for SystemC models from the outside. In Simics, each SystemC subsystem effectively has its own SystemC kernel, and can run in parallel to other SystemC subsystems with their own kernel. This is multi-kernel parallelism, but inside a single process, and connected to native Simics models that also run in parallel. It does not really affect the SystemC code at all, providing parallel execution to existing code with an adaptation layer.
Disclaimer: As always, my personal blog posts reflect my own opinions, not those of my employer.
The fine grained approach should be the best, as you say it can guarantee correctness and best “load balance” the operations. If we were using a “real” simulation language it would be easier for the compiler to analyse the parallelism, I’ve always felt the single threaded speed benefit that C++ gave a few decades ago had had the long term cost of making parallelism a really hard problem for SystemC. (Of course there are many other benefits of C++ as a base language)
Is RISC a standalone compiler?, or perhaps a hybrid approach of adding an SystemC analysis stage in an existing C++ compiler?
A quick and dirty, but performant, multiple kernel parallelism solution can be hacked together quickly with a shared memory interface. One of the problems I encountered doing that was “bridging” OS events and threading parallelism events with simulation events. I’ve also encountered this when interfacing with external devices. Is there a good standard solution for this in the system c kernel?
Or do solutions like CoMIx rely on custom kernel changes?
Thanks for the notes!
In my opinion, fine-grained parallelism is very hard to do well once you throw in instruction-set simulators and the attendant behavior of the software, as well as the need for significant temporal decoupling to provide locality. It works pretty well for cycle-detailed models of microarchitecture, less so for high-level transaction-level modeling. It is all about the balance between computation and communication/sync.
Not sure precisely how the RISC approach is implemented.
I believe there are ideas to add a better interface for managing the connection of a SystemC simulation to other simulators being worked on by Mark Burton and his team. But still, the issue of how to introduce external events to the simulation remains – I think the simplest solution there is to decouple them to some extent. Have a thread to handle the actual communications with the outside, and then have some device model periodically in virtual time check if something has happened. Some level of asynchronicity is needed it seems.
Very interesting, a great introduction to the problem. It’s a pity how difficult these problems are, simce so many people need the simulation to go faster.