DRAMsys – Cycle-Accurate Simulation using Transactions

DRAMsys is a simulator for modern RAM systems, built by researchers at Fraunhofer IESE and the Technische Universität Kaiserslautern. Over the past few years, I have heard several talks about the tool and also had the luck to talk a bit to the team behind it. It is an interesting piece of simulation technology, in particular for how it manages to build a truly cycle-accurate model on top of the approximately-timed (AT) style defined SystemC TLM-2.0.

For a short overview of DRAMsys (version 4.0), see the SAMOS XX talk by Lukas Steiner available on Youtube. The open-source edition of DRAMsys is available on github, at https://github.com/tukl-msd/DRAMSys.

What does DRAMsys Model?

The purpose of DRAMsys (and associated tools) is to support performance analysis (including driving DRAM power models) for Dynamic RAM (DRAM) systems. Such models can be used as part of a system-wide performance simulator to include the effects of a specific type of DRAM as part of the memory hierarchy. It can also be used to try different types of DRAMs or different configurations of DRAM for a given workload or application, in order to optimize the system design of specialized computer systems.

Modern Double-Data Rate (DDR) DRAM has a very rich internal structure, with many layers of hierarchy going from the memory channels (specified as part of the capabilities of a certain platform or chip) to individual DIMMS (Dual Inline Memory Modules), to the chips (known as devices) on the DIMM, and on to their internal structure of ranks, banks, arrays, and finally to rows and columns.

Illustration of the structure of DRAMsys, from “Exploration of DDR5 with the Open-Source Simulator DRAMSys” by Lukas Steiner, Matthias Jung, and Norbert Wehn, in the proceedings of Methoden und Beschreibungssprachen zur Modellierung und Verifikation vonSchaltungen und Systemen (MBMV) 2021.

DRAM controller front ends are responsible for scheduling and prioritizing requests from the system, while the control back end (channel controllers) drives the actual DRAM DIMMs and devices. The back-end protocol towards the DRAM devices is not trivial, with many different commands available to retrieve memory information, manage memory refresh, and allow for optimization of data accesses. Each DRAM standard defines a command set (which typically gets richer and richer for each new generation of DRAM), as well as the timing constraints between commands. The job of DRAMsys is to faithfully model the commands and their constraints, in order to provide an executable model of the DRAM standards.

The DRAMsys model is executed using SystemC TLM-2.0, with a model coded using the AT style.

TLM AT style, CA results

By design, the SystemC TLM AT modeling style is intended for approximately-timed models. Such models are expected to approximate the timing of some actual hardware design, but not to really achieve full cycle-accuracy (CA). Hiding behind the terminology is the assumption that a truly accurate model would have to be cycle-driven and not transaction-driven. However, DRAMsys shows that for some types of hardware, a model coded using the AT style can actually be cycle-accurate in its results. This is an interesting data point in the design space of simulators.

For DRAMsys, my intuition is that the cycle accuracy comes down to DRAM being specified and functioning in a way that fits the definition of approximate timing very well: as a set of events (timing points) with defined latencies between them (many of which happens in parallel). Events typically comprise a command being given from a controller towards a device, and a response arriving after an amount of time given by the protocol and the properties of the DRAM being modeled. There is no need to look for the response at all cycles between the command and the response. The cycle accuracy derives from this rare “impedance match” between the real world and the modeling style. Note that DRAMsys uses some custom phases in SystemC TLM-2.0 AT to fully match the DRAM specifications, since the default TLM-2.0 phases are not rich enough.

The primary advantage of using an event-driven transaction model is performance. In particular for light loads (few memory operations per time unit) and idle periods. Simulating billions of cycles of doing nothing works very well in a TLM model as it can jump to the next event, whereas a cycle-driven model would check each clock cycle along the way. Under heavy load, a TLM model will exhibit similar performance to cycle-based modeling since there is activity on every cycle and thus there are no idle cycles to skip over. The TLM model is superior, since it is better able to adapt to varying levels of load.

I would also think that the model itself is easier to express using a TLM style, given that this matched the domain. Expressing DRAM timing in a cycle-driven simulator seems like it would have to implement latencies between events using some kind of list of active transaction phases and the remaining time before the next timing point… and checking each active transaction on each cycle… which is basically an inefficient implementation of an event queue. For DRAM, transactions just make sense.

DRAMml – Modeling DRAM

Instead of coding DRAM standards directly in SystemC, the DRAMsys simulator uses a domain-specific language called DRAMml (for DRAM modeling language I guess) to describe the command sequences and states of the JEDEC DRAM standards. The semantics of DRAMml are based on timed Petri nets, providing a solid underlying theoretical model (as an aside, that part just feels archetypically German – for some reason, Petri nets are very popular in German computer science research and not quite as common elsewhere). The complete simulation generation flow is thus that DRAMml gets compiled into Petri nets, which in turn are used to generate SystemC code which is actually used in the execution of the simulation.

In addition to the DRAMml description of each protocol generation, JSON files are used to describe the particulars of a memory device like size and actual timings. It seems that the open-source repository for DRAMsys only contains the generated SystemC, and not the actual DRAMml source code.

DRAMml is impressively expressive. The JEDEC DRAM standards are hundreds of pages long, containing dozens of pages of timing diagrams. The papers about DRAMsys claim that this can be boiled down to a couple of pages of DRAMml code for each standard. Obviously, this begs the question of whether JEDEC will be willing to adopt this type of formal description of memory standards going forward. It seems like a really good idea to move to a truly executable/simulatable specification, but that would likely require a long process in standardization committees.

A key benefit of the use of a DSL is that it is lot quicker to model new standards. For example, the DRAMsys team was able to present a DDR5 model fairly soon after the standard was finalized. Hand-coding the detailed simulation would take a much longer team.

Connections and System Context

One interesting question raised by a good subsystem model like DRAMsys is just what you need to do to feed it useful data to process. DRAMsys is built to receive transactions using standard SystemC TLM-2.0 and from trace files. The operations need to be concrete – reads or writes, to a certain address. Operations in DRAM are 64 bytes wide (a single standard cache line), which means it cannot be hooked up straight to a processor core – if you want to run code that generates memory operations and hook it up to DRAMsys a functional cache model, at least, is necessary.

Just how timing-accurate must that upstream model be to provide useful data? While I am a strong proponent of simplified models that focus on the important aspects for a particular task, here the level of abstraction is pretty much a given. If you have an interest in the detailed timing behavior of DRAM, the rest of the system model has to be at a similar level of timing accuracy to really make sense. It is definitely interesting to see how DRAM operates based on a functional trace, but the timing information produced by DRAMsys feels a bit wasted unless the timing and bandwidth information is used in some kind of feedback loop.

Checker Mode

A description can not only be used to build a running simulation – it can also be used to build a checker that looks at the timing traces generated from other simulators. For example, it is possible to use parts of DRAMsys to check an RTL implementation of DRAM or a DRAM controller. This is a neat additional capability that really only makes sense with a cycle-accurate simulation. A checker that was cycle-approximate would be a bit odd…

Concluding Remarks

DRAMsys has been developed for something like ten years into a mature and stable piece of simulation technology which achieves cycle accuracy while using TLM abstractions. It is a neat example of a focused simulator that does one thing (DRAM) and does it well. In my mind, such focused solutions are often easier to apply since they do not force a user to extract an interesting kernel from a complex integrated simulator.