In a funny coincidence, I published an article at SCDSource.com about the need for cycle-accurate models for virtual platforms on the same day that ARM announced that they were selling their cycle-accurate simulators and associated tool chain to Carbon Technology. That makes one wonder where cycle-accuracy is going, or whether it is a valid idea at all… is ARM right or am I right, or are we both right since we are talking about different things?
Let’s look at this in more detail.
Definitions
A clock-cycle (CC) model in this discussion is something that attempts to provide a cycle-by-cycle depiction of the behavior of a computer system. Usually, such models are driven by a cycle-by-cycle clock, as that is the easiest way to write and structure them.
A cycle-accurate (CA) model is a CC model where the depiction is “the same” as what would happen in the real system provided they both started from the same state.
What is ARM Doing?
ARM seems to be passing on the tools and technologies they acquired when they bought Axys back in 2004. These tools are CC-oriented, and are aimed at hardware architects (and some really-low-level software work). They make it possible to evolve a target design cycle by cycle in the simulator to get a very accurate picture of the target behavior. I think this fits very well for Carbon, as they generate cycle-driven very accurate models by essentially compiling the actual RTL implementation of a piece of logic, processor, or device into something a bit faster than plain HDL simulation. Carbon models are a natural fit for the Axys tools.
Basically, it sounds as if ARM decided that manually creating CC level CA models for their latest processors for use in the Axys tools (SoC Designer) was too much work and too hard to validate. Thus, they pass the whole thing on to Carbon and seem to expect Carbon to generate CA models for use with SoC designer straight from the actual ARM implementation RTL. Carbon will have the old CC/CA models written by Axys (and later ARM), and then generate new models for new generations of ARM chips like the Cortex A9. I quote:
“The model generation flow will be optimized and validated using the RTL code, ensuring speed and accuracy. The processor models will also leverage the Carbon model application programming interface (API) to offer a direct connection to the ARM RealView(R) Debugger. Carbon-generated models of ARM IP will offer our customers the fastest, most-accurate path for firmware development and architectural exploration.” (press release)
And:
ARM made this decision, Cornish said, because it’s become increasingly difficult and time-consuming to develop cycle-accurate models. “We recognized it would make more sense to work with a specialist like Carbon that has technology for generating models directly from RTL,” he said. (SCDSource News Piece on the deal)
Feasibility of Construction
The core argument here is really how easy or feasible it is to build CA models of a processor core (or any other really complex piece of logic). There are several interesting views to consider.
- The ARM statement is basically saying that building CA models of a processor core is very hard. It is hard to get right, hard to validate, and hard to maintain. So why even try? Better to generate it from the RTL and let experts at doing that do the work.
- In my PhD thesis from 2002, I concluded that building an accurate model of a processor from public information and reverse engineering is very very difficult, and cited a number of computer architecture and real-time systems attempts to build models that all turned out to have accuracy issues. I did not know much about EDA then — and ESL did not really exist. But I think that still holds water: constructing a model of a processor is hard.
- In the SCDSource article, I make the statement that “Building cycle-accurate (CA) models is very difficult, as you need to understand and describe the implementation details of complicated hardware units. … It is quite easy to end up with something that is essentially an alternative implementation to the actual chip RTL. It is especially difficult for third parties, as it requires access to the device and processor core designers to explain the design.” Which is essentially saying that you need to get inside the processor design group to get the information.
- The common knowledge that all great processor design teams, from the DEC Alpha to Intel x86s to AMD Opterons to IBM Power to Freescale Power to Infineon TriCore to Sun Niagara use internal cycle-detailed simulators as their main design tools to prototype and decide how to design pipelines, memory systems, and system platforms. In this case, the simulator comes before the processor, not the other way around.
- Tensilica has, as Grant Martin points out in comments at SCDSource, tools that generate both the processor and an accurate model at the same time from the same information base.
- CoWare’s LisaTek tools for describing and generating application-specific processors also claim to generate accurate models from the LISA source files in a way similar to Tensilica but based on a user describing a completely custom design in a third-party tool. In the case of Tensilica, the tool and the design come from the same company.
So where does this leave us? It makes it clear that in order to build a good cycle-accurate model you need access to internal information and the processor design/processor design team. The CA model can be built either:
- By synthesizing from the RTL, Carbon-style.
- By synthesizing from some more abstract design description, Tensilica or LisaTek-style.
- By the design team as part of the design process.
- By some poor guy working after the fact from specs and test cases.
I think the ARM-Carbon deal (and all practical experience as well) invalidates the fourth variant. Essentially, that is what Axys had to do: build models after the fact, separate from the CPU design flow. This is a property of how ARM design processors and the fact that Axys began life outside of ARM (my guess, nota bene). It is what computer architecture researchers often want to do but fall down on over and over again. In fact, a common question from computer architecture newbies is if Virtutech Simics has correct models of processors like the Intel Pentium4 or Core 2 available to use as starting points in research. It would be nice, but sorry, we do not.
But the other three variants do make sense, and will all result in some kind of decent model. Which one you end up doing depends on the style of your design and quite likely the complexity of the processor and system design. In the end, any truly revolutionary design (think Sun Rock, for example) will need to write a custom simulator as tools will not have the concepts in them to model all ideas. It seems that simple “standard” designs that fit in the categories of “custom RISC” or “custom DSP” and that do not break new ground in computer architecture can probably be designed using tools that allow processor and simulator generation. I think that most heavy-duty general-purpose processor cores will have to do either the design-model or RTL-generation path, while more accelerator-style cores can use the tools approach.
As a final note, there could really be two different problems being addressed here regarding “cycle accuracy”, and that this might contribute to different levels of feasibility:
- Using the simulator to validate and optimize software performance can tolerate some errors in details as long as errors do not accumulate (see for example the “timing anomalies” or “unbounded long timing effects” found in WCET research). It is about understanding the software behavior versus the processor design (or complex accelerator design versus input data), in small focused spots of execution.
- Using the simulator to validate a chip design including buses and other devices that can be bus masters. This ought to require a higher level of accuracy, as the penalty for errors would potentially seem greater. And this is also where ARM’s SoC designer fit in, rather than as a tool to understand the software behavior. The scope here is larger and there is usually no idea of zooming in on detail at particular points in time.
So where does this land us?
I guess that CC/CA models can be built if you have a nice inside track to the design team, and that the only sensible way to use them is as a zoom device for the places in your code where you absolutely need the details. Most of the time (say 90-95-99%) software does not need CC models, but rather something that is functionally accurate and that runs really really fast so that all software can at least be executed. That is something a CC model will never be able to do, at least not for systems using non-trivial operating systems requiring a few billion instructions just to boot…
Jakob, a very nice summary of the issues around building Cycle Accurate models. One thing to note is that when optimising a processor in the data plane, it is very important to understand detailed and accurate cycle counts especially for frequently executed loop nests. This is one reason having the Cycle Accurate ISS, as well as the Fast Functional one, is so important for us. This is one area where you need to zoom in to get details. But once the processor is fully defined, the fast functional models come into play to validate software correctness, with the need to run in Cycle Accurate mode greatly reduced. However, we have found that users should still do some regressions in Cycle Accurate mode, to validate that they are on the right performance track, and to validate the use of processors in a system context (especially depending on their synchronisation schemes).
Grant, that fits with my experience as well, where DSPs tend to come with CA simulators for this express purpose.
What might “save” data plane processors from absolute modeling misery is that they are usually architected to have performance locality and certain robustness in performance. In my WCET work, it was clear that local optimizations in a pipeline were much easier to model and analyze and simulate well than global effects like multi-level branch predictors or caches with non-LRU replacement policies.
I guess you agree with the following:
Designing a processor for good data crunching performance should usually mean making it more “local” in this respect. Good control plane performance tends to being more “global” in nature to tackle a more varying and unpredictable load where statistics and averages are more important than perfect loops. And these are also usually far more difficult to model and simulate well.
/jakob
Hi Jacob:
Great write up.
The item not to be underestimated is the item number three in your list of how to get to a cycle accurate model “… by the design team as part of the design process”. The cycle accurate models often are not an afterthought but have been used to determine how to configure the processor in the first place. Making assessments about pipeline issues, caching etc. for sure is easier done in NML, LISA or other associated languages. You can even run the software as stimulus on this nicely.
However, keeping this model in sync with the actual developed RTL model is a whole different story (unless you automatically create them). It looks like the complexity of the models has grown so much and they have become so difficult to validate that it is better to create them from the RTL. I already had argued about a “polarization of system-level modeling styles” a while ago in my post “Bring in the Models” at http://www.synopsysoc.org/viewfromtop/?p=7.
Best, Frank
Thanks Frank, I missed that part when I originally read that post. Here is the relevant excerpt, for the reference in this discussion:
====
In the cycle accurate domain today the effort to develop and validate the models has sometimes reached the same order of magnitude as the implementation itself, while not providing full accuracy. As a result polarization of models happens.
At one end of the spectrum users find instruction accurate models, but at the other end we see less and less fully cycle accurate modeling going on. Given that cycle accurate models are an absolute requirement, users rely on hardware prototypes, models directly derived from the RTL code or even the simulated RTL code itself.
====
Nice article Jakob. I am not sure how fast and practical are the Cycle accurate model converted using RTL. It may provide you the ISS speed of atong 100 KIPS. That is too slow to address the s/w development forum that use the CA simulators. Any thoughts?
I think models converted from RTL are bound to be slow, sure. But at least they exist, and if ARM say that they are fast enough to be useful I believe them. However, I guess your speed estimate of 100 KHz or so is likely in the right ballpark.
That’s why the concept of mixing fast and detailed simulation is so important. Where by “fast” I mean 100-1000MIPS or more. As discussed in my SCDSource article (http://www.scdsource.com/article.php?id=266) you will spend most of the time (in my experience, 90-95-99%) in fast mode to position workloads, and switch to detailed mode (cycle-accurate) only for interesting critical parts of the software.
You cannot develop interestingly large software workloads on a CA model, there is not sufficient time in the life of a software engineer to complete operating systems boots and similar on CA models.