There is an eternal debate going on in virtual platform land over what the right kind of abstraction is for each job. Depending on background, people favor different levels. For those with a hardware background, more details tend to be the comfort zone, while for those with a software background like myself, we are quite comfortable with less details. I recently did some experiments about the use of quite low levels of hardware modeling details for early architecture exploration and system specification.
It all comes down to a simple classic tradeoff that I usually illustrate like this (using more neutral ground than computer systems; and with credit to Peter Magnusson who had this slide already in place when I joined Virtutech back in 2002):
What this is telling you is simple:
- You simulate something very large using large units, i.e., low level of detail; or
- You simulate something quite small using small units, i.e., high level of detail.
I wanted to test the idea that by using less detail, you can run larger test cases and therefore obtain better coverage of overall landscape than diving in and counting cycles in some small part of it. In the end, this made me cross the trillion instruction line — since each experiment took a few hundred billion target instructions to complete, repeating and tweaking during the development work definitely add up to more than a trillion instructions.
And this is where I have put my little finger close to my mouth and say:
‘I want one trillion instructions’
So what did I get from these trillion instructions?
An interesting study in how operating system overhead can have a big impact on the profitability of hardware accelerators. By running hundreds of test cases with different assigned computation latencies of a hardware accelerators, as well as different driver models for my hardware (all running under Linux on my favorite MPC8641D), a key diagram emerged:
Read the paper for all the details, but the key thing to note is that with a poor driver architecture, making the hardware 100 times faster resulted in zero gain in system performance. Had this experiment been performed on a bare-bones platform without a full operating system in place, I am fairly certain that the faster hardware would have been considered much more worthwhile.
In the end, I resorted to a driver variant where I had user-level code directly access the device programming interface via an mmap()-mapped memory region. Not pretty, essentially this was bare-metal programming wrapped inside a big cosy Linux package, but it sure was efficient compared to doing a kernel/user mode switch for each hardware operation. But even here, it turned out that making the hardware very very fast as opposed to just very fast had no benefit. It proves to me that the software has to be taken into account in full in order to properly evaluate an idea for a hardware design.
You could say that the poor results for acceleration here were due to my inept Linux driver programming skills, but that just underscores the key result: you have to take the software into account. If the conclusion is that a better Linux device driver programmer is needed, you have still decided that the key system bottleneck is not just the speed of the hardware, but how it is used. And that is exactly what system design needs to be about.
As an aside, playing around with a complete system like this, and automatically run large volumes of test with varying parameters was a really interesting experience. I must admit that getting to these trillions of instructions required a few hours of simulation time, but nothing that could not be solved by leaving a computer running over lunch or a long meeting. The machine was modeled using standard Simics “software timing”, i.e., without any particular cache or pipeline or bus details, and it seems that that is usually all you need. Had I increased the level of detail and slowed things down by a factor of ten or a hundred, I would never have covered such a large set of test cases and been able to evaluate as many different variants of drivers and hardware speeds.
IBM did it before me
Finally, I found it interesting that an analogous experience about the effect of creating a complete software stack and testing what looks like a very good hardware idea was reported in an IBM paper from a few years ago, in “Application of full-system simulation in exploratory system design and development“, by Peterson et al, in the IBM Journal of Research and Development. Look at the section about the “MIP Morphing” feature, which is essentially cache locking. They do use a fairly detailed simulator for the end evaluation of their performance – but the key message is that by running a full software stack, they realized that just managing the feature was too hard in a realistic software environment to make it worthwhile:
Initially, the MIP morphing feature was well received by internal development and HPCS customers alike. The team was aware of the need to both manage this hardware feature at the OS level and provide portable abstractions to the programmer to exploit this feature in a productive way. …
The implementation effort was facilitated by Mambo, allowing the OS team to prototype the MIP morph idea in a controlled development environment. Taking the prototyping effort to this level of realism uncovered many complexities in supporting the MIP morph in a virtualized manner. ..
By prototyping the software support that was needed at the OS level and exposing the usage issues at the application programmer’s level, the magnitude of the problem was exposed at its fullest. Further, the improvement in performance did not show a sufficient payback for the immense effort that would be required at the software level to support the idea, and as a result it was dropped from further consideration.
It seems that whatever you do, IBM did it first… and it validates the idea of full-system simulation and that software is king today.