Does ISA Matter for Performance?

When I grew up with computers, the big RISC vs CISC debate was raging. At the time, in the late 1980s, it did indeed seem that RISC was inherently superior to CISC. SPARCs, MIPS, and Alpha all outpaced boring old x86, VAX and 68000 processors. This turned out to be a historical parenthesis, as the Pentium Pro from Intel showed how RISC-style performance could be mated to a CISC ISA. However, maybe ISAs still do matter.

The conventional wisdom (see for example http://archive.arstechnica.com/cpu/4q99/risc-cisc/rvc-1.html) since the PentiumPro has been that for a mainstream processor, the decoder logic is such a small part that it does not really matter. Behind that decoder, both a high-end RISC and a high-end CISC do pretty much the same thing, and the sheer weight of design manpower and manufacturing advantages that Intel has had has ensured that their processors have been the fastest or at least very competitive for the past decade and a half. CISC vs RISC also turned out to be a blurry boundary – some PowerPC processors did the same trick as Intel processors, breaking up their instructions into microoperations before sending them down the pipeline. Indeed, Power Architecture is sometimes not even considered RISC, but rather something of hybrid.

It seems that market share is just as important for processor performance as technical elegance and ingenuity – a bit sad, but pretty true. At least, this holds true in the “old mainstream” desktop and server markets.

Still ISA matters. And even within a family, ISA changes can have a huge impact on performance.

Obviously, we have specialized ISAs like DSPs and network processors and GPUs. In these cases, the ISAs let us express computations in ways that they cannot be on a general-purpose processor, often trading ease of writing general code against performance on a certain class of computation. Efficiency is maybe an order of magnitude better compared to a GPP for these processors, doing the same work. A high-end GPP can often match the absolute performance of a specialized processor, by the brute force of high clock speeds, large caches, and aggressive out-of-order pipelines. But in doing so, it uses 10-100 times more energy.

Inside a mainstream ISA, the trend of the last decade has been the successive addition of small and large sets of specialized instructions for doing various important computations. Floating point units have been essentially replaced by vector processing instruction sets such as MMX, SSE, Altivec, VFP, and Neon. Crypto instructions have entered the mainstream on all major architectures. Power Architecture has included binary-coded-decimal instructions, and IBM mainframes have instructions that do string copies in the L3 cache. Configurable architectures like Tensilica show the value in adding a few well-chosen application-specific instructions. Adding features to existing ISAs has been proven to have very good bang for the buck.

Another, more interesting, aspect is the modernization of old ISAs. The actual cause for this blog post was that the Microprocessor Report estimated that an ARM Cortex-A57 (and A53) core would gain 10% performance when running in ARM AArch64 rather than AArch32 mode. That is pretty significant – the same processor core, the same program, just compiled using a different set of instructions. It is especially interesting for ARM, since the ARM v8 AArch64 64-bit instruction set is pretty much totally different from the old ARM 32-bit instruction set. ARM took the chance to modernize and remove old “good ideas at the time” aspects that are hard to implement efficiently today, such as predication on most instructions, the wonderfully complex shifting operands, and pc-relative constant pools swimming around in the instruction stream (see http://www.realworldtech.com/arm64/  for more details). AArch64 also finally makes ARM an architecture with 32 registers, rather than the old 16, which makes compiler optimizations much easier.

This means that AArch64 improves performance in two dimensions: it allows better more streamlined silicon implementations compared to the old ARM, and it makes the life of a compiler a bit easier with fewer spills and more data kept in registers. I think it is mostly this second effect that is seen for the Cortex-A57 and A53.

I think this is good proof that ISA design does matter. Another similar data point is the effect of going to x86-64 from x86-32. Typically, you can see a five to ten percent improvement in the speed of compute-intense code from the better register allocation allowed by having eight extra registers and the overall somewhat cleaner instruction set. This is a different approach compared to ARM, since x86 maintains the 32-bit instruction set and simply extends it, where ARM basically changed the ISA completely.

Thus, my conclusion is that the design of an ISA can have significant effect on the performance of a processor, even when all other factors are held constant. What this “best” design is seems to change over time – in the 1970s, CISC designs like the 8008 and 8086 were dominant as they let you do more with each slow fetch from memory. In the 1980s, with faster memory, RISC used a bit more program space to create an instruction set that could map directly to a very efficient pipelined processor. In the 1990s, with more complex processors, RISC or CISC turned out to be equivalent. In the 2010s, it is clear that efficiencies can be achieved by designing a clean ISA that is easy to implement in many different ways in out-of-order speculative high-frequency designs. The key today is hardware-independence, where it used to be the close co-design of instructions and hardware.

On the ultra-low-end, I think the classic RISC idea of “simple to implement” instructions still have clear value. Something like the ARM Cortex-M0 can be implemented in 12000 transistors, on the same level as a 1970s 8-bit design – while being fully 32-bit and clocking in at many times the performance of these old machines. It is even half the size of an Intel 8086. It is hard to see how an x86 processor could ever be cut down to this kind of size – unless of course Intel does the same as ARM did, and cut out most of the instruction set into an ultrareduced version. Note that a classic simple RISC like the MIPS R2000 was still about twice the size of a 68000 processor, so it is not necessarily the case that RISC means small. You need to design for smallness too.

 

7 thoughts on “Does ISA Matter for Performance?”

  1. “The actual cause for this blog post was that the Microprocessor Report estimated that an ARM Cortex-A57 (and A53) core would gain 10% performance when running in ARM AArch64 rather than AArch32 mode. That is pretty significant …”

    Is 10% really that significant, considering Moore’s law? It sounds like almost a rounding error given that changing the ISA is a one-time thing that doesn’t payoff over time.

    Here’s an article from yesterday that gives another data point:
    http://www.tomshardware.com/reviews/atom-z2760-power-consumption-arm,3387.html

    Nice article!

  2. I think 10% is significant, as it applies to all future chips of the same design and. It means that in the same process, with the same design, you get 10% more performance. Compared to typical microarchitecture gains of a few percent, I think it does mean something.

    The Atom Clover Trail clearly demonstrates the benefit of the Intel manufacturing lead. And that there is nothing magical with a certain instruction set at the high end – ARM tends to be more efficient more as a result of a long tradition of low-power design as I see it, not by using a particular ISA.

  3. Bollocks! Intel is kind of crap( I use AMD ). ARM is slowly but surely catching up with Intel.
    Now that ARM has a 64 bit architecture people can leverage it in places where Intel is in use now.
    Moore’s law will stop and single molecule transistors will emerge. For now architecture is important as you put to use all those ever smaller transistors.
    I am holding out for photonic circuits and quantum computers. We will master the matter at ever finer scale.

  4. @molecule_computing

    Intel is superb on cache speed and memory controllers. AMD is much slower on cache. Maybe they can improve Bulldozer core. ARM and MIPS (China is putting much money on MIPS now) both have a long way to go even if the CPU’s are not bad.

  5. I would like to see some results from ARM and MIPS (the best from China) on SPEC2000 or SPEC2006. If you want to have ARM as a server you know what I mean?

Leave a Reply

Your email address will not be published. Required fields are marked *