Useful Instruction Set Computing

I tend to get into discussions about computer processor instruction-set architecture (ISA) design. ISA design is far from my day job, but it is an interesting topic where everyone working with computers at the machine level have opinions. Typically based on a mix of personal experience and fond memories of particular machines. This in turn leads to intricate and intriguing arguments. In this blog, I will talk about my take on the current state of instruction sets in industry and the age-old “complexity of instruction set” question.

What is a Good Instruction Set?

The goodness of an instruction set can be measured by many yardsticks. For example:

  • Simplicity – in the sense of regularity and elegance and minimality
  • Code size – for some program, is the code smaller than from other instruction sets?
  • Performance – does the instruction set affect the performance?
  • Compatibility – is it compatible with something that already exists?
  • Compilability – how easy is it to target the instruction set with a compiler?
  • Usability – how easy is it to create optimized code for the instruction set?

Obviously, others can be imagined.

RISC vs CISC

I don’t think the “RISC vs CISC” discussion is relevant anymore. That said, it is still the reference point in general for instruction-set design.

When the early commercial RISC processors came to market in the mid-to-late 1980s (MIPS, SPARC, PA-RISC, and ARM) they offered significant performance advantages over existing machines based on “complex” instruction sets (CISC). The predecessor of all RISC machines, the IBM 801, was a leap ahead of its IBM contemporaries in performance. Early RISC machines showed great performance together with simplicity and easy of compilation, but with varying results on code size.

However, the performance advantage was temporary and likely based on a particular confluence of factors in the early to mid 1980s. Memory was comparatively fast at the time, reducing the cost of instruction fetch. The CISC machines of the day used microcode to implement most instructions, slowing them down compared to just executing an instruction in hardware directly. Thus, it worked well to use several simple instructions to implement the same function as a single CISC instruction. For example, having to use separate instructions to do memory accesses as opposed to having memory accesses integrated as an operand type in an arithmetic instruction.  

The simplicity of the early RISC designs turned out to be less than ideal. Simplicity often got in the way of efficiency for the long run, in particular in needing more instructions to do the same thing. I was pointed at an old interview from 2005, where Jim Bourgoin from MIPS had this to say about the early days of the RISC era:

Instead of massive numbers of simple instructions, we should have been learning more about how to make the representation of the algorithm in main memory as dense as possible, so that we had some hope of getting it into the CPU in a timely way.

Convergence of Designs

As a result of the appearance of RISC, existing CISC designs like X86 and IBM 370 adopted some of the ideas from the RISC machines, and as more transistors became available for processor implementations the backend pipelines turned into RISC machines. There were also eventually changes to instruction sets to pick up some of the ideas from the RISC designs.

One important take-away from the RISC designs was the realization that you wanted a decent number of general registers. No self-respecting RISC design has less than 32. Nobody seems to find 64 registers worth the encoding space. X86 went to 16 general registers with X86-64, which solved many of the compilation problems for that architecture. Before this, processors tended to have more or less specialized registers – clearly simplifying the implementation when being severely transistor-starved, but also complicating programming. Even the nice 68000 had a split of registers into address and data registers.

Another realization was that instructions have to be generated by compilers, and instructions that compilers cannot make use of are pretty useless.

On the other hand, the RISC designers realized that they needed to provide ways to reduce the size of the compiled code. Variable-length instructions (ARM Thumb, MIPS Compact, etc.) were introduced for this purpose. This is still a matter of contention – there is an argument that a single fixed instruction size really simplifies the frontend (promulgated by ARM fans in particular as modern high-end ARMv8 and ARMv9 are fixed 32-bit). But there is also the fact that it restricts your encoding space and makes it harder to create useful instructions like loading large random constants into registers or doing 64-bit-destination jumps. Notably, IBM extended the Power ISA to contain 64-bit instructions in 2020. Making what used to be a fixed-length ISA into a variable-length ISA.

Current State

Today, I would argue that there are no “reduced” instruction sets in practical use. Instruction set design has evolved away from the simplicity of the early designs towards providing instructions that address specific use cases in efficient ways. I.e., useful instruction sets. These might be encoded into fixed or variable-length packages, but universally everyone appears to be in agreement that new instructions should be designed to solve specific problems, and indeed that what Jim Bourgoin said twenty years ago is true. You want to pack as much functionality as possible in as few bytes as possible.

Let’s look at some examples of what I am talking about.

Example: Vector Instructions

The easiest way to show that complex instructions carry great benefits is to look at vector instruction sets. All mainstream architectures today feature vector instructions, with the more famous examples being Intel Architecture AVX (AVX, AVX2, AVX512, AVX10, etc) and ARM NEON and SVE. The RISC-V have the V extension (and likely more variants coming) and Power ISA has VSX.

The pure RISC way to do vectors would have been to just loop simple instructions over the data in memory. Which is obviously a lousy design today.

Just look at ARM SVE2, where you have registers from 128 to 2048 bits long, holding elements that can be integer or floating point, ranging from 8 to 64 bits. Vector instructions regardless of architecture tend to be wonderfully complex, doing things like fused multiply-and-add guided by mask registers… nobody would call that “reduced”.

Example: Bit Manipulation

Bit manipulation instructions are another example that I like, as it really shows the effectiveness of specific instructions even in the context of standard integer code. For example, counting the number of bits set in a word requires iteration in software. But it can be accomplished with a single instruction like ARM CNT and X86 POPCNT (apparently the population count instruction popped up in the 1960s but was then largely ignored until the 2000s).  

Compilers might not be able to infer the use of such an instruction from arbitrary source code using loops to accomplish the work. However, these kinds of well-defined operations should be accessed using libraries or intrinsics. One example is Python, which added the int.bit_count() function in 3.10. This will map back to whatever the host computer provides in terms of instructions or other libraries – being way faster than doing a bit-wise loop in Python.

Communicating Intent

Which brings me to the last point on why it is good with expressive “complex” instructions: it tells the processor what you are trying to do as a program/programmer. Knowing the purpose of code at a higher level is very useful for optimizations and efficient execution. While it is kind of beautiful and elegant to build up complex operations from simple parts, just telling the underlying library or machine what it is you are trying to do in a single operation makes it so much easier to execute efficiently.

Thus, having instructions for things like “counting bits” or “perform part of an AES encryption” or “move data” is really handy as it lets the hardware do what the hardware does best. In particular, performing computations in parallel.

My personal favorite here is the X86 REP prefix. When I first saw it used in recent years I thought that it was some odd remnant of old code from the 1980s or something. Until I realized that the Intel processor architects had optimized this in hardware into what basically amounts to implementing memcpy and memset in a single instruction! It is not entirely clear exactly when this is faster than complex looping implementations, but the hardware is clearly stepping forward and saying that it wants to handle this case for the code. With the added benefit of using fewer registers and much fewer instruction bytes.

It seems that complex instructions can actually be good, and that simplicity is a bit overrated if taken too far. Symmetry is definitely nice for compilers, but simple instructions not so much.

Final Note: Performance of an Instruction Set?

Computing history thus far seems to show that there is no inherent performance advantage to any particular instruction set or type of instruction set. The early RISC processors did indeed have really good performance. However, the CISC X86 architecture soon overtook them leading to the disappearance of all the RISC-based high-performance machines except the POWER architecture.

Today the “RISC is better” argument is making a comeback due to the efficiency seen from ARM-based systems. I distinctly remember an architect telling me that RISC-V is so superior to X86 that it will be no competition. And we have articles like this one that claims that ARM is inherently more efficient than X86 designs.

Apple offers a very interesting case study, as they have flip-flopped back and forth. They started out with the very nice “CISC” 68000 processor, but when Motorola fell behind they switched to the RISC PowerPC architecture. However, that eventually fell behind Intel X86, and Apple switched to the very CISC-y X86 architecture. Most recently, they have famously switched to their forth instruction set, ARM. What this shows is that over time, different processors designers and designs will be better than others. Which ones are better does not seem to depend on the instruction set.

Text Heavy

This was a bit text heavy, but I could not see a need for any illustrations beyond cooking up some random image of a chip with an AI image generator. Which adds no information and thus is pretty pointless.

One thought on “Useful Instruction Set Computing”

  1. Nice article, but I wonder if the choices of RISC vs CISC at apple was more about Apple’s control (power, arm) vs market power (Motorola, Intel). Interestingly the market gave them more control over RISC.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.