“RISC-V in Practice” – Computer and System Architecture Unraveled Event Three

On Wednesday, March 13, we had our third CaSA, Computer and System Architecture Unraveled, meetup. Same place as the previous, the 25th floor of the Kista Science Tower building, thanks to the kind sponsorship of Vasakronan and our collaboration with Kista Science City. The theme this time was “The RISC-V ISA in Practice”, with two speakers named Björn. Another great event!

The event took place in one of the sharp corners of floor 25, providing a great view of the southern part of Kista.

Björn Töpel – RISC-V: The pains of growing an ISA

The first Björn was Björn Töpel from RiVOS, a startup building a high-performance RISC-V-based core.

Björn Töpel talked about various aspects of the RISC-V ISA and how he had seen it impact its application for a high-performance processor core running Linux. Björn comes from the software side, in particular Linux, and had many interesting observations about how the instruction set actually does impact the programmability and ergonomics of a processor.

Some people claim that compilers mean that ISAs do not matter, but for general-purpose processors running code that is ported between architectures, it really does. Not so much in the details of how your “add” instructions are encoded, but rather aspects like the reach of jump instructions and how easy it is to code common patterns like patching code (see below).

The way that RISC-V started as an academic project still shows. There is some emphasis on purity and symmetry over efficiency. For example, Björn talked about the jump-and-link (JAL/R) instructions used to call functions in the RISC-V ISA. Those instructions have a full register operand encoded, in theory allowing the address of the return instruction to be put into any register. But the RISC-V ABI restricts programs to use either register X1 or X5. Leaving a ton of bits unused that could be used to extend the reach of the instructions with larger immediates. It would be objectively better with dedicated call and return instructions like on most other architectures.

There is also some tension between microcontrollers and high-end processors. Many of the RISC-V design choices makes it really easy to build a simple core; but they are not really all that relevant when it comes to high-performance cores where you will spend a lot of silicon area on decode, instruction cracking, instruction fusing, etc., anyway. More complex encodings are not a burden, but rather an asset if it helps performance and expressivity.

Björn Forsberg – Heterogeneous multi-ISA, multi-data-model, RISC-V Research Platform

The second Björn (or, as we joked, Björn One since the first speaker was obviously Björn Zero, as you always count from zero in computing) was Björn Forsberg from RI.SE. His perspective on RISC-V came from university research. He has worked in the PULP project and in particular on the HERO platform for research into heterogeneous accelerators. His presentation was based on the article “HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous Computing”.

The HEROv2 platform uses 32-bit RISC-V as the base instruction set for accelerator cores. There are currently three accelerator cores called IBEX, RI5CY, and snitch. The focus here was mostly on RI5CY that adds many DSP-processor-style extensions to the instruction set to reduce the number of instructions needed to express certain computations. There are SIMD instructions, hardware loops, and memory operations with automatic post-increment, for example.

This is a pretty common use for RISC-V I think: you use the base instruction set as, well, a base, and then use the openness of the instruction specification to add your own special instructions for a specific application. A bit like Cadence Xtensa does it, except it is all based on an open specification. The advantage of RISC-V is that you do not have to spend time reinventing the base instruction set and you get a general toolchain for free from the community. Previously, you had some spend some time on basic software enabling for whatever custom instruction set you came up with.

The accelerator code is embedded inside the main program using OpenMP #pragmas. The main program runs on a 64-bit controller core, leading to some issues communicating pointer values to an accelerator. The communications model is based on explicit DMA rather than cache coherency to keep things simple. The DMA model probably helps by allowing the main controller to be either an ARM or a RISC-V core. The typical experimental platform for HERO is on a Xilinx Zynq SoC FPGA which provides hard-wired ARM cores in addition to the FPGA fabric, so why not use it?

Discussions

After the two presentations we had several hours of discussions over food and drinks.

Question and answer session with the speakers

There were some aspects of the RISC-V instruction set and ecosystem that are worth mentioning in their own right.

Patch Code

A to me totally unexpected topic grabbed a lot of the discussions – the question of self-modifying code, also known as CMODX in RISC-V land, or probably more politely as code patching. Basically, the practice of changing the contents of the memory holding the code being executed. This is a surprisingly common technique.

For example, Linux f-trace functionality is based on rewriting the first instruction of a function from a NOP to a call to trace function. This obviously offers the best performance possible, as there is no flag checking or other dynamic behavior. Just a call when needed, and no call when not needed. Handy. Similar techniques are used in virtual machines like the OpenJDK.

Runtime code patching has been supported on Intel Architecture for a very long time and just works. For example, for f-trace, Björn zero showed how you can put a large-size NOP (8 or 9 bytes I think he said) as the first instruction of a kernel function. When tracing is enabled, you overwrite the NOP with a call to a trace function using a single atomic write. X86 processors have no problem with this operation. There is no need for memory fences or global synchronization locks.  

It is a lot more complicated on ARM or RISC-V. Basically, the current RISC-V standard semantics do not guarantee anything about data writes to instruction areas. The underlying implementation is the same as on X86: replace the first two 32-bit instructions of a function with code that makes it call a tracer function. How to so this safely is another matter altogether. You do not want one core to write to shared code at the same time that another core is executing, so that the second core sees just half of the change… The current standard solution is stop the kernel and make sure all processors are parked in a safe location, which is obviously expensive.

Solving this problem involves defining the instruction-set semantics for the cache and memory system. In a way that seems “wrong” – why should an ISA define how the cache system works? But it is clear that memory modification between code and data is critical to much modern software, and as such it has to be defined. Interesting.

Instruction Lengths

An undeniable attraction of the RISC-V ecosystem is that design decisions that regard the instruction set takes place in the open. Traditionally, this was the domain of internal discussions at companies like IBM, Intel, and ARM. With RISC-V, you get to see the architects debate the merits of different proposals, and gain some insight into the microarchitectural implications of instruction-set decisions. Bring out the popcorn!

One such aspect that is currently being debated is the question of variable-length instruction encodings and the compressed instruction set (“C”). These two questions are different but also deeply intertwined.

First of all, does it make sense to have a fixed-length instruction encoding? The industry does not really have a consensus on this topic. ARM AArch64 has gone for all-32-bit instructions. Intel Architecture have instructions from 1 to 15 bytes and an admittedly very complex encoding scheme. The Power ISA has gone from 32-bit to a mix of 32- and 64-bit instructions. RISC-V currently mixes 16-bit and 32-bit instructions, making it variable-length.

Some architects argue that fixed lengths simplify instruction decoding and therefore provide a potential advantage in performance. There is no need to deal with instructions straddling cache-line or page boundaries. Thus, it should be possible to make a frontend simpler and faster. On the other hand, variable-length instruction sets typically result in smaller code. Which means that more of a program can fit into a given-size instruction cache, and that more instructions can be fetched with a given fetch width.

A related question is just what you want or need to encode in the instructions. A 32-bit instruction set makes it quite painful to do things like load 64-bit immediates into registers. Classic 32-bit instruction sets tend towards three-operand operations. However, once we start looking at vector math instruction sets, it starts to get a bit tight. There are hundreds or thousands of operation variants, and often four or more operands really make sense. Thus, variable-length instructions might allow more convenient expressions of operations and thus make code more compact and easier to parse and execute for a processor.

RISC-V is currently at 16-bit plus 32-bit, with the option to extend into at least 48-bit instructions in the future. I must admit to liking this model. It just seems to make sense to size instructions to the information that they carry. If all I want is to add a small constant to a single register, use a small 16-bit instruction for that. If I need to do a five-operand very-complicated vector math operation, 48 bits make a lot of sense. Having two or three sizes just makes sense, and should still be a lot simpler than the old CISC architectures that vary in units of bytes.

Compressed Instructions

A particular debate is raging in RISC-V land right now about the compressed, C, instruction set. It definitely provides benefits in code size and helps microcontrollers. But it also occupies “75% of the instruction space”, since all non-compressed instructions must set the two least significant bits of the instruction to 11. Maybe not an ideal design point. Googling around finds a lot of interesting debate on the topic (like this thread, and this presentation from SiFive).

Qualcomm is apparently campaigning for removing C from high-end cores, going strictly 32-bit, and then modifying the instruction encodings in something that seems similar to ARM Aarch64. Other companies like SiFive argue the opposite. As I said, great to be able to see this discussion. It should offer great study material for computer architecture students at universities.

CPUID

One of my main surprises when I started to look at RISC-V was that the ISA did not have a real CPUID instruction! There is the original MISA register that contained 26 bits, one for each letter of the alphabet, with the idea that each instruction group would get a letter. This quickly ran out of space, as should have surprised nobody. But that it was not replaced with a more sophisticated mechanism is a total mystery.

The user-facing way to represent the instruction set of a core is to list all its supported extensions in one very long string. This is pretty much unusable:

Björn showing a typical processor ID string for a well-featured RISC-V processor

On the hardware level, the current method for most cases seems to be to just try to run an instruction and see if it results in an illegal instruction. Not the most elegant. RISC-V is working on a “configuration structure” standard that should resolve this. The idea appears to be that boot firmware figures out the setup of a machine, in the form of some kind of data structure in memory. Exactly how this data structure is passed from the hardware seems not to be agreed-upon yet. While it is kind of elegant in one way, it is also not very user-friendly if you just want to check a few things.

To be honest, I think the X86 CPUID is a better design. It allows user-level libraries and applications to do their own quick flag checks to see what is available on a certain hardware in a way that is consistent across vendors and over time. Just checking a flag is simple and easy, compared to trying to parse out something in a large configuration structure. CPUID can be extended at runtime by hypervisors to allow them to communicate configuration information to guest operating systems and software on these, which is also kind of nifty.

Final notes on RISC-V

To me, all of this points to how something that started out as an academic project doing the basics of an instruction-set architecture is now trying to turn itself into a serious data-center architecture. Starting simple is not necessarily bad, but in order to build an industrial-strength system that can solve the problems solved by X86, ARM, Power, IBM System Z, and suchlike, much more is needed.

To a total outsider like me, it just looks like the early efforts on RISC-V were missing people familiar with “real” computers and what current system architectures actually look like and need from the processor. Not including basic functionality like virtualization from the beginning looks strange in hindsight. Or thinking about CPUID. Or the need for control-flow checks and shadow stacks. Or support for language virtual machines. All of that could have been designed-in from the start, but instead it ends up as extensions to the base. Which is just following the trajectory of all other instruction sets where stuff gets added over time “wherever it can be made to fit”. I see a lost opportunity, really.

It also fascinating the see the old CISC vs RISC debate on what is a “fast” instruction set play out again. I thought that was settled on the side of “variable-length instructions are not a problem”. Apparently that was not necessarily the case.

Future

The next meetup is not yet planned, but keep an eye on the Meetup page or the Kista.com event page. Hope to see you there, whatever the topic turns out to be.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.