Modeling Endianness – Observations from Uppsala

Endianness is a topic in computer architecture that can give anyone a headache trying to understand exactly what is happening and why. In the field of computer simulation, it is a pervasive problem that takes some thinking to solve in an efficient, composable, and portable way.

This blog post describes how I am used to working with endianness in virtual platforms, and why this approach makes sense to me. There are other ways of dealing with endianness, with different trade-offs and overriding goals.

Fundamentals

What is endianness? In my way of looking at it, it is the arbitrary solution to the problem you get when a large unit of information (say, a 32-bit word) needs to be stored as a set of smaller units (say, 8-bit bytes). When this happens, you need to split the large unit into smaller units, and decide on how to order the smaller units. There is no objectively better or worse way to do this – as long as the result is unambiguous and based on positional numerics (i.e., no roman numerals, please), it is hard to claim that one order is better than another.

We use “endianness” all time without really thinking about it, when we write regular decimal numbers. In our standard base-10 decimal writing system, any value >9 has to be written down using multiple digits. The order we use is a big endian representation: the most significant numbers come first in our reading order (hundreds before tens before single digits, etc.).

In computer architecture, we have three main schools of endianness:

No endian, where we never break things down to bytes but always operate on equal-size words (not very common in practice, but certain machines like the Microchip PIC have instruction ROMs as wide as the instructions, and no way to address components of the intructions)
Big endian, BE, where the most significant bytes are put first in order of ascending addresses. I.e., the “big end” comes first.
Little endian, LE, where the least significant bytes are put first
“Middle endian”, where the ordering differs for different sizes of data (Wikipedia mentions this, but I have never seen an example). I have heard stories about chips that also used different endianness to store data by different instructions (by misdesign, I am not referring to the Power Architecture load/store byte-reversed instructions).

BE is the traditional choice of IBM and the major early RISC chips, with Power Architecture, MIPS, SPARC, and the zSeries as the most important representatives. LE is the choice of x86, and more recently ARM. MIPS also seems to be gravitating towards LE, probably as a way to make x86 software slightly easier to port. Note that even though some processor cores are described as endianness-neutral, that really means that they can run as either LE or BE. In practice, particular chip designs incorporating such cores tend to lean heavily towards one endianness, since devices are designed for a particular endianness.

The Software View

For me, the most important view of endianness is how the software sees it. When a program is running on any current architecture, it logically sees memory as an array of bytes. Inside the memory chips, we have a very different physical layout, usually with words much wider than a byte, as well as an addressing scheme that is not one-dimensional. The interconnect (“bus”) moving data from a processor to memory and back is a complex system containing caches, buses of different widths (usually 64 bits or more), memory controllers, cache controllers, bus bridges, and other devices. All of this is usually completely invisible to software, as illustrated below:

Basically, the bus system is invisible. The important endianness property as far as software is concerned is the order in which bytes are put into memory, and memory is considered as an array of bytes (since a byte is the smallest unit of addressing). If you look at the memory of a computer system using a debugger, this is the view you will get – both for on-target and off-target debuggers like ICE units and JTAG debuggers. Each memory access (store or load) will logically pass a small array of bytes into some position in the very large array that is memory.

The Modeling View

Modeling endianness is not optional when building a virtual platform. The software will at some point assume a certain relationship between word layouts and byte addresses in memory (such as overlaying a byte array on an integer in a C union), or when interpreting network packets (which are defined to use BE byte order, and therefore network code has to convert values to native endian to process them).

If you start from the software view of endianness and memory, the obvious simulation model for memory operations is to maintain the array of bytes view of memory matching the physical target.

Each memory access from a simulated processor gets turned into a transaction in the simulator.
The transaction has variable size, matching the size of the memory access operation issued by the processor.
The transaction contains a sequence of bytes, in the same order as they would end up in target memory on a physical machine. I.e., the order reflects the endianness of the processor.
The transaction has a starting address (byte-based) matching the memory access the processor issues.
The contents of the memory model in the simulation is an array of bytes, and its content matches what you would find on the physical target – the logical software view of the target.
The bus system connecting the processor to the memory is basically considered as a black box that just moves the transaction to memory.

The above is very easy to implement, and actually a very convenient implementation for someone used to the software view of hardware. The only thing that remains to be considered is how a processor simulator is implemented in practice.

In a typical processor simulator, you represent the target system registers using words of the same size as the target processor uses. I.e., for a 32-bit processor, you use 32-bit words on the host to represent the contents of a register. As the processor model is running, the contents of the register might have to stored in data structures internal to the processor (such as an array of words representing the register file). Naturally, such data structures are kept in host endianness since they are just plain compiled C code. As the processor model runs, arithmetic is carried out using host endianness.

Actually, usually no endianness is involved as the values are considered as words. Remember that a word does not have endianness until it is broken down into bytes and someone actually looks at the bytes. In particular, an operation like

uint8  a;
uint32 b;
a = (b & 0xff)

will pick up the 8 lowest bits of a word on any processor. The code is logically working inside of registers and is perfectly portable. However, the result of

uint32 *c;
*c = b;
a = *((uint8 *)c);

will pick up the first (at the lowest address) byte stored in memory when b was written – which is the same as the above on an LE processor, but different on a BE processor. The crucial observation here is that the latter variant contains an explicit store of a word, and an explicit load of a byte. Thus, endianness enters as we store the word (the byte load has no endianness, as it is loading the smallest unit of addressability).

What this means is that a processor simulator will have to do an explicit ordering of bytes as it is writing out values to memory. The simulator will need to take a word it has represented in “host order” (as it is within the simulator itself) and convert it to the byte order of the target processor. If the two match, such as simulating a little-endian ARM target on a (always little-endian) x86 host, nothing needs to be done. If they do not match, such as simulating a big-endian PPC target on an x86 host, the bytes have to be swapped before being sent to simulated memory.

When the processor does a load, it similarly has to swap the bytes being read from memory (if using different target and host endianness).

As soon as we leave the processor simulator, the order of bytes in transactions and simulated memory has to defined and managed in a host-independent way. This is crucial to enable snapshots of memory to be shared across hosts, time, and space, and simply to allow the simulation to work correctly. The semantics of the simulation must be defined by the simulator, not by the nature of the host.

Note that as an optimization, quite often we do not create an explicit transaction, but rather use the optimization of letting the processor simulator write directly to the representation of the target memory in the memory simulator. In this design, the target memory representation is just an array of bytes mirroring the contents that the processor would see on a physical target.

Let’s go through this with a simple example. We assume we are on an x86 host. Our processor simulator contains a 32-bit register with the value 0x01020304. This value is endianless until we have to send it to simulated memory, it is just a value of 32 bits. We write it to target memory at address 0x100

On a simulated LE target, the memory write will result in a transaction containing the byte sequence (0x04, 0x03, 0x02, 0x01) – lowest byte comes first. The memory model will store this with 0x04 at address 0x100, 0x03 at 0x101, etc. The processor model can achieve this effect by simply doing a host-native word store to the memory array.

On a simulate BE target, the memory write will result in a transaction containing (0x01, 0x02, 0x03, 0x04). In memory, 0x01 will be stored at address 0x100, 0x02 at 0x101, etc. To store this word correctly, the processor model will have to do a byte swap operation on the word before writing it out to memory. Such a byte swap operation might seem expensive, but the evidence does not indicate that it matters. All the fastest instruction-set simulators use this method internally as far as I know (Wind River Simics, Imperas OVP, Qemu, IBM Mambo), which to me indicates that the design works well on a simulation system level.

Device Models

Device models are the main part of a functional simulator for a computer system. They also have endianness, as they expose memory-mapped interfaces to software. To deal with devices in a consistent manner, they will interpret inbound memory transactions using their local register endianness. This makes it simple and reliable to simulate systems where the processor and the devices have different endianness.

Systems with mixed device endianness is very common, mostly thanks to PCI. PCI is defined to use little-endian byte ordering in all memory accesses, as it originated in the x86 world. PCI is still being used in almost all computer systems, and thus LE PCI devices are being connected to BE processors.

Internally, a device model will also use words to represent data. When data is written to a device, it will interpret the bytes in the write transaction using its local order. When data is read from a device, it will fill in the data in the read transaction using its local order.This makes device drivers that byte-swap incoming data from an LE PCI device on a BE processor work just like they do on physical hardware.

This makes endianness a local property of the device. The same device model can be used without change in both an LE and a BE target system. This mirrors reality: PCI devices are used in all kinds of systems, and the devices do not change, and neither do the models have to.

In some systems, the designers try to hide the RISC-processor-to-PCI endianness mismatch by making the hardware swap bytes around as they move from the memory bus into the PCI subsystem. If this is the case in a target system, the simplest simulation method is to insert an byte-swapping intermediary on the path from the processor to the devices. This will do an extra byte swap on all transactions passing by, and things will work correctly (note that this byte swap has to be defined to work on a certain word length, and if transactions are bigger than this length, you will also have to order the words).

Note that as long as all units involved on the path from a device to a processor use the same word length, you can replace all the byte swapping operations with a simple flag. This flag will indicate if a transaction has been swapped or not. For example, when we have a BE processor talking to a BE device, on an LE host. The BE processor will flag the transaction as “wrong-endian” as it sends it out but actually store the bytes in LE order in the transaction. The BE device will check the flag and realize that it is wrong-endian too. And since two wrongs make a right, it does not have to swap the bytes either but can copy the transaction contents directly into its internal registers.

Dealing with Data

There are other things you want to do with a memory image in a virtual platform apart from reading and writing it from a processor. One particular task is to move data into and out of memory model in order to load code and data, as well as to save the state of the system. The representation of a memory as an array of bytes works very well for this approach, since it corresponds naturally to how software files are created on the host. Since most software files are intended to be loaded by the target into target memory, they are prepared in target byte order. Another advantage of using a byte-based memory representation is that file formats like ELF can be loaded straight into virtual memory without having to convert addresses.

The representation is also host-independent, which facilitates moving memory images from one host to another, a key part of using virtual platforms as a communications mechanism. Another benefit of viewing memory as an array of bytes as accessed from a processor is that debuggers can look at memory in the same way as they would when running on the same host.

Summary

This long post (WordPress tells me it is more than 2500 words) really only starts to scratch the surface of this fascinating topic. It has described one approach to endianness modeling, and some of the subtleties involved. There are many more subtleties that we could go into.

Footnote: SystemC TLM-2.0

There are other ways to model endianness. In particular, the approach described here is not used in the SystemC TLM-2.0 standard. In TLM-2.0, all data is stored in a transaction in host order, not target order. To model the target endianness, you instead change a descriptor array that tells the simulator about how to interpret the bytes when viewed from the target.

As I see it, this means that TLM-2.0 is better suited for modeling the ins and outs of a bus system, including discovering how data ends up at a target from the actions of the various components of the bus system. It models byte lanes and the width of buses, and uses host byte order for all transfers of data. In contrast, the approach described in this blog post works by modeling the documented (or intended) effect of the hardware at the software level.

Overall, I would say that TLM-2.0 is slightly more geared towards the “design” use of modeling, rather than “describe“. By modeling bus widths, actual byte lanes, and other concepts, the simulator will discover the shape and endianness of data as it arrives at a target memory or device.