Back to Bare Metal

Once upon a time, all programming was bare metal programming. You coded to the processor core, you took care of memory, and no operating system got in your way. Over time, as computer programmers, users, and designers got more sophisticated and as more clock cycles and memory bytes became available, more and more layers were added between the programmer and the computer. However, I have recently spotted what might seem like a trend away from ever-thicker software stacks, in the interest of performance and, in particular, latency.

Make no mistake. The big long-term trend is still towards more layers. Higher-level programming languages, operating systems, libraries and middleware, dynamic languages, interpreted languages, and virtual-machine-based languages all add software layers to provide more powerful abstractions and increased programmer productivity. Dynamic and virtualized systems aim to divorce even operating systems from the hardware and make processors an abstraction rather than bedrock. This happens in all areas of computing, including embedded systems.

However, there are some areas of computing where that bucks this trend.  When maximum performance and minimum latency really matters, it seems that software architects and programmers realize that they have to remove or at least get around most of the abstraction layers. Real control is sometimes really necessary. APIs are de-abstracted to something quite close to the hardware.

One particular area where this is happening is in networking – with the advent of 10 gigabit per second (10G) Ethernet and higher speeds, software has to be brutally efficient to produce and consume packets at the rates that the network can deliver them. Network speeds have been increasing faster than processor speeds in recent years, which affects not only the core infrastructure of the Internet, but also regular application programs that need to take advantage of high data rates.

I have recently seen quite a few examples of this.

Luigi Rizzo wrote a nice and accessible article on the subject in the March 2012 issue of Communications of the ACM, called “Revisiting Network I/O APIs: The Netmap Framework“. In it, he notes that at 10 Gbps Ethernet, maximum packet rates are close to 15 million packets per second. 15 million events per seconds leaves a typical 2 to 3 gigahertz general-purpose processor with only a few hundred cycles to handle each event. At such data rates, the driver and network stack overhead in a typical OS is many times bigger than the time to process a single packet. Basically, the old sockets API and its layered implementation just cannot keep up. There is too much abstraction in the way. The solution proposed is the netmap API, which basically uses the abstraction of raw packets and work on a set of abstractions very close to how modern Ethernet controllers are implemented. Netmap aims to process batches of packets, not just individual packets, and to avoid all copying costs by directly putting received data packets into user space. Netmap also makes use of multicore hardware, having a single network interface map different hardware queues to different host cores (this is a standard function available in many Ethernet controllers since some years back).

The way that netmap provides software with very direct access to hardware is reminiscent of other APIs that are used in industry. Freescale has an API called Netcomm, for accelerating network code on their QorIQ and PowerQUICC SoCs. Intel has the DPDK (Data Plane Development Kit) API, which is used with quite a few different Intel platforms and Ethernet cards. Both these APIs let programmers use the hardware acceleration functions, in a way that does not rely on standard OS APIs for networking. The APIs are also necessary to access other kinds of hardware acceleration, like encryption, for which there is no standard API available at all. I have not worked extensively with either API, but the overall pattern is clear.  You cut out the OS drivers, go directly to the hardware, and rely on polling rather than interrupts to synchronize software and hardware.  Network interfaces (or queues in an individual interface) are tied to particular processor cores in multicore processors to reduce latencies and guarantee throughput.

This makes perfect sense.  I did some experiments a few years ago with device drivers in Linux that makes me believe that OS overhead can be very significant (blog post). In these experiments with a simple hardware accelerator, I finally connected my user-level software directly to the hardware (using the mmap() function in Linux).  I also switched from being interrupt-driven to polled, as that provided a much more efficient way to use really fast hardware. The details are captured in a Wind River whitepaper.

Yet another example of this kind of static allocation of work to cores and short paths from software to hardware is found in the Microsoft Sora software-defined-radio project. In Sora, they even mange to convince Windows XP to provide short latencies and tasks bound to cores – showing that it seems possible to retrofit real-time behavior to almost any OS as long as you have a few cores to spare.

Overall, an interesting trend. It is clear that not all software requires very short latency or the ability to handle millions of events per second. But is it equally clear that when this is the case, reducing the number of abstraction layers is the way to go. There are recurring attempts to find some way to build an abstraction that provides this kind of control on top of or inside existing many-layered systems, but for some reason I just like this reduction to the simplest possible. It just seems simpler.

One thought on “Back to Bare Metal”

  1. It kind of comes back to deterministic machines. Computers are and should be predictable machines. Over the course of the business cycle computers should perform and should have the same results for the same input. Computing less instructions should free the processor to start compute the next batch faster.
    Alcatel claims to have a 400Gbps throughput processor ; this makes it suitable to implement a 40 port 10Gbps network switch.
    Once you maxed out your available processor you need to scale it up somehow by going multiprocessor or multicore.
    400Gbps might seem a lot of throughput but it just consists of 20MbsX50X400 FullHD channels. This capacity can easily be used up by FullHD videoconferencing ( Skype ).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.