• About Jakob Engblom and this blog
Observations from Uppsala Computer Simulation, Virtual Platforms, Embedded Programming, Multicore and More (by Jakob Engblom)

Back to Bare Metal

2012 March 30 22:10 / 1 Comment / Jakob

Once upon a time, all programming was bare metal programming. You coded to the processor core, you took care of memory, and no operating system got in your way. Over time, as computer programmers, users, and designers got more sophisticated and as more clock cycles and memory bytes became available, more and more layers were added between the programmer and the computer. However, I have recently spotted what might seem like a trend away from ever-thicker software stacks, in the interest of performance and, in particular, latency.

Make no mistake. The big long-term trend is still towards more layers. Higher-level programming languages, operating systems, libraries and middleware, dynamic languages, interpreted languages, and virtual-machine-based languages all add software layers to provide more powerful abstractions and increased programmer productivity. Dynamic and virtualized systems aim to divorce even operating systems from the hardware and make processors an abstraction rather than bedrock. This happens in all areas of computing, including embedded systems.

However, there are some areas of computing where that bucks this trend.  When maximum performance and minimum latency really matters, it seems that software architects and programmers realize that they have to remove or at least get around most of the abstraction layers. Real control is sometimes really necessary. APIs are de-abstracted to something quite close to the hardware.

One particular area where this is happening is in networking – with the advent of 10 gigabit per second (10G) Ethernet and higher speeds, software has to be brutally efficient to produce and consume packets at the rates that the network can deliver them. Network speeds have been increasing faster than processor speeds in recent years, which affects not only the core infrastructure of the Internet, but also regular application programs that need to take advantage of high data rates.

I have recently seen quite a few examples of this.

Luigi Rizzo wrote a nice and accessible article on the subject in the March 2012 issue of Communications of the ACM, called “Revisiting Network I/O APIs: The Netmap Framework“. In it, he notes that at 10 Gbps Ethernet, maximum packet rates are close to 15 million packets per second. 15 million events per seconds leaves a typical 2 to 3 gigahertz general-purpose processor with only a few hundred cycles to handle each event. At such data rates, the driver and network stack overhead in a typical OS is many times bigger than the time to process a single packet. Basically, the old sockets API and its layered implementation just cannot keep up. There is too much abstraction in the way. The solution proposed is the netmap API, which basically uses the abstraction of raw packets and work on a set of abstractions very close to how modern Ethernet controllers are implemented. Netmap aims to process batches of packets, not just individual packets, and to avoid all copying costs by directly putting received data packets into user space. Netmap also makes use of multicore hardware, having a single network interface map different hardware queues to different host cores (this is a standard function available in many Ethernet controllers since some years back).

The way that netmap provides software with very direct access to hardware is reminiscent of other APIs that are used in industry. Freescale has an API called Netcomm, for accelerating network code on their QorIQ and PowerQUICC SoCs. Intel has the DPDK (Data Plane Development Kit) API, which is used with quite a few different Intel platforms and Ethernet cards. Both these APIs let programmers use the hardware acceleration functions, in a way that does not rely on standard OS APIs for networking. The APIs are also necessary to access other kinds of hardware acceleration, like encryption, for which there is no standard API available at all. I have not worked extensively with either API, but the overall pattern is clear.  You cut out the OS drivers, go directly to the hardware, and rely on polling rather than interrupts to synchronize software and hardware.  Network interfaces (or queues in an individual interface) are tied to particular processor cores in multicore processors to reduce latencies and guarantee throughput.

This makes perfect sense.  I did some experiments a few years ago with device drivers in Linux that makes me believe that OS overhead can be very significant (blog post). In these experiments with a simple hardware accelerator, I finally connected my user-level software directly to the hardware (using the mmap() function in Linux).  I also switched from being interrupt-driven to polled, as that provided a much more efficient way to use really fast hardware. The details are captured in a Wind River whitepaper.

Yet another example of this kind of static allocation of work to cores and short paths from software to hardware is found in the Microsoft Sora software-defined-radio project. In Sora, they even mange to convince Windows XP to provide short latencies and tasks bound to cores – showing that it seems possible to retrofit real-time behavior to almost any OS as long as you have a few cores to spare.

Overall, an interesting trend. It is clear that not all software requires very short latency or the ability to handle millions of events per second. But is it equally clear that when this is the case, reducing the number of abstraction layers is the way to go. There are recurring attempts to find some way to build an abstraction that provides this kind of control on top of or inside existing many-layered systems, but for some reason I just like this reduction to the simplest possible. It just seems simpler.

Tweet
Posted in: computer architecture, embedded software, multicore software, programming / Tagged: Communications of the ACM, ethernet, Luigi Rizzo, NetMap, networking

One Thought on “Back to Bare Metal”

  1. usdhfjsdf on 2012 March 31 at 14:38 said:

    It kind of comes back to deterministic machines. Computers are and should be predictable machines. Over the course of the business cycle computers should perform and should have the same results for the same input. Computing less instructions should free the processor to start compute the next batch faster.
    Alcatel claims to have a 400Gbps throughput processor ; this makes it suitable to implement a 40 port 10Gbps network switch.
    Once you maxed out your available processor you need to scale it up somehow by going multiprocessor or multicore.
    400Gbps might seem a lot of throughput but it just consists of 20MbsX50X400 FullHD channels. This capacity can easily be used up by FullHD videoconferencing ( Skype ).

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Post Navigation

← Previous Post
Next Post →

Recent Posts

  • Military Science Fiction – The Books Blur Together
  • Wind River Blog: Starting & Configuring Simics
  • Wind River Blog:
  • Nudge Theory and Graphical User Interfaces
  • Wind River Blog: Collaborating with Recording Checkpoints
  • Wind River Blog: Simics 4.8 is Here
  • A Few Electrons too Many
  • Wind River Blog: Visuality NQ CIFS Server on Simics
  • Everything in the Cloud?
  • Wind River Blog: TCF and Simics
  • Off-Topic: Moving Bad Piggies Save Games
  • Two Cores, Four Cores, Eight Cores – Mobile Variety
  • Bliss: Failing to Pivot for Ideology
  • Wind River Blog and Movie: Demo of Simics Debugging
  • Simulation vs Reality in Schlock Mercenary

Categories

  • appearances (30)
  • articles (21)
  • blogging (10)
  • books (7)
  • business issues (31)
  • computer architecture (35)
  • conferences (34)
  • EDA (50)
    • ESL (35)
  • embedded (78)
    • embedded software (57)
    • embedded systeme (50)
  • general research (6)
  • history (32)
    • general history (7)
    • history of computing (26)
  • off-topic (94)
    • biking (5)
    • board games (1)
    • computer games (3)
    • desktop software (35)
    • food and drink (1)
    • funny (12)
    • gadgets (24)
    • Politics (3)
    • popular culture (5)
    • trains (5)
    • transportation (10)
    • travel (10)
    • websites (3)
  • parallel computing (92)
    • multicore computer architecture (51)
    • multicore debug (22)
    • multicore software (65)
  • programming (109)
  • review (8)
  • security (19)
  • teaching (7)
  • testing (9)
  • uncategorized (12)
  • virtual things (131)
    • computer simulation technology (68)
    • virtual machines (18)
    • virtual platforms (99)
    • virtualization (14)
  • Wind River Blog (43)

Tags

ARM blog commentary Cadence Checkpointing clock-cycle models Communications of the ACM computer architecture conference cycle accuracy debugging Domain-specific languages eclipse embedded freescale G900 heterogeneous homogeneous IBM Intel iPod lego linux mobile phones multicore off-topic office 2007 operating systems p4080 podcast commentary power architecture rant research reverse debugging reverse execution S4D SiCS Multicore days Simics simulation software tools Sun SystemC video virtualization Vista Windows

1

  • F-Secure Blog

Blogs and news

  • Andras Vajda's blog (on multicore)
  • Embedded in Academia (John Regehr)
  • Grant Martin
  • Jack Ganssle
  • My Wind River Blog
  • Security Now podcast
  • Secworks (Joachim Strömbergson)
  • Simon Kågström
  • Synopsys View from the Top
  • Worse Than Failure

Archives

  • June 2013 (3)
  • May 2013 (4)
  • April 2013 (1)
  • March 2013 (4)
  • February 2013 (1)
  • January 2013 (3)
  • December 2012 (2)
  • November 2012 (2)
  • October 2012 (1)
  • September 2012 (6)
  • August 2012 (4)
  • July 2012 (4)
  • June 2012 (3)
  • May 2012 (4)
  • April 2012 (2)
  • March 2012 (3)
  • February 2012 (1)
  • January 2012 (6)
  • December 2011 (2)
  • November 2011 (3)
  • October 2011 (4)
  • September 2011 (5)
  • August 2011 (4)
  • July 2011 (3)
  • June 2011 (4)
  • May 2011 (7)
  • April 2011 (1)
  • March 2011 (3)
  • February 2011 (5)
  • January 2011 (1)
  • December 2010 (4)
  • November 2010 (3)
  • October 2010 (5)
  • September 2010 (5)
  • August 2010 (5)
  • July 2010 (6)
  • June 2010 (5)
  • May 2010 (3)
  • April 2010 (4)
  • March 2010 (3)
  • February 2010 (4)
  • January 2010 (7)
  • December 2009 (6)
  • November 2009 (6)
  • October 2009 (7)
  • September 2009 (6)
  • August 2009 (7)
  • July 2009 (11)
  • June 2009 (5)
  • May 2009 (10)
  • April 2009 (7)
  • March 2009 (8)
  • February 2009 (9)
  • January 2009 (12)
  • December 2008 (8)
  • November 2008 (9)
  • October 2008 (9)
  • September 2008 (10)
  • August 2008 (13)
  • July 2008 (12)
  • June 2008 (8)
  • May 2008 (9)
  • April 2008 (10)
  • March 2008 (7)
  • February 2008 (8)
  • January 2008 (5)
  • December 2007 (5)
  • November 2007 (7)
  • October 2007 (7)
  • September 2007 (12)
  • August 2007 (9)
  • July 2007 (2)
© Copyright 2013 - Observations from Uppsala
Infinity Theme by DesignCoral / WordPress