CACM on DSAs

The July 2020 edition of the Communications of the ACM (CACM) had a front-page theme of “Domains-Specific Hardware Accelerators”, or DSAs. It contained two articles about the subject, one about an academic genomics accelerator, and one about the Google TPU. Hardware accelerators dedicated to particular types of computation are basically everywhere today, and an accepted part of the evolution of computers. The CACM articles have some good tidbits and points about how accelerators are designed and used today. At the same time, I also found a youtube talk about the first hardware accelerator, the IBM Stretch HARVEST, showing both contrasts with today as well as a remarkable continuity in concept.

Sources

For precision, the two articles in the CACM issue and the Youtube video are:

Domain-Specific Hardware Accelerators”, by William J. Dally, Yatish Turakhia, and Song Han. The title is a bit too general for the content. They do start with some general observations, but really, mostly it is about a set of accelerators that the authors built for genomics-related computations.

A Domain-Specific Supercomputer for Training Deep Neural Networks”, Norman P. Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. This is about the Google Tensor Processing Unit (TPU) in their version 2 and 3.

The Stretch HARVEST Compiler”, by Frances Allen, a talk recorded at the Computer History Museum in November 2000. The HARVEST was arguably the world’s first dedicated domain-specific hardware accelerator, created by IBM for the National Security Agency in the US (which at the time was very secret, and jokingly NSA was said to stand for “No Such Agency”). The HARVEST went into operation in 1962, was twice as big as the already-large host computer (IBM 7960 Stretch), and accelerated operations related to crypto cracking and signals analysis by up to 200x compared to the general-purpose computers of the day.

Why are Accelerators Efficient?

One question that needs be answered is when and why specialized accelerators make sense as a design choice? What makes them actually better at doing something than a general-purpose processor running a piece of software? When and why does specialization pay off? It might seem obvious that being “specialized” or “dedicated” automatically makes something “better”… but that is not necessarily the case.

In general, a few different factors combine to making accelerators more efficient and higher-peak-performance than a general-purpose processor occupying the same amount of silicon.

Reduced overhead, since most of the energy of a general-purpose processor is spent dealing with the overhead of instruction flow. In practice, only a small fraction (like 1% for an aggressive out-of-order machine) is spent actually computing the results of operations. A general-purpose processor has many layers of middle managers determining what the workers on the floor should be doing. An accelerator reduces this overhead by restricting what the hardware can do and how it is done.

Accelerators typically employ specialized operations that would take many instructions to do on a general-purpose machine. Note that these are not enough on their own – if an operation is sufficiently useful, it can be added to the instruction sets of general-purpose processors. Look at how processor designers keep adding instructions for tasks like crypto computations, bit manipulation, vectorized math, and similar.

Specialized datatypes that closely match the domain are commonly employed in accelerators. The TPU article describes the idea behind the “brain floating point”, bfloat16, format. Bfloat16 has been adopted in recent Intel and ARM instruction sets, as it is an easy and useful addition to general-purpose processors.

A new floating point format might seem like an extremely esoteric concern, but it is actually a very clever innovation, which is described like this in the TPU article:

The resulting brain floating format (bf16) in Figure 5 keeps the same 8-bit exponent as fp32. Given the same exponent size, there is no danger in losing the small update values due to FP underflow of a smaller exponent, […] However, fp16 requires adjustments to training software (loss scaling) to deliver convergence and efficiency. […]

As the size of an FP multiplier scales with the square of the mantissa width, the bf16 multiplier is half the size and energy of a fp16 multiplier […]. Bf16 delivers a rare combination: reducing hardware and energy while simplifying software by making loss scaling unnecessary.

In the genomics article, they also describe how the accelerator can make use of compressed and specialized formatting for arrays of data that provide both compute efficiency and memory efficiency, but which would be hard to do in a general-purpose processor. The 1962 IBM HARVEST system also featured several unique data representations, tailored to its job of cracking codes.

Extensive parallelism is a common property of accelerators, at least for the case where the problem domain exhibits parallelism. In particular, where multiple pieces of data can be processed in parallel. Parallelism is not necessary – there have been accelerators that are designed to reduce the latency on inherently serial tasks – but in most cases, accelerators do exploit the availability of parallelism in the domain in a way that would be hard to do within a single general-purpose processor core. 

Updating the Algorithms

It should be noted that making best use of the operations and datatypes available in an accelerator most likely require reformulating algorithms and implementations tuned for general-purpose processors.

For examples, accelerators typically have a different balance between the cost of doing computing operations and the cost of accessing memory. General-purpose-oriented code traditionally has tended to reduce operation count and using memory operations to avoid compute. Many accelerators are the other way around, where it is worth doing a lot of operations as long as it avoids going out to memory. The genomics article goes into some details on this which are well worth reading. The hardware design and software design are closely coupled, and a good design will only result from classic hardware-software co-design.

The inherent parallelism of accelerators often requires quite a bit of data to be made available for processing at once in order to achieve efficient execution. This means a risk of increased latency in order to achieve increased throughput… where a general-purpose processor can process a single data unit just as well as multiple, an accelerator is often best used by collecting thousands of units into a large batch for processing at once. Most current accelerators tend to be better at high throughput for large amounts of data, rather than minimal latency for each data unit. You can obviously build one optimized for latency as well, it is just less common. 

The data batch size is a balancing act for interactive use cases. The Google article talks about their tradeoffs in neural network training and inference batch sizes vs latency. For efficiency, they need to run with rather large batch sizes. However, these large batches still produce the latencies required by the applications – in this case, inference that is run as part of Google’s online services and therefore have humans waiting for results at the other end. Thanks to the constant flood of user requests that hit Google’s services, they are able to do both.

Memory, How Annoying

Creating an accelerator today would be a lot more straightforward if there wasn’t the problem of how to deal with memory. Ideally, an accelerator could use only on-chip memory which is fast to access and with mostly predictable latency. However, as soon as it is necessary to go to external DRAM, the cost in energy and time of memory accesses go up by orders of magnitude. Not to mention the need to put a large memory controller onto the accelerator chip.

This means that most accelerator designs end up with software-managed on-chip fast scratchpad memories, putting some responsibility on the programmer or compiler to manage this memory efficiently. The size of memory that an accelerator can access can also be a limiting factor for applicability – if your design happens to be capped at a certain problem size, its value becomes zero as soon as the size exceeds that size. General-purpose processors tend to be better at scaling up the size of memory (at the obvious costs of bigger chips and greater complexity). Often, these issues can be mitigated by splitting a problem into appropriately sized chunks – going back to the need to reformulate algorithms to fit the accelerator.

Domain-Specific Accelerator Programming

The programming environment is key to making an accelerator actually useful. The accelerator programs might not necessarily look like sequences of instructions in memory like those of a standard processor, they can be settings in configuration registers, linked descriptor tables, or something else entirely… Fundamentally, something is needed to tell the accelerator what to do. And that something needs to be accessible to the programmers. If programming is too hard to do, the accelerator will not be very successful in the market no matter how powerful its hardware.

It often makes sense to rely on a domain-specific programming environment. Allowing users to express the computations to be performed in a way that fits the domain makes it easier to fit the programs to the accelerator, as well as express problems as programs. A general-purpose language like C can sometimes be used… but such programming is typically more convoluted than using domain-specific expressions.

When building a new accelerator, I would argue that the choice and provision of a programming environment is extremely important. To ease adoption and increase the chance of success, an accelerator that enters an existing market (like AI or ML) should make use of an existing programming language, API, or framework. The Google TPU followed this course, being built to run Tensorflow code – which is both domain-specific and in widespread use, providing a perfect entry point for a new accelerator with quite a bit of code ready to run.

Of course… maybe some adjustments are needed to run well on a particular accelerator. But for higher-abstraction-level programming like this, it seems that the compilation chain can do a lot to make code run well on a new architecture without requiring major changes to the algorithm code. The rebalancing of algorithms to work well on accelerators can be done in the lower layers of the software stack where the kernel implementations are often provided by the accelerator designers, providing a high degree of performance for rather portable code.

The HARVEST system had a unique and bespoke programming language called ALPHA (for Alphabet, it seems). This was designed by the users at the NSA, to allow the expression of the kinds of operations that they needed for their crypto and analysis work. The implementation was then worked out together with IBM. It should be noted that according to Frances Allen, the IBM people were never told what the system was supposed to do. Today, hiding that kind of basic aspect from the engineers would be considered a rather bad idea, since the more you know about what the users are doing, the better solutions you can build for them.

TPU vs GPU

What is an appropriate comparison point? When designing an accelerator, it is important to know what the alternatives are. It is typically easy enough to show that a given design is better than a general-purpose processor on some specific task. But what if there is some other accelerator already in place in the market? 

The Google TPU article tackles this head on. For the TPU, the competition is not really the general-purpose processor cores, but rather Graphics Processing Units (GPU). If you do not have a TPU, GPUs are one of the most common way to accelerate machine-learning workloads. The TPU is a lot more specialized than a GPU, as honestly GPUs have become quite multi-purpose in recent years. The results are as expected – the TPU is a lot better at what it does than the GPU.

One thing that the Google article dodges very elegantly is the actual cost of their chips. Instead of talking about per-chip or per-unit costs, they measure the cost of renting compute in the cloud. This is relevant to users, definitely, but it is also a nice way to obscure the underlying costs by the fuzziness in how rental costs relate to actual hardware costs.

Google shows an interesting example of building a “TPU Supercomputer”. They are not content to just use a single TPU alongside a general-purpose processors to tackle some smaller workloads. Instead, with the TPUv3, they have designed a solution that can scale up to 1024 TPUs (serviced by 128 general-purpose processors) and that can produce theoretical floating point operations per second (FLOP) numbers that compare well to the leaders in the Top500 supercomputer ranking. Provided you want to do 16-bit and 32-bit math to train neural networks.

It should be noted that the field of AI and ML accelerators is getting rather complex and full of different solutions. For some jobs, a general-purpose processor with some well-chosen instructions (like Intel VNNI) is good enough or even better than accelerators. For inference workloads, there are probably dozens of different accelerators big and small that purport to run them at various points of performance and power efficiency. Even for training workloads for neural networks, the TPU and GPU compete with multiple other accelerator designs.

Note on Accelerator History

About ten years ago, accelerators started to become really important in the embedded space and I wrote quite a few blogs about the topic. There was a trend towards networking processors with serious on-board acceleration power (like the Freescale P4080), even if standard PCs and servers at the time were mostly just general-purpose processors. Automotive and other embedded processors already featured a lot of domain-specific acceleration outside the main processor cores.

Today, all new hardware designs for all markets (mobile, laptop, desktop, server, embedded) put a large emphasis on accelerators. The portion of a chip that is dedicated to standard general-purpose processing cores is much smaller than a decade ago, with both IO and accelerators taken up much more room. Not to speak about the graphics units which are typically the biggest block on systems-on-chip today.

It appears that right now, we have found quite a few domains where domain-specific accelerators make sense. Graphics, image processing, audio processing, machine learning/artificial intelligence (really just convolutional neural networks for the most part)… even in end-user machines. Back a decade ago, it was mostly about networking and other infrastructure. I guess he proliferation of accelerators is also why we never got to 100-core laptops like we thought 15 years ago.  All the parallelism got eaten by the accelerators. Instead of a 100-general-purpose-core laptop, we get something with 4 general-purpose cores and 1000s of specialized ones.

Taking a long view, it seems like this is another field where the 1970 rule holds. IBM did it before 1970. The HARVEST machine and programming environment is very different from what we have today (the whole machine contained less than 500k transistors, which will barely get you a single functional unit these days), but all the principles were there: built for a specific domain, featuring a specific programming environment, and offering an acceleration of a few critical tasks of orders of magnitude compared to the general-purpose processors of the day. It required a control processor to run it, as it was not capable of booting or running on its own, and there was a special runtime system to manage the hardware resources.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.