At the ISCA 2014 conference (the biggest event in computer architecture research), a group of researchers from Microsoft Research presented a paper on their Catapult system. The full title of the paper is “A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services“, and it is about using FPGAs to accelerate search engine queries at datacenter scale. It has 23 authors, which is probably the most I have ever seen on an interesting paper. There are many things to be learnt from and discussed about this paper, and here are my thoughts on it.
My first impression is the crazy scale of the experiment. Where most research is done using a simulator or a few physical prototypes, this group deployed experimental boards into a system consisting of 1632 servers. If I read the paper right, that means 1632 prototypes. Insane, and something only a giant like Microsoft can afford to do – no academic institution could ever reach that level. You also need a major datacenter actually doing things for this to make sense, and even in this day of large web systems, there are not that many of those around. Good job, but unfortunately a real rarity. This is modestly described as a “medium scale” deployment in the paper. They built real hardware boards, custom for deployment as additions to the datacenter servers used. These boards had to fit in the power, cooling, and volume envelopes of tightly packed servers.
I have an admitted fondness for hardware acceleration as an efficient way to get things done, and this experiment definitely shows that doing things in hardware beats software when the task can be fit in hardware. Even using FPGAs rather than ASICs, they achieved a near-doubling of performance at a 10% energy consumption increase and 30% total cost of ownership increase. Pretty clear that in this case, hardware acceleration is profitable. It is worth noting that the team found FPGAs better than GPUs for the types of low-latency work considered. GPUs would both have latency issues and consume much more power than FPGAs. They both have their place, but their the right solution does depend on the application.
The reconfigurability of an FPGA is a feature in its own right in this kind of setting, since the FPGA board attached to each server can be used for many different tasks. Previously, we have seen FPGAs being used to power more high-performance-computing-style workloads, but here it is more in the domain of databases. The way the team proceeded to make this power accessible is a bit interesting, as they dedicated about a quarter of the FPGA area to a “shell” that provides standard fixed services. This shell is then used by the rest of the FPGA that actually does the application-specific computing. This is one part of bringing FPGA programming complexity down towards something that is accessible by “ordinary programmers”. Still, the research paper used Verilog to program the application, partitioning it across seven FPGAs. In theory, it would be possible to switch different kernels in and out over time, but this seems not to have been necessary here.
A significant part of the Bing ranking engine was put on top of the FPGAs, including feature computation (such as the number of times a word is found in a document) and the machine learning algorithm that does ranking. That is pretty far from your typical numerical computation. Each request to the FPGA is a (document, query) pair, producing a single floating-point score. The requests are compressed to fit them inside the 64kB communications windows employed – the FPGAs really operate very autonomously, with their own memory and communications between them. This is a very capable accelerator! The paper has lots of more details on how this was implemented and how the algorithms were mapped.
It is worth noting that they also supported a change in the evaluation model used by the algorithms configured into the FPGAs by means of reloading the contents of the on-FPGA RAM banks (a few megabytes per FPGA). This process was orders of magnitude faster than a full reconfiguration, but still slow enough that it was necessary to group requests using different model parameters to minimize the overhead of changes. This does point to a common design pattern, where it is faster to change parameters to an algorithm stored in regular memory than to change the algorithm itself.
Finally, they also designed their own custom multithreaded processor core, and ran it on the FPGA. They claim to fit 60 of these cores into a single FPGA, with each core having four threads for latency-hiding. Six cores shared the complex floating-point unit, in a Niagara-1-style architecture. This demonstrates how FPGAs can be used to keep the “custom core” design pattern alive for a long time. They did look at standard soft cores, but they were not at all suitable to their kinds of workloads. Thus, a custom core was the better solution.
Doing research at this scale also necessitated attention to fault tolerance. With thousands of parts in play, and many connections per part, it is inevitable that some failures will occur. There will be faulty hardware that never works, there will be drop-outs during computation, and there will be communications failures. All of this had to be handled – assuming ideal conditions would have resulted in a pretty dead system. RAM memories had to be protected with ECC, something that is getting increasingly common. It is worth noting that many Intel parts dedicated to the “embedded” market add ECC to the desktop parts that they are derived from, and that ECC is mandatory in servers. It is really only in single-person interactive computers that ECC is still avoidable!
The main error source reported in the paper was reconfiguring FPGAs. During reconfiguration, an FPGA could accidentally generate PCIe errors towards the host. It could send bogus messages to neighbors in the FPGA fabric, and there could be a mix of new and old FPGA configurations talking to each other, confusing algorithms and corrupting state. Solutions were implemented to work around all of those, essentially making FPGAs isolated during their reconfiguration and only brought into communication once they were up and stable. That is the kind of thing that you do not learn without actually doing things on real hardware.
It was also necessary to consider debugging. The solution used by the researchers is essentially log-based debug. It is impossible to look inside the FPGAs to see what is going on directly, since there is no way to attach JTAG units to each and every FPGA. Instead they rely on a “flight data recorder” that continuously records information into an on-chip memory , which is then exported to the host during health checks. This is essentially log-based debugging, which is the only reasonable approach at this scale. Even with a fairly small buffer, the researchers were able to debug problems that only showed up in the full-scale deployment. Being able to see what went on just prior to a crash is powerful in all circumstances and systems, and this was no exception.
There is much more to learned from this paper, so go read it!