The 2012 edition of the SiCS Multicore Day was fun, like they have always been in the past. I missed it in 2010 and 2011, but could make it back this year. It was interesting to see that the points where keynote speakers disagreed was similar to previous years, albeit with some new twists. There was also a trend in architecture, moving crypto operations into the core processor ISA, that indicates another angle on the hardware accelerator space.
Five years have passed since the first SiCS Multicore Day in 2007 (making this the sixth event), and in the introduction by Erik Hagersten he looked back at some of the predictions made back then. One missed prediction stood out clearly. The idea that by now, 128 cores would be mainstream in personal computers. My theory of why this has not happened is simple. GPGPU. GPUs have eaten up the easy parallelism. Instead of using massively multicore regular processors, heavy-duty personal computing has been shifted onto GPUs. With the disappearance of these workloads, there has been little pressure on main processors to become more parallel as there would not be much to gain from that, performance-wise. GPUs have turned out to be perfect for massively dataparallel work in media and other areas (including tasks like cracking password hashes and mining for bitcoins), achieving performance orders of magnitude higher than what could be hoped for with a multicore main processor – while costing less and using comparatively little power.
The prevalence of GPGPU on the desktop is not mirrored in the top supercomputers, however. According to Erik Hagersten, there is no real GPGPU machine in the top-500 supercomputer list at the moment. Maybe 5% of the performance and 3% of the chips are GPUs. I suspect part of this might have to do with the kinds of tasks being done. HPC at the high-end probably requires more flexibility and programmer control than GPUs can offer.
Programmability might be more important in architectural design for HPC, as HPC users tend to be programmers. Most regular computer users, on the other hand, just use software written by someone else. Thus, it is enough that a few people go through the hard work of coding in CUDA or OpenCL or similar toolkits, and the results of their work can be spread across a very large user base. GPGPUs are perfect to provide “performance for the rest of us”, for common tasks coded by a few expert programmers.
The debate over GPGPU is part of a bigger debate about homogeneous vs heterogeneous compute systems (see previous blog posts like this, this, this, and this). The debate is still going on, with the same intensity as it always have. To me, that would seem to indicate that hardware accelerators are here to stay, even if some people do not really like them.
This year, the primary example of the drive to homogeneity was Intel’s recently announced “more than 50 x86 cores on a chip” Knight’s Corner (Xeon Phi). The argument for the chip is very much programmability: “just a large x86 box that runs Linux”. But I guess you do need special compilers or libraries to make use of the big somewhat Cray-like vector unit (512-bit SIMD unit) each core has been equipped with. At least special optimization will be needed to make the best use of the chip, just like you always need to do when performance matters.
The UltraSparc T5 presented by Rich Hetherington from Oracle fell somewhere inbetween. It has 16 identical cores, but can tweak how it uses the SMT threading to make a core run a certain serial task faster than it otherwise would. This is a step towards the kind of heterogeneous performance in a single ISA that ARM is going after with their Cortex-A15/A7 bigLITTLE approach – but without the same span in performance, and also with less impact on the overall flexibility of the chip. The T5 also removed the special crypto accelerator hardware that used to be there, instead adding a few crypto instructions to the ISA.
The reason they moved crypto from an accelerator into the ISA was that it turned out to be costly to use a separate hardware unit for small pieces of data. There is OS overhead in invoking an accelerator, and that requires a decent size buffer of data to work on. With instructions in the ISA, you can work on a single word and still get performance gains. User-level software also have a far easier time accessing it, as the instructions are just part of the regular instruction stream. Interestingly, ARM (as presented by Stephen Hill) had done the exact same thing for crypto, for the same reason. This is an important point for hardware accelerators in general: the driver overhead has to be managed, sometimes by mapping hardware straight to individual programs (I made a simple experiment a few years ago that showed this nicely). On the other hand, everything put into the ISA risks making the entire processor a bit slower and power hungrier, making general ISA extensions something done with great care. Hardware accelerators can be removed from a certain SoC if they turn out not to be needed, not so easy with ISA components.
Stephen Hill from ARM clearly believed in heterogeneity, with four types of processing on a typical chip:
- Big core (ARM Cortex-A15 today)
- Little core (ARM Cortex-A7) – to create the kind of bigLITTLE setup that allows for a bigger span of power-performance settings.
- GPU (from ARM, that means Mali T604 today) – they clearly see that GPGPU is moving into the mobile space very quickly, doing the same kind of work that it has done on the desktop, and with the same effect of reducing the need for general processor cores.
- Special-purpose accelerators – except when merged into the ISA, as noted above.
In researching some of the material from James Larus’ talk, I also came across an interesting talk from Surge 2011 where Artur Bergman from fastly.com tell how they have optimized their content delivery network by only relying on plain processors and not using any network processing offload, router ASICs, etc. Too hard to use, to easy to make errors and have the software crash, and “Xeons are simply faster”. Note that the word “energy” is never mentioned in his talk.
Software Needs to be 100x Better
The software perspective was presented by James Larus from Microsoft research. His talk made many interesting points, but I think the main points were that:
We are not even trying to make efficient systems today, throwing away billions of clock cycles on plain pure overhead. Example: IBM had investigated the conversion of a SOAP (text) date to a Java date object in IBM Trade benchmark.268 function calls and 70 object allocations. There is great modularity and nothing obviously wrong in the code. About 20% of memory is used to hold actual data, the rest is hash table, object management overhead, etc. In general, objects are small and waste is large. Great for programmers, bad for machines. We could and should find ways to do better in programming than this, need to find a way to make performance an abstraction we can work with.
Note that Larus is not advocating going back to assembly language – there is far too much value in programmer productivity – but just that we remove unnecessary waste from our systems while advancing the state of the art in programming languages.
Distributed systems are the new norm. Why don’t we teach it? All programmers should need to understand how build systems from many separate parts. In particular, the impact of IO and network traffic on software performance. Distribution is not free either.
For an example of how bad things can be, he brought up a nice introspective talk from Surge 2011, about the Etsy website:
So, that’s my summary of an interesting day.