My post on SiCS multicore, as well as the SiCS multicore day itself, put a renewed spotlight on the GPGPU phenomenon. I have been following this at a distance, since it does not feel very applicable to neither my job of running Simics, nor do I see such processors appear in any customer applications. Still, I think it is worth thinking about what a GPGPU really is, at a high level.
The initial key idea behind GPGPU was that a GPU offers very high performance, and does so in a part that “everyone has anyway” — i.e., something that is found on any PC. Outside of PCs, such powerful GPUs are pretty non-existent. Then, the GPU companies picked up on this idea and are making their GPUs more applicable to general purpose tasks.
But where does all this performance come from? To me, it all looks like the rebirth of the vector processor. If we compare a GPU and an Intel or AMD x86 main processor, it is clear that the GPU gets more FLOPs per chip. Mostly, this seems to be because the GPU has many times the number of processing units. Something like 1000s of them, rather than maybe 10 in a general purpose unit.
How can all of these be fit on a die that is similar in size to the general processor? As always when you see disparity like this, it stems from optimization for different target uses leading to different architecture.
The reasons for GPU raw performance seems to be three-fold:
- Each processor is much simpler, with a simple instruction set and no out-of-order, speculation, or other complex logic. Programming is more complicated, as programs are run on groups of processors and with lots of little constraints. This makes it possible to fit more cores into the same area.
- There is far less cache on the die, which forces programs to rely on bandwidth and managing to stream data through the processor.
- Processors are built to be good at repetitive math, and be very bad at anything else. This also makes it possible to optimize data flows and control handling to a far greater extent than on general-purpose processors.
- And I guess you can add a forth parameter: power consumption and heat is not really a big problem. Watercooling, huge fans, and 300W power draws are OK…
What this all boils down to is that the GPGPU requires predictable algorithms that can effectively and efficiently prefetch data and stream it through the cores at a predictable rate. Data also needs to be wide to engage groups of cores at once (i.e., vector processing). Integer decision-making code is out (gcc, Simics, control-plane code, most database front ends), and data-intense is in (images, audio, video, graphics). SIMD is part of it, but not the most interesting part. The point is that you apply SIMD across large vectors of independent elements in parallel. And you are looking to solve one large problem at a time.
If you compare this to the classic single-core DSP, you see a very different design. A DSP has specialized instructions in the instruction set, support for loops in very efficient ways, and is often SIMD. But they very rarely operate like vector processors. They are also general enough to be able to run a rudimentary OS and operate semi-independently from the main processor. Also, DSPs tends to be used in large multicore clusters, but there each DSP operates on a different problem at a time. So rather than one vector of 1000 elements in a video compression, you might have 1000 independent video streams being processed, out of synch with each other. DSPs also tend to have much simpler programming models compared to GPGPUs — even if they can be painful compared to general-purpose processors.
So GPGPUs are qiute different in practice from DSPs, built to solve different types of problems in different ways. In the end, it is not clear to me that a GPGPU is a winner in terms of performance per watt or performance per area. They are certainly hot in the desktop and server field, but I cannot see them replace general DSPs any day soon.
Note that something like the Tilera chip is another intermediate point between multicore DSP and a GPU. There seems to be a long continuum of core counts from around 4 to 8 for DSP to around 100 for Tilera to 1000 for GPUs…