My post on SiCS multicore, as well as the SiCS multicore day itself, put a renewed spotlight on the GPGPU phenomenon. I have been following this at a distance, since it does not feel very applicable to neither my job of running Simics, nor do I see such processors appear in any customer applications. Still, I think it is worth thinking about what a GPGPU really is, at a high level.
The initial key idea behind GPGPU was that a GPU offers very high performance, and does so in a part that “everyone has anyway” — i.e., something that is found on any PC. Outside of PCs, such powerful GPUs are pretty non-existent. Then, the GPU companies picked up on this idea and are making their GPUs more applicable to general purpose tasks.
But where does all this performance come from? To me, it all looks like the rebirth of the vector processor. If we compare a GPU and an Intel or AMD x86 main processor, it is clear that the GPU gets more FLOPs per chip. Mostly, this seems to be because the GPU has many times the number of processing units. Something like 1000s of them, rather than maybe 10 in a general purpose unit.
How can all of these be fit on a die that is similar in size to the general processor? As always when you see disparity like this, it stems from optimization for different target uses leading to different architecture.
The reasons for GPU raw performance seems to be three-fold:
- Each processor is much simpler, with a simple instruction set and no out-of-order, speculation, or other complex logic. Programming is more complicated, as programs are run on groups of processors and with lots of little constraints. This makes it possible to fit more cores into the same area.
- There is far less cache on the die, which forces programs to rely on bandwidth and managing to stream data through the processor.
- Processors are built to be good at repetitive math, and be very bad at anything else. This also makes it possible to optimize data flows and control handling to a far greater extent than on general-purpose processors.
- And I guess you can add a forth parameter: power consumption and heat is not really a big problem. Watercooling, huge fans, and 300W power draws are OK…
What this all boils down to is that the GPGPU requires predictable algorithms that can effectively and efficiently prefetch data and stream it through the cores at a predictable rate. Data also needs to be wide to engage groups of cores at once (i.e., vector processing). Integer decision-making code is out (gcc, Simics, control-plane code, most database front ends), and data-intense is in (images, audio, video, graphics). SIMD is part of it, but not the most interesting part. The point is that you apply SIMD across large vectors of independent elements in parallel. And you are looking to solve one large problem at a time.
If you compare this to the classic single-core DSP, you see a very different design. A DSP has specialized instructions in the instruction set, support for loops in very efficient ways, and is often SIMD. But they very rarely operate like vector processors. They are also general enough to be able to run a rudimentary OS and operate semi-independently from the main processor. Also, DSPs tends to be used in large multicore clusters, but there each DSP operates on a different problem at a time. So rather than one vector of 1000 elements in a video compression, you might have 1000 independent video streams being processed, out of synch with each other. DSPs also tend to have much simpler programming models compared to GPGPUs — even if they can be painful compared to general-purpose processors.
So GPGPUs are qiute different in practice from DSPs, built to solve different types of problems in different ways. In the end, it is not clear to me that a GPGPU is a winner in terms of performance per watt or performance per area. They are certainly hot in the desktop and server field, but I cannot see them replace general DSPs any day soon.
Note that something like the Tilera chip is another intermediate point between multicore DSP and a GPU. There seems to be a long continuum of core counts from around 4 to 8 for DSP to around 100 for Tilera to 1000 for GPUs…
Hmm I guess I’m partly responsible for this post… 🙂
I agree with you, at least to some extent.
Yes, the GPU could be seen as the rebirth of the vector processor. Yes, the PEs are simple, scalar in-order-issue machines with less cache (but loads of registers and PE-shared mem – thing SW-controlled cache). Yes they are currently mainly for big things connected to a good PSU and can therefore be power hogs.
All those things are tru. Right now.
In about a decade the PEs have gone from specialized instances of datapaths with a few control plane adaptations to complete 32-bit RISC engines. This trend has largely been driven by MS and the development of DirectX as well as the OpenGL standard evolving with requirements on ever growing flexibility in the datapath.
The Nvidia vs AMD/ATI arms race is still going on building ever bigger and hungrier GPUs for all those gami^D^D^D^D scientific applications. But while that is happening and gets a lot of press and steam, these players as well as a lot of niche players have actively started to target embedded/low power space. Today you can get devices as well as IP-cores that includes dynamic (as in data dependent as well as control plane adjustable) power control. In terms of W/FLOP for general computing they are not a match for CPUs or DSPs, but I’d venture to say that they are getting closer. The devices we are talking about here have not 1000+ PEs, but in the range of a few PEs to a low hundreds.
I would not be surprised to see variants of GPUs in the near future where one or a few PEs are more equal than the others and are allowed to control the activity (power supply) of some of the others. Yes this would require a more complicated instruction fetch and memory interface, but it should be doable. A mobile phone platform having three ARM11 cores and a GPU with four cores is pretty close to a GPU with eigth cores.
Also the thing about streaming. Since the processing power vs bandwidth is so big, esp for big GPUs you are actually not talking about streaming things through but batch processing with double buffering. If you look at the Nvidia Tesla T10, Nvidia themselves suggest that it is better to burn a few more cycles that to shuffle data. When I worked on a GPGPU project we gained performance by going to a cubic transform rather than a quadratic transform, simply because the amount of data needed/time unit was reduced and we could use some of that compute power sitting there idle.
One key issue here is programmability. When DSPs are going multicore and are adding multiple MACs into the core datapath etc, how hard will they be to program efficiently? Compared to a GPU with CUDA or OpenCL?
As a computer arch geek I love looking at new designs from Tilera, BOPs, Excellera and all other startups as well as research architecture proposals. But at the end of the day the application needs to be mapped onto the instruction set and the programming model of the device. How well that works out is down to two things:
(1) How well does the instruction set and programming model match the application.
(2) How easy is it to write the code implementing the application mapping.
The first one is what you talk about, the difference between DSPs vs GPUs and how good they match a set of applications. The second one is all about tools.
From my point of view as an embedded engineer is that squeezing out the performance out of DSPs (to get that MIPS/W) can be pretty hard, and quite a few engineer have neither the competence nor the time allowed project wise to do it. My observation is that when the DSP complexity goes up this discrepancy increases.
And, to finish off this long text, if the DSPs in real-world usage is getting less and less efficient, how far off are GPUs if they (more) easily be efficiently programmed?
Aloha!
And to clarify the section about Tilera, Excellera etc: When I have evaluated NPUs, GPUs, DSPs and other more or less app-area specific architectures I often reflect on how cool, powerful and interesting the architecture is… and how lousy the provided toolchain needed to program the machine is.
A port of GCC is basically always provided (but sometimes not even that) and of course an assembler. But debuggers might be quite simplistic and things like data flow, optimization and hotspot analysis is not that common.
The result is that yes, you can get you application onto the device, and it might even be pretty well debugged. But getting it really efficiently mapped onto the device might be hard even though your application might be in the application domain of the device.
I guess this is where Simics comes in. 😉
I think we are basically making the same argument, in terms of the end result being pretty much like today’s DSPs… and in particular the SPE engines of the Cell in terms of programming and use style. The Tesla use model sounds very much like the archetypal DSP DMA-in, process, DMA-out cycle.
That GPGPUs are easier to program is an interesting point. Certainly, desktop use tends to drive better tools. But that is a point not entirely lost on DSP and Tilera and the other people either… and the solution seems to be moving to model-based composition of algorithm elements. Which is essentially what OpenCL is also about, as well as graphical programming in LabView or MatLab or textual domain-specific languages.
In essence, the key advantage your see with the GPU is that is has a thicker software stack that hides a thornier architecture.
http://www.nvidia.com/fermi
Thanks! Also see http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932 for a readable analysis.
It seems that Nvidia is making this more and more like a vector processor, actually moving away from discrete graphics as their main market. But that also partially defeats the charm of GPGPU — to leverage for other purposes the steady march of GPU improvements driven by graphics for gamers. It is still not as easy to use as your average DSP, for example.
Yes, nvidia is really taking the HPC market seriously, especially with fermi. Fermi’s on-chip shared memory is quadrupled and can also be split into an l1 cache if desired. There is also a shared l2 cahce.
This theoretically means that the GPU will be better at working on applications were one need to do more random accesses to data.
One strength of the current Nvidia CUDA architecture is that it forces you to handle memory in a efficient way if you want good results. You’re not really allowed to write dumb code which is just as well…
@Jakob , The fermi lineup doesn’t necessarily mean that nvidia is moving away from their main market, the Geforce series will probably not support ECC and other fancy HPC attributes and will still be very much focused on the games.. Let the gamers buy the Geforce series meanwhile HPC geeks can buy the quadro and tesla series.
Btw, both Larrabee and the IBM Cell are dead/cancelled. They both realised they aren’t going to able to compete with Nvidia / AMD/ATI in this area..