In the March/April 2008 issue of ACM Queue, there is an article on GPU Programming by Kayvon Fatahalian and Mike Houston of Stanford that I found a very interesting read. It presents and analyzes the programming model of modern GPUs, in the most coherent and understandable way that I have seen so far. The PC GPU has a model for programming parallel hardware that might be a good pattern for other areas of processing. Programmers do not have to write explicitly parallel code, the machinery and hardware takes care of ensuring parallel behavior, as long as the code follows the assumptions made in the model.
The fundamentals of the GPU model are the following:
- It presents a fixed pipeline of stages through which data is streamed and transformed into final output.
- Some stages are fixed-function (with some parameters), some are fully programmable.
- The programmable stages are programmed in a local state, simple input-to-output transformation style with no access to global variables or any way to affect other computations. In this respect, the model is similar to DSP programming with its DMA-in, compute, DMA-out style. If a bit more automated.
- There is shared global state — but it is read-only, for parameters (textures, etc.)
- Parallelism is present in two dimensions: each stage operates on lots of data in parallel, and all stages execute concurrently.
- The really tricky transformations of the input data stream that involve dependences between data items are encapsualted inside the fixed-function stages. In essence, this lets a few experts take care of the hard part of programming, and presents a streamlined simple model .
- It is possible to users to destroy performance with badly written programs, but the typical use case and hardware design rests on users doing sensible things within a fairly narrow domain.
- Code is compiled into byte codes, which are then translated and optimized for a particular GPU by the driver in the final PC running the application. This two-stage just-in-time compilation (or dynamic recompilation, or whatever we want to call it) technique is a known good way to combine performance with portability.
This model is an interesting pattern as it has been extensively proven in practice. There are large numbers of programmers doing graphics programs, and they seem to have not too big problems in getting GPUs to run massively parallel computations. If you think about it, that is a pretty major success story! It also validates that the idea of “no shared state” that keeps being brought up does simplify programming. The model above is quite similar to what you have in Erlang/OTP for example.
The lesson that can be drawn from this for other domains is likely that you need to create a framework (both in concept and in implementation) for processing that makes the code users write simple, single-threaded, and straightforward. The framework then automatically runs lots of little sequential snippets in parallel, and takes care of resource scheduling and the data flow.
Obviously, there are domains where this does seems harder to do than in other domains, but I think this is the pattern for the future. Unless most parallel applications are “easy” to create, we will not make much use of parallelism. And I think that this can usually be the case.
In particular can see this kind of framework being quite possible for things like packet processing in various network transforms like firewalls, routing, switching, and virus scanning. A smart hardware manufacturer in the networking market should provide these frameworks, just like the graphics chips providers are today.
Finally, here is a nice-looking link to the article, as generated by ACM’s online magazine publishing system: