The Register has a pretty good report from the Supercomputing (SC) 2007 conference. Quite knowledgeable, and mostly about the thorny issue of programming massively parallel fairly homogeneous machines likes GPUs and floating-point accelerators. Of course, their commentary has to be commented on. Read on for more.
The following quotes on programming for the Clearspeed chips describes a solution that I find very attractive: use higher-level programming languages that describe an actual problem or computation to be performed, and then let a compiler/code generator take care of generating a suitable parallel implementation:
For one, it notes that a number of applications such as Matlab and Mathematica can run on the CSX600 chips without any changes to the underlying code thanks to work done by ClearSpeed and the software makers and the presence of friendly ClearSpeed libraries.
Using libraries is one solution suitable for certain classes of problems, and can cover a fair amount of the supercomputing market where the number of kernel algorithms used tend to be fairly limited. I wouldn’t try using these tools for programming a telecom switch, but that is not what the hardware is designed for either. The CSX600 chip contains 96 fully floating-point coprocessors, and there are solutions out there using massive numbers of the chips (144, according to The Register, for a total of 13824 processors). Scalability like that is pretty cool.
So it seems that by targeting a selected set of applications, Clearspeed does manage to produce a decently programmable solution. Also, one has to presume that the value to the end users of the solution is great enough that the time spent programming is worth its while.
Another example of a domain-limited solution with great power is what Acceleware is doing: take the hardware and the software from Nvidia for using GPUs as accelerators, and code a solution applying it to a certain problem domain. This makes it very easy for the end-users to pick up, since they basically buy a package targeted to their problem, with all the hard parts already taken care of. In the case of Acceleware, the problem is electromagnetic simulations, and the customers are big companies like Nokia and Samsung. These end customers only need a month or so to incorporate the acceleration effect into their custom programs. Everybody wins, and the value of Acceleware is in letting a group of users for GPU acceleration share the cost of creating a platform to work from. Classic play in high-tech.
Finally, The Register looks at a few players working with FPGAs as their acceleration platform. This has theoretical immense performance and performance/power, but also a much steeper learning curve for programmers. In my favorite field of embedded, I rather see FPGAs and on-chip FPGAs being used to create accelerated peripherals or simple algorithms embedded in hardware rather than as general-purpose math accelerators. The difference might not be that big in theory, but it does impact what the style and target programs for the programming tools are. And there is a big difference between floating-point math and the operations needed to decode video or do parallel table lookups.