I just finished reading the October 2010 issue of Communications of the ACM. It contained some very good articles on performance and parallel computing. In particular, I found the ACM Case Study on the parallelism of Photoshop a fascinating read. There was also the second part of Cary Millsap’s articles about “Thinking Clearly about Performance”.
Cary’s articles deal mostly with database tuning in the Oracle ecosystem, but most of his observations apply to any kind of programming with a performance requirement. It is worth a read. It was good to see him dissect performance, including obvious – but not really obvious – concepts like the difference in usefulness between average and worst-case response times from a user perspective. In essence, you need to watch the spread of response times, and try to keep the worst times from getting too bad, rather than just look at an average that might conceal extremes that frustrate users.
Cary also made the comment noted in the title of this post. In his opinion, the performance instrumentation built into Oracle has an overhead of -10% – or even -20% or -30%, since it enables optimizations that would otherwise have been impossible to do. This is something worth noting in general – overhead that looks bad when considered as a local cost might be a net benefit in the grand scale of things, by enabling measurements and insight that let a program run much faster.
The ACM case study on Photoshop can be found online as a resource at the ACM Queue, with what seems to be mostly the same content. It was written by Clem Cole, at Intel, who interviews Russell Williams of the Photoshop team. It is very instructive to see how the Photoshop team has built an application that works well with 2 to 4 and maybe 8 cores, but that really needs to reconsider parts of its architecture to scale beyond 8.
Clem from Intel pushes Russell by bringing up various examples of next-generation architectures, in particular the fact that clusters-on-a-chip and NUMA memories look inevitable. The Photoshop people seem to take a wait-and-see approach to this: they first want to see some architecture have real traction in the market before they commit and rearchitect their software to make use of it.
The problems of debugging parallel software are also brought up. There used to be a simple bug in the asynchronous I/O system in Photoshop that took ten years to uncover! Essentially, the programmers had not considered atomicity properly in the presence of multiple threads. With that kind of example, it is not surprising that the Photoshop programmers are very careful when planning and performing parallelizations.
The target domain of Photoshop is to some extent naturally parallel, but not as much as I would have thought. Since a user might operate on any part of an image, large or small, and maybe start and then abort an operation, it is not just a matter of splitting a image evenly across threads or cores. There is a significant amount of variation in just how parallel things can be in Photoshop.
Photoshop has had an easy-to-use parallelization system in place since around 1994, which lets programmers write simple serial computational kernels which are automatically applied to parts of an image in parallel. The Photoshop program itself takes care of the synchronization between kernels, and the kernels can be simple and robust and without any parallel code inside. This is a pattern that has been seen before, and which does make a lot of sense – if it can be applied successfully. Apparently, this is not necessarily the easiest thing to scale beyond four cores.
The main performance limitation for Photoshop performance keeps being memory bandwidth, rather than raw compute performance. This also limits the need to aggressively scale to higher levels of parallelism: as long as multiple threads do not give more bandwidth, it has proven hard to use more than two or three threads on any multicore processor as that is sufficient to saturate the memory system. Apparently, this is different on the Nehalem (Core i7/i5/i3) generation of Intel multicore processors, where each core has a dedicated non-stealable slice of the memory bandwidth.
For the near future, it seems that the big step for Photoshop is going the route of using GPUs for acceleration, rather than 10+ core main processors.