SiCS Multicore Day 2009

Last Friday, I attended this year’s edition of the SiCS Multicore Day. It was smaller in scale than last year, being only a single day rather than two days. The program was very high quality nevertheless, with keynote talks from Hazim Shafi of Microsoft, Richard Kaufmann of HP, and Anders Landin of Sun. Additionally, there was a mid-day three-track session with research and industry talks from the Swedish multicore community.

I think that for next year, the organizers need to find keynote speakers that are not from the general computing multicore world. The Microsoft talk this year was a step in that direction, as it rather came from multicore programming than multicore hardware. Richard and Anders gave very interesting and good talks, no doubt about it. But it would have been nice with someone from ARM or Freescale or Tensilica or TI or ST or Ericsson or Cisco talking about the kinds of multicore embedded hardware that is being developed and used today. For example, the “next new thing” touted by the keynotes this year was GPGPU. Interesting for HPC and desktops, certainly. But pretty irrelevant for most of the people that I know. GPUs are huge, expensive, and power hungry.

GPGPU was one part of the theme this year. It is definitely catching on as the way to do number crunching in the desktop, server, and HPC world. It is not the universal panacea for any kind of parallelism, however, as Hazim and I noted in the panel discussion that ended the day. There are applications (such as parallel Simics…) that scale well on general-purpose cores, but that will never ever work on GPUs. In general, the class of problems that work on GPUs is pretty limited to massive data-parallel problems like image and video manipulation.

In the eternal homogeneous vs heterogeneous debate (follow the tags in my blog for more posts on this topic), GPGPU was grudingly accepted as a good candidate for something that will not be homogeneized with the main processors. Additionally, Richard Kaufmann gave some hints that Intel or AMD are coming out with new chips with more accelerators on board… I guess it will be security, as is already done by Sun and IBM. When I brought up the topic of more accelerators like pattern matching, compression, and the other things we see in chips from Freescale, Cavium, and others, the response was very “can only be economical for very high volume applications”.

It is striking how the GPGPU idea is bringing the classic telecommunications DSP-data plane/CPU-control plane division into the desktop and server space. Without any recognition being paid or any experience being reused from the 40 years that that has been done in telecoms and consumer electronics… as Jack Ganssle often says, us embedded folks get no respect.

In terms of programming, this year was all about general programming languages. Hazim from Microsoft talked about (and demoed) the quite pervasive addition of parallelism to both native C/C++ and managed .net code in Visual Studio 2010. Microsoft is dead serious about parallel programming, and are bringing out a whole set of different libraries and support structures to allow easier expression of parallel code. In the “LINQ” data query language subset of C#, you could add some easy modifiers to “foreach” statements to make them parallel, for example. Having a language that is your own and which you can extend at will certainly pays off in terms of innovation here. C++ moves far slower than C#, that is becoming clearer and clearer. C# and its cousins in the .net system seem to be sneaking in lots of powerful language design ideas from places like Python, and also results from Microsoft’s powerful group of language researchers.

When I tried to bring up the idea of using domain-specific languages to program parallel applications, Hazim had the wonderful comment that “that might be applicable in certain domains…” — yes, that is the idea. By being narrow in terms of target domains, you gain expressive power and semantic insight that helps move programming from “how” towards “what”. But it sounds like domain-specific is a foul word inside of Microsoft — when the audience asked whether LINQ was not a exactly a domain-specific language for data access, Hazim was a pains to point out that it is Turing-complete and that someone had managed to write a Raytracer using it… interesting. This feels more political than market-based. I guess Micro

Richard Kaufmann had some interesting notes on throughput vs TTC (time-to-completion) jobs in servers. In the “cloud computing” era, throughput is much easier to scale: just add more servers. Classic HPC is more oriented towards TTC, as you do want your results within a reasonable time. Quite often, you can most work into a throughput-oriented style by simply running lots of jobs in parallel rather than pushing through a series of jobs sequentially. Note however that we have the entire field of real-time control, real-time communications, etc., that do not work like this. But that is not the market that HP is building servers for, or that Intel and AMD are servicing.

Outside the keynotes, Per Holmberg of Ericsson gave an interesting presentation on the adoption of multicore in the control plane of the Ericsson CPP platform. The core of his talk was the observation that in these kinds of systems, multicore is not such a big revolution.

They have been distributed since the beginning. Thus, scaling by adding more processors (with local memories) is easy and multicore is only a packaging change from that. Also, most performance-intense operations are already offloaded onto DSP groups, network processors, ASICs, or FPGAs. There is not much parallelism left for the control plane to exploit. Essentially, only functions that unexpectedly become performance bottlenecks due to changes in traffic patterns are likely candidates for parallellization. Interesting point, and might be why the EETimes noted that multicore is slow to catch on in communications (the article is a bit flawed).

Patrik Nyblom from Ericsson held a talk about how the Erlang runtime engine was parallelized. From a practical perspective, the most interesting aspect was that this made applications parallel without changing a single line of code in the applications. Of course, applications had to be threaded to start with, but that is the most natural way in Erlang. He mentioned systems containing up to a quarter of a million threads — hard to do that in anything except Erlang.

He described how they had evolved from a simple implementation that worked well on synthetic benchmarks to a truly industrial-strength implementation. The difference was quite radical, as real codes feature more complex communications patterns, and make heavy use of device drivers and network stacks. This process forced the use of more and finer locks, and rethinking the balance between shared and separate heaps for threads.

They also had the opportunity to test their solution on a Tilera 64-core machines. This mercilessly exposed any scalability limitations in their system, and proved the conventional wisdom that going beyond 10+ cores is quite different from scaling from 1 to 8… The two key lessons they learned was that no shared lock goes unpunished, and data has to be distributed as well as code. Very interesting to hear this story from real software developers solving real problems.

The next multicore event taking place around here is the Second Swedish WOrkshop on Multicore Computing (MCC 2009), in Uppsala, November 26-27.

Update: note that the presentations from the event are available via

8 thoughts on “SiCS Multicore Day 2009”

  1. Aloha!

    Thanks for the writeup from the conference. One thing about GPGPU in embedded space. Consumer electronics are moving more and more towards fancy GUIs – think iPhone (and other mobile phones), GPS devices etc. These devices are powered with things like gfx-accelerators from ARM, Anoto and whatnot. Esp accelerators for OpenGL ES to provide godd 3D gfx on embedded platforms.

    Since embedded is all about maximizing your resources I believe we will se GPGPU move into embedded space, not to use massive parallelism, but to use computational resources that otherwise is idle but still powered on.

    Some possible applications might be audio decoding digital camera effects (zoom, gamma-correction, motion blur compensation, pan-tilt) as well as GPS coordinate filtering that today requires DSP support or high performance main MCU but could possibly be off-loaded onto a small GPU.

  2. I think the difference between a GPU and a DSP is the that the GPU has a narrower field of application. Thus, it would make more sense to consolidate on something like a Cell SPE or similar processor. The GPU is not a natural and established part in most embedded fields (space, control, telecom infra, mobile consumer apart from phones) like the DSP is. The fact that a GPU has to drive a display does skew its implementation some… and lets it be very efficient at drawing pictures.

    We will see, it is an interesting idea to have a merged GPU and DSP subsystem. However, lots of DSP work is time-critical and cannot really be shared with other work that is bursty in nature like screen updates.

    But I guess the media encoding/decoding acceleration circuitry could merge with the GPU. That would be a bit more natural — but note that even in PCs, audio is separate, as is video handling. Video is on the GPU, but separate function blocks not using the actual GPU processors.

  3. Aloha!

    I’d say that the trend is towards ever more general GPUs. The latest ones from Nvidia are pretty much real RISC cores albeit with some constraints on memory acceses and shared instruction fetch between several cores making them SIMDish. (Which I’m sure you know.)

    We’ll see what happens with OpenCL. There seems to be a flurry of activities related to VLC as well as audio processing programs on the Mac utilizing the GPUs. If I understood correctly the next Garageband will use OpenCL to accelerate the audio processing.

    I agree that on embedded devices (mobile phones) there are DSP-heavy applications that have very hard real time constraints – baseband modem and speech codecs.

    But audio in general is actually quite slow and undemanding. Relatively speaking. The reason for audio is separate is (imho) more related to the issue of having a driver for the ports and that they used to be on a separate card. On other platforms, external soundcards are connectecd via USB or Firewire and the processing done on the boards and more related to output buffering and jitter than codec and other DSP processing.

    Finally I agree that time-critical apps will have problems sharing resources, but that is always the problem with shared resources and RT. If you replaced the GPU with a DSP and tried to process gfx and time-critical modem processing you will have starvation too.

  4. One note on DSLs.
    They are coming, big time. Ericsson is working on DSLs for various domains, Stanford’s PPL lab is developing DSLs for creating parallel software for various domains etc. Also watch out for a WS on the subject at next year’s ICSE conference 🙂
    Finally, a post on the subject on my blog:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.