Probably thanks to the yearly Mobile World Congress, there have been a slew of recent announcements of mobile application processors recently. Everything is ARM-based, but show quite some variety in the CPU core configurations used. Indeed, I think this variety has something to say on the general state of multicore.
Category Archives: Multicore Computer Architecture
Nvidia recently announced that their already-known “Kal-El” quad-core ARM Cortex-A9 SoC actually contains five processor cores, not just four as a “normal” quad-core would. They call the architecture “Variable SMP”, and it is a pretty smart design. The one where you think, “I should have thought of that”, which is the best sign of something truly good.
By chance, I got to attend a day at the UPMARC Summer School with a very enjoyable talk by Francesco Zappa Nardelli from INRIA. He described his work (along with others) on understanding and modeling multiprocessor memory models. It is a very complex subject, but he managed to explain it very well.
Episodes 299 and 301 of the SecurityNow podcast deal with the problem of how to get randomness out of a computer. As usual, Steve Gibson does a good job of explaining things, but I felt that there was some more that needed to be said about computers and randomness, as well as the related ideas of predictability, observability, repeatability, and determinism. I have worked and wrangled with these concepts for almost 15 years now, from my research into timing prediction for embedded processors to my current work with the repeatable and reversible Simics simulator.
I have another blog up at Wind River. This one is about multicore bugs that cannot happen on multithreaded systems, and is called True Concurrency is Truly Different (Again). It bounces from a recent interesting Windows security flaw into how Simics works with multicore systems.
My post on SiCS multicore, as well as the SiCS multicore day itself, put a renewed spotlight on the GPGPU phenomenon. I have been following this at a distance, since it does not feel very applicable to neither my job of running Simics, nor do I see such processors appear in any customer applications. Still, I think it is worth thinking about what a GPGPU really is, at a high level.
Last Friday, I attended this year’s edition of the SiCS Multicore Day. It was smaller in scale than last year, being only a single day rather than two days. The program was very high quality nevertheless, with keynote talks from Hazim Shafi of Microsoft, Richard Kaufmann of HP, and Anders Landin of Sun. Additionally, there was a mid-day three-track session with research and industry talks from the Swedish multicore community. Read More →
In a post from late June, Jeff Atwood at Coding Horror discusses the horrible cost of a large HP server (scaling up to 32 processor cores in eight AMD x86 sockets), compared to a bunch of simple single-socket basic servers. There are some interesting notes on relative costs of small-and-simple servers, including things like administration and power. There is an undercurrent to the post and the comments that the big HP machine is “overpriced”. I don’t think it is. If you have ever had Erik Hagersten as a teacher in computer architecture, you will know why.
About two months ago, Cavium Networks launched their second generation of Octeon chips, the Octeon II. The most obvious difference to the previous generation (Octeon, Octeon Plus) is a new MIPS64 core with much better support for hypervisors and virtualization. There are some other interesting aspects to this chip, though.
Last year in a blog post on video encoding for the iPod Nano, I complained about the lack of performance on my old Athlon. A bit later, I noted that (obviously) video encoding is a good example of an application that can take advantage of parallelism. Yesterday I put these two topics together in a practical test. And it worked nicely enough.
A common question from simulation users to us simulation providers is “can I simulate a machine with N cores”, where N is “large”. As if running lots of cores was a simulation system or even a hardware problem. In almost all cases, the problem is with software. Creating an arbitrary configuration in a virtual platform is easy. Creating a software stack for that arbitrary platform is a lot harder, since an SMP software stack needs to understand about the cores and how they communicate.
Essentially, what you need is a hardware design that has addressing room for lots of cores, and a software stack that is capable of using lots of cores — even if such configurations do not exist in hardware. Unfortunately, since software is normally written to run on real existing machines, there tends to be unexpected limitations even where scalability should be feasible “in principle”.
Here is the story of how I convinced Linux to handle more than two cores in a virtual MPC8641D machine.
I keep looking out for interesting examples of parallel software, and there is constant trickle of these. This past week I spotted a couple of new ones in the EDA field: SPICE simulation and chip timing analysis.
It is a week ago now, and sometimes it is good to let impressions sink in and get processed a bit before writing about an event like the SiCS Multicore Days. Overall, the event was serious fun, and I found the speakers very insightful and the panel discussion and audience questions added even more information.
The two days of the SiCS Multicore Days is now over, and it was a really fun event this year too. I will be writing a few things inspired by the event, and here is the first.
Kunle Olukotun‘s presentation on the work of the Stanford Pervasive Parallelism lab included a diagram where they showed a range of domain-specific languages (DSL) being compiled to a universal implementation language. That language is currently Scala, and in the end all applications end up being compiled into Scala byte codes, which are then optimized and dynamically reoptimized and executed on a particular hardware system based on the properties of that system. Fundamentally, the problem of creating and compiling a DSL, and combining program segments written in different DSLs, is solved by interposing a layer of indirection.
But this idea got me thinking about what the best such intermediary might be for large-scale general deployment.
I will give a presentation on how Simics was threaded and how we created a parallel virtual platform system at the SiCS Multicore Days 2008, which takes place in Kista, Sweden, on September 11 and 12. The schedule is now up (so I edited the post and added updated to the title), at http://www.sics.se/node/3182, and my talk is on Friday, Sept 12, at 13.00 in “track 2″. Speaker bios and abstracts are also online.
Even apart from my own humble participation, I think the event itself will be well worth attending. Last year was really good and serious fun! See my writeups from last year: part 1 and part 2 (and a short note on the Rock processor and transactional memory).
The first Swedish Workshop on Multicore Computing (MCC) will take place in Ronneby on November 27 and 28, 2008. The call for papers is now out, and it is open until September 26. If you have something cool to present or publish about multicore computing, and happen to be here in Sweden, please do submit an abstract!
Disclosure: I am in the program committee for this event.
In Episode 157 of Security Now,Steve Gibson and Leo Laporte discuss the recently discovered security issues with DNS. In particular, the cost of making a good fix in terms of bandwidth and computation capacity. Fundamentally, according to Steve, today’s DNS servers are running at a fairly high load, and there is no room to improve the security of DNS updates by for example sending extra UDP packets or switching to TCP/IP. As this theoretically means a doubling or tripling of the number of packets per query, I can believe that. The “real solutions” to DNS problems should lie in the adoption of a truly secured protocol like DNSSEC. As this uses public key crypto (PKC), it would add a processing load to the servers that would kill the DNS servers on the CPU side instead…
In the March/April 2008 issue of ACM Queue, there is an article on GPU Programming by Kayvon Fatahalian and Mike Houston of Stanford that I found a very interesting read. It presents and analyzes the programming model of modern GPUs, in the most coherent and understandable way that I have seen so far. The PC GPU has a model for programming parallel hardware that might be a good pattern for other areas of processing. Programmers do not have to write explicitly parallel code, the machinery and hardware takes care of ensuring parallel behavior, as long as the code follows the assumptions made in the model.
The Radio Register has a nice interview with Kunle Olukotun, the man most known for the Afara/Sun Niagara/UltraSparc T1-2-etc. design. It is a long interview, lasting well over an hour, but it is worth a listen. A particular high point is the story on how Kunle worked on parallel processors in the mid-1990s when everyone else was still chasing single-thread performance. He really was a very early proponent of multicore, and saw it coming a bit before most other (general-purpose) computer architects did. Currently, he is working on how to program multiprocessors, at the Stanford Pervasive Parallelism Laboratory (PPL). In the interview, I see several themes that I have blogged about before being reinforced…
YouTube – Freescale QorIQ P4080 Hybrid Simulation is a video of a demo of the QorIQ P4080 hybrid simulation. Cool of Freescale to be publishing it like this, I think it is a very smart move!
Updated: Here is the video inline, let’s see if this works.