Via the EETimes, I found a very interesting talk by Bristol professor David May, presented at the 4th Annual Bristol Multicore Challenge, in June of 2013. The talk can be found as a Youtube movie here, and the slides are available here. The EETimes focused on the idea to cut down ARM to be really RISC, but I think the more interesting part is Professor May’s observations on multicore computing in general, and the case for and against heterogeneity in (parallel) computers.
I am usually in favor of heterogeneous computing. It seems to make intuitive sense, and I like the idea of a small processor optimized to do one thing and do it well. General-purpose processors have an awful lot of overhead, and why waste them on things that can be better resolved by specialized processors? To read what I have said about this before, look for the “heterogeneous” tag on my blog posts. Still, there are some interesting points well worth considering in Professor May’s talk.
Before going into details, let’s note that Professor David May is multicore and parallel computing veteran. He was part of the team that built the Inmos Transputer, and he also co-founded XMOS, a company that builds an embedded parallel multicore microcontroller (of which I know precious little). I liked the way in which he is teaching computer history to his students as a way to get them to understand why things are like they are in our industry (basically, today’s computers are really odd if you look at it from a priori perspective), and his talk starts out with some real good historical background materials that even include the idea that most computer architecture ideas were developed at IBM in the 1960s 🙂. So we have a solid presenter here who knows what he is talking about, and he does it well.
Clearly, he belongs in the camp that thinks that homogeneous computing is the best way to do things. However, he has a twist to his thinking. Most proponents of homogeneous computer architectures tend to be HPC or server designers who also think that the systems have to use cache-coherent shared memory and be microarchitecturally homogeneous too. For David May, cache coherency is not the right way to do things, and he is quite happy to have heterogenerous implementations of the same instruction set (if that helps overall efficiency).
In short, his idea of a good computer architecture is a to start with an efficient and general interconnect, and then bolt on whatever processor core fulfills the requirements. I am not sure just what a “general” interconnect means, but it seems to be about having fair scalability and not being a 2D mesh. Once the interconnect is designed, add the processor cores, which just have to be able to keep up with the interconnect. This is backwards from how typical computers are designed, where you start with the cores (big, important, popular) and bolt on some wires to connect them. He noted that older computer architects tend to build interconnects, as experience has told them what is really important… makes sense to me.
That “good enough” processor core is where the reduced ARM comes in. Right now, he is looking at a 30-odd instruction subset of the ARM Thumb 1 instruction set (see a paper on his site for details), which can be implemented in a really small core.
His vision is a computer similar in spirit to the original Transputer: hundreds or thousands or millions of simple cores, interconnected via a message-passing network, with a programming system that makes it simple to write programs that execute in parallel. Input and output are handled by assigning a few cores to the tasks, no special processors needed. When you have thousands of cores, assigning some of them to particular tasks does not hurt (something I thought about a few years ago, too).
In the end, what he describes starts to sound an awful lot like a computer scientist’s version of an FPGA. Thousands of simple processing elements, connected via a programmable interconnect, on which any program and algorithm can be made to run via a compiler. The difference is in the nature and complexity of the processors nodes. In an FPGA, they are mostly lookup tables, while in May’s vision they are simple processors.
Such a massive number of simple cores should be able to replace current specialized hardware like GPUs. He makes an interesting historical case about the REYES machine from Pixar (essentially a very early GPU implemented in a machine all of its own), which gave birth to Renderman. Apparently, they started out designing special-purpose boards for each phase of the rendering pipeline, but ended up using the same Transputer-based board for all phases – with different software.
I think the key to making that work is that the cores are so simple that their energy efficiency is sufficiently high. Basically, as a processor core gets simpler, its efficiency increases and the overhead of processing instruction streams as compared to the useful work performed is reduced. Thus, it makes some kind of sense that a GPU could be replaced efficiently (and not just theoretically) by an array of simple general processors. In a sense, GPGPU is pushing GPUs in that general direction, even though the cores can hardly be considered simple…
The one weak spot in his argument is how to deal with the real need for high performance on single-threaded stubbornly sequential programs. He does acknowledge in his history section that this has always been a key factor in building a successful parallel machine – it also has to do sequential stuff very well. However, it is left quite unclear in the talk just how a large number of simple cores would work together to run a sequential program quickly. That is probably the key technical issue to be resolved before his vision of an interconnect with a sea of simple cores can become a general unified computing architecture that replaces today’s complex mix of IO processors, GPUs, DSPs, and GPPs.
In summary, I must highly recommend this talk. It is interesting, thought-provoking, and presents some interesting anecdotes and facts from the history (and future) of computing.
Hi Jakob!
The article “Brawny cores still beat wimpy cores, most of the time” (2010), http://research.google.com/pubs/archive/36448.pdf, seems related to what you write about the need for sequential performance.
Bun,
What is the target time when we will build a trillion core machine?
Can you apriori compute the peak performance of a trillion core machine?
Is it a trillion times better than just 1 core? Or the sum is better than its parts?
I look forward for a trillion core computer.
Salutari
Trillion cores seems “some way” out… since we seem to be only now entering the million core range for building-size supercomputers.
let’s say a core needs 1 mm3 of space.
you can cram 1,000,000 cores into a 1m3.
To make a trillion core computer you would need 1,000,000 m3 of space. A cube building with the height=width=length=100m will do.
What kind of applications will this computer run?
You forget the need to connect the cores together, and to have memory for them. All of that takes up the majority of the space in large SoCs. Not to mention the effort needed to connect each little chip to its neighbors on boards, racks, and and rooms.
Ok,
Maybe in the future 1 core will be implemented using 1 atom or molecule.
Let’s try a different approach for today’s supercomputers. How about connect together all the supercomputers in 1 country( any country with lots of supercomputers will do). Will the ubercomputer be faster or slower than each 1 of them?
But I guess in few decades you will hold million core devices in your palm. All the intrinsec quantum physics of the atoms will be exploited.
I want a trillion core computer! Yesterday!
what the funtion this software? thanks