The SiCS Multicore Day took place last week, for the tenth year in a row! It is still a very good event to learn about multicore and computer architecture, and meet with a broad selection of industry and academic people interested in multicore in various ways. While multicore is not bright shiny new thing it once was, it is still an exciting area of research – even if much of the innovation is moving away from the traditional field of making a bunch of processor cores work together, towards system-level optimizations. For the past few years, SiCS has had to good taste to publish all the lectures online, so you can go to their Youtube playlist and see all the talks for free, right now!
The first talk of the day was by Peter Sewell from Cambridge, who presented the work of his research group in formalizing and analyzing weak memory models. I heard a talk from the very same group back five years ago at the UPMARC Summer School, and it was interesting to hear how they had refined their insights. The core of their work is proving that Intel x86 (I work for Intel, but I have nothing to do with this particular part of our product design) is indeed TSO, Total Store Order. Back five years ago, they lumped ARM and Power together, but this year they had refined their understanding to show that they are actually a bit different from each other and need to be treated separately. Not too surprising, really.
One thing he said that was rather intriguing was that they were looking into ways to build a simulator that could explain to a user what is going on in their use of the memory system… and somehow explore all possible executions allowed by the architecture – not just by a particular implementation of the architecture. In a way that can be understood and used to guide the creation of correct and efficient code. That is an idea that I recognize from what we do with Simics: to explore a larger space of behaviors than the hardware or a particular run on a particular piece of hardware can expose (in our case, for example how functional specs are implemented). But doing it for the multitude of subtle possible execution patterns in a weak memory model is harder to do in a pedagogic way. Maybe just doing a lot of pseudorandom test runs that show a wide variety of possibilities is enough. Such a simulator is going to be rather slow though, so we need some way to focus on the important snippets of code – running it on a complete 10s of billions of instructions boot of Windows or Linux OS is probably not going to be very helpful.
Joe Armstrong advocated that message passing is the right way to build software. Once it is split into a lot of small units passing messages, parallelism and even distribution becomes an easy problem for the runtime system. I do think this makes sense – programming with shared memory is a bad model for most programs and most programmers. However, no matter how you do it, you always end up ALSO requiring some kind of shared state… so that’s why see people implementing shared memory on top of message passing and message passing on top of shared memory. I guess in practice, you need both primitives.
Torbiörn Fritzon from Spotify went over their issues and how their profile is different from other cloud-based services. He shared some good insights into how their system works, and the problems that they are facing. For example, they have to process some 20 TB of logged data each day! That’s the kind of scale where you need to employ thousands of computers – multicore is just a basic building block, here we are looking at many thousands of such chips stacked into thousands of machines. And that is even for what is just a mid-sized service that just broke through 100 million users.
One thing that Torbiörn said was that they do not expect that doing 1 billion users would be the same as doing 100 million users times ten. Rather, if they were to use the same system architecture and setup as today, it would like be much more than 10 times as expensive to deal with 10 times more users. It resembles something I heard at the SiCS Multicore day in 2009 – where it was noted that when scaling the Erlang engine (Joe Armstrong’s favorite, BTW), each power of two of cores required a re-architecting and tackling a new set of problems. It seems there is a similar effect at play in large systems – maybe each power of 10 you have to rethink radically how you do things.
Zoran Radovic (unfortunately not available as a video) from Oracle (ex-Sun) presented a talk about the current “7” generation of SPARC chips from Oracle. In his talk, we made a very clear point that hardware acceleration is the way to go right now. Interestingly, back at the SiCS Multicore Day in 2012, another Oracle speaker presented the “5” generation, where the focus was on adding features to the ISA rather than as specific accelerators. In the meantime, it seems that Oracle and other companies have invested much more in hardware offload units of various kinds. I guess we have all got better at building the right APIs and libraries to make using them easier, and using asynch programming methods to overcome the latency issues. Also, by having libraries that use different implementations depending on the hardware available, the implementation cost for programmers to use the new hardware features have gone down. Today, you can pick up a library like Oracle’s libdax and have it run accelerated on a SPARC M7 chip – and then fall back to various software implementations on other chips. The same is true for DPDK that originally came from Intel – you can use it on pretty much any hardware, which makes the cost for investing in it lower. Right now, it means that hardware accelerators for particular common expensive tasks seems to be the right way to add performance at the smallest cost in silicon area and power consumption.
GPGPU was entirely absent this year, but I guess you can’t find new angles on it every year.
Going to the multicore day was time well-spent. It offered a good mix of low-level stuff like memory consistency, system issues like what Spotify talked about, and real product insights like Oracle’s SPARC chips. It would have been nice to go to the other days that were part of the software week, but there is only so much time available.
I must also say that I was very impressed by the speed that the videos from the multicore day got posted – I accidentally found them the day after, which is very quick. Bravo, that’s the way you should do it!