The SiCS Multicore Day took place last week, for the tenth year in a row! It is still a very good event to learn about multicore and computer architecture, and meet with a broad selection of industry and academic people interested in multicore in various ways. While multicore is not bright shiny new thing it once was, it is still an exciting area of research – even if much of the innovation is moving away from the traditional field of making a bunch of processor cores work together, towards system-level optimizations. For the past few years, SiCS has had to good taste to publish all the lectures online, so you can go to their Youtube playlist and see all the talks for free, right now!
I love bug and debug stories in general. Bugs are a fun and interesting part of software engineering, programming, and systems development. Stories that involve running Simics on Simics to find bugs are a particular category that is fascinating, as it shows how to apply serious software technology to solve problems related to said serious software technology. On the Intel Software and Services blog, I just posted a story about just that: debugging a Linux kernel bug provoked by Simics, by running Simics on a small network of machines inside of Simics. See https://blogs.intel.com/evangelists/2016/05/30/finding-kernel-1-2-3-bug-running-wind-river-simics-simics/ for the full story.
A new record, replay, and reverse debugger has appeared, and I just had to take a look at what they do and how they do it. “rr” has been developed by the Firefox developers at Mozilla Corporation, initially for the purpose of debugging Firefox itself. Starting at a debugger from the angle of attacking a particular program does let you get things going quickly, but the resulting tool is clearly generally useful, at least for Linux user-land programs on x86. Since I have tried to keep up with the developments in this field, a write-up seems to be called for.
In a blog post at Wind River, I describe how the Wind River Helix Lab Cloud system can be used to communicate hardware design to software developers. The idea is that you upload a virtual platform to the cloud-based system, and then share it to the software developers. In this way, there is no need to install or build a virtual platform locally, and the sender has perfect control over access and updates. It is a realization of the hardware communication principles I presented in an earlier blog post on use cases for Lab Cloud.
But the past part is that the targets I talk about in the blog post and use in the video are available for anyone! Just register on Lab Cloud, and you can try your own threaded software and check how it scales on a simulated 8-core ARM!
On November 3, 2015, I will give a presentation at the Embedded Conference Scandinavia about simulating IoT systems. The conference program can be found at http://www.svenskelektronik.se/ECS/ECS15/Program.html, with my session detailed at http://www.svenskelektronik.se/ECS/ECS15/Program/IoT%20Development.html.
My topic is how to realistically simulate very large IoT networks for software testing and system development. This is a fun field where I have spent significant time recently. Only a couple of weeks ago, I tried my hand simulating a 1000-node network. Which worked! I had 1000 ARM-based nodes running VxWorks running at the same time, inside a single Simics process, and at speeds close to real time! It did use some 55GB of RAM, which I think is a personal record for largest use of system resources from a single process. Still, it only took a dozen processors to do it.
I have a silly demo program that I have been using for a few years to demonstrate the Simics Analyzer ability to track software programs as they are executing and plot which threads run where and when. This demo involves using that plot window to virtually draw text, in a way akin to how I used to make my old ZX Spectrum “draw” things in the border. But when I brought it up in a new setting it failed to run properly and actually starting hanging on me. Strange, but also quite funny when I realized that I had originally foreseen this very problem and consciously decided not to put in a fix for it… which now came back to bite me in a pretty spectacular way. But at least I did get an interesting bug to write about.
Via the EETimes, I found a very interesting talk by Bristol professor David May, presented at the 4th Annual Bristol Multicore Challenge, in June of 2013. The talk can be found as a Youtube movie here, and the slides are available here. The EETimes focused on the idea to cut down ARM to be really RISC, but I think the more interesting part is Professor May’s observations on multicore computing in general, and the case for and against heterogeneity in (parallel) computers.
Probably thanks to the yearly Mobile World Congress, there have been a slew of recent announcements of mobile application processors recently. Everything is ARM-based, but show quite some variety in the CPU core configurations used. Indeed, I think this variety has something to say on the general state of multicore.
The 2012 edition of the SiCS Multicore Day was fun, like they have always been in the past. I missed it in 2010 and 2011, but could make it back this year. It was interesting to see that the points where keynote speakers disagreed was similar to previous years, albeit with some new twists. There was also a trend in architecture, moving crypto operations into the core processor ISA, that indicates another angle on the hardware accelerator space.
A few years ago, I built a demo on Simics that used a hacked Freescale MPC8641D target that was forced to scale from 1 to 8 cores. Some interesting experiements could be made using this target, and it was nicely scalable for its time. However, I always wanted to have something just a bit bigger. Say 20 cores, or 100. Just to see what would happen. Finally, I got it.
The Simics QSP target that we quietly launched earlier this Summer is such a scalable target. As discussed in a blog post describing the architecture, it is designed to scale to 128 cores currently. Using this ability, I repeated my old experiments, but trying very large threads counts and target core counts. The results show clearly that the way that I coded my parallel computation program was pretty bad, and I really would like to try to rewrite it using some more modern threading library. All I need is time and a way to cross-compile Wool…
Anyway, the new blog post is here.
Once upon a time, all programming was bare metal programming. You coded to the processor core, you took care of memory, and no operating system got in your way. Over time, as computer programmers, users, and designers got more sophisticated and as more clock cycles and memory bytes became available, more and more layers were added between the programmer and the computer. However, I have recently spotted what might seem like a trend away from ever-thicker software stacks, in the interest of performance and, in particular, latency.
I was recently pointed to a 2011 SPLASH presentation by David Ungar, an IBM researcher working on parallel programming for manycore systems. In particular, in a project called Renaissance, run together with the Vrije Universiteit Brussels in Belgium (VUB) and Portland State University in the US. The title of the presentation is “Everything You Know (about Parallel Programming) Is Wrong! A Wild Screed about the Future“, and it has provoked some discussion among people I know about just how wrong is wrong.
I just read a quite interesting article by Christian Pinto et al, “GPGPU-Accelerated Parallel and Fast Simulation of Thousand-core Platforms“, published at the CCGRID 2011 conference. It discusses some work in using a GPGPU to run simulations of massively parallel computers, using the parallelism of the GPU to speed the simulation. Intriguing concept, but the execution is not without its flaws and it is unclear at least from the paper just how well this generalizes, scales, or compares to parallel simulation on a general-purpose multicore machine.
By chance, I got to attend a day at the UPMARC Summer School with a very enjoyable talk by Francesco Zappa Nardelli from INRIA. He described his work (along with others) on understanding and modeling multiprocessor memory models. It is a very complex subject, but he managed to explain it very well.
I just finished reading the October 2010 issue of Communications of the ACM. It contained some very good articles on performance and parallel computing. In particular, I found the ACM Case Study on the parallelism of Photoshop a fascinating read. There was also the second part of Cary Millsap’s articles about “Thinking Clearly about Performance”.
I have a fairly lengthy new blog post at my Wind River blog. This time, I interview Tennessee Carmel-Veilleux, a Canadian MSc student who have done some very smart things with Simics. His research is in IMA, Integrated Modular Avionics, and how to make that work on multicore.
I listened to an interesting FLOSS Weekly interview with Adam Hall and Achim Hasenmuller of VirtualBox. For someone interested in virtual machines and hardware simulation, the interview was full of interested tidbits. I think the best part was the discussion on multiprocessing in Virtualbox.
I recently read a couple of articles on multicore that felt a bit like jumping back in time. In IEEE Spectrum, David Patterson at Berkeley’s parallel computing lab brings up the issue of just how hard it is to program in parallel and that this makes the wholesale move to multicore into something like a “hail Mary pass” for the computer industry. In Computer World, Chris Nicols at NICTA in Australia asks what you will do with a hundred cores – implying that there is not much you can do today. While both articles make some good points, I also think they should be taken with a grain of salt. Things are better than they make them seem. Continue reading “Multicore is not That Bad”
I have another blog up at Wind River. This one is about multicore bugs that cannot happen on multithreaded systems, and is called True Concurrency is Truly Different (Again). It bounces from a recent interesting Windows security flaw into how Simics works with multicore systems.
Today here at the MCC 2009 workshop, I heard an interesting talk by David Black-Schaffer of Stanford university. His work is on stream programming for image processing (“2D streams”). Pretty simple basic idea, to use 2D blobs of pixels as kernel inputs rather than single values or vectors. Makes eminent sense for image processing.
Part of my daily work at Virtutech is building demos. One particularly interesting and frustrating aspect of demo-building is getting good raw material. I might have an idea like “let’s show how we unravel a randomly occurring hard-to-reproduce bug using Simics“. This then turns into a hard hunt for a program with a suitable bug in it… not the Simics tooling to resolve the bug. For some reason, when I best need bugs, I have hard time getting them into my code.
I guess it is Murphy’s law — if you really set out to want a bug to show up in your code, your code will stubbornly be perfect and refuse to break. If you set out to build a perfect piece of software, it will never work…
So I was actually quite happy a few weeks ago when I started to get random freezes in a test program I wrote to show multicore scaling. It was the perfect bug! It broke some demos that I wanted to have working, but fixing the code to make the other demos work was a very instructive lesson in multicore debug that would make for a nice demo in its own right. In the end, it managed to nicely illustrate some common wisdom about multicore software. It was not a trivial problem, fortunately.
Andras Vajda of Ericsson wrote an interesting blog post on domain-specific languages (DSLs). Thanks for some success stories and support in what sometimes feels like an uphill battle trying to make people accept that DSLs are a large part of the future of programming. In particular for parallel computing, as they let you hide the complexities of parallel programming.
I found this quote in Stackoverflow Podcast #68 quite funny in its extreme dislike of parallel programming in C…
Last Friday, I attended this year’s edition of the SiCS Multicore Day. It was smaller in scale than last year, being only a single day rather than two days. The program was very high quality nevertheless, with keynote talks from Hazim Shafi of Microsoft, Richard Kaufmann of HP, and Anders Landin of Sun. Additionally, there was a mid-day three-track session with research and industry talks from the Swedish multicore community. Continue reading “SiCS Multicore Day 2009”
Øredev is the premier software development conference in Sweden and Europe (they claim). I gave some presentations there in 2006 and 2007, but since then they have dropped the general embedded software development track and just focused on programming for mobile phones. Most of the material is “general IT”. If you are doing software development on the desktop or for servers, it is a good place to go to learn new things from the general world of computing.
Freescale has now released the collected, updated, and restyled book version of the article series on embedded multicore that I wrote last year together with Patrik Strömblad of Enea, and Jonas Svennebring, and John Logan of Freescale. The book covers the basics of multicore software and hardware, as well as operating systems issues and virtual platforms. Obviously, the virtual platform part was my contribution.
Last year, FLOSS Weekly interviewed Jan Lehnard of the CouchDB project. I put up a blog post on this, noting that it was interesting with a scalable parallel program written in Erlang, a true concurrent language. The interview was interesting, but not very deeply technical. Now, almost a year later, the StackOverflow podcast, number 59, interviewed the founder of the project, Damien Katz. This interview goes a bit more into the technical details and what CouchDB is good for and what not, as well as some details on the use and performance of Erlang.
I have a short article on multicore systems development and virtual platforms in the May 2009 issue of ECN magazine, over at www.ecnmag.com.
Last year in a blog post on video encoding for the iPod Nano, I complained about the lack of performance on my old Athlon. A bit later, I noted that (obviously) video encoding is a good example of an application that can take advantage of parallelism. Yesterday I put these two topics together in a practical test. And it worked nicely enough.