To continue from last week’s post about my Linux device driver and hardware teaching setup in Simics, here is a lesson I learnt this week when doing some performance analysis based on various hardware speeds.
I have recently discovered stackoverflow.com and I must say it is something I very much recommend. The idea is simple, and the details rich and interesting.
The two days of the SiCS Multicore Days is now over, and it was a really fun event this year too. I will be writing a few things inspired by the event, and here is the first.
Kunle Olukotun‘s presentation on the work of the Stanford Pervasive Parallelism lab included a diagram where they showed a range of domain-specific languages (DSL) being compiled to a universal implementation language. That language is currently Scala, and in the end all applications end up being compiled into Scala byte codes, which are then optimized and dynamically reoptimized and executed on a particular hardware system based on the properties of that system. Fundamentally, the problem of creating and compiling a DSL, and combining program segments written in different DSLs, is solved by interposing a layer of indirection.
But this idea got me thinking about what the best such intermediary might be for large-scale general deployment.
This might appear as a stretched analogy, but it struck as me as obvious when I tried playing the Lego Racers boardgame with my 3-year old this weekend. The game is ranked pretty low on Boardgamegeek, and deservedly so. The promise and premise is great: use Lego cars to race around a track and pick up new pieces to modify the powers of your car… sounds like great fun. Right? But it is not, and that’s where my analogy with the age of software comes in.
This was a refreshingly different post: Too Many Cores, not Enough Brains:
More importantly, I believe the whole movement is misguided. Remember that we already know how to exploit multicore processors: with now-standard multithreading techniques. Multithreaded programming is notoriously difficult and error-prone, so the challenge is to invent techniques that will make it easier. But I just don’t see vast hordes of programmers needing to do multithreaded programming, and I don’t see large application domains where it is needed. Internet server apps are architected to scale across a CPU farm far beyond the limits of multicore. Likewise CGI rendering farms. Desktop apps don’t really need more CPU cycles: they just absorb them in lieu of performance tuning. It is mostly specialized performance-intensive domains that are truly in need of multithreading: like OS kernels and database engines and video codecs. Such code will continue to be written in C no matter what.
The argument at core is that multicore is about performance, and performance optimization is generally something that we do prematurely rather than focussing on how to solve the core problem in the best way. You have to respect Jonathan Edwards, and often this is true: programmers optimize themselves into a horrible design that is also slow.
I just read an opinion-provoking piece “Software developer attitudes: just get on with it” by Frank Schirrmeister, as well as the article “Life imitating art: Hardware development imitating software development” by Glenn Perry that he linked to. Both these articles touch on the long-standing question of who does development the “best” in computing. I have heard these arguments many times, where software developers think that there is something mythical about hardware development that makes things work so much better with much fewer bugs, and hardware people looking at the speed of development and fanciful fireworks of coding that software engineers can do. It could be a case of the grass always looking greener on the other side… but there are some concrete things that are relevant here.
I just listened to another Floss Weekly show, Number 36 where they interviewed Jan Lehnard of the CouchDB project. CouchDB is very interesting, in that it is a database designed for replication, redundancy, and thus massive parallelism. It was initially written by Damien Katz on his own, but now it is an Apache Foundation project sponsored by IBM. The most interesting thing is that Damien decided in 2006 to rewrite the C++ prototype he had in Erlang, and did so in just a few months if I understood my Erlang friends right. So here we have a really good parallel program written in a true parallel language.
In the March/April 2008 issue of ACM Queue, there is an article on GPU Programming by Kayvon Fatahalian and Mike Houston of Stanford that I found a very interesting read. It presents and analyzes the programming model of modern GPUs, in the most coherent and understandable way that I have seen so far. The PC GPU has a model for programming parallel hardware that might be a good pattern for other areas of processing. Programmers do not have to write explicitly parallel code, the machinery and hardware takes care of ensuring parallel behavior, as long as the code follows the assumptions made in the model.
In the July 2008 issue of IEEE Computer, there is short article called “In Praise of Scripting: Real Programming Pragmatism“, by Ronald P. Loui, a professor at Washington University (WUSTL). The article deals with the issue of what is the appropriate first language to teach new CS (Computer Science) students, and considers that a “scripting” langauge like Python or Ruby might be way better than Java (no doubt about that I think).
What can this teach us for the purpose of simulation and the creation of models of computer system hardware for the purpose of simulation? Maybe a fair bit…
Model-based architecture (MDA) or model-based development is an idea that to me comes from the automotive field. To, it means that you use some tool that is capable of modeling both a computer controller system and the environment being controlled to create a simulation world where computer control and environment meet and the characteristics of the controller can be ascertained quickly. The key is to not have to convert controller algorithms to concrete code, and not have to run concrete code on concrete hardware against physical prototypes to test the controllers. Today, this seems to be applied to many fields where you are creating control systems (automotive, aviation, robotics). The tools are math-based like MatLab and LabView, along with special programming environments based on UML and StateCharts.
What is interesting is that most of these tools are graphical in nature. And they do seem to work quite well, which is quite surprising given the otherwise poor record of graphical programming as opposed to text-based programming. There were a pile of graphical programming environments in the 1980’s, none of which amounted to much. What survived and prospered were the good old text-based languages like C, C++, Java, VisualBasic, etc. In practice, it seems like it is very hard to beat sequential text when it is time to actual get code working. More efficient programming seems to boil down to having to write less text and having text which is easier to write (for example, dynamic typing, rich libraries, garbage collection, and other modern language features that remove intellectual burdens from the programmer).
But graphics do seem to work for domain-specific cases (like control engineering or signal processing), especially for data-flow-style problems. And for abstract architecture work. So there has to be something to it… but what?
In early July, Cadence announced their new “C2S” C-to-silicon compiler. This event was marked with some excitement and blogging in the EDA space (SCDSource, EDN-Wilson, CDM-Martin, to give some links for more reading). At core, I agree that what they are doing is fairly cool — taking an essentially hardware-unrelated sequential program in C and creating hardware from it. The kind of heavy technology that I have come to admire in the EDA space.
But I have to ask: why start with C?
SystemC TLM-2.0 has just been released, and on the heels of that everyone in the EDA world is announcing various varieties of support. TLM-2.0-compliant models, tools that can run TLM-2.0 models, and existing modeling frameworks that are being updated to comply with the TLM-2.0 standard. All of this feeds a general feeling that the so-called Electronic System Level design market (according to Frank Schirrmeister of Synopsys, the term was coined by Gary Smith) is finally reaching a level of maturity where there is hope to grow the market by standards. This is something that has to happen, but it seems to be getting hijacked by a certain part of the market addressing the needs of a certain set of users.
There is more to virtual platforms than ESL. Much more. Remember the pure software people.
Edit: Maybe it is more correct to say “there is more to virtual platforms than SoC”, as that is what several very smart comments to this post has said. ESL is not necessarily tied to SoC, it is in theory at least a broader term. But currently, most tools retain an SoC focus.
A very interesting idea that has been bandied around for a while in manycore land is the notion that in the future, we will see a total inversion in today’s cost intuition for computers. Today, we are all versed in the idea that processor cores and processing times are quite precious, while memory is free. For best performance, you need to care about the cache system, but in the end, the goal is to keep those processor pipelines as busy as possible. Processors have traditionally been the most expensive part of a system, and ideas such as Integrated Modular Avionics are invented to make the best use of a resource perceived as rare and expensive…
But is that really always going to be true? Is it reasonably to think of CPU cores are being free but other resources as expensive? And what happens to program and system design then?
I got another email from my friend with the thesis that processors will become ever more homogeneous as time goes on, while I believe in a relative heterogenezation (is that a word?) of computer architecture with many special-purpose accelerators and helper processors. This argument is put forward in a previous blog post. In this round, the arguments for homogenization are from the gaming world.
Sometimes it is very reassuring that certain things do not work when tested in practice, especially when you have been telling people that for a long time. In my talks about Debugging Multicore Systems at the Embedded Systems Conference Silicon Valley in 2006 and 2007, I had a fairly long discussion about relaxed or weak memory consistency models and their effect on parallel software when run on a truly concurrent machine. I used Dekker’s Algorithm as an example of code that works just fine on a single-processor machine with a multitasking operating system, but that fails to work on a dual-processor machine. Over Christmas, I finally did a practical test of just how easy it was to make it fail in reality. Which turned out to showcase some interesting properties of various types and brands of hardware and software.
The book “Multicore Programming – Increasing Performance through Software Multithreading” by Shameem Akhter and Jason Roberts is part of a series of books put out by Intel in their multicore software push. In case you have not noticed, Intel has a huge market push currently where they give seminars, publish articles and books, and give curricula to universities in order to get more parallel software in place. I read this book recently, and here is a short review.
Continue reading “Book Review: Intel’s Multicore Programming Book”
Most of the time when talking about the impact of multicore processing on software, we complain that it makes the software more complicated because it has to cope with the additional complexities of parallelism. There are some cases, however, when moving to multicore hardware allows a software structure to be simplified. The case of Integrated Modular Avionics (IMA) and the honestly idiotic design of the ARINC 653 standard is one such case.
Continue reading “When Multicore makes Things Simpler, like IMA”
Just like in 2006, I went to the Øredev conference in Malmö and presented a workshop using Virtutech Simics. This year, I worked with Jonas Svennebring from Freescale and we created a workshop around parallelizing network processing software for running on a multicore Freescale processor. The workshop went reasonably well, and the participants definitely learned something about what we trying to get across, even though we did not have much time to actualy complete the programming assignments.
An old colleague just sent me an email bringing up a discussion we had last year, where he was a strong proponent for the homogeneous model of a multiprocessor. The root of that discussion was the difference between the Xbox 360 and Playstation 3 processors. The Xbox 360 has a three-core, two-threads-per-core homogeneous PowerPC main processor called the Xenon (plus a graphics processor, obviously), while the PS3 has a Cell processor with a single two-threaded PowerPC core and seven SPEs, Synergistic Processing Elements (basically DSP-like SIMD machines).
In the game business, it is clear that the Xenon CPU is considered easier to code for. This means that even though the Cell processor clearly has higher theoretical raw performance, in practical the two machines are about equal in power since it is harder to make use of the Cell. Which seems to be a fact.
So here, homogeneous systems do appear to have it easier among programmers. However, I do not believe that that extends to all systems, all the time, everywhere.
Some more thoughts on how to program multicore machines that did not make it into my original posting from last week. Some of this was discussed at the multicore day, and others I have been thinking about for some time now.
One of the best ways to handle any hard problem is to make it “somebody else’s problem“. In computer science this is also known as abstraction, and it is a very useful principle for designing more productive programming languages and environments. Basically, the idea I am after is to let a programmer focus on the problem at hand, leaving somebody else to fill in the details and map the problem solution onto the execution substrate.
Sun slots transactional memory into Rock | The Register
That is just so cool! For us around Stockholm, we can hear Marc Tremblay talk about this at the SiCS Multicore Day on August 31, 2007.