This is a small Linux SMP programming tip, which I had a hard time finding documented clearly anywhere on the web. I guess people won’t find it here either, but with some luck some search engine will pick up on this.
Edited on 2009-Feb-01, to include the link to the illustrated guide that really helps you get there faster. Thanks Simon! Also, promoted to front page, original post was put up on 2008-Nov-09.
Thanks to Simon Kågströms post (and the even better second-generation with screenshots) about using Eclipse for the Linux kernel, I have a much nicer work environment now for my ongoing work in learning Linux device drivers on PowerPC, which has helped me work my way through several hard-to-figure-out system calls. Continue reading “Eclipse Linux Kernel Indexing Works”
Embedded.com just listed the ten most visited articles on their website during 2008, and my contribution on debugging multiprocessor code was number ten. If you want some more meat around multiprocessor debug, please peruse the various papers and presentations found on my personal website.
Jack Ganssle wrote a column about the failure of multicore to scale, based on an article in IEEE Spectrum. He makes the following claim:
Now a study in IEEE Spectrum shows that even for the classic embarrassingly parallel problems like weather simulations multicore offers little benefit. The curve in that article is priceless. As the number of cores grow from two to 64 performance plummets by a factor of five. Additional processors nullify each other.
Call it the Nulticore Effect.
I just read an interesting paper from the 2004 Embedded System’s Conference (ESC) written by Gary Stringham. It is called “ASIC Design Practices from a Firmware Perspective” and straddles the boundary between hardware design and driver software development. It was good to see someone take the viewpoint of “how you actually program a hardware device is as important as what it does”. Gary seems to understand both the hardware design and implementation view of things, as well as that of the embedded software engineer. To me, that seems to be a fairly rare combination of skills, to the detriment of our entire economy of computer system development.
The first real snow reached Uppsala this weekend, lots of nice fluffy slippery cold snow on the ground and on the roads and everywhere else. It really is nice to have snow again, it lessens the effect of our dark winters and kind of puts you in a Christmas-like mood, especially now that the Christmas decorations are going up in town and shopping centers.
I also had to bring out the car for some errands and transports yesterday, and that new snow was probably the slipperiest I have ever driven on. It also provided an unsought opportunity for the electronic systems in our car to show themselves… both the stability and traction control and the anti-lock brakes were activated several times despite my pretty careful driving. For some reason, I never really believe that they would apply to me. I know that ESP and ABS are really good for safety, but for some reason I am a diehard skeptic that never quite believe these things work as they should. I guess this is another example of an embedded system that works as it should. Which really should not be a surprise.
To continue from last week’s post about my Linux device driver and hardware teaching setup in Simics, here is a lesson I learnt this week when doing some performance analysis based on various hardware speeds.
There are times when working with virtual hardware and not real hardware feels very liberating and efficient (not to mention safe). Bringing up, modifying, and extending operating systems is one obvious such case. Recently, I have been preparing an open-source-based demonstration and education systems based on embedded PowerPC machines, and teaching myself how to do Linux device drivers in the process. This really brought out the best in virtual platform use.
I am a skeptic when it comes to technology. Despite working in the tech field — or maybe because I am — I always expect technology to fail or at least disappoint. But sometimes that instinct is actually wrong! Here are two recent examples when I felt “wow, that was pretty good” about some fairly mundane pieces of computerized equipment.
Chip Design Magazine published an article by me in their August/September 2008, about Getting Software into the Hardware Design Loop. The article is about the technical and marketing aspects of how chip designers can get early feedback from software and systems designers, early in the hardware design process. The vehicle for this? Virtual platforms, obviously.
I have recently discovered stackoverflow.com and I must say it is something I very much recommend. The idea is simple, and the details rich and interesting.
The article/editoral “Using virtual platforms to improve AdvancedTCA software development practice” is now up at CompactPCI and AdvancedTCA Systems, an online and paper journal for the rack-based market. It is about our experience at Virtutech in using virtual platforms to drive system and software development for “pretty large” target systems, even those based on standard hardware.
And really, there is no such thing as a standard embedded system. Even if you use a standard backplane and buy off-the-shelf boards and cards to put in it, the combination of cards and added mezzanine cards makes each system quite unique. If you could use completely standard PC hardware for your system with no custom additions or special IO units, the thing would in likelihood not actually be an embedded system.
More from the SiCS multicore days 2008.
There were some interesting comments on how to define efficiency in a world of plentiful cores. The theme from my previous blog post called “Real-Time Control when Cores Become Free” came up several times during the talks, panels, and discussions. It seems that this year, everybody agreed that we are heading to 100s or 1000s of “self-respecting” cores on a single chip, and that with that kind of core count, it is not too important to keep them all busy at all times at any cost. As I stated earlier, cores and instructions are now free, while other aspects are limiting, turning the classic optimization imperatives of computing on its head. Operating systems will become more about space-sharing than time-sharing, and it might make sense to dedicate processing cores to the sole job of impersonating peripheral units or doing polling work. Operating systems can also be simplified when the job of time-sharing is taken away, even if communications and resource management might well bring in some new interesting issues.
So, what is efficiency in this kind of environment?
In early July, Cadence announced their new “C2S” C-to-silicon compiler. This event was marked with some excitement and blogging in the EDA space (SCDSource, EDN-Wilson, CDM-Martin, to give some links for more reading). At core, I agree that what they are doing is fairly cool — taking an essentially hardware-unrelated sequential program in C and creating hardware from it. The kind of heavy technology that I have come to admire in the EDA space.
But I have to ask: why start with C?
I have another short technical piece published about Multicore Debug at the EETimes (and their network of related publications, like Embedded.com). Pretty short piece, and they cut out some bits to make it fit their format. Nothing new to fans of virtual platforms for software development, basically we can use virtual platforms to reintroduce control over parallel and for all practical purposes chaotic hardware/software systems.
I have another opinion piece published over at SCDsource.com. The title, “Why virtual platforms need cycle-accurate models“, was their creation, not mine, and I think it is a little bit off the main message of the piece.The follow-up discussion is also fairly interesting.
The key thing that I want to get across is that we need virtual platforms where we can spend most of our time executing in a fast, not-very-detailed mode to get the software somewhere interesting. Once we get to the interesting spot, we can then switch to more detailed models to get detailed information about the software behavior and especially its low-level timing. Getting to that point in detailed mode is impossible since it would take too much time.
This is something that computer architecture researchers have been doing for a very long time, just look at how toolsets like SimpleScalar and Simics with the Wisconsin GEMS system use fast mode for “positioning” and more detailed execution for “measurement”. It is also what is now commercial with the Simics Freescale QorIQ P4080 Hybrid virtual platform. Tensilica also have the ability to switch mode in their toolchain.
See an upcoming post for more on how to get at the cycle-accurate models – this was just to point out that that the article is there, for symmetry with previous posts about my articles popping up in places.
Only half an hour ago, the embargoes lifted. Freescale announced its new QorIQ series of multicore (and some single- and dual-core) processors. For the top-end of that line, the P4080, Freescale and Virtutech (where I work, remember) have developed a virtual platform solution to help Freescale customers get to working products faster. The virtual platform is available now, and is already running several operating systems including VxWorks, QNX, and a variety of Linuxes. Apart from the fairly large scale of this SoC, the really new part of the virtual platform is the so-called Hybrid solution, where the fast models are combined with detailed models from Freescale themselves. This creates a cycle-level detailed model with validated timing, “from the source” — but without the performance issues of having to run everything at great level of detail. Rather, you use the fast model to steer the simulation of a workload to an interesting spot, and then turn up the level of detail then and there. You can also select which components of the chip are actually detailed and which parts are modeled with the fast functional models, avoiding the incredible slow-down of running and entire virtual platform at a great level of detail.
If you happen to be at the FTF in Orlando, do come by and look at the demos!
I have been involved in this work for the past year, and it is wonderful to finally see it coming out and be able to talk about it.
As a follow-up to my previous post on the scope of ESL, I found a nice tidbit in an EETimes article… basically saying that hardware design is declining inside the typical system houses.
SystemC TLM-2.0 has just been released, and on the heels of that everyone in the EDA world is announcing various varieties of support. TLM-2.0-compliant models, tools that can run TLM-2.0 models, and existing modeling frameworks that are being updated to comply with the TLM-2.0 standard. All of this feeds a general feeling that the so-called Electronic System Level design market (according to Frank Schirrmeister of Synopsys, the term was coined by Gary Smith) is finally reaching a level of maturity where there is hope to grow the market by standards. This is something that has to happen, but it seems to be getting hijacked by a certain part of the market addressing the needs of a certain set of users.
There is more to virtual platforms than ESL. Much more. Remember the pure software people.
Edit: Maybe it is more correct to say “there is more to virtual platforms than SoC”, as that is what several very smart comments to this post has said. ESL is not necessarily tied to SoC, it is in theory at least a broader term. But currently, most tools retain an SoC focus.
On Tuesday next week, I will be presenting at the Power Architecture Conference (PAC) in München, Germany. The topics will be multicore debug using virtual hardware, and the new Simics Accelerator technology. Especially Simics Accelerator is pretty interesting technology.
It is a simple idea, using multiple host cores to run a virtual platform, with fairly amazing results. Now, using a single computer we can run fairly incredible simulations that were the realm of pure fantasy just a few years ago. We also got a nice new little box to demonstrate it with, an eight-core Dell with 16 GB of RAM. With 64-bit Linux, this thing makes my Core 2 Duo laptop with 32-bit Vista look like yesteryear’s snail… And creates that giggling feeling that a really impressive new toy brings up in even the most grown up boys. Booting a 16-machine network of PowerPC boards was so fast it was not demoworthy. I think we have to up the ante to some 100 target machines to make it interesting, and I have no doubt that a combination of multithreading and idle-loop optimization will make that thing be usefully interactive from the target command lines. There are many other wild things we could try on that demo box, once it gets back from the Power Architecture Conferences tour.
A very interesting idea that has been bandied around for a while in manycore land is the notion that in the future, we will see a total inversion in today’s cost intuition for computers. Today, we are all versed in the idea that processor cores and processing times are quite precious, while memory is free. For best performance, you need to care about the cache system, but in the end, the goal is to keep those processor pipelines as busy as possible. Processors have traditionally been the most expensive part of a system, and ideas such as Integrated Modular Avionics are invented to make the best use of a resource perceived as rare and expensive…
But is that really always going to be true? Is it reasonably to think of CPU cores are being free but other resources as expensive? And what happens to program and system design then?
I just got another article published! In the April 2008 issue of the ACM Transactions on Embedded Computing Systems (TECS), we have an article called “The worst-case execution-time problem – overview of methods and survey of tools”. “We” is kind of understatement, the article has fifteen authors from three continents, and presents an overview of the state of the field of WCET (Worst-Case Execution Time) analysis. The article was started back in 2005, with submission in 2006, accepted in January of 2007, and then finally it appeared in 2008. It is probably my last shot in the WCET area where I did my PhD thesis (please see my list of publications for an idea of what all of that is about).
Now the ESC SV 2008 is over. I really enjoyed going to the show this year, and presenting on simulation for embedded systems. The topic has to be heating up, I had some fifty people listen to the talk, which is really very good. Hope that they learnt how to build good transaction-level hardware models, and have some idea on how to apply this to their own projects. Hopefully, I can come back next year for the ESC 2009 (update: this did not happen) and do it again (even though the recent travel trouble makes it a less attractive idea to fly back here right now…).
Most of the time when talking about the impact of multicore processing on software, we complain that it makes the software more complicated because it has to cope with the additional complexities of parallelism. There are some cases, however, when moving to multicore hardware allows a software structure to be simplified. The case of Integrated Modular Avionics (IMA) and the honestly idiotic design of the ARINC 653 standard is one such case.
Continue reading “When Multicore makes Things Simpler, like IMA”
An old colleague just sent me an email bringing up a discussion we had last year, where he was a strong proponent for the homogeneous model of a multiprocessor. The root of that discussion was the difference between the Xbox 360 and Playstation 3 processors. The Xbox 360 has a three-core, two-threads-per-core homogeneous PowerPC main processor called the Xenon (plus a graphics processor, obviously), while the PS3 has a Cell processor with a single two-threaded PowerPC core and seven SPEs, Synergistic Processing Elements (basically DSP-like SIMD machines).
In the game business, it is clear that the Xenon CPU is considered easier to code for. This means that even though the Cell processor clearly has higher theoretical raw performance, in practical the two machines are about equal in power since it is harder to make use of the Cell. Which seems to be a fact.
So here, homogeneous systems do appear to have it easier among programmers. However, I do not believe that that extends to all systems, all the time, everywhere.
The “Handbook of Real-Time and Embedded Systems” (ToC, Amazon, CRC Press) is now out. I and my university research colleague and friend Andreas Ermedahl have written a chapter on worst-case execution time analysis. We talk some about the theories and techniques, but we try to discuss practical experience in actual industrial use. Both static, dynamic, and hybrid techniques are covered.
I just got my personal copy, but my first impression of the book overall is very positive. The contents seems quite practical to a large extent, not as academic as one might have feared. Do check it out if you are into the field. It is not a collection of research paper, rather instructive chapters informed by solid research but with applications in mind.
It just dawned on me recently (and it sure must have been obvious to those working with configuring AMP — Assymtric Multiprocessing Systems) that in an AMP setup, the operating systems involved actually know about each other and have to account for the fact that they are sharing a single processor chip with other operating systems. So you cannot just take two single-core operating system images from an existing multiple-processor (local memory) solution and put them on a single chip and things just work. You do need to prepare the boot process and find a way to nicely share the common I/O devices, timers, accelerator engines and other resources on the chip. This is materially different from a virtualized setup.
The SICS Multicore Day August 31 was a really great event! We had some fantastic speakers presenting the latest industry research view on multicores and how to program them. Marc Tremblay did the first presentation in Europe of Sun’s upcoming Rock processor. Tim Mattson from Intel tried hard to provoke the crowd, and Vijay Saraswat of IBM presented their X10 language. Erik Hagersten from Uppsala University provided a short scene-setting talk about how multicore is becoming the norm.
RTiS 2007 just took place in Västerås, Sweden. It is a biannual event where Swedish real-time research (and that really means embedded in general these days) presents new results and summarizes results from the past two years. For someone who has worked in the field for ten years, it really feels like a gathering of friends and old acquaintances. And always some fresh new faces. Due to a scheduling conflict, I was only able to make it to day one of two.
I presented a short summary of a paper I and a colleague at Virtutech wrote last year together with Ericsson and TietoEnator, on the Simics-based simulator for the Ericsson CPP system (see the publications page for 2006 and soon for 2007). I also presented the Simics tool and demoed it in the demo session. Overall, nice to be talking to the mixed academic-industrial audience.