What is Efficiency when Cores are Free?

More from the SiCS multicore days 2008.

There were some interesting comments on how to define efficiency in a world of plentiful cores. The theme from my previous blog post called “Real-Time Control when Cores Become Free” came up several times during the talks, panels, and discussions. It seems that this year, everybody agreed that we are heading to 100s or 1000s of “self-respecting” cores on a single chip, and that with that kind of core count, it is not too important to keep them all busy at all times at any cost. As I stated earlier, cores and instructions are now free, while other aspects are limiting, turning the classic optimization imperatives of computing on its head. Operating systems will become more about space-sharing than time-sharing, and it might make sense to dedicate processing cores to the sole job of impersonating peripheral units or doing polling work. Operating systems can also be simplified when the job of time-sharing is taken away, even if communications and resource management might well bring in some new interesting issues.

So, what is efficiency in this kind of environment?

Continue reading “What is Efficiency when Cores are Free?”

The JVM as Universal Parallel Glue?

The two days of the SiCS Multicore Days is now over, and it was a really fun event this year too. I will be writing a few things inspired by the event, and here is the first.

Kunle Olukotun‘s presentation on the work of the Stanford Pervasive Parallelism lab included a diagram where they showed a range of domain-specific languages (DSL) being compiled to a universal implementation language. That language is currently Scala, and in the end all applications end up being compiled into Scala byte codes, which are then optimized and dynamically reoptimized and executed on a particular hardware system based on the properties of that system. Fundamentally, the problem of creating and compiling a DSL, and combining program segments written in different DSLs, is solved by interposing a layer of indirection.

But this idea got me thinking about what the best such intermediary might be for large-scale general deployment.

Continue reading “The JVM as Universal Parallel Glue?”

SiCS Multicore Days 2008: Talk about Threading Simics (updated)

Shrinking cores

I will give a presentation on how Simics was threaded and how we created a parallel virtual platform system at the SiCS Multicore Days 2008, which takes place in Kista, Sweden, on September 11 and 12. The schedule is now up (so I edited the post and added updated to the title), at http://www.sics.se/node/3182, and my talk is on Friday, Sept 12, at 13.00 in “track 2”. Speaker bios and abstracts are also online.

Even apart from my own humble participation, I think the event itself will be well worth attending. Last year was really good and serious fun! See my writeups from last year: part 1 and part 2 (and a short note on the Rock processor and transactional memory).

Swedish Workshop on Multicore 2008: Nov 27-28: CFP!

Shrinking cores

The first Swedish Workshop on Multicore Computing (MCC) will take place in Ronneby on November 27 and 28, 2008. The call for papers is now out, and it is open until September 26. If you have something cool to present or publish about multicore computing, and happen to be here in Sweden, please do submit an abstract!

Disclosure: I am in the program committee for this event.

DNS: Hardware Accelerator Time!

In Episode 157 of Security Now,Steve Gibson and Leo Laporte discuss the recently discovered security issues with DNS. In particular, the cost of making a good fix in terms of bandwidth and computation capacity. Fundamentally, according to Steve, today’s DNS servers are running at a fairly high load, and there is no room to improve the security of DNS updates by for example sending extra UDP packets or switching to TCP/IP. As this theoretically means a doubling or tripling of the number of packets per query, I can believe that. The “real solutions” to DNS problems should lie in the adoption of a truly secured protocol like DNSSEC. As this uses public key crypto (PKC), it would add a processing load to the servers that would kill the DNS servers on the CPU side instead…

Continue reading “DNS: Hardware Accelerator Time!”

GPU Programming: a Good Pattern to Follow?

In the March/April 2008 issue of ACM Queue, there is an article on GPU Programming by Kayvon Fatahalian and Mike Houston of Stanford that I found a very interesting read. It presents and analyzes the programming model of modern GPUs, in the most coherent and understandable way that I have seen so far. The PC GPU has a model for programming parallel hardware that might be a good pattern for other areas of processing. Programmers do not have to write explicitly parallel code, the machinery and hardware takes care of ensuring parallel behavior, as long as the code follows the assumptions made in the model.

Continue reading “GPU Programming: a Good Pattern to Follow?”

Kunle Olukotun Interview: Heterogeneity, Domain-Specific Programming

TheRegister Radio LogoThe Radio Register has a nice interview with Kunle Olukotun, the man most known for the Afara/Sun Niagara/UltraSparc T1-2-etc. design. It is a long interview, lasting well over an hour, but it is worth a listen. A particular high point is the story on how Kunle worked on parallel processors in the mid-1990s when everyone else was still chasing single-thread performance. He really was a very early proponent of multicore, and saw it coming a bit before most other (general-purpose) computer architects did. Currently, he is working on how to program multiprocessors, at the Stanford Pervasive Parallelism Laboratory (PPL). In the interview, I see several themes that I have blogged about before being reinforced…

Continue reading “Kunle Olukotun Interview: Heterogeneity, Domain-Specific Programming”

Freescale QorIQ P4080 Hybrid Simulation on YouTube(!)

YouTube – Freescale QorIQ P4080 Hybrid Simulation is a video of a demo of the QorIQ P4080 hybrid simulation. Cool of Freescale to be publishing it like this, I think it is a very smart move!

Updated: Here is the video inline, let’s see if this works.

Freescale QorIQ P4080 is out — with Simics support

Only half an hour ago, the embargoes lifted. Freescale announced its new QorIQ series of multicore (and some single- and dual-core) processors. For the top-end of that line, the P4080, Freescale and Virtutech (where I work, remember) have developed a virtual platform solution to help Freescale customers get to working products faster. The virtual platform is available now, and is already running several operating systems including VxWorks, QNX, and a variety of Linuxes. Apart from the fairly large scale of this SoC, the really new part of the virtual platform is the so-called Hybrid solution, where the fast models are combined with detailed models from Freescale themselves. This creates a cycle-level detailed model with validated timing, “from the source” — but without the performance issues of having to run everything at great level of detail. Rather, you use the fast model to steer the simulation of a workload to an interesting spot, and then turn up the level of detail then and there. You can also select which components of the chip are actually detailed and which parts are modeled with the fast functional models, avoiding the incredible slow-down of running and entire virtual platform at a great level of detail.

If you happen to be at the FTF in Orlando, do come by and look at the demos!

I have been involved in this work for the past year, and it is wonderful to finally see it coming out and be able to talk about it.

Is SoC (was: ESL) all there is to virtual platforms?

SystemC TLM-2.0 has just been released, and on the heels of that everyone in the EDA world is announcing various varieties of support. TLM-2.0-compliant models, tools that can run TLM-2.0 models, and existing modeling frameworks that are being updated to comply with the TLM-2.0 standard. All of this feeds a general feeling that the so-called Electronic System Level design market (according to Frank Schirrmeister of Synopsys, the term was coined by Gary Smith) is finally reaching a level of maturity where there is hope to grow the market by standards. This is something that has to happen, but it seems to be getting hijacked by a certain part of the market addressing the needs of a certain set of users.

There is more to virtual platforms than ESL. Much more. Remember the pure software people.

Edit: Maybe it is more correct to say “there is more to virtual platforms than SoC”, as that is what several very smart comments to this post has said. ESL is not necessarily tied to SoC, it is in theory at least a broader term. But currently, most tools retain an SoC focus.

Continue reading “Is SoC (was: ESL) all there is to virtual platforms?”

The 1970 rule strikes again: Virtual Platform Principles in 1967

Being a bit of a computer history buff, I am often struck by how most key concepts and ideas in computer science and computer architecture were all invented in some form or the other before 1970. And commonly by IBM. This goes for caches, virtual memory, pipelining, out-of-order execution, virtual machines, operating systems, multitasking, byte-code machines, etc. Even so, I have found a quite extraordinary example of this that actually surprised me in its range of modern techniques employed. This is a follow-up to a previous post, after having actually digested the paper I talked about earlier.

Continue reading “The 1970 rule strikes again: Virtual Platform Principles in 1967”

Tri-core or Tricore or TriCore(tm)

I do find it kind of funny when marketing names go bad in unexpected ways of collide in unexpected ways. There is this fairly old Infineon combined DSP/MCU core called TriCore (the name means it is both a RISC, a DSP, and an MCU). It was a nice name, easy to recognize, easy to pronounce, unlike the competition at the time. Today though, we are seeing multicore chips with three cores on the die. So what are these, if not tri-core chips, in analog with single- dual- quad- oct- etc.  And this makes it very necessary to use the hyphen. For example, the Freescale recent StarCore 8113 chip with three cores has its press release explicitly headed tri-core with an hyphen. I guess marketing would have liked the more visually pleasing tricore moniker along with dualcore, which looks fairly established.

Ah well, not to mention the fun Infineon will have if it launches a triple-core TriCore device. Maybe in a third generation TriCore 3? The power of three, indeed. TriTriTriCore possibly?

Real-time control when cores become free

ImageA very interesting idea that has been bandied around for a while in manycore land is the notion that in the future, we will see a total inversion in today’s cost intuition for computers. Today, we are all versed in the idea that processor cores and processing times are quite precious, while memory is free. For best performance, you need to care about the cache system, but in the end, the goal is to keep those processor pipelines as busy as possible. Processors have traditionally been the most expensive part of a system, and ideas such as Integrated Modular Avionics are invented to make the best use of a resource perceived as rare and expensive…

But is that really always going to be true? Is it reasonably to think of CPU cores are being free but other resources as expensive? And what happens to program and system design then?

Continue reading “Real-time control when cores become free”

David Ditzel Interview at The Register/Semicoherent Computing

TheRegister Radio LogoThe Register has a few podcasts in addition to their website, and the one called “Semicoherent Computing” has turned into a very nice series of interviews with interesting people from the computer industry. I recently listened to their interview from September 2007 with David Ditzel of Transmeta fame. He had a lot to say about the history of computing, as well as interesting things on where computing is going. Well worth a listen! Particular interesting highlights…

Continue reading “David Ditzel Interview at The Register/Semicoherent Computing”

Heterogeneous vs homogeneous systems, revisited

I got another email from my friend with the thesis that processors will become ever more homogeneous as time goes on, while I believe in a relative heterogenezation (is that a word?) of computer architecture with many special-purpose accelerators and helper processors. This argument is put forward in a previous blog post. In this round, the arguments for homogenization are from the gaming world.

Continue reading “Heterogeneous vs homogeneous systems, revisited”

IBM z6: Multicore, Accelerators

z6 die photoThe IBM mainframe family started with the S/360 back in the 1960s is still going strong. The naming has been a interesting in recent years, going from S/390 to z900 to z990 to z9.

Continue reading “IBM z6: Multicore, Accelerators”

VIA/Centaur Isaiah: Multicore when needed, not earlier than that

Scott Wasson at the techreport.com has a short but fairly good write-up on the microarchitecture of VIAs new ‘Isaiah’ processor core, developed by their subsidiary Centaur technology. What is interesting is the part on multicore processing. Scott is quoting Glenn Henry from Centaur:

He points out that most people don’t and shouldn’t care what type of CPU they have in their PCs, so long as it gets the job done. When Centaur started, Henry says, they had to develop engineers with a different mindset, not “faster is better.” He set a series of targets involving die size limits and a ship date, and then directed his people to make the processor fast enough within those constraints that people would want to buy it.

Indeed, once you’ve absorbed the Centaur mindset, Henry’s answers to questions become somewhat predictable. Will Isaiah go multi-core? It can; it’s built that way, and Henry thinks Intel’s approach of a shared L2 cache makes sense. But he scoffs at the notion that people need multiple cores in basic computing devices right now. Henry says Centaur will go to multiple cores if it needs that level of performance or if Intel convinces people they have to have it.

It is a very interesting and different way to go about processor design. Aiming for “good enough for what people do now”. A Skoda, not a BMW, to use an old automotive analogy. But note that while in some markets “good enough” also means “bleak and boring”. But this is not necessarily the case in personal computing.

In computers, the software and system construction is what makes a PC stand out from another, not the raw performance of the processor. And as good enough in the Via case also means fairly radically low-power, you can build some cool and cool compact solutions from something like the Isaiah, provided that you absolutely want it to run a commodity OS like Windows or x86 standard Linux (which makes sense for a home machine).

If you care to do a bit more customization and create true consumer electronics, you can easily get the same features at half the power budget using an ARM, MIPS, or embedded POWER core.

The only bit that strikes me as interesting is that Henry thinks of multicore as a way to increase performance, rather than as a way to decrease power. Maybe the additional size of a second core has something do with this, but other players in the market, most notably ARM, is using multicore as a way to reduce the power consumption at a given level of performance. Could be that in the x86 world, anything slower than the current Isaiah design will just be too single-threaded slow to be viable running Windows.

Dekker’s Algorithm Does not Work, as Expected

Sometimes it is very reassuring that certain things do not work when tested in practice, especially when you have been telling people that for a long time. In my talks about Debugging Multicore Systems at the Embedded Systems Conference Silicon Valley in 2006 and 2007, I had a fairly long discussion about relaxed or weak memory consistency models and their effect on parallel software when run on a truly concurrent machine. I used Dekker’s Algorithm as an example of code that works just fine on a single-processor machine with a multitasking operating system, but that fails to work on a dual-processor machine. Over Christmas, I finally did a practical test of just how easy it was to make it fail in reality. Which turned out to showcase some interesting properties of various types and brands of hardware and software.

Continue reading “Dekker’s Algorithm Does not Work, as Expected”

Book Review: Intel’s Multicore Programming Book

Multicore programming book coverThe book “Multicore Programming – Increasing Performance through Software Multithreading” by Shameem Akhter and Jason Roberts is part of a series of books put out by Intel in their multicore software push. In case you have not noticed, Intel has a huge market push currently where they give seminars, publish articles and books, and give curricula to universities in order to get more parallel software in place. I read this book recently, and here is a short review.
Continue reading “Book Review: Intel’s Multicore Programming Book”

When Multicore makes Things Simpler, like IMA

Most of the time when talking about the impact of multicore processing on software, we complain that it makes the software more complicated because it has to cope with the additional complexities of parallelism. There are some cases, however, when moving to multicore hardware allows a software structure to be simplified. The case of Integrated Modular Avionics (IMA) and the honestly idiotic design of the ARINC 653 standard is one such case.
Continue reading “When Multicore makes Things Simpler, like IMA”

Homogeneous and Heterogeneous Multicore vs Programmers

An old colleague just sent me an email bringing up a discussion we had last year, where he was a strong proponent for the homogeneous model of a multiprocessor. The root of that discussion was the difference between the Xbox 360 and Playstation 3 processors. The Xbox 360 has a three-core, two-threads-per-core homogeneous PowerPC main processor called the Xenon (plus a graphics processor, obviously), while the PS3 has a Cell processor with a single two-threaded PowerPC core and seven SPEs, Synergistic Processing Elements (basically DSP-like SIMD machines).

In the game business, it is clear that the Xenon CPU is considered easier to code for. This means that even though the Cell processor clearly has higher theoretical raw performance, in practical the two machines are about equal in power since it is harder to make use of the Cell. Which seems to be a fact.

So here, homogeneous systems do appear to have it easier among programmers. However, I do not believe that that extends to all systems, all the time, everywhere.

Continue reading “Homogeneous and Heterogeneous Multicore vs Programmers”

Parallel Processing Requires Parallel IO

One common use-case for multicore processing on the desktop and elsewhere is “doing many things at the same time”. You could be running many user-interface programs at once, like the “typical today’s teenager template” of tens of IM clients, web sessions, email conversations, music and video players, downloading movies, etc. Or it is a more business-like background indexing of harddrives, backups being taken, downloading large business files, compiling software, updating source code repositories, etc.

I have been doing both of these modes to some extent, and the main problem with them at least on a PC is that while the processors might be good at multitasking and sharing the CPU load, my IO system is annoyingly non-parallel.

Continue reading “Parallel Processing Requires Parallel IO”

Applications that can make use of more compute power (e.g., iPod Video)

A question that pops up quite often when computer architects and representatives from firms like Intel encounter a crowd today is but just what do you need more computing power for????. Most regular users are fairly happy with the speed at which they process words, surf the web, read email, do IP phone calls, crunch numbers in Excel, and other common tasks. It is hard to perceive the need for more speed in everyday tasks, unlike a decade or two ago when you could definitely ask for improvement. I remember scrolling a page in PageMaker on a Mac SE (8Mhz 68000). You counted the clicks and waited for the screen to jump, redraw, jump, redraw, stabilize… quite a different experience from working with modern computers and far more complex software that still responds instantaneously to almost any work.

Continue reading “Applications that can make use of more compute power (e.g., iPod Video)”

SICS Multicore Day 2007 – More on Programming

Some more thoughts on how to program multicore machines that did not make it into my original posting from last week. Some of this was discussed at the multicore day, and others I have been thinking about for some time now.

One of the best ways to handle any hard problem is to make it “somebody else’s problem“. In computer science this is also known as abstraction, and it is a very useful principle for designing more productive programming languages and environments. Basically, the idea I am after is to let a programmer focus on the problem at hand, leaving somebody else to fill in the details and map the problem solution onto the execution substrate.

Continue reading “SICS Multicore Day 2007 – More on Programming”

SICS Multicore Day August 31

The SICS Multicore Day August 31 was a really great event! We had some fantastic speakers presenting the latest industry research view on multicores and how to program them. Marc Tremblay did the first presentation in Europe of Sun’s upcoming Rock processor. Tim Mattson from Intel tried hard to provoke the crowd, and Vijay Saraswat of IBM presented their X10 language. Erik Hagersten from Uppsala University provided a short scene-setting talk about how multicore is becoming the norm.

Continue reading “SICS Multicore Day August 31”

Real-Time in Sweden (RTiS) 2007

RTiS 2007 just took place in Västerås, Sweden. It is a biannual event where Swedish real-time research (and that really means embedded in general these days) presents new results and summarizes results from the past two years. For someone who has worked in the field for ten years, it really feels like a gathering of friends and old acquaintances. And always some fresh new faces. Due to a scheduling conflict, I was only able to make it to day one of two.

I presented a short summary of a paper I and a colleague at Virtutech wrote last year together with Ericsson and TietoEnator, on the Simics-based simulator for the Ericsson CPP system (see the publications page for 2006 and soon for 2007). I also presented the Simics tool and demoed it in the demo session. Overall, nice to be talking to the mixed academic-industrial audience.

Continue reading “Real-Time in Sweden (RTiS) 2007”